Tracking your infrastructure health with Consul

header-0dcb7f34

The bigger your system becomes, the harder to maintain all included services. Some parts falling down after unsuccessful operation, some can get stuck after server reboot. Something can just exit the process because of panic.

You can say like “Let’s setting it up in monitoring system and everything will be good!”. Yes, but with 3 problems:

  1. If we talk about pull monitoring, it becomes too costly to gather health checks. You have to send the server check state even it it didn’t changed. It’s flooding your network when you’re working with hundreds or even thousands of services.
  2. Typical monitoring system can only alert you in hardware/software outage case. Not likely you can set up there the simple trigger like process restart. You’ll just have a message. It may get you up at night to solve the issue by simple restart. Won’t say it needs to be automated.
  3. Monitoring call come down as well. In ideal there should be meta-monitoring. Meta-monitoring means the health check of monitoring system itself.

In my case all these issues I solved working with Consul.

Meet the Consul


Consul is a service discovery service from famous Hashicorp team. Service discovery based on healthchecks. The main healthcheck idea – ensure every service component is working as expected. Web-page returns 200, database responding on queries and so on. If health check worked out successfully, it’s called “passing”. Otherwise it will be called “critical”, rarely “warning”. Consul has some basic concepts:

  1. Consul represents bunch of services inside of consistent cluster. Cluster has at least one leader. For reliability there may be more than one leader. Each node syncs health checks with the leader.
  2. There is no stuffy data exchange about service state. The data will be sent to cluster leader even if state will be changed.
  3. Beside health checks Consul has watches. Watch tracks the subject state and triggered when state will reach the watch condition. The subject is usually healthcheck, event, key or the service itself. Trigger could be be a shell script-handler when service check becomes critical.
  4. Consul cluster has multi-datacenter concept. It means, each datacenter would be better to have its own cluster with leader. Clusters should be synced to track each other. It provides failover feature.
  5. To keep any kind of data Consul provides key/value store. Dropping this subject, because I still don’t use this.

Cluster installation


Consul cluster build looks pretty easy. Here are my steps I did with Ansible(playbook link here):

  1. Download the archive. Also get the web UI sources and unpack them into specified location (say, /var/www).
  2. Unzip the archive to /usr/bin (or any directory in $PATH). Consul is a typical Golang software. It means you don’t have to go through the installation process. Simple philosophy: get and launch! Just place the binary to directory and run the consul command for testing.
  3. Configure Consul agent. Configuration format is usually severed on multiple files within the single directory. Let’s make the /etc/consul.d directory and place here our first service nginx.json.
{
    "service": {
        "check": {
            "interval": "60s",
            "script": "curl localhost",
            "timeout": "3s"
        },
        "name": "nginx",
        "port": 80
    }
}
  1. Here we just check curl request will return anything but connection reset. When connection is failed, curl returns exit code 1. In Consul every check is critical if it doesn’t return 0.
    Alright, now we have a small configuration for Consul agent.
  2. Create the Hashicorp account and get access token. It’s going to make your life much easier linking your Consul agents within the single account. At this case you don’t need to manually add agents to the cluster. They’re synced through your account.
    Just be careful with tokens and don’t share them.
  3. Setup the system daemon. By default Consul launched by the shell command and doesn’t provide a service interface. It’s such complicated to track the process state and manage it. Let’s make the systemd daemon for automated start/stop and system startup launch.
    Here is the daemon example for Consul agent daemon: /etc/systemd/system/consul-agent.service.
[Service]
ExecStart=/bin/sh -c '/usr/bin/consul agent\
 -data-dir /tmp/consul -config-dir /etc/consul.d -dc=aws-oregon\
 -atlas-join -atlas=cazorla19/infrastructure\
 -atlas-token="hfsdkjsefhoi.atlasv1.yOurAtlAsToKenHere" >> /var/log/consul/consul.log 2>&1'
Restart=always

[Install]
WantedBy=multi-user.target

Here is the one for Consul server: /etc/systemd/system/consul-server.service.

[Service]
ExecStart=/bin/sh -c '/usr/bin/consul agent -server -bootstrap-expect 1\
 -data-dir /tmp/consul -config-dir /etc/consul.d -dc=aws-oregon\
 -ui-dir /var/www -atlas-join -atlas=cazorla19/infrastructure\
 -atlas-token="hfsdkjsefhoi.atlasv1.yOurAtlAsToKenHere" >> /var/log/consul/consul.log 2>&1'
Restart=always
[Install]
WantedBy=multi-user.target

You can feel the difference in start commands. Flag -server tells we run Consul agent in server mode. So, here is the candidate for cluster leadership. Flag -bootstrap-expect configures minimum nodes count to reach the quorum in cluster. You don’t need to create data directory. Consul does it personally if the process owner has respective permissions. Setting up Atlas account credentials allows to automatically add nodes to the cluster.

Flag -ui-dir provides web-interface on the specified location. Web-interface in Consul is another cool feature. I never installed web panel such easy as I did here. By default, web-interface is accessible on 8500 port. But for security issues and access simplicity I’d suggest to configure the Nginx on 80 port. Nginx can provide proxying with HTTP basic authentication.

server {
    listen 80 default_server;
    server_name consul.example.com;
    access_log              /var/log/nginx/consul.example.com/access.log;
    error_log               /var/log/nginx/consul.example.com/error.log;
    location / {
      proxy_pass            http://127.0.0.1:8500;
      include               /etc/nginx/proxy.conf;
      auth_basic            "Restricted";
      auth_basic_user_file  /etc/nginx/conf.d/.htpasswd;
    }
}

Consul logs all events to stdout. We configured system daemon to redirect stdout to the special file /var/log/consul/consul.log. If you’re afraid to get the flood, you can configure logrotate for logs processing.

/var/log/consul/consul.log {
    daily
    rotate 5
    missingok
    notifempty
    copytruncate
}

Note: don’t forget to create the space for Consul logging to have the file path valid. I mean create /var/log/consul directory to write the log records as described here. And don’t forget to reload systemd daemons and enable startup.

systemctl daemon-reload
systemctl enable consul-[agent/server].service

            4. Double check you have configs, daemons, binary and log destination. After that start the daemon and check log file.

systemctl start consul-[agent/server].service

At last we set up the Consul cluster. Run the command consul members to check that all systems go. 

Don’t be afraid of this TL/DR manual. In reality it should be easier than I described.

Health checks provisioning


You can see at the screenshot each node has more than 1 service. That’s because of nodes configuration includes more than one. By default each one has Cerf service which tracks the agent state itself. If server is not responding, Cerf check shows it up.

Let’s append the new services. For example, Node.js application and AWS ELastic Load Balancer health check.

{
    "service": {
        "check": {
            "interval": "30s",
            "script": "nc -zv 127.0.0.1 3000",
            "timeout": "3s"
        },
        "name": "myapp",
        "port": 3000
    }
}
{
    "service": {
        "check": {
            "interval": "30s",
            "script": "nc -zv app.myapp.com 443",
            "timeout": "3s"
        },
        "name": "elb",
        "port": 443
    }
}

Save them in 2 different files and reload the Consul running the consul reload command.

Setting up watch


Ok, now we track all services we need. But what if the service check failed? Do you need to know it or make a simple step to resolve? For this kind of questions we can use Consul watch.

Watch can be run by the single command as a daemon. But it would be much more clear to append the watch into existing configuration.

Let’s create the new file /etc/consul.d/watch.json and append the watch configuration. It’s going to check if state is critical or warning.

{
   "watches": [
        {
            "type": "checks",
            "state": "warning",
            "handler": "/opt/consul_handler.sh"
        },
        {
            "type": "checks",
            "state": "critical",
            "handler": "/opt/consul_handler.sh"
        }
     ]
}

Don’t forget to run consul reload.

Handler attachment


I mentioned the special handler Bash script in watch configuration. Under script would be service exceptions handling or alert shipment. Let’s make the handler for our single node with Nginx and Node.js services.

#!/bin/bash
nginx=$(curl localhost > /dev/null; echo $?)
myapp=$(nc -zv -w 3 localhost 3000 > /dev/null 2>&1; echo $?)
services=('nginx' 'myapp')
for service in ${services[*]}; do
        eval status=\$$service
        if [[ "$status" -ne 0 ]]
                then service $service start
        fi
done

Here we’re trying to start exited services. Stupid logic, but works in many cases. Instead you can also use the mail command to make an alert that service failed.

Getting the bulletproof


Now we’ve got some sort of bulletproof in our infrastructure. Here is how it works:

  1. The checked process was killed (try it yourself for testing);
  2. Consul health check becomes critical;
  3. Consul watch is triggered calling the handler script;
  4. Handler script starts the service again.

It’s just the beginning. With this algorithm you can deal with much more complicated outages. Enhance your health checks and include all necessary troubleshooting logic in handler.

Pitfalls


  1. Force agent stop. Sometimes you have to stop the service for maintenance job. But it won’t work if you forgot about Consul. With described configuration, it will automatically start the service again.
    There are 2 solutions. The first one – simply stop the Consul on node. The second one – make the logic where Consul understands that service should be stopped.
  2. Server agent reboot. Recently when I rebooted machine with Consul server, everything messed up. I’ve got thousands of RPC errors from all nodes. Leader elections were failed. At first it looked like a disaster.
    The issue were the obsolete data Consul agents have. It works fine in local cloud. But in AWS EC2 metadata is going to be changed after each stop/start iteration. So, Consul cluster might be confused with data mismatch.
    Troubleshooting this, I stopped all agents, removed the data directories and started agents again. With refreshed metadata the cluster started to work again. But if you use key/value store, probably it’s not your solution.

Future improvements


  1. Write the handlers with much more reasonable logic.Track all routine troubleshooting steps, describe them in script and implement the handler. It’s possible to have automated outage solution. But when Consul agent is active.
  2. Track the internal service state with checks. If you can listen remote database in 5432 port, it doesn’t really mean the database works. There could be outage like transaction wraparound in PostgreSQL which needs to be tracked as well.
  3. Working with Docker containers. Additionally there may be useful combination with Nomad – another Hashicorp product. Nomad is a scheduler responsible to launch containers or processes when some event happened.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s