Keep going with infrastructure improvement, I stayed some time on monitoring issue. Generally monitoring is inseparable component of DevOps workflow. It allows to clearly control each subject of maintained service.
Monitoring work is interesting as well as complicated. This field, like database, doesn’t have ready solutions for any kind of project. For each situation you have to figure out what metrics are special and what graphs you must check at first.
Why is it important?
If you don’t see what’s going on with your service at present and what was before – you really lack of service maintenance. Why? Because without service tracking you can’t:
- track the road of your service, check the trend of its improvement.
- make correct decisions and check the implemented decisions influence on the system.
- catch emergency situations and perform actions to prevent them.
- understand the current service performance and performance limit.
This “can-not” list could be expanded up to the height of Yao Ming. But monitoring your service and its surroundings you can create improvement suggestions quickly.
Generally I have the next picture of my structure.
I will shortly describe components:
1. Sources. It could be AWS API (Python boto3 library or AWS SDK of your favorite language). Also I have here service PostgreSQL application database. From there I take and calculate DB condition numbers and business metrics. There are API of third-party tools like Sendgrid to delivery emails or Twilio to send SMS. I use them to collect data about how much resources we spent. And here are log files for further parsing and data aggregation.
2. Python/bash scripts. Script is a program code which works by itself. The main goal – get the data from sources, parse data and flush the fresh metrics to file. I use Bash only for local log files processing. For all other goals Python will be much faster.
3. Cron. Cron scheduler is running scripts directly on each monitored Linux server. The time depends on script regime. Intervals between script launches could be 1 minute, 1 hour or 1 day.
4. Metric filesystem. Directory /opt/metrics on the server keeps one file per each metric. It looks like /proc pseudo-filesystem. File content refreshed by scripts run by cron. I picked this solution up according to the UNIX-way: our life is just a bunch of files. Actually, what could be easier than store your metrics in sequenced file tree?
5. Zabbix-agent. Looks at files mentioned in configuration by cat command and extracts from them one single number – metric. Called by Zabbix-server.
6. Zabbix-server. Keeps a schedule to check items and sends requests to Zabbix-agent at the time. Received value Zabbix-server keeps in MySQL database and draws complete graphs by items value. Third Zabbix version can build trends depending on your metrics value history.
Moreover, zabbix-server sends emergency messages to e-mail and Slack project team channel. That’s one you should do with your monitoring. Make the notification engine to track the health of your system every moment.
Battle of thousands monitoring systems
It’s such funny as well as meaningless to argue about centralized monitoring systems leadership. Zabbix? Nagios? Icinga? InfluxDB? Grafana? Graphite? Well, it’s easy to confuse in this ocean of opportunities. And also easy to be swallowed by useless battles when you already picked something.
What’s your system? It depends only on your desire and taste. Only you can answer to this question. Read about their distinguishes, understand what’s important for you and install the tool where ‘important’ is more.
I used to try Zabbix 3.2. Third version became much better with web-interface and data visualisation. But it’s still lacks scalability. And also I’m tired to use only one system for all projects. The next time when I will start from scratch, I’d rather try Prometheus.
Here is the list of metrics I started to track in my structure. I skipped the SLA stuff, because we have so friendly connection in our project that we don’t have it. But I hope, almost every metric could be included in SLA.
1. Application. That’s the first and the main thing you have to care. Every application has it’s own quality indexes which should be regularly checked. My project is a web-application. So, the things I’m keeping an eye are:
- payment systems healthcheck (simple HTTP requests with requests Python3 module);
- Facebook authentication failures (log parsing by one Bash-pipe command);
- crucial metrics list and these response time(the same log parsing).
2. Hardware. The main sysadmin mantra: CPU-memory-disk-network. I almost stood on default Zabbix metrics here. That’s enough to check what’s going on with your memory or network traffic. Later I’ll attach some other metrics. Also it’s definitely important to watch the RAID if you have it. But I don’t have it.
3. AWS resources. Here I realised what we use from Amazon, what’s the main components we want to track and how to get them. Here I used the Python with boto3 module (and partly old boto). The list of monitored services:
- Cloudfront (AWS console free dashboard is enough)
- S3 (download/upload stats + bucket size)
- Elastic Transcoder (count, duration and size of uploaded videos)
- CodeDeploy (simple deployment stats divided by qa/staging/production groups)
- Autoscaling groups (triggered number of EC2 instances)
- ELB (a lot of metrics from Cloudwatch API)
- RDS (the same thing with Cloudwatch)
Hope I’ll publish my source code later on Github.
4. Business metrics. Fun stuff I want to know about my service. It makes me completely understood how we’re popular and what could we expect someday. I got the data from SQL queries to Postgres provided by Python module psycopg2. Things I have in monitoring:
- transactions count (splitted by cash-in/cash-out)
- total transactions balance
- transaction stats by provider
- new users count
- new users count from social networks (Facebook, Google+ etc.)
5. Third-party services. Almost every famous project like these has API interfaces for all popular programming languages. So, I used the Python sendgrid and twilio modules. Here I check:
- sent SMS messages + calculated charge
- e-mail delivery requests with succeeded rate
- error statistics by every reason (spam report, provider block, invalid MX domain name etc.)
6. Database. Database monitoring is a definitely huge theme and I think I’m still newbie on this. I’ll try to explain this in my future cycle posts. Here I just mention things I did on PostgreSQL:
- RDS instance utilization (mentioned before, Cloudwatch helps);
- indexes use (pgstat_all_indexes) thinking what index is bloating and what indexes were never used;
- slow queries list;
- crucial SQL queries time (EXPLAIN command).
Also we have a Redis cache responsible for hard SELECT queries processing. Here I monitor the keyspace hits, misses and general cache hit rate. I’m getting this data from Python script requested INFO Redis-CLI command.
1. Don’t get stuck with monitoring system choice. As I mentioned before, just get the facts about concurrents, analyze them and take a pick. Improve what you have instead of think what can you replace. In monitoring the most important component is executor, not a monitoring server itself.
2. Set metrics refreshment interval as you need. Analyze how large your metric is and choose how frequently do you want to refresh this. For example, CPU using could have outages at any moment. It means I need to refresh this metric every minute. Unlike Twilio, where we send only a couple of SMS daily. Here I don’t need to get zeros almost every moment. So, I have to set daily metric refreshment.
3. Build general screens. Screens as a graphs and triggers board allows to check the full picture of monitored subject. Scanning load balancer metrics, we can collect requests count, latency, connection errors and so on. Or you could make your screen based on the one component utilization like memory.
Almost every system includes an opportunity to build such simple dashboards. So, feel free to use this.
4. Share it with team. Tell about things you monitor to your stakeholders and developers. The first team will be interested to analyze the stuff like new users count or transactions statistic. The second one will take care of hardware utilization and database workflow. They won’t be surprised if any service component will be down or database will work slower.
5. Something was broken? Monitor this! After troubleshooting and subject workflow analysis set the trigger on this issue. Later you’ll have a flexible monitoring which lets you know about any mistake in your system. Example: when you’ve got the trouble with Facebook authentication – put this trouble to log file and set the trigger to this. In such cases you also need developers help. If you’ll have a recurrence, you’ll be immediately notified. This kind of notification gives you a big scope for further troubleshooting.
Bonus. My Github repo with AWS monitoring.