Grok patterns for Logstash: how to write.


One of the most important Ops goals – check systems healthy. And there is no easier way to do it than logs file reading. You can do it directly from system terminal, filtering with “grep”, “sed”, “awk”. But it’s easy to confuse and lost track of real problem. Here is your personal help – log management systems.

This article about log records filtering by “grok” pattern engine. IHere will be samples of grok patterns and logstash filters. But at first you have to realise a whole log processing moment.

What is the grok?

Grok is a pattern engine for logs parsing by special template. Basically log records have the same format for a special kind of software. The “grok” idea – present well-built log records structure for further automatic processing.

What is the Logstash?

Logstash is a log processing software. It allows to perform operations with any log file format. Generally Logstash is a bunch of filters for each format. With “grok” patterns you can set filters with special settings like time tracking, geoip etc..

Generally the whole log management server is constituted by:

  1. Filebeat on the nodes. It’s going to ship logs to the server
  2. Logstash which is connected with nodes by Filebeat through SSL certificate. The main idea of Logstash described above. Besides I’d like to set a point for input and output logstash configuration. There you need to choose interconnected software. For input it’s basically Filebeat, for output – Elasticsearch.
  3. Elasticsearch as a document-oriented NoSQL database. You can use it for myriad of purposes more. In this structure Elasticsearch receives JSON-formatted logs and stores it. You don’t really need any extra configuration if you’ve got all management tools on one host.
  4. Kibana as a frontend application for logging visualisation. Provides a web-browser access to Elasticsearch log records storage. With Kibana you’re able to discover logs, filter them as easy as it could be, build graphs for any parameter. So on, so forth.

Here is the clear ELK description and installation instruction. I just want to shotly point out what is the role of grok patterns at this scheme.

I’ve got JSON-formatted logs. So what?

Now you don’t need anymore to scroll log files manually and stop on any strange record. It’s automatized, visualized and cleared. Here is the “grok” sake too.

Pattern syntax

The grok patterns syntax is simple. Each field is started from “%” symbol and enclosed in braces. Fields could be assigned one by one or be splitted by a special delimiter. Delimiter could be a whitespace, semicolon, colon, whatever. Each field has 2 parts splitted by colon.

The first part is a field type. It could be integer, word, IP-address or anything else. Here is the basic log field formats. If you have a basic regexp knowledge, you can easily pick up a suitable type. But also you can built your own log fields in pattern for special needs. Just declare it in Logstash filter file and assign it in the main pattern below. Of course, field can include any other fields.

The second part is the field name. In JSON key-value format it’s going to be a key. Elasticsearch indexes records by this name, Kibana shows them as a table columns. Names make discovering clearer. You started to recognize what is this part of message for.

Write your first pattern

The goal – structurize Nginx logging. The format of combined Nginx log record has the next view: - - [23/Aug/2010:03:50:59 +0000] "POST /wordpress3/wp-admin/admin-ajax.php HTTP/1.1" 200 2 "" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.25 Safari/534.3"

At first we have to realise how is the fields splitted and what does each field stand for. Delimiters are not equal. Some fields divided by whitespace, but some fields have whitespaces included. Also fields isolated by quoted useless for log view. About fields destination: Client IP-address
Authentified user ID (it this case it’s blank)
Authentified user name (it this case it’s blank)
23/Aug/2010:03:50:59 Access date and time
+0000 Server timezone (here is UTC)
“POST /wordpress3/wp-admin/admin-ajax.php HTTP/1.1” HTTP request (POST format and desired page) with protocol version
200 Response request status
2 Size of response (in bytes) Referrer URL for Web-client access
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.25 Safari/534.3 User agent description

Now we want to structurize this record and cut off useless parts. Let’s make a new file beforehand. Patterns directory described in Logstash filter configuration will be a layout. Usually it’s a /etc/logstash/patterns. After that write a personal pattern for Nginx log records.

NGINX %{IPORHOST:nginx_clientip} (?:%{USER:nginx_user_ident}|-) (?:%{USER:nginx_user_auth}|-) \[%{HTTPDATE:timestamp}\] "(?:%{WORD:nginx_http_request} %{URIPATHPARAM:nginx_request_desc}(?: HTTP/%{NUMBER:nginx_http_version})?|-)" %{NUMBER:nginx_response} (?:%{NUMBER:nginx_bytes}|-) "(?:%{URI:nginx_referrer}|-)" %{GREEDYDATA:user_agent}

Explaining notes:

  1. Brackets in \[%{HTTPDATE:timestamp}\] screened by slashes. It’s a regexp rule. Brackets won’t be delivered to Elasticsearch as every symbol between fields. It makes log records more readable.
  2. Construction (?:%{USER:nginx_user_auth}|-) means a condition. If field is not equal “-”, then it will be delivered as a nginx_user_auth field. Otherwise it won’t.
  3. Construction HTTP/%{NUMBER:nginx_http_version})?|-) extracts only number of HTTP protocol version for shipping. “HTTP” word will be left (because all you know that it’s a HTTP). Generally I suggest to skip version too, because it’s almost a 1.1. 2.0 version isn’t coming soon.

The next step is assign this pattern in Logstash filter file. Create a new file in /etc/logstash/conf.d or append the filter in existed file. Now let’s choose the first way. Config file should have a .conf extension and unified number. So, let’s make a 10-nginx-filter.conf file and append this lines:

filter {
      if [type] == "nginx" {
          grok {
              patterns_dir => "/etc/logstash/patterns"
              match => { "message" => "%{NGINX}" }
              named_captures_only => true
      date {
              match => [ "timestamp", "yyyy-MM-dd HH:mm:ss" ]
      geoip {
              source => "nginx_clientip"

Explaining notes:

  1. Type of log is going to be checked beforehand after Filebeat input. You can set a type in Filebeat config file for each node.
  2. Patterns directory is a layout of your pattern. Now it’s default – /etc/logstash/patterns.
  3. named_captures _only option permits only named fileds for further output.
  4. Match option points out a pattern laying on record.
  5. Date directive allows to choose a field for time tracking. Here you can reformat this field to general logstash timestamp format.
  6. Geoip is a one of Logstash features. You can set client IP-address field and track its location. Additional geoip fields like city, state, country will be put in JSON automatically. With geoip you may pick up the records by client location and determine the reason of server danger. Client could be a source of spamming, snooping etc.

Now you can easily sort out your logs and shoot your trouble carelessly. Have fun, but remember the Java memory features, included in Logstash 🙂

Bonus: here is my Github repo with Logstash filters and Grok patterns. Welcome, if you’re interested in.


One thought on “Grok patterns for Logstash: how to write.

  1. Pingback: Infrastructure from scratch. Part 4: log management | MyOpsBlog

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s