Github

Real-Time Centralized Log Analysis With Bigquery & Go

Logs contain a massive amount of very important information about your application.. But that information is mixed in with vast amounts of noise. How do you surface what's important?

By streaming all your logs into Google's BigQuery, you can run real-time SQL queries on your data for cheap (or even free!).

Example Queries

Are there any bots running wild on your site?

SELECT 
  ip,
  COUNT(*) AS num 
FROM [logs_appname.logs] 
WHERE 
  log_type = 'nginx-access' 
GROUP BY ip 
ORDER BY num DESC
[
  {
    "ip": "XXX.XX.XXX.XXX",
    "num": "205184"
  },
  {
    "ip": "XXX.XX.XXX.XX",
    "num": "52638"
  },
  {
    "ip": "XXX.XXX.XXX.XX",
    "num": "52583"
  }
]

How quickly is your app server rendering pages?

SELECT 
  NTH(50, quantiles(response_time, 100)) median,
  NTH(75, quantiles(response_time, 100)) seventy_fifth,
  NTH(90, quantiles(response_time, 100)) ninetieth,
  NTH(99, quantiles(response_time, 100)) ninety_ninth
FROM [logs_appname.logs] 
WHERE
  log_type = 'nginx-access'
[
  {
    "median": "0.007",
    "seventy_fifth": "0.021",
    "ninetieth": "0.161",
    "ninety_ninth": "1.135"
  }
]

Or your most common 404 errors:

SELECT
  path,
  COUNT(*) AS num 
FROM [logs_appname.logs] 
WHERE
  log_type = 'nginx-access' 
  AND status = 404 
GROUP BY path 
ORDER BY num DESC
[
  {
    "path": "/old-blog-post",
    "num": "271105"
  },
  {
    "path": "/images/broken-image.png",
    "num": "135585"
  },
  {
    "path": "/favicon.ico",
    "num": "52595"
  }
]

Formatting Your Logs

The first step is to output your logs in JSON format. This will help in the (basic) processing we will do before streaming them into BigQuery.

This is quite easy to do for nginx:

log_format logstalker '{"ip":"$remote_addr", "timestamp":"$time_local",'
                      '"domain":"$http_host", "request":"$request", "status":$status,'
                      '"referrer":"$http_referer", "user_agent":"$http_user_agent",'
                      '"response_time":$request_time}';

server {
    listen 80;
    root /srv/app/current/public;

    access_log /var/log/nginx/access.log logstalker;
    error_log /var/log/nginx/error.log error;
}

For application logs this is harder, as you generally cannot configure their format. For Rails apps I built a custom logger that will output everything in JSON (and includes a lot more information about the performance of your app): logstalker_rails.

Loading Them Into BigQuery

Once you have your log files in JSON, the next step is to get them into BigQuery.

One option would be to bulk load them hourly or daily... but it's 2017 and we want these logs streamed in real-time.

So I wrote a program that tails log files and does just that. In my experience the log entries are available for querying within 5-10 seconds of them being generated on the server. Not bad!

This program was written in Go and can be downloaded from Github.

While it does some light processing of the log entries, the jist of it is:

seek := tail.SeekInfo{Offset: 0, Whence: 2}
t, _ := tail.TailFile(config.LogFilename, tail.Config{
  Location: &seek,
  Follow:   true,
  Logger:   tail.DiscardingLogger,
})

for line := range t.Lines {
  parsed, err := parserFn(config.Host, strings.Replace(line.Text, "\\", "", -1))
  if err == nil {
    go stream(&config, tabledataService, parsed)
  }
}

BigQuery

If you have never used BigQuery, I cannot recommend trying it enough. Even for the relatively high-traffic client we set this up for, the cost of storing and analyzing these logs are less than $100 per month.

And unlike the Amazon hosted databases DynamoDB or Redshift, you do not have to reserve capacity up front. BigQuery is truly pay as you go, with a very generous free tier and storage as low as 2 cents per GB.

I have run queries over hundreds of gigabytes of data, with complex joins, and the results were returned in less than 5 seconds. BigQuery is amazing!