Introducing Versions: Develop data products using Git. Join the waitlist

Publish SQL-based endpoints on NGINX log analysis

Building a highly scalable log analytics tool with Tinybird and exporting your queries as an API.
Javier Santana
Jan 28, 2021
  min read

While Tinybird is not a log analysis tool per se, it’s a tool that you can use as the foundation to analyze logs-like datasets at scale. Here you’ll learn:

  • how to use it to analyze NGINX logs
  • how to publish your SQL queries as API endpoints to be used in other tools
  • and how to model your data sources and endpoints to make it 10X faster.

An NGINX log looks like this:

It’s not a CSV file but a “space separated file” so in principle Tinybird wouldn’t load it because we just support CSV files. But if we analyze the file using our CLI tool:

The delimiter is guessed so it can be loaded.

The format looks like this:

We need to transform it to something useful. To transform data we use a pipe.

Now let’s query the new pipe:

You could now publish an endpoint:

So far so good. But what happens when you have millions of entries a day, the generated endpoints wouldn’t be as fast as needed. For example, for endpoints serving real time dashboards, where the user can pick different date ranges, add filters and so on, waiting for more than a second is not the best solution. So there is an easy solution: generate views as data arrives.

So, let’s create a Materialized view (MV from now on) for the previous endpoint:

There are many things to notice:

  • The materialized view SQL is almost the same than the endpoint but instead of count it uses countState and the same for avg. Those -State modifiers are needed when pushing data to a MV. The datasource where the MV pushes the data has two columns with a special data type to store those {% code-line %}-State{% code-line-end %}
  • The Engine is not the regular MergeTree, it’s an {% code-line %}AggregatingMergeTree{% code-line-end %} that tells ClickHouse to aggregate columns on merge operations (executed periodically, you don’t need to worry about that)
  • The sorting key tells ClickHouse which column we are grouping by

We push the files

We use {% code-line %}--push-deps{% code-line-end %} to upload all the dependencies and {% code-line %}--populate{% code-line-end %} to fill the materialized view with the data already in {% code-line %}log_transform{% code-line-end %}

Now we check the amount of rows in that table:

If we push new data with new IP addresses, the amount of rows will rise but, comparing the amount of rows in the original table with the MV datasource, you can understand why it is going to be faster. And the good thing - this table is updated when you push new data, it does not need to be recalculated.

The last thing is to change the endpoint to use the new table

Let’s test the endpoint and compare with the old one

Both are really fast but the last one is one order of magnitude faster.

You could create MV for any case you can think of, for example:

  • unique ip addreeses by day: {% code-line %}select toDate(time) day, uniqState(ip_address) uniq_ip from log_transform group by day{% code-line-end %}
  • percentile 95 of payload size per hour: {% code-line %}select toStartOfHour(time) hour, quantileState(0.95)(ip_address) q95 from log_transform group by hour{% code-line-end %}
  • requests per month: {% code-line %}select toStartOfMonth(time) month, countState() requests_count from log_transform group by month{% code-line-end %}

Materialized views are a really powerful feature specially when you are creating real time endpoints.

We created a github repo with the data project so you can reproduce the post commands in your Tinybird account and start analysing your own logs.