Introducing Versions: Develop data products using Git. Join the waitlist

Starting with Kafka

I just want to share my thoughts on Kafka after using it for a few months, always from a practical point of view. I don’t know anything more than the basics ...
Javier Santana
Jun 25, 2021
  min read

I just want to share my thoughts on Kafka after using it for a few months, always from a practical point of view. I don’t know anything more than the basics about Kafka internals. I knew the principles of log processing and so on but had never got my hands dirty. So this is just the opinion of a newcomer, not an expert.

Everything began because we started seeing event-gathering use cases: a website, with a reasonably high amount of traffic that wants to gather events, to measure in real-time what’s going on. Like a customized version of Google analytics but without sampling. We’ve spent the last few months working on an integration with Kafka to cover these use cases.

I have to admit that I was biased against Kafka, I thought it was a large, complex piece of technology that added nothing but an extra layer of complexity.

I still think the same, it is complex, adding Kafka to your stack will add a bunch of services and some extra decisions to make about configuration and so on. In most cases, a single process app sending data to a file and a small HTTP server serving it will perform much better with less downtime. If you compare the burden of understanding, setting-up, configuring and starting to use compared to, for example, Redis, it’s a massive amount of work.

Of course, when things get difficult is when Kafka shines and all that complexity gives you the flexibility to tackle massive amounts of load and complex producer, consumer, processor use cases. But even in this situation, you’ll need to know much more than just Kafka 101.

To be honest, this is true no matter what tech you use, a simple load balancer can become a nightmare to maintain under certain circumstances. High-load software projects always get complex somehow, in the same way that an F1 car needs a highly complex system, a good bunch of engineers and mechanics, and constant maintenance to go fast, so does a software project.

The key here is how soon you hit the complexity; ideally, a tool should become complex as your project does.

Anyway, the most interesting thing about Kafka is that it enables something pretty useful for projects at scale: a clear way to coordinate and measure and scale data consumption. I’m wondering why HTTP load balancers didn’t get the consumer approach of Kafka. And the same kind of metrics. And a non-HTTP protocol and a way to coordinate workers :)

Coordinate: Kafka doesn’t only rely on the server to do things, clients know things like “What partition I should consume.” or load-balancing between brokers.

Scale: when consuming it is clear when you need more computing power just by looking at the number of unprocessed messages. This sounds obvious to do but it’s not, at least to me. This is a pretty old queueing problem that is useful to understand and measure (there is a lot of theory written about queueing from the early days in telecommunication systems) but it’s usually hidden in most pieces of software.

Kafka is not the only tool that does these things, I guess all message brokers do this but I never used them.

Anyway, some practical information that I didn’t know but wish I had known:

  • You need to know more things than I expected to be able to use it: partitions, messages, assignments, brokers, consumers, different kinds of timeouts… and that’s just the surface.
  • Setting up a cluster by yourself is a pain in the ass. Using Confluent cloud is expensive but it just works, that’s why people use it.
  • Even for a local install, Kafka is a PITA. I recommend using Redpanda, a single command and you have a Kafka compatible API up and running. No Zookeeper, no docker-compose up, no Kubernetes.
  • Kafka does not have a good HTTP interface (for a good reason). HTTP Proxy is the default choice but Confluent cloud does not provide it (Redpanda does). I think using a custom protocol is the right choice but there is an impedance mismatch between HTTP based systems and Kafka (although not in all cases).
  • Partition number, compression, and all the settings become relevant when you are in trouble.
  • Managing errors in consumers is hard. The good part is that the data is in Kafka so you can replay it but the bad part is that if the problem is a logic problem (not a temporal glitch) you are in bigger trouble. This article explains it pretty well.

So, if you ask me, you need to know what you are doing before you add Kafka to your stack but it does bring really good architectural patterns when your platform usage starts to grow.

We have been happily ingesting millions of events per minute for some months now, without any major problems. Actually, Kafka saved us from some complicated situations. Probably a fast HTTP server writing batches to a file would also have worked but I think that when you want to scale things up without much pain Kafka is a good solution.