Many people see Kafka as a messaging system but in reality it’s more than that. It’s a distributed streaming platform. While it can be used as a traditional messaging platform it also means that it’s more complex.
In this post we’ll introduce the main concepts present in Kafka and see how they can be used to build different application from the traditional publish/subscribe all the way up to streaming applications.
This is (kind-of) the “big picture” of the concepts used in Kafka. It might seem overwhelming but don’t worry we’ll cover all of them in details below.
Kafka is message-based. The main elements it deals with are messages. A message is a simple key/value pair. The key and the value can be anything serialisable. It can be sensors measurements, meter readings, user actions, etc
Messages are published into topics. A topic is a logical grouping of related messages. It can be thought of as a channel. A topic can be “water meter readings” or “user clicks”.
Physically topics are split into partitions. A partition lives on a physical node and persists the messages it receives. A partition can be replicated onto other nodes in a master/slave relationship. There is only one “leader” node for a given partition which accepts all reads and writes – in case of failure a new leader is chosen. The other nodes just replicate messages from the “leader” to ensure fault-tolerance. The important thing to note here is that message ordering makes sense only inside a partition. There is no global ordering in a topic as a whole but only partial orders inside each partition.
Producers can choose into which topic and partition they publish their messages. Because of the ordering available only inside a partition, choosing the right partition is often a key factor in the application design.
Finally consumers are organised into consumer groups. Every consumer inside a group is assigned one (or more) partitions of the topics it subscribes to. These subscribers consume messages only from their assigned partitions.
Unlike message queues where messages are removed from the queue when they are consumed, consuming a message in Kafka doesn’t remove it from the partition. Instead each consumer maintains an offset of the messages it has already consumed. In case of failure, the subscriber can then restart processing from where it left off. Offsets management is also an important design decision as it affects the delivery guarantees.
That’s quite a lot of information. But fear not we’re going to see how to apply these concepts to build up different messaging patterns. So let’s start with the simplest one: a single partition topic with a single consumer.
Single partition / Single consumer
The most simple pattern by far. Producer publishes messages into the single partition topic and the consumer consumes the messages from the single partition.
Of course as there is a single partition this solution doesn’t scale. However if you don’t need scaling it provides a total ordering of the messages which might be really helpful depending on your use cases.
If there are multiple Kafka node the partition can be replicated providing fault-tolerance. However we need to restart the consumer in case of failure of the master partition. A supervisor of the consumer might come handy to handle the restart of the consumer when there is a change in the partition topology.
Single partition/multiple consumers
This is a simple variation of the above where multiple consumers consumes from the partition. As messages are not removed from Kafka when they are consumed, it’s possible to add more than one consumer, each maintaining its own message offset. Of course the consumers must be in different consumer groups. This pattern is also known as fan-out.
At least-once delivery
To achieve at least-once delivery the producer must make sure its messages has been persisted into Kafka. If not and if the node fails after receiving the message and before it is persisted then there is no way to guarantee at least-once delivery. Fortunately it’s possible for the producer to wait for its message to be acknowledge by kafka, making sure it has been persisted.
On the consumer side Kafka doesn’t push messages to consumers. Instead it’s the consumers who pulled messages from Kafka. In order to improve efficiency message are not pulled one by one but in (mini) batches. E.g. every few seconds the consumer polls for any messages published after a given offset. Kafka returns the batch of corresponding messages. Once all the messages have been processed, the consumer confirms the batch to Kafka and the offset for the batch is committed in Kafka and the next poll request will return the next messages. Now assumes that the consumer fails in the middle of the batch (i.e. some messages have been processed and some others haven’t). When the consumer recovers it will start from the beginning of the exact same batch, reprocessing some messages that were already processed. This is at-least once delivery.
Multi-partitions / Multiple consumers
As partitions and consumer groups are managed by Kafka there is not much to change in the application code. However as the ordering works on a per-partition basis we must make sure that our application is compatible with this constraint.
However because of the “at-least once” delivery guarantee this implies stateless processes. If we need a stateful processes we can no longer have Kafka maintaining the offsets for us. Instead it’s the application who must manage its offsets itself. This is known as self-managed offsets.
This idea is that once a message has been processed the application persists its state along with the message offsets into a persistent storage. Upon failure the application will recover from the last persisted offset, even being in the middle of a batch, skipping any already processed messages. This is (almost) exactly once delivery.
It’s almost exactly once because if there is a failure after the application has computed the new state but before the state (and the offset) is persisted, then the application will have to recompute the last message (because it wasn’t save). It may or may not be an issue depending on the side-effects performed during the state computation.
The application can persist its states and offsets into any persistent storage like a database. Another option is to use Kafka to persist the state. In this case the application publishes its state into a different topic every time a message is processed. Well this is exactly what Kafka streams does behind the scene to maintain the application state.
Kafka is built on top a simple principles that when combined together allow to build a wide range of applications.
Kafka is also an essential component to build reactive systems because it is message-driven, resilient (reliable message storage + replication), elastic (partitioning) and responsive (consumer groups).
Designing applications still requires careful thinking (partitioning key, offsets management, partitions rebalance, …) but having a good understanding allows to appreciate the strength and shortcomings of a solution.