How to Get Started – Kafka Basic Concepts Explained

Introduction

Decoupling service interactions and asynchronous communication is core to building a microservices architecture.

For example, let’s say there is a UserService that signs up a user and then calls an EmailService that sends out a welcome email to the new user. The UserService should not wait for the EmailService to complete its operation.

This operation sequence should be split into two different workflows, and each service should work on its own without waiting for the other to complete.

This approach is also called Event-Driven Architecture, where services communicate with each other via events and notifications. But there should be a way for the EmailService to know that there is a job waiting for it to do, right?

Message Queues help create that common space, to which different services can write their requests, and others can pick them up from that space.

But Kafka isn’t just such a component – in fact, it doesn’t even work like one. Then what is it like?

What is Kafka?

Official documentation explains Kafka as a distributed, partitioned, replicated commit log service. Each of these terms has its own meaning and is a feature of Kafka.

It runs as a clustered service with many nodes, each containing different partitions. It works on the Publisher-Subscriber model, where multiple producers write data to kafka and multiple subscribers listen for data. It also employs mechanisms to provide a highly reliable and scalable communication system for large-scale architectures.

What is Zookeeper?

A Typical Kafka cluster consists of multiple server nodes which store and process data, and Zookeeper – which is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

Each of these Kafka nodes communicate with each other through Zookeeper, which manages and maintains the metadata about these. In general, producers and consumers connect to kafka. The nodes connect to the zookeeper.

Zookeeper contains metadata – Kafka contains data.

Flavors of Kafka

Kafka is basically an open source platform, which is maintained by the community and is distributed with an open-source license. There are however commercial versions of Kafka, which is based on the OSS Kafka with additional features and offerings. Examples are Confluent Kafka, Conduktor etc.

For example, the following table describes the differences between Apache Kafka – which is the OSS Kafka and Confluent Kafka – which is a commercial Kafka platform built by Confluent and is generally used for enterprise purposes.

Apache Kafka	Confluent Kafka
Open source software	Commercial software, built on top of OSS Kafka
No enterprise support	Enterprise support for paid versions
No additional features	Additional features such as Connectors, Dashboard and RESTful APIs, which aren’t present in the OSS Kafka
Supports Linux, Mac and Windows operating systems	Supports only Linux, can be installed and run via Docker

Differences between Apache Kafka and Confluent Kafka

What is Topic and Partition?

As mentioned before, Kafka is a Partitioned service.

Topics and Partitions are core to how data is written to and maintained in a Kafka cluster. A Topic is a dedicated space where applications can write to and read messages from. Think of it like an individual folder in a file system where you can store and retrieve data from.

UserService can write a job request to a particular space within Kafka, called a Topic, and the EmailService receives the request and can work on it independently.

A Producer is any service that writes to a Kafka topic, and a Consumer is any service that reads from the Topic. In the above example, a UserService is a producer that writes an Event to a Kafka Topic and the EmailService is a Subscriber that listens for it.

But since a microservice architecture can have services scaled up or down based on the request load – there could be multiple UserService instances and multiple EmailService instances running.

Kafka is highly distributed and available – a single topic can have multiple partitions, where individual messages can be placed.

Let’s say you have a Kafka cluster running with 3 partitions, and there’s a topic called myTopic where data is being written to – you have these messages spread across all the 3 partitions from which other services subscribed to this topic can read.

Data is written to a Topic and can be assigned to any Partition, unless a “Key” is specified. Each data written to a Kafka Topic can contain an optional Key, which uniquely identifies a Value. Otherwise Kafka automatically generates a Key internally based on the data.

All Data written to Kafka is maintained and transmitted in bytes though.

Why are Partitions important?

These partitions help provide parallel processing of messages. Think of a partition as multiple sub folders within a folder – where each consumer can read from a single or multiple partitions.

They provide parallel read operations from the consumer side, which helps in reading data faster and can scale with the number of consumer instances reading.

The Producer can specify which Partition to which the message has to be written. If there’s no Key specified, Kafka writes to a partition in a round-robin fashion.

What is an Offset?

An offset is a sequence number given to a message written inside a partition. This can tell us to what extent messages have been written to the topic partition, and how many have been read so far (on the consumer side).

An Offset is unique only within a Partition. That means an offset of a message within a Partition can also appear for another message in another Partition for the same Topic. We can uniquely identify a message by combining the Topic, Partition and Offset together.

You need to remember the following while working with Kafka Topics, Partitions and Offsets –

Kafka consists of topics. Each topic can represent a dataset, or a table without any constraints.
Each Topic can have one or more partitions.
Each partition maintains an offset for data written.
Data is written to a topic, and can be assigned to any partition, unless a key is specified.
Data is ordered only inside a partition and not across partitions.
Offset 1 in Partition 0 doesn’t represent the same data in Offset 1 in Partition 1.
You can only write to a topic, and cannot be updated once written.

Conclusion

In this article, we have briefly discussed Kafka and its basic concepts. We have seen what Kafka is, and looked at the flavors of Kafka with differences between them. Then we jumped into the components of Kafka – along with the functioning of Zookeeper. Then we looked at Kafka Topics, Partitions and Offsets and how they are important in sending and receiving messages in an efficient and distributed manner.

In the upcoming articles, we’ll see how we can build a Producer-Consumer interaction using Apache Kafka, along with setting up Kafka in our local machine and some hands-on code.