Apache Kafka is an awesome way to stream data between your applications. It’s often used to integrate components in a distributed system, like microservices. But it does have some rather snazzy terminology. So what’s the difference between Kafka, Kafka Streams and Kafka Connect? Let’s find out.
Kafka, Kafka Streams and Kafka Connect are all tools that form part of the Kafka ecosystem of event streaming. These tools have some similarities, but there are also some key differences that set them apart.
Apache Kafka is an event streaming application. Applications publish a stream of events or messages to a topic on Kafka. The stream can be consumed independently by many consumers, and messages in the topic can even be replayed if needed. Kafka is massively scalable and fault-tolerant.
Kafka Streams is an API for writing applications that transform and enrich data in Apache Kafka, usually by publishing the transformed data onto a new topic. The data processing itself happens within your application, not on a Kafka broker.
Kafka Connect is an API for moving data into and out of Kafka. It standardises the integration of other applications into Kafka, by letting you write and share connectors for moving data to/from popular applications like databases.
Applications don’t magically share data with each other. This is why we often use messaging tools like Apache Kafka to move data around. Want to share data in real time between your applications? You might want to use Apache Kafka.
At first glance, Apache Kafka just seems like a message broker, which can ship a message from a producer to a consumer. But actually Kafka is a lot more than that.
Kafka is primarily a distributed event log. It’s usually run as a cluster of several brokers, which are joined together.
How does Apache Kafka work?
With Kafka, messages are published onto topics. These topics are like never-ending log files. Producers put their messages onto a topic. Consumers drop in at any time, receive messages from the topic, and can even rewind and replay old messages.
Messages are only deleted from a topic when you want them to be deleted. You can even run a Kafka broker that keeps every message ever (set log retention to “forever”), and Kafka will never delete anything.
One of the best things about Kafka is that it can replicate messages between brokers in the cluster. So consumers can keep receiving messages, even if a broker crashes. Life-changing!
This makes Kafka very capable of handling all sorts of scenarios, from simple point-to-point messaging, to stock price feeds, to processing massive streams of website clicks, and even using Kafka like a database (yes, some people are doing that).
If you’re a developer and you want to play around with a Kafka cluster for testing and development, try the free developer plan from Cloudkarafka.
Kafka vs traditional message brokers
One of the major things that sets Kafka apart from “traditional” message brokers like RabbitMQ or ActiveMQ, is that a topic in Kafka doesn’t know or care about its consumers. It’s simply a log that consumers can dive into and access data from, at any time.
Whereas, with queues on a traditional message broker, messages are delivered only once to consumers. And in traditional topics, consumers who subscribe to a topic can only receive messages from that point forward; they can’t rewind.
Why would you use it?
Kafka is massively scalable. Think of the biggest thing you can think of, and then double it. Kafka can handle that.
People talk about Kafka being scalable because it can handle a very large number of messages and consumers, due to the way that it spreads the load across a cluster of brokers. It spreads your messages across these brokers, in topic segments known as partitions.
Kafka is pretty damn performant. Even a fairly small Kafka cluster can support very high throughput of messages.
When would you use it?
Kafka is suited very well to these types of use cases:
Collecting metrics. Instead of writing metrics to a log file or database, you can write data to a Kafka “topic”, which other consumers might also be interested in reading.
Collecting high-volume events (e.g. website clicks)
Sharing database change events (called “change data capture”). This method is sometimes used for sharing data between microservices.
Last-value queues – where you might publish a bunch of information to a topic, but you want people to be able to access the “last value” quickly, e.g. stock prices.
Simple messaging (similar to RabbitMQ or ActiveMQ). Kafka can do simple messaging too.
What are the alternatives?
Amazon Web Services have their own data streaming product called Kinesis, which is modelled on Kafka.
If you only need traditional point-to-point messaging, you could use a classic message broker like RabbitMQ or ActiveMQ, or AWS Simple Queue Service (SQS).
How do you use it?
You can run your own cluster of Kafka brokers, or pay for a managed Kafka service from some cloud providers. Consumers can then connect to your cluster of brokers, to publish and consume events.
Kafka also includes a Producer and Consumer API, so that you can send and receive messages from your applications. But these APIs have a few limitations, as Stephane Maarek writes.
So Kafka is often paired with Kafka Streams and Kafka Connect, which are simpler APIs that make it easier to do the two really common things that people want to do with Kafka: process data in Kafka topics, and connect Kafka to external systems.
So let’s check out each of these projects in turn.
If you’re interested in running Apache Kafka on Kubernetes, then make sure you take a look at the Strimzi open source project. Or, if you’re using OpenShift, then Red Hat offers its own version of Kafka, which is called the Red Hat AMQ streams component.
Kafka Streams is another project from the Apache Kafka community. It’s basically a Java API for processing and transforming data inside Kafka topics.
Kafka Streams, or the Streams API, makes it easier to transform or filter data from one Kafka topic and publish it to another Kafka topic, although you can use Streams for sending events to external systems if you wish. (But, for doing that, you might find it easier to use Kafka Connect, which we’ll look at shortly.)
You can think of Kafka Streams as a Java-based toolkit that lets you change and modify messages in Kafka in real time, before the messages reach your external consumers.
How do you use it?
To use Kafka Streams, you first import it into your Java application as a library (JAR file). The library gives you the Kafka Streams Java API.
With the API, you can write code to process or transform individual messages, one-by-one, and then publish those modified messages to a new Kafka topic, or to an external system.
With Kafka Streams, all your stream processing takes place inside your app, not on the brokers. You can even run multiple instances of your Kafka Streams-based application if you’ve got a firehose of messages and you need to handle high volumes.
Are there any alternatives?
Kafka Streams isn’t the only way to process data within Kafka. You could also use another open source project like Apache Samza or Apache Storm. But Kafka Streams allows you to do your stream processing using Kafka-specific tools.
And it’s pretty popular.
I love @kafkastreams, however I love burning shit more. pic.twitter.com/6Em09O9Qwi— Anna McDonald 🤖🧮🔍 (@jbfletch_) July 1, 2021
And so what if you want to bring data in or out of Kafka from other systems? Then you might want to look at Kafka Connect.
The final tool in this rundown of the Kafka projects is Kafka Connect.
Kafka Connect is a tool for connecting different input and output systems to Kafka. Think of it like an engine that can run a number of different components, which can stream Kafka messages into databases, Lambda functions, S3 buckets, or apps like Elasticsearch or Snowflake.
So it makes it much easier to connect Kafka to the other systems in your
big ball of mud architecture, without having to write all the glue code yourself. (And let’s be honest, it’s often much better to use someone else’s tried and tested code than write your own.)
How do you use it?
To use Kafka Connect, you download the Connect distribution, set the configuration files how you want them, and then start a Kafka Connect instance.
To use additional connectors, you can find them on places like Confluent Hub or community projects on GitHub. Then you unzip the download into your target environment, and tell Kafka Connect where to look for connectors.
You can also run Kafka Connect in containers. Some projects which use Kafka Connect, offer their own pre-built Docker image. Debezium has a ready-made Connect image that you can pull and run.
The idea of Kafka Connect is to minimise the amount of code you need to write to get data flowing between Kafka and your other systems.
What are the alternatives?
You don’t have to use Kafka Connect to integrate Kafka with your other apps and databases. You can write your own code using the Producer and Consumer API, or use the Streams API.
Or you could even use an integration framework that supports Kafka, like Apache Camel or Spring Integration.
Some integration frameworks have support for Kafka Connect, such as Apache Camel. This lets you integrate with Kafka, using the way that you might be already familiar with.
If you want to know more about Apache Kafka, Streams and Connect, then I recommend these articles:
Got any thoughts on what you've just read? Anything wrong, or no longer correct? Sign in with your GitHub account to leave a comment.
(All comments get added as Issues in our GitHub repo here, using the comments tool Utterances)