Kafka, Kafka Streams and Kafka Connect: What’s the difference?
Apache Kafka is an awesome way to stream data between applications. It’s often used to connect up components in a distributed system, so it’s especially useful if you’re using microservices. But it does have some rather confusing terminology. So what is the difference between Kafka, Kafka Streams and Kafka Connect? Let’s find out.
Apache Kafka is a set of tools designed for event streaming.
People tend to get pretty excited about Kafka, because it’s fault-tolerant (this means that Kafka will keep running even when some of its components fail), and it can be run at huge scale. 📏⛰
Kafka, Kafka Streams and Kafka Connect are all components in the Kafka project. These three components seem similar, but there are some key things that set them apart.
But first, for all you busy TL;DR types, here’s the executive summary:
Apache Kafka is a back-end application that provides a way to share streams of events between applications.
An application publishes a stream of events or messages to a topic on a Kafka broker. The stream can then be consumed independently by other applications, and messages in the topic can even be replayed if needed.
Kafka Streams is an API for writing client applications that transform data in Apache Kafka. You usually do this by publishing the transformed data onto a new topic. The data processing itself happens within your client application, not on a Kafka broker.
Kafka Connect is an API for moving data into and out of Kafka. It provides a pluggable way to integrate other applications with Kafka, by letting you use and share connectors to move data to or from popular applications, like databases.
P.S. Want the no-tech beginner’s guide to Kafka? Check out the fantastic Gently Down The Stream, which is an illustrated guide to Kafka by Mitch Seymour.
At first glance, Apache Kafka just seems like a message broker, which can ship a message from a producer to a consumer. But actually Kafka is a lot more than that.
Kafka is primarily a distributed event log. It’s usually run as a cluster of several brokers, which are joined together.
If you’re just getting started with Apache Kafka, you will probably want to learn the basics first. It’s worth investing in a course, as Kafka can become very complicated, very quickly!
This Kafka beginners’ course from Stephane Maarek is well-paced, and explains all of the complex terminology. It contains more than 7 hours of video lectures, sprinkled with examples that help you to understand the technical concepts.
And, most importantly, it includes a real-world project that you can follow, so you can start using Kafka for real.
Now let’s learn a bit more about Apache Kafka.
How does Apache Kafka work?
With Kafka, messages are published onto topics. These topics are like never-ending log files. Producers put their messages onto a topic. Consumers drop in at any time, receive messages from the topic, and can even rewind and replay old messages.
Messages are only deleted from a topic when you want them to be deleted. You can even run a Kafka broker that keeps every message ever (set log retention to “forever”), and Kafka will never delete anything.
One of the best things about Kafka is that it can replicate messages between brokers in the cluster. So consumers can keep receiving messages, even if a broker crashes. Life-changing!
This makes Kafka very capable of handling all sorts of scenarios, from simple point-to-point messaging, to stock price feeds, to processing massive streams of website clicks, and even using Kafka like a database (yes, some people are doing that).
If you’re learning Kafka and you want a Kafka cluster to play around with, try the free “Developer Duck” plan from Cloudkarafka.
Kafka vs traditional message brokers
One of the major things that sets Kafka apart from “traditional” message brokers like RabbitMQ or ActiveMQ, is that a topic in Kafka doesn’t know or care about its consumers. It’s simply a log that consumers can dive into and access data from, at any time.
On the other hand, on a traditional message broker, messages are delivered only once to consumers. And in traditional topics, consumers who subscribe to a topic can only receive messages from that point forward, they can’t rewind.
Why would you use it?
Kafka is massively scalable. Think of the biggest thing you can think of, and then double it. Kafka can handle that.
People talk about Kafka being scalable because it can handle a very large number of messages and consumers, due to the way that it spreads the load across a cluster of brokers. It spreads your messages across these brokers, in topic segments known as partitions.
Kafka is pretty damn performant. Even a fairly small Kafka cluster can support very high throughput of messages.
When would you use it?
Kafka is suited very well to these types of use cases:
Collecting metrics. Instead of writing metrics to a log file or database, you can write data to a Kafka “topic”, which other consumers might also be interested in reading.
Collecting high-volume events (e.g. website clicks)
Sharing database change events (called “change data capture”). This method is sometimes used for sharing data between microservices.
Last-value queues – where you might publish a bunch of information to a topic, but you want people to be able to access the “last value” quickly, e.g. stock prices.
Simple messaging (similar to RabbitMQ or ActiveMQ). Kafka can do simple messaging too.
What are the alternatives?
Amazon Web Services have their own data streaming product called Kinesis, which is modelled on Kafka.
If you only need traditional point-to-point messaging, you could use a classic message broker like RabbitMQ or ActiveMQ, or AWS Simple Queue Service (SQS).
How do you use it?
You can run your own cluster of Kafka brokers, or pay for a managed Kafka service from some cloud providers. Consumers can then connect to your cluster of brokers, to publish and consume events.
Kafka also includes a Producer and Consumer API, so that you can send and receive messages from your applications. But these APIs have a few limitations, as Stephane Maarek writes.
So Kafka is often paired with Kafka Streams and Kafka Connect, which are simpler APIs that make it easier to do the two really common things that people want to do with Kafka: process data in Kafka topics, and connect Kafka to external systems.
So let’s check out each of these projects in turn.
If you’re interested in running Apache Kafka on Kubernetes, then make sure you take a look at the Strimzi open source project. Or, if you’re using OpenShift, then Red Hat offers its own version of Kafka, which is called the Red Hat AMQ streams component.
Kafka Streams is another project from the Apache Kafka community. It’s basically a Java API for processing and transforming data inside Kafka topics.
Kafka Streams, or the Streams API, makes it easier to transform or filter data from one Kafka topic and publish it to another Kafka topic, although you can use Streams for sending events to external systems if you wish. (But, for doing that, you might find it easier to use Kafka Connect, which we’ll look at shortly.)
You can think of Kafka Streams as a Java-based toolkit that lets you change and modify messages in Kafka in real time, before the messages reach your external consumers.
If you’re looking to get started with Streams, you should grab a copy of Mastering Kafka Streams and ksqlDB by Mitch Seymour. It covers many aspects of data processing with Kafka Streams, including the central Processor API.
Let’s dive a bit more into Kafka Streams.
How do you use it?
To use Kafka Streams, you first import it into your Java application as a library (JAR file). The library gives you the Kafka Streams Java API.
With the API, you can write code to process or transform individual messages, one-by-one, and then publish those modified messages to a new Kafka topic, or to an external system.
With Kafka Streams, all your stream processing takes place inside your app, not on the brokers. You can even run multiple instances of your Kafka Streams-based application if you’ve got a firehose of messages and you need to handle high volumes.
Are there any alternatives?
Kafka Streams isn’t the only way to process data within Kafka. You could also use another open source project like Apache Samza or Apache Storm. But Kafka Streams allows you to do your stream processing using Kafka-specific tools.
And it’s pretty popular.
I love @kafkastreams, however I love burning shit more. pic.twitter.com/6Em09O9Qwi— Anna McDonald 🤖🧮🔍 (@jbfletch_) July 1, 2021
And so what if you want to bring data in or out of Kafka from other systems? Then you might want to look at Kafka Connect.
The final tool in this rundown of the Kafka projects is Kafka Connect.
Kafka Connect is a tool for connecting different input and output systems to Kafka. Think of it like an engine that can run a number of different components, which can stream Kafka messages into databases, Lambda functions, S3 buckets, or apps like Elasticsearch or Snowflake.
So it makes it much easier to connect Kafka to the other systems in your
big ball of mud architecture, without having to write all the glue code yourself. (And let’s be honest, it’s often much better to use someone else’s tried and tested code than write your own.)
How do you use it?
To use Kafka Connect, you download the Connect distribution, set the configuration files how you want them, and then start a Kafka Connect instance.
To use additional connectors, you can find them on places like Confluent Hub or community projects on GitHub. Then you unzip the download into your target environment, and tell Kafka Connect where to look for connectors.
You can also run Kafka Connect in containers. Some projects which use Kafka Connect, offer their own pre-built Docker image. Debezium has a ready-made Connect image that you can pull and run.
The idea of Kafka Connect is to minimise the amount of code you need to write to get data flowing between Kafka and your other systems.
What are the alternatives?
You don’t have to use Kafka Connect to integrate Kafka with your other apps and databases. You can write your own code using the Producer and Consumer API, or use the Streams API.
Or you could even use an integration framework that supports Kafka, like Apache Camel or Spring Integration.
Some integration frameworks have support for Kafka Connect, such as Apache Camel. This lets you integrate with Kafka, using the way that you might be already familiar with.
If you want to know more about Apache Kafka, Streams and Connect, then I recommend these articles: