bicycle wheel with hub and spokes against brick wall

As more and more organizations have come to rely on streaming data to provide real-time insights, a number of applications have sprung up to handle the myriad technical challenges that streaming data presents. One of the most popular options is Apache Kafka. In this article, we’ll take a brief look at what Kafka is, how it works, and what challenges it’s meant to solve.

What Is Kafka?

Let’s start with the basic question: What is Kafka and how does it work? Kafka started out as a project at LinkedIn to make data ingestion with Hadoop easier. Since then, it’s evolved into what Apache describes as a “distributed streaming platform,” but what does that mean? In short, Kafka is a publish-subscribe messaging system that also processes and stores data as it passes through.

Like other publish-subscribe messaging brokers, it enables different systems to broadcast and receive events without having to know exactly where it’s going or coming from. That said, Kafka has a few key advantages over other message brokers:

  • It’s general purpose. Kafka is meant to connect multiple systems, which makes it extremely attractive for large enterprise systems as well as small startups that are cobbling together their own applications. It’s equally adept at activity tracking, monitoring operational data, log aggregating, and stream processing.
  • It takes availability seriously. While all message broker systems act as storage systems for messages that are in transit, Kafka goes to the trouble of writing data to disk and replicating it. That means data is much less likely to be lost while in transit, making Kafka an attractive option for applications that demand both speed and retention, for example applications that are subject to regulatory and compliance mandates.
  • It enables real-time processing of data streams. The Kafka Streams library is designed for building streaming applications that can handle core business functions without adding additional complexity or dependencies. This can be a major advantage for applications that want to be able to process streaming data in real time but don’t need the heavy analytic tools of a Spark or Flink-type service.

How Does It Work?

Before we go any further, let’s cover a little vocabulary. Kafka’s main abstraction is the topic, which represents a stream of records in a specific category. For example, all records of users logging in or out of an application might go to a topic called “Logins.” Any number of subscribers can subscribe to “Logins,” and they’ll be able to pull new messages from that stream at whatever rate they can.

What happens when a subscriber can’t keep up? Kafka holds on to all messages for a set period of time to prevent that data from being lost. This emphasis on availability is one of the main reasons Kafka is attractive for large enterprises where data retention and event logging are major considerations.

Stream Processing

Tools like Spark are great for heavy-duty streaming analytics. But what if your needs are more in the realm of simple processing?

Kafka Streams was designed to handle, in the words of Jay Kreps, the lead developer, “core functions in the business rather than computing analytics about the business.” The goal is to make it easy to build streaming services without adding new dependencies. There are no new frameworks or clusters to add.

For example, say you’re a logistics application that needs to be able to accept orders, order new inventory, and re-price products as they sell out in real time. Kafka Streams can make it so that new orders trigger changes to the database (in this case, your inventory), which in turn trigger additional processes (like adjusting prices or placing new orders), all without having to change your application beyond adding a new library.

How Does Kafka Compare to Flume?

Often mentioned in the same breath as Kafka is Flume, another open-source project from the Apache Foundation. So how do these two stream processing frameworks compare?

The major difference between the two is that, while Kafka is a general-purpose messaging system, Flume is specifically designed to ingest data from a large number of sources into the Hadoop ecosystem. In practice, that means that Flume is specifically optimized for sending data to HDFS and HBase, so if you’re running an alternative distributed file system like Hive or Cassandra, Kafka might be easier to implement.

Another difference is that Flume comes with a number of built-in sinks and sources that might be easier to implement out-of-the-box, assuming they match your requirements. With Kafka, you should be prepared to code your own consumers and producers, though as it increases in popularity more and more systems are making their own Kafka integrations.

A final difference is that Kafka is a pull system where Flume is a push system. Consistent with its focus on reliability, Kafka can store messages for a set number of days until subscribers are ready to pull them, whereas Flume will continually push messages to subscribers. Flume also doesn’t replicate data on its broker nodes, meaning they’re only as reliable as the disk it’s on, so if one of your nodes goes down, you could lose access to that data until it comes back online.

Is Kafka Right for Your Project?

We’ve put together a few questions to help you determine if you need a stream-processing framework like Kafka.

  • What sort of processing do you need? If you need analytics about your business, you’ll probably want to go with an option like Spark Streaming. If, however, what you really need is something that can execute microservices, then Kafka might be just right.
  • Are you already committed to the Hadoop ecosystem? If so, it might be easier to go with Flume, which is designed to plug right into HDFS and HBase. If you’re using an array of different systems (or if you’ve built your own), then the flexibility of Kafka might be a major appeal.
  • How much development work are you willing to do? While Kafka is general purpose and highly scalable, it does require some development work to get the consumers and producers up and running. For that, you’ll likely need a qualified Kafka expert or data engineer.