Kafka Streams Basics: Real-Time Data Processing Made Simple
2026-04-01 · 12 min read
If you've worked with Apache Kafka, you know it as a distributed event streaming platform — great for moving data between systems. But what happens when you need to process that data in real time? That's where Kafka Streams comes in.
What is Kafka Streams?
Kafka Streams is a client library for building real-time applications and microservices that process data stored in Kafka topics. Unlike heavyweight frameworks like Apache Flink or Spark Streaming, Kafka Streams runs as part of your application — no separate cluster required.
Just a JAR dependency, no external infra
Built-in guarantee for data correctness
Aggregations, joins, and windowing
Scale by adding instances; auto-recovery
Core Concepts
Stream (KStream)
A KStreamrepresents an unbounded, continuously updating stream of records. Each record is an independent, immutable fact — like an event log. Think of it as "user X clicked button Y at time T."
Table (KTable)
A KTableis a changelog stream where each record is an update to a key. Only the latest value per key matters. Think of it as a materialized view — like "user X's current balance is 500."
Topology
A topology is the processing graph — a DAG (directed acyclic graph) of stream processors connected by streams. You define your topology using the Streams DSL or the lower-level Processor API.
Kafka Streams Processing Topology
How It Works Under the Hood
1. Partitions Drive Parallelism
Kafka Streams leverages Kafka's partition model. Each input topic partition maps to a stream task. If your topic has 8 partitions, you get 8 tasks — these can be distributed across multiple application instances automatically.
Scaling Model
Each instance gets assigned a subset of partitions. Add more instances = more parallelism.
2. State Stores
For stateful operations (aggregations, joins), Kafka Streams uses local state stores backed by RocksDB. These stores are:
- Persisted to disk for fast recovery
- Backed up to a Kafka changelog topic for fault tolerance
- Automatically restored when a task migrates to a new instance
3. Processing Guarantees
Kafka Streams supports exactly-once semantics (EOS)by coordinating consumer offsets and producer writes within a single transaction. Even if your app crashes mid-processing, you won't get duplicate results.
4. Simple Deployment
Unlike Flink or Spark, there's no cluster manager. Deploy your Kafka Streams app like any normal Java/Kotlin application. Scaling? Just start more instances. Kafka's consumer group protocol handles rebalancing automatically.
A Simple Example
Here's a word count topology — the "hello world" of stream processing:
StreamsBuilder builder = new StreamsBuilder();
// Read from input topic as a stream
KStream<String, String> textLines = builder.stream("input-topic");
// Process: split lines into words, group, and count
KTable<String, Long> wordCounts = textLines
.flatMapValues(line -> Arrays.asList(line.toLowerCase().split("\\W+")))
.groupBy((key, word) -> word)
.count(Materialized.as("word-counts-store"));
// Write results to output topic
wordCounts.toStream().to("output-topic",
Produced.with(Serdes.String(), Serdes.Long()));
// Build and start the topology
KafkaStreams streams = new KafkaStreams(builder.build(), config);
streams.start();This reads text from input-topic, counts word occurrences in real time, and writes updated counts to output-topic. The state store keeps the running totals, backed by RocksDB locally and a changelog topic in Kafka.
When to Use Kafka Streams
Real-World Use Cases
Aggregate transaction patterns per user within time windows
Transform database change events into downstream formats
Join event streams with reference data (KStream-KTable join)
Aggregate metrics and push to a serving layer
Group user activity into sessions using windowed operations
When NOT to Use Kafka Streams
Kafka Streams vs. Alternatives
Kafka Streams
Kafka-native microservices
- DeploymentLibrary (no cluster)
- LatencyMilliseconds
- StateRocksDB + changelog
- Exactly-onceYes (Kafka only)
- LanguageJava / Kotlin
Apache Flink
Complex pipelines
- DeploymentCluster required
- LatencyMilliseconds
- StateRocksDB + checkpoints
- Exactly-onceYes (any sink)
- LanguageJava / Scala / Python
Spark Streaming
Batch + streaming
- DeploymentCluster required
- LatencySeconds (micro-batch)
- StateIn-memory / external
- Exactly-onceLimited
- LanguageJava / Scala / Python
Key Takeaways
- Kafka Streams is a library, not a framework — deploy it like any app
- The KStream/KTable duality is the foundation for modeling real-time data
- Scaling = starting more instances; Kafka handles the rest
- State is local (fast) but backed by Kafka (durable)
- Use it when Kafka is your backbone and you need real-time, stateful processing
Kafka Streams strikes a unique balance: powerful enough for production-grade stream processing, simple enough to embed in a microservice. If Kafka is already in your stack, it's the most pragmatic choice for real-time data processing.