Back to home

Kafka Streams Basics: Real-Time Data Processing Made Simple

2026-04-01 · 12 min read

If you've worked with Apache Kafka, you know it as a distributed event streaming platform — great for moving data between systems. But what happens when you need to process that data in real time? That's where Kafka Streams comes in.

Prerequisites
This article assumes basic familiarity with Apache Kafka concepts (topics, partitions, consumers, producers). If you're new to Kafka, start with the official introduction first.

What is Kafka Streams?

Kafka Streams is a client library for building real-time applications and microservices that process data stored in Kafka topics. Unlike heavyweight frameworks like Apache Flink or Spark Streaming, Kafka Streams runs as part of your application — no separate cluster required.

Lightweight

Just a JAR dependency, no external infra

Exactly-once

Built-in guarantee for data correctness

Stateful & stateless

Aggregations, joins, and windowing

Elastic & fault-tolerant

Scale by adding instances; auto-recovery

Core Concepts

Stream (KStream)

A KStreamrepresents an unbounded, continuously updating stream of records. Each record is an independent, immutable fact — like an event log. Think of it as "user X clicked button Y at time T."

Table (KTable)

A KTableis a changelog stream where each record is an update to a key. Only the latest value per key matters. Think of it as a materialized view — like "user X's current balance is 500."

Stream-Table Duality
Every stream can be viewed as a table (by replaying events to build current state), and every table can be viewed as a stream (by capturing each change as an event). This duality lets you model almost any real-time processing scenario.

Topology

A topology is the processing graph — a DAG (directed acyclic graph) of stream processors connected by streams. You define your topology using the Streams DSL or the lower-level Processor API.

Kafka Streams Processing Topology

Source Topicinput-topicProcessorfilter / map /aggregateKStream / KTableState StoreRocksDB(local + changelog)Sink Topicoutput-topic

How It Works Under the Hood

1. Partitions Drive Parallelism

Kafka Streams leverages Kafka's partition model. Each input topic partition maps to a stream task. If your topic has 8 partitions, you get 8 tasks — these can be distributed across multiple application instances automatically.

Scaling Model

Instance 1
Instance 2
Instance 3
Instance 4

Each instance gets assigned a subset of partitions. Add more instances = more parallelism.

2. State Stores

For stateful operations (aggregations, joins), Kafka Streams uses local state stores backed by RocksDB. These stores are:

  • Persisted to disk for fast recovery
  • Backed up to a Kafka changelog topic for fault tolerance
  • Automatically restored when a task migrates to a new instance

3. Processing Guarantees

Kafka Streams supports exactly-once semantics (EOS)by coordinating consumer offsets and producer writes within a single transaction. Even if your app crashes mid-processing, you won't get duplicate results.

4. Simple Deployment

Unlike Flink or Spark, there's no cluster manager. Deploy your Kafka Streams app like any normal Java/Kotlin application. Scaling? Just start more instances. Kafka's consumer group protocol handles rebalancing automatically.

A Simple Example

Here's a word count topology — the "hello world" of stream processing:

java
StreamsBuilder builder = new StreamsBuilder();

// Read from input topic as a stream
KStream<String, String> textLines = builder.stream("input-topic");

// Process: split lines into words, group, and count
KTable<String, Long> wordCounts = textLines
    .flatMapValues(line -> Arrays.asList(line.toLowerCase().split("\\W+")))
    .groupBy((key, word) -> word)
    .count(Materialized.as("word-counts-store"));

// Write results to output topic
wordCounts.toStream().to("output-topic",
    Produced.with(Serdes.String(), Serdes.Long()));

// Build and start the topology
KafkaStreams streams = new KafkaStreams(builder.build(), config);
streams.start();

This reads text from input-topic, counts word occurrences in real time, and writes updated counts to output-topic. The state store keeps the running totals, backed by RocksDB locally and a changelog topic in Kafka.

When to Use Kafka Streams

You already use Kafka — integrates natively, no new infrastructure
Real-time processing — event-driven transformations, enrichment, routing
Stateful operations — aggregations, windowed computations, stream-table joins
Microservice architecture — embed processing logic directly in services
Exactly-once matters — financial transactions, inventory, deduplication

Real-World Use Cases

Fraud Detection

Aggregate transaction patterns per user within time windows

CDC Processing

Transform database change events into downstream formats

Event Enrichment

Join event streams with reference data (KStream-KTable join)

Real-time Dashboards

Aggregate metrics and push to a serving layer

Session Tracking

Group user activity into sessions using windowed operations

When NOT to Use Kafka Streams

You don't use Kafka — only works with Kafka as source/sink
Batch processing — Spark or Flink batch mode is better for historical data
Complex event processing — pattern matching is easier in Flink CEP
Multi-cluster — designed for a single Kafka cluster
Non-JVM languages — Java/Kotlin only (ksqlDB offers a SQL alternative)

Kafka Streams vs. Alternatives

Kafka Streams

Kafka-native microservices

  • DeploymentLibrary (no cluster)
  • LatencyMilliseconds
  • StateRocksDB + changelog
  • Exactly-onceYes (Kafka only)
  • LanguageJava / Kotlin

Apache Flink

Complex pipelines

  • DeploymentCluster required
  • LatencyMilliseconds
  • StateRocksDB + checkpoints
  • Exactly-onceYes (any sink)
  • LanguageJava / Scala / Python

Spark Streaming

Batch + streaming

  • DeploymentCluster required
  • LatencySeconds (micro-batch)
  • StateIn-memory / external
  • Exactly-onceLimited
  • LanguageJava / Scala / Python

Key Takeaways

TL;DR
  • Kafka Streams is a library, not a framework — deploy it like any app
  • The KStream/KTable duality is the foundation for modeling real-time data
  • Scaling = starting more instances; Kafka handles the rest
  • State is local (fast) but backed by Kafka (durable)
  • Use it when Kafka is your backbone and you need real-time, stateful processing

Kafka Streams strikes a unique balance: powerful enough for production-grade stream processing, simple enough to embed in a microservice. If Kafka is already in your stack, it's the most pragmatic choice for real-time data processing.