Case Study: Node.js Batch Processing with Kafka and MongoDB

At first glance, this looked like a classic scale problem: high throughput, continuous operation, external dependencies, and growing infrastructure pressure. The team’s first conversation was about adding pods and considering a rewrite.

Except the bottleneck was not Kafka, MongoDB, or pod count. It was the iteration model: the code treated async I/O like synchronous work and accumulated too much before giving control back to the runtime.

The symptoms

Batch processing relied on legacy JavaScript iteration patterns. At low volume this looked harmless. At thousands of items per batch, the system started degrading until it stopped responding. Health checks were failing because the event loop was starved. Memory climbed in spikes because the code pulled too much work in at once. There was no real backpressure on the Kafka producer or the MongoDB driver, so any slow operation triggered a domino effect of retries and cascading timeouts.

Where the problem actually was

The team thought they needed more infrastructure. In practice, the iteration design made no sense for a single-threaded, event-driven runtime.

Scaling pods or increasing memory would have masked the symptom for a while, but not fixed how the runtime was being pressured. The question shifted from “how many resources do we need?” to “why can’t Node.js breathe?”

That shift in perspective unblocked everything. Instead of staring at infra dashboards, I went and read the main processing loop. And there it was: the code was firing hundreds of async operations at once without ever waiting for any of them to finish before spawning the next.

The rewrite that didn’t happen

There was a proposal to rebuild the flow by splitting responsibilities across more components. That could work, but the cost was high for what was actually broken.

The system did not need architectural theater. It needed better iteration and backpressure. The difference between weeks of rewriting and hours of refactoring came down to that clarity: the problem was localized, so the fix could be localized too. More operational complexity, a learning curve, and a full rollout for the team, all without addressing the root cause, made no sense.

What changed in the code

The decision was to move iteration to async generators with for await...of. That enabled lazy evaluation, natural flow control, and a clearer place to apply intentional concurrency.

Streams would have been the textbook answer, but they would also require reshaping the entire system as a stream pipeline. Async generators delivered the key benefit with a localized change and a low learning curve because the team already understood the for...of model. The code change was small. The production impact was not.

In the end

Beyond knowing Node.js, Kafka, or MongoDB, the point here is knowing where to look when everything seems to need more resources. The right diagnosis avoided weeks of rewriting and solved the problem in hours. Runtime investigation came before scaling infra, and the choice between the elegant solution and the one that works was deliberate.

Aspect	Before	After
Event loop	Blocked during large batches	Stable, with health checks responding
Memory	Spikes proportional to batch size	Constant usage, independent of volume
Throughput	Progressive degradation across the batch	Consistent speed from first item to last
Infrastructure	Pressure to scale resources	No infrastructure change needed to solve it

How I unstuck a Node.js/Kafka pipeline without rewriting the architecture

Case summary

The symptoms

Where the problem actually was

The rewrite that didn’t happen

What changed in the code

In the end

Before and after

Stack and environment

In one sentence

Useful links in this context