@jeff_drumgod
PT

Batch processing, Node.js, Kafka, MongoDB

How I unstuck a Node.js/Kafka pipeline without rewriting the architecture

In production, large batches were choking the event loop, inflating memory, and triggering cascading timeouts. The fix came from async generators and a surgical change to the existing flow.

Published on 2026-04-05 · Updated on 2026-04-05

Stable event loop during large batches Constant memory usage Refactor delivered in hours, not weeks

Case summary

Problem
Legacy loops were firing I/O with no concurrency control and degrading the pipeline in production.
Context
Internal continuous-processing system writing to MongoDB, publishing to Kafka, and enriching data through internal APIs.
Decision
Replace iteration with async generators and `for await...of`, keeping the existing architecture intact.
Result
Health checks recovered, memory stopped growing with batch size, and throughput stayed consistent from the first item to the last.

At first glance, this looked like a classic scale problem: high throughput, continuous operation, external dependencies, and growing infrastructure pressure. The team’s first conversation was about adding pods and considering a rewrite.

Except the bottleneck was not Kafka, MongoDB, or pod count. It was the iteration model: the code treated async I/O like synchronous work and accumulated too much before giving control back to the runtime.

The symptoms

Batch processing relied on legacy JavaScript iteration patterns. At low volume this looked harmless. At thousands of items per batch, the system started degrading until it stopped responding. Health checks were failing because the event loop was starved. Memory climbed in spikes because the code pulled too much work in at once. There was no real backpressure on the Kafka producer or the MongoDB driver, so any slow operation triggered a domino effect of retries and cascading timeouts.

Where the problem actually was

The team thought they needed more infrastructure. In practice, the iteration design made no sense for a single-threaded, event-driven runtime.

Scaling pods or increasing memory would have masked the symptom for a while, but not fixed how the runtime was being pressured. The question shifted from “how many resources do we need?” to “why can’t Node.js breathe?”

That shift in perspective unblocked everything. Instead of staring at infra dashboards, I went and read the main processing loop. And there it was: the code was firing hundreds of async operations at once without ever waiting for any of them to finish before spawning the next.

The rewrite that didn’t happen

There was a proposal to rebuild the flow by splitting responsibilities across more components. That could work, but the cost was high for what was actually broken.

The system did not need architectural theater. It needed better iteration and backpressure. The difference between weeks of rewriting and hours of refactoring came down to that clarity: the problem was localized, so the fix could be localized too. More operational complexity, a learning curve, and a full rollout for the team, all without addressing the root cause, made no sense.

What changed in the code

The decision was to move iteration to async generators with for await...of. That enabled lazy evaluation, natural flow control, and a clearer place to apply intentional concurrency.

Streams would have been the textbook answer, but they would also require reshaping the entire system as a stream pipeline. Async generators delivered the key benefit with a localized change and a low learning curve because the team already understood the for...of model. The code change was small. The production impact was not.

In the end

Beyond knowing Node.js, Kafka, or MongoDB, the point here is knowing where to look when everything seems to need more resources. The right diagnosis avoided weeks of rewriting and solved the problem in hours. Runtime investigation came before scaling infra, and the choice between the elegant solution and the one that works was deliberate.

Before and after

Aspect Before After
Event loop Blocked during large batches Stable, with health checks responding
Memory Spikes proportional to batch size Constant usage, independent of volume
Throughput Progressive degradation across the batch Consistent speed from first item to last
Infrastructure Pressure to scale resources No infrastructure change needed to solve it

Stack and environment

  • Node.js
  • Kafka
  • MongoDB
  • JavaScript/TypeScript
  • High-volume internal pipeline

In one sentence

A problem that looked like a scale issue was an iteration issue. The fix took hours and did not touch infra.

When the problem is the wrong iteration model, more infrastructure only buys time to stay wrong at a larger scale.

Useful links in this context