Case Study: FinOps for a multi-tenant SaaS on AWS with Cloudflare

When I joined, the cost conversation already had a resigned tone to it. The AWS bill had been climbing for months, every previous attempt to tighten things up had been short-lived, and the team had a shared sense that the operation had a life of its own. The product worked, customers were active, but the bill was growing at a pace no meeting could slow down.

This was not a case of “turning off idle machines.” It was a platform built to ship value fast, and nobody had gone back to ask whether the setup still made sense for the way it actually ran. You can run software for a while without thinking about FinOps, but eventually the bill catches up.

A note before anything else

This case involves AWS and Cloudflare, and part of the solution was moving responsibilities from one to the other. This is not a criticism of AWS. AWS is a solid platform and it remained in use across several core parts of the operation. What was wrong was the architecture, which had not been designed to use AWS efficiently. The issue was less the tool itself and more how it had been put to work.

The starting point

The cost curve over previous months made it clear that without intervention, the bill would keep climbing. But the initial concern was not even the total. It was something else: nobody could say, with any confidence, how much each customer actually cost. There was an estimated split based on “tenant size” that worked for internal reports but could not survive a serious question. Without real per-tenant cost, any discussion about pricing, margins, or churn turned into guesswork.

There was an aggravating factor: less than 10% of the operation was documented. No reliable architecture diagram, no dependency map, no record of why each service had been set up the way it was. Much of the early work was not “optimizing.” It was investigating: opening the code, tracing routes, and pulling on configuration threads until I understood what was actually running. It felt almost archaeological, with documentation being written alongside the decisions taking shape.

What changed and why

The useful question shifted from “how much more resource do we need?” to “where is this system losing control?” That reframe unlocked everything. The work moved on three fronts at once.

Measurement

Before cutting anything, I built an internal calculator to understand the real weight of each line item, not by service name, but by what it represented in the operation. That made it clear that data transfer was the dominant cost, followed by CDN, WAF, and multi-tenant SSL in aggregate. CPU was not the main villain. The edge was. Instead of debating cost by impression, the team started debating cost by evidence.

Code and architecture

There was waste inside the code itself. In some React and Next.js flows, components hit the same endpoint multiple times. Even with Next.js caching part of the path, this still generated CPU cost and server-side processing. At scale with many tenants, that becomes recurring spend without delivering new value. Those flows were fixed.

There were also misplaced responsibilities. Security rules like rate limiting sat in a layer that made every request carry unnecessary memory and processing cost. The protection made sense; the problem was where it ran. Moving that weight to a more appropriate layer improved the ratio between protection and cost.

And there was a mismatched compute model. The platform used ECS with auto-scaling, which always embedded paid idle margin. Where that model no longer justified itself, migrating to Lambda swapped reserved-capacity cost for per-execution cost, removing the paid slack that had been treated as inevitable.

None of this involved a rewrite. It required patience for fine-grained diagnosis, in a codebase where the code itself was often the only existing documentation of how the system worked.

Right tool for each job

AWS stayed where it made sense. But CDN, WAF, multi-tenant SSL, and data transfer were significantly more expensive in aggregate than on Cloudflare. SSL for SaaS and Workers covered the edge at a cost-performance ratio the previous architecture could not match.

Cloudflare has pricier CPU in some scenarios, but the edge and transfer savings more than made up for it. The criterion was never ideological. It was understanding where each workload ran best at the lowest acceptable complexity cost. Treating AWS and Cloudflare as competitors leaves money on the table. Treating them as tools with different strengths is what made a reduction of this scale possible.

What happened to the bill

The drop came in waves. The first few months brought single-digit cuts from code adjustments and rightsizing. Once the first round of refactoring landed, the reduction crossed 20% and reached the low 30s.

The stronger wave hit when the edge migration started handling real traffic. The monthly cost fell to less than half of the starting point. There were months where the number crept back up slightly, which is normal when you are migrating workloads while keeping part of the traffic on the old model. I did not treat that as a setback, but as the natural cost of a transition done with the product still running.

By the last full month before this publication, the cumulative reduction was around 70%. With the final design reaching steady state, the projection points to something close to 85%.

All of this while the product kept shipping

None of this stopped development. The roadmap kept moving. The team shipped new features, including generative AI capabilities, which bring their own set of cost, latency, and architecture challenges. This was not just about cutting. It was about reducing waste while building something that is, by nature, expensive.

That required picking battles carefully. Not every optimization was worth pursuing that month, because it competed with a feature that could not slip. Not every new feature was accepted as-is, because some carried cost decisions that only made sense after the next refactoring wave.

What does not show up in the spreadsheet

It is tempting to reduce this case to “I cut the AWS bill.” But the most useful outcomes are not in the savings chart.

We reached the point where we know how much each tenant actually costs the company. That changes pricing conversations, decisions about customers who consume a lot but pay little, and how the sales team negotiates plans. It moves from “I think” to “I can show you.”

Web vitals became more consistent across regions, which matters for a product that exists to deliver web content. Incidents became rarer and easier to isolate, partly because documentation now existed where before there was only oral memory.

And the team could breathe again. When cost stopped being a shadow over every technical decision, people could go back to discussing product and evolution without every idea hitting the wall of “but how much will that cost us?” Retaining good people got easier, and that alone is a form of savings that never shows up on the invoice.

What I take from this process

High cloud cost is usually the result of old decisions that keep running without review. Some sit in infrastructure. Some sit in code. Some sit in responsibilities that ended up in the wrong layer and stayed there too long. When that kind of review is done with measurement, you stop trying to save in the dark and start applying engineering to what actually weighs the most.

There is no “best cloud.” There is the right cloud for each piece of the problem. That is what made the difference here, and it did so without burning out the people who were building the product.

Aspect	Before	After
Monthly cost	Continuously rising, projected to hit a new high the following month	Cumulative reduction close to 85% over roughly 12 months
Per-tenant cost	Rough estimate, no real traceability	Near-exact calculation per customer, usable in commercial decisions
Data transfer	Largest line item on the bill, hard to attribute to customers	Absorbed by Cloudflare's edge, with predictable cost
CDN, WAF, and multi-tenant SSL	Expensive when aggregated	Cloudflare with SSL for SaaS and Workers covering the edge
Documentation	Less than 10% of operations recorded anywhere	Decisions, trade-offs, and critical flows documented throughout the process
Roadmap	Pressure to freeze features to control costs	New releases and generative AI features shipped in parallel
Web vitals	Inconsistent across regions and routes	Consistent improvement, aligned with the platform's content-delivery purpose

How I cut over 80% of a SaaS platform's AWS bill without freezing the product

Case summary