Skip to main content
Custom SaaS Development

Rebuild vs Refactor: How to Decide When Your SaaS Platform Is Breaking

A practical decision framework for founders and CTOs facing a degrading SaaS platform. When strangler fig refactoring is enough, when a rewrite is the only honest answer, and how to avoid the traps that kill both.

Jahja Nur Zulbeari | | 11 min read

The Question Nobody Wants to Answer Honestly

Your SaaS platform is visibly struggling. Deployments take two hours and fail one in four times. If you are still at the earlier stage of deciding what to build and how to structure it, read our guide on SaaS platform architecture decisions before the technical debt accumulates. Shipping a feature that used to take a sprint now takes a quarter. Your senior engineers spend more time in incident calls than writing code. New hires take six months to become productive because the codebase is a archaeological site, not a system.

You know something has to change. The question is whether you modernise what you have or start again.

This is one of the highest-stakes decisions a founder or CTO makes. Get it wrong in either direction and the cost is measured in years: a premature rebuild that consumes eighteen months and delivers nothing, or a refactoring programme that cleans the surface while the structural rot deepens underneath.

This post gives you a framework for making that decision rigorously, without the motivated reasoning that typically corrupts it.

Why the Decision Gets Made Badly

Most “rebuild vs refactor” decisions are made emotionally, not analytically. Engineers who built the original system want to refactor it — they understand it, they feel ownership, and a rebuild implicitly criticises their prior work. Engineers who joined later want to rebuild — they did not write the original code, they are not attached to it, and a greenfield project sounds more interesting than archaeology.

Neither position is analysis. Both produce confident recommendations for opposite conclusions from the same codebase.

The decision also gets corrupted by sunk cost reasoning. “We’ve invested three years in this, we can’t throw it away” is not architecture thinking. You are not throwing away three years — you are deciding whether the next three years are better spent extending a broken foundation or building a sound one. The past investment is irrelevant to that question.

The framework below is designed to force objective signals into the conversation.

Signal 1: Deployment Frequency and Failure Rate

A healthy engineering team deploying a maintainable codebase ships to production multiple times per week. If your deployment frequency has dropped below once per week, or if more than 20% of deployments require an immediate rollback or hotfix, that is a concrete signal that the system is actively resisting change.

The question is why. Deployment friction comes from two very different sources:

Code quality problems show up as test suites that are too slow, flaky, or absent; build pipelines that are not parallelised; deployments that require manual steps that engineers have never automated. These are solvable with disciplined refactoring and investment in engineering infrastructure. They do not require a rebuild.

Architectural problems show up as deployments that require coordinating changes across multiple tightly-coupled modules simultaneously; data migrations that cannot be run without downtime; configuration that is spread across the codebase in ways that make it impossible to deploy a change without touching thirty files. If coordinating a deployment requires a spreadsheet and a thirty-minute video call, that is architecture, not code quality.

Measure your mean time to deploy a one-line change. If it is measured in hours, not minutes, you have an architecture problem.

Signal 2: Time-to-Feature

Track how long it takes to go from a well-scoped feature request to that feature in production. Not discovery, not design — just implementation and delivery for a feature whose requirements are already clear.

In a healthy codebase, a medium-complexity feature takes one to two sprints. If your honest answer is one to two quarters, something structural is wrong. Features should not compound in difficulty over time — if they are, the codebase is not a foundation you are building on, it is a constraint you are working around.

Pay attention to where the time goes. If it goes into understanding — engineers spending more time reading code than writing it — you have a comprehensibility problem that refactoring can address. If it goes into untangling — every change requires touching seven other modules because the dependencies are incoherent — you have an architectural coupling problem. If it goes into re-implementing — you keep rebuilding the same infrastructure because there is no shared layer — you have a structural duplication problem that may be too deep to refactor economically.

Signal 3: Test Coverage Debt and Knowledge Silos

These two metrics are leading indicators that predict future degradation, not just present pain.

Test coverage debt is not the same as test coverage percentage. A codebase can have 80% line coverage with almost no useful tests — happy-path unit tests that verify the obvious and miss the business logic edge cases that actually break. The meaningful question is: can your engineers refactor a core module with confidence that the tests will catch regressions? If the answer is “only if we’re very careful” or “we always test manually afterwards,” you have test debt that will compound every time you touch anything.

A refactoring programme can address test debt incrementally — write tests for the behaviour you understand, then refactor the code. This is the classic “characterisation test” approach: before you change anything, write tests that document what the system currently does. Then you have a safety net.

Knowledge silos are more dangerous. If only one engineer understands how the billing module works, and a different engineer is the only one who can safely touch the authentication layer, and the two of them are the only people who can deploy — you have a bus factor problem that is architecturally embedded. The knowledge is not just in people’s heads; it is in code that is incomprehensible without years of context.

Silos that are purely human (knowledge in people’s heads that has not been documented) can be addressed with documentation and pairing. Silos that are architectural (the code is so tangled that no amount of documentation makes it comprehensible to someone new) are structural and do not respond to refactoring.

Signal 4: Performance Problems — Architecture or Code Quality?

Performance degradation is one of the most misread signals in this decision. Engineers instinctively reach for profilers and query optimisers, and sometimes that is the right answer. But persistent performance problems that resist optimisation often signal architectural issues: a data model that requires N+1 queries because the relationships are wrong, a caching strategy that cannot work because the access patterns are incoherent, a synchronous request chain that spans five services because the boundaries were drawn incorrectly.

Ask this question: have you already optimised the obvious things, and are you now optimising the wrong things because the right things cannot be optimised without changing the structure? If so, you are treating an architectural symptom with a code-quality remedy.

The architecture decisions you make early in a SaaS product’s life — data model, tenancy strategy, API design — determine the ceiling of what optimisation can achieve. If your ceiling is too low, optimisation cannot lift you past it.

The Strangler Fig Pattern: The Default for Most Situations

If you have concluded that the problem is architectural but the system is still functioning — users are still getting value, the business is still growing — the strangler fig pattern is almost always the right approach before you commit to a full rewrite. Our custom SaaS development practice regularly performs this type of structured migration, and the cost comparison between approaches is significant.

The pattern works like this: you build new functionality on a new, well-designed foundation alongside the existing system. You expose both behind a routing layer (often a thin facade or an API gateway) that sends requests to old or new depending on which capabilities have been migrated. Over time, you migrate capabilities from old to new, the new system grows, and the old system shrinks until it handles nothing and can be deleted.

The advantages are significant. You are running the new architecture against real production traffic, which means you discover integration problems, edge cases, and missing business logic before you have fully committed. You continue shipping product features throughout the migration. You can stop and reassess at any point — if the new architecture is not working, you have not bet the company on it.

The disadvantages are real too. It requires maintaining two systems simultaneously, which has coordination overhead. If the legacy system is deeply tangled, identifying clean migration boundaries is hard work. And it takes longer than a clean rewrite when measured in calendar time — though rarely longer when measured in risk-adjusted delivery probability.

For most SaaS modernisation projects, the strangler fig is the correct default. Start here.

When a Full Rebuild Is the Honest Answer

There are specific circumstances where the strangler fig pattern is not viable and a full rebuild is the right call. Be honest about these criteria, because engineers who want to rebuild will over-apply them.

The data model is fundamentally wrong. This is the clearest signal — and understanding what “correct” looks like is a prerequisite. The SaaS product development process covers data model design as a formal discovery stage, precisely because retrofitting it is so expensive. If the schema does not reflect the domain — if your core entities are wrong, if you have been encoding business logic in column naming conventions, if multi-tenancy was bolted on after the fact in a way that touches every query — then no amount of incremental migration fixes the underlying problem. Every new feature you build on the strangler fig inherits the wrong foundation. You will do the migration work and still end up with a broken data model.

The codebase is actively preventing you from understanding what it does. There is a threshold of complexity beyond which the code cannot be understood incrementally. If your engineers cannot safely change a module without running the full test suite and manually checking five other parts of the system, and if that is true for most of the codebase, then the comprehension cost of a refactoring programme exceeds the cost of a clean rebuild. You would spend more time reverse-engineering the existing system than building a new one.

The technology stack is a competitive liability. This one requires more discipline. “We’d rather use a different framework” is not a reason to rebuild — that is preference, not need. But if your stack cannot support the capabilities your product needs to compete — if you need real-time features and your architecture is fundamentally synchronous, if you need multi-region deployment and your data layer cannot support it — then the stack limitation is a strategic problem, not just a technical preference.

The Second-System Trap

If you decide to rebuild, the biggest risk is not technical. It is the second-system trap: the tendency to use the rebuild as an opportunity to fix every mistake you made the first time, build every abstraction you wished you had, and produce a perfect, generalised architecture that solves problems you do not yet have.

The first system was constrained by ignorance — you did not know what the product needed to become. The second system is constrained by hubris — you know what went wrong, so you compensate by removing all constraints. The result is typically a system that takes three times longer to build than estimated and collapses under the weight of its own unused abstraction.

The discipline required is treating the rebuild as a product problem. Define the scope by what your users need, not by what your engineers want to do differently. Every feature in the rebuild scope should map to a user outcome. If an engineer proposes an abstraction that does not directly enable a user capability, it should require explicit justification. Set a hard delivery date and protect it.

The cost of a failed software project is not just the money spent — it is the market position lost while your engineering team was heads-down on infrastructure. That cost is real and it compounds.

The Decision Framework

Work through these questions in order. Stop when you have your answer.

1. Is your data model correct? If no, and if fixing it requires structural changes across the entire schema, you are likely looking at a rebuild. If yes, or if the data model problems are localised, continue.

2. Are your deployment problems rooted in code quality or architecture? If code quality (test debt, manual steps, missing automation), refactoring can address it. If architecture (coupled deployments, migration-dependent releases), continue.

3. Can you identify clean strangler fig migration boundaries? If the system is modular enough that you can route specific capabilities to a new implementation without pulling everything else with it, strangler fig is viable. If everything is tangled, continue.

4. Is the technology stack a strategic liability? If yes, for reasons that affect competitive capability (not just preference), factor this into the rebuild case. If no, this is not a factor.

5. Do you have the organisational capacity for a parallel rebuild? A rebuild that halts feature development is a commercial risk. If you cannot staff a dedicated rebuild track independently of the feature team, you need to phase the work differently, not skip the rebuilding decision.

If you reach step 5 with consistent rebuild signals, you have an honest case for a full rebuild. If you are reaching for rebuild at step 2 on the basis of code quality problems, you are likely falling into the emotional reasoning described at the start of this post.

What to Do Next

If you are in this decision and want an independent technical perspective, the starting point is a structured codebase audit — not a surface review, but a systematic analysis of data model integrity, coupling metrics, deployment pipeline analysis, and test coverage quality.

At Zulbera, we work with founders and CTOs on exactly this kind of assessment as part of our custom SaaS development engagements and enterprise platform work. The goal is not to sell you a rebuild — it is to give you a rigorous answer to the question before you commit significant resource in either direction.

The decision deserves that rigour. The consequences of getting it wrong are too large to make it on instinct.

Jahja Nur Zulbeari

Jahja Nur Zulbeari

Founder & Technical Architect

Zulbera — Digital Infrastructure Studio

Let's talk

Ready to build
something great?

Whether it's a new product, a redesign, or a complete rebrand — we're here to make it happen.

View Our Work
Avg. 2h response 120+ projects shipped Based in EU

Trusted by Novem Digital, Revide, Toyz AutoArt, Univerzal, Red & White, Livo, FitCommit & more