Introducing SPF: An Assessment for Choosing Software by Fit

The recurring decision this method was built for is mundane to state and surprisingly hard to settle: given a need, what software should we adopt to meet it, and in what order relative to everything else on the roadmap? The instance that prompted it was a sequencing question — whether to put declarative configuration of our database environment ahead of building pipeline orchestration. Both were defensible. What was missing wasn't an opinion; it was an honest way to score the choice, so the answer rested on something more durable than whoever argued longest.

The instrument most teams reach for is an automation maturity ladder: manual at the bottom, scripted above it, declarative above that, autonomous at the top, with an implied instruction to climb. The trouble is that the ladder encodes a claim that doesn't survive scrutiny — that more automation is always better. It isn't. Over-automating a problem is a real and expensive failure mode, and a ladder is structurally incapable of seeing it, because it only measures height.

So the intention behind SPF was narrow and practical: a way to choose software that is matched to what a need actually requires, and that treats over-engineering as the failure it is. That intention has good precedent. The levels-of-automation literature has long framed its central question not as how high can we go but as which functions should be automated and to what extent — an explicitly bidirectional question (Parasuraman, Sheridan and Wickens, 2000). SPF takes that same posture and points it at software selection.

What SPF is

SPF assesses a need, and the candidate solutions that could meet it, and produces a recommendation you can defend rather than merely assert. The name is the method — it makes you look at three things:

Stack — where in your architecture the need lives.
Paradigm — how a candidate solution works, from manual to autonomous.
Fit — how closely a solution sits to what the need actually requires.

The first two are axes you locate a need on; the third is the score you compute from them. (The name is lifted from sunscreen on purpose: it's protection against over-exposure — except the exposure you're guarding against is over-engineering.)

Each axis deliberately borrows its shape from an existing, well-understood standard, so the method isn't asking you to trust something invented from scratch.

The Stack axis: a layered-architecture model

The vertical axis is an abstraction stack, and it's the same shape as the layered models software has used for decades — most familiarly the OSI networking model, where each layer is built on the one beneath it and consumes the abstractions the lower layer exposes without needing to know its internals. Read bottom to top: infrastructure provisioning, host and OS configuration, service configuration, schema and data model, orchestration, and finally data products with their lineage and observability.

The ordering is by dependency, and there's a falsifiable test for it: could this layer operate correctly if the one beneath it did not exist? Orchestration needs a configured, reachable database; a configured database doesn't need an orchestrator. So configuration sits genuinely beneath orchestration — a fact about the dependency graph, not a matter of taste.

That observation is what settled the sequencing question that started this. Orchestration is a coordinator: it acts on servers, accounts, and permissions that have to exist first, and the configuration layer is what produces them. You can't orchestrate against infrastructure you haven't defined yet. "Configuration first" stopped being a preference and became a consequence of where the two needs sit on the stack.

Like every layered model, the axis is unambiguous for distant layers and a little blurry for adjacent ones (service configuration and schema bleed into each other). That's a property of the genre, not a defect, and you use the axis for the distinctions it draws cleanly.

The Paradigm axis: two standards, braided

The horizontal axis is where SPF does something less conventional: it braids two existing standards that each capture half of what "how a solution works" means.

The first strand is the levels-of-automation scale — how much of the work is delegated to the machine. This is old, well-studied ground, beginning with Sheridan and Verplank's levels of automation (1978) and refined by Parasuraman, Sheridan and Wickens (2000); its most familiar descendant is SAE J3016, the levels of driving automation from 0 to 5 that anchor "Level 0" at no automation. These scales measure who decides and acts — human, machine, or some split.

The second strand is the imperative-versus-declarative distinction — how the behaviour is specified. This comes from a different tradition: the desired-state configuration lineage, where you declare an end state and an engine converges to it, formalized in Mark Burgess's work on convergent configuration and promise theory (1995 onward). The design principle that governs it is the rule of least power (Berners-Lee & Mendelsohn, 2006): choose the least powerful language sufficient for a purpose, because less power is more predictable and more reusable.

SPF's five paradigm levels are what you get when you lay those two strands along one line:

0 — Manual. You do it by hand; nothing is encoded.
1 — Imperative. You write the steps; a machine runs them. You own the how.
2 — Declarative. You declare the desired state; an engine reaches it and corrects drift. You own the what.
3 — Intelligent-assisted. You give a goal; an AI or heuristic proposes a change, but a deterministic gate stays in charge.
4 — Autonomous. You give a goal; the system senses, decides, and acts at runtime, ungated.

It matters to be honest that this is a braid. The two strands co-vary in infrastructure tooling, which is why one axis is workable, but they are independent in principle — Terraform is fully declarative yet decides nothing at runtime, and a hand-coded control loop is highly autonomous yet thoroughly imperative. Neither levels-of-automation nor the imperative/declarative distinction can express the other on its own; the SAE-style scales say nothing about specification style, and the paradigm distinction says nothing about autonomy. For a tool that sits off the diagonal where the two strands part ways, score it by its autonomy and flag it for a closer look.

The framework as a matrix

Put the two axes together and you get a grid, and because every cell is a real category of solution, the matrix doubles as a map of the tooling landscape: a need lives in a row, and choosing software means picking a column. Here it is filled in with representative tools — placements are approximate, since plenty of tools straddle two columns:

Stack layer ↓ / Paradigm →	0 · Manual	1 · Imperative	2 · Declarative	3 · Assisted	4 · Autonomous
1 · Infra provisioning	cloud-console click-ops	Azure / AWS CLI, shell	Terraform, Bicep, CloudFormation, Pulumi	AI-assisted IaC behind review	autoscalers (Karpenter, Cluster Autoscaler)
2 · Host / OS config	SSH in and edit	bash, Ansible (procedural)	Puppet, Chef, DSC, Ansible (state)	drift detection w/ suggested fixes	self-healing config agents (CFEngine)
3 · Service config (e.g. SQL Server)	SSMS, hand-run T-SQL	`sp_configure` / setup scripts	SqlServerDsc, DSC, dbatools	policy-as-code w/ review, SQL Assessment	continuous drift auto-remediation
4 · Schema / data model	hand-run `ALTER` statements	bespoke migration scripts	dbt, Flyway, Liquibase, DACPAC	AI-generated migrations to review	auto-applied schema evolution
5 · Orchestration	run jobs by hand	cron + scripts, classic Airflow / Prefect DAGs	Dagster assets, Kestra, Airflow assets	AI-authored pipelines behind a gate	self-healing / agentic orchestration
6 · Data products / observability	spreadsheets, eyeballing	custom monitoring scripts	data contracts, OpenLineage, dbt tests	anomaly detection, alerts to review	autonomous data-quality remediation

Reading left to right, each column hands more of the work to the machine. Two things from the rest of this post are already visible in the grid: artifact authority is richest in the declarative and assisted columns and thins out at autonomous, and the rule of least power says pick the leftmost column that fully covers the need. The matrix shows what exists; the Fit score, next, tells you which column to aim for.

Scoring fit: the heart of the method

This is what separates SPF from a relabeled ladder. Rather than scoring how high a solution climbs, you score how close it lands to what the need requires. Three numbers per capability.

The needed level, N — derived from demand, not aspiration. Repetition, volatility, blast radius, and scale push N up; specifiability and longevity cap it:

One-off or rarely repeated → N = 0–1
Repeated, drift-prone, specifiable, long-lived → N = 2
The above, plus real value from AI-accelerated authoring behind a gate → N = 3
Desired state genuinely unspecifiable and runtime adaptation essential → N = 4

The last line is load-bearing: an unspecifiable problem is the only legitimate reason to reach level 4. If you can write the desired state down, you almost certainly shouldn't be paying for runtime inference.

The fit, which is Current minus N — and it is signed. Zero is matched. Negative is under-served: you pay in manual toil, drift, and errors. Positive is over-served: you pay in setup cost, complexity, fragility, and lost authority. Both directions are failures, and the signed gap is exactly what a maturity ladder cannot represent — a ladder only ever reports "not high enough."

The priority, which is the absolute fit times the stakes (a 1–3 weight from risk and frequency), so a small misfit on a critical system outranks a large misfit on something trivial.

Underneath all three sits the rule of least power, and it has an economic shape worth drawing out, because it's the part that makes the scoring more than a slogan.

The economics behind the U

Effort and cost along the paradigm axis split into a fixed part and a marginal part that move in opposite directions. The up-front cost rises as you move right: manual costs almost nothing to start, while declarative is expensive to begin — you model the desired state, learn the tooling, make operations idempotent. But the marginal cost — per run, per environment — falls as you move right: manual costs the same every time and never amortizes, while declarative re-applies and self-heals for almost nothing.

So total cost is a trade between a rising fixed cost and a falling marginal one, and where they cross depends on how many times you'll run the thing. Plotted against paradigm level, the total cost-and-risk of operating a capability is U-shaped: high on the left from toil and drift, high on the right from waste and fragility, lowest at the level the need actually requires. The job of an SPF score is to find the bottom of that U — not the top of the ladder.

This is also why a bought solution changes the maths. A packaged declarative tool means someone else already paid the fixed modeling cost; adopting it drops your intercept and pulls the break-even point sharply left, so it can be worth adopting at a fraction of the repetition that would justify building your own.

SPF in one picture

To make it concrete, here is SPF reading three illustrative capabilities. Each row locates a need on the map — its stack layer and the paradigm level it requires — and scores it against where it sits today. The Fit column is signed: negative is under-served, positive is over-built, zero is matched.

Capability	Stack layer	Needs	Today	Fit	The move
Database configuration	3 · service config	2 · declarative	0 · manual	−2	Adopt a mature declarative tool
Schema migrations	4 · schema	2 · declarative	2 · declarative	0	Matched — leave it alone
One-off data backfill	5 · orchestration	1 · imperative	4 · autonomous	+3	Over-built — replace the agent with a script

Three capabilities on the SPF map. Vertical position is the stack layer; horizontal position is the paradigm. Every recommended move points toward the fit level — rightward for the under-served database config, leftward for the over-built backfill, and nowhere at all for the already-matched migrations.

The signed fit is what makes the backfill row legible. On a paradigm axis alone it scores a 4 — top of the ladder, apparently exemplary. SPF reads it as over-built: runtime autonomy spent on a job that runs once, where an imperative script is the better fit. And the matched migrations score a priority of zero and earn the rarest recommendation in software — leave them alone.

Scoring one solution against another

The per-capability picture above reads the current state. But because Fit is a number, the same machinery scores competing solutions against the enterprise's needs, so you can rank one vendor or approach against another and defend the choice.

Suppose the enterprise has three needs, each with a needed level and a stakes weight, and three candidate approaches are on the table: build it ourselves with scripts (which lands every need at imperative), adopt a declarative platform (declarative across the board), or buy an autonomous AI-ops suite (autonomous everywhere). Scoring each need as signed fit, then weighting by stakes:

Enterprise need	Stack	Needs (N)	Stakes	Scripts · lvl 1	Declarative platform · lvl 2	Autonomous suite · lvl 4
Database configuration	3	2	3	−1 → 3	0 → 0	+2 → 6
Schema migrations	4	2	2	−1 → 2	0 → 0	+2 → 4
Pipeline orchestration	5	2	2	−1 → 2	0 → 0	+2 → 4
Total weighted misfit				7	0	14

Each cell is signed fit → weighted misfit (|fit| × stakes). Lower total is a better fit for the enterprise.

The ranking is the whole argument in one number. The declarative platform wins outright with zero misfit. The autonomous suite — top of the ladder, the "most advanced" option, the one with the best demo — scores worst, because it over-serves every need and charges you for power you won't use. The scripts leave everything under-served. A maturity score would have ranked these three in exactly the wrong order; SPF ranks them by fit.

This also handles the messy real case where solutions cover different things. If a candidate doesn't address a need at all, score it as level 0 for that row — a large under-served misfit — which is precisely how a narrow point-tool loses to a platform that covers more of the enterprise's needs, even when the point-tool is slicker at the one thing it does.

What the scores tell you to do

A fit score isn't only a diagnosis; it points at an action, and the actions are more varied than "climb."

When a capability is under-served and a mature solution already exists at its fit level, adopt it directly rather than building your way up one rung at a time. Writing the imperative version first, then migrating to the declarative tool later, pays the setup cost twice; the buy-versus-build maths above is why skipping the intermediate rung is usually correct.

When a capability is over-served, the action is to dial back — the move no maturity ladder will suggest. Reducing capability, and the cost and fragility that came with it, is a legitimate and money-saving outcome of an assessment.

And sometimes the action is to wait, because the map isn't static and new tools increasingly compete on what they leave behind, not just what they execute. A declarative solution hands you the specification as a version-controlled, authoritative record of intent — Burgess's original point about convergent configuration was precisely that the automation code is a description of the desired end state. A strong new entrant may offer richer artifacts still: lineage, audit trails, reproducible documentation. Those artifacts are a real selection criterion because they make a system reviewable and defensible later. One caution keeps it honest: artifact authority peaks around the declarative and gated-assisted levels and drops at full autonomy, where generated documentation describes behaviour after the fact rather than determining it — so "better artifacts" argues for a strong declarative tool, not the most autonomous one. Waiting carries the cost of the toil you haven't yet solved, so it's an option-value call — but it's a call SPF puts on the table and the ladder never does.

Where SPF still needs work

Two parts are softer than the rest, and they're the inputs a real spend decision depends on.

The first is assessing need. Today N comes from a short rubric — repetition, volatility, specifiability — which is directionally right but hand-wavy. To trust it for a purchasing decision you'd want structured, measurable inputs: runs per quarter, number of target environments, change rate, blast radius, regulatory exposure. Turning that into a defensible demand model is the next piece of work, and the levels-of-automation literature's function-by-function analysis is a reasonable starting frame for it.

The second is evidence of real gains. SPF tells you the direction of a good move but not its magnitude, and magnitude is what justifies budget. There's a usable evidence base — for example the DevOps delivery-and-recovery metrics popularized by the DORA research and Accelerate — and pulling those numbers in is what turns a sensible reframe into a business case.

Fit, not altitude

That's SPF: locate a need on the Stack, judge a solution on the Paradigm, and score the Fit between them. It came out of a sequencing decision we had no honest way to score, and the move that unstuck it was giving up altitude as the goal.

A maturity ladder asks "how high have we climbed?", and the answer is always "not high enough," forever. SPF asks "what does this need require, and what's the closest, cheapest way to land exactly there?" — and that question has an answer you can reach, and then stop.

References

The thinking here leans on a few existing standards and ideas, one set per axis:

Sheridan, T. B., & Verplank, W. L. (1978). Human and Computer Control of Undersea Teleoperators. MIT Man-Machine Systems Laboratory. — The original levels-of-automation scale. semanticscholar.org
Parasuraman, R., Sheridan, T. B., & Wickens, C. D. (2000). "A Model for Types and Levels of Human Interaction with Automation." IEEE Transactions on Systems, Man, and Cybernetics — Part A, 30(3), 286–297. — Frames automation as a choice of what to automate and to what extent, the posture SPF adopts. doi:10.1109/3468.844354
SAE International, J3016. Taxonomy and Definitions for Terms Related to Driving Automation Systems for On-Road Motor Vehicles. — The 0–5 driving-automation levels; the model for anchoring "Level 0" at no automation. sae.org
Berners-Lee, T., & Mendelsohn, N. (2006). The Rule of Least Power. W3C Technical Architecture Group Finding. — Choose the least powerful tool sufficient for the purpose. w3.org
Burgess, M. (1995). "Cfengine: a site configuration engine." USENIX Computing Systems, 8(3). — The convergent, desired-state configuration model (later, promise theory) underpinning the declarative end of the paradigm axis. markburgess.org
ISO/IEC 7498-1. The Basic Reference Model (OSI). — The layered-architecture template the Stack axis follows.
Forsgren, N., Humble, J., & Kim, G. (2018). Accelerate: The Science of Lean Software and DevOps. — Delivery and recovery metrics, a starting point for sizing the gains an SPF-recommended move would produce.

Side Effects May Vary