Introducing RAVE: Reliability & Validation Engineering
Modern software systems are incredibly sophisticated.
We have CI pipelines, observability platforms, automated testing, SLOs, and increasingly complicated deployment strategies. On paper, reliability should be better than ever.
And yet - reliability still feels uncertain.
We deploy systems hoping the tests are sufficient.
We monitor dashboards hoping they will tell us when something goes wrong.
We run incident reviews hoping we learn the right lessons.
And a question keeps nagging at me:
Can we actually prove that our systems are reliable?
Not just observe them.
Not just monitor them.
Prove it.
Reliability is usually implicit
In most organisations, reliability lives in a mixture of places:
CI pipelines
automated tests
observability systems
deployment automation
operational runbooks
incident processes
Each of these contributes something useful. But they rarely form a coherent model.
The result is that reliability becomes implicit.
We assume certain properties hold:
that every pull request ran tests
that deployments can be rolled back
that production metrics are monitored
that secrets aren't leaked
that alerts fire when systems degrade
But those assumptions are rarely formalised or validated continuously.
From assumptions to claims
The starting point for RAVE is simple.
Instead of vague assumptions about reliability, we define explicit claims.
For example:
Every change must pass automated tests before merging
Deployments must be reversible within two minutes
Production services must expose health checks
Critical services must emit availability metrics
Secrets must never appear in logs
Each of these is a reliability claim about the system.
Once written down, the question becomes:
How do we know this claim is actually true?
Claims require evidence
A claim without evidence is just a statement.
So the next step is gathering evidence.
Evidence might come from many places:
CI pipeline results
deployment workflows
monitoring systems
security scanners
configuration repositories
incident records
For example:
Claim:
Every PR must pass tests before merging.
Evidence:
GitHub branch protection rules
CI pipeline test results
merge history
Or:
Claim:
Deployments can be rolled back within two minutes.
Evidence:
deployment system configuration
rollback scripts
automated validation checks
Reliability begins to look less like an abstract quality and more like something that can be continuously validated.
Reliability is a graph
As I explored this idea, something interesting emerged.
Claims rarely exist in isolation.
They depend on other claims.
For example:
"All tests must pass before merging"
only makes sense if:
pull requests are required
tests exist
CI pipelines run automatically
In other words, reliability properties form a dependency graph.
Some claims support others.
Some claims require underlying capabilities.
Reliability becomes something we can model and reason about structurally.
Enter RAVE
This is where RAVE — Reliability & Validation Engineering — comes in.
RAVE is an attempt to formalise a simple idea:
Reliability should be something we can prove with evidence, not just hope for.
The core concepts are straightforward:
Claims
Statements about reliability properties.
Evidence
Signals that demonstrate those claims are satisfied.
Validation workflows
Processes that gather and check evidence continuously.
Reliability graphs
A structure that shows how reliability properties depend on each other.
Together, these ideas form the beginnings of a model for provable reliability.
Why this matters
The industry has made huge progress in automation and observability.
But we still lack a clear way to answer questions like:
How reliable is this system really?
Which reliability guarantees actually hold?
Which ones are currently unproven?
RAVE is an experiment in making those questions easier to answer.
What comes next
Over the next few posts I’ll explore ideas including:
the Claim → Evidence model for reliability
why reliability properties form graphs
how CI/CD pipelines can produce continuous evidence
what Reliability as Code might look like
and how these ideas could evolve into tools like RAVEgraph
If you're interested in platform engineering, SRE, or reliable delivery, I'd love to hear your thoughts as these ideas develop.
Subscribe if you'd like to follow along as the ideas evolve.