Introducing RAVE: Reliability & Validation Engineering

Modern software systems are incredibly sophisticated.

We have CI pipelines, observability platforms, automated testing, SLOs, and increasingly complicated deployment strategies. On paper, reliability should be better than ever.

And yet - reliability still feels uncertain.

We deploy systems hoping the tests are sufficient.
We monitor dashboards hoping they will tell us when something goes wrong.
We run incident reviews hoping we learn the right lessons.

And a question keeps nagging at me:

Can we actually prove that our systems are reliable?

Not just observe them.
Not just monitor them.

Prove it.

Reliability is usually implicit

In most organisations, reliability lives in a mixture of places:

  • CI pipelines

  • automated tests

  • observability systems

  • deployment automation

  • operational runbooks

  • incident processes

Each of these contributes something useful. But they rarely form a coherent model.

The result is that reliability becomes implicit.

We assume certain properties hold:

  • that every pull request ran tests

  • that deployments can be rolled back

  • that production metrics are monitored

  • that secrets aren't leaked

  • that alerts fire when systems degrade

But those assumptions are rarely formalised or validated continuously.

From assumptions to claims

The starting point for RAVE is simple.

Instead of vague assumptions about reliability, we define explicit claims.

For example:

  • Every change must pass automated tests before merging

  • Deployments must be reversible within two minutes

  • Production services must expose health checks

  • Critical services must emit availability metrics

  • Secrets must never appear in logs

Each of these is a reliability claim about the system.

Once written down, the question becomes:

How do we know this claim is actually true?

Claims require evidence

A claim without evidence is just a statement.

So the next step is gathering evidence.

Evidence might come from many places:

  • CI pipeline results

  • deployment workflows

  • monitoring systems

  • security scanners

  • configuration repositories

  • incident records

For example:

Claim:

Every PR must pass tests before merging.

Evidence:

  • GitHub branch protection rules

  • CI pipeline test results

  • merge history

Or:

Claim:

Deployments can be rolled back within two minutes.

Evidence:

  • deployment system configuration

  • rollback scripts

  • automated validation checks

Reliability begins to look less like an abstract quality and more like something that can be continuously validated.

Reliability is a graph

As I explored this idea, something interesting emerged.

Claims rarely exist in isolation.

They depend on other claims.

For example:

"All tests must pass before merging"

only makes sense if:

  • pull requests are required

  • tests exist

  • CI pipelines run automatically

In other words, reliability properties form a dependency graph.

Some claims support others.
Some claims require underlying capabilities.

Reliability becomes something we can model and reason about structurally.

Enter RAVE

This is where RAVE — Reliability & Validation Engineering — comes in.

RAVE is an attempt to formalise a simple idea:

Reliability should be something we can prove with evidence, not just hope for.

The core concepts are straightforward:

Claims
Statements about reliability properties.

Evidence
Signals that demonstrate those claims are satisfied.

Validation workflows
Processes that gather and check evidence continuously.

Reliability graphs
A structure that shows how reliability properties depend on each other.

Together, these ideas form the beginnings of a model for provable reliability.

Why this matters

The industry has made huge progress in automation and observability.

But we still lack a clear way to answer questions like:

  • How reliable is this system really?

  • Which reliability guarantees actually hold?

  • Which ones are currently unproven?

RAVE is an experiment in making those questions easier to answer.

What comes next

Over the next few posts I’ll explore ideas including:

  • the Claim → Evidence model for reliability

  • why reliability properties form graphs

  • how CI/CD pipelines can produce continuous evidence

  • what Reliability as Code might look like

  • and how these ideas could evolve into tools like RAVEgraph

If you're interested in platform engineering, SRE, or reliable delivery, I'd love to hear your thoughts as these ideas develop.

Subscribe if you'd like to follow along as the ideas evolve.

Keep Reading