SRE Programme Improving Availability from 99.5% to 99.97%

37/45Project Reference

16 weeksEngagement Duration

4 architectsZippyOPS Team

4Measurable Outcomes

The Challenge

What the Client Was Facing

A payment processing fintech had an SLA of 99.9% uptime but was averaging 99.5% — resulting in £400k in annual SLA penalties. The team had no SLO framework, alert thresholds were set arbitrarily and post-mortems were optional and inconsistent.

Our Role

What ZippyOPS Was Engaged To Do

ZippyOPS was brought in to design and implement a solution addressing the root causes of the client's challenges — delivering measurable outcomes within a fixed engagement timeline. Our team worked embedded with the client's engineers throughout the entire project.

The Solution

How We Solved It

ZippyOPS implemented the Google SRE methodology — running SLI/SLO definition workshops, configuring error budget dashboards in Grafana, implementing burn rate alerting and establishing a blameless post-mortem culture. A chaos engineering programme was introduced to proactively find reliability weaknesses.

Technologies Used

Prometheus Grafana Loki Tempo PagerDuty OpenTelemetry LitmusChaos Python Kubernetes Datadog

The Results

Measurable Outcomes Delivered

✓

Service availability improved from 99.5% to 99.97% within 6 months

✓

£400k annual SLA penalties eliminated — first full year with zero penalty clauses triggered

✓

Error budget framework adopted by all 12 service teams — reliability now measured, not guessed

✓

Chaos engineering identified 4 critical failure modes before they caused production incidents

Work With Us

Want Similar Results for Your Team?

Book a free consultation and let's discuss how ZippyOPS can deliver the same transformation for your organisation.

Book Free Consultation ← All Projects