Home›Bootcamp›SRE & Observability Bootcamp

📡 Bootcamp

SRE & Observability Bootcamp

Build Reliable Systems. Then Prove They're Reliable.

A practitioner-led bootcamp on Site Reliability Engineering — SLIs, SLOs, error budgets, full-stack observability with OpenTelemetry and Grafana, chaos engineering and blameless incident response.

Duration8 Weeks

Total Hours64 Hours

LevelIntermediate–Advanced

FormatOnline + Offline

CertificateYes

Enroll Now ← All Bootcamps

Delivery Format

Train How You Learn Best

💻 Online — Live Instructor-Led

Live sessions via Zoom with a ZippyOPS practitioner. 4 sessions per week, all recordings provided. Ask questions in real time and get code reviewed live.

🏢 Offline — Chennai Lab Sessions

In-person at ZippyOPS Chennai labs. Mon–Fri batches. Lab machines provided. Direct hands-on access to instructors throughout every session.

Who Should Attend

Is This Bootcamp Right for You?

✅ This bootcamp is for you if…

DevOps and platform engineers moving into SRE
Software engineers embedded in SRE teams wanting deeper foundations
Engineers at companies adopting SRE methodologies for the first time
On-call engineers frustrated by alert noise and reactive incident response

📋 Prerequisites

Experience running production systems in Kubernetes or cloud
Working knowledge of Docker and containers
Basic understanding of metrics and monitoring concepts

Full Curriculum

What You'll Learn — Week by Week

SRE Foundations & the Reliability Mindset

Week 1

▾

SRE vs DevOps vs traditional ops — what SRE actually means in practice
The error budget model — why it aligns engineering and product priorities
Toil — identifying, measuring and eliminating operational toil systematically
SRE book principles — how Google applies them and how real companies adapt
Building an SRE culture — reliability as a team responsibility, not a team name
Lab: Conduct a toil audit on a provided operations runbook and produce an automation roadmap

SLIs, SLOs and Error Budgets

Week 2

▾

SLI design — choosing the right indicators for each service type
SLO setting — the art of choosing targets that reflect user experience
Error budget calculation, burn rate alerting and budget policies
Availability vs latency vs correctness SLOs — designing for each
Lab: Define SLIs and SLOs for a 5-service application and implement Grafana error budget dashboards

Full-Stack Observability with Prometheus & Grafana

Week 3

▾

Prometheus architecture — scraping, storage, remote write and federation
PromQL — queries, aggregations, rate(), irate() and histogram_quantile()
Recording rules for performance and alerting rules for reliability
Grafana dashboard design — variable templating, annotations and drill-down
Alertmanager — routing, grouping, inhibition and silencing
Lab: Build a complete Prometheus + Grafana observability stack for a Kubernetes cluster with SLO dashboards

Logs, Tracing & OpenTelemetry

Week 4

▾

Structured logging — why unstructured logs are unqueryable and how to fix it
Loki — log aggregation, LogQL queries and log-based alerting
Distributed tracing concepts — spans, traces, propagation and sampling
OpenTelemetry — instrumentation, collectors and exporters
Tempo — trace storage and trace-to-log-to-metric correlation in Grafana
Lab: Instrument a 10-service application — reduce MTTR from 3 hours to under 15 minutes

Incident Response & On-Call Engineering

Week 5

▾

Incident lifecycle — detection, triage, coordination, resolution and retrospective
Severity classification — P1/P2/P3 definitions and escalation policies
On-call design — healthy rotations, escalation paths and runbooks
Blameless post-mortems — how to run them and what makes them valuable
Lab: Run a full simulated incident from detection to post-mortem using PagerDuty

Capacity Planning & Performance Engineering

Week 6

▾

Load testing with k6 — scripting, scenarios, thresholds and distributed testing
Capacity modelling — predicting resource requirements from load test data
Kubernetes resource requests and limits — practical tuning methodology
Horizontal and vertical pod autoscaling — configuration and pitfalls
Lab: Load test a Kubernetes application to find its breaking point, then tune it to handle 10× load

Chaos Engineering

Week 7

▾

Chaos engineering principles — hypothesis-based experimentation, not random destruction
LitmusChaos — pod failures, node drains, network latency and disk pressure
Gremlin — CPU, memory, network and application-level chaos experiments
GameDay design — running structured failure exercises with clear hypotheses
Lab: Design and run 3 GameDay scenarios — document all findings and remediation actions

Capstone Project

Week 8

▾

Take a provided 8-service application from zero observability to full SRE implementation
Define SLIs, SLOs and error budget policies for all 8 services
Deploy Prometheus, Grafana, Loki and Tempo with full correlation
Implement burn rate alerting and PagerDuty on-call routing
Run a load test, tune HPA settings and run a chaos engineering GameDay
Live review with ZippyOPS SRE engineers

On Completion

Earn Your ZippyOPS Certificate

🎓

ZippyOPS Certified Site Reliability Engineer (ZCSRE)

Tests practical knowledge of SLO design, full-stack observability implementation and chaos engineering through a capstone bringing a provided system to 99.9% documented reliability.

Enroll Today

Ready to Level Up?

Seats are limited per batch. Contact us to check availability and get full pricing for the next online or offline cohort.

Enquire & Enroll ← All Bootcamps