Services DevOps DevSecOps Cloud Consulting Infrastructure Automation Managed Services AIOps MLOps DataOps Microservices πŸ” Private AINEW Solutions DevOps Transformation CI/CD Automation Platform Engineering Security Automation Zero Trust Security Compliance Automation Cloud Migration Kubernetes Migration Cloud Cost Optimisation AI-Powered Operations Data Platform Modernisation SRE & Observability Legacy Modernisation Managed IT Services πŸ” Private AI DeploymentNEW Products ✨ ZippyOPS AINEW πŸ›‘οΈ ArmorPlane πŸ”’ DevSecOpsAsService πŸ–₯️ LabAsService 🀝 Collab πŸ§ͺ SandboxAsService 🎬 DemoAsService Bootcamp πŸ”„ DevOps Bootcamp ☁️ Cloud Engineering πŸ”’ DevSecOps πŸ›‘οΈ Cloud Security βš™οΈ Infrastructure Automation πŸ“‘ SRE & Observability πŸ€– AIOps & MLOps 🧠 AI Engineering πŸŽ“ ZOLS β€” Free Learning Company About Us Projects Careers Get in Touch
Homeβ€ΊBootcampβ€ΊSRE & Observability Bootcamp
πŸ“‘ Bootcamp

SRE & Observability Bootcamp

Build Reliable Systems. Then Prove They're Reliable.

A practitioner-led bootcamp on Site Reliability Engineering β€” SLIs, SLOs, error budgets, full-stack observability with OpenTelemetry and Grafana, chaos engineering and blameless incident response.

Duration8 Weeks
Total Hours64 Hours
LevelIntermediate–Advanced
FormatOnline + Offline
CertificateYes
Delivery Format

Train How You Learn Best

πŸ’» Online β€” Live Instructor-Led

Live sessions via Zoom with a ZippyOPS practitioner. 4 sessions per week, all recordings provided. Ask questions in real time and get code reviewed live.

🏒 Offline β€” Chennai Lab Sessions

In-person at ZippyOPS Chennai labs. Mon–Fri batches. Lab machines provided. Direct hands-on access to instructors throughout every session.

Who Should Attend

Is This Bootcamp Right for You?

βœ… This bootcamp is for you if…

  • DevOps and platform engineers moving into SRE
  • Software engineers embedded in SRE teams wanting deeper foundations
  • Engineers at companies adopting SRE methodologies for the first time
  • On-call engineers frustrated by alert noise and reactive incident response

πŸ“‹ Prerequisites

  • Experience running production systems in Kubernetes or cloud
  • Working knowledge of Docker and containers
  • Basic understanding of metrics and monitoring concepts
Full Curriculum

What You'll Learn β€” Week by Week

01
SRE Foundations & the Reliability Mindset
Week 1
β–Ύ
  • SRE vs DevOps vs traditional ops β€” what SRE actually means in practice
  • The error budget model β€” why it aligns engineering and product priorities
  • Toil β€” identifying, measuring and eliminating operational toil systematically
  • SRE book principles β€” how Google applies them and how real companies adapt
  • Building an SRE culture β€” reliability as a team responsibility, not a team name
  • Lab: Conduct a toil audit on a provided operations runbook and produce an automation roadmap
02
SLIs, SLOs and Error Budgets
Week 2
β–Ύ
  • SLI design β€” choosing the right indicators for each service type
  • SLO setting β€” the art of choosing targets that reflect user experience
  • Error budget calculation, burn rate alerting and budget policies
  • Availability vs latency vs correctness SLOs β€” designing for each
  • Lab: Define SLIs and SLOs for a 5-service application and implement Grafana error budget dashboards
03
Full-Stack Observability with Prometheus & Grafana
Week 3
β–Ύ
  • Prometheus architecture β€” scraping, storage, remote write and federation
  • PromQL β€” queries, aggregations, rate(), irate() and histogram_quantile()
  • Recording rules for performance and alerting rules for reliability
  • Grafana dashboard design β€” variable templating, annotations and drill-down
  • Alertmanager β€” routing, grouping, inhibition and silencing
  • Lab: Build a complete Prometheus + Grafana observability stack for a Kubernetes cluster with SLO dashboards
04
Logs, Tracing & OpenTelemetry
Week 4
β–Ύ
  • Structured logging β€” why unstructured logs are unqueryable and how to fix it
  • Loki β€” log aggregation, LogQL queries and log-based alerting
  • Distributed tracing concepts β€” spans, traces, propagation and sampling
  • OpenTelemetry β€” instrumentation, collectors and exporters
  • Tempo β€” trace storage and trace-to-log-to-metric correlation in Grafana
  • Lab: Instrument a 10-service application β€” reduce MTTR from 3 hours to under 15 minutes
05
Incident Response & On-Call Engineering
Week 5
β–Ύ
  • Incident lifecycle β€” detection, triage, coordination, resolution and retrospective
  • Severity classification β€” P1/P2/P3 definitions and escalation policies
  • On-call design β€” healthy rotations, escalation paths and runbooks
  • Blameless post-mortems β€” how to run them and what makes them valuable
  • Lab: Run a full simulated incident from detection to post-mortem using PagerDuty
06
Capacity Planning & Performance Engineering
Week 6
β–Ύ
  • Load testing with k6 β€” scripting, scenarios, thresholds and distributed testing
  • Capacity modelling β€” predicting resource requirements from load test data
  • Kubernetes resource requests and limits β€” practical tuning methodology
  • Horizontal and vertical pod autoscaling β€” configuration and pitfalls
  • Lab: Load test a Kubernetes application to find its breaking point, then tune it to handle 10Γ— load
07
Chaos Engineering
Week 7
β–Ύ
  • Chaos engineering principles β€” hypothesis-based experimentation, not random destruction
  • LitmusChaos β€” pod failures, node drains, network latency and disk pressure
  • Gremlin β€” CPU, memory, network and application-level chaos experiments
  • GameDay design β€” running structured failure exercises with clear hypotheses
  • Lab: Design and run 3 GameDay scenarios β€” document all findings and remediation actions
08
Capstone Project
Week 8
β–Ύ
  • Take a provided 8-service application from zero observability to full SRE implementation
  • Define SLIs, SLOs and error budget policies for all 8 services
  • Deploy Prometheus, Grafana, Loki and Tempo with full correlation
  • Implement burn rate alerting and PagerDuty on-call routing
  • Run a load test, tune HPA settings and run a chaos engineering GameDay
  • Live review with ZippyOPS SRE engineers
On Completion

Earn Your ZippyOPS Certificate

πŸŽ“
ZippyOPS Certified Site Reliability Engineer (ZCSRE)

Tests practical knowledge of SLO design, full-stack observability implementation and chaos engineering through a capstone bringing a provided system to 99.9% documented reliability.

Enroll Today

Ready to Level Up?

Seats are limited per batch. Contact us to check availability and get full pricing for the next online or offline cohort.

Scroll to Top