The purpose of this workshop is to walk through some sample Operations scenarios, and identify the role that Operations teams have in improving the availability of workloads.

The lab focuses on Operations activities that will prevent incidents from occuring that impact availability. It will also look at ways to reduce the time it takes to detect incidents (MTTD), with the goal of early detection before events lead to availability impacting incidents. It will then look at ways to reduce the time it takes to resolve an incident (MTTR), especially through the use of proactive engagement, and automated responses.

While the scenarios here might not directly represent your operating environment, they are a reference point for Operations approaches you can take with your workloads to reduce MTTD, MTTR, and improve availability through Operational Excellence.

There are four major concepts to consider in Operations activities:

  • Prevent
  • Readiness, awareness, telemetry, guardrails
  • Detect:
  • Controls, monitoring, anticipating failure, raising events
  • Respond:
  • Trends, consistency, validation, automation
  • Learn:
  • RCA, ops metrics, improvement, shared learnings