Site Reliability Engineering (SRE)
SRE is a set of practices coming from Google's experience on treating operation as a software. The commitment towards full service lifecycle enables organizations to successfully build, deploy, monitor, and maintain software systems. SRE is composed of technical and cultural aspects with the shared objective of meeting the expected reliability targets.
The 5 basic principles of the DevOps philosophy and their implementation via the SRE are:
- Break down organizational silos
Large companies have a complex organizational structure with a
multitude of teams often working separately in "silos". Each team has a
different view of the whole, which encourages inefficiency. The task of
DevOps and SREs is to better align teams with each other towards
overall goals and towards a common vision. 2. Accept failures in the product lifecycle
Service Level Indicators (SLI) and Service Level Objectives (SLO) are
used to assess failures. SLIs measure failures over time. An SLO is a
service level agreement regarding a specific metric, such as
availability or response time, that must be met. Each failure leads to
reassessment and optimization of the objectives. SREs have a risk
budget to test the limits and more radical changes to potentially
innovate faster. SRE quantifies this acceptable risk as an "error
budget". 3. Implement changes in small, quick steps
Like DevOps, SRE encourages continuous improvement through small and
frequent development steps. 4. Use standard tools and automation
Incompatibility and integration issues between technologies create
silos, even in a DevOps environment. SRE introduces common technologies
and cross-access to information across different IT teams. SRE's policy
is to automate manual tasks that are repetitive, reactive, and produce
no lasting improvement. Automation should free up capabilities for work
that brings long-term benefits. 5. Base reliability on measurement data
The various stakeholders need to agree on a common way to measure
reliability and what to do when the value is out of specification. Key
DevOps metrics are number of deployments over time, time from commit to
release, number of failed deployments, and required recovery time.
- Learn more
- Official site at Google
Related articles
Databricks logs collection with Azure Monitor at a Workspace Scale
Categories: Cloud Computing, Data Engineering, Adaltas Summit 2021 | Tags: Metrics, Monitoring, Spark, Azure, Databricks, Log4j
Databricks is an optimized data analytics platform based on Apache Spark. Monitoring Databricks plateform is crucial to ensure data quality, job performance, and security issues by limiting access toā¦
By Claire PLAYE
May 10, 2022