Site Reliability Engineering (SRE)

SRE is a set of practices coming from Google's experience on treating operation as a software. The commitment towards full service lifecycle enables organizations to successfully build, deploy, monitor, and maintain software systems. SRE is composed of technical and cultural aspects with the shared objective of meeting the expected reliability targets.

The 5 basic principles of the DevOps philosophy and their implementation via the SRE are:

Break down organizational silos

Large companies have a complex organizational structure with a

multitude of teams often working separately in "silos". Each team has a

different view of the whole, which encourages inefficiency. The task of

DevOps and SREs is to better align teams with each other towards

overall goals and towards a common vision. 2. Accept failures in the product lifecycle

Service Level Indicators (SLI) and Service Level Objectives (SLO) are

used to assess failures. SLIs measure failures over time. An SLO is a

service level agreement regarding a specific metric, such as

availability or response time, that must be met. Each failure leads to

reassessment and optimization of the objectives. SREs have a risk

budget to test the limits and more radical changes to potentially

innovate faster. SRE quantifies this acceptable risk as an "error

budget". 3. Implement changes in small, quick steps

Like DevOps, SRE encourages continuous improvement through small and

frequent development steps. 4. Use standard tools and automation

Incompatibility and integration issues between technologies create

silos, even in a DevOps environment. SRE introduces common technologies

and cross-access to information across different IT teams. SRE's policy

is to automate manual tasks that are repetitive, reactive, and produce

no lasting improvement. Automation should free up capabilities for work

that brings long-term benefits. 5. Base reliability on measurement data

The various stakeholders need to agree on a common way to measure

reliability and what to do when the value is out of specification. Key

DevOps metrics are number of deployments over time, time from commit to

release, number of failed deployments, and required recovery time.

Learn more: Official site at Google

Databricks logs collection with Azure Monitor at a Workspace Scale

Categories: Cloud Computing, Data Engineering, Adaltas Summit 2021 | Tags: Metrics, Monitoring, Spark, Azure, Databricks, Log4j

Databricks is an optimized data analytics platform based on Apache Spark. Monitoring Databricks plateform is crucial to ensure data quality, job performance, and security issues by limiting access to…

By Claire PLAYE

May 10, 2022

Site Reliability Engineering (SRE)

Related articles

Databricks logs collection with Azure Monitor at a Workspace Scale