Site Reliability Engineer – SRE

By May 7, 2021 Powerlearnings

Compiled by Kiran Kumar, Business analyst at Powerup Cloud Technologies

Contributor Agnel Bankien, Head – Marketing at Powerup Cloud Technologies

Summary

SRE is a systematic and automated approach to enhancing IT service delivery using standardized tools and practices for the acceptable implementation of DevOps concepts. Let us look at the SRE team’s roles and responsibilities along with how their effort and productivity can be assessed and why they need organizational support to ensure uninterrupted functioning. The blog will also help understand the various models of SRE implementation, the current day SRE tools, and the benefits derived from it.

Index

1. What is Site Reliability Engineering (SRE)?

2. SRE roles and responsibilities 

2.1 Build software engineering

2.2 Fixing support escalation

2.3 On-Call Process Optimization 

2.4 Documenting Knowledge

2.5 Optimizing SDLC

3. SRE challenges

4. Why organizational SRE support is important?

5. Measuring SRE efforts and effectiveness

6. SRE implementation and its models 

6.1 A composite solo team

6.2 Application or Product-specific

6.3 Infrastructure specific

6.4 Tool centered

6.5 Embedded

6.6 Consulting

7. DevOps and SRE

8. SRE tools and technologies

9. Conclusion

1. What is Site Reliability Engineering (SRE)?

Site reliability engineering (SRE) is a systematic software engineering approach that uses software as a tool to manage IT infrastructure and operations. In the words of Benjamin Treynor Sloss, VP engineering at Google Cloud, “SRE is what happens when you ask a software engineer to design an operations function.” 

The notion was conceptualized at Google in 2003 when the technology giant established SRE to make its websites highly available, reliable and scalable. 

Subsequently, site reliability engineering was embraced by the IT industry to mitigate risks and issues, manage systems and automate operations like capacity and performance planning, disaster management and quality monitoring.

Standardization and automation are the fundamental components of an SRE model. The idea is to shift operations from manual control to DevOps, which implement and manage large complex systems on to cloud through software code and automation to accelerate the efficiency and sustainability of a business. 

2. SRE Roles and Responsibilities 

According to the latest report by LinkedIn on the ‘Emerging Jobs 2020’, Site Reliability Engineer is among the top 10 in-demand jobs for 2020.

Site reliability engineering is a progressive form of QA that strengthens the collaboration between software developers and IT operations in the DevOps environment.

Some roles of an individual SRE defined within the teams are:

  • System administrators, engineers, and operations experts 
  • Software developers and engineers
  • Quality assurance and test engineers 
  • Build automation pipeline experts and
  • Database administrators. 

SRE responsibilities include:

2.1 Build software Engineering

Site reliability engineers, in order to develop and improve IT and support, build exclusive tools to mitigate risks, manage incidents and provide services like production, code change, alerting and monitoring.

2.2 Fixing Support Escalation

SREs must be capable of identifying and routing critical issues pertaining to support escalation to concerned teams. They must also work in collaboration with relevant teams to remediate issues. As SRE functions mature with time, the systems become more robust and reliable leading to lesser support escalations. 

2.3 On-Call Process Optimization 

There are instances where SREs take up on-call responsibilities and its process optimization. They implement runbook tools and other automation techniques to ready incident response teams, enhance their collaborative responses in real-time, and appraise documents.

2.4 Documenting Knowledge

SREs often function alongside IT operations, development, support as well as technical teams constructing a large amount of knowledge repository in the process. Documenting such comprehensive information would ensure a smooth flow of operations among teams.

2.5 Optimizing SDLC 

Site reliability engineers ensure that IT resources and software developers detect and assess incidents to document their findings to facilitate informed decision-making. Based on post-incident reviews, SRE can then optimize the SDLC to augment service reliability.  

3. SRE Challenges

Many times, organizations are reluctant to change and any change often comes with additional costs, lost productivity, and uncertainties. A few challenges are:

  • Maintaining a high level of network and application availability,
  • Establishing standards and performance metrics for monitoring purposes,
  • Analyzing cloud and on-premise infrastructure scalability and limitations,
  • Understanding application requirements, test and debugging needs as well as general security issues to ensure smooth function of systems,
  • Keeping alerting systems in place to detect, prioritize and resolve incidents if any,
  • Generate best practices documents and conduct training to facilitate smooth collaboration between identified SRE resources and other cross-functional teams.

4. Why Organizational SRE Support is Important?

In today’s digital ecosystems, enterprises have to constantly work towards highly integrated setups in controlled environments that pose a colossal challenge for them. Catering to build dependable services while ensuring uninterrupted functioning of business is the key to escalated performance and user experience.

Organizations need to actively obtain consent from senior leaders to set up SRE teams and processes. Its success is directly proportional to the teams and business units supporting it. SRE creates better visibility across all SDLC phases and while reliability engineers solely work on incident management and scalability concerns, DevOps teams can focus perpetually on continuous deployment and integration.

With an SRE setup, organizations can adhere to SLAs and SLOs better, run tests, monitor software and hardware be prepared for incident occurrence, reduce issue escalation, speed up software development and save potential costs of downtime. It promotes automation, post-incident reviews, on-call support engineer to distinguish the level of reliability in new deployments and infrastructure.

5. Measuring SRE Efforts and Effectiveness

Irrespective of whether an organization has adapted fully to DevOps or is still transitioning, it is vital to continuously improve the people, process, and technology within IT organizations. 

The need to establish a formalized SRE process to upgrade the health of the systems has come into effect. Setting up metrics and benchmarks to ensure better performance, uptime and enhanced user experiences paved the way to an effective monitoring strategy.

  • Define a Benchmark for Good Latency Rates

This is to determine the time taken to service a request by monitoring the latency of successful requests against failed ones to keep a check on system health. The process highlights non-performing services allowing teams to detect incidents faster.

  • Control Traffic Demands

SRE teams must monitor the amount of stress a system can take at a given point in time in terms of user interactions, number of applications running, or service transaction events. This gives organizations a clearer picture of customer product experience and the system’s efficacy to hold up to changes in demand.

  • Track the Rate of Errors

It is important to scan and detect the rate of requests that are failing across individual services as well as the whole system at large. It is crucial to further understand whether they are manual errors or obvious errors like failed HTTP requests. SRE teams must also ensure categorizing the errors into critical, medium, or minor to help organizations keep a check on the true health of service and take appropriate action to fix errors.

Monitor Overall Service Capacity

Saturation provides a high-level overview of how much more capacity does the service has. As most of the systems start to deteriorate before it hits 100% utilization, SRE teams must ascertain a benchmark on accepted levels of saturation thus ensuring service performance and availability for customers.

6. SRE Implementation and its Models

To implement SRE as per business specifications, organizational teams need to collate and consider the best approach that works for them. To begin with, clarifications and advice can be sought by an SRE proponent for the project kick-off who will also be capable of testing various SRE models that can be best suited for the organization. 

The SRE team can then work sequentially with the product teams, be existent around them or function as a discrete centralized unit.

The primary focus is continuous improvement with a data-driven approach to operations. SRE supports the application of automation practices, failure restoration while also ensuring error reduction and prevention. 

A simulation-based inference aids ineffective root cause analysis of incidents, performance, and capacity. SRE helps determine the permissible amount of risk through error budgeting as well as offers change management techniques in case the system has become unstable, thus guaranteeing a controlled risk environment.

A few SRE Models to Look at:

6.1 A Composite Solo Team

  • It is a single SRE team and is the most widely accepted model.
  • Best fit for smaller set ups with narrower product range and customer base.
  • Handles all organizational processes and is capable of identifying patterns and alikeness between different projects.
  • Enables implementation of integrative solutions across contrasting software projects. 

6.2 Application or Product-specific

  • Such teams focus on upgrading reliability of one specific product or application of the organization that is business-critical.
  • Bridges the gap between business objectives and how the team’s effort reciprocates.
  • Best suited for large organizations that cannot implement SRE for their entire range of products or services and therefore focus specifically on crucial deliveries. 

6.3 Infrastructure Specific

  • Defines Infrastructure as Code and enables automation of iterative tasks to simplify and speed up software delivery. 
  • Infrastructure SRE teams improve quality, continuity and performance of business. 
  • Such models are suitable for large sized organizations with multiple independent development teams that need common uniform processes applicable to all.

6.4 Tool Centered

  • Tool-centered SRE teams exclusively create tools and features to enhance productivity as well as mitigate process hold-ups and restrictions.
  • They need SRE charters, defined scope, and continuous monitoring to keep themselves segregated from the Infrastructure team.
  • All organizations that require software tools other than those provided by DevOps or SaaS platforms can form this SRE team.

6.5 Embedded

  • SRE experts are implanted within development teams that address specific issues of product development.
  • Helps change environment configurations to enhance performance in the SDLC lifecycle and enables implementation of development best practices.
  • It is beneficial to have the embedded team when organizations have just begun the SRE journey and need to accelerate transformation. 

6.6 Consulting:

  • The consulting model helps build specific tools that complement the existing processes.
  • Helps scale the impact of SRE outside of the initial project scope and can be performed by third party specialists.
  • Can be implemented before the actual SRE implementation begins to comprehend SRE best practices. 

7. DevOps and SRE

DevOps offer high-speed and top quality service delivery, contemporary designs that ensure increased business value as well as operational responsiveness and instills a modernized approach to development, automation and operations culture. SRE on the other hand can be considered as an implementation of DevOps. SRE, like DevOps is also about team culture and relations and constantly works towards building a tightly integrated development and operations team.

Both DevOps and SRE provision speedy application development along with enhanced quality, reliability and performance of services.

SRE depends on on-site reliability engineers who possess the skill set of both development and operations and are embedded within the development teams to ensure smooth communication and workflow. The SRE team assists developers that need guidance with operational functions and help balance tasks between DevOps and SRE where DevOps team focus on code, new features and the development pipeline whereas SRE ensures reliability.

8. SRE Tools and Technologies

Google Site Reliability Engineer Liz Fong-Jones states, “One SRE team is going to have a really difficult time supporting 50 different software engineering teams if they’re each doing their own separate thing, and they’re each using separate tooling.”

Hence standardization of SRE tools and practices for proper implementation of DevOps principles was a must. 

SRE toolchains are in line with the DevOps phases where the DevOps toolchain assists teams to choose the right tools to plan, build, integrate, test, monitor, evaluate, deploy, release and operate the software they develop.

Tools used by Site Reliability Engineers

9. Conclusion

SRE establishes a healthy and productive relationship between development and operations, thus letting organizations envision a holistic picture towards better resilience, authenticity, and speed of their IT services.  

As Jonathan Schwietert, SRE Team Lead at VictorOps rightly states, “SRE shines a light on the unknowns in a running system, it eliminates fear associated with the unknown. SRE brings awareness to the maintainers of a system and builds confidence toward handling unknown issues”.

SRE is one of the most efficient team models to implement DevOps standards in the present-day software delivery systems that aim to direct more productive and progressive business objectives going forward.

Leave a Reply