Site Reliability Engineer

10932

31 May, 2024 to 31 May, 2025

Göteborg (Onsite)

Our client is seeking a Senior Site Reliability Engineer who excels at working at the Operational side of

DevOps. Attention to detail, proactivity, and problem-solving skills are key, as is the ability to communicate and collaborate effectively.

Location: Gothenburg, minimum 3 days on site

Language: Fluent English

Job description

Position: Senior SRE Engineer within Platform Operations and Support

• A service minded team player with a quality driven approach

• Manage and dispatch incident and service requests.

• Provide high quality support, drive trouble shooting, RCAs and be advisor to Dev teams

• Be responsible for maintaining the platform availability, shorten time to market for new features, and improve performance.

• Play a crucial role in troubleshooting and quality assurance from an end-to-end perspective.

• Focus on understanding, monitoring, and improving the production system, actively preventing future incidents.

• Be a leading star for continuous improvements and innovations.

Overview of responsibilities

System support & troubleshooting

• Guiding and coordinating junior colleagues within the team.

• Assist in initial technical analysis for production incidents.

• Support development team in building capabilities for alerts and monitoring.

• Conduct code review for reported cases, fixes development, and delivery.

Infrastructure Automation and Configuration Management

• Develop and maintain automation tools, scripts, and configuration management systems.

• Implement Infrastructure as Code (IaC) practices using tools like Ansible, Terraform, or

Kubernetes.

• Collaborate with development and operations teams to automate build, test, and deployment

processes

Reliability Engineering and Resilience

• Design and implement systems and processes to enhance infrastructure reliability and

resilience.

• Continuously improve system reliability by analyzing logs and trends, identifying areas for

improvement, and implementing preventative measures.

System Monitoring and Incident Response

• Develop and manage monitoring tools and systems to track software and infrastructure

health, performance, security, and availability.

• Set up alerts, dashboards, and metrics for proactive detection and response to incidents.

• Investigate and diagnose root causes of incidents and work towards resolution in a timely

manner.

Continuous Improvement and Collaboration

• Drive a culture of continuous improvement by identifying areas for automation and efficiency.

• Document procedures, incidents, and best practices for knowledge sharing and team

efficiency.

• Stay updated on industry trends and emerging technologies to propose innovative solutions.

• Collaborate closely with cross-functional teams to ensure smooth operation of systems.

Required skills & experience.

• Bachelor's degree in computer science, Engineering, or a related field (or equivalent

experience) with 5+ years of DevOps SRE work.

• Proficient in scripting/programming languages such as Python, Bash.

• Experience with cloud platforms (AWS preferred).

• Experience in DevOps practice, CI/CD, and monitoring tools.

• Experience with automation tools and configuration management frameworks such as

Terraform, AWS CDK, Puppet, or Ansible.

• Strong troubleshooting and problem-solving skills with a keen attention to detail.

• Excellent communication and collaboration skills to work effectively in a cross-functional team

environment.

• Strong experience in system administration, infrastructure management, or site reliability

engineering.

In case of a potential candidate, please email me CV, rate, and how the consultants meets the requirements by mark the lined skills with: green=good knowledge, yellow=some knowledge, red=no knowledge, Thanks!

Cornelia.renhult@nexergroup.com