Full Job Description
As a Site Reliability Engineer (SRE), you will bridge the gap between development and operations by applying a software engineering mindset to system administration. You will be responsible for maintaining the reliability and performance of our critical systems and network services. This role involves monitoring performance, refining standard operating procedures, and implementing automation to streamline processes. Your expertise will be essential in incident response and developing tools and practices that prevent incidents from occurring in the first place. Additionally, you will collaborate closely with software development and infrastructure teams to enhance scalability and reliability across all systems.
The Work You'll Do:
- Collaborate with development teams to improve service performance and resilience
- Design and optimize continuous deployment pipelines cloud infrastructure
- Develop tools for monitoring, logging, and operations management to ensure reliability
- Implement automation processes to simplify operations and reduce manual intervention
- Participate in on-call rotations and handle operational incidents
What You'll Bring:
- Ability to perform under pressure in fast-paced environments
- Deep understanding of system administration and software engineering principles
- Excellent problem-solving skills and attention to detail
- Experience with monitoring tools like Prometheus, Grafana, or Splunk
- Proficient in a high-level programming language such as Python or Go
Qualifications:
- 5-7 years of experience in site reliability engineering or operations
- Familiarity with cloud platforms like AWS, Google Cloud, or Azure
- Master's degree in Computer Science or related field
- Strong understanding of containerization technologies such as Docker or Kubernetes
- Track record of improved system reliability through automation and monitoring