
Senior Site Reliability Engineer
- Lisboa
- Permanente
- Horário completo
- Participate in an on-call rotation with the Cloud Engineering team to respond to production incidents and outages.
- Operate and evolve infrastructure using Infrastructure as Code (Terraform), configuration management tools, and containerized platforms on AWS.
- Build and maintain observability tooling to detect symptoms before they lead to outages.
- Automate repetitive tasks and processes to reduce operational toil.
- Collaborate with Engineering and Product teams to design resilient systems that meet performance and reliability goals.
- Troubleshoot production issues across application, network, and infrastructure layers.
- Document systems, processes, and runbooks to improve team transparency and onboarding.
- 5+ years of hands-on experience with AWS in both development and operations contexts.
- Strong Linux system administration skills, including performance tuning and debugging (experience with eBPF tracing is a plus).
- Software development background and strong coding skills in one or more of the following: Go, Python, Ruby.
- Experience with Infrastructure as Code, particularly Terraform.
- Familiarity with CI/CD pipelines and artifact management tools (e.g., Ansible, Puppet, Chef, Artifactory, Nexus).
- A mindset for resilient systems design, thinking about edge cases, failure modes, and graceful degradation.
- Excellent communication skills in English, both written and spoken.
- Comfortable in a fast-paced environment and adaptable to shifting priorities.
- Experience with EKS or ECS.
- Familiarity with chaos engineering practices.
- Knowledge of OpenTelemetry or Distributed Tracing Systems.
- Knowledge of Service Level Objectives (SLOs), Service Level Indicator (SLIs).
- Experience setting up Error Budgets and conducting Post Incident Reviews.