
Observability Engineer - Site Reliability Engineer
- Leiria
- Permanente
- Horário completo
- Design, implement, and maintain observability solutions covering metrics, logs, traces, and RUM.
- Work with tools such as Grafana Cloud, Tempo, Loki, Mimir, Alloy, and OpenTelemetry.
- Build reliable alerting and monitoring pipelines based on SLOs/SLAs, focusing on low-maintenance automation.
- Ensure the health and integrity of observability data flows from instrumentation to dashboards.
- Collaborate with development and operations teams to embed observability by design into the software lifecycle.
- Define and promote best practices and standards for observability across the organization.
- Support the modernization of observability by replacing and evolving legacy monitoring and alerting solutions.
- Monitor observability-related costs and contribute to FinOps efforts by identifying optimization opportunities.
- 3+ years of experience as an SRE, Observability Engineer, or equivalent role.
- Practical experience with OpenTelemetry, or similar instrumentation tools.
- Knowledge in Kubernetes, Helm, Terraform, and ArgoCD.
- Experience designing and managing telemetry pipelines (metrics/logs/traces), exporters, and sidecars.
- Expertise in performance monitoring, alerting, dashboarding, and root cause analysis.
- Knowledge in Java development and applications instrumentation
- Product-oriented mindset with a bias for automation and a “you build it, you run it” culture
- Fluency in English.
- Knowledge of APM and distributed tracing solutions.
- Experience with FinOps practices applied to observability.
- Hands-on involvement in replacing legacy monitoring stacks.
- Experience with Cloud environments (Azure preferred)
- Contributions to open-source observability tool
- High-impact role in a leading e-commerce company undergoing digital transformation.
- Collaborative team and technical ownership.
- Flexible working hours and remote work options.
- Opportunity to work with modern observability technologies.
- Influence in architectural decisions and long-term platform strategy.