
System Reliability Engineer
- Lisboa
- Permanente
- Horário completo
- Continue accelerating and enhancing our Platform as a Service for Vodafone customers on footprint
- Introduce service propositions in markets beyond Vodafone's current footprint
- Address long tail lower volume segment through digital self-service platform globally
- Develop and govern resilience strategies that span system architecture, deployment, monitoring, and incident response
- Define and track stability KPIs (e.g., MTTD, MTTR, error budgets), partnering with performance and operation teams to meet or exceed targets
- Design and implement fault injection testing, chaos engineering practices, and scenario-based simulations to validate platform robustness
- Collaborate with product, infrastructure, architecture and development teams to re-design services with built-in redundancy, failover, and graceful degradation
- Drive automation and observability improvements to reduce noise, increase fault detection speed, and support predictive failure mitigation
- Contribute to the design and maintenance of our Business Continuity and Disaster Recovery Plan (BD/DR), ensuring IoT systems remain resilient and recoverable in the face of unexpected distruptions
- Own the resilience roadmap and continuously assess emerging threats, technologies, and architectural shifts to guide evolution of stability practices
- Evangelize a culture of resilience through internal communication, workshops, and post-incident learning programs
- Engineering excellence - Deliver new capabilities and services efficiently while continuously enhancing the resilience, scalability, and cost-effectiveness of our IoT platform
- Platform availability and fault tolerance
- Reduction in recurrence of critical incidents
- Adoption of engineering best practices aligned with future-proof architecture
- Delivery focus - Consistently meet or exceed delivery expectations-ensuring the right customer experience, delivering tangible business outcomes, and achieving financial target
- Improved service-level attainment (SLA/SLO adherence)
- Reduced mean time to detect (MTTD) and mean time to recover (MTTR)
- Operational efficiency gains through automation and proactive issue resolution
- Stakeholder management - Foster trusted, transparent, and outcome-driven relationships with business and technical stakeholders
- Cross-team alignment on resilience goals, metrics, and ownership
- Effective communication during incidents and planned changes
- Stakeholder satisfaction with the stability, predictability, and responsiveness of platform services
- Improve the Connectivity service delivered to Vodafone IoT customers
- Assure the proper dimensioning for the owned IoT platforms, guaranteeing capacity is used efficiently
- Guarantee owned IoT Connectivity platforms can cope with new products being delivered to IoT customers
- Manage stakeholders and vendors as required for the technical delivery and report project progress & activities
- Degree in Software Engineer or related discipline with Computer Science
- Good understanding of DevSecOps methodology mindset
- Good understanding of information security
- Scripting experience such as bash, python, perl, groovy, powershell
- Proven experience with high-availability system design, chaos engineering principes and proactive failure mitigation strategies
- Experience with ISO 22301
- Good understanding of system monitoring tools and automated testing frameworks
- Industry experience with Software Platforms on Linux, on-premises and cloud Server technologies
- Deep understanding of SRE principles including SLOs/SLIs, error budgets, observability, toil reduction, and automation
- Demonstrated ability to balance operational stability with delivery velocity
- Understanding of security principles, practices and standards and how they translate into real-world technical solutions
- Hands-on experience with infrastructure provisioning and configuration management tools such as Terraform or Ansible. Demonstrated ability to eliminate manual processes through scripting (e.g., Python, Bash, Go)
- Strong command of telemetry, logging, and alerting stacks (e.g., Prometheus, Grafana, ELK, Datadog, Splunk)
- Experience defining meaningful SLIs and building dashboards that drive actionable insight
- Skilled in leading and participating in incident response with a calm, structured approach
- Experience driving blameless postmortems, root cause analysis, and continuous improvement across teams
- Good knowledge of DevSecOps principles
- Expertise in identifying and resolving system bottlenecks, latency issues, and throughput constraints
- Proficient in forecasting demand and managing system growth in a cost-efficient manner
- Proven ability to work closely with software engineers, infrastructure teams, product owners, and business stakeholders to embed reliability into the development lifecycle
- Consultative, customer-focused design mind-set
- Strong presentation and communication skills, to technical, business and (senior) management audience
- Strong work planning- and time management skills
- Willing to learn and a strong sense of ownsership and autonomy
- Hybrid Work Model - Flexible hybrid work model with 8-10 in-office days per month, managed by team leaders
- Vodafone Products and Services - Employees get a mobile phone, free communication plan, data card, and various discounts on services and products
- Recognition - Recognition programs for innovative, creative, high-potential employees and exemplary behaviors
- Health and Well-being - Well-being Program offers nutrition and psychological consultations, webinars, workshops, and discounts on various services and products
- Learning - Access to Communities of Practice and a customizable digital training platform with high-quality content (namely Harvard Business Publishing and Skillsoft)
- Local and International Mobility - Internal recruitment with local and international rotation opportunities across departments and roles