**Excellent opportunity to work REMOTELY with a U.S.\-based company. Candidates living in Mexico, Central, or South America are welcome to apply.** **About The Company** Bydrec, Inc. is a California\-based company that connects top Tech talent from Latin America with U.S. companies looking to expand their development teams. Learn more at bydrec.com. Our client is a dynamic company that requires a proactive and self\-assured engineer to help define and lead this project. The ideal candidate must be able to ensure their platform remains fast, resilient, and scalable, especially during high\-traffic live events. This is a unique opportunity to contribute to the future of reliability at a company where uptime and user experience are paramount. **What You’ll Do** * Optimize Performance: Continuously monitor and analyze system performance, identify bottlenecks, and implement solutions to improve efficiency and scalability across our cloud\-native infrastructure. * Monitoring \& Alerting: Design and manage robust observability systems using Prometheus, Grafana, ELK stack, and APM tools to ensure real\-time visibility into platform health. * Incident Management: Lead incident response efforts, perform root cause analysis, and drive post\-mortem processes to prevent recurrence and improve system resilience. * Cloud Infrastructure: Architect and maintain infrastructure across Azure and GCP, ensuring high availability, security, and cost\-effectiveness. * Automation \& Tooling: Build and maintain automation scripts and playbooks using Python and Ansible to reduce manual effort and improve deployment consistency. * Container Orchestration: Manage Kubernetes clusters to support dynamic scaling and seamless deployment of microservices. * CI/CD \& GitOps: Collaborate with development teams to enhance GitLab pipelines and promote GitOps practices for reliable and repeatable deployments. * Cross\-Team Collaboration: Work closely with Engineering, Development, and Technical Operations to align reliability goals with product and business objectives. **Technical Requirements** * 5\+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure roles within a SaaS or cloud\-native environment. * Strong expertise in: * Azure and GCP cloud platforms * Kubernetes and container orchestration * Monitoring tools: Prometheus, Grafana, ELK stack, APM solutions * Automation: Python, Ansible * CI/CD: GitLab * Proven success in performance tuning, incident response, and system scalability. * Excellent communication and collaboration skills across technical and non\-technical teams. * Initiative, confidence, and a builder’s mindset—ready to shape a nascent function and drive impact from day one. * Sense of urgency during critical incidents, as the work focuses on maintaining high availability. * Advanced level of English **Must Have Skills** * Experience using APM (Application Performance Monitoring) tools — also referred to as Observability platforms. * Skill in leveraging logs for monitoring, alerting, and forensics. * Expertise working with modern cloud\-native environments, with experience in both on\-premise and cloud infrastructure (due to the ongoing migration).

Site Reliability Engineer (SRE)

Descrição da Vaga

Receba vagas como esta no seu email

Alertas que entendem o que você quer

Filtros Combinados

Email Diário

Kanban Visual

Planos simples, sem surpresas

Gratuito

Premium

Pronto para encontrar sua vaga ideal?