Lead Operational Intelligence Developer
Descrição da Vaga
We are looking for a highly experienced and dynamic **Lead Operational Intelligence Developer** to join our team. In this role, you will take ownership of leading the development, maintenance, and enhancement of our Elastic \& Observability Platform deployed across GCP and Elastic Cloud. You will drive strategic initiatives, guide a high\-performing technical team, and ensure platform reliability while fostering innovation and enabling self\-service capabilities for platform consumers. This position also involves participating in an on\-call rotation to oversee platform health and functionality. **Responsibilities** * Oversee the availability, functionality, performance, and security of observability and search platforms to exceed business SLAs * Provide technical leadership during complex incidents and escalate resolutions promptly during on\-call periods * Develop and maintain comprehensive platform documentation, standard operating procedures, and knowledge\-sharing resources * Collaborate with cross\-functional teams, stakeholders, and vendors to oversee operational requirements, drive strategic initiatives, and manage installations, troubleshooting, and upgrades * Lead the enhancement of platform features and self\-service capabilities, including advanced Elastic Synthetics and chargeback automation * Architect and implement proof\-of\-concepts for platform innovation, such as AI\-driven observability, advanced data processing models, or Kubernetes\-based platform migration * Supervise the building, deployment, and maintenance of Elastic clusters using Infrastructure\-as\-Code (IaC) tools like Terraform and Ansible, while mentoring team members on best practices * Oversee platform lifecycle management activities, including component upgrades, capacity planning, cost optimization, and evolving compliance requirements * Continuously assess and fine\-tune ELK stack performance, including ingestion, indexing, and query optimization for large\-scale environments * Establish and enhance comprehensive alerting and incident management workflows, integrating sophisticated monitoring tools such as Kibana Rules, Watchers, and PagerDuty * Supervise the ingestion, enrichment, backup, and restoration of large\-scale platform data while optimizing data workflows * Lead and plan critical operational events such as SSL certificate rotations, cluster migrations, or scalability optimization projects **Requirements** * 5\+ years of experience in Operational Intelligence, with a proven track record of leadership and technical expertise in managing large\-scale observability platforms * Demonstrated ability to architect and manage Elastic clusters in complex, multi\-cloud environments * In\-depth knowledge of Elastic Stack components, including advanced configurations of Elasticsearch, Kibana, and Logstash * Advanced proficiency in Infrastructure\-as\-Code (IaC) tools like Terraform and Ansible, with demonstrated flexibility in adapting other tools like Jenkins CI or GitOps frameworks * Advanced Python scripting skills for automation, data processing, and extending platform interoperability * Deep understanding of incident management frameworks and workflows with tools like PagerDuty, Uptrends, and other enterprise monitoring solutions * Proven expertise in troubleshooting and resolving complex platform challenges under tight SLAs * Strong capability in managing and scaling fault\-tolerant platforms while ensuring performance, security, and compliance across large distributed systems * Demonstrated ability to mentor and grow team members, manage priorities, and act as a bridge between technical and non\-technical teams * Excellent command of English (B2\+ level), both written and spoken, with a strong emphasis on technical communication skills **Nice to have** * Expertise in scripting with Groovy or experience in advanced Linux administration to optimize platform processes * Track record of optimizing observability workflows with additional integrations or customizations in tools like Uptrends, PagerDuty, or Elastic features * Hands\-on experience with advanced Elastic Synthetics setups for robust monitoring and custom synthetic testing frameworks * Experience driving strategic initiatives such as modernization through AI tooling, cloud\-native transitions, or cost\-saving observability optimizations **We offer** * International projects with top brands * Work with global teams of highly skilled, diverse peers * Healthcare benefits * Employee financial programs * Paid time off and sick leave * Upskilling, reskilling and certification courses * Unlimited access to the LinkedIn Learning library and 22,000\+ courses * Global career opportunities * Volunteer and community involvement opportunities * EPAM Employee Groups * Award\-winning culture recognized by Glassdoor, Newsweek and LinkedIn
Vaga originalmente publicada em: linkedin
💼 Encontre as melhores oportunidades para desenvolvedores no Job For Dev