Responsibilities Be on an on-call (Pager Duty) rotation to respond to incidents, and provide support for service engineers with customer incidents. Use your on-call shift to prevent incidents from ever happening. Build monitoring that alerts on symptoms rather than on outages. Document every action so your findings turn into repeatable actions and then into automation. Improve operational processes (such as deployments and upgrades) to make them as efficient as possible. Design, build and maintain core infrastructure that enables to support hundreds of thousands of concurrent users. Debug production issues across services and levels of the stack. Plan the growth of system infrastructure. Qualifications Bachelor in computer system or related Fields. Experience handling multiple on-call shifts for mission-critical systems, and responsibility for the tools and processes used to debug and correct failures. Navigated more than one incident through to the retrospective process. Strong software engineering skills, primarily in backend software development. Comfort with hands-on development, navigating through multiple programming languages, digging deep in the stack, and using cloud infrastructure (AWS, GCE, Azure, Kubernetes, Docker). Experience with mentorship and helping teammates level up their craft and technical skills. To understand the meaning of continuous improvement and evolving systems. A commitment and drive for quality, technical excellence and results. Experience working with a variety of open-source software, including nginx, redis, Memcached and MySQL. Familiarity with network and web protocols, from IP to HTTP.
Site Resiliency Engineer
وظائف مهندسين /