Candidates: Create an Account or Sign In
Platform Reliability Engineer
Ncounter is supporting a highly sophisticated, technology driven trading environment in the search for a Platform Reliability Engineer to help operate, engineer, and continuously improve a large scale distributed production platform used by researchers and software engineers. This role sits at the intersection of software engineering, infrastructure engineering, and production operations, with a strong focus on reliability, automation, observability, and operational excellence across mission critical systems. You will work closely with developers and infrastructure teams to maintain resilient services, diagnose complex production issues, and engineer tooling and automation that reduces operational toil while improving platform stability and performance.
Key Responsibilities
• Improve reliability and resilience of production platform services
• Build automation and internal tooling to streamline operational workflows
• Design observability across metrics, logging, tracing, and alerting
• Diagnose complex production issues and improve system performance
• Contribute to operational runbooks, incident reviews, and reliability standards
Experience Required
• Background in SRE, Production Engineering, or platform operations supporting large scale systems
• Strong Linux troubleshooting experience across distributed or containerised environments
• Programming capability in Python with Git based workflows and CI/CD pipelines
• Hands on experience with observability platforms and monitoring systems
• Experience operating high availability infrastructure and improving system resilience
Exposure to technologies such as Kubernetes, Prometheus, Grafana, ELK, Kafka, PostgreSQL, Redis, Terraform, or Ansible would be beneficial.
If you enjoy solving complex reliability challenges and building the tooling that keeps large scale platforms operating smoothly, we would welcome a conversation