Job Description
Overview
Principal Platform Engineer, Reliability and Observability
Ncounter is hiring a senior Platform Engineer to own reliability and observability across a mission-critical trading platform. This is a deeply technical role focused on keeping complex, distributed systems stable, measurable, and predictable under real-time load. You will work directly on shared platform services that underpin trading and research workloads, where latency, partial failure, and blind spots in monitoring are not tolerated.
Observability is a core engineering concern here, not a bolt-on toolset. You will design and operate metrics, logging, tracing, and alerting pipelines that ingest high-volume telemetry, expose system behaviour under stress, and materially reduce operational risk. The role blends production engineering, platform tooling, automation, and reliability-led architecture, with direct ownership of systems running at scale.
Responsibilities
- Owning reliability and observability for shared platform services in Linux and Kubernetes environments
- Designing and operating high-throughput metrics, logging, and tracing pipelines for real-time systems
- Hardening services against latency degradation, cascading failure, and outages using reliability engineering principles
- Reducing operational toil through automation, GitOps workflows, and platform tooling
- Improving on-call signal quality through alert design, runbooks, and post-incident learning
- Partnering with engineers to bake observability and resilience into services by default
Core Technical Background
- Strong experience in SRE, production engineering, or platform reliability with ownership of live systems
- Deep Linux systems knowledge, debugging, and performance tuning
- Software engineering with Python or Go, plus solid Git and CI/CD experience
- Hands-on expertise with observability stacks covering metrics, logs, traces, and alerting
- Experience operating systems at scale, including HA, DR, and incident response
Nice to Have
- Infrastructure automation with Terraform or Ansible
This is a role for engineers who enjoy understanding how systems really behave under pressure and who want to own reliability as a first-class engineering problem. If you like solving hard platform problems where observability directly drives system correctness, this is worth a conversation.
#J-18808-Ljbffr