Production Engineer at Meta, working on the rate-limiting platform (RIM) that protects 100% of Meta's infra services. My focus is reliability, scalability, and automation: config canaries and health checks that stop outages before they start, real-time alerting across the fleet, and capacity work that maximizes resource utilization for compute-hungry workloads — including model-training infrastructure on external GPU clouds.
I'm a Production Engineer focused on reliability, scalability, and automation for infrastructure that operates at enormous scale. My work lives in the layer most people never see — the rate-limiting, capacity, and observability machinery that keeps services healthy under heavy load.
At Meta I work on the RIM rate-limiting platform that protects 100% of the company's infra services. I build the safety and automation around it — config canaries, health checks, self-service onboarding, automated SLO provisioning, and real-time alerting — so the platform scales without scaling the operational burden.
A growing part of my work is capacity efficiency for compute-intensive workloads: extending our rate-limiting hosts onto external GPU clouds and using host stacking and service-side optimization to maximize resource utilization for model-training infrastructure. That intersection — keeping AI-scale infra reliable while getting the most out of every host — is exactly where I want to keep building.
Open to conversations about reliability, scale, and the systems that will power the AI era.