Production Engineer · Meta

Rishabh Agrawal

I build rate-limiting and capacity infrastructure that keeps large-scale systems reliable — and squeezes the most out of every GPU and host.

Production Engineer at Meta, working on the rate-limiting platform (RIM) that protects 100% of Meta's infra services. My focus is reliability, scalability, and automation: config canaries and health checks that stop outages before they start, real-time alerting across the fleet, and capacity work that maximizes resource utilization for compute-hungry workloads — including model-training infrastructure on external GPU clouds.

01

About

I'm a Production Engineer focused on reliability, scalability, and automation for infrastructure that operates at enormous scale. My work lives in the layer most people never see — the rate-limiting, capacity, and observability machinery that keeps services healthy under heavy load.

At Meta I work on the RIM rate-limiting platform that protects 100% of the company's infra services. I build the safety and automation around it — config canaries, health checks, self-service onboarding, automated SLO provisioning, and real-time alerting — so the platform scales without scaling the operational burden.

A growing part of my work is capacity efficiency for compute-intensive workloads: extending our rate-limiting hosts onto external GPU clouds and using host stacking and service-side optimization to maximize resource utilization for model-training infrastructure. That intersection — keeping AI-scale infra reliable while getting the most out of every host — is exactly where I want to keep building.

roleProduction Engineer @ Meta
basedBay Area, California
focusReliability · Scale · Infra + AI
corePython · C++ · Distributed systems
02

Experience

Production Engineer · Meta
Jul 2022 — Present · Menlo Park, CA
  • Lead reliability work on the RIM rate-limiting platform protecting 100% of Meta's infra servicesconfig canaries and health checks that prevent unintended throttling from config regressions.
  • Expanded capacity by extending RIM rate-limiting hosts onto external GPU clouds (AWS, CoreWeave), maximizing resource utilization for compute-intensive and model-training workloads.
  • Built an end-to-end real-time alerting system notifying 10k+ infra users of resource throttling — enabling near real-time incident detection and 50% faster mitigation.
  • Automated the platform for self-service onboarding, with improved observability and automated SLO provisioning — scaling adoption without scaling operational load.
Software Engineer II · UiPath
Jan 2022 — Jul 2022 · Austin, TX
  • Architected and shipped new CloudElements marketplace import/export APIs plus a privacy-aware file extension, unlocking a new revenue stream.
  • Cut internal latency SLOs by 15% and integrated mutation testing with 150+ new tests for >95% coverage.
Software Engineer · Amazon
Aug 2020 — Jan 2022 · Austin, TX
  • Built a serverless solution (Lambda, S3, SQS, SNS) to automate international order export documentation and compliance.
  • Added live metrics monitoring and anomaly detection across 20+ KPIs with global alerting.
Software Engineer Intern · Copart
Aug 2019 — Jun 2020 · Dallas, TX
  • Migrated a legacy billing service to Spring Boot APIs with an event-driven architecture (RabbitMQ/Kafka).
  • Built a microservice that moved check printing to MICR printers — saving over $300K annually.
03

Skills & Tooling

Languages

PythonC++JavaHack

Reliability & Observability

SLOs as codeReal-time alertingAnomaly detectionConfig canaries

Scale & Data

Rate limiting (RIM)ScubaPrestoDistributed systems

Cloud & Infra

AWSCoreWeaveKubernetesDockerCI/CD
04

Education

The University of Texas at Arlington

M.S. in Computer Science
2018 — 2020

GLBITM (UPTU), India

B.S. in Computer Science
2012 — 2016

Let's talk infrastructure.

Open to conversations about reliability, scale, and the systems that will power the AI era.