Production Engineer · Meta

Rishabh Agrawal

I build rate-limiting and capacity infrastructure that keeps large-scale systems reliable — and squeezes the most out of every GPU and host.

Production Engineer at Meta, working on the rate-limiting platform (RIM) that protects 100% of Meta's infra services. My focus is reliability, scalability, and automation: config canaries and health checks that stop outages before they start, real-time alerting across the fleet, and capacity work that maximizes resource utilization for compute-hungry workloads — including model-training infrastructure on external GPU clouds.

Get in touch → LinkedIn GitHub

About

I'm a Production Engineer focused on reliability, scalability, and automation for infrastructure that operates at enormous scale. My work lives in the layer most people never see — the rate-limiting, capacity, and observability machinery that keeps services healthy under heavy load.

At Meta I work on the RIM rate-limiting platform that protects 100% of the company's infra services. I build the safety and automation around it — config canaries, health checks, self-service onboarding, automated SLO provisioning, and real-time alerting — so the platform scales without scaling the operational burden.

A growing part of my work is capacity efficiency for compute-intensive workloads: extending our rate-limiting hosts onto external GPU clouds and using host stacking and service-side optimization to maximize resource utilization for model-training infrastructure. That intersection — keeping AI-scale infra reliable while getting the most out of every host — is exactly where I want to keep building.

roleProduction Engineer @ Meta

basedBay Area, California

focusReliability · Scale · Infra + AI

corePython · C++ · Distributed systems

Experience

Production Engineer · Meta

Jul 2022 — Present · Menlo Park, CA

Lead reliability work on the RIM rate-limiting platform protecting 100% of Meta's infra services — config canaries and health checks that prevent unintended throttling from config regressions.
Expanded capacity by extending RIM rate-limiting hosts onto external GPU clouds (AWS, CoreWeave), maximizing resource utilization for compute-intensive and model-training workloads.
Built an end-to-end real-time alerting system notifying 10k+ infra users of resource throttling — enabling near real-time incident detection and 50% faster mitigation.
Automated the platform for self-service onboarding, with improved observability and automated SLO provisioning — scaling adoption without scaling operational load.

Software Engineer II · UiPath

Jan 2022 — Jul 2022 · Austin, TX

Architected and shipped new CloudElements marketplace import/export APIs plus a privacy-aware file extension, unlocking a new revenue stream.
Cut internal latency SLOs by 15% and integrated mutation testing with 150+ new tests for >95% coverage.

Software Engineer · Amazon

Aug 2020 — Jan 2022 · Austin, TX

Built a serverless solution (Lambda, S3, SQS, SNS) to automate international order export documentation and compliance.
Added live metrics monitoring and anomaly detection across 20+ KPIs with global alerting.

Software Engineer Intern · Copart

Aug 2019 — Jun 2020 · Dallas, TX

Migrated a legacy billing service to Spring Boot APIs with an event-driven architecture (RabbitMQ/Kafka).
Built a microservice that moved check printing to MICR printers — saving over $300K annually.

Skills & Tooling

Languages

PythonC++JavaHack

Reliability & Observability

SLOs as codeReal-time alertingAnomaly detectionConfig canaries

Scale & Data

Rate limiting (RIM)ScubaPrestoDistributed systems

Cloud & Infra

AWSCoreWeaveKubernetesDockerCI/CD

Education

The University of Texas at Arlington

M.S. in Computer Science

2018 — 2020

GLBITM (UPTU), India

B.S. in Computer Science

2012 — 2016

Let's talk infrastructure.

Open to conversations about reliability, scale, and the systems that will power the AI era.