Skip to content

Commit 27ce0c0

Browse files
authored
Merge pull request #69 from Marcinthecloud/docs/data-platform-refresh
docs: refresh Pipelines, R2 Data Catalog, and R2 SQL references
2 parents ffcc622 + 0554300 commit 27ce0c0

17 files changed

Lines changed: 906 additions & 2346 deletions

File tree

skills/cloudflare/SKILL.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,8 @@ Need storage?
6969
├─ Strongly-consistent per-entity state → durable-objects/ (DO storage)
7070
├─ Secrets management → secrets-store/
7171
├─ Streaming ETL to R2 → pipelines/
72+
├─ Managed Apache Iceberg catalog on R2 → r2-data-catalog/
73+
├─ Serverless SQL analytics over Iceberg tables → r2-sql/
7274
└─ Persistent cache (long-term retention) → cache-reserve/
7375
```
7476

@@ -126,6 +128,7 @@ Need analytics?
126128
├─ Custom high-cardinality metrics from Workers → analytics-engine/
127129
├─ Client-side (RUM) performance data → web-analytics/
128130
├─ Workers Logs and real-time debugging → observability/
131+
├─ SQL over Iceberg data lake (logs, events) → r2-sql/ (+ pipelines/, r2-data-catalog/)
129132
└─ Raw logs (Logpush to external tools) → Cloudflare docs
130133
```
131134

Lines changed: 54 additions & 69 deletions
Original file line numberDiff line numberDiff line change
@@ -1,105 +1,90 @@
11
# Cloudflare Pipelines
22

3-
ETL streaming platform for ingesting, transforming, and loading data into R2 with SQL transformations.
3+
Streaming ingest: receive events over HTTP/Workers/Logpush, transform with SQL, write to R2 as Iceberg tables or Parquet/JSON files.
44

5-
## Overview
5+
## Documentation
66

7-
Pipelines provides:
8-
- **Streams**: Durable event buffers (HTTP/Workers ingestion)
9-
- **Pipelines**: SQL-based transformations
10-
- **Sinks**: R2 destinations (Iceberg tables or Parquet/JSON files)
7+
This reference is a fast-start with verified code and gotchas. For limits, settings, full SQL syntax, and pricing, **retrieve the live docs** — use the `cloudflare-docs` MCP/search tool if available, otherwise `webfetch` the URL. Docs are source of truth over this file.
118

12-
**Status**: Open beta (Workers Paid plan)
13-
**Pricing**: No charge beyond standard R2 storage/operations
9+
| Topic | URL |
10+
|-------|-----|
11+
| Overview / getting started | `https://developers.cloudflare.com/pipelines/getting-started/` |
12+
| Streams (write, manage, Logpush) | `https://developers.cloudflare.com/pipelines/streams/` |
13+
| Sinks | `https://developers.cloudflare.com/pipelines/sinks/` |
14+
| Pipelines & SQL transforms | `https://developers.cloudflare.com/pipelines/pipelines/` |
15+
| SQL reference (statements, types) | `https://developers.cloudflare.com/pipelines/sql-reference/` |
16+
| Wrangler commands | `https://developers.cloudflare.com/pipelines/reference/wrangler-commands/` |
17+
| Terraform | `https://developers.cloudflare.com/pipelines/reference/terraform/` |
18+
| Limits | `https://developers.cloudflare.com/pipelines/platform/limits/` |
19+
| Pricing | `https://developers.cloudflare.com/pipelines/platform/pricing/` |
20+
| Metrics (GraphQL) | `https://developers.cloudflare.com/pipelines/observability/metrics/` |
1421

15-
## Architecture
22+
## Three Components
1623

1724
```
18-
Data Sources → Streams → Pipelines (SQL) → Sinks → R2
19-
↑ ↓ ↓
20-
HTTP/Workers Transform Iceberg/Parquet
25+
Sources → Stream → Pipeline (SQL) → Sink → R2
26+
↑ ↓ ↓
27+
HTTP / Workers / Transform Iceberg (Data Catalog)
28+
Logpush (row-level) or Parquet/JSON files
2129
```
2230

23-
| Component | Purpose | Key Feature |
24-
|-----------|---------|-------------|
25-
| Streams | Event ingestion | Structured (validated) or unstructured |
26-
| Pipelines | Transform with SQL | Immutable after creation |
27-
| Sinks | Write to R2 | Exactly-once delivery |
31+
| Component | Purpose |
32+
|-----------|---------|
33+
| **Stream** | Receives events (HTTP endpoint, Worker binding, or Logpush). Structured (schema-validated) or unstructured. |
34+
| **Pipeline** | SQL connecting a stream to a sink. Row-level transforms only — no GROUP BY/aggregation. |
35+
| **Sink** | Writes to R2 — Iceberg via Data Catalog, or raw Parquet/JSON. |
36+
37+
**Status:** Open beta (Workers Paid for production). Pricing announced; verify billing status in docs.
2838

2939
## Quick Start
3040

3141
```bash
32-
# Interactive setup (recommended)
42+
# Interactive — creates stream + sink + pipeline, optionally bucket + catalog
3343
npx wrangler pipelines setup
3444
```
3545

36-
**Minimal Worker example:**
46+
Minimal Worker producer:
3747
```typescript
38-
interface Env {
39-
STREAM: Pipeline;
40-
}
48+
interface Env { MY_STREAM: Pipeline; }
4149

4250
export default {
43-
async fetch(request: Request, env: Env, ctx: ExecutionContext): Promise<Response> {
44-
const event = { user_id: "123", event_type: "purchase", amount: 29.99 };
45-
46-
// Fire-and-forget pattern
47-
ctx.waitUntil(env.STREAM.send([event]));
48-
49-
return new Response('OK');
51+
async fetch(req: Request, env: Env, ctx: ExecutionContext): Promise<Response> {
52+
ctx.waitUntil(env.MY_STREAM.send([{ event_id: crypto.randomUUID(), amount: 29.99 }]));
53+
return new Response("OK");
5054
}
5155
} satisfies ExportedHandler<Env>;
5256
```
5357

5458
## Which Sink Type?
5559

5660
```
57-
Need SQL queries on data?
58-
→ R2 Data Catalog (Iceberg)
59-
✅ ACID transactions, time-travel, schema evolution
60-
❌ More setup complexity (namespace, table, catalog token)
61-
62-
Just file storage/archival?
63-
→ R2 Storage (Parquet)
64-
✅ Simple, direct file access
65-
❌ No built-in SQL queries
66-
67-
Using external tools (Spark/Athena)?
68-
→ R2 Storage (Parquet with partitioning)
69-
✅ Standard format, partition pruning for performance
70-
❌ Must manage schema compatibility yourself
71-
```
72-
73-
## Common Use Cases
61+
Need SQL queries / ACID / time-travel on the data?
62+
→ R2 Data Catalog (Iceberg) ✅ R2 SQL, schema evolution ❌ more setup
7463
75-
- **Analytics pipelines**: Clickstream, telemetry, server logs
76-
- **Data warehousing**: ETL into queryable Iceberg tables
77-
- **Event processing**: Mobile/IoT with enrichment
78-
- **Ecommerce analytics**: User events, purchases, views
64+
Just archival / external tools (Spark, Athena)?
65+
→ R2 raw files (Parquet/JSON) ✅ simple, partitioned files ❌ no built-in SQL
66+
```
7967

80-
## Reading Order
68+
## Critical Behaviors (read before building)
8169

82-
**New to Pipelines?** Start here:
83-
1. [configuration.md](./configuration.md) - Setup streams, sinks, pipelines
84-
2. [api.md](./api.md) - Send events, TypeScript types, SQL functions
85-
3. [patterns.md](./patterns.md) - Best practices, integrations, complete example
86-
4. [gotchas.md](./gotchas.md) - Critical warnings, troubleshooting
70+
These are non-obvious and prevent most failures — see [gotchas.md](gotchas.md) for detail.
8771

88-
**Task-based routing:**
89-
- Setup pipeline → [configuration.md](./configuration.md)
90-
- Send/query data → [api.md](./api.md)
91-
- Implement pattern → [patterns.md](./patterns.md)
92-
- Debug issue → [gotchas.md](./gotchas.md)
72+
- **Everything is immutable after creation** — stream schema, pipeline SQL, sink config. To change, delete and recreate.
73+
- **Sinks create their own table** — they cannot target an existing Iceberg table.
74+
- **`__ingest_ts` is added automatically** (TIMESTAMP, partitioned by day). Don't define it in your schema.
75+
- **Data isn't queryable immediately** — first flush takes **3–7 minutes** (warm-up + table creation) even with a short roll interval.
76+
- **Schema validation is deferred** — invalid events are accepted then silently dropped. Monitor via GraphQL error metrics.
77+
- **Binding field renamed `pipeline``stream`** (June 2026); old field still accepted.
9378

94-
## In This Reference
79+
## Reading Order
9580

96-
- [configuration.md](./configuration.md) - wrangler.jsonc bindings, schema definition, sink options, CLI commands
97-
- [api.md](./api.md) - Pipeline binding interface, send() method, HTTP ingest, SQL function reference
98-
- [patterns.md](./patterns.md) - Fire-and-forget, schema validation with Zod, integrations, performance tuning
99-
- [gotchas.md](./gotchas.md) - Silent validation failures, immutable pipelines, latency expectations, limits
81+
1. [configuration.md](configuration.md) — schema, streams, sinks, pipelines (CLI + REST + Terraform), bindings
82+
2. [api.md](api.md) `send()`, HTTP ingest, REST API, pipeline SQL, lifecycle states
83+
3. [patterns.md](patterns.md) — fire-and-forget, validation, Logpush, observability, end-to-end
84+
4. [gotchas.md](gotchas.md) — silent drops, immutability, REST≠CLI field names
10085

10186
## See Also
10287

103-
- [r2](../r2/) - R2 storage backend for sinks
104-
- [queues](../queues/) - Compare with Queues for async processing
105-
- [workers](../workers/) - Worker runtime for event ingestion
88+
- [r2-data-catalog](../r2-data-catalog/) — Iceberg sink destination
89+
- [r2-sql](../r2-sql/) — query the ingested data
90+
- [r2](../r2/) · [queues](../queues/) · [workers](../workers/)

0 commit comments

Comments
 (0)