catalog: add per-timeframe console cluster utilization overview views#37307
catalog: add per-timeframe console cluster utilization overview views#37307jubrad wants to merge 1 commit into
Conversation
The Console's cluster-detail page polls a replica-utilization rollup that,
for every timeframe except "Last 14 days", was recomputed ad-hoc on each
request: it builds the whole-fleet rollup (per-replica metrics aggregation,
five per-bucket arg-max top-1s, a multi-way join) and only filters to the one
selected cluster at the very end. That recompute is CPU-bound on
mz_catalog_server, so it serializes under concurrent users and its cost grows
linearly with the number of clusters in the deployment.
This adds two new indexed views alongside the existing 14-day overview so the
Console can read a maintained, per-cluster indexed lookup for every timeframe
it offers instead of recomputing:
- mz_console_cluster_utilization_overview_3h (1-minute buckets, 3h window)
- mz_console_cluster_utilization_overview_24h (5-minute buckets, 24h window)
- mz_console_cluster_utilization_overview (now 1-hour buckets, 14d;
previously 8-hour buckets)
All three share one SQL body (console_cluster_utilization_overview_sql) and
output relation so they stay in sync, and each is indexed on cluster_id in
mz_catalog_server. The view body is also cleaned up to drop a redundant
re-join of replica_history that the binning step never needed.
The two new views read only a 3h/24h window of metrics, so they are cheap to
maintain (~10MB and ~510MB of arrangements, ~1-10s hydration at 100 replicas).
The 14-day view dominates cost (input-bound on 14 days of 1-minute samples);
the 8h->1h change roughly doubles its arrangement memory but leaves hydration
unchanged.
The Console gates use of these views on the environment version, falling back
to the ad-hoc query on older environments.
Updates the catalog snapshot tests (oid, information_schema_tables,
mz_catalog_server_index_accounting, catalog_server_explain, autogenerated
mz_internal) and the catalog/indexes/explain-analyze testdrive files to cover
the new objects.
c052d01 to
ccd2a38
Compare
The cluster-detail page's replica-utilization graph previously recomputed the whole-fleet rollup ad-hoc for every timeframe except "Last 14 days", which serializes on mz_catalog_server under concurrent users and scales with the number of clusters in the deployment. This points the graph at the pre-materialized, per-cluster indexed overview views (added in the companion catalog PR): a new selector picks the finest view whose window covers the selected range — _3h (1-min), _24h (5-min), or the 14-day view (now 1-hour) — so each load is an indexed lookup keyed on cluster_id. Ranges longer than any view (30 days) still use the ad-hoc query. Use of the new views is gated on the environment version via useEnvironmentGate (CLUSTER_UTILIZATION_OVERVIEW_VIEWS_VERSION); older environments keep today's behavior (ad-hoc for short ranges, the pre-existing 8-hour view for 14 days). The remaining ad-hoc query (buildReplicaUtilizationHistoryQuery, used for the 30-day range and on older environments) is also optimized: the cluster filter is pushed down into replica_history so the rollup only processes the target cluster's replicas, the redundant replica_history re-join is removed, and the offline-events scan is restricted to the cluster's replicas. Also fixes a latent bug where the optional endDate bound used startDate. types/materialize.d.ts is hand-edited to add the new view types; it should be regenerated with `yarn gen:types` against an environment that has the views. Depends on MaterializeInc#37307.
|
cc @leedqin Cluster-utilization overview — performance findingsBacking the Console's cluster-detail replica-utilization graphs. The Console used TL;DR — ship the three maintained indexed views +
The problemEach page load issued the full rollup (five per-bucket arg-max top-1s, a 6-way The fix, and what it buys (before/after)Three maintained indexed views —
Before is pinned at ~0.33 qps regardless of concurrency (clients just queue; p50 Weighing the options
Option 2 — single 1-min base. Tempting: at 1/min scrape each 1-min bucket
Opt 2 serializes on the worker and blows up — 82× worse snapshots at N=32. Rules queriesSee Q4 (base) and Q5 (unmaterialized re-bin) in the appendix. Option 3 — base + a maintained 1h view on top. Both arrangements are querySee Q7 in the appendix. Key supporting findings
ConclusionShip the three maintained indexed views with the MethodologyAll runs on a local Appendix: queriesThe harness shadows the real system relations with user tables of identical
SELECT m.occurred_at, m.replica_id, r.size, r.cluster_id,
(SUM(m.cpu_nano_cores::float8)/(NULLIF(s.cpu_nano_cores,0)*s.processes)) AS cpu_percent,
(SUM(m.memory_bytes::float8)/(NULLIF(s.memory_bytes,0)*s.processes)) AS memory_percent,
(SUM(m.disk_bytes::float8)/(NULLIF(s.disk_bytes,0)*s.processes)) AS disk_percent,
SUM(m.memory_bytes::float8) AS memory_bytes, SUM(m.disk_bytes::float8) AS disk_bytes,
s.disk_bytes*s.processes AS total_disk_bytes, s.memory_bytes*s.processes AS total_memory_bytes,
MAX(m.heap_bytes::float8) AS heap_bytes, MAX(m.heap_limit) AS heap_limit,
COALESCE(MAX(m.heap_bytes::float8/NULLIF(m.heap_limit,0)),
SUM(m.memory_bytes::float8)/(NULLIF(s.memory_bytes,0)*s.processes)) AS heap_percent
FROM replica_history AS r
JOIN replica_sizes AS s ON r.size = s.size
JOIN replica_metrics_history AS m ON m.replica_id = r.replica_id
GROUP BY m.occurred_at, m.replica_id, r.size, r.cluster_id,
s.cpu_nano_cores, s.memory_bytes, s.disk_bytes, s.processes
WITH binned AS (SELECT * FROM <src>),
max_memory AS (SELECT DISTINCT ON (cluster_id, b, replica_id) cluster_id, b, replica_id, memory_percent, occurred_at
FROM binned OPTIONS (DISTINCT ON INPUT GROUP SIZE = <GS>) ORDER BY cluster_id, b, replica_id, COALESCE(memory_bytes,0) DESC),
max_disk AS (SELECT DISTINCT ON (cluster_id, b, replica_id) cluster_id, b, replica_id, disk_percent, occurred_at
FROM binned OPTIONS (DISTINCT ON INPUT GROUP SIZE = <GS>) ORDER BY cluster_id, b, replica_id, COALESCE(disk_bytes,0) DESC),
max_cpu AS (SELECT DISTINCT ON (cluster_id, b, replica_id) cluster_id, b, replica_id, cpu_percent, occurred_at
FROM binned OPTIONS (DISTINCT ON INPUT GROUP SIZE = <GS>) ORDER BY cluster_id, b, replica_id, COALESCE(cpu_percent,0) DESC),
max_heap AS (SELECT DISTINCT ON (cluster_id, b, replica_id) cluster_id, b, replica_id, heap_percent, occurred_at
FROM binned OPTIONS (DISTINCT ON INPUT GROUP SIZE = <GS>) ORDER BY cluster_id, b, replica_id, COALESCE(heap_bytes,0) DESC),
max_mad AS (SELECT DISTINCT ON (cluster_id, b, replica_id) cluster_id, b, replica_id, memory_and_disk_percent, occurred_at
FROM binned OPTIONS (DISTINCT ON INPUT GROUP SIZE = <GS>) ORDER BY cluster_id, b, replica_id, COALESCE(memory_and_disk_percent,0) DESC)
SELECT max_memory.b, max_memory.replica_id, max_memory.cluster_id,
max_memory.memory_percent, max_memory.occurred_at AS mem_at,
max_disk.disk_percent, max_cpu.cpu_percent, max_heap.heap_percent, max_mad.memory_and_disk_percent
FROM max_memory
JOIN max_disk USING (b, replica_id, cluster_id) JOIN max_cpu USING (b, replica_id, cluster_id)
JOIN max_heap USING (b, replica_id, cluster_id) JOIN max_mad USING (b, replica_id, cluster_id)Q1. Shipped rollup (maintained indexed view)Byte-for-byte the builtin body (real definition in CREATE VIEW util AS
WITH replica_metrics_history AS (
SELECT m.occurred_at, m.replica_id, r.size,
(SUM(m.cpu_nano_cores::float8)/(NULLIF(s.cpu_nano_cores,0)*s.processes)) AS cpu_percent,
(SUM(m.memory_bytes::float8)/(NULLIF(s.memory_bytes,0)*s.processes)) AS memory_percent,
(SUM(m.disk_bytes::float8)/(NULLIF(s.disk_bytes,0)*s.processes)) AS disk_percent,
SUM(m.disk_bytes::float8) AS disk_bytes, SUM(m.memory_bytes::float8) AS memory_bytes,
s.disk_bytes*s.processes AS total_disk_bytes, s.memory_bytes*s.processes AS total_memory_bytes,
MAX(m.heap_bytes::float8) AS heap_bytes, MAX(m.heap_limit) AS heap_limit,
COALESCE(MAX(m.heap_bytes::float8/NULLIF(m.heap_limit,0)),
SUM(m.memory_bytes::float8)/(NULLIF(s.memory_bytes,0)*s.processes)) AS heap_percent
FROM replica_history AS r
JOIN replica_sizes AS s ON r.size = s.size
JOIN replica_metrics_history AS m ON m.replica_id = r.replica_id
GROUP BY m.occurred_at, m.replica_id, r.size, s.cpu_nano_cores, s.memory_bytes, s.disk_bytes, s.processes
),
binned AS (
SELECT *, date_bin('1 minute', occurred_at, '1970-01-01'::timestamp) AS bucket_start
FROM replica_metrics_history
WHERE mz_now() <= date_bin('1 minute', occurred_at, '1970-01-01'::timestamp) + INTERVAL '14 days'
),
max_memory AS (SELECT DISTINCT ON (bucket_start, replica_id) bucket_start, replica_id, memory_percent, occurred_at
FROM binned OPTIONS (DISTINCT ON INPUT GROUP SIZE = 60) ORDER BY bucket_start, replica_id, COALESCE(memory_bytes,0) DESC),
max_disk AS (SELECT DISTINCT ON (bucket_start, replica_id) bucket_start, replica_id, disk_percent, occurred_at
FROM binned OPTIONS (DISTINCT ON INPUT GROUP SIZE = 60) ORDER BY bucket_start, replica_id, COALESCE(disk_bytes,0) DESC),
max_cpu AS (SELECT DISTINCT ON (bucket_start, replica_id) bucket_start, replica_id, cpu_percent, occurred_at
FROM binned OPTIONS (DISTINCT ON INPUT GROUP SIZE = 60) ORDER BY bucket_start, replica_id, COALESCE(cpu_percent,0) DESC),
max_memory_and_disk AS (SELECT DISTINCT ON (bucket_start, replica_id) bucket_start, replica_id, memory_percent, disk_percent, memory_and_disk_percent, occurred_at
FROM (SELECT *, CASE WHEN disk_bytes IS NULL AND memory_bytes IS NULL THEN NULL
ELSE (COALESCE(memory_bytes,0)+COALESCE(disk_bytes,0))/NULLIF((total_memory_bytes+total_disk_bytes),0) END AS memory_and_disk_percent
FROM binned) AS inner_mad
OPTIONS (DISTINCT ON INPUT GROUP SIZE = 60) ORDER BY bucket_start, replica_id, COALESCE(memory_and_disk_percent,0) DESC),
max_heap AS (SELECT DISTINCT ON (bucket_start, replica_id) bucket_start, replica_id, heap_percent, occurred_at
FROM binned OPTIONS (DISTINCT ON INPUT GROUP SIZE = 60) ORDER BY bucket_start, replica_id, COALESCE(heap_bytes,0) DESC),
offline AS (
SELECT date_bin('1 minute', occurred_at, '1970-01-01'::timestamp) AS bucket_start, replica_id,
jsonb_agg(jsonb_build_object('replicaId', rsh.replica_id, 'occurredAt', rsh.occurred_at,
'status', rsh.status, 'reason', rsh.reason)) AS offline_events
FROM replica_status_history AS rsh
WHERE process_id = 0 AND status = 'offline'
AND mz_now() <= date_bin('1 minute', occurred_at, '1970-01-01'::timestamp) + INTERVAL '14 days'
GROUP BY bucket_start, replica_id
)
SELECT bucket_start, replica_id, max_memory.memory_percent, max_memory.occurred_at AS max_memory_at,
max_disk.disk_percent, max_disk.occurred_at AS max_disk_at,
max_memory_and_disk.memory_and_disk_percent, max_memory_and_disk.memory_percent AS mad_mem,
max_memory_and_disk.disk_percent AS mad_disk, max_memory_and_disk.occurred_at AS mad_at,
max_heap.heap_percent, max_heap.occurred_at AS max_heap_at, max_cpu.cpu_percent, max_cpu.occurred_at AS max_cpu_at,
offline.offline_events, bucket_start + INTERVAL '1 minute' AS bucket_end,
replica_name_history.new_name AS name, replica_history.cluster_id, replica_history.size
FROM max_memory
JOIN max_disk USING (bucket_start, replica_id)
JOIN max_cpu USING (bucket_start, replica_id)
JOIN max_memory_and_disk USING (bucket_start, replica_id)
JOIN max_heap USING (bucket_start, replica_id)
JOIN replica_history USING (replica_id)
CROSS JOIN LATERAL (
SELECT new_name FROM replica_name_history
WHERE replica_id = replica_name_history.id
AND bucket_start + INTERVAL '1 minute' >= COALESCE(replica_name_history.occurred_at, '1970-01-01'::timestamp)
ORDER BY replica_name_history.occurred_at DESC LIMIT 1
) AS replica_name_history
LEFT JOIN offline USING (bucket_start, replica_id);
CREATE INDEX util_idx IN CLUSTER bench ON util (cluster_id);Q2. Per-cluster point lookup (before vs after)The per-pageload query, and the one swept for throughput. "before" = no index on SELECT * FROM util WHERE cluster_id = $1;Q3. Design A — 14d/1h view aloneCREATE VIEW view_a AS
-- five_max(src, 60), src = (SELECT *, date_bin('1 hour', occurred_at, '1970-01-01') AS b
-- FROM (SELECT *, <memory_and_disk_percent CASE> FROM ( <rmh> ) rmh
-- WHERE mz_now() <= date_bin('1 hour', occurred_at, '1970-01-01') + INTERVAL '14 days') z);
CREATE INDEX view_a_idx IN CLUSTER bench ON view_a (cluster_id);Q4. Design B — 1-min baseCREATE VIEW base_1min AS
SELECT date_bin('1 minute', occurred_at, '1970-01-01'::timestamp) AS bucket_start,
replica_id, cluster_id, size, occurred_at,
cpu_percent, memory_percent, disk_percent, heap_percent, memory_bytes, disk_bytes, heap_bytes,
total_memory_bytes, total_disk_bytes,
CASE WHEN disk_bytes IS NULL AND memory_bytes IS NULL THEN NULL
ELSE (COALESCE(memory_bytes,0)+COALESCE(disk_bytes,0))/NULLIF((total_memory_bytes+total_disk_bytes),0) END AS memory_and_disk_percent
FROM ( <rmh> ) rmh
WHERE mz_now() <= date_bin('1 minute', occurred_at, '1970-01-01'::timestamp) + INTERVAL '14 days';
CREATE INDEX base_1min_idx IN CLUSTER bench ON base_1min (cluster_id);Q5. Design B — unmaterialized re-bin (collapses under SUBSCRIBE)CREATE VIEW rebin_1h AS
-- five_max(src, 60), src = (SELECT *, date_bin('1 hour', occurred_at, '1970-01-01') AS b FROM base_1min)
SELECT * FROM rebin_1h WHERE cluster_id = $1; -- per-query top-1, NOT maintainedQ6. SUBSCRIBE probe-- design A: v = maintained five_max view (indexed); design B: v = unmaterialized rebin_1h
BEGIN;
DECLARE cur CURSOR FOR SUBSCRIBE (SELECT * FROM v WHERE cluster_id = 'c<i>');
FETCH ALL cur; -- time to first full snapshotQ7. Base + 1h chain-- base_1min + index (Q4), plus:
CREATE VIEW hour AS
-- five_max(src, 60), src = (SELECT *, date_bin('1 hour', occurred_at, '1970-01-01') AS b FROM base_1min)
CREATE INDEX hour_idx IN CLUSTER bench ON hour (cluster_id);Q8. Offline jsonb on vs offThe "off" variant drops exactly the |
Motivation
The Console's cluster-detail page polls a replica-utilization rollup that, for every timeframe except "Last 14 days", is recomputed ad-hoc on each request: it builds the whole-fleet rollup (per-replica metrics aggregation, five per-bucket arg-max top-1s, a multi-way join) and only filters down to the one selected cluster at the very end.
That recompute is CPU-bound on
mz_catalog_server, so it serializes under concurrent users and its cost grows linearly with the number of clusters in the deployment. In a local repro on ascale=1,workers=1cluster (matchingmz_catalog_server) with synthetic fleets at a realistic 1-min metric cadence:The indexed lookup is flat in both fleet size and concurrency; the recompute collapses.
What this does
Adds two new indexed views alongside the existing 14-day overview, so the Console can read a maintained, per-cluster indexed lookup for every timeframe it offers:
mz_console_cluster_utilization_overview_3h— 1-minute buckets, 3-hour windowmz_console_cluster_utilization_overview_24h— 5-minute buckets, 24-hour windowmz_console_cluster_utilization_overview— now 1-hour buckets / 14 days (was 8-hour)All three share one SQL body (
console_cluster_utilization_overview_sql) and output relation so they stay in sync, and each is indexed oncluster_idinmz_catalog_server. The shared body also drops a redundant re-join ofreplica_historythat the binning step never needed.Cost / rollout
The two new views only read a 3h/24h window, so they are cheap to maintain (~10 MB and ~510 MB of arrangements; ~1–10 s hydration at 100 replicas). The 14-day view dominates cost (input-bound on 14 days of 1-minute samples — ~1.7 GB / ~33 s hydration per 50 replicas with the
GROUP SIZEhints); the 8h→1h change roughly doubles its arrangement memory but leaves hydration unchanged. Incremental cost of the whole change is ~18 MB/replica.Because the cost is modest for typical deployments and significant only for very large fleets, these ship always-on; the companion Console PR gates use of the views on the environment version and falls back to the ad-hoc query on older environments. A system-flag kill-switch can be added as a fast-follow if a large environment needs it.
Tests
Updates the catalog snapshot tests (
oid,information_schema_tables,mz_catalog_server_index_accounting,catalog_server_explain, autogeneratedmz_internal) and thecatalog/indexes/explain-analyzetestdrive files to cover the new objects.