Skip to content

feat(metrics): 3 Prometheus counters for scaling dive (#7756)#7762

Merged
JohnMcLear merged 2 commits into
developfrom
feat/scaling-dive-counters
May 15, 2026
Merged

feat(metrics): 3 Prometheus counters for scaling dive (#7756)#7762
JohnMcLear merged 2 commits into
developfrom
feat/scaling-dive-counters

Conversation

@JohnMcLear

Copy link
Copy Markdown
Member

Summary

Adds three Prometheus rows to `/stats/prometheus` so the load-test harness for #7756 can attribute where time goes on the server, not just the gauge headline (CPU / event-loop / memory):

  • `etherpad_pad_users{padId}` — gauge derived from `sessioninfos` on scrape. Lets the harness confirm a target pad actually has the expected concurrency.
  • `etherpad_changeset_apply_duration_seconds` — histogram observed inside `handleUserChanges`. Separates "apply path is slow" from "fan-out is slow" when latency rises.
  • `etherpad_socket_emits_total{type}` — counter at the broadcast emit sites (`handleCustomObjectMessage`, `handleCustomMessage`, `sendChatMessageToPadClients`) and the per-socket `NEW_CHANGES` loop in `updatePadClients`. Bucketed by message type so the harness can measure the amplification factor of each lever (especially the fan-out batching lever).

Metric handles live in a new `src/node/prom-instruments.ts` rather than `prometheus.ts` itself, to avoid a circular import — `prometheus.ts` already `require`s `PadMessageHandler` to read `sessioninfos`.

This is the follow-up PR planned in section 6 of `docs/superpowers/specs/2026-05-15-scaling-dive-design.md` (in `ether/etherpad-load-test`). The load-test harness already tolerates these metrics' absence, so it ships unblocked; this PR just makes the next dive run more informative.

Test Plan

  • Backend test `src/tests/backend-new/specs/prom-instruments.test.ts` passes (3/3 locally) — verifies the recording helpers move the underlying counter/histogram.
  • Existing backend test suite remains green.
  • After merge, `curl http://your-etherpad/stats/prometheus | grep etherpad_` shows the three new rows.
  • Trigger the Scaling dive workflow with `core_ref` pointing at this branch or develop-post-merge; verify the harness reports include `etherpad_socket_emits_total{type=NEW_CHANGES}` and `etherpad_changeset_apply_duration_seconds_count` in the JSON snapshot.

🤖 Generated with Claude Code

Per the spec section 6 of #7756: enables the load-test harness to
attribute *where* time goes on the server, not just the gauge headline
(CPU / event-loop / memory) the dive doc starts from.

New /stats/prometheus rows:

- etherpad_pad_users{padId} — gauge, derived from sessioninfos on
  each scrape. Lets the harness confirm the pad it points at actually
  has the expected concurrency.

- etherpad_changeset_apply_duration_seconds — histogram observed
  inside handleUserChanges. Separates "apply path is slow" from
  "fan-out is slow" when latency rises.

- etherpad_socket_emits_total{type} — counter at the broadcast
  emit sites (handleCustomObjectMessage, handleCustomMessage,
  sendChatMessageToPadClients) and inside the NEW_CHANGES per-socket
  loop in updatePadClients. Bucketed by message type so the harness
  can measure the amplification factor of each lever (especially the
  fan-out batching lever).

Metric handles live in a new prom-instruments.ts module rather than
in prometheus.ts itself, so PadMessageHandler can import the
recording helpers without creating a circular dependency
(prometheus.ts already requires PadMessageHandler).

Tests: smoke test verifies recordSocketEmit + recordChangesetApply
move the underlying counters/histogram.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@qodo-code-review

Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@qodo-free-for-open-source-projects

Copy link
Copy Markdown

Review Summary by Qodo

Add Prometheus metrics for scaling dive load testing

✨ Enhancement

Grey Divider

Walkthroughs

Description
• Add three Prometheus metrics for scaling dive load testing
  - etherpad_pad_users{padId} gauge tracks concurrent users per pad
  - etherpad_changeset_apply_duration_seconds histogram measures edit processing time
  - etherpad_socket_emits_total{type} counter tracks broadcast messages by type
• Create new prom-instruments.ts module to avoid circular imports
• Instrument hot paths in PadMessageHandler to record metrics
• Add smoke tests verifying metric recording functionality
Diagram
flowchart LR
  PMH["PadMessageHandler<br/>hot path"] -->|import| PI["prom-instruments.ts<br/>metric handles"]
  PI -->|register| PROM["prometheus.ts<br/>central registry"]
  PMH -->|recordSocketEmit| SE["socketEmitsTotal<br/>counter"]
  PMH -->|recordChangesetApply| CAD["changesetApplyDuration<br/>histogram"]
  PROM -->|getPadUsersMap| PUG["padUsersGauge<br/>gauge"]
  PROM -->|expose| STATS["/stats/prometheus<br/>endpoint"]
Loading

Grey Divider

File Changes

1. src/node/prom-instruments.ts ✨ Enhancement +36/-0

New Prometheus instruments module for hot path

• New module defining three Prometheus instruments for the scaling dive
• Exports changesetApplyDuration histogram, socketEmitsTotal counter, and padUsersGauge gauge
• Provides recordChangesetApply() and recordSocketEmit() helper functions for hot-path recording
• Separated from prometheus.ts to avoid circular dependency with PadMessageHandler

src/node/prom-instruments.ts


2. src/node/handler/PadMessageHandler.ts ✨ Enhancement +21/-0

Instrument message handlers with metric recording

• Import recordChangesetApply and recordSocketEmit from new prom-instruments module
• Add getPadUsersMap() function to derive per-pad user counts from sessioninfos
• Instrument handleCustomObjectMessage() to record broadcast emits
• Instrument handleCustomMessage() to record broadcast emits
• Instrument sendChatMessageToPadClients() to record CHAT_MESSAGE emits
• Instrument handleUserChanges() to measure changeset apply duration
• Instrument updatePadClients() to record NEW_CHANGES emits per socket

src/node/handler/PadMessageHandler.ts


3. src/node/prometheus.ts ✨ Enhancement +14/-0

Register new metrics and populate pad users gauge

• Import and register three new metrics from prom-instruments.ts with central registry
• Update monitor() function to populate padUsersGauge by calling getPadUsersMap()
• Reset padUsersGauge on each scrape to avoid stale labels for drained pads

src/node/prometheus.ts



4. src/tests/backend-new/specs/prom-instruments.test.ts 🧪 Tests +47/-0

Smoke tests for metric recording helpers

• New smoke test file verifying recordSocketEmit() increments counter by message type
• Test verifies fallback to "unknown" label when type is undefined
• Test verifies recordChangesetApply() observes duration in histogram
• Tests reset metrics before each case to ensure isolation

src/tests/backend-new/specs/prom-instruments.test.ts


Grey Divider

Qodo Logo

@qodo-free-for-open-source-projects

qodo-free-for-open-source-projects Bot commented May 15, 2026

Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0)

Grey Divider


Action required

1. New Prometheus metrics lack flag ✓ Resolved 📘 Rule violation ⚙ Maintainability
Description
The PR introduces new Prometheus metrics (etherpad_pad_users,
etherpad_changeset_apply_duration_seconds, etherpad_socket_emits_total) that are registered and
emitted unconditionally. This violates the requirement that new features be gated behind a feature
flag and disabled by default.
Code

src/node/prometheus.ts[R26-37]

+// Added for the #7756 scaling dive: lets the load-test harness attribute
+// where time goes (apply path vs. fan-out) and confirm per-pad concurrency.
+// The metric handles live in prom-instruments.ts to avoid a circular import
+// with PadMessageHandler (which records into them on the hot path).
+import {padUsersGauge, changesetApplyDuration, socketEmitsTotal} from './prom-instruments';
+register.registerMetric(padUsersGauge);
+register.registerMetric(changesetApplyDuration);
+register.registerMetric(socketEmitsTotal);
+
client.collectDefaultMetrics({register});
const monitor = async function () {
Evidence
PR Compliance ID 9 requires new features to be behind a feature flag and disabled by default. The
diff shows the new metrics are registered in src/node/prometheus.ts and emitted/recorded from
PadMessageHandler with no flag checks or configuration gate, meaning they are enabled by default.

src/node/prometheus.ts[26-48]
src/node/handler/PadMessageHandler.ts[606-610]
Best Practice: Repository guidelines

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
New Prometheus metrics are always enabled, but compliance requires new features to be behind a feature flag and disabled by default.
## Issue Context
This PR registers three new metrics and emits them on the hot path and in the Prometheus monitor loop without any enable/disable mechanism.
## Fix Focus Areas
- src/node/prometheus.ts[26-48]
- src/node/handler/PadMessageHandler.ts[50-50]
- src/node/handler/PadMessageHandler.ts[609-609]
- src/node/handler/PadMessageHandler.ts[630-630]
- src/node/handler/PadMessageHandler.ts[671-671]
- src/node/handler/PadMessageHandler.ts[791-791]
- src/node/handler/PadMessageHandler.ts[904-904]
- src/node/handler/PadMessageHandler.ts[957-957]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Histogram includes fan-out ✓ Resolved 🐞 Bug ≡ Correctness
Description
handleUserChanges() starts etherpad_changeset_apply_duration_seconds before any processing and only
stops it in finally, after updatePadClients() completes, so the histogram includes broadcast/fan-out
time rather than isolating the apply path. This prevents the scaling-dive analysis from
distinguishing “apply is slow” vs “fan-out is slow”.
Code

src/node/handler/PadMessageHandler.ts[791]

+  const stopHistogram = recordChangesetApply();
Evidence
The timer is started at the top of handleUserChanges(), then updatePadClients() is awaited, and only
afterwards is stopHistogram() called in finally, so the histogram necessarily includes fan-out time.

src/node/handler/PadMessageHandler.ts[789-905]
src/node/prom-instruments.ts[11-15]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`etherpad_changeset_apply_duration_seconds` is intended to time the *apply path*, but the timer currently spans the entire `handleUserChanges()` including `await exports.updatePadClients(pad)` (fan-out). This makes the metric misleading and defeats the PR’s stated purpose.
## Issue Context
- The timer is started near the beginning of `handleUserChanges()` and stopped in `finally`.
- `updatePadClients()` is awaited before the timer is stopped, so its work is included in the duration.
## Fix Focus Areas
- src/node/handler/PadMessageHandler.ts[788-905]
## Implementation notes
- Move `recordChangesetApply()` start to immediately before the “apply” work you want to measure (e.g., just before `pad.appendRevision(...)`).
- Call the returned `stopHistogram()` immediately after the apply work completes and **before** any socket emits / `updatePadClients()` fan-out.
- If you also want total end-to-end latency, keep using the existing `stats.timer('edits')` (or add a second histogram explicitly for fan-out/total).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

3. Unbounded metrics label ✓ Resolved 🐞 Bug ☼ Reliability
Description
handleCustomMessage() increments etherpad_socket_emits_total with msgString (an HTTP API parameter)
as the label value, allowing an API caller to generate unbounded distinct label values and
potentially exhaust memory/CPU via high-cardinality time series. Although the endpoint is API-key
protected, this still creates a sharp footgun if the key is leaked or a client misbehaves.
Code

src/node/handler/PadMessageHandler.ts[630]

+  recordSocketEmit(msg.data.type);
Evidence
The HTTP API passes an arbitrary msg string into handleCustomMessage(), which assigns it to
data.type and uses it as the type label for the Prometheus counter; prom-client counters create
distinct time series per label value.

src/node/db/API.ts[863-866]
src/node/handler/PadMessageHandler.ts[614-631]
src/node/prom-instruments.ts[17-21]
src/node/prom-instruments.ts[34-36]
src/node/handler/APIHandler.ts[180-185]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`etherpad_socket_emits_total{type=...}` uses the message type as a Prometheus label. In `handleCustomMessage()`, that type comes directly from the HTTP API `msg` parameter, which means label cardinality is unbounded.
## Issue Context
- `sendClientsMessage(padID, msg)` passes `msg` directly into `PadMessageHandler.handleCustomMessage(padID, msg)`.
- `handleCustomMessage()` sets `data.type = msgString` and then calls `recordSocketEmit(msg.data.type)`.
- `recordSocketEmit()` uses `socketEmitsTotal.labels(type).inc()`, creating a new time series per distinct `type`.
## Fix Focus Areas
- src/node/handler/PadMessageHandler.ts[620-631]
- src/node/db/API.ts[863-866]
- src/node/prom-instruments.ts[17-21]
- src/node/prom-instruments.ts[34-36]
## Implementation notes
Pick one of:
1) **Normalize/bucket**: Map any unknown/untrusted type to a small bounded set (e.g., `API_CUSTOM_MESSAGE`, `CUSTOM`, `other`, `unknown`).
2) **Allowlist**: Only allow a fixed set of label values (e.g., `NEW_CHANGES`, `CHAT_MESSAGE`, etc.); everything else becomes `other`.
3) **Remove label at this site**: For `handleCustomMessage()`, increment with a constant label regardless of `msgString`.
Also consider enforcing a max length/character set if you keep any user-provided label value.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

Comment thread src/node/prometheus.ts
Comment thread src/node/handler/PadMessageHandler.ts Outdated
JohnMcLear added a commit to ether/etherpad-load-test that referenced this pull request May 15, 2026
…lly emits (#99)

Confirmed from the first real dive run (25934713423): core's
/stats/prometheus uses prom-client's collectDefaultMetrics output
(process_cpu_user_seconds_total, nodejs_eventloop_lag_p95_seconds,
process_resident_memory_bytes, ...) — not the nodejs_cpu_gauge /
nodejs_eventloop_latency_gauge names that src/node/metrics.ts
defines but never registers. The Scraper's default allowlist was
filtering EVERYTHING out, so all dive reports had empty cpu_user /
evloop_p95_ms / rss_mb columns.

Two changes:

1. Update DEFAULT_KEEP prefixes to match real prom-client names.
   Includes 'etherpad_' as a single prefix that covers all current
   and future etherpad_ rows (including the three added in
   ether/etherpad#7762).

2. Update Reporter CSV column mapping to read process_cpu_user_seconds_total,
   nodejs_eventloop_lag_p95_seconds, and process_resident_memory_bytes
   (converting seconds -> ms and bytes -> MB as before).

CSV column names stay stable; only the underlying lookup keys change.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… label cardinality

Three issues raised on the initial PR:

1. **Feature flag.** Per project compliance rule, new features must
   be behind a flag and disabled by default. Adds
   `settings.scalingDiveMetrics` (default `false`). When off,
   recordSocketEmit() / recordChangesetApply() short-circuit to
   no-ops and the metrics are never even registered with the
   Prometheus register. Enable only when running the
   ether/etherpad-load-test scaling-dive harness.

2. **Histogram scope.** Previously the
   etherpad_changeset_apply_duration_seconds timer wrapped the
   whole handleUserChanges() body — including
   `await exports.updatePadClients(pad)` — so the histogram
   measured apply+fan-out, defeating its stated purpose. Now
   stopped immediately after the apply work (`assert.equal(...rev,
   r)`), before the ACCEPT_COMMIT socket emit and the
   updatePadClients call. Failed applies deliberately don't observe
   so the success-path distribution stays clean.

3. **Label cardinality.** handleCustomMessage was passing the
   user-supplied msgString (an HTTP-API param) directly as the
   `type` label value. A misbehaving API caller could grow
   prom-client's internal label map until OOM. Now bucketed against
   a known-types allowlist; anything outside it lands in `other`.

Tests updated: 5/5 — covers happy path, "other" bucketing of
unknown/unsafe labels, and that the flag-disabled state is a true
no-op.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@JohnMcLear JohnMcLear merged commit 986f139 into develop May 15, 2026
31 checks passed
@JohnMcLear JohnMcLear deleted the feat/scaling-dive-counters branch May 15, 2026 18:54
JohnMcLear added a commit that referenced this pull request May 16, 2026
Two findings from rigorous N=3 scoring:

1. Lever 3 (#7768) is NOT a perf win. When you compare like-for-
   like matrix entries (develop-baseline vs PR-baseline), the
   per-socket serialization is slightly net-negative across the
   curve. My earlier "70% drop" was a single-run outlier; the
   subsequent "tighter envelope" was a cross-matrix-entry
   comparison confounded by noise. The serialization is still a
   real correctness fix (race on concurrent fan-outs + lost
   revisions on emit error) so the PR stays open, but the
   recommendation is now correctness-only.

2. Lever 8b (#7774) — engine.io flush deferral. The follow-up to
   the closed lever 8 that actually patches Socket.sendPacket
   instead of just transport.send. queueMicrotask-coalesced flush
   gives the transport multi-packet batches to work with at last.
   N=3 shows tighter tail at step 300-350 (122 → 110 max at 350,
   71 → 58 max at 300). Not a cliff-mover. The only PR in this
   program with N=3-confirmed perf benefit.

Final disposition:
- Merge: #7774 (modest perf), #7768 (correctness), #7762 (already
  merged, instruments).
- The cliff at 350-400 authors is hardware-bound on a 4-vCPU
  runner, not code-bound. Production with more cores per host
  scales proportionally with no code changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

1 participant