Skip to content

Conversation

@9larsons
Copy link
Contributor

@9larsons 9larsons commented Jan 29, 2026

ref https://linear.app/ghost/issue/NY-970/

Summary

Adds api_top_pages_v3 endpoint using a daily materialized view for improved performance at scale.

Changes

File Purpose
endpoints/api_top_pages_v3.pipe New endpoint using daily MV
pipes/mv_daily_pages.pipe Daily aggregation of page visits
datasources/_mv_daily_pages.datasource MV storage schema
tests/api_top_pages_v3.yaml Endpoint tests
scripts/benchmark-top-pages.sh Performance comparison
scripts/compare-top-pages.sh Result validation

Benchmarks (10M rows)

Query v1 v3 Improvement
30-day 450ms, 12M rows 150ms, 25K rows 3x faster
All-time ❌ MEMORY_LIMIT 300ms, 160K rows Works vs crashes

Why differences?

v3 may show 1-2 fewer visits for some pages. This is more accurate — v1 over-counts when sessions cross midnight at date boundaries (counts page views outside the requested range).

Test plan

  • Run yarn dev:analytics
  • Run yarn data:analytics:generate (you can use yarn data:analytics:generate 5000000 for larger test data sets)
  • Run ./benchmark-top-pages.sh 5 to verify performance
  • Run ./compare-top-pages.sh --limit 100 to validate accuracy
Uses pre-aggregated daily data to reduce query scans from millions of
raw hits to thousands of MV rows. Prevents memory limit errors at scale.

Includes benchmark and comparison scripts for validation.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 29, 2026

Walkthrough

This PR introduces a new optimized Tinybird endpoint (api_top_pages_v3) for querying top pages data. It includes a materialized view data source and pipe for daily page visit pre-aggregation, reducing query scan volume from millions of raw hits to aggregated records. The endpoint combines historical pre-aggregated daily data with a real-time daily window. Supporting changes include test fixtures covering various filter combinations, benchmarking and validation scripts for performance comparison, and modifications to the Tinybird service layer to support versioned endpoint URLs.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the primary change: adding a new api_top_pages_v3 endpoint that uses a daily materialized view for performance improvements.
Description check ✅ Passed The pull request description clearly describes the changeset, including the purpose (new api_top_pages_v3 endpoint), files modified, performance benchmarks, and behavioral differences with test instructions.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@ghost/core/core/server/data/tinybird/scripts/benchmark-top-pages.sh`:
- Around line 39-42: The DATE_TO/DATE_FROM calculation uses local time which can
shift ranges for APIs expecting UTC; replace the current date calls in
benchmark-top-pages.sh so they run date in the UTC timezone (use TZ=UTC before
the date invocation) rather than adding the GNU-only -u flag, and keep the
existing fallback that tries BSD-style (-v-30d) then GNU-style (-d '30 days
ago'); update DATE_TO, DATE_FROM (and remove reliance on a separate TIMEZONE
variable if redundant) so both BSD/macOS and GNU/Linux produce UTC YYYY-MM-DD
values.

In `@ghost/core/core/server/data/tinybird/scripts/compare-top-pages.sh`:
- Around line 63-70: The SITE_UUID assignment in compare-top-pages.sh currently
captures the full Tinybird JSON response into SITE_UUID; update the SITE_UUID
extraction to parse Tinybird's JSON and pull the first row's site_uuid from the
data array (e.g., use curl to call "${TB_HOST}/v0/sql?q=..." and pipe to jq to
extract .data[0].site_uuid), keep the same auth header variables
(TB_TOKEN/TB_HOST), and preserve the existing empty-check/exit logic that
follows; change the SITE_UUID variable assignment only and ensure the script
fails if jq returns empty.
🧹 Nitpick comments (1)
ghost/core/core/server/data/tinybird/scripts/compare-top-pages.sh (1)

94-103: Fail fast on HTTP errors from Tinybird.

curl -s won’t fail on 4xx/5xx, which can mask errors until JSON parsing. Consider -fS so non-2xx responses stop the script with a clear error.

🔧 Suggested change
-curl -s -H "Authorization: Bearer $TB_TOKEN" \
+curl -fsS -H "Authorization: Bearer $TB_TOKEN" \
     "${TB_HOST}/v0/pipes/api_top_pages.json?site_uuid=${SITE_UUID}&date_from=${DATE_FROM}&date_to=${DATE_TO}&timezone=${TIMEZONE}&limit=${LIMIT}" \
     > "$V1_RESPONSE"
@@
-curl -s -H "Authorization: Bearer $TB_TOKEN" \
+curl -fsS -H "Authorization: Bearer $TB_TOKEN" \
     "${TB_HOST}/v0/pipes/api_top_pages_v3.json?site_uuid=${SITE_UUID}&date_from=${DATE_FROM}&date_to=${DATE_TO}&timezone=${TIMEZONE}&limit=${LIMIT}" \
     > "$V3_RESPONSE"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants