-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
Added api_top_pages_v3 endpoint with daily materialized view #26060
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Uses pre-aggregated daily data to reduce query scans from millions of raw hits to thousands of MV rows. Prevents memory limit errors at scale. Includes benchmark and comparison scripts for validation.
WalkthroughThis PR introduces a new optimized Tinybird endpoint (api_top_pages_v3) for querying top pages data. It includes a materialized view data source and pipe for daily page visit pre-aggregation, reducing query scan volume from millions of raw hits to aggregated records. The endpoint combines historical pre-aggregated daily data with a real-time daily window. Supporting changes include test fixtures covering various filter combinations, benchmarking and validation scripts for performance comparison, and modifications to the Tinybird service layer to support versioned endpoint URLs. Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@ghost/core/core/server/data/tinybird/scripts/benchmark-top-pages.sh`:
- Around line 39-42: The DATE_TO/DATE_FROM calculation uses local time which can
shift ranges for APIs expecting UTC; replace the current date calls in
benchmark-top-pages.sh so they run date in the UTC timezone (use TZ=UTC before
the date invocation) rather than adding the GNU-only -u flag, and keep the
existing fallback that tries BSD-style (-v-30d) then GNU-style (-d '30 days
ago'); update DATE_TO, DATE_FROM (and remove reliance on a separate TIMEZONE
variable if redundant) so both BSD/macOS and GNU/Linux produce UTC YYYY-MM-DD
values.
In `@ghost/core/core/server/data/tinybird/scripts/compare-top-pages.sh`:
- Around line 63-70: The SITE_UUID assignment in compare-top-pages.sh currently
captures the full Tinybird JSON response into SITE_UUID; update the SITE_UUID
extraction to parse Tinybird's JSON and pull the first row's site_uuid from the
data array (e.g., use curl to call "${TB_HOST}/v0/sql?q=..." and pipe to jq to
extract .data[0].site_uuid), keep the same auth header variables
(TB_TOKEN/TB_HOST), and preserve the existing empty-check/exit logic that
follows; change the SITE_UUID variable assignment only and ensure the script
fails if jq returns empty.
🧹 Nitpick comments (1)
ghost/core/core/server/data/tinybird/scripts/compare-top-pages.sh (1)
94-103: Fail fast on HTTP errors from Tinybird.
curl -swon’t fail on 4xx/5xx, which can mask errors until JSON parsing. Consider-fSso non-2xx responses stop the script with a clear error.🔧 Suggested change
-curl -s -H "Authorization: Bearer $TB_TOKEN" \ +curl -fsS -H "Authorization: Bearer $TB_TOKEN" \ "${TB_HOST}/v0/pipes/api_top_pages.json?site_uuid=${SITE_UUID}&date_from=${DATE_FROM}&date_to=${DATE_TO}&timezone=${TIMEZONE}&limit=${LIMIT}" \ > "$V1_RESPONSE" @@ -curl -s -H "Authorization: Bearer $TB_TOKEN" \ +curl -fsS -H "Authorization: Bearer $TB_TOKEN" \ "${TB_HOST}/v0/pipes/api_top_pages_v3.json?site_uuid=${SITE_UUID}&date_from=${DATE_FROM}&date_to=${DATE_TO}&timezone=${TIMEZONE}&limit=${LIMIT}" \ > "$V3_RESPONSE"
ref https://linear.app/ghost/issue/NY-970/
Summary
Adds
api_top_pages_v3endpoint using a daily materialized view for improved performance at scale.Changes
endpoints/api_top_pages_v3.pipepipes/mv_daily_pages.pipedatasources/_mv_daily_pages.datasourcetests/api_top_pages_v3.yamlscripts/benchmark-top-pages.shscripts/compare-top-pages.shBenchmarks (10M rows)
Why differences?
v3 may show 1-2 fewer visits for some pages. This is more accurate — v1 over-counts when sessions cross midnight at date boundaries (counts page views outside the requested range).
Test plan
yarn dev:analyticsyarn data:analytics:generate(you can useyarn data:analytics:generate 5000000for larger test data sets)./benchmark-top-pages.sh 5to verify performance./compare-top-pages.sh --limit 100to validate accuracy