-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Minor optimizations to store benchmark #421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR optimizes store benchmark performance by replacing expensive stack introspection with a fast ContextVar-based approach for tracking method execution, and upgrades benchmark runner infrastructure for better resource allocation.
- Replaces
inspect-based stack walking with O(1) ContextVar lookups for method tracking - Refactors the
@trackeddecorator to use token-based ContextVar management for proper cleanup - Upgrades benchmark runner pools and worker counts for high-load scenarios
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| agentlightning/store/collection_based.py | Replaces stack introspection with ContextVar-based tracking; refactors @tracked decorator to set/reset context variables using tokens; removes nearest_lightning_store_method_from_stack() function and inspect import |
| agentlightning/store/collection/base.py | Updates import and function call from nearest_lightning_store_method_from_stack() to get_current_store_methods() |
| .github/workflows/benchmark.yml | Increases worker counts (32→64, 96→96, 64→64) and switches runner pool from agl-runner-cpu to agl-runner-cpu-high for three high-load benchmark scenarios |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/ci |
|
🚀 CI Watcher for correlation id-3665139471-mj9zqb2d triggered by comment 3665139471
✅ All runs completed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| age = time.monotonic() - ts | ||
| if age > stale_after: | ||
| # Only warn once per stale snapshot (check if we haven't warned about this timestamp yet) | ||
| if last_warned_ts != ts: | ||
| logger.warning( | ||
| "%s Heartbeat consumer: snapshot stale (age=%.2fs > %.2fs); skipping update.", | ||
| self._log_prefix(), | ||
| age, | ||
| stale_after, | ||
| ) | ||
| last_warned_ts = ts | ||
| continue |
Copilot
AI
Dec 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The consumer thread's while loop does not wait/sleep when the snapshot is stale (age > stale_after). This causes a busy-wait loop that could consume unnecessary CPU while waiting for a new snapshot. Consider adding a small sleep when the snapshot is stale.
| async def emit_progress(progress_made: bool) -> None: | ||
| if progress_made: | ||
| async with active_lock: | ||
| pending_ids = list(active_rollouts) | ||
| await tracker.handle_progress(progress_made=True, pending_rollout_ids=pending_ids, store=store) | ||
| return | ||
| async with active_lock: | ||
| pending_ids = list(active_rollouts) | ||
| await tracker.handle_progress(progress_made=False, pending_rollout_ids=pending_ids, store=store) |
Copilot
AI
Dec 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The emit_progress function duplicates logic for acquiring the active_lock and getting pending_ids. The logic for when progress_made is True unnecessarily acquires the lock twice (once to get pending_ids before the handle_progress call, and potentially again if that function were to need it). Consider simplifying to acquire the lock once regardless of progress_made value.
| # ContextVars for tracking the current store method without expensive stack introspection. | ||
| # These are set by the @tracked decorator and read by tracking_context in collection/base.py. |
Copilot
AI
Dec 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment mentions "tracking_context in collection/base.py" but this function could be called from multiple places. Consider updating the comment to be more general, such as "read by get_current_store_methods()" to better reflect the actual usage pattern.
| class RolloutProgressTracker: | ||
| """Helper for tracking rollout progress and surfacing stale worker states.""" | ||
|
|
Copilot
AI
Dec 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The class docstring should document the max_stale_seconds parameter.
| if snap is None: | ||
| # probably just started | ||
| logger.debug("%s Heartbeat consumer: no snapshot yet; skipping update.", self._log_prefix()) | ||
| continue |
Copilot
AI
Dec 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The consumer thread's while loop does not wait/sleep when snap is None (no snapshot yet). This causes a busy-wait loop that could consume unnecessary CPU while waiting for the first snapshot to be produced. Consider adding a small sleep when no snapshot is available.
It's now very close to all-pass.