Skip to content

Conversation

@ultmaster
Copy link
Contributor

@ultmaster ultmaster commented Dec 17, 2025

  1. Fix workers that are trying to acquire GPU status.
  2. Fix stale system_snapshot causing the whole rollout to hang.
  3. Add more debugging logs
  4. Fix find_nearest_store_method performance issue
  5. benchmark fixes and tests.

It's now very close to all-pass.

Copilot AI review requested due to automatic review settings December 17, 2025 02:11
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes store benchmark performance by replacing expensive stack introspection with a fast ContextVar-based approach for tracking method execution, and upgrades benchmark runner infrastructure for better resource allocation.

  • Replaces inspect-based stack walking with O(1) ContextVar lookups for method tracking
  • Refactors the @tracked decorator to use token-based ContextVar management for proper cleanup
  • Upgrades benchmark runner pools and worker counts for high-load scenarios

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
agentlightning/store/collection_based.py Replaces stack introspection with ContextVar-based tracking; refactors @tracked decorator to set/reset context variables using tokens; removes nearest_lightning_store_method_from_stack() function and inspect import
agentlightning/store/collection/base.py Updates import and function call from nearest_lightning_store_method_from_stack() to get_current_store_methods()
.github/workflows/benchmark.yml Increases worker counts (32→64, 96→96, 64→64) and switches runner pool from agl-runner-cpu to agl-runner-cpu-high for three high-load benchmark scenarios

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ultmaster
Copy link
Contributor Author

/ci

@github-actions
Copy link

github-actions bot commented Dec 17, 2025

🚀 CI Watcher for correlation id-3665139471-mj9zqb2d triggered by comment 3665139471
🏃‍♀️ Tracking 6 workflow run(s):

✅ All runs completed.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 541 to 552
age = time.monotonic() - ts
if age > stale_after:
# Only warn once per stale snapshot (check if we haven't warned about this timestamp yet)
if last_warned_ts != ts:
logger.warning(
"%s Heartbeat consumer: snapshot stale (age=%.2fs > %.2fs); skipping update.",
self._log_prefix(),
age,
stale_after,
)
last_warned_ts = ts
continue
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The consumer thread's while loop does not wait/sleep when the snapshot is stale (age > stale_after). This causes a busy-wait loop that could consume unnecessary CPU while waiting for a new snapshot. Consider adding a small sleep when the snapshot is stale.

Copilot uses AI. Check for mistakes.
Comment on lines +297 to +305
async def emit_progress(progress_made: bool) -> None:
if progress_made:
async with active_lock:
pending_ids = list(active_rollouts)
await tracker.handle_progress(progress_made=True, pending_rollout_ids=pending_ids, store=store)
return
async with active_lock:
pending_ids = list(active_rollouts)
await tracker.handle_progress(progress_made=False, pending_rollout_ids=pending_ids, store=store)
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The emit_progress function duplicates logic for acquiring the active_lock and getting pending_ids. The logic for when progress_made is True unnecessarily acquires the lock twice (once to get pending_ids before the handle_progress call, and potentially again if that function were to need it). Consider simplifying to acquire the lock once regardless of progress_made value.

Copilot uses AI. Check for mistakes.
Comment on lines +90 to +91
# ContextVars for tracking the current store method without expensive stack introspection.
# These are set by the @tracked decorator and read by tracking_context in collection/base.py.
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment mentions "tracking_context in collection/base.py" but this function could be called from multiple places. Consider updating the comment to be more general, such as "read by get_current_store_methods()" to better reflect the actual usage pattern.

Copilot uses AI. Check for mistakes.
Comment on lines +28 to +30
class RolloutProgressTracker:
"""Helper for tracking rollout progress and surfacing stale worker states."""

Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class docstring should document the max_stale_seconds parameter.

Copilot uses AI. Check for mistakes.
Comment on lines 536 to 539
if snap is None:
# probably just started
logger.debug("%s Heartbeat consumer: no snapshot yet; skipping update.", self._log_prefix())
continue
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The consumer thread's while loop does not wait/sleep when snap is None (no snapshot yet). This causes a busy-wait loop that could consume unnecessary CPU while waiting for the first snapshot to be produced. Consider adding a small sleep when no snapshot is available.

Copilot uses AI. Check for mistakes.
@ultmaster ultmaster merged commit 9f178ac into main Dec 17, 2025
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment