Minor optimizations to store benchmark #421

ultmaster · 2025-12-17T02:11:42Z

Fix workers that are trying to acquire GPU status.
Fix stale system_snapshot causing the whole rollout to hang.
Add more debugging logs
Fix find_nearest_store_method performance issue
benchmark fixes and tests.

It's now very close to all-pass.

Copilot

Pull request overview

This PR optimizes store benchmark performance by replacing expensive stack introspection with a fast ContextVar-based approach for tracking method execution, and upgrades benchmark runner infrastructure for better resource allocation.

Replaces inspect-based stack walking with O(1) ContextVar lookups for method tracking
Refactors the @tracked decorator to use token-based ContextVar management for proper cleanup
Upgrades benchmark runner pools and worker counts for high-load scenarios

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
agentlightning/store/collection_based.py	Replaces stack introspection with ContextVar-based tracking; refactors `@tracked` decorator to set/reset context variables using tokens; removes `nearest_lightning_store_method_from_stack()` function and `inspect` import
agentlightning/store/collection/base.py	Updates import and function call from `nearest_lightning_store_method_from_stack()` to `get_current_store_methods()`
.github/workflows/benchmark.yml	Increases worker counts (32→64, 96→96, 64→64) and switches runner pool from `agl-runner-cpu` to `agl-runner-cpu-high` for three high-load benchmark scenarios

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ultmaster · 2025-12-17T12:31:11Z

/ci

github-actions · 2025-12-17T12:31:25Z

🚀 CI Watcher for correlation id-3665139471-mj9zqb2d triggered by comment 3665139471
🏃‍♀️ Tracking 6 workflow run(s):

🟢 GPU Test - PR #421 - ci-gpu - id-3665139471-mj9zqb2d — completed/success
🟢 RAG - PR #421 - ci-rag - id-3665139471-mj9zqb2d — completed/success
🟢 ChartQA - PR #421 - ci-chartqa - id-3665139471-mj9zqb2d — completed/success
🟢 Spider - PR #421 - ci-spider - id-3665139471-mj9zqb2d — completed/success
🟢 Calc-X - PR #421 - ci-calc-x - id-3665139471-mj9zqb2d — completed/success
🟢 APO - PR #421 - ci-apo - id-3665139471-mj9zqb2d — completed/success

✅ All runs completed.

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-17T15:19:05Z

agentlightning/runner/agent.py

+                    age = time.monotonic() - ts
+                    if age > stale_after:
+                        # Only warn once per stale snapshot (check if we haven't warned about this timestamp yet)
+                        if last_warned_ts != ts:
+                            logger.warning(
+                                "%s Heartbeat consumer: snapshot stale (age=%.2fs > %.2fs); skipping update.",
+                                self._log_prefix(),
+                                age,
+                                stale_after,
+                            )
+                            last_warned_ts = ts
+                        continue


The consumer thread's while loop does not wait/sleep when the snapshot is stale (age > stale_after). This causes a busy-wait loop that could consume unnecessary CPU while waiting for a new snapshot. Consider adding a small sleep when the snapshot is stale.

Copilot · 2025-12-17T15:19:05Z

tests/benchmark/benchmark_store.py

+        async def emit_progress(progress_made: bool) -> None:
+            if progress_made:
+                async with active_lock:
+                    pending_ids = list(active_rollouts)
+                await tracker.handle_progress(progress_made=True, pending_rollout_ids=pending_ids, store=store)
+                return
+            async with active_lock:
+                pending_ids = list(active_rollouts)
+            await tracker.handle_progress(progress_made=False, pending_rollout_ids=pending_ids, store=store)


The emit_progress function duplicates logic for acquiring the active_lock and getting pending_ids. The logic for when progress_made is True unnecessarily acquires the lock twice (once to get pending_ids before the handle_progress call, and potentially again if that function were to need it). Consider simplifying to acquire the lock once regardless of progress_made value.

Copilot · 2025-12-17T15:19:06Z

agentlightning/store/collection_based.py

+# ContextVars for tracking the current store method without expensive stack introspection.
+# These are set by the @tracked decorator and read by tracking_context in collection/base.py.


The comment mentions "tracking_context in collection/base.py" but this function could be called from multiple places. Consider updating the comment to be more general, such as "read by get_current_store_methods()" to better reflect the actual usage pattern.

Copilot · 2025-12-17T15:19:06Z

tests/benchmark/benchmark_store.py

+class RolloutProgressTracker:
+    """Helper for tracking rollout progress and surfacing stale worker states."""
+


The class docstring should document the max_stale_seconds parameter.

Copilot · 2025-12-17T15:19:06Z

agentlightning/runner/agent.py

+                    if snap is None:
+                        # probably just started
+                        logger.debug("%s Heartbeat consumer: no snapshot yet; skipping update.", self._log_prefix())
+                        continue


The consumer thread's while loop does not wait/sleep when snap is None (no snapshot yet). This causes a busy-wait loop that could consume unnecessary CPU while waiting for the first snapshot to be produced. Consider adding a small sleep when no snapshot is available.

optimize nearest store method

64663be

Copilot AI review requested due to automatic review settings December 17, 2025 02:11

Copilot started reviewing on behalf of ultmaster December 17, 2025 02:12 View session

Copilot AI reviewed Dec 17, 2025

View reviewed changes

ultmaster added 9 commits December 17, 2025 11:51

.

5a7e744

add more logging

9ee97d9

set debug mode

85ac868

.

aa635ef

add debug log

a16bc14

fix benchmark console

2d9fe7d

optimize system snapshot

fb9b292

minor fix

c142162

remove debug flag

27234de

ultmaster added ci-spider ci-apo ci-calc-x ci-gpu ci-rag ci-chartqa labels Dec 17, 2025

ultmaster requested a review from Copilot December 17, 2025 15:15

Copilot started reviewing on behalf of ultmaster December 17, 2025 15:16 View session

Copilot AI reviewed Dec 17, 2025

View reviewed changes

resolve comments

0ddcaec

ultmaster merged commit 9f178ac into main Dec 17, 2025
35 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minor optimizations to store benchmark #421

Minor optimizations to store benchmark #421

Uh oh!

ultmaster commented Dec 17, 2025 •

edited

Loading

Copilot AI left a comment

ultmaster commented Dec 17, 2025

github-actions bot commented Dec 17, 2025 •

edited

Loading

Copilot AI left a comment

Copilot AI Dec 17, 2025

Copilot AI Dec 17, 2025

Copilot AI Dec 17, 2025

Copilot AI Dec 17, 2025

Copilot AI Dec 17, 2025

Uh oh!

Labels

2 participants

		# ContextVars for tracking the current store method without expensive stack introspection.
		# These are set by the @tracked decorator and read by tracking_context in collection/base.py.

		class RolloutProgressTracker:
		"""Helper for tracking rollout progress and surfacing stale worker states."""

Minor optimizations to store benchmark #421

Minor optimizations to store benchmark #421

Uh oh!

Conversation

ultmaster commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

ultmaster commented Dec 17, 2025

github-actions bot commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Labels

2 participants

ultmaster commented Dec 17, 2025 •

edited

Loading

github-actions bot commented Dec 17, 2025 •

edited

Loading