Skip to content

Conversation

@ddsfda99
Copy link
Contributor

Problem

When two trainers launch store servers on the same port, the second trainer may incorrectly attach to the first trainer’s store.
This happens because LightningStoreServer.start() only checks /health and doesn’t detect when its own uvicorn fails to bind the port (raising SystemExit or OSError).

Fix

  • Added _server_start_exception tracking to capture uvicorn startup failures.

  • Enhanced startup checks in both start() and run_forever() to verify:

    • uvicorn_server.started flag,
    • serving thread liveness,
    • and recorded startup exceptions.
  • Introduced _handle_failed_start() for safe cleanup and _format_start_failure_reason() to surface a friendly message (“Another process may already be using this port.”).

  • Updated cli/store.py to catch RuntimeError and log cleanly instead of dumping stack traces.

Tests

  • test_server_start_rejects_port_conflict
  • test_run_forever_rejects_port_conflict
    Both confirm that a second store launched on the same port fails fast with the friendly error message.

Manual Verification

# Terminal 1
agl store --port 4747

# Terminal 2
agl store --port 4747
# → Immediately exits with: "Another process may already be using this port."

Fixes: #221
Impact: Prevents trainers from mistakenly sharing the same store process.

@ddsfda99
Copy link
Contributor Author

@microsoft-github-policy-service agree

@ddsfda99 ddsfda99 force-pushed the fix-store-port-conflict branch from c39c3b7 to d74c0ee Compare October 27, 2025 13:33
try:
asyncio.run(server.run_forever())
except RuntimeError as exc:
logger.error(str(exc))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you confirmed whether this logger works here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, confirmed — the logger works here.
When the port is already in use, Uvicorn emits
ERROR: [Errno 98] error while attempting to bind ...,
followed by our CLI logger output:
2025-10-28 07:34:01,766 [ERROR] (Process-4982 agentlightning.cli.store) LightningStore server failed to start on http://0.0.0.0:4747 ...
So the error message is correctly captured and formatted via configure_logger().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any idea why?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we directly raise the exception here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s caught here to print a clean CLI error and exit with code 1.
Re-raising would show a long asyncio/Uvicorn traceback instead.
This keeps the output concise and user-friendly for the command-line entry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in that case better using logger.exception or logger.error(exc_info=True) here. str(exc) can be very uninformative in some cases.

@ultmaster
Copy link
Contributor

/ci

@github-actions
Copy link

github-actions bot commented Oct 28, 2025

🚀 CI Watcher for correlation id-3454793152-mha6p0lw triggered by comment 3454793152
🏃‍♀️ Tracking 4 workflow run(s):

✅ All runs completed.

@ultmaster ultmaster merged commit 01955ae into microsoft:main Oct 28, 2025
8 checks passed
github-actions bot pushed a commit that referenced this pull request Oct 28, 2025
@agent-lightning-bot
Copy link
Collaborator

Successfully created backport PR for stable/v0.2.x:

ultmaster pushed a commit that referenced this pull request Oct 28, 2025
(cherry picked from commit 01955ae)

Co-authored-by: ddsfda99 <168000702+ddsfda99@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment