-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Fix store port conflict handling (issue #221) #227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@microsoft-github-policy-service agree |
c39c3b7 to
d74c0ee
Compare
agentlightning/cli/store.py
Outdated
| try: | ||
| asyncio.run(server.run_forever()) | ||
| except RuntimeError as exc: | ||
| logger.error(str(exc)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have you confirmed whether this logger works here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, confirmed — the logger works here.
When the port is already in use, Uvicorn emits
ERROR: [Errno 98] error while attempting to bind ...,
followed by our CLI logger output:
2025-10-28 07:34:01,766 [ERROR] (Process-4982 agentlightning.cli.store) LightningStore server failed to start on http://0.0.0.0:4747 ...
So the error message is correctly captured and formatted via configure_logger().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any idea why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we directly raise the exception here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s caught here to print a clean CLI error and exit with code 1.
Re-raising would show a long asyncio/Uvicorn traceback instead.
This keeps the output concise and user-friendly for the command-line entry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in that case better using logger.exception or logger.error(exc_info=True) here. str(exc) can be very uninformative in some cases.
|
/ci |
|
🚀 CI Watcher for correlation id-3454793152-mha6p0lw triggered by comment 3454793152
✅ All runs completed. |
|
Successfully created backport PR for |
Problem
When two trainers launch store servers on the same port, the second trainer may incorrectly attach to the first trainer’s store.
This happens because
LightningStoreServer.start()only checks/healthand doesn’t detect when its ownuvicornfails to bind the port (raisingSystemExitorOSError).Fix
Added
_server_start_exceptiontracking to captureuvicornstartup failures.Enhanced startup checks in both
start()andrun_forever()to verify:uvicorn_server.startedflag,Introduced
_handle_failed_start()for safe cleanup and_format_start_failure_reason()to surface a friendly message (“Another process may already be using this port.”).Updated
cli/store.pyto catchRuntimeErrorand log cleanly instead of dumping stack traces.Tests
test_server_start_rejects_port_conflicttest_run_forever_rejects_port_conflictBoth confirm that a second store launched on the same port fails fast with the friendly error message.
Manual Verification
Fixes: #221
Impact: Prevents trainers from mistakenly sharing the same store process.