-
Notifications
You must be signed in to change notification settings - Fork 7.2k
[WIP][Data] Revising resource allocator task scheduling decision #60639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…max_block_size` until estimate becomes available Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
…e budget to hold pending task outputs Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request revises the resource allocator's task scheduling decision to be more robust, especially concerning object store memory. It introduces a fallback mechanism to estimate the size of pending task outputs when no historical data is available, using target_max_block_size. This estimate is then used to check if there is sufficient object store memory budget before submitting a new task.
The overall logic is sound and improves scheduling decisions. However, I found a critical issue in resource_manager.py where self._metrics is used instead of op.metrics, which will lead to an AttributeError.
| budget.object_store_memory > ( | ||
| self._metrics.obj_store_mem_max_pending_output_per_task or 0 | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ReservationOpResourceAllocator does not have a _metrics attribute, so self._metrics will raise an AttributeError. It seems you intended to use the metrics from the operator op that is passed as an argument.
| budget.object_store_memory > ( | |
| self._metrics.obj_store_mem_max_pending_output_per_task or 0 | |
| ) | |
| budget.object_store_memory > ( | |
| op.metrics.obj_store_mem_max_pending_output_per_task or 0 | |
| ) |
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
1cf767c to
33cf32d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
| # task outputs) | ||
| budget.object_store_memory > ( | ||
| op.metrics.obj_store_mem_max_pending_output_per_task or 0 | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strict inequality prevents task submission when budget equals threshold
High Severity
The new check uses strict inequality (>) to compare budget.object_store_memory against obj_store_mem_max_pending_output_per_task. When the budget exactly equals the threshold (which can happen when the reservation logic in _update_reservation sets reserved_for_tasks to exactly min_resource_usage.object_store_memory), the condition evaluates to False and prevents task submission. This can cause deadlock in resource-constrained environments where no tasks can ever be scheduled. The comparison likely needs to use >= instead of >.


Description
Related issues
Additional information