Skip to content

Conversation

@alexeykudinkin
Copy link
Contributor

Thank you for contributing to Ray! 🚀
Please review the Ray Contribution Guide before opening a pull request.

⚠️ Remove these instructions before submitting your PR.

💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete.

Description

Briefly describe what this PR accomplishes and why it's needed.

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

…max_block_size` until estimate becomes available

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
…e budget to hold pending task outputs

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
@alexeykudinkin alexeykudinkin requested a review from a team as a code owner January 31, 2026 19:31
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request revises the resource allocator's task scheduling decision to be more robust, especially concerning object store memory. It introduces a fallback mechanism to estimate the size of pending task outputs when no historical data is available, using target_max_block_size. This estimate is then used to check if there is sufficient object store memory budget before submitting a new task.

The overall logic is sound and improves scheduling decisions. However, I found a critical issue in resource_manager.py where self._metrics is used instead of op.metrics, which will lead to an AttributeError.

Comment on lines 870 to 872
budget.object_store_memory > (
self._metrics.obj_store_mem_max_pending_output_per_task or 0
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The ReservationOpResourceAllocator does not have a _metrics attribute, so self._metrics will raise an AttributeError. It seems you intended to use the metrics from the operator op that is passed as an argument.

Suggested change
budget.object_store_memory > (
self._metrics.obj_store_mem_max_pending_output_per_task or 0
)
budget.object_store_memory > (
op.metrics.obj_store_mem_max_pending_output_per_task or 0
)
@alexeykudinkin alexeykudinkin added the go add ONLY when ready to merge, run all tests label Jan 31, 2026
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

# task outputs)
budget.object_store_memory > (
op.metrics.obj_store_mem_max_pending_output_per_task or 0
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strict inequality prevents task submission when budget equals threshold

High Severity

The new check uses strict inequality (>) to compare budget.object_store_memory against obj_store_mem_max_pending_output_per_task. When the budget exactly equals the threshold (which can happen when the reservation logic in _update_reservation sets reserved_for_tasks to exactly min_resource_usage.object_store_memory), the condition evaluates to False and prevents task submission. This can cause deadlock in resource-constrained environments where no tasks can ever be scheduled. The comparison likely needs to use >= instead of >.

Fix in Cursor Fix in Web

@ray-gardener ray-gardener bot added the data Ray Data-related issues label Feb 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

2 participants