[WIP][Data] Revising resource allocator task scheduling decision #60639

alexeykudinkin · 2026-01-31T19:31:58Z

Thank you for contributing to Ray! 🚀
Please review the Ray Contribution Guide before opening a pull request.

⚠️ Remove these instructions before submitting your PR.

💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete.

Description

Briefly describe what this PR accomplishes and why it's needed.

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

…max_block_size` until estimate becomes available Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

…e budget to hold pending task outputs Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

gemini-code-assist

Code Review

This pull request revises the resource allocator's task scheduling decision to be more robust, especially concerning object store memory. It introduces a fallback mechanism to estimate the size of pending task outputs when no historical data is available, using target_max_block_size. This estimate is then used to check if there is sufficient object store memory budget before submitting a new task.

The overall logic is sound and improves scheduling decisions. However, I found a critical issue in resource_manager.py where self._metrics is used instead of op.metrics, which will lead to an AttributeError.

gemini-code-assist · 2026-01-31T19:33:03Z

python/ray/data/_internal/execution/resource_manager.py

+            budget.object_store_memory > (
+                self._metrics.obj_store_mem_max_pending_output_per_task or 0
+            )


The ReservationOpResourceAllocator does not have a _metrics attribute, so self._metrics will raise an AttributeError. It seems you intended to use the metrics from the operator op that is passed as an argument.

Suggested change

budget.object_store_memory > (

self._metrics.obj_store_mem_max_pending_output_per_task or 0

)

budget.object_store_memory > (

op.metrics.obj_store_mem_max_pending_output_per_task or 0

)

python/ray/data/_internal/execution/resource_manager.py

.buildkite/_images.rayci.yml

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

cursor · 2026-01-31T19:54:34Z

python/ray/data/_internal/execution/resource_manager.py

+            # task outputs)
+            budget.object_store_memory > (
+                op.metrics.obj_store_mem_max_pending_output_per_task or 0
+            )


Strict inequality prevents task submission when budget equals threshold

High Severity

The new check uses strict inequality (>) to compare budget.object_store_memory against obj_store_mem_max_pending_output_per_task. When the budget exactly equals the threshold (which can happen when the reservation logic in _update_reservation sets reserved_for_tasks to exactly min_resource_usage.object_store_memory), the condition evaluates to False and prevents task submission. This can cause deadlock in resource-constrained environments where no tasks can ever be scheduled. The comparison likely needs to use >= instead of >.

alexeykudinkin added 2 commits January 31, 2026 11:27

Reverting obj_store_mem_max_pending_output_per_task to use `target_…

62f40c2

…max_block_size` until estimate becomes available Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Updated can_submit_task to check whether there's enough object stor…

d8f1f51

…e budget to hold pending task outputs Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin requested a review from a team as a code owner January 31, 2026 19:31

gemini-code-assist bot reviewed Jan 31, 2026

View reviewed changes

alexeykudinkin added the go add ONLY when ready to merge, run all tests label Jan 31, 2026

cursor bot reviewed Jan 31, 2026

View reviewed changes

python/ray/data/_internal/execution/resource_manager.py Outdated Show resolved Hide resolved

.buildkite/_images.rayci.yml Outdated Show resolved Hide resolved

Fixing invalid ref

33cf32d

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin force-pushed the ak/res-allc-fup-1 branch from 1cf767c to 33cf32d Compare January 31, 2026 19:42

cursor bot reviewed Jan 31, 2026

View reviewed changes

ray-gardener bot added the data Ray Data-related issues label Feb 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][Data] Revising resource allocator task scheduling decision #60639

[WIP][Data] Revising resource allocator task scheduling decision #60639

alexeykudinkin commented Jan 31, 2026

gemini-code-assist bot left a comment

gemini-code-assist bot Jan 31, 2026

Uh oh!

Uh oh!

cursor bot left a comment

cursor bot Jan 31, 2026

Labels

2 participants

[WIP][Data] Revising resource allocator task scheduling decision #60639

Are you sure you want to change the base?

[WIP][Data] Revising resource allocator task scheduling decision #60639

Conversation

alexeykudinkin commented Jan 31, 2026

Description

Related issues

Additional information

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

gemini-code-assist bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

cursor bot Jan 31, 2026

Choose a reason for hiding this comment

Strict inequality prevents task submission when budget equals threshold

Labels

2 participants