[data] continue grabbing task state until response is not None #60592

iamjustinhsu · 2026-01-29T22:35:56Z

Description

Previously, I added task_id, node_id, and attempt_number for hanging tasks in #59793. However, this introduced a race condition when querying for task state:

Task is submitted
Issue detector immediately fires off
get_task returns None https://github.com/iamjustinhsu/ray/blob/75f9731f69f4b9c7b973f53b74d0580adb3c4ab9/python/ray/data/_internal/issue_detection/detectors/hanging_detector.py#L161 because task state not ready.

for 2), we only fire off when the task wasn't hanging before, or if the task has produced bytes since last checked. My fix is to also check if previous_state.task_state is None too

I ran this many times, and the race condition stopped. Open to ideas on testing this too

Related issues

Additional information

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

gemini-code-assist

Code Review

This pull request aims to fix a race condition where get_task could return None for a hanging task if queried too quickly. The proposed change correctly adds a condition to re-fetch the task state if it was previously None. However, this introduces a subtle bug where the hanging task timer is incorrectly reset, which could delay or prevent the detection of hanging tasks. I've added a comment with details on the issue.

Regarding your question on testing, this race condition could be tested by mocking ray.util.state.get_task to return None on the first call for a given task, and a valid TaskState on a subsequent call. You could then assert that the task state is eventually populated in the detector's internal state and that the hanging issue is correctly reported with the full task details.

python/ray/data/_internal/issue_detection/detectors/hanging_detector.py

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

bveeramani · 2026-01-30T20:04:53Z

python/ray/data/_internal/issue_detection/detectors/hanging_detector.py

+                            # NOTE: The task_id + node_id will not change once we grab the task state.
+                            # Therefore, we can avoid an rpc call if we have already retrieved state info.


I felt confused while reading this code because I don't think it's obvious that task_id and node_id are fields on the task_state dataclass. Could you clarify?

Also, what about all of the other fields that can possibly change? Do we not care about those?

The TaskState is defined by core: https://github.com/iamjustinhsu/ray/blob/d35d310a0759a0112335e6a74583ebe164a7d648/python/ray/util/state/common.py#L731. My previous implementation assume that tasks cannot change their node_id, or task_id. Upon thinking about this more, I'm not sure that is true if a task is retried. Because of this and the interest of simplicity, I decided to grab the new state every time

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

python/ray/data/_internal/issue_detection/detectors/hanging_detector.py

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

[data] continue grabbing task state until response is not None

75f9731

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

iamjustinhsu requested a review from a team as a code owner January 29, 2026 22:35

iamjustinhsu added the go add ONLY when ready to merge, run all tests label Jan 29, 2026

gemini-code-assist bot reviewed Jan 29, 2026

View reviewed changes

python/ray/data/_internal/issue_detection/detectors/hanging_detector.py Show resolved Hide resolved

cursor bot reviewed Jan 29, 2026

View reviewed changes

python/ray/data/_internal/issue_detection/detectors/hanging_detector.py Outdated Show resolved Hide resolved

iamjustinhsu added 3 commits January 29, 2026 15:16

cursor

5f0f9ce

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

me

44fbe02

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

comment

37f2d7a

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

ray-gardener bot added the data Ray Data-related issues label Jan 30, 2026

bveeramani reviewed Jan 30, 2026

View reviewed changes

iamjustinhsu added 2 commits January 30, 2026 12:15

refactor

acd6017

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

remove

d35d310

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

cursor bot reviewed Jan 30, 2026

View reviewed changes

python/ray/data/_internal/issue_detection/detectors/hanging_detector.py Show resolved Hide resolved

iamjustinhsu added 2 commits January 30, 2026 12:27

fix

49ec6a8

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

add task_id to debug msg too

ff2a285

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] continue grabbing task state until response is not None #60592

[data] continue grabbing task state until response is not None #60592

iamjustinhsu commented Jan 29, 2026 •

edited

Loading

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

bveeramani Jan 30, 2026

bveeramani Jan 30, 2026

iamjustinhsu Jan 30, 2026 •

edited

Loading

cursor bot left a comment

Uh oh!

Labels

2 participants

		# NOTE: The task_id + node_id will not change once we grab the task state.
		# Therefore, we can avoid an rpc call if we have already retrieved state info.

[data] continue grabbing task state until response is not None #60592

Are you sure you want to change the base?

[data] continue grabbing task state until response is not None #60592

Conversation

iamjustinhsu commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Additional information

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

bveeramani Jan 30, 2026

Choose a reason for hiding this comment

bveeramani Jan 30, 2026

Choose a reason for hiding this comment

iamjustinhsu Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Labels

2 participants

iamjustinhsu commented Jan 29, 2026 •

edited

Loading

iamjustinhsu Jan 30, 2026 •

edited

Loading