Skip to content

Conversation

@iamjustinhsu
Copy link
Contributor

@iamjustinhsu iamjustinhsu commented Jan 29, 2026

Description

Previously, I added task_id, node_id, and attempt_number for hanging tasks in #59793. However, this introduced a race condition when querying for task state:

  1. Task is submitted
  2. Issue detector immediately fires off
  3. get_task returns None https://github.com/iamjustinhsu/ray/blob/75f9731f69f4b9c7b973f53b74d0580adb3c4ab9/python/ray/data/_internal/issue_detection/detectors/hanging_detector.py#L161 because task state not ready.

for 2), we only fire off when the task wasn't hanging before, or if the task has produced bytes since last checked. My fix is to also check if previous_state.task_state is None too

I ran this many times, and the race condition stopped. Open to ideas on testing this too

Related issues

Additional information

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@iamjustinhsu iamjustinhsu requested a review from a team as a code owner January 29, 2026 22:35
@iamjustinhsu iamjustinhsu added the go add ONLY when ready to merge, run all tests label Jan 29, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix a race condition where get_task could return None for a hanging task if queried too quickly. The proposed change correctly adds a condition to re-fetch the task state if it was previously None. However, this introduces a subtle bug where the hanging task timer is incorrectly reset, which could delay or prevent the detection of hanging tasks. I've added a comment with details on the issue.

Regarding your question on testing, this race condition could be tested by mocking ray.util.state.get_task to return None on the first call for a given task, and a valid TaskState on a subsequent call. You could then assert that the task state is eventually populated in the detector's internal state and that the hanging issue is correctly reported with the full task details.

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@ray-gardener ray-gardener bot added the data Ray Data-related issues label Jan 30, 2026
Comment on lines 159 to 160
# NOTE: The task_id + node_id will not change once we grab the task state.
# Therefore, we can avoid an rpc call if we have already retrieved state info.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I felt confused while reading this code because I don't think it's obvious that task_id and node_id are fields on the task_state dataclass. Could you clarify?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what about all of the other fields that can possibly change? Do we not care about those?

Copy link
Contributor Author

@iamjustinhsu iamjustinhsu Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TaskState is defined by core: https://github.com/iamjustinhsu/ray/blob/d35d310a0759a0112335e6a74583ebe164a7d648/python/ray/util/state/common.py#L731. My previous implementation assume that tasks cannot change their node_id, or task_id. Upon thinking about this more, I'm not sure that is true if a task is retried. Because of this and the interest of simplicity, I decided to grab the new state every time

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

2 participants