Skip to content

Support stateful ops in TransformIterator (#9627)#9630

Open
nethum529 wants to merge 1 commit into
NVIDIA:mainfrom
nethum529:fix/issue-9627-stateful-transform-iterator
Open

Support stateful ops in TransformIterator (#9627)#9630
nethum529 wants to merge 1 commit into
NVIDIA:mainfrom
nethum529:fix/issue-9627-stateful-transform-iterator

Conversation

@nethum529

Copy link
Copy Markdown

Description

closes #9627

A stateful op — a Python callable that closes over a CUDA device array — works as a direct algorithm op (e.g. select's cond, unary_transform's op), but failed when wrapped in a TransformIterator used as an algorithm's input (reduce_into, segmented_reduce, histogram_even). Two symptoms:

  1. Without a return annotation, constructing the iterator raised NotImplementedError: get_return_type not implemented for _StatefulOp.
  2. With a return annotation, execution crashed with cudaErrorLaunchFailure: unspecified launch failure.

Root cause

  • _StatefulOp never implemented get_return_type, so TransformIterator could not infer the transformed value type (symptom 1).
  • A stateful op compiles to a device function with signature (void* op_state, void* input, void* output), where op_state points to the packed device-array pointers. But TransformIterator's generated dereference code declared and called the op as if it were stateless(void* input, void* output) — and never carried the op's state into the iterator's state. On device, the op read garbage in place of its state pointers, crashing the launch (symptom 2).

Fix

  • _jit.py: implement _StatefulOp.get_return_type, reusing the same Numba inference path as compilation (extracted into the helpers _state_array_numba_types and _infer_stateful_return_type, now shared with _compile_stateful_op).
  • iterators/_transform.py: for a stateful transform op, append the op's state bytes after the underlying iterator's state (with alignment padding; the underlying state stays at offset 0 where its child ops expect it) and pass static_cast<char*>(state) + op_state_offset as the op's state argument in the generated input/output dereference. This mirrors the existing state-composition pattern in PermutationIterator / compose_iterator_states. Stateless ops are unchanged.

This makes index-gather and multi-array transforms fuse directly into reductions/histograms without first materializing an intermediate array.

Tests

Added to tests/compute/test_reduce.py:

  • stateful transform op as a reduce input iterator (annotated and inferred-return-type variants),
  • multiple captured arrays,
  • cache reuse with identical op bytecode but different captured arrays,
  • nested transform iterators (stateful inner under stateless outer),
  • stateful op as a TransformOutputIterator.

Verified on an NVIDIA RTX 4070 (CUDA 13.3): the 6 new tests pass, and the reduce/select/segmented_reduce/histogram/iterators/transform/permutation/zip suites pass with no regressions.

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
A stateful op (a callable capturing a CUDA device array) works as a direct
algorithm op but failed inside a TransformIterator used as an algorithm input:
without a return annotation it raised "get_return_type not implemented for
_StatefulOp", and with one it crashed at launch with cudaErrorLaunchFailure.

The op's compiled device function takes its packed state pointers as a leading
argument, but the iterator's generated dereference called it as a stateless
op and never carried the op's state. Implement _StatefulOp.get_return_type,
and compose the op's state after the underlying iterator's state (keeping the
underlying at offset 0) so the dereference can pass the op its state pointer,
mirroring PermutationIterator's state composition.

Closes NVIDIA#9627

Signed-off-by: nethum529 <nethumweerasinghe.nw@gmail.com>
@nethum529 nethum529 requested a review from a team as a code owner June 29, 2026 18:55
@nethum529 nethum529 requested a review from tpn June 29, 2026 18:55
@github-project-automation github-project-automation Bot moved this to Todo in CCCL Jun 29, 2026
@copy-pr-bot

copy-pr-bot Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 29, 2026
@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Transform iterators now support operations that carry captured device-array state.
    • Stateful transform operations can be used in both input and output iterator paths.
    • Return-type inference was improved for stateful transforms, including cases without explicit annotations.
  • Bug Fixes

    • Fixed reduction behavior when multiple iterators use similar transform logic but different captured state.
    • Improved correctness for nested transform iterators and state-aware iterator composition.

Walkthrough

Refactors stateful-op return-type inference in _jit.py into two shared helpers (_state_array_numba_types, _infer_stateful_return_type) used by both _compile_stateful_op and _StatefulOp.get_return_type. Extends TransformIterator to pack the compiled op's state into the iterator's combined state buffer at an alignment-computed offset, passing a derived pointer at dereference time. Six tests validate the full scenarios.

Changes

Stateful TransformIterator Support

Layer / File(s) Summary
Stateful return-type inference helpers
python/cuda_cccl/cuda/compute/_jit.py
Extracts _state_array_numba_types and _infer_stateful_return_type; both _compile_stateful_op and _StatefulOp.get_return_type now delegate to them instead of duplicating logic.
TransformIterator stateful state packing and deref
python/cuda_cccl/cuda/compute/iterators/_transform.py
Adds _op_state_offset slot; __init__ conditionally appends packed op state after the underlying iterator state and records the offset; _make_input_deref_op and _make_output_deref_op branch on is_stateful to compute and pass op_state pointer in the CUDA device wrapper.
Tests
python/cuda_cccl/tests/compute/test_reduce.py
Six new tests covering single/multiple captured arrays, inferred return type, cache reuse, nested stateful iterators, and stateful output iterator.

Assessment against linked issues

Objective Addressed Explanation
NotImplementedError: get_return_type not implemented for _StatefulOp when no return annotation [#9627]
cudaErrorLaunchFailure when running a stateful op in a TransformIterator [#9627]

Out-of-scope changes

No out-of-scope changes identified.


important: In _transform.py lines 98–121, alignment padding is computed between the underlying iterator state and the op state. The code must guarantee that _op_state_offset is a multiple of the op state's alignment requirement. Verify that the padding calculation matches op_state_alignment exactly, because an off-by-one in the modulo arithmetic would silently misalign the pointer on device, causing the same cudaErrorLaunchFailure the PR is fixing.

suggestion: _StatefulOp.get_return_type (lines 987–995) now transforms the function and registers gpu_structs again, duplicating work already done during compilation. Consider caching the transformed function or the inferred TypeDescriptor on the _StatefulOp instance to avoid the redundant re-compilation on repeated get_return_type calls.

suggestion: test_transform_iterator_stateful_op_cache_reuse exercises that two TransformIterator instances with different captured arrays but identical bytecode produce correct results. It would be worth also asserting that the underlying reducer is reused (e.g., by checking a cache-hit counter or that both reductions share the compiled kernel), not just that results are numerically correct — otherwise the test only validates correctness, not the cache-reuse property it claims to cover.


Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
python/cuda_cccl/cuda/compute/_jit.py (1)

846-850: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

suggestion: Captured state arrays are always rebuilt as flat 1-D buffers (shape=len(state_array), strides=itemsize). Reject non-1D CUDA arrays here or document the 1-D-only contract; otherwise a multidimensional capture compiles against the wrong parameter shape.

Source: Path instructions


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e71a519d-89cb-4d9a-9d9d-091d49fcff60

📥 Commits

Reviewing files that changed from the base of the PR and between e1cb696 and 3b049a9.

📒 Files selected for processing (3)
  • python/cuda_cccl/cuda/compute/_jit.py
  • python/cuda_cccl/cuda/compute/iterators/_transform.py
  • python/cuda_cccl/tests/compute/test_reduce.py
@nethum529

Copy link
Copy Markdown
Author

The failing pre-commit.ci check is the mypy hook, unrelated to this PR. numpy is unpinned and 2.5.0's PEP 695 type stubs break mypy under python_version = "3.10":

numpy/__init__.pyi:737: error: Type statement is only supported in Python 3.12 and greater
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant