Support stateful ops in TransformIterator (#9627) by nethum529 · Pull Request #9630 · NVIDIA/cccl

nethum529 · 2026-06-29T18:55:41Z

Description

A stateful op — a Python callable that closes over a CUDA device array — works as a direct algorithm op (e.g. select's cond, unary_transform's op), but failed when wrapped in a TransformIterator used as an algorithm's input (reduce_into, segmented_reduce, histogram_even). Two symptoms:

Without a return annotation, constructing the iterator raised NotImplementedError: get_return_type not implemented for _StatefulOp.
With a return annotation, execution crashed with cudaErrorLaunchFailure: unspecified launch failure.

Root cause

_StatefulOp never implemented get_return_type, so TransformIterator could not infer the transformed value type (symptom 1).
A stateful op compiles to a device function with signature (void* op_state, void* input, void* output), where op_state points to the packed device-array pointers. But TransformIterator's generated dereference code declared and called the op as if it were stateless — (void* input, void* output) — and never carried the op's state into the iterator's state. On device, the op read garbage in place of its state pointers, crashing the launch (symptom 2).

Fix

_jit.py: implement _StatefulOp.get_return_type, reusing the same Numba inference path as compilation (extracted into the helpers _state_array_numba_types and _infer_stateful_return_type, now shared with _compile_stateful_op).
iterators/_transform.py: for a stateful transform op, append the op's state bytes after the underlying iterator's state (with alignment padding; the underlying state stays at offset 0 where its child ops expect it) and pass static_cast<char*>(state) + op_state_offset as the op's state argument in the generated input/output dereference. This mirrors the existing state-composition pattern in PermutationIterator / compose_iterator_states. Stateless ops are unchanged.

This makes index-gather and multi-array transforms fuse directly into reductions/histograms without first materializing an intermediate array.

Tests

Added to tests/compute/test_reduce.py:

stateful transform op as a reduce input iterator (annotated and inferred-return-type variants),
multiple captured arrays,
cache reuse with identical op bytecode but different captured arrays,
nested transform iterators (stateful inner under stateless outer),
stateful op as a TransformOutputIterator.

Verified on an NVIDIA RTX 4070 (CUDA 13.3): the 6 new tests pass, and the reduce/select/segmented_reduce/histogram/iterators/transform/permutation/zip suites pass with no regressions.

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

A stateful op (a callable capturing a CUDA device array) works as a direct algorithm op but failed inside a TransformIterator used as an algorithm input: without a return annotation it raised "get_return_type not implemented for _StatefulOp", and with one it crashed at launch with cudaErrorLaunchFailure. The op's compiled device function takes its packed state pointers as a leading argument, but the iterator's generated dereference called it as a stateless op and never carried the op's state. Implement _StatefulOp.get_return_type, and compose the op's state after the underlying iterator's state (keeping the underlying at offset 0) so the dereference can pass the op its state pointer, mirroring PermutationIterator's state composition. Closes NVIDIA#9627 Signed-off-by: nethum529 <nethumweerasinghe.nw@gmail.com>

copy-pr-bot · 2026-06-29T18:55:44Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-06-29T19:03:28Z

📝 Walkthrough

Summary by CodeRabbit

New Features
- Transform iterators now support operations that carry captured device-array state.
- Stateful transform operations can be used in both input and output iterator paths.
- Return-type inference was improved for stateful transforms, including cases without explicit annotations.
Bug Fixes
- Fixed reduction behavior when multiple iterators use similar transform logic but different captured state.
- Improved correctness for nested transform iterators and state-aware iterator composition.

Walkthrough

Refactors stateful-op return-type inference in _jit.py into two shared helpers (_state_array_numba_types, _infer_stateful_return_type) used by both _compile_stateful_op and _StatefulOp.get_return_type. Extends TransformIterator to pack the compiled op's state into the iterator's combined state buffer at an alignment-computed offset, passing a derived pointer at dereference time. Six tests validate the full scenarios.

Changes

Stateful TransformIterator Support

Layer / File(s)	Summary
Stateful return-type inference helpers `python/cuda_cccl/cuda/compute/_jit.py`	Extracts `_state_array_numba_types` and `_infer_stateful_return_type`; both `_compile_stateful_op` and `_StatefulOp.get_return_type` now delegate to them instead of duplicating logic.
TransformIterator stateful state packing and deref `python/cuda_cccl/cuda/compute/iterators/_transform.py`	Adds `_op_state_offset` slot; `__init__` conditionally appends packed op state after the underlying iterator state and records the offset; `_make_input_deref_op` and `_make_output_deref_op` branch on `is_stateful` to compute and pass `op_state` pointer in the CUDA device wrapper.
Tests `python/cuda_cccl/tests/compute/test_reduce.py`	Six new tests covering single/multiple captured arrays, inferred return type, cache reuse, nested stateful iterators, and stateful output iterator.

Assessment against linked issues

Objective	Addressed	Explanation
`NotImplementedError: get_return_type not implemented for _StatefulOp` when no return annotation [`#9627`]	✅
`cudaErrorLaunchFailure` when running a stateful op in a `TransformIterator` [`#9627`]	✅

Out-of-scope changes

No out-of-scope changes identified.

important: In _transform.py lines 98–121, alignment padding is computed between the underlying iterator state and the op state. The code must guarantee that _op_state_offset is a multiple of the op state's alignment requirement. Verify that the padding calculation matches op_state_alignment exactly, because an off-by-one in the modulo arithmetic would silently misalign the pointer on device, causing the same cudaErrorLaunchFailure the PR is fixing.

suggestion: _StatefulOp.get_return_type (lines 987–995) now transforms the function and registers gpu_structs again, duplicating work already done during compilation. Consider caching the transformed function or the inferred TypeDescriptor on the _StatefulOp instance to avoid the redundant re-compilation on repeated get_return_type calls.

suggestion: test_transform_iterator_stateful_op_cache_reuse exercises that two TransformIterator instances with different captured arrays but identical bytecode produce correct results. It would be worth also asserting that the underlying reducer is reused (e.g., by checking a cache-hit counter or that both reductions share the compiled kernel), not just that results are numerically correct — otherwise the test only validates correctness, not the cache-reuse property it claims to cover.

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

🧹 Nitpick comments (1)

python/cuda_cccl/cuda/compute/_jit.py (1)

846-850: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

suggestion: Captured state arrays are always rebuilt as flat 1-D buffers (shape=len(state_array), strides=itemsize). Reject non-1D CUDA arrays here or document the 1-D-only contract; otherwise a multidimensional capture compiles against the wrong parameter shape.

Source: Path instructions

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e71a519d-89cb-4d9a-9d9d-091d49fcff60

📥 Commits

Reviewing files that changed from the base of the PR and between e1cb696 and 3b049a9.

📒 Files selected for processing (3)

python/cuda_cccl/cuda/compute/_jit.py
python/cuda_cccl/cuda/compute/iterators/_transform.py
python/cuda_cccl/tests/compute/test_reduce.py

nethum529 · 2026-06-30T02:37:35Z

The failing pre-commit.ci check is the mypy hook, unrelated to this PR. numpy is unpinned and 2.5.0's PEP 695 type stubs break mypy under python_version = "3.10":

numpy/__init__.pyi:737: error: Type statement is only supported in Python 3.12 and greater

nethum529 requested a review from a team as a code owner June 29, 2026 18:55

nethum529 requested a review from tpn June 29, 2026 18:55

github-project-automation Bot added this to CCCL Jun 29, 2026

github-project-automation Bot moved this to Todo in CCCL Jun 29, 2026

cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 29, 2026

coderabbitai Bot reviewed Jun 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support stateful ops in TransformIterator (#9627)#9630

Support stateful ops in TransformIterator (#9627)#9630
nethum529 wants to merge 1 commit into
NVIDIA:mainfrom
nethum529:fix/issue-9627-stateful-transform-iterator

nethum529 commented Jun 29, 2026

copy-pr-bot Bot commented Jun 29, 2026

coderabbitai Bot commented Jun 29, 2026

Summary by CodeRabbit

Walkthrough

Changes

Assessment against linked issues

Out-of-scope changes

coderabbitai Bot left a comment

nethum529 commented Jun 30, 2026

Labels

1 participant

Uh oh!

Conversation

nethum529 commented Jun 29, 2026

Description

Root cause

Fix

Tests

Checklist

copy-pr-bot Bot commented Jun 29, 2026

coderabbitai Bot commented Jun 29, 2026

Summary by CodeRabbit

Walkthrough

Changes

Assessment against linked issues

Out-of-scope changes

coderabbitai Bot left a comment

Choose a reason for hiding this comment

nethum529 commented Jun 30, 2026

Labels

1 participant