Skip to content

feat: enable QASYMM8_SIGNED to F32 assembly dequantization#1302

Open
morgolock wants to merge 1 commit into
mainfrom
pr/asm_dequant_f32
Open

feat: enable QASYMM8_SIGNED to F32 assembly dequantization#1302
morgolock wants to merge 1 commit into
mainfrom
pr/asm_dequant_f32

Conversation

@morgolock

Copy link
Copy Markdown
Contributor

Enable QASYMM8_SIGNED input and weights to use the F32 DequantizeFloat assembly output stage, including the direct convolution selection path.

Propagate input and weight zero-points plus the unrounded mathematical K depth so asymmetric offset correction uses the real GEMM/convolution depth rather than arm_gemm's padded internal K.

Add NEON validation coverage for the direct I8S8F32 convolution path.

Performance was checked on a A76, pinned to CPU 4 with one thread. The change is neutral on large workloads and improves the small NHWC signed int8 to F32 convolution path: QASYMM8_SIGNED/RunSmallDequantizeF32 NHWC/no-activation cases show a 1.36x geomean speedup over github/main, with the tiniest single-batch cases improving by roughly 1.45x to 3.13x.

Change-Id: Ie723d3da629d48de6de737c425bf7ad48e0f7feb

@morgolock morgolock force-pushed the pr/asm_dequant_f32 branch 2 times, most recently from 72829b3 to 4848723 Compare July 1, 2026 14:17
Enable QASYMM8_SIGNED input and weights to use the F32 DequantizeFloat assembly output stage, including the direct convolution selection path.

Propagate input and weight zero-points plus the unrounded mathematical K depth so asymmetric offset correction uses the real GEMM/convolution depth rather than arm_gemm's padded internal K.

Fix the interleaved no-merge DequantizeFloat scheduler stride so kernels that do not pack row-sum slots advance between A panels correctly.

Guard the symmetric no-merge dequant support helper with the SME/SME2 feature macros. The helper is only referenced when those no-merge dequantized kernels are compiled, so non-SME builds must not define it unconditionally under Werror.

Add NEON validation coverage for the direct I8S8F32 convolution path.

Performance was checked on a Cortex-A76, pinned to CPU 4 with one thread. The change is neutral on large workloads and improves the small NHWC signed int8 to F32 convolution path: QASYMM8_SIGNED/RunSmallDequantizeF32 NHWC/no-activation cases show a 1.36x geomean speedup over github/main, with the tiniest single-batch cases improving by roughly 1.45x to 3.13x.

Signed-off-by: Pablo Marquez Tello <pablo.tello@arm.com>

Change-Id: Ie723d3da629d48de6de737c425bf7ad48e0f7feb
@morgolock morgolock force-pushed the pr/asm_dequant_f32 branch from 4848723 to 2ccd5a7 Compare July 1, 2026 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant