feat: enable QASYMM8_SIGNED to F32 assembly dequantization by morgolock · Pull Request #1302 · ARM-software/ComputeLibrary

morgolock · 2026-07-01T11:38:39Z

Enable QASYMM8_SIGNED input and weights to use the F32 DequantizeFloat assembly output stage, including the direct convolution selection path.

Propagate input and weight zero-points plus the unrounded mathematical K depth so asymmetric offset correction uses the real GEMM/convolution depth rather than arm_gemm's padded internal K.

Add NEON validation coverage for the direct I8S8F32 convolution path.

Performance was checked on a A76, pinned to CPU 4 with one thread. The change is neutral on large workloads and improves the small NHWC signed int8 to F32 convolution path: QASYMM8_SIGNED/RunSmallDequantizeF32 NHWC/no-activation cases show a 1.36x geomean speedup over github/main, with the tiniest single-batch cases improving by roughly 1.45x to 3.13x.

Change-Id: Ie723d3da629d48de6de737c425bf7ad48e0f7feb

Enable QASYMM8_SIGNED input and weights to use the F32 DequantizeFloat assembly output stage, including the direct convolution selection path. Propagate input and weight zero-points plus the unrounded mathematical K depth so asymmetric offset correction uses the real GEMM/convolution depth rather than arm_gemm's padded internal K. Fix the interleaved no-merge DequantizeFloat scheduler stride so kernels that do not pack row-sum slots advance between A panels correctly. Guard the symmetric no-merge dequant support helper with the SME/SME2 feature macros. The helper is only referenced when those no-merge dequantized kernels are compiled, so non-SME builds must not define it unconditionally under Werror. Add NEON validation coverage for the direct I8S8F32 convolution path. Performance was checked on a Cortex-A76, pinned to CPU 4 with one thread. The change is neutral on large workloads and improves the small NHWC signed int8 to F32 convolution path: QASYMM8_SIGNED/RunSmallDequantizeF32 NHWC/no-activation cases show a 1.36x geomean speedup over github/main, with the tiniest single-batch cases improving by roughly 1.45x to 3.13x. Signed-off-by: Pablo Marquez Tello <pablo.tello@arm.com> Change-Id: Ie723d3da629d48de6de737c425bf7ad48e0f7feb

morgolock force-pushed the pr/asm_dequant_f32 branch 2 times, most recently from 72829b3 to 4848723 Compare July 1, 2026 14:17

morgolock force-pushed the pr/asm_dequant_f32 branch from 4848723 to 2ccd5a7 Compare July 1, 2026 15:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: enable QASYMM8_SIGNED to F32 assembly dequantization#1302

feat: enable QASYMM8_SIGNED to F32 assembly dequantization#1302
morgolock wants to merge 1 commit into
mainfrom
pr/asm_dequant_f32

morgolock commented Jul 1, 2026

Labels

1 participant

Uh oh!

Conversation

morgolock commented Jul 1, 2026

Labels

1 participant