Skip to content

feat: optimized QASYMM8_SIGNED->F32 direct convolution path#1298

Open
alvoron wants to merge 3 commits into
ARM-software:mainfrom
alvoron:alvoron_direct_i8_f32_conv
Open

feat: optimized QASYMM8_SIGNED->F32 direct convolution path#1298
alvoron wants to merge 3 commits into
ARM-software:mainfrom
alvoron:alvoron_direct_i8_f32_conv

Conversation

@alvoron

@alvoron alvoron commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Motivation

Inference frameworks (e.g., OpenVINO) often run int8-quantized activations with float32 output. Previously this required a multi-step chain: quantized GEMM accumulating into int32, a separate GEMMLowp output-stage operator, and a dequantize/cast step. This adds a single-kernel path that takes QASYMM8_SIGNED input and weights and writes F32 output directly, reducing memory traffic and operator overhead.

Dependency

Requires #1297 to be merged first.
That branch provides the dequant_a_offset/dequant_b_offset fields on AsmGemmInfo, the relaxed type guard in CpuGemmAssemblyDispatch::validate (removing the "Only S32 output" restriction for S8 input), and the create_arm_gemm_dequant changes that pass offsets into DequantizeFloat with the correct combined scale. Without it this branch does not build correctly standalone.

Technical approach

The existing DequantizeFloat output stage is extended with a_offset and b_offset fields. GemmInterleaved is taught to:

  1. Pack row sums of A into the A panel (via transforms_quantized with multiplier = 1) when b_offset != 0, for per-row offset correction.
  2. Compute column sums of B (weights) during set_pretransposed_B_array when a_offset != 0, stored alongside the pretransposed weight buffer.
  3. Apply all three correction terms in dequantize_block_32<float> at merge time.

K-blocking is conservatively disabled for the DequantizeFloat + MergeStep case (matching the existing Requantize32 policy) since row sums must cover full K.

CpuGemmDirectConv2d detects the QASYMM8_SIGNEDF32 path by type, reads zero-points from QuantizationInfo, and passes them to the dispatch via AsmGemmInfo. CpuConv2d::get_convolution_method automatically routes NHWC QASYMM8_SIGNEDF32 convolutions to GEMM_CONV2D (no explicit flag).

Also fixes a latent bug in dequantize_block_32<float>: the expression val * qp.scale (integer × float, silently truncating for large accumulators) is corrected to static_cast<float>(val) * qp.scale.

Asymmetric correction formula

out[m,n] = (raw_acc[m,n]
− a_offset · Σ_k b[k,n] (per-column, from col sums)
− b_offset · Σ_k a[m,k] (per-row, from row sums)
+ a_offset · b_offset · K (cross-term)
) · (scale_a · scale_b) + bias[n]

alvoron added 3 commits June 19, 2026 14:43
Two gaps in the assembly dispatch layer prevented QASYMM8_SIGNED input
from producing F32 output:

1. has_opt_impl() had no branch for F32 output when input is S8/
   QASYMM8_SIGNED, causing spurious kernel-not-found errors.  Add a
   DequantizeFloat branch mirroring the existing S32 branch.

2. validate() rejected F32 output for QASYMM8_SIGNED input because it
   had no explicit allowance for that combination.  Add a guard that
   permits QASYMM8_SIGNED/S32/F32 as output types (matching the already-
   existing QASYMM8 guard).

3. AsmGemmInfo gains dequant_a_offset / dequant_b_offset fields so that
   callers can supply quantization zero-points to create_arm_gemm_dequant
   without touching existing callers.

Also fix the __aarch64_ typo in the DequantFP32_SupportedTypes test guard
so that the test now actually executes on AArch64 targets.

Signed-off-by: Aleksandr Voron <aleksandr.voron@intel.com>
Signed-off-by: Aleksandr Voron <aleksandr.voron@intel.com>
…te metadata

The single-kernel QASYMM8_SIGNED->F32 direct convolution path (CpuGemmDirectConv2d
via the arm_gemm DequantizeFloat output stage) produced numerically wrong NHWC
results whenever the input zero-point (a_offset) and weight zero-point (b_offset)
were both non-zero and the per-section K depth was rounded up (e.g. convolutions
with a small channel count and a kernel larger than 1x1). NCHW was unaffected as
it does not use this path.

Root cause: the dequant offset correction computes
  scale * (acc - a_offset*sum_b[n] - b_offset*sum_a[m] + a_offset*b_offset*K)
but the a_offset*b_offset*K cross-term used kern_k, the *padded* accumulation
depth (_Ksections * roundup(_Ksize, k_unroll)), instead of the real number of
MAC terms per output (_Ksize * _Ksections). For rounded-up per-section K this
over-counts the cross-term, adding a constant a_offset*b_offset*(kern_k-K)*scale
error to every output element. 1x1 convolutions with K a multiple of k_unroll
were correct, which is why only a subset of NHWC cases failed.

Fix: fold the cross-term into col_bias by subtracting b_offset*K_real (real K)
from each raw weight column sum in GemmInterleaved::requantize_bias, so
  -a_offset * col_bias[n] * scale
now yields both the a_offset correction and the cross-term with the real K. The
merge kernel dequantize_block_32 no longer reconstructs K, removing its reliance
on the padded kern_k.

Also fix the validate/configure metadata mismatch: CpuGemmDirectConv2d::validate()
built AsmGemmInfo without dequant_a_offset/dequant_b_offset while configure() set
them, so validation and execution used inequivalent metadata. Both now derive
AsmGemmInfo from a single shared helper (build_asm_metadata) to prevent drift.

Signed-off-by: Aleksandr Voron <aleksandr.voron@intel.com>
@alvoron alvoron force-pushed the alvoron_direct_i8_f32_conv branch from 6e974d7 to 9296252 Compare July 1, 2026 17:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant