feat: optimized QASYMM8_SIGNED->F32 direct convolution path#1298
Open
alvoron wants to merge 3 commits into
Open
feat: optimized QASYMM8_SIGNED->F32 direct convolution path#1298alvoron wants to merge 3 commits into
alvoron wants to merge 3 commits into
Conversation
Two gaps in the assembly dispatch layer prevented QASYMM8_SIGNED input from producing F32 output: 1. has_opt_impl() had no branch for F32 output when input is S8/ QASYMM8_SIGNED, causing spurious kernel-not-found errors. Add a DequantizeFloat branch mirroring the existing S32 branch. 2. validate() rejected F32 output for QASYMM8_SIGNED input because it had no explicit allowance for that combination. Add a guard that permits QASYMM8_SIGNED/S32/F32 as output types (matching the already- existing QASYMM8 guard). 3. AsmGemmInfo gains dequant_a_offset / dequant_b_offset fields so that callers can supply quantization zero-points to create_arm_gemm_dequant without touching existing callers. Also fix the __aarch64_ typo in the DequantFP32_SupportedTypes test guard so that the test now actually executes on AArch64 targets. Signed-off-by: Aleksandr Voron <aleksandr.voron@intel.com>
Signed-off-by: Aleksandr Voron <aleksandr.voron@intel.com>
…te metadata The single-kernel QASYMM8_SIGNED->F32 direct convolution path (CpuGemmDirectConv2d via the arm_gemm DequantizeFloat output stage) produced numerically wrong NHWC results whenever the input zero-point (a_offset) and weight zero-point (b_offset) were both non-zero and the per-section K depth was rounded up (e.g. convolutions with a small channel count and a kernel larger than 1x1). NCHW was unaffected as it does not use this path. Root cause: the dequant offset correction computes scale * (acc - a_offset*sum_b[n] - b_offset*sum_a[m] + a_offset*b_offset*K) but the a_offset*b_offset*K cross-term used kern_k, the *padded* accumulation depth (_Ksections * roundup(_Ksize, k_unroll)), instead of the real number of MAC terms per output (_Ksize * _Ksections). For rounded-up per-section K this over-counts the cross-term, adding a constant a_offset*b_offset*(kern_k-K)*scale error to every output element. 1x1 convolutions with K a multiple of k_unroll were correct, which is why only a subset of NHWC cases failed. Fix: fold the cross-term into col_bias by subtracting b_offset*K_real (real K) from each raw weight column sum in GemmInterleaved::requantize_bias, so -a_offset * col_bias[n] * scale now yields both the a_offset correction and the cross-term with the real K. The merge kernel dequantize_block_32 no longer reconstructs K, removing its reliance on the padded kern_k. Also fix the validate/configure metadata mismatch: CpuGemmDirectConv2d::validate() built AsmGemmInfo without dequant_a_offset/dequant_b_offset while configure() set them, so validation and execution used inequivalent metadata. Both now derive AsmGemmInfo from a single shared helper (build_asm_metadata) to prevent drift. Signed-off-by: Aleksandr Voron <aleksandr.voron@intel.com>
6e974d7 to
9296252
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Inference frameworks (e.g., OpenVINO) often run int8-quantized activations with float32 output. Previously this required a multi-step chain: quantized GEMM accumulating into int32, a separate
GEMMLowpoutput-stage operator, and a dequantize/cast step. This adds a single-kernel path that takesQASYMM8_SIGNEDinput and weights and writes F32 output directly, reducing memory traffic and operator overhead.Dependency
Requires #1297 to be merged first.
That branch provides the
dequant_a_offset/dequant_b_offsetfields onAsmGemmInfo, the relaxed type guard inCpuGemmAssemblyDispatch::validate(removing the "Only S32 output" restriction for S8 input), and thecreate_arm_gemm_dequantchanges that pass offsets intoDequantizeFloatwith the correct combined scale. Without it this branch does not build correctly standalone.Technical approach
The existing
DequantizeFloatoutput stage is extended witha_offsetandb_offsetfields.GemmInterleavedis taught to:transforms_quantizedwith multiplier = 1) whenb_offset != 0, for per-row offset correction.set_pretransposed_B_arraywhena_offset != 0, stored alongside the pretransposed weight buffer.dequantize_block_32<float>at merge time.K-blocking is conservatively disabled for the
DequantizeFloat + MergeStepcase (matching the existingRequantize32policy) since row sums must cover full K.CpuGemmDirectConv2ddetects theQASYMM8_SIGNED→F32path by type, reads zero-points fromQuantizationInfo, and passes them to the dispatch viaAsmGemmInfo.CpuConv2d::get_convolution_methodautomatically routes NHWCQASYMM8_SIGNED→F32convolutions toGEMM_CONV2D(no explicit flag).Also fixes a latent bug in
dequantize_block_32<float>: the expressionval * qp.scale(integer × float, silently truncating for large accumulators) is corrected tostatic_cast<float>(val) * qp.scale.Asymmetric correction formula
out[m,n] = (raw_acc[m,n]
− a_offset · Σ_k b[k,n] (per-column, from col sums)
− b_offset · Σ_k a[m,k] (per-row, from row sums)
+ a_offset · b_offset · K (cross-term)
) · (scale_a · scale_b) + bias[n]