feat: enable QASYMM8_SIGNED to F32 assembly dequantization#1302
Open
morgolock wants to merge 1 commit into
Open
feat: enable QASYMM8_SIGNED to F32 assembly dequantization#1302morgolock wants to merge 1 commit into
morgolock wants to merge 1 commit into
Conversation
72829b3 to
4848723
Compare
Enable QASYMM8_SIGNED input and weights to use the F32 DequantizeFloat assembly output stage, including the direct convolution selection path. Propagate input and weight zero-points plus the unrounded mathematical K depth so asymmetric offset correction uses the real GEMM/convolution depth rather than arm_gemm's padded internal K. Fix the interleaved no-merge DequantizeFloat scheduler stride so kernels that do not pack row-sum slots advance between A panels correctly. Guard the symmetric no-merge dequant support helper with the SME/SME2 feature macros. The helper is only referenced when those no-merge dequantized kernels are compiled, so non-SME builds must not define it unconditionally under Werror. Add NEON validation coverage for the direct I8S8F32 convolution path. Performance was checked on a Cortex-A76, pinned to CPU 4 with one thread. The change is neutral on large workloads and improves the small NHWC signed int8 to F32 convolution path: QASYMM8_SIGNED/RunSmallDequantizeF32 NHWC/no-activation cases show a 1.36x geomean speedup over github/main, with the tiniest single-batch cases improving by roughly 1.45x to 3.13x. Signed-off-by: Pablo Marquez Tello <pablo.tello@arm.com> Change-Id: Ie723d3da629d48de6de737c425bf7ad48e0f7feb
4848723 to
2ccd5a7
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Enable QASYMM8_SIGNED input and weights to use the F32 DequantizeFloat assembly output stage, including the direct convolution selection path.
Propagate input and weight zero-points plus the unrounded mathematical K depth so asymmetric offset correction uses the real GEMM/convolution depth rather than arm_gemm's padded internal K.
Add NEON validation coverage for the direct I8S8F32 convolution path.
Performance was checked on a A76, pinned to CPU 4 with one thread. The change is neutral on large workloads and improves the small NHWC signed int8 to F32 convolution path: QASYMM8_SIGNED/RunSmallDequantizeF32 NHWC/no-activation cases show a 1.36x geomean speedup over github/main, with the tiniest single-batch cases improving by roughly 1.45x to 3.13x.
Change-Id: Ie723d3da629d48de6de737c425bf7ad48e0f7feb