perf(quantizer): make MatrixRotator::rotate row-contiguous (loop interchange, ~12-15x)#534
Open
cluster2600 wants to merge 4 commits into
Open
perf(quantizer): make MatrixRotator::rotate row-contiguous (loop interchange, ~12-15x)#534cluster2600 wants to merge 4 commits into
cluster2600 wants to merge 4 commits into
Conversation
…rchange) rotate() computed the (1 x dim) * (dim x dim) matrix-vector product with the reduction on the outer loop and the row-major matrix indexed by column (matrix_[i * dim + j] with i inner, stride dim). That walks each column with a dim-element stride, thrashing cache/TLB for large dim and blocking auto-vectorization of the inner loop. Interchange to i-outer / j-inner so the matrix is read row-contiguously (matrix_[i * dim + j] steps by 1) with out[] as the accumulator. The summation over i keeps the same order, so the result is numerically identical up to FMA/vectorization rounding. Pure scalar: no SIMD intrinsics, no OpenMP, portable across every target (x86 / arm64 / RISC-V), consistent with keeping ENABLE_OPENMP off. Measured ~12-15x single-thread speedup (dim 128-768, clang++ -O3), max_abs_diff <= 1e-6 vs the previous order. The sibling unrotate() already used the contiguous pattern and is unchanged. Adds a unit test pinning rotate() against an independent row-major matvec reference across power-of-two and odd dimensions. Closes alibaba#533
Adds tools/core/distance_bench.cc, a standalone benchmark for the core FP32 Distance::SquaredEuclidean kernel. For each dim it validates the SIMD result against a scalar reference (exits non-zero on mismatch), then times a brute-force scan of both implementations and reports the speedup. Built under the existing BUILD_TOOLS flag; intentionally not registered as a ctest since micro-benchmark numbers are machine dependent. Closes alibaba#535
The Q^T copy in MatrixRotator::init wrote matrix_[j*dim+i] with the inner loop over j, i.e. by column (stride dim), thrashing cache for large dim. Interchange the loops so the inner loop runs over i and the stores are sequential. Q is read strided either way, so making the writes contiguous is the net win. Pure copy: result is unchanged, verified by the existing matrix_rotator_test.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
MatrixRotator::rotate()computes the rotationout = in · matrix_(a
(1×dim) · (dim×dim)matrix-vector product) with the reduction on theouter loop and the row-major matrix indexed by column. This walks each column
of
matrix_with adim-element stride, which thrashes cache/TLB for largedimand prevents the inner loop from auto-vectorizing.This PR interchanges the loops so the matrix is read row-contiguously.
The result is numerically identical up to FMA/vectorization rounding, and the
kernel runs ~12–15× faster single-threaded. Pure scalar — no SIMD
intrinsics, no OpenMP — so it's portable across every target.
Closes #533.
The problem: column-strided access on a row-major matrix
matrix_is stored row-major. The original loop nest put the reduction indexion the inner loop, so consecutive inner iterations readmatrix_[i*dim + j]fori = 0,1,2,…— i.e. down a column, jumpingdimfloats (≈
dim*4bytes) each step.flowchart TD subgraph BEFORE["BEFORE — j outer, i inner (reduction inner)"] direction TB jb["for j in 0..dim (output index)"]:::seq ib[" for i in 0..dim (reduction)"]:::bad ab[" sum += in[i] * matrix_[i*dim + j]"]:::body eb[" out[j] = sum"]:::body jb --> ib --> ab --> eb end classDef seq fill:#2b2b2b,color:#fff,stroke:#888 classDef bad fill:#7a1f1f,color:#fff,stroke:#d33,stroke-width:2px classDef body fill:#1f3a7a,color:#fff,stroke:#46fMemory access pattern of the inner loop — every step jumps a full row:
flowchart LR a["matrix_[0*dim+j]"]:::c --> b["matrix_[1*dim+j]"]:::c --> d["matrix_[2*dim+j]"]:::c --> e["matrix_[3*dim+j]"]:::c a -. "+dim floats" .-> b b -. "+dim floats" .-> d d -. "+dim floats" .-> e classDef c fill:#7a1f1f,color:#fff,stroke:#d33,stroke-width:2pxFor
dim = 768that's a 3 KB stride per access: nearly every load misses theL1 line it just touched, and the strided dependency on
sumblocksvectorization.
The fix: loop interchange → row-contiguous reads
Move the input index
ito the outer loop and the output indexjto theinner loop, using
out[]as the accumulator. Now the inner loop readsmatrix_[i*dim + j]forj = 0,1,2,…— unit stride along a row — and thewrites to
out[]are also unit stride.flowchart TD subgraph AFTER["AFTER — i outer, j inner (contiguous)"] direction TB z["for j in 0..dim: out[j] = 0"]:::seq ia["for i in 0..dim (input index)"]:::par ja[" xi = in[i]; row = &matrix_[i*dim]"]:::par aa[" for j in 0..dim: out[j] += xi * row[j]"]:::good z --> ia --> ja --> aa end classDef seq fill:#2b2b2b,color:#fff,stroke:#888 classDef par fill:#234d23,color:#fff,stroke:#0a0 classDef good fill:#1f7a1f,color:#fff,stroke:#0a0,stroke-width:2pxInner-loop access pattern is now sequential — vectorizer- and prefetcher-friendly:
flowchart LR a["row[0]"]:::g --> b["row[1]"]:::g --> d["row[2]"]:::g --> e["row[3]"]:::g a -. "+1 float" .-> b b -. "+1 float" .-> d d -. "+1 float" .-> e classDef g fill:#1f7a1f,color:#fff,stroke:#0a0,stroke-width:2pxThis is a classic GAXPY (column-of-A · scalar) reformulation of matvec. The
reduction over
iis performed in the same order as before —istillascends
0…dim-1— so the only numerical difference is whether the compilerfuses/vectorizes the now-contiguous inner loop. Measured
max_abs_diff ≤ 1e-6.flowchart LR A["rotate(in, out)"] --> B{"loop order"} B -- "j-outer / i-inner" --> C["column stride dim ❌ cache miss, no vec"]:::bad B -- "i-outer / j-inner" --> D["unit stride ✅ cache hit, vectorized"]:::good C -. "same math,<br/>same i-order" .-> D classDef bad fill:#7a1f1f,color:#fff,stroke:#d33,stroke-width:2px classDef good fill:#1f7a1f,color:#fff,stroke:#0a0,stroke-width:2pxBenchmark
Standalone microbench of the two loop orders,
clang++ -O3 -std=c++17,single thread, Apple Silicon, best-of-5. Per-call time for one
rotate():xychart-beta title "rotate() single-thread speedup from loop interchange" x-axis "dim" [128, 256, 512, 768] y-axis "speedup ×" 0 --> 16 bar [11.9, 11.8, 14.2, 14.9]rotate()is on the query path when matrix rotation is enabled (therandom-rotation rotator added in #483). At
dim = 768the old order costs~0.69 ms per rotation; the interchange brings that under 0.05 ms with no
accuracy change.
Correctness
(
max_abs_diff ≤ 1e-6).tests/core/quantizer/matrix_rotator_test.ccpinsrotate()against an independent, plainly-written row-major matvec referenceacross power-of-two and odd dimensions (
1, 2, 3, 7, 16, 31, 64, 128, 257)to exercise inner-loop tails.
(
integer_quantizer_reformer,uniform_int8_reformer,half_float_reformer,cosine_metric,quantized_integer_metric) pass unchanged.unrotate()already used the contiguous pattern (matrix_[j*dim + i],iinner) and is not touched.Scope
Single-kernel change plus its test. No public API change, no new dependency, no
build-flag change. The transform is architecture-independent (it removes a
column stride on a row-major array); the magnitude scales with
dim.Methodology note
The loop interchange was surfaced and proven legal (no carried dependence,
verified before the bit-equivalence check) with a polyhedral auto-scheduling
pass — cluster_compilot, an
implementation of Agentic Auto-Scheduling (arXiv:2511.00592). The change here
is the plain, hand-written result; the tool only identified it.