perf(quantizer): make MatrixRotator::rotate row-contiguous (loop interchange, ~12-15x) by cluster2600 · Pull Request #534 · alibaba/zvec

cluster2600 · 2026-06-27T13:45:12Z

Summary

MatrixRotator::rotate() computes the rotation out = in · matrix_
(a (1×dim) · (dim×dim) matrix-vector product) with the reduction on the
outer loop and the row-major matrix indexed by column. This walks each column
of matrix_ with a dim-element stride, which thrashes cache/TLB for large
dim and prevents the inner loop from auto-vectorizing.

This PR interchanges the loops so the matrix is read row-contiguously.
The result is numerically identical up to FMA/vectorization rounding, and the
kernel runs ~12–15× faster single-threaded. Pure scalar — no SIMD
intrinsics, no OpenMP — so it's portable across every target.

Closes #533.

The problem: column-strided access on a row-major matrix

matrix_ is stored row-major. The original loop nest put the reduction index
i on the inner loop, so consecutive inner iterations read
matrix_[i*dim + j] for i = 0,1,2,… — i.e. down a column, jumping dim
floats (≈ dim*4 bytes) each step.

flowchart TD
    subgraph BEFORE["BEFORE — j outer, i inner (reduction inner)"]
      direction TB
      jb["for j in 0..dim&nbsp;&nbsp;(output index)"]:::seq
      ib["&nbsp;&nbsp;for i in 0..dim&nbsp;&nbsp;(reduction)"]:::bad
      ab["&nbsp;&nbsp;&nbsp;&nbsp;sum += in[i] * matrix_[i*dim + j]"]:::body
      eb["&nbsp;&nbsp;out[j] = sum"]:::body
      jb --> ib --> ab --> eb
    end
    classDef seq  fill:#2b2b2b,color:#fff,stroke:#888
    classDef bad  fill:#7a1f1f,color:#fff,stroke:#d33,stroke-width:2px
    classDef body fill:#1f3a7a,color:#fff,stroke:#46f

Memory access pattern of the inner loop — every step jumps a full row:

flowchart LR
    a["matrix_[0*dim+j]"]:::c --> b["matrix_[1*dim+j]"]:::c --> d["matrix_[2*dim+j]"]:::c --> e["matrix_[3*dim+j]"]:::c
    a -. "+dim floats" .-> b
    b -. "+dim floats" .-> d
    d -. "+dim floats" .-> e
    classDef c fill:#7a1f1f,color:#fff,stroke:#d33,stroke-width:2px

For dim = 768 that's a 3 KB stride per access: nearly every load misses the
L1 line it just touched, and the strided dependency on sum blocks
vectorization.

The fix: loop interchange → row-contiguous reads

Move the input index i to the outer loop and the output index j to the
inner loop, using out[] as the accumulator. Now the inner loop reads
matrix_[i*dim + j] for j = 0,1,2,… — unit stride along a row — and the
writes to out[] are also unit stride.

flowchart TD
    subgraph AFTER["AFTER — i outer, j inner (contiguous)"]
      direction TB
      z["for j in 0..dim:&nbsp;&nbsp;out[j] = 0"]:::seq
      ia["for i in 0..dim&nbsp;&nbsp;(input index)"]:::par
      ja["&nbsp;&nbsp;xi = in[i];&nbsp;&nbsp;row = &matrix_[i*dim]"]:::par
      aa["&nbsp;&nbsp;&nbsp;&nbsp;for j in 0..dim:&nbsp;&nbsp;out[j] += xi * row[j]"]:::good
      z --> ia --> ja --> aa
    end
    classDef seq  fill:#2b2b2b,color:#fff,stroke:#888
    classDef par  fill:#234d23,color:#fff,stroke:#0a0
    classDef good fill:#1f7a1f,color:#fff,stroke:#0a0,stroke-width:2px

Inner-loop access pattern is now sequential — vectorizer- and prefetcher-friendly:

flowchart LR
    a["row[0]"]:::g --> b["row[1]"]:::g --> d["row[2]"]:::g --> e["row[3]"]:::g
    a -. "+1 float" .-> b
    b -. "+1 float" .-> d
    d -. "+1 float" .-> e
    classDef g fill:#1f7a1f,color:#fff,stroke:#0a0,stroke-width:2px

This is a classic GAXPY (column-of-A · scalar) reformulation of matvec. The
reduction over i is performed in the same order as before — i still
ascends 0…dim-1 — so the only numerical difference is whether the compiler
fuses/vectorizes the now-contiguous inner loop. Measured max_abs_diff ≤ 1e-6.

flowchart LR
    A["rotate(in, out)"] --> B{"loop order"}
    B -- "j-outer / i-inner" --> C["column stride dim&nbsp;&nbsp;❌ cache miss, no vec"]:::bad
    B -- "i-outer / j-inner" --> D["unit stride&nbsp;&nbsp;✅ cache hit, vectorized"]:::good
    C -. "same math,<br/>same i-order" .-> D
    classDef bad  fill:#7a1f1f,color:#fff,stroke:#d33,stroke-width:2px
    classDef good fill:#1f7a1f,color:#fff,stroke:#0a0,stroke-width:2px

Benchmark

Standalone microbench of the two loop orders, clang++ -O3 -std=c++17,
single thread, Apple Silicon, best-of-5. Per-call time for one rotate():

dim	before (column-strided)	after (contiguous)	speedup
128	12.3 µs	1.0 µs	11.9×
256	60.5 µs	5.1 µs	11.8×
512	290.8 µs	20.5 µs	14.2×
768	691.3 µs	46.3 µs	14.9×

xychart-beta
    title "rotate() single-thread speedup from loop interchange"
    x-axis "dim" [128, 256, 512, 768]
    y-axis "speedup ×" 0 --> 16
    bar [11.9, 11.8, 14.2, 14.9]

rotate() is on the query path when matrix rotation is enabled (the
random-rotation rotator added in #483). At dim = 768 the old order costs
~0.69 ms per rotation; the interchange brings that under 0.05 ms with no
accuracy change.

Correctness

Same summation order → numerically identical up to FMA rounding
(max_abs_diff ≤ 1e-6).
New unit test tests/core/quantizer/matrix_rotator_test.cc pins
rotate() against an independent, plainly-written row-major matvec reference
across power-of-two and odd dimensions (1, 2, 3, 7, 16, 31, 64, 128, 257)
to exercise inner-loop tails.
The existing reformer/quantizer suites that exercise the rotator
(integer_quantizer_reformer, uniform_int8_reformer, half_float_reformer,
cosine_metric, quantized_integer_metric) pass unchanged.
unrotate() already used the contiguous pattern (matrix_[j*dim + i],
i inner) and is not touched.

Scope

Single-kernel change plus its test. No public API change, no new dependency, no
build-flag change. The transform is architecture-independent (it removes a
column stride on a row-major array); the magnitude scales with dim.

Methodology note

The loop interchange was surfaced and proven legal (no carried dependence,
verified before the bit-equivalence check) with a polyhedral auto-scheduling
pass — cluster_compilot, an
implementation of Agentic Auto-Scheduling (arXiv:2511.00592). The change here
is the plain, hand-written result; the tool only identified it.

…rchange) rotate() computed the (1 x dim) * (dim x dim) matrix-vector product with the reduction on the outer loop and the row-major matrix indexed by column (matrix_[i * dim + j] with i inner, stride dim). That walks each column with a dim-element stride, thrashing cache/TLB for large dim and blocking auto-vectorization of the inner loop. Interchange to i-outer / j-inner so the matrix is read row-contiguously (matrix_[i * dim + j] steps by 1) with out[] as the accumulator. The summation over i keeps the same order, so the result is numerically identical up to FMA/vectorization rounding. Pure scalar: no SIMD intrinsics, no OpenMP, portable across every target (x86 / arm64 / RISC-V), consistent with keeping ENABLE_OPENMP off. Measured ~12-15x single-thread speedup (dim 128-768, clang++ -O3), max_abs_diff <= 1e-6 vs the previous order. The sibling unrotate() already used the contiguous pattern and is unchanged. Adds a unit test pinning rotate() against an independent row-major matvec reference across power-of-two and odd dimensions. Closes alibaba#533

Adds tools/core/distance_bench.cc, a standalone benchmark for the core FP32 Distance::SquaredEuclidean kernel. For each dim it validates the SIMD result against a scalar reference (exits non-zero on mismatch), then times a brute-force scan of both implementations and reports the speedup. Built under the existing BUILD_TOOLS flag; intentionally not registered as a ctest since micro-benchmark numbers are machine dependent. Closes alibaba#535

The Q^T copy in MatrixRotator::init wrote matrix_[j*dim+i] with the inner loop over j, i.e. by column (stride dim), thrashing cache for large dim. Interchange the loops so the inner loop runs over i and the stores are sequential. Q is read strided either way, so making the writes contiguous is the net win. Pure copy: result is unchanged, verified by the existing matrix_rotator_test.

cluster2600 requested review from iaojnh and richyreachy as code owners June 27, 2026 13:45

github-actions Bot assigned cluster2600, feihongxu0824 and richyreachy Jun 27, 2026

cluster2600 mentioned this pull request Jun 28, 2026

Proposal: opt-in benchmarks/ harness for per-query scan throughput (ThreadPool + SIMD dispatch) #535

Open

cluster2600 and others added 2 commits June 29, 2026 14:43

Merge branch 'main' into perf/rotator-loop-interchange

eacea6b

cluster2600 requested a review from JalinWang as a code owner June 29, 2026 12:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(quantizer): make MatrixRotator::rotate row-contiguous (loop interchange, ~12-15x)#534

perf(quantizer): make MatrixRotator::rotate row-contiguous (loop interchange, ~12-15x)#534
cluster2600 wants to merge 4 commits into
alibaba:mainfrom
cluster2600:perf/rotator-loop-interchange

cluster2600 commented Jun 27, 2026

Labels

3 participants