Skip to content

perf(quantizer): make MatrixRotator::rotate row-contiguous (loop interchange, ~12-15x)#534

Open
cluster2600 wants to merge 4 commits into
alibaba:mainfrom
cluster2600:perf/rotator-loop-interchange
Open

perf(quantizer): make MatrixRotator::rotate row-contiguous (loop interchange, ~12-15x)#534
cluster2600 wants to merge 4 commits into
alibaba:mainfrom
cluster2600:perf/rotator-loop-interchange

Conversation

@cluster2600

Copy link
Copy Markdown
Contributor

Summary

MatrixRotator::rotate() computes the rotation out = in · matrix_
(a (1×dim) · (dim×dim) matrix-vector product) with the reduction on the
outer loop and the row-major matrix indexed by column
. This walks each column
of matrix_ with a dim-element stride, which thrashes cache/TLB for large
dim and prevents the inner loop from auto-vectorizing.

This PR interchanges the loops so the matrix is read row-contiguously.
The result is numerically identical up to FMA/vectorization rounding, and the
kernel runs ~12–15× faster single-threaded. Pure scalar — no SIMD
intrinsics, no OpenMP — so it's portable across every target.

Closes #533.

The problem: column-strided access on a row-major matrix

matrix_ is stored row-major. The original loop nest put the reduction index
i on the inner loop, so consecutive inner iterations read
matrix_[i*dim + j] for i = 0,1,2,… — i.e. down a column, jumping dim
floats (≈ dim*4 bytes) each step.

flowchart TD
    subgraph BEFORE["BEFORE — j outer, i inner (reduction inner)"]
      direction TB
      jb["for j in 0..dim  (output index)"]:::seq
      ib["  for i in 0..dim  (reduction)"]:::bad
      ab["    sum += in[i] * matrix_[i*dim + j]"]:::body
      eb["  out[j] = sum"]:::body
      jb --> ib --> ab --> eb
    end
    classDef seq  fill:#2b2b2b,color:#fff,stroke:#888
    classDef bad  fill:#7a1f1f,color:#fff,stroke:#d33,stroke-width:2px
    classDef body fill:#1f3a7a,color:#fff,stroke:#46f
Loading

Memory access pattern of the inner loop — every step jumps a full row:

flowchart LR
    a["matrix_[0*dim+j]"]:::c --> b["matrix_[1*dim+j]"]:::c --> d["matrix_[2*dim+j]"]:::c --> e["matrix_[3*dim+j]"]:::c
    a -. "+dim floats" .-> b
    b -. "+dim floats" .-> d
    d -. "+dim floats" .-> e
    classDef c fill:#7a1f1f,color:#fff,stroke:#d33,stroke-width:2px
Loading

For dim = 768 that's a 3 KB stride per access: nearly every load misses the
L1 line it just touched, and the strided dependency on sum blocks
vectorization.

The fix: loop interchange → row-contiguous reads

Move the input index i to the outer loop and the output index j to the
inner loop, using out[] as the accumulator. Now the inner loop reads
matrix_[i*dim + j] for j = 0,1,2,…unit stride along a row — and the
writes to out[] are also unit stride.

flowchart TD
    subgraph AFTER["AFTER — i outer, j inner (contiguous)"]
      direction TB
      z["for j in 0..dim:  out[j] = 0"]:::seq
      ia["for i in 0..dim  (input index)"]:::par
      ja["  xi = in[i];  row = &matrix_[i*dim]"]:::par
      aa["    for j in 0..dim:  out[j] += xi * row[j]"]:::good
      z --> ia --> ja --> aa
    end
    classDef seq  fill:#2b2b2b,color:#fff,stroke:#888
    classDef par  fill:#234d23,color:#fff,stroke:#0a0
    classDef good fill:#1f7a1f,color:#fff,stroke:#0a0,stroke-width:2px
Loading

Inner-loop access pattern is now sequential — vectorizer- and prefetcher-friendly:

flowchart LR
    a["row[0]"]:::g --> b["row[1]"]:::g --> d["row[2]"]:::g --> e["row[3]"]:::g
    a -. "+1 float" .-> b
    b -. "+1 float" .-> d
    d -. "+1 float" .-> e
    classDef g fill:#1f7a1f,color:#fff,stroke:#0a0,stroke-width:2px
Loading

This is a classic GAXPY (column-of-A · scalar) reformulation of matvec. The
reduction over i is performed in the same order as before — i still
ascends 0…dim-1 — so the only numerical difference is whether the compiler
fuses/vectorizes the now-contiguous inner loop. Measured max_abs_diff ≤ 1e-6.

flowchart LR
    A["rotate(in, out)"] --> B{"loop order"}
    B -- "j-outer / i-inner" --> C["column stride dim  ❌ cache miss, no vec"]:::bad
    B -- "i-outer / j-inner" --> D["unit stride  ✅ cache hit, vectorized"]:::good
    C -. "same math,<br/>same i-order" .-> D
    classDef bad  fill:#7a1f1f,color:#fff,stroke:#d33,stroke-width:2px
    classDef good fill:#1f7a1f,color:#fff,stroke:#0a0,stroke-width:2px
Loading

Benchmark

Standalone microbench of the two loop orders, clang++ -O3 -std=c++17,
single thread, Apple Silicon, best-of-5. Per-call time for one rotate():

dim before (column-strided) after (contiguous) speedup
128 12.3 µs 1.0 µs 11.9×
256 60.5 µs 5.1 µs 11.8×
512 290.8 µs 20.5 µs 14.2×
768 691.3 µs 46.3 µs 14.9×
xychart-beta
    title "rotate() single-thread speedup from loop interchange"
    x-axis "dim" [128, 256, 512, 768]
    y-axis "speedup ×" 0 --> 16
    bar [11.9, 11.8, 14.2, 14.9]
Loading

rotate() is on the query path when matrix rotation is enabled (the
random-rotation rotator added in #483). At dim = 768 the old order costs
~0.69 ms per rotation; the interchange brings that under 0.05 ms with no
accuracy change.

Correctness

  • Same summation order → numerically identical up to FMA rounding
    (max_abs_diff ≤ 1e-6).
  • New unit test tests/core/quantizer/matrix_rotator_test.cc pins
    rotate() against an independent, plainly-written row-major matvec reference
    across power-of-two and odd dimensions (1, 2, 3, 7, 16, 31, 64, 128, 257)
    to exercise inner-loop tails.
  • The existing reformer/quantizer suites that exercise the rotator
    (integer_quantizer_reformer, uniform_int8_reformer, half_float_reformer,
    cosine_metric, quantized_integer_metric) pass unchanged.
  • unrotate() already used the contiguous pattern (matrix_[j*dim + i],
    i inner) and is not touched.

Scope

Single-kernel change plus its test. No public API change, no new dependency, no
build-flag change. The transform is architecture-independent (it removes a
column stride on a row-major array); the magnitude scales with dim.

Methodology note

The loop interchange was surfaced and proven legal (no carried dependence,
verified before the bit-equivalence check) with a polyhedral auto-scheduling
pass — cluster_compilot, an
implementation of Agentic Auto-Scheduling (arXiv:2511.00592). The change here
is the plain, hand-written result; the tool only identified it.

…rchange)

rotate() computed the (1 x dim) * (dim x dim) matrix-vector product with the
reduction on the outer loop and the row-major matrix indexed by column
(matrix_[i * dim + j] with i inner, stride dim). That walks each column with a
dim-element stride, thrashing cache/TLB for large dim and blocking
auto-vectorization of the inner loop.

Interchange to i-outer / j-inner so the matrix is read row-contiguously
(matrix_[i * dim + j] steps by 1) with out[] as the accumulator. The summation
over i keeps the same order, so the result is numerically identical up to
FMA/vectorization rounding. Pure scalar: no SIMD intrinsics, no OpenMP,
portable across every target (x86 / arm64 / RISC-V), consistent with keeping
ENABLE_OPENMP off.

Measured ~12-15x single-thread speedup (dim 128-768, clang++ -O3), max_abs_diff
<= 1e-6 vs the previous order. The sibling unrotate() already used the
contiguous pattern and is unchanged.

Adds a unit test pinning rotate() against an independent row-major matvec
reference across power-of-two and odd dimensions.

Closes alibaba#533
cluster2600 and others added 2 commits June 29, 2026 14:43
Adds tools/core/distance_bench.cc, a standalone benchmark for the core
FP32 Distance::SquaredEuclidean kernel. For each dim it validates the
SIMD result against a scalar reference (exits non-zero on mismatch),
then times a brute-force scan of both implementations and reports the
speedup.

Built under the existing BUILD_TOOLS flag; intentionally not registered
as a ctest since micro-benchmark numbers are machine dependent.

Closes alibaba#535
@cluster2600 cluster2600 requested a review from JalinWang as a code owner June 29, 2026 12:48
The Q^T copy in MatrixRotator::init wrote matrix_[j*dim+i] with the inner
loop over j, i.e. by column (stride dim), thrashing cache for large dim.
Interchange the loops so the inner loop runs over i and the stores are
sequential. Q is read strided either way, so making the writes contiguous
is the net win. Pure copy: result is unchanged, verified by the existing
matrix_rotator_test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants