Skip to content

Conversation

@Scottcjn
Copy link

@Scottcjn Scottcjn commented Jan 31, 2026

Summary

Adds IBM POWER8 (ppc64le) support with scalar and VSX SIMD I2_S kernels. The VSX implementation uses vmsummbm for 16-way signed-unsigned byte multiply-accumulate and dcbt L3-resident prefetch hints for weight tensor caching.

Achieves 9.1–10.5x speedup over scalar baseline on prompt processing across three model sizes (700M, 2B, 8B).

Also adds Power Mac G5 (big-endian PowerPC 970) support with AltiVec SIMD kernels via VSX/AltiVec compatibility macros. One code path works on both POWER8 and G5 — 4 macros abstract the ISA differences (vec_vsx_ld vs vec_ld, dcbt TH hints vs basic dcbt). This is the first known instance of a 1-bit LLM running on big-endian hardware from 2003.

POWER8 Benchmarks

Hardware: IBM Power System S824 (8286-42A), 16c/128t POWER8, 512 GB DDR3
Config: 64 threads, numactl --interleave=all, OMP_PROC_BIND=spread

Scalar → VSX Speedup

Model Params pp128 (scalar) pp128 (VSX) Speedup
bitnet_b1_58-large 728M 21.48 t/s 211.48 t/s 9.8x
BitNet-b1.58-2B-4T 2.74B 8.04 t/s 73.03 t/s 9.1x
Llama3-8B-1.58 8.03B 2.60 t/s 27.39 t/s 10.5x

Full Results (VSX + dcbt)

Model Size pp128 pp256 pp512 tg32
bitnet_b1_58-large 257 MiB 209.38 t/s 176.67 t/s 134.10 t/s 24.02 t/s
BitNet-b1.58-2B-4T 1.71 GiB 71.95 t/s 64.98 t/s 52.67 t/s 11.99 t/s
Llama3-8B-1.58 3.58 GiB 26.98 t/s 25.06 t/s 21.70 t/s 5.63 t/s

Power Mac G5 Benchmarks (Big-Endian)

Hardware: Power Mac G5 Dual 2.0 GHz (PowerPC 970), 8 GB DDR2, Mac OS X 10.5.8
Compiler: GCC 10.5.0, -mcpu=970 -maltivec -Os

Model Size pp6 tg20
bitnet_b1_58-large 257 MiB 4.68 t/s 1.70 t/s

AltiVec kernel microbenchmark: The raw dot product kernel achieves 16.1x speedup over scalar (5.84 GMAC/s vs 0.36 GMAC/s on a 768×1536 matmul). End-to-end inference is limited by Amdahl's law: matmul is only 12-24% of total inference time. The remaining 76-88% is framework overhead (layernorm, softmax, RoPE, activation quantization, and 870 ggml_barrier() synchronizations per token).

Use -t 1 (single thread) for best G5 performance — barrier overhead on 870 graph nodes makes multi-threading counterproductive.

The G5 AltiVec kernels share the same code path as POWER8 VSX through 4 compatibility macros:

Macro POWER8 (VSX) G5 (AltiVec)
I2S_VEC_LD_UC vec_vsx_ld (unaligned) vec_ld (aligned)
I2S_VEC_LD_SC vec_vsx_ld (signed char) vec_ld (aligned)
I2S_DCBT_RESIDENT dcbt TH=16 (L3 sticky) dcbt (basic)
I2S_DCBT dcbt TH=0 (transient) dcbt (basic)

Vector constants use vec_splat_u8() (generates vspltisb instruction) instead of static const arrays, avoiding Mach-O alignment issues on old Darwin. No endian changes needed: vec_msum accumulates all 16 bytes into 4 int32 lanes, then hsum sums all 4 — the total is identical regardless of lane assignment order.

always_inline is critical on Mach-O: without it, every i2s_ppc_half call generates VRsave save/restore sequences (mfspr/mtspr, ~20 cycles each), which devastates performance in the inner loop.

Files Changed

  • src/ggml-bitnet-mad.cpp — POWER8 VSX + G5 AltiVec I2_S kernels with compatibility macros
  • include/gemm-config.h — PowerPC architecture detection (__VSX__, __ALTIVEC__, __powerpc__, __ppc__)
  • include/bitnet-lut-kernels.h — PowerPC stub (LUT kernels are x86/ARM-specific)
  • patches/g5-big-endian.patch — GGUF byte-swap for big-endian hosts
  • patches/regex-ppc.h — POSIX regex wrapper for PPC big-endian
  • patches/build_g5.sh — G5 AltiVec build script (-Os, llama-cli)
  • README.md — Build instructions and benchmarks

Build

POWER8 (ppc64le)

cmake -B build -DCMAKE_C_FLAGS="-mcpu=power8 -mvsx" -DCMAKE_CXX_FLAGS="-mcpu=power8 -mvsx"
cmake --build build --config Release

Power Mac G5 (ppc64be)

./patches/build_g5.sh /usr/local/gcc-10/bin

Testing

  • POWER8: Verified on IBM Power System S824 with bitnet_b1_58-large (700M), BitNet-b1.58-2B-4T (2B), and Llama3-8B-1.58 (8B). All models produce correct output.
  • Power Mac G5: Verified on Power Mac G5 Dual 2.0 GHz with bitnet_b1_58-large (700M). Produces coherent English text output. AltiVec kernels verified active via microbenchmark (16x raw dot product speedup).
  • POWER8 regression: Same code compiles and runs correctly with -mvsx — the compatibility macros resolve to the original VSX intrinsics on POWER8.
Scottcjn and others added 4 commits January 30, 2026 03:42
Port Microsoft BitNet to IBM POWER8 (ppc64le). Adds scalar fallback
implementations for all 5 I2_S dot product kernel functions and the
quantize_i2_s function. Also adds PowerPC defines to gemm-config.h.

Benchmarks on POWER8 S824 (16c/128t, 512GB RAM, scalar-only):
- BitNet Large 700M: 21.5 t/s pp128, 11.2 t/s tg32
- BitNet 2B:         8.0 t/s pp128,  4.1 t/s tg32
- Llama3-8B BitNet:  2.6 t/s pp128,  1.6 t/s tg32

Files changed:
- gemm-config.h: Add PARALLEL_SIZE/ROW_BLOCK_SIZE for __VSX__/__powerpc64__
- ggml-bitnet-mad.cpp: Scalar fallbacks for quantize_i2_s,
  ggml_vec_dot_i2_i8_s_1x1, _1x4_32W, _1xN, _Nx1
- bitnet-lut-kernels.h: Stub header for POWER8 (LUT kernels are x86/ARM)

Next: VSX-optimized kernels using vec_perm for 10-16x speedup.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace scalar fallback with POWER8 AltiVec/VSX optimized kernels for all
5 I2_S functions using vec_msum (vmsummbm) instruction.

Key optimizations:
- vec_msum: 16 signed*unsigned byte multiply-accumulate per cycle
- dcbt prefetch hints for weight/activation data
- i2s_vsx_half() helper processes 16-byte blocks in vectorized form
- All 5 kernels: quantize, 1x1, 1x4_32W, 1xN, Nx1

Benchmark results (POWER8 S824, 64 threads):
  700M: pp128 21.48 -> 211.48 t/s (9.8x)
  2B:   pp128  8.04 ->  73.03 t/s (9.1x)
  8B:   pp128  2.60 ->  27.39 t/s (10.5x)
  8B:   tg32   1.61 ->   4.90 t/s (3.0x)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Use dcbt with TH=0x10 (L3 resident hint) for weight block prefetch
instead of transient dcbt. Keeps weight data sticky in L3 cache
between token generation steps, avoiding re-fetch from DRAM.

Key design: only prefetch next block (not bulk scan) to avoid
overhead on prompt processing. Bulk prefetch hurts pp because
BitNet I2_S blocks are tiny (32 bytes) and vec_msum is so fast
that prefetch overhead dominates.

Results (vs VSX-only baseline):
  700M tg32: 22.77 -> 24.02 t/s (+5.5%)
  2B   tg32: 10.93 -> 11.99 t/s (+9.7%)
  8B   tg32:  4.90 ->  5.63 t/s (+14.9%)
  pp unchanged (within noise)

Full speedup from scalar baseline:
  8B pp128: 2.60 -> 26.98 t/s (10.4x)
  8B tg32:  1.61 ->  5.63 t/s (3.5x)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds POWER8/PowerPC section with:
- Build instructions for ppc64le
- Three optimization levels explained (scalar, VSX, dcbt)
- Full benchmark tables for 700M, 2B, and 8B models
- Scalar-to-VSX speedup comparison (9-10x)
- Key technical details (vec_msum, dcbt resident, NUMA)
- Model sources and conversion notes

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@Scottcjn
Copy link
Author

@microsoft-github-policy-service agree

@Scottcjn Scottcjn changed the title feat(power8): Add IBM POWER8 ppc64le support with VSX SIMD kernels — 10x speedup Jan 31, 2026
Scott and others added 2 commits January 31, 2026 15:59
Add patches and build infrastructure for running BitNet on PowerPC 970
(Power Mac G5) big-endian systems. GGUF format is always little-endian
on disk; this adds byte-swap support for all multi-byte scalar reads
and tensor data.

Key changes:
- g5-big-endian.patch: gguf_fread_val() byte-swap function for GGUF
  reader, tensor data byte-swap for F32/F16/I2_S at load time,
  sizeof(bool)==4 fix for PowerPC GCC
- regex-ppc.h: POSIX regex wrapper replacing broken std::regex on
  PPC big-endian
- build_g5.sh: Build script with G5-safe compiler flags (-Os)

Tested on Power Mac G5 Dual 2.0 GHz, Mac OS X 10.5, GCC 10.5.0.
Produces coherent text at 4.31 t/s prompt eval, 1.61 t/s generation
with bitnet_b1_58-large (728M).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Port POWER8 VSX SIMD kernels to G5 AltiVec using 4 compatibility macros
that abstract the ISA differences. One code path, works on both targets.

Compatibility macros:
- I2S_VEC_LD_UC/I2S_VEC_LD_SC: vec_vsx_ld (POWER8) vs vec_ld (G5)
- I2S_DCBT_*: TH-hint dcbt (POWER8) vs basic dcbt (G5)

Key changes across all 4 kernel functions (1x1, 1x4_32W, 1xN, Nx1):
- vec_vsx_ld → I2S_VEC_LD_UC / I2S_VEC_LD_SC (22 sites)
- static const vector arrays → vec_splat_u8() macros (avoids Mach-O
  alignment issues on old Darwin, generates vspltisb instruction)
- hsum_i32_4_vsx → hsum_i32_4_ppc with G5 branch using vec_sums
- POWER8-specific dcbt TH hints → G5-safe basic dcbt fallback
- Architecture guard extended with __powerpc__ and __ppc__
- Build script updated: -Os → -O3 (safe now that vector constants
  are in-register), added llama-bench target

Scalar baseline on G5 Dual 2.0GHz: pp5 = 4.31 t/s, tg = 1.61 t/s
Target with AltiVec: 12-20 t/s (3-5x speedup via vmsummbm)

No endian changes needed: vec_msum accumulates all 16 bytes into 4
int32 lanes then hsum reduces all 4 - total is identical regardless
of lane assignment order on BE vs LE.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@Scottcjn
Copy link
Author

Scottcjn commented Feb 1, 2026 via email

- Change build_g5.sh from -O3 to -Os: higher optimization levels cause
  Bus errors on G5 due to Mach-O ABI stack alignment issues with aggressive
  vector register spills from GCC 10.
- Add __attribute__((always_inline)) to i2s_ppc_half and hsum_i32_4_ppc:
  without this, the Mach-O ABI generates VRsave save/restore sequences
  (mfspr/mtspr, ~20 cycles each) on every function call, devastating
  performance in the inner dot product loop.
- Recommend -t 1 for G5 inference: single thread is faster because
  ggml_barrier() overhead on 870 graph nodes per token exceeds the
  benefit of 2-thread parallelism.
- Remove llama-bench from G5 build (C++ compat issues with GCC 10).

G5 AltiVec kernel microbenchmark: 16.1x raw speedup (5.84 vs 0.36 GMAC/s).
End-to-end limited by Amdahl's law: matmul is 12-24% of total inference.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant