Skip to content

v0.7.2#273

Merged
Cyan4973 merged 62 commits intomasterfrom
dev
Oct 8, 2019
Merged

v0.7.2#273
Cyan4973 merged 62 commits intomasterfrom
dev

Conversation

@Cyan4973
Copy link
Copy Markdown
Owner

@Cyan4973 Cyan4973 commented Oct 7, 2019

  • Fixed collision ratio of XXH128 for some specific input lengths, reported by @svpv
  • Improved VSX and NEON variants, by @easyaspi314
  • Improved performance of scalar code path (XXH_VECTOR=0), by @easyaspi314
  • xxhsum : can generate 128-bit hash with command -H2 (note : for experimental purposes only ! XXH128 is not yet frozen)
  • xxhsum : option -q removes status notifications
easyaspi314 and others added 30 commits August 20, 2019 21:06
The VSX codepath is now working on POWER8 and is fully enabled.

The little endian code has been verified on POWER8E, although
a big endian machine was not available.

This uses vpermxor from POWER8 to shuffle on big endian.

There are a few other fixes as well to unify endian memes.
[PPC64][TRAVIS] Fix VSX + add POWER8 support, fix VSX and ARM NEON Travis testing
fix XXH32 and XXH32_digest return types
…acros

Clang was using ldmia and ldrd on unaligned pointers. These
instructions don't support unaligned access.

I also check the numerical value of __ARM_ARCH.
IT IS DEFINED BY THE STANDARD
Prevent Clang from emitting unaligned ldm/ldrd on ARMv6, better arm macros
now generate errors when there is a compiler warning
fix #249

Also fix a few corresponding minor warnings on Visual.
by using a custom variable XXHASH_C_FLAGS
as suggested by @wesm.
Visual Studio tests on Appveyor
Add comment about CRC32 speed comparison
Sorry about the disorganized commit. :(

Yet again, I had to fix ARMv6. Clang went from ldm to ldrd which
also bus errors.

Therefore, I decided to fix the root problem and remove the
XXH_FORCE_DIRECT_MEMORY_ACCESS hack, using only memcpy.

This will kill alignment memes for good, and besides, it didn't
seem to make much of a difference.

Additionally, I added my better 128-bit long multiply
and applied DRY to XXH3_mul128_fold64. This also removes
the cryptic inline assembly hack.

Each method was documented, too (we need more comments).

Also, I added a warning for users who are compiling Thumb-1
code for a target supporting ARM instructions.

While all versions of ARM and Thumb-2 meet XXH3's base requirements,
Thumb-1 does not.

First of all, UMULL is inaccessible in the 16-bit subset. This means
that every XXH_mult32to64 means a call to __aeabi_lmul.

Since everything operation in XXH3 needs to happen in the Lo registers
plus having to setup r0-r3 many times for __aeabi_lmul, the output
resembles a game of Rush Hour:

 $ clang -O3 -S --target=arm-none-eabi -march=armv4t -mthumb xxhash.c
 $ grep -c mov xxhash.s
 5472
 $ clang -O3 -S --target=arm-none-eabi -march=armv4t xxhash.c
 $ grep -c mov xxhash.s
 2071

It is much more practical to compile xxHash with the wider instruction
sets, as these restrictions do not apply.

This doesn't warn if ARMv6-M is targeted; Thumb-1 is unavoidable.

Lastly, I removed the pragma clang loop hack which didn't work anymore
since the number of iterations can't be constant evaluated. Now, we
don't have 20 warnings when compiling for x86.
Clang prefers to emit aligned-only instructions with the second variant.

Clang works fine with memcpy.
Better 128-bit multiply, multiple bugfixes.
to reflect the different terms
for the library (BSD-2)
and the command line interface (GPLv2),

answering #253
xxhsum -q does no longer display "Loading" notification
== xxhsum -H2
fix collisions for xxh128 in 9-16 bytes range
failing `-c` verification test
Cyan4973 and others added 29 commits September 28, 2019 20:02
for a slightly better bias
changing xxh128 results for len within 129-240.
improve xxh128 for mid-size
Reduce void pointers and evil casts.
Previously, XXH3_64bits looked much faster than XXH3_128bits. The truth
is that they are similar in long keys. The difference was that
XXH3_64b's benchmark was unseeded, putting it at an unfair advantage
over XXH128 which is seeded.

I don't think I am going to do the dummy bench. That made things moe
complicated.
It's a start, but an improvement. I still have more things I would like
to change but it is good for now.
I need to stop coding before my coffee. :/
Use both seeded and unseeded variants in the bench
Try to improve some variable names.
especially on the canonical representation paragraph,
to make it clear it's the preferred format for storage and transmission.
The previous XXH3_accumulate_512 loop didn't fare well since XXH128
started swapping the addition.

Neither GCC nor Clang could follow the barely-readable loop, resulting
in garbage code output.

This made XXH3 much slower. Take 32-bit scalar ARM.

Ignoring loads and potential interleaving optimizations, in the main
loop, XXH32 takes 16 cycles for 8 bytes on a typical ARMv6+ CPU, or 2 cpb.

```asm
        mla     r0, r2, r5, r0  @ 4 cycles
	ror     r0, r0, #19     @ 1 cycle
	mul     r0, r0, r6      @ 3 cycles
	mla     r1, r3, r5, r1  @ 4 cycles
	ror     r1, r1, #19     @ 1 cycle
	mul     r1, r1, r6      @ 3 cycles
```

XXH3_64b takes 9, or 1.1 cpb:
```asm
        adds    r0, r0, r2      @ 2 cycles
	adc     r1, r1, r3      @ 1 cycle
	eor     r4, r4, r2      @ 1 cycle
	eor     r5, r5, r3      @ 1 cycle
	umlal   r0, r1, r4, r5  @ 4 cycles
```

Benchmarking on a Pixel 2 XL (with a binary for ARMv4T), previously,
XXH32 got 1.8 GB/s, while XXH3_64b got 1.7.

Now, XXH3_64b gets 2.3 GB/s! This calculates out well (as additional
loads and stores have some overhead).

Unlike before, it is better to disable autovectorization completely, as
the compiler can't vectorize it as well. (Especially with Clang and
NEON, where it extracts to multiply instead of the obvious vmlal.u32!).

On that same device in aarch64 mode XXH3's scalar version when compiled
with `clang-8 -O3 -DXXH_VECTOR=0 -fno-vectorize -fno-slp-vectorize`,
XXH3 went from 2.3 GB/s to 4.3 GB/s. For comparison, the NEON version
gets 6.0 GB/s.

However, almost all platforms with decent autovectorization have a
handwritten intrinsics version which is much faster.

For optimal performance, use -fno-tree-vectorize -fno-tree-slp-vectorize
(or simply disable SIMD instructions entirely).

From testing, ARM32 also prefers forced inlining, so I enabled it.

I also fixed some typos.
[SCALAR] Improve scalar XXH3_accumulate_512 loop
Fix endianness detection on GCC, avoid XXH_cpuIsLittleEndian.
Fixes #258.

```c
BYTE -> xxh_u8
U32  -> xxh_u32
U64  -> xxh_u64
```

Additionally, I hopefully fixed an issue for targets where int is 16
bits. XXH32 used unsigned int for its seed, and in C90 mode, unsigned
int as its U32. This would cause truncation issues. I check limits.h in
C90 mode to make sure UINT_MAX == 0xFFFFFFFFUL, and if it isn't, use
unsigned long.

We should see if we can set up an AVR CI test. Just to run the
verification program, though, as the benchmark will take a very long
time.

Lastly, the seed types are XXH32_hash_t and XXH64_hash_t for XXH32/64.
This matches xxhash.c and prevents the aforementioned 16-bit int bug.
Improve typedefs, fix 16-bit int/seed type bug
for internal typedef xxh_u32 and xxh_u64
some types were not needed.

Also : xxh_u* type are only necessary within libxxhash,
not xxhsum
since it can be included.
since it inexplicably complains about `main` since 4e4570f
@Cyan4973 Cyan4973 merged commit e2f4695 into master Oct 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants