v0.7.2 by Cyan4973 · Pull Request #273 · Cyan4973/xxHash

Cyan4973 · 2019-10-07T17:07:05Z

Fixed collision ratio of XXH128 for some specific input lengths, reported by @svpv
Improved VSX and NEON variants, by @easyaspi314
Improved performance of scalar code path (XXH_VECTOR=0), by @easyaspi314
xxhsum : can generate 128-bit hash with command -H2 (note : for experimental purposes only ! XXH128 is not yet frozen)
xxhsum : option -q removes status notifications

The VSX codepath is now working on POWER8 and is fully enabled. The little endian code has been verified on POWER8E, although a big endian machine was not available. This uses vpermxor from POWER8 to shuffle on big endian. There are a few other fixes as well to unify endian memes.

[PPC64][TRAVIS] Fix VSX + add POWER8 support, fix VSX and ARM NEON Travis testing

fix XXH32 and XXH32_digest return types

…acros Clang was using ldmia and ldrd on unaligned pointers. These instructions don't support unaligned access. I also check the numerical value of __ARM_ARCH.

IT IS DEFINED BY THE STANDARD

Prevent Clang from emitting unaligned ldm/ldrd on ARMv6, better arm macros

now generate errors when there is a compiler warning fix #249 Also fix a few corresponding minor warnings on Visual.

@wesm

by using a custom variable XXHASH_C_FLAGS as suggested by @wesm.

Visual Studio tests on Appveyor

Add comment about CRC32 speed comparison

Sorry about the disorganized commit. :( Yet again, I had to fix ARMv6. Clang went from ldm to ldrd which also bus errors. Therefore, I decided to fix the root problem and remove the XXH_FORCE_DIRECT_MEMORY_ACCESS hack, using only memcpy. This will kill alignment memes for good, and besides, it didn't seem to make much of a difference. Additionally, I added my better 128-bit long multiply and applied DRY to XXH3_mul128_fold64. This also removes the cryptic inline assembly hack. Each method was documented, too (we need more comments). Also, I added a warning for users who are compiling Thumb-1 code for a target supporting ARM instructions. While all versions of ARM and Thumb-2 meet XXH3's base requirements, Thumb-1 does not. First of all, UMULL is inaccessible in the 16-bit subset. This means that every XXH_mult32to64 means a call to __aeabi_lmul. Since everything operation in XXH3 needs to happen in the Lo registers plus having to setup r0-r3 many times for __aeabi_lmul, the output resembles a game of Rush Hour: $ clang -O3 -S --target=arm-none-eabi -march=armv4t -mthumb xxhash.c $ grep -c mov xxhash.s 5472 $ clang -O3 -S --target=arm-none-eabi -march=armv4t xxhash.c $ grep -c mov xxhash.s 2071 It is much more practical to compile xxHash with the wider instruction sets, as these restrictions do not apply. This doesn't warn if ARMv6-M is targeted; Thumb-1 is unavoidable. Lastly, I removed the pragma clang loop hack which didn't work anymore since the number of iterations can't be constant evaluated. Now, we don't have 20 warnings when compiling for x86.

Clang prefers to emit aligned-only instructions with the second variant. Clang works fine with memcpy.

Better 128-bit multiply, multiple bugfixes.

to reflect the different terms for the library (BSD-2) and the command line interface (GPLv2), answering #253

fix #251

updated LICENSE

xxhsum -q does no longer display "Loading" notification

== xxhsum -H2

fix collisions for xxh128 in 9-16 bytes range

failing `-c` verification test

added xxh128sum link

for a slightly better bias

for mid-size length 17+

changing xxh128 results for len within 129-240.

improve xxh128 for mid-size

Reduce void pointers and evil casts.

Previously, XXH3_64bits looked much faster than XXH3_128bits. The truth is that they are similar in long keys. The difference was that XXH3_64b's benchmark was unseeded, putting it at an unfair advantage over XXH128 which is seeded. I don't think I am going to do the dummy bench. That made things moe complicated.

It's a start, but an improvement. I still have more things I would like to change but it is good for now.

I need to stop coding before my coffee. :/

Use both seeded and unseeded variants in the bench

Try to improve some variable names.

especially on the canonical representation paragraph, to make it clear it's the preferred format for storage and transmission.

The previous XXH3_accumulate_512 loop didn't fare well since XXH128 started swapping the addition. Neither GCC nor Clang could follow the barely-readable loop, resulting in garbage code output. This made XXH3 much slower. Take 32-bit scalar ARM. Ignoring loads and potential interleaving optimizations, in the main loop, XXH32 takes 16 cycles for 8 bytes on a typical ARMv6+ CPU, or 2 cpb. ```asm mla r0, r2, r5, r0 @ 4 cycles ror r0, r0, #19 @ 1 cycle mul r0, r0, r6 @ 3 cycles mla r1, r3, r5, r1 @ 4 cycles ror r1, r1, #19 @ 1 cycle mul r1, r1, r6 @ 3 cycles ``` XXH3_64b takes 9, or 1.1 cpb: ```asm adds r0, r0, r2 @ 2 cycles adc r1, r1, r3 @ 1 cycle eor r4, r4, r2 @ 1 cycle eor r5, r5, r3 @ 1 cycle umlal r0, r1, r4, r5 @ 4 cycles ``` Benchmarking on a Pixel 2 XL (with a binary for ARMv4T), previously, XXH32 got 1.8 GB/s, while XXH3_64b got 1.7. Now, XXH3_64b gets 2.3 GB/s! This calculates out well (as additional loads and stores have some overhead). Unlike before, it is better to disable autovectorization completely, as the compiler can't vectorize it as well. (Especially with Clang and NEON, where it extracts to multiply instead of the obvious vmlal.u32!). On that same device in aarch64 mode XXH3's scalar version when compiled with `clang-8 -O3 -DXXH_VECTOR=0 -fno-vectorize -fno-slp-vectorize`, XXH3 went from 2.3 GB/s to 4.3 GB/s. For comparison, the NEON version gets 6.0 GB/s. However, almost all platforms with decent autovectorization have a handwritten intrinsics version which is much faster. For optimal performance, use -fno-tree-vectorize -fno-tree-slp-vectorize (or simply disable SIMD instructions entirely). From testing, ARM32 also prefers forced inlining, so I enabled it. I also fixed some typos.

[SCALAR] Improve scalar XXH3_accumulate_512 loop

Fix endianness detection on GCC, avoid XXH_cpuIsLittleEndian.

Fixes #258. ```c BYTE -> xxh_u8 U32 -> xxh_u32 U64 -> xxh_u64 ``` Additionally, I hopefully fixed an issue for targets where int is 16 bits. XXH32 used unsigned int for its seed, and in C90 mode, unsigned int as its U32. This would cause truncation issues. I check limits.h in C90 mode to make sure UINT_MAX == 0xFFFFFFFFUL, and if it isn't, use unsigned long. We should see if we can set up an AVR CI test. Just to run the verification program, though, as the benchmark will take a very long time. Lastly, the seed types are XXH32_hash_t and XXH64_hash_t for XXH32/64. This matches xxhash.c and prevents the aforementioned 16-bit int bug.

Improve typedefs, fix 16-bit int/seed type bug

for internal typedef xxh_u32 and xxh_u64

some types were not needed. Also : xxh_u* type are only necessary within libxxhash, not xxhsum

since it can be included.

since it inexplicably complains about `main` since 4e4570f

easyaspi314 and others added 30 commits August 20, 2019 21:06

Enable PPC VSX and ARMv7 NEON tests.

c62fd1b

Merge pull request #245 from easyaspi314/power8-travis

e222686

[PPC64][TRAVIS] Fix VSX + add POWER8 support, fix VSX and ARM NEON Travis testing

fix XXH32 and XXH32_digest return types

a928488

Merge pull request #246 from bram-ivs/fixXXH32types

17969c4

fix XXH32 and XXH32_digest return types

Prevent Clang from emitting unaligned ldm/ldrd on ARMv6, better arm m…

662e199

…acros Clang was using ldmia and ldrd on unaligned pointers. These instructions don't support unaligned access. I also check the numerical value of __ARM_ARCH.

Silence -Wundef warning

8bcf561

IT IS DEFINED BY THE STANDARD

Merge pull request #247 from easyaspi314/armv6fix

726c140

Prevent Clang from emitting unaligned ldm/ldrd on ARMv6, better arm macros

Visual Studio tests on Appveyor

e18a23a

now generate errors when there is a compiler warning fix #249 Also fix a few corresponding minor warnings on Visual.

hopefully fixed the Visual test on Appveyor

a87e590

by using a custom variable XXHASH_C_FLAGS as suggested by @wesm.

Merge pull request #250 from Cyan4973/visualWarnings

77fd98f

Visual Studio tests on Appveyor

Add comment about CRC32 speed comparison

879d0af

Merge pull request #252 from nigeltao/dev

69c9558

Add comment about CRC32 speed comparison

Fix typo

1a56635

Remove extra blank line

6a768ab

Revert XXH_FORCE_DIRECT_MEMORY_ACCESS but exclude clang.

a1da6e2

Disable DIRECT_MEMORY_ACCESS check for Clang.

e923cc6

Clang prefers to emit aligned-only instructions with the second variant. Clang works fine with memcpy.

Merge pull request #254 from easyaspi314/multalign

1ce04e3

Better 128-bit multiply, multiple bugfixes.

updated LICENSE

3304443

to reflect the different terms for the library (BSD-2) and the command line interface (GPLv2), answering #253

xxhsum -q does not display "Loading" notification

d8551d2

fix #251

Merge pull request #255 from Cyan4973/license2

ed35bc4

updated LICENSE

Merge pull request #256 from Cyan4973/Loading

9538a9d

xxhsum -q does no longer display "Loading" notification

added xxh128sum

af010ba

== xxhsum -H2

added tests for xxh128sum

3649220

fix #259

e098fff

fix collisions for xxh128 in 9-16 bytes range

update valgrind test

f2be00e

improved programming pattern for hashStream

ce7dbf0

fixed extraneous ' ' character

d5336ef

failing `-c` verification test

added capability to control XXH128 hashes

549fca1

added xxh128sum link

Cyan4973 and others added 29 commits September 28, 2019 20:02

slightly updated xxh128 at len 1-3

cd0f5c2

for a slightly better bias

fix input distribution over 128-bit state

6896c57

for mid-size length 17+

updated self-test values for xxh128

0bed0c2

fixed mistake in last ingested segment

43b5c76

factor mix32

9d79fd7

factorized mix32B

c8f3fb5

changing xxh128 results for len within 129-240.

Merge pull request #262 from Cyan4973/xxh128_17p

a44629a

improve xxh128 for mid-size

Reduce void pointers and evil casts.

f90b0ab

Typo

cb4adfc

Merge pull request #264 from easyaspi314/voidptrfix

3df9e91

Reduce void pointers and evil casts.

Try to improve some variable names.

425dbd8

It's a start, but an improvement. I still have more things I would like to change but it is good for now.

Fix mixed declaration

1367385

I need to stop coding before my coffee. :/

Merge pull request #265 from easyaspi314/fair_bench

b2154f3

Use both seeded and unseeded variants in the bench

Merge pull request #266 from easyaspi314/varnames

2b956f8

Try to improve some variable names.

updated code comments

28950be

especially on the canonical representation paragraph, to make it clear it's the preferred format for storage and transmission.

documented opened API consistency questions

96e8472

Merge pull request #267 from easyaspi314/main_loop_cleanup

71a8150

[SCALAR] Improve scalar XXH3_accumulate_512 loop

Fix endianness detection on GCC, avoid XXH_cpuIsLittleEndian.

028c0fd

Merge pull request #269 from easyaspi314/endianness_fix

b604c7b

Fix endianness detection on GCC, avoid XXH_cpuIsLittleEndian.

Merge pull request #271 from easyaspi314/typedefs

8dab031

Improve typedefs, fix 16-bit int/seed type bug

reuse XXHnn_hasht_t definitions

c996521

for internal typedef xxh_u32 and xxh_u64

simplified type declaration

73e6c52

some types were not needed. Also : xxh_u* type are only necessary within libxxhash, not xxhsum

added guard macro in xxhash.c

c5f72f8

since it can be included.

removed non-error messages from stderr when specifying -q

4e4570f

updated man page regarding -q

c6ea1d2

changed strict-overflow warning to level 2

1ea98d6

since it inexplicably complains about `main` since 4e4570f

Cyan4973 merged commit e2f4695 into master Oct 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.7.2#273

v0.7.2#273
Cyan4973 merged 62 commits intomasterfrom
dev

Cyan4973 commented Oct 7, 2019 •

edited

Loading

Labels

3 participants

Conversation

Cyan4973 commented Oct 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Labels

3 participants

Cyan4973 commented Oct 7, 2019 •

edited

Loading