The Wayback Machine - https://web.archive.org/web/20150331003532/http://stackoverflow.com/questions/28792784/why-does-git-use-a-cryptographic-hash-function
Take the 2-minute tour ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

Why does Git use SHA-1, a cryptographic hash function, instead of a faster non-cryptographic hash function?

Related question:

Stack Overflow question Why does Git use SHA-1 as version numbers? asks why Git uses SHA-1 as opposed to sequential numbers for commits.

share|improve this question
    
Personally I think that even using broken SHA-1 over SHA-2 was premature optimization. –  CodesInChaos Mar 1 at 14:00
3  
@CodesInChaos: and besides, baking any particular algorithm into the code was a horrible violation of DI principles. Should be in an XML config file somewhere ;-) –  Steve Jessop Mar 1 at 23:54

1 Answer 1

up vote 115 down vote accepted

You can check that from Linus Torvalds himself, when he presented Git to Google back in 2007:
(emphasis mine)

We check checksums that is considered cryptographically secure. Nobody has been able to break SHA-1, but the point is, SHA-1 as far as git is concerned, isn't even a security feature. It's purely a consistency check.
The security parts are elsewhere. A lot of people assume since git uses SHA-1 and SHA-1 is used for cryptographically secure stuff, they think that it's a huge security feature. It has nothing at all to do with security, it's just the best hash you can get.

Having a good hash is good for being able to trust your data, it happens to have some other good features, too, it means when we hash objects, we know the hash is well distributed and we do not have to worry about certain distribution issues.

Internally it means from the implementation standpoint, we can trust that the hash is so good that we can use hashing algorithms and know there are no bad cases.

So there are some reasons to like the cryptographic side too, but it's really about the ability to trust your data.
I guarantee you, if you put your data in git, you can trust the fact that five years later, after it is converted from your harddisc to DVD to whatever new technology and you copied it along, five years later you can verify the data you get back out is the exact same data you put in. And that is something you really should look for in a source code management system.


I mentioned in "How would git handle a SHA-1 collision on a blob?" that you could engineer a commit with a particular SHA1 prefix (still an extremely costly endeavor).
But the point remains, as Eric Sink mentions in "Git: Cryptographic Hashes" (Version Control by Example (2011) book:

It is rather important that the DVCS never encounter two different pieces of data which have the same digest. Fortunately, good cryptographic hash functions are designed to make such collisions extremely unlikely.

It is harder to find good non-cryptographic hash with low collision rate, unless you consider research like "Finding State-of-the-Art Non-cryptographic Hashes with Genetic Programming".

You can also read "Consider use of non-cryptographic hash algorithm for hashing speed-up", which mentions for instance "xxhash", an extremely fast non-cryptographic Hash algorithm, working at speeds close to RAM limits.


Discussions around changing the hash in Git are not new:

(Linus Torvalds)

There's not really anything remaining of the mozilla code, but hey, I started from it. In retrospect I probably should have started from the PPC asm code that already did the blocking sanely - but that's a "20/20 hindsight" kind of thing.

Plus hey, the mozilla code being a horrid pile of crud was why I was so convinced that I could improve on things. So that's a kind of source for it, even if it's more about the motivational side than any actual remaining code ;)

And you need to be careful about how to measure the actual optimization gain

(Linus Torvalds)

I pretty much can guarantee you that it improves things only because it makes gcc generate crap code, which then hides some of the P4 issues.

(John Tapsell - johnflux)

The engineering cost for upgrading git from SHA-1 to a new algorithm is much higher. I'm not sure how it can be done well.

First of all we probably need to deploy a version of git (let's call it version 2 for this conversation) which allows there to be a slot for a new hash value even though it doesn't read or use that space -- it just uses the SHA-1 hash value which is in the other slot.

That way once we eventually deploy yet a newer version of git, let's call it version 3, which produces SHA-3 hashes in addition to SHA-1 hashes, people using git version 2 will be able to continue to inter-operate.
(Although, per this discussion, they may be vulnerable and people who rely on their SHA-1-only patches may be vulnerable.)

In short, switching to any hash is not easy.

share|improve this answer
6  
It does seem like the recent crop of high quality non-cryptographic hash functions, like xxhash, came out just a little too late -- right after git. –  Praxeolitic Mar 1 at 22:22
2  
@Praxeolitic indeed. There have been discussion about replacing SHA1 with another hash, but it would simply require quite a bit of work, for something which, for now, is working fine. –  VonC Mar 1 at 22:27
1  
Interesting read: whonix.org/forum/index.php/topic,538.msg4278.html#msg4278 –  VonC Mar 5 at 20:52
    
"we know the hash is well distributed and we do not have to worry about certain distribution issues" - why is this an issue for scm? –  roded Mar 6 at 11:58
    
@roded the collision rate is low enough to be well-suited for an SCM where the data is generally not random but test files. –  VonC Mar 6 at 12:27

Not the answer you're looking for? Browse other questions tagged or ask your own question.