Cryptographic and Non-Cryptographic Hash Functions
Many of us - people involved with information technology - have heard about MD5, SHA1, SHA2, and other hash functions, especially if you work in information security. The main idea behind hash functions is to generate a fixed output from a given input, kind of like generating a 'signature' of that input.
When it comes to web development, for example, it's common to encounter scenarios where you need to compare if two files have the same content. Let's call them File1 and File2. Suppose you have to frequently compare these files.
Without hash functions, you would likely need to read the entire content of File1 and File2 to compare if they match. But what if you could generate a 'signature' for each file using a hash function and then only compare the signatures? That would be much faster.
The reason is that the output of the hash function, also known as the digest, has a fixed length and is significantly smaller in size compared to a file, as shown in our example. Here's an example of a digest using a hash function named MD5: 8d92b359921d5716e3856d6a666e19dd.
# Generating the MD5 digest of all ".png" files
# In my current directory using "md5sum"
$ md5sum *.png
036f00f5694e9f10919436c49f610f4b file1.png
d10120ff9e8205a4f7af8f3c7ba721b3 file2.png
The problem is that MD5 - and now SHA1 - have been demoted from being cryptographic hash functions to non-cryptographic hash functions.
Do you know the difference?
We're used to simply calling MD5, SHA1, SHA2, and others as 'hash functions,' but cryptographic hash functions are a special class among hash functions that aim to provide certain security guarantees that non-cryptographic hash functions do not. For example, when obtaining a device fingerprint, you should use a cryptographic hash function to have more guarantees of its output uniqueness.
The ideal cryptographic hash function has six main properties:
- Deterministic: the same message always results in the same hash.
- Quick: it is quick to compute the hash value for any given message.
- One-way function: it is infeasible to generate a message from its hash value except by trying all possible messages.
- Avalanche effect: a small change to a message should change the hash value so extensively that the new hash value appears uncorrelated with the old hash value.
- Collision resistant: it is infeasible to find two different messages with the same hash value.
- Pre-image attack resistant: a pre-image attack on cryptographic hash functions tries to find a message that has a specific hash value. A cryptographic hash function should resist attacks on its pre-image.
On the other hand, non-cryptographic hash functions provide weaker guarantees in exchange for performance improvements. They mainly aim to avoid collisions for non-malicious input. An example would be detecting data corruption due to an unstable network.
Cryptographic Function Speed
Even after MD5 stopped being recommended as a cryptographic hash function, the truth is that it is still being used because of its performance. MD5 is very fast. And that may not be a problem, as long as you don't care about malicious input that could result in collisions. Just treat it like another non-cryptographic function.
But time has passed, and other algorithms have taken its place. Here's a benchmark:
Blake2b?! You probably haven't heard about it, at least not as often as the others. It's not as popular as MD5 and SHA1, but it's very fast and, until someone says otherwise, secure. Learn more about it at https://blake2.net/.
Non-Cryptographic Function Speed
Blake2 already seems pretty fast, but what if you want more speed? If you don't care as much about the mentioned security properties of cryptographic hash functions, Seahash looks like an interesting choice. It has gained popularity recently. Its author wrote about it here. According to "ticki_" on HN, who has a very similar name to "ticki," the author of the Seahash post:
BLAKE(2) is a cryptographic hash function, SeaHash is not. Even the fastest implementations of BLAKE only get around 7.8 cycles/byte (hardware might do it twice as fast). SeaHash gets 0.24 cycles/byte. That's a wide difference, around 32x faster.
Wow, 32x faster. And here are some use cases for non-cryptographic hash functions as long as there is no way to feed them with malicious input:
- Checksums and error correction codes
- Hash tables
- Caches
- Bloom filters
- Finding duplicated records
See Also
- Blake2
- Security Stack Exchange - Difference between a crypto and non-crypto hash function
- Wikipedia - Uses of hash functions
- Wikipedia - Cryptographic hash function
- Wikipedia - Preimage attack
- HN - Seahash discussion (1st)
- HN - Seahash discussion (2nd)
- Ticki - Seahash explained
Epilogue
That's all for today. Thank you.