English Portuguese

Cryptographic and Non-Cryptographic Hash Functions

Many of us - people involved with information technology - heard about MD5, SHA1, SHA2 and other hash functions, specially if you work with information security. The main idea behind hash functions is to generate a fixed output from a given input. It's sort of generating a 'signature' of this input.

When it comes to web development for example, it's common to encounter a scenario where you need to compare if 2 files have the same content. Let's call them File1 and File2. Also suppose that you have to compare those files frequently.

Without hash functions you probably would need to read all content from File1 and all content from File2 to compare if they match. But what if you could generate a 'signature' for each file using a hash function and then only compare the signatures? That would be a lot faster.

The reason is the hash function output - also called digest - have a fixed length and is way smaller in size when compared to a file like in our example. Here's an example of a digest using a hash function named MD5: 8d92b359921d5716e3856d6a666e19dd.

# Generating the MD5 digest of all ".png" files
# In my current directory using "md5sum"
$ md5sum *.png
036f00f5694e9f10919436c49f610f4b  file1.png
d10120ff9e8205a4f7af8f3c7ba721b3  file2.png

The problem is that MD5 - and now SHA1 - has been demoted from being a cryptographic hash function to a non-cryptographic hash function now.

Do you know the difference?

We're used to just call MD5, SHA1, SHA2 and others simply as 'hash functions' and that's it. However cryptographic hash functions are a special class among hash functions that aim to provide certain security guarantees that non-cryptographic hash functions don't. For example, when obtaining a device fingerprinting, you should use a cryptographic hash function to have more guarantees of its output uniqueness.

The ideal cryptographic hash function has six main properties:

  1. Deterministic: the same message always results in the same hash;
  2. Quick: it is quick to compute the hash value for any given message;
  3. One-way function: it is infeasible to generate a message from its hash value except by trying all possible messages;
  4. Avalanche effect: a small change to a message should change the hash value so extensively that the new hash value appears uncorrelated with the old hash value;
  5. Collision resistant: it is infeasible to find two different messages with the same hash value
  6. Pre-image attack resistant: a pre-image attack on cryptographic hash functions tries to find a message that has a specific hash value. A cryptographic hash function should resist attacks on its pre-image.

On the other hand, non-cryptographic hash functions provide weaker guarantees in exchange for performance improvements. They just try to avoid collisions for non malicious input. An example would be to detect data corruption due to an unstable network.

Cryptographic Functions Speed

Even after MD5 stopped being recommended as a cryptographic hash function to be used, the truth is that is still being used. The main point is performance. MD5 is very fast. And that may not be a problem, as long as you don't care about malicious input that could result in collisions. Just treat it like another non-cryptographic function.

But time have passed by and other algorithms took its place. Here's a benchmark:

Hash Functions Speed Benchmarking

Blake2b?! You probably didn't hear about it. At least not so often as the others. It's not popular as MD5 and SHA1 but very fast and until someone tell otherwise, secure. Learn more about it at https://blake2.net/.

Non-Cryptographic Functions Speed

Blake2 seems pretty fast already, but what if you want more speed? As long as you don't care that much the mentioned security properties of cryptographic hash functions, Seahash looks to be an interesting choice. It's became pretty popular recently. It's author wrote about it here. According to "ticki_" in HN, which has a very similar name to "ticki", the author of the Seahash post:

BLAKE(2) is a cryptographic hash function, SeaHash is not. Even the fastest implementations of BLAKE only gets around 7.8 cycles/byte (hardware might do it twice as fast). SeaHash gets 0.24 cycles/byte. That's a wide difference, around 32x faster.

Wow 32x. And here are some uses cases for non-cryptographic hash functions as long as there is no way to feed it with malicious input:

See Also

Epilogue

That's all for today. Thank you.


Share the knowledge :)

Share on Twitter Share on Facebook Share on Google Plus Share on LinkedIn Share on Hacker News