Skip to main content

Learn | Hashes

Intro

A hash is created by one of the building blocks of modern security, an important piece of security engineering known as a cryptographic hash function. At its simplest, you provide a hash function with some data (e.g. images, videos, documents, database records, or anything else you can store on a computer) and it generates a short hash that represents that data. There are many different types of hash functions, each of which generates different hash outputs. They are not interchangeable.

A hash typically looks something like this example; the output from a SHA2-256 (256 bit, 32 bytes) hash function encoded as a hex string for the input text Truestamp rocks!:

$ echo -n 'Truestamp rocks!' | openssl dgst -sha256
206c286fba2e26869322aacbcbbb69540968b017a600039c6ad9111bf99ba2e1

You can think of a hash like a fingerprint of a piece of data, where no other data that is even slightly different will generate the same hash. If you give the same data to the same type of hash function it will always generate the same hash as its output. This is what's known as a deterministic function.

A hash by any other name...

The output values returned by a hash function are variously called hash values, hash codes, digests, message digest, or simply hashes. We'll generally use the term hash here, but if you see the other terms used they generally refer to the same thing.

Definition : Deterministic

Wikipedia definition of a deterministic system:

In mathematics, computer science and physics, a deterministic system is a system in which no randomness is involved in the development of future states of the system. A deterministic model will thus always produce the same output from a given starting condition or initial state.

Just like the math equation 1 + 1 = 2 is deterministic and will always create the same output for the same inputs, a hash function will always generate the same hash when provided the same data.

Hash functions have rules

An "ideal" hash function, as you might see it defined in a dry computer science text, has the following properties.

It is deterministic

For any specific type of hash function (e.g. SHA2-256), the same input data will always result in the same output hash. Changing even a single bit of the data from a 1 to a 0 will result in a different hash.

It is fast

Modern computers and hash functions are designed to be fast. A modern laptop can easily generate millions, even billions, of hashes per second. Dedicated Application Specific Integrated Circuits (ASIC) optimized for hashing can generate trillions of new hashes per second. Generating a hash for some data is an extremely lightweight task for your computer to perform.

It's output is unpredictable

The output of a hash functions must never be predictable and you must not be able to achieve a specific hash output by manipulating the inputs. This prevents what is known as a preimage attack.

It is collision resistant

It should be impossible to find two different messages which result in the same hash output. This is known as collision resistance. In other words, two different documents should never "collide" and produce the same hash. If they could, then a document could have its hash signature be valid for its "evil twin" modified document as well.

It exhibits the "avalanche effect"

A small, even single-bit, change in the input to a hash function should result in a completely different hash output. The hash generated after even the most minor change in input should be statistically indistinguishable from random and should not be able to be correlated in any way with a previous hash output. A more detailed examination of this property can be found here.

The one-way nature of hashes ensures that they are inherently privacy preserving. It is impossible to learn anything about the nature of the original data that was used to create the hash.

Types of hash functions

There are many different types of hash functions that have been created over the years. There are some that are no-longer considered secure to use, meaning that one or more of the key rules of hashing algorithms has been broken.

There are three families of hash functions that are most commonly used in recent years, each of which is considered likely to be secure for many years to come and recommended for use with Truestamp.

SHA-2

The SHA-2 (Secure Hash Algorithm 2) is a set of cryptographic hash functions designed by the United States National Security Agency (NSA) and first published in 2001.

It's predecessor, SHA-1, is still in common use (e.g. Git commit hashes) but is no longer considered secure.

There are a number of variants of SHA-2 to choose from. The two that are most commonly used are:

  • SHA-256 (256 bits, 32 byte output)
  • SHA-512 (512 bits, 64 byte output)

The SHA-256 hash function is arguably the one in most common use in 2021.

When submitting these hash function types to Truestamp you would give them the name sha2-256 or sha2-512.

Extra credit

If you want to learn more about the inner workings of SHA2-256 from a bits and bytes perspective this video goes into great detail.

SHA3

The SHA-3 (Secure Hash Algorithm 3) is the latest member of the Secure Hash Algorithm family of standards, released by NIST in 2015. Although part of the same series of standards, SHA-3 is internally different from the MD5-like structure of SHA-1 and SHA-2.

The purpose of SHA-3 is that it can be directly substituted for SHA-2 in current applications if necessary, and to significantly improve the robustness of NIST's overall hash algorithm toolkit. It is not intended a replacement for SHA-2.

There are a number of variants of SHA-3 to choose from. The two that are most commonly used are:

  • SHA3-256 (256 bits, 32 byte output)
  • SHA3-512 (512 bits, 64 byte output)

When submitting these hash function types to Truestamp you would give them the name sha3-256 or sha3-512.

Keccak vs SHA-3?

The algorithm originally submitted to NIST for the SHA-3 competition was called Keccak. Controversially the output of Keccak and the final SHA-3 were different. There is a good explanation of the difference.

Therefore, Keccak and SHA-3 are NOT interchangeable terms. These two hash functions have different outputs as can be seen here.

The Ethereum blockchain uses Keccak-256 as its chosen hash function, presumably to avoid the controversy over the SHA-3 changes. You will sometimes see this referred to in Ethereum documentation or blog posts as SHA-3, which is incorrect.

Keccak hashes can be submitted to Truestamp with the type keccak-256 or keccak-512.

BLAKE

BLAKE is a cryptographic hash function based on Dan Bernstein's ChaCha stream cipher. The BLAKE2 hash function, based on BLAKE, was announced in 2012. The BLAKE3 hash function, based on BLAKE2, was announced in 2020.

BLAKE2b is faster than MD5, SHA-1, SHA-2, and SHA-3, on 64-bit x86-64 and ARM architectures. BLAKE2 provides better security than SHA-2 and similar to that of SHA-3: immunity to length extension, indifferentiability from a random oracle, etc. BLAKE2 is a family of variants though and choosing one is not as simple as it should be.

According to its documentation BLAKE3 is:

  • Much faster than MD5, SHA-1, SHA-2, SHA-3, and BLAKE2.
  • Secure, unlike MD5 and SHA-1. And secure against length extension, unlike SHA-2.
  • Highly parallelizable across any number of threads and SIMD lanes, because it's a Merkle tree on the inside.
  • Capable of verified streaming and incremental updates, again because it's a Merkle tree.
  • A PRF, MAC, KDF, and XOF, as well as a regular hash.
  • One algorithm with no variants, which is fast on x86-64 and also on smaller architectures.

The more modern BLAKE3 would be a recommended choice.

When submitting these hash function types to Truestamp you would give them the name blake3, blake2b-256, blake2b-512, blake2s-256, or blake2s-512.

What hash function types does Truestamp accept?

The Truestamp service will allow you to submit SHA1, the SHA2 family, the SHA3 family, and the BLAKE family of hashes. If you have the need to submit another type of hash please let us know.

All hashes are validated by byte length against the expectations of the hash function type provided. For example, you cannot specify a hash type of sha2-256 (32 bytes) and submit a hash that is 64 bytes of encoded data.

Learn more details of the comparison of various cryptographic hash functions.

Hash encodings

The output of a hash function is a collection of bytes known as a digest. These bytes are often referred to as a byte array or binary data. These raw bytes are great for computers to work with, but make it hard to store them or pass them around in an email, text message, or database. For this reason hashes are almost always encoded to a text representation that can more easily be shared or stored by humans.

We've created a simple interactive demo, written in JavaScript, of hash encodings:

hashes.js
// For Node.js
// Run this file in your terminal using node:
// $ node hashes.js

const crypto = require("crypto")

const message = "Truestamp rocks!"

// Create a hash of the `text` as a `Uint8Array` byte array
const hash = crypto.createHash("sha256")
const hash_update = hash.update(message, "utf-8")
const digest = hash_update.digest()

// Output the hash using different encodings

console.log("as Bytes")
console.log(digest)
//= <Buffer 20 6c 28 6f ba 2e 26 86 93 22 aa cb cb bb 69 54 09 68 b0 17 a6 00 03 9c 6a d9 11 1b f9 9b a2 e1>

console.log("as Hex")
console.log(digest.toString("hex"))
//= 206c286fba2e26869322aacbcbbb69540968b017a600039c6ad9111bf99ba2e1

console.log("as Base64")
console.log(digest.toString("base64"))
//= IGwob7ouJoaTIqrLy7tpVAlosBemAAOcatkRG/mbouE=

When submitting document hashes to the Truestamp API they will be accepted when encoded as hex, or base64 encoded strings.

Extra credit

Its easy to play around with various Hash formats using Cyberchef.

You can also try using the openssl command line application which is available on most operating systems:

# a file with some data in it
$ cat truestamp.txt
Truestamp rocks!

# create a hex encoded hash representing the file contents
$ openssl dgst -sha256 truestamp.txt
SHA256(truestamp.txt)= 206c286fba2e26869322aacbcbbb69540968b017a600039c6ad9111bf99ba2e1

# create the same hash with the same content, but not stored in a file and with Hex output
$ echo -n 'Truestamp rocks!' | openssl dgst -sha256
206c286fba2e26869322aacbcbbb69540968b017a600039c6ad9111bf99ba2e1

# create the same hash with the same content, but not stored in a file and with Base64 output
$ echo -n 'Truestamp rocks!' | openssl dgst -sha256 -binary | openssl enc -base64
IGwob7ouJoaTIqrLy7tpVAlosBemAAOcatkRG/mbouE=

Security

One of the most common questions about hashes and security relates to the ability to guess the output of a hash. This is sometimes referred to as a brute force attack, as you are trying every possible value through sheer brute force in order to find a collision. This thinking vastly underestimates the "search space" of a typical 256 or 512 bit hash. This short fun video does a great job of explaining this universe sized problem.

Are hashes safe from Quantum computers?

The question of whether or not cryptographic hash algorithms are safe in a post-quantum world is often raised. The best indications so far, without the necessary large quantum computer actually existing yet to test it, is that modern hash functions are safe in a post-quantum world.

There is some theoretical speedup available that would seem to reduce the security level by half (e.g. 256 bit security would be reduced to 128 bit security.). However, 128 bit security is still considered intractable to brute-force using any known or conceived computing power. So both 256 bit and 512 bit hashes are considered post-quantum secure.

Fun fact, while hashes are considered post-quantum secure, current asymmetric encryption algorithms like RSA and PGP are most definitely not. There is much work going on to get post-quantum secure encryption algorithms into everyday use.