# How Large is a 160 Bit Number?

When we combined SQL and Git to make Dolt, the world's first version controlled database, we took a lot of inspiration from Git's use of content addressed storage. Merkle Trees and DAGs allow us to continually add content to the database and not fear that the content of the data will get corrupted over time. Dolt, similar to Git, used cryptographic checksums to address every piece of information in the database. These data blocks are called "Chunks", and they are addressed by a 20 byte key which is calculated from the chunk's content. Hence the name, "content addressed."

Every once in a while, you'll hear a thoughtful engineer chime in and suggest that content addressed objects, chunks in our case, could collide. That is to say, two chunks which have different data could have the same key. And if that were to happen, the entire system falls apart.

It's usually about this time I roll my eyes.

Seriously though, it's virtually impossible to get my head around how big a 160 bit number is. The likelihood of two Dolt chunks having an address collision is very low. Very very low. Not impossible, but very very very very low. Let's discuss.

## Crypto Checksums

I'm not a cryptographer, but I play one on TV. Cryptographic checksums, also called Hash Functions, are very useful because they give us a one way function. They take in any number of bytes, and output a number. What makes them useful for cryptography is that it is very difficult (hopefully impossible) to take the number and reverse the function to determine what the input was.

For our purposes, we simply want to generate checksums of data blobs and ensure that they don't collide because they are the addresses we'll use to refer to the chunk. Collision Resistance is a measure of how many unique data blobs you would need, on average, to wind up with two identical keys. This is generally represented as 2 raised to some power, such as `2^100`

, which states that you would need this many attempts, on average, to have a collision:

```
$ python
>>> import math
>>> 2 ** 100
1267650600228229401496703205376
>>> math.log10(2 ** 100)
30.10299956639812 # ~ 1 followed by 30 zeros.
```

Put another way, you would need a decimal number with 30 zeros behind it. Also, in a perfect Hash function, the number of bits in the output is twice the power of the desired collision resistance. In order to get `2^100`

of resistance, you need a function which outputs a 200 bit number. This is due to the Birthday Paradox. Basically if you randomly have a group of people, it's far more likely that two people will have the same birthday (a collision), than you would think. I won't dwell on this - you have the internet at your disposal.

Research into these cryptographic hash functions has been happening in the public sphere since the late 1980's. The MD* family of checksums were some of the earliest, and they all produced a 128 bit result. With a perfect hash function would have `2^64`

collision resistance. Interesting to call out that today that's not even considered secure given how much compute power there is in the world. MD5 is the most commonly known MD* hash function, and pretty much everyone avoids using it. Turns out that a perfect checksum algorithm is really hard to achieve. The collision resistance of MD5 is currently considered `2^24`

, which is abysmal. Your phone can easily crack that.

```
>>> 2 ** 24
16777216
>>> math.log10(2 ** 24)
7.224719895935548
```

The next generation of checksums that gained wide spread use was SHA (later called SHA-0), which avoided the problem of not having a perfect one way function by having a 160 bit result. The idea was that even if we know the function isn't perfect, there is significantly more than 128 bits of entropy in the output, therefore it must be better than `2^64`

. Right?

SHA-1 followed with more improvements but kept the 160 bit result. SHA-1 is what Git uses today, and in many ways it has fallen into the same state of shambles that MD5 has become. It actually suffers from some of the same problematic root causes as MD5. The current research of the strength of SHA-1 is that it has about `2^64`

of collision resistance. Wrong.

```
>>> 2 ** 64
18446744073709551616
>>> math.log10(2 ** 64)
19.265919722494797
```

`1`

followed by 19 zeros seems like a pretty big number. It's 10 quintillion. I can't even conceptualize how big a quintillion is, and we'll talk more about that below. Remember, this is saying that if you use SHA-1, even with it's known warts, you would need 10 quintillion objects before there was a collision, on average.

Learning from previous mistakes, the SHA-2 family, which includes SHA-256 and SHA-512, greatly expanded the number of bits in their output (256 and 512 bits respectively). But they didn't just expand the number of bits, they also addressed some of the underlying core design flaws of MD and SHA, and currently SHA-512 is considered to have `2^256`

level of strength - which means no major vulnerability has been found in it.

## What Dolt Does

All of Dolt's data is stored as byte buffers keyed by their checksum. Dolt uses the first 20 bytes of the SHA-512 of the data. The particulars of this decision predate Dolt and were made by the Noms team, which was the foundation Dolt built on top of. There are a few things to note though:

- SHA-512 has proven to be a very good hash function, and hasn't been tarnished by vulnerabilities discovered in the last 22 years. We can consider ourselves lucky.
- Using a 20 byte key results in less space used in memory and on disk than the full 64 byte value returned by SHA-512.
- A 20 byte key yields a collision resistance of
`2^80`

, which works for our application.

This last point is where we'll dwell for the rest of this post.

## Collisions! Oh My!

When I was migrating Amazon to Git more than a decade ago, there was a pretty consistent tone I heard from Amazon's very opinionated development culture which was that Git's use of SHA-1 wasn't a guarantee that there would never be key collisions. It's true, it's not a guarantee, but very few things in life are guaranteed. The same debate at the time was that you couldn't use UUID as primary keys in your database because they might collide as well. I think we hear this concern less now, but it's not completely dead. I had someone ask me about collisions in a DM on Discord after I published our storage review, which is primarily the reason I'm writing this. This idea that two random 20 byte addresses would collide is very hard for people to let go. As I mentioned above, I can't really conceive of how large 10 quintillion is. At the same time, I stated that a collision resistance of `2^64`

of SHA-1 is to be avoided by most. But is `2^64`

enough? How about `2^80`

, which is what Dolt has?

### Let's count to 1 Billion

Chances are that if you have any sort of concept of how large the number 1,000,000,000 is, it's because someone described the life of a billionaire for you. For a billionaire, in the event that they drop a $100 bill, it wouldn't be worth their time to stop and pick it up. If you worked an hourly job 40 hours a week for 50 years, you would need to make $10,000 an hour in order to make 1 billion. A billion dollars it pretty much infinite money; unless you have a hobby of building rocket ships I guess.

If you count every second as it passes from the day you were born until sometime in the 32nd year of life, you would have counted to a billion. In my 40s now, I kind of have a visceral sense of how long 32 years is. Granted I was sleeping for some of that time, but put in these terms 1,000,000,000 sounds like a number that is on the human level. The fact that there are 8 billion people on earth is a daunting idea, but it's still in the realm of something I can imagine.

In collision resistance terms, 1 billion is `2^15`

.

```
>>> math.log2(1000000000)
29.897352853986263 # Bits required to represent this many values.
>>> 2 ** 30
1073741824 # Yep. ~1 Billion
>>> 2 ** (15)
32768 # The collision resistance of a perfect hash which returns a 30 bit number.
```

This tells us that if we had a perfect hash function which returned a value between 1 and 1,000,000,000, and we used it as a means to address data, that on average we would have a collision after storing 32K objects.

The critical point to get across is that 1 billion is a number which runs up against my ability to even understand it. In collision resistance terms though, it's pretty small.

### Let's Count the Atoms in your Cup of Coffee

To count all those seconds for 32 years, you are going to need a lot of coffee. How many atoms are in that cup? What is the collision resistance of that number?

Does anyone else remember learning about Avogadro's number in chemistry? Avogadro's number, `6.022 x 10^23`

, commonly referred to as `moles`

, can help with questions such as "How many atoms are in this cup of coffee?"

```
volume_cup_liters = 0.24 # Volume of the cup in liters
density_water_g_per_ml = 1 # Density of water in grams per milliliter
molar_mass_water_g_per_mol = 18 # Molar mass of water in grams per mole
avogadro_number = 6.022e23 # Avogadro's number
# Convert volume to grams (since density of water is 1 g/mL and 1 mL = 1 g)
mass_water_g = volume_cup_liters * 1000 * density_water_g_per_ml # 0.24 L * 1000 mL/L * 1 g/mL = 240 g
# Calculate the number of moles of water
moles_water = mass_water_g / molar_mass_water_g_per_mol
# Calculate the number of water molecules
molecules_water = moles_water * avogadro_number
# Each water molecule has 3 atoms (2 hydrogen and 1 oxygen)
atoms = molecules_water * 3
```

You can check my math, but the punchline is that your cup of coffee has `2.41 x 10^25`

atoms. That's a big number. If my brain maxes out at 9 zeros, 25 is an entirely different ballgame. If each of those atoms of coffee expanded to be the size of a grain of rice, your cup of coffee would now be covering every square inch of the earth with 125 meters of rice.

And what about collision resistance?

```
>>> math.log2(2.41e+25)
84.3172355186393
>>> 2 ** (84.317 / 2) # ~2^42 resistance
4908774390646.901
>>> math.log10(2 ** (84.317/2))
12.69097307219995
```

If your perfect hash function returned an 84 bit key, then you could use it to store 4.9 Trillion objects before you will likely have a collision. That's 3 orders of magnitude larger than my brain can make sense of, but when I think about all that rice that my coffee turned into, the effects of the Birthday Paradox are obvious. 5 Trillion may be a number I can't get my head around, but I've worked with computer systems that have that many objects. Gonna need more bits.

### Let's Count the Atoms which make up the Oceans

I'll google this one for you. And ChatGPT, too because that's what we do now.

The oceans of earth are made up with approximately `1.34 x 10^47`

atoms.

```
>>> math.log2(1.34e+47)
156.5528534603891
>>> 2 ** 78 # ~156.55 / 2
302231454903657293676544
```

Looks at that, a 156 bit number can represent the number of atoms contained by the oceans. When I think about all the rice I turned my cup of coffee into, and multiple that by the size of our oceans, I do hit the point where my head explodes.

Using a 160 bit address space, we can expect that we won't have any collisions until we store about 1 Sextillion objects.

```
>>> 2 ** 80
1208925819614629174706176
```

Now that's a big number. If I can barely fit 1 billion in my mind, a Sextillion is 1 Billion times 1 Trillion. Remember, that's how many objects we can expect to have before there is a collision.

## Dolt's Requirements

How many objects does Dolt need to store? We've been working on pulling all of Media Wiki content into a database, which is currently over 300 Gb. The import isn't done, but we have about 2.8 million pages imported. The number of chunks in that database is ~170 million.

Maybe in a few years we'll have databases which are 100x or a 1000x's that - so having 10 Billion objects to store a Dolt database isn't completely out of the realm of possibility. Getting back to our question, is 160 bits enough? Yes. It's plenty. If you had 1 Billion object in your database, the chance of the next object added colliding with another object is one in a trillion. I'll take those odds.

Robin is not wrong. But when you keep in mind that a collision of two 160 bit numbers is as probable as two people selecting the same atom of hydrogen out of the ocean, I think we can agree that the odds are in our favor.