Modelling and Simulating Hash Functions

Programmer Adventures in Modelling and Simulating Hash Functions

I've been curious about the properties of hash functions lately. I had three questions, that I attempted to answer:

What is the order of SHA-2 (x)?
What is the order of SHA-2 (SHA-2 (x))?
What is the order of PBKDF2 (x)?

For this research, I define order to mean the size of the output set of the given function (in other words, how many unique outputs there are). Ideally, this would be 2^256 for SHA-256. If the cost of executing the function is considered to be one operation, than the complexity of finding a collision is directly related to the order of the function.

When the input set is small, finding the order of a given function is easy. Just run all possible inputs through the function, and count the number of unique results. Unfortunately, the input set of SHA-256 is impossibly large (though technically finite). Since I'm a programmer by nature, I chose to answer the above questions by modelling and simulating.

The model I use is a random oracle. These represent perfect hash functions; they are impossible to invert except through brute-force. For small input sizes, one can generate a random oracle easily. It's just a random LUT! To answer the first question using this model, we need to know the expected order of a random oracle. This is equal to the average of all possible random oracles (for a given input size). Again, we run into a problem, because the set of possible random oracles is too large. So, instead, I estimate the expected order by generating a suitably large number of random oracles and averaging their orders.

To answer the last two questions, we do the same, but apply the random oracle in the specified construct. e.g. Oracle (Oracle (x)). With enough samples, we should hopefully get good estimates for the expected orders and thus have answers to my questions. The Python code for all of this is listed at the bottom of this post. In the meantime...

Results

Code:

Testing 8 bits
Oracle:  162.212890625
Exponential:  19.470703125
Chained:  120.920898438

Testing 12 bits
Oracle:  2589.109375
Exponential:  95.3359375
Chained:  1920.23925781

Testing 16 bits
Oracle:  41426.6992188
Exponential:  1278.45996094
Chained:  30704.7412109

Testing 17 bits
Oracle:  82854.8759766
Exponential:  2516.46582031
Chained:  61408.5410156

Testing 18 bits
Oracle:  165699.530273
Exponential:  5046.56347656
Chained:  122825.053711

Testing 19 bits
Oracle:  331412.545898
Exponential:  10073.0644531
Chained:  245653.294922

Testing X bits means running the simulations with an input set of size 2 ^ X. Oracle is just the plain random oracle model. Exponential is 101 iterations of a random oracle iteratively applied (e.g. Oracle (Oracle (x)), but with 101 Oracles). Chained is similar to PBKDF2 (password and salt combined, no HMAC).

The results are certainly interesting. Over all the tested input sizes, we see consistently that the estimated expected order of a hash function is only 63% of the input set. In other words, based on these models, one would expect the order of SHA-256 to be ~2^255.3. Pretty good! On the other hand, SHA-256 in an Exponential construct is weaker, resulting in an order of ~2^250.3. SHA-256 in PBKDF2 would be ~2^254.9.

None of this is particularly surprising, but it's educational to have some concrete numbers. Of particular interest is the result for the Exponential construct. Some have conjectured that double SHA-2 is weaker than regular SHA-2. While this seems to be strictly true, the weakness is nowhere near significant. After 100 iterations of SHA-2, one would expect to see the output space cut down to 1% of its usual size. But 1% of 2^256 is still impossibly massive and impossible to search. If we factor in increased computational complexity, searching the space is still more expensive than single SHA-2. (NOTE: Doubling hash functions can have other benefits, too. e.g. HMAC_SHA1 should remain strong in spite of SHA1's weaknesses)

Chained (pseudo-PBKDF2) is interesting too. At its core it uses the hash function recursively like Exponential does, so seemingly the larger your iteration parameter for PBKDF2 is, the less pseudo entropy you're injecting into the system. But the expected order is still quite large, and of course computational complexity grows drastically.

Perhaps all of this could have been easily proven using modern algebra, but I'm a better programmer than I am a mathematician. So to all those crypto-interested programmers out there, I hope this has been useful.

The Code

Code:

import random
import os


# If True, causes the PRNG to be reseeded with entropy often.
EXTRA_RANDOM = True


# Represents an instance of a PseudoRandom Function
class PRF (object):
	def __init__ (self, func, input_bits):
		self.input_bits = input_bits
		self.func = func
	
	# Calculates the size of the output set
	def output_order (self):
		outputs = set ()

		for x in xrange (2 ** self.input_bits):
			y = self.func (x)
			outputs.add (y)

		return len (outputs)

	# Estimates the size of the output set
	def estimate_output_order (self, num_samples=2**16):
		num_samples = min (num_samples, 2 ** self.input_bits)
		inputs = random.sample (xrange (2 ** self.input_bits), num_samples)
		outputs = set ()

		for x in inputs:
			y = self.func (x)
			outputs.add (y)

		return len (outputs) * (2 ** self.input_bits) / num_samples


# A real random oracle generator
class RandomOracle (object):
	def __init__ (self, input_bits):
		self.input_bits = input_bits
	
	def generate (self):
		if EXTRA_RANDOM:
			random.seed (os.urandom (32))

		permutation = [random.randrange (2 ** self.input_bits) for i in xrange (2 ** self.input_bits)]

		def model (x):
			return permutation[x]

		return PRF (model, self.input_bits)


# Generates exponential versions of the given PRF generator.
# i.e. y = PRF (PRF (PRF (x)))
class ExponentialPRF (object):
	def __init__ (self, prf, iterations):
		self.prf = prf
		self.iterations = iterations
	
	def generate (self):
		prf = self.prf.generate ()

		def model (x):
			U = x

			for i in xrange (self.iterations):
				U = prf.func (U)

			return U

		return PRF (model, prf.input_bits)


# Generates chained versions of the given PRF generator.
# i.e. y = PRF (x) ^ PRF (PRF (x)) ^ PRF (PRF (PRF (x))) ^ ...
class ChainedPRF (object):
	def __init__ (self, prf, iterations):
		self.prf = prf
		self.iterations = iterations

	def generate (self):
		prf = self.prf.generate ()

		def model (x):
			U = x
			T = 0
			
			for i in xrange (self.iterations):
				U = prf.func (U)
				T = T ^ U

			return T
		
		return PRF (model, prf.input_bits)


# Given a PRF generator, estimates the expected size of the output set.
def estimate_family_order (generator, num_samples=1024):
	total = 0

	for i in xrange (num_samples):
		prf = generator.generate ()
		total += prf.output_order ()
	
	return total / float (num_samples)


# Some tests
for bits in [8, 12, 16, 17, 18, 19, 20, 21, 22, 23, 24]:
	print "Testing %d bits" % bits

	oracle = RandomOracle (bits)
	exponential = ExponentialPRF (oracle, 101)
	chained = ChainedPRF (oracle, 101)

	print "Oracle: ", estimate_family_order (oracle)
	print "Exponential: ", estimate_family_order (exponential)
	print "Chained: ", estimate_family_order (chained)
	print ""

TL;DR: Recursively applying hash functions like SHA-2 insignificantly weakens the function, if one ignores computational complexity. Blah, blah, blah, programming is fun and cryptography is awesome.