Bitcoin Forum
October 17, 2017, 07:26:49 PM
 News: Latest stable version of Bitcoin Core: 0.15.0.1  [Torrent]. (New!)
 Home Help Search Donate Login Register
 Pages: [1] 2  All
 Author Topic: SHA-256 as a boolean function  (Read 10907 times)
killerstorm
Legendary

Offline

Activity: 994

 December 28, 2011, 10:58:00 AM

I'm experimenting with an alternative way to compute SHA-256 -- through tracing boolean functions.

Here's the idea:

1. Represent SHA-256 as a huge-ass DAG (directed acyclic graph) with nodes being boolean operations -- AND, OR, NOT, XOR. (One SHA256 block expansion + transform has 121k nodes after some optimization.)

2. Since we go through ~4 billion different nonces it makes sense to assume all other variables constant and optimize some nodes away. I get 86k nodes with 32 variables left.

3. Pre-compute fringe nodes with few dependencies via binary decision diagrams, normal forms or whatever. Optimize out parts of graph which do not need to be computed with certain configurations of nonce.

4. Now we can compute SHA-256 with about as many CPU instructions as there were nodes in DAG. It is possible to compute many combinations at once in parallel, e.g. on CPU which can do 64-bit operations in parallel it is possible to compute 64 combinations.

I've got some preliminary results, without precomputation it, theoretically, should be about as fast as traditional implementation of SHA-256 on CPUs, probably a bit slower due to required load/store operations. With pre-computation and other optimizations it might be faster. Or maybe not. I don't know.

Anybody wants to discuss? I'm not an expert in discrete math, so maybe some SAT solver or boolean function optimization techniques I don't know would be useful here.

Theoretical considerations:

* boolean function implementation, however, does not need rotateright -- it is optimized away at graph building stage
* further, boolean function implementation can take advantage of pre-computation (I see some GPU implementations pre-compute some things, but they cannot go deep).

So this might be of interest for implementation on GPUs which do not have ROTR.

colored coins proof-of-concept: private currencies, stock/bond p2p exchange

Tips and donations: 16v13Fa9cPmfFzpm9mmbWwAkXY4gyY6uh4
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertise here.
copumpkin
Donator
Sr. Member

Offline

Activity: 266

I'm actually a pineapple

 December 28, 2011, 11:21:53 AM

It seems like the main benefit from your proposed approach is that you're considering the function over a bunch of (uniformly distributed) inputs simultaneously, so you'd essentially be trying to reduce needless duplication of work across separate calls to the hash function? At first glance, it does seem like it could be a fruitful direction to pursue.
killerstorm
Legendary

Offline

Activity: 994

 December 28, 2011, 11:52:27 AM

More-or-less so. The aim is, indeed, to 'reduce needless duplication of work'. However, it's not possible to process all inputs at once.

Generally speaking, main part of SHA-256 is a function of 512 boolean variables. We're going to change only 32 boolean variables to look for certain result, so we can assume other 480 variables fixed. So for a certain block header we can reduce SHA-256 to a function of 32 variables instead of 512 ones, optimizing it in process.

Furthermore, as we're using hardware which can do 32, 64, 128 or 256 binary operations at once, we can utilize this by considering many combinations at once. Let's say for a 64-bit hardware we can, essentially, vary 6 bits at once. Thus we already have a function of (32-6) = 26 variables, so we need only 2^26 combinations. Furthermore, it is possible to optimize things a bit for each such combination.

Processing 64 (or more) combinations at once merely makes it competitive with traditional implementation, it does not reduce number of bit ops done by CPU. Number of bin ops should be reduced at a step when we assume certain bits fixed.

colored coins proof-of-concept: private currencies, stock/bond p2p exchange

Tips and donations: 16v13Fa9cPmfFzpm9mmbWwAkXY4gyY6uh4
kokjo
Legendary

Offline

Activity: 1050

You are WRONG!

 December 28, 2011, 11:59:50 AM

More-or-less so. The aim is, indeed, to 'reduce needless duplication of work'. However, it's not possible to process all inputs at once.

Generally speaking, main part of SHA-256 is a function of 512 boolean variables. We're going to change only 32 boolean variables to look for certain result, so we can assume other 480 variables fixed. So for a certain block header we can reduce SHA-256 to a function of 32 variables instead of 512 ones, optimizing it in process.

Furthermore, as we're using hardware which can do 32, 64, 128 or 256 binary operations at once, we can utilize this by considering many combinations at once. Let's say for a 64-bit hardware we can, essentially, vary 6 bits at once. Thus we already have a function of (32-6) = 26 variables, so we need only 2^26 combinations. Furthermore, it is possible to optimize things a bit for each such combination.

Processing 64 (or more) combinations at once merely makes it competitive with traditional implementation, it does not reduce number of bit ops done by CPU. Number of bin ops should be reduced at a step when we assume certain bits fixed.

you are forgetting the SHA also is having a internal state...

"The whole problem with the world is that fools and fanatics are always so certain of themselves and wiser people so full of doubts." -Bertrand Russell
killerstorm
Legendary

Offline

Activity: 994

 December 28, 2011, 12:53:50 PM

Yep, right. So it is 768 bits in general case. But it doesn't make any difference in this context.

colored coins proof-of-concept: private currencies, stock/bond p2p exchange

Tips and donations: 16v13Fa9cPmfFzpm9mmbWwAkXY4gyY6uh4
kokjo
Legendary

Offline

Activity: 1050

You are WRONG!

 December 28, 2011, 12:57:59 PM

Yep, right. So it is 768 bits in general case. But it doesn't make any difference in this context.
yes because you need to optimise for every single block, as the static-variables, will change with every block.
also the internal state is larger then 256 bit, and it can be precomputed per block, but it will still change every block.

i think it will take longer time to optimise for every block, then it will to just do the plain work.

"The whole problem with the world is that fools and fanatics are always so certain of themselves and wiser people so full of doubts." -Bertrand Russell
jetmine
Jr. Member

Offline

Activity: 53

 December 29, 2011, 08:13:46 PM

Anybody wants to discuss?

It's an interesting thought, but in my opinion it is doomed to fail.

From what I understand (here + usenet) you are near to need "almost as few" resources to calculate one bit, as a traditional method needs to calculate the full result.  If this is correct, it matches quite well with what I would expect from following through your idea mentally.

The problem is the non-linearity from the adders.  While you can easily bitslice the other logic, or adders alone, you can't do it with adders + logic.  As soon as you merge the two domains, for example when you perform a logic function on an adder output, you need to know the actual values of bits.

When adders are expressed as logic functions, the fan-in is huge.  It's natural.  If you want to know the MSB of an adder result, you need to know the carry.  For the carry, you need to know the lower bits and their carries.  As a consequence, the input will cover (almost) all bits.  Therefore the logic function for the MSB will need all input bits.  A function for the near-MSB will need almost all input bits. Etc.

Since SHA2 is cryptografically good, there is not much to optimize away by using logic functions.  The dynamic inputs avalanche over the static ones very quickly.  A few steps into the algorithm, and almost every bit is somehow influence by an MSB or near-MSB (which means that all input bits are relevant now).  You just cannot save much.

To calculate one output bit you need to do (almost) as many calculations as a traditional implementation.  I think this is where you are now, except that your adder replacements are less efficient in terms of "operation count" (because you've separated the single add operation out as several logic operations) and therefore you get a higher operation count.

If I've misunderstood something and you're not really at this point, you should try harder.  It must be possible to get there.

However, there is another thing to think about.  You are calculating just one bit of the result, and will probably (50%) want to calculate more bits.  To do that, you will need to process again (almost) all inputs.  This is where the real inefficiency starts.  Your algorithm is inherently not able to "cache" partial intermediates.  You can think of the traditional algorithm with adders as calculating partial results which are then used several times as inputs into the next stages.

Your algorithm, on the other hand, has to re-calculate everything from scratch, because it didn't separate out what is result of an adder for example, but rather fusioned every single add operation with thousands of other operations into one big logic blob.

This is my biggest complain, and makes me think the approach is doomed.

• You are missing out on the partial result reuse.
• You can't get significantly below 100% resource usage for a single bit calculation either.
• You would need to get below 50% for it to make sense.
• The cryptographical properties of SHA2 make it impossible to achieve 50%.
Red Emerald
Hero Member

Offline

Activity: 742

 December 30, 2011, 12:38:09 AM

If possible, what would the benefits of this be?

kokjo
Legendary

Offline

Activity: 1050

You are WRONG!

 December 30, 2011, 10:43:56 AM

If possible, what would the benefits of this be?
almost none. we could calculate sha256 a little bit faster, but only per block. we would need to optimize for every block...

"The whole problem with the world is that fools and fanatics are always so certain of themselves and wiser people so full of doubts." -Bertrand Russell
killerstorm
Legendary

Offline

Activity: 994

 December 30, 2011, 11:17:18 AM

From what I understand (here + usenet) you are near to need "almost as few" resources to calculate one bit, as a traditional method needs to calculate the full result.  If this is correct, it matches quite well with what I would expect from following through your idea mentally.

I was thinking about this approach initially, but that doesn't make sense as that one bit shares a lot of computations with bits nearby (it is fairly obvious, I just wanted to verify that actual results match intuitive ones and there is no unexpected speedup). Now I'm aiming at computing as many bits as necessary. E.g. 32 or 64 of them. It is true that computing one bit costs about as much as computing 32 of them due to re-use of common dependencies. Computing 64 bits would require somewhat more due to a round structure of SHA-256 (see below).

Quote from: jetmine
The problem is the non-linearity from the adders.

That's right.

Quote from: jetmine
input will cover (almost) all bits.  Therefore the logic function for the MSB will need all input bits.  A function for the near-MSB will need almost all input bits. Etc.

But half of result bits are pretty far from MSB. They will be mixed eventually, but only after a few rounds.

Quote from: jetmine
Since SHA2 is cryptografically good, there is not much to optimize away by using logic functions.  The dynamic inputs avalanche over the static ones very quickly.  A few steps into the algorithm,

Many people say that, and initially I had a same impression. But try inspecting SHA2 compression function closer: one round does very little. It modifies only two registers (and shifts the rest), eating only 32 bits of input block at once.

Thus when you do SHA-256 for bitcoin mining, you don't need to compute first three rounds (of 64) at all. That's already a ~5% speedup. Then fourth round boils to a single addition. Then fifth is pretty simple too, as input bits are constant, so you just need to compute S0 (which boils down to two XORs per bit if you ignore ROTR) and do an addition with constant. And so on. So it only requires all operations by round 8 or so.

Likewise, we are only interest in H on 64th round. Which is just G at round 63. Which is F at round 62. Which is E at round 61. Which is D + S1 + Ch + W[60] + K[60] at round 60.

So you can chop off last four rounds. Only 57 rounds out of 64 are required, so ~11% of computations go away. Furthermore, a lot of leftover rounds can be simplified out. Furthermore, a lot of block expansion parts can be simplified. Furthermore, some bits in first few meaningful rounds have very few dependencies and can be simplified.

Quote from: jetmine
and almost every bit is somehow influence by an MSB or near-MSB (which means that all input bits are relevant now).  You just cannot save much.

It is true that I cannot chop away, like, 99% of computation, but some modest speedup might be possible.

colored coins proof-of-concept: private currencies, stock/bond p2p exchange

Tips and donations: 16v13Fa9cPmfFzpm9mmbWwAkXY4gyY6uh4
killerstorm
Legendary

Offline

Activity: 994

 December 30, 2011, 11:34:34 AM

If possible, what would the benefits of this be?

Let's say 10-20% speedup in an ideal case.

almost none. we could calculate sha256 a little bit faster, but only per block. we would need to optimize for every block...

I believe optimization can be done fairly quickly, thus its cost would be negligible. Especially if you use GPU for main computation and CPU for code generation. I would only be concerned with OpenCL/CUDA compiler not being fast enough, but that's just a technical difficulty.

colored coins proof-of-concept: private currencies, stock/bond p2p exchange

Tips and donations: 16v13Fa9cPmfFzpm9mmbWwAkXY4gyY6uh4
kokjo
Legendary

Offline

Activity: 1050

You are WRONG!

 December 30, 2011, 11:55:13 AM

If possible, what would the benefits of this be?

Let's say 10-20% speedup in an ideal case.

almost none. we could calculate sha256 a little bit faster, but only per block. we would need to optimize for every block...

I believe optimization can be done fairly quickly, thus its cost would be negligible. Especially if you use GPU for main computation and CPU for code generation. I would only be concerned with OpenCL/CUDA compiler not being fast enough, but that's just a technical difficulty.
dream on...

"The whole problem with the world is that fools and fanatics are always so certain of themselves and wiser people so full of doubts." -Bertrand Russell
killerstorm
Legendary

Offline

Activity: 994

 December 30, 2011, 12:14:47 PM

For people curious about possibility of optimization of 'fringe' nodes, here are some BDD (binary decision diagram) complexity results: http://paste.lisp.org/display/126772

These are results for all subtrees of height 13 in computation DAG after constant propagation and other simple optimizations. Vars is effective number of meaningful variables in BDD after optimization, leafs means number of distinct outcomes.

E.g. a fully random boolean expression of 32 vars would have 32 vars and something like 2^32 leafs.

Here maximum we have is 32 vars, 19676 leafs, while most are very simple, some even constants (zero variables).

Of course, it doesn't mean that SHA-256 is broken as registers after all 64 rounds have height of ~3500. At high heights it becomes close to random boolean function, with billions of distinct results. But low heights can be pre-computed and optimized out. I don't know how much yet.

colored coins proof-of-concept: private currencies, stock/bond p2p exchange

Tips and donations: 16v13Fa9cPmfFzpm9mmbWwAkXY4gyY6uh4
jetmine
Jr. Member

Offline

Activity: 53

 December 30, 2011, 12:30:10 PM

so ~11% of computations go away.

Off-topic.  You're not talking about boolean logic implementation anymore.  These optimizations apply to traditional implementations just as well.

Your point was to calculate just one output bit using a big and wide input logic function (and then parallelize to make good use of available instruction sets).

It is true that I cannot chop away, like, 99% of computation, but some modest speedup might be possible.

A modest speedup is not achievable, only a very tiny one (if any) compared to an equally well optimized traditional implementation. Yet even a modest speedup in the boolean part would not translate into a total algorithm speedup.  Imagine that you for example need only 98% for one bit (compared to a traditional algorithm).  There is a 50% chance that the bit is 0 and you want to calculate more bits of it.  But you've already used 98% of the budget.  Now what?  Calculate a traditional result (98+100=198%)?  Or calculate another single bit (98+98=196%)?  It could be 0, too, and further complicate your life.

To calculate just a single bit makes only sense if you can achieve a generous speedup, so you can use it as "fast oracle" to decide whether or not to run the "expensive full calculation" on a particular nonce.

As said, the cryptographic nature of SHA2 makes it impossible to achieve the generous speedup, nor a modest one.  Therefore the whole idea is doomed to fail.

I hope I explained it sufficiently clear this time.
killerstorm
Legendary

Offline

Activity: 994

 December 30, 2011, 01:07:16 PM

Off-topic.  You're not talking about boolean logic implementation anymore.  These optimizations apply to traditional implementations just as well.

Sure, I've just illustrated that SHA-256 round's diffusion isn't as good as people think about it. (BTW, even though these optimizations are applicable, I'm not sure that existing miners exploit all of them. Early miners like cpuminer do not seem to take advantage of them at all.)

And if it is not so good, there are optimization possibilities on bit level too (which aren't accessible on word level).

Quote from: jetmine
Your point was to calculate just one output bit using a big and wide input logic function (and then parallelize to make good use of available instruction sets).

I think you've missed the main part of my reply -- I'm not calculating just one bit of output, I can calculate all bits of output using this method. So your criticism isn't applicable.

Let's say H is content of H register on 64th round, and H[0] is first bit of it, H[1] is second and so on.

We need to find input such that

Code:
H[0] or H[1] or ... H[31] = 0

or alternatively:

not (H[0]) and not (H[1]) ... not(H[31]) = 1

That would be true for approximately one in 4 billion inputs. If that's not enough, we can include bits of G into a target function. Or we can compute fewer bits of H if that helps. Or we can express values of H[0] in terms of other expressions and do early rejection on that level.

Quote from: jetmine
A modest speedup is not achievable, only a very tiny one (if any) compared to an equally well optimized traditional implementation.

I don't know yet. So far, I've got pretty good results with BDD, better than I expected.

Quote from: jetmine
want to calculate more bits of it.  But you've already used 98% of the budget.  Now what?  Calculate a traditional result (98+100=198%)?  Or calculate another single bit (98+98=196%)?

Calculating another single bit would take only 0.1% because it depends on exactly same inputs as the first bit. Or see above -- it is possible to formulate whole mining problem in a form of boolean expression.

Quote from: jetmine
I hope I explained it sufficiently clear this time.

It was actually sufficiently clear the first time too, but for some reason you're missing that it is possible to reuse computations to calculate all bits of output.

In that case 2% speedup would be 2% speedup, plain and clear.

Right now I see BDDizing fringe nodes as only viable optimization, so I guess we need to wait until I'll BDDize all what is BDDizable before jumping to conclusions.

But thanks for looking into this, so far you're the one who paid most attention.

colored coins proof-of-concept: private currencies, stock/bond p2p exchange

Tips and donations: 16v13Fa9cPmfFzpm9mmbWwAkXY4gyY6uh4
killerstorm
Legendary

Offline

Activity: 994

 December 30, 2011, 10:32:20 PM

Concrete numbers: one SHA-256, 32 boolean variables in nonce position, no BDD optimization, just DAG nodest:

* 99940  operations to compute last 1 bit of output
* 101349 operations to compute last 16 bits of output
* 102510 operations to compute last 32 bits of output
* 104780 operations for 64 bits

(includes OR'ing them).

So computing more bits costs more, but not very much.

Now let's compare it with a traditional SHA-256 implementation, this page:https://en.bitcoin.it/wiki/Why_a_GPU_mines_faster_than_a_CPU#Why_are_AMD_GPUs_faster_than_Nvidia_GPUs.3F gives an idea: "~3250 to execute the SHA-256 compression function" (on Nvidia GTX 590).

From https://en.bitcoin.it/wiki/Mining_hardware_comparison we get that 1024 ALUs working at 1215MHz make 193 Mhash/sec. 1024*1215/193 = 6446. Mining requires 2 SHA-256 has operations, so let's half that: 6446/2=3223. Quite similar to number on wiki page, so I guess that's it.

To compute hash for 32 different combinations NVIDIA GPU requires 3223*32 = 103136 cycles.

This is same number as required by boolean function implemention, so they should be roughly equal in performance, and perhaps boolean function would be even faster with additional optimizations (although second SHA-256 might not optimize as well as the first one).

So on the optimization potential, at least 2000 DAG nodes can be reduced to sufficiently simple (< 1000 leafs) BDDs. This number is going to get higher after I remove 5 variables to compute 32 combos in parallel.

Then there are 'hybrid' BDDs which combines dynamic nodes with static ones, it is going to penetrate DAG even further.

And then I haven't yet tried boolean function optimization and statistical early rejection.

So some speedup for NVIDIA GPUs and CPU mining isn't ruled out. AMD hardware is much harder to beat. But I haven't yet looked at what advanced instructions it supports, things like BFI_INT might replace several graph nodes.

colored coins proof-of-concept: private currencies, stock/bond p2p exchange

Tips and donations: 16v13Fa9cPmfFzpm9mmbWwAkXY4gyY6uh4
kano
Legendary

Offline

Activity: 2240

Linux since 1997 RedHat 4

 January 01, 2012, 04:17:02 AM

Wow.
I'm surprised.
OK this is something I've been looking at for a while and not got very far with it.
My main thought about it all this time (over a few months) has been why I've heard no one else even mention it.
My first and last comment on the subject unless I complete my work on it some time in the far future
Edit: I should add that my work on it is related but follows a different path, with what I hope to be a better chance of performance increase.
I was hoping 10x ... but that could be dreaming

Pool: https://kano.is Here on Bitcointalk: Forum BTC: 1KanoPb8cKYqNrswjaA8cRDk4FAS9eDMLU
FreeNode IRC: irc.freenode.net channel #kano.is Majority developer of the ckpool code
Help keep Bitcoin secure by mining on pools with full block verification on all blocks - and NO empty blocks!
gmaxwell
Moderator
Legendary

Offline

Activity: 2324

 January 02, 2012, 12:05:11 AM

And then I haven't yet tried boolean function optimization and statistical early rejection.

I tried early statistical early rejection using about a  giga-hash-second-month of shares data (or so) from real miners and wasn't able to find anything useful given intermediate state bits or state bit pairs. Perhaps with your graph techniques you'll find something more complicated which is useful...

but I'm doubtful— we have something like 3x the number of rounds of the best attack and the complete construction gets complete diffusion many times over and the later your early out distinguisher has to run the better it has to be in order to be worthwhile.

Bitcoin will not be compromised
legitnick
Hero Member

Offline

Activity: 532

 January 02, 2012, 12:20:27 AM

I think something like this http://hashcat.net/wiki/mask_attack would be more reasonable/realistic. It used to take me 6 hours to brute force a WPA PSK of 8 digits, now I can do it in under 30 minutes.

5 BITCOIN RAFFLE GIVEAWAY
"I dont lift" - Lord Furrycoat
gmaxwell
Moderator
Legendary

Offline

Activity: 2324

 January 02, 2012, 06:39:01 AM

I think something like this http://hashcat.net/wiki/mask_attack would be more reasonable/realistic. It used to take me 6 hours to brute force a WPA PSK of 8 digits, now I can do it in under 30 minutes.

This is already how mining works, (barring surprising weakness in SHA256) the searched keyspace has uniform prior probability... the parts that have known acceptable values are already fixed.

Bitcoin will not be compromised
 Pages: [1] 2  All
 « previous topic next topic »