killerstorm
Legendary
Offline
Activity: 994


December 28, 2011, 10:58:00 AM 

I'm experimenting with an alternative way to compute SHA256  through tracing boolean functions. Here's the idea: 1. Represent SHA256 as a hugeass DAG (directed acyclic graph) with nodes being boolean operations  AND, OR, NOT, XOR. (One SHA256 block expansion + transform has 121k nodes after some optimization.) 2. Since we go through ~4 billion different nonces it makes sense to assume all other variables constant and optimize some nodes away. I get 86k nodes with 32 variables left. 3. Precompute fringe nodes with few dependencies via binary decision diagrams, normal forms or whatever. Optimize out parts of graph which do not need to be computed with certain configurations of nonce. 4. Now we can compute SHA256 with about as many CPU instructions as there were nodes in DAG. It is possible to compute many combinations at once in parallel, e.g. on CPU which can do 64bit operations in parallel it is possible to compute 64 combinations. I've got some preliminary results, without precomputation it, theoretically, should be about as fast as traditional implementation of SHA256 on CPUs, probably a bit slower due to required load/store operations. With precomputation and other optimizations it might be faster. Or maybe not. I don't know. Anybody wants to discuss? I'm not an expert in discrete math, so maybe some SAT solver or boolean function optimization techniques I don't know would be useful here. Here's discussion in sci.crypt: http://groups.google.com/group/sci.crypt/browse_thread/thread/06ffe40905f725feTheoretical considerations: * traditional implementation makes advantage of native integer addition, it is very expensive to express addition in form of boolean ops * boolean function implementation, however, does not need rotateright  it is optimized away at graph building stage * further, boolean function implementation can take advantage of precomputation (I see some GPU implementations precompute some things, but they cannot go deep). So this might be of interest for implementation on GPUs which do not have ROTR.







Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertise here.


copumpkin
Donator
Sr. Member
Offline
Activity: 266
I'm actually a pineapple


December 28, 2011, 11:21:53 AM 

It seems like the main benefit from your proposed approach is that you're considering the function over a bunch of (uniformly distributed) inputs simultaneously, so you'd essentially be trying to reduce needless duplication of work across separate calls to the hash function? At first glance, it does seem like it could be a fruitful direction to pursue.




killerstorm
Legendary
Offline
Activity: 994


December 28, 2011, 11:52:27 AM 

Moreorless so. The aim is, indeed, to 'reduce needless duplication of work'. However, it's not possible to process all inputs at once.
Generally speaking, main part of SHA256 is a function of 512 boolean variables. We're going to change only 32 boolean variables to look for certain result, so we can assume other 480 variables fixed. So for a certain block header we can reduce SHA256 to a function of 32 variables instead of 512 ones, optimizing it in process.
Furthermore, as we're using hardware which can do 32, 64, 128 or 256 binary operations at once, we can utilize this by considering many combinations at once. Let's say for a 64bit hardware we can, essentially, vary 6 bits at once. Thus we already have a function of (326) = 26 variables, so we need only 2^26 combinations. Furthermore, it is possible to optimize things a bit for each such combination.
Processing 64 (or more) combinations at once merely makes it competitive with traditional implementation, it does not reduce number of bit ops done by CPU. Number of bin ops should be reduced at a step when we assume certain bits fixed.




kokjo
Legendary
Offline
Activity: 1050
You are WRONG!


December 28, 2011, 11:59:50 AM 

Moreorless so. The aim is, indeed, to 'reduce needless duplication of work'. However, it's not possible to process all inputs at once.
Generally speaking, main part of SHA256 is a function of 512 boolean variables. We're going to change only 32 boolean variables to look for certain result, so we can assume other 480 variables fixed. So for a certain block header we can reduce SHA256 to a function of 32 variables instead of 512 ones, optimizing it in process.
Furthermore, as we're using hardware which can do 32, 64, 128 or 256 binary operations at once, we can utilize this by considering many combinations at once. Let's say for a 64bit hardware we can, essentially, vary 6 bits at once. Thus we already have a function of (326) = 26 variables, so we need only 2^26 combinations. Furthermore, it is possible to optimize things a bit for each such combination.
Processing 64 (or more) combinations at once merely makes it competitive with traditional implementation, it does not reduce number of bit ops done by CPU. Number of bin ops should be reduced at a step when we assume certain bits fixed.
you are forgetting the SHA also is having a internal state...

"The whole problem with the world is that fools and fanatics are always so certain of themselves and wiser people so full of doubts." Bertrand Russell



killerstorm
Legendary
Offline
Activity: 994


December 28, 2011, 12:53:50 PM 

Yep, right. So it is 768 bits in general case. But it doesn't make any difference in this context.




kokjo
Legendary
Offline
Activity: 1050
You are WRONG!


December 28, 2011, 12:57:59 PM 

Yep, right. So it is 768 bits in general case. But it doesn't make any difference in this context.
yes because you need to optimise for every single block, as the staticvariables, will change with every block. also the internal state is larger then 256 bit, and it can be precomputed per block, but it will still change every block. i think it will take longer time to optimise for every block, then it will to just do the plain work.

"The whole problem with the world is that fools and fanatics are always so certain of themselves and wiser people so full of doubts." Bertrand Russell



jetmine
Jr. Member
Offline
Activity: 53


December 29, 2011, 08:13:46 PM 

Anybody wants to discuss? It's an interesting thought, but in my opinion it is doomed to fail. From what I understand (here + usenet) you are near to need "almost as few" resources to calculate one bit, as a traditional method needs to calculate the full result. If this is correct, it matches quite well with what I would expect from following through your idea mentally. The problem is the nonlinearity from the adders. While you can easily bitslice the other logic, or adders alone, you can't do it with adders + logic. As soon as you merge the two domains, for example when you perform a logic function on an adder output, you need to know the actual values of bits. When adders are expressed as logic functions, the fanin is huge. It's natural. If you want to know the MSB of an adder result, you need to know the carry. For the carry, you need to know the lower bits and their carries. As a consequence, the input will cover (almost) all bits. Therefore the logic function for the MSB will need all input bits. A function for the nearMSB will need almost all input bits. Etc. Since SHA2 is cryptografically good, there is not much to optimize away by using logic functions. The dynamic inputs avalanche over the static ones very quickly. A few steps into the algorithm, and almost every bit is somehow influence by an MSB or nearMSB (which means that all input bits are relevant now). You just cannot save much. To calculate one output bit you need to do (almost) as many calculations as a traditional implementation. I think this is where you are now, except that your adder replacements are less efficient in terms of "operation count" (because you've separated the single add operation out as several logic operations) and therefore you get a higher operation count. If I've misunderstood something and you're not really at this point, you should try harder. It must be possible to get there.However, there is another thing to think about. You are calculating just one bit of the result, and will probably (50%) want to calculate more bits. To do that, you will need to process again (almost) all inputs. This is where the real inefficiency starts. Your algorithm is inherently not able to "cache" partial intermediates. You can think of the traditional algorithm with adders as calculating partial results which are then used several times as inputs into the next stages. Your algorithm, on the other hand, has to recalculate everything from scratch, because it didn't separate out what is result of an adder for example, but rather fusioned every single add operation with thousands of other operations into one big logic blob. This is my biggest complain, and makes me think the approach is doomed.  You are missing out on the partial result reuse.
 You can't get significantly below 100% resource usage for a single bit calculation either.
 You would need to get below 50% for it to make sense.
 The cryptographical properties of SHA2 make it impossible to achieve 50%.




Red Emerald


December 30, 2011, 12:38:09 AM 

If possible, what would the benefits of this be?




kokjo
Legendary
Offline
Activity: 1050
You are WRONG!


December 30, 2011, 10:43:56 AM 

If possible, what would the benefits of this be?
almost none. we could calculate sha256 a little bit faster, but only per block. we would need to optimize for every block...

"The whole problem with the world is that fools and fanatics are always so certain of themselves and wiser people so full of doubts." Bertrand Russell



killerstorm
Legendary
Offline
Activity: 994


December 30, 2011, 11:17:18 AM 

From what I understand (here + usenet) you are near to need "almost as few" resources to calculate one bit, as a traditional method needs to calculate the full result. If this is correct, it matches quite well with what I would expect from following through your idea mentally.
I was thinking about this approach initially, but that doesn't make sense as that one bit shares a lot of computations with bits nearby (it is fairly obvious, I just wanted to verify that actual results match intuitive ones and there is no unexpected speedup). Now I'm aiming at computing as many bits as necessary. E.g. 32 or 64 of them. It is true that computing one bit costs about as much as computing 32 of them due to reuse of common dependencies. Computing 64 bits would require somewhat more due to a round structure of SHA256 (see below). The problem is the nonlinearity from the adders. That's right. input will cover (almost) all bits. Therefore the logic function for the MSB will need all input bits. A function for the nearMSB will need almost all input bits. Etc. But half of result bits are pretty far from MSB. They will be mixed eventually, but only after a few rounds. Since SHA2 is cryptografically good, there is not much to optimize away by using logic functions. The dynamic inputs avalanche over the static ones very quickly. A few steps into the algorithm, Many people say that, and initially I had a same impression. But try inspecting SHA2 compression function closer: one round does very little. It modifies only two registers (and shifts the rest), eating only 32 bits of input block at once. Thus when you do SHA256 for bitcoin mining, you don't need to compute first three rounds (of 64) at all. That's already a ~5% speedup. Then fourth round boils to a single addition. Then fifth is pretty simple too, as input bits are constant, so you just need to compute S0 (which boils down to two XORs per bit if you ignore ROTR) and do an addition with constant. And so on. So it only requires all operations by round 8 or so. Likewise, we are only interest in H on 64th round. Which is just G at round 63. Which is F at round 62. Which is E at round 61. Which is D + S1 + Ch + W[60] + K[60] at round 60. So you can chop off last four rounds. Only 57 rounds out of 64 are required, so ~11% of computations go away. Furthermore, a lot of leftover rounds can be simplified out. Furthermore, a lot of block expansion parts can be simplified. Furthermore, some bits in first few meaningful rounds have very few dependencies and can be simplified. and almost every bit is somehow influence by an MSB or nearMSB (which means that all input bits are relevant now). You just cannot save much. It is true that I cannot chop away, like, 99% of computation, but some modest speedup might be possible.




killerstorm
Legendary
Offline
Activity: 994


December 30, 2011, 11:34:34 AM 

If possible, what would the benefits of this be?
Let's say 1020% speedup in an ideal case. almost none. we could calculate sha256 a little bit faster, but only per block. we would need to optimize for every block...
I believe optimization can be done fairly quickly, thus its cost would be negligible. Especially if you use GPU for main computation and CPU for code generation. I would only be concerned with OpenCL/CUDA compiler not being fast enough, but that's just a technical difficulty.




kokjo
Legendary
Offline
Activity: 1050
You are WRONG!


December 30, 2011, 11:55:13 AM 

If possible, what would the benefits of this be?
Let's say 1020% speedup in an ideal case. almost none. we could calculate sha256 a little bit faster, but only per block. we would need to optimize for every block...
I believe optimization can be done fairly quickly, thus its cost would be negligible. Especially if you use GPU for main computation and CPU for code generation. I would only be concerned with OpenCL/CUDA compiler not being fast enough, but that's just a technical difficulty. dream on...

"The whole problem with the world is that fools and fanatics are always so certain of themselves and wiser people so full of doubts." Bertrand Russell



killerstorm
Legendary
Offline
Activity: 994


December 30, 2011, 12:14:47 PM 

For people curious about possibility of optimization of 'fringe' nodes, here are some BDD (binary decision diagram) complexity results: http://paste.lisp.org/display/126772These are results for all subtrees of height 13 in computation DAG after constant propagation and other simple optimizations. Vars is effective number of meaningful variables in BDD after optimization, leafs means number of distinct outcomes. E.g. a fully random boolean expression of 32 vars would have 32 vars and something like 2^32 leafs. Here maximum we have is 32 vars, 19676 leafs, while most are very simple, some even constants (zero variables). Of course, it doesn't mean that SHA256 is broken as registers after all 64 rounds have height of ~3500. At high heights it becomes close to random boolean function, with billions of distinct results. But low heights can be precomputed and optimized out. I don't know how much yet.




jetmine
Jr. Member
Offline
Activity: 53


December 30, 2011, 12:30:10 PM 

so ~11% of computations go away. Offtopic. You're not talking about boolean logic implementation anymore. These optimizations apply to traditional implementations just as well. Your point was to calculate just one output bit using a big and wide input logic function (and then parallelize to make good use of available instruction sets). It is true that I cannot chop away, like, 99% of computation, but some modest speedup might be possible.
A modest speedup is not achievable, only a very tiny one (if any) compared to an equally well optimized traditional implementation. Yet even a modest speedup in the boolean part would not translate into a total algorithm speedup. Imagine that you for example need only 98% for one bit (compared to a traditional algorithm). There is a 50% chance that the bit is 0 and you want to calculate more bits of it. But you've already used 98% of the budget. Now what? Calculate a traditional result (98+100=198%)? Or calculate another single bit (98+98=196%)? It could be 0, too, and further complicate your life. To calculate just a single bit makes only sense if you can achieve a generous speedup, so you can use it as "fast oracle" to decide whether or not to run the "expensive full calculation" on a particular nonce. As said, the cryptographic nature of SHA2 makes it impossible to achieve the generous speedup, nor a modest one. Therefore the whole idea is doomed to fail. I hope I explained it sufficiently clear this time.




killerstorm
Legendary
Offline
Activity: 994


December 30, 2011, 01:07:16 PM 

Offtopic. You're not talking about boolean logic implementation anymore. These optimizations apply to traditional implementations just as well.
Sure, I've just illustrated that SHA256 round's diffusion isn't as good as people think about it. (BTW, even though these optimizations are applicable, I'm not sure that existing miners exploit all of them. Early miners like cpuminer do not seem to take advantage of them at all.) And if it is not so good, there are optimization possibilities on bit level too (which aren't accessible on word level). Your point was to calculate just one output bit using a big and wide input logic function (and then parallelize to make good use of available instruction sets). I think you've missed the main part of my reply  I'm not calculating just one bit of output, I can calculate all bits of output using this method. So your criticism isn't applicable. Let's say H is content of H register on 64th round, and H[0] is first bit of it, H[1] is second and so on. We need to find input such that H[0] or H[1] or ... H[31] = 0
or alternatively:
not (H[0]) and not (H[1]) ... not(H[31]) = 1
That would be true for approximately one in 4 billion inputs. If that's not enough, we can include bits of G into a target function. Or we can compute fewer bits of H if that helps. Or we can express values of H[0] in terms of other expressions and do early rejection on that level. A modest speedup is not achievable, only a very tiny one (if any) compared to an equally well optimized traditional implementation. I don't know yet. So far, I've got pretty good results with BDD, better than I expected. want to calculate more bits of it. But you've already used 98% of the budget. Now what? Calculate a traditional result (98+100=198%)? Or calculate another single bit (98+98=196%)? Calculating another single bit would take only 0.1% because it depends on exactly same inputs as the first bit. Or see above  it is possible to formulate whole mining problem in a form of boolean expression. I hope I explained it sufficiently clear this time. It was actually sufficiently clear the first time too, but for some reason you're missing that it is possible to reuse computations to calculate all bits of output. In that case 2% speedup would be 2% speedup, plain and clear. Right now I see BDDizing fringe nodes as only viable optimization, so I guess we need to wait until I'll BDDize all what is BDDizable before jumping to conclusions. But thanks for looking into this, so far you're the one who paid most attention.




killerstorm
Legendary
Offline
Activity: 994


December 30, 2011, 10:32:20 PM 

Concrete numbers: one SHA256, 32 boolean variables in nonce position, no BDD optimization, just DAG nodest: * 99940 operations to compute last 1 bit of output * 101349 operations to compute last 16 bits of output * 102510 operations to compute last 32 bits of output * 104780 operations for 64 bits (includes OR'ing them). So computing more bits costs more, but not very much. Now let's compare it with a traditional SHA256 implementation, this page: https://en.bitcoin.it/wiki/Why_a_GPU_mines_faster_than_a_CPU#Why_are_AMD_GPUs_faster_than_Nvidia_GPUs.3F gives an idea: "~3250 to execute the SHA256 compression function" (on Nvidia GTX 590). From https://en.bitcoin.it/wiki/Mining_hardware_comparison we get that 1024 ALUs working at 1215MHz make 193 Mhash/sec. 1024*1215/193 = 6446. Mining requires 2 SHA256 has operations, so let's half that: 6446/2=3223. Quite similar to number on wiki page, so I guess that's it. To compute hash for 32 different combinations NVIDIA GPU requires 3223*32 = 103136 cycles. This is same number as required by boolean function implemention, so they should be roughly equal in performance, and perhaps boolean function would be even faster with additional optimizations (although second SHA256 might not optimize as well as the first one). So on the optimization potential, at least 2000 DAG nodes can be reduced to sufficiently simple (< 1000 leafs) BDDs. This number is going to get higher after I remove 5 variables to compute 32 combos in parallel. Then there are 'hybrid' BDDs which combines dynamic nodes with static ones, it is going to penetrate DAG even further. And then I haven't yet tried boolean function optimization and statistical early rejection. So some speedup for NVIDIA GPUs and CPU mining isn't ruled out. AMD hardware is much harder to beat. But I haven't yet looked at what advanced instructions it supports, things like BFI_INT might replace several graph nodes.




kano
Legendary
Online
Activity: 2030
Linux since 1997 RedHat 4


January 01, 2012, 04:17:02 AM 

Wow. I'm surprised. OK this is something I've been looking at for a while and not got very far with it. My main thought about it all this time (over a few months) has been why I've heard no one else even mention it. My first and last comment on the subject unless I complete my work on it some time in the far future Edit: I should add that my work on it is related but follows a different path, with what I hope to be a better chance of performance increase. I was hoping 10x ... but that could be dreaming

Pool: https://kano.is Here on Bitcointalk: Forum BTC: 1 KanoPb8cKYqNrswjaA8cRDk4FAS9eDMLU FreeNode IRC: irc.freenode.net channel #kano.isHelp keep Bitcoin secure by mining on pools with full block verification on all blocks



gmaxwell
Moderator
Legendary
Offline
Activity: 2128


January 02, 2012, 12:05:11 AM 

And then I haven't yet tried boolean function optimization and statistical early rejection.
I tried early statistical early rejection using about a gigahashsecondmonth of shares data (or so) from real miners and wasn't able to find anything useful given intermediate state bits or state bit pairs. Perhaps with your graph techniques you'll find something more complicated which is useful... but I'm doubtful— we have something like 3x the number of rounds of the best attack and the complete construction gets complete diffusion many times over and the later your early out distinguisher has to run the better it has to be in order to be worthwhile.




legitnick


January 02, 2012, 12:20:27 AM 

I think something like this http://hashcat.net/wiki/mask_attack would be more reasonable/realistic. It used to take me 6 hours to brute force a WPA PSK of 8 digits, now I can do it in under 30 minutes.




gmaxwell
Moderator
Legendary
Offline
Activity: 2128


January 02, 2012, 06:39:01 AM 

I think something like this http://hashcat.net/wiki/mask_attack would be more reasonable/realistic. It used to take me 6 hours to brute force a WPA PSK of 8 digits, now I can do it in under 30 minutes. This is already how mining works, (barring surprising weakness in SHA256) the searched keyspace has uniform prior probability... the parts that have known acceptable values are already fixed.




