Inaba


February 11, 2012, 03:28:45 PM 

I'm not trying to discourage you from anything, Wondermine; Maybe you will find something that's been missed, somewhere, but it seems unlikely is all I'm saying. Maybe getting a fully working implementation then refining is a good course of action, but then again maybe genius needs to take a different path, and who am I do argue. It's all beyond my abilities anyway! I may purchase a couple DE0's just to see how it goes.

If you're searching these lines for a point, you've probably missed it. There was never anything there in the first place.




Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertise here.

wondermine
Jr. Member
Offline
Activity: 57


February 11, 2012, 03:51:14 PM 

wondermine, I wish you the best. I really do. However, please take a look at the SHA256 algorithm. http://en.wikipedia.org/wiki/Sha256The 32 bit values b, c, d, f, g, and h are trivially derived from the previous round, i.e. copied from a, b, c, e, f, and g, respectively. The 32 bit value e is derived from the previous round's d, h, e, f, and g (i.e. 5/8th of the previous round's 256 bits are used to derive it). The 32 bit value a is derived from the previous round's h, e, f, g, a, b, and c  i.e. 7/8th of the previous round's 256 bits are used to derive it. Now think this through over just one more round. Only four 32 bit values are trivially derived from their "grandfather" round. The other four 32 bit values are derived from brutal mixing of almost all bits of the grandfather round. And so on. After just 4 rounds, a single bit change in the greatgreatgrandfather round influences ALL bits of the current round. Thus, any notion of shaving more than 4 or 5 rounds off the 128 total rounds is a pipe dream. In other words, speeding up an implementation of SHA256 cannot be done by mathematical tricks. Rather, the operations of each round should be optimized. There is no real reason why the clock is a measly 200 MHz (and thus the clock cycle 5 ns) in the best currently available implementations, such as the ZTEX implementation. Think about it: 5 ns, that is a delay straight from the 70s. A TTL technologylike delay. Certainly we can do better than that?!? Analyzing the operations for their contribution to the delay yields: rightrotate ... instant, no delay at all xor ... minor delay, bit by bit, probably a few dozen picoseconds and ... minor delay, bit by bit, probably a few dozen picoseconds not ... minor delay, bit by bit, probably a few dozen picoseconds + ... this should be scrutinized. A 32 bit add operation can be quite costly and the fastest possible implementation should be pursued. Adding insult to injury, SHA256 features not just binary or ternary adds, but 4fold adds (in the t1 function) and 5fold adds (e := d+t1) and 6fold adds (a := t1 + t2). So, there you go. The biggest detriment to performance is probably the 6fold 32 bit wide add in a := t1 + t2. If you can speed this operation up, maybe by precomputing partial results in the PREVIOUS round, then bringing them to the table when needed, the entire SHA256 will be sped up (assuming optimal placing and routing). I'm very familiar with the SHA256 algorithm and understand its complexity. And I don't mean from reading Wikipedia. What you fail to capture here is that I'm not looking at this to know exact values to avoid... it's a matter of probability. Why does SHA256 use 64 opaque rounds? Precisely so something like this does not happen. It might be a waste of time and resources if checking a value wasn't so resourcefriendly, but it's not. As far as "more standard" optimizations, I already mentioned I would do that. And I know adding is the biggest resource hog here. I'm looking into the best ways to do that, precalculation, DSP chips on custom boards (to offload logic that does not require programming and other benefits)... I may be a student but I'm not exactly lacking in mathematical or engineering understanding. The 200MHz issue... Actually there is a reason we're down in the "70s" range of timing. Optimizing for high clockability is a huge challenge. It is a problem, also something I'm already looking into. I'm looking into quadpumping and other technologies that major manufacturers use. I'm going to take it from your knowledge of some of this that you understand how hard it is to come by clean clocking, and then making sure that doesn't become unstable. Have you looked at commercial IP for SHA or other block ciphers? They don't run much higher than 200MHz stably, and they cost thousands of dollars and have had many engineers optimize the hell out of them. So all that to say, yes I know what routes of action to take. I also won't say I'm looking into something unless I believe it's feasible and have done adequate research.




runeks


February 12, 2012, 04:42:54 AM 

^ Just so we're clear, wouldn't a discovery like that be a major one, essentially  to a certain extent  compromising SHA256? Wouldn't it mean that it doesn't qualify as a cryptographic hash function? And perhaps that Bitcoin (and whatever else uses SHA256) should migrate to a new and safer hash function?
Wouldn't this be the kind of stuff someone might do as a PhD in cryptology, and you're doing it as a sideproject for the mining FPGA you're developing in your spare time while studying. :=)
Nice work by the way! I'm going to send you a couple BTCs.




Inspector 2211


February 12, 2012, 05:50:25 AM 

wondermine, I wish you the best. I really do. However, please take a look at the SHA256 algorithm. http://en.wikipedia.org/wiki/Sha256The 32 bit values b, c, d, f, g, and h are trivially derived from the previous round, i.e. copied from a, b, c, e, f, and g, respectively. The 32 bit value e is derived from the previous round's d, h, e, f, and g (i.e. 5/8th of the previous round's 256 bits are used to derive it). The 32 bit value a is derived from the previous round's h, e, f, g, a, b, and c  i.e. 7/8th of the previous round's 256 bits are used to derive it. Now think this through over just one more round. Only four 32 bit values are trivially derived from their "grandfather" round. The other four 32 bit values are derived from brutal mixing of almost all bits of the grandfather round. And so on. After just 4 rounds, a single bit change in the greatgreatgrandfather round influences ALL bits of the current round. Thus, any notion of shaving more than 4 or 5 rounds off the 128 total rounds is a pipe dream. In other words, speeding up an implementation of SHA256 cannot be done by mathematical tricks. Rather, the operations of each round should be optimized. There is no real reason why the clock is a measly 200 MHz (and thus the clock cycle 5 ns) in the best currently available implementations, such as the ZTEX implementation. Think about it: 5 ns, that is a delay straight from the 70s. A TTL technologylike delay. Certainly we can do better than that?!? Analyzing the operations for their contribution to the delay yields: rightrotate ... instant, no delay at all xor ... minor delay, bit by bit, probably a few dozen picoseconds and ... minor delay, bit by bit, probably a few dozen picoseconds not ... minor delay, bit by bit, probably a few dozen picoseconds + ... this should be scrutinized. A 32 bit add operation can be quite costly and the fastest possible implementation should be pursued. Adding insult to injury, SHA256 features not just binary or ternary adds, but 4fold adds (in the t1 function) and 5fold adds (e := d+t1) and 6fold adds (a := t1 + t2). So, there you go. The biggest detriment to performance is probably the 6fold 32 bit wide add in a := t1 + t2. If you can speed this operation up, maybe by precomputing partial results in the PREVIOUS round, then bringing them to the table when needed, the entire SHA256 will be sped up (assuming optimal placing and routing). I'm very familiar with the SHA256 algorithm and understand its complexity. And I don't mean from reading Wikipedia. What you fail to capture here is that I'm not looking at this to know exact values to avoid... it's a matter of probability. Why does SHA256 use 64 opaque rounds? Precisely so something like this does not happen. It might be a waste of time and resources if checking a value wasn't so resourcefriendly, but it's not. As far as "more standard" optimizations, I already mentioned I would do that. And I know adding is the biggest resource hog here. I'm looking into the best ways to do that, precalculation, DSP chips on custom boards (to offload logic that does not require programming and other benefits)... I may be a student but I'm not exactly lacking in mathematical or engineering understanding. The 200MHz issue... Actually there is a reason we're down in the "70s" range of timing. Optimizing for high clockability is a huge challenge. It is a problem, also something I'm already looking into. I'm looking into quadpumping and other technologies that major manufacturers use. I'm going to take it from your knowledge of some of this that you understand how hard it is to come by clean clocking, and then making sure that doesn't become unstable. Have you looked at commercial IP for SHA or other block ciphers? They don't run much higher than 200MHz stably, and they cost thousands of dollars and have had many engineers optimize the hell out of them. So all that to say, yes I know what routes of action to take. I also won't say I'm looking into something unless I believe it's feasible and have done adequate research. >to offload logic You can't offload logic. Not enough IO pins on an FPGA. There are seven additions per round, 125 rounds  do you want to connect 875 external adders? These 875 external adders would need 96 pins (64 inputs, 32 outputs) connected to the FPGA each, for a grand total of 84,000 pins...

KICKICO   ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █     ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █        ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █   



wondermine
Jr. Member
Offline
Activity: 57


February 12, 2012, 08:38:42 AM 

[deleted, argumentative]




Defkin


February 12, 2012, 09:41:34 AM 

interested




nelisky
Legendary
Offline
Activity: 1554


February 12, 2012, 11:05:12 AM 

Are you familiar with the term "SerDes"?
That reminds me that old internet slang word: "modem"




wondermine
Jr. Member
Offline
Activity: 57


February 12, 2012, 11:54:41 AM 

Alright everyone, I've put up a forum on my website dedicated to Nanominer. There's a whole bunch of new information as well as better ways to keep track of the project there, so head on over to http://nonverba.org/forum/ and check it out!




nelisky
Legendary
Offline
Activity: 1554


February 12, 2012, 12:00:24 PM 

I'm a single forum type of person, but I will try to keep that one in sight too.
Also, I can get you a VPS as a donation, send me an email (my address is in my profile) with what you need and we'll see what I can manage.




Inspector 2211


February 12, 2012, 03:50:08 PM 

Are you familiar with the term "SerDes"?
Yes, as in "device that iS not suitablE for sending 64 bits to an addeR anD rEceiving 32 bitS back, all within 1 or 2 ns total"  SERDES. You'd be shooting for 100 Gbit/s SerDes units  possible in ASICs now (if expensive), not yet available in FPGAs. Furthermore, FPGAs typically have only about half a dozen or a dozen or maybe two dozens of SerDes units  not 875.

KICKICO   ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █     ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █        ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █   



finway


February 12, 2012, 06:11:19 PM 

Interesting!




wondermine
Jr. Member
Offline
Activity: 57


February 12, 2012, 08:22:35 PM 

The start of a repository of resources, information, and eventually tutorials has been started here: http://nonverba.org/forum/viewtopic.php?f=8&t=6Check it out, sign up, post, get some discussion going. There's been a fair number of people mentioning they'd like to get into FPGA programming, so let's do it.




wondermine
Jr. Member
Offline
Activity: 57


February 13, 2012, 03:49:40 AM 

From my forum at http://www.nonverba.org/forum:So I've been talking to a number of people about the possibility of using sidechannel vulnerabilities in the bitcoin mining process to speed up mining. It's all been vague until now, so I'm going to give a very tangible example. Assume we roll up the algorithm: 2x SHA2 = 128 clocks/mining operation. Now, the final addition we do is to add the round constant, something that will always be the same. In this specific situation it's 0x5be0cd19. Now, before we commit our 128th clock cycle to adding that to our previous 32bit "h" value, we can add very resource friendly code to determine whether our "h" value is too high to yield a winning number. We can say that if "h" is greater than 0xa41f32e6, it cannot yield a winning digest and we should not waste that clock cycle. Great, that was easy, and isn't nearly the magnitude of problem I'm working on, but it's here to get the idea across. This one is a completely predictable example, the true optimization will come from probabilityderived solutions earlier. Anyways... By eating a couple of logic elements and little block ram, we can implement this check. Now, how does that all translate into performance gain? Well, 64.11% of hashes fall into this range. That means that 64.11% of the time, we can save one clock cycle on a core, which is 1/128 or a 0.78125% speed increase. 0.6411 * 0.0078125 = 0.005% increase in performance. Pulling, say, 500MH/s, that's 2.50MH/s performance enhancement over time. (I'm pretty sure this math is accurate, but if it's not perfect, it's at least close, these are the ranges we're talking about.) It doesn't maybe sound like a lot, but think about it, we just pulled 2.5MH/s out of thin air. Or rather, out of unused block ram. It didn't cost us time, or power, or logic. The earlier these constants can be checked out, the less their probability of success, but the higher their effect on performance. In case this was tl;dr material: We just pulled 2.5MH/s out of thin air and can do it repeatedly.There are a lot of these optimizations that could fit on even a Cyclone series FPGA. The only thing standing between us and them is time and mathematical analysis. Hopefully this gives you a clearer picture of what I'm talking about, and lets you know that it's actually feasible and just requires time and effort, rather than being voodoo. Oh, and this process isn't just for FPGAs, it works on any mining platform. These are the optimizations you can get your hands on before anyone else with donations. Seeing the incentive yet?




rjk


February 13, 2012, 04:26:03 AM 

I was wondering what method you were going to use to determine which partially finished hashes to throw away prematurely, but I suppose I understand a little better now. I suppose the question is, how does this affect (for instance) miners that underclock RAM (since you are now going to be using some)? And is the added complexity going to raise power requirements by activating parts of the chips that might otherwise remain dormant? Thinking specifically in GPU terms here.




ummas


February 13, 2012, 03:17:10 PM 

@wondermine i understand the need to have own forum, where you can speare information better, but dont forget to post here any news about nanominer. This place is the best place i know to get more ppl intrested in nanominer. ideas are gr8. just pull out somme hashes first please




pieppiep


February 13, 2012, 03:29:05 PM 

From my forum at http://www.nonverba.org/forum:Now, the final addition we do is to add the round constant, something that will always be the same. In this specific situation it's 0x5be0cd19. Now, before we commit our 128th clock cycle to adding that to our previous 32bit "h" value, we can add very resource friendly code to determine whether our "h" value is too high to yield a winning number. We can say that if "h" is greater than 0xa41f32e6, it cannot yield a winning digest and we should not waste that clock cycle. Great, that was easy, and isn't nearly the magnitude of problem I'm working on, but it's here to get the idea across. This one is a completely predictable example, the true optimization will come from probabilityderived solutions earlier. Anyways... I don't understand. If the h value of the midstate is 0x5be0cd19, then the next h value that is added to it must be exactly 0xa41f32e7 to get the 0x00000000 value to make a valid share at difficulty 1. Isn't that a way bigger advantage? Now you know for sure you got a value you want and not some strange percentage. I assume no pools use shares less than difficulty 1 and for mining in pools with difficulty greater than 1 it should be easy for the miner software on the host computer to check if the share with difficulty 1 is also valid for the pool with difficulty x.

KICKICO   ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █     ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █        ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █ ██ ██ █   



wondermine
Jr. Member
Offline
Activity: 57


February 13, 2012, 10:23:46 PM 

From my forum at http://www.nonverba.org/forum:Now, the final addition we do is to add the round constant, something that will always be the same. In this specific situation it's 0x5be0cd19. Now, before we commit our 128th clock cycle to adding that to our previous 32bit "h" value, we can add very resource friendly code to determine whether our "h" value is too high to yield a winning number. We can say that if "h" is greater than 0xa41f32e6, it cannot yield a winning digest and we should not waste that clock cycle. Great, that was easy, and isn't nearly the magnitude of problem I'm working on, but it's here to get the idea across. This one is a completely predictable example, the true optimization will come from probabilityderived solutions earlier. Anyways... I don't understand. If the h value of the midstate is 0x5be0cd19, then the next h value that is added to it must be exactly 0xa41f32e7 to get the 0x00000000 value to make a valid share at difficulty 1. Isn't that a way bigger advantage? Now you know for sure you got a value you want and not some strange percentage. I assume no pools use shares less than difficulty 1 and for mining in pools with difficulty greater than 1 it should be easy for the miner software on the host computer to check if the share with difficulty 1 is also valid for the pool with difficulty x. I'll answer this last one here. You're right, there's a certain value that will work 100% of the time for difficulty one on that last addition. This answer was given for the sake of simplicity, but it was also given in nonabsolute terms to illustrate how probabilistic advantages will work. Real advantage can be taken earlier in the algorithm where we're dealing with probability thresholds, not absolute certainty. SHA2 at 64 rounds is random (succeeds at mathematical randomness tests). However given certain values at certain stages, that unravels. The ultimate example is adding a round constant which is a value we know 100% for certain. The goal we're going for is to find them a little earlier, say round 1267 where the probability might be in the fractions of a percentage, and the relationship to check may not be greater than, less than, or equal to; it's going to be something completely different that specially relates to a SHA2 round. Further answers directly related to the mathematics of it will be at http://www.nonverba.org/forum. I was wondering what method you were going to use to determine which partially finished hashes to throw away prematurely, but I suppose I understand a little better now. I suppose the question is, how does this affect (for instance) miners that underclock RAM (since you are now going to be using some)? And is the added complexity going to raise power requirements by activating parts of the chips that might otherwise remain dormant? Thinking specifically in GPU terms here.
I'm not an expert as far as GPU mining, but unless underclocking RAM actually invalidates the RAM functionally, it will not be a problem. As far as increasing power consumption, I don't believe it should be a detectable amount, this again from an educated guess; we can see. The reason I say so is that "not using" the RAM doesn't mean "not powering" the RAM, it means not accessing it. Now, we're just accessing it, the amount of electricity we're sending to the memory hasn't changed (the electricity related to logic signaling is absurdly small). RAM doesn't lie electrically dormant, just logically.




rjk


February 13, 2012, 11:02:07 PM 

I'm not an expert as far as GPU mining, but unless underclocking RAM actually invalidates the RAM functionally, it will not be a problem. As far as increasing power consumption, I don't believe it should be a detectable amount, this again from an educated guess; we can see. The reason I say so is that "not using" the RAM doesn't mean "not powering" the RAM, it means not accessing it. Now, we're just accessing it, the amount of electricity we're sending to the memory hasn't changed (the electricity related to logic signaling is absurdly small). RAM doesn't lie electrically dormant, just logically.
No, underclocking just makes the RAM run slower. The speed of the RAM is independent of the core in a GPU. The reason I ask is because with traditional miners, there is a little bit of a curve in efficiency through the range of 150400 Mhz in RAM speed. Some speeds might be unstable and slow mining, others could be a little bit faster mining, but never very much difference. Even if the RAM usage isn't much, it still can affect the speed if it is needed in a critical part of the hash algo, since the clocks are independent, is basically all I was trying to say. The reason for messing about with the RAM speed in the first place is to save several watts of power.




wondermine
Jr. Member
Offline
Activity: 57


February 14, 2012, 10:37:07 PM 

DE4230 Stratix IV GX: 800 MH/s
Who wants to write a script to handle UART interfacing rather than this JTAG business? PM/email me if you're interested.
++More (nonmathematical) design improvements to be implemented in the near future.





