Bitcoin Forum
November 13, 2024, 01:26:57 AM *
News: Check out the artwork 1Dq created to commemorate this forum's 15th anniversary
 
   Home   Help Search Login Register More  
Pages: « 1 2 3 4 5 [6] 7 8 »  All
  Print  
Author Topic: Nanominer - Modular FPGA Mining Platform  (Read 18939 times)
Inaba
Legendary
*
Offline Offline

Activity: 1260
Merit: 1000



View Profile WWW
February 11, 2012, 03:28:45 PM
 #101

I'm not trying to discourage you from anything, Wondermine; Maybe you will find something that's been missed, somewhere, but it seems unlikely is all I'm saying.  Maybe getting a fully working implementation then refining is a good course of action, but then again maybe genius needs to take a different path, and who am I do argue. Smiley  It's all beyond my abilities anyway!  I may purchase a couple DE0's just to see how it goes.

If you're searching these lines for a point, you've probably missed it.  There was never anything there in the first place.
wondermine (OP)
Newbie
*
Offline Offline

Activity: 59
Merit: 0


View Profile
February 11, 2012, 03:51:14 PM
 #102

wondermine, I wish you the best. I really do.

However, please take a look at the SHA-256 algorithm. http://en.wikipedia.org/wiki/Sha-256
The 32 bit values b, c, d, f, g, and h are trivially derived from the previous round, i.e. copied from a, b, c, e, f, and g, respectively.
The 32 bit value e is derived from the previous round's d, h, e, f, and g (i.e. 5/8th of the previous round's 256 bits are used to derive it).
The 32 bit value a is derived from the previous round's h, e, f, g, a, b, and c - i.e. 7/8th of the previous round's 256 bits are used to derive it.

Now think this through over just one more round. Only four 32 bit values are trivially derived from their "grandfather" round.
The other four 32 bit values are derived from brutal mixing of almost all bits of the grandfather round.

And so on.

After just 4 rounds, a single bit change in the great-great-grandfather round influences ALL bits of the current round.

Thus, any notion of shaving more than 4 or 5 rounds off the 128 total rounds is a pipe dream.

In other words, speeding up an implementation of SHA-256 cannot be done by mathematical tricks.

Rather, the operations of each round should be optimized.

There is no real reason why the clock is a measly 200 MHz (and thus the clock cycle 5 ns) in the best currently available implementations,
such as the ZTEX implementation. Think about it: 5 ns, that is a delay straight from the 70s. A TTL technology-like delay. Certainly we can do better than that?!?

Analyzing the operations for their contribution to the delay yields:
rightrotate ... instant, no delay at all
xor ... minor delay, bit by bit, probably a few dozen picoseconds
and ... minor delay, bit by bit, probably a few dozen picoseconds
not ... minor delay, bit by bit, probably a few dozen picoseconds
+ ... this should be scrutinized. A 32 bit add operation can be quite costly and the fastest possible implementation should be pursued.
      Adding insult to injury, SHA-256 features not just binary or ternary adds, but 4-fold adds (in the t1 function) and 5-fold adds
      (e := d+t1) and 6-fold adds (a := t1 + t2).

So, there you go. The biggest detriment to performance is probably the 6-fold 32 bit wide add in a := t1 + t2.
If you can speed this operation up, maybe by pre-computing partial results in the PREVIOUS round, then bringing them to the table when needed, the entire SHA-256 will be sped up (assuming optimal placing and routing).

I'm very familiar with the SHA-256 algorithm and understand its complexity.  And I don't mean from reading Wikipedia.  What you fail to capture here is that I'm not looking at this to know exact values to avoid...  it's a matter of probability.  Why does SHA-256 use 64 opaque rounds? Precisely so something like this does not happen.  It might be a waste of time and resources if checking a value wasn't so resource-friendly, but it's not.

As far as "more standard" optimizations, I already mentioned I would do that.  And I know adding is the biggest resource hog here.  I'm looking into the best ways to do that, precalculation, DSP chips on custom boards (to offload logic that does not require programming and other benefits)...  I may be a student but I'm not exactly lacking in mathematical or engineering understanding.

The 200MHz issue... Actually there is a reason we're down in the "70s" range of timing.  Optimizing for high clockability is a huge challenge. It is a problem, also something I'm already looking into.  I'm looking into quad-pumping and other technologies that major manufacturers use.  I'm going to take it from your knowledge of some of this that you understand how hard it is to come by clean clocking, and then making sure that doesn't become unstable.  Have you looked at commercial IP for SHA or other block ciphers? They don't run much higher than 200MHz stably, and they cost thousands of dollars and have had many engineers optimize the hell out of them.

So all that to say, yes I know what routes of action to take.  I also won't say I'm looking into something unless I believe it's feasible and have done adequate research.
runeks
Legendary
*
Offline Offline

Activity: 980
Merit: 1008



View Profile WWW
February 12, 2012, 04:42:54 AM
 #103

^ Just so we're clear, wouldn't a discovery like that be a major one, essentially - to a certain extent - compromising SHA-256? Wouldn't it mean that it doesn't qualify as a cryptographic hash function? And perhaps that Bitcoin (and whatever else uses SHA-256) should migrate to a new and safer hash function?

Wouldn't this be the kind of stuff someone might do as a PhD in cryptology, and you're doing it as a side-project for the mining FPGA you're developing in your spare time while studying. :=)

Nice work by the way! I'm going to send you a couple BTCs.
Inspector 2211
Sr. Member
****
Offline Offline

Activity: 448
Merit: 250



View Profile
February 12, 2012, 05:50:25 AM
 #104

wondermine, I wish you the best. I really do.

However, please take a look at the SHA-256 algorithm. http://en.wikipedia.org/wiki/Sha-256
The 32 bit values b, c, d, f, g, and h are trivially derived from the previous round, i.e. copied from a, b, c, e, f, and g, respectively.
The 32 bit value e is derived from the previous round's d, h, e, f, and g (i.e. 5/8th of the previous round's 256 bits are used to derive it).
The 32 bit value a is derived from the previous round's h, e, f, g, a, b, and c - i.e. 7/8th of the previous round's 256 bits are used to derive it.

Now think this through over just one more round. Only four 32 bit values are trivially derived from their "grandfather" round.
The other four 32 bit values are derived from brutal mixing of almost all bits of the grandfather round.

And so on.

After just 4 rounds, a single bit change in the great-great-grandfather round influences ALL bits of the current round.

Thus, any notion of shaving more than 4 or 5 rounds off the 128 total rounds is a pipe dream.

In other words, speeding up an implementation of SHA-256 cannot be done by mathematical tricks.

Rather, the operations of each round should be optimized.

There is no real reason why the clock is a measly 200 MHz (and thus the clock cycle 5 ns) in the best currently available implementations,
such as the ZTEX implementation. Think about it: 5 ns, that is a delay straight from the 70s. A TTL technology-like delay. Certainly we can do better than that?!?

Analyzing the operations for their contribution to the delay yields:
rightrotate ... instant, no delay at all
xor ... minor delay, bit by bit, probably a few dozen picoseconds
and ... minor delay, bit by bit, probably a few dozen picoseconds
not ... minor delay, bit by bit, probably a few dozen picoseconds
+ ... this should be scrutinized. A 32 bit add operation can be quite costly and the fastest possible implementation should be pursued.
      Adding insult to injury, SHA-256 features not just binary or ternary adds, but 4-fold adds (in the t1 function) and 5-fold adds
      (e := d+t1) and 6-fold adds (a := t1 + t2).

So, there you go. The biggest detriment to performance is probably the 6-fold 32 bit wide add in a := t1 + t2.
If you can speed this operation up, maybe by pre-computing partial results in the PREVIOUS round, then bringing them to the table when needed, the entire SHA-256 will be sped up (assuming optimal placing and routing).

I'm very familiar with the SHA-256 algorithm and understand its complexity.  And I don't mean from reading Wikipedia.  What you fail to capture here is that I'm not looking at this to know exact values to avoid...  it's a matter of probability.  Why does SHA-256 use 64 opaque rounds? Precisely so something like this does not happen.  It might be a waste of time and resources if checking a value wasn't so resource-friendly, but it's not.

As far as "more standard" optimizations, I already mentioned I would do that.  And I know adding is the biggest resource hog here.  I'm looking into the best ways to do that, precalculation, DSP chips on custom boards (to offload logic that does not require programming and other benefits)...  I may be a student but I'm not exactly lacking in mathematical or engineering understanding.

The 200MHz issue... Actually there is a reason we're down in the "70s" range of timing.  Optimizing for high clockability is a huge challenge. It is a problem, also something I'm already looking into.  I'm looking into quad-pumping and other technologies that major manufacturers use.  I'm going to take it from your knowledge of some of this that you understand how hard it is to come by clean clocking, and then making sure that doesn't become unstable.  Have you looked at commercial IP for SHA or other block ciphers? They don't run much higher than 200MHz stably, and they cost thousands of dollars and have had many engineers optimize the hell out of them.

So all that to say, yes I know what routes of action to take.  I also won't say I'm looking into something unless I believe it's feasible and have done adequate research.

>to offload logic

You can't offload logic.
Not enough IO pins on an FPGA.
There are seven additions per round, 125 rounds - do you want to connect 875 external adders?
These 875 external adders would need 96 pins (64 inputs, 32 outputs) connected to the FPGA each, for a grand total of 84,000 pins...

               ▄█▄
            ▄█ ▀█▀
     ▄ ▄███▄▄████▄▀ ▄▄▀▄
    ▀█▄████
██████▀▄█████▀▄▀
   ▄█▀▄
███████████████████▄
 ▄██▀█▀
▀▀▀███▀▀▀█████▄▄▄▀█▀▄
 ▄█▀▀   ▀█
███▀▄████████ █▀█▄▄
██▀  ▀ ▀ ▀
██████████▄   ▄▀▀█▄
     ▀ ▀
  ███▀▀▀▀▀████▌ ▄  ▀
          ████████████▌   █
        █████████████▀
        ▀▀▀██▀▀██▀▀
           ▀▀  ▀▀
BTC-GREEN       ▄▄████████▄▄
    ▄██████████████▄
  ▄██████
██████████████▄
 ▄███
███████████████████▄
▄█████████████████████████▄
██████████████████████████
███████████████████████████
███████████████████████████
▀█████████████████████████▀
 ▀███████████████████████▀
  ▀█████████████████████▀
    ▀█████████████████
       ▀▀█████████▀▀
Ecological Community in the Green Planet
❱❱❱❱❱❱     WHITEPAGE   |   ANN THREAD     ❰❰❰❰❰❰
           ▄███▄▄
       ▄▄█████████▄
      ▄████████████▌
   ▄█████████████▄▄
 ▄████████████████████
███████████████▄
▄████████████████████▀
███████████████████████▀
 ▀▀██████▀██▌██████▀
   ▀██▀▀▀  ██  ▀▀▀▀▀▀
           ██
           ██▌
          ▐███▄
.
wondermine (OP)
Newbie
*
Offline Offline

Activity: 59
Merit: 0


View Profile
February 12, 2012, 08:38:42 AM
Last edit: February 12, 2012, 04:39:35 PM by wondermine
 #105

[deleted, argumentative]
Defkin
Member
**
Offline Offline

Activity: 80
Merit: 10


View Profile
February 12, 2012, 09:41:34 AM
 #106

interested
nelisky
Legendary
*
Offline Offline

Activity: 1540
Merit: 1002


View Profile
February 12, 2012, 11:05:12 AM
 #107

Are you familiar with the term "SerDes"?

That reminds me that old internet slang word: "modem"

Smiley
wondermine (OP)
Newbie
*
Offline Offline

Activity: 59
Merit: 0


View Profile
February 12, 2012, 11:54:41 AM
 #108

Alright everyone, I've put up a forum on my website dedicated to Nanominer.  There's a whole bunch of new information as well as better ways to keep track of the project there, so head on over to
http://nonverba.org/forum/ and check it out!
nelisky
Legendary
*
Offline Offline

Activity: 1540
Merit: 1002


View Profile
February 12, 2012, 12:00:24 PM
 #109

I'm a single forum type of person, but I will try to keep that one in sight too.

Also, I can get you a VPS as a donation, send me an email (my address is in my profile) with what you need and we'll see what I can manage.
Inspector 2211
Sr. Member
****
Offline Offline

Activity: 448
Merit: 250



View Profile
February 12, 2012, 03:50:08 PM
Last edit: February 12, 2012, 05:36:36 PM by Inspector 2211
 #110

Are you familiar with the term "SerDes"?

Yes, as in "device that iS not suitablE for sending 64 bits to an addeR anD rEceiving 32 bitS back, all within 1 or 2 ns total" - SERDES.

You'd be shooting for 100 Gbit/s SerDes units - possible in ASICs now (if expensive), not yet available in FPGAs.

Furthermore, FPGAs typically have only about half a dozen or a dozen or maybe two dozens of SerDes units - not 875.

               ▄█▄
            ▄█ ▀█▀
     ▄ ▄███▄▄████▄▀ ▄▄▀▄
    ▀█▄████
██████▀▄█████▀▄▀
   ▄█▀▄
███████████████████▄
 ▄██▀█▀
▀▀▀███▀▀▀█████▄▄▄▀█▀▄
 ▄█▀▀   ▀█
███▀▄████████ █▀█▄▄
██▀  ▀ ▀ ▀
██████████▄   ▄▀▀█▄
     ▀ ▀
  ███▀▀▀▀▀████▌ ▄  ▀
          ████████████▌   █
        █████████████▀
        ▀▀▀██▀▀██▀▀
           ▀▀  ▀▀
BTC-GREEN       ▄▄████████▄▄
    ▄██████████████▄
  ▄██████
██████████████▄
 ▄███
███████████████████▄
▄█████████████████████████▄
██████████████████████████
███████████████████████████
███████████████████████████
▀█████████████████████████▀
 ▀███████████████████████▀
  ▀█████████████████████▀
    ▀█████████████████
       ▀▀█████████▀▀
Ecological Community in the Green Planet
❱❱❱❱❱❱     WHITEPAGE   |   ANN THREAD     ❰❰❰❰❰❰
           ▄███▄▄
       ▄▄█████████▄
      ▄████████████▌
   ▄█████████████▄▄
 ▄████████████████████
███████████████▄
▄████████████████████▀
███████████████████████▀
 ▀▀██████▀██▌██████▀
   ▀██▀▀▀  ██  ▀▀▀▀▀▀
           ██
           ██▌
          ▐███▄
.
finway
Hero Member
*****
Offline Offline

Activity: 714
Merit: 500


View Profile
February 12, 2012, 06:11:19 PM
 #111

Interesting!

wondermine (OP)
Newbie
*
Offline Offline

Activity: 59
Merit: 0


View Profile
February 12, 2012, 08:22:35 PM
 #112

The start of a repository of resources, information, and eventually tutorials has been started here: http://nonverba.org/forum/viewtopic.php?f=8&t=6

Check it out, sign up, post, get some discussion going.  There's been a fair number of people mentioning they'd like to get into FPGA programming, so let's do it.
wondermine (OP)
Newbie
*
Offline Offline

Activity: 59
Merit: 0


View Profile
February 13, 2012, 03:49:40 AM
 #113

From my forum at http://www.nonverba.org/forum:

So I've been talking to a number of people about the possibility of using side-channel vulnerabilities in the bitcoin mining process to speed up mining. It's all been vague until now, so I'm going to give a very tangible example.

Assume we roll up the algorithm: 2x SHA-2 = 128 clocks/mining operation.
Now, the final addition we do is to add the round constant, something that will always be the same.
In this specific situation it's 0x5be0cd19. Now, before we commit our 128th clock cycle to adding that to our previous 32-bit "h" value, we can add very resource friendly code to determine whether our "h" value is too high to yield a winning number. We can say that if "h" is greater than 0xa41f32e6, it cannot yield a winning digest and we should not waste that clock cycle. Great, that was easy, and isn't nearly the magnitude of problem I'm working on, but it's here to get the idea across. This one is a completely predictable example, the true optimization will come from probability-derived solutions earlier. Anyways...

By eating a couple of logic elements and little block ram, we can implement this check.
Now, how does that all translate into performance gain? Well, 64.11% of hashes fall into this range. That means that 64.11% of the time, we can save one clock cycle on a core, which is 1/128 or a 0.78125% speed increase.
0.6411 * 0.0078125 = 0.005% increase in performance. Pulling, say, 500MH/s, that's 2.50MH/s performance enhancement over time. (I'm pretty sure this math is accurate, but if it's not perfect, it's at least close, these are the ranges we're talking about.)

It doesn't maybe sound like a lot, but think about it, we just pulled 2.5MH/s out of thin air. Or rather, out of unused block ram. It didn't cost us time, or power, or logic. The earlier these constants can be checked out, the less their probability of success, but the higher their effect on performance.

In case this was tl;dr material: We just pulled 2.5MH/s out of thin air and can do it repeatedly.

There are a lot of these optimizations that could fit on even a Cyclone series FPGA. The only thing standing between us and them is time and mathematical analysis. Hopefully this gives you a clearer picture of what I'm talking about, and lets you know that it's actually feasible and just requires time and effort, rather than being voodoo.

Oh, and this process isn't just for FPGAs, it works on any mining platform.  These are the optimizations you can get your hands on before anyone else with donations.  Seeing the incentive yet? Smiley
rjk
Sr. Member
****
Offline Offline

Activity: 448
Merit: 250


1ngldh


View Profile
February 13, 2012, 04:26:03 AM
 #114

I was wondering what method you were going to use to determine which partially finished hashes to throw away prematurely, but I suppose I understand a little better now. I suppose the question is, how does this affect (for instance) miners that underclock RAM (since you are now going to be using some)? And is the added complexity going to raise power requirements by activating parts of the chips that might otherwise remain dormant? Thinking specifically in GPU terms here.

Mining Rig Extraordinaire - the Trenton BPX6806 18-slot PCIe backplane [PICS] Dead project is dead, all hail the coming of the mighty ASIC!
ummas
Sr. Member
****
Offline Offline

Activity: 274
Merit: 250


View Profile
February 13, 2012, 03:17:10 PM
 #115

@wondermine
i understand the need to have own forum, where you can speare information better, but dont forget to post here any news about nanominer. This place is the best place i know to get more ppl intrested in nanominer. ideas are gr8. just pull out somme hashes first please Cheesy
pieppiep
Hero Member
*****
Offline Offline

Activity: 1596
Merit: 502


View Profile
February 13, 2012, 03:29:05 PM
 #116

From my forum at http://www.nonverba.org/forum:
Now, the final addition we do is to add the round constant, something that will always be the same.
In this specific situation it's 0x5be0cd19. Now, before we commit our 128th clock cycle to adding that to our previous 32-bit "h" value, we can add very resource friendly code to determine whether our "h" value is too high to yield a winning number. We can say that if "h" is greater than 0xa41f32e6, it cannot yield a winning digest and we should not waste that clock cycle. Great, that was easy, and isn't nearly the magnitude of problem I'm working on, but it's here to get the idea across. This one is a completely predictable example, the true optimization will come from probability-derived solutions earlier. Anyways...
I don't understand.
If the h value of the midstate is 0x5be0cd19, then the next h value that is added to it must be exactly 0xa41f32e7 to get the 0x00000000 value to make a valid share at difficulty 1.
Isn't that a way bigger advantage? Now you know for sure you got a value you want and not some strange percentage.
I assume no pools use shares less than difficulty 1 and for mining in pools with difficulty greater than 1 it should be easy for the miner software on the host computer to check if the share with difficulty 1 is also valid for the pool with difficulty x.
wondermine (OP)
Newbie
*
Offline Offline

Activity: 59
Merit: 0


View Profile
February 13, 2012, 10:23:46 PM
Last edit: February 13, 2012, 10:33:48 PM by wondermine
 #117

From my forum at http://www.nonverba.org/forum:
Now, the final addition we do is to add the round constant, something that will always be the same.
In this specific situation it's 0x5be0cd19. Now, before we commit our 128th clock cycle to adding that to our previous 32-bit "h" value, we can add very resource friendly code to determine whether our "h" value is too high to yield a winning number. We can say that if "h" is greater than 0xa41f32e6, it cannot yield a winning digest and we should not waste that clock cycle. Great, that was easy, and isn't nearly the magnitude of problem I'm working on, but it's here to get the idea across. This one is a completely predictable example, the true optimization will come from probability-derived solutions earlier. Anyways...
I don't understand.
If the h value of the midstate is 0x5be0cd19, then the next h value that is added to it must be exactly 0xa41f32e7 to get the 0x00000000 value to make a valid share at difficulty 1.
Isn't that a way bigger advantage? Now you know for sure you got a value you want and not some strange percentage.
I assume no pools use shares less than difficulty 1 and for mining in pools with difficulty greater than 1 it should be easy for the miner software on the host computer to check if the share with difficulty 1 is also valid for the pool with difficulty x.

I'll answer this last one here.

You're right, there's a certain value that will work 100% of the time for difficulty one on that last addition.  This answer was given for the sake of simplicity, but it was also given in non-absolute terms to illustrate how probabilistic advantages will work.  Real advantage can be taken earlier in the algorithm where we're dealing with probability thresholds, not absolute certainty.  SHA-2 at 64 rounds is random (succeeds at mathematical randomness tests).  However given certain values at certain stages, that unravels. The ultimate example is adding a round constant which is a value we know 100% for certain.  The goal we're going for is to find them a little earlier, say round 126-7 where the probability might be in the fractions of a percentage, and the relationship to check may not be greater than, less than, or equal to; it's going to be something completely different that specially relates to a SHA-2 round.

Further answers directly related to the mathematics of it will be at http://www.nonverba.org/forum.

I was wondering what method you were going to use to determine which partially finished hashes to throw away prematurely, but I suppose I understand a little better now. I suppose the question is, how does this affect (for instance) miners that underclock RAM (since you are now going to be using some)? And is the added complexity going to raise power requirements by activating parts of the chips that might otherwise remain dormant? Thinking specifically in GPU terms here.

I'm not an expert as far as GPU mining, but unless underclocking RAM actually invalidates the RAM functionally, it will not be a problem.  As far as increasing power consumption, I don't believe it should be a detectable amount, this again from an educated guess; we can see.  The reason I say so is that "not using" the RAM doesn't mean "not powering" the RAM, it means not accessing it. Now, we're just accessing it, the amount of electricity we're sending to the memory hasn't changed (the electricity related to logic signaling is absurdly small).  RAM doesn't lie electrically dormant, just logically.
rjk
Sr. Member
****
Offline Offline

Activity: 448
Merit: 250


1ngldh


View Profile
February 13, 2012, 11:02:07 PM
 #118

I'm not an expert as far as GPU mining, but unless underclocking RAM actually invalidates the RAM functionally, it will not be a problem.  As far as increasing power consumption, I don't believe it should be a detectable amount, this again from an educated guess; we can see.  The reason I say so is that "not using" the RAM doesn't mean "not powering" the RAM, it means not accessing it. Now, we're just accessing it, the amount of electricity we're sending to the memory hasn't changed (the electricity related to logic signaling is absurdly small).  RAM doesn't lie electrically dormant, just logically.
No, underclocking just makes the RAM run slower. The speed of the RAM is independent of the core in a GPU. The reason I ask is because with traditional miners, there is a little bit of a curve in efficiency through the range of 150-400 Mhz in RAM speed. Some speeds might be unstable and slow mining, others could be a little bit faster mining, but never very much difference.

Even if the RAM usage isn't much, it still can affect the speed if it is needed in a critical part of the hash algo, since the clocks are independent, is basically all I was trying to say. The reason for messing about with the RAM speed in the first place is to save several watts of power.

Mining Rig Extraordinaire - the Trenton BPX6806 18-slot PCIe backplane [PICS] Dead project is dead, all hail the coming of the mighty ASIC!
wondermine (OP)
Newbie
*
Offline Offline

Activity: 59
Merit: 0


View Profile
February 14, 2012, 10:37:07 PM
Last edit: February 14, 2012, 10:47:28 PM by wondermine
 #119

DE4-230 Stratix IV GX: 800 MH/s

Who wants to write a script to handle UART interfacing rather than this JTAG business?  PM/email me if you're interested.

++More (non-mathematical) design improvements to be implemented in the near future.
lame.duck
Legendary
*
Offline Offline

Activity: 1270
Merit: 1000


View Profile
February 15, 2012, 12:45:35 AM
 #120

There is already a mining script using a serial connection to FPGA ...

https://bitcointalk.org/index.php?topic=62823.0
Pages: « 1 2 3 4 5 [6] 7 8 »  All
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!