RandyFolds
|
|
January 31, 2012, 09:41:58 PM |
|
4-6 weeks. I had to do it.....I'm sorry...... Bad yochdog! That line is reserved for use by RandyFold's god- size presence (TM) only. Fixed that for ya...
|
|
|
|
Inspector 2211
|
|
January 31, 2012, 10:00:40 PM |
|
Lets not confuse unrolling with pipelining. Current open-source designs are fully-unrolled, but I have yet to see a proper pipelined design that is open-sourced. Maybe pipelining is what BFL did? By pipelining the unrolled design they could significantly crank up the clock, since the FPGAs are limited more by the propagation delay in the signal routing than in the propagation delay in the actual logic.
My understanding is that the lack of pipelining is due to the lack of registers in an FPGA, is this correct or not? A Spartan6-LX150 has 184000 flipflops, and for a double SHA-256 only 32768 flip-flops are needed. 128 stages x 256 width = 32768. Fits easily if you have 184000 at your disposal. Thus, I find it very hard to believe that current designs are not pipelined. Also, a typical design such as the ZTEX design achieves 200 MH/s with 200 MHz. Assuming it is not a fully pipelined design, that would mean that all 128 (or 125) stages have to percolate through in a mere 5 ns, because 5 ns is the clock period of 200 MHz. 40 ps (picoseconds) per stage? I don't think so.
|
|
|
|
2112
Legendary
Offline
Activity: 2128
Merit: 1073
|
|
January 31, 2012, 10:01:08 PM Last edit: January 31, 2012, 10:34:37 PM by 2112 |
|
My understanding is that the lack of pipelining is due to the lack of registers in an FPGA, is this correct or not?
No, I don't think so. I think that the limitations are due to the heuristics used by FPGA synthesis tools. At least in Xilinxes the registers are essentially free. Pretty much each slice can have direct combinatorial outputs or registered outputs mixed with no restrictions. I shouldn't have written about no pipelining. The more accurate way would be inflexible pipelining. It would be better to describe level of unrolling and level of pipelining as two variables that are somewhat independent. I just looked again into the folder that I used to store the Verilog source code for Bitcoin hashers. It seems like some of them are indeed pipelined, but the level of pipelining is equal to the level of unrolling. It seems like ztex uses 125-way unrolling and 125-way pipelining. So the design computes in a single clock rounds of hashes for nonces (N-124 to N). When nonce N is on the input the output shows the final hash for nonce N-124. In general a 125-way unrolled design can be pipelined anywhere from 1 to 125 stages. There are also other possible ways of pipelining the SHA-256. For example the (W(i) + K(i)) expansion function uses a four-way adder: K(i) + S1(W(i-2)) + W(-7) + S0(W(i-15)) + W(i-16). One could factor out the last two addends S0(W(i-15)) + W(i-16) and precompute them in previous round as S0(W(i-14)) + W(i-15). Or even go two rounds deep and compute S0(W(i-13)) + W(i-14). And so forth. My guess is that the number of possible valid transformations overwhelm the synthesis tools and they blow up either on memory usage or time. Anyway, those are just my speculations. I haven't spend much eyeball time analyzing the available codes.
|
|
|
|
2112
Legendary
Offline
Activity: 2128
Merit: 1073
|
|
January 31, 2012, 10:06:04 PM |
|
Thus, I find it very hard to believe that current designs are not pipelined.
Yeah, you are right and I was wrong. It seems like the N-way unrolled designs are also N-way pipelined. But the degree of pipelining doesn't have to equal the degree of unrolling.
|
|
|
|
2112
Legendary
Offline
Activity: 2128
Merit: 1073
|
|
January 31, 2012, 10:32:56 PM Last edit: February 01, 2012, 12:18:41 AM by 2112 |
|
My guess is that the number of possible valid transformations overwhelm the synthesis tools and they blow up either on memory usage or time.
I apologize, I'm having problem posting and editing the posts. There just so many logically-equivalent ways to synthesize the SHA-256. For example somebody earlier posted a snippet of his synthesis where he used the adders in the DSP blocks on the Virtex 6 chip. For this to be really beneficial on Spartan 6 chips one has to write a location-dependent Verilog: when near a DSP block use its adder, when far away synthesize the adder using local slice resources. The number of available trade-offs is immense. And thus far I have talked only about synthesis. But the full working design requires two more steps: place and route. This opens another of dimensions that need to be explored for optimization. One guy here on this forum is working on a design where he wrote a Java program to generate a Verilog program that does hashing. The Verilog is all location-constrained to the particular slices. Somebody else posted a code that explicitly uses ternary adders Y = A + B + C. As far as I know Xilinx ISE will always synthesize adder trees Y = (A + B) + C or Y = A + (B + C) or Y = (A + C) + B. On some other site I've found an implementation that pipelines rounds in pairs: 128-way unrolled Bitcoin hash would've had 64-way pipelining. Again, it wasn't for Spartan 6, but some other Xilinx chip.
|
|
|
|
rjk
Sr. Member
Offline
Activity: 448
Merit: 250
1ngldh
|
|
January 31, 2012, 10:39:19 PM |
|
My guess is that the number of possible valid transformations overwhelm the synthesis tools and they blow up either on memory usage or time.
So if I could get my hands on a 4x hex core server with 256GB of RAM, you FPGA guys would love me long time?
|
|
|
|
DiabloD3
Legendary
Offline
Activity: 1162
Merit: 1000
DiabloMiner author
|
|
January 31, 2012, 10:42:00 PM |
|
My guess is that the number of possible valid transformations overwhelm the synthesis tools and they blow up either on memory usage or time.
So if I could get my hands on a 4x hex core server with 256GB of RAM, you FPGA guys would love me long time? They may even use lube.
|
|
|
|
RandyFolds
|
|
January 31, 2012, 10:48:30 PM |
|
To the FPGA guys here: Why is it 'rolled' and not 'furled'? It seems way more appropriate.
Because its always been unrolling loops. Ears are unfurled, loops are unrolled. And just because "it's always been that way", it's ok? You don't happen to live in Alabama, now, do you?
|
|
|
|
2112
Legendary
Offline
Activity: 2128
Merit: 1073
|
|
January 31, 2012, 11:07:45 PM |
|
So if I could get my hands on a 4x hex core server with 256GB of RAM, you FPGA guys would love me long time?
I recall somebody posting a screenshot of a control session for an Amazon EC2 farm containing over 50 machines doing the Xilinx design. I really don't think that using more brute-force would be helpful. The SHA-family of algorithms are very regular and pretty much every bit depends on every bit. This hits a weak spot in the global optimization algorithm used by the FPGA tools. I think that the way forward goes through the use of specialized synthesis tools that don't make generic assumptions about what kind of circuitry is being synthesized.
|
|
|
|
RandyFolds
|
|
January 31, 2012, 11:10:53 PM |
|
So if I could get my hands on a 4x hex core server with 256GB of RAM, you FPGA guys would love me long time?
I recall somebody posting a screenshot of a control session for an Amazon EC2 farm containing over 50 machines doing the Xilinx design. I really don't think that using more brute-force would be helpful. The SHA-family of algorithms are very regular and pretty much every bit depends on every bit. This hits a weak spot in the global optimization algorithm used by the FPGA tools. I think that the way forward goes through the use of specialized synthesis tools that don't make generic assumptions about what kind of circuitry is being synthesized. Anyone remember that thread where a guy had some crazy graphic utility for FPGA design? I didn't understand a lick of what everyone was talking about, but it seemed that he was hand plotting it, and the pictures were awesome...
|
|
|
|
fizzisist
|
|
January 31, 2012, 11:13:58 PM |
|
Anyone remember that thread where a guy had some crazy graphic utility for FPGA design? I didn't understand a lick of what everyone was talking about, but it seemed that he was hand plotting it, and the pictures were awesome...
https://bitcointalk.org/index.php?topic=49971
|
|
|
|
makomk
|
|
January 31, 2012, 11:24:48 PM |
|
It seems like some of them are indeed pipelined, but the level of pipelining is equal to the level of unrolling. It seems like ztex uses 125-way unrolling and 125-way pipelining. So the design computes in a single clock rounds of hashes for nonces (N-124 to N). When nonce N is on the input the output shows the final hash for nonce N-124.
In general a 125-way unrolled design can be pipelined anywhere from 1 to 125 stages. ztex's latest code actually has two pipeline stages for every SHA-256 round, which is partly why it's so much faster; ISE has trouble routing the design efficiently. It varies as to how much sense this makes though. Also, the FPGA synthesis tools support something called register rebalancing where they move the registers that divide up the calculations into pipeline stages backwards and forwards in order to get the best speed, so it's not necessarily a simple question of one (or two) pipeline stages per round. Somebody else posted a code that explicitly uses ternary adders Y = A + B + C. As far as I know Xilinx ISE will always synthesize adder trees Y = (A + B) + C or Y = A + (B + C) or Y = (A + C) + B.
Actually, I seem to recall that it's quite happy to automatically use ternary adders on Spartan-6.
|
Quad XC6SLX150 Board: 860 MHash/s or so. SIGS ABOUT BUTTERFLY LABS ARE PAID ADS
|
|
|
RandyFolds
|
|
January 31, 2012, 11:32:53 PM |
|
Anyone remember that thread where a guy had some crazy graphic utility for FPGA design? I didn't understand a lick of what everyone was talking about, but it seemed that he was hand plotting it, and the pictures were awesome...
https://bitcointalk.org/index.php?topic=49971That one! What is it?
|
|
|
|
2112
Legendary
Offline
Activity: 2128
Merit: 1073
|
|
January 31, 2012, 11:46:08 PM Last edit: February 01, 2012, 12:05:40 AM by 2112 |
|
I don't think he did manual placement using his hands and mouse. From my understanding he wrote a Java program that produced Verilog with explicit location constrains as well as a Tcl script that controlled the fpgaeditor to put the finishing touches in signal routing. That one! What is it?
That is "fpgaeditor" from Xilinx ISE. This is actually how FPGA started, if I recall correctly the suite was called X-ACT not ISE. But for sure it wasn't called X-Acto, although it felt like using one. The automatic circuit syntesis was an expensive upgrade. I have no personal experience beyond peeking over coworker's shoulders.
|
|
|
|
yochdog
Legendary
Offline
Activity: 2044
Merit: 1000
|
|
February 01, 2012, 04:48:34 AM |
|
4-6 weeks. I had to do it.....I'm sorry...... Bad yochdog! That line is reserved for use by RandyFold's god-like presence (TM) only. I wish I could "like" this.
|
I am a trusted trader! Ask Inaba, Luo Demin, Vanderbleek, Sannyasi, Episking, Miner99er, Isepick, Amazingrando, Cablez, ColdHardMetal, Dextryn, MB300sd, Robocoder, gnar1ta$ and many others!
|
|
|
RandyFolds
|
|
February 01, 2012, 04:22:50 PM |
|
Well......it's february.
|
|
|
|
DeathAndTaxes
Donator
Legendary
Offline
Activity: 1218
Merit: 1079
Gerald Davis
|
|
February 01, 2012, 04:57:48 PM |
|
I got my tracking #!
Er wait that was just a tracking # for dog food from petflow. Sorry.
|
|
|
|
Inaba
Legendary
Offline
Activity: 1260
Merit: 1000
|
|
February 01, 2012, 05:00:15 PM |
|
Petflow is the shiznit. No more lugging 50lb bags around!
|
If you're searching these lines for a point, you've probably missed it. There was never anything there in the first place.
|
|
|
DeathAndTaxes
Donator
Legendary
Offline
Activity: 1218
Merit: 1079
Gerald Davis
|
|
February 01, 2012, 05:02:00 PM Last edit: February 01, 2012, 09:48:45 PM by DeathAndTaxes |
|
Petflow is the shiznit. No more lugging 50lb bags around!
Isn't it. I like the scheduler too. Set it & forget it.
|
|
|
|
kano
Legendary
Offline
Activity: 4592
Merit: 1851
Linux since 1997 RedHat 4
|
|
February 01, 2012, 08:15:45 PM |
|
So I'm sure people have emailed Sonny and asked where their single(s) is/are. Anyone gonna post the reply he gave this time? Since they have redone the power on the board they should also have some new performance figures .... anyone got them yet?
|
|
|
|
|