Algorithmically placed FPGA miner: 255MH/s/chip, supports all known boards

Quote from: kano on May 23, 2012, 01:53:26 AM

Well, the device itself (GPU, FPGA) does 2 sha256 rounds of 64 in length.
However, there is a VERY simple optimisation of that to remove 8 rounds
(4 at the beginning of the 1st and 4 at the end of the 2nd) that is done by GPUs (and most? FPGAs?)
(i.e. a 6.25% gain) that is not available with this at all.
(so you also need to subtract 6.25% from any gain)
Maybe that is what he is referring to?

That is already "subtracted" from the results, and apparently both the MH/s and MH/J are still better for the rolled version. This is most likely due to the Spartan6's awful long distance routing fabric, which means that keeping things very close to each other pays off (which is one reason why 85 small, 64-clocks-per-hash cores together are faster than just three 2-clocks-per-hash cores, you can just clock them at much higher frequencies, and you can utilize more area on the chip).

Thats an interesting hack. Thats exactly the same reason why GPUs unroll the entire thing, just so the in registers are kept in registers instead of pushed back to local or global RAM.

2112

Legendary

Offline

Activity: 2128
Merit: 1065

Re: Algorithmically placed FPGA miner: 245MH/s/chip and still rising

May 23, 2012, 11:18:43 PM

#248

This is most likely due to the Spartan6's awful long distance routing fabric,

This isn't Spartan's fault. This is a property of any modern FPGA: most of the delay and energy loss occurs in the routing fabric. So the easiest way to speed up the design is to minimize the demand on routing resources.

I was always perplexed why everyone here was focusing on unrolling the combinatorial logic. After gaining some experience with the currently available EDA tool suites for FPGA it became obvious: they make the place and route of repetitive designs very difficult.

The "sea of tight hashers" approach will probably be also beneficial for the future ASIC designs, although not by such a wide margin.

Does anyone know if bitfury's design stores the SHA-256 constants in BRAMs or has them spread over through the SLICEs?

Please comment, critique, criticize or ridicule BIP 2112: https://bitcointalk.org/index.php?topic=54382.0
Long-term mining prognosis: https://bitcointalk.org/index.php?topic=91101.0

Inspector 2211

Sr. Member

Offline

Activity: 448
Merit: 250

Re: Algorithmically placed FPGA miner: 245MH/s/chip and still rising

May 23, 2012, 11:34:05 PM

#249

▄█▄ ▄█ ▀█▀ ▄ ▄███▄▄████▄▀ ▄▄▀▄ ▀█▄██████████▀▄█████▀▄▀ ▄█▀▄███████████████████▄ ▄██▀█▀▀▀▀███▀▀▀█████▄▄▄▀█▀▄ ▄█▀▀ ▀████▀▄████████ █▀█▄▄ ██▀ ▀ ▀ ▀██████████▄ ▄▀▀█▄ ▀ ▀ ███▀▀▀▀▀████▌ ▄ ▀ ████████████▌ █ █████████████▀ ▀▀▀██▀▀██▀▀ ▀▀ ▀▀

This is most likely due to the Spartan6's awful long distance routing fabric,

This isn't Spartan's fault. This is a property of any modern FPGA: most of the delay and energy loss occurs in the routing fabric. So the easiest way to speed up the design is to minimize the demand on routing resources.

I was always perplexed why everyone here was focusing on unrolling the combinatorial logic. After gaining some experience with the currently available EDA tool suites for FPGA it became obvious: they make the place and route of repetitive designs very difficult.

The "sea of tight hashers" approach will probably be also beneficial for the future ASIC designs, although not by such a wide margin.

Does anyone know if bitfury's design stores the SHA-256 constants in BRAMs or has them spread over through the SLICEs?

In a completely unrolled design, there are no long lines.
The start vector is fed in on the left side, then the calculations percolate down to the right, and at the right a "matching" circuit determines if a "golden nonce" was found. There is no feedback from the right side to the left side.
Thus, while I do think that Bitfury's approach is EASIER (as one only has to worry about a few hundred wires and their associated delays, and not tens of thousands), I fail to see why it is inherently faster. I don't think it is inherently faster.
Maybe the Xilinx router goofs up wires that would be short and local and sends them the long way like a crooked cab driver an out-of-town tourist. But, to reiterate, a fully unrolled miner does not involve a feedback from the right side to the left side.

BTC-GREEN

Ecological Community in the Green Planet
❱❱❱❱❱❱ WHITEPAGE | ANN THREAD ❰❰❰❰❰❰

.

FACEBOOK ❱❱ TWITTER ❱❱ YOUTUBE
J O I N I C O IIILIVE

TheSeven

Hero Member

Offline

Activity: 504
Merit: 500

FPGA Mining LLC

Re: Algorithmically placed FPGA miner: 245MH/s/chip and still rising

May 23, 2012, 11:39:23 PM

#250

Quote from: Inspector 2211 on May 23, 2012, 11:34:05 PM

This is most likely due to the Spartan6's awful long distance routing fabric,

This isn't Spartan's fault. This is a property of any modern FPGA: most of the delay and energy loss occurs in the routing fabric.

That one has a particularly bad routing fabric though. Virtex, Kintex or even Artix are all much better.

And as pointed out above already, most of your other claims don't really apply here, especially for ASICs I think a pipelined design is likely to perform better for several reasons. The only downside that I can think of right now is that a sea of small cores approach has much better damage containment properties, thus increasing yield.

My tip jar: 13kwqR7B4WcSAJCYJH1eXQcxG5vVUwKAqY

DiabloD3

Legendary

Offline

Activity: 1162
Merit: 1000

DiabloMiner author

Re: Algorithmically placed FPGA miner: 245MH/s/chip and still rising

May 23, 2012, 11:45:07 PM

#251

This is most likely due to the Spartan6's awful long distance routing fabric,

This isn't Spartan's fault. This is a property of any modern FPGA: most of the delay and energy loss occurs in the routing fabric. So the easiest way to speed up the design is to minimize the demand on routing resources.

I was always perplexed why everyone here was focusing on unrolling the combinatorial logic. After gaining some experience with the currently available EDA tool suites for FPGA it became obvious: they make the place and route of repetitive designs very difficult.

The "sea of tight hashers" approach will probably be also beneficial for the future ASIC designs, although not by such a wide margin.

Does anyone know if bitfury's design stores the SHA-256 constants in BRAMs or has them spread over through the SLICEs?

In a completely unrolled design, there are no long lines.
The start vector is fed in on the left side, then the calculations percolate down to the right, and at the right a "matching" circuit determines if a "golden nonce" was found. There is no feedback from the right side to the left side.
Thus, while I do think that Bitfury's approach is EASIER (as one only has to worry about a few hundred wires and their associated delays, and not tens of thousands), I fail to see why it is inherently faster. I don't think it is inherently faster.
Maybe the Xilinx router goofs up wires that would be short and local and sends them the long way like a crooked cab driver an out-of-town tourist. But, to reiterate, a fully unrolled miner does not involve a feedback from the right side to the left side.

Theres a small difference, though. There technically is enough room to fit 2 full hashes on a Spartan 6, but due to how the leftover space is arranged, it probably will never fit (so eldentyrell fit 1 and a half). However, a shitload of tiny rolled engines would easily fit into weirdly shaped unused space. I think someone did the math and said they're almost at the equiv of 2 full hashes.

bitfury

Sr. Member

Offline

Activity: 266
Merit: 251

Re: Algorithmically placed FPGA miner: 245MH/s/chip and still rising

May 23, 2012, 11:53:36 PM

#252

This is most likely due to the Spartan6's awful long distance routing fabric,

This isn't Spartan's fault. This is a property of any modern FPGA: most of the delay and energy loss occurs in the routing fabric. So the easiest way to speed up the design is to minimize the demand on routing resources.

I was always perplexed why everyone here was focusing on unrolling the combinatorial logic. After gaining some experience with the currently available EDA tool suites for FPGA it became obvious: they make the place and route of repetitive designs very difficult.

The "sea of tight hashers" approach will probably be also beneficial for the future ASIC designs, although not by such a wide margin.

Does anyone know if bitfury's design stores the SHA-256 constants in BRAMs or has them spread over through the SLICEs?

You have all the clues... Turn on your head and just guess using data you have - print screen from PlanAhead - I certify that it is correct one... Try placing some BRAM and watch your timings... Why would you ask then ?

With routing fabric - it is the same... Open FPGA Editor, and start placing routes manually, understood how QUAD, DOUBLE, SINGLE routes works within spartan, what are costs of switch to switch hop, and switch to logic entry etc. It is interesting, believe me :-) Most pity however with them - is that P&R tool is far from ideal, and less routing resources left - worse design it produces. In SHA-256 round expander kills routing, as taking that w[0], w[1] and w[9] requires a lot of routing, because you basically pulling data from N rounds behind... so you basically put either SRL or BRAM to do that... near end of game... however if working really hard on it - spartan has barely enough resources just to route these parallel rounds - if you find right placement schema to use more adequately vertical and horizontal interconnect. Also interconnect works in one direction only, so if rounds placed in smart way, you'll get more efficiency in routing resources usage ( i.e. A,B <---> C,D while A --> C and B <--- D are interconnected and placed into same regions).

So I really respect author's work of fitting 1.5 parallel rounds into Spartan 6 - it is tough and very nice work. And probably Spartan is showing his bad temper in error rates. In case of rolled rounds - only single round failures, in case of unrolled rounds - if some part of chip fails more frequently than other - you get higher performance degradation. In my experience during debug runs - it starts to degrade from central slices to peripheral, when you rise clocks. It is interesting indeed if design performance would actually match performance that tools display.

Finally I would say that implementing FPGA design mostly about placement and routing... Do not even start trying it, if you are not prepared to waste weeks figuring all of that things, or use only simple designs, when you have about clocks 2-3 times smaller than chip's maximums... designs @ 50 - 100 Mhz would be easy....

DiabloD3

Legendary

Offline

Activity: 1162
Merit: 1000

DiabloMiner author

Re: Algorithmically placed FPGA miner: 245MH/s/chip and still rising

May 24, 2012, 12:01:38 AM

#253

Finally I would say that implementing FPGA design mostly about placement and routing... Do not even start trying it, if you are not prepared to waste weeks figuring all of that things, or use only simple designs, when you have about clocks 2-3 times smaller than chip's maximums... designs @ 50 - 100 Mhz would be easy....

I completely agree. I currently have the most optimized OpenCL kernel for GPUs out there, and the most recent version took me 2 weeks of 6-8 hour a day fiddling to get it done, after 1+ year of working on previous versions.

FPGA design is about 2-3 times harder.

Inspector 2211

Sr. Member

Offline

Activity: 448
Merit: 250

Re: Algorithmically placed FPGA miner: 245MH/s/chip and still rising

May 24, 2012, 12:08:05 AM

#254

▄█▄ ▄█ ▀█▀ ▄ ▄███▄▄████▄▀ ▄▄▀▄ ▀█▄██████████▀▄█████▀▄▀ ▄█▀▄███████████████████▄ ▄██▀█▀▀▀▀███▀▀▀█████▄▄▄▀█▀▄ ▄█▀▀ ▀████▀▄████████ █▀█▄▄ ██▀ ▀ ▀ ▀██████████▄ ▄▀▀█▄ ▀ ▀ ███▀▀▀▀▀████▌ ▄ ▀ ████████████▌ █ █████████████▀ ▀▀▀██▀▀██▀▀ ▀▀ ▀▀

In SHA-256 round expander kills routing, as taking that w[0], w[1] and w[9] requires a lot of routing, because you basically pulling data from N rounds behind...

Oh yeah, I totally forgot about that.
Now you got me almost convinced that such a sea of small blocks is the better way to do it.
Live and learn...

BTC-GREEN

Ecological Community in the Green Planet
❱❱❱❱❱❱ WHITEPAGE | ANN THREAD ❰❰❰❰❰❰

.

FACEBOOK ❱❱ TWITTER ❱❱ YOUTUBE
J O I N I C O IIILIVE

bitfury

Sr. Member

Offline

Activity: 266
Merit: 251

Re: Algorithmically placed FPGA miner: 245MH/s/chip and still rising

May 24, 2012, 12:13:56 AM

#255

Quote from: Inspector 2211 on May 23, 2012, 11:34:05 PM

In a completely unrolled design, there are no long lines.
The start vector is fed in on the left side, then the calculations percolate down to the right, and at the right a "matching" circuit determines if a "golden nonce" was found. There is no feedback from the right side to the left side.
Thus, while I do think that Bitfury's approach is EASIER (as one only has to worry about a few hundred wires and their associated delays, and not tens of thousands), I fail to see why it is inherently faster. I don't think it is inherently faster.
Maybe the Xilinx router goofs up wires that would be short and local and sends them the long way like a crooked cab driver an out-of-town tourist. But, to reiterate, a fully unrolled miner does not involve a feedback from the right side to the left side.

TheSeven said correctly - Spartan routing resources are ugly. no handy BENTQUADs etc.... plus 50% of Slices.X. adds up problems. With Artix my highest expectation 2x Spartan.... but I am afraid to make such predictions, because I've heard that on 28-nm chips there's even more problems with power distribution..... Do not want to make again troubles, like having estimation of 500 Mh/s per chip, then target of 400 Mh/s and finishing with 300 Mh/s.

About "there's no long lines" - I've already commented, but will try to draw it, where epic fail for parallel expander is exactly....

say computing w0+w1 and feeding to w9:

   ---+---------------------------------
   ---+---------------------------------
   ---+--------------------------------
   ---+-------------------------------
   ---+------------------------------
   ---+-----------------------------
   ---+----------------------------
   ---+----------------------------
---+---------------------------
w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15 w16

How many wires ? biggest cross-section just for that ? 9x32 bits :-)
The same happens when pushing w9 to w16... and w14 to w16...
Lazy to calculate - but near 512 bits cross-section...

And in Spartan-6 there's difficult to pass more than 256-bit cross-section in 8 slices height long-way (there's
32 QUAD routes per each switch - so 256-bits would use QUAD routes in horizontal case for 8 slices height).

Then what will happen - it will go to DOUBLE route, and will go wide outside of your round expander area slowing
down interconnect for other parts of design....

I've started with that :-( Plus it is a question how this design would survive reality that sha256 is VERY TOUGH TEST for bit error rates. even small infrequent errors are amplified by avalanche effect through rounds.

with unrolled rounds however it is true - no problem there - it works like charm... unrolled design is also more compact than rolled one.... and rolled design within 240 slices is very difficult... even 248 would be easier. as in 240 I had to fight for each register, and reuse parts of logics to do other things.... in my design rounds only looks similar, but in reality there's 3 kinds of rounds with special cases. and they are different.

PS. You've answered before I written post... Anyway I think this will be helpful for those who try with parallel rounds... With ASICs it will do same mess BTW

lots of wires for round expander

+ lots of clock problems.

PPS. So getting quick and dense parallel design is tough task - that's why I respect this work!

2112

Legendary

Offline

Activity: 2128
Merit: 1065

Re: Algorithmically placed FPGA miner: 245MH/s/chip and still rising

May 24, 2012, 01:16:49 AM

#256

Quote from: 2112 on May 24, 2012, 01:16:49 AM

You have all the clues... Turn on your head and just guess using data you have - print screen from PlanAhead - I certify that it is correct one... Try placing some BRAM and watch your timings... Why would you ask then ?

I'm asking because I'm not fully up-to-speed on possible space-time tradeoffs on the current Xilinx platforms. When I worked on them professionally we had the information about the routing and bitstream format available directly from Xilinx (maybe under NDA, I'm not sure, it was years ago).

I've also remember the comments from a poster who implemented the bitcoin hashers on Virtex-6 and quick-and-dirty solution was to use DSP48s for some fraction of the adders in SHA-256 mixing steps.

In theory at least it should be possible to fill every BRAM with multiple copies of the constants and use those constants at least in those hashing cells that are close to the BRAMs. As far as I understand your design you currently have just one class/macro of hashing cell, but have plans on implementing another class/macro to fill out the space that currently remains unused.

Overall, I'll venture to guess that the ultimate Spartan-6 bitstream will use the sea-of-hashers concept and the hashers will be a heterogenous mixture: close-to-DSP48, close-to-BRAM and far-from-DSP-and-BRAM. I occasionally talk to my friends who do digital design and they always mention "don't leave any FPGA resource unused, even at the expense of partially mangling the original algorithm".

I guess the ultimate way to express all the above is that the design space tradeoffs are multidimensional space of clock-freq*number-of-gates*time-to-market.

Please comment, critique, criticize or ridicule BIP 2112: https://bitcointalk.org/index.php?topic=54382.0
Long-term mining prognosis: https://bitcointalk.org/index.php?topic=91101.0

DiabloD3

Legendary

Offline

Activity: 1162
Merit: 1000

DiabloMiner author

Re: Algorithmically placed FPGA miner: 245MH/s/chip and still rising

May 24, 2012, 01:27:10 AM

#257

You have all the clues... Turn on your head and just guess using data you have - print screen from PlanAhead - I certify that it is correct one... Try placing some BRAM and watch your timings... Why would you ask then ?

I'm asking because I'm not fully up-to-speed on possible space-time tradeoffs on the current Xilinx platforms. When I worked on them professionally we had the information about the routing and bitstream format available directly from Xilinx (maybe under NDA, I'm not sure, it was years ago).

I've also remember the comments from a poster who implemented the bitcoin hashers on Virtex-6 and quick-and-dirty solution was to use DSP48s for some fraction of the adders in SHA-256 mixing steps.

In theory at least it should be possible to fill every BRAM with multiple copies of the constants and use those constants at least in those hashing cells that are close to the BRAMs. As far as I understand your design you currently have just one class/macro of hashing cell, but have plans on implementing another class/macro to fill out the space that currently remains unused.

Overall, I'll venture to guess that the ultimate Spartan-6 bitstream will use the sea-of-hashers concept and the hashers will be a heterogenous mixture: close-to-DSP48, close-to-BRAM and far-from-DSP-and-BRAM. I occasionally talk to my friends who do digital design and they always mention "don't leave any FPGA resource unused, even at the expense of partially mangling the original algorithm".

I guess the ultimate way to express all the above is that the design space tradeoffs are multidimensional space of clock-freq*number-of-gates*time-to-market.

Thats pretty much my analysis of this too. Everything that can lead to faster hashing is on the table no matter how insane or ugly.

Quote from: DiabloD3 on May 23, 2012, 11:45:07 PM

TheSeven

Hero Member

Offline

Activity: 504
Merit: 500

FPGA Mining LLC

Re: Algorithmically placed FPGA miner: 245MH/s/chip and still rising

May 24, 2012, 08:12:36 AM

#258

Theres a small difference, though. There technically is enough room to fit 2 full hashes on a Spartan 6, but due to how the leftover space is arranged, it probably will never fit (so eldentyrell fit 1 and a half). However, a shitload of tiny rolled engines would easily fit into weirdly shaped unused space. I think someone did the math and said they're almost at the equiv of 2 full hashes.

Not quite. Due to additional overhead of each core, it is only equivalent to ~1.3 fully unrolled cores hashes-per-clock wise. What bumps this to >1.5 times the total hashing speed is the higher speed those little cores can run at.

My tip jar: 13kwqR7B4WcSAJCYJH1eXQcxG5vVUwKAqY

TheSeven

Hero Member

Offline

Activity: 504
Merit: 500

FPGA Mining LLC

Re: Algorithmically placed FPGA miner: 245MH/s/chip and still rising

May 24, 2012, 08:24:32 AM

#259

Quote from: bitfury on May 24, 2012, 12:13:56 AM

With ASICs it will do same mess BTW

lots of wires for round expander

+ lots of clock problems.

With (fully custom) ASICs, however, you can just match your exact routing needs with wires, which should take care of the routing problems.
I'm certainly not an expert on that area, but I'd expect the overhead of intermediate result storage (in a rolled design) to outweigh the routing overhead (in an unrolled deisn).
As I stated above already a rolled design might still be useful to increase yield by containing defects into smaller functional units.

My tip jar: 13kwqR7B4WcSAJCYJH1eXQcxG5vVUwKAqY

DiabloD3

Legendary

Offline

Activity: 1162
Merit: 1000

DiabloMiner author

Re: Algorithmically placed FPGA miner: 245MH/s/chip and still rising

May 24, 2012, 03:01:26 PM

#260

Quote from: TheSeven on May 24, 2012, 08:24:32 AM

Quote from: bitfury on May 24, 2012, 12:13:56 AM

With ASICs it will do same mess BTW

lots of wires for round expander

+ lots of clock problems.

With (fully custom) ASICs, however, you can just match your exact routing needs with wires, which should take care of the routing problems.
I'm certainly not an expert on that area, but I'd expect the overhead of intermediate result storage (in a rolled design) to outweigh the routing overhead (in an unrolled deisn).
As I stated above already a rolled design might still be useful to increase yield by containing defects into smaller functional units.

Thats only if you get real ASIC. SASIC still screws you the same way since its just a hardwired version of the FPGA.