In a completely unrolled design, there are no long lines.
The start vector is fed in on the left side, then the calculations percolate down to the right, and at the right a "matching" circuit determines if a "golden nonce" was found. There is no feedback from the right side to the left side.
Thus, while I do think that Bitfury's approach is EASIER (as one only has to worry about a few hundred wires and their associated delays, and not tens of thousands), I fail to see why it is inherently faster. I don't think it is inherently faster.
Maybe the Xilinx router goofs up wires that would be short and local and sends them the long way like a crooked cab driver an out-of-town tourist. But, to reiterate, a fully unrolled miner does not involve a feedback from the right side to the left side.
TheSeven said correctly - Spartan routing resources are ugly. no handy BENTQUADs etc.... plus 50% of Slices.X. adds up problems. With Artix my highest expectation 2x Spartan.... but I am afraid to make such predictions, because I've heard that on 28-nm chips there's even more problems with power distribution..... Do not want to make again troubles, like having estimation of 500 Mh/s per chip, then target of 400 Mh/s and finishing with 300 Mh/s.
About "there's no long lines" - I've already commented, but will try to draw it, where epic fail for parallel expander is exactly....
say computing w0+w1 and feeding to w9:
---+---------------------------------
---+---------------------------------
---+--------------------------------
---+-------------------------------
---+------------------------------
---+-----------------------------
---+----------------------------
---+----------------------------
---+---------------------------
w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15 w16
How many wires ? biggest cross-section just for that ? 9x32 bits :-)
The same happens when pushing w9 to w16... and w14 to w16...
Lazy to calculate - but near 512 bits cross-section...
And in Spartan-6 there's difficult to pass more than 256-bit cross-section in 8 slices height long-way (there's
32 QUAD routes per each switch - so 256-bits would use QUAD routes in horizontal case for 8 slices height).
Then what will happen - it will go to DOUBLE route, and will go wide outside of your round expander area slowing
down interconnect for other parts of design....
I've started with that :-( Plus it is a question how this design would survive reality that sha256 is VERY TOUGH TEST for bit error rates. even small infrequent errors are amplified by avalanche effect through rounds.
with unrolled rounds however it is true - no problem there - it works like charm... unrolled design is also more compact than rolled one.... and rolled design within 240 slices is very difficult... even 248 would be easier. as in 240 I had to fight for each register, and reuse parts of logics to do other things.... in my design rounds only looks similar, but in reality there's 3 kinds of rounds with special cases. and they are different.
PS. You've answered before I written post... Anyway I think this will be helpful for those who try with parallel rounds... With ASICs it will do same mess BTW
lots of wires for round expander
+ lots of clock problems.
PPS. So getting quick and dense parallel design is tough task - that's why I respect this work!