iidx
Newbie
Offline
Activity: 35
Merit: 0
|
|
August 14, 2011, 11:40:18 PM |
|
200MH is simply way out of the question for an S6-LX150. That won't stop me from trying As far as I can tell with the poking around I've done so far, the current bottleneck on the S6-LX150 is the far dependencies caused by the W calculations. These references make it so that the rounds are not isolated, and so cannot be routed into a uniform chain. This forces ISE to do completely absurd routing, splattering the placement of a round's components across a good 1/4th of the chip. And that, obviously, leads to massive routing delays. On my last few compiles, the worst-case paths were >80% routing (8ns+ of routing, with 2ns of logic). Yeah, it looks like a "giant snake" that traverses the chip The current critical path is approximately two 3-way 32-bit adders implemented as 16 total slices, thanks to the Spartan-6 fast carry look ahead chains. Is there a means of optimizating that logic that I have missed? These are the adders that I tried to move into DSP48s, as they have dedicated carry paths to and from adjacent DSPs in a column. I didn't look at all how to optimize the actual math/operations at all though.
|
|
|
|
ngzhang
|
|
August 15, 2011, 04:25:12 AM Last edit: August 15, 2011, 05:06:59 AM by ngzhang |
|
200MH is simply way out of the question for an S6-LX150. That won't stop me from trying Good luck! Besides pipelining, there a another way to enhance performance in IC design, which is logic-copy. I didn't read the codes yet( because I put all my spare time on the dual XC6SLX150 mining board design), but after read this thread, if we facing terrible routing problems, why not we try another architecture. The possible way is, implement a core, optimized roll up, like a calculate equipment(maybe better use DSP48As) around a signal 512bit register(maybe use LUTs to implement Distributed RAM instead of using registers), runing at 200MHZ+(it's very possible), about 64clocks per hash. and we can implentment 100+ of them per chip. This way, we can also generate a very MH/s. Certainly, I'm not a expert, just for discussing. EDIT1: I found this : http://www.heliontech.com/downloads/fast_hash_xilinx_datasheet.pdf#view=FitIn this Commercial Ip core, they use 309 slices (SLX150 has 23038 of them), generate a transport of 977Mbps. If we use 80% slices of one SLX150, we can implement 60 of these cores, generate a transport of 58G. about 228MH/s. So ,reach 200MH/s is very possible, isn't it?
|
|
|
|
lame.duck
Legendary
Offline
Activity: 1270
Merit: 1000
|
|
August 15, 2011, 09:54:20 AM |
|
Hm, the datasheet tells 126 MHz performance and 1 clock cycle per hashing round, i would interpret this numbers as this would give us approx 1 MHash/s for bitcoin hashing. There are some papers on the net on SHA2 cores (McEnvoy and another one) which are capable of running at 120 MHz on a quite old Virtex2 using 1k Lut/SLices??? reaching similar perfomance numbers, they use a pipelined design which need ca. 68 rounds for a single SHA2 Hash. But regarding the resource usage, there are one the numbers for a single core, but not how the design scales up up to a FPGA full of cores.
|
|
|
|
ngzhang
|
|
August 15, 2011, 10:06:59 AM |
|
Hm, the datasheet tells 126 MHz performance and 1 clock cycle per hashing round, i would interpret this numbers as this would give us approx 1 MHash/s for bitcoin hashing. There are some papers on the net on SHA2 cores (McEnvoy and another one) which are capable of running at 120 MHz on a quite old Virtex2 using 1k Lut/SLices??? reaching similar perfomance numbers, they use a pipelined design which need ca. 68 rounds for a single SHA2 Hash. But regarding the resource usage, there are one the numbers for a single core, but not how the design scales up up to a FPGA full of cores. I apologize that if there are no misunderstanding, the datasheet tells that IP core could run at 126MHz and provide a hash rate of 977Mbps That means approx. 8bit/clk. And also means a bitcoin hashing rate at 3.8MH/s(1 bitcoin hash is 256bit of data, is that right?).
|
|
|
|
lame.duck
Legendary
Offline
Activity: 1270
Merit: 1000
|
|
August 15, 2011, 11:58:46 AM |
|
Hm, the datasheet tells 126 MHz performance and 1 clock cycle per hashing round, i would interpret this numbers as this would give us approx 1 MHash/s for bitcoin hashing.
There are some papers on the net on SHA2 cores (McEnvoy and another one) which are capable of running at 120 MHz on a quite old Virtex2 using 1k Lut/SLices??? reaching similar perfomance numbers, they use a pipelined design which need ca. 68 rounds for a single SHA2 Hash.
But regarding the resource usage, there are one the numbers for a single core, but not how the design scales up up to a FPGA full of cores.
I apologize that if there are no misunderstanding, the datasheet tells that IP core could run at 126MHz and provide a hash rate of 977Mbps That means approx. 8bit/clk. And also means a bitcoin hashing rate at 3.8MH/s(1 bitcoin hash is 256bit of data, is that right?). IMHO No, one bitcoin hash uses 2 'normal' SHA256 hashes, but this would give 1,9 Mhash which ist still the double if using the MHz/64=MHash/s asumption. I have no clue how the troughput will be counted, my understanding so far was that it will use the output data rate for hashing 64 bit chunks of input data. (If the input data set to be hashed is larger than 64 bit, the input will be processed in 64 bit chunks that are expanded to 256 bit, but the output data size will not grow in size)
|
|
|
|
makomk
|
|
August 16, 2011, 02:00:59 PM |
|
200MH is simply way out of the question for an S6-LX150. That won't stop me from trying As far as I can tell with the poking around I've done so far, the current bottleneck on the S6-LX150 is the far dependencies caused by the W calculations. These references make it so that the rounds are not isolated, and so cannot be routed into a uniform chain. This forces ISE to do completely absurd routing, splattering the placement of a round's components across a good 1/4th of the chip. And that, obviously, leads to massive routing delays. On my last few compiles, the worst-case paths were >80% routing (8ns+ of routing, with 2ns of logic). I saw similar failures at one point. Try enabling register duplication for the Map stage and/or register rebalancing during synthesis. I think I can probably hit at least 140 MHz for 70 Mhash/s on SLX75 with two pipeline stages per round and both of those enabled, plus some other bits, but I need to fix some stuff and test the changes in simulation.
|
Quad XC6SLX150 Board: 860 MHash/s or so. SIGS ABOUT BUTTERFLY LABS ARE PAID ADS
|
|
|
lame.duck
Legendary
Offline
Activity: 1270
Merit: 1000
|
|
August 17, 2011, 01:43:24 PM |
|
Finally got around to coding some maximum clock speed improvements for users of smaller Cyclone III and IV devices - now available from my new partial-unroll-speed branch. Expected minimum device size and speed is roughly as follows: I've got so far: EP3C25C6 135MHz EP2C35C6 111(108)MHz EP2C35C8 80Mhz for the 85Degree Celsius slow timing model after playing with the options given from the 'timimg optimizing advisor). One point was that rerunning the compile process a second time doesn't not always give better or equal result (with timing driven options 'on'), so it could be wise the work with revisions or some other provisions made for keeping the optimum bitstream. One idea i've got from the numbers: would it be more performant to use for the adressed cases only one pipeline that would compute both hashes alternating at a lower resource count even if the pipelines are not 100 % equal?
|
|
|
|
Anoynomous
Newbie
Offline
Activity: 11
Merit: 0
|
|
August 17, 2011, 09:57:44 PM |
|
hi to all, i am having a little trouble here. I had some experience in designing sha1 hash cracker on fpga, so this project caught my interest. When i downloaded the code and tried to compile it for S6 lx150, it took about an hour to just synthesize the code and then the software said i had overused my resources.. so i wanted to knw, where did i go wrong?...
|
|
|
|
fpgaminer (OP)
|
|
August 17, 2011, 10:54:34 PM |
|
i am having a little trouble here. I had some experience in designing sha1 hash cracker on fpga, so this project caught my interest. When i downloaded the code and tried to compile it for S6 lx150, it took about an hour to just synthesize the code and then the software said i had overused my resources.. so i wanted to knw, where did i go wrong?... Which project did you use? For S6-LX150, this is probably the preferred project to start from: https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/tree/master/projects/LX150_makomk_TestYou'll want to adjust main_pll.v:98 to 5 for 50MHz, to make the compile easier and the firmware actually usable (assuming you have the S6-LX150T dev board) without cooling.
|
|
|
|
Anoynomous
Newbie
Offline
Activity: 11
Merit: 0
|
|
August 18, 2011, 01:46:18 AM |
|
well i had used LX150_test. and i dnt have a lx150 dev board, so i think i will just share my ideas here.. the critical path in this circuit is " t1 = rx_state[`IDX(7)] + e1_w + ch_w + rx_w[31:0] + k".. but k and rx_w[31:0] can be calculated one loop ahead and added to rx_state[`IDX(7)] at the point below:
state_buf[`IDX(7)] <= rx_state[`IDX(6)];the new code should look like this: state_buf[`IDX(7)] <= rx_state[`IDX(6)] + rx_w[31:0] + k;----> where k and rx_w are of next loop This will reduce the adders to: t1 = rx_state[`IDX(7)] + e1_w + ch_w;this should improve clock speed, provided routing issues dont interfere....
|
|
|
|
Anoynomous
Newbie
Offline
Activity: 11
Merit: 0
|
|
August 18, 2011, 02:02:19 AM |
|
if the above solution is applied, the calculation of new_w will be the new critical path... new_w = s1_w + rx_w[319:288] + s0_w + rx_w[31:0];again s0_w can be calculated a loop ahead and added to rx_w[31:0]. this way our new_w will be shortened to: new_w = s1_w + rx_w[319:288] + rx_w[31:0];dcreasing the critical path and possibly increasing the clock frequency... Can anbody tell me the %age LUT utilized after synthesis... there may be a possibility of replacing the adders logic...
|
|
|
|
fpgaminer (OP)
|
|
August 18, 2011, 04:30:49 AM |
|
well i had used LX150_test. and i dnt have a lx150 dev board, so i think i will just share my ideas here.. Oops, sorry, LX150_Test isn't really usable at the moment. I really need to add a useful README outlining all those different project variations ... Thank you for contributing your idea! Please take a look at the project variation I linked: https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/tree/master/projects/LX150_makomk_TestYou will find that your idea, for the most part, has already been implemented in there. Specifically look around this line. BUT: You did point something out that I think I missed. In the code I linked you'll see that the pre-calculated T1 value is stored in a separate register, not tx_state[7] as you listed in your example. On looking at my code, I believe you are correct; tx_state[7] is never used (except for the last round) so it could be removed or replaced with the partial calculation. Good catch, Anoynomous! Not sure if the compiler catches this optimization automatically or not. again s0_w can be calculated a loop ahead and added to rx_w[31:0]. this way our new_w will be shortened to: Now that, I hadn't thought of. Another fantastic catch, Anoynomous! Double check me on this: tx_pre_w <= s0(rx_w[2]) + rx_w[1]; // Calculate the next round's s0 + the next round's w[0]. tx_new_w <= s1(rx_w[14]) + rx_w[9] + rx_pre_w;
if the above solution is applied, the calculation of new_w will be the new critical path... The calculation of tx_state[0] is the current critical path: t1 = rx_t1_part + e1_w + ch_w tx_state[0] <= t1 + e0_w + maj_w;
Which is actually pretty good, since it's implemented as only two adders.
|
|
|
|
Anoynomous
Newbie
Offline
Activity: 11
Merit: 0
|
|
August 18, 2011, 05:26:10 AM |
|
Double check me on this: tx_pre_w <= s0(rx_w[2]) + rx_w[1]; // Calculate the next round's s0 + the next round's w[0]. tx_new_w <= s1(rx_w[14]) + rx_w[9] + rx_pre_w;
right.. though the tx_pre_w can be saved at w[0]'s place that is to be transmitted to next loop, it will save a register..
|
|
|
|
makomk
|
|
August 18, 2011, 09:38:55 AM |
|
BUT: You did point something out that I think I missed. In the code I linked you'll see that the pre-calculated T1 value is stored in a separate register, not tx_state[7] as you listed in your example. On looking at my code, I believe you are correct; tx_state[7] is never used (except for the last round) so it could be removed or replaced with the partial calculation. Good catch, Anoynomous!
Not sure if the compiler catches this optimization automatically or not.
I'm reasonably sure Altera's compiler for Cyclone IV does because of the large decrease in resource usage. On Cyclone IV it uses less resources to store the partially pre-calculated T1 value than it does to store tx_state[`IDX(7)] because registering logic outputs is practically free but registering the output of another register ties up an entire LE per bit that can't be used for anything else. No idea if Xilinx's tools catch this though. Double check me on this: tx_pre_w <= s0(rx_w[2]) + rx_w[1]; // Calculate the next round's s0 + the next round's w[0]. tx_new_w <= s1(rx_w[14]) + rx_w[9] + rx_pre_w;
right.. though the tx_pre_w can be saved at w[0]'s place that is to be transmitted to next loop, it will save a register.. Oooh, cunning - nice one Anoynomous! Costs a register overall due to having to get rx_w[2] out of storage, but might be worthwhile. In theory could it be cheaper to do this with s1(rx_w[14]) + rx_w[9] instead? tx_pre_w <= s1(rx_w[15]) + rx_w[10]; // Calculate the next round's s1 + the next round's w[9]. tx_new_w <= s0(rx_w[1]) + rx_w[0] + rx_pre_w;
|
Quad XC6SLX150 Board: 860 MHash/s or so. SIGS ABOUT BUTTERFLY LABS ARE PAID ADS
|
|
|
mb300sd
Legendary
Offline
Activity: 1260
Merit: 1000
Drunk Posts
|
|
August 19, 2011, 12:17:28 AM |
|
Do you know if the code will run/fit on a Spartan XC2S30? I have a ProxMark3 (RFID hacking tool) that I've been playing with the FPGA on, wondering if its capable of mining.. I can deal with the ARM code to interface between the FPGA and USB.
|
1D7FJWRzeKa4SLmTznd3JpeNU13L1ErEco
|
|
|
ngzhang
|
|
August 19, 2011, 03:20:07 AM |
|
Do you know if the code will run/fit on a Spartan XC2S30? I have a ProxMark3 (RFID hacking tool) that I've been playing with the FPGA on, wondering if its capable of mining.. I can deal with the ARM code to interface between the FPGA and USB.
I'm very sad to say, it is impossible... the LX150 we used has approx. 150,000 logic-cells, but the XC2S30 has less than 1,000 of them. in addition, the logic-cells in spartan6 is far enhanced than spartan2.
|
|
|
|
rph
|
|
August 20, 2011, 09:18:23 PM |
|
There are ways to get the critical path down to a single 2-input 32 bit adder. If you think carefully about what you're building.
-rph
|
|
|
|
newMeat1
|
|
August 21, 2011, 03:16:52 AM |
|
I sure hope you're right!
|
|
|
|
fpgaminer (OP)
|
|
August 21, 2011, 09:07:06 AM |
|
There are ways to get the critical path down to a single 2-input 32 bit adder. If you think carefully about what you're building. You want 3-input adders on 6 series Spartans, not 2-input. And yes, of course you can reduce the critical path to a single adder, but it requires an immense quantity of registers. And before you suggest it, don't tell me to run the FPGA faster to avoid extra pipeline registers . Spartan-6 isn't designed to run faster than ~250MHz. The memory doesn't run faster than that, and I think even the DSPs top out at that level.
|
|
|
|
Venkatesh Srinivas
Newbie
Offline
Activity: 18
Merit: 0
|
|
August 21, 2011, 02:13:52 PM |
|
For anyone who has run this design on the LX9 microboard, what sort of hashrate did you get? And how many slices were used (and at what unrolling level?).
Thanks, -- vs
|
|
|
|
|