Bitcoin Forum
March 19, 2024, 07:12:47 AM *
News: Latest Bitcoin Core release: 26.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 [25] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 »
  Print  
Author Topic: Official Open Source FPGA Bitcoin Miner (Last Update: April 14th, 2013)  (Read 432863 times)
iidx
Newbie
*
Offline Offline

Activity: 35
Merit: 0


View Profile
August 14, 2011, 11:40:18 PM
 #481

Quote
200MH is simply way out of the question for an S6-LX150.
That won't stop me from trying  Grin

As far as I can tell with the poking around I've done so far, the current bottleneck on the S6-LX150 is the far dependencies caused by the W calculations. These references make it so that the rounds are not isolated, and so cannot be routed into a uniform chain. This forces ISE to do completely absurd routing, splattering the placement of a round's components across a good 1/4th of the chip. And that, obviously, leads to massive routing delays. On my last few compiles, the worst-case paths were >80% routing (8ns+ of routing, with 2ns of logic).

Yeah, it looks like a "giant snake" that traverses the chip Cheesy

Quote
The current critical path is approximately two 3-way 32-bit adders implemented as 16 total slices, thanks to the Spartan-6 fast carry look ahead chains. Is there a means of optimizating that logic that I have missed?

These are the adders that I tried to move into DSP48s, as they have dedicated carry paths to and from adjacent DSPs in a column.  I didn't look at all how to optimize the actual math/operations at all though.
1710832367
Hero Member
*
Offline Offline

Posts: 1710832367

View Profile Personal Message (Offline)

Ignore
1710832367
Reply with quote  #2

1710832367
Report to moderator
1710832367
Hero Member
*
Offline Offline

Posts: 1710832367

View Profile Personal Message (Offline)

Ignore
1710832367
Reply with quote  #2

1710832367
Report to moderator
1710832367
Hero Member
*
Offline Offline

Posts: 1710832367

View Profile Personal Message (Offline)

Ignore
1710832367
Reply with quote  #2

1710832367
Report to moderator
No Gods or Kings. Only Bitcoin
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
ngzhang
Hero Member
*****
Offline Offline

Activity: 592
Merit: 501


We will stand and fight.


View Profile
August 15, 2011, 04:25:12 AM
Last edit: August 15, 2011, 05:06:59 AM by ngzhang
 #482

Quote
200MH is simply way out of the question for an S6-LX150.
That won't stop me from trying  Grin
Good luck!  Wink


Besides pipelining, there a another way to enhance performance in IC design, which is logic-copy.
I didn't read the codes yet( because I put all my spare time on the dual XC6SLX150 mining board design), but after read this thread, if we facing   terrible routing problems, why not we try another architecture.
The possible way is, implement a core, optimized  roll up, like a calculate equipment(maybe better use DSP48As) around a  signal 512bit register(maybe use LUTs to implement Distributed RAM instead of using registers), runing at 200MHZ+(it's very possible), about 64clocks per hash.  and we can implentment 100+ of them per chip.
This way, we can also generate a very MH/s.

Certainly, I'm not a expert, just for discussing.


EDIT1:
I found this :
http://www.heliontech.com/downloads/fast_hash_xilinx_datasheet.pdf#view=Fit
In this Commercial Ip core, they use 309 slices (SLX150 has 23038 of them), generate a transport of 977Mbps.
If we use 80% slices of one SLX150, we can implement 60 of these cores, generate a transport of 58G. about 228MH/s.

So ,reach 200MH/s is very possible, isn't it?
lame.duck
Legendary
*
Offline Offline

Activity: 1270
Merit: 1000


View Profile
August 15, 2011, 09:54:20 AM
 #483


EDIT1:
I found this :
http://www.heliontech.com/downloads/fast_hash_xilinx_datasheet.pdf#view=Fit
In this Commercial Ip core, they use 309 slices (SLX150 has 23038 of them), generate a transport of 977Mbps.
If we use 80% slices of one SLX150, we can implement 60 of these cores, generate a transport of 58G. about 228MH/s.

So ,reach 200MH/s is very possible, isn't it?

Hm, the datasheet tells 126 MHz performance and 1 clock cycle per hashing round,  i would interpret this numbers as this would give us approx 1 MHash/s for bitcoin hashing.

There are some papers on the net on SHA2 cores (McEnvoy and another one) which are capable of running at 120 MHz  on a quite old Virtex2 using 1k Lut/SLices??? reaching similar perfomance numbers, they use a pipelined design which need ca. 68 rounds for a single SHA2 Hash.

But regarding the resource usage, there are one the numbers for a single core, but not how the design scales up up to a FPGA full of cores.
ngzhang
Hero Member
*****
Offline Offline

Activity: 592
Merit: 501


We will stand and fight.


View Profile
August 15, 2011, 10:06:59 AM
 #484


EDIT1:
I found this :
http://www.heliontech.com/downloads/fast_hash_xilinx_datasheet.pdf#view=Fit
In this Commercial Ip core, they use 309 slices (SLX150 has 23038 of them), generate a transport of 977Mbps.
If we use 80% slices of one SLX150, we can implement 60 of these cores, generate a transport of 58G. about 228MH/s.

So ,reach 200MH/s is very possible, isn't it?

Hm, the datasheet tells 126 MHz performance and 1 clock cycle per hashing round,  i would interpret this numbers as this would give us approx 1 MHash/s for bitcoin hashing.

There are some papers on the net on SHA2 cores (McEnvoy and another one) which are capable of running at 120 MHz  on a quite old Virtex2 using 1k Lut/SLices??? reaching similar perfomance numbers, they use a pipelined design which need ca. 68 rounds for a single SHA2 Hash.

But regarding the resource usage, there are one the numbers for a single core, but not how the design scales up up to a FPGA full of cores.

I apologize that if there are no misunderstanding, the  datasheet tells that IP core could run at 126MHz and provide a hash rate of 977Mbps


That means approx. 8bit/clk.
And also means a bitcoin hashing rate at 3.8MH/s(1 bitcoin hash is 256bit of data, is that right?).

lame.duck
Legendary
*
Offline Offline

Activity: 1270
Merit: 1000


View Profile
August 15, 2011, 11:58:46 AM
 #485

Hm, the datasheet tells 126 MHz performance and 1 clock cycle per hashing round,  i would interpret this numbers as this would give us approx 1 MHash/s for bitcoin hashing.

There are some papers on the net on SHA2 cores (McEnvoy and another one) which are capable of running at 120 MHz  on a quite old Virtex2 using 1k Lut/SLices??? reaching similar perfomance numbers, they use a pipelined design which need ca. 68 rounds for a single SHA2 Hash.

But regarding the resource usage, there are one the numbers for a single core, but not how the design scales up up to a FPGA full of cores.

I apologize that if there are no misunderstanding, the  datasheet tells that IP core could run at 126MHz and provide a hash rate of 977Mbps


That means approx. 8bit/clk.
And also means a bitcoin hashing rate at 3.8MH/s(1 bitcoin hash is 256bit of data, is that right?).

IMHO No, one bitcoin hash uses 2 'normal' SHA256 hashes, but this would give 1,9 Mhash which ist still the double if using the MHz/64=MHash/s asumption. I have no clue how the troughput will be counted, my understanding so far was that it will use the output data rate for  hashing 64 bit chunks of input data. (If the input data set to be hashed is larger than 64 bit, the input will be processed in 64 bit chunks that are expanded to 256 bit, but the output data size will not grow in size)
makomk
Hero Member
*****
Offline Offline

Activity: 686
Merit: 564


View Profile
August 16, 2011, 02:00:59 PM
 #486

Quote
200MH is simply way out of the question for an S6-LX150.
That won't stop me from trying  Grin

As far as I can tell with the poking around I've done so far, the current bottleneck on the S6-LX150 is the far dependencies caused by the W calculations. These references make it so that the rounds are not isolated, and so cannot be routed into a uniform chain. This forces ISE to do completely absurd routing, splattering the placement of a round's components across a good 1/4th of the chip. And that, obviously, leads to massive routing delays. On my last few compiles, the worst-case paths were >80% routing (8ns+ of routing, with 2ns of logic).
I saw similar failures at one point. Try enabling register duplication for the Map stage and/or register rebalancing during synthesis. I think I can probably hit at least 140 MHz for 70 Mhash/s on SLX75 with two pipeline stages per round and both of those enabled, plus some other bits, but I need to fix some stuff and test the changes in simulation.

Quad XC6SLX150 Board: 860 MHash/s or so.
SIGS ABOUT BUTTERFLY LABS ARE PAID ADS
lame.duck
Legendary
*
Offline Offline

Activity: 1270
Merit: 1000


View Profile
August 17, 2011, 01:43:24 PM
 #487

Finally got around to coding some maximum clock speed improvements for users of smaller Cyclone III and IV devices - now available from my new partial-unroll-speed branch. Expected minimum device size and speed is roughly as follows:

I've got so far:

EP3C25C6 135MHz
EP2C35C6 111(108)MHz
EP2C35C8 80Mhz

for the 85Degree Celsius slow timing model after playing with the options given from the 'timimg optimizing advisor). One point was that rerunning the compile process a second time doesn't not always give better or equal result (with timing driven options 'on'), so it could be wise the work with revisions or some other provisions made for keeping the optimum bitstream.

One idea i've got from the numbers: would it be more performant to use for the adressed cases only one pipeline that would compute both hashes alternating at a lower resource count even if the pipelines are not 100 % equal?
Anoynomous
Newbie
*
Offline Offline

Activity: 11
Merit: 0


View Profile
August 17, 2011, 09:57:44 PM
 #488

hi to all,
i am having a little trouble here. I had some experience in designing sha1 hash cracker on fpga, so this project caught my interest. When i downloaded the code and tried to compile it for S6 lx150, it took about an hour to just synthesize the code and then the software said i had overused my resources.. so i wanted to knw, where did i go wrong?...
fpgaminer (OP)
Hero Member
*****
Offline Offline

Activity: 560
Merit: 517



View Profile WWW
August 17, 2011, 10:54:34 PM
 #489

Quote
i am having a little trouble here. I had some experience in designing sha1 hash cracker on fpga, so this project caught my interest. When i downloaded the code and tried to compile it for S6 lx150, it took about an hour to just synthesize the code and then the software said i had overused my resources.. so i wanted to knw, where did i go wrong?...
Which project did you use?

For S6-LX150, this is probably the preferred project to start from:
https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/tree/master/projects/LX150_makomk_Test
You'll want to adjust main_pll.v:98 to 5 for 50MHz, to make the compile easier and the firmware actually usable (assuming you have the S6-LX150T dev board) without cooling.

Anoynomous
Newbie
*
Offline Offline

Activity: 11
Merit: 0


View Profile
August 18, 2011, 01:46:18 AM
 #490

For S6-LX150, this is probably the preferred project to start from:
https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/tree/master/projects/LX150_makomk_Test
You'll want to adjust main_pll.v:98 to 5 for 50MHz, to make the compile easier and the firmware actually usable (assuming you have the S6-LX150T dev board) without cooling.

well i had used LX150_test. and i dnt have a lx150 dev board, so i think i will just share my ideas here..

the critical path in this circuit is "t1 = rx_state[`IDX(7)] + e1_w + ch_w + rx_w[31:0] + k"..
but k and rx_w[31:0] can be calculated one loop ahead and added to rx_state[`IDX(7)] at the point below:

state_buf[`IDX(7)] <= rx_state[`IDX(6)];


the new code should look like this:
state_buf[`IDX(7)] <= rx_state[`IDX(6)] + rx_w[31:0] + k;
----> where k and rx_w are of next loop

This will reduce the adders to:
t1 = rx_state[`IDX(7)] + e1_w + ch_w;

this should improve clock speed, provided routing issues dont interfere....





Anoynomous
Newbie
*
Offline Offline

Activity: 11
Merit: 0


View Profile
August 18, 2011, 02:02:19 AM
 #491

if the above solution is applied, the calculation of new_w will be the new critical path...
new_w = s1_w + rx_w[319:288] + s0_w + rx_w[31:0];

again s0_w can be calculated a loop ahead and added to  rx_w[31:0]. this way our new_w will be shortened to:

new_w = s1_w + rx_w[319:288] + rx_w[31:0];

dcreasing the critical path and possibly increasing the clock frequency...

Can anbody tell me the %age LUT utilized after synthesis... there may be a possibility of replacing the adders logic... Smiley

fpgaminer (OP)
Hero Member
*****
Offline Offline

Activity: 560
Merit: 517



View Profile WWW
August 18, 2011, 04:30:49 AM
 #492

Quote
well i had used LX150_test. and i dnt have a lx150 dev board, so i think i will just share my ideas here..
Oops, sorry, LX150_Test isn't really usable at the moment. I really need to add a useful README outlining all those different project variations ...

Thank you for contributing your idea!

Please take a look at the project variation I linked: https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/tree/master/projects/LX150_makomk_Test

You will find that your idea, for the most part, has already been implemented in there. Specifically look around this line.

BUT: You did point something out that I think I missed. In the code I linked you'll see that the pre-calculated T1 value is stored in a separate register, not tx_state[7] as you listed in your example. On looking at my code, I believe you are correct; tx_state[7] is never used (except for the last round) so it could be removed or replaced with the partial calculation. Good catch, Anoynomous!

Not sure if the compiler catches this optimization automatically or not.

Quote
again s0_w can be calculated a loop ahead and added to  rx_w[31:0]. this way our new_w will be shortened to:
Now that, I hadn't thought of. Another fantastic catch, Anoynomous!

Double check me on this:

Code:
tx_pre_w <= s0(rx_w[2]) + rx_w[1];     // Calculate the next round's s0 + the next round's w[0].
tx_new_w <= s1(rx_w[14]) + rx_w[9] + rx_pre_w;

Quote
if the above solution is applied, the calculation of new_w will be the new critical path...
The calculation of tx_state[0] is the current critical path:
Code:
t1 = rx_t1_part + e1_w + ch_w
tx_state[0] <= t1 + e0_w + maj_w;
Which is actually pretty good, since it's implemented as only two adders.

Anoynomous
Newbie
*
Offline Offline

Activity: 11
Merit: 0


View Profile
August 18, 2011, 05:26:10 AM
 #493

Double check me on this:

Code:
tx_pre_w <= s0(rx_w[2]) + rx_w[1];     // Calculate the next round's s0 + the next round's w[0].
tx_new_w <= s1(rx_w[14]) + rx_w[9] + rx_pre_w;

right.. though the tx_pre_w can be saved at w[0]'s place that is to be transmitted to next loop, it will save a register..
makomk
Hero Member
*****
Offline Offline

Activity: 686
Merit: 564


View Profile
August 18, 2011, 09:38:55 AM
 #494

BUT: You did point something out that I think I missed. In the code I linked you'll see that the pre-calculated T1 value is stored in a separate register, not tx_state[7] as you listed in your example. On looking at my code, I believe you are correct; tx_state[7] is never used (except for the last round) so it could be removed or replaced with the partial calculation. Good catch, Anoynomous!

Not sure if the compiler catches this optimization automatically or not.
I'm reasonably sure Altera's compiler for Cyclone IV does because of the large decrease in resource usage. On Cyclone IV it uses less resources to store the partially pre-calculated T1 value than it does to store tx_state[`IDX(7)] because registering logic outputs is practically free but registering the output of another register ties up an entire LE per bit that can't be used for anything else. No idea if Xilinx's tools catch this though.

Double check me on this:

Code:
tx_pre_w <= s0(rx_w[2]) + rx_w[1];     // Calculate the next round's s0 + the next round's w[0].
tx_new_w <= s1(rx_w[14]) + rx_w[9] + rx_pre_w;

right.. though the tx_pre_w can be saved at w[0]'s place that is to be transmitted to next loop, it will save a register..


Oooh, cunning - nice one Anoynomous! Costs a register overall due to having to get rx_w[2] out of storage, but might be worthwhile. In theory could it be cheaper to do this with s1(rx_w[14]) + rx_w[9] instead?

Code:
tx_pre_w <= s1(rx_w[15]) + rx_w[10];     // Calculate the next round's s1 + the next round's w[9].
tx_new_w <= s0(rx_w[1]) + rx_w[0] + rx_pre_w;

Quad XC6SLX150 Board: 860 MHash/s or so.
SIGS ABOUT BUTTERFLY LABS ARE PAID ADS
mb300sd
Legendary
*
Offline Offline

Activity: 1260
Merit: 1000

Drunk Posts


View Profile WWW
August 19, 2011, 12:17:28 AM
 #495

Do you know if the code will run/fit on a Spartan XC2S30? I have a ProxMark3 (RFID hacking tool) that I've been playing with the FPGA on, wondering if its capable of mining.. I can deal with the ARM code to interface between the FPGA and USB.

1D7FJWRzeKa4SLmTznd3JpeNU13L1ErEco
ngzhang
Hero Member
*****
Offline Offline

Activity: 592
Merit: 501


We will stand and fight.


View Profile
August 19, 2011, 03:20:07 AM
 #496

Do you know if the code will run/fit on a Spartan XC2S30? I have a ProxMark3 (RFID hacking tool) that I've been playing with the FPGA on, wondering if its capable of mining.. I can deal with the ARM code to interface between the FPGA and USB.

I'm very sad to say, it is impossible...
the LX150 we used has approx. 150,000 logic-cells, but the XC2S30 has less than 1,000 of them.
in addition, the logic-cells in spartan6 is far enhanced than spartan2.
rph
Full Member
***
Offline Offline

Activity: 176
Merit: 100


View Profile
August 20, 2011, 09:18:23 PM
 #497

There are ways to get the critical path down to a single 2-input 32 bit adder.
If you think carefully about what you're building.

-rph

Ultra-Low-Cost DIY FPGA Miner: https://bitcointalk.org/index.php?topic=44891
newMeat1
Full Member
***
Offline Offline

Activity: 210
Merit: 100



View Profile
August 21, 2011, 03:16:52 AM
 #498

I sure hope you're right!

fpgaminer (OP)
Hero Member
*****
Offline Offline

Activity: 560
Merit: 517



View Profile WWW
August 21, 2011, 09:07:06 AM
 #499

Quote
There are ways to get the critical path down to a single 2-input 32 bit adder.
If you think carefully about what you're building.
You want 3-input adders on 6 series Spartans, not 2-input. And yes, of course you can reduce the critical path to a single adder, but it requires an immense quantity of registers.

And before you suggest it, don't tell me to run the FPGA faster to avoid extra pipeline registers Tongue. Spartan-6 isn't designed to run faster than ~250MHz. The memory doesn't run faster than that, and I think even the DSPs top out at that level.

Venkatesh Srinivas
Newbie
*
Offline Offline

Activity: 18
Merit: 0


View Profile WWW
August 21, 2011, 02:13:52 PM
 #500

For anyone who has run this design on the LX9 microboard, what sort of hashrate did you get? And how many slices were used (and at what unrolling level?).

Thanks,
-- vs
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 [25] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 »
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!