Bitcoin Forum
November 16, 2024, 02:21:29 AM *
News: Check out the artwork 1Dq created to commemorate this forum's 15th anniversary
 
   Home   Help Search Login Register More  
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 [38] 39 40 41 42 43 44 45 46 47 48 49 »
  Print  
Author Topic: Official Open Source FPGA Bitcoin Miner (Last Update: April 14th, 2013)  (Read 432945 times)
kingcoin
Sr. Member
****
Offline Offline

Activity: 262
Merit: 250


View Profile
April 15, 2013, 08:17:40 AM
 #741

I have just pushed the experimental KC705 code to the repo.  Here is the project.  This is a DSP48E1 based design, and I have compiled and run it at 400MH/s.  I

Great! Thank you. I thought it would be interesting to browse the DSP48 code to see how you can archive the impressive performance.
anomalies
Newbie
*
Offline Offline

Activity: 13
Merit: 0



View Profile
April 17, 2013, 01:36:43 PM
 #742

hi, total newbs here.  Grin

just wanna ask since i got this fpga for free (my friends bought it and decide not use it for whatever reason), could i use this for BTC mining?

Genesys™ Virtex-5 FPGA Development Board
http://www.digilentinc.com/Products/Detail.cfm?Prod=GENESYS

thank you for your kind answer.

regards,
Reggie0
Member
**
Offline Offline

Activity: 107
Merit: 13


View Profile
April 17, 2013, 03:02:41 PM
Last edit: April 17, 2013, 05:15:17 PM by Reggie0
 #743

fpgaminer: is there any advantage using "{a,b,c}<={x,y,z};" instead of "a<=x;b<=y;c<=z;"  ?
(My opinion is it only helps to make more readable code.)
Reggie0
Member
**
Offline Offline

Activity: 107
Merit: 13


View Profile
April 17, 2013, 05:20:49 PM
 #744

hi, total newbs here.  Grin

just wanna ask since i got this fpga for free (my friends bought it and decide not use it for whatever reason), could i use this for BTC mining?

Genesys™ Virtex-5 FPGA Development Board
http://www.digilentinc.com/Products/Detail.cfm?Prod=GENESYS

thank you for your kind answer.

regards,

Probably you can use it, but it will be slow, because 50k logic gate is not enough to use fully unrolled pipes. As i know Spartan-6 LX90T produces 90MH/s, and it has almost twice gates.
ihtfp
Newbie
*
Offline Offline

Activity: 12
Merit: 0


View Profile
April 17, 2013, 05:27:20 PM
 #745

hi, total newbs here.  Grin

just wanna ask since i got this fpga for free (my friends bought it and decide not use it for whatever reason), could i use this for BTC mining?

Genesys™ Virtex-5 FPGA Development Board
http://www.digilentinc.com/Products/Detail.cfm?Prod=GENESYS

thank you for your kind answer.

regards,

hi, total newbs here.  Grin

just wanna ask since i got this fpga for free (my friends bought it and decide not use it for whatever reason), could i use this for BTC mining?

Genesys™ Virtex-5 FPGA Development Board
http://www.digilentinc.com/Products/Detail.cfm?Prod=GENESYS

thank you for your kind answer.

regards,

Probably you can use it, but it will be slow, because 50k logic gate is not enough to use fully unrolled pipes. As i know Spartan-6 LX90T produces 90MH/s, and it has almost twice gates.
I wouldn't use it.  This FPGA only has 28k flip flops.  The Spartan6 LX150 has 184k for comparison.   As Reggie0 said, you wouldn't be able to use fully unrolled logic.

ihtfp
Newbie
*
Offline Offline

Activity: 12
Merit: 0


View Profile
April 17, 2013, 05:56:49 PM
 #746

Looks good!  I tried to do the same thing on a V6 LX130T (use almost all DSPs and pipeline the rest of the LUT adders), but there aren't enough registers in that device for tx_w and tx_state delays Sad.  so many 512 and 256 bit registers...


   If you are short on flip flops, have you considered using the BRAMs?  You would need 11 primitives (there are 264 in the LX130T) to make a by 792 bit wide memory.  You can set the BRAM to 'write first' mode, which will echo the data to the output.  The clk-to-out for unpipelined BRAM is ~2.0ns...slower than FF. 
   Since the BRAMs are dual port, you can use both sides of the memory (with different locked addresses), you can get enough storage for 48 stages of a fully unrolled algorithm.   
   I've never tried this, but was just thinking of how to make use of all the unused BRAM laying around.  I usually run out of LUTs, but need to rethink if this is worthwhile with the DSP48 implementation.

anomalies
Newbie
*
Offline Offline

Activity: 13
Merit: 0



View Profile
April 17, 2013, 06:32:01 PM
 #747

@ihftp: thanks for the info.. now i know why he didn't use it.  Grin
i got 5 of those though.. shame, can't be fully utilised it.
iidx
Newbie
*
Offline Offline

Activity: 35
Merit: 0


View Profile
April 17, 2013, 09:18:10 PM
 #748

I think the problem is linking 11 BRAMs together requires a lot of LUTs for address decode/routing since the BRAMs are arranged in columns throughout the chip.  Plus linking 11 together would probably result in a minimum period much higher than 2.0ns (2.0 ns is for 1 BRAM I think).

So, you would need 128 (hashers) * 11 (BRAMs) for one pipeline stage = 1408 total BRAMs.  Of course, you're not suggesting you use BRAM for all the delay.  However, I think the slices you would sacrifice to connect the BRAMs and create their address logic would be more expensive than just using the built in FFs or DMEMs (plus the speed hit).

I'm hoping by floor planning each hashing module I can get to quick speeds.  Currently the logic delay I am facing is only around ~2.0 ns, with the routes taking the rest.  So with some nice routing I would hopefully meet my target.

The V6LX130 isn't even as big as the S6 150, but at least is has DSP48s.

I may also need to cut down the PCIe link from 4x to 1x and reduce its performance settings to regain some of the space that is being used up.

IIDX

Looks good!  I tried to do the same thing on a V6 LX130T (use almost all DSPs and pipeline the rest of the LUT adders), but there aren't enough registers in that device for tx_w and tx_state delays Sad.  so many 512 and 256 bit registers...


   If you are short on flip flops, have you considered using the BRAMs?  You would need 11 primitives (there are 264 in the LX130T) to make a by 792 bit wide memory.  You can set the BRAM to 'write first' mode, which will echo the data to the output.  The clk-to-out for unpipelined BRAM is ~2.0ns...slower than FF. 
   Since the BRAMs are dual port, you can use both sides of the memory (with different locked addresses), you can get enough storage for 48 stages of a fully unrolled algorithm.   
   I've never tried this, but was just thinking of how to make use of all the unused BRAM laying around.  I usually run out of LUTs, but need to rethink if this is worthwhile with the DSP48 implementation.


ihtfp
Newbie
*
Offline Offline

Activity: 12
Merit: 0


View Profile
April 17, 2013, 09:32:45 PM
 #749

IIDX,
 
    The addressing would be constant, so no decoding would be needed.  They would be tied off to constants.
    The 2.0ns is the clk-to-out time for a data output.  Since all outputs are in parallel, (each BRAM configured as x72, and grouped together to give very wide access),the individual BRAM bit delay would not change. No demuxing of outputs would be necessary.
    The number of BRAMs needed is only half what you show, since you can use both sides (Port A & Port B) independently (assign each side a fixed, but different address).
    Yes, you are right though re getting the data from the BRAMs to the LUTs needed for the computation.  There is a routing delay which is probably too large.
    Obviously this is not the optimum solution, only bringing it up as a last resort if available flip flops have expired.
Regards,
ihtfp


I think the problem is linking 11 BRAMs together requires a lot of LUTs for address decode/routing since the BRAMs are arranged in columns throughout the chip.  Plus linking 11 together would probably result in a minimum period much higher than 2.0ns (2.0 ns is for 1 BRAM I think).

So, you would need 128 (hashers) * 11 (BRAMs) for one pipeline stage = 1408 total BRAMs.  Of course, you're not suggesting you use BRAM for all the delay.  However, I think the slices you would sacrifice to connect the BRAMs and create their address logic would be more expensive than just using the built in FFs or DMEMs (plus the speed hit).

I'm hoping by floor planning each hashing module I can get to quick speeds.  Currently the logic delay I am facing is only around ~2.0 ns, with the routes taking the rest.  So with some nice routing I would hopefully meet my target.

The V6LX130 isn't even as big as the S6 150, but at least is has DSP48s.

I may also need to cut down the PCIe link from 4x to 1x and reduce its performance settings to regain some of the space that is being used up.

IIDX

Looks good!  I tried to do the same thing on a V6 LX130T (use almost all DSPs and pipeline the rest of the LUT adders), but there aren't enough registers in that device for tx_w and tx_state delays Sad.  so many 512 and 256 bit registers...


   If you are short on flip flops, have you considered using the BRAMs?  You would need 11 primitives (there are 264 in the LX130T) to make a by 792 bit wide memory.  You can set the BRAM to 'write first' mode, which will echo the data to the output.  The clk-to-out for unpipelined BRAM is ~2.0ns...slower than FF. 
   Since the BRAMs are dual port, you can use both sides of the memory (with different locked addresses), you can get enough storage for 48 stages of a fully unrolled algorithm.   
   I've never tried this, but was just thinking of how to make use of all the unused BRAM laying around.  I usually run out of LUTs, but need to rethink if this is worthwhile with the DSP48 implementation.


AJRGale
Hero Member
*****
Offline Offline

Activity: 767
Merit: 500



View Profile
April 18, 2013, 02:39:54 AM
 #750

So, looking into this whole mining with fpga system, and this code you people are working on, what is the required Logic cells/gates required for a full roll out? also whats the smallest unit you can get it running on? (the bare minimal for a half roll out (what ever you call it?))

I just want to dip my toe into the FPGA mining with a cheap and nasty chip set Wink

just tell me to piddle off else where if its the wrong spot to ask
fpgaminer (OP)
Hero Member
*****
Offline Offline

Activity: 560
Merit: 517



View Profile WWW
April 18, 2013, 05:14:42 AM
 #751

Quote
fpgaminer: is there any advantage using "{a,b,c}<={x,y,z};" instead of "a<=x;b<=y;c<=z;"  ?
(My opinion is it only helps to make more readable code.)
No advantage, no.  As you pointed out, it would only be for readability.

ihtfp
Newbie
*
Offline Offline

Activity: 12
Merit: 0


View Profile
April 18, 2013, 04:56:34 PM
 #752

So, looking into this whole mining with fpga system, and this code you people are working on, what is the required Logic cells/gates required for a full roll out? also whats the smallest unit you can get it running on? (the bare minimal for a half roll out (what ever you call it?))

I just want to dip my toe into the FPGA mining with a cheap and nasty chip set Wink

just tell me to piddle off else where if its the wrong spot to ask

AJRGale,
    I think you'll want at least a Spartan6 LX150.  This is the cheapest device I would use.  I would only run a fully pipelined implementation -- one that can do one hash per clock cycle.  If you can get a hold of a Kintex7 or Virtex7 board you'll be a lot better because you can instantiate more miners. 
    fpgaminer has posted a lot of useful code on github.
    I don't speak Altera, so not sure on specific devices.
AJRGale
Hero Member
*****
Offline Offline

Activity: 767
Merit: 500



View Profile
April 19, 2013, 03:21:57 PM
 #753

So, looking into this whole mining with fpga system, and this code you people are working on, what is the required Logic cells/gates required for a full roll out? also whats the smallest unit you can get it running on? (the bare minimal for a half roll out (what ever you call it?))

I just want to dip my toe into the FPGA mining with a cheap and nasty chip set Wink

just tell me to piddle off else where if its the wrong spot to ask

AJRGale,
    I think you'll want at least a Spartan6 LX150.  This is the cheapest device I would use.  I would only run a fully pipelined implementation -- one that can do one hash per clock cycle.  If you can get a hold of a Kintex7 or Virtex7 board you'll be a lot better because you can instantiate more miners. 
    fpgaminer has posted a lot of useful code on github.
    I don't speak Altera, so not sure on specific devices.


so 150K gates? like a Cyclone V? (no idea what gates to logic cells ratios really are) ...so that means 75K gates for half miner?

Ether way, cant find a Spartan6 LX150, can find a http://www.adafruit.com/products/451 "DE0-Nano - Altera Cyclone IV FPGA starter board "
a miner could run on it, buut, only the smallest one to what I've read out of here, at 5Mh/s...

maybe i should look at the code and work out how to use the dev suite, maybe it might tell me what it needs to run i have no idea what I'll be looking at though :/
senseless
Hero Member
*****
Offline Offline

Activity: 1118
Merit: 541



View Profile
April 19, 2013, 03:41:41 PM
Last edit: April 20, 2013, 03:20:26 AM by senseless
 #754

I have just pushed the experimental KC705 code to the repo.

Thanks!

I ordered my AC701 today. I'm playing with the eval software now. The clocks run a little bit slower than the Kintex line, but it has nearly as many DSPs as the chip you're using. I have high hopes for a minimum of 600Mh/s and shooting for 800Mh/s. Initial compile showing 92% dsp usage, 43% lut usage, 67% memory lut usage and a clock of 345mhz or so. Should be able to squeeze another core in there.

I was wondering, did it really take them 2 weeks to process & ship your unit to you after ordering? That's a big yes. They're not going to ship my card for 2 weeks after ordering Sad . Maybe they've got a large order queue? Maybe each card is made to order? no idea. Seems a rather long time to wait though.

AJR,

If you're going to get into it I would highly recommend you get the 705 or the 701.

http://www.xilinx.com/products/boards-and-kits/EK-K7-KC705-G.htm
http://www.xilinx.com/products/boards-and-kits/EK-A7-AC701-G.htm

The 705 will have room for more hashers, but I believe the Artix chip may be more cost effective.



AJRGale
Hero Member
*****
Offline Offline

Activity: 767
Merit: 500



View Profile
April 19, 2013, 05:24:32 PM
 #755

I have just pushed the experimental KC705 code to the repo.
....
AJR,

If you're going to get into it I would highly recommend you get the 705 or the 701.
...


that's outside my budget pricing.. would be nice though
if i can get a cheap and nasty going, getting half a dozen coins over the next few months, then i will get one
senseless
Hero Member
*****
Offline Offline

Activity: 1118
Merit: 541



View Profile
April 19, 2013, 05:52:43 PM
 #756

that's outside my budget pricing.. would be nice though
if i can get a cheap and nasty going, getting half a dozen coins over the next few months, then i will get one

My biggest problem was software. If you're going for used chips, make sure whatever chip you buy has free development software for it. Wanting licensed software for a dev board is one of the reason I bought the kit.

AJRGale
Hero Member
*****
Offline Offline

Activity: 767
Merit: 500



View Profile
April 19, 2013, 08:24:30 PM
 #757

that's outside my budget pricing.. would be nice though
if i can get a cheap and nasty going, getting half a dozen coins over the next few months, then i will get one

My biggest problem was software. If you're going for used chips, make sure whatever chip you buy has free development software for it. Wanting licensed software for a dev board is one of the reason I bought the kit.

the nano i was looking at does have "two CDs with the software necessary to 'compile' and 'upload' code to the board. " but not sure if the EP4CE22F17C6N is usable
fpgaminer (OP)
Hero Member
*****
Offline Offline

Activity: 560
Merit: 517



View Profile WWW
April 19, 2013, 08:53:05 PM
 #758

Quote
So I need to figure out what additional logic exists besides the SHA-256 module, how do they interact with each other and how do they interact with the SHA-256 modules?
First, please note that there are multiple "flavors" of the hashing code, and for the most part they are optimized for synthesis to FPGA targets.  I would highly suggest you hire an ASIC engineer who can take the time to understand SHA-256 and the needs of the Bitcoin proof of work mining algorithm himself.

Second, the SHA-256 hashing units do need a controlling unit.  As you can see in the modules you linked, there are a few top-level signals that are expected to be driven by a controller.  Most importantly rx_state and rx_input.  And you need a controller to check the results, and talk to the outside world.

This is the top-level module for one of the projects you linked to.  In there you will find the code that controls the sha256_transform instances, and how they are connected together.

kramble
Sr. Member
****
Offline Offline

Activity: 384
Merit: 250



View Profile WWW
April 19, 2013, 08:57:26 PM
Last edit: April 19, 2013, 09:08:38 PM by kramble
 #759

the nano i was looking at does have "two CDs with the software necessary to 'compile' and 'upload' code to the board. " but not sure if the EP4CE22F17C6N is usable

The DE0-Nano is great to get started learning about fpga's, but it won't make you any useful coin. 5MHash/sec is about right, it will go faster but not without risk of overheating, and certainly no more than about 25MHash/sec (using Makomk's modified power supply). To put that in context 5MHash/sec will currently earn approx 0.0003 bitcoin per day (and getting less by roughly 20% every 2 weeks as the difficulty increases).

If you do decide to get a DE0-Nano, start with the DE2_70_Unoptimized_Pipelined project. You'll need to increase CONFIG_LOG_LOOP2 to 4 to get it to compile (that's one eighth of a core, I think). I cheated and edited the fpgaminer.qsf file directly to configure it for the EP4CE22, but its probably safer to create a new project from scratch and add in the source files.

Mark

Github https://github.com/kramble BLC BkRaMaRkw3NeyzsZ2zUgXsNLogVVkQ1iPV
AJRGale
Hero Member
*****
Offline Offline

Activity: 767
Merit: 500



View Profile
April 20, 2013, 06:22:28 AM
 #760

the nano i was looking at does have "two CDs with the software necessary to 'compile' and 'upload' code to the board. " but not sure if the EP4CE22F17C6N is usable

The DE0-Nano is great to get started learning about fpga's, but it won't make you any useful coin. 5MHash/sec is about right, it will go faster but not without risk of overheating, and certainly no more than about 25MHash/sec (using Makomk's modified power supply). To put that in context 5MHash/sec will currently earn approx 0.0003 bitcoin per day (and getting less by roughly 20% every 2 weeks as the difficulty increases).

If you do decide to get a DE0-Nano, start with the DE2_70_Unoptimized_Pipelined project. You'll need to increase CONFIG_LOG_LOOP2 to 4 to get it to compile (that's one eighth of a core, I think). I cheated and edited the fpgaminer.qsf file directly to configure it for the EP4CE22, but its probably safer to create a new project from scratch and add in the source files.

Mark

Heh, thats about $1 a month (at the ~$100/coin mark) so that thing is not going to break even this life time i thinks

So, i might have to go hunt out some 2nd hand Spartan6 with 150K gates (or similar)

So the question is, will any 150K gate fpga work with the full miner? or is there something I'm missing (EG: http://www.digilentinc.com/Products/Detail.cfm?NavPath=2,400,790&Prod=BASYS2 with 250K gates, slap on the full miner, and bam, 1hash a clock? )
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 [38] 39 40 41 42 43 44 45 46 47 48 49 »
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!