Bitcoin Forum
November 08, 2024, 10:29:33 PM *
News: Latest Bitcoin Core release: 28.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 [35] 36 37 38 39 40 41 42 43 44 45 46 47 48 49 »
  Print  
Author Topic: Official Open Source FPGA Bitcoin Miner (Last Update: April 14th, 2013)  (Read 432941 times)
kramble
Sr. Member
****
Offline Offline

Activity: 384
Merit: 250



View Profile WWW
March 29, 2013, 10:45:37 PM
 #681

What are the 384 bits for when it sends the new nonce and buffer?

Is that wasted space or used for something?

Its the fixed part of the input to the second round of the sha256 transform (I explained it a few posts ago, probably not too well though).

And yes, there is wasted bandwidth in the altsource_probe comms, but it doesn't matter as this has no effect on the hash rate, since it only does a getwork every few seconds compared to a hash rate of millions per second.

I guess you'll need to read up on the bitcoin hashing algorithm, but I'm not really the guy to explain it (no expert me, so I'm not going to make a fool of myself trying).

All the best
Mark

Github https://github.com/kramble BLC BkRaMaRkw3NeyzsZ2zUgXsNLogVVkQ1iPV
iidx
Newbie
*
Offline Offline

Activity: 35
Merit: 0


View Profile
March 30, 2013, 12:50:35 AM
 #682

I've trawled this topic and github but am still not sure what the best starting point for a new (Kintex7-325) fpga miner would be.  
Multicore is key and a pointer to a working open source software/fpga combo for a serial interface would be hugely appreciated but any sensible starting point would be fine - I expect to do some work!
I'm poking about with the verilog_xilinx port at the moment.

I started with the verilog_xilinx port back in 2011 to put on a handful of ML605s (V6 240s).  I also have some K7 325s and K7 480s at work, but only did build tests for those because I didn't have permanent access to those boards.

I would suggest starting with verilog_xilinx or one of the Ztex ports.  I used PCIe for mine, so I modified the interfaces to take in 32 bit words.  Unfortunately that means I don't have a starting place for you to use if you are going to try and use the serial port.

However, I would not try to fit more than 3 instances of the fully unrolled verilog_xilinx version into that 325 without changing some of the adders into DSP48s.  On the V6 240 I can fit 3 instances if I use most of the DSP48s to replace some of the adders in the design.  Sadly, the K7 325 doesn't have that many more adders.  I don't think I was successful getting 4 instances of the verilog_xilinx port to fit.

Technically you could actually just make several instances of the entire design... and just use multiple serial ports to talk to it Wink
fpgaminer (OP)
Hero Member
*****
Offline Offline

Activity: 560
Merit: 517



View Profile WWW
March 30, 2013, 09:24:56 AM
 #683

Quote
Not possible. Max per channel is 256 bits.
Quick note, kramble; the max is actually 511-bits.

Quote
Does the X6500 use JTAG for communication or does it use some more effective protocol?
It uses JTAG.  There is an FTDI chip on there that allows bit-banging pins over USB, and so compatible software bit-bangs JTAG to talk to the FPGA.  The entire protocol sitting on top of JTAG is described in jtag_comm.v.

Quote
I've trawled this topic and github but am still not sure what the best starting point for a new (Kintex7-325) fpga miner would be.  
I would recommend starting with the X6000_ztex_comm4 project.  That's the same code that generated the bitstreams on the fpgamining.com website.  You'll want to remove the jtag communication related code and replace it with serial communication, or something else.  You can then multi-core that and exchange some resources for DSP48s as iidx mentioned.

It's also possible to implement a miner using only DSP48s and misc. logic, achieving about 500MH/s.  I haven't released any code for that yet.

kingcoin
Sr. Member
****
Offline Offline

Activity: 262
Merit: 250


View Profile
March 30, 2013, 09:59:50 AM
 #684

It's also possible to implement a miner using only DSP48s and misc. logic, achieving about 500MH/s.  I haven't released any code for that yet.

On a single core?
fpgaminer (OP)
Hero Member
*****
Offline Offline

Activity: 560
Merit: 517



View Profile WWW
March 30, 2013, 10:57:26 AM
 #685

Quote
On a single core?
Yes.  DSP48E1's can run up to 500MHz, so if you unroll all the calculations, replace all the additions with DSP48E1's, register everything, and throw in misc. logic for the non-linear calculations you can get about 500MH/s, depending on the speed grade.  It all fits into a Kintex-7 160 (the 160 has a higher DSP48 density than the 325).  Probably some room left over for a normal hashing core, though I'm not sure.  The DSP design requires quite a lot of registers.

EDIT: And yes, I implemented the design, so it's feasible.  It was never fully debugged though, because at the time the Xilinx simulator couldn't handle Kintex's DSP48E1's very well.

kingcoin
Sr. Member
****
Offline Offline

Activity: 262
Merit: 250


View Profile
March 30, 2013, 03:38:00 PM
 #686

Quote
On a single core?
Yes.  DSP48E1's can run up to 500MHz

But the Spartan6 fpga fabric does not run at 500HMz. Maybe some clever interleaving might make it possible to run the fabric interface at a lower clock.
fpgaminer (OP)
Hero Member
*****
Offline Offline

Activity: 560
Merit: 517



View Profile WWW
March 31, 2013, 12:27:29 AM
 #687

Quote
But the Spartan6 fpga fabric does not run at 500HMz.
Sorry, I was talking about Kintex 7, which most certainly can.  Kintex 7 has similar performance to the Virtex 6, at less cost.

Quote
What is the best way to run multiple devices (25 trought USB)?
Your best bet would be to modify the code to use a serial interface that could be chained, instead of the altsource_probe.  You could do it with the current altsource_probe, but you'd need to modify this code to find multiple devices.  Also, last I checked, Quartus had issues handling more than one device plugged into the same USB controller.

senseless
Hero Member
*****
Offline Offline

Activity: 1118
Merit: 541



View Profile
March 31, 2013, 11:33:53 AM
Last edit: March 31, 2013, 12:44:03 PM by senseless
 #688

I had this idea for a bitcoin fpga design. I'm not an fpga designer/coder, but since you guys are talking about new designs I thought I would throw this in there. The code would be split into 2 segments on different clocks/plls. The first pll would be a master controller on chip that interfaces with the mining software. It will receive nonces from the code and store them in memory (as opposed to sending the nonces direct to the miners). The miners would run on their own pll (separate clock), would read nonces from memory and write any golden nonces back to a different memory segment. Each miner would provision their own pool in memory to hold N nonces and N golden nonces.

Rough flow chart:

Master controller:
Software sends nonce to master controller -> on-chip master controller saves nonce to memory under a hashing core -> on-chip master controller looks for golden nonce in separate memory area -> golden nonce send back to software for reporting to network/pool

Hasher Cores:
Hashing core reads new nonce from it's memory segment -> hashing core performs hashes on this nonce range (flipping nonce in memory) -> if golden nonce is found write to a different memory segment

Software Signals:

Reqs (Requests nonce from software)
Rest (overwrites nonces from memory on-chip (Num Core * nonce pool size per core = number of nonces to request, then overwrite in memory)
Nonc (sending nonce from software to chip)
Stat (Requesting stats on chip processing speed)
Gnon (sending golden nonce upstream to software)

Thoughts:

Nonces can be flipped in memory and then pulled to start the next hash so the nonce range. When it detects the last nonce start working on the next nonce in the memory pool. For instance, provision room in memory for 3 nonces per core once the 4 billion results of nonce 0 are completed it would start working on nonce 1. Meanwhile, master controller would see in memory that the nonce 0 is finished (completely calculated 4 billion flips) and overwrite that memory segment with a fresh nonce. The reset signal would only need to overwrite every existing memory segment with new nonces, it does not need to reset the cores or make any changes as nonces are flipped in memory.

..

The reason I came up with this sort of idea for a design is; After playing with the code the worst case slack seems to be when it reports a golden nonce up stream. Hence why you can seriously overclock the design over fmax and it works fine, other than reporting bad results upstream to the software. I'm able to push my clock rate almost up to 275mhz without the compile failing completely(with edge/corner timing errors). Using this method of allowing a master controller to be on its own separate clock/pll than the hasher cores themselves it would allow the fmax of the hashing cores to sky rocket while you can set the controller at a more conservative level for software communications.

.... Hell my chip has 8 PLLs, could probably put every core on its own PLL so a slow down in one hashing core does not affect the others. (Which would probably be ideal, every "core" would have its own fmax and timing)

..

It would be nice if we could come up with a fully functioning/optimized unrolled multi-core design so anyone could take said design and produce a top level structured asic design (print their own chips). Just make sure to release under a license which requires all modifications to be reported. It's not really THAT expensive to get your own structured asic produced from design, takes awhile to complete but < 100K should be fine at > 90nm. This would also be nice as new technologies come along (14nm altera, etc) we can move to hardcopies of those devices on the same design which should meet performance/power wise to BFL's target estimates for their asic design (or better).




kingcoin
Sr. Member
****
Offline Offline

Activity: 262
Merit: 250


View Profile
March 31, 2013, 03:34:30 PM
 #689

Quote
But the Spartan6 fpga fabric does not run at 500HMz.
Sorry, I was talking about Kintex 7, which most certainly can.  Kintex 7 has similar performance to the Virtex 6, at less cost.

I would imagine getting the fabric to run at 500MHz in a Kintex-7 device is also a challenge. Running the design as-is through Vivado with a 325 speed grade -2 target does not meet timing closure at 250MHz. But of course with careful placement and constraints it might be possible.
kingcoin
Sr. Member
****
Offline Offline

Activity: 262
Merit: 250


View Profile
March 31, 2013, 03:58:29 PM
 #690

This would also be nice as new technologies come along (14nm altera, etc) we can move to hardcopies of those devices on the same design which should meet performance/power wise to BFL's target estimates for their asic design (or better).


How many e.g. Stratix devices do you have to make before the unit const including the NRE  is lower for Hardcopy?


senseless
Hero Member
*****
Offline Offline

Activity: 1118
Merit: 541



View Profile
March 31, 2013, 08:03:26 PM
Last edit: March 31, 2013, 08:51:48 PM by senseless
 #691

This would also be nice as new technologies come along (14nm altera, etc) we can move to hardcopies of those devices on the same design which should meet performance/power wise to BFL's target estimates for their asic design (or better).

...


I don't have the hardcopy prices from altera or xilinix directly. But, based on my investigation of some other fabless companies it should be somewhere around 300K chips @ 45nm and somewhere around 1M chips for 28nm to get competitive pricing. The pricing I was getting at 45NM was around 20$/GHash @ 300K units (2.5-3Gh/s per unit). Keep in mind these were from fabless companies they were just reselling someone else's services but did their own in house design conversion. The fabless companies are obviously going to be a bit higher on per unit and nre as thats where they get their cash from as opposed to going direct with altera or xilinix's structured asic processes with no middle man.


kingcoin
Sr. Member
****
Offline Offline

Activity: 262
Merit: 250


View Profile
March 31, 2013, 08:23:01 PM
 #692

According to Alterar NRE for 90nm was in the range of $240K to $345K, which is fairly low.  http://www.altera.com/products/devices/hardcopy-asics/about/migration/hrd-migration.html
senseless
Hero Member
*****
Offline Offline

Activity: 1118
Merit: 541



View Profile
March 31, 2013, 08:59:04 PM
Last edit: March 31, 2013, 09:19:51 PM by senseless
 #693

According to Alterar NRE for 90nm was in the range of $240K to $345K, which is fairly low.  http://www.altera.com/products/devices/hardcopy-asics/about/migration/hrd-migration.html

The pricing isn't bad at all; but the only open source designs available to be taken for conversion do not have a very good multi-core base design. Could easily take this design into an avalon style 1 chip per core; but seems like an awful waste of PCB space. The 20$/Ghash pricing before was with a single chip operating at 250mhz with 10 cores on it. It would be 28 chips to reach 68Gh/s as opposed to avalon's 240 chips to reach that speed. The pricing I got is still a little high. It won't be effective (competition price match) until it hits like 10$/Ghash at which point people could build their own units for less than the cost of avalon's, bfls, etc.

It should be possible to get a miner @ 800$ cost with 70Gh/s @ 200-400W (28nm-45nm).

Maybe some sort of non-profit coop to collect funds to get the initial design conversion, mask printing and chips made? Could then just sell chips on as needed basis close to cost.

2112
Legendary
*
Offline Offline

Activity: 2128
Merit: 1073



View Profile
March 31, 2013, 09:40:52 PM
 #694

I would imagine getting the fabric to run at 500MHz in a Kintex-7 device is also a challenge. Running the design as-is through Vivado with a 325 speed grade -2 target does not meet timing closure at 250MHz. But of course with careful placement and constraints it might be possible.
I just wanted to point one thing: trying to achieve a timing closure is a blind alley. What you should really aim is power optimization. To my knowledge none of the popular toolchains has such a goal available.

With the unrolled design the fanout of some registers is high enough to trigger combinatorial logic duplication when searching for the closure. I haven't tried Vivado, but ISE was even doing the register duplication. This is exactly what you don't want to do when doing an FPGA design that has to compete with an ASIC design. In the absence of pure power optimization your next-best goal is try to optimize for the area.

I guess working with the two unrolled copies of SHA-256 produces such a wild mess of trees primitives that it is possible to lose ones bearing in the jungle of vines signals.

Please comment, critique, criticize or ridicule BIP 2112: https://bitcointalk.org/index.php?topic=54382.0
Long-term mining prognosis: https://bitcointalk.org/index.php?topic=91101.0
tbd
Newbie
*
Offline Offline

Activity: 45
Merit: 0


View Profile
April 01, 2013, 01:02:27 AM
 #695

Maybe some sort of non-profit coop to collect funds to get the initial design conversion, mask printing and chips made? Could then just sell chips on as needed basis close to cost.

I like this idea.
fpgaminer (OP)
Hero Member
*****
Offline Offline

Activity: 560
Merit: 517



View Profile WWW
April 01, 2013, 06:17:20 AM
 #696

Quote
I would imagine getting the fabric to run at 500MHz in a Kintex-7 device is also a challenge. Running the design as-is through Vivado with a 325 speed grade -2 target does not meet timing closure at 250MHz.
In this case, the DSP48E1's are taking care of the heavy lifting; three and two-way 32-bit addition.  The rest of the fabric only needs to handle registers, routing signals, and the non-linear math.  Obviously the Kintex 7 fabric is capable of handling these frequencies for modest logic, otherwise the DSP's would be unusable in the first place Tongue

But, as I said before, I already did a rudimentary implementation of this design and synthesized/routed it.  Timing reported ~400MHz on the devkit.

Quote
What you should really aim is power optimization. To my knowledge none of the popular toolchains has such a goal available.
Quartus, ISE, and Vivado all have options to target minimizing power.  I don't know how good they are at it; probably not very.

Quote
I guess working with the two unrolled copies of SHA-256 produces such a wild mess of trees primitives that it is possible to lose ones bearing in the jungle of vines signals.
Actually, unrolled cores are very straight-forward designs.  The issue is that FPGA's are routing constrained, especially in the Spartan 6's, and the tools aren't designed to handle these sorts of long chains.  The Kintex 7 chips are much nicer with respect to routing resources and consistency.  Also, the newer Vivado Studio tool does a much better job than ISE in my experiences with it so far.  It's a shame Vivado does not support S6.

Quote
Could easily take this design into an avalon style 1 chip per core; but seems like an awful waste of PCB space.
In my opinion, Avalon was smart in this regard and did it right.  Using lots of chips is a very good thing for these early mining ASIC's; I would not have recommended it any other way.  This is because there are rather large Minimum Order Quantities when producing ASIC's.  If you sell 1000 units, each with 4 chips, you aren't going to reach the necessary MOQ's, which are at least 50K chips.  Selling 1000 units, each with 240 chips, puts you in that beautiful quantity where the fabs and factories start giving you the time of day.  And the cost of everything else goes down.  In the long-run, yes, bigger chips are a better idea since they require less overall supporting circuitry and PCB space.

Slightly related:  I would not recommend fully unrolled cores for an ASIC design.  It will certainly result in higher performance per area due to optimizations unique to the unrolled designs, but it means higher failure rates and lower clock speeds due to intra-die variations.  Fully rolled cores that can be individually enabled and clocked (or clocked in regions) should give better yield and overclocking.

fpgaminer (OP)
Hero Member
*****
Offline Offline

Activity: 560
Merit: 517



View Profile WWW
April 01, 2013, 06:32:20 AM
 #697

Quote
I had this idea for a bitcoin fpga design. I'm not an fpga designer/coder, but since you guys are talking about new designs I thought I would throw this in there.
Thank you for sharing your idea, Senseless.  I love getting people engaged in this field of engineering.

Quote
The code would be split into 2 segments on different clocks/plls.
Forgive me if I misunderstand your design, but I believe you have replicated what the current FPGA mining designs are already doing.  For example, on the X6500 board, the jtag_comm module communicates with the mining core in the rx_hash_clk clock domain, and communicates with the outside world in the jtag clock domain.  You can see the Asynchronous FIFO that shuttles golden nonces from rx_hash_clk clock to jtag clock here.

There is certainly work that could be done there, though.  JTAG is not a good communication method for this sort of task.  On the X6500 it was simply chosen to reduce cost and complexity.

senseless
Hero Member
*****
Offline Offline

Activity: 1118
Merit: 541



View Profile
April 01, 2013, 06:56:15 AM
Last edit: April 01, 2013, 07:18:02 AM by senseless
 #698

Quote
I had this idea for a bitcoin fpga design. I'm not an fpga designer/coder, but since you guys are talking about new designs I thought I would throw this in there.
Thank you for sharing your idea, Senseless.  I love getting people engaged in this field of engineering.

Quote
The code would be split into 2 segments on different clocks/plls.
Forgive me if I misunderstand your design, but I believe you have replicated what the current FPGA mining designs are already doing.  For example, on the X6500 board, the jtag_comm module communicates with the mining core in the rx_hash_clk clock domain, and communicates with the outside world in the jtag clock domain.  You can see the Asynchronous FIFO that shuttles golden nonces from rx_hash_clk clock to jtag clock here.

There is certainly work that could be done there, though.  JTAG is not a good communication method for this sort of task.  On the X6500 it was simply chosen to reduce cost and complexity.

Correct, something like that. I was thinking on-die memory segments could be used. But anything that would separate the hasher clock from the software communicator should be a good thing. I hadn't seen that code as I was working on the altera branches. They must be doing something right to achieve 200mh/s per chip on a spartan lx150 which in this thread (and on the hardware comparison page) topped out at 100mh/s on other boards (unless I missed some updates somewhere). The ztex design seems to be clocking 1 core at 200+mhz versus the other designs without hasher/controller separation clocking at 100mhz with 1 core. Would be amazing to double the clock rate of my altera chips from 220 to 440 w/ 3 cores!

Quote
Slightly related:  I would not recommend fully unrolled cores for an ASIC design.  It will certainly result in higher performance per area due to optimizations unique to the unrolled designs, but it means higher failure rates and lower clock speeds due to intra-die variations.  Fully rolled cores that can be individually enabled and clocked (or clocked in regions) should give better yield and overclocking.

What sort of pipelining would you recommend, I suppose 64 cycles per hash would be the smallest footprint and the highest clocked design? At some point routing issues will become a concern I guess I'll need to optimize the pipeline unrolling per chip. Pipelining would also allow for a greater use of available space (on an sasic at least). I would love to be able to better utilize all of the logic available on my chip (lacking 8% MLABs for a 4th fully unrolled core).

kingcoin
Sr. Member
****
Offline Offline

Activity: 262
Merit: 250


View Profile
April 01, 2013, 07:35:28 AM
 #699

Quote
I would imagine getting the fabric to run at 500MHz in a Kintex-7 device is also a challenge. Running the design as-is through Vivado with a 325 speed grade -2 target does not meet timing closure at 250MHz.
The rest of the fabric only needs to handle registers, routing signals, and the non-linear math.  Obviously the Kintex 7 fabric is capable of handling these frequencies for modest logic

This was the part of my concern. Getting that part to run at 500Mhz is a challenge, especially with multiple cores when utilization goes up

The issue is that FPGA's are routing constrained, especially in the Spartan 6's, and the tools aren't designed to handle these sorts of long chains.  The Kintex 7 chips are much nicer with respect to routing resources and consistency.

Yes. The CLB is pretty similar to the Spartan6, but it seems like the new switching matrix is quite effective when it comes to this type of logic/routing.

Altera Stratix-V does not seem to match this type of logic very well, at least with the current tools, as the Stratix-IV seem to outperform the Stratix-V. I don't understand why as the ALM does not seem to be radically different from the Stratix-IV.
kingcoin
Sr. Member
****
Offline Offline

Activity: 262
Merit: 250


View Profile
April 01, 2013, 07:46:47 AM
 #700

I was thinking on-die memory segments could be used.

FIFO's are usually implemented using embedded memory on the FPGA's. Even if you claim not being a FPGA designer/coder you think like one Smiley

But if you run your miner clock domain way above the fmax it will quite often work as most devices are usually faster than their marked speed grade. But when you get your next board/batch it might fail constantly since you got slower devices. Also you have to be careful so that timing errors in the faster clock domain will not propagate into the slower clock domains, e.g. the FIFO enqueue signal beeing stuck asserted due to a timing error etc. It can potentially be a lot worse than just a bad nonce.
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 [35] 36 37 38 39 40 41 42 43 44 45 46 47 48 49 »
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!