kramble
|
|
March 29, 2013, 10:45:37 PM |
|
What are the 384 bits for when it sends the new nonce and buffer?
Is that wasted space or used for something?
Its the fixed part of the input to the second round of the sha256 transform (I explained it a few posts ago, probably not too well though). And yes, there is wasted bandwidth in the altsource_probe comms, but it doesn't matter as this has no effect on the hash rate, since it only does a getwork every few seconds compared to a hash rate of millions per second. I guess you'll need to read up on the bitcoin hashing algorithm, but I'm not really the guy to explain it (no expert me, so I'm not going to make a fool of myself trying). All the best Mark
|
|
|
|
iidx
Newbie
Offline
Activity: 35
Merit: 0
|
|
March 30, 2013, 12:50:35 AM |
|
I've trawled this topic and github but am still not sure what the best starting point for a new (Kintex7-325) fpga miner would be. Multicore is key and a pointer to a working open source software/fpga combo for a serial interface would be hugely appreciated but any sensible starting point would be fine - I expect to do some work! I'm poking about with the verilog_xilinx port at the moment.
I started with the verilog_xilinx port back in 2011 to put on a handful of ML605s (V6 240s). I also have some K7 325s and K7 480s at work, but only did build tests for those because I didn't have permanent access to those boards. I would suggest starting with verilog_xilinx or one of the Ztex ports. I used PCIe for mine, so I modified the interfaces to take in 32 bit words. Unfortunately that means I don't have a starting place for you to use if you are going to try and use the serial port. However, I would not try to fit more than 3 instances of the fully unrolled verilog_xilinx version into that 325 without changing some of the adders into DSP48s. On the V6 240 I can fit 3 instances if I use most of the DSP48s to replace some of the adders in the design. Sadly, the K7 325 doesn't have that many more adders. I don't think I was successful getting 4 instances of the verilog_xilinx port to fit. Technically you could actually just make several instances of the entire design... and just use multiple serial ports to talk to it
|
|
|
|
fpgaminer (OP)
|
|
March 30, 2013, 09:24:56 AM |
|
Not possible. Max per channel is 256 bits. Quick note, kramble; the max is actually 511-bits. Does the X6500 use JTAG for communication or does it use some more effective protocol? It uses JTAG. There is an FTDI chip on there that allows bit-banging pins over USB, and so compatible software bit-bangs JTAG to talk to the FPGA. The entire protocol sitting on top of JTAG is described in jtag_comm.v. I've trawled this topic and github but am still not sure what the best starting point for a new (Kintex7-325) fpga miner would be. I would recommend starting with the X6000_ztex_comm4 project. That's the same code that generated the bitstreams on the fpgamining.com website. You'll want to remove the jtag communication related code and replace it with serial communication, or something else. You can then multi-core that and exchange some resources for DSP48s as iidx mentioned. It's also possible to implement a miner using only DSP48s and misc. logic, achieving about 500MH/s. I haven't released any code for that yet.
|
|
|
|
kingcoin
|
|
March 30, 2013, 09:59:50 AM |
|
It's also possible to implement a miner using only DSP48s and misc. logic, achieving about 500MH/s. I haven't released any code for that yet.
On a single core?
|
|
|
|
fpgaminer (OP)
|
|
March 30, 2013, 10:57:26 AM |
|
On a single core? Yes. DSP48E1's can run up to 500MHz, so if you unroll all the calculations, replace all the additions with DSP48E1's, register everything, and throw in misc. logic for the non-linear calculations you can get about 500MH/s, depending on the speed grade. It all fits into a Kintex-7 160 (the 160 has a higher DSP48 density than the 325). Probably some room left over for a normal hashing core, though I'm not sure. The DSP design requires quite a lot of registers. EDIT: And yes, I implemented the design, so it's feasible. It was never fully debugged though, because at the time the Xilinx simulator couldn't handle Kintex's DSP48E1's very well.
|
|
|
|
kingcoin
|
|
March 30, 2013, 03:38:00 PM |
|
On a single core? Yes. DSP48E1's can run up to 500MHz But the Spartan6 fpga fabric does not run at 500HMz. Maybe some clever interleaving might make it possible to run the fabric interface at a lower clock.
|
|
|
|
fpgaminer (OP)
|
|
March 31, 2013, 12:27:29 AM |
|
But the Spartan6 fpga fabric does not run at 500HMz. Sorry, I was talking about Kintex 7, which most certainly can. Kintex 7 has similar performance to the Virtex 6, at less cost. What is the best way to run multiple devices (25 trought USB)? Your best bet would be to modify the code to use a serial interface that could be chained, instead of the altsource_probe. You could do it with the current altsource_probe, but you'd need to modify this code to find multiple devices. Also, last I checked, Quartus had issues handling more than one device plugged into the same USB controller.
|
|
|
|
senseless
|
|
March 31, 2013, 11:33:53 AM Last edit: March 31, 2013, 12:44:03 PM by senseless |
|
I had this idea for a bitcoin fpga design. I'm not an fpga designer/coder, but since you guys are talking about new designs I thought I would throw this in there. The code would be split into 2 segments on different clocks/plls. The first pll would be a master controller on chip that interfaces with the mining software. It will receive nonces from the code and store them in memory (as opposed to sending the nonces direct to the miners). The miners would run on their own pll (separate clock), would read nonces from memory and write any golden nonces back to a different memory segment. Each miner would provision their own pool in memory to hold N nonces and N golden nonces.
Rough flow chart:
Master controller: Software sends nonce to master controller -> on-chip master controller saves nonce to memory under a hashing core -> on-chip master controller looks for golden nonce in separate memory area -> golden nonce send back to software for reporting to network/pool
Hasher Cores: Hashing core reads new nonce from it's memory segment -> hashing core performs hashes on this nonce range (flipping nonce in memory) -> if golden nonce is found write to a different memory segment
Software Signals:
Reqs (Requests nonce from software) Rest (overwrites nonces from memory on-chip (Num Core * nonce pool size per core = number of nonces to request, then overwrite in memory) Nonc (sending nonce from software to chip) Stat (Requesting stats on chip processing speed) Gnon (sending golden nonce upstream to software)
Thoughts:
Nonces can be flipped in memory and then pulled to start the next hash so the nonce range. When it detects the last nonce start working on the next nonce in the memory pool. For instance, provision room in memory for 3 nonces per core once the 4 billion results of nonce 0 are completed it would start working on nonce 1. Meanwhile, master controller would see in memory that the nonce 0 is finished (completely calculated 4 billion flips) and overwrite that memory segment with a fresh nonce. The reset signal would only need to overwrite every existing memory segment with new nonces, it does not need to reset the cores or make any changes as nonces are flipped in memory.
..
The reason I came up with this sort of idea for a design is; After playing with the code the worst case slack seems to be when it reports a golden nonce up stream. Hence why you can seriously overclock the design over fmax and it works fine, other than reporting bad results upstream to the software. I'm able to push my clock rate almost up to 275mhz without the compile failing completely(with edge/corner timing errors). Using this method of allowing a master controller to be on its own separate clock/pll than the hasher cores themselves it would allow the fmax of the hashing cores to sky rocket while you can set the controller at a more conservative level for software communications.
.... Hell my chip has 8 PLLs, could probably put every core on its own PLL so a slow down in one hashing core does not affect the others. (Which would probably be ideal, every "core" would have its own fmax and timing)
..
It would be nice if we could come up with a fully functioning/optimized unrolled multi-core design so anyone could take said design and produce a top level structured asic design (print their own chips). Just make sure to release under a license which requires all modifications to be reported. It's not really THAT expensive to get your own structured asic produced from design, takes awhile to complete but < 100K should be fine at > 90nm. This would also be nice as new technologies come along (14nm altera, etc) we can move to hardcopies of those devices on the same design which should meet performance/power wise to BFL's target estimates for their asic design (or better).
|
|
|
|
kingcoin
|
|
March 31, 2013, 03:34:30 PM |
|
But the Spartan6 fpga fabric does not run at 500HMz. Sorry, I was talking about Kintex 7, which most certainly can. Kintex 7 has similar performance to the Virtex 6, at less cost. I would imagine getting the fabric to run at 500MHz in a Kintex-7 device is also a challenge. Running the design as-is through Vivado with a 325 speed grade -2 target does not meet timing closure at 250MHz. But of course with careful placement and constraints it might be possible.
|
|
|
|
kingcoin
|
|
March 31, 2013, 03:58:29 PM |
|
This would also be nice as new technologies come along (14nm altera, etc) we can move to hardcopies of those devices on the same design which should meet performance/power wise to BFL's target estimates for their asic design (or better).
How many e.g. Stratix devices do you have to make before the unit const including the NRE is lower for Hardcopy?
|
|
|
|
senseless
|
|
March 31, 2013, 08:03:26 PM Last edit: March 31, 2013, 08:51:48 PM by senseless |
|
This would also be nice as new technologies come along (14nm altera, etc) we can move to hardcopies of those devices on the same design which should meet performance/power wise to BFL's target estimates for their asic design (or better).
... I don't have the hardcopy prices from altera or xilinix directly. But, based on my investigation of some other fabless companies it should be somewhere around 300K chips @ 45nm and somewhere around 1M chips for 28nm to get competitive pricing. The pricing I was getting at 45NM was around 20$/GHash @ 300K units (2.5-3Gh/s per unit). Keep in mind these were from fabless companies they were just reselling someone else's services but did their own in house design conversion. The fabless companies are obviously going to be a bit higher on per unit and nre as thats where they get their cash from as opposed to going direct with altera or xilinix's structured asic processes with no middle man.
|
|
|
|
|
senseless
|
|
March 31, 2013, 08:59:04 PM Last edit: March 31, 2013, 09:19:51 PM by senseless |
|
The pricing isn't bad at all; but the only open source designs available to be taken for conversion do not have a very good multi-core base design. Could easily take this design into an avalon style 1 chip per core; but seems like an awful waste of PCB space. The 20$/Ghash pricing before was with a single chip operating at 250mhz with 10 cores on it. It would be 28 chips to reach 68Gh/s as opposed to avalon's 240 chips to reach that speed. The pricing I got is still a little high. It won't be effective (competition price match) until it hits like 10$/Ghash at which point people could build their own units for less than the cost of avalon's, bfls, etc. It should be possible to get a miner @ 800$ cost with 70Gh/s @ 200-400W (28nm-45nm). Maybe some sort of non-profit coop to collect funds to get the initial design conversion, mask printing and chips made? Could then just sell chips on as needed basis close to cost.
|
|
|
|
2112
Legendary
Offline
Activity: 2128
Merit: 1073
|
|
March 31, 2013, 09:40:52 PM |
|
I would imagine getting the fabric to run at 500MHz in a Kintex-7 device is also a challenge. Running the design as-is through Vivado with a 325 speed grade -2 target does not meet timing closure at 250MHz. But of course with careful placement and constraints it might be possible.
I just wanted to point one thing: trying to achieve a timing closure is a blind alley. What you should really aim is power optimization. To my knowledge none of the popular toolchains has such a goal available. With the unrolled design the fanout of some registers is high enough to trigger combinatorial logic duplication when searching for the closure. I haven't tried Vivado, but ISE was even doing the register duplication. This is exactly what you don't want to do when doing an FPGA design that has to compete with an ASIC design. In the absence of pure power optimization your next-best goal is try to optimize for the area. I guess working with the two unrolled copies of SHA-256 produces such a wild mess of trees primitives that it is possible to lose ones bearing in the jungle of vines signals.
|
|
|
|
tbd
Newbie
Offline
Activity: 45
Merit: 0
|
|
April 01, 2013, 01:02:27 AM |
|
Maybe some sort of non-profit coop to collect funds to get the initial design conversion, mask printing and chips made? Could then just sell chips on as needed basis close to cost.
I like this idea.
|
|
|
|
fpgaminer (OP)
|
|
April 01, 2013, 06:17:20 AM |
|
I would imagine getting the fabric to run at 500MHz in a Kintex-7 device is also a challenge. Running the design as-is through Vivado with a 325 speed grade -2 target does not meet timing closure at 250MHz. In this case, the DSP48E1's are taking care of the heavy lifting; three and two-way 32-bit addition. The rest of the fabric only needs to handle registers, routing signals, and the non-linear math. Obviously the Kintex 7 fabric is capable of handling these frequencies for modest logic, otherwise the DSP's would be unusable in the first place But, as I said before, I already did a rudimentary implementation of this design and synthesized/routed it. Timing reported ~400MHz on the devkit. What you should really aim is power optimization. To my knowledge none of the popular toolchains has such a goal available. Quartus, ISE, and Vivado all have options to target minimizing power. I don't know how good they are at it; probably not very. I guess working with the two unrolled copies of SHA-256 produces such a wild mess of trees primitives that it is possible to lose ones bearing in the jungle of vines signals. Actually, unrolled cores are very straight-forward designs. The issue is that FPGA's are routing constrained, especially in the Spartan 6's, and the tools aren't designed to handle these sorts of long chains. The Kintex 7 chips are much nicer with respect to routing resources and consistency. Also, the newer Vivado Studio tool does a much better job than ISE in my experiences with it so far. It's a shame Vivado does not support S6. Could easily take this design into an avalon style 1 chip per core; but seems like an awful waste of PCB space. In my opinion, Avalon was smart in this regard and did it right. Using lots of chips is a very good thing for these early mining ASIC's; I would not have recommended it any other way. This is because there are rather large Minimum Order Quantities when producing ASIC's. If you sell 1000 units, each with 4 chips, you aren't going to reach the necessary MOQ's, which are at least 50K chips. Selling 1000 units, each with 240 chips, puts you in that beautiful quantity where the fabs and factories start giving you the time of day. And the cost of everything else goes down. In the long-run, yes, bigger chips are a better idea since they require less overall supporting circuitry and PCB space. Slightly related: I would not recommend fully unrolled cores for an ASIC design. It will certainly result in higher performance per area due to optimizations unique to the unrolled designs, but it means higher failure rates and lower clock speeds due to intra-die variations. Fully rolled cores that can be individually enabled and clocked (or clocked in regions) should give better yield and overclocking.
|
|
|
|
fpgaminer (OP)
|
|
April 01, 2013, 06:32:20 AM |
|
I had this idea for a bitcoin fpga design. I'm not an fpga designer/coder, but since you guys are talking about new designs I thought I would throw this in there. Thank you for sharing your idea, Senseless. I love getting people engaged in this field of engineering. The code would be split into 2 segments on different clocks/plls. Forgive me if I misunderstand your design, but I believe you have replicated what the current FPGA mining designs are already doing. For example, on the X6500 board, the jtag_comm module communicates with the mining core in the rx_hash_clk clock domain, and communicates with the outside world in the jtag clock domain. You can see the Asynchronous FIFO that shuttles golden nonces from rx_hash_clk clock to jtag clock here.There is certainly work that could be done there, though. JTAG is not a good communication method for this sort of task. On the X6500 it was simply chosen to reduce cost and complexity.
|
|
|
|
senseless
|
|
April 01, 2013, 06:56:15 AM Last edit: April 01, 2013, 07:18:02 AM by senseless |
|
I had this idea for a bitcoin fpga design. I'm not an fpga designer/coder, but since you guys are talking about new designs I thought I would throw this in there. Thank you for sharing your idea, Senseless. I love getting people engaged in this field of engineering. The code would be split into 2 segments on different clocks/plls. Forgive me if I misunderstand your design, but I believe you have replicated what the current FPGA mining designs are already doing. For example, on the X6500 board, the jtag_comm module communicates with the mining core in the rx_hash_clk clock domain, and communicates with the outside world in the jtag clock domain. You can see the Asynchronous FIFO that shuttles golden nonces from rx_hash_clk clock to jtag clock here.There is certainly work that could be done there, though. JTAG is not a good communication method for this sort of task. On the X6500 it was simply chosen to reduce cost and complexity. Correct, something like that. I was thinking on-die memory segments could be used. But anything that would separate the hasher clock from the software communicator should be a good thing. I hadn't seen that code as I was working on the altera branches. They must be doing something right to achieve 200mh/s per chip on a spartan lx150 which in this thread (and on the hardware comparison page) topped out at 100mh/s on other boards (unless I missed some updates somewhere). The ztex design seems to be clocking 1 core at 200+mhz versus the other designs without hasher/controller separation clocking at 100mhz with 1 core. Would be amazing to double the clock rate of my altera chips from 220 to 440 w/ 3 cores! Slightly related: I would not recommend fully unrolled cores for an ASIC design. It will certainly result in higher performance per area due to optimizations unique to the unrolled designs, but it means higher failure rates and lower clock speeds due to intra-die variations. Fully rolled cores that can be individually enabled and clocked (or clocked in regions) should give better yield and overclocking.
What sort of pipelining would you recommend, I suppose 64 cycles per hash would be the smallest footprint and the highest clocked design? At some point routing issues will become a concern I guess I'll need to optimize the pipeline unrolling per chip. Pipelining would also allow for a greater use of available space (on an sasic at least). I would love to be able to better utilize all of the logic available on my chip (lacking 8% MLABs for a 4th fully unrolled core).
|
|
|
|
kingcoin
|
|
April 01, 2013, 07:35:28 AM |
|
I would imagine getting the fabric to run at 500MHz in a Kintex-7 device is also a challenge. Running the design as-is through Vivado with a 325 speed grade -2 target does not meet timing closure at 250MHz. The rest of the fabric only needs to handle registers, routing signals, and the non-linear math. Obviously the Kintex 7 fabric is capable of handling these frequencies for modest logic This was the part of my concern. Getting that part to run at 500Mhz is a challenge, especially with multiple cores when utilization goes up The issue is that FPGA's are routing constrained, especially in the Spartan 6's, and the tools aren't designed to handle these sorts of long chains. The Kintex 7 chips are much nicer with respect to routing resources and consistency.
Yes. The CLB is pretty similar to the Spartan6, but it seems like the new switching matrix is quite effective when it comes to this type of logic/routing. Altera Stratix-V does not seem to match this type of logic very well, at least with the current tools, as the Stratix-IV seem to outperform the Stratix-V. I don't understand why as the ALM does not seem to be radically different from the Stratix-IV.
|
|
|
|
kingcoin
|
|
April 01, 2013, 07:46:47 AM |
|
I was thinking on-die memory segments could be used.
FIFO's are usually implemented using embedded memory on the FPGA's. Even if you claim not being a FPGA designer/coder you think like one But if you run your miner clock domain way above the fmax it will quite often work as most devices are usually faster than their marked speed grade. But when you get your next board/batch it might fail constantly since you got slower devices. Also you have to be careful so that timing errors in the faster clock domain will not propagate into the slower clock domains, e.g. the FIFO enqueue signal beeing stuck asserted due to a timing error etc. It can potentially be a lot worse than just a bad nonce.
|
|
|
|
|