|
makomk
|
 |
July 14, 2011, 09:03:59 PM |
|
By the way - and this is probably what I get for not reading data sheets - I only just realized that the Virtex-5 and Virtex-6 series don't suffer from the annoying "only half the CLBs have fast carry logic" restriction of Spartan-6 FPGAs. Guess that could partly explain why Spartan-6 FPGAs have so much trouble with this design...
It appears ISE has no trouble at all fitting a 150 MHash/sec design on the Kintex-7 XC7K70T either (as in, it reaches that clock without even trying hard) - though note that this is an untested tweaked design and obviously I don't have the FPGA to run it on anyway. Kintex-7 is supposed to basically be a much cheaper 28nm equivalent of the Virtex-6 series.
|
Quad XC6SLX150 Board: 860 MHash/s or so. SIGS ABOUT BUTTERFLY LABS ARE PAID ADS
|
|
|
|
fpgaminer (OP)
|
 |
July 14, 2011, 10:07:50 PM |
|
I also found a way to program the FPGA without Altera tools, using UrJTAG. There is a script included for this.  Fantastic! I played with UrJTAG once, but it just barfed on me. Maybe I'll give it another try. Their disclaimer that it might damage the hardware is a bit disconcerting ... but I doubt it's likely. Too good to be true? Unless you have the code, or the physical device in your hand, you shouldn't believe it. This applies to every number someone quotes, including myself. It's like that quote you find in a lot of version control tutorials and guides. If it isn't commit-ed, it doesn't exist. By the way - and this is probably what I get for not reading data sheets - I only just realized that the Virtex-5 and Virtex-6 series don't suffer from the annoying "only half the CLBs have fast carry logic" restriction of Spartan-6 FPGAs. Guess that could partly explain why Spartan-6 FPGAs have so much trouble with this design... That might explain it. Since SHA-256 is almost entirely addition logic, then that means half the Spartan-6 chips are useless. But I couldn't even get 64 rounds fit on my LX150. Maybe the CLBs are ordered in such a way that it was forced to route around the useless ones, causing massive routing delays? If so, padding with registers between rounds might fix that (as others have suggested). If those useless CLBs can't be used for anything else, might as well pack 'em with a bunch of registers. It could also be a combination of poor routing design, and having to route around the useless CLBs. It appears ISE has no trouble at all fitting a 150 MHash/sec design on the Kintex-7 XC7K70T either (as in, it reaches that clock without even trying hard) - though note that this is an untested tweaked design and obviously I don't have the FPGA to run it on anyway. Kintex-7 is supposed to basically be a much cheaper 28nm equivalent of the Virtex-6 series. A great first number! I'm very interested in the Kintex-7 series, and hope they are made available soon. I'll be checking their booth at CES next year for sure  Think they'll sell me a devkit for Bitcoins? 
|
|
|
|
|
makomk
|
 |
July 14, 2011, 10:29:50 PM Last edit: July 14, 2011, 10:45:02 PM by makomk |
|
That might explain it. Since SHA-256 is almost entirely addition logic, then that means half the Spartan-6 chips are useless. But I couldn't even get 64 rounds fit on my LX150. Maybe the CLBs are ordered in such a way that it was forced to route around the useless ones, causing massive routing delays? If so, padding with registers between rounds might fix that (as others have suggested). If those useless CLBs can't be used for anything else, might as well pack 'em with a bunch of registers.
It could also be a combination of poor routing design, and having to route around the useless CLBs.
My personal suspicion is that support for this particular Spartan-6 misfeature in ISE is buggy and badly tested; I'm fairly sure that at one point I was trying to cram on more adders than there were even SLICEM/Ls it could fit them in and it tried to map them anyway without realising this was impossible. It also doesn't seem to report usage of this particular resource in any useful way. It's possible that their routing design is also bad too, of course. I recently upgraded to ISE 13.2 and its behaviour in this department seems to have changed quite a bit from ISE 13.1, though I'm not sure if this is entirely for the better... Edit: Think I may have been wrong about this; it's a bit under the maximum number of SLICEM/Ls available in theory. Also, have you tried disabling shift register inferrence or turning up the threshold for it? Those compete for the same resources as adders. A great first number! I'm very interested in the Kintex-7 series, and hope they are made available soon. I'll be checking their booth at CES next year for sure  Think they'll sell me a devkit for Bitcoins?  Heheheheheheh. They certainly seem impressive, though I'll wait until someone's actually submitting shares with one, or at least until they're actually available for sale, before I actually believe it ;-). Edit 2: 200 MHash/sec, though that increased total build time to over an hour. I have a feeling this isn't even pushing the tools slightly yet...
|
Quad XC6SLX150 Board: 860 MHash/s or so. SIGS ABOUT BUTTERFLY LABS ARE PAID ADS
|
|
|
|
TheSeven
|
 |
July 14, 2011, 11:11:41 PM |
|
Also, have you tried disabling shift register inferrence or turning up the threshold for it? Those compete for the same resources as adders.
That's my suspicion as well, however I failed to figure out how to configure this, didn't have time to search thoroughly. Do you have a hint?
|
My tip jar: 13kwqR7B4WcSAJCYJH1eXQcxG5vVUwKAqY
|
|
|
|
makomk
|
 |
July 14, 2011, 11:27:31 PM |
|
That's my suspicion as well, however I failed to figure out how to configure this, didn't have time to search thoroughly. Do you have a hint?
"HDL Options" section of the XST configuration, options "Shift register extraction" and "Shift register minimum size". The default is to create shift registers as short as 2 stages, which is a bit daft. (I assume you've already figured out this quirk of ISE, but for some reason you have to select fpgaminer_top in the hierarchy before it'll show you the list of process steps and let you configure them.) This probably matters most for Spartan-6; other Xilinx FPGAs don't seem to have such annoying limitations on adders.
|
Quad XC6SLX150 Board: 860 MHash/s or so. SIGS ABOUT BUTTERFLY LABS ARE PAID ADS
|
|
|
|
makomk
|
 |
July 16, 2011, 11:17:46 AM |
|
By the way, anyone running with LOOP_LOG2=1, 2 or 3 might find the experimental partial-unroll-opt branch useful - it reduces the resource usage somewhat. Unfortunately it also breaks LOOP_LOG2=4 and greater and hasn't been tested on actual FPGAs.
|
Quad XC6SLX150 Board: 860 MHash/s or so. SIGS ABOUT BUTTERFLY LABS ARE PAID ADS
|
|
|
|
fpgaminer (OP)
|
 |
July 17, 2011, 08:05:58 AM |
|
July 17th, 2011 - Code Updates and Minor Cleanupteknohog's Xilinx Verilog port on the public repo has been updated. teknohog's serial modifications to makomk's code have been added as a separate project. OrphanedGland's port to Stratix devices, using VHDL, has been merged into the public repo. To top it all off, I updated the project's main README.md file, to prominently include a list of contributors and their donation addresses, because they deserve recognition for their hard work. I will modify the first post in this thread to include the same list  As it wasn't mentioned before on the first post, I am mentioning here that makomk made improvements to my base Verilog code. These changes improved both the overall performance of the design, and its area consumption, allowing the design to fit on a smaller, cheaper EP4CE75 chip. Great work makomk!
|
|
|
|
|
fpgaminer (OP)
|
 |
July 17, 2011, 08:07:38 AM Last edit: July 19, 2011, 11:26:44 PM by fpgaminer |
|
udif and makomk, I do not have donation addresses for you. If you would like your donation address added to the first post with the Contributors list, and in the github project's README.md, please contact me and let me know.
|
|
|
|
lame.duck
Legendary
Offline
Activity: 1270
Merit: 1000
|
 |
July 17, 2011, 08:23:03 PM |
|
By the way, anyone running with LOOP_LOG2=1, 2 or 3 might find the experimental partial-unroll-opt branch useful - it reduces the resource usage somewhat. Unfortunately it also breaks LOOP_LOG2=4 and greater and hasn't been tested on actual FPGAs. Great work, on my EP3C25C6 Board the Device utilisation is reduced by approx 20% and the Fmax increases from 87MHz to 103Mhz but maybe there is still some room. I did some compile cycles i get different results and the difference is substantial. 10 Mhz or so. I could verify that it works, by erarning shares. For the EP2C35C8 i am able to change LOOP_LOG2 from 3 to 2 with no change for Fmax. The test with the real hardware isn't done yet but maybe tomorrow. Is the toplevel compatible with the serial toplevel? And going OT, could someone produce a bitstream for the EP1C80 for me?
|
|
|
|
|
|
makomk
|
 |
July 17, 2011, 08:59:47 PM |
|
Great work, on my EP3C25C6 Board the Device utilisation is reduced by approx 20% and the Fmax increases from 87MHz to 103Mhz but maybe there is still some room. I did some compile cycles i get different results and the difference is substantial. 10 Mhz or so. I could verify that it works, by erarning shares.
For the EP2C35C8 i am able to change LOOP_LOG2 from 3 to 2 with no change for Fmax. The test with the real hardware isn't done yet but maybe tomorrow.
Glad to hear a success story! Is the toplevel compatible with the serial toplevel?
Unfortunately, some manual merging of changes is probably going to be required because both this and serial support modify fpgaminer_top.v. Shouldn't be too difficult to do in theory though, especially now that teknohog's merged his serial code with an older version of my modificatiosn. And going OT, could someone produce a bitstream for the EP1C80 for me?
I wasn't even aware there was such a device...
|
Quad XC6SLX150 Board: 860 MHash/s or so. SIGS ABOUT BUTTERFLY LABS ARE PAID ADS
|
|
|
lame.duck
Legendary
Offline
Activity: 1270
Merit: 1000
|
 |
July 17, 2011, 09:15:11 PM |
|
And going OT, could someone produce a bitstream for the EP1C80 for me?
I wasn't even aware there was such a device... Err... EP1S80F1508C6, unfortunalty altera issues only licenses for quartus that are sold by licensed distributirs so i have no possiblity to produce a bitstream myself. 
|
|
|
|
|
mlut
Newbie
Offline
Activity: 6
Merit: 0
|
 |
July 19, 2011, 07:28:40 AM |
|
Does someone got the current version on github (fpgaminer + makomk modifications) successfully running in an FPGA? It compiles problemless in a EPS4GX230 device but creates me only Stales, no Shares. The "original" fpgaminer version runs problemless ...
|
|
|
|
|
|
fpgaminer (OP)
|
 |
July 19, 2011, 11:25:36 PM |
|
Does someone got the current version on github (fpgaminer + makomk modifications) successfully running in an FPGA? It compiles problemless in a EPS4GX230 device but creates me only Stales, no Shares. The "original" fpgaminer version runs problemless ...
Are you using projects/DE2_115_makomk_mod/ with CONFIG_LOOP_LOG2 set to something other than 0? It must be set to 0 for the version on my public repo. The version on makomk's repo will work with other settings, but it basically just reverts to the unoptimized version if you do that so there's little point in using it unless CONFIG_LOOP_LOG2 is 0. You also mentioned in your PM that you're using an "old" mining tcl script. How old? It was updated a few weeks ago to reflect changes in the code.
|
|
|
|
|
fpgaminer (OP)
|
 |
July 20, 2011, 12:48:32 AM |
|
Just a quick progress report. I dug back into my Spartan-6 LX150, updated its code to the version that can roll up, and used a CONFIG_LOOP_LOG2 setting of 5; I just wanted to get something compiled and working  And after waiting long enough, it did indeed churn out a valid result! So ... progress! I am going to clean up the project and get it into the public repo. From there I will crank down the LOOP_LOG2 setting as low as it will go, and begin adding extra pipelining to see if it will push further.
|
|
|
|
mlut
Newbie
Offline
Activity: 6
Merit: 0
|
 |
July 20, 2011, 09:28:48 AM |
|
Does someone got the current version on github (fpgaminer + makomk modifications) successfully running in an FPGA? It compiles problemless in a EPS4GX230 device but creates me only Stales, no Shares. The "original" fpgaminer version runs problemless ...
Are you using projects/DE2_115_makomk_mod/ with CONFIG_LOOP_LOG2 set to something other than 0? It must be set to 0 for the version on my public repo. The version on makomk's repo will work with other settings, but it basically just reverts to the unoptimized version if you do that so there's little point in using it unless CONFIG_LOOP_LOG2 is 0. You also mentioned in your PM that you're using an "old" mining tcl script. How old? It was updated a few weeks ago to reflect changes in the code. Correct, i used DE2_115_makomk_mod, changed the FPGA type and the clock pin, CONFIG_LOOP_LOG2 was 0. Do you publish also a version where CONFIG_LOOP_LOG2 can be different from 0? Because the virtual_wires are the same I though that I can still use the tcl scripts from your "original" design !?!?!? Is this not a good idea?
|
|
|
|
|
|
fpgaminer (OP)
|
 |
July 21, 2011, 01:52:53 AM |
|
Correct, i used DE2_115_makomk_mod, changed the FPGA type and the clock pin, CONFIG_LOOP_LOG2 was 0. Did you set the correct voltage for the clock pin? Is the clock 50MHz? If it is not 50MHz, you will have to adjust the sdc file, so Quartus knows the correct speed of the clock. What dev board is it, by the way? Because the virtual_wires are the same I though that I can still use the tcl scripts from your "original" design !?!?!? Is this not a good idea? That is correct, the tcl mining script should work fine, as long as you're using the latest version: https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/tree/master/scripts/mineDo you publish also a version where CONFIG_LOOP_LOG2 can be different from 0? The unoptimized version works with any setting of CONFIG_LOOP_LOG2. makomk's personal repository also has a version of his code that works with any setting of CONFIG_LOOP_LOG2, but as I pointed out there isn't much point to using it if CONFIG_LOOP_LOG2 isn't 0 because it basically reverts to the unoptimized version.
|
|
|
|
|
TheSeven
|
 |
July 21, 2011, 08:50:26 AM |
|
Just a quick progress report. I dug back into my Spartan-6 LX150, updated its code to the version that can roll up, and used a CONFIG_LOOP_LOG2 setting of 5; I just wanted to get something compiled and working  And after waiting long enough, it did indeed churn out a valid result! So ... progress! I am going to clean up the project and get it into the public repo. From there I will crank down the LOOP_LOG2 setting as low as it will go, and begin adding extra pipelining to see if it will push further. Just to let you know, I just managed to route a fully-unrolled doubly-pipelined miner at 50MHz on the LX150 overnight, so it's definitely doable. As I don't have access to an LX150 I haven't verified its correctness yet, but even if it has some bugs, I don't think fixing them would affect timing much. I might try to improve it during the weekend.
|
My tip jar: 13kwqR7B4WcSAJCYJH1eXQcxG5vVUwKAqY
|
|
|
|
fpgaminer (OP)
|
 |
July 21, 2011, 09:56:25 AM |
|
Just to let you know, I just managed to route a fully-unrolled doubly-pipelined miner at 50MHz on the LX150 overnight, so it's definitely doable. As I don't have access to an LX150 I haven't verified its correctness yet, but even if it has some bugs, I don't think fixing them would affect timing much. I might try to improve it during the weekend.
Is the code available anywhere? I'd be happy to compile and run it on my board  I'm guessing it's a modified version of your VHDL port? I tried compiling a version of my code, LOOP_LOG2=1, and two extra pipeline registers with Register Balancing enabled. It couldn't even get past Mapping, with an area constraint error 
|
|
|
|
|
makomk
|
 |
July 21, 2011, 12:20:01 PM Last edit: July 21, 2011, 02:09:45 PM by makomk |
|
I tried compiling a version of my code, LOOP_LOG2=1, and two extra pipeline registers with Register Balancing enabled. It couldn't even get past Mapping, with an area constraint error  The LOOP_LOG2>0 code isn't terribly efficient. There are some changes to make it a bit better in the partial-unroll-opt branch on my github repo; I'll see if I can get them merged in to DE2_115_makomk_mod and send you a pull request at some point. Just put together an untested port of them to Xilinx and got an apparently successful synthesis run for the XC6SLX75 at LOOP_LOG2=1 and 50 MHz. This is with some experimental block-RAM-as-shift-register code that I'm not sure will work, though. Will upload the code in a bit. Edit: Also, the DCM_SP doesn't seem to work quite the way I might expect. Curious. Edit 2: Aha - with the way this code uses it, the CLKFX_DIVIDE and CLKFX_MULTIPLY settings are effectively irrelevant, as is CLKDV_DIVIDE. Edit 3: Can't quite get it to hit 100 MHz with LOOP_LOG2=1 on the XC6SLX75 (actual period = 11.045ns). Looks like I'd need to port over the patch for the extra pipeline stage to compute the initial value of H+K+W[0]. Untested 50Mhz code dump here
|
Quad XC6SLX150 Board: 860 MHash/s or so. SIGS ABOUT BUTTERFLY LABS ARE PAID ADS
|
|
|
|
fpgaminer (OP)
|
 |
July 22, 2011, 12:49:32 AM |
|
The LOOP_LOG2>0 code isn't terribly efficient. It's grotesquely inefficient compared to LOOP_LOG2=0  For the vanilla code (DE2_115_Unoptimized), LOOP_LOG2=1 is almost as big as LOOP_LOG2=0 on Altera. It's terrifying. Just put together an untested port of them to Xilinx and got an apparently successful synthesis run for the XC6SLX75 at LOOP_LOG2=1 and 50 MHz. This is with some experimental block-RAM-as-shift-register code that I'm not sure will work, though. Will upload the code in a bit. Thank you for sharing. I'll muck with the code a bit and try it out on my LX150. I've got my fingers crossed ... Edit: Also, the DCM_SP doesn't seem to work quite the way I might expect. Curious. Edit 2: Aha - with the way this code uses it, the CLKFX_DIVIDE and CLKFX_MULTIPLY settings are effectively irrelevant, as is CLKDV_DIVIDE. The DCM confuses the heck out of me. I really should read its datasheet and get my head straight. Anyway, the code I use is on the public repo, so you can take a look at the DCM that coregen made for my project: https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/blob/master/projects/LX150_Test/hdl/main_pll.vIt uses the CLKDV_DIVIDE parameter to take an input 100MHz clock and spit out a 50MHz clock. Edit 3: Can't quite get it to hit 100 MHz with LOOP_LOG2=1 on the XC6SLX75 (actual period = 11.045ns). If it fully unrolls on the LX150 and gets close to 100MHz I will be very happy. That alone will yield higher MHash/s/$ than the current Altera solution, and we can build up from there. Those DSP slices are sitting all lonely and unused 
|
|
|
|
|