phew, finally tracked down the bug. The K and K_next wires in sha256_transform.HASHERS were not getting the right values. K was easy: assign K = Ks_mem[(NUM_ROUNDS/LOOP)*cnt+i]; K_next is a little bit more complicated, because it has to use cnt differently for each HASHERS. For example, if LOOP_LOG2=1, then K_next in HASHERS[1] needs to alternate between Ks_mem[34] and Ks_mem[2], when cnt=0 and cnt=1 respectively. In HASHERS[2] K_next alternates between Ks_mem[3] and Ks_mem[35] respectively. It's a little weird, but it makes sense. When LOOP_LOG2=1, HASHERS[0] alternates between doing fresh work (cnt=0, Round 0) and doing old work (cnt=1, Round 32): new, old, new, old, etc. Since HASHERS[1] is directly connected to HASHERS[0] it alternates as well, but it will alternate in the opposite fashion. It gets old work, and then new work, old, new, old, new, etc. I threw a quick hack into my code for K_next (it's on the public repo), that only works for LOOP_LOG2=1: if(i & 1) assign K_next = Ks_mem[(NUM_ROUNDS/LOOP)*!cnt[0]+i+1]; else assign K_next = Ks_mem[(NUM_ROUNDS/LOOP)*cnt+i+1]; Off the top of my head, I think this will work in the general case: assign K_next = Ks_mem[(NUM_ROUNDS/LOOP)*((cnt+i) & (LOOP-1)) +i+1];
That basically says, adjust cnt by our position in the HASHERS chain. I can try it out later and check. With that fix, the code works correctly in ModelSim, and it works on live hardware So my LX150T dev board finally gets 25MH/s of performance. Progress! I may run a compile of LOOP_LOG2=0 overnight and see if that finishes. I'm finally making pleasant progress with the LX150 because of your hard work, makomk, so thank you.
|
|
|
I got your code rolled into an LX150T project with ChipScope (JTAG) for the "virtual wires." It compiles up just fine at 50MHz and LOOP_LOG2=1 However it does not deliver correct results when run on the live chip. The results are consistent, and the chip reads 54C at the surface, so I don't think the timing is wrong, nor is it overheating. I will run the code through modelsim and see if I can track down the bug. https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/tree/master/projects/LX150_makomk_TestSide Note: I tried out ISE's power analysis tool. I've never used it before, and I don't have a post P&R simulation vector to give it, so I just set some over-estimated toggling values. 120 FF toggle, 100% BRAM usage, I think. It reported something like 1.4W. Seems a bit low to me ... any idea if that seems reasonable to you? Bigger, from what I remember. These changes get it down to about 51k LEs for LOOP_LOG2=1 on Cyclone IV, which isn't great but... That's great progress in my book. Well done, makomk! The rolled up designs are important as well, because they can help fill out large chips. they appear to do not-so-great things to the routing and placement... ಠ_ಠ
|
|
|
The highest performance, working FPGA miner is currently based on Altera FPGAs. Altera did a fantastic job protecting their FPGAs; they are absolutely 100% immune to all bitstream encryption attacks ... mostly because Altera FPGAs don't actually have encrypted bitstreams Jokes aside, thank you for sharing the news!
|
|
|
The most the manufacturer could do is something like designing the chips to fail to ever find solutions that exceed a particular difficulty. To check for these kinds of defects, someone would have to imagine each particular subtle flaw and test for it. I don't see what benefit that would have to the manufacturer, unless they created a competing company that sucked up all the business after the "scam" was discovered, essentially getting people to buy the same chip twice ... which is actually ingenious. MWAHAHAHA Anyway, you'd just feed the chip Difficulty 1 work to work-around the "feature". The chip should be operating on Difficulty 1 work anyway, because there's no reason to have the extra logic in the chip to check for other difficulties.
|
|
|
The LOOP_LOG2>0 code isn't terribly efficient. It's grotesquely inefficient compared to LOOP_LOG2=0 For the vanilla code (DE2_115_Unoptimized), LOOP_LOG2=1 is almost as big as LOOP_LOG2=0 on Altera. It's terrifying. Just put together an untested port of them to Xilinx and got an apparently successful synthesis run for the XC6SLX75 at LOOP_LOG2=1 and 50 MHz. This is with some experimental block-RAM-as-shift-register code that I'm not sure will work, though. Will upload the code in a bit. Thank you for sharing. I'll muck with the code a bit and try it out on my LX150. I've got my fingers crossed ... Edit: Also, the DCM_SP doesn't seem to work quite the way I might expect. Curious. Edit 2: Aha - with the way this code uses it, the CLKFX_DIVIDE and CLKFX_MULTIPLY settings are effectively irrelevant, as is CLKDV_DIVIDE. The DCM confuses the heck out of me. I really should read its datasheet and get my head straight. Anyway, the code I use is on the public repo, so you can take a look at the DCM that coregen made for my project: https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/blob/master/projects/LX150_Test/hdl/main_pll.vIt uses the CLKDV_DIVIDE parameter to take an input 100MHz clock and spit out a 50MHz clock. Edit 3: Can't quite get it to hit 100 MHz with LOOP_LOG2=1 on the XC6SLX75 (actual period = 11.045ns). If it fully unrolls on the LX150 and gets close to 100MHz I will be very happy. That alone will yield higher MHash/s/$ than the current Altera solution, and we can build up from there. Those DSP slices are sitting all lonely and unused
|
|
|
Just to let you know, I just managed to route a fully-unrolled doubly-pipelined miner at 50MHz on the LX150 overnight, so it's definitely doable. As I don't have access to an LX150 I haven't verified its correctness yet, but even if it has some bugs, I don't think fixing them would affect timing much. I might try to improve it during the weekend.
Is the code available anywhere? I'd be happy to compile and run it on my board I'm guessing it's a modified version of your VHDL port? I tried compiling a version of my code, LOOP_LOG2=1, and two extra pipeline registers with Register Balancing enabled. It couldn't even get past Mapping, with an area constraint error
|
|
|
I haven't done much of anything in verilog yet, so maybe it would be a good exercise for me to port it over to verilog... (no promises have been made, emphasize on maybe). Don't sweat it, I was just letting you know your contributions are welcome And just so you know, there are VHDL ports of the mining core on the public repo. So if you are curious to study how the mining core works, you can choose either variant to learn up; they follow the same ideas.
|
|
|
Correct, i used DE2_115_makomk_mod, changed the FPGA type and the clock pin, CONFIG_LOOP_LOG2 was 0. Did you set the correct voltage for the clock pin? Is the clock 50MHz? If it is not 50MHz, you will have to adjust the sdc file, so Quartus knows the correct speed of the clock. What dev board is it, by the way? Because the virtual_wires are the same I though that I can still use the tcl scripts from your "original" design !?!?!? Is this not a good idea? That is correct, the tcl mining script should work fine, as long as you're using the latest version: https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/tree/master/scripts/mineDo you publish also a version where CONFIG_LOOP_LOG2 can be different from 0? The unoptimized version works with any setting of CONFIG_LOOP_LOG2. makomk's personal repository also has a version of his code that works with any setting of CONFIG_LOOP_LOG2, but as I pointed out there isn't much point to using it if CONFIG_LOOP_LOG2 isn't 0 because it basically reverts to the unoptimized version.
|
|
|
Just a quick progress report. I dug back into my Spartan-6 LX150, updated its code to the version that can roll up, and used a CONFIG_LOOP_LOG2 setting of 5; I just wanted to get something compiled and working And after waiting long enough, it did indeed churn out a valid result! So ... progress! I am going to clean up the project and get it into the public repo. From there I will crank down the LOOP_LOG2 setting as low as it will go, and begin adding extra pipelining to see if it will push further.
|
|
|
*sigh* I have better things to do with my time than deal with this. My personal opinion:Nothing indicates to me this is real. Until further notice, I do not advice purchasing this product.We used fpga miner as a starting base to develop our design. When we said it is 'compact' we meant it fits into 45K gates.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 If you used the Open Source FPGA Bitcoin Mining project as your starting base, I am making you publicly legally aware that that project is covered entirely by the terms of the GNU General Public License. By using any part of the Open Source FPGA Bitcoin Mining project, you must conform to the legal obligations set forth in the GNU GPL. If you do not conform to the legal obligations set forth in the GNU GPL, you will be prosecuted in all applicable legal jurisdictions. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (MingW32) iQIcBAEBAgAGBQJOJiDFAAoJEFFoGj2A5YKR5uIQAJ6OyAziHZAPpwHzc3RYhy+2 d/p0zGb/+fPQjrjtvGVCai15JL6yC1B4SZgJ584SQNIATl9w16K8994DaI/7Wh+R uWGzti+E1qI+Z5oJU3K+ku9RwSTL3Bu/aqJgu/csYdibOQO2A4zwAeFSE6mqim/4 DEpfNbUU4Ca1BhR754I+aFHQnY6/3tDhp3U6dl1clDAxs+aTkFRItLdNjlrTUjAW Tqt9+ei3ZovSK6ceYYOzI84qY5Lh48QVthGAoiGOK3YC/ggtL+JC/1vut3ZGNcy8 1UNhn4LWwLn+gi/A3Ge146SBxBvGDOdXb7SD+6nRlGXFIKKIoAGMW4nzhXbEHXQu wInXUMnQCaCAmrXLom35+HFBmo7InzWeqPUxDqpW0DpWN6RfufDH8fryIRLdfz9T BH/PY7QRoZub51B13NEPU1kEv365ADIDgPvribZsF+14wxRO2RGr/JzgBaCQDZEo 1sQnIsNI+iSJm7tqdAxYrDA8t+Bb/+IImTf0K8Vjk4mG+NHhEXxIkVGgh4+Aso0U AnaTTlY0A0xEXth6H6wkbOKSbNkT+gsSzbWNVT1hFCpzJuUXyHoN3+JvRtnXBzJh i/No2AFBC6LyAPxzMOwS7QzcmIx1dnebXJ812w6gCB5T/R3Jvtf5JX5TQr698vTy Es4gk6uQYSR9nFgPCjS6 =QRTV -----END PGP SIGNATURE----- our asic has 45K logical gates and gathers 500 Mhash\sec at 120MHz and one cycle per full-hash. SHA-256, by its very nature, cannot be optimized. (1) The Open Source FPGA Bitcoin Miner uses 75K Altera Logic Elements to perform a full Bitcoin hash per clock cycle. This translates to somewhere between 225K gates and 975K gates. You say you get ~4 Bitcoin hashes per clock cycle from 45K gates. This is massively incorrect on your part. (1) For curious parties, SHA-256 itself cannot be optimized. Optimization of SHA-256 is akin to an attack on the algorithm itself. You can, however, provide application specific implementations that exploit constant data in various ways. This applies to Bitcoin, as it does have some constant data fed into the hashing function, but only to a limited extent. The 75K Logic Elements figure I quoted above is already taking advantage of a good majority of these optimizations.
|
|
|
Does someone got the current version on github (fpgaminer + makomk modifications) successfully running in an FPGA? It compiles problemless in a EPS4GX230 device but creates me only Stales, no Shares. The "original" fpgaminer version runs problemless ...
Are you using projects/DE2_115_makomk_mod/ with CONFIG_LOOP_LOG2 set to something other than 0? It must be set to 0 for the version on my public repo. The version on makomk's repo will work with other settings, but it basically just reverts to the unoptimized version if you do that so there's little point in using it unless CONFIG_LOOP_LOG2 is 0. You also mentioned in your PM that you're using an "old" mining tcl script. How old? It was updated a few weeks ago to reflect changes in the code.
|
|
|
udif and makomk, I do not have donation addresses for you. If you would like your donation address added to the first post with the Contributors list, and in the github project's README.md, please contact me and let me know.
|
|
|
July 17th, 2011 - Code Updates and Minor Cleanupteknohog's Xilinx Verilog port on the public repo has been updated. teknohog's serial modifications to makomk's code have been added as a separate project. OrphanedGland's port to Stratix devices, using VHDL, has been merged into the public repo. To top it all off, I updated the project's main README.md file, to prominently include a list of contributors and their donation addresses, because they deserve recognition for their hard work. I will modify the first post in this thread to include the same list As it wasn't mentioned before on the first post, I am mentioning here that makomk made improvements to my base Verilog code. These changes improved both the overall performance of the design, and its area consumption, allowing the design to fit on a smaller, cheaper EP4CE75 chip. Great work makomk!
|
|
|
Well, did you know a real Altera usb-blaster is $300? Highway robbery. At least Terasic is only $50
Xilinx's is about the same price Surprisingly, though, my $8 knock-off USB-Blaster works better than the $50 Terasic one The Terasic one wouldn't program some on-board flash chips I had, but the cheapo $8 one did
|
|
|
Quick note: I didn't know much about JTAG's specifics, so I found this tutorial on it to be very helpful. It gives it in terms of a PC controlling a parallel port adapter, with C code.
|
|
|
I think it's moving between states because I use the Debug tool to force it. That's why TCK doesn't have to toggle. In fact, maybe it's not supposed to and then TDI is my only problem. State won't change without a cycle of TCK. From the Device Handbook you linked: Transitions in the state machine occur on the rising edge of TCK How are you measuring the pins? With a multimeter or logic analyzer? If you're using a multimeter, perhaps Quartus leaves TCK high when idle, so when you tell it to change state it quickly toggles TCK low and then high, which you wouldn't be able to see on a multimeter. The same may be true of TDI; Quartus holding it low when idle, and only changing it when it needs to perform an operation. My educated guess is that TDI is the only issue, because TCK, TDO, and TMS are all needed to get the IDCODE, unless I've read the process of reading IDCODE wrong. Hooking TDI, TCK, and TMS from your USB-Blaster to a logic analyzer would reveal if it is correctly switching those signals, and narrow the problem down.
|
|
|
I wonder if this change would be a net improvement: On each 'getwork' call, check if the oldest block is stale. If so, delete it and do the check one more time. Delete up to two stale transactions per 'getwork' call. This would amortize the cost of emptying over many calls to 'getwork'. This still dumps the costs on the very important 'getwork' calls right after a block is found when LP triggers a mess of them. Maybe we should exempt the first 75% (estimated) of calls after a new block is found and free up to 5 on each call? It sounds like this is forming into a kind of garbage collection. That being the case, you'd want to run your "garbage collection" only when there is time available. If bitcoind detects that it doesn't have any getworks to process, and nothing else to do, it could then spend some time picking up the trash, so to speak. How smart is bitcoind about storing the transaction lists for the possible new blocks? I imagine most of the time they're exactly the same, except for the generation transaction. Could it store a list of unique transaction lists, excluding the generation transaction, instead? Then the blocks would only need a reference to a transaction list, along with their specific extraNonce; instead of each block carrying duplicates of transactions lists that only differ by their extraNonce.
|
|
|
This seems odd. Are you sure you're measuring the right pins? It's getting late, but I can't see how TDO could be toggling if TCK is stuck... I agree. There's no way to get an IDCODE without toggling TCK. You should have a pull-down resistor on TCK. If you disconnect the USB-Blaster, is Pin 1, TCK, low? And is TDI being pulled high with the USB-Blaster disconnected? I'm skeptical of my USB-blaster, because I bought it cheap on eBay. It may not be working right. That would explain why TDI is stuck, but I don't think it explains TCK. Do you have a bit-banging device to drive those pins manually from your PC? Maybe even switches and LEDs on a breadboard to create your own little bit-banging device. You could manually test JTAG that way and see if the problem was with your USB-Blaster.
|
|
|
|