Iulius
Newbie
Offline
Activity: 1
Merit: 0
|
|
June 01, 2011, 04:28:49 PM |
|
I also made 2 tries (after converting the code into vhdl)
- one serialized where i want to fit the FF into Blockrams
- one parallel solution with 1 or more skipped FF stage between each other stage
First try should be quiet good for devices with a lot small blockrams but few FF (old Altera) or any other small device, i'm still working here.
Second is great if your device is too small for 80k but you still want to have a full chain working, because some logic can be shared (64 stages unrolled needs only as much logic as ~40 single stages) With 1 skipped FF stage i only need 40k FF/Lut pairs on V6 running at 70-80 Mhz(not tweaked). Still Lut count doesn't decrease much obviously here, so you either need 6/7 Input luts or a very special device, as most have more FF than LUTs.
I will try to get a version with around 15 MH/s on a bemicro stick (49$) or maybe ~20MH/s on a bemicro sdk (79$).
This will make for very easy expansion and easy pc communication. One can easily plug in 10 of these to one pc.
Will post code when it fits good at a specific device.
|
|
|
|
fpgaminer (OP)
|
|
June 02, 2011, 07:49:16 AM |
|
June 2nd, 2011 - Flexible Unrolling Added: Smaller Device Support
Thanks to the patch submitted by Udif, the code now supports a configurable amount of loop unrolling. The original design was fully unrolled, with 128 total round modules. By adjusting the CONFIG_LOOP_LOG2 Verilog define, you can choose to unroll to 64 round modules, 32, 16, 8, or 4. This makes the design smaller, at the equivalent cost of speed, which should allow it to run on many more FPGAs.
If you're interested in trying the code on a smaller FPGA, open the projects/DE2_115_Unoptimized_Pipelined project in Quartus. Then go to Assignments->Settings->Analysis & Synthesis Settings->Verilog HDL Input. You should see a CONFIG_LOOP_LOG2 macro setting, which you can set from 0 to 5. 0 gives full unrolling (largest, fastest), and 5 gives the smallest design. You will also need to go to Assignments->Device and choose your FPGA, and set the correct clock pin in Assignments->Pin Planner. Then just compile and program!
If you would like help adjusting the design for your specific chip&board, just let me know. You just need to know the pin location of a clock source, and which Altera FPGA it is (specifically, including package and speed grade).
|
|
|
|
nathanrees19
|
|
June 02, 2011, 11:23:07 AM |
|
June 2nd, 2011 - Flexible Unrolling Added: Smaller Device Support
Thanks to the patch submitted by Udif, the code now supports a configurable amount of loop unrolling. The original design was fully unrolled, with 128 total round modules. By adjusting the CONFIG_LOOP_LOG2 Verilog define, you can choose to unroll to 64 round modules, 32, 16, 8, or 4. This makes the design smaller, at the equivalent cost of speed, which should allow it to run on many more FPGAs.
If you're interested in trying the code on a smaller FPGA, open the projects/DE2_115_Unoptimized_Pipelined project in Quartus. Then go to Assignments->Settings->Analysis & Synthesis Settings->Verilog HDL Input. You should see a CONFIG_LOOP_LOG2 macro setting, which you can set from 0 to 5. 0 gives full unrolling (largest, fastest), and 5 gives the smallest design. You will also need to go to Assignments->Device and choose your FPGA, and set the correct clock pin in Assignments->Pin Planner. Then just compile and program!
If you would like help adjusting the design for your specific chip&board, just let me know. You just need to know the pin location of a clock source, and which Altera FPGA it is (specifically, including package and speed grade).
+1 Well done! No luck for DE0-Nano users. With the setting at 3: Error: Fitter requires 1397 LABs to implement the project, but the device contains only 1395 LABs With optimisation set for minimum LEs, it uses 22330. There are 22320 available. Oh well, it fits with 4
|
|
|
|
TheSeven
|
|
June 02, 2011, 02:39:53 PM |
|
June 2nd, 2011 - Flexible Unrolling Added: Smaller Device Support
Thanks to the patch submitted by Udif, the code now supports a configurable amount of loop unrolling. The original design was fully unrolled, with 128 total round modules. By adjusting the CONFIG_LOOP_LOG2 Verilog define, you can choose to unroll to 64 round modules, 32, 16, 8, or 4. This makes the design smaller, at the equivalent cost of speed, which should allow it to run on many more FPGAs.
If you're interested in trying the code on a smaller FPGA, open the projects/DE2_115_Unoptimized_Pipelined project in Quartus. Then go to Assignments->Settings->Analysis & Synthesis Settings->Verilog HDL Input. You should see a CONFIG_LOOP_LOG2 macro setting, which you can set from 0 to 5. 0 gives full unrolling (largest, fastest), and 5 gives the smallest design. You will also need to go to Assignments->Device and choose your FPGA, and set the correct clock pin in Assignments->Pin Planner. Then just compile and program!
If you would like help adjusting the design for your specific chip&board, just let me know. You just need to know the pin location of a clock source, and which Altera FPGA it is (specifically, including package and speed grade).
+1 Well done! No luck for DE0-Nano users. With the setting at 3: Error: Fitter requires 1397 LABs to implement the project, but the device contains only 1395 LABs With optimisation set for minimum LEs, it uses 22330. There are 22320 available. Oh well, it fits with 4 I bet that you could get rid of those 10 LEs somehow. However, this still wouldn't mean that the design can be successfully routed, and even if it can be, timing performance would be awful. It's likely that you'll get more MH/s with the smaller version. @fpgaminer: As you apparently seem to know the altera side of things rather well, and as the altera FPGAs seem to be the more cost-effective ones, which FPGA would you suggest for a copacobana-like miner design?
|
My tip jar: 13kwqR7B4WcSAJCYJH1eXQcxG5vVUwKAqY
|
|
|
Adeq
Newbie
Offline
Activity: 17
Merit: 0
|
|
June 02, 2011, 04:10:38 PM |
|
Anyone performed this miner with Terasic DE0-Nano board? How many Mhash/s do you get?
|
|
|
|
TheSeven
|
|
June 02, 2011, 04:28:01 PM |
|
XC5VLX110T-1FF1136: "Measuring FPGA performance... FPGA running at 119.958388 MH/s"
|
My tip jar: 13kwqR7B4WcSAJCYJH1eXQcxG5vVUwKAqY
|
|
|
|
fpgaminer (OP)
|
|
June 02, 2011, 08:47:26 PM |
|
XC5VLX110T-1FF1136: "Measuring FPGA performance... FPGA running at 119.958388 MH/s" Very nice That's a Virtex 5, right? How much utilization are you getting? If you have some spare space you could update your VHDL code to support parameterized unrolling like the latest update and squeeze another hashing core in there. With optimisation set for minimum LEs, it uses 22330. There are 22320 available.
Oh well, it fits with 4 Try commenting out line 107, the virtual_wire for "NONC". It isn't currently used and should save a few LEs.
|
|
|
|
TheSeven
|
|
June 02, 2011, 08:57:16 PM |
|
Hm. Might be ~20% faster than mine, according to their datasheet, but will need more difficult handling. are going to email them and request some more detail.
They aren't going to tell you how it works. They want to sell it. XC5VLX110T-1FF1136: "Measuring FPGA performance... FPGA running at 119.958388 MH/s" Very nice That's a Virtex 5, right? How much utilization are you getting? If you have some spare space you could update your VHDL code to support parameterized unrolling like the latest update and squeeze another hashing core in there. Yes, a huge but slow Virtex 5, at 97% slice usage, and getting it to run at 120MHz at these utilization levels was a bit of a challenge. I might nevertheless add parameterized unrolling to accomodate for other people's needs. I'm probably going to release the source code soon, including the Python frontend.
|
My tip jar: 13kwqR7B4WcSAJCYJH1eXQcxG5vVUwKAqY
|
|
|
Disposition
|
|
June 02, 2011, 08:59:00 PM |
|
are going to email them and request some more detail.
They aren't going to tell you how it works. They want to sell it. Haha I know, just wondering about pricing n maybe could request a sample or something :3 Also See below. http://opencores.org/project,sha_core
|
|
|
|
fpgaminer (OP)
|
|
June 02, 2011, 09:08:05 PM |
|
@fpgaminer: As you apparently seem to know the altera side of things rather well, and as the altera FPGAs seem to be the more cost-effective ones, which FPGA would you suggest for a copacobana-like miner design? It seems like the Xilinx devices are far cheaper, for reasons I can't explain. The only problem is my code doesn't currently support Xilinx. I'm working on it, but until it's done there's no way to know for sure if the utilization and performance is similar. Back-of-the-napkin: You'd want to go with, obviously, the chips giving the lowest $ per MH/s figure. The price scaling on the chips is very non-linear. A quick scan over the Altera Cyclone 4 series shows that a Cyclone 4, C40-7N chip has the lowest (best) ratio. It's $82.96 in singles from Altera's website and would produce an estimated 40MH/s (if you can fit a half-mining core in it). That's $2.08 per MH/s. You'd probably get a small discount if you're buying 128 to build a big array, or if their sales team can price match Xilinx's offerings. 128 of those babies would get you 5GH/s and consume ~320Watts. For reference, Xilinx's SLX150-3N chips are ~$120 in some quantity, with an estimated performance of 160MH/s. That's $0.75 per MH/s. Yes, a huge but slow Virtex 5, at 97% slice usage, and getting it to run at 120MHz at these utilization levels was a bit of a challenge. I might nevertheless add parameterized unrolling to accomodate for other people's needs. I'm probably going to release the source code soon, including the Python frontend. Wow, I'm surprised utilization is 97%, but I guess that's fairly close to the 90K LEs that my unoptimized version uses on Cyclone 3/4. You can apply the usual optimizations (last 3 rounds aren't needed, pre-calced W, etc) to get better utilization but I'm not sure you could cram much into whatever space that would save you. You can also play with the adder trees a bit, moving W + k into the prior rounds to improve MHz performance.
|
|
|
|
TheSeven
|
|
June 02, 2011, 09:18:59 PM Last edit: June 02, 2011, 09:35:56 PM by TheSeven |
|
@fpgaminer: As you apparently seem to know the altera side of things rather well, and as the altera FPGAs seem to be the more cost-effective ones, which FPGA would you suggest for a copacobana-like miner design? It seems like the Xilinx devices are far cheaper, for reasons I can't explain. The only problem is my code doesn't currently support Xilinx. I'm working on it, but until it's done there's no way to know for sure if the utilization and performance is similar. Really? I had the opposite impression. BTW, if you want to, you could host my VHDL version in your git repository as well, for the Xilinx users. Back-of-the-napkin: You'd want to go with, obviously, the chips giving the lowest $ per MH/s figure. The price scaling on the chips is very non-linear. A quick scan over the Altera Cyclone 4 series shows that a Cyclone 4, C40-7N chip has the lowest (best) ratio. It's $82.96 in singles from Altera's website and would produce an estimated 40MH/s (if you can fit a half-mining core in it). That's $2.08 per MH/s. You'd probably get a small discount if you're buying 128 to build a big array, or if their sales team can price match Xilinx's offerings. 128 of those babies would get you 5GH/s and consume ~320Watts.
For reference, Xilinx's SLX150-3N chips are ~$120 in some quantity, with an estimated performance of 160MH/s. That's $0.75 per MH/s.
Never ever. I'm if you mean XC6SLX150-3, this isn't going to cross 120MH/s, at least not with this design. I've tried to synthesize a fully-unrolled (131 pipeline stages) design for that FPGA, and even though it fits (it doesn't on the LX100 variant), there seem to be difficulties routing it, and attempting to synthesize it for 120MHz didn't work out at all. The router is still running (since almost 48 hours now), so I don't even know which frequency it did reach yet. Yes, a huge but slow Virtex 5, at 97% slice usage, and getting it to run at 120MHz at these utilization levels was a bit of a challenge. I might nevertheless add parameterized unrolling to accomodate for other people's needs. I'm probably going to release the source code soon, including the Python frontend. Wow, I'm surprised utilization is 97%, but I guess that's fairly close to the 90K LEs that my unoptimized version uses on Cyclone 3/4. You can apply the usual optimizations (last 3 rounds aren't needed, pre-calced W, etc) to get better utilization but I'm not sure you could cram much into whatever space that would save you. You can also play with the adder trees a bit, moving W + k into the prior rounds to improve MHz performance. This is where you apparently know more about the algorithm than I do. I've just translated your code to VHDL. Could you provide some details on your optimizations? Is the optimized verilog code available somewhere? Getting rid of 3 rounds would very likely allow to increase the clock frequency even further. EDIT: Now that I think about it, I see how the last rounds could be removed, but I'm fairly certain that the synthesis tool was clever enough to have done that already.
|
My tip jar: 13kwqR7B4WcSAJCYJH1eXQcxG5vVUwKAqY
|
|
|
fpgaminer (OP)
|
|
June 02, 2011, 10:06:58 PM |
|
Wow, Terasic apparently noticed this project and posted it on their Facebook page. Neato! And they also gave a shout-out to Bitcoin! BTW, if you want to, you could host my VHDL version in your git repository as well, for the Xilinx users. Sure, that'd be great! Never ever. I'm if you mean XC6SLX150-3, this isn't going to cross 120MH/s, at least not with this design. I've tried to synthesize a fully-unrolled (131 pipeline stages) design for that FPGA, and even though it fits (it doesn't on the LX100 variant), there seem to be difficulties routing it, and attempting to synthesize it for 120MHz didn't work out at all. The router is still running (since almost 48 hours now), so I don't even know which frequency it did reach yet. You mention 131 pipelined stages. I assume you mean something like, 128 round modules, 2 final adder stages, and a third ... misc stage? My estimate for the 6SLX150-3N comes from assuming that Xilinx's estimation on equivalent LCs is accurate and that the design would use a similar number of 4-LUTs and FFs on Xilinx, as compared to Altera. The optimized design can fit into 80K LEs on a Cyclone 3, possibly 75K with some more cramming; at 80MHz and one cycle per full-hash. Assuming LE == LC that would allow two of those to fit into the 6SLX150-3N because it has the equivalent of 150K LCs. That would achieve a combined hash-rate of 160MH/s. Obviously, you can see that my assumptions all come from "LE == LC." Indeed, when I've synthesized my design for the 6SLX150, utilization was roughly what I expected. But as you pointed out, routing is where I got stuck. ISE refused to route the design. That's on my list of things to do; figure out why ISE won't route the design, even though it will route a cut-down design (only implementing a few rounds instead of all 128) just fine, and utilization is <70%. Could you provide some details on your optimizations? Is the optimized verilog code available somewhere? The code isn't on the public repo yet, no, because it's a work in progress and I haven't taken the time to put it up yet. It's on the list of things to do There are four classes of optimizations: 1) The last 3 rounds the second SHA-256 pass are not needed. You only need to check that Round64.H is equal to 0, and the last three rounds do not affect H. Here's the math: // Round numbers are 1 based, so we go from Round1 to Round64 Round61.E = Round60.D + Round60.H + s1 + ch + k[61] + w[61] Round62.F = Round61.E Round63.G = Round62.F Round64.H = Round63.G
Let's simplify:
32'h00000000 == Round64.H + InitialState.H == Round63.G == Round62.F == Round61.E == Round60.D + Round60.H + s1 + ch + k[61] + w[61] 32'h00000000 == Round60.D + Round60.H + s1 + ch + k[61] + w[61] + InitialState.H 32'h00000000 == Round60.D + Round60.H + s1 + ch + 32'h90befffa + w[61] + 32'h5be0cd19 //K and InitialState are known 32'h136032ED == Round60.D + Round60.H + s1 + ch + w[61]
That allows you to remove 3 full rounds and 9 adders. 2) Pre calculating the first few stages of the first SHA-256 pass is also possible. The first pass is calculated from the 512-bit DATA string that getwork requests give us, and the MIDSTATE. Between pieces of work, all of that data remains constant, except for the nonce, which we are increasing. The nonce is at W[3] (0 based), and W[3] isn't used until Round4. So, you can run the first 3 rounds of SHA-256 on a controller (PC, microcontroller, whatever) before handing the "work" to the FPGA. The FPGA then picks up where you left off and never has to calculate those first 3 rounds. That amounts to giving the FPGA Midstate, Data, and Midstate', where Midstate' is the 256-bit state as of your pre-calced Round3. 3) Quite a large amount of W can be pre-calced off the FPGA as well. Again, Data is constant except for the 4th word which is the nonce. I won't go into the math here, because it's rather long. The first 16 values of W are sourced directly from Data, so that's a no-brainer. However, also note that all of Data after the first 4 words (after the nonce) is constant for all getwork requests! So you can actually hard code those values as they are in the code on the public repo. What that code doesn't do, however, is just add K with the known W values, thus saving you an adder for the first 16 rounds. After the first 16 rounds, everything up to round 35 (I think) has some amount of W that can be pre-calculated off the FPGA and given to it with the rest of the work data. 4) The final optimization is this: t1 := h + s1 + ch + k[i] + w[i] Having to add K and W makes the adder tree larger. K is a constant and always known. W can be calculated at any point before a given round. H is also known at least one round ahead of time (as you've learned in the first optimization). So it's possible to do this calculation in the previous round: pre-t1 = g + k[i+1] + w[i+1] and then in the next round: That shrinks the critical path down, allowing for higher clock rates. Note that I haven't implemented that particular optimization, so double check that I did my math right.
|
|
|
|
TheSeven
|
|
June 02, 2011, 11:00:17 PM |
|
Wow, Terasic apparently noticed this project and posted it on their Facebook page. Neato! And they also gave a shout-out to Bitcoin! Nice one, liked it BTW, if you want to, you could host my VHDL version in your git repository as well, for the Xilinx users. Sure, that'd be great! http://dl.dropbox.com/u/23683845/fpgaminer-xilinx.zipNever ever. I'm if you mean XC6SLX150-3, this isn't going to cross 120MH/s, at least not with this design. I've tried to synthesize a fully-unrolled (131 pipeline stages) design for that FPGA, and even though it fits (it doesn't on the LX100 variant), there seem to be difficulties routing it, and attempting to synthesize it for 120MHz didn't work out at all. The router is still running (since almost 48 hours now), so I don't even know which frequency it did reach yet. You mention 131 pipelined stages. I assume you mean something like, 128 round modules, 2 final adder stages, and a third ... misc stage? 2x 64 digesters, 2 final adders and the =0 comparison My estimate for the 6SLX150-3N comes from assuming that Xilinx's estimation on equivalent LCs is accurate and that the design would use a similar number of 4-LUTs and FFs on Xilinx, as compared to Altera.
The optimized design can fit into 80K LEs on a Cyclone 3, possibly 75K with some more cramming; at 80MHz and one cycle per full-hash. Assuming LE == LC that would allow two of those to fit into the 6SLX150-3N because it has the equivalent of 150K LCs. That would achieve a combined hash-rate of 160MH/s.
Obviously, you can see that my assumptions all come from "LE == LC." Indeed, when I've synthesized my design for the 6SLX150, utilization was roughly what I expected. But as you pointed out, routing is where I got stuck. ISE refused to route the design. That's on my list of things to do; figure out why ISE won't route the design, even though it will route a cut-down design (only implementing a few rounds instead of all 128) just fine, and utilization is <70%.
I don't think two of those pipelines actually fit in there. I've got 60% slice usage on the LX150, and on the LX100 the placer failed even though it has more slices than slices actually used on the LX150. Have you tried routing it at a lower clock frequency? Start out with something ridiculous like 10MHz to check if the design can be routed at all, and then increase it until things get tight. If it doesn't route for 10MHz as well, my suspicion would be that the Spartan series just don't support as flexible routing as the Virtex series do. Regarding your optimizations, I'm fairly sure that at least part 1 and possibly also part 4 will be performed automatically by xst.
|
My tip jar: 13kwqR7B4WcSAJCYJH1eXQcxG5vVUwKAqY
|
|
|
fpgaminer (OP)
|
|
June 02, 2011, 11:48:12 PM |
|
Regarding your optimizations, I'm fairly sure that at least part 1 and possibly also part 4 will be performed automatically by xst. I don't trust synthesizers. I've gotten optimizations by simply moving things into separate modules to clearly define the hierarchy. Have you tried routing it at a lower clock frequency? Start out with something ridiculous like 10MHz to check if the design can be routed at all, and then increase it until things get tight. Yeah, I will have to try that. Last time I tried, it was at 50MHz. I tried routing 16 rounds only. Worked fine. 32 was fine. But then at 64 it failed. So it's doing something weird ... If it doesn't route for 10MHz as well, my suspicion would be that the Spartan series just don't support as flexible routing as the Virtex series do. I certainly hope not. The Spartan series is supposed to compete with Altera's Cyclone which routes these designs without any problems. Their architectures are certainly different, but if it can't even route a 64-round SHA-256 design on an LX150 then ... I dunno what to think about Xilinx. Thank you! It will take me some time to double check it and commit it to the repo; it's a busy week at my job. I'm posting these replies during compilation rounds Anyway, this is great input TheSeven, I really appreciate it. It makes me a little more worried about cramming the design into my LX150, but it's all good discussion and I'll keep working at it regardless.
|
|
|
|
Silverpike
Newbie
Offline
Activity: 54
Merit: 0
|
|
June 03, 2011, 12:17:43 AM |
|
3) Quite a large amount of W can be pre-calced off the FPGA as well. Again, Data is constant except for the 4th word which is the nonce. I won't go into the math here, because it's rather long. The first 16 values of W are sourced directly from Data, so that's a no-brainer. However, also note that all of Data after the first 4 words (after the nonce) is constant for all getwork requests! So you can actually hard code those values as they are in the code on the public repo. What that code doesn't do, however, is just add K with the known W values, thus saving you an adder for the first 16 rounds.
After the first 16 rounds, everything up to round 35 (I think) has some amount of W that can be pre-calculated off the FPGA and given to it with the rest of the work data.
I just wanted to add that this statement is not true. The computation of future rounds of W (above the initial 16 values) uses mixing functions S0() and S1() (in the SHA-nomenclature), which are not composable with add arithmetic. So the need to compute S0(W[0]) + A is not the same as computing S0(W[0] + A). The nonce is the value feeding the S0() function, so you need to recompute the entire W block if it changes. It's a great idea (I tried this too), but it just doesn't work.
|
|
|
|
lizthegrey
Newbie
Offline
Activity: 56
Merit: 0
|
|
June 03, 2011, 01:00:10 AM |
|
Wow, cool! I doubt I'd get anywhere close to reasonable hash rates with my nexys 2 board, but I'm glad that someone has done work to make things xilinx-compatible!
|
|
|
|
nathanrees19
|
|
June 03, 2011, 01:06:25 AM |
|
Try commenting out line 107, the virtual_wire for "NONC". It isn't currently used and should save a few LEs.
It worked! There was enough room left to add a blinking LED when it finds the golden nonce. Just. It seems to be reporting an Fmax of 82.31Mhz. Is this actually safe without a heatsink?
|
|
|
|
fpgaminer (OP)
|
|
June 03, 2011, 01:20:24 AM |
|
It worked! There was enough room left to add a blinking LED when it finds the golden nonce. Just. Yay! It seems to be reporting an Fmax of 82.31Mhz. Is this actually safe without a heatsink? It isn't on the C4-115 chip, but for the tiny C4-22 it might be. Check with PowerPlay. Make sure no heatsink and no fan is selected, and the toggle rate is ~65%. See what it says the JT is. The computation of future rounds of W Well I will certainly double check my math, but you can most certainly compute some of W after the initial 16. Example (0 indexed): w[16] = w[0] + s0(w[1]) + w[9] + s1(w[14]) All those values are known and do not change during the course of a work unit. The same applies to w[17] and w[18]. I don't have my notes with me for the rest.
|
|
|
|
nathanrees19
|
|
June 03, 2011, 02:16:11 AM |
|
Check with PowerPlay. Make sure no heatsink and no fan is selected, and the toggle rate is ~65%. See what it says the JT is. With a 50Mhz clock and the toggle rate manually set to 65%, it reports 48C for the junction temperature. I might just keep it at 50 to be safe.
|
|
|
|
|