Bitcoin Forum
April 24, 2024, 11:54:12 PM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 2 3 4 5 [6] 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 »
  Print  
Author Topic: Official Open Source FPGA Bitcoin Miner (Last Update: April 14th, 2013)  (Read 432886 times)
Iulius
Newbie
*
Offline Offline

Activity: 1
Merit: 0


View Profile
June 01, 2011, 04:28:49 PM
 #101

I also made 2 tries (after converting the code into vhdl)

- one serialized where i want to fit the FF into Blockrams

- one parallel solution with 1 or more skipped FF stage between each other stage

First try should be quiet good for devices with a lot small blockrams but few FF (old Altera) or any other small device, i'm still working here.

Second is great if your device is too small for 80k but you still want to have a full chain working, because some logic can be shared (64 stages unrolled needs only as much logic as ~40 single stages)
With 1 skipped FF stage i only need 40k FF/Lut pairs on V6 running at 70-80 Mhz(not tweaked). Still Lut count doesn't decrease much obviously here, so you either need 6/7 Input luts or a very special device, as most have more FF than LUTs.


I will try to get a version with around 15 MH/s on a bemicro stick (49$) or maybe ~20MH/s on a bemicro sdk (79$).

This will make for very easy expansion and easy pc communication. One can easily plug in 10 of these to one pc.


Will post code when it fits good at a specific device.
1714002852
Hero Member
*
Offline Offline

Posts: 1714002852

View Profile Personal Message (Offline)

Ignore
1714002852
Reply with quote  #2

1714002852
Report to moderator
1714002852
Hero Member
*
Offline Offline

Posts: 1714002852

View Profile Personal Message (Offline)

Ignore
1714002852
Reply with quote  #2

1714002852
Report to moderator
1714002852
Hero Member
*
Offline Offline

Posts: 1714002852

View Profile Personal Message (Offline)

Ignore
1714002852
Reply with quote  #2

1714002852
Report to moderator
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
fpgaminer (OP)
Hero Member
*****
Offline Offline

Activity: 560
Merit: 517



View Profile WWW
June 02, 2011, 07:49:16 AM
 #102

June 2nd, 2011 - Flexible Unrolling Added: Smaller Device Support

Thanks to the patch submitted by Udif, the code now supports a configurable amount of loop unrolling. The original design was fully unrolled, with 128 total round modules. By adjusting the CONFIG_LOOP_LOG2 Verilog define, you can choose to unroll to 64 round modules, 32, 16, 8, or 4. This makes the design smaller, at the equivalent cost of speed, which should allow it to run on many more FPGAs.

If you're interested in trying the code on a smaller FPGA, open the projects/DE2_115_Unoptimized_Pipelined project in Quartus. Then go to Assignments->Settings->Analysis & Synthesis Settings->Verilog HDL Input. You should see a CONFIG_LOOP_LOG2 macro setting, which you can set from 0 to 5. 0 gives full unrolling (largest, fastest), and 5 gives the smallest design. You will also need to go to Assignments->Device and choose your FPGA, and set the correct clock pin in Assignments->Pin Planner. Then just compile and program!

If you would like help adjusting the design for your specific chip&board, just let me know. You just need to know the pin location of a clock source, and which Altera FPGA it is (specifically, including package and speed grade).

nathanrees19
Full Member
***
Offline Offline

Activity: 196
Merit: 100



View Profile
June 02, 2011, 11:23:07 AM
 #103

June 2nd, 2011 - Flexible Unrolling Added: Smaller Device Support

Thanks to the patch submitted by Udif, the code now supports a configurable amount of loop unrolling. The original design was fully unrolled, with 128 total round modules. By adjusting the CONFIG_LOOP_LOG2 Verilog define, you can choose to unroll to 64 round modules, 32, 16, 8, or 4. This makes the design smaller, at the equivalent cost of speed, which should allow it to run on many more FPGAs.

If you're interested in trying the code on a smaller FPGA, open the projects/DE2_115_Unoptimized_Pipelined project in Quartus. Then go to Assignments->Settings->Analysis & Synthesis Settings->Verilog HDL Input. You should see a CONFIG_LOOP_LOG2 macro setting, which you can set from 0 to 5. 0 gives full unrolling (largest, fastest), and 5 gives the smallest design. You will also need to go to Assignments->Device and choose your FPGA, and set the correct clock pin in Assignments->Pin Planner. Then just compile and program!

If you would like help adjusting the design for your specific chip&board, just let me know. You just need to know the pin location of a clock source, and which Altera FPGA it is (specifically, including package and speed grade).

+1

Well done!

No luck for DE0-Nano users. With the setting at 3:

Code:
Error: Fitter requires 1397 LABs to implement the project, but the device contains only 1395 LABs

With optimisation set for minimum LEs, it uses 22330. There are 22320 available.

Oh well, it fits with 4 Tongue
TheSeven
Hero Member
*****
Offline Offline

Activity: 504
Merit: 500


FPGA Mining LLC


View Profile WWW
June 02, 2011, 02:39:53 PM
 #104

June 2nd, 2011 - Flexible Unrolling Added: Smaller Device Support

Thanks to the patch submitted by Udif, the code now supports a configurable amount of loop unrolling. The original design was fully unrolled, with 128 total round modules. By adjusting the CONFIG_LOOP_LOG2 Verilog define, you can choose to unroll to 64 round modules, 32, 16, 8, or 4. This makes the design smaller, at the equivalent cost of speed, which should allow it to run on many more FPGAs.

If you're interested in trying the code on a smaller FPGA, open the projects/DE2_115_Unoptimized_Pipelined project in Quartus. Then go to Assignments->Settings->Analysis & Synthesis Settings->Verilog HDL Input. You should see a CONFIG_LOOP_LOG2 macro setting, which you can set from 0 to 5. 0 gives full unrolling (largest, fastest), and 5 gives the smallest design. You will also need to go to Assignments->Device and choose your FPGA, and set the correct clock pin in Assignments->Pin Planner. Then just compile and program!

If you would like help adjusting the design for your specific chip&board, just let me know. You just need to know the pin location of a clock source, and which Altera FPGA it is (specifically, including package and speed grade).

+1

Well done!

No luck for DE0-Nano users. With the setting at 3:

Code:
Error: Fitter requires 1397 LABs to implement the project, but the device contains only 1395 LABs

With optimisation set for minimum LEs, it uses 22330. There are 22320 available.

Oh well, it fits with 4 Tongue
I bet that you could get rid of those 10 LEs somehow. However, this still wouldn't mean that the design can be successfully routed, and even if it can be, timing performance would be awful. It's likely that you'll get more MH/s with the smaller version.

@fpgaminer: As you apparently seem to know the altera side of things rather well, and as the altera FPGAs seem to be the more cost-effective ones, which FPGA would you suggest for a copacobana-like miner design? Smiley

My tip jar: 13kwqR7B4WcSAJCYJH1eXQcxG5vVUwKAqY
Adeq
Newbie
*
Offline Offline

Activity: 17
Merit: 0



View Profile
June 02, 2011, 04:10:38 PM
 #105

Anyone performed this miner with Terasic DE0-Nano board? How many Mhash/s do you get?
TheSeven
Hero Member
*****
Offline Offline

Activity: 504
Merit: 500


FPGA Mining LLC


View Profile WWW
June 02, 2011, 04:28:01 PM
 #106

XC5VLX110T-1FF1136: "Measuring FPGA performance... FPGA running at 119.958388 MH/s"

My tip jar: 13kwqR7B4WcSAJCYJH1eXQcxG5vVUwKAqY
Disposition
Full Member
***
Offline Offline

Activity: 121
Merit: 100


View Profile
June 02, 2011, 08:35:36 PM
Last edit: June 02, 2011, 08:58:02 PM by Mesmer
 #107

I'm doing some research on this topic and found http://www.heliontech.com/fast_hash.htm

thoughts?

are going to email them and request some more detail.
fpgaminer (OP)
Hero Member
*****
Offline Offline

Activity: 560
Merit: 517



View Profile WWW
June 02, 2011, 08:47:26 PM
 #108

Quote
XC5VLX110T-1FF1136: "Measuring FPGA performance... FPGA running at 119.958388 MH/s"
Very nice  Cool That's a Virtex 5, right? How much utilization are you getting? If you have some spare space you could update your VHDL code to support parameterized unrolling like the latest update and squeeze another hashing core in there.


Quote
With optimisation set for minimum LEs, it uses 22330. There are 22320 available.

Oh well, it fits with 4
Try commenting out line 107, the virtual_wire for "NONC". It isn't currently used and should save a few LEs.

TheSeven
Hero Member
*****
Offline Offline

Activity: 504
Merit: 500


FPGA Mining LLC


View Profile WWW
June 02, 2011, 08:57:16 PM
 #109

I'm doing some research on this topic and found http://www.heliontech.com/fast_hash.htm
thoughts?
Hm. Might be ~20% faster than mine, according to their datasheet, but will need more difficult handling.
are going to email them and request some more detail.
They aren't going to tell you how it works. They want to sell it. Smiley

Quote
XC5VLX110T-1FF1136: "Measuring FPGA performance... FPGA running at 119.958388 MH/s"
Very nice  Cool That's a Virtex 5, right? How much utilization are you getting? If you have some spare space you could update your VHDL code to support parameterized unrolling like the latest update and squeeze another hashing core in there.
Yes, a huge but slow Virtex 5, at 97% slice usage, and getting it to run at 120MHz at these utilization levels was a bit of a challenge.
I might nevertheless add parameterized unrolling  to accomodate for other people's needs. I'm probably going to release the source code soon, including the Python frontend.

My tip jar: 13kwqR7B4WcSAJCYJH1eXQcxG5vVUwKAqY
Disposition
Full Member
***
Offline Offline

Activity: 121
Merit: 100


View Profile
June 02, 2011, 08:59:00 PM
 #110

Quote
are going to email them and request some more detail.
They aren't going to tell you how it works. They want to sell it. Smiley

Haha I know, just wondering about pricing n maybe could request a sample or something :3

Also See below.

http://opencores.org/project,sha_core
fpgaminer (OP)
Hero Member
*****
Offline Offline

Activity: 560
Merit: 517



View Profile WWW
June 02, 2011, 09:08:05 PM
 #111

Quote
@fpgaminer: As you apparently seem to know the altera side of things rather well, and as the altera FPGAs seem to be the more cost-effective ones, which FPGA would you suggest for a copacobana-like miner design?
It seems like the Xilinx devices are far cheaper, for reasons I can't explain. The only problem is my code doesn't currently support Xilinx. I'm working on it, but until it's done there's no way to know for sure if the utilization and performance is similar.

Back-of-the-napkin:
You'd want to go with, obviously, the chips giving the lowest $ per MH/s figure. The price scaling on the chips is very non-linear. A quick scan over the Altera Cyclone 4 series shows that a Cyclone 4, C40-7N chip has the lowest (best) ratio. It's $82.96 in singles from Altera's website and would produce an estimated 40MH/s (if you can fit a half-mining core in it). That's $2.08 per MH/s. You'd probably get a small discount if you're buying 128 to build a big array, or if their sales team can price match Xilinx's offerings. 128 of those babies would get you 5GH/s and consume ~320Watts.

For reference, Xilinx's SLX150-3N chips are ~$120 in some quantity, with an estimated performance of 160MH/s. That's $0.75 per MH/s.

Quote
Yes, a huge but slow Virtex 5, at 97% slice usage, and getting it to run at 120MHz at these utilization levels was a bit of a challenge.
I might nevertheless add parameterized unrolling  to accomodate for other people's needs. I'm probably going to release the source code soon, including the Python frontend.
Wow, I'm surprised utilization is 97%, but I guess that's fairly close to the 90K LEs that my unoptimized version uses on Cyclone 3/4. You can apply the usual optimizations (last 3 rounds aren't needed, pre-calced W, etc) to get better utilization but I'm not sure you could cram much into whatever space that would save you. You can also play with the adder trees a bit, moving W + k into the prior rounds to improve MHz performance.

TheSeven
Hero Member
*****
Offline Offline

Activity: 504
Merit: 500


FPGA Mining LLC


View Profile WWW
June 02, 2011, 09:18:59 PM
Last edit: June 02, 2011, 09:35:56 PM by TheSeven
 #112

Quote
@fpgaminer: As you apparently seem to know the altera side of things rather well, and as the altera FPGAs seem to be the more cost-effective ones, which FPGA would you suggest for a copacobana-like miner design?
It seems like the Xilinx devices are far cheaper, for reasons I can't explain. The only problem is my code doesn't currently support Xilinx. I'm working on it, but until it's done there's no way to know for sure if the utilization and performance is similar.

Really? I had the opposite impression.
BTW, if you want to, you could host my VHDL version in your git repository as well, for the Xilinx users.

Back-of-the-napkin:
You'd want to go with, obviously, the chips giving the lowest $ per MH/s figure. The price scaling on the chips is very non-linear. A quick scan over the Altera Cyclone 4 series shows that a Cyclone 4, C40-7N chip has the lowest (best) ratio. It's $82.96 in singles from Altera's website and would produce an estimated 40MH/s (if you can fit a half-mining core in it). That's $2.08 per MH/s. You'd probably get a small discount if you're buying 128 to build a big array, or if their sales team can price match Xilinx's offerings. 128 of those babies would get you 5GH/s and consume ~320Watts.

For reference, Xilinx's SLX150-3N chips are ~$120 in some quantity, with an estimated performance of 160MH/s. That's $0.75 per MH/s.

Never ever. I'm if you mean XC6SLX150-3, this isn't going to cross 120MH/s, at least not with this design.
I've tried to synthesize a fully-unrolled (131 pipeline stages) design for that FPGA, and even though it fits (it doesn't on the LX100 variant), there seem to be difficulties routing it, and attempting to synthesize it for 120MHz didn't work out at all. The router is still running (since almost 48 hours now), so I don't even know which frequency it did reach yet.

Quote
Yes, a huge but slow Virtex 5, at 97% slice usage, and getting it to run at 120MHz at these utilization levels was a bit of a challenge.
I might nevertheless add parameterized unrolling  to accomodate for other people's needs. I'm probably going to release the source code soon, including the Python frontend.
Wow, I'm surprised utilization is 97%, but I guess that's fairly close to the 90K LEs that my unoptimized version uses on Cyclone 3/4. You can apply the usual optimizations (last 3 rounds aren't needed, pre-calced W, etc) to get better utilization but I'm not sure you could cram much into whatever space that would save you. You can also play with the adder trees a bit, moving W + k into the prior rounds to improve MHz performance.

This is where you apparently know more about the algorithm than I do. I've just translated your code to VHDL.
Could you provide some details on your optimizations? Is the optimized verilog code available somewhere?
Getting rid of 3 rounds would very likely allow to increase the clock frequency even further.
EDIT: Now that I think about it, I see how the last rounds could be removed, but I'm fairly certain that the synthesis tool was clever enough to have done that already.

My tip jar: 13kwqR7B4WcSAJCYJH1eXQcxG5vVUwKAqY
fpgaminer (OP)
Hero Member
*****
Offline Offline

Activity: 560
Merit: 517



View Profile WWW
June 02, 2011, 10:06:58 PM
 #113

Wow, Terasic apparently noticed this project and posted it on their Facebook page. Neato! And they also gave a shout-out to Bitcoin!  Cheesy

Quote
BTW, if you want to, you could host my VHDL version in your git repository as well, for the Xilinx users.
Sure, that'd be great!

Quote
Never ever. I'm if you mean XC6SLX150-3, this isn't going to cross 120MH/s, at least not with this design.
I've tried to synthesize a fully-unrolled (131 pipeline stages) design for that FPGA, and even though it fits (it doesn't on the LX100 variant), there seem to be difficulties routing it, and attempting to synthesize it for 120MHz didn't work out at all. The router is still running (since almost 48 hours now), so I don't even know which frequency it did reach yet.
You mention 131 pipelined stages. I assume you mean something like, 128 round modules, 2 final adder stages, and a third ... misc stage?

My estimate for the 6SLX150-3N comes from assuming that Xilinx's estimation on equivalent LCs is accurate and that the design would use a similar number of 4-LUTs and FFs on Xilinx, as compared to Altera.

The optimized design can fit into 80K LEs on a Cyclone 3, possibly 75K with some more cramming; at 80MHz and one cycle per full-hash. Assuming LE == LC that would allow two of those to fit into the 6SLX150-3N because it has the equivalent of 150K LCs. That would achieve a combined hash-rate of 160MH/s.

Obviously, you can see that my assumptions all come from "LE == LC." Indeed, when I've synthesized my design for the 6SLX150, utilization was roughly what I expected. But as you pointed out, routing is where I got stuck. ISE refused to route the design. That's on my list of things to do; figure out why ISE won't route the design, even though it will route a cut-down design (only implementing a few rounds instead of all 128) just fine, and utilization is <70%.

Quote
Could you provide some details on your optimizations? Is the optimized verilog code available somewhere?
The code isn't on the public repo yet, no, because it's a work in progress and I haven't taken the time to put it up yet. It's on the list of things to do Wink

There are four classes of optimizations:

1) The last 3 rounds the second SHA-256 pass are not needed. You only need to check that Round64.H is equal to 0, and the last three rounds do not affect H. Here's the math:

Code:
// Round numbers are 1 based, so we go from Round1 to Round64
Round61.E = Round60.D + Round60.H + s1 + ch + k[61] + w[61]
Round62.F = Round61.E
Round63.G = Round62.F
Round64.H = Round63.G

Let's simplify:

32'h00000000 == Round64.H + InitialState.H == Round63.G == Round62.F == Round61.E == Round60.D + Round60.H + s1 + ch + k[61] + w[61]
32'h00000000 == Round60.D + Round60.H + s1 + ch + k[61] + w[61] + InitialState.H
32'h00000000 == Round60.D + Round60.H + s1 + ch + 32'h90befffa + w[61] + 32'h5be0cd19  //K and InitialState are known
32'h136032ED == Round60.D + Round60.H + s1 + ch + w[61]

That allows you to remove 3 full rounds and 9 adders.


2) Pre calculating the first few stages of the first SHA-256 pass is also possible. The first pass is calculated from the 512-bit DATA string that getwork requests give us, and the MIDSTATE. Between pieces of work, all of that data remains constant, except for the nonce, which we are increasing. The nonce is at W[3] (0 based), and W[3] isn't used until Round4. So, you can run the first 3 rounds of SHA-256 on a controller (PC, microcontroller, whatever) before handing the "work" to the FPGA. The FPGA then picks up where you left off and never has to calculate those first 3 rounds.

That amounts to giving the FPGA Midstate, Data, and Midstate', where Midstate' is the 256-bit state as of your pre-calced Round3.


3) Quite a large amount of W can be pre-calced off the FPGA as well. Again, Data is constant except for the 4th word which is the nonce. I won't go into the math here, because it's rather long. The first 16 values of W are sourced directly from Data, so that's a no-brainer. However, also note that all of Data after the first 4 words (after the nonce) is constant for all getwork requests! So you can actually hard code those values as they are in the code on the public repo. What that code doesn't do, however, is just add K with the known W values, thus saving you an adder for the first 16 rounds.

After the first 16 rounds, everything up to round 35 (I think) has some amount of W that can be pre-calculated off the FPGA and given to it with the rest of the work data.


4) The final optimization is this:

Code:
t1 := h + s1 + ch + k[i] + w[i]
Having to add K and W makes the adder tree larger. K is a constant and always known. W can be calculated at any point before a given round. H is also known at least one round ahead of time (as you've learned in the first optimization). So it's possible to do this calculation in the previous round:

Code:
pre-t1 = g + k[i+1] + w[i+1]

and then in the next round:

Code:
t1 := pre-t1 + s1 + ch

That shrinks the critical path down, allowing for higher clock rates. Note that I haven't implemented that particular optimization, so double check that I did my math right.

TheSeven
Hero Member
*****
Offline Offline

Activity: 504
Merit: 500


FPGA Mining LLC


View Profile WWW
June 02, 2011, 11:00:17 PM
 #114

Wow, Terasic apparently noticed this project and posted it on their Facebook page. Neato! And they also gave a shout-out to Bitcoin!  Cheesy

Nice one, liked it Smiley

Quote
BTW, if you want to, you could host my VHDL version in your git repository as well, for the Xilinx users.
Sure, that'd be great!

http://dl.dropbox.com/u/23683845/fpgaminer-xilinx.zip

Quote
Never ever. I'm if you mean XC6SLX150-3, this isn't going to cross 120MH/s, at least not with this design.
I've tried to synthesize a fully-unrolled (131 pipeline stages) design for that FPGA, and even though it fits (it doesn't on the LX100 variant), there seem to be difficulties routing it, and attempting to synthesize it for 120MHz didn't work out at all. The router is still running (since almost 48 hours now), so I don't even know which frequency it did reach yet.
You mention 131 pipelined stages. I assume you mean something like, 128 round modules, 2 final adder stages, and a third ... misc stage?

2x 64 digesters, 2 final adders and the =0 comparison

My estimate for the 6SLX150-3N comes from assuming that Xilinx's estimation on equivalent LCs is accurate and that the design would use a similar number of 4-LUTs and FFs on Xilinx, as compared to Altera.

The optimized design can fit into 80K LEs on a Cyclone 3, possibly 75K with some more cramming; at 80MHz and one cycle per full-hash. Assuming LE == LC that would allow two of those to fit into the 6SLX150-3N because it has the equivalent of 150K LCs. That would achieve a combined hash-rate of 160MH/s.

Obviously, you can see that my assumptions all come from "LE == LC." Indeed, when I've synthesized my design for the 6SLX150, utilization was roughly what I expected. But as you pointed out, routing is where I got stuck. ISE refused to route the design. That's on my list of things to do; figure out why ISE won't route the design, even though it will route a cut-down design (only implementing a few rounds instead of all 128) just fine, and utilization is <70%.

I don't think two of those pipelines actually fit in there. I've got 60% slice usage on the LX150, and on the LX100 the placer failed even though it has more slices than slices actually used on the LX150.
Have you tried routing it at a lower clock frequency? Start out with something ridiculous like 10MHz to check if the design can be routed at all, and then increase it until things get tight.
If it doesn't route for 10MHz as well, my suspicion would be that the Spartan series just don't support as flexible routing as the Virtex series do.

Regarding your optimizations, I'm fairly sure that at least part 1 and possibly also part 4 will be performed automatically by xst.

My tip jar: 13kwqR7B4WcSAJCYJH1eXQcxG5vVUwKAqY
fpgaminer (OP)
Hero Member
*****
Offline Offline

Activity: 560
Merit: 517



View Profile WWW
June 02, 2011, 11:48:12 PM
 #115

Quote
Regarding your optimizations, I'm fairly sure that at least part 1 and possibly also part 4 will be performed automatically by xst.
I don't trust synthesizers. Tongue I've gotten optimizations by simply moving things into separate modules to clearly define the hierarchy.

Quote
Have you tried routing it at a lower clock frequency? Start out with something ridiculous like 10MHz to check if the design can be routed at all, and then increase it until things get tight.
Yeah, I will have to try that. Last time I tried, it was at 50MHz.

I tried routing 16 rounds only. Worked fine. 32 was fine. But then at 64 it failed. So it's doing something weird ...

Quote
If it doesn't route for 10MHz as well, my suspicion would be that the Spartan series just don't support as flexible routing as the Virtex series do.
I certainly hope not. The Spartan series is supposed to compete with Altera's Cyclone which routes these designs without any problems. Their architectures are certainly different, but if it can't even route a 64-round SHA-256 design on an LX150 then ... I dunno what to think about Xilinx.

Quote
Thank you! It will take me some time to double check it and commit it to the repo; it's a busy week at my job. I'm posting these replies during compilation rounds  Wink

Anyway, this is great input TheSeven, I really appreciate it. It makes me a little more worried about cramming the design into my LX150, but it's all good discussion and I'll keep working at it regardless.

Silverpike
Newbie
*
Offline Offline

Activity: 54
Merit: 0



View Profile
June 03, 2011, 12:17:43 AM
 #116

3) Quite a large amount of W can be pre-calced off the FPGA as well. Again, Data is constant except for the 4th word which is the nonce. I won't go into the math here, because it's rather long. The first 16 values of W are sourced directly from Data, so that's a no-brainer. However, also note that all of Data after the first 4 words (after the nonce) is constant for all getwork requests! So you can actually hard code those values as they are in the code on the public repo. What that code doesn't do, however, is just add K with the known W values, thus saving you an adder for the first 16 rounds.

After the first 16 rounds, everything up to round 35 (I think) has some amount of W that can be pre-calculated off the FPGA and given to it with the rest of the work data.

I just wanted to add that this statement is not true.  The computation of future rounds of W (above the initial 16 values) uses mixing functions S0() and S1() (in the SHA-nomenclature), which are not composable with add arithmetic.  So the need to compute S0(W[0]) + A is not the same as computing S0(W[0] + A).  The nonce is the value feeding the S0() function, so you need to recompute the entire W block if it changes.

It's a great idea (I tried this too), but it just doesn't work. Wink
lizthegrey
Newbie
*
Offline Offline

Activity: 56
Merit: 0


View Profile
June 03, 2011, 01:00:10 AM
 #117

Wow, cool! I doubt I'd get anywhere close to reasonable hash rates with my nexys 2 board, but I'm glad that someone has done work to make things xilinx-compatible!
nathanrees19
Full Member
***
Offline Offline

Activity: 196
Merit: 100



View Profile
June 03, 2011, 01:06:25 AM
 #118

Try commenting out line 107, the virtual_wire for "NONC". It isn't currently used and should save a few LEs.

It worked! There was enough room left to add a blinking LED when it finds the golden nonce. Just.

It seems to be reporting an Fmax of 82.31Mhz. Is this actually safe without a heatsink?
fpgaminer (OP)
Hero Member
*****
Offline Offline

Activity: 560
Merit: 517



View Profile WWW
June 03, 2011, 01:20:24 AM
 #119

Quote
It worked! There was enough room left to add a blinking LED when it finds the golden nonce. Just.
Grin Yay!

Quote
It seems to be reporting an Fmax of 82.31Mhz. Is this actually safe without a heatsink?
It isn't on the C4-115 chip, but for the tiny C4-22 it might be. Check with PowerPlay. Make sure no heatsink and no fan is selected, and the toggle rate is ~65%. See what it says the JT is.

Quote
The computation of future rounds of W
Well I will certainly double check my math, but you can most certainly compute some of W after the initial 16. Example (0 indexed):

Code:
w[16] = w[0] + s0(w[1]) + w[9] + s1(w[14])

All those values are known and do not change during the course of a work unit. The same applies to w[17] and w[18]. I don't have my notes with me for the rest.

nathanrees19
Full Member
***
Offline Offline

Activity: 196
Merit: 100



View Profile
June 03, 2011, 02:16:11 AM
 #120

Check with PowerPlay. Make sure no heatsink and no fan is selected, and the toggle rate is ~65%. See what it says the JT is.

With a 50Mhz clock and the toggle rate manually set to 65%, it reports 48C for the junction temperature. I might just keep it at 50 to be safe.
Pages: « 1 2 3 4 5 [6] 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 »
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!