Bitcoin Forum
December 11, 2016, 08:09:47 AM *
News: To be able to use the next phase of the beta forum software, please ensure that your email address is correct/functional.
 
   Home   Help Search Donate Login Register  
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 [32] 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 »
  Print  
Author Topic: FPGA development board "Icarus" - DisContinued/ important announcement  (Read 184376 times)
TheSeven
Hero Member
*****
Offline Offline

Activity: 504


FPGA Mining LLC


View Profile WWW
March 08, 2012, 08:24:27 PM
 #621

I'm referring to how cgminer works.
It gets a timeout, checks the count of how many timeouts and if it has reached some limit it will then go through the process of starting fresh work (that cgminer already has queued ready to go)
Then it starts to write new work down on the Icarus.

I think that's how just about all miners work today (including MPBM).

Are you suggesting that there is some period of time AFTER it starts to write the new work to the Icarus that a valid nonce could be returned?
If so then I guess we could add that to cgminer also, but that time would need to be VERY accurate to ensure the old reader isn't taking a nonce from the new work.

Yes, there is a very small period after starting to upload a job where nonces from the previous job could still be coming in.
After that, there is a longer period (about 5.6ms) during which there will be garbage nonces, if any, because the board is working on a mixture between two jobs that won't yield any sensible result.
After that, normal operation will continue with nonce 0 of the new job.

Icarus hashes a pair of nonce in roughly 6 nanoseconds (11.3s ~= 380MH/s ~= 3ns per nonce if it was a single device = 6ns per pair)
... though I'd be curious to know if the hashing process is a complete cycle per pair or the pairs are stepping though a stepped cycle
i.e. is there some delay before the first nonce-check completes, and then the remaining (2^31 - 1) sequential results are closer together than this initial delay?
I've still not quite got my understanding of that inside FPGA processing clear to me yet.

The FPGAs usually use a pipelined design. I don't know what the exact pipeline depth is, but I'd assume that it's somewhere between 128 and 270 stages.
So generating a full double-hash will need N clock cycles, but N nonces are being processed in parallel. Basically there's a hardware implementation of each sha256 round, an the work bubbles through that chain, one step (sha256 round) per clock cycle. See http://en.wikipedia.org/wiki/Instruction_pipeline for the general idea, just that we're having a hundred sha256 round stages instead of those 5 processor pipeline stages described there.

My tip jar: 13kwqR7B4WcSAJCYJH1eXQcxG5vVUwKAqY
1481443787
Hero Member
*
Offline Offline

Posts: 1481443787

View Profile Personal Message (Offline)

Ignore
1481443787
Reply with quote  #2

1481443787
Report to moderator
1481443787
Hero Member
*
Offline Offline

Posts: 1481443787

View Profile Personal Message (Offline)

Ignore
1481443787
Reply with quote  #2

1481443787
Report to moderator
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertise here.
1481443787
Hero Member
*
Offline Offline

Posts: 1481443787

View Profile Personal Message (Offline)

Ignore
1481443787
Reply with quote  #2

1481443787
Report to moderator
Glasswalker
Sr. Member
****
Offline Offline

Activity: 350



View Profile WWW
March 08, 2012, 08:43:38 PM
 #622

Current Icarus code (at least the released stuff) is based on the ZTex code.

The ZTex code has a central core module which has a variable number of stages. The SHA-2 (SHA256) spec calls for 64 stages per hash. But the way bitcoin uses it, it only needs a full hash on one stage, and a partial hash on the other.

So the ZTex code (and therefor the Icarus code) does 64 stages on one core, and 61 stages on the other core, for a total of 125 stages. It has all of those stages fully unrolled so it takes 125 clocks (probably slightly more, haven't looked at the UART code, and controlling logic in depth yet) to fully load the pipeline, after which it runs 1 hash per clock once the pipe is loaded.

The Icarus has 2 FPGAs, each running independent hashing cores, which divide the nonce space between them to split the work (but each operates essentially independent). That's based on my basic understanding of the Verilog source code for it.

Edit: At second glance this may or may not be correct... I did say "Basic" understanding lol... My verilog is rusty as hell...

It may actually be fully unrolled so that it's doing the entire hash in a single clock. (for a given SHA256 Hash) and pipe lining the bitcoin (double SHA) hash (into 2 stages).

*runs back to look at the code again*

lol Smiley

Just trying to make Bitcoin a Success... One crazy project at a time. (13rwPKskyATcAq3PpnCikfFG8989DQ8M3c)
HashVoodoo Open Source FPGA Mining Bitstream: https://github.com/pmumby/hashvoodoo-fpga-bitcoin-miner
kano
Legendary
*
Offline Offline

Activity: 1932


Linux since 1997 RedHat 4


View Profile
March 08, 2012, 10:01:10 PM
 #623

Current Icarus code (at least the released stuff) is based on the ZTex code.

The ZTex code has a central core module which has a variable number of stages. The SHA-2 (SHA256) spec calls for 64 stages per hash. But the way bitcoin uses it, it only needs a full hash on one stage, and a partial hash on the other.

So the ZTex code (and therefor the Icarus code) does 64 stages on one core, and 61 stages on the other core, for a total of 125 stages. It has all of those stages fully unrolled so it takes 125 clocks (probably slightly more, haven't looked at the UART code, and controlling logic in depth yet) to fully load the pipeline, after which it runs 1 hash per clock once the pipe is loaded.
...
Ignoring the pipeline question I asked, I hope it doesn't do 64 + 61.
(well actually I should say I hope it does do this coz then there is a speed up still available)

The 2nd sha256 is actually just 60.5 - but that is probably what you meant by 61.

The 1st sha256 is 61 also - the first 3 'stages' are exactly the same for all nonce in a range, so repeating them 4 billion times is a waste.
There is also the nonce-constant values of W0-W2 & W4-W15 (W4-W15 are constant over all time)
Then the calculation of W16, W17 is also constant across the nonce range.
(and there are other partial calculations you can do also that are constant across a nonce-range)
Edit: the partial ones are W18 (S0), W19 (S0 and S1, S1 is a constant over all time) W20 (S1 - again a constant over all time) W21 (S1 = 0) W22-W30 (S1) all these partial calculations shouldn't be done 4 billion times if at all possible (and some of the +W values for these are also constants per range or even constants over all time)

Edit2: I wrote a C program many months ago to analyse the double sha256 and optimise it (and spit out an optimised C program to calculate it - that works) and that's where I get that info from - but I know it is correct coz - as I said, the output code works.
I did this for my own understanding of what optimisations there are ... and of course found them all for the normal double sha256 Smiley
If you could actually fit in doing 2 nonce at a time in one chip there are also some more partial calculations across each pair of nonce (that I started working on with my code but didn't finish due to there being no actual use in the results at the time)

Pool: https://kano.is BTC: 1KanoiBupPiZfkwqB7rfLXAzPnoTshAVmb
CKPool and CGMiner developer, IRC FreeNode #ckpool and #cgminer kanoi
Help keep Bitcoin secure by mining on pools with Stratum, the best protocol to mine Bitcoins with ASIC hardware
TheSeven
Hero Member
*****
Offline Offline

Activity: 504


FPGA Mining LLC


View Profile WWW
March 08, 2012, 10:26:30 PM
 #624

The 1st sha256 is 61 also - the first 3 'stages' are exactly the same for all nonce in a range, so repeating them 4 billion times is a waste.
There is also the nonce-constant values of W0-W2 & W4-W15 (W4-W15 are constant over all time)
Then the calculation of W16, W17 is also constant across the nonce range.
(and there are other partial calculations you can do also that are constant across a nonce-range)
Edit: the partial ones are W18 (S0), W19 (S0 and S1, S1 is a constant over all time) W20 (S1 - again a constant over all time) W21 (S1 = 0) W22-W30 (S1) all these partial calculations shouldn't be done 4 billion times if at all possible (and some of the +W values for these are also constants per range or even constants over all time)

The synthesis tools usually do a rather good job at removing logic with constant output values. So while this may not be true for the nonce-dependent ones, most of the all time constant ones have probably already been caught automatically.

My tip jar: 13kwqR7B4WcSAJCYJH1eXQcxG5vVUwKAqY
kano
Legendary
*
Offline Offline

Activity: 1932


Linux since 1997 RedHat 4


View Profile
March 08, 2012, 10:39:16 PM
 #625

The 1st sha256 is 61 also - the first 3 'stages' are exactly the same for all nonce in a range, so repeating them 4 billion times is a waste.
There is also the nonce-constant values of W0-W2 & W4-W15 (W4-W15 are constant over all time)
Then the calculation of W16, W17 is also constant across the nonce range.
(and there are other partial calculations you can do also that are constant across a nonce-range)
Edit: the partial ones are W18 (S0), W19 (S0 and S1, S1 is a constant over all time) W20 (S1 - again a constant over all time) W21 (S1 = 0) W22-W30 (S1) all these partial calculations shouldn't be done 4 billion times if at all possible (and some of the +W values for these are also constants per range or even constants over all time)

The synthesis tools usually do a rather good job at removing logic with constant output values. So while this may not be true for the nonce-dependent ones, most of the all time constant ones have probably already been caught automatically.
A lot of it is nonce dependent - so not doing that is a BIG waste.
Also, even the ATI OpenCL compiler sux at doing this so I wouldn't be surprised if the tool is poor at optimisation.

As I said in my "Edit2:" above, I did this with C.
On top of all that - using gcc -O2 over the resulting code made a massive speed difference also - something close to running at twice the speed (though that was probably the optimisation of C to assembler)
And the -O2 made doing some of the code optimisations pointless since gcc worked them out itself

Pool: https://kano.is BTC: 1KanoiBupPiZfkwqB7rfLXAzPnoTshAVmb
CKPool and CGMiner developer, IRC FreeNode #ckpool and #cgminer kanoi
Help keep Bitcoin secure by mining on pools with Stratum, the best protocol to mine Bitcoins with ASIC hardware
TheSeven
Hero Member
*****
Offline Offline

Activity: 504


FPGA Mining LLC


View Profile WWW
March 08, 2012, 11:33:20 PM
 #626

Edit2: I wrote a C program many months ago to analyse the double sha256 and optimise it (and spit out an optimised C program to calculate it - that works) and that's where I get that info from - but I know it is correct coz - as I said, the output code works.
I did this for my own understanding of what optimisations there are ... and of course found them all for the normal double sha256 Smiley
If you could actually fit in doing 2 nonce at a time in one chip there are also some more partial calculations across each pair of nonce (that I started working on with my code but didn't finish due to there being no actual use in the results at the time)

I'm not sure if that would make things any better. The wall that the HDL people are currently hitting seems to be mostly routing congestion, not really logic slices yet. Spartan6 routing must be awful. And this idea doesn't really sound like it would improve on that Smiley

As I said in my "Edit2:" above, I did this with C.
On top of all that - using gcc -O2 over the resulting code made a massive speed difference also - something close to running at twice the speed (though that was probably the optimisation of C to assembler)
And the -O2 made doing some of the code optimisations pointless since gcc worked them out itself

Running without -O tells the compiler to literally do what you say, i.e. forbids that kind of optimization (and also writes all kinds of variables to the stack for no good reason, resulting in even more slowdown). -O1 vs. -O2 vs. -O3 vs. -Os might be more interesting than comparing with no -O option at all.

My tip jar: 13kwqR7B4WcSAJCYJH1eXQcxG5vVUwKAqY
kano
Legendary
*
Offline Offline

Activity: 1932


Linux since 1997 RedHat 4


View Profile
March 08, 2012, 11:44:43 PM
 #627

-O 2 to -O 3 did nothing.

But my point there was that the gcc compiler is VERY good at optimisation - and applying that optimisation makes a big difference in CPU land.

However, those nonce-range optimisations are simply removing code to doing it once rather than 2^32 times (assuming the controller that distributes the work to the 2 chips does the setup work)
So without them you are wasting something like 2% ... or using that approximate figure on an Icarus: 98% = 380MH/s, then 100% = around 388MH/s
All very rough but certainly worth doing - since it doesn't increase the power usage or the amount of effort for the Icarus, it simply increases the MH/s

Pool: https://kano.is BTC: 1KanoiBupPiZfkwqB7rfLXAzPnoTshAVmb
CKPool and CGMiner developer, IRC FreeNode #ckpool and #cgminer kanoi
Help keep Bitcoin secure by mining on pools with Stratum, the best protocol to mine Bitcoins with ASIC hardware
allinvain
Legendary
*
Offline Offline

Activity: 2002



View Profile
March 09, 2012, 12:41:45 AM
 #628

I am getting 0% invalid shares on 6 boards! I am using MPBM with jobinterval set to 11.3!

By the way I was wrong! the 11->11.3 range is as valuable as any 0.3 seconds within the 11.3 seconds range!

So is there a consensus that setting jobinterval to 11.3 results in the _best_ performance for the Icarus board?


Glasswalker
Sr. Member
****
Offline Offline

Activity: 350



View Profile WWW
March 09, 2012, 03:07:45 AM
 #629

Quick update, after re-reading the Verilog, looks like it is pipelining it (and you're right, 61 stages on both parts, he has some special cases in there, there is also a core doing full 64 stage pipe, but I am not sure what that's for lol, only going over it roughly right now)

I'm intrigued to hear more about your optimizations, since I'm writing my own verilog. Once I get it working and able to calculate hashes (slowly) I'll go over optimizing it, and then things like your suggestions could help quite a bit.

Just trying to make Bitcoin a Success... One crazy project at a time. (13rwPKskyATcAq3PpnCikfFG8989DQ8M3c)
HashVoodoo Open Source FPGA Mining Bitstream: https://github.com/pmumby/hashvoodoo-fpga-bitcoin-miner
TheSeven
Hero Member
*****
Offline Offline

Activity: 504


FPGA Mining LLC


View Profile WWW
March 09, 2012, 07:57:36 AM
 #630

I am getting 0% invalid shares on 6 boards! I am using MPBM with jobinterval set to 11.3!

By the way I was wrong! the 11->11.3 range is as valuable as any 0.3 seconds within the 11.3 seconds range!

So is there a consensus that setting jobinterval to 11.3 results in the _best_ performance for the Icarus board?

I think there is a consensus amongst basically everyone but Energizer that it doesn't. Exactly 11.3 seconds is indeed the sweet spot, but the effective interval will always be a little bit longer than the one calculated at that line of code that Energizer pointed at, there's a bit of jitter due to various reasons.
While the penalty for going lower (and thus adding a bit of a safety margin) is pretty much zero, the penalty for exceeding those 11.3 seconds is huge. That's why the defaults should be fine, and you'll need to hack up the code to change that (jobinterval settings above 8 seconds in the configuration file will just be ignored).
See this post for details: https://bitcointalk.org/index.php?topic=51371.msg780603#msg780603

My tip jar: 13kwqR7B4WcSAJCYJH1eXQcxG5vVUwKAqY
kano
Legendary
*
Offline Offline

Activity: 1932


Linux since 1997 RedHat 4


View Profile
March 09, 2012, 08:33:36 AM
 #631

Quick update, after re-reading the Verilog, looks like it is pipelining it (and you're right, 61 stages on both parts, he has some special cases in there, there is also a core doing full 64 stage pipe, but I am not sure what that's for lol, only going over it roughly right now)

I'm intrigued to hear more about your optimizations, since I'm writing my own verilog. Once I get it working and able to calculate hashes (slowly) I'll go over optimizing it, and then things like your suggestions could help quite a bit.
Well here's the output of my code (which is of course fully unrolled) before I started messing with trying to do 2 nonce at the same time.
It also doesn't do the partial Wn calculations, but it's easy to see them.

http://pastebin.com/sxdVSJF1

That has all 3 sha256()'s in it since the first one is the midstate calculation.
Also note that the last sha256() has a lot of constants at the start (that my code determined) that may also not be in the Icarus version (I don't know)
My code worked out constants and converted them to their values.

That code will run and find shares correctly.
It's not perfect in terms of register usage or optimisation of partial calculations, bit otherwise it's pretty close to complete.

Pool: https://kano.is BTC: 1KanoiBupPiZfkwqB7rfLXAzPnoTshAVmb
CKPool and CGMiner developer, IRC FreeNode #ckpool and #cgminer kanoi
Help keep Bitcoin secure by mining on pools with Stratum, the best protocol to mine Bitcoins with ASIC hardware
allinvain
Legendary
*
Offline Offline

Activity: 2002



View Profile
March 09, 2012, 09:09:13 AM
 #632

I am getting 0% invalid shares on 6 boards! I am using MPBM with jobinterval set to 11.3!

By the way I was wrong! the 11->11.3 range is as valuable as any 0.3 seconds within the 11.3 seconds range!

So is there a consensus that setting jobinterval to 11.3 results in the _best_ performance for the Icarus board?

I think there is a consensus amongst basically everyone but Energizer that it doesn't. Exactly 11.3 seconds is indeed the sweet spot, but the effective interval will always be a little bit longer than the one calculated at that line of code that Energizer pointed at, there's a bit of jitter due to various reasons.
While the penalty for going lower (and thus adding a bit of a safety margin) is pretty much zero, the penalty for exceeding those 11.3 seconds is huge. That's why the defaults should be fine, and you'll need to hack up the code to change that (jobinterval settings above 8 seconds in the configuration file will just be ignored).
See this post for details: https://bitcointalk.org/index.php?topic=51371.msg780603#msg780603

Thank you for clearing that up for me. That settles it for me, I will leave things as they are. I typically see 0.1% invalids which IMHO is _good_ .

TheSeven
Hero Member
*****
Offline Offline

Activity: 504


FPGA Mining LLC


View Profile WWW
March 09, 2012, 09:40:39 AM
 #633

Quick update, after re-reading the Verilog, looks like it is pipelining it (and you're right, 61 stages on both parts, he has some special cases in there, there is also a core doing full 64 stage pipe, but I am not sure what that's for lol, only going over it roughly right now)

I'm intrigued to hear more about your optimizations, since I'm writing my own verilog. Once I get it working and able to calculate hashes (slowly) I'll go over optimizing it, and then things like your suggestions could help quite a bit.

Actually twice that many pipeline stages (relevant for latency), because each sha256 round is split into two pipeline stages in the ztex core.

My tip jar: 13kwqR7B4WcSAJCYJH1eXQcxG5vVUwKAqY
Glasswalker
Sr. Member
****
Offline Offline

Activity: 350



View Profile WWW
March 09, 2012, 01:35:52 PM
 #634

Actually twice that many pipeline stages (relevant for latency), because each sha256 round is split into two pipeline stages in the ztex core.

Really? Since I'm trying to get my head around the code anyway, can you elaborate on this? I'm not seeing it in the code I'm looking at for sha256_pipes2.v

I see the main sha256_pipe2_base module, which seems to generate the 64 SHA stages,

Then I see pipe130 (which instantiates sha256_pipe2_base with 64 stages and does a single pass)

Then I see pipe123 (which instantiates sha256_pipe2_base with 61 stages and only seems to output a single 32bit word of hash)

Then I see pipe129 (which instantiates sha256_pipe2_base with 64 stages and does a single pass and outputs a full 256bit hash)

the top module seems to instantiate sha256_pipe130 and sha256_pipe123 (as p1 and p2)

I don't see anywhere where the sha cores are split? (but as I said before, my verilog is pretty rusty, and since I'm trying to brush up and write my own sha core, if you can help me out with what I'm misinterpreting I'd appreciate it) Wink

Thanks!

Just trying to make Bitcoin a Success... One crazy project at a time. (13rwPKskyATcAq3PpnCikfFG8989DQ8M3c)
HashVoodoo Open Source FPGA Mining Bitstream: https://github.com/pmumby/hashvoodoo-fpga-bitcoin-miner
ngzhang
Hero Member
*****
Offline Offline

Activity: 592


We will stand and fight.


View Profile
March 09, 2012, 02:14:33 PM
 #635

 Grin

hi, i'm sorry about the disappear and no answer to many mails for a few days.
i got a box of boards yesterday. i'm busy for testing them.



i must finish some bulk orders before 3/12.

so please have a nice day, my friends. Grin

CEO of Canaan-creative, Founder of Avalon project.
https://canaan.io/
Business contact: love@canaan.io
All PMs will be unread.
Turbor
Legendary
*
Offline Offline

Activity: 1008


BitMinter


View Profile WWW
March 09, 2012, 03:30:06 PM
 #636

FPGA sex Tongue

TheSeven
Hero Member
*****
Offline Offline

Activity: 504


FPGA Mining LLC


View Profile WWW
March 09, 2012, 06:10:56 PM
 #637

Actually twice that many pipeline stages (relevant for latency), because each sha256 round is split into two pipeline stages in the ztex core.

Really? Since I'm trying to get my head around the code anyway, can you elaborate on this? I'm not seeing it in the code I'm looking at for sha256_pipes2.v

I see the main sha256_pipe2_base module, which seems to generate the 64 SHA stages,

Then I see pipe130 (which instantiates sha256_pipe2_base with 64 stages and does a single pass)

Then I see pipe123 (which instantiates sha256_pipe2_base with 61 stages and only seems to output a single 32bit word of hash)

Then I see pipe129 (which instantiates sha256_pipe2_base with 64 stages and does a single pass and outputs a full 256bit hash)

the top module seems to instantiate sha256_pipe130 and sha256_pipe123 (as p1 and p2)

I don't see anywhere where the sha cores are split? (but as I said before, my verilog is pretty rusty, and since I'm trying to brush up and write my own sha core, if you can help me out with what I'm misinterpreting I'd appreciate it) Wink

Thanks!

I've never really known any verilog (I like VHDL much better), but this looks like the sha256_pipe2_base module consists of two pipeline stages:

Code:
for (i = 0; i <= STAGES; i = i + 1) begin : S

reg [511:0] data;
reg [223:0] state;
reg [31:0] t1_p1;
That's the first set of pipeline registers
Code:
if(i == 0)
begin
[...]
end else
begin

reg [511:0] data_buf;
reg [223:0] state_buf;
reg [31:0] data15_p1, data15_p2, data15_p3, t1;
That's the second set of pipeline resigers
Code:
always @ (posedge clk)
begin
data_buf <= S[i-1].data;
Just copy the input data in the first stage
Code:
data[479:0] <= data_buf[511:32];
data15_p1 <= `S1( S[i-1].data[`IDX(15)] ); // 3
data15_p2 <= data15_p1; // 1
data15_p3 <= ( ( i == 1 ) ? `S1( S[i-1].data[`IDX(14)] ) : S[i-1].data15_p2 ) + S[i-1].data[`IDX(9)] + S[i-1].data[`IDX(0)]; // 3
data[`IDX(15)] <= `S0( data_buf[`IDX(1)] ) + data15_p3; // 4
Do the actual caldulations in the second state
Code:
state_buf <= S[i-1].state; // 2
Just copy the input data in the first stage
Code:
t1 <= `CH( S[i-1].state[`IDX(4)], S[i-1].state[`IDX(5)], S[i-1].state[`IDX(6)] ) + `E1( S[i-1].state[`IDX(4)] ) + S[i-1].t1_p1; // 6

state[`IDX(0)] <= `MAJ( state_buf[`IDX(0)], state_buf[`IDX(1)], state_buf[`IDX(2)] ) + `E0( state_buf[`IDX(0)] ) + t1; // 7
state[`IDX(1)] <= state_buf[`IDX(0)]; // 1
state[`IDX(2)] <= state_buf[`IDX(1)]; // 1
state[`IDX(3)] <= state_buf[`IDX(2)]; // 1
state[`IDX(4)] <= state_buf[`IDX(3)] + t1; // 2
state[`IDX(5)] <= state_buf[`IDX(4)]; // 1
state[`IDX(6)] <= state_buf[`IDX(5)]; // 1
Do the actual caldulations in the second state
Code:

t1_p1 <= state_buf[`IDX(6)] + data_buf[`IDX(1)] + Ks[`IDX((127-i) & 63)]; // 2
end

end
end

The synthesis software will then do some register balancing and move part of the logic from the second to the first stage in order to equalize delays between those two stages and thus achieve a higher clock rate because the individual stages' critical path delay is reduced.

My tip jar: 13kwqR7B4WcSAJCYJH1eXQcxG5vVUwKAqY
Glasswalker
Sr. Member
****
Offline Offline

Activity: 350



View Profile WWW
March 09, 2012, 06:32:33 PM
 #638

Ooh! *drool*

I think I see several in that box with my name on it! lol

Just trying to make Bitcoin a Success... One crazy project at a time. (13rwPKskyATcAq3PpnCikfFG8989DQ8M3c)
HashVoodoo Open Source FPGA Mining Bitstream: https://github.com/pmumby/hashvoodoo-fpga-bitcoin-miner
Glasswalker
Sr. Member
****
Offline Offline

Activity: 350



View Profile WWW
March 09, 2012, 06:45:17 PM
 #639

In verilog, the generate block when you put a for loop in it, will synthesize that out into multiple blocks of logic (think of it as a fast way to instantiate chunks of logic multiple times over).

So when he's copying data from registers in S[i-1] to the current registers you're right he's moving it from the previous pipeline stage to the current pipeline stage. But that for loop instantiates the number of stages in the pipe as STAGES. (so 64 by default). That's the full 64 stage sha pipeline. Each individual block within a stage doesn't seem to be split further.

At least that's what I got out of his method by reading the code, and it's how I've built mine Wink

Just trying to make Bitcoin a Success... One crazy project at a time. (13rwPKskyATcAq3PpnCikfFG8989DQ8M3c)
HashVoodoo Open Source FPGA Mining Bitstream: https://github.com/pmumby/hashvoodoo-fpga-bitcoin-miner
Energizer
Sr. Member
****
Offline Offline

Activity: 274



View Profile
March 09, 2012, 07:15:57 PM
 #640

I would be grateful if someone with good FPGA programming experience answers this question:

Is it possible to make use of both clock edges to improve the mining speed?

For example: replacing always@(posedge CLK) by always@(posedge CLK or negedge CLK)

Zhang've told me that this would lead to a disaster! I am still wondering if its possible to use a double edged clock design @ lower MHz "100->133"!
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 [32] 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 »
  Print  
 
Jump to:  

Sponsored by , a Bitcoin-accepting VPN.
Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!