FPGA development board "Icarus" - DisContinued/ important announcement

TheSeven

Hero Member

Activity: 504
Merit: 500

FPGA Mining LLC

Re: FPGA development board "Icarus" - 3rd batch payment start.

March 08, 2012, 08:24:27 PM

#621

Quote from: kano on March 08, 2012, 01:00:15 PM

I'm referring to how cgminer works.
It gets a timeout, checks the count of how many timeouts and if it has reached some limit it will then go through the process of starting fresh work (that cgminer already has queued ready to go)
Then it starts to write new work down on the Icarus.

I think that's how just about all miners work today (including MPBM).

Quote from: kano on March 08, 2012, 01:00:15 PM

Are you suggesting that there is some period of time AFTER it starts to write the new work to the Icarus that a valid nonce could be returned?
If so then I guess we could add that to cgminer also, but that time would need to be VERY accurate to ensure the old reader isn't taking a nonce from the new work.

Yes, there is a very small period after starting to upload a job where nonces from the previous job could still be coming in.
After that, there is a longer period (about 5.6ms) during which there will be garbage nonces, if any, because the board is working on a mixture between two jobs that won't yield any sensible result.
After that, normal operation will continue with nonce 0 of the new job.

Quote from: kano on March 08, 2012, 01:00:15 PM

Icarus hashes a pair of nonce in roughly 6 nanoseconds (11.3s ~= 380MH/s ~= 3ns per nonce if it was a single device = 6ns per pair)
... though I'd be curious to know if the hashing process is a complete cycle per pair or the pairs are stepping though a stepped cycle
i.e. is there some delay before the first nonce-check completes, and then the remaining (2^31 - 1) sequential results are closer together than this initial delay?
I've still not quite got my understanding of that inside FPGA processing clear to me yet.

The FPGAs usually use a pipelined design. I don't know what the exact pipeline depth is, but I'd assume that it's somewhere between 128 and 270 stages.
So generating a full double-hash will need N clock cycles, but N nonces are being processed in parallel. Basically there's a hardware implementation of each sha256 round, an the work bubbles through that chain, one step (sha256 round) per clock cycle. See http://en.wikipedia.org/wiki/Instruction_pipeline for the general idea, just that we're having a hundred sha256 round stages instead of those 5 processor pipeline stages described there.

My tip jar: 13kwqR7B4WcSAJCYJH1eXQcxG5vVUwKAqY

Glasswalker

Sr. Member

Activity: 407
Merit: 250

Re: FPGA development board "Icarus" - 3rd batch payment start.

March 08, 2012, 08:43:38 PM

#622

Current Icarus code (at least the released stuff) is based on the ZTex code.

The ZTex code has a central core module which has a variable number of stages. The SHA-2 (SHA256) spec calls for 64 stages per hash. But the way bitcoin uses it, it only needs a full hash on one stage, and a partial hash on the other.

So the ZTex code (and therefor the Icarus code) does 64 stages on one core, and 61 stages on the other core, for a total of 125 stages. It has all of those stages fully unrolled so it takes 125 clocks (probably slightly more, haven't looked at the UART code, and controlling logic in depth yet) to fully load the pipeline, after which it runs 1 hash per clock once the pipe is loaded.

The Icarus has 2 FPGAs, each running independent hashing cores, which divide the nonce space between them to split the work (but each operates essentially independent). That's based on my basic understanding of the Verilog source code for it.

Edit: At second glance this may or may not be correct... I did say "Basic" understanding lol... My verilog is rusty as hell...

It may actually be fully unrolled so that it's doing the entire hash in a single clock. (for a given SHA256 Hash) and pipe lining the bitcoin (double SHA) hash (into 2 stages).

*runs back to look at the code again*

lol

BattleDrome: Blockchain based Gladiator Combat for fun and profit!
http://www.battledrome.io/

kano

Legendary

Activity: 4844
Merit: 1932

Linux since 1997 RedHat 4

Re: FPGA development board "Icarus" - 3rd batch payment start.

March 08, 2012, 10:01:10 PM
Last edit: March 08, 2012, 10:27:12 PM by kano

#623

Quote from: Glasswalker on March 08, 2012, 08:43:38 PM

Ignoring the pipeline question I asked, I hope it doesn't do 64 + 61.
(well actually I should say I hope it does do this coz then there is a speed up still available)

The 2nd sha256 is actually just 60.5 - but that is probably what you meant by 61.

The 1st sha256 is 61 also - the first 3 'stages' are exactly the same for all nonce in a range, so repeating them 4 billion times is a waste.
There is also the nonce-constant values of W0-W2 & W4-W15 (W4-W15 are constant over all time)
Then the calculation of W16, W17 is also constant across the nonce range.
(and there are other partial calculations you can do also that are constant across a nonce-range)
Edit: the partial ones are W18 (S0), W19 (S0 and S1, S1 is a constant over all time) W20 (S1 - again a constant over all time) W21 (S1 = 0) W22-W30 (S1) all these partial calculations shouldn't be done 4 billion times if at all possible (and some of the +W values for these are also constants per range or even constants over all time)

Edit2: I wrote a C program many months ago to analyse the double sha256 and optimise it (and spit out an optimised C program to calculate it - that works) and that's where I get that info from - but I know it is correct coz - as I said, the output code works.
I did this for my own understanding of what optimisations there are ... and of course found them all for the normal double sha256

If you could actually fit in doing 2 nonce at a time in one chip there are also some more partial calculations across each pair of nonce (that I started working on with my code but didn't finish due to there being no actual use in the results at the time)

Pool: https://kano.is - low 0.5% fee PPLNS 3 Days - Most reliable Solo with ONLY 0.5% fee Bitcointalk thread: Forum
Discord support invite at https://kano.is/ Majority developer of the ckpool code - k for kano
The ONLY active original developer of cgminer. Original master git: https://github.com/kanoi/cgminer

TheSeven

Hero Member

Activity: 504
Merit: 500

FPGA Mining LLC

Re: FPGA development board "Icarus" - 3rd batch payment start.

March 08, 2012, 10:26:30 PM

#624

Quote from: kano on March 08, 2012, 10:01:10 PM

The 1st sha256 is 61 also - the first 3 'stages' are exactly the same for all nonce in a range, so repeating them 4 billion times is a waste.
There is also the nonce-constant values of W0-W2 & W4-W15 (W4-W15 are constant over all time)
Then the calculation of W16, W17 is also constant across the nonce range.
(and there are other partial calculations you can do also that are constant across a nonce-range)
Edit: the partial ones are W18 (S0), W19 (S0 and S1, S1 is a constant over all time) W20 (S1 - again a constant over all time) W21 (S1 = 0) W22-W30 (S1) all these partial calculations shouldn't be done 4 billion times if at all possible (and some of the +W values for these are also constants per range or even constants over all time)

The synthesis tools usually do a rather good job at removing logic with constant output values. So while this may not be true for the nonce-dependent ones, most of the all time constant ones have probably already been caught automatically.

My tip jar: 13kwqR7B4WcSAJCYJH1eXQcxG5vVUwKAqY

kano

Legendary

Activity: 4844
Merit: 1932

Linux since 1997 RedHat 4

Re: FPGA development board "Icarus" - 3rd batch payment start.

March 08, 2012, 10:39:16 PM

#625

Quote from: TheSeven on March 08, 2012, 10:26:30 PM

Quote from: kano on March 08, 2012, 10:01:10 PM

A lot of it is nonce dependent - so not doing that is a BIG waste.
Also, even the ATI OpenCL compiler sux at doing this so I wouldn't be surprised if the tool is poor at optimisation.

As I said in my "Edit2:" above, I did this with C.
On top of all that - using gcc -O2 over the resulting code made a massive speed difference also - something close to running at twice the speed (though that was probably the optimisation of C to assembler)
And the -O2 made doing some of the code optimisations pointless since gcc worked them out itself

TheSeven

Hero Member

Activity: 504
Merit: 500

FPGA Mining LLC

Re: FPGA development board "Icarus" - 3rd batch payment start.

March 08, 2012, 11:33:20 PM

#626

Quote from: kano on March 08, 2012, 10:01:10 PM

Edit2: I wrote a C program many months ago to analyse the double sha256 and optimise it (and spit out an optimised C program to calculate it - that works) and that's where I get that info from - but I know it is correct coz - as I said, the output code works.
I did this for my own understanding of what optimisations there are ... and of course found them all for the normal double sha256

I'm not sure if that would make things any better. The wall that the HDL people are currently hitting seems to be mostly routing congestion, not really logic slices yet. Spartan6 routing must be awful. And this idea doesn't really sound like it would improve on that

Quote from: kano on March 08, 2012, 10:39:16 PM

As I said in my "Edit2:" above, I did this with C.
On top of all that - using gcc -O2 over the resulting code made a massive speed difference also - something close to running at twice the speed (though that was probably the optimisation of C to assembler)
And the -O2 made doing some of the code optimisations pointless since gcc worked them out itself

Running without -O tells the compiler to literally do what you say, i.e. forbids that kind of optimization (and also writes all kinds of variables to the stack for no good reason, resulting in even more slowdown). -O1 vs. -O2 vs. -O3 vs. -Os might be more interesting than comparing with no -O option at all.

My tip jar: 13kwqR7B4WcSAJCYJH1eXQcxG5vVUwKAqY

kano

Legendary

Activity: 4844
Merit: 1932

Linux since 1997 RedHat 4

Re: FPGA development board "Icarus" - 3rd batch payment start.

March 08, 2012, 11:44:43 PM

#627

-O 2 to -O 3 did nothing.

But my point there was that the gcc compiler is VERY good at optimisation - and applying that optimisation makes a big difference in CPU land.

However, those nonce-range optimisations are simply removing code to doing it once rather than 2^32 times (assuming the controller that distributes the work to the 2 chips does the setup work)
So without them you are wasting something like 2% ... or using that approximate figure on an Icarus: 98% = 380MH/s, then 100% = around 388MH/s
All very rough but certainly worth doing - since it doesn't increase the power usage or the amount of effort for the Icarus, it simply increases the MH/s

allinvain

Legendary

Activity: 3080
Merit: 1087

Re: FPGA development board "Icarus" - 3rd batch payment start.

March 09, 2012, 12:41:45 AM

#628

Quote from: Energizer on March 08, 2012, 07:41:31 PM

I am getting 0% invalid shares on 6 boards! I am using MPBM with jobinterval set to 11.3!

By the way I was wrong! the 11->11.3 range is as valuable as any 0.3 seconds within the 11.3 seconds range!

So is there a consensus that setting jobinterval to 11.3 results in the _best_ performance for the Icarus board?

Binance - where I trade. Funds are SAFU!

Glasswalker

Sr. Member

Activity: 407
Merit: 250

Re: FPGA development board "Icarus" - 3rd batch payment start.

March 09, 2012, 03:07:45 AM

#629

Quick update, after re-reading the Verilog, looks like it is pipelining it (and you're right, 61 stages on both parts, he has some special cases in there, there is also a core doing full 64 stage pipe, but I am not sure what that's for lol, only going over it roughly right now)

I'm intrigued to hear more about your optimizations, since I'm writing my own verilog. Once I get it working and able to calculate hashes (slowly) I'll go over optimizing it, and then things like your suggestions could help quite a bit.

BattleDrome: Blockchain based Gladiator Combat for fun and profit!
http://www.battledrome.io/

TheSeven

Hero Member

Activity: 504
Merit: 500

FPGA Mining LLC

Re: FPGA development board "Icarus" - 3rd batch payment start.

March 09, 2012, 07:57:36 AM

#630

Quote from: allinvain on March 09, 2012, 12:41:45 AM

Quote from: Energizer on March 08, 2012, 07:41:31 PM

I am getting 0% invalid shares on 6 boards! I am using MPBM with jobinterval set to 11.3!

By the way I was wrong! the 11->11.3 range is as valuable as any 0.3 seconds within the 11.3 seconds range!

So is there a consensus that setting jobinterval to 11.3 results in the _best_ performance for the Icarus board?

I think there is a consensus amongst basically everyone but Energizer that it doesn't. Exactly 11.3 seconds is indeed the sweet spot, but the effective interval will always be a little bit longer than the one calculated at that line of code that Energizer pointed at, there's a bit of jitter due to various reasons.
While the penalty for going lower (and thus adding a bit of a safety margin) is pretty much zero, the penalty for exceeding those 11.3 seconds is huge. That's why the defaults should be fine, and you'll need to hack up the code to change that (jobinterval settings above 8 seconds in the configuration file will just be ignored).
See this post for details: https://bitcointalk.org/index.php?topic=51371.msg780603#msg780603

My tip jar: 13kwqR7B4WcSAJCYJH1eXQcxG5vVUwKAqY

kano

Legendary

Activity: 4844
Merit: 1932

Linux since 1997 RedHat 4

Re: FPGA development board "Icarus" - 3rd batch payment start.

March 09, 2012, 08:33:36 AM

#631

Quote from: Glasswalker on March 09, 2012, 03:07:45 AM

Well here's the output of my code (which is of course fully unrolled) before I started messing with trying to do 2 nonce at the same time.
It also doesn't do the partial Wn calculations, but it's easy to see them.

http://pastebin.com/sxdVSJF1

That has all 3 sha256()'s in it since the first one is the midstate calculation.
Also note that the last sha256() has a lot of constants at the start (that my code determined) that may also not be in the Icarus version (I don't know)
My code worked out constants and converted them to their values.

That code will run and find shares correctly.
It's not perfect in terms of register usage or optimisation of partial calculations, bit otherwise it's pretty close to complete.

allinvain

Legendary

Activity: 3080
Merit: 1087

Re: FPGA development board "Icarus" - 3rd batch payment start.

March 09, 2012, 09:09:13 AM

#632

Quote from: TheSeven on March 09, 2012, 07:57:36 AM

Quote from: allinvain on March 09, 2012, 12:41:45 AM

Quote from: Energizer on March 08, 2012, 07:41:31 PM

I am getting 0% invalid shares on 6 boards! I am using MPBM with jobinterval set to 11.3!

By the way I was wrong! the 11->11.3 range is as valuable as any 0.3 seconds within the 11.3 seconds range!

So is there a consensus that setting jobinterval to 11.3 results in the _best_ performance for the Icarus board?

Thank you for clearing that up for me. That settles it for me, I will leave things as they are. I typically see 0.1% invalids which IMHO is _good_ .

Binance - where I trade. Funds are SAFU!

TheSeven

Hero Member

Activity: 504
Merit: 500

FPGA Mining LLC

Re: FPGA development board "Icarus" - 3rd batch payment start.

March 09, 2012, 09:40:39 AM

#633

Quote from: Glasswalker on March 09, 2012, 03:07:45 AM

Actually twice that many pipeline stages (relevant for latency), because each sha256 round is split into two pipeline stages in the ztex core.

My tip jar: 13kwqR7B4WcSAJCYJH1eXQcxG5vVUwKAqY

Glasswalker

Sr. Member

Activity: 407
Merit: 250

Re: FPGA development board "Icarus" - 3rd batch payment start.

March 09, 2012, 01:35:52 PM

#634

Quote from: TheSeven on March 09, 2012, 09:40:39 AM

Actually twice that many pipeline stages (relevant for latency), because each sha256 round is split into two pipeline stages in the ztex core.

Really? Since I'm trying to get my head around the code anyway, can you elaborate on this? I'm not seeing it in the code I'm looking at for sha256_pipes2.v

I see the main sha256_pipe2_base module, which seems to generate the 64 SHA stages,

Then I see pipe130 (which instantiates sha256_pipe2_base with 64 stages and does a single pass)

Then I see pipe123 (which instantiates sha256_pipe2_base with 61 stages and only seems to output a single 32bit word of hash)

Then I see pipe129 (which instantiates sha256_pipe2_base with 64 stages and does a single pass and outputs a full 256bit hash)

the top module seems to instantiate sha256_pipe130 and sha256_pipe123 (as p1 and p2)

I don't see anywhere where the sha cores are split? (but as I said before, my verilog is pretty rusty, and since I'm trying to brush up and write my own sha core, if you can help me out with what I'm misinterpreting I'd appreciate it) Wink

Thanks!

BattleDrome: Blockchain based Gladiator Combat for fun and profit!
http://www.battledrome.io/

ngzhang (OP)

Hero Member

Activity: 592
Merit: 501

We will stand and fight.

Re: FPGA development board "Icarus" - 3rd batch payment start.

March 09, 2012, 02:14:33 PM

#635

hi, i'm sorry about the disappear and no answer to many mails for a few days.
i got a box of boards yesterday. i'm busy for testing them.

i must finish some bulk orders before 3/12.

so please have a nice day, my friends. Grin

Turbor

Legendary

Activity: 1022
Merit: 1000

BitMinter

Re: FPGA development board "Icarus" - 3rd batch payment start.

March 09, 2012, 03:30:06 PM

#636

FPGA sex

BitMinter -----> Knives4Bitcoin.com <-----

TheSeven

Hero Member

Activity: 504
Merit: 500

FPGA Mining LLC

Re: FPGA development board "Icarus" - 3rd batch payment start.

March 09, 2012, 06:10:56 PM

#637

Quote from: Glasswalker on March 09, 2012, 01:35:52 PM

Quote from: TheSeven on March 09, 2012, 09:40:39 AM

Actually twice that many pipeline stages (relevant for latency), because each sha256 round is split into two pipeline stages in the ztex core.

Thanks!

I've never really known any verilog (I like VHDL much better), but this looks like the sha256_pipe2_base module consists of two pipeline stages:

Code:

	for (i = 0; i <= STAGES; i = i + 1) begin : S

		reg [511:0] data;
		reg [223:0] state;
		reg [31:0] t1_p1;

That's the first set of pipeline registers

Code:

		if(i == 0) 
		begin
[...]
		end else
		begin

			reg [511:0] data_buf;
			reg [223:0] state_buf;
			reg [31:0] data15_p1, data15_p2, data15_p3, t1;

That's the second set of pipeline resigers

Code:

			always @ (posedge clk)
			begin
				data_buf <= S[i-1].data;

Just copy the input data in the first stage

Code:

				data[479:0] <= data_buf[511:32];
				data15_p1 <= `S1( S[i-1].data[`IDX(15)] );											// 3
				data15_p2 <= data15_p1;														// 1
				data15_p3 <= ( ( i == 1 ) ? `S1( S[i-1].data[`IDX(14)] ) : S[i-1].data15_p2 ) + S[i-1].data[`IDX(9)] + S[i-1].data[`IDX(0)];	// 3
				data[`IDX(15)] <= `S0( data_buf[`IDX(1)] ) + data15_p3;										// 4

Do the actual caldulations in the second state

Code:

				state_buf <= S[i-1].state;													// 2

Just copy the input data in the first stage

Code:

				t1 <= `CH( S[i-1].state[`IDX(4)], S[i-1].state[`IDX(5)], S[i-1].state[`IDX(6)] ) + `E1( S[i-1].state[`IDX(4)] ) + S[i-1].t1_p1;	// 6

				state[`IDX(0)] <= `MAJ( state_buf[`IDX(0)], state_buf[`IDX(1)], state_buf[`IDX(2)] ) + `E0( state_buf[`IDX(0)] ) + t1;		// 7
				state[`IDX(1)] <= state_buf[`IDX(0)];												// 1
				state[`IDX(2)] <= state_buf[`IDX(1)];												// 1
				state[`IDX(3)] <= state_buf[`IDX(2)];												// 1
				state[`IDX(4)] <= state_buf[`IDX(3)] + t1;											// 2
				state[`IDX(5)] <= state_buf[`IDX(4)];												// 1
				state[`IDX(6)] <= state_buf[`IDX(5)];												// 1

Do the actual caldulations in the second state

Code:


				t1_p1 <= state_buf[`IDX(6)] + data_buf[`IDX(1)] + Ks[`IDX((127-i) & 63)];							// 2
			end

		end
	end

The synthesis software will then do some register balancing and move part of the logic from the second to the first stage in order to equalize delays between those two stages and thus achieve a higher clock rate because the individual stages' critical path delay is reduced.

My tip jar: 13kwqR7B4WcSAJCYJH1eXQcxG5vVUwKAqY

Glasswalker

Sr. Member

Activity: 407
Merit: 250

Re: FPGA development board "Icarus" - 3rd batch payment start.

March 09, 2012, 06:32:33 PM

#638

Ooh! *drool*

I think I see several in that box with my name on it! lol

BattleDrome: Blockchain based Gladiator Combat for fun and profit!
http://www.battledrome.io/

Glasswalker

Sr. Member

Activity: 407
Merit: 250

Re: FPGA development board "Icarus" - 3rd batch payment start.

March 09, 2012, 06:45:17 PM

#639

In verilog, the generate block when you put a for loop in it, will synthesize that out into multiple blocks of logic (think of it as a fast way to instantiate chunks of logic multiple times over).

So when he's copying data from registers in S[i-1] to the current registers you're right he's moving it from the previous pipeline stage to the current pipeline stage. But that for loop instantiates the number of stages in the pipe as STAGES. (so 64 by default). That's the full 64 stage sha pipeline. Each individual block within a stage doesn't seem to be split further.

At least that's what I got out of his method by reading the code, and it's how I've built mine Wink

BattleDrome: Blockchain based Gladiator Combat for fun and profit!
http://www.battledrome.io/

Energizer

Sr. Member

Activity: 273
Merit: 250

Re: FPGA development board "Icarus" - 3rd batch payment start.

March 09, 2012, 07:15:57 PM

#640

I would be grateful if someone with good FPGA programming experience answers this question:

Is it possible to make use of both clock edges to improve the mining speed?

For example: replacing always@(posedge CLK) by always@(posedge CLK or negedge CLK)

Zhang've told me that this would lead to a disaster! I am still wondering if its possible to use a double edged clock design @ lower MHz "100->133"!

Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 [32] 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 »

Bitcoin Forum > Bitcoin > Mining > Hardware > FPGA development board "Icarus" - DisContinued/ important announcement

« previous topic next topic »