Print Page - [BOUNTY] sha256 shader for Linux OSS video drivers (15 BTC pledged)

Title: [BOUNTY] sha256 shader for Linux OSS video drivers (15 BTC pledged)
Post by: jgarzik on March 18, 2011, 08:41:45 PM

Become an open source hero, and help bitcoin too!

OK, I think this project would see some real return (in BTC) on Linux, for all the miners out there. It would benefit open source as well.

The Project
-------------------------------------------------------
Successfully load and execute a sha256 "compute shader", using 100% open source video drivers on Linux (using closed source ATI tools to produce shader binary is permitted). Any Linux OS/distribution, as long as it's a recent version. Must work on ATI 5870/5970 hardware.

Rationale
-------------------------------------------------------
1. In theory, the closed source ATI SDK and video driver should not be needed, once we have a compiled shader. It would make life much easier on Linux, and expand our miner base, if stock open source drivers can be used for GPU mining.

2. Open source GPGPU efforts are moving slowly, and this would help jump-start those efforts, by providing a working example. This has the potential to be a high profile contribution to the OSS community.

Details
-------------------------------------------------------
According to some knowledgeable hackers, it should be possible to upload a "compute shader" using current Linux/OSS video drivers, via the Linux DRI APIs. The programmer (or team) would need to figure out how to coax ATI's SDK to produce a compiled, binary object that is then loaded into an open source driver, and executed.

The person or team collecting this bounty will need to be able to accomplish tasks such as rebuilding and replacing the kernel, rebuilding and replacing Mesa (OpenGL/DRI), and rebuilding/replacing the X server. Even though these are non-programming tasks, they are decidedly non-trivial.

This code (from ATI?) should be helpful in demonstrating how to work with 5870/5970 hardware: http://cgit.freedesktop.org/mesa/r600_demo/tree/?h=master

Although this task should be largely a "put together existing pieces and make them work" task, it is still quite complex.

The Pledges (in BTC)
-------------------------------------------------------
I'm hoping to raise at least 200 BTC for this task, if not more. Miners on Linux, consider pledging a block (or part of a block).

15 jgarzik

If you wish to pledge anonymously, send me a PM and I'll coordinate.

Pledges should be payable within 24 hours of a working example being posted publicly.

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (250 BTC pledged)
Post by: Luke-Jr on March 18, 2011, 09:02:12 PM

I'm offering 50 BTC to the first only-open-source miner to achieve a minimum of 252 MH/s (that's 95% of my present 265 MH/s) on my Radeon 5850. To claim, please send me an email at luke+openminingbounty@dashjr.org with the SHA256 hash of your miner tbz2, in case this turns out to be a close race.

Edit: This offer is expired.

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (300 BTC pledged)
Post by: jgarzik on April 06, 2011, 12:23:32 AM

Bump. Increased bounty to 400 BTC.

Here's a link showing several examples of asm shaders on ATI:
http://cgit.freedesktop.org/xorg/driver/xf86-video-ati/tree/src/evergreen_shader.c

and here is some useful Mesa code for building asm shaders:
http://cgit.freedesktop.org/mesa/mesa/tree/src/mesa/drivers/dri/r600/r700_assembler.c
http://cgit.freedesktop.org/mesa/mesa/tree/src/mesa/drivers/dri/r600/r700_shader.c

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (400 BTC pledged)
Post by: Mahkul on April 08, 2011, 12:20:53 PM

I will pledge 25 BTC for this.

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (445 BTC pledged)
Post by: xf2_org on April 18, 2011, 04:17:18 PM

Bump, for the new arrivals. :)

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (445 BTC pledged)
Post by: Zamicol on April 21, 2011, 11:15:05 PM

Its not much, but I'll pledge 10 BTC. Evey little bit(coin) counts right? :D

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (445 BTC pledged)
Post by: teknohog on May 06, 2011, 09:45:08 AM

Here's another 10 BTC. With the recent USD price of bitcoins, I wouldn't even say "it's not much".

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (465 BTC pledged)
Post by: xf2_org on May 19, 2011, 08:26:03 PM

bump, and updated first post to include the two most recent pledges (from past 30 days).

At today's exchange rate, the bounty is over $3,100.

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (465 BTC pledged)
Post by: DiabloD3 on May 19, 2011, 09:24:04 PM

Quote from: jgarzik on March 18, 2011, 08:41:45 PM

This is pointless. R600 cannot run compute shaders of the kind we need, R700 (Radeon 4xxx) suck at it.

Also, Mesa has a prototype OpenCL compiler for Gallium targets. Your "bounty" is technically already completed before you started.

If you really want to help open source, go work on that project instead.

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (465 BTC pledged)
Post by: MoonShadow on May 19, 2011, 09:46:00 PM

I'm not a programmer, so I don't know what I'm talking about here, but could such a linux binary permit graphics hardware too old to use the current miners to contribute at a respectable hash/watt rate even if they such at the hash/second rate?

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (465 BTC pledged)
Post by: xf2_org on May 20, 2011, 12:02:29 AM

Quote from: DiabloD3 on May 19, 2011, 09:24:04 PM

This is pointless. R600 cannot run compute shaders of the kind we need, R700 (Radeon 4xxx) suck at it.

That was a reference to the starting point for the architecture in the source code. You find R700/R800/+ hardware support code in directories labelled "r600" due to several similarities.

Quote

Also, Mesa has a prototype OpenCL compiler for Gallium targets. Your "bounty" is technically already completed before you started.

If that was true, then somebody would have collected the $3000+ in free money.

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (465 BTC pledged)
Post by: DiabloD3 on May 20, 2011, 02:15:50 AM

Quote from: xf2_org on May 20, 2011, 12:02:29 AM

If that was true, then somebody would have collected the $3000+ in free money.

Not at all. Xorg nor FDO accept donations in BTC.

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (465 BTC pledged)
Post by: DiabloD3 on May 20, 2011, 02:16:47 AM

Quote from: creighto on May 19, 2011, 09:46:00 PM

No. They lack the hardware design to run programs like this, plus they would be amazingly slow.

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (465 BTC pledged)
Post by: MoonShadow on May 20, 2011, 03:53:54 AM

Quote from: DiabloD3 on May 20, 2011, 02:16:47 AM

Quote from: creighto on May 19, 2011, 09:46:00 PM

No. They lack the hardware design to run programs like this, plus they would be amazingly slow.

Slow is irrelevant, if they are efficient. There are millions of them. But if it's not possible, it's not possible.

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (465 BTC pledged)
Post by: xf2_org on May 20, 2011, 04:23:26 AM

Let's not get distracted by Diablo getting worked up over a directory name.

The bounty is for working on 5870/5970/6990 era hardware.

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (465 BTC pledged)
Post by: Basiley on May 20, 2011, 04:35:59 AM

anyone with [deprecated]Render Monkey experience ?
note: using shader allow you to extract MORE from you GPU. also CAL implementation might me considerably faster than OpenCL one[why, how and some optimisation tricks was nuff said by BarsWF creator already].
and such experience also handy for porting solver in other areas[Linpack/Livermore shader ? Biology ? Meteorology ? Radiophysic]

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (465 BTC pledged)
Post by: DiabloD3 on May 20, 2011, 06:51:43 AM

Quote from: Basiley on May 20, 2011, 04:35:59 AM

ArtForz has a CAL miner that is slightly faster than the CL kernel poclbm and I based ours on. Its not particularly a huge win, especially when SDK 2.5 is getting rid of CAL support.

Also, I have experimented with GLSL-based miners. The lack of real integer support kills the whole thing.

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (465 BTC pledged)
Post by: Basiley on May 20, 2011, 07:04:28 AM

:-(

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (465 BTC pledged)
Post by: DiabloD3 on May 20, 2011, 07:06:51 AM

Quote from: Basiley on May 20, 2011, 07:04:28 AM

:-(

Well, its why OpenCL was invented in the first place. Using GLSL for even generic computing tasks that don't fit the OpenGL workflow is problematic and really not worth it.

I don't want to shoot down anyone's hopes, but Mesa already is growing OpenCL support for Gallium. What more could we possibly ask for?

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (465 BTC pledged)
Post by: Basiley on May 20, 2011, 07:11:44 AM

Quote from: DiabloD3 on May 20, 2011, 07:06:51 AM

Quote from: Basiley on May 20, 2011, 07:04:28 AM

:-(

best wishes ?
quicker Gallium3D adoption ?
quicker Mesa development[along with Gallium3D] ?
better drivers[esp free drivers. now about 10x times slower than proprietary counterparts] ?
more suitable SDK' ?
AMD/NVidia support for both developers and OpenCL itself[today Intel CPU's had better OpenCL support than Nvidia GPU's].
and yes OpenCL/OpenGL ES is cool. at least in theory. heil glorious OpenMAXdeveloper

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (465 BTC pledged)
Post by: DiabloD3 on May 20, 2011, 07:17:37 AM

Quote from: Basiley on May 20, 2011, 07:11:44 AM

Quote from: DiabloD3 on May 20, 2011, 07:06:51 AM

Quote from: Basiley on May 20, 2011, 07:04:28 AM

:-(

Mesa and Gallium are open source projects, you can always start developing for them.

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (465 BTC pledged)
Post by: Basiley on May 20, 2011, 07:28:01 AM

im terrific developer, never seriously wrote anything serious[years ago was last time] so im hardly helpful for such project.
except docs polishing maybe or translation.
until they need AV engineer or system administrator or regional representative[Russia to be particular] and etc.

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (465 BTC pledged)
Post by: Zamicol on June 09, 2011, 02:10:59 AM

I should probably retract my offer... I sold all my Bitcoins the other day. I didn't even think about this until last night. Seeing where this discussion went, I hope that isn't a problem.

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (195 BTC pledged)
Post by: xaci on June 09, 2011, 05:01:57 PM

As Diablo has already pointed out, GLSL (version 1.2 and earlier) has no support for 32-bit integers nor bitwise operators. GLSL 1.2 corresponds to OpenGL 2.1, which is what you'll currently get with FLOSS drivers (i.e. Mesa/Gallium). It is in theory possible to do equivalent calculations using float-pairs (16-bits in each) and do conditional-arithmetic equivalents for XOR, bitshift, rotation etc. (f.e. division with a power of two is the same as right-shift, rotation may be implemented by moving fractional bits after a shift, and so on). Look, it *might* work, but there will be absolutely no gain at all. You'll be lucky if you get a few Mhash/s from it.

GLSL versions 1.3 and above have support for 32-bit unsigned integers as well as bitwise operators. I'm in progress of writing a GLSL 1.3 shader, and it should be completed in a few days. The downside with this is that it requires (at least partial support of) OpenGL 3.0. There is no complete FLOSS OpenGL 3.0 implementation. AFAIK, the proprietary ATI/AMD driver has OpenGL 3.0 support for R600 and later. On the other hand, in practice, only the extension GL_EXT_gpu_shader4 is necessary, not the complete OpenGL 3.0 (at least I think that's right). If any of the FLOSS drivers implements that extension, then it would/should be possible to run my (to be written) shader on those drivers as well. In any case, the proprietary ATI/AMD driver should run it, which means it will become possible to mine on R600 and R700 hardware which does not support OpenCL.

Unfortunately it will take a few days before I get access to hardware to test this out. I do have a HD3850, but nowhere to plug it in.

See below (sorry, couldn't attach it) for a ridiculous GLSL 1.2 shader that (partially) calculates SHA256 hashes. (NOTE: it's incomplete, and may even be incorrect since it's untested -- it also crashes my system which has an old Intel IGP with partial GLSL support)

Code:

#version 120

/*
32-bit integers are represented by a vec2. GLSL 2 integers may only have up
to 16-bit precision (in portable code), and they are likely to be implemented
with floats anyway. Instead we use float-pairs, with 16-bit in each (although
floats fit 24-bit precision). A vec4 is also used instead of two vec2, where
possible.
*/

uniform vec4 data[8];		/* Second part of data */
uniform vec4 hash1[4];		/* Second part of hash1 */
uniform vec4 midstate[4];
uniform vec4 target[4];
uniform vec2 nonce_base;

/* Note: N is the width of the buffer and should only be between 1 and 2048 or
so. Preferably less -- around 128 or 256. */
uniform float N;

/* Note: offset is two independent floats, with values between 0 and N. */
varying vec2 varying_nonce_offset;

const vec4 stdstate[4] = vec4[](
	vec4 (float (0x6a09), float (0xe667), float (0xbb67), float (0xae85)),
	vec4 (float (0x3c6e), float (0xf372), float (0xa54f), float (0xf53a)),
	vec4 (float (0x510e), float (0x527f), float (0x9b05), float (0x688c)),
	vec4 (float (0x1f83), float (0xd9ab), float (0x5be0), float (0xcd19)));

const vec4 k[32] = vec4[](
	vec4 (float (0x428a), float (0x2f98), float (0x7137), float (0x4491)),
	vec4 (float (0xb5c0), float (0xfbcf), float (0xe9b5), float (0xdba5)),
	vec4 (float (0x3956), float (0xc25b), float (0x59f1), float (0x11f1)),
	vec4 (float (0x923f), float (0x82a4), float (0xab1c), float (0x5ed5)),

	vec4 (float (0xd807), float (0xaa98), float (0x1283), float (0x5b01)),
	vec4 (float (0x2431), float (0x85be), float (0x550c), float (0x7dc3)),
	vec4 (float (0x72be), float (0x5d74), float (0x80de), float (0xb1fe)),
	vec4 (float (0x9bdc), float (0x06a7), float (0xc19b), float (0xf174)),

	vec4 (float (0xe49b), float (0x69c1), float (0xefbe), float (0x4786)),
	vec4 (float (0x0fc1), float (0x9dc6), float (0x240c), float (0xa1cc)),
	vec4 (float (0x2de9), float (0x2c6f), float (0x4a74), float (0x84aa)),
	vec4 (float (0x5cb0), float (0xa9dc), float (0x76f9), float (0x88da)),

	vec4 (float (0x983e), float (0x5152), float (0xa831), float (0xc66d)),
	vec4 (float (0xb003), float (0x27c8), float (0xbf59), float (0x7fc7)),
	vec4 (float (0xc6e0), float (0x0bf3), float (0xd5a7), float (0x9147)),
	vec4 (float (0x06ca), float (0x6351), float (0x1429), float (0x2967)),

	vec4 (float (0x27b7), float (0x0a85), float (0x2e1b), float (0x2138)),
	vec4 (float (0x4d2c), float (0x6dfc), float (0x5338), float (0x0d13)),
	vec4 (float (0x650a), float (0x7354), float (0x766a), float (0x0abb)),
	vec4 (float (0x81c2), float (0xc92e), float (0x9272), float (0x2c85)),

	vec4 (float (0xa2bf), float (0xe8a1), float (0xa81a), float (0x664b)),
	vec4 (float (0xc24b), float (0x8b70), float (0xc76c), float (0x51a3)),
	vec4 (float (0xd192), float (0xe819), float (0xd699), float (0x0624)),
	vec4 (float (0xf40e), float (0x3585), float (0x106a), float (0xa070)),

	vec4 (float (0x19a4), float (0xc116), float (0x1e37), float (0x6c08)),
	vec4 (float (0x2748), float (0x774c), float (0x34b0), float (0xbcb5)),
	vec4 (float (0x391c), float (0x0cb3), float (0x4ed8), float (0xaa4a)),
	vec4 (float (0x5b9c), float (0xca4f), float (0x682e), float (0x6ff3)),

	vec4 (float (0x748f), float (0x82ee), float (0x78a5), float (0x636f)),
	vec4 (float (0x84c8), float (0x7814), float (0x8cc7), float (0x0208)),
	vec4 (float (0x90be), float (0xfffa), float (0xa450), float (0x6ceb)),
	vec4 (float (0xbef9), float (0xa3f7), float (0xc671), float (0x78f2)));

/* For rotr (>>) use division with appropriate power of 2. */

/* Do not let overflow happen with this function, or use sum_c instead! */
vec2 sum (vec2 a, vec2 b)
{
	vec2 ret;
	ret.x = a.x + b.x;
	ret.y = a.y + b.y;
	if (ret.y >= float(0x10000))
	{
		ret.y -= float(0x10000);
		ret.x += 1.0;
	}
	if (ret.x >= float(0x10000))
		ret.x -= float(0x10000);
	return ret;
}

vec2 sum_c (vec2 a, vec2 b, out float carry)
{
	vec2 ret;
	ret.x = a.x + b.x;
	ret.y = a.y + b.y;
	if (ret.y >= float(0x10000))
	{
		ret.y -= float(0x10000);
		ret.x += 1.0;
	}
	if (ret.x >= float(0x10000))
	{
		ret.x -= float(0x10000);
		carry = 1.0;
	}
	return ret;
}

vec2 prod (float a, float b)
{
	vec2 ret;
	ret.x = 0;
	ret.y = a * b;
	if (ret.y >= float(0x10000))
	{
		float c = floor (ret.y / float(0x10000));
		ret.x += c;
		ret.y -= c * float(0x10000);
	}
	return ret;
}

/* Note: shift should be a power of two, e.g. to shift 3 steps, use 2^3. */
vec2 sftr (vec2 a, float shift)
{
	vec2 ret = a / shift;
	ret = vec2 (floor (ret.x), floor (ret.y) + fract (ret.x) * float (0x10000));
	return ret;
}

/* Note: shift should be a power of two, e.g. to rotate 3 steps, use 2^3. */
vec2 rotr (vec2 a, float shift)
{
	vec2 ret = a / shift;
	ret = floor (ret) + fract (ret.yx) * float (0x10000);
	return ret;
}

float xor16 (float a, float b)
{
	float ret = 0;
	float fact = float (0x8000);
	while (fact > 0)
	{
		if ((a >= fact || b >= fact) && (a < fact || b < fact))
		ret += fact;

		if (a >= fact)
		a -= fact;
		if (b >= fact)
		b -= fact;

		fact /= 2.0;
	}
	return ret;
}

vec2 xor (vec2 a, vec2 b)
{
	return vec2 (xor16 (a.x, b.x), xor16 (a.y, b.y));
}

float and16 (float a, float b)
{
	float ret = 0;
	float fact = float (0x8000);
	while (fact > 0)
	{
		/* TODO: This still does XOR */
		if ((a >= fact || b >= fact) && (a < fact || b < fact))
		ret += fact;

		if (a >= fact)
		a -= fact;
		if (b >= fact)
		b -= fact;

		fact /= 2.0;
	}
	return ret;
}

vec2 and (vec2 a, vec2 b)
{
	return vec2 (and16 (a.x, b.x), and16 (a.y, b.y));
}

/* Logical complement ("not") */
vec2 cpl (vec2 a)
{
	return vec2 (float (0x10000), float (0x10000)) - a;
}

#define POW_2_01 2.0
#define POW_2_02 4.0
#define POW_2_03 8.0
#define POW_2_06 64.0
#define POW_2_07 128.0
#define POW_2_09 512.0
#define POW_2_10 1024.0
#define POW_2_11 2048.0
#define POW_2_13 8192.0

vec2 blend (vec2 m16, vec2 m15, vec2 m07, vec2 m02)
{
	vec2 s0 = xor (rotr (m15   , POW_2_07), xor (rotr (m15.yx, POW_2_02), sftr (m15, POW_2_03)));
	vec2 s1 = xor (rotr (m02.yx, POW_2_01), xor (rotr (m02.yx, POW_2_03), sftr (m02, POW_2_10)));
	return sum (sum (m16, s0), sum (m07, s1));
}

vec2 e0 (vec2 a)
{
	return xor (rotr (a, POW_2_02), xor (rotr (a, POW_2_13), rotr (a.yx, POW_2_06)));
}

vec2 e1 (vec2 a)
{
	return xor (rotr (a, POW_2_06), xor (rotr (a, POW_2_11), rotr (a.yx, POW_2_09)));
}

vec2 ch (vec2 a, vec2 b, vec2 c)
{
	return xor (and (a, b), and (cpl (a), c));
}

vec2 maj (vec2 a, vec2 b, vec2 c)
{
	return xor (xor (and (a, b), and (a, c)), and (b, c));
}

void main ()
{
	vec2 nonce_offset = floor (varying_nonce_offset);
	vec2 nonce = sum (nonce_base, sum(prod(nonce_offset.y, N), vec2 (0.0, nonce_offset.x)));

	vec4 w[24];
	vec4 hash0[4];
	vec4 tmp[4];
	#define a (tmp[0].xy)
	#define b (tmp[0].zw)
	#define c (tmp[1].xy)
	#define d (tmp[1].zw)
	#define e (tmp[2].xy)
	#define f (tmp[2].zw)
	#define g (tmp[3].xy)
	#define h (tmp[3].zw)
	vec2 t1, t2;

	/* TODO: Using midstate as state, calculate hash "hash0" of data with nonce applied */
	w[0].xy = blend (data[0].xy, data[0].zw, data[4].zw, data[7].xy);
	w[0].zw = blend (data[0].zw, data[1].xy, data[5].xy, data[7].zw);
	w[1].xy = blend (data[1].xy, data[1].zw, data[5].zw,    w[0].xy);
	w[1].zw = blend (data[1].zw, data[2].xy, data[6].xy,    w[0].zw);
	w[2].xy = blend (data[2].xy, data[2].zw, data[6].zw,    w[1].xy);
	w[2].zw = blend (data[2].zw, nonce.xy,   data[7].xy,    w[1].zw);
	w[3].xy = blend (nonce.xy,   nonce.zw,   data[7].zw,    w[2].xy);
	w[3].zw = blend (nonce.zw,   data[4].xy,    w[0].xy,    w[2].zw);
	w[4].xy = blend (data[4].xy, data[4].zw,    w[0].zw,    w[3].xy);
	w[4].zw = blend (data[4].zw, data[5].xy,    w[1].xy,    w[3].zw);
	w[5].xy = blend (data[5].xy, data[5].zw,    w[1].zw,    w[4].xy);
	w[5].zw = blend (data[5].zw, data[6].xy,    w[2].xy,    w[4].zw);
	w[6].xy = blend (data[6].xy, data[6].zw,    w[2].zw,    w[5].xy);
	w[6].zw = blend (data[6].zw, data[7].xy,    w[3].xy,    w[5].zw);
	w[7].xy = blend (data[7].xy, data[7].zw,    w[3].zw,    w[6].xy);
	w[7].zw = blend (data[7].zw,	w[0].xy,    w[4].xy,    w[6].zw);
	for (int i = 8; i < 24; ++i)
	{
		w[i].xy = blend (w[i-8].xy, w[i-8].zw, w[i-4].zw, w[i-1].xy);
		w[i].zw = blend (w[i-8].zw, w[i-7].xy, w[i-3].xy, w[i-1].zw);
	}
	tmp = midstate;

	/* TODO: Add loop-unrolled of i = 0 to 3, where data is used instead of w. */
	/*for (int i = 4; i < 32; i+=4)
	{
		t1 = sum (sum (sum (sum (h, e1(e)), ch(e,f,g)), k[i+0].xy), w[i-4+0].xy);
		t2 = sum (e0(a), maj(a,b,c)); d = sum (d, t1); h = sum (t1, t2);
		t1 = sum (sum (sum (sum (g, e1(d)), ch(d,e,f)), k[i+0].zw), w[i-4+0].zw);
		t2 = sum (e0(h), maj(h,a,b)); c = sum (c, t1); g = sum (t1, t2);
		t1 = sum (sum (sum (sum (f, e1(c)), ch(c,d,e)), k[i+1].xy), w[i-4+1].xy);
		t2 = sum (e0(g), maj(g,h,a)); b = sum (b, t1); f = sum (t1, t2);
		t1 = sum (sum (sum (sum (e, e1(b)), ch(b,c,d)), k[i+1].zw), w[i-4+1].zw);
		t2 = sum (e0(f), maj(f,g,h)); a = sum (a, t1); e = sum (t1, t2);
		t1 = sum (sum (sum (sum (d, e1(a)), ch(a,b,c)), k[i+2].xy), w[i-4+2].xy);
		t2 = sum (e0(e), maj(e,f,g)); h = sum (h, t1); d = sum (t1, t2);
		t1 = sum (sum (sum (sum (c, e1(h)), ch(h,a,b)), k[i+2].zw), w[i-4+2].zw);
		t2 = sum (e0(d), maj(d,e,f)); g = sum (g, t1); c = sum (t1, t2);
		t1 = sum (sum (sum (sum (b, e1(g)), ch(g,h,a)), k[i+3].xy), w[i-4+3].xy);
		t2 = sum (e0(c), maj(c,d,e)); f = sum (f, t1); b = sum (t1, t2);
		t1 = sum (sum (sum (sum (a, e1(f)), ch(f,g,h)), k[i+3].zw), w[i-4+3].zw);
		t2 = sum (e0(b), maj(b,c,d)); e = sum (e, t1); a = sum (t1, t2);
	}*/

	/* TODO: More iterations... Copy-paste block and fix k-index and W-value. */

	for (int i = 0; i < 4; ++i)
	{
		hash0[i].xy = sum (midstate[i].xy, tmp[i].xy);
		hash0[i].zw = sum (midstate[i].zw, tmp[i].zw);
	}

	vec4 hash[4];
	/* TODO: Using stdstate as state, calculate the hash of (hash0, hash1) */

	/* TODO: Compare with target. */

	gl_FragColor.r = nonce.y / 255.0;
	if (mod (nonce.y, 2.0) == 0.0)
		gl_FragColor.r = 0;
	else
		gl_FragColor.r = 1;
}

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (195 BTC pledged)
Post by: xf2_org on June 09, 2011, 06:51:08 PM

Quote from: xaci on June 09, 2011, 05:01:57 PM

The bounty is for full-performance assembly that works on ATI 5870/5970 at a minimum, not slow GLSL.

Please ignore Diablo-D3, he has a talent for taking threads off-topic.

The bounty requires loading full performance binary code onto ATI hardware running full open source drivers/stack.

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (195 BTC pledged)
Post by: xaci on June 09, 2011, 07:22:19 PM

Well, I don't think an OpenGL 3 shader will be necessarily slow. The GLSL code is also compiled into a binary to be run on the GPU, after all. The OpenGL 3 shading language has enough features to do SHA256 without ugly hacks. It should in theory be just as fast as an OpenCL equivalent (I will admit that I'm not sure though. However, I believe it's worth a try -- especially for models where OpenCL is not an alternative). Do you know if the FLOSS drivers for ATI 5870/5970 (or other models) support GL_EXT_gpu_shader4?

And just in case it wasn't obvious; I didn't expect to collect a bounty for that code I posted.

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (195 BTC pledged)
Post by: LegitBit on June 09, 2011, 07:36:52 PM

Forgive me if I am totally wrong.. but don't ATI cards have specific tessellation units separate from the shader ALU's?

Tessellation is a math heavy algo, but nVidia cards even surpass ATI's in this case.

Is that because of tessellation requiring more iterations? Math isn't my strong suit, but I figure a probe in that direction might help.

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (195 BTC pledged)
Post by: DiabloD3 on June 09, 2011, 10:39:13 PM

Quote from: LegitBit on June 09, 2011, 07:36:52 PM

Both ATI and Nvidia have fixed function hardware dedicated to tessellation. Nvidia 5xx performance on tess is about the same as Radeon 5xxx/68xx performance, which both are really inferior to 69xx performance.

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (195 BTC pledged)
Post by: error on June 10, 2011, 01:48:35 AM

Oh goody, now I can mine not only on my huge stack of 5850s, but also on the motherboard's embedded HD3300!

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (195 BTC pledged)
Post by: derjanb on July 08, 2011, 01:44:49 PM

I've created a WebGL bitcoin miner: http://forum.bitcoin.org/index.php?topic=27056.0 (http://forum.bitcoin.org/index.php?topic=27056.0)

Maybe the shader can be reused for this?!

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (195 BTC pledged)
Post by: teknohog on December 14, 2011, 10:22:43 PM

Seems like regular OpenCL is on its way to the opensource Radeon drivers:

http://www.phoronix.com/scan.php?page=news_item&px=MTAyNTg

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (195 BTC pledged)
Post by: shakaru on December 16, 2011, 12:36:34 AM

Holy necroposting batman!

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (195 BTC pledged)
Post by: daybyter on February 04, 2012, 02:06:28 AM

@xaci: thanks a lot for your sha256 code. Do you know the NVidia cg compiler? It has 32 bit ints AFAIK? I have a Geforce 7 card, that is supported by cgc, but has only OpenGL 2.1 support AFAIK at the moment (can't check it at the moment, since I'm at another machine). If I'd write a BrookGPU kernel and use the cgc compiler, I'd have 32 bit integers, right?

TIA,
Andreas

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (195 BTC pledged)
Post by: daybyter on February 07, 2012, 09:13:21 PM

Did anyone got the sha256 GLSL code to work?

So far I was reading GLSL tutorials hacked me a test app together (from too many sources to recall all the authors... sorry :( ):

Code:

#include <stdio.h>                      //C standard IO
#include <stdlib.h>                     //C standard lib
#include <string.h>                     //C string lib

#include <GL/glew.h>                    //GLEW lib
#include <GL/glut.h>                    //GLUT lib


//Function from: http://www.evl.uic.edu/aej/594/code/ogl.cpp
//Read in a textfile (GLSL program)
// we need to pass it as a string to the GLSL driver
char *textFileRead(char *fn) {
  FILE *fp;
  char *content = NULL;
  
  int count=0;
  
  if (fn != NULL) {
    
    fp = fopen(fn,"rt");
    
    if (fp != NULL) {
      
      fseek(fp, 0, SEEK_END);
      count = ftell(fp);
      rewind(fp);
      
      if (count > 0) {
        content = (char *)malloc(sizeof(char) * (count+1));
        count = fread(content,sizeof(char),count,fp);
        content[count] = '\0';
      }
      fclose(fp);
      
    }
  }
  
  return content;
}

//Function from: http://www.evl.uic.edu/aej/594/code/ogl.cpp
//Read in a textfile (GLSL program)
// we can use this to write to a text file
int textFileWrite(char *fn, char *s) {
  FILE *fp;
  int status = 0;
  
  if (fn != NULL) {
    fp = fopen(fn,"w");
    
    if (fp != NULL) {                   
      if (fwrite(s,sizeof(char),strlen(s),fp) == strlen(s))
        status = 1;
      fclose(fp);
    }
  }
  return(status);
}

/**
 * Setup shaders
 */
void setShaders() {
  char *my_fragment_shader_source;
  // char * my_vertex_shader_source;
  GLenum error;

  GLenum my_program;
  // GLenum my_vertex_shader;
  GLenum my_fragment_shader;
  
  // Get Vertex And Fragment Shader Sources
  my_fragment_shader_source = textFileRead( "sha256.glsl");
  // my_vertex_shader_source = GetVertexShaderSource();

  // my_vertex_shader = glCreateShaderObjectARB(GL_VERTEX_SHADER_ARB);
  my_fragment_shader = glCreateShaderObjectARB(GL_FRAGMENT_SHADER_ARB);
 
  // Load Shader Sources
  // glShaderSourceARB(my_vertex_shader, 1, &my_vertex_shader_source, NULL);
  glShaderSourceARB( my_fragment_shader, 1, (const GLcharARB** )&my_fragment_shader_source, NULL);
 
  // Compile The Shaders
  // glCompileShaderARB(my_vertex_shader);
  glCompileShaderARB(my_fragment_shader);
  
  // Check for compile errors
  int compiled = 0;
  glGetObjectParameterivARB( my_fragment_shader, GL_OBJECT_COMPILE_STATUS_ARB, &compiled );

  if  ( !compiled ) {
    int maxLength;

    glGetShaderiv( my_fragment_shader, GL_INFO_LOG_LENGTH, &maxLength);
 
    /* The maxLength includes the NULL character */
    char *fragmentInfoLog = malloc( maxLength *sizeof(char));
    
    glGetShaderInfoLog( my_fragment_shader, maxLength, &maxLength, fragmentInfoLog);
 
    printf( "Compile error log: %s\n\n", fragmentInfoLog);

    /* Handle the error in an appropriate way such as displaying a message or writing to a log file. */
    /* In this simple program, we'll just leave */
    free( fragmentInfoLog);

    // printf( "compile error...\n" );
  }

  // Create Shader And Program Objects
  my_program = glCreateProgramObjectARB();

  if(( error=glGetError()) != GL_NO_ERROR) {
    exit( error);
  }

  // Attach The Shader Objects To The Program Object
  // glAttachObjectARB(my_program, my_vertex_shader);
  glAttachObjectARB(my_program, my_fragment_shader);
 
  // Link The Program Object
  glLinkProgramARB(my_program);
  
  // Use The Program Object Instead Of Fixed Function OpenGL
  glUseProgramObjectARB(my_program);
}

int main( int argc, char *argv[]) {

  glutInit(&argc, argv);
  //glutInitDisplayMode(GLUT_DEPTH | GLUT_DOUBLE | GLUT_RGBA);
  glutInitDisplayMode(GLUT_DOUBLE | GLUT_RGBA);
  glutInitWindowPosition(100,100);
  glutInitWindowSize(320,320);
  glutCreateWindow("GPU");
  
  //  glutDisplayFunc(renderScene);
  // glutIdleFunc(renderScene);
  // glutReshapeFunc(changeSize);
  // glutKeyboardFunc(processNormalKeys);
  
  glewInit();
  if (glewIsSupported("GL_VERSION_2_1"))
    printf("Ready for OpenGL 2.1\n");
  else {
    printf("OpenGL 2.1 not supported\n");
    exit(1);
  }
  if (GLEW_ARB_vertex_shader && GLEW_ARB_fragment_shader && GL_EXT_geometry_shader4)
    printf("Ready for GLSL - vertex, fragment, and geometry units\n");
  else {
    printf("Not totally ready :( \n");
    exit(1);
  }

  setShaders();
  
  glutMainLoop();
  
  // just for compatibiliy purposes
  return 0;

  // glDeleteObjectARB( my_program);
  // glDeleteObjectARB( my_fragment_shader);
}

There are lots of bugs in this code, but at the moment, I just want to compile the shader and start it to do further checks.

I also wrote me a small makefile:

Code:

PROGRAM := glslminer

SOURCES := $(wildcard *.c)

CC = gcc
CCOPTS = 
LINKEROPTS = -lGL -lGLEW -lglut

.PHONY: all
all:
        $(CC) $(CCOPTS) $(LINKEROPTS) $(SOURCES) -o $(PROGRAM)

.PHONY: clean
        rm *.o

and when I compile and start the code as root (as a regular user, I don't get access the the nvidia card here), I get:

Code:

localhost glsl # ./glslminer 
Ready for OpenGL 2.1
Ready for GLSL - vertex, fragment, and geometry units
Compile error log: 0(232) : error C1031: swizzle mask element not present in operand "zw"
0(233) : error C1031: swizzle mask element not present in operand "zw"

, which seems to mean, that some of the <something>.zw operations fail (I don't know the linenumber yet, since the newlines seems to get lost in my shader source import).

Anyone with more luck?

Ciao,
Andreas

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (195 BTC pledged)
Post by: daybyter on February 07, 2012, 09:50:36 PM

After some more debugging, it seems I've found the problem:

in line 210 nonce is declared as a vec2, so it has 2 elements x and y. But in line 232 and 233 (IIRC), nonce.zw is used for computation. Doesn't work as nonce has no element z and w. When I change those expression to nonce.xy the code compiles and it seems there's even something started, although I get no output so far. Will have to investigate that further and fix more issues of the test code.

Any help is really appreciated!

Ciao,
Andreas

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (195 BTC pledged)
Post by: ThiagoCMC on February 11, 2012, 11:35:26 PM

subscribing... I love Linux and its new video memory management, called GEM + KMS...

Mining with purely open source tools and drivers will be awesome!!

I wanna test this out!!

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (195 BTC pledged)
Post by: daybyter on February 12, 2012, 02:25:09 PM

Getting the GLSL code to work properly is really tricky to me. Here's a tutorial that describes some of the issues:

http://www.mathematik.tu-dortmund.de/~goeddeke/gpgpu/tutorial.html

You have to render the GLSL output to a texture and read it back to the host.

At this point, I'm not really sure how Xaci wants to pass the header and the nonce to the shader. Is the header supposed to be variable in a way, too?

I'm trying to simplify things for me a bit, so I translated some of the code to BrookGPU to get float2 streams. This might give a performance hit, since I'm not sure yet, what kind of texture brook generates and passes to the GPU (I've found some posting that said it's a streamlength^2 * 4 * sizeof(float) texture, which would be really big.

So as I'm trying to simplyfy things, I just assume the header as constant and pass an array of nonces to the kernel. The shader should then replace the header nonce with the current nonce and do the double sha256 computation. I guess I'll have to pass the decoded difficulty, too, but I'll see that later...

Ciao,
Andreas

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (195 BTC pledged)
Post by: daybyter on February 12, 2012, 04:09:18 PM

Completed 'and' and fixed bug in 'not' function. This is the brook version, but it should be easy to port the change back to GLSL if wanted:

Code:

/**
 * Some utility functions to process integers represented as float2.
 */

/**
 * Add 2 integers represented as float2.
 *
 * Do not let overflow happen with this function, or use sum_c instead! 
 */
kernel float2 add( float2 a, float2 b) {
        float2 ret;

        ret.x = a.x + b.x;
        ret.y = a.y + b.y;

        if (ret.y >= 65536.0) {
                ret.y -= 65536.0;
                ret.x += 1.0;
        }

        if (ret.x >= 65536.0) {
                ret.x -= 65536.0;
	}

        return ret;
}

/**
 * Shift an integer represented as a float2 by log2(shift).
 * 
 * Note: shift should be a power of two, e.g. to shift 3 steps, use 2^3. 
 */
kernel float2 shiftr( float2 a, float shift) {
        float2 ret;
	
	ret.x = a.x / shift;

	ret.y = floor( a.y / shift) + frac( ret.x) * 65536.0;

	ret.x = floor( ret.x);

        return ret;
}

/**
 * Rotate an integer represented as a float2 by log2(shift).
 * 
 * Note: shift should be a power of two, e.g. to rotate 3 steps, use 2^3. 
 */
kernel float2 rotater( float2 a, float shift) {
        float2 ret;
	
	ret.x = a.x / shift;  // Shipt words and keep fractions to shift those bits later.
	ret.y = a.y / shift;

	ret.y += frac( ret.x) * 65536.0;  // Shift low bits from x into y;
	ret.x += frac( ret.y) * 65536.0;  // Rotate low bits from y into x;

	ret.x = floor( ret.x);  // Cut shifted bits.
	ret.y = floor( ret.y);

        return ret;
}

/**
 * Xor half of an integer, represented as a float.
 */
kernel float xor16( float a<>, float b<>) {

        float ret = 0;
        float fact = 32768.0;

        while (fact > 0) {
                if( ( ( a >= fact) || ( b >= fact)) && ( ( a < fact) || ( b < fact))) {
                  ret += fact;
		}

                if( a >= fact) {
                  a -= fact;
		}
                if (b >= fact) {
                  b -= fact;
		}

                fact /= 2.0;
        }
        return ret;
}

/**
 * Xor a complete integer represetended as a float2.
 */
kernel float2 xor( float2 a<>, float2 b<>) {
       float2 ret = { xor16( a.x, b.x), xor16( a.y, b.y) };

       return ret;
}

/**
 * And operation on half of an integer, represented as a float.
 */
kernel float and16( float a<>, float b<>) {
        float ret = 0;
        float fact = 32768.0;

        while (fact > 0) {
                if( ( a >= fact) && ( b >= fact)) {
                  ret += fact;
		}

                if( a >= fact) {
                  a -= fact;
		}
                if (b >= fact) {
                  b -= fact;
		}

                fact /= 2.0;
        }
        return ret;
}

/**
 * And operation on a full integer, represented as a float2.
 */
kernel float2 and( float2 a<>, float2 b<>) {
        float2 ret =  { and16( a.x, b.x), and16( a.y, b.y) };

        return ret;
}

/*
 * Logical complement ("not") 
 */
kernel float2 not( float2 a<>) {
       float2 ret = { 65535.0 - a.x, 65535.0 - a.y};

       return ret;
}

/**
 * Swap the 2 words of an int.
 */
kernel swapw( float2 a) {
       float2 ret;

       ret.x = a.y;
       ret.y = a.x;

       return ret;
}

kernel float2 blend( float2 m16, float2 m15, float2 m07, float2 m02) {
        float2 s0 = xor( rotater( m15, 128.0), xor( rotater( swapw( m15), 4.0), shiftr( m15, 8)));
        float2 s1 = xor( rotater( swapw( m02), 2.0), xor( rotater( swapw( m02), 8.0), shiftr( m02, 1024.0)));

        return add( add( m16, s0), add( m07, s1));
}

kernel float2 e0( float2 a) {
        return xor( rotater( a, 4.0), xor( rotater( a, 8192.0), rotater( swapw( a), 64.0)));
}

kernel float2 e1( float2 a) {
        return xor( rotater( a, 64.0), xor( rotater( a, 2048.0), rotater( swapw( a), 512.0)));
}

kernel float2 ch( float2 a, float2 b, float2 c) {
        return xor( and( a, b), and( not( a), c));
}

kernel float2 maj( float2 a, float2 b, float2 c) {
        return xor( xor( and( a, b), and( a, c)), and( b, c));
}

This code compiles here at least. Don't know if it actually works, since I don't have the actually sha256 code in brook yet.

Ciao,
Andreas

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (195 BTC pledged)
Post by: bulanula on February 12, 2012, 04:12:48 PM

Quote from: daybyter on February 12, 2012, 02:25:09 PM

If you can get this working you are my absolute HERO.

I absolutely DESPISE ATI and their proprietary BS drivers that always break. Once they fix X then Y comes up and once they fix Y then Z and X comes up etc.

It's a never ending cycle of desperation, at least for me.

Good luck !

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (195 BTC pledged)
Post by: Dusty on February 12, 2012, 04:29:26 PM

[ watching (mining on open source drivers would be awesome) ]

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (195 BTC pledged)
Post by: daybyter on February 12, 2012, 04:33:23 PM

Well, at the moment I'm only interested in getting something running on my old GeForce 7600gs, that's not supported by Cuda or OpenCL. Just to give you an idea: my cpu mines at about 300 kilo-hashes at the moment, so 1 Mega-hash would be an improvement for me... :)

But even if it works, you have to consider, that the performance would be poor since simple operations like bitshifting require quite some float-operations. And there's no way to really use the entire card, since brook compiles only fragment shaders _or_ vertex shaders. The vertex profile has even limited integer support, but I have only 5 of those shaders, that's why I guess Xaci's system with the floats makes more sense.

So it's a great learning project, but don't see any practical use of it, to be honest...

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (195 BTC pledged)
Post by: daybyter on February 14, 2012, 10:44:03 PM

Current status:

- Code is incomplete and buggy, but compiles

- The kernel is not optimized and especially the stream transport of the nonces to the kernel is not really implemented.

- Few issues: nonce is not in the first hash, I think some infos is not passed to the 2nd hash round. And I think the endianess of the hash vs difficulty is not the same.

- The block header and the difficulty are not set yet, since I'm testing other stuff now.

- BrookGPU runs into some sort of infinite loop, consumes up to 2 gb mem and is terminated then (no clue why yet).

- If had tons of problems with arrays, since brook wanted to convert array constants to brook-streams, which are not constant during the nonce yet. No-go, so I just split the array into single vars and wrote me scripts to generate the code (since all var are passed as values and not pointers it's not so much of an issue for now).

- Arrays as local vars are causing trouble, since Brook wants to align them in some way, that cgc doesn't like, so I split them up, too.

Just to give you an idea of the strange-looking code:

Code:

/**
 * Aminer - a bitcoin miner for various platforms.
 *
 * Andreas Rueckert <a_rueckert@gmx.net>
 *
 * A good part of this code is based on the GLSL sha256 code of xaci: https://bitcointalk.org/index.php?topic=4618.msg191488#msg191488
 */

/*
#pragma optionNV looplimit 32768
*/



/**
 * Some utility functions to process integers represented as float2.
 */

/**
 * Add 2 integers represented as float2.
 *
 * Do not let overflow happen with this function, or use sum_c instead! 
 */
kernel float2 add( float2 a, float2 b) {
        float2 ret;

        ret.x = a.x + b.x;
        ret.y = a.y + b.y;

        if (ret.y >= 65536.0) {
                ret.y -= 65536.0;
                ret.x += 1.0;
        }

        if (ret.x >= 65536.0) {
                ret.x -= 65536.0;
	}

        return ret;
}

/**
 * Shift an integer represented as a float2 by log2(shift).
 * 
 * Note: shift should be a power of two, e.g. to shift 3 steps, use 2^3. 
 */
kernel float2 shiftr( float2 a, float shift) {
        float2 ret;
	
	ret.x = a.x / shift;

	ret.y = floor( a.y / shift) + frac( ret.x) * 65536.0;

	ret.x = floor( ret.x);

        return ret;
}

/**
 * Rotate an integer represented as a float2 by log2(shift).
 * 
 * Note: shift should be a power of two, e.g. to rotate 3 steps, use 2^3. 
 */
kernel float2 rotater( float2 a, float shift) {
        float2 ret;
	
	ret.x = a.x / shift;  // Shipt words and keep fractions to shift those bits later.
	ret.y = a.y / shift;

	ret.y += frac( ret.x) * 65536.0;  // Shift low bits from x into y;
	ret.x += frac( ret.y) * 65536.0;  // Rotate low bits from y into x;

	ret.x = floor( ret.x);  // Cut shifted bits.
	ret.y = floor( ret.y);

        return ret;
}

/**
 * Xor half of an integer, represented as a float.
 */
kernel float xor16( float a<>, float b<>) {

        float ret = 0;
        float fact = 32768.0;

        while (fact > 0) {
                if( ( ( a >= fact) || ( b >= fact)) && ( ( a < fact) || ( b < fact))) {
                  ret += fact;
		}

                if( a >= fact) {
                  a -= fact;
		}
                if (b >= fact) {
                  b -= fact;
		}

                fact /= 2.0;
        }
        return ret;
}

/**
 * Xor a complete integer represetended as a float2.
 */
kernel float2 xor( float2 a<>, float2 b<>) {
       float2 ret = { xor16( a.x, b.x), xor16( a.y, b.y) };

       return ret;
}

/**
 * And operation on half of an integer, represented as a float.
 */
kernel float and16( float a<>, float b<>) {
        float ret = 0;
        float fact = 32768.0;

        while (fact > 0) {
                if( ( a >= fact) && ( b >= fact)) {
                  ret += fact;
		}

                if( a >= fact) {
                  a -= fact;
		}
                if (b >= fact) {
                  b -= fact;
		}

                fact /= 2.0;
        }
        return ret;
}

/**
 * And operation on a full integer, represented as a float2.
 */
kernel float2 and( float2 a<>, float2 b<>) {
        float2 ret =  { and16( a.x, b.x), and16( a.y, b.y) };

        return ret;
}

/*
 * Logical complement ("not") 
 */
kernel float2 not( float2 a<>) {
       float2 ret = { 65535.0 - a.x, 65535.0 - a.y};

       return ret;
}

/**
 * Swap the 2 words of an int.
 */
kernel swapw( float2 a) {
       float2 ret;

       ret.x = a.y;
       ret.y = a.x;

       return ret;
}

/**
 * Swap the 2 bytes in an 16-bit word.
 */
kernel float swapb( float a) {
       float ret = a / 256.0;

       ret += frac( ret) * 65536.0;

       return floor( ret);
}

/**
 * Swap the 4 bytes of a 4-byte integer;
 */
kernel float2 swapInt( float2 a) {
       float2 ret = swapw( a);

       ret.x = swapb( ret.x);
       ret.y = swapb( ret.y);

       return ret;
}

/**
 * Check if float2 integer a is smaller than float2 integer b.
 */
kernel float isSmaller( float2 a, float2 b) {
       if( ( a.x < b.x) || ( ( a.x == a.x) && ( a.y < b.y))) {
           return 1.0;
       } else {
           return 0.0;
       }
}

kernel float2 blend( float2 m16, float2 m15, float2 m07, float2 m02) {
        float2 s0 = xor( rotater( m15, 128.0), xor( rotater( swapw( m15), 4.0), shiftr( m15, 8)));
        float2 s1 = xor( rotater( swapw( m02), 2.0), xor( rotater( swapw( m02), 8.0), shiftr( m02, 1024.0)));

        return add( add( m16, s0), add( m07, s1));
}

kernel float2 s0( float2 a) {
        return xor( rotater( a, 4.0), xor( rotater( a, 8192.0), rotater( swapw( a), 64.0)));
}

kernel float2 s1( float2 a) {
        return xor( rotater( a, 64.0), xor( rotater( a, 2048.0), rotater( swapw( a), 512.0)));
}

kernel float2 ch( float2 a, float2 b, float2 c) {
        return xor( and( a, b), and( not( a), c));
}

kernel float2 maj( float2 a, float2 b, float2 c) {
        return xor( xor( and( a, b), and( a, c)), and( b, c));
}


/**
 * Let the kernel check a nonce for a given bitcoin block.
 * That's basically 2 rounds of sha256 and a difficulty check.
 * 
 * @param nonce The nonce to test for the given block
 * @param block_header* The block header as a set of ints, since brcc always converts constant arrays to brook::stream here.
 * @param difficulty The difficulty as a 256-bit superlong int.
 *
 * @return result Return the nonce, if it is valid. Return -nonce if not.
 */
kernel void kernelMinerCheckNonce( float2 nonce<>
      	    			   , float2 block_header0
				   , float2 block_header1
				   , float2 block_header2
				   , float2 block_header3 
				   , float2 block_header4
				   , float2 block_header5 
				   , float2 block_header6
				   , float2 block_header7
				   , float2 block_header8
				   , float2 block_header9
      	    			   , float2 block_header10
				   , float2 block_header11
				   , float2 block_header12
				   , float2 block_header13 
				   , float2 block_header14
				   , float2 block_header15 
				   , float2 block_header16
				   , float2 block_header17
				   , float2 block_header18
				   , float2 block_header19
				   , float2 decoded_difficulty0
				   , float2 decoded_difficulty1
				   , float2 decoded_difficulty2
				   , float2 decoded_difficulty3
				   , float2 decoded_difficulty4
				   , float2 decoded_difficulty5
				   , float2 decoded_difficulty6
				   , float2 decoded_difficulty7
				   , out float2 result<>) {

       // brcc seems to have problems with array alignment in fp40 model, so no arrays for now... :-(
       float2 k0,k1,k2,k3,k4,k5,k6,k7,k8,k9,k10,k11,k12,k13,k14,k15;
       float2 k16,k17,k18,k19,k20,k21,k22,k23,k24,k25,k26,k27,k28,k29,k30,k31;
       float2 k32,k33,k34,k35,k36,k37,k38,k39,k40,k41,k42,k43,k44,k45,k46,k47;
       float2 k48,k49,k50,k51,k52,k53,k54,k55,k56,k57,k58,k59,k60,k61,k62,k63;
       float2 h0,h1,h2,h3,h4,h5,h6,h7;
       float2 w0,w1,w2,w3,w4,w5,w6,w7,w8,w9,w10,w11,w12,w13,w14,w15;
       float2 w16,w17,w18,w19,w20,w21,w22,w23,w24,w25,w26,w27,w28,w29,w30,w31;
       float2 w32,w33,w34,w35,w36,w37,w38,w39,w40,w41,w42,w43,w44,w45,w46,w47;
       float2 w48,w49,w50,w51,w52,w53,w54,w55,w56,w57,w58,w59,w60,w61,w62,w63;
       float2 a,b,c,d,e,f,g,h;
       float2 t1, t2;

       // Initialize k
       // ( Code generated with the following c-program:
       /*
       #include <stdio.h>

       int main(int argc, char *agv[]) {
       	    unsigned int k[64] = { 0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
	       	     	           0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
				   0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
				   0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
		        	   0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
			   	   0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
			      	   0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
			           0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2 };
	    int i;

   	    for( i = 0; i < 64; ++i) {
            	 printf( "k%d.x = %5d.0; k%d.y = %5d.0;\n", i, k[i] >> 16, i, k[i] & 0xffff);
            }
       }
       */
       
       k0.x = 17034.0; k0.y = 12184.0;
       k1.x = 28983.0; k1.y = 17553.0;
       k2.x = 46528.0; k2.y = 64463.0;
       k3.x = 59829.0; k3.y = 56229.0;
       k4.x = 14678.0; k4.y = 49755.0;
       k5.x = 23025.0; k5.y =  4593.0;
       k6.x = 37439.0; k6.y = 33444.0;
       k7.x = 43804.0; k7.y = 24277.0;
       k8.x = 55303.0; k8.y = 43672.0;
       k9.x =  4739.0; k9.y = 23297.0;
       k10.x =  9265.0; k10.y = 34238.0;
       k11.x =	21772.0; k11.y = 32195.0;
       k12.x = 29374.0; k12.y = 23924.0;
       k13.x = 32990.0; k13.y = 45566.0;
       k14.x = 39900.0; k14.y =  1703.0;
       k15.x = 49563.0; k15.y = 61812.0;
       k16.x = 58523.0; k16.y = 27073.0;
       k17.x = 61374.0; k17.y = 18310.0;
       k18.x =  4033.0; k18.y = 40390.0;
       k19.x =  9228.0; k19.y = 41420.0;
       k20.x = 11753.0; k20.y = 11375.0;
       k21.x = 19060.0; k21.y = 33962.0;
       k22.x = 23728.0; k22.y = 43484.0;
       k23.x = 30457.0; k23.y = 35034.0;
       k24.x = 38974.0; k24.y = 20818.0;
       k25.x = 43057.0; k25.y = 50797.0;
       k26.x = 45059.0; k26.y = 10184.0;
       k27.x = 48985.0; k27.y = 32711.0;
       k28.x = 50912.0; k28.y =  3059.0;
       k29.x = 54695.0; k29.y = 37191.0;
       k30.x =  1738.0; k30.y = 25425.0;
       k31.x =  5161.0; k31.y = 10599.0;
       k32.x = 10167.0; k32.y =  2693.0;
       k33.x = 11803.0; k33.y =  8504.0;
       k34.x = 19756.0; k34.y = 28156.0;
       k35.x = 21304.0; k35.y =  3347.0;
       k36.x = 25866.0; k36.y = 29524.0;
       k37.x = 30314.0; k37.y =  2747.0;
       k38.x = 33218.0; k38.y = 51502.0;
       k39.x = 37490.0; k39.y = 11397.0;
       k40.x = 41663.0; k40.y = 59553.0;
       k41.x = 43034.0; k41.y = 26187.0;
       k42.x = 49739.0; k42.y = 35696.0;
       k43.x = 51052.0; k43.y = 20899.0;
       k44.x = 53650.0; k44.y = 59417.0;
       k45.x = 54937.0; k45.y =  1572.0;
       k46.x = 62478.0; k46.y = 13701.0;
       k47.x =  4202.0; k47.y = 41072.0;
       k48.x =  6564.0; k48.y = 49430.0;
       k49.x =  7735.0; k49.y = 27656.0;
       k50.x = 10056.0; k50.y = 30540.0;
       k51.x = 13488.0; k51.y = 48309.0;
       k52.x = 14620.0; k52.y =  3251.0;
       k53.x = 20184.0; k53.y = 43594.0;
       k54.x = 23452.0; k54.y = 51791.0;
       k55.x = 26670.0; k55.y = 28659.0;
       k56.x = 29839.0; k56.y = 33518.0;
       k57.x = 30885.0; k57.y = 25455.0;
       k58.x = 33992.0; k58.y = 30740.0;
       k59.x = 36039.0; k59.y =   520.0;
       k60.x = 37054.0; k60.y = 65530.0;
       k61.x = 42064.0; k61.y = 27883.0;
       k62.x = 48889.0; k62.y = 41975.0;
       k63.x = 50801.0; k63.y = 30962.0;


       // Initialize h

       h0.x = 27145.0; h0.y = 58983.0;  // 0x6a09 0xe667
       h1.x = 47975.0; h1.y = 44677.0;  // 0xbb67 0xae85
       h2.x = 15470.0; h2.y = 62322.0;  // 0x3c6e 0xf372
       h3.x = 42319.0; h3.y = 62778.0;  // 0xa54f 0xf53a
       h4.x = 20750.0; h4.y = 21119.0;  // 0x510e 0x527f
       h5.x = 39685.0; h5.y = 26764.0;  // 0x9b05 0x688c
       h6.x =  8067.0; h6.y = 55723.0;  // 0x1f83 0xd9ab
       h7.x = 23520.0; h7.y = 52505.0;  // 0x5be0 0xcd19

       // For the following algorithm, see: http://en.wikipedia.org/wiki/SHA-2#Examples_of_SHA-2_variants

       // Initialize the first 16 w values
       // ToDo? Precompute this outside of the kernel, since the nonce is not in these 16 values?

       /*
        * Process the message in successive 512-bit chunks:
	* break message into 512-bit chunks
	* for each chunk
    	*   break chunk into sixteen 32-bit big-endian words w[0..15]
	*/
	// Implementation:
	w0 = swapInt( block_header0);
	w1 = swapInt( block_header1);
	w2 = swapInt( block_header2);
	w3 = swapInt( block_header3);
	w4 = swapInt( block_header4);
	w5 = swapInt( block_header5);
	w6 = swapInt( block_header6);
	w7 = swapInt( block_header7);
	w8 = swapInt( block_header8);
	w9 = swapInt( block_header9);
	w10 = swapInt( block_header10);
	w11 = swapInt( block_header11);
	w12 = swapInt( block_header12);
	w13 = swapInt( block_header13);
	w14 = swapInt( block_header14);
	w15 = swapInt( block_header15);

	/*
         * for( i = 16; i < 64; i++) {
       	 *   w[i] = blend( w[ i - 16], w[ i - 15], w[ i -7], w[ i - 2]);
       	 * }
	 */
	// Implementation 
	// (Generated with bash script: for i in {16..63}; do echo "w$i = blend( w$((i - 16)), w$((i - 15)), w$((i -7)), w$((i - 2)));"; done ):
	w16 = blend( w0, w1, w9, w14);
	w17 = blend( w1, w2, w10, w15);
	w18 = blend( w2, w3, w11, w16);
	w19 = blend( w3, w4, w12, w17);
	w20 = blend( w4, w5, w13, w18);
	w21 = blend( w5, w6, w14, w19);
	w22 = blend( w6, w7, w15, w20);
	w23 = blend( w7, w8, w16, w21);
	w24 = blend( w8, w9, w17, w22);
	w25 = blend( w9, w10, w18, w23);
	w26 = blend( w10, w11, w19, w24);
	w27 = blend( w11, w12, w20, w25);
	w28 = blend( w12, w13, w21, w26);
	w29 = blend( w13, w14, w22, w27);
	w30 = blend( w14, w15, w23, w28);
	w31 = blend( w15, w16, w24, w29);
	w32 = blend( w16, w17, w25, w30);
	w33 = blend( w17, w18, w26, w31);
	w34 = blend( w18, w19, w27, w32);
	w35 = blend( w19, w20, w28, w33);
	w36 = blend( w20, w21, w29, w34);
	w37 = blend( w21, w22, w30, w35);
	w38 = blend( w22, w23, w31, w36);
	w39 = blend( w23, w24, w32, w37);
	w40 = blend( w24, w25, w33, w38);
	w41 = blend( w25, w26, w34, w39);
	w42 = blend( w26, w27, w35, w40);
	w43 = blend( w27, w28, w36, w41);
	w44 = blend( w28, w29, w37, w42);
	w45 = blend( w29, w30, w38, w43);
	w46 = blend( w30, w31, w39, w44);
	w47 = blend( w31, w32, w40, w45);
	w48 = blend( w32, w33, w41, w46);
	w49 = blend( w33, w34, w42, w47);
	w50 = blend( w34, w35, w43, w48);
	w51 = blend( w35, w36, w44, w49);
	w52 = blend( w36, w37, w45, w50);
	w53 = blend( w37, w38, w46, w51);
	w54 = blend( w38, w39, w47, w52);
	w55 = blend( w39, w40, w48, w53);
	w56 = blend( w40, w41, w49, w54);
	w57 = blend( w41, w42, w50, w55);
	w58 = blend( w42, w43, w51, w56);
	w59 = blend( w43, w44, w52, w57);
	w60 = blend( w44, w45, w53, w58);
	w61 = blend( w45, w46, w54, w59);
	w62 = blend( w46, w47, w55, w60);
	w63 = blend( w47, w48, w56, w61);

	/*
	 * Initialize hash value for this chunk:
    	 * a := h0
	 * b := h1
    	 * c := h2
    	 * d := h3
    	 * e := h4
    	 * f := h5
    	 * g := h6
    	 * h := h7
	 */
	// Implementation:
	a = h0;
	b = h1;
	c = h2;
	d = h3;
	e = h4;
	f = h5;
	g = h6;
	h = h7;

	/*
	 * Main loop:
	 * for i from 0 to 63
         *     s0 := (a rightrotate 2) xor (a rightrotate 13) xor (a rightrotate 22)
         *     maj := (a and b) xor (a and c) xor (b and c)
         *     t2 := s0 + maj
         *     s1 := (e rightrotate 6) xor (e rightrotate 11) xor (e rightrotate 25)
         *     ch := (e and f) xor ((not e) and g)
         *     t1 := h + s1 + ch + k[i] + w[i]
	 *
	 *     h := g
         *     g := f
         *     f := e
         *     e := d + t1
         *     d := c
         *     c := b
         *     b := a
         *     a := t1 + t2
	*/
	// Implementation:
	// ( Generated with bash script: 
	// for i in {0..63}; do echo "t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k$i + w$i;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;"; done
	// )
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k0 + w0;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k1 + w1;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k2 + w2;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k3 + w3;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k4 + w4;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k5 + w5;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k6 + w6;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k7 + w7;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k8 + w8;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k9 + w9;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k10 + w10;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k11 + w11;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k12 + w12;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k13 + w13;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k14 + w14;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k15 + w15;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k16 + w16;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k17 + w17;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k18 + w18;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k19 + w19;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k20 + w20;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k21 + w21;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k22 + w22;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k23 + w23;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k24 + w24;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k25 + w25;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k26 + w26;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k27 + w27;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k28 + w28;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k29 + w29;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k30 + w30;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k31 + w31;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k32 + w32;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k33 + w33;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k34 + w34;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k35 + w35;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k36 + w36;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k37 + w37;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k38 + w38;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k39 + w39;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k40 + w40;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k41 + w41;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k42 + w42;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k43 + w43;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k44 + w44;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k45 + w45;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k46 + w46;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k47 + w47;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k48 + w48;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k49 + w49;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k50 + w50;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k51 + w51;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k52 + w52;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k53 + w53;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k54 + w54;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k55 + w55;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k56 + w56;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k57 + w57;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k58 + w58;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k59 + w59;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k60 + w60;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k61 + w61;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k62 + w62;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k63 + w63;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;

	/*
	 * Add this chunk's hash to result so far:
    	 * h0 := h0 + a
    	 * h1 := h1 + b
    	 * h2 := h2 + c
    	 * h3 := h3 + d
    	 * h4 := h4 + e
    	 * h5 := h5 + f
    	 * h6 := h6 + g
    	 * h7 := h7 + h
	 */
	h0 = add( h0, a);
	h1 = add( h1, b);
	h2 = add( h2, c);
	h3 = add( h3, d);
	h4 = add( h4, e);
	h5 = add( h5, f);
	h6 = add( h6, g);
	h7 = add( h7, h);

	// The result of the first sha256 round should be now in h0..h7 as a big endian encoded int.

	// So use it as the new input for w0..w15

	// ToDo: check if the h-order should be reversed, like w0 = h15; w1 = h14; ...

	w0 = h0;  
	w1 = h1;
	w2 = h2;
	w3 = h3;
	w4 = h4;
	w5 = h5;
	w6 = h6;
	w7 = h7;
	w8.x = 0.0; w8.y = 0.0;
	w9 = w8;
	w10 = w8;
	w11 = w8;
	w12 = w8;
	w13 = w8;
	w14 = w8;
	w15 = w8;

	// Re-initialize h for the new sha256 round

        h0.x = 27145.0; h0.y = 58983.0;  // 0x6a09 0xe667
        h1.x = 47975.0; h1.y = 44677.0;  // 0xbb67 0xae85
        h2.x = 15470.0; h2.y = 62322.0;  // 0x3c6e 0xf372
        h3.x = 42319.0; h3.y = 62778.0;  // 0xa54f 0xf53a
        h4.x = 20750.0; h4.y = 21119.0;  // 0x510e 0x527f
        h5.x = 39685.0; h5.y = 26764.0;  // 0x9b05 0x688c
        h6.x =  8067.0; h6.y = 55723.0;  // 0x1f83 0xd9ab
        h7.x = 23520.0; h7.y = 52505.0;  // 0x5be0 0xcd19
	/*
         * for( i = 16; i < 64; i++) {
       	 *   w[i] = blend( w[ i - 16], w[ i - 15], w[ i -7], w[ i - 2]);
       	 * }
	 */
	// Implementation 
	// (Generated with bash script: for i in {16..63}; do echo "w$i = blend( w$((i - 16)), w$((i - 15)), w$((i -7)), w$((i - 2)));"; done ):
	w16 = blend( w0, w1, w9, w14);
	w17 = blend( w1, w2, w10, w15);
	w18 = blend( w2, w3, w11, w16);
	w19 = blend( w3, w4, w12, w17);
	w20 = blend( w4, w5, w13, w18);
	w21 = blend( w5, w6, w14, w19);
	w22 = blend( w6, w7, w15, w20);
	w23 = blend( w7, w8, w16, w21);
	w24 = blend( w8, w9, w17, w22);
	w25 = blend( w9, w10, w18, w23);
	w26 = blend( w10, w11, w19, w24);
	w27 = blend( w11, w12, w20, w25);
	w28 = blend( w12, w13, w21, w26);
	w29 = blend( w13, w14, w22, w27);
	w30 = blend( w14, w15, w23, w28);
	w31 = blend( w15, w16, w24, w29);
	w32 = blend( w16, w17, w25, w30);
	w33 = blend( w17, w18, w26, w31);
	w34 = blend( w18, w19, w27, w32);
	w35 = blend( w19, w20, w28, w33);
	w36 = blend( w20, w21, w29, w34);
	w37 = blend( w21, w22, w30, w35);
	w38 = blend( w22, w23, w31, w36);
	w39 = blend( w23, w24, w32, w37);
	w40 = blend( w24, w25, w33, w38);
	w41 = blend( w25, w26, w34, w39);
	w42 = blend( w26, w27, w35, w40);
	w43 = blend( w27, w28, w36, w41);
	w44 = blend( w28, w29, w37, w42);
	w45 = blend( w29, w30, w38, w43);
	w46 = blend( w30, w31, w39, w44);
	w47 = blend( w31, w32, w40, w45);
	w48 = blend( w32, w33, w41, w46);
	w49 = blend( w33, w34, w42, w47);
	w50 = blend( w34, w35, w43, w48);
	w51 = blend( w35, w36, w44, w49);
	w52 = blend( w36, w37, w45, w50);
	w53 = blend( w37, w38, w46, w51);
	w54 = blend( w38, w39, w47, w52);
	w55 = blend( w39, w40, w48, w53);
	w56 = blend( w40, w41, w49, w54);
	w57 = blend( w41, w42, w50, w55);
	w58 = blend( w42, w43, w51, w56);
	w59 = blend( w43, w44, w52, w57);
	w60 = blend( w44, w45, w53, w58);
	w61 = blend( w45, w46, w54, w59);
	w62 = blend( w46, w47, w55, w60);
	w63 = blend( w47, w48, w56, w61);

	/*
	 * Initialize hash value for this chunk:
    	 * a := h0
	 * b := h1
    	 * c := h2
    	 * d := h3
    	 * e := h4
    	 * f := h5
    	 * g := h6
    	 * h := h7
	 */
	// Implementation:
	a = h0;
	b = h1;
	c = h2;
	d = h3;
	e = h4;
	f = h5;
	g = h6;
	h = h7;

	/*
	 * Main loop:
	 * for i from 0 to 63
         *     s0 := (a rightrotate 2) xor (a rightrotate 13) xor (a rightrotate 22)
         *     maj := (a and b) xor (a and c) xor (b and c)
         *     t2 := s0 + maj
         *     s1 := (e rightrotate 6) xor (e rightrotate 11) xor (e rightrotate 25)
         *     ch := (e and f) xor ((not e) and g)
         *     t1 := h + s1 + ch + k[i] + w[i]
	 *
	 *     h := g
         *     g := f
         *     f := e
         *     e := d + t1
         *     d := c
         *     c := b
         *     b := a
         *     a := t1 + t2
	*/
	// Implementation:
	// ( Generated with bash script: 
	// for i in {0..63}; do echo "t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k$i + w$i;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;"; done
	// )
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k0 + w0;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k1 + w1;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k2 + w2;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k3 + w3;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k4 + w4;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k5 + w5;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k6 + w6;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k7 + w7;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k8 + w8;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k9 + w9;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k10 + w10;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k11 + w11;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k12 + w12;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k13 + w13;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k14 + w14;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k15 + w15;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k16 + w16;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k17 + w17;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k18 + w18;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k19 + w19;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k20 + w20;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k21 + w21;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k22 + w22;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k23 + w23;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k24 + w24;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k25 + w25;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k26 + w26;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k27 + w27;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k28 + w28;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k29 + w29;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k30 + w30;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k31 + w31;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k32 + w32;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k33 + w33;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k34 + w34;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k35 + w35;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k36 + w36;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k37 + w37;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k38 + w38;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k39 + w39;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k40 + w40;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k41 + w41;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k42 + w42;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k43 + w43;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k44 + w44;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k45 + w45;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k46 + w46;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k47 + w47;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k48 + w48;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k49 + w49;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k50 + w50;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k51 + w51;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k52 + w52;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k53 + w53;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k54 + w54;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k55 + w55;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k56 + w56;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k57 + w57;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k58 + w58;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k59 + w59;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k60 + w60;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k61 + w61;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k62 + w62;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;
	t2 = s0( a) + maj( a, b, c); t1 = h + s1( e) + ch( e, f, g) + k63 + w63;  h = g; g = f; f = e; e = d + t1; d = c; c = b; b = a; a = t1 + t2;

	/*
	 * Add this chunk's hash to result so far:
    	 * h0 := h0 + a
    	 * h1 := h1 + b
    	 * h2 := h2 + c
    	 * h3 := h3 + d
    	 * h4 := h4 + e
    	 * h5 := h5 + f
    	 * h6 := h6 + g
    	 * h7 := h7 + h
	 */
	h0 = add( h0, a);
	h1 = add( h1, b);
	h2 = add( h2, c);
	h3 = add( h3, d);
	h4 = add( h4, e);
	h5 = add( h5, f);
	h6 = add( h6, g);
	h7 = add( h7, h);	

	// Compare the nonce with the decoded difficulty.

	// ToDo: check for endianess!!!

	if( isSmaller( h7, decoded_difficulty7) 
	    + isSmaller( h6, decoded_difficulty6) 
	    + isSmaller( h5, decoded_difficulty5) 
	    + isSmaller( h4, decoded_difficulty4)
	    + isSmaller( h3, decoded_difficulty3) 
	    + isSmaller( h2, decoded_difficulty2) 
	    + isSmaller( h1, decoded_difficulty1) 
	    + isSmaller( h0, decoded_difficulty0) == 8.0) {
	    result = nonce;  // Found a valid nonce!
	} else {
	    result.x = -nonce.x;  // Return -nonce to indicate, that this nonce was not valid...
	    result.y = -nonce.y;
	}

}

Ciao,
Andreas

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (195 BTC pledged)
Post by: K1773R on October 30, 2012, 02:56:24 PM

bounty still "alive"?

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (195 BTC pledged)
Post by: jgarzik on October 30, 2012, 04:34:25 PM

Quote from: K1773R on October 30, 2012, 02:56:24 PM

bounty still "alive"?

Yes. I'll personally keep it alive, matching the $subject pledge (200 BTC).

Updated OP.

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (195 BTC pledged)
Post by: K1773R on November 01, 2012, 10:35:55 AM

EDIT: nvm ;)

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (15 BTC pledged)
Post by: PeanutPower on March 13, 2013, 05:36:18 PM

any updates on this project?

I have a nice old Radeon HD 2600 to play with

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (15 BTC pledged)
Post by: MoonShadow on March 13, 2013, 05:48:53 PM

I have an update. I'm withdrawing any pledges that I have made here. I no longer consider this project to be relevant or worthwhile.

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (15 BTC pledged)
Post by: jgarzik on March 13, 2013, 09:39:04 PM

ACK. I left the 15 BTC pledge active, as I do consider the project still worthwhile... although of diminished importance in FPGA/ASIC era.

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (15 BTC pledged)
Post by: PeanutPower on March 27, 2013, 11:33:47 AM

$1200 eh ;)

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (15 BTC pledged)
Post by: Luke-Jr on April 12, 2013, 12:08:34 AM

Slashdot: Open Source Radeon Gallium3D OpenCL Stack Adds Bitcoin Mining (http://hardware.slashdot.org/story/13/04/11/1314214/open-source-radeon-gallium3d-opencl-stack-adds-bitcoin-mining)

In the process of working out the kinks now...

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (15 BTC pledged)
Post by: jenga on September 02, 2013, 06:55:17 PM

Hi everyone.
I'm probably reviving an old thread and I apologize for it.

I made a GLSL implementation of the sha256d script I posted here:
https://bitcointalk.org/index.php?topic=286532

Using Opengl 3.3 and GLSL shaders 1.3 with built-in bitwise operations support I can almost reach cgminer performance.

As an example, in a machine I got a GeForce 9600GT. Pretty bad for mining, but shows very well how things went out.

https://en.bitcoin.it/wiki/Mining_hardware_comparison
The chart tells that the GPU gets 15.66 Mh/s. That's what a couple of benchmarks show:

9600GT -> ideally: 15.66 Mh/s
9600GT -> cgminer speed: 15.34 Mh/s
9600GT -> my GLSL script: 14.80 Mh/s

Not sure how this is going to help. The shader right now is naively translated from C code, but even if most of it can be optimized for glsl at most one can reach cgminer performance but not a Kh more.

I considered the possibility of using a combination of vertex/fragment/geometry shaders together but this didn't work out (how do you call 1M vertex shaders? With 1M GL_POINTS to be rasterized...).

Btw I post the code here just in case someone is interested.

Code:

#version 130
#pragma optionNV(unroll all)

uint ROTLEFT(in uint a, in int b) { return (a << b) | (a >> (32-b)); }
uint ROTRIGHT(in uint a, in int b) { return (a >> b) | (a << (32-b)); }

uint CH(in uint x,in uint y,in uint z) { return (x & y) ^ (~x & z); }
uint MAJ(in uint x,in uint y,in uint z) { return (x & y) ^ (x & z) ^ (y & z); }
uint EP0(in uint x) { return ROTRIGHT(x,2) ^ ROTRIGHT(x,13) ^ ROTRIGHT(x,22); }
uint EP1(in uint x) { return ROTRIGHT(x,6) ^ ROTRIGHT(x,11) ^ ROTRIGHT(x,25); }
uint SIG0(in uint x) { return ROTRIGHT(x,7) ^ ROTRIGHT(x,18) ^ (x >> 3); }
uint SIG1(in uint x) { return ROTRIGHT(x,17) ^ ROTRIGHT(x,19) ^ (x >> 10); }

uint k[64] = uint[64](
	0x428a2f98,0x71374491,0xb5c0fbcf,0xe9b5dba5,0x3956c25b,0x59f111f1,0x923f82a4,0xab1c5ed5,
	0xd807aa98,0x12835b01,0x243185be,0x550c7dc3,0x72be5d74,0x80deb1fe,0x9bdc06a7,0xc19bf174,
	0xe49b69c1,0xefbe4786,0x0fc19dc6,0x240ca1cc,0x2de92c6f,0x4a7484aa,0x5cb0a9dc,0x76f988da,
	0x983e5152,0xa831c66d,0xb00327c8,0xbf597fc7,0xc6e00bf3,0xd5a79147,0x06ca6351,0x14292967,
	0x27b70a85,0x2e1b2138,0x4d2c6dfc,0x53380d13,0x650a7354,0x766a0abb,0x81c2c92e,0x92722c85,
	0xa2bfe8a1,0xa81a664b,0xc24b8b70,0xc76c51a3,0xd192e819,0xd6990624,0xf40e3585,0x106aa070,
	0x19a4c116,0x1e376c08,0x2748774c,0x34b0bcb5,0x391c0cb3,0x4ed8aa4a,0x5b9cca4f,0x682e6ff3,
	0x748f82ee,0x78a5636f,0x84c87814,0x8cc70208,0x90befffa,0xa4506ceb,0xbef9a3f7,0xc67178f2
);

uniform uint midstate[8];
uniform uint text[16];

void main() {

	uint a,b,c,d,e,f,g,h,t1,t2,m[64];
	uint ee,eee,eeee;
	int i;

	a = midstate[0];
	b = midstate[1];
	c = midstate[2];
	d = midstate[3];
	e = midstate[4];
	f = midstate[5];
	g = midstate[6];
	h = midstate[7];

	for (i = 0;  i < 16; i++) m[i] = text[i];

	for (; i < 64; i++) m[i] = SIG1(m[i-2]) + m[i-7] + SIG0(m[i-15]) + m[i-16];

	for (i = 0; i < 64; i++) {
		t1 = h + EP1(e) + CH(e,f,g) + k[i] + m[i];
		t2 = EP0(a) + MAJ(a,b,c);
		h = g;
		g = f;
		f = e;
		e = d + t1;
		d = c;
		c = b;
		b = a;
		a = t1 + t2;
	}

	m[0] = midstate[0] + a;
	m[1] = midstate[1] + b;
	m[2] = midstate[2] + c;
	m[3] = midstate[3] + d;
	m[4] = midstate[4] + e;
	m[5] = midstate[5] + f;
	m[6] = midstate[6] + g;
	m[7] = midstate[7] + h;

	a = 0x6a09e667U;
	b = 0xbb67ae85U;
	c = 0x3c6ef372U;
	d = 0xa54ff53aU;
	e = 0x510e527fU;
	f = 0x9b05688cU;
	g = 0x1f83d9abU;
	h = 0x5be0cd19U;

	m[8]  = 0x80000000U;
	m[9]  = 0x00U;
	m[10] = 0x00U;
	m[11] = 0x00U;
	m[12] = 0x00U;
	m[13] = 0x00U;
	m[14] = 0x00U;
	m[15] = 0x100U;

	for (i = 16; i < 64; i++) m[i] = SIG1(m[i-2]) + m[i-7] + SIG0(m[i-15]) + m[i-16];

	for (i = 0; i < 57; i++) {
		t1 = h + EP1(e) + CH(e,f,g) + k[i] + m[i];
		t2 = EP0(a) + MAJ(a,b,c);
		h = g;
		g = f;
		f = e;
		e = d + t1;
		d = c;
		c = b;
		b = a;
		a = t1 + t2;
	}

	eeee = d + h + EP1(e) + CH(e,f,g) + 0x78a5636fU + m[57];
	eee = c + g + EP1(eeee) + CH(eeee,e,f) + 0x84c87814U + m[58];
	ee = b + f + EP1(eee) + CH(eee,eeee,e) + 0x8cc70208U + m[59];
	h = a + e + EP1(ee) + CH(ee,eee,eeee) + 0x90befffaU + m[60];

	if (0x5be0cd19U + h == 0x00U) {
		gl_FragColor=vec4(0.0,1.0,0.0,1.0);
	} else { gl_FragColor=vec4(1.0,0.0,0.0,1.0); }
}

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (445 BTC pledged)
Post by: teknohog on September 03, 2013, 11:59:39 AM

Quote from: teknohog on May 06, 2011, 09:45:08 AM

Here's another 10 BTC. With the recent USD price of bitcoins, I wouldn't even say "it's not much".

Retracting my offer, as GPUs are now useless for sha256 mining.

Title: Re: [BOUNTY] sha256 shader for Linux OSS video drivers (445 BTC pledged)
Post by: jgarzik on September 03, 2013, 02:14:55 PM

Quote from: teknohog on September 03, 2013, 11:59:39 AM

Quote from: teknohog on May 06, 2011, 09:45:08 AM

Here's another 10 BTC. With the recent USD price of bitcoins, I wouldn't even say "it's not much".

Retracting my offer, as GPUs are now useless for sha256 mining.

Yes, the OP was long ago updated to reflect current bounty status (15 BTC from me).

Bitcoin Forum

Other => CPU/GPU Bitcoin mining hardware => Topic started by: jgarzik on March 18, 2011, 08:41:45 PM