Phoenix - Efficient, fast, modular miner

jedi95 (OP)

Full Member

Offline

Activity: 219
Merit: 120

Re: Phoenix - Efficient, fast, modular miner

January 23, 2012, 08:05:55 AM

#1081

Quote from: Diapolo on January 23, 2012, 06:30:57 AM

Quote from: deepceleron on January 22, 2012, 02:11:05 AM

Just for curiosity I thought I'd see if there's any hashrate to be gained with compiling pyOpenCL in Visual Studio 2010 with Stream SDK 2.6 RC3:

Python 2.7.2
Base-12.1.1.win32-py2.7
numpy-MKL-1.6.1.win32-py2.7-2011-10-29
scipy-0.10.0.win32-py2.7
zope.interface-3.8.0.win32-py2.7
Twisted-11.1.0.win32-py2.7

I compiled boost_1_48_0 multithreaded in msvc-10.0, I now have boost_python-vc100-mt-1_48.dll. Compiled pyopencl-0.92 after doing the manifest tweaks and env variables to get it to work.

Results? Exactly the same 224.00 MHash/s as Phoenix 1.7.4 exe gives me. Yay. Three hours I won't get back... At least my python isn't slower than the exe's python any more.

I did some further tests with 1.7.3 and the latest pyOpenCL, too and have to say I never got a miner idle message nor other problems. So it would be nice if we were able to chose, which version pyOpenCL version we want to download jedi

.

Thanks,
Dia

The bugs that caused the miner to idle were all related to the RPC implementation. Using twisted.web for the RPC backend always caused problems so that's why we re-wrote it to use httplib instead for 1.7.0. The extra delay getting work won't be enough to cause the miner to idle thanks to the work queue. (which is around 10 seconds of stored work on a fast miner like a 5870)

For those who want to use other versions of PyOpenCL I suggest you run Phoenix from source rather than use the compiled binaries. Supporting more than one official binary is not something I want to deal with. If you want a binary with the newer versions you can always compile one yourself.

Phoenix Miner developer

Donations appreciated at:
1PHoenix9j9J3M6v3VQYWeXrHPPjf7y3rU

pooler

Hero Member

Offline

Activity: 838
Merit: 507

Re: Phoenix - Efficient, fast, modular miner

January 24, 2012, 12:01:37 PM

#1082

At least with versions 1.7.2 and 1.7.4, Phoenix doesn't quit when I hit Ctrl-C. I am using Gentoo Linux and GNU screen.
Version 1.6.4, however, doesn't have this problem. I haven't tried versions between 1.6.4 and 1.7.2.

BTC: 15MRTcUweNVJbhTyH5rq9aeSdyigFrskqE · LTC: LTCPooLqTK1SANSNeTR63GbGwabTKEkuS7

allinvain

Legendary

Offline

Activity: 3080
Merit: 1080

Re: Phoenix - Efficient, fast, modular miner

January 24, 2012, 02:31:59 PM

#1083

Quote from: pooler on January 24, 2012, 12:01:37 PM

At least with versions 1.7.2 and 1.7.4, Phoenix doesn't quit when I hit Ctrl-C. I am using Gentoo Linux and GNU screen.
Version 1.6.4, however, doesn't have this problem. I haven't tried versions between 1.6.4 and 1.7.2.

It's funny you mention that cause I actually liked it when it quit if I sent it Ctrl+C. For the longest time that was like sending it a BREAK signal. Now the new way to send it a break signal is Ctrl-\

Binance - where I trade. Funds are SAFU!

TurdHurdur

Full Member

Offline

Activity: 216
Merit: 100

Re: Phoenix - Efficient, fast, modular miner

January 26, 2012, 11:04:16 PM

#1084

While attempting to use p2pool I get this error:

Code:

/usr/local/src/phoenix-1.7.4/WorkQueue.py:80: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  if self.lastBlock is not None and (wu.data[4:36] == self.lastBlock):

Any idea why/solution?

jedi95 (OP)

Full Member

Offline

Activity: 219
Merit: 120

Re: Phoenix - Efficient, fast, modular miner

January 26, 2012, 11:34:38 PM

#1085

Quote from: TurdHurdur on January 26, 2012, 11:04:16 PM

While attempting to use p2pool I get this error:

Code:

/usr/local/src/phoenix-1.7.4/WorkQueue.py:80: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  if self.lastBlock is not None and (wu.data[4:36] == self.lastBlock):

Any idea why/solution?

Looks like I missed updating the "work from previous block" check when I added X-Work-Identifier. Basically the error occurs under P2Pool because self.lastblock is actually the P2Pool identifier from the previous 'block', and not the bitcoin previous block hash.

I have fixed this in 1.7.5 on GitHub, but I am currently away from my main computer to compile a Windows binary. (will be up later tonight) Let me know if this resolves the issue.

Phoenix Miner developer

Donations appreciated at:
1PHoenix9j9J3M6v3VQYWeXrHPPjf7y3rU

TurdHurdur

Full Member

Offline

Activity: 216
Merit: 100

Re: Phoenix - Efficient, fast, modular miner

January 27, 2012, 01:24:28 AM

#1086

Quote from: jedi95 on January 26, 2012, 11:34:38 PM

Looks like I missed updating the "work from previous block" check when I added X-Work-Identifier. Basically the error occurs under P2Pool because self.lastblock is actually the P2Pool identifier from the previous 'block', and not the bitcoin previous block hash.

I have fixed this in 1.7.5 on GitHub, but I am currently away from my main computer to compile a Windows binary. (will be up later tonight) Let me know if this resolves the issue.

It's working now.

d3m0n1q_733rz

Sr. Member

Offline

Activity: 378
Merit: 250

Re: Phoenix - Efficient, fast, modular miner

January 28, 2012, 10:12:08 AM

#1087

I'm working on some optimizations to the phatk2 kernel, but I hit a small hitch. Working out the kinks now. Grr...

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs. If you need it, we can get it. We have solutions for your computing conundrums. BTC accepted! 12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq

d3m0n1q_733rz

Sr. Member

Offline

Activity: 378
Merit: 250

Re: Phoenix - Efficient, fast, modular miner

January 28, 2012, 01:16:05 PM

#1088

Okay, I think I have the kinds worked out mostly. Give it a try and tell me what you think. It's phak2 with a few simple logic mods. I would try the AMD Kernel Analyzer tool, but it doesn't seem to work right for me. So, I just go off of hash rate and accepts or rejects/errors. So far, no errors. But I'm squeezing out a few more decimal place hashes.

Code:

// This file is taken and modified from the public-domain poclbm project, and
// I have therefore decided to keep it public-domain.


#ifdef VECTORS4
	typedef uint4 u;
#else 
	#ifdef VECTORS
		typedef uint2 u;
	#else
		typedef uint u;
	#endif
#endif

__constant uint K[64] = { 
	0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
	0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
	0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
	0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
	0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
	0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
	0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
	0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2
};

__constant uint ConstW[128] = {
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x80000000U, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000280U,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,

0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x80000000U, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000100U,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000
};

__constant uint H[8] = { 
	0x6a09e667, 0xbb67ae85, 0x3c6ef372, 0xa54ff53a, 0x510e527f, 0x9b05688c, 0x1f83d9ab, 0x5be0cd19
};


#ifdef BITALIGN
	#pragma OPENCL EXTENSION cl_amd_media_ops : enable
	
	#define rot(x, y) amd_bitalign(x, x, (uint)(32 - y))
#else
	#define rot(x, y) rotate(x, (uint)y)
#endif

// This part is not from the stock poclbm kernel. It's part of an optimization
// added in the Phoenix Miner.

// Some AMD devices have the BFI_INT opcode, which behaves exactly like the
// SHA-256 Ch function, but provides it in exactly one instruction. If
// detected, use it for Ch. Otherwise, construct Ch out of simpler logical
// primitives.

#ifdef BFI_INT
	// Well, slight problem... It turns out BFI_INT isn't actually exposed to
	// OpenCL (or CAL IL for that matter) in any way. However, there is 
	// a similar instruction, BYTE_ALIGN_INT, which is exposed to OpenCL via
	// amd_bytealign, takes the same inputs, and provides the same output. 
	// We can use that as a placeholder for BFI_INT and have the application 
	// patch it after compilation.
	
	// This is the BFI_INT function
	#define Ch(x, y, z) amd_bytealign(x,y,z)
	// Ma can also be implemented in terms of BFI_INT...
	#define Ma(z, x, y) amd_bytealign(z^x,y,x)
#else
	#define Ch(x, y, z) bitselect(z,y,x)
	#define Ma(x, y, z) bitselect(x,y,(z^x))
#endif

//Various intermediate calculations for each SHA round
#define s0(n) (S0(Vals[(128 - (n)) % 8]))
#define S0(n) (rot(n, 30u)^rot(n, 19u)^rot(n,10u))

#define s1(n) (S1(Vals[(132 - (n)) % 8]))
#define S1(n) (rot(n, 26u)^rot(n, 21u)^rot(n, 7u))

#define ch(n) Ch(Vals[(132 - (n)) % 8],Vals[(133 - (n)) % 8],Vals[(134 - (n)) % 8])
#define maj(n) Ma(Vals[(129 - (n)) % 8],Vals[(130 - (n)) % 8],Vals[(128 - (n)) % 8])

//t1 calc when W is already calculated
#define t1(n) K[(n) % 64] + Vals[(135 - (n)) % 8] +  W[(n)] + s1(n) + ch(n) 

//t1 calc which calculates W
#define t1W(n) K[(n) % 64] + Vals[(135 - (n)) % 8] +  W(n) + s1(n) + ch(n)

//Used for constant W Values (the compiler optimizes out zeros)
#define t1C(n) (K[(n) % 64] + ConstW[(n)]) + Vals[(135 - (n)) % 8] + s1(n) + ch(n)

//t2 Calc
#define t2(n)  maj(n) + s0(n)

#define rotC(x,n) (x<<n | x >> (32-n))

//W calculation used for SHA round
#define W(n) (W[n] = P4(n) + P3(n) + P2(n) + P1(n))


//Partial W calculations (used for the begining where only some values are nonzero)
#define P1(n) ((rot(W[(n)-2],15u)^rot(W[(n)-2],13u)^((W[(n)-2])>>10U)))
#define P2(n) ((rot(W[(n)-15],25u)^rot(W[(n)-15],14u)^((W[(n)-15])>>3U)))
#define p1(x) ((rot(x,15u)^rot(x,13u)^((x)>>10U)))
#define p2(x) ((rot(x,25u)^rot(x,14u)^((x)>>3U)))
#define P3(n)  W[n-7]
#define P4(n)  W[n-16]

//Partial Calcs for constant W values
#define P1C(n) ((rotC(ConstW[(n)-2],15)^rotC(ConstW[(n)-2],13)^((ConstW[(n)-2])>>10U)))
#define P2C(n) ((rotC(ConstW[(n)-15],25)^rotC(ConstW[(n)-15],14)^((ConstW[(n)-15])>>3U)))
#define P3C(x)  ConstW[x-7]
#define P4C(x)  ConstW[x-16]

//SHA round with built in W calc
#define sharoundW(n) Barrier1(n);  Vals[(131 - (n)) % 8] += t1W(n); Vals[(135 - (n)) % 8] = t1W(n) + t2(n);  

//SHA round without W calc
#define sharound(n)  Barrier2(n); Vals[(131 - (n)) % 8] += t1(n); Vals[(135 - (n)) % 8] = t1(n) + t2(n);

//SHA round for constant W values
#define sharoundC(n)  Barrier3(n); Vals[(131 - (n)) % 8] += t1C(n); Vals[(135 - (n)) % 8] = t1C(n) + t2(n);

//The compiler is stupid... I put this in there only to stop the compiler from (de)optimizing the order

#define Barrier1(n) t1 = t1C((n+1))
#define Barrier2(n) t1 = t1C((n))
#define Barrier3(n) Barrier2(n)

__kernel
//removed this to allow detection of invalid work size
//__attribute__((reqd_work_group_size(WORKSIZE, 1, 1)))
void search(	const uint state0, const uint state1, const uint state2, const uint state3,
						const uint state4, const uint state5, const uint state6, const uint state7,
						const uint B1, const uint C1, const uint D1,
						const uint F1, const uint G1, const uint H1,
						const u base,
						const uint W16, const uint W17,
						const uint PreVal4, const uint PreVal0,
						const uint PreW31, const uint PreW32,
						const uint PreW19, const uint PreW20,
						__global uint * output)
{

	u W[124];
	u Vals[8];

//Dummy Variable to prevent compiler from reordering between rounds
	u t1;
	
	W[16] = W16;
	W[17] = W17;

#ifdef VECTORS4
	//Less dependencies to get both the local id and group id and then add them
	W[3] = base + (uint)(get_local_id(0)) * 4u + (uint)(get_group_id(0)) * (WORKSIZE * 4u);
	uint r = rot(W[3].x,25u)^rot(W[3].x,14u)^((W[3].x)>>3U);
	//Since only the 2 LSB is opposite between the nonces, we can save an instruction by flipping the 4 bits in W18 rather than the 1 bit in W3
	W[18] = PreW20 + (u){r, r ^ 0x2004000U, r ^ 0x4008000U, r ^ 0x600C000U};
#else
	#ifdef VECTORS
		W[3] = base + (uint)(get_local_id(0)) * 2u + (uint)(get_group_id(0)) * (WORKSIZE * 2u);
		uint r = rot(W[3].x,25u)^rot(W[3].x,14u)^((W[3].x)>>3U);
		W[18] = PreW20 + (u){r, r ^ 0x2004000U};
	#else
		W[3] = base + get_local_id(0) + get_group_id(0) * (WORKSIZE);
		u r = PreW20 + rot(W[3],25u)^rot(W[3],14u)^((W[3])>>3U);
		W[18] = PreW20 + r;
	#endif
#endif
	//the order of the W calcs and Rounds is like this because the compiler needs help finding how to order the instructions

	Vals[0]=PreVal0 + W[3];
	Vals[1]=B1;
	Vals[2]=C1;
	Vals[3]=D1;
	Vals[4]=PreVal4 + W[3];	
	Vals[5]=F1;
	Vals[6]=G1;
	Vals[7]=H1;

	sharoundC(4);
	W[19] = PreW19 + W[3];
	sharoundC(5);
	W[20] = P1(20) + P4C(20);
	sharoundC(6);
	W[21] = P1(21);
	sharoundC(7);
	W[22] = P1(22) + P3C(22);
	sharoundC(8);
	W[23] = W[16] + P1(23);
	sharoundC(9);
	W[24] = W[17] + P1(24);
	sharoundC(10);
	W[25] = P1(25) + P3(25);
	W[26] = P1(26) + P3(26);
	sharoundC(11);
	W[27] = P1(27) + P3(27);
	W[28] = P1(28) + P3(28);
	sharoundC(12);
	W[29] = P1(29) + P3(29);
	sharoundC(13);
	W[30] = P1(30) + P3(30) + P2C(30);
	W[31] = PreW31 + P1(31) + P3(31);
	sharoundC(14);
	W[32] = PreW32 + P1(32) + P3(32);
	sharoundC(15);
	sharound(16);
	sharound(17);
	sharound(18);
	sharound(19);
	sharound(20);
	sharound(21);
	sharound(22);
	sharound(23);
	sharound(24);
	sharound(25);
	sharound(26);
	sharound(27);
	sharound(28);
	sharound(29);
	sharound(30);
	sharound(31);
	sharound(32);
	sharoundW(33);
	sharoundW(34);
	sharoundW(35);
	sharoundW(36);
	sharoundW(37);
	sharoundW(38);
	sharoundW(39);
	sharoundW(40);
	sharoundW(41);
	sharoundW(42);
	sharoundW(43);
	sharoundW(44);
	sharoundW(45);
	sharoundW(46);
	sharoundW(47);
	sharoundW(48);
	sharoundW(49);
	sharoundW(50);
	sharoundW(51);
	sharoundW(52);
	sharoundW(53);
	sharoundW(54);
	sharoundW(55);
	sharoundW(56);
	sharoundW(57);
	sharoundW(58);
	sharoundW(59);
	sharoundW(60);
	sharoundW(61);
	sharoundW(62);
	sharoundW(63);

	W[64]=state0+Vals[0];
	W[65]=state1+Vals[1];
	W[66]=state2+Vals[2];
	W[67]=state3+Vals[3];
	W[68]=state4+Vals[4];
	W[69]=state5+Vals[5];
	W[70]=state6+Vals[6];
	W[71]=state7+Vals[7];

	const u Temp = (0xb0edbdd0U + K[0]) +  W[64];

//Vals 5 & 6 are not used but gives us a complete chunk instead of a partial

	Vals[0]=H[0];
	Vals[1]=H[1];
	Vals[2]=H[2];
	Vals[3]=Temp + 0xa54ff53aU;
	Vals[4]=H[4];
	Vals[5]=H[5];
	Vals[6]=H[6];
	Vals[7]=Temp + 0x08909ae5U;



//#define P124(n) P1(n) + P2(n) + P4(n)

	W[80] = P2(80) + P4(80);
	sharound(65);
	W[81] = P1C(81) + P2(81) + P4(81);
	sharound(66);
	W[82] = P1(82) + P2(82) + P4(82);
	sharound(67);
	W[83] = P1(83) + P2(83) + P4(83);
	sharound(68);
	W[84] = P1(84) + P2(84) + P4(84);
	sharound(69);
	W[85] = P1(85) + P2(85) + P4(85);
	sharound(70);
	W[86] = P1(86) + P2(86) + P4(86) + P3C(86);
	sharound(71);
	W[87] = P1(87) + P2C(87) + P3(87) + P4(87);
	sharoundC(72);
	W[88] = P1(88) + P3(88) + P4C(88);
	sharoundC(73);
	W[89] = P1(89) + P3(89);
	sharoundC(74);
	W[90] = P1(90) + P3(90);
	sharoundC(75);
	W[91] = P1(91) + P3(91);
	sharoundC(76);
	W[92] = P1(92) + P3(92);
	sharoundC(77);
	W[93] = P1(93) + P3(93);
	W[94] = P1(94) + P3(94) + P2C(94);
	sharoundC(78);
	W[95] = P1(95) + P2(95) + P3(95) + P4C(95);
	sharoundC(79);
	sharound(80);
	sharound(81);
	sharound(82);
	sharound(83);
	sharound(84);
	sharound(85);
	sharound(86);
	sharound(87);
	sharound(88);
	sharound(89);
	sharound(90);
	sharound(91);
	sharound(92);
	sharound(93);
	sharound(94);
	sharound(95);
	sharoundW(96);
	sharoundW(97);
	sharoundW(98);
	sharoundW(99);
	sharoundW(100);
	sharoundW(101);
	sharoundW(102);
	sharoundW(103);
	sharoundW(104);
	sharoundW(105);
	sharoundW(106);
	sharoundW(107);
	sharoundW(108);
	sharoundW(109);
	sharoundW(110);
	sharoundW(111);
	sharoundW(112);
	sharoundW(113);
	sharoundW(114);
	sharoundW(115);
	sharoundW(116);
	sharoundW(117);
	sharoundW(118);
	sharoundW(119);
	sharoundW(120);
	sharoundW(121);
	sharoundW(122);

	const u precomp = (Vals[0] + Vals[4]) + (K[59] + W(123)) + s1(123) + ch(123);

	u v = W[117] + W[108] + Vals[3] + Vals[7] + P2(124) + P1(124) + Ch((precomp),Vals[1],Vals[2]); 
	u g = -(K[60] + H[7]) - S1(precomp);

	uint nonce = 0;
#ifdef VECTORS4
	if (v.x == g.x)
	{
		nonce = W[3].x;
	}
	if (v.y == g.y)
	{
		nonce = W[3].y;
	}
	if (v.z == g.z)
	{
		nonce = W[3].z;
	}
	if (v.w == g.w)
	{
		nonce = W[3].w;
	}
#else
	#ifdef VECTORS
		if (v.x == g.x)
		{
			nonce = W[3].x;
		}
		if (v.y == g.y)
		{
			nonce = W[3].y;
		}
	#else
		if (v == g)
		{
			nonce = W[3];
		}
	#endif
#endif
	if(nonce)
	{
		//Faster to shift the nonce by 2 due to 4-DWORD addressing and does not add more collisions
		output[OUTPUT_SIZE] = nonce;
		output[get_local_id(0)] = nonce;
	}
}

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs. If you need it, we can get it. We have solutions for your computing conundrums. BTC accepted! 12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq

deepceleron

Legendary

Offline

Activity: 1512
Merit: 1025

Re: Phoenix - Efficient, fast, modular miner

January 28, 2012, 03:54:03 PM
Last edit: January 28, 2012, 04:33:36 PM by deepceleron

#1089

Quote from: d3m0n1q_733rz on January 28, 2012, 01:16:05 PM

Okay, I think I have the kinds worked out mostly. Give it a try and tell me what you think. It's phak2 with a few simple logic mods. I would try the AMD Kernel Analyzer tool, but it doesn't seem to work right for me. So, I just go off of hash rate and accepts or rejects/errors. So far, no errors. But I'm squeezing out a few more decimal place hashes.

Here's an online diff of original phatk2 and yours, I'll check it out on my card that I've already done five-sig-fig benchmarking on. The kernel is from Phateus, you might update his kernel thread with your findings.

edit: You've done it - you now have the fastest kernel. 341.19 ⇒ 341.60 on a 5830 using the same parameters and simply using the new kernel. phatk2 at the below settings was the highest possible Mhash/s that could be extracted from the card previously.

Sapphire 5830 -2L @ 1050/380 (benchmarking core speed; peak phatk2 output @ core speed)
System OpenCL 2.5.793.1, driver 11.11, WinXP
kernel parameters: VECTORS AGGRESSION=12 FASTLOOP=False WORKSIZE=256

jedi95 (OP)

Full Member

Offline

Activity: 219
Merit: 120

Re: Phoenix - Efficient, fast, modular miner

January 28, 2012, 06:28:34 PM

#1090

Quote from: d3m0n1q_733rz on January 28, 2012, 01:16:05 PM

Okay, I think I have the kinds worked out mostly. Give it a try and tell me what you think. It's phak2 with a few simple logic mods. I would try the AMD Kernel Analyzer tool, but it doesn't seem to work right for me. So, I just go off of hash rate and accepts or rejects/errors. So far, no errors. But I'm squeezing out a few more decimal place hashes.

Code:

// This file is taken and modified from the public-domain poclbm project, and
// I have therefore decided to keep it public-domain.


#ifdef VECTORS4
	typedef uint4 u;
#else 
	#ifdef VECTORS
		typedef uint2 u;
	#else
		typedef uint u;
	#endif
#endif

__constant uint K[64] = { 
	0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
	0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
	0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
	0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
	0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
	0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
	0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
	0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2
};

__constant uint ConstW[128] = {
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x80000000U, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000280U,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,

0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x80000000U, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000100U,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000
};

__constant uint H[8] = { 
	0x6a09e667, 0xbb67ae85, 0x3c6ef372, 0xa54ff53a, 0x510e527f, 0x9b05688c, 0x1f83d9ab, 0x5be0cd19
};


#ifdef BITALIGN
	#pragma OPENCL EXTENSION cl_amd_media_ops : enable
	
	#define rot(x, y) amd_bitalign(x, x, (uint)(32 - y))
#else
	#define rot(x, y) rotate(x, (uint)y)
#endif

// This part is not from the stock poclbm kernel. It's part of an optimization
// added in the Phoenix Miner.

// Some AMD devices have the BFI_INT opcode, which behaves exactly like the
// SHA-256 Ch function, but provides it in exactly one instruction. If
// detected, use it for Ch. Otherwise, construct Ch out of simpler logical
// primitives.

#ifdef BFI_INT
	// Well, slight problem... It turns out BFI_INT isn't actually exposed to
	// OpenCL (or CAL IL for that matter) in any way. However, there is 
	// a similar instruction, BYTE_ALIGN_INT, which is exposed to OpenCL via
	// amd_bytealign, takes the same inputs, and provides the same output. 
	// We can use that as a placeholder for BFI_INT and have the application 
	// patch it after compilation.
	
	// This is the BFI_INT function
	#define Ch(x, y, z) amd_bytealign(x,y,z)
	// Ma can also be implemented in terms of BFI_INT...
	#define Ma(z, x, y) amd_bytealign(z^x,y,x)
#else
	#define Ch(x, y, z) bitselect(z,y,x)
	#define Ma(x, y, z) bitselect(x,y,(z^x))
#endif

//Various intermediate calculations for each SHA round
#define s0(n) (S0(Vals[(128 - (n)) % 8]))
#define S0(n) (rot(n, 30u)^rot(n, 19u)^rot(n,10u))

#define s1(n) (S1(Vals[(132 - (n)) % 8]))
#define S1(n) (rot(n, 26u)^rot(n, 21u)^rot(n, 7u))

#define ch(n) Ch(Vals[(132 - (n)) % 8],Vals[(133 - (n)) % 8],Vals[(134 - (n)) % 8])
#define maj(n) Ma(Vals[(129 - (n)) % 8],Vals[(130 - (n)) % 8],Vals[(128 - (n)) % 8])

//t1 calc when W is already calculated
#define t1(n) K[(n) % 64] + Vals[(135 - (n)) % 8] +  W[(n)] + s1(n) + ch(n) 

//t1 calc which calculates W
#define t1W(n) K[(n) % 64] + Vals[(135 - (n)) % 8] +  W(n) + s1(n) + ch(n)

//Used for constant W Values (the compiler optimizes out zeros)
#define t1C(n) (K[(n) % 64] + ConstW[(n)]) + Vals[(135 - (n)) % 8] + s1(n) + ch(n)

//t2 Calc
#define t2(n)  maj(n) + s0(n)

#define rotC(x,n) (x<<n | x >> (32-n))

//W calculation used for SHA round
#define W(n) (W[n] = P4(n) + P3(n) + P2(n) + P1(n))


//Partial W calculations (used for the begining where only some values are nonzero)
#define P1(n) ((rot(W[(n)-2],15u)^rot(W[(n)-2],13u)^((W[(n)-2])>>10U)))
#define P2(n) ((rot(W[(n)-15],25u)^rot(W[(n)-15],14u)^((W[(n)-15])>>3U)))
#define p1(x) ((rot(x,15u)^rot(x,13u)^((x)>>10U)))
#define p2(x) ((rot(x,25u)^rot(x,14u)^((x)>>3U)))
#define P3(n)  W[n-7]
#define P4(n)  W[n-16]

//Partial Calcs for constant W values
#define P1C(n) ((rotC(ConstW[(n)-2],15)^rotC(ConstW[(n)-2],13)^((ConstW[(n)-2])>>10U)))
#define P2C(n) ((rotC(ConstW[(n)-15],25)^rotC(ConstW[(n)-15],14)^((ConstW[(n)-15])>>3U)))
#define P3C(x)  ConstW[x-7]
#define P4C(x)  ConstW[x-16]

//SHA round with built in W calc
#define sharoundW(n) Barrier1(n);  Vals[(131 - (n)) % 8] += t1W(n); Vals[(135 - (n)) % 8] = t1W(n) + t2(n);  

//SHA round without W calc
#define sharound(n)  Barrier2(n); Vals[(131 - (n)) % 8] += t1(n); Vals[(135 - (n)) % 8] = t1(n) + t2(n);

//SHA round for constant W values
#define sharoundC(n)  Barrier3(n); Vals[(131 - (n)) % 8] += t1C(n); Vals[(135 - (n)) % 8] = t1C(n) + t2(n);

//The compiler is stupid... I put this in there only to stop the compiler from (de)optimizing the order

#define Barrier1(n) t1 = t1C((n+1))
#define Barrier2(n) t1 = t1C((n))
#define Barrier3(n) Barrier2(n)

__kernel
//removed this to allow detection of invalid work size
//__attribute__((reqd_work_group_size(WORKSIZE, 1, 1)))
void search(	const uint state0, const uint state1, const uint state2, const uint state3,
						const uint state4, const uint state5, const uint state6, const uint state7,
						const uint B1, const uint C1, const uint D1,
						const uint F1, const uint G1, const uint H1,
						const u base,
						const uint W16, const uint W17,
						const uint PreVal4, const uint PreVal0,
						const uint PreW31, const uint PreW32,
						const uint PreW19, const uint PreW20,
						__global uint * output)
{

	u W[124];
	u Vals[8];

//Dummy Variable to prevent compiler from reordering between rounds
	u t1;
	
	W[16] = W16;
	W[17] = W17;

#ifdef VECTORS4
	//Less dependencies to get both the local id and group id and then add them
	W[3] = base + (uint)(get_local_id(0)) * 4u + (uint)(get_group_id(0)) * (WORKSIZE * 4u);
	uint r = rot(W[3].x,25u)^rot(W[3].x,14u)^((W[3].x)>>3U);
	//Since only the 2 LSB is opposite between the nonces, we can save an instruction by flipping the 4 bits in W18 rather than the 1 bit in W3
	W[18] = PreW20 + (u){r, r ^ 0x2004000U, r ^ 0x4008000U, r ^ 0x600C000U};
#else
	#ifdef VECTORS
		W[3] = base + (uint)(get_local_id(0)) * 2u + (uint)(get_group_id(0)) * (WORKSIZE * 2u);
		uint r = rot(W[3].x,25u)^rot(W[3].x,14u)^((W[3].x)>>3U);
		W[18] = PreW20 + (u){r, r ^ 0x2004000U};
	#else
		W[3] = base + get_local_id(0) + get_group_id(0) * (WORKSIZE);
		u r = PreW20 + rot(W[3],25u)^rot(W[3],14u)^((W[3])>>3U);
		W[18] = PreW20 + r;
	#endif
#endif
	//the order of the W calcs and Rounds is like this because the compiler needs help finding how to order the instructions

	Vals[0]=PreVal0 + W[3];
	Vals[1]=B1;
	Vals[2]=C1;
	Vals[3]=D1;
	Vals[4]=PreVal4 + W[3];	
	Vals[5]=F1;
	Vals[6]=G1;
	Vals[7]=H1;

	sharoundC(4);
	W[19] = PreW19 + W[3];
	sharoundC(5);
	W[20] = P1(20) + P4C(20);
	sharoundC(6);
	W[21] = P1(21);
	sharoundC(7);
	W[22] = P1(22) + P3C(22);
	sharoundC(8);
	W[23] = W[16] + P1(23);
	sharoundC(9);
	W[24] = W[17] + P1(24);
	sharoundC(10);
	W[25] = P1(25) + P3(25);
	W[26] = P1(26) + P3(26);
	sharoundC(11);
	W[27] = P1(27) + P3(27);
	W[28] = P1(28) + P3(28);
	sharoundC(12);
	W[29] = P1(29) + P3(29);
	sharoundC(13);
	W[30] = P1(30) + P3(30) + P2C(30);
	W[31] = PreW31 + P1(31) + P3(31);
	sharoundC(14);
	W[32] = PreW32 + P1(32) + P3(32);
	sharoundC(15);
	sharound(16);
	sharound(17);
	sharound(18);
	sharound(19);
	sharound(20);
	sharound(21);
	sharound(22);
	sharound(23);
	sharound(24);
	sharound(25);
	sharound(26);
	sharound(27);
	sharound(28);
	sharound(29);
	sharound(30);
	sharound(31);
	sharound(32);
	sharoundW(33);
	sharoundW(34);
	sharoundW(35);
	sharoundW(36);
	sharoundW(37);
	sharoundW(38);
	sharoundW(39);
	sharoundW(40);
	sharoundW(41);
	sharoundW(42);
	sharoundW(43);
	sharoundW(44);
	sharoundW(45);
	sharoundW(46);
	sharoundW(47);
	sharoundW(48);
	sharoundW(49);
	sharoundW(50);
	sharoundW(51);
	sharoundW(52);
	sharoundW(53);
	sharoundW(54);
	sharoundW(55);
	sharoundW(56);
	sharoundW(57);
	sharoundW(58);
	sharoundW(59);
	sharoundW(60);
	sharoundW(61);
	sharoundW(62);
	sharoundW(63);

	W[64]=state0+Vals[0];
	W[65]=state1+Vals[1];
	W[66]=state2+Vals[2];
	W[67]=state3+Vals[3];
	W[68]=state4+Vals[4];
	W[69]=state5+Vals[5];
	W[70]=state6+Vals[6];
	W[71]=state7+Vals[7];

	const u Temp = (0xb0edbdd0U + K[0]) +  W[64];

//Vals 5 & 6 are not used but gives us a complete chunk instead of a partial

	Vals[0]=H[0];
	Vals[1]=H[1];
	Vals[2]=H[2];
	Vals[3]=Temp + 0xa54ff53aU;
	Vals[4]=H[4];
	Vals[5]=H[5];
	Vals[6]=H[6];
	Vals[7]=Temp + 0x08909ae5U;



//#define P124(n) P1(n) + P2(n) + P4(n)

	W[80] = P2(80) + P4(80);
	sharound(65);
	W[81] = P1C(81) + P2(81) + P4(81);
	sharound(66);
	W[82] = P1(82) + P2(82) + P4(82);
	sharound(67);
	W[83] = P1(83) + P2(83) + P4(83);
	sharound(68);
	W[84] = P1(84) + P2(84) + P4(84);
	sharound(69);
	W[85] = P1(85) + P2(85) + P4(85);
	sharound(70);
	W[86] = P1(86) + P2(86) + P4(86) + P3C(86);
	sharound(71);
	W[87] = P1(87) + P2C(87) + P3(87) + P4(87);
	sharoundC(72);
	W[88] = P1(88) + P3(88) + P4C(88);
	sharoundC(73);
	W[89] = P1(89) + P3(89);
	sharoundC(74);
	W[90] = P1(90) + P3(90);
	sharoundC(75);
	W[91] = P1(91) + P3(91);
	sharoundC(76);
	W[92] = P1(92) + P3(92);
	sharoundC(77);
	W[93] = P1(93) + P3(93);
	W[94] = P1(94) + P3(94) + P2C(94);
	sharoundC(78);
	W[95] = P1(95) + P2(95) + P3(95) + P4C(95);
	sharoundC(79);
	sharound(80);
	sharound(81);
	sharound(82);
	sharound(83);
	sharound(84);
	sharound(85);
	sharound(86);
	sharound(87);
	sharound(88);
	sharound(89);
	sharound(90);
	sharound(91);
	sharound(92);
	sharound(93);
	sharound(94);
	sharound(95);
	sharoundW(96);
	sharoundW(97);
	sharoundW(98);
	sharoundW(99);
	sharoundW(100);
	sharoundW(101);
	sharoundW(102);
	sharoundW(103);
	sharoundW(104);
	sharoundW(105);
	sharoundW(106);
	sharoundW(107);
	sharoundW(108);
	sharoundW(109);
	sharoundW(110);
	sharoundW(111);
	sharoundW(112);
	sharoundW(113);
	sharoundW(114);
	sharoundW(115);
	sharoundW(116);
	sharoundW(117);
	sharoundW(118);
	sharoundW(119);
	sharoundW(120);
	sharoundW(121);
	sharoundW(122);

	const u precomp = (Vals[0] + Vals[4]) + (K[59] + W(123)) + s1(123) + ch(123);

	u v = W[117] + W[108] + Vals[3] + Vals[7] + P2(124) + P1(124) + Ch((precomp),Vals[1],Vals[2]); 
	u g = -(K[60] + H[7]) - S1(precomp);

	uint nonce = 0;
#ifdef VECTORS4
	if (v.x == g.x)
	{
		nonce = W[3].x;
	}
	if (v.y == g.y)
	{
		nonce = W[3].y;
	}
	if (v.z == g.z)
	{
		nonce = W[3].z;
	}
	if (v.w == g.w)
	{
		nonce = W[3].w;
	}
#else
	#ifdef VECTORS
		if (v.x == g.x)
		{
			nonce = W[3].x;
		}
		if (v.y == g.y)
		{
			nonce = W[3].y;
		}
	#else
		if (v == g)
		{
			nonce = W[3];
		}
	#endif
#endif
	if(nonce)
	{
		//Faster to shift the nonce by 2 due to 4-DWORD addressing and does not add more collisions
		output[OUTPUT_SIZE] = nonce;
		output[get_local_id(0)] = nonce;
	}
}

Nice work! This kernel still uses 1356 ALU OPs (same as current phatk2) but it uses 1 less GPR (22 vs 23).

However, the above is only true for VLIW5. On VLIW4 (69xx) it uses 2 more GPRs (25 vs 23)

Phoenix Miner developer

Donations appreciated at:
1PHoenix9j9J3M6v3VQYWeXrHPPjf7y3rU

d3m0n1q_733rz

Sr. Member

Offline

Activity: 378
Merit: 250

Re: Phoenix - Efficient, fast, modular miner

January 28, 2012, 11:01:51 PM
Last edit: January 28, 2012, 11:16:20 PM by d3m0n1q_733rz

#1091

Quote from: jedi95 on January 28, 2012, 06:28:34 PM

Quote from: d3m0n1q_733rz on January 28, 2012, 01:16:05 PM

Okay, I think I have the kinds worked out mostly. Give it a try and tell me what you think. It's phak2 with a few simple logic mods. I would try the AMD Kernel Analyzer tool, but it doesn't seem to work right for me. So, I just go off of hash rate and accepts or rejects/errors. So far, no errors. But I'm squeezing out a few more decimal place hashes.

Code:

// This file is taken and modified from the public-domain poclbm project, and
// I have therefore decided to keep it public-domain.


#ifdef VECTORS4
	typedef uint4 u;
#else 
	#ifdef VECTORS
		typedef uint2 u;
	#else
		typedef uint u;
	#endif
#endif

__constant uint K[64] = { 
	0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
	0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
	0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
	0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
	0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
	0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
	0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
	0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2
};

__constant uint ConstW[128] = {
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x80000000U, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000280U,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,

0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x80000000U, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000100U,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000
};

__constant uint H[8] = { 
	0x6a09e667, 0xbb67ae85, 0x3c6ef372, 0xa54ff53a, 0x510e527f, 0x9b05688c, 0x1f83d9ab, 0x5be0cd19
};


#ifdef BITALIGN
	#pragma OPENCL EXTENSION cl_amd_media_ops : enable
	
	#define rot(x, y) amd_bitalign(x, x, (uint)(32 - y))
#else
	#define rot(x, y) rotate(x, (uint)y)
#endif

// This part is not from the stock poclbm kernel. It's part of an optimization
// added in the Phoenix Miner.

// Some AMD devices have the BFI_INT opcode, which behaves exactly like the
// SHA-256 Ch function, but provides it in exactly one instruction. If
// detected, use it for Ch. Otherwise, construct Ch out of simpler logical
// primitives.

#ifdef BFI_INT
	// Well, slight problem... It turns out BFI_INT isn't actually exposed to
	// OpenCL (or CAL IL for that matter) in any way. However, there is 
	// a similar instruction, BYTE_ALIGN_INT, which is exposed to OpenCL via
	// amd_bytealign, takes the same inputs, and provides the same output. 
	// We can use that as a placeholder for BFI_INT and have the application 
	// patch it after compilation.
	
	// This is the BFI_INT function
	#define Ch(x, y, z) amd_bytealign(x,y,z)
	// Ma can also be implemented in terms of BFI_INT...
	#define Ma(z, x, y) amd_bytealign(z^x,y,x)
#else
	#define Ch(x, y, z) bitselect(z,y,x)
	#define Ma(x, y, z) bitselect(x,y,(z^x))
#endif

//Various intermediate calculations for each SHA round
#define s0(n) (S0(Vals[(128 - (n)) % 8]))
#define S0(n) (rot(n, 30u)^rot(n, 19u)^rot(n,10u))

#define s1(n) (S1(Vals[(132 - (n)) % 8]))
#define S1(n) (rot(n, 26u)^rot(n, 21u)^rot(n, 7u))

#define ch(n) Ch(Vals[(132 - (n)) % 8],Vals[(133 - (n)) % 8],Vals[(134 - (n)) % 8])
#define maj(n) Ma(Vals[(129 - (n)) % 8],Vals[(130 - (n)) % 8],Vals[(128 - (n)) % 8])

//t1 calc when W is already calculated
#define t1(n) K[(n) % 64] + Vals[(135 - (n)) % 8] +  W[(n)] + s1(n) + ch(n) 

//t1 calc which calculates W
#define t1W(n) K[(n) % 64] + Vals[(135 - (n)) % 8] +  W(n) + s1(n) + ch(n)

//Used for constant W Values (the compiler optimizes out zeros)
#define t1C(n) (K[(n) % 64] + ConstW[(n)]) + Vals[(135 - (n)) % 8] + s1(n) + ch(n)

//t2 Calc
#define t2(n)  maj(n) + s0(n)

#define rotC(x,n) (x<<n | x >> (32-n))

//W calculation used for SHA round
#define W(n) (W[n] = P4(n) + P3(n) + P2(n) + P1(n))


//Partial W calculations (used for the begining where only some values are nonzero)
#define P1(n) ((rot(W[(n)-2],15u)^rot(W[(n)-2],13u)^((W[(n)-2])>>10U)))
#define P2(n) ((rot(W[(n)-15],25u)^rot(W[(n)-15],14u)^((W[(n)-15])>>3U)))
#define p1(x) ((rot(x,15u)^rot(x,13u)^((x)>>10U)))
#define p2(x) ((rot(x,25u)^rot(x,14u)^((x)>>3U)))
#define P3(n)  W[n-7]
#define P4(n)  W[n-16]

//Partial Calcs for constant W values
#define P1C(n) ((rotC(ConstW[(n)-2],15)^rotC(ConstW[(n)-2],13)^((ConstW[(n)-2])>>10U)))
#define P2C(n) ((rotC(ConstW[(n)-15],25)^rotC(ConstW[(n)-15],14)^((ConstW[(n)-15])>>3U)))
#define P3C(x)  ConstW[x-7]
#define P4C(x)  ConstW[x-16]

//SHA round with built in W calc
#define sharoundW(n) Barrier1(n);  Vals[(131 - (n)) % 8] += t1W(n); Vals[(135 - (n)) % 8] = t1W(n) + t2(n);  

//SHA round without W calc
#define sharound(n)  Barrier2(n); Vals[(131 - (n)) % 8] += t1(n); Vals[(135 - (n)) % 8] = t1(n) + t2(n);

//SHA round for constant W values
#define sharoundC(n)  Barrier3(n); Vals[(131 - (n)) % 8] += t1C(n); Vals[(135 - (n)) % 8] = t1C(n) + t2(n);

//The compiler is stupid... I put this in there only to stop the compiler from (de)optimizing the order

#define Barrier1(n) t1 = t1C((n+1))
#define Barrier2(n) t1 = t1C((n))
#define Barrier3(n) Barrier2(n)

__kernel
//removed this to allow detection of invalid work size
//__attribute__((reqd_work_group_size(WORKSIZE, 1, 1)))
void search(	const uint state0, const uint state1, const uint state2, const uint state3,
						const uint state4, const uint state5, const uint state6, const uint state7,
						const uint B1, const uint C1, const uint D1,
						const uint F1, const uint G1, const uint H1,
						const u base,
						const uint W16, const uint W17,
						const uint PreVal4, const uint PreVal0,
						const uint PreW31, const uint PreW32,
						const uint PreW19, const uint PreW20,
						__global uint * output)
{

	u W[124];
	u Vals[8];

//Dummy Variable to prevent compiler from reordering between rounds
	u t1;
	
	W[16] = W16;
	W[17] = W17;

#ifdef VECTORS4
	//Less dependencies to get both the local id and group id and then add them
	W[3] = base + (uint)(get_local_id(0)) * 4u + (uint)(get_group_id(0)) * (WORKSIZE * 4u);
	uint r = rot(W[3].x,25u)^rot(W[3].x,14u)^((W[3].x)>>3U);
	//Since only the 2 LSB is opposite between the nonces, we can save an instruction by flipping the 4 bits in W18 rather than the 1 bit in W3
	W[18] = PreW20 + (u){r, r ^ 0x2004000U, r ^ 0x4008000U, r ^ 0x600C000U};
#else
	#ifdef VECTORS
		W[3] = base + (uint)(get_local_id(0)) * 2u + (uint)(get_group_id(0)) * (WORKSIZE * 2u);
		uint r = rot(W[3].x,25u)^rot(W[3].x,14u)^((W[3].x)>>3U);
		W[18] = PreW20 + (u){r, r ^ 0x2004000U};
	#else
		W[3] = base + get_local_id(0) + get_group_id(0) * (WORKSIZE);
		u r = PreW20 + rot(W[3],25u)^rot(W[3],14u)^((W[3])>>3U);
		W[18] = PreW20 + r;
	#endif
#endif
	//the order of the W calcs and Rounds is like this because the compiler needs help finding how to order the instructions

	Vals[0]=PreVal0 + W[3];
	Vals[1]=B1;
	Vals[2]=C1;
	Vals[3]=D1;
	Vals[4]=PreVal4 + W[3];	
	Vals[5]=F1;
	Vals[6]=G1;
	Vals[7]=H1;

	sharoundC(4);
	W[19] = PreW19 + W[3];
	sharoundC(5);
	W[20] = P1(20) + P4C(20);
	sharoundC(6);
	W[21] = P1(21);
	sharoundC(7);
	W[22] = P1(22) + P3C(22);
	sharoundC(8);
	W[23] = W[16] + P1(23);
	sharoundC(9);
	W[24] = W[17] + P1(24);
	sharoundC(10);
	W[25] = P1(25) + P3(25);
	W[26] = P1(26) + P3(26);
	sharoundC(11);
	W[27] = P1(27) + P3(27);
	W[28] = P1(28) + P3(28);
	sharoundC(12);
	W[29] = P1(29) + P3(29);
	sharoundC(13);
	W[30] = P1(30) + P3(30) + P2C(30);
	W[31] = PreW31 + P1(31) + P3(31);
	sharoundC(14);
	W[32] = PreW32 + P1(32) + P3(32);
	sharoundC(15);
	sharound(16);
	sharound(17);
	sharound(18);
	sharound(19);
	sharound(20);
	sharound(21);
	sharound(22);
	sharound(23);
	sharound(24);
	sharound(25);
	sharound(26);
	sharound(27);
	sharound(28);
	sharound(29);
	sharound(30);
	sharound(31);
	sharound(32);
	sharoundW(33);
	sharoundW(34);
	sharoundW(35);
	sharoundW(36);
	sharoundW(37);
	sharoundW(38);
	sharoundW(39);
	sharoundW(40);
	sharoundW(41);
	sharoundW(42);
	sharoundW(43);
	sharoundW(44);
	sharoundW(45);
	sharoundW(46);
	sharoundW(47);
	sharoundW(48);
	sharoundW(49);
	sharoundW(50);
	sharoundW(51);
	sharoundW(52);
	sharoundW(53);
	sharoundW(54);
	sharoundW(55);
	sharoundW(56);
	sharoundW(57);
	sharoundW(58);
	sharoundW(59);
	sharoundW(60);
	sharoundW(61);
	sharoundW(62);
	sharoundW(63);

	W[64]=state0+Vals[0];
	W[65]=state1+Vals[1];
	W[66]=state2+Vals[2];
	W[67]=state3+Vals[3];
	W[68]=state4+Vals[4];
	W[69]=state5+Vals[5];
	W[70]=state6+Vals[6];
	W[71]=state7+Vals[7];

	const u Temp = (0xb0edbdd0U + K[0]) +  W[64];

//Vals 5 & 6 are not used but gives us a complete chunk instead of a partial

	Vals[0]=H[0];
	Vals[1]=H[1];
	Vals[2]=H[2];
	Vals[3]=Temp + 0xa54ff53aU;
	Vals[4]=H[4];
	Vals[5]=H[5];
	Vals[6]=H[6];
	Vals[7]=Temp + 0x08909ae5U;



//#define P124(n) P1(n) + P2(n) + P4(n)

	W[80] = P2(80) + P4(80);
	sharound(65);
	W[81] = P1C(81) + P2(81) + P4(81);
	sharound(66);
	W[82] = P1(82) + P2(82) + P4(82);
	sharound(67);
	W[83] = P1(83) + P2(83) + P4(83);
	sharound(68);
	W[84] = P1(84) + P2(84) + P4(84);
	sharound(69);
	W[85] = P1(85) + P2(85) + P4(85);
	sharound(70);
	W[86] = P1(86) + P2(86) + P4(86) + P3C(86);
	sharound(71);
	W[87] = P1(87) + P2C(87) + P3(87) + P4(87);
	sharoundC(72);
	W[88] = P1(88) + P3(88) + P4C(88);
	sharoundC(73);
	W[89] = P1(89) + P3(89);
	sharoundC(74);
	W[90] = P1(90) + P3(90);
	sharoundC(75);
	W[91] = P1(91) + P3(91);
	sharoundC(76);
	W[92] = P1(92) + P3(92);
	sharoundC(77);
	W[93] = P1(93) + P3(93);
	W[94] = P1(94) + P3(94) + P2C(94);
	sharoundC(78);
	W[95] = P1(95) + P2(95) + P3(95) + P4C(95);
	sharoundC(79);
	sharound(80);
	sharound(81);
	sharound(82);
	sharound(83);
	sharound(84);
	sharound(85);
	sharound(86);
	sharound(87);
	sharound(88);
	sharound(89);
	sharound(90);
	sharound(91);
	sharound(92);
	sharound(93);
	sharound(94);
	sharound(95);
	sharoundW(96);
	sharoundW(97);
	sharoundW(98);
	sharoundW(99);
	sharoundW(100);
	sharoundW(101);
	sharoundW(102);
	sharoundW(103);
	sharoundW(104);
	sharoundW(105);
	sharoundW(106);
	sharoundW(107);
	sharoundW(108);
	sharoundW(109);
	sharoundW(110);
	sharoundW(111);
	sharoundW(112);
	sharoundW(113);
	sharoundW(114);
	sharoundW(115);
	sharoundW(116);
	sharoundW(117);
	sharoundW(118);
	sharoundW(119);
	sharoundW(120);
	sharoundW(121);
	sharoundW(122);

	const u precomp = (Vals[0] + Vals[4]) + (K[59] + W(123)) + s1(123) + ch(123);

	u v = W[117] + W[108] + Vals[3] + Vals[7] + P2(124) + P1(124) + Ch((precomp),Vals[1],Vals[2]); 
	u g = -(K[60] + H[7]) - S1(precomp);

	uint nonce = 0;
#ifdef VECTORS4
	if (v.x == g.x)
	{
		nonce = W[3].x;
	}
	if (v.y == g.y)
	{
		nonce = W[3].y;
	}
	if (v.z == g.z)
	{
		nonce = W[3].z;
	}
	if (v.w == g.w)
	{
		nonce = W[3].w;
	}
#else
	#ifdef VECTORS
		if (v.x == g.x)
		{
			nonce = W[3].x;
		}
		if (v.y == g.y)
		{
			nonce = W[3].y;
		}
	#else
		if (v == g)
		{
			nonce = W[3];
		}
	#endif
#endif
	if(nonce)
	{
		//Faster to shift the nonce by 2 due to 4-DWORD addressing and does not add more collisions
		output[OUTPUT_SIZE] = nonce;
		output[get_local_id(0)] = nonce;
	}
}

Nice work! This kernel still uses 1356 ALU OPs (same as current phatk2) but it uses 1 less GPR (22 vs 23).

However, the above is only true for VLIW5. On VLIW4 (69xx) it uses 2 more GPRs (25 vs 23)

Any idea where the extra GPRs are coming from on VLIW4? Also, we could just comment out Barrier3 completely and change it to Barrier2 instead since they're the same thing. We don't really need the extra #define. What are we looking at as far as cycles are concerned? And any chance of enabling -cl-fast-relaxed-math if it's not already and having it run properly?

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs. If you need it, we can get it. We have solutions for your computing conundrums. BTC accepted! 12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq

Diapolo

Hero Member

Offline

Activity: 769
Merit: 500

Re: Phoenix - Efficient, fast, modular miner

January 29, 2012, 10:21:17 AM
Last edit: January 29, 2012, 04:15:06 PM by Diapolo

#1092

The fast-math thing is only for floating point operations, should be pointless.

By the way, if you guys like to play around with new kernels you could have a look here: https://bitcointalk.org/index.php?topic=61406

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo

d3m0n1q_733rz

Sr. Member

Offline

Activity: 378
Merit: 250

Re: Phoenix - Efficient, fast, modular miner

January 29, 2012, 08:15:02 PM

#1093

Quote from: Diapolo on January 29, 2012, 10:21:17 AM

The fast-math thing is only for floating point operations, should be pointless.

By the way, if you guys like to play around with new kernels you could have a look here: https://bitcointalk.org/index.php?topic=61406

Dia

Ironically, I was going to be asking when the heck a GCN miner was going to come out. Granted, I don't have the hardware and I doubt many people will at this point. But, I'll see what I can accomplish with the HD5450. And thanks for the info on floating point ops. I'll see if I can "organize" the code if it isn't already and start editing things out. Granted, I can't find any decent free resources for OpenCL operations and accepted code, but I'm looking. It's not like assembly where I can just go to a reference guide and "Oh! That's what I'm looking for!"
Any resources would be helpful if you can provide me with them. I'm still at novice level, but I'm trying. I get confused on how to follow the rotations, xors, etc. I just use logic to edit out already used variables or cut down on the amount of work the GPU has to do and let you handle the bigger math until I figure out a way to remember it. : /

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs. If you need it, we can get it. We have solutions for your computing conundrums. BTC accepted! 12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq

d3m0n1q_733rz

Sr. Member

Offline

Activity: 378
Merit: 250

Re: Phoenix - Efficient, fast, modular miner

January 30, 2012, 07:04:51 AM

#1094

So umm...I was looking at my code and it seems that it's actually not faster. : /
If there was some way to combine some of the writes (around the Vals), then I think it could be sped up a little. But right now, I don't think I'm smarter than the compiler. I've been working with assembly too long and I find that the rules are backward. So thanks for taking the time.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs. If you need it, we can get it. We have solutions for your computing conundrums. BTC accepted! 12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq

d3m0n1q_733rz

Sr. Member

Offline

Activity: 378
Merit: 250

Re: Phoenix - Efficient, fast, modular miner

February 02, 2012, 07:18:28 AM

#1095

Code:

// This file is taken and modified from the public-domain poclbm project, and
// I have therefore decided to keep it public-domain.
#ifdef VECTORS8 
	typedef uint8 u;
#else
#ifdef VECTORS4
	typedef uint4 u;
#else 
	#ifdef VECTORS
		typedef uint2 u;
	#else
		typedef uint u;
	#endif
#endif
#endif

__constant uint K[64] = { 
	0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
	0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
	0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
	0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
	0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
	0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
	0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
	0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2
};

__constant uint ConstW[128] = {
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x80000000U, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000280U,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,

0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x80000000U, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000100U,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000
};

__constant uint H[8] = { 
	0x6a09e667, 0xbb67ae85, 0x3c6ef372, 0xa54ff53a, 0x510e527f, 0x9b05688c, 0x1f83d9ab, 0x5be0cd19
};


#ifdef BITALIGN
	#pragma OPENCL EXTENSION cl_amd_media_ops : enable
	#define rot(x, y) amd_bitalign(x, x, (uint)(32 - y))
#else
	#define rot(x, y) rotate(x, (uint)y)
#endif

// This part is not from the stock poclbm kernel. It's part of an optimization
// added in the Phoenix Miner.

// Some AMD devices have the BFI_INT opcode, which behaves exactly like the
// SHA-256 Ch function, but provides it in exactly one instruction. If
// detected, use it for Ch. Otherwise, construct Ch out of simpler logical
// primitives.

#ifdef BFI_INT
	// Well, slight problem... It turns out BFI_INT isn't actually exposed to
	// OpenCL (or CAL IL for that matter) in any way. However, there is 
	// a similar instruction, BYTE_ALIGN_INT, which is exposed to OpenCL via
	// amd_bytealign, takes the same inputs, and provides the same output. 
	// We can use that as a placeholder for BFI_INT and have the application 
	// patch it after compilation.
	
	// This is the BFI_INT function
	#define Ch(x, y, z) amd_bytealign(x,y,z)
	// Ma can also be implemented in terms of BFI_INT...
	#define Ma(z, x, y) amd_bytealign(z^x,y,x)
#else
	#define Ch(x, y, z) bitselect(z,y,x)
	#define Ma(x, y, z) bitselect(x,y,(z^x))
#endif

//Various intermediate calculations for each SHA round
#define s0(n) (S0(Vals[(0 + 128 - (n)) % 8]))
#define S0(n) (rot(n, 30u)^rot(n, 19u)^rot(n,10u))

#define s1(n) (S1(Vals[(4 + 128 - (n)) % 8]))
#define S1(n) (rot(n, 26u)^rot(n, 21u)^rot(n, 7u))

#define ch(n) Ch(Vals[(4 + 128 - (n)) % 8],Vals[(5 + 128 - (n)) % 8],Vals[(6 + 128 - (n)) % 8])
#define maj(n) Ma(Vals[(1 + 128 - (n)) % 8],Vals[(2 + 128 - (n)) % 8],Vals[(0 + 128 - (n)) % 8])

//t1 calc when W is already calculated
#define t1(n) K[(n) % 64] + Vals[(7 + 128 - (n)) % 8] +  W[(n)] + s1(n) + ch(n) 

//t1 calc which calculates W
#define t1W(n) K[(n) % 64] + Vals[(7 + 128 - (n)) % 8] +  W(n) + s1(n) + ch(n)

//Used for constant W Values (the compiler optimizes out zeros)
#define t1C(n) (K[(n) % 64]+ ConstW[(n)]) + Vals[(7 + 128 - (n)) % 8] + s1(n) + ch(n)

//t2 Calc
#define t2(n)  maj(n) + s0(n)

#define rotC(x,n) (x<<n | x >> (32-n))

//W calculation used for SHA round
#define W(n) (W[n] = P4(n) + P3(n) + P2(n) + P1(n))


//Partial W calculations (used for the begining where only some values are nonzero)
#define P1(n) ((rot(W[(n)-2],15u)^rot(W[(n)-2],13u)^((W[(n)-2])>>10U)))
#define P2(n) ((rot(W[(n)-15],25u)^rot(W[(n)-15],14u)^((W[(n)-15])>>3U)))
#define p1(x) ((rot(x,15u)^rot(x,13u)^((x)>>10U)))
#define p2(x) ((rot(x,25u)^rot(x,14u)^((x)>>3U)))
#define P3(n)  W[n-7]
#define P4(n)  W[n-16]

//Partial Calcs for constant W values
#define P1C(n) ((rotC(ConstW[(n)-2],15)^rotC(ConstW[(n)-2],13)^((ConstW[(n)-2])>>10U)))
#define P2C(n) ((rotC(ConstW[(n)-15],25)^rotC(ConstW[(n)-15],14)^((ConstW[(n)-15])>>3U)))
#define P3C(x)  ConstW[x-7]
#define P4C(x)  ConstW[x-16]

//SHA round with built in W calc
#define sharoundW(n) Barrier1(n);  Vals[(3 + 128 - (n)) % 8] += t1W(n); Vals[(7 + 128 - (n)) % 8] = t1W(n) + t2(n);  

//SHA round without W calc
#define sharound(n)  Barrier2(n); Vals[(3 + 128 - (n)) % 8] += t1(n); Vals[(7 + 128 - (n)) % 8] = t1(n) + t2(n);

//SHA round for constant W values
#define sharoundC(n)  Barrier2(n); Vals[(3 + 128 - (n)) % 8] += t1C(n); Vals[(7 + 128 - (n)) % 8] = t1C(n) + t2(n);

//The compiler is stupid... I put this in there only to stop the compiler from (de)optimizing the order
#define Barrier1(n) t1 = t1C((n+1))
#define Barrier2(n) t1 = t1C((n))
//#define Barrier3(n) Barrier2(n)

__kernel
//removed this to allow detection of invalid work size
//__attribute__((reqd_work_group_size(WORKSIZE, 1, 1)))
void search(	const uint state0, const uint state1, const uint state2, const uint state3,
						const uint state4, const uint state5, const uint state6, const uint state7,
						const uint B1, const uint C1, const uint D1,
						const uint F1, const uint G1, const uint H1,
						const u base,
						const uint W16, const uint W17,
						const uint PreVal4, const uint PreVal0,
						const uint PreW31, const uint PreW32,
						const uint PreW19, const uint PreW20,
						__global uint * output)
{

	u W[124];
	u Vals[8];

//Dummy Variable to prevent compiler from reordering between rounds
	u t1;
	
	//Vals[0]=state0;
	Vals[1]=B1;
	Vals[2]=C1;
	Vals[3]=D1;
	//Vals[4]=PreVal4;
	Vals[5]=F1;
	Vals[6]=G1;
	Vals[7]=H1;

	W[16] = W16;
	W[17] = W17;

#ifdef VECTORS8
	W[3] = base + (uint)(get_local_id(0)) * 8u + (uint)(get_group_id(0)) * (WORKSIZE * 8u);
	uint r = rot(W[3].x,25u)^rot(W[3].x,14u)^((W[3].x)>>3U);
	//Since only the 2 LSB is opposite between the nonces, we can save an instruction by flipping the 4 bits in W18 rather than the 1 bit in W3
	W[18] = PreW20 + (u){r, r ^ 0x2004000U, r ^ 0x4008000U, r ^ 0x600C000U, r ^ 0x8010000U, r ^ 0xA050000U, r ^ 0xC090000U, r ^ 0xE0D0000U};
#else


#ifdef VECTORS4
	//Less dependencies to get both the local id and group id and then add them
	W[3] = base + (uint)(get_local_id(0)) * 4u + (uint)(get_group_id(0)) * (WORKSIZE * 4u);
	uint r = rot(W[3].x,25u)^rot(W[3].x,14u)^((W[3].x)>>3U);
	//Since only the 2 LSB is opposite between the nonces, we can save an instruction by flipping the 4 bits in W18 rather than the 1 bit in W3
	W[18] = PreW20 + (u){r, r ^ 0x2004000U, r ^ 0x4008000U, r ^ 0x600C000U};
#else
	#ifdef VECTORS
		W[3] = base + (uint)(get_local_id(0)) * 2u + (uint)(get_group_id(0)) * (WORKSIZE * 2u);
		uint r = rot(W[3].x,25u)^rot(W[3].x,14u)^((W[3].x)>>3U);
		W[18] = PreW20 + (u){r, r ^ 0x2004000U};
	#else
		W[3] = base + get_local_id(0) + get_group_id(0) * (WORKSIZE);
		u r = rot(W[3],25u)^rot(W[3],14u)^((W[3])>>3U);
		W[18] = PreW20 + r;
	#endif
#endif
#endif

	//the order of the W calcs and Rounds is like this because the compiler needs help finding how to order the instructions

	Vals[4] = PreVal4 + W[3];
	Vals[0] = PreVal0 + W[3];

	sharoundC(4);
	W[19] = PreW19 + W[3];
	sharoundC(5);
	W[20] = P4C(20) + P1(20);
	sharoundC(6);
	W[21] = P1(21);
	sharoundC(7);
	W[22] = P3C(22) + P1(22);
	sharoundC(8);
	W[23] = W[16] + P1(23);
	sharoundC(9);
	W[24] = W[17] + P1(24);
	sharoundC(10);
	W[25] = P1(25) + P3(25);
	W[26] = P1(26) + P3(26);
	sharoundC(11);
	W[27] = P1(27) + P3(27);
	W[28] = P1(28) + P3(28);
	sharoundC(12);
	W[29] = P1(29) + P3(29);
	sharoundC(13);
	W[30] = P1(30) + P2C(30) + P3(30);
	W[31] = PreW31 + (P1(31) + P3(31));
	sharoundC(14);
	W[32] = PreW32 + (P1(32) + P3(32));
	sharoundC(15);
	sharound(16);
	sharound(17);
	sharound(18);
	sharound(19);
	sharound(20);
	sharound(21);
	sharound(22);
	sharound(23);
	sharound(24);
	sharound(25);
	sharound(26);
	sharound(27);
	sharound(28);
	sharound(29);
	sharound(30);
	sharound(31);
	sharound(32);
	sharoundW(33);
	sharoundW(34);
	sharoundW(35);
	sharoundW(36);
	sharoundW(37);
	sharoundW(38);
	sharoundW(39);
	sharoundW(40);
	sharoundW(41);
	sharoundW(42);
	sharoundW(43);
	sharoundW(44);
	sharoundW(45);
	sharoundW(46);
	sharoundW(47);
	sharoundW(48);
	sharoundW(49);
	sharoundW(50);
	sharoundW(51);
	sharoundW(52);
	sharoundW(53);
	sharoundW(54);
	sharoundW(55);
	sharoundW(56);
	sharoundW(57);
	sharoundW(58);
	sharoundW(59);
	sharoundW(60);
	sharoundW(61);
	sharoundW(62);
	sharoundW(63);

	W[64]=state0+Vals[0];
	W[65]=state1+Vals[1];
	W[66]=state2+Vals[2];
	W[67]=state3+Vals[3];
	W[68]=state4+Vals[4];
	W[69]=state5+Vals[5];
	W[70]=state6+Vals[6];
	W[71]=state7+Vals[7];

//const u Temp = (0xb0edbdd0U + K[0]) +  W[64];

	Vals[0]=H[0];
	Vals[1]=H[1];
	Vals[2]=H[2];
	Vals[3] = 0xa54ff53aU + (0xb0edbdd0U + K[0]) +  W[64];
//Vals[3]=H[3];
	Vals[4]=H[4];
	Vals[5]=H[5];
	Vals[6]=H[6];
	Vals[7] = 0x08909ae5U + (0xb0edbdd0U + K[0]) +  W[64];
//Vals[7]=H[7];

	
#define P124(n) P2(n) + P1(n) + P4(n)

	W[64 + 16] = + P2(64 + 16) + P4(64 + 16);
	sharound(64 + 1);
	W[64 + 17] = P1C(64 + 17) + P2(64 + 17) + P4(64 + 17);
	sharound(64 + 2);
	W[64 + 18] = P124(64 + 18);
	sharound(64 + 3);
	W[64 + 19] = P124(64 + 19);
	sharound(64 + 4);
	W[64 + 20] = P124(64 + 20);
	sharound(64 + 5);
	W[64 + 21] = P124(64 + 21);
	sharound(64 + 6);
	W[64 + 22] = P4(64 + 22) + P3C(64 + 22) + P2(64 + 22) + P1(64 + 22);
	sharound(64 + 7);
	W[64 + 23] = P4(64 + 23) + P3(64 + 23) + P2C(64 + 23) + P1(64 + 23);
	sharoundC(64 + 8);
	W[64 + 24] =   P1(64 + 24) + P4C(64 + 24) + P3(64 + 24);
	sharoundC(64 + 9);
	W[64 + 25] = P3(64 + 25) + P1(64 + 25);
	sharoundC(64 + 10);
	W[64 + 26] = P3(64 + 26) + P1(64 + 26);
	sharoundC(64 + 11);
	W[64 + 27] = P3(64 + 27) + P1(64 + 27);
	sharoundC(64 + 12);
	W[64 + 28] = P3(64 + 28) + P1(64 + 28);
	sharoundC(64 + 13);
	W[64 + 29] = P1(64 + 29) + P3(64 + 29);
	W[64 + 30] = P3(64 + 30) + P2C(64 + 30) + P1(64 + 30);
	sharoundC(64 + 14);
	W[64 + 31] = P4C(64 + 31) + P3(64 + 31) + P2(64 + 31) + P1(64 + 31);
	sharoundC(64 + 15);
	sharound(64 + 16);
	sharound(64 + 17);
	sharound(64 + 18);
	sharound(64 + 19);
	sharound(64 + 20);
	sharound(64 + 21);
	sharound(64 + 22);
	sharound(64 + 23);
	sharound(64 + 24);
	sharound(64 + 25);
	sharound(64 + 26);
	sharound(64 + 27);
	sharound(64 + 28);
	sharound(64 + 29);
	sharound(64 + 30);
	sharound(64 + 31);
	sharoundW(64 + 32);
	sharoundW(64 + 33);
	sharoundW(64 + 34);
	sharoundW(64 + 35);
	sharoundW(64 + 36);
	sharoundW(64 + 37);
	sharoundW(64 + 38);
	sharoundW(64 + 39);
	sharoundW(64 + 40);
	sharoundW(64 + 41);
	sharoundW(64 + 42);
	sharoundW(64 + 43);
	sharoundW(64 + 44);
	sharoundW(64 + 45);
	sharoundW(64 + 46);
	sharoundW(64 + 47);
	sharoundW(64 + 48);
	sharoundW(64 + 49);
	sharoundW(64 + 50);
	sharoundW(64 + 51);
	sharoundW(64 + 52);
	sharoundW(64 + 53);
	sharoundW(64 + 54);
	sharoundW(64 + 55);
	sharoundW(64 + 56);
	sharoundW(64 + 57);
	sharoundW(64 + 58);

	
	u g = -(K[60] + H[7]) - S1((Vals[0] + Vals[4]) + (K[59] + W(59+64)) + s1(64+59)+ ch(59+64));
	u v = W[117] + W[108] + Vals[3] + Vals[7] + P2(124) + P1(124) + Ch((Vals[0] + Vals[4]) + (K[59] + W(59+64)) + s1(64+59)+ ch(59+64),Vals[1],Vals[2]); 
	
	uint nonce = 0;

//#ifdef VECTORS8
//		if (v.x == g.x)
//	{
//		nonce = W[3].x;
//	}
//	if (v.y == g.y)
//	{
//		nonce = W[3].y;
//	}
//	if (v.z == g.z)
//	{
//		nonce = W[3].z;
//	}
//	if (v.w == g.w)
//	{
//		nonce = W[3].w;
//	}
#ifdef VECTORS4
	if (v.x == g.x)
	{
		nonce = W[3].x;
	}
	if (v.y == g.y)
	{
		nonce = W[3].y;
	}
	if (v.z == g.z)
	{
		nonce = W[3].z;
	}
	if (v.w == g.w)
	{
		nonce = W[3].w;
	}
#else
	#ifdef VECTORS
		if (v.x == g.x)
		{
			nonce = W[3].x;
		}
		if (v.y == g.y)
		{
			nonce = W[3].y;
		}
	#else
		if (v == g)
		{
			nonce = W[3];
		}
	#endif
#endif
//#endif
	if(nonce)
	{
		//Faster to shift the nonce by 2 due to 4-DWORD addressing and does not add more collisions
		output[OUTPUT_SIZE] = nonce;
		output[get_local_id(0)] = nonce;
	}
}

Would someone mind fixing the nonces for me? I'm modding it for 128-bit and seem to have hit a snag. And, just so you know, uint8 DOES work with a worksize of 32! However, without the nonces finished, I don't know how well. I would try uint16 as it works too with a worksize of 1, but the bitshifting is a bit more involved.
After this, I'm going to have to compute some different constants for a SHA-512/256 kernel. I'll need the fractional parts of the seventeenth through twenty-fourth prime numbers as stated here: http://eprint.iacr.org/2010/548.pdf
From that point, I'll attempt a 64-bit, 128-bit and 256-bit version of the SHA-512 algorithm (in one) and then truncate the results to SHA-256. ^_^
The problem is, I'm no math genius so I'm going to need loads of help with this little project.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs. If you need it, we can get it. We have solutions for your computing conundrums. BTC accepted! 12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq

jedi95 (OP)

Full Member

Offline

Activity: 219
Merit: 120

Re: Phoenix - Efficient, fast, modular miner

February 02, 2012, 04:52:37 PM

#1096

Quote from: d3m0n1q_733rz on February 02, 2012, 07:18:28 AM

Would someone mind fixing the nonces for me? I'm modding it for 128-bit and seem to have hit a snag. And, just so you know, uint8 DOES work with a worksize of 32! However, without the nonces finished, I don't know how well. I would try uint16 as it works too with a worksize of 1, but the bitshifting is a bit more involved.
After this, I'm going to have to compute some different constants for a SHA-512/256 kernel. I'll need the fractional parts of the seventeenth through twenty-fourth prime numbers as stated here: http://eprint.iacr.org/2010/548.pdf
From that point, I'll attempt a 64-bit, 128-bit and 256-bit version of the SHA-512 algorithm (in one) and then truncate the results to SHA-256. ^_^
The problem is, I'm no math genius so I'm going to need loads of help with this little project.

The problem with uint8 is that you are going to use way more GPRs. Even if you get better efficiency per thread, the higher GPR use it going to reduce the maximum number of threads that can be processed at once. This problem can be minimized with high RAM clocks, but the higher power use might not be worth it.

Also, truncated SHA512 isn't the same as SHA256.

Phoenix Miner developer

Donations appreciated at:
1PHoenix9j9J3M6v3VQYWeXrHPPjf7y3rU

d3m0n1q_733rz

Sr. Member

Offline

Activity: 378
Merit: 250

Re: Phoenix - Efficient, fast, modular miner

February 02, 2012, 06:34:11 PM

#1097

Quote from: jedi95 on February 02, 2012, 04:52:37 PM

Quote from: d3m0n1q_733rz on February 02, 2012, 07:18:28 AM

Would someone mind fixing the nonces for me? I'm modding it for 128-bit and seem to have hit a snag. And, just so you know, uint8 DOES work with a worksize of 32! However, without the nonces finished, I don't know how well. I would try uint16 as it works too with a worksize of 1, but the bitshifting is a bit more involved.
After this, I'm going to have to compute some different constants for a SHA-512/256 kernel. I'll need the fractional parts of the seventeenth through twenty-fourth prime numbers as stated here: http://eprint.iacr.org/2010/548.pdf
From that point, I'll attempt a 64-bit, 128-bit and 256-bit version of the SHA-512 algorithm (in one) and then truncate the results to SHA-256. ^_^
The problem is, I'm no math genius so I'm going to need loads of help with this little project.

The problem with uint8 is that you are going to use way more GPRs. Even if you get better efficiency per thread, the higher GPR use it going to reduce the maximum number of threads that can be processed at once. This problem can be minimized with high RAM clocks, but the higher power use might not be worth it.

Also, truncated SHA512 isn't the same as SHA256.

Yeah, I found out that the SHA-512/256 algorithm doesn't output an actual SHA-256 hash.
As for the uint8, I was trying to figure out if the increased efficiency will make up for the lower hash speed IF it doesn't increase after the nonce values are fixed. If it does increase efficiency significantly, as it is processing twice as many vectors [with nonce values set], then the increased RAM clocks might not be that big of an issue. Though, I did find out that the GPRs are designed to handle 128-bit moves more efficiently than 64-bit or 256-bit. So this means that 4 vectors are optimum for the Evergreen (HD5xxx) GPUs. However, I'm not certain on the Northern Islands (HD6xxx) GPUs and beyond. Their memory interface is 256-bit which may mean that 8 vectors could be optimum. Unfortunately, I don't have the hardware to test the theory out. And that's all it is. Just a theory.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs. If you need it, we can get it. We have solutions for your computing conundrums. BTC accepted! 12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq

FreeBit

Member

Offline

Activity: 106
Merit: 10

Re: Phoenix - Efficient, fast, modular miner

February 02, 2012, 06:52:36 PM

#1098

I want to use this miner.

How do I install it on an Ubunut 11.10 32bit system with one 5850?
I could not found any installation advice, readme or howto.

Thank you ...

jedi95 (OP)

Full Member

Offline

Activity: 219
Merit: 120

Re: Phoenix - Efficient, fast, modular miner

February 02, 2012, 09:51:04 PM

#1099

Quote from: FreeBit on February 02, 2012, 06:52:36 PM

I want to use this miner.

How do I install it on an Ubunut 11.10 32bit system with one 5850?
I could not found any installation advice, readme or howto.

Thank you ...

In general this is what you will need to do:

1. Install drivers + APP SDK
There are many guides for this around, so I won't go into the details here. In general SDK versions 2.1, 2.4, 2.5 are good, and 2.2, 2.3, 2.6 are bad.

2. Install prerequisites:
sudo apt-get install python-twisted python-numpy python-pyopencl

3. Download Phoenix
You can do this one of 2 ways:
A. Clone the Git repo using:
git clone git@github.com:jedi95/Phoenix-Miner.git /home/username/phoenix
Obviously you will want to modify the directory to suit your needs.
B. Download
https://github.com/jedi95/Phoenix-Miner/tarball/master
Extract to the directory of your choice

4. Start Phoenix:
./phoenix.py -u http://username:password@pool.com:8332 -k phatk2 AGGRESSION=6 WORKSIZE=128 VECTORS DEVICE=0
You will need modify the URL to connect to the pool of your choice. The above settings should be optimal for a 5850 running at normal memory clocks. (substitute WORKSIZE=256 if underclocking the memory)

Phoenix Miner developer

Donations appreciated at:
1PHoenix9j9J3M6v3VQYWeXrHPPjf7y3rU

pogi23

Member

Offline

Activity: 266
Merit: 10

Re: Phoenix - Efficient, fast, modular miner

February 03, 2012, 07:13:58 AM

#1100

Has anyone fried a 5770 using AGGRESSION=7?

The skinny: I was using poclbm, SDK2.1, and 5770 (stock clocks) when I thought I'd try out phoenix. I saw the 'Recommended' settings for High-End cards and thought they were merely guidelines. It would run for 5 minutes then it froze twice then it wouldn't even POST. Bummer, because it had a twin.

Just how much of a difference is there between AGGRESSION 5 and 7?