[ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

daeminium

Sr. Member

Offline

Activity: 448
Merit: 252

Re: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

January 28, 2015, 04:45:37 PM

#2781

Quote from: cat77 on January 26, 2015, 07:48:52 PM

Here are my neoscrypt.cl and sgminer configurations for 7950 and 280x. I use the 14.6 driver, found no difference using 14.7.
My 7950 does 320KHs at GPU clock 1000 MHz, memclock 1250 MHz, 1.081V GPU core voltage
My 280X does 360KHs at GPU clock 1036 MHz, memclock 1500 MHz, 1.025V GPU core voltage

The XORBYTESINPLACE needs to change depending on 280X or 7950. It hashes higher one way vs another for the card used, I have not examined as to why. So, edit the neoscrypt.cl file and look at the xorbytesinplace function. Change the section to match the card you are using, by changing what is commented out or in.

7950 config:
sgminer.exe -k neoscrypt --worksize 64 --rawintensity 4584 -g 4 -o stratum+tcp://stratum.ftc.theblocksfactory.com:3333 -u USER -p PASSWORD

280X config
sgminer.exe -k neoscrypt --worksize 64 --rawintensity 5120 -g 4 -o stratum+tcp://stratum.ftc.theblocksfactory.com:3333 -u USER -p PASSWORD

Feel free to criticize and / or offer up improvements. Basically, I have about 9 hours of OpenCL programming experience, I don't claim to be an expert by any means. I only claim to be 20% faster than the POS neoscrypt.cl file on Nicehash. Whoever posted that certainly wouldn't qualify as an expert either.

Last time I played with this stuff, I made other changes which make a 7950 run at 355KHs, 1000MHz GPU, 1250 Memclock. But that was a bit unstable, 3% HW errors. If I get around to playing with it and get it clean and stable, I will post up new code and config. Relatively speaking, this would then push the 280X to near 400Khs at 1036Mhz.

Donations: 1D4yYxmH44Xg4J2GuQ5ppfUKS7ohiJaD21

Code:

/* NeoScrypt(128, 2, 1) with Salsa20/20 and ChaCha20/20 */
/* Adapted and improved for 14.x drivers by Wolf9466 (Wolf`) */

// Stupid AMD compiler ignores the unroll pragma in these two
#define SALSA_SMALL_UNROLL 3
#define CHACHA_SMALL_UNROLL 3

// If SMALL_BLAKE2S is defined, BLAKE2S_UNROLL is interpreted
// as the unroll factor; must divide cleanly into ten.
// Usually a bad idea.
// #define SMALL_BLAKE2S
// #define BLAKE2S_UNROLL 5

#define BLOCK_SIZE           64U
#define FASTKDF_BUFFER_SIZE 256U
#ifndef PASSWORD_LEN
#define PASSWORD_LEN         80U
#endif

#if !defined(cl_khr_byte_addressable_store)
#error "Device does not support unaligned stores"
#endif

void CopyBytes(void *restrict dst, const void *restrict src, uint len)
{
    for(int i = 0; i < len; ++i)
		((uchar *)dst)[i] = ((uchar *)src)[i];
}

void CopyBytes32(void *restrict dst, const void *restrict src)
{
   	#pragma unroll 4
    for(int i = 31; i > 0; i-=8) 
    {
    		((uchar *)dst)[i] = ((uchar *)src)[i];    
    		((uchar *)dst)[i-1] = ((uchar *)src)[i-1];
    		((uchar *)dst)[i-2] = ((uchar *)src)[i-2];
    		((uchar *)dst)[i-3] = ((uchar *)src)[i-3];
    		((uchar *)dst)[i-4] = ((uchar *)src)[i-4];
    		((uchar *)dst)[i-5] = ((uchar *)src)[i-5];
    		((uchar *)dst)[i-6] = ((uchar *)src)[i-6];    
    		((uchar *)dst)[i-7] = ((uchar *)src)[i-7];
    } 
}

void CopyBytes64(void *restrict dst, const void *restrict src)
{
	#pragma unroll 8
    for(int i = 63; i > 0; i-=8) 
    {
  		((uchar *)dst)[i] = ((uchar *)src)[i];    
  		((uchar *)dst)[i-1] = ((uchar *)src)[i-1];
  		((uchar *)dst)[i-2] = ((uchar *)src)[i-2];
  		((uchar *)dst)[i-3] = ((uchar *)src)[i-3];
  		((uchar *)dst)[i-4] = ((uchar *)src)[i-4];
  		((uchar *)dst)[i-5] = ((uchar *)src)[i-5];
  		((uchar *)dst)[i-6] = ((uchar *)src)[i-6];    
  		((uchar *)dst)[i-7] = ((uchar *)src)[i-7];
    } 
}


void XORBytesInPlace(void *restrict dst, const void *restrict src, uchar bufidx)
{

/*
// for 7950
  switch(bufidx & 0x03)
  {
  case 0:
    	  ((ulong4 *)dst)[0] ^= ((ulong4 *)src)[0]; 
      break;
// end for 7950
*/


// for 280X
  switch( bufidx & 0x03)
  {
  case 0:
    #pragma unroll 2
    for(int i = 0; i < 4; i+=2)
    {       
    	  ((uint2 *)dst)[i] ^= ((uint2 *)src)[i];       
    	  ((uint2 *)dst)[i+1] ^= ((uint2 *)src)[i+1]; 
    }
    break;  

  case 2:  
    #pragma unroll 8
    for(int i = 0; i < 16; i+=2)
    {
    	  ((uchar2 *)dst)[i] ^= ((uchar2 *)src)[i]; 
    	  ((uchar2 *)dst)[i+1] ^= ((uchar2 *)src)[i+1]; 
    }
    break;
//  end for 280X


  default:
  #pragma unroll 8
   for(int i = 0; i < 32; i+=4)
   {
  	  ((uchar *)dst)[i] ^= ((uchar *)src)[i];
  	  ((uchar *)dst)[i+1] ^= ((uchar *)src)[i+1];
  	  ((uchar *)dst)[i+2] ^= ((uchar *)src)[i+2];
  	  ((uchar *)dst)[i+3] ^= ((uchar *)src)[i+3];   
    }
  }
} 

void XORBytes(void *restrict dst, const void *restrict src1, const void *restrict src2, uint len)
{
	#pragma unroll 1
	for(int i = 0; i < len; ++i)
		((uchar *)dst)[i] = ((uchar *)src1)[i] ^ ((uchar *)src2)[i];
}


// Blake2S

#define BLAKE2S_BLOCK_SIZE    64U
#define BLAKE2S_OUT_SIZE      32U
#define BLAKE2S_KEY_SIZE      32U

static const __constant uint BLAKE2S_IV_1[16] =
{
    0x6B08C647, 0xBB67AE85, 0x3C6EF372, 0xA54FF53A,
    0x510E527F, 0x9B05688C, 0x1F83D9AB, 0x5BE0CD19,
    0x6A09E667, 0xBB67AE85, 0x3C6EF372, 0xA54FF53A,
    0x510E523F, 0x9B05688C, 0x1F83D9AB, 0x5BE0CD19
};

static const __constant uint BLAKE2S_IV_2[8] =
{
    0x6A09E667, 0xBB67AE85, 0x3C6EF372, 0xA54FF53A,
    0x510E52FF, 0x9B05688C, 0xE07C2654, 0x5BE0CD19
};

static const __constant uchar BLAKE2S_SIGMA[10][16] =
{
    {  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15 } ,
    { 14, 10,  4,  8,  9, 15, 13,  6,  1, 12,  0,  2, 11,  7,  5,  3 } ,
    { 11,  8, 12,  0,  5,  2, 15, 13, 10, 14,  3,  6,  7,  1,  9,  4 } ,
    {  7,  9,  3,  1, 13, 12, 11, 14,  2,  6,  5, 10,  4,  0, 15,  8 } ,
    {  9,  0,  5,  7,  2,  4, 10, 15, 14,  1, 11, 12,  6,  8,  3, 13 } ,
    {  2, 12,  6, 10,  0, 11,  8,  3,  4, 13,  7,  5, 15, 14,  1,  9 } ,
    { 12,  5,  1, 15, 14, 13,  4, 10,  0,  7,  6,  3,  9,  2,  8, 11 } ,
    { 13, 11,  7, 14, 12,  1,  3,  9,  5,  0, 15,  4,  8,  6,  2, 10 } ,
    {  6, 15, 14,  9, 11,  3,  0,  8, 12,  2, 13,  7,  1,  4, 10,  5 } ,
    { 10,  2,  8,  4,  7,  6,  1,  5, 15, 11,  9, 14,  3, 12, 13 , 0 } ,
};


#define BLAKE_G(idx0, idx1, a, b, c, d, key)	do { \
  for(int i=0; i< 2; ++i) {\
  a += b + key[BLAKE2S_SIGMA[idx0][idx1 + i]]; \
  d = rotate(d ^ a, ( i << 3 )+16U ); \
	c += d; \
	b = rotate(b ^ c, ( i + (i<<2))+20U) ; \
  }\
} while(0)


void Blake2S(uint *restrict inout, const uint *restrict inkey)
{
	uint16 V;
	uint8 tmpblock;
 
	// Load first block (IV into V.lo) and constants (IV into V.hi)
	V = vload16(0U, BLAKE2S_IV_1);
  tmpblock = V.lo; 

	// Compress state, using the key as the key

	#ifdef SMALL_BLAKE2S
	#pragma unroll BLAKE2S_UNROLL
	#else
	#pragma unroll 10
	#endif

      for(int x = 0; x < 10; ++x)
    	{
    		BLAKE_G(x, 0x00, V.s0, V.s4, V.s8, V.sc, inkey);
    		BLAKE_G(x, 0x02, V.s1, V.s5, V.s9, V.sd, inkey);
    		BLAKE_G(x, 0x04, V.s2, V.s6, V.sa, V.se, inkey);
    		BLAKE_G(x, 0x06, V.s3, V.s7, V.sb, V.sf, inkey);	
        BLAKE_G(x, 0x08, V.s0, V.s5, V.sa, V.sf, inkey);
    		BLAKE_G(x, 0x0A, V.s1, V.s6, V.sb, V.sc, inkey);
    		BLAKE_G(x, 0x0C, V.s2, V.s7, V.s8, V.sd, inkey);
    		BLAKE_G(x, 0x0E, V.s3, V.s4, V.s9, V.se, inkey);
     	}
    
    	// XOR low part of state with the high part,
    	// then with the original input block.
    	tmpblock = V.lo = V.lo ^ V.hi ^ tmpblock;
    
    	// Load constants (IV into V.hi)
    	V.hi = vload8(0U, BLAKE2S_IV_2);

	// Compress block, using the input as the key
	#ifdef SMALL_BLAKE2S
	#pragma unroll BLAKE2S_UNROLL
	#else
	#pragma unroll 10
	#endif
	for(int x = 0; x < 10; x++)
	{
		BLAKE_G(x, 0x00, V.s0, V.s4, V.s8, V.sc, inout);
		BLAKE_G(x, 0x02, V.s1, V.s5, V.s9, V.sd, inout);
		BLAKE_G(x, 0x04, V.s2, V.s6, V.sa, V.se, inout);
		BLAKE_G(x, 0x06, V.s3, V.s7, V.sb, V.sf, inout);
		BLAKE_G(x, 0x08, V.s0, V.s5, V.sa, V.sf, inout);
		BLAKE_G(x, 0x0A, V.s1, V.s6, V.sb, V.sc, inout);
		BLAKE_G(x, 0x0C, V.s2, V.s7, V.s8, V.sd, inout);
		BLAKE_G(x, 0x0E, V.s3, V.s4, V.s9, V.se, inout);
	}

	// Store result in input/output buffer
	vstore8(V.lo ^ V.hi ^ tmpblock, 0, inout);
}


/* FastKDF, a fast buffered key derivation function:
 * FASTKDF_BUFFER_SIZE must be a power of 2;
 * password_len, salt_len and output_len should not exceed FASTKDF_BUFFER_SIZE;
 * prf_output_size must be <= prf_key_size; */
void fastkdf(const uchar *restrict password, const uchar *restrict salt, const uint salt_len, uchar *restrict output, uint output_len)
{

	/*                    WARNING!
	 * This algorithm uses byte-wise addressing for memory blocks.
	 * Or in other words, trying to copy an unaligned memory region
	 * will significantly slow down the algorithm, when copying uses
	 * words or bigger entities. It even may corrupt the data, when
	 * the device does not support it properly.
	 * Therefore use byte copying, which will not the fastest but at
	 * least get reliable results. */

	// BLOCK_SIZE            64U
	// FASTKDF_BUFFER_SIZE  256U
	// BLAKE2S_BLOCK_SIZE    64U
	// BLAKE2S_KEY_SIZE      32U
	// BLAKE2S_OUT_SIZE      32U

  uchar bufidx = 0;
  uint8 Abuffer[9], Bbuffer[9] = { (uint8)(0) };
	uchar *A = (uchar *)Abuffer, *B = (uchar *)Bbuffer;
  uint i;
  
	// Initialize the password buffer
  #pragma unroll 5
  for( i = 0; i < 5; i++ )
      ((ulong2 *)A)[i] = ((ulong2 *)A)[i+5] = ((ulong2 *)A)[i+10] = ((ulong2 *)password)[i];   
  ((ulong2 *)A)[15] = ((ulong2 *)password)[0];
 
	((ulong8 *)(A + FASTKDF_BUFFER_SIZE))[0] = ((ulong8 *)password)[0];

	// Initialize the salt buffer
	if( !(salt_len ^ FASTKDF_BUFFER_SIZE))
	{
		((ulong16 *)B)[0] = ((ulong16 *)B)[2] = ((ulong16 *)salt)[0];
		((ulong16 *)B)[1] = ((ulong16 *)B)[3] = ((ulong16 *)salt)[1];
	}
	else
	{
		// salt_len is 80 bytes here

		#pragma unroll 5 	
    for( i = 0; i < 5; i++)
       ((ulong2 *)B)[i] = ((ulong2 *)B)[i+5] = ((ulong2 *)B)[i+10] = ((ulong2 *)salt)[i];
    ((ulong2 *)B)[15] = ((ulong2 *)salt)[0];

//		for(int i = 0; i < (FASTKDF_BUFFER_SIZE >> 3); ++i) ((ulong *)B)[i] = ((ulong *)salt)[i % 10];

		// Initialized the rest to zero earlier
      ((ulong8 *)(B + FASTKDF_BUFFER_SIZE))[0] = ((ulong8 *)salt)[0];
      ((ulong2 *)(B + FASTKDF_BUFFER_SIZE))[4] = ((ulong2 *)salt)[4];
	}

		// Make the key buffer twice the size of the key so it fits a Blake2S block
		// This way, we don't need a temp buffer in the Blake2S function.
		uchar input[BLAKE2S_BLOCK_SIZE], key[BLAKE2S_BLOCK_SIZE] = { 0 };    

    // The primary iteration
    #pragma unroll 1
    for(i = 0; i < 32; ++i)
    {   
    		// Copy input and key to their buffers
    		CopyBytes64(input, A + bufidx); 
        CopyBytes32(key, B + bufidx);

            // PRF
            Blake2S((uint *)input, (uint *)key);
    
            // Calculate the next buffer pointer
    
        bufidx = 0;   
        #pragma unroll 2
        for(int k = 0; k < 31; k+=16) {
          bufidx += input[k] + input[k+1] + input[k+2] + input[k+3] + input[k+4] + input[k+5] + input[k+6] + input[k+7];         
    			bufidx += input[k+8] + input[k+9] + input[k+10] + input[k+11] + input[k+12] + input[k+13] + input[k+14] + input[k+15]; 
        }    // Modify the salt buffer
    	      
        XORBytesInPlace(B + bufidx, input, bufidx );

    		if(  bufidx < BLAKE2S_KEY_SIZE )
    		{
    			// Head modified, tail updated
    			CopyBytes(B + FASTKDF_BUFFER_SIZE + bufidx, B + bufidx, BLAKE2S_KEY_SIZE - bufidx );
    		}
//    		else if( (FASTKDF_BUFFER_SIZE - bufidx ) < BLAKE2S_OUT_SIZE )
        else if ( bufidx > 224 )
    		{
    			// Tail modified, head updated
    			CopyBytes(B, B + FASTKDF_BUFFER_SIZE, bufidx - 224);
    		}
    }

    // Modify and copy into the output buffer

	if( (FASTKDF_BUFFER_SIZE - bufidx) < output_len)
	{
		XORBytes(output, B + bufidx, A, (FASTKDF_BUFFER_SIZE - bufidx));
		XORBytes(output + (FASTKDF_BUFFER_SIZE - bufidx), B, A + (FASTKDF_BUFFER_SIZE - bufidx), output_len - (FASTKDF_BUFFER_SIZE - bufidx));
	}
	else
      XORBytes(output, B + bufidx, A, output_len);    
}


#define SALSA_CORE(state) do { \
  state.s49e3 ^= rotate(state.s05af + state.sc16b, (uint4)( 7U, 7U, 7U, 7U)); \  
  state.s8d27 ^= rotate(state.s49e3 + state.s05af, (uint4)( 9U, 9U, 9U, 9U));  \
  state.sc16b ^= rotate(state.s8d27 + state.s49e3, (uint4)( 13U, 13U, 13U, 13U)); \ 
  state.s05af ^= rotate(state.sc16b + state.s8d27, (uint4)( 18U, 18U, 18U, 18U)); \
  \
  state.s16bc ^= rotate(state.s05af + state.s349e, (uint4)( 7U, 7U, 7U, 7U)); \  
  state.s278d ^= rotate(state.s16bc + state.s05af, (uint4)( 9U, 9U, 9U, 9U)); \ 
  state.s349e ^= rotate(state.s278d + state.s16bc, (uint4)( 13U, 13U, 13U, 13U)); \ 
  state.s05af ^= rotate(state.s349e + state.s278d, (uint4)( 18U, 18U, 18U, 18U)); \
} while(0)


uint16 salsa_small_scalar_rnd(uint16 X)
{
	uint16 st = X;

	#if SALSA_SMALL_UNROLL == 1

	for(int i = 0; i < 10; ++i)
	{
		SALSA_CORE(st);
	}

	#elif SALSA_SMALL_UNROLL == 2

	for(int i = 0; i < 5; ++i)
	{
		SALSA_CORE(st);
		SALSA_CORE(st);
	}

	#elif SALSA_SMALL_UNROLL == 3

//	for(int i = 0; i < 4; ++i)

  uint i = 4;
  while (i--) 
	{
		SALSA_CORE(st);
		if( !i ) break;
		SALSA_CORE(st);
		SALSA_CORE(st);
	} 

	#elif SALSA_SMALL_UNROLL == 4

	for(int i = 0; i < 3; ++i)
	{
		SALSA_CORE(st);
		SALSA_CORE(st);
		if(i == 2) break;
		SALSA_CORE(st);
		SALSA_CORE(st);
	}

	#else

	for(int i = 0; i < 2; ++i)
	{
		SALSA_CORE(st);
		SALSA_CORE(st);
		SALSA_CORE(st);
		SALSA_CORE(st);
		SALSA_CORE(st);
	}

	#endif

	return(X + st);
}

#define CHACHA_CORE_PARALLEL(state)	do { \
	state[0] += state[1]; state[3] = rotate(state[3] ^ state[0], (uint4)(16U, 16U, 16U, 16U)); \
	state[2] += state[3]; state[1] = rotate(state[1] ^ state[2], (uint4)(12U, 12U, 12U, 12U)); \
	state[0] += state[1]; state[3] = rotate(state[3] ^ state[0], (uint4)(8U, 8U, 8U, 8U)); \
	state[2] += state[3]; state[1] = rotate(state[1] ^ state[2], (uint4)(7U, 7U, 7U, 7U)); \
	\
	state[0] += state[1].yzwx; state[3].wxyz = rotate(state[3].wxyz ^ state[0], (uint4)(16U, 16U, 16U, 16U)); \
	state[2].zwxy += state[3].wxyz; state[1].yzwx = rotate(state[1].yzwx ^ state[2].zwxy, (uint4)(12U, 12U, 12U, 12U)); \
	state[0] += state[1].yzwx; state[3].wxyz = rotate(state[3].wxyz ^ state[0], (uint4)(8U, 8U, 8U, 8U)); \
	state[2].zwxy += state[3].wxyz; state[1].yzwx = rotate(state[1].yzwx ^ state[2].zwxy, (uint4)(7U, 7U, 7U, 7U)); \
} while(0)

uint16 chacha_small_parallel_rnd(uint16 X)
{
	uint4 st[4];

	((uint16 *)st)[0] = X;

	#if CHACHA_SMALL_UNROLL == 1

	for(int i = 0; i < 10; ++i)
	{
		CHACHA_CORE_PARALLEL(st);
	}

	#elif CHACHA_SMALL_UNROLL == 2

	for(int i = 0; i < 5; ++i)
	{
		CHACHA_CORE_PARALLEL(st);
		CHACHA_CORE_PARALLEL(st);
	}

	#elif CHACHA_SMALL_UNROLL == 3

//	for(int i = 0; i < 4; ++i)
  
  int i = 4;
  while (i--)
	{
		CHACHA_CORE_PARALLEL(st);
		if( !i ) break;
		CHACHA_CORE_PARALLEL(st);    
		CHACHA_CORE_PARALLEL(st);

	}

	#elif CHACHA_SMALL_UNROLL == 4

	for(int i = 0; i < 3; ++i)
	{
		CHACHA_CORE_PARALLEL(st);
		CHACHA_CORE_PARALLEL(st);
		if(i == 2) break;
		CHACHA_CORE_PARALLEL(st);
		CHACHA_CORE_PARALLEL(st);
	}

	#else

	for(int i = 0; i < 2; ++i)
	{
		CHACHA_CORE_PARALLEL(st);
		CHACHA_CORE_PARALLEL(st);
		CHACHA_CORE_PARALLEL(st);
		CHACHA_CORE_PARALLEL(st);
		CHACHA_CORE_PARALLEL(st);
	}

	#endif

	return(X + ((uint16 *)st)[0]);
}

void neoscrypt_blkmix(uint16 *XV, uint alg)
{
  uint16 TX;

    /* NeoScrypt flow:                   Scrypt flow:
         Xa ^= Xd;  M(Xa'); Ya = Xa";      Xa ^= Xb;  M(Xa'); Ya = Xa";
         Xb ^= Xa"; M(Xb'); Yb = Xb";      Xb ^= Xa"; M(Xb'); Yb = Xb";
         Xc ^= Xb"; M(Xc'); Yc = Xc";      Xa" = Ya;
         Xd ^= Xc"; M(Xd'); Yd = Xd";      Xb" = Yb;
         Xa" = Ya; Xb" = Yc;
         Xc" = Yb; Xd" = Yd; */
      
    if (!alg)
    {
    		XV[0] = salsa_small_scalar_rnd( XV[0] ^ XV[3] ); 
        TX =    salsa_small_scalar_rnd( XV[1] ^ XV[0] ); 
        XV[1] = salsa_small_scalar_rnd( XV[2] ^ TX ); 
        XV[3] = salsa_small_scalar_rnd( XV[3] ^ XV[1] ); 
    }
    else
    {
        XV[0] = chacha_small_parallel_rnd(XV[0] ^ XV[3] ); 
    		TX =    chacha_small_parallel_rnd(XV[1] ^ XV[0] ); 
    		XV[1] = chacha_small_parallel_rnd(XV[2] ^ TX); 
        XV[3] = chacha_small_parallel_rnd(XV[3] ^ XV[1] );      
    } 
    XV[2] = TX;      
}


void SMix(ulong16 *X, __global ulong16 *V, uint flag)
{
  uint idx;
  uint i = 0; 
    do {
       	V[i++]   = X[0];
 	      V[i++]   = X[1];           
        neoscrypt_blkmix(X, flag);
    }   while (i ^ 256);
    do {
        idx =  (((uint *)X)[48])<<1 & 0xFE;
        X[0] ^= V[idx++];
 	      X[1] ^= V[idx];    
        neoscrypt_blkmix(X, flag);    
        i-=2;
    } while (i);
}


__attribute__((reqd_work_group_size(WORKSIZE, 1, 1)))
__kernel void search(__global const uchar* restrict input, __global uint* restrict output, __global uchar *padcache, const uint target)
{
#define CONSTANT_N 128
#define CONSTANT_r 2
	// X = CONSTANT_r * 2 * BLOCK_SIZE(64); Z is a copy of X for ChaCha
	uint16 X[4], Z[4];
  bool flag = false; 
 
	/* V = CONSTANT_N * CONSTANT_r * 2 * BLOCK_SIZE */
	__global ulong16 *V = (__global ulong16 *)(padcache + ( (get_global_id(0) % MAX_GLOBAL_THREADS) << 15 ));
	uchar outbuf[32];
	uchar data[PASSWORD_LEN];

	((ulong8 *)data)[0] = ((__global const ulong8 *)input)[0];
	((ulong *)data)[8] = ((__global const ulong *)input)[8];
	((uint *)data)[18] = ((__global const uint *)input)[18];
	((uint *)data)[19] = get_global_id(0);

    // X = KDF(password, salt)
	fastkdf(data, data, PASSWORD_LEN, (uchar *)X, 256);

    // Process ChaCha 1st, Salsa 2nd and XOR them - run that through PBKDF2
//    CopyBytes128(Z, X, 2);

		((ulong16 *)Z)[0] = ((ulong16 *)X)[0];
    ((ulong16 *)Z)[1] = ((ulong16 *)X)[1];

    // X = SMix(X); X & Z are swapped, repeat.

    for( ;; ++flag)
    {
      SMix(X, V, flag);
      if (flag) break;      
//   		SwapBytes128(X, Z, 256);  
   		((ulong16 *)X)[0] ^= ((ulong16 *)Z)[0];
  		((ulong16 *)Z)[0] ^= ((ulong16 *)X)[0];
  		((ulong16 *)X)[0] ^= ((ulong16 *)Z)[0];
  		((ulong16 *)X)[1] ^= ((ulong16 *)Z)[1];
  		((ulong16 *)Z)[1] ^= ((ulong16 *)X)[1];
  		((ulong16 *)X)[1] ^= ((ulong16 *)Z)[1];  
   	}
        
	// blkxor(X, Z)
	((ulong16 *)X)[0] ^= ((ulong16 *)Z)[0];
	((ulong16 *)X)[1] ^= ((ulong16 *)Z)[1];

	// output = KDF(password, X)
	fastkdf(data, (uchar *)X, FASTKDF_BUFFER_SIZE, outbuf, 32);
	if(((uint *)outbuf)[7] <= target) output[atomic_add(output + 0xFF, 1)] = get_global_id(0);
}

thanks, from 1150Mh/s to 1314Mh/s ;-)

cat77

Newbie

Offline

Activity: 18
Merit: 0

Re: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

January 28, 2015, 07:34:24 PM

#2782

You're quite welcome, glad its working.

Whoever is sending BTCcoinageBTC to me, thank you kindly.

platinum4

Sr. Member

Offline

Activity: 547
Merit: 250

Re: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

January 31, 2015, 07:52:55 AM

#2783

cat77: is that for Tahiti only, or can apply to Hawaii?

Eliovp

Legendary

Offline

Activity: 1050
Merit: 1293

Huh?

Re: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

January 31, 2015, 11:36:38 AM

#2784

Quote from: platinum4 on January 31, 2015, 07:52:55 AM

cat77: is that for Tahiti only, or can apply to Hawaii?

Tried it with a 290 and a 290x, it won't even compile with most used settings.

The only way i could compile is with gpu threads at 1 and tc 8195..

And even then only 150Kh, so it's a no go for Hawaii

However, if you edit the kernel file it is possible, but still not as much as the standard one.

Greetings.

Eliovp.com - By Miles The Best ROI in the industry!

semajjames

Hero Member

Offline

Activity: 528
Merit: 500

Re: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

January 31, 2015, 08:09:20 PM

#2785

Quote from: platinum4 on January 31, 2015, 07:52:55 AM

cat77: is that for Tahiti only, or can apply to Hawaii?

I am now using it on 290 and 290x

dhsc19

Member

Offline

Activity: 96
Merit: 10

Re: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

January 31, 2015, 10:07:56 PM

#2786

Quote from: Wolf0 on January 27, 2015, 06:48:21 PM

Quote from: dhsc19 on January 27, 2015, 06:28:33 PM

Quote from: Wolf0 on January 27, 2015, 05:46:55 PM

Quote from: platinum4 on January 27, 2015, 05:41:33 PM

They should work, just rename the .bin file if you need to. I think his has l8 in the filename, some people need l4.

Right - it's safe to replace anyway.

Oh, I also see that you also made available the kernel cl files. Am I correct to assume that I can drop those in/replace the ones sgminer built when I compiled and sgminer should be able to generate a native bin file with the correct mods the next time I run it?

No, I didn't - those are very old - leaked with the bins. They don't have anywhere near the speed.

Hi Wolf0,

Can you direct me to where can get your darkcoin-mod.cl source? I have sgminer automatic start using the /etc/rc.local file on my Ubuntu machine and I cannot find where it is creating the .bin file so that I can replace it. I've even tried doing "find . -name *darkcoin*" from filesystem root as root user and no .bin file is coming up. My miner works fine, but I have no idea where those .bin files are ending up...so I can't replace anything. So, I'd rather stick in your modified darkcoin-mod.cl file and recompile sgminer with it. Thanks.

MaxDZ8

Hero Member

Offline

Activity: 672
Merit: 500

Re: what is the state of pool difficulty?

February 02, 2015, 03:56:24 PM

#2787

Stratum diff always confused me. Some pools report it as fractional numbers, some much lower than 1, those which I usually go for.
There are two sgminer issues, #312, later superseded by #317 regarding this.

Just to make sure we're talking about the same thing, I have been using sgminer-5.1-2014-12-20-win32.zip I just downloaded from nicehash and pointed it to Myriadcoin nonce-pool (stratum: mine1.myr.nonce-pool.com:3360).

Noncepool uses difficulty 2 (at least that's what's reported in workers control panel and that's also what goes off stratum).
Of course this is totally inappropriate to a small 1MHs miner such as me.

The cool thing is that due to a bug in my miner's diff calculation (this time I'm really talking about mine) I've been submitting work anyway... and it gets accepted anyway (well, it used to). I am not sure if I'm more Cool

,

or

So, let's go back to sgminer.

In line of concept, I fully agree with badman and lukejr when they propose the diff multiplier to be deprecated and removed.

In practice, it seems to me the pool operators are taking it a bit too easy as I cannot believe those to be sensible settings.

On the pro side, p2pool nodes seem to work as expected so far... perhaps NOMP as well, I don't know. What's the state of this thing?

adaseb

Legendary

Offline

Activity: 3752
Merit: 1709

Re: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

February 02, 2015, 05:23:06 PM

#2788

Are there any leaked wolf0 drivers for X11 for Pitcarn GPUs. The 270 and the 270X and also the 7870?

Just wondering what the hashrate is? I currently get 2.750Mhash/s

.BEST..CHANGE.

███████████████ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██

BEST
EXCHANGE
RATES

██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ███████████████

..BUY/ SELL CRYPTO..

adaseb

Legendary

Offline

Activity: 3752
Merit: 1709

Re: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

February 02, 2015, 06:18:56 PM

#2789

Quote from: Wolf0 on February 02, 2015, 05:43:58 PM

Quote from: adaseb on February 02, 2015, 05:23:06 PM

Are there any leaked wolf0 drivers for X11 for Pitcarn GPUs. The 270 and the 270X and also the 7870?

Just wondering what the hashrate is? I currently get 2.750Mhash/s

Yeah, I released a Pitcairn bin.

Is there a website where its posted. All I can find in this thread is a Tahiti bin

.BEST..CHANGE.

███████████████ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██

BEST
EXCHANGE
RATES

██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ███████████████

..BUY/ SELL CRYPTO..

bradli

Member

Offline

Activity: 82
Merit: 10

Re: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

February 02, 2015, 07:01:22 PM

#2790

Quote from: adaseb on February 02, 2015, 06:18:56 PM

Quote from: Wolf0 on February 02, 2015, 05:43:58 PM

Quote from: adaseb on February 02, 2015, 05:23:06 PM

Are there any leaked wolf0 drivers for X11 for Pitcarn GPUs. The 270 and the 270X and also the 7870?

Just wondering what the hashrate is? I currently get 2.750Mhash/s

Yeah, I released a Pitcairn bin.

Is there a website where its posted. All I can find in this thread is a Tahiti bin

maybe you are looking for this
http://www.reddit.com/r/DRKCoin/comments/2o1yoz/rewritten_x11_binaries/

adaseb

Legendary

Offline

Activity: 3752
Merit: 1709

Re: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

February 02, 2015, 07:17:36 PM

#2791

Quote from: bradli on February 02, 2015, 07:01:22 PM

Quote from: adaseb on February 02, 2015, 06:18:56 PM

Quote from: Wolf0 on February 02, 2015, 05:43:58 PM

Quote from: adaseb on February 02, 2015, 05:23:06 PM

Are there any leaked wolf0 drivers for X11 for Pitcarn GPUs. The 270 and the 270X and also the 7870?

Just wondering what the hashrate is? I currently get 2.750Mhash/s

Yeah, I released a Pitcairn bin.

Is there a website where its posted. All I can find in this thread is a Tahiti bin

maybe you are looking for this
http://www.reddit.com/r/DRKCoin/comments/2o1yoz/rewritten_x11_binaries/

Yes looks like that is it but the server is down. Can somebody post a copy?

.BEST..CHANGE.

███████████████ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██

BEST
EXCHANGE
RATES

██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ███████████████

..BUY/ SELL CRYPTO..

Masked_Immortal

Member

Offline

Activity: 67
Merit: 10

Re: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

February 03, 2015, 07:28:13 AM

#2792

I am using wolf’s bin for x11 but something is odds for me, on other algorithms when a work is done with one GPU the result is shown in the bottom section as accepted /////// GPU# and then at the top section A: will increase respectively, but in this case A: increase very slowly in compare with the accepted shares. is something wrong with my setting?

SCAVO Technologies
SELF-SUSTAINING CRYPTO MINING FARM BY USING RENEWABLE ENERGY RESOURCES - LIFETIME CONTRACT
──────── Whitepaper ⬝ Twitter ⬝ Telegram ⬝ Facebook ⬝ Ann Thread ─────── Join our ICO: July. 15- Oct. 31

Tiger78

Member

Offline

Activity: 83
Merit: 10

Re: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

February 03, 2015, 08:43:47 AM

#2793

Quote from: Wolf0 on February 02, 2015, 05:43:58 PM

Quote from: adaseb on February 02, 2015, 05:23:06 PM

Are there any leaked wolf0 drivers for X11 for Pitcarn GPUs. The 270 and the 270X and also the 7870?

Just wondering what the hashrate is? I currently get 2.750Mhash/s

Yeah, I released a Pitcairn bin.

Wolf0, released please 7790/260x!

K1773R

Legendary

Offline

Activity: 1792
Merit: 1008

/dev/null

Re: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

February 03, 2015, 05:03:31 PM

#2794

Quote from: Masked_Immortal on February 03, 2015, 07:28:13 AM

I am using wolf’s bin for x11 but something is odds for me, on other algorithms when a work is done with one GPU the result is shown in the bottom section as accepted /////// GPU# and then at the top section A: will increase respectively, but in this case A: increase very slowly in compare with the accepted shares. is something wrong with my setting?

some count the accepted shares (A:) as the count of accepted shares, other the cumulative difficulty.

[GPG Public Key]
BTC/DVC/TRC/FRC: 1K1773RbXRZVRQSSXe9N6N2MUFERvrdu6y ANC/XPM AK1773RTmRKtvbKBCrUu95UQg5iegrqyeA NMC: NK1773Rzv8b4ugmCgX789PbjewA9fL9Dy1 LTC: LKi773RBuPepQH8E6Zb1ponoCvgbU7hHmd EMC: EK1773RxUes1HX1YAGMZ1xVYBBRUCqfDoF BQC: bK1773R1APJz4yTgRkmdKQhjhiMyQpJgfN

bigblind

Sr. Member

Offline

Activity: 378
Merit: 252

Re: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

February 04, 2015, 10:42:21 AM

#2795

Is there also any Wolf0-Like Lyra2RE.bin? Getting Around 1100Kh/s per 280x and around 1400 kh/s per 290.
Any comparison?

pallas

Legendary

Offline

Activity: 2716
Merit: 1094

Black Belt Developer

Re: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

February 04, 2015, 10:52:06 AM

#2796

Quote from: bigblind on February 04, 2015, 10:42:21 AM

Is there also any Wolf0-Like Lyra2RE.bin? Getting Around 1100Kh/s per 280x and around 1400 kh/s per 290.
Any comparison?

I started working on Lyra2RE and got about 1780 Kh/s on r9 290, but there was no interest at all so I dropped the project.

Cryptonite (XCN): first mini-blockchain coin, innovative, running since 2014!

bigblind

Sr. Member

Offline

Activity: 378
Merit: 252

Re: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

February 04, 2015, 11:27:04 AM

#2797

Quote from: pallas on February 04, 2015, 10:52:06 AM

Quote from: bigblind on February 04, 2015, 10:42:21 AM

Is there also any Wolf0-Like Lyra2RE.bin? Getting Around 1100Kh/s per 280x and around 1400 kh/s per 290.
Any comparison?

I started working on Lyra2RE and got about 1780 Kh/s on r9 290, but there was no interest at all so I dropped the project.

I'm interested mate Tongue

thevictimofuktyranny

Legendary

Offline

Activity: 1092
Merit: 1004

Re: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

February 04, 2015, 06:12:43 PM
Last edit: February 04, 2015, 06:28:56 PM by thevictimofuktyranny

#2798

Quote from: pallas on February 04, 2015, 10:52:06 AM

Quote from: bigblind on February 04, 2015, 10:42:21 AM

Is there also any Wolf0-Like Lyra2RE.bin? Getting Around 1100Kh/s per 280x and around 1400 kh/s per 290.
Any comparison?

I started working on Lyra2RE and got about 1780 Kh/s on r9 290, but there was no interest at all so I dropped the project.

Yeah, I be interested in that as well Shocked

What sort of BTC would want?

Remember some of us only have 1 AMD card, ironically a R9 290 Grin

Would it be crowdfunding campaign like djm34

https://bitcointalk.org/index.php?topic=916336.0

People who contribute get sgminer mod early (you just have to set a minimum pledge e.g. 125 pledges at 0.04BTC equals 5BTC) and when 5BTC is reached, it is released to the community Wink

Mind you Nvidia Maxwell have Lyra2 to themselves, 700khs per card. To be competitive, it would need to hit around 2.1mhs on a R9 290 Cheesy

Only joking Roll Eyes

pallas

Legendary

Offline

Activity: 2716
Merit: 1094

Black Belt Developer

Re: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

February 05, 2015, 01:41:42 PM

#2799

here it is, let's see how it goes:

https://bitcointalk.org/index.php?topic=946655.0

(faster Lyra2RE kernel)

Cryptonite (XCN): first mini-blockchain coin, innovative, running since 2014!

ocminer

Legendary

Offline

Activity: 2660
Merit: 1240

Re: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

February 05, 2015, 01:53:56 PM

#2800

Has anyone experience with SSL secured connections with sgminer ?

AFAIK with ccminer you can put certificates using the -cert flag for securing the stratum connection, does sgminer allow that too ?

suprnova pools - reliable mining pools - #suprnova on freenet
https://www.suprnova.cc - FOLLOW us @ Twitter ! twitter.com/SuprnovaPools