[ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

djm34

Legendary

Offline

Activity: 1400
Merit: 1050

Re: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

January 26, 2015, 03:38:00 PM

#2761

Quote from: Eastwind on January 26, 2015, 03:08:39 PM

Quote from: Wolf0 on January 26, 2015, 02:37:00 PM

Quote from: pallas on January 26, 2015, 02:28:04 PM

Quote from: BitmoreCoin on January 26, 2015, 01:53:32 PM

Quote from: Wolf0 on January 26, 2015, 01:39:58 PM

(number & (power_of_two - 1)) of course.

A bit more detail for non programmer?

Are you saying you want to program crypto algorythms in opencl without understanding "(number & (power_of_two - 1))"?

I kinda feel the same way, but I'll be a little gentler. BitmoreCoin, you may want to look up bitwise operators and things like that - an AND operation is far, far faster than modulus (modulus and division are ouch slow, as a general rule).

I think he means how to adjust the statements inside the case 0 and case 1.

hu ?
he means N % 2^n <=> N & (2^n-1) but gets computed faster in some bad compilation case...
similarly N / 2^n <=> N >> (n)
and N * 2^n <=> N << (n)

djm34 facebook page
BTC: 1NENYmxwZGHsKFmyjTc5WferTn5VTFb7Ze
Pledge for neoscrypt ccminer to that address: 16UoC4DmTz2pvhFvcfTQrzkPTrXkWijzXw

BitmoreCoin

Sr. Member

Offline

Activity: 406
Merit: 250

Re: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

January 26, 2015, 03:46:52 PM

#2762

Quote from: cat77 on January 10, 2015, 04:16:31 AM

.....This is worth 20KH/s on my 280X......from 343KHs to 363KH/s at 1020MHz clock
.....now somebody needs to find 20KH/s more for me....

change the XORBytesInPlace call from

Code:

	XORBytesInPlace(B + bufidx, input, BLAKE2S_OUT_SIZE);

to

Code:

      XORBytesInPlace(B + bufidx, input, bufidx);

and change the function itself to perform some byte alignment checking

Code:

//
// a bit of byte alignment checking goes a long ways...
//
void XORBytesInPlace(void *restrict dst, const void *restrict src, uint mod)
{
  switch(mod % 4)
  {
  case 0:
    #pragma unroll 2
    for(int i = 0; i < 4; i+=2)
    {
    	  ((uint2 *)dst)[i]   ^= ((uint2 *)src)[i]; 
     	  ((uint2 *)dst)[i+1] ^= ((uint2 *)src)[i+1];    
    }
    break;    

  case 2:  
    #pragma unroll 8
    for(int i = 0; i < 16; i+=2)
    {
    	  ((uchar2 *)dst)[i] ^= ((uchar2 *)src)[i]; 
    	  ((uchar2 *)dst)[i+1] ^= ((uchar2 *)src)[i+1]; 
    }
    break;

  default:
  #pragma unroll 8
   for(int i = 0; i < 31; i+=4)
   {
  	  ((uchar *)dst)[i] ^= ((uchar *)src)[i];
  	  ((uchar *)dst)[i+1] ^= ((uchar *)src)[i+1];
  	  ((uchar *)dst)[i+2] ^= ((uchar *)src)[i+2];
  	  ((uchar *)dst)[i+3] ^= ((uchar *)src)[i+3];   
    }
  }
}

Later you said

Quote

Very interesting. I get about 2% gain on 7950 and need to use (mod % 2) with the case statements adjusted accordingly.
My 280X gains almost 6% as is, but the gain difference between (mod % 2) and (mod % 4) is pretty small, like 1-2 KHs

When you used (mod %2, same as mod &1), what are the case statements inside XORBytesInPlace(void *restrict dst, const void *restrict src, uint mod)?

revelacaogr

Legendary

Offline

Activity: 1316
Merit: 1021

2009 Alea iacta est

Re: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

January 26, 2015, 06:20:44 PM

#2763

Did anyone tested the new AMD APP SDK 3.0 ,and if yes ,what about GPU speeds? Thks......

thevictimofuktyranny

Legendary

Offline

Activity: 1092
Merit: 1004

Re: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

January 26, 2015, 06:34:37 PM

#2764

Quote from: revelacaogr on January 26, 2015, 06:20:44 PM

Did anyone tested the new AMD APP SDK 3.0 ,and if yes ,what about GPU speeds? Thks......

I don't think there is much point in testing them out, as the fastest drivers are the 14.7 RC1 to RC3 Shocked

Those drivers were released last year and will be incompatiable with AMD APP SDK 3.00.

Secondly, on the 14.7 RC1 to RC3 is better not to install the AMD APP SDK, because it does not make any speed difference at all Tongue

scryptr

Legendary

Offline

Activity: 1793
Merit: 1028

Re: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

January 26, 2015, 06:51:47 PM

#2765

Quote from: scryptr on January 25, 2015, 08:55:35 PM

MIXED DRIVER VERSIONS--

I have read various posts about putting the driver files "in the miner directory". How does a person do this?

Here is my situation. I have a uATX mother board with on-board graphics. The video chip will run with AMD drivers 13.x, but no later. The MB BIOS can be set to use additional video cards. What I want to do is install Ubuntu on the rig, use AMD 13.x drivers, and run the monitor with the installed AMD 13.x drivers. I then want to put AMD 14.6 drivers "in the mining directory", and compile and mine with the additional video cards.

How to do this? Any help? --scryptr

Talking to myself --

Sorry, don't mean to be a bother, but I have read several posts about installing one driver for the system, and then placing 14.x drivers in the mining directory for miner compilation. How does one do this? Is there a description or how-to on the web? I have been googling my eyes out... --scryptr

TIPS: BTC - 1Fs4uZ6a9ABYBTaHGUfqcwCQmeBRxkKRQT DASH - XrK81tW31SLsVvZ2WX9VhTjpT6GXJPLdbQ
SCRYPTR'S NOTEBOOK: https://bitcointalk.org/index.php?topic=5035515.msg46035530#msg46035530
GITHUB: "github.com/scryptr" MERIT is appreciated, also. Thanks!

thevictimofuktyranny

Legendary

Offline

Activity: 1092
Merit: 1004

Re: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

January 26, 2015, 06:59:25 PM

#2766

Quote from: scryptr on January 26, 2015, 06:51:47 PM

Quote from: scryptr on January 25, 2015, 08:55:35 PM

MIXED DRIVER VERSIONS--

I have read various posts about putting the driver files "in the miner directory". How does a person do this?

Here is my situation. I have a uATX mother board with on-board graphics. The video chip will run with AMD drivers 13.x, but no later. The MB BIOS can be set to use additional video cards. What I want to do is install Ubuntu on the rig, use AMD 13.x drivers, and run the monitor with the installed AMD 13.x drivers. I then want to put AMD 14.6 drivers "in the mining directory", and compile and mine with the additional video cards.

How to do this? Any help? --scryptr

Talking to myself --

Sorry, don't mean to be a bother, but I have read several posts about installing one driver for the system, and then placing 14.x drivers in the mining directory for miner compilation. How does one do this? Is there a description or how-to on the web? I have been googling my eyes out... --scryptr

Pretty easy, you install the driver 13.12 (as an example) needed to create the bin file. Once this has been created you shut sgminer down and uninstall the 13.12 driver Grin

Then, you install the GPU driver you is fastest for mining or gaming depending on your priorities Roll Eyes

You don't need to do this when using Wolf0's leaked x11 mod, which is 50% faster hash; x13 mod, which is 50% faster than the official sgminer release - the bin files are already made to work with a modded kernel Wink

scryptr

Legendary

Offline

Activity: 1793
Merit: 1028

Re: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

January 26, 2015, 07:36:23 PM

#2767

Quote from: thevictimofuktyranny on January 26, 2015, 06:59:25 PM

Quote from: scryptr on January 26, 2015, 06:51:47 PM

Quote from: scryptr on January 25, 2015, 08:55:35 PM

MIXED DRIVER VERSIONS--

I have read various posts about putting the driver files "in the miner directory". How does a person do this?

Here is my situation. I have a uATX mother board with on-board graphics. The video chip will run with AMD drivers 13.x, but no later. The MB BIOS can be set to use additional video cards. What I want to do is install Ubuntu on the rig, use AMD 13.x drivers, and run the monitor with the installed AMD 13.x drivers. I then want to put AMD 14.6 drivers "in the mining directory", and compile and mine with the additional video cards.

How to do this? Any help? --scryptr

Talking to myself --

Sorry, don't mean to be a bother, but I have read several posts about installing one driver for the system, and then placing 14.x drivers in the mining directory for miner compilation. How does one do this? Is there a description or how-to on the web? I have been googling my eyes out... --scryptr

Pretty easy, you install the driver 13.12 (as an example) needed to create the bin file. Once this has been created you shut sgminer down and uninstall the 13.12 driver Grin

Then, you install the GPU driver you is fastest for mining or gaming depending on your priorities Roll Eyes

You don't need to do this when using Wolf0's leaked x11 mod, which is 50% faster hash; x13 mod, which is 50% faster than the official sgminer release - the bin files are already made to work with a modded kernel Wink

THANK YOU --

Thank you for replying, but I want to retain 13.x drivers for the system. The on-board GPU will not run on anything higher. I have read posts about installing 14.x drivers "in the mining directory"' and for the purpose of compiling and running the miner.

The on-board GPU will be used for the monitor, not for mining. The hot 14.x drivers will only be used for compiling and running the miner. I am running Ubuntu Linux 14.04. I imagine that a creative PATH environment variable is involved.

UNLESS, you mean that Wolf's *.bin file can be used with my required 13.x drivers. If that is the case, I'll do just that. I suspect that some posters may be referring to the *.bin "binary" file as a "driver".

Again, thanks for the response. I hope someone can point me to the solution. --scryptr

TIPS: BTC - 1Fs4uZ6a9ABYBTaHGUfqcwCQmeBRxkKRQT DASH - XrK81tW31SLsVvZ2WX9VhTjpT6GXJPLdbQ
SCRYPTR'S NOTEBOOK: https://bitcointalk.org/index.php?topic=5035515.msg46035530#msg46035530
GITHUB: "github.com/scryptr" MERIT is appreciated, also. Thanks!

dhsc19

Member

Offline

Activity: 96
Merit: 10

Re: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

January 26, 2015, 07:39:37 PM

#2768

Does anyone know any significant advantages/disadvantages between these two Linux drivers?

linux-amd-14.41rc1-opencl2-sep19.zip

vs

linux-amd-catalyst-14.6-beta-v1.0-jul11.zip

Or is there a different recommended driver I need to look for best performance?

cat77

Newbie

Offline

Activity: 18
Merit: 0

Re: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

January 26, 2015, 07:48:52 PM
Last edit: January 26, 2015, 10:48:28 PM by cat77

#2769

Here are my neoscrypt.cl and sgminer configurations for 7950 and 280x. I use the 14.6 driver, found no difference using 14.7.
My 7950 does 320KHs at GPU clock 1000 MHz, memclock 1250 MHz, 1.081V GPU core voltage
My 280X does 360KHs at GPU clock 1036 MHz, memclock 1500 MHz, 1.025V GPU core voltage

The XORBYTESINPLACE needs to change depending on 280X or 7950. It hashes higher one way vs another for the card used, I have not examined as to why. So, edit the neoscrypt.cl file and look at the xorbytesinplace function. Change the section to match the card you are using, by changing what is commented out or in.

7950 config:
sgminer.exe -k neoscrypt --worksize 64 --rawintensity 4584 -g 4 -o stratum+tcp://stratum.ftc.theblocksfactory.com:3333 -u USER -p PASSWORD

280X config
sgminer.exe -k neoscrypt --worksize 64 --rawintensity 5120 -g 4 -o stratum+tcp://stratum.ftc.theblocksfactory.com:3333 -u USER -p PASSWORD

Feel free to criticize and / or offer up improvements. Basically, I have about 9 hours of OpenCL programming experience, I don't claim to be an expert by any means. I only claim to be 20% faster than the POS neoscrypt.cl file on Nicehash. Whoever posted that certainly wouldn't qualify as an expert either.

Last time I played with this stuff, I made other changes which make a 7950 run at 355KHs, 1000MHz GPU, 1250 Memclock. But that was a bit unstable, 3% HW errors. If I get around to playing with it and get it clean and stable, I will post up new code and config. Relatively speaking, this would then push the 280X to near 400Khs at 1036Mhz.

Donations: 1D4yYxmH44Xg4J2GuQ5ppfUKS7ohiJaD21

Code:

/* NeoScrypt(128, 2, 1) with Salsa20/20 and ChaCha20/20 */
/* Adapted and improved for 14.x drivers by Wolf9466 (Wolf`) */

// Stupid AMD compiler ignores the unroll pragma in these two
#define SALSA_SMALL_UNROLL 3
#define CHACHA_SMALL_UNROLL 3

// If SMALL_BLAKE2S is defined, BLAKE2S_UNROLL is interpreted
// as the unroll factor; must divide cleanly into ten.
// Usually a bad idea.
// #define SMALL_BLAKE2S
// #define BLAKE2S_UNROLL 5

#define BLOCK_SIZE           64U
#define FASTKDF_BUFFER_SIZE 256U
#ifndef PASSWORD_LEN
#define PASSWORD_LEN         80U
#endif

#if !defined(cl_khr_byte_addressable_store)
#error "Device does not support unaligned stores"
#endif

void CopyBytes(void *restrict dst, const void *restrict src, uint len)
{
    for(int i = 0; i < len; ++i)
		((uchar *)dst)[i] = ((uchar *)src)[i];
}

void CopyBytes32(void *restrict dst, const void *restrict src)
{
   	#pragma unroll 4
    for(int i = 31; i > 0; i-=8) 
    {
    		((uchar *)dst)[i] = ((uchar *)src)[i];    
    		((uchar *)dst)[i-1] = ((uchar *)src)[i-1];
    		((uchar *)dst)[i-2] = ((uchar *)src)[i-2];
    		((uchar *)dst)[i-3] = ((uchar *)src)[i-3];
    		((uchar *)dst)[i-4] = ((uchar *)src)[i-4];
    		((uchar *)dst)[i-5] = ((uchar *)src)[i-5];
    		((uchar *)dst)[i-6] = ((uchar *)src)[i-6];    
    		((uchar *)dst)[i-7] = ((uchar *)src)[i-7];
    } 
}

void CopyBytes64(void *restrict dst, const void *restrict src)
{
	#pragma unroll 8
    for(int i = 63; i > 0; i-=8) 
    {
  		((uchar *)dst)[i] = ((uchar *)src)[i];    
  		((uchar *)dst)[i-1] = ((uchar *)src)[i-1];
  		((uchar *)dst)[i-2] = ((uchar *)src)[i-2];
  		((uchar *)dst)[i-3] = ((uchar *)src)[i-3];
  		((uchar *)dst)[i-4] = ((uchar *)src)[i-4];
  		((uchar *)dst)[i-5] = ((uchar *)src)[i-5];
  		((uchar *)dst)[i-6] = ((uchar *)src)[i-6];    
  		((uchar *)dst)[i-7] = ((uchar *)src)[i-7];
    } 
}


void XORBytesInPlace(void *restrict dst, const void *restrict src, uchar bufidx)
{

/*
// for 7950
  switch(bufidx & 0x03)
  {
  case 0:
    	  ((ulong4 *)dst)[0] ^= ((ulong4 *)src)[0]; 
      break;
// end for 7950
*/


// for 280X
  switch( bufidx & 0x03)
  {
  case 0:
    #pragma unroll 2
    for(int i = 0; i < 4; i+=2)
    {       
    	  ((uint2 *)dst)[i] ^= ((uint2 *)src)[i];       
    	  ((uint2 *)dst)[i+1] ^= ((uint2 *)src)[i+1]; 
    }
    break;  

  case 2:  
    #pragma unroll 8
    for(int i = 0; i < 16; i+=2)
    {
    	  ((uchar2 *)dst)[i] ^= ((uchar2 *)src)[i]; 
    	  ((uchar2 *)dst)[i+1] ^= ((uchar2 *)src)[i+1]; 
    }
    break;
//  end for 280X


  default:
  #pragma unroll 8
   for(int i = 0; i < 32; i+=4)
   {
  	  ((uchar *)dst)[i] ^= ((uchar *)src)[i];
  	  ((uchar *)dst)[i+1] ^= ((uchar *)src)[i+1];
  	  ((uchar *)dst)[i+2] ^= ((uchar *)src)[i+2];
  	  ((uchar *)dst)[i+3] ^= ((uchar *)src)[i+3];   
    }
  }
} 

void XORBytes(void *restrict dst, const void *restrict src1, const void *restrict src2, uint len)
{
	#pragma unroll 1
	for(int i = 0; i < len; ++i)
		((uchar *)dst)[i] = ((uchar *)src1)[i] ^ ((uchar *)src2)[i];
}


// Blake2S

#define BLAKE2S_BLOCK_SIZE    64U
#define BLAKE2S_OUT_SIZE      32U
#define BLAKE2S_KEY_SIZE      32U

static const __constant uint BLAKE2S_IV_1[16] =
{
    0x6B08C647, 0xBB67AE85, 0x3C6EF372, 0xA54FF53A,
    0x510E527F, 0x9B05688C, 0x1F83D9AB, 0x5BE0CD19,
    0x6A09E667, 0xBB67AE85, 0x3C6EF372, 0xA54FF53A,
    0x510E523F, 0x9B05688C, 0x1F83D9AB, 0x5BE0CD19
};

static const __constant uint BLAKE2S_IV_2[8] =
{
    0x6A09E667, 0xBB67AE85, 0x3C6EF372, 0xA54FF53A,
    0x510E52FF, 0x9B05688C, 0xE07C2654, 0x5BE0CD19
};

static const __constant uchar BLAKE2S_SIGMA[10][16] =
{
    {  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15 } ,
    { 14, 10,  4,  8,  9, 15, 13,  6,  1, 12,  0,  2, 11,  7,  5,  3 } ,
    { 11,  8, 12,  0,  5,  2, 15, 13, 10, 14,  3,  6,  7,  1,  9,  4 } ,
    {  7,  9,  3,  1, 13, 12, 11, 14,  2,  6,  5, 10,  4,  0, 15,  8 } ,
    {  9,  0,  5,  7,  2,  4, 10, 15, 14,  1, 11, 12,  6,  8,  3, 13 } ,
    {  2, 12,  6, 10,  0, 11,  8,  3,  4, 13,  7,  5, 15, 14,  1,  9 } ,
    { 12,  5,  1, 15, 14, 13,  4, 10,  0,  7,  6,  3,  9,  2,  8, 11 } ,
    { 13, 11,  7, 14, 12,  1,  3,  9,  5,  0, 15,  4,  8,  6,  2, 10 } ,
    {  6, 15, 14,  9, 11,  3,  0,  8, 12,  2, 13,  7,  1,  4, 10,  5 } ,
    { 10,  2,  8,  4,  7,  6,  1,  5, 15, 11,  9, 14,  3, 12, 13 , 0 } ,
};


#define BLAKE_G(idx0, idx1, a, b, c, d, key)	do { \
  for(int i=0; i< 2; ++i) {\
  a += b + key[BLAKE2S_SIGMA[idx0][idx1 + i]]; \
  d = rotate(d ^ a, ( i << 3 )+16U ); \
	c += d; \
	b = rotate(b ^ c, ( i + (i<<2))+20U) ; \
  }\
} while(0)


void Blake2S(uint *restrict inout, const uint *restrict inkey)
{
	uint16 V;
	uint8 tmpblock;
 
	// Load first block (IV into V.lo) and constants (IV into V.hi)
	V = vload16(0U, BLAKE2S_IV_1);
  tmpblock = V.lo; 

	// Compress state, using the key as the key

	#ifdef SMALL_BLAKE2S
	#pragma unroll BLAKE2S_UNROLL
	#else
	#pragma unroll 10
	#endif

      for(int x = 0; x < 10; ++x)
    	{
    		BLAKE_G(x, 0x00, V.s0, V.s4, V.s8, V.sc, inkey);
    		BLAKE_G(x, 0x02, V.s1, V.s5, V.s9, V.sd, inkey);
    		BLAKE_G(x, 0x04, V.s2, V.s6, V.sa, V.se, inkey);
    		BLAKE_G(x, 0x06, V.s3, V.s7, V.sb, V.sf, inkey);	
        BLAKE_G(x, 0x08, V.s0, V.s5, V.sa, V.sf, inkey);
    		BLAKE_G(x, 0x0A, V.s1, V.s6, V.sb, V.sc, inkey);
    		BLAKE_G(x, 0x0C, V.s2, V.s7, V.s8, V.sd, inkey);
    		BLAKE_G(x, 0x0E, V.s3, V.s4, V.s9, V.se, inkey);
     	}
    
    	// XOR low part of state with the high part,
    	// then with the original input block.
    	tmpblock = V.lo = V.lo ^ V.hi ^ tmpblock;
    
    	// Load constants (IV into V.hi)
    	V.hi = vload8(0U, BLAKE2S_IV_2);

	// Compress block, using the input as the key
	#ifdef SMALL_BLAKE2S
	#pragma unroll BLAKE2S_UNROLL
	#else
	#pragma unroll 10
	#endif
	for(int x = 0; x < 10; x++)
	{
		BLAKE_G(x, 0x00, V.s0, V.s4, V.s8, V.sc, inout);
		BLAKE_G(x, 0x02, V.s1, V.s5, V.s9, V.sd, inout);
		BLAKE_G(x, 0x04, V.s2, V.s6, V.sa, V.se, inout);
		BLAKE_G(x, 0x06, V.s3, V.s7, V.sb, V.sf, inout);
		BLAKE_G(x, 0x08, V.s0, V.s5, V.sa, V.sf, inout);
		BLAKE_G(x, 0x0A, V.s1, V.s6, V.sb, V.sc, inout);
		BLAKE_G(x, 0x0C, V.s2, V.s7, V.s8, V.sd, inout);
		BLAKE_G(x, 0x0E, V.s3, V.s4, V.s9, V.se, inout);
	}

	// Store result in input/output buffer
	vstore8(V.lo ^ V.hi ^ tmpblock, 0, inout);
}


/* FastKDF, a fast buffered key derivation function:
 * FASTKDF_BUFFER_SIZE must be a power of 2;
 * password_len, salt_len and output_len should not exceed FASTKDF_BUFFER_SIZE;
 * prf_output_size must be <= prf_key_size; */
void fastkdf(const uchar *restrict password, const uchar *restrict salt, const uint salt_len, uchar *restrict output, uint output_len)
{

	/*                    WARNING!
	 * This algorithm uses byte-wise addressing for memory blocks.
	 * Or in other words, trying to copy an unaligned memory region
	 * will significantly slow down the algorithm, when copying uses
	 * words or bigger entities. It even may corrupt the data, when
	 * the device does not support it properly.
	 * Therefore use byte copying, which will not the fastest but at
	 * least get reliable results. */

	// BLOCK_SIZE            64U
	// FASTKDF_BUFFER_SIZE  256U
	// BLAKE2S_BLOCK_SIZE    64U
	// BLAKE2S_KEY_SIZE      32U
	// BLAKE2S_OUT_SIZE      32U

  uchar bufidx = 0;
  uint8 Abuffer[9], Bbuffer[9] = { (uint8)(0) };
	uchar *A = (uchar *)Abuffer, *B = (uchar *)Bbuffer;
  uint i;
  
	// Initialize the password buffer
  #pragma unroll 5
  for( i = 0; i < 5; i++ )
      ((ulong2 *)A)[i] = ((ulong2 *)A)[i+5] = ((ulong2 *)A)[i+10] = ((ulong2 *)password)[i];   
  ((ulong2 *)A)[15] = ((ulong2 *)password)[0];
 
	((ulong8 *)(A + FASTKDF_BUFFER_SIZE))[0] = ((ulong8 *)password)[0];

	// Initialize the salt buffer
	if( !(salt_len ^ FASTKDF_BUFFER_SIZE))
	{
		((ulong16 *)B)[0] = ((ulong16 *)B)[2] = ((ulong16 *)salt)[0];
		((ulong16 *)B)[1] = ((ulong16 *)B)[3] = ((ulong16 *)salt)[1];
	}
	else
	{
		// salt_len is 80 bytes here

		#pragma unroll 5 	
    for( i = 0; i < 5; i++)
       ((ulong2 *)B)[i] = ((ulong2 *)B)[i+5] = ((ulong2 *)B)[i+10] = ((ulong2 *)salt)[i];
    ((ulong2 *)B)[15] = ((ulong2 *)salt)[0];

//		for(int i = 0; i < (FASTKDF_BUFFER_SIZE >> 3); ++i) ((ulong *)B)[i] = ((ulong *)salt)[i % 10];

		// Initialized the rest to zero earlier
      ((ulong8 *)(B + FASTKDF_BUFFER_SIZE))[0] = ((ulong8 *)salt)[0];
      ((ulong2 *)(B + FASTKDF_BUFFER_SIZE))[4] = ((ulong2 *)salt)[4];
	}

		// Make the key buffer twice the size of the key so it fits a Blake2S block
		// This way, we don't need a temp buffer in the Blake2S function.
		uchar input[BLAKE2S_BLOCK_SIZE], key[BLAKE2S_BLOCK_SIZE] = { 0 };    

    // The primary iteration
    #pragma unroll 1
    for(i = 0; i < 32; ++i)
    {   
    		// Copy input and key to their buffers
    		CopyBytes64(input, A + bufidx); 
        CopyBytes32(key, B + bufidx);

            // PRF
            Blake2S((uint *)input, (uint *)key);
    
            // Calculate the next buffer pointer
    
        bufidx = 0;   
        #pragma unroll 2
        for(int k = 0; k < 31; k+=16) {
          bufidx += input[k] + input[k+1] + input[k+2] + input[k+3] + input[k+4] + input[k+5] + input[k+6] + input[k+7];         
    			bufidx += input[k+8] + input[k+9] + input[k+10] + input[k+11] + input[k+12] + input[k+13] + input[k+14] + input[k+15]; 
        }    // Modify the salt buffer
    	      
        XORBytesInPlace(B + bufidx, input, bufidx );

    		if(  bufidx < BLAKE2S_KEY_SIZE )
    		{
    			// Head modified, tail updated
    			CopyBytes(B + FASTKDF_BUFFER_SIZE + bufidx, B + bufidx, BLAKE2S_KEY_SIZE - bufidx );
    		}
//    		else if( (FASTKDF_BUFFER_SIZE - bufidx ) < BLAKE2S_OUT_SIZE )
        else if ( bufidx > 224 )
    		{
    			// Tail modified, head updated
    			CopyBytes(B, B + FASTKDF_BUFFER_SIZE, bufidx - 224);
    		}
    }

    // Modify and copy into the output buffer

	if( (FASTKDF_BUFFER_SIZE - bufidx) < output_len)
	{
		XORBytes(output, B + bufidx, A, (FASTKDF_BUFFER_SIZE - bufidx));
		XORBytes(output + (FASTKDF_BUFFER_SIZE - bufidx), B, A + (FASTKDF_BUFFER_SIZE - bufidx), output_len - (FASTKDF_BUFFER_SIZE - bufidx));
	}
	else
      XORBytes(output, B + bufidx, A, output_len);    
}


#define SALSA_CORE(state) do { \
  state.s49e3 ^= rotate(state.s05af + state.sc16b, (uint4)( 7U, 7U, 7U, 7U)); \  
  state.s8d27 ^= rotate(state.s49e3 + state.s05af, (uint4)( 9U, 9U, 9U, 9U));  \
  state.sc16b ^= rotate(state.s8d27 + state.s49e3, (uint4)( 13U, 13U, 13U, 13U)); \ 
  state.s05af ^= rotate(state.sc16b + state.s8d27, (uint4)( 18U, 18U, 18U, 18U)); \
  \
  state.s16bc ^= rotate(state.s05af + state.s349e, (uint4)( 7U, 7U, 7U, 7U)); \  
  state.s278d ^= rotate(state.s16bc + state.s05af, (uint4)( 9U, 9U, 9U, 9U)); \ 
  state.s349e ^= rotate(state.s278d + state.s16bc, (uint4)( 13U, 13U, 13U, 13U)); \ 
  state.s05af ^= rotate(state.s349e + state.s278d, (uint4)( 18U, 18U, 18U, 18U)); \
} while(0)


uint16 salsa_small_scalar_rnd(uint16 X)
{
	uint16 st = X;

	#if SALSA_SMALL_UNROLL == 1

	for(int i = 0; i < 10; ++i)
	{
		SALSA_CORE(st);
	}

	#elif SALSA_SMALL_UNROLL == 2

	for(int i = 0; i < 5; ++i)
	{
		SALSA_CORE(st);
		SALSA_CORE(st);
	}

	#elif SALSA_SMALL_UNROLL == 3

//	for(int i = 0; i < 4; ++i)

  uint i = 4;
  while (i--) 
	{
		SALSA_CORE(st);
		if( !i ) break;
		SALSA_CORE(st);
		SALSA_CORE(st);
	} 

	#elif SALSA_SMALL_UNROLL == 4

	for(int i = 0; i < 3; ++i)
	{
		SALSA_CORE(st);
		SALSA_CORE(st);
		if(i == 2) break;
		SALSA_CORE(st);
		SALSA_CORE(st);
	}

	#else

	for(int i = 0; i < 2; ++i)
	{
		SALSA_CORE(st);
		SALSA_CORE(st);
		SALSA_CORE(st);
		SALSA_CORE(st);
		SALSA_CORE(st);
	}

	#endif

	return(X + st);
}

#define CHACHA_CORE_PARALLEL(state)	do { \
	state[0] += state[1]; state[3] = rotate(state[3] ^ state[0], (uint4)(16U, 16U, 16U, 16U)); \
	state[2] += state[3]; state[1] = rotate(state[1] ^ state[2], (uint4)(12U, 12U, 12U, 12U)); \
	state[0] += state[1]; state[3] = rotate(state[3] ^ state[0], (uint4)(8U, 8U, 8U, 8U)); \
	state[2] += state[3]; state[1] = rotate(state[1] ^ state[2], (uint4)(7U, 7U, 7U, 7U)); \
	\
	state[0] += state[1].yzwx; state[3].wxyz = rotate(state[3].wxyz ^ state[0], (uint4)(16U, 16U, 16U, 16U)); \
	state[2].zwxy += state[3].wxyz; state[1].yzwx = rotate(state[1].yzwx ^ state[2].zwxy, (uint4)(12U, 12U, 12U, 12U)); \
	state[0] += state[1].yzwx; state[3].wxyz = rotate(state[3].wxyz ^ state[0], (uint4)(8U, 8U, 8U, 8U)); \
	state[2].zwxy += state[3].wxyz; state[1].yzwx = rotate(state[1].yzwx ^ state[2].zwxy, (uint4)(7U, 7U, 7U, 7U)); \
} while(0)

uint16 chacha_small_parallel_rnd(uint16 X)
{
	uint4 st[4];

	((uint16 *)st)[0] = X;

	#if CHACHA_SMALL_UNROLL == 1

	for(int i = 0; i < 10; ++i)
	{
		CHACHA_CORE_PARALLEL(st);
	}

	#elif CHACHA_SMALL_UNROLL == 2

	for(int i = 0; i < 5; ++i)
	{
		CHACHA_CORE_PARALLEL(st);
		CHACHA_CORE_PARALLEL(st);
	}

	#elif CHACHA_SMALL_UNROLL == 3

//	for(int i = 0; i < 4; ++i)
  
  int i = 4;
  while (i--)
	{
		CHACHA_CORE_PARALLEL(st);
		if( !i ) break;
		CHACHA_CORE_PARALLEL(st);    
		CHACHA_CORE_PARALLEL(st);

	}

	#elif CHACHA_SMALL_UNROLL == 4

	for(int i = 0; i < 3; ++i)
	{
		CHACHA_CORE_PARALLEL(st);
		CHACHA_CORE_PARALLEL(st);
		if(i == 2) break;
		CHACHA_CORE_PARALLEL(st);
		CHACHA_CORE_PARALLEL(st);
	}

	#else

	for(int i = 0; i < 2; ++i)
	{
		CHACHA_CORE_PARALLEL(st);
		CHACHA_CORE_PARALLEL(st);
		CHACHA_CORE_PARALLEL(st);
		CHACHA_CORE_PARALLEL(st);
		CHACHA_CORE_PARALLEL(st);
	}

	#endif

	return(X + ((uint16 *)st)[0]);
}

void neoscrypt_blkmix(uint16 *XV, uint alg)
{
  uint16 TX;

    /* NeoScrypt flow:                   Scrypt flow:
         Xa ^= Xd;  M(Xa'); Ya = Xa";      Xa ^= Xb;  M(Xa'); Ya = Xa";
         Xb ^= Xa"; M(Xb'); Yb = Xb";      Xb ^= Xa"; M(Xb'); Yb = Xb";
         Xc ^= Xb"; M(Xc'); Yc = Xc";      Xa" = Ya;
         Xd ^= Xc"; M(Xd'); Yd = Xd";      Xb" = Yb;
         Xa" = Ya; Xb" = Yc;
         Xc" = Yb; Xd" = Yd; */
      
    if (!alg)
    {
    		XV[0] = salsa_small_scalar_rnd( XV[0] ^ XV[3] ); 
        TX =    salsa_small_scalar_rnd( XV[1] ^ XV[0] ); 
        XV[1] = salsa_small_scalar_rnd( XV[2] ^ TX ); 
        XV[3] = salsa_small_scalar_rnd( XV[3] ^ XV[1] ); 
    }
    else
    {
        XV[0] = chacha_small_parallel_rnd(XV[0] ^ XV[3] ); 
    		TX =    chacha_small_parallel_rnd(XV[1] ^ XV[0] ); 
    		XV[1] = chacha_small_parallel_rnd(XV[2] ^ TX); 
        XV[3] = chacha_small_parallel_rnd(XV[3] ^ XV[1] );      
    } 
    XV[2] = TX;      
}


void SMix(ulong16 *X, __global ulong16 *V, uint flag)
{
  uint idx;
  uint i = 0; 
    do {
       	V[i++]   = X[0];
 	      V[i++]   = X[1];           
        neoscrypt_blkmix(X, flag);
    }   while (i ^ 256);
    do {
        idx =  (((uint *)X)[48])<<1 & 0xFE;
        X[0] ^= V[idx++];
 	      X[1] ^= V[idx];    
        neoscrypt_blkmix(X, flag);    
        i-=2;
    } while (i);
}


__attribute__((reqd_work_group_size(WORKSIZE, 1, 1)))
__kernel void search(__global const uchar* restrict input, __global uint* restrict output, __global uchar *padcache, const uint target)
{
#define CONSTANT_N 128
#define CONSTANT_r 2
	// X = CONSTANT_r * 2 * BLOCK_SIZE(64); Z is a copy of X for ChaCha
	uint16 X[4], Z[4];
  bool flag = false; 
 
	/* V = CONSTANT_N * CONSTANT_r * 2 * BLOCK_SIZE */
	__global ulong16 *V = (__global ulong16 *)(padcache + ( (get_global_id(0) % MAX_GLOBAL_THREADS) << 15 ));
	uchar outbuf[32];
	uchar data[PASSWORD_LEN];

	((ulong8 *)data)[0] = ((__global const ulong8 *)input)[0];
	((ulong *)data)[8] = ((__global const ulong *)input)[8];
	((uint *)data)[18] = ((__global const uint *)input)[18];
	((uint *)data)[19] = get_global_id(0);

    // X = KDF(password, salt)
	fastkdf(data, data, PASSWORD_LEN, (uchar *)X, 256);

    // Process ChaCha 1st, Salsa 2nd and XOR them - run that through PBKDF2
//    CopyBytes128(Z, X, 2);

		((ulong16 *)Z)[0] = ((ulong16 *)X)[0];
    ((ulong16 *)Z)[1] = ((ulong16 *)X)[1];

    // X = SMix(X); X & Z are swapped, repeat.

    for( ;; ++flag)
    {
      SMix(X, V, flag);
      if (flag) break;      
//   		SwapBytes128(X, Z, 256);  
   		((ulong16 *)X)[0] ^= ((ulong16 *)Z)[0];
  		((ulong16 *)Z)[0] ^= ((ulong16 *)X)[0];
  		((ulong16 *)X)[0] ^= ((ulong16 *)Z)[0];
  		((ulong16 *)X)[1] ^= ((ulong16 *)Z)[1];
  		((ulong16 *)Z)[1] ^= ((ulong16 *)X)[1];
  		((ulong16 *)X)[1] ^= ((ulong16 *)Z)[1];  
   	}
        
	// blkxor(X, Z)
	((ulong16 *)X)[0] ^= ((ulong16 *)Z)[0];
	((ulong16 *)X)[1] ^= ((ulong16 *)Z)[1];

	// output = KDF(password, X)
	fastkdf(data, (uchar *)X, FASTKDF_BUFFER_SIZE, outbuf, 32);
	if(((uint *)outbuf)[7] <= target) output[atomic_add(output + 0xFF, 1)] = get_global_id(0);
}

Eastwind

Hero Member

Offline

Activity: 896
Merit: 1000

Re: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

January 26, 2015, 10:50:01 PM

#2770

Quote from: cat77 on January 26, 2015, 07:48:52 PM

Here are my neoscrypt.cl and sgminer configurations for 7950 and 280x. I use the 14.6 driver, found no difference using 14.7.
My 7950 does 320KHs at GPU clock 1000 MHz, memclock 1250 MHz, 1.081V GPU core voltage
My 280X does 360KHs at GPU clock 1036 MHz, memclock 1500 MHz, 1.025V GPU core voltage

Thanks a lot. Working well!

chup

Sr. Member

Offline

Activity: 736
Merit: 262

Me, Myself & I

Re: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

January 27, 2015, 08:19:59 AM

#2771

Quote from: scryptr on January 26, 2015, 07:36:23 PM

Thank you for replying, but I want to retain 13.x drivers for the system. The on-board GPU will not run on anything higher. I have read posts about installing 14.x drivers "in the mining directory"' and for the purpose of compiling and running the miner.

The on-board GPU will be used for the monitor, not for mining. The hot 14.x drivers will only be used for compiling and running the miner. I am running Ubuntu Linux 14.04. I imagine that a creative PATH environment variable is involved.

UNLESS, you mean that Wolf's *.bin file can be used with my required 13.x drivers. If that is the case, I'll do just that. I suspect that some posters may be referring to the *.bin "binary" file as a "driver".

Again, thanks for the response. I hope someone can point me to the solution. --scryptr

You should only unpack (without installation) wanted version of drivers installation package and copy only two OpenCL files from that package into sgminer directory:
amd_opencl32.dll or amd_opencl64.dll (if the miner is x64)
amdocl.dll or amdocl64.dll (if the miner is x64)

MEGABONUS 🔥 up to 8.25% 🔥 CASHBACK at ALIEXPRESS, 🔥 $10.45 🔥 at AIRBNB...

ViperRUS

Newbie

Offline

Activity: 5
Merit: 0

Re: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

January 27, 2015, 09:32:12 AM

#2772

What are the maximum hashrates for the free kernels X11 and X13 on R280x (1100/1500Mhz) now? My values are X11 - 6.4Mh/s, X13 - 3.5Mh/s.

dhsc19

Member

Offline

Activity: 96
Merit: 10

Re: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

January 27, 2015, 05:13:15 PM

#2773

Quote from: Wolf0 on January 27, 2015, 09:43:24 AM

Quote from: ViperRUS on January 27, 2015, 09:32:12 AM

What are the maximum hashrates for the free kernels X11 and X13 on R280x (1100/1500Mhz) now? My values are X11 - 6.4Mh/s, X13 - 3.5Mh/s.

That's about right - my x13 hasn't been released to the general public yet.

Hi Wolf0,

I compiled sgminer from source (https://github.com/badman74/sgminer) for my Ubuntu 14.04 machine. I'm only getting about 3.5Mh/s for x11. Do I still need to import your modified kernel? Or am I doing something wrong?

dhsc19

Member

Offline

Activity: 96
Merit: 10

Re: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

January 27, 2015, 05:40:08 PM

#2774

Quote from: Wolf0 on January 27, 2015, 05:25:07 PM

Quote from: dhsc19 on January 27, 2015, 05:13:15 PM

Quote from: Wolf0 on January 27, 2015, 09:43:24 AM

Quote from: ViperRUS on January 27, 2015, 09:32:12 AM

What are the maximum hashrates for the free kernels X11 and X13 on R280x (1100/1500Mhz) now? My values are X11 - 6.4Mh/s, X13 - 3.5Mh/s.

That's about right - my x13 hasn't been released to the general public yet.

Hi Wolf0,

I compiled sgminer from source (https://github.com/badman74/sgminer) for my Ubuntu 14.04 machine. I'm only getting about 3.5Mh/s for x11. Do I still need to import your modified kernel? Or am I doing something wrong?

You need to replace your kernel bin file with mine, using the same name - 3.5MH/s is far too slow, assuming you're on a 280X.

Yes, 280X. Can the bin file be directly used on Linux sgminer? (I've never used bin files outside of what was compiled on my machine, so I'm a bit unfamiliar with the cross-platform compatibility).

platinum4

Sr. Member

Offline

Activity: 547
Merit: 250

Re: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

January 27, 2015, 05:41:33 PM

#2775

They should work, just rename the .bin file if you need to. I think his has l8 in the filename, some people need l4.

dhsc19

Member

Offline

Activity: 96
Merit: 10

Re: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

January 27, 2015, 06:19:22 PM

#2776

Quote from: Wolf0 on January 27, 2015, 05:46:55 PM

Quote from: platinum4 on January 27, 2015, 05:41:33 PM

They should work, just rename the .bin file if you need to. I think his has l8 in the filename, some people need l4.

Right - it's safe to replace anyway.

Ok. Thanks for the help.

dhsc19

Member

Offline

Activity: 96
Merit: 10

Re: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

January 27, 2015, 06:28:33 PM

#2777

Quote from: Wolf0 on January 27, 2015, 05:46:55 PM

Quote from: platinum4 on January 27, 2015, 05:41:33 PM

They should work, just rename the .bin file if you need to. I think his has l8 in the filename, some people need l4.

Right - it's safe to replace anyway.

Oh, I also see that you also made available the kernel cl files. Am I correct to assume that I can drop those in/replace the ones sgminer built when I compiled and sgminer should be able to generate a native bin file with the correct mods the next time I run it?

platinum4

Sr. Member

Offline

Activity: 547
Merit: 250

Re: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

January 27, 2015, 06:35:41 PM

#2778

Quote from: dhsc19 on January 27, 2015, 06:28:33 PM

Quote from: Wolf0 on January 27, 2015, 05:46:55 PM

Quote from: platinum4 on January 27, 2015, 05:41:33 PM

They should work, just rename the .bin file if you need to. I think his has l8 in the filename, some people need l4.

Right - it's safe to replace anyway.

Oh, I also see that you also made available the kernel cl files. Am I correct to assume that I can drop those in/replace the ones sgminer built when I compiled and sgminer should be able to generate a native bin file with the correct mods the next time I run it?

I don't think those are 'updated' .cl files; I could be wrong.

dhsc19

Member

Offline

Activity: 96
Merit: 10

Re: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

January 27, 2015, 06:56:09 PM

#2779

ok

Quote from: Wolf0 on January 27, 2015, 06:48:21 PM

Quote from: dhsc19 on January 27, 2015, 06:28:33 PM

Quote from: Wolf0 on January 27, 2015, 05:46:55 PM

Quote from: platinum4 on January 27, 2015, 05:41:33 PM

They should work, just rename the .bin file if you need to. I think his has l8 in the filename, some people need l4.

Right - it's safe to replace anyway.

Oh, I also see that you also made available the kernel cl files. Am I correct to assume that I can drop those in/replace the ones sgminer built when I compiled and sgminer should be able to generate a native bin file with the correct mods the next time I run it?

No, I didn't - those are very old - leaked with the bins. They don't have anywhere near the speed.

Alright...noted.

scryptr

Legendary

Offline

Activity: 1793
Merit: 1028

Re: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner

January 28, 2015, 06:21:41 AM

#2780

Quote from: chup on January 27, 2015, 08:19:59 AM

Quote from: scryptr on January 26, 2015, 07:36:23 PM

Thank you for replying, but I want to retain 13.x drivers for the system. The on-board GPU will not run on anything higher. I have read posts about installing 14.x drivers "in the mining directory"' and for the purpose of compiling and running the miner.

The on-board GPU will be used for the monitor, not for mining. The hot 14.x drivers will only be used for compiling and running the miner. I am running Ubuntu Linux 14.04. I imagine that a creative PATH environment variable is involved.

UNLESS, you mean that Wolf's *.bin file can be used with my required 13.x drivers. If that is the case, I'll do just that. I suspect that some posters may be referring to the *.bin "binary" file as a "driver".

Again, thanks for the response. I hope someone can point me to the solution. --scryptr

You should only unpack (without installation) wanted version of drivers installation package and copy only two OpenCL files from that package into sgminer directory:
amd_opencl32.dll or amd_opencl64.dll (if the miner is x64)
amdocl.dll or amdocl64.dll (if the miner is x64)

Thank you for the information. I will be trying the above solution shortly, or one like it. My miner will be on an Ubuntu 14.04 OS, with drivers for Linux. I'll try to adapt your solution. --scryptr

TIPS: BTC - 1Fs4uZ6a9ABYBTaHGUfqcwCQmeBRxkKRQT DASH - XrK81tW31SLsVvZ2WX9VhTjpT6GXJPLdbQ
SCRYPTR'S NOTEBOOK: https://bitcointalk.org/index.php?topic=5035515.msg46035530#msg46035530
GITHUB: "github.com/scryptr" MERIT is appreciated, also. Thanks!