Bitcoin Forum
May 04, 2024, 12:12:29 AM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 ... 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 [135] 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 ... 233 »
  Print  
Author Topic: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner  (Read 877795 times)
damm315er
Sr. Member
****
Offline Offline

Activity: 539
Merit: 255


View Profile
January 04, 2015, 01:02:18 AM
 #2681

I don't think elpida memory could account for such a wide discrepancy in the hash or the HW errors.  At the very least a 290x should be equal with a 290, the only difference is the number of shaders at least for a reference card.  If it was a hynix 290x hitting 330+ and an elpida 290x hitting 310, then maybe I could understand.  I've got a hynix 290x in my test rig and I forgot what the other one is but I can test.  I realize you spent hours on your config for your 290 but if you feel like showing it I can start from there.  I've got 14.6 rc2 installed on the test rig and I'm going to install 14.9 tonight and drop the 14.6 in the mining directory (I also use wolf0's builds).  Do you use Stilt's bios?  I couldn't get stilt's bios stable for the X coins but I wonder if it will work for neoscrypt.  Maybe Stilt on neoscrypt will let us find the right ratio of gpu to memory clock (if neoscrypt is anything like scrypt).  Sad thing is this really doesn't make too much of a difference in profit.  

Yeah, right from the get-go tuning the GPU's for neoscrypt (and tuning them for scrypt as well) using the exact same settings would typically get the 290's with Hynix more hash than the 290x's with elpida.  Then once that peaked I split them off in different directions for tuning.

There was a single setting where the 290x got more hash than the 290, but that was back in the 30 to 60 kh/s range and was never repeatable with the newer kernel and drivers, even with the same settings.

It may be more than just the hynix/elpida thing, it could be in the card hardware or bios (never tried stilts).  I never dug deeper than the memory after I figured out why the 290's were outpacing the 290x's mining scrypt.  But, when I was doing scrypt the hashrates were much closer, so it could also be that the bottleneck in the kernel affects the 290x worse.

And you have another good point..  At this time hashrate isn't good for much more than bragging rights unless you have a farm..  I heard via the rumor mill that there's a massive GPU farm getting built.  If that is true, then unless there's some attractive new coins to draw the hash, the GPU mine-able coins are all going to get diluted even further.  It would help if BTC weren't tanking, but there was a big hype bubble to recover from..

My elpida 290x's always outperformed my hynix 290s, especially with Stilt's bios.  Stilt's bios was stable for the 2 hynix 290x on my test rig mining neoscrypt but it didn't seem to make a difference in the max hash I could get.  With either bios I could squeeze out about 330kh/s by overclocking to 1070/1500.  There was no magic ratio that I could find but that may be because I'm running the stock kernel.  I have a feeling stilt's bios may help out with a better kernel, if not for performance then for energy savings.  The core clock doesn't influence hash that much which is something wolf0 and others have said, ie crank up the memory speed and downclock the core for energy savings.

Testing the different drivers didn't make a difference to me, in fact I saw a slight increase in just sticking with 14.6rc2, as opposed to using 14.9 and 14.6 ocl files.  Maybe you play games and 14.9 is better for that but I don't use these for gaming.  Testing different settings didn't really make too much of a difference either, TCs of 8192, 8448, 16384, 22500 (I used that for scrypt-n) and 22528 and different worksizes didn't produce a significant change.  The 290 kernel bottleneck is a problem.  

I wouldn't worry to much about a massive gpu farm.  I don't think it will matter too much, there will always be new farmers and some will also leave.  Now if it's wolf0's farm then maybe that would be something to worry about, Wink  But thanks to his hawaii bin, x11 is much more profitable for me than neoscrypt.

Anyway

My 290x don't like over 1130 clocks and 1450 memory at all, they keep crashing the drivers or the rig.  But the 290's will run 1150 and 1500.  They might run more, but I haven't really pushed them as I had bad fans.  Now they are back with new fans, I will push further when I have the time to pay attention to it.

No gaming on this rig..  It's pretty capable of it with an AMD8350 and 16 gigs of ram, but 4 of the 5 GPU's only have a 1x extender so it would be playing with a single 280x.  I noticed about 5 to 10 mhs difference with the 14.6/14.9, odd that you don't.

Yes, there's a whole lotta blah settings in that bottlenecked area, I tried them all, and many a few times as I was running 4 gpu's and would set all 4 differently during the testing to cover as much ground as possible.
1714781549
Hero Member
*
Offline Offline

Posts: 1714781549

View Profile Personal Message (Offline)

Ignore
1714781549
Reply with quote  #2

1714781549
Report to moderator
1714781549
Hero Member
*
Offline Offline

Posts: 1714781549

View Profile Personal Message (Offline)

Ignore
1714781549
Reply with quote  #2

1714781549
Report to moderator
Even if you use Bitcoin through Tor, the way transactions are handled by the network makes anonymity difficult to achieve. Do not expect your transactions to be anonymous unless you really know what you're doing.
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
1714781549
Hero Member
*
Offline Offline

Posts: 1714781549

View Profile Personal Message (Offline)

Ignore
1714781549
Reply with quote  #2

1714781549
Report to moderator
1714781549
Hero Member
*
Offline Offline

Posts: 1714781549

View Profile Personal Message (Offline)

Ignore
1714781549
Reply with quote  #2

1714781549
Report to moderator
Prelude
Legendary
*
Offline Offline

Activity: 1596
Merit: 1000



View Profile
January 04, 2015, 01:39:19 AM
 #2682

Hey, so i started testing neoscrypt configs, but i have no idea what i am aiming for.
Can anyone share what is the most you can get from 7950 ? I am getting around 220kh/s. Is that good ? can i get more then that ?

Cheers

Looks like you can get a little more..  

http://hw.neoscrypt.tk/index.php

I actually got them up to 260 but i am still wondering if i can get more out of them. I am using sgminer5.1-dev

I would like to know what is the best some one got out of this cards or from any cards actually.
I mean any optimized hidden super secret kernel etc Smiley
Just wondering what is the max at this moment.

Around 600kh/s out of 290X.

That's insane! I want your kernel. Wink

Do you have any power figures VS the publicly available kernel that gets me ~315KH/s on 290 & 290X @975/1500?

Also, what clocks are you running? Would you mind sharing your TC and other relevant settings if they can be applied to the public kernel?
go6ooo1212
Legendary
*
Offline Offline

Activity: 1512
Merit: 1000


quarkchain.io


View Profile
January 04, 2015, 07:53:22 AM
 #2683

..who doesn't want those wolf0's kernels Smiley
damm315er
Sr. Member
****
Offline Offline

Activity: 539
Merit: 255


View Profile
January 04, 2015, 12:44:58 PM
 #2684

..who doesn't want those wolf0's kernels Smiley

True.
bobben2
Full Member
***
Offline Offline

Activity: 279
Merit: 104


View Profile
January 06, 2015, 05:55:29 PM
 #2685

Here is a small neoscrypt kernel improvement for free, since I am mostly doing X11 anyway.
It gave me a 5.8% speedup on my reference R9 290 card (with Stilt bios),
from 290.2 to 307Kh/s  at 800/1500 core/mem freq on Ubuntu 12.04 with stock drivers.
I didnt try it on my R9 280x cards, so please post your results if you try this.

You will have to mod the kernel as per the code below.
The bottleneck in this kernel is the way it stores the 128 intermediate results of chacha and salsa in global memory.
By doing the change below you are reducing stalls/latency by not making read/writes to same/adjacent memory banks.

Change:
void ScratchpadStore(__global void *V, void *X, uchar idx)
{
   ((__global ulong16 *)V)[idx << 1] = ((ulong16 *)X)[0];
   ((__global ulong16 *)V)[(idx << 1) + 1] = ((ulong16 *)X)[1];
}

void ScratchpadMix(void *X, const __global void *V, uchar idx)
{
   ((ulong16 *)X)[0] ^= ((__global ulong16 *)V)[idx << 1];
   ((ulong16 *)X)[1] ^= ((__global ulong16 *)V)[(idx << 1) + 1];
}

To:
void ScratchpadStore(__global void *V, void *X, uchar idx)
{
   ((__global ulong16 *)V)[idx] = ((ulong16 *)X)[0];
   ((__global ulong16 *)V)[idx + 128] = ((ulong16 *)X)[1];
}
void ScratchpadMix(void *X, const __global void *V, uchar idx)
{
   ((ulong16 *)X)[0] ^= ((__global ulong16 *)V)[idx];
   ((ulong16 *)X)[1] ^= ((__global ulong16 *)V)[idx + 128];
}

Fellow miners, get your thens and thans in order and help other forum readers understand what you are writing. Remember the grammar basics:  B larger THAN A (comparator operator). If something THEN ....
JuanHungLo
Hero Member
*****
Offline Offline

Activity: 935
Merit: 1001


I don't always drink...


View Profile
January 06, 2015, 06:50:34 PM
 #2686

@ bobben2, 280x with hynix 302Kh/s 1100/1600 x 4 is 990 watts at the wall

Bull markets are born on pessimism, grow on skepticism, mature on optimism, and die on euphoria. - John Templeton
bobben2
Full Member
***
Offline Offline

Activity: 279
Merit: 104


View Profile
January 06, 2015, 07:17:33 PM
 #2687

@ bobben2, 280x with hynix 302Kh/s 1100/1600 x 4 is 990 watts at the wall

Those cards must be screaming  Grin
How much of a %age improvement did you get?

Fellow miners, get your thens and thans in order and help other forum readers understand what you are writing. Remember the grammar basics:  B larger THAN A (comparator operator). If something THEN ....
Zuikkis
Newbie
*
Offline Offline

Activity: 57
Merit: 0


View Profile
January 06, 2015, 07:37:36 PM
 #2688

@ bobben2, 280x with hynix 302Kh/s 1100/1600 x 4 is 990 watts at the wall

That's not very good hashrate with the public kernel.. 1600 is not very good memclock, try lowering to 1500.

I had about 320khs with 1100/1500..
JuanHungLo
Hero Member
*****
Offline Offline

Activity: 935
Merit: 1001


I don't always drink...


View Profile
January 06, 2015, 07:47:59 PM
 #2689

@ bobben2, 280x with hynix 302Kh/s 1100/1600 x 4 is 990 watts at the wall

Those cards must be screaming  Grin
How much of a %age improvement did you get?

Sorry, 1600 was a type, it is 1500
about 8.6% with the below config.  Note the undervoltage to 1000.

Quote
{
  "pools": [
    {
      "name": "FeatherCoin-neo Pool - WemineFTC",
      "nfactor": "10",
      "algorithm": "neoscrypt",
      "url": "stratum+tcp://stratum.wemineftc.com:4444",
      "user": "USER",
      "pass": "x"
    }
  ],
  "api-port": "4028",
  "gpu-engine": "1100",
  "gpu-memclock": "1500",
  "worksize": "256",
  "gpu-threads": "2",
  "api-listen": true,
  "api-allow": "W:127.0.0.1/32",
  "queue": "1",
  "algorithm": "neoscrypt",
  "device": "0,1,2,3",
  "xintensity": "3",
  "thread-concurrency": "8192",
  "gpu-vddc": "1.00",
  "scan-time": "1",
  "gpu-reorder": true,
  "temp-cutoff": "90",
  "temp-overheat": "82",
  "temp-target": "72",
  "gpu-platform": "0",
  "gpu-dyninterval": "7",
  "expiry": "1",
  "no-pool-disable": true,
  "no-client-reconnect": true,
  "log": "5",
  "no-submit-stale": true,
  "scrypt": true,
  "tcp-keepalive": "30",
  "temp-hysteresis": "3",
  "kernel-path": "/usr/local/bin",
  "powertune": "20"
}

Bull markets are born on pessimism, grow on skepticism, mature on optimism, and die on euphoria. - John Templeton
Zuikkis
Newbie
*
Offline Offline

Activity: 57
Merit: 0


View Profile
January 06, 2015, 07:52:12 PM
 #2690

Try to remove thread-concurrency from config, so sgminer can calculate it from the xintensity.

Edit: Oh, and worksize 128  or even 64 is probably faster.
bobben2
Full Member
***
Offline Offline

Activity: 279
Merit: 104


View Profile
January 06, 2015, 08:08:48 PM
 #2691

Hi again,
Now I tried the "improved" kernel on my own 280X rig.
3 cards, all running 1000/1500 core/mem.  550Watts at the wall 
Orig neoscrypt kernel (Kh/s)
  301
  296
  287
My "improved" kernel
  295
  289
  276
Yiikes!  I got worse performance on the 280X! 
Sorry guys.  This "improvement", as it stands, seems to come to the 290 only.

Fellow miners, get your thens and thans in order and help other forum readers understand what you are writing. Remember the grammar basics:  B larger THAN A (comparator operator). If something THEN ....
JuanHungLo
Hero Member
*****
Offline Offline

Activity: 935
Merit: 1001


I don't always drink...


View Profile
January 06, 2015, 08:31:11 PM
 #2692

I concur.  After further testing with suggested settings I was getting HW errors.  Back to the drawing board...

Bull markets are born on pessimism, grow on skepticism, mature on optimism, and die on euphoria. - John Templeton
damm315er
Sr. Member
****
Offline Offline

Activity: 539
Merit: 255


View Profile
January 07, 2015, 02:21:03 AM
 #2693

So wolf0 isn't the only one with a kernel mod that works better on the 290's than anything else..
Eastwind
Hero Member
*****
Offline Offline

Activity: 896
Merit: 1000



View Profile
January 07, 2015, 10:42:42 AM
 #2694

Hi again,
Now I tried the "improved" kernel on my own 280X rig.
3 cards, all running 1000/1500 core/mem.  550Watts at the wall  
Orig neoscrypt kernel (Kh/s)
  301
  296
  287
My "improved" kernel
  295
  289
  276
Yiikes!  I got worse performance on the 280X!  
Sorry guys.  This "improvement", as it stands, seems to come to the 290 only.


+1
on my 280x i get 308 with "improved" vs 317 before

The drop is about 5% compared to old kernel for 7970. Maybe this "improved kernel" works only for 290 which has larger memory and more cores.
KL0nLutiy
Member
**
Offline Offline

Activity: 158
Merit: 10


View Profile
January 07, 2015, 11:41:18 AM
 #2695

Here is a small neoscrypt kernel improvement for free, since I am mostly doing X11 anyway.
It gave me a 5.8% speedup on my reference R9 290 card (with Stilt bios),
from 290.2 to 307Kh/s  at 800/1500 core/mem freq on Ubuntu 12.04 with stock drivers.
I didnt try it on my R9 280x cards, so please post your results if you try this.

You will have to mod the kernel as per the code below.
The bottleneck in this kernel is the way it stores the 128 intermediate results of chacha and salsa in global memory.
By doing the change below you are reducing stalls/latency by not making read/writes to same/adjacent memory banks.

Change:
void ScratchpadStore(__global void *V, void *X, uchar idx)
{
   ((__global ulong16 *)V)[idx << 1] = ((ulong16 *)X)[0];
   ((__global ulong16 *)V)[(idx << 1) + 1] = ((ulong16 *)X)[1];
}

void ScratchpadMix(void *X, const __global void *V, uchar idx)
{
   ((ulong16 *)X)[0] ^= ((__global ulong16 *)V)[idx << 1];
   ((ulong16 *)X)[1] ^= ((__global ulong16 *)V)[(idx << 1) + 1];
}

To:
void ScratchpadStore(__global void *V, void *X, uchar idx)
{
   ((__global ulong16 *)V)[idx] = ((ulong16 *)X)[0];
   ((__global ulong16 *)V)[idx + 128] = ((ulong16 *)X)[1];
}
void ScratchpadMix(void *X, const __global void *V, uchar idx)
{
   ((ulong16 *)X)[0] ^= ((__global ulong16 *)V)[idx];
   ((ulong16 *)X)[1] ^= ((__global ulong16 *)V)[idx + 128];
}

thanks, increase from 317 to 324 on 290x

tccd
Newbie
*
Offline Offline

Activity: 51
Merit: 0


View Profile
January 07, 2015, 11:56:49 AM
 #2696

Here is a small neoscrypt kernel improvement for free, since I am mostly doing X11 anyway.
It gave me a 5.8% speedup on my reference R9 290 card (with Stilt bios),
from 290.2 to 307Kh/s  at 800/1500 core/mem freq on Ubuntu 12.04 with stock drivers.
I didnt try it on my R9 280x cards, so please post your results if you try this.

You will have to mod the kernel as per the code below.
The bottleneck in this kernel is the way it stores the 128 intermediate results of chacha and salsa in global memory.
By doing the change below you are reducing stalls/latency by not making read/writes to same/adjacent memory banks.

Change:
void ScratchpadStore(__global void *V, void *X, uchar idx)
{
   ((__global ulong16 *)V)[idx << 1] = ((ulong16 *)X)[0];
   ((__global ulong16 *)V)[(idx << 1) + 1] = ((ulong16 *)X)[1];
}

void ScratchpadMix(void *X, const __global void *V, uchar idx)
{
   ((ulong16 *)X)[0] ^= ((__global ulong16 *)V)[idx << 1];
   ((ulong16 *)X)[1] ^= ((__global ulong16 *)V)[(idx << 1) + 1];
}

To:
void ScratchpadStore(__global void *V, void *X, uchar idx)
{
   ((__global ulong16 *)V)[idx] = ((ulong16 *)X)[0];
   ((__global ulong16 *)V)[idx + 128] = ((ulong16 *)X)[1];
}
void ScratchpadMix(void *X, const __global void *V, uchar idx)
{
   ((ulong16 *)X)[0] ^= ((__global ulong16 *)V)[idx];
   ((ulong16 *)X)[1] ^= ((__global ulong16 *)V)[idx + 128];
}


Not working well with Wolf0's Hawaii mod. Hash rate dropped from 339kh/s to 320kh/s.
damm315er
Sr. Member
****
Offline Offline

Activity: 539
Merit: 255


View Profile
January 09, 2015, 11:01:11 PM
Last edit: January 10, 2015, 02:36:06 PM by damm315er
 #2697

Here is a small neoscrypt kernel improvement for free, since I am mostly doing X11 anyway.
It gave me a 5.8% speedup on my reference R9 290 card (with Stilt bios),
from 290.2 to 307Kh/s  at 800/1500 core/mem freq on Ubuntu 12.04 with stock drivers.
I didnt try it on my R9 280x cards, so please post your results if you try this.

You will have to mod the kernel as per the code below.
The bottleneck in this kernel is the way it stores the 128 intermediate results of chacha and salsa in global memory.
By doing the change below you are reducing stalls/latency by not making read/writes to same/adjacent memory banks.

Change:
void ScratchpadStore(__global void *V, void *X, uchar idx)
{
   ((__global ulong16 *)V)[idx << 1] = ((ulong16 *)X)[0];
   ((__global ulong16 *)V)[(idx << 1) + 1] = ((ulong16 *)X)[1];
}

void ScratchpadMix(void *X, const __global void *V, uchar idx)
{
   ((ulong16 *)X)[0] ^= ((__global ulong16 *)V)[idx << 1];
   ((ulong16 *)X)[1] ^= ((__global ulong16 *)V)[(idx << 1) + 1];
}

To:
void ScratchpadStore(__global void *V, void *X, uchar idx)
{
   ((__global ulong16 *)V)[idx] = ((ulong16 *)X)[0];
   ((__global ulong16 *)V)[idx + 128] = ((ulong16 *)X)[1];
}
void ScratchpadMix(void *X, const __global void *V, uchar idx)
{
   ((ulong16 *)X)[0] ^= ((__global ulong16 *)V)[idx];
   ((ulong16 *)X)[1] ^= ((__global ulong16 *)V)[idx + 128];
}


CORRECTION:

That made a 8 kh/s increase on my 290's.. from 341 to 349 kh/s. (dumb azz me, I forgot to delete the bin)
cat77
Newbie
*
Offline Offline

Activity: 18
Merit: 0


View Profile
January 10, 2015, 04:16:31 AM
Last edit: January 10, 2015, 04:40:15 AM by cat77
 #2698

.....This is worth 20KH/s on my 280X......from 343KHs to 363KH/s at 1020MHz clock
.....now somebody needs to find 20KH/s more for me....  Smiley

change the XORBytesInPlace call from
Code:
	XORBytesInPlace(B + bufidx, input, BLAKE2S_OUT_SIZE);
to
Code:
      XORBytesInPlace(B + bufidx, input, bufidx);
and change the function itself to perform some byte alignment checking
Code:
//
// a bit of byte alignment checking goes a long ways...
//
void XORBytesInPlace(void *restrict dst, const void *restrict src, uint mod)
{
  switch(mod % 4)
  {
  case 0:
    #pragma unroll 2
    for(int i = 0; i < 4; i+=2)
    {
      ((uint2 *)dst)[i]   ^= ((uint2 *)src)[i];
        ((uint2 *)dst)[i+1] ^= ((uint2 *)src)[i+1];   
    }
    break;   

  case 2: 
    #pragma unroll 8
    for(int i = 0; i < 16; i+=2)
    {
      ((uchar2 *)dst)[i] ^= ((uchar2 *)src)[i];
      ((uchar2 *)dst)[i+1] ^= ((uchar2 *)src)[i+1];
    }
    break;

  default:
  #pragma unroll 8
   for(int i = 0; i < 31; i+=4)
   {
    ((uchar *)dst)[i] ^= ((uchar *)src)[i];
    ((uchar *)dst)[i+1] ^= ((uchar *)src)[i+1];
    ((uchar *)dst)[i+2] ^= ((uchar *)src)[i+2];
    ((uchar *)dst)[i+3] ^= ((uchar *)src)[i+3];   
    }
  }
}
Eastwind
Hero Member
*****
Offline Offline

Activity: 896
Merit: 1000



View Profile
January 10, 2015, 09:02:03 AM
 #2699

.....This is worth 20KH/s on my 280X......from 343KHs to 363KH/s at 1020MHz clock
.....now somebody needs to find 20KH/s more for me....  Smiley

change the XORBytesInPlace call from
Code:
	XORBytesInPlace(B + bufidx, input, BLAKE2S_OUT_SIZE);
to
Code:
      XORBytesInPlace(B + bufidx, input, bufidx);
and change the function itself to perform some byte alignment checking
Code:
//
// a bit of byte alignment checking goes a long ways...
//
void XORBytesInPlace(void *restrict dst, const void *restrict src, uint mod)
{
  switch(mod % 4)
  {
  case 0:
    #pragma unroll 2
    for(int i = 0; i < 4; i+=2)
    {
      ((uint2 *)dst)[i]   ^= ((uint2 *)src)[i];
        ((uint2 *)dst)[i+1] ^= ((uint2 *)src)[i+1];   
    }
    break;   

  case 2: 
    #pragma unroll 8
    for(int i = 0; i < 16; i+=2)
    {
      ((uchar2 *)dst)[i] ^= ((uchar2 *)src)[i];
      ((uchar2 *)dst)[i+1] ^= ((uchar2 *)src)[i+1];
    }
    break;

  default:
  #pragma unroll 8
   for(int i = 0; i < 31; i+=4)
   {
    ((uchar *)dst)[i] ^= ((uchar *)src)[i];
    ((uchar *)dst)[i+1] ^= ((uchar *)src)[i+1];
    ((uchar *)dst)[i+2] ^= ((uchar *)src)[i+2];
    ((uchar *)dst)[i+3] ^= ((uchar *)src)[i+3];   
    }
  }
}


Did you change from the original kernal or after boben2's change? Can you upload a revised kernal?
KL0nLutiy
Member
**
Offline Offline

Activity: 158
Merit: 10


View Profile
January 10, 2015, 11:58:54 AM
 #2700

.....This is worth 20KH/s on my 280X......from 343KHs to 363KH/s at 1020MHz clock
.....now somebody needs to find 20KH/s more for me....  Smiley

change the XORBytesInPlace call from
Code:
	XORBytesInPlace(B + bufidx, input, BLAKE2S_OUT_SIZE);
to
Code:
      XORBytesInPlace(B + bufidx, input, bufidx);
and change the function itself to perform some byte alignment checking
Code:
//
// a bit of byte alignment checking goes a long ways...
//
void XORBytesInPlace(void *restrict dst, const void *restrict src, uint mod)
{
  switch(mod % 4)
  {
  case 0:
    #pragma unroll 2
    for(int i = 0; i < 4; i+=2)
    {
      ((uint2 *)dst)[i]   ^= ((uint2 *)src)[i];
        ((uint2 *)dst)[i+1] ^= ((uint2 *)src)[i+1];   
    }
    break;   

  case 2: 
    #pragma unroll 8
    for(int i = 0; i < 16; i+=2)
    {
      ((uchar2 *)dst)[i] ^= ((uchar2 *)src)[i];
      ((uchar2 *)dst)[i+1] ^= ((uchar2 *)src)[i+1];
    }
    break;

  default:
  #pragma unroll 8
   for(int i = 0; i < 31; i+=4)
   {
    ((uchar *)dst)[i] ^= ((uchar *)src)[i];
    ((uchar *)dst)[i+1] ^= ((uchar *)src)[i+1];
    ((uchar *)dst)[i+2] ^= ((uchar *)src)[i+2];
    ((uchar *)dst)[i+3] ^= ((uchar *)src)[i+3];   
    }
  }
}

What settings do you use?

Pages: « 1 ... 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 [135] 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 ... 233 »
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!