Come-from-Beyond (OP)
Legendary
Offline
Activity: 2142
Merit: 1010
Newbie
|
|
December 21, 2011, 04:58:05 PM |
|
I'm talking about algorithm that is used between two SHA256 calculations to make hashing GPU unfriendly. Its name is "Scrypt".
It's supposed to do calculations that can't be done in parralel mode and require a lot of memory. Parameters used for LTC and similar forks (1024, 1 and 1) make us to use only 1 thread and at least 128 Kb of memory to calc 1 hash value. I rewrote the algorithm so I could track a path of every bit. After that I used math methods for simplifying integrated circuits (it's like when u have 1000 NAND or NOR elements and wish to use as less as possible of them, just 300 for example). I failed to build complete Scrypt this way, coz of memory constraint, but I managed to simplify small pieces of it. I'm afraid that with a lot of memory anyone could build complete scheme and pass thousand SALSA steps. This will lead to huge hashrate boost, i.e. 1 Mh/s vs 1 Kh/s with "usual" Scrypt. No need to tell why it's very bad for LTC and the others.
Well, now my question: is there any theory that prove quality of Scrypt?
PS: My modified Scrypt code do less calculations. It gives approx. 20% boost when compared to original code of Tarnsnap project due to sequential memory access. Imagine what will happen if someone get rid of redundant XORs of 16 temporary memory cells in SALSA...
|
|
|
|
pooler
|
|
December 21, 2011, 05:04:31 PM |
|
I'm talking about algorithm that is used between two SHA256 calculations to make hashing GPU unfriendly. Its name is "Scrypt".
It's supposed to do calculations that can't be done in parralel mode and require a lot of memory. Parameters used for LTC and similar forks (1024, 1 and 1) make us to use only 1 thread and at least 128 Kb of memory to calc 1 hash value. I rewrote the algorithm so I could track a path of every bit. After that I used math methods for simplifying integrated circuits (it's like when u have 1000 NAND or NOR elements and wish to use as less as possible of them, just 300 for example). I failed to build complete Scrypt this way, coz of memory constraint, but I managed to simplify small pieces of it. I'm afraid that with a lot of memory anyone could build complete scheme and pass thousand SALSA steps. This will lead to huge hashrate boost, i.e. 1 Mh/s vs 1 Kh/s with "usual" Scrypt. No need to tell why it's very bad for LTC and the others.
Well, now my question: is there any theory that prove quality of Scrypt?
PS: My modified Scrypt code do less calculations. It gives approx. 20% boost when compared to original code of Tarnsnap project due to sequential memory access. Imagine what will happen if someone get rid of redundant XORs of 16 temporary memory cells in SALSA...
I think it would be interesting to discuss this with the author of scrypt, Colin Percival. There is even a mailing list dedicated to scrypt here: http://www.tarsnap.com/lists.html
|
BTC: 15MRTcUweNVJbhTyH5rq9aeSdyigFrskqE · LTC: LTCPooLqTK1SANSNeTR63GbGwabTKEkuS7
|
|
|
Come-from-Beyond (OP)
Legendary
Offline
Activity: 2142
Merit: 1010
Newbie
|
|
December 21, 2011, 05:25:58 PM |
|
Good idea! thank you
|
|
|
|
Come-from-Beyond (OP)
Legendary
Offline
Activity: 2142
Merit: 1010
Newbie
|
|
December 21, 2011, 05:43:48 PM |
|
Scrypt looks OK, the problem comes from chosen parameters (1024, 1 and 1) I think. Maybe "problem" is too strong word, but realy good algorithm supposed to use every byte of scratchpad with equal probability. Unfortunately, usage of the scratchpad is not symmetrical. For example, if u divide memory into 64 byte chunks u'll notice that every even chunk used 2 times while odd one used 1 time. There are also evident patterns in jumps from chunk to chunk. So on.
I think no need to discuss this topic. We should just pray that noone will solve the problem how to simplify SALSA.
|
|
|
|
grue
Legendary
Offline
Activity: 2058
Merit: 1446
|
|
December 22, 2011, 12:59:16 AM |
|
PS: My modified Scrypt code do less calculations. It gives approx. 20% boost when compared to original code of Tarnsnap project due to sequential memory access. Imagine what will happen if someone get rid of redundant XORs of 16 temporary memory cells in SALSA...
do share!
|
|
|
|
Come-from-Beyond (OP)
Legendary
Offline
Activity: 2142
Merit: 1010
Newbie
|
|
December 22, 2011, 05:40:06 AM |
|
I'm going to publish the miner when I create a "pool" for it.
|
|
|
|
ssvb
Newbie
Offline
Activity: 39
Merit: 0
|
|
January 19, 2012, 10:37:02 AM |
|
I think no need to discuss this topic. We should just pray that noone will solve the problem how to simplify SALSA.
Praying is not very useful and obscurity is definitely not the right way to do things here Even without sharing the sources yet, do you have any statistics for the number of 32-bit XOR, ADD and ROTATE operations before and after your optimization? Or even better would be the statistics for 128-bit SIMD operations (XOR, ADD, ROTATE, SHUFFLE).
|
|
|
|
Come-from-Beyond (OP)
Legendary
Offline
Activity: 2142
Merit: 1010
Newbie
|
|
January 19, 2012, 11:50:51 AM |
|
I think no need to discuss this topic. We should just pray that noone will solve the problem how to simplify SALSA.
Praying is not very useful and obscurity is definitely not the right way to do things here Even without sharing the sources yet, do you have any statistics for the number of 32-bit XOR, ADD and ROTATE operations before and after your optimization? Or even better would be the statistics for 128-bit SIMD operations (XOR, ADD, ROTATE, SHUFFLE). I didn't count the statistics. I tried to simplify the algo using usual NAND/NOR-logic optimization techniques. (Un)fortunatelly I ran out of memory (I had only 16 Gb) so optimization wasn't completed. If u wish u could try to do it urself using my implementation of [1024, 1, 1]-Scrypt. When I wrote it using SSE2 (calc 4 hashes at once) I got 2500 h/s from each GHz of an Intel CPU. #define ROTR(x, n) ((x>>n)|(x<<(32-n)))
void Salsa(unsigned __int32 uiData[16]) { unsigned __int32 uiA[16], i, j;
for (i=0; i<16; i++) uiA[i]=uiData[i];
for (i=0; i<4; i++) { uiA[4]^=ROTR((uiA[0]+uiA[12]), 25); uiA[9]^=ROTR((uiA[5]+uiA[1]), 25); uiA[14]^=ROTR((uiA[10]+uiA[6]), 25); uiA[3]^=ROTR((uiA[15]+uiA[11]), 25);
uiA[8]^=ROTR((uiA[4]+uiA[0]), 23); uiA[13]^=ROTR((uiA[9]+uiA[5]), 23); uiA[2]^=ROTR((uiA[14]+uiA[10]), 23); uiA[7]^=ROTR((uiA[3]+uiA[15]), 23);
uiA[12]^=ROTR((uiA[8]+uiA[4]), 19); uiA[1]^=ROTR((uiA[13]+uiA[9]), 19); uiA[6]^=ROTR((uiA[2]+uiA[14]), 19); uiA[11]^=ROTR((uiA[7]+uiA[3]), 19);
uiA[0]^=ROTR((uiA[12]+uiA[8]), 14); uiA[5]^=ROTR((uiA[1]+uiA[13]), 14); uiA[10]^=ROTR((uiA[6]+uiA[2]), 14); uiA[15]^=ROTR((uiA[11]+uiA[7]), 14);
uiA[1]^=ROTR((uiA[0]+uiA[3]), 25); uiA[6]^=ROTR((uiA[5]+uiA[4]), 25); uiA[11]^=ROTR((uiA[10]+uiA[9]), 25); uiA[12]^=ROTR((uiA[15]+uiA[14]), 25);
uiA[2]^=ROTR((uiA[1]+uiA[0]), 23); uiA[7]^=ROTR((uiA[6]+uiA[5]), 23); uiA[8]^=ROTR((uiA[11]+uiA[10]), 23); uiA[13]^=ROTR((uiA[12]+uiA[15]), 23);
uiA[3]^=ROTR((uiA[2]+uiA[1]), 19); uiA[4]^=ROTR((uiA[7]+uiA[6]), 19); uiA[9]^=ROTR((uiA[8]+uiA[11]), 19); uiA[14]^=ROTR((uiA[13]+uiA[12]), 19);
uiA[0]^=ROTR((uiA[3]+uiA[2]), 14); uiA[5]^=ROTR((uiA[4]+uiA[7]), 14); uiA[10]^=ROTR((uiA[9]+uiA[8]), 14); uiA[15]^=ROTR((uiA[14]+uiA[13]), 14); }
for (i=0; i<16; i++) uiData[i]+=uiA[i]; }
unsigned __int32 uiStorage[32768+32]; .......................... .......................... ..........................
for (i=0; i<32768; i+=16) { for (j=0; j<16; j++) uiStorage[i+32+j]=uiStorage[i+j]^uiStorage[i+16+j]; Salsa(&uiStorage[i+32]); } for (i=0; i<32768; i+=32) { for (j=0; j<16; j++) uiStorage[i+j]^=uiStorage[i+16+j]; } for (i=0; i<32768; i+=16) { j=((uiStorage[32784]&1023)<<5)+(i&31); for (k=0; k<16; k++) uiStorage[32768+(i&31)+k]=uiStorage[32768+k]^uiStorage[32784+k]^uiStorage[j+k]; Salsa(&uiStorage[32768+(i&31)]); } .......................... .......................... ..........................
I can't post here my semi-optimized solution. It used some precomputing and required a lot of memory. Nothing special, but I decided to keep it in a cold dark place.
|
|
|
|
ArtForz
|
|
January 19, 2012, 12:45:40 PM Last edit: January 19, 2012, 01:08:39 PM by ArtForz |
|
Let's play a game of counting XORs, shall we? for (i=0; i<32768; i+=16) { for (j=0; j<16; j++) uiStorage[i+32+j]=uiStorage[i+j]^uiStorage[i+16+j]; Salsa(&uiStorage[i+32]); } for (i=0; i<32768; i+=32) { for (j=0; j<16; j++) uiStorage[i+j]^=uiStorage[i+16+j]; } for (i=0; i<32768; i+=16) { j=((uiStorage[32784]&1023)<<5)+(i&31); for (k=0; k<16; k++) uiStorage[32768+(i&31)+k]=uiStorage[32768+k]^uiStorage[32784+k]^uiStorage[j+k]; Salsa(&uiStorage[32768+(i&31)]); }
first loop is 2048 times 1 16-dword XOR 2nd loop 1024 times 1 16-dword XOR 3rd loop 2048 times 2 16-dword XORs (sorry, a 3-input XOR is still 2 XORs) total # of 16-dword XORs ... 7168 vs. a simple stupid scrypt(1024,1,1) (using 16-dword vectors for X and V to keep things more readable): for (i = 0; i < 1024; i++) { V[i*2+0] = X[0]; V[i*2+1] = X[1]; X[0] ^= X[1]; X[0] = salsa20(X[0]); X[1] ^= X[0]; X[1] = salsa20(X[1]); } for (i = 0; i < 1024; i++) { j = X[1].vec_element[0] & 1023; X[0] ^= V[j * 2 + 0]; X[1] ^= V[j * 2 + 1]; X[0] ^= X[1] X[0] = salsa20(X[0]); X[1] ^= X[0]; X[1] = salsa20(X[1]); }
let's count again... 1st loop 1024 times 2 16-dword XORs 2nd loop 1024 times 4 16-dword XORs total # of 16-dword XORs ... 6144 Congratulations. Your implementation *increased* the total # of 16-dword XORs vs. reference scrypt by 1024... ?!?
|
bitcoin: 1Fb77Xq5ePFER8GtKRn2KDbDTVpJKfKmpz i0coin: jNdvyvd6v6gV3kVJLD7HsB5ZwHyHwAkfdw
|
|
|
fivebells
|
|
January 19, 2012, 01:03:05 PM |
|
Your fear that someone might find a faster implementation by running one instance of it through some magical optimization software doesn't seem that plausible to me. Roughly speaking, scrypt is based on randomly mixing up the results of a large number of SHA256 hashes, so any substantial optimization which didn't also optimize SHA256 would be pretty surprising. Any accurate optimization would still need random access to all those hashes, which forces a very large circuit.
|
|
|
|
Come-from-Beyond (OP)
Legendary
Offline
Activity: 2142
Merit: 1010
Newbie
|
|
January 19, 2012, 01:07:59 PM |
|
Let's play a game of counting XORs, shall we?
...
Congratulations. Your implementation *increased* the total # of 16-dword XORs vs. reference scrypt by 1024... ?!?
Perhaps. I didn't check it by myself but I suppose ur counting is correct. So?
|
|
|
|
Come-from-Beyond (OP)
Legendary
Offline
Activity: 2142
Merit: 1010
Newbie
|
|
January 19, 2012, 01:11:52 PM |
|
Your fear that someone might find a faster implementation by running one instance of it through some magical optimization software doesn't seem that plausible to me. Roughly speaking, scrypt is based on randomly mixing up the results of a large number of SHA256 hashes, so any substantial optimization which didn't also optimize SHA256 would be pretty surprising. Any accurate optimization would still need random access to all those hashes, which forces a very large circuit.
I'm afraid that this circuit won't be large enough for modern FPGAs.
|
|
|
|
ArtForz
|
|
January 19, 2012, 01:15:18 PM |
|
Your fear that someone might find a faster implementation by running one instance of it through some magical optimization software doesn't seem that plausible to me. Roughly speaking, scrypt is based on randomly mixing up the results of a large number of SHA256 hashes, so any substantial optimization which didn't also optimize SHA256 would be pretty surprising. Any accurate optimization would still need random access to all those hashes, which forces a very large circuit.
I'm afraid that this circuit won't be large enough for modern FPGAs. ... says the guy claiming +1024 is negative.
|
bitcoin: 1Fb77Xq5ePFER8GtKRn2KDbDTVpJKfKmpz i0coin: jNdvyvd6v6gV3kVJLD7HsB5ZwHyHwAkfdw
|
|
|
Come-from-Beyond (OP)
Legendary
Offline
Activity: 2142
Merit: 1010
Newbie
|
|
January 19, 2012, 01:32:18 PM |
|
Your fear that someone might find a faster implementation by running one instance of it through some magical optimization software doesn't seem that plausible to me. Roughly speaking, scrypt is based on randomly mixing up the results of a large number of SHA256 hashes, so any substantial optimization which didn't also optimize SHA256 would be pretty surprising. Any accurate optimization would still need random access to all those hashes, which forces a very large circuit.
I'm afraid that this circuit won't be large enough for modern FPGAs. ... says the guy claiming +1024 is negative. I got u. U assumed that time of calculating of a Scrypt-based hash is proportionally to number of operations. It's wrong. For example, i*4+i could be faster than i*5. My implementation of Scrypt is optimized for an Intel cpu and runs faster than ur code. The 1st step I made was optimization of ur miner and I wasn't satisfied with results, so I moved to original Tarsnap code and then to implementation posted here. I took into account a REAL cpu that has limited cache for code and data, not very clever branch predictor, register stalls and that is able to process 3 instructions at once.
|
|
|
|
ArtForz
|
|
January 19, 2012, 01:58:09 PM |
|
My implementation of Scrypt is optimized for an Intel cpu and runs faster than ur code.
Reference C implementation slower than optimized asm, news at 11. The 1st step I made was optimization of ur miner and I wasn't satisfied with results, so I moved to original Tarsnap code and then to implementation posted here. I took into account a REAL cpu that has limited cache for code and data, not very clever branch predictor, register stalls and that is able to process 3 instructions at once.
So... exactly what pooler did, except he managed it without big secrecy or talking about boolean simplification and factor 1000 speedups "if it only had worked". Tip: Statements like that make me think "oh dear, another crypto kook".
|
bitcoin: 1Fb77Xq5ePFER8GtKRn2KDbDTVpJKfKmpz i0coin: jNdvyvd6v6gV3kVJLD7HsB5ZwHyHwAkfdw
|
|
|
Come-from-Beyond (OP)
Legendary
Offline
Activity: 2142
Merit: 1010
Newbie
|
|
January 19, 2012, 02:30:58 PM |
|
Reference C implementation slower than optimized asm, news at 11.
As I wrote: "When I wrote it using SSE2 (calc 4 hashes at once) I got 2500 h/s from each GHz of an Intel CPU."See? Tip: Statements like that make me think "oh dear, another crypto kook".
I don't need tips from a guy who is so infantile. I don't know why u think that this topic has something personal against u. Read https://bitcointalk.org/index.php?topic=55670.0 to see results of my implementation. Hashrate was proven by litecoinpool.org statistics and tests were made by other guys. Ur miner is so slow that even simple optimizations give good boost to performance (+200% bonus). --- Starting from this point I put u into ignore list so u won't waste my time.
|
|
|
|
ArtForz
|
|
January 19, 2012, 02:46:29 PM |
|
Reference C implementation slower than optimized asm, news at 11.
As I wrote: "When I wrote it using SSE2 (calc 4 hashes at once) I got 2500 h/s from each GHz of an Intel CPU."See? Tip: Statements like that make me think "oh dear, another crypto kook".
I don't need tips from a guy who is so infantile. I don't know why u think that this topic has something personal against u. Read https://bitcointalk.org/index.php?topic=55670.0 to see results of my implementation. Hashrate was proven by litecoinpool.org statistics and tests were made by other guys. Ur miner is so slow that even simple optimizations give good boost to performance (+200% bonus). --- Starting from this point I put u into ignore list so u won't waste my time. *facepalm* Again, completely avoiding the topic of outrageous claims without *any* proof. Oh wait, the "proof" is writing a miner that's 20% slower than poolers. right. Ignored.
|
bitcoin: 1Fb77Xq5ePFER8GtKRn2KDbDTVpJKfKmpz i0coin: jNdvyvd6v6gV3kVJLD7HsB5ZwHyHwAkfdw
|
|
|
ssvb
Newbie
Offline
Activity: 39
Merit: 0
|
|
January 19, 2012, 07:13:34 PM |
|
OK, thanks for the clarifications. Now we can see that it was just scaremongering and/or handwaving. So much for "do less calculations" and "sequential memory access" promises. In reality the demonstrated code snippet has more XOR operations and does three separate passes over 128K buffer instead of two (which means more L1 cache misses unless the hardware prefetcher saves the day). The pointless replacement of left rotations with right rotations also looks amusing.
|
|
|
|
Come-from-Beyond (OP)
Legendary
Offline
Activity: 2142
Merit: 1010
Newbie
|
|
January 19, 2012, 08:10:44 PM |
|
OK, thanks for the clarifications. Now we can see that it was just scaremongering and/or handwaving. So much for "do less calculations" and "sequential memory access" promises. In reality the demonstrated code snippet has more XOR operations and does three separate passes over 128K buffer instead of two (which means more L1 cache misses unless the hardware prefetcher saves the day). The pointless replacement of left rotations with right rotations also looks amusing.
Well... There are some explanations: Code I posted here is C++ edition of the algo (for better reading). CPU runs assembler code which looks completely different and u should know this unless u r just a coin trader. Ur post looks like an attempt to defend ArtForz' position. But everyone sees that he is just upset that there are other miners which are better then his one (pooler's, mine and I know 2 other guys who developed much better miners). My implementation uses less calculations coz CPU process 3 instructions at once using a trick to avoid register stalls. So extra cycle between 2 others is necessary for "less calculations" and "sequential memory access". Btw, a CPU with 128K L1 per core is quite... uncommon and ur statement about 3 passes is wrong. Regarding replacement of left rotations with right ones. I suspect that u r NOT serious. Even a young programmer won't belive that this was made for optimizations or something like that. I made it coz there was already defined function for calculating SHA-256 that used right rotation. PS: Ask more if u have questions
|
|
|
|
Roadhog2k5
|
|
January 19, 2012, 08:23:25 PM |
|
u, ur, r, coz.
|
|
|
|
|