Bitcoin Forum
November 03, 2024, 11:47:24 PM *
News: Latest Bitcoin Core release: 28.0 [Torrent]
 
   Home   Help Search Login Register More  
Warning: One or more bitcointalk.org users have reported that they believe that the creator of this topic displays some red flags which make them high-risk. (Login to see the detailed trust ratings.) While the bitcointalk.org administration does not verify such claims, you should proceed with extreme caution.
Pages: [1] 2 »  All
  Print  
Author Topic: Scrypt based coins: Any proof that this algorithm is realy good?  (Read 6363 times)
Come-from-Beyond (OP)
Legendary
*
Offline Offline

Activity: 2142
Merit: 1010

Newbie


View Profile
December 21, 2011, 04:58:05 PM
 #1

I'm talking about algorithm that is used between two SHA256 calculations to make hashing GPU unfriendly. Its name is "Scrypt".

It's supposed to do calculations that can't be done in parralel mode and require a lot of memory. Parameters used for LTC and similar forks (1024, 1 and 1) make us to use only 1 thread and at least 128 Kb of memory to calc 1 hash value. I rewrote the algorithm so I could track a path of every bit. After that I used math methods for simplifying integrated circuits (it's like when u have 1000 NAND or NOR elements and wish to use as less as possible of them, just 300 for example). I failed to build complete Scrypt this way, coz of memory constraint, but I managed to simplify small pieces of it. I'm afraid that with a lot of memory anyone could build complete scheme and pass thousand SALSA steps. This will lead to huge hashrate boost, i.e. 1 Mh/s vs 1 Kh/s with "usual" Scrypt. No need to tell why it's very bad for LTC and the others.

Well, now my question: is there any theory that prove quality of Scrypt?

PS: My modified Scrypt code do less calculations. It gives approx. 20% boost when compared to original code of Tarnsnap project due to sequential memory access. Imagine what will happen if someone get rid of redundant XORs of 16 temporary memory cells in SALSA...
pooler
Hero Member
*****
Offline Offline

Activity: 842
Merit: 507


View Profile
December 21, 2011, 05:04:31 PM
 #2

I'm talking about algorithm that is used between two SHA256 calculations to make hashing GPU unfriendly. Its name is "Scrypt".

It's supposed to do calculations that can't be done in parralel mode and require a lot of memory. Parameters used for LTC and similar forks (1024, 1 and 1) make us to use only 1 thread and at least 128 Kb of memory to calc 1 hash value. I rewrote the algorithm so I could track a path of every bit. After that I used math methods for simplifying integrated circuits (it's like when u have 1000 NAND or NOR elements and wish to use as less as possible of them, just 300 for example). I failed to build complete Scrypt this way, coz of memory constraint, but I managed to simplify small pieces of it. I'm afraid that with a lot of memory anyone could build complete scheme and pass thousand SALSA steps. This will lead to huge hashrate boost, i.e. 1 Mh/s vs 1 Kh/s with "usual" Scrypt. No need to tell why it's very bad for LTC and the others.

Well, now my question: is there any theory that prove quality of Scrypt?

PS: My modified Scrypt code do less calculations. It gives approx. 20% boost when compared to original code of Tarnsnap project due to sequential memory access. Imagine what will happen if someone get rid of redundant XORs of 16 temporary memory cells in SALSA...

I think it would be interesting to discuss this with the author of scrypt, Colin Percival. There is even a mailing list dedicated to scrypt here:
http://www.tarsnap.com/lists.html

BTC: 15MRTcUweNVJbhTyH5rq9aeSdyigFrskqE · LTC: LTCPooLqTK1SANSNeTR63GbGwabTKEkuS7
Come-from-Beyond (OP)
Legendary
*
Offline Offline

Activity: 2142
Merit: 1010

Newbie


View Profile
December 21, 2011, 05:25:58 PM
 #3

Good idea! thank you
Come-from-Beyond (OP)
Legendary
*
Offline Offline

Activity: 2142
Merit: 1010

Newbie


View Profile
December 21, 2011, 05:43:48 PM
 #4

Scrypt looks OK, the problem comes from chosen parameters (1024, 1 and 1) I think. Maybe "problem" is too strong word, but realy good algorithm supposed to use every byte of scratchpad with equal probability. Unfortunately, usage of the scratchpad is not symmetrical. For example, if u divide memory into 64 byte chunks u'll notice that every even chunk used 2 times while odd one used 1 time. There are also evident patterns in jumps from chunk to chunk. So on.

I think no need to discuss this topic. We should just pray that noone will solve the problem how to simplify SALSA.
grue
Legendary
*
Offline Offline

Activity: 2058
Merit: 1446



View Profile
December 22, 2011, 12:59:16 AM
 #5

PS: My modified Scrypt code do less calculations. It gives approx. 20% boost when compared to original code of Tarnsnap project due to sequential memory access. Imagine what will happen if someone get rid of redundant XORs of 16 temporary memory cells in SALSA...
do share!

It is pitch black. You are likely to be eaten by a grue.

Adblock for annoying signature ads | Enhanced Merit UI
Come-from-Beyond (OP)
Legendary
*
Offline Offline

Activity: 2142
Merit: 1010

Newbie


View Profile
December 22, 2011, 05:40:06 AM
 #6

I'm going to publish the miner when I create a "pool" for it.
ssvb
Newbie
*
Offline Offline

Activity: 39
Merit: 0


View Profile
January 19, 2012, 10:37:02 AM
 #7

I think no need to discuss this topic. We should just pray that noone will solve the problem how to simplify SALSA.
Praying is not very useful and obscurity is definitely not the right way to do things here Smiley Even without sharing the sources yet, do you have any statistics for the number of 32-bit XOR, ADD and ROTATE operations before and after your optimization? Or even better would be the statistics for 128-bit SIMD operations (XOR, ADD, ROTATE, SHUFFLE).
Come-from-Beyond (OP)
Legendary
*
Offline Offline

Activity: 2142
Merit: 1010

Newbie


View Profile
January 19, 2012, 11:50:51 AM
 #8

I think no need to discuss this topic. We should just pray that noone will solve the problem how to simplify SALSA.
Praying is not very useful and obscurity is definitely not the right way to do things here Smiley Even without sharing the sources yet, do you have any statistics for the number of 32-bit XOR, ADD and ROTATE operations before and after your optimization? Or even better would be the statistics for 128-bit SIMD operations (XOR, ADD, ROTATE, SHUFFLE).

I didn't count the statistics. I tried to simplify the algo using usual NAND/NOR-logic optimization techniques. (Un)fortunatelly I ran out of memory (I had only 16 Gb) so optimization wasn't completed.

If u wish u could try to do it urself using my implementation of [1024, 1, 1]-Scrypt. When I wrote it using SSE2 (calc 4 hashes at once) I got 2500 h/s from each GHz of an Intel CPU.

Code:
#define ROTR(x, n) ((x>>n)|(x<<(32-n)))

void Salsa(unsigned __int32 uiData[16])
{
unsigned __int32 uiA[16], i, j;

for (i=0; i<16; i++) uiA[i]=uiData[i];

for (i=0; i<4; i++)
{
uiA[4]^=ROTR((uiA[0]+uiA[12]), 25);
uiA[9]^=ROTR((uiA[5]+uiA[1]), 25);
uiA[14]^=ROTR((uiA[10]+uiA[6]), 25);
uiA[3]^=ROTR((uiA[15]+uiA[11]), 25);

uiA[8]^=ROTR((uiA[4]+uiA[0]), 23);
uiA[13]^=ROTR((uiA[9]+uiA[5]), 23);
uiA[2]^=ROTR((uiA[14]+uiA[10]), 23);
uiA[7]^=ROTR((uiA[3]+uiA[15]), 23);

uiA[12]^=ROTR((uiA[8]+uiA[4]), 19);
uiA[1]^=ROTR((uiA[13]+uiA[9]), 19);
uiA[6]^=ROTR((uiA[2]+uiA[14]), 19);
uiA[11]^=ROTR((uiA[7]+uiA[3]), 19);

uiA[0]^=ROTR((uiA[12]+uiA[8]), 14);
uiA[5]^=ROTR((uiA[1]+uiA[13]), 14);
uiA[10]^=ROTR((uiA[6]+uiA[2]), 14);
uiA[15]^=ROTR((uiA[11]+uiA[7]), 14);

uiA[1]^=ROTR((uiA[0]+uiA[3]), 25);
uiA[6]^=ROTR((uiA[5]+uiA[4]), 25);
uiA[11]^=ROTR((uiA[10]+uiA[9]), 25);
uiA[12]^=ROTR((uiA[15]+uiA[14]), 25);

uiA[2]^=ROTR((uiA[1]+uiA[0]), 23);
uiA[7]^=ROTR((uiA[6]+uiA[5]), 23);
uiA[8]^=ROTR((uiA[11]+uiA[10]), 23);
uiA[13]^=ROTR((uiA[12]+uiA[15]), 23);

uiA[3]^=ROTR((uiA[2]+uiA[1]), 19);
uiA[4]^=ROTR((uiA[7]+uiA[6]), 19);
uiA[9]^=ROTR((uiA[8]+uiA[11]), 19);
uiA[14]^=ROTR((uiA[13]+uiA[12]), 19);

uiA[0]^=ROTR((uiA[3]+uiA[2]), 14);
uiA[5]^=ROTR((uiA[4]+uiA[7]), 14);
uiA[10]^=ROTR((uiA[9]+uiA[8]), 14);
uiA[15]^=ROTR((uiA[14]+uiA[13]), 14);
}

for (i=0; i<16; i++) uiData[i]+=uiA[i];
}

unsigned __int32 uiStorage[32768+32];
..........................
..........................
..........................

for (i=0; i<32768; i+=16)
{
for (j=0; j<16; j++) uiStorage[i+32+j]=uiStorage[i+j]^uiStorage[i+16+j];
Salsa(&uiStorage[i+32]);
}
for (i=0; i<32768; i+=32)
{
for (j=0; j<16; j++) uiStorage[i+j]^=uiStorage[i+16+j];
}
for (i=0; i<32768; i+=16)
{
j=((uiStorage[32784]&1023)<<5)+(i&31);
for (k=0; k<16; k++) uiStorage[32768+(i&31)+k]=uiStorage[32768+k]^uiStorage[32784+k]^uiStorage[j+k];
Salsa(&uiStorage[32768+(i&31)]);
}
..........................
..........................
..........................

I can't post here my semi-optimized solution. It used some precomputing and required a lot of memory. Nothing special, but I decided to keep it in a cold dark place.
ArtForz
Sr. Member
****
Offline Offline

Activity: 406
Merit: 257


View Profile
January 19, 2012, 12:45:40 PM
Last edit: January 19, 2012, 01:08:39 PM by ArtForz
 #9

Let's play a game of counting XORs, shall we?

Code:
for (i=0; i<32768; i+=16)
{
for (j=0; j<16; j++)
uiStorage[i+32+j]=uiStorage[i+j]^uiStorage[i+16+j];
Salsa(&uiStorage[i+32]);
}
for (i=0; i<32768; i+=32)
{
for (j=0; j<16; j++)
uiStorage[i+j]^=uiStorage[i+16+j];
}
for (i=0; i<32768; i+=16)
{
j=((uiStorage[32784]&1023)<<5)+(i&31);
for (k=0; k<16; k++)
uiStorage[32768+(i&31)+k]=uiStorage[32768+k]^uiStorage[32784+k]^uiStorage[j+k];
Salsa(&uiStorage[32768+(i&31)]);
}
first loop is 2048 times 1 16-dword XOR
2nd loop 1024 times 1 16-dword XOR
3rd loop 2048 times 2 16-dword XORs (sorry, a 3-input XOR is still 2 XORs)
total # of 16-dword XORs ... 7168

vs. a simple stupid scrypt(1024,1,1) (using 16-dword vectors for X and V to keep things more readable):
Code:
for (i = 0; i < 1024; i++) {
V[i*2+0] = X[0];
V[i*2+1] = X[1];
X[0] ^= X[1];
X[0] = salsa20(X[0]);
X[1] ^= X[0];
X[1] = salsa20(X[1]);
}
for (i = 0; i < 1024; i++) {
j = X[1].vec_element[0] & 1023;
X[0] ^= V[j * 2 + 0];
X[1] ^= V[j * 2 + 1];
X[0] ^= X[1]
X[0] = salsa20(X[0]);
X[1] ^= X[0];
X[1] = salsa20(X[1]);
}
let's count again...
1st loop 1024 times 2 16-dword XORs
2nd loop 1024 times 4 16-dword XORs
total # of 16-dword XORs ... 6144

Congratulations. Your implementation *increased* the total # of 16-dword XORs vs. reference scrypt by 1024... ?!?

bitcoin: 1Fb77Xq5ePFER8GtKRn2KDbDTVpJKfKmpz
i0coin: jNdvyvd6v6gV3kVJLD7HsB5ZwHyHwAkfdw
fivebells
Sr. Member
****
Offline Offline

Activity: 462
Merit: 250


View Profile
January 19, 2012, 01:03:05 PM
 #10

Your fear that someone might find a faster implementation by running one instance of it through some magical optimization software doesn't seem that plausible to me.  Roughly speaking, scrypt is based on randomly mixing up the results of a large number of SHA256 hashes, so any substantial optimization which didn't also optimize SHA256 would be pretty surprising.  Any accurate optimization would still need random access to all those hashes, which forces a very large circuit.
Come-from-Beyond (OP)
Legendary
*
Offline Offline

Activity: 2142
Merit: 1010

Newbie


View Profile
January 19, 2012, 01:07:59 PM
 #11

Let's play a game of counting XORs, shall we?

...

Congratulations. Your implementation *increased* the total # of 16-dword XORs vs. reference scrypt by 1024... ?!?

Perhaps. I didn't check it by myself but I suppose ur counting is correct. So?
Come-from-Beyond (OP)
Legendary
*
Offline Offline

Activity: 2142
Merit: 1010

Newbie


View Profile
January 19, 2012, 01:11:52 PM
 #12

Your fear that someone might find a faster implementation by running one instance of it through some magical optimization software doesn't seem that plausible to me.  Roughly speaking, scrypt is based on randomly mixing up the results of a large number of SHA256 hashes, so any substantial optimization which didn't also optimize SHA256 would be pretty surprising.  Any accurate optimization would still need random access to all those hashes, which forces a very large circuit.

I'm afraid that this circuit won't be large enough for modern FPGAs.
ArtForz
Sr. Member
****
Offline Offline

Activity: 406
Merit: 257


View Profile
January 19, 2012, 01:15:18 PM
 #13

Your fear that someone might find a faster implementation by running one instance of it through some magical optimization software doesn't seem that plausible to me.  Roughly speaking, scrypt is based on randomly mixing up the results of a large number of SHA256 hashes, so any substantial optimization which didn't also optimize SHA256 would be pretty surprising.  Any accurate optimization would still need random access to all those hashes, which forces a very large circuit.

I'm afraid that this circuit won't be large enough for modern FPGAs.
... says the guy claiming +1024 is negative.

bitcoin: 1Fb77Xq5ePFER8GtKRn2KDbDTVpJKfKmpz
i0coin: jNdvyvd6v6gV3kVJLD7HsB5ZwHyHwAkfdw
Come-from-Beyond (OP)
Legendary
*
Offline Offline

Activity: 2142
Merit: 1010

Newbie


View Profile
January 19, 2012, 01:32:18 PM
 #14

Your fear that someone might find a faster implementation by running one instance of it through some magical optimization software doesn't seem that plausible to me.  Roughly speaking, scrypt is based on randomly mixing up the results of a large number of SHA256 hashes, so any substantial optimization which didn't also optimize SHA256 would be pretty surprising.  Any accurate optimization would still need random access to all those hashes, which forces a very large circuit.

I'm afraid that this circuit won't be large enough for modern FPGAs.
... says the guy claiming +1024 is negative.

Smiley I got u.

U assumed that time of calculating of a Scrypt-based hash is proportionally to number of operations. It's wrong.

For example, i*4+i could be faster than i*5.

My implementation of Scrypt is optimized for an Intel cpu and runs faster than ur code. The 1st step I made was optimization of ur miner and I wasn't satisfied with results, so I moved to original Tarsnap code and then to implementation posted here. I took into account a REAL cpu that has limited cache for code and data, not very clever branch predictor, register stalls and that is able to process 3 instructions at once.
ArtForz
Sr. Member
****
Offline Offline

Activity: 406
Merit: 257


View Profile
January 19, 2012, 01:58:09 PM
 #15

My implementation of Scrypt is optimized for an Intel cpu and runs faster than ur code.
Reference C implementation slower than optimized asm, news at 11.

The 1st step I made was optimization of ur miner and I wasn't satisfied with results, so I moved to original Tarsnap code and then to implementation posted here. I took into account a REAL cpu that has limited cache for code and data, not very clever branch predictor, register stalls and that is able to process 3 instructions at once.
So... exactly what pooler did, except he managed it without big secrecy or talking about boolean simplification and factor 1000 speedups "if it only had worked".

Tip: Statements like that make me think "oh dear, another crypto kook".

bitcoin: 1Fb77Xq5ePFER8GtKRn2KDbDTVpJKfKmpz
i0coin: jNdvyvd6v6gV3kVJLD7HsB5ZwHyHwAkfdw
Come-from-Beyond (OP)
Legendary
*
Offline Offline

Activity: 2142
Merit: 1010

Newbie


View Profile
January 19, 2012, 02:30:58 PM
 #16

Reference C implementation slower than optimized asm, news at 11.

As I wrote:

"When I wrote it using SSE2 (calc 4 hashes at once) I got 2500 h/s from each GHz of an Intel CPU."

See?


Tip: Statements like that make me think "oh dear, another crypto kook".

I don't need tips from a guy who is so infantile. I don't know why u think that this topic has something personal against u.

Read https://bitcointalk.org/index.php?topic=55670.0 to see results of my implementation. Hashrate was proven by litecoinpool.org statistics and tests were made by other guys. Ur miner is so slow that even simple optimizations give good boost to performance (+200% bonus).

---
Starting from this point I put u into ignore list so u won't waste my time.
ArtForz
Sr. Member
****
Offline Offline

Activity: 406
Merit: 257


View Profile
January 19, 2012, 02:46:29 PM
 #17

Reference C implementation slower than optimized asm, news at 11.

As I wrote:

"When I wrote it using SSE2 (calc 4 hashes at once) I got 2500 h/s from each GHz of an Intel CPU."

See?


Tip: Statements like that make me think "oh dear, another crypto kook".

I don't need tips from a guy who is so infantile. I don't know why u think that this topic has something personal against u.

Read https://bitcointalk.org/index.php?topic=55670.0 to see results of my implementation. Hashrate was proven by litecoinpool.org statistics and tests were made by other guys. Ur miner is so slow that even simple optimizations give good boost to performance (+200% bonus).

---
Starting from this point I put u into ignore list so u won't waste my time.
*facepalm*
Again, completely avoiding the topic of outrageous claims without *any* proof.
Oh wait, the "proof" is writing a miner that's 20% slower than poolers. right.
Ignored.

bitcoin: 1Fb77Xq5ePFER8GtKRn2KDbDTVpJKfKmpz
i0coin: jNdvyvd6v6gV3kVJLD7HsB5ZwHyHwAkfdw
ssvb
Newbie
*
Offline Offline

Activity: 39
Merit: 0


View Profile
January 19, 2012, 07:13:34 PM
 #18

OK, thanks for the clarifications. Now we can see that it was just scaremongering and/or handwaving. So much for "do less calculations" and "sequential memory access" promises. In reality the demonstrated code snippet has more XOR operations and does three separate passes over 128K buffer instead of two (which means more L1 cache misses unless the hardware prefetcher saves the day). The pointless replacement of left rotations with right rotations also looks amusing.
Come-from-Beyond (OP)
Legendary
*
Offline Offline

Activity: 2142
Merit: 1010

Newbie


View Profile
January 19, 2012, 08:10:44 PM
 #19

OK, thanks for the clarifications. Now we can see that it was just scaremongering and/or handwaving. So much for "do less calculations" and "sequential memory access" promises. In reality the demonstrated code snippet has more XOR operations and does three separate passes over 128K buffer instead of two (which means more L1 cache misses unless the hardware prefetcher saves the day). The pointless replacement of left rotations with right rotations also looks amusing.

Well... There are some explanations:

Code I posted here is C++ edition of the algo (for better reading). CPU runs assembler code which looks completely different and u should know this unless u r just a coin trader. Ur post looks like an attempt to defend ArtForz' position. But everyone sees that he is just upset that there are other miners which are better then his one (pooler's, mine and I know 2 other guys who developed much better miners).

My implementation uses less calculations coz CPU process 3 instructions at once using a trick to avoid register stalls. So extra cycle between 2 others is necessary for "less calculations" and "sequential memory access". Btw, a CPU with 128K L1 per core is quite... uncommon and ur statement about 3 passes is wrong.

Regarding replacement of left rotations with right ones. I suspect that u r NOT serious. Even a young programmer won't belive that this was made for optimizations or something like that. I made it coz there was already defined function for calculating SHA-256 that used right rotation.

PS: Ask more if u have questions Smiley
Roadhog2k5
Full Member
***
Offline Offline

Activity: 131
Merit: 100



View Profile
January 19, 2012, 08:23:25 PM
 #20

u, ur, r, coz.

Pages: [1] 2 »  All
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!