Scrypt based coins: Any proof that this algorithm is realy good?

Bitcoin Forum

October 02, 2024, 11:44:02 AM

Welcome, Guest. Please login or register.

News: Latest Bitcoin Core release: 27.1 [Torrent]

Home

Help

Search

Login

Register

More

Bitcoin Forum > Alternate cryptocurrencies > Altcoin Discussion > Scrypt based coins: Any proof that this algorithm is realy good?

Warning: One or more bitcointalk.org users have reported that they believe that the creator of this topic displays some red flags which make them high-risk. (Login to see the detailed trust ratings.) While the bitcointalk.org administration does not verify such claims, you should proceed with extreme caution.

Pages: « 1 [2] All

« previous topic next topic »

Author

Topic: Scrypt based coins: Any proof that this algorithm is realy good? (Read 6350 times)

ssvb

Newbie

Offline

Offline

Activity: 39
Merit: 0

Re: Scrypt based coins: Any proof that this algorithm is realy good?

January 19, 2012, 09:38:31 PM

#21

Quote from: Come-from-Beyond on January 19, 2012, 08:10:44 PM

Code I posted here is C++ edition of the algo (for better reading). CPU runs assembler code which looks completely different

So what was the purpose of dumping here this garbage C++ edition then? Why not using some pseudocode or intrinsics which would show the idea about the efficient SIMD implementation of scrypt?

Quote

My implementation uses less calculations coz CPU process 3 instructions at once using a trick to avoid register stalls. So extra cycle between 2 others is necessary for "less calculations" and "sequential memory access".

Better instructions scheduling on superscalar processors for making use of triple issue is not using "less calculations", it is more like just doing "the same calculations faster". That's an ordinary performance optimization and has nothing to do with the "quality of Scrypt" and does not help to "get rid of redundant XORs". I had a hope that you have something more impressive to show and was somewhat disappointed.

Quote

Btw, a CPU with 128K L1 per core is quite... uncommon and ur statement about 3 passes is wrong.

You have 3 loops looking like "for (i=0; i<32768; i+=16) { ... }" in your code. The first two of them are walking over the buffer sequentially one after another. And because the buffer is larger than L1 cache ("a CPU with 128K L1 per core is quite... uncommon" as you have noticed), you get L1 cache misses on each of these loops. Merging these first two loops into one would reduce the number of L1 cache misses and also save some load/store instructions. This would actually make the code look more like the reference implementation.

The thing that actually *might* make some sense is to have the data buffers for 4 hashes interleaved. So on the first loop you just run the SIMD code in exactly the same way as scalar code without the need for any shuffle instructions. However this data layout becomes problematic on the last loop because you need to access different memory areas for different hashes ('j' indexes are different for each of the simultaneously processed 4 hashes) and gather scattered data into different lanes of SIMD registers. Introducing the intermediate loop for deinterleaving the data in addition to XORing it would fix the problem. But the data needs to be interleaved again on the last loop. The disadvantage is extra interleaving/deinterleaving overhead. The advantage is no need for shuffles and possibly better instructions scheduling. Is this what your SSE2 code is actually trying to do?

Come-from-Beyond (OP)

Legendary

Offline

Offline

Activity: 2142
Merit: 1010

Newbie

Re: Scrypt based coins: Any proof that this algorithm is realy good?

January 20, 2012, 06:13:43 AM

#22

>> So what was the purpose of dumping here this garbage C++ edition then? Why not using some pseudocode or intrinsics which would show the idea about the efficient SIMD implementation of scrypt?

Coz I posted the code which could be simplified using NAND/NOR-optimizations. Assembler edition is very hard to read. I wasn't going to show efficient SIMD implementation.

>> Better instructions scheduling on superscalar processors for making use of triple issue is not using "less calculations", it is more like just doing "the same calculations faster".

English is not my mother tongue, sorry if I failed to explain my idea. In few words its like "CPU spends less time".

>> That's an ordinary performance optimization and has nothing to do with the "quality of Scrypt" and does not help to "get rid of redundant XORs". I had a hope that you have something more impressive to show and was somewhat disappointed.

Look at my function named Salsa(). Each cell from 16-vector uiA is XORed 4*2=8 times. My phrase was

"Imagine what will happen if someone get rid of redundant XORs of 16 temporary memory cells in SALSA..."

It's not about the posted code, it's about NAND/NOR-optimization. I didn't say that posted code eliminates any XORs.

>> You have 3 loops looking like "for (i=0; i<32768; i+=16) { ... }" in your code. The first two of them are walking over the buffer sequentially one after another. And because the buffer is larger than L1 cache ("a CPU with 128K L1 per core is quite... uncommon" as you have noticed), you get L1 cache misses on each of these loops.

I thought so before I tried both ways. CPU understands when code do sequential memory access and prefetches data into a cache before the code needs it.

>> Merging these first two loops into one would reduce the number of L1 cache misses and also save some load/store instructions. This would actually make the code look more like the reference implementation.

Merged loop accesses 3 different chunks of memory. This leads to more MOVDQA instructions and makes CPU to maintain 3 flows of prefetching. My edition needs only 2 flows which improve memory access.

>> The thing that actually *might* make some sense is to have the data buffers for 4 hashes interleaved. So on the first loop you just run the SIMD code in exactly the same way as scalar code without the need for any shuffle instructions. However this data layout becomes problematic on the last loop because you need to access different memory areas for different hashes ('j' indexes are different for each of the simultaneously processed 4 hashes) and gather scattered data into different lanes of SIMD registers. Introducing the intermediate loop for deinterleaving the data in addition to XORing it would fix the problem. But the data needs to be interleaved again on the last loop. The disadvantage is extra interleaving/deinterleaving overhead. The advantage is no need for shuffles and possibly better instructions scheduling. Is this what your SSE2 code is actually trying to do?

I got rid of PSHUFDs this way. I collect data from uiStorage to be XORed with 4 hashes into a register using 4 other registers as pointers (EAX, EBX, ECX, EDX). Disadvantage of PSHUFDs is much bigger than disadvantage of collecting data into a register. EAX, EBX, ECX and EDX r set in previous loop iteration, so when I realy need this data it's already in a cache.

ssvb

Newbie

Offline

Offline

Activity: 39
Merit: 0

Re: Scrypt based coins: Any proof that this algorithm is realy good?

January 20, 2012, 08:31:15 AM

#23

Quote from: Come-from-Beyond on January 20, 2012, 06:13:43 AM

Coz I posted the code which could be simplified using NAND/NOR-optimizations.

Could it really be simplified? That's the main question.

Quote

Assembler edition is very hard to read. I wasn't going to show efficient SIMD implementation.

Yes, yes. We already know that you decided not to show it, but "close researches in this direction". Assuming that the assembler edition was actually efficient and even existed in the first place...

Quote

Look at my function named Salsa(). Each cell from 16-vector uiA is XORed 4*2=8 times. My phrase was

"Imagine what will happen if someone get rid of redundant XORs of 16 temporary memory cells in SALSA..."

It's not about the posted code, it's about NAND/NOR-optimization.

So here we are again. Have you managed to do the alleged NAND/NOR-optimizations? If yes, then how many XORs does it actually eliminate?

Quote

I didn't say that posted code eliminates any XORs.

You lost me here. What are you trying to prove by posting some piece of useless code?

And what makes you think that "Scrypt is good, but [1024, 1, 1] was a bad choice"?

Come-from-Beyond (OP)

Legendary

Offline

Offline

Activity: 2142
Merit: 1010

Newbie

⇾ Re: Scrypt based coins: Any proof that this algorithm is realy good?

January 20, 2012, 09:18:05 AM

#24

>> Could it really be simplified? That's the main question.

Theoretically it could be. But like factorization of a very long number it's hard to do. And I hope noone will manage to do it at least while I'm holding a lot of LTC.

>> So here we are again. Have you managed to do the alleged NAND/NOR-optimizations? If yes, then how many XORs does it actually eliminate?

I failed to complete the optimization. Replaced it with modified salsa which does precached results lookup. Traded CPU for RAM. Qty of XORs were reduced to 50%.

>> You lost me here. What are you trying to prove by posting some piece of useless code?

It's usefull code. It's easier to optimize coz u could move it into 1st or 2nd cycle, or leave it alone. It also helps to access memory in a "symmetrical" way.

>> And what makes you think that "Scrypt is good, but [1024, 1, 1] was a bad choice"?

2nd part of Scrypt looks like jumps from one chunk of memory to other. If u count how often each chunk is accessed u'll notice that some of them rn't accessed at all and some - very often. Good parameters r supposed to make probability of access equal for all chunks. That's how GOOD random number generators work. With [1024, 1, 1] I always got BAD distribution of access probability. So [1024, 1, 1] is a BAD choice.

Pages: « 1 [2] All

Bitcoin Forum > Alternate cryptocurrencies > Altcoin Discussion > Scrypt based coins: Any proof that this algorithm is realy good?

« previous topic next topic »

Jump to:

Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines