Bitcoin Forum
April 27, 2024, 04:22:44 AM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
   Home   Help Search Login Register More  
Warning: One or more bitcointalk.org users have reported that they believe that the creator of this topic displays some red flags which make them high-risk. (Login to see the detailed trust ratings.) While the bitcointalk.org administration does not verify such claims, you should proceed with extreme caution.
Pages: « 1 [2]  All
  Print  
Author Topic: Scrypt based coins: Any proof that this algorithm is realy good?  (Read 6323 times)
ssvb
Newbie
*
Offline Offline

Activity: 39
Merit: 0


View Profile
January 19, 2012, 09:38:31 PM
 #21

Code I posted here is C++ edition of the algo (for better reading). CPU runs assembler code which looks completely different
So what was the purpose of dumping here this garbage C++ edition then? Why not using some pseudocode or intrinsics which would show the idea about the efficient SIMD implementation of scrypt?

Quote
My implementation uses less calculations coz CPU process 3 instructions at once using a trick to avoid register stalls. So extra cycle between 2 others is necessary for "less calculations" and "sequential memory access".
Better instructions scheduling on superscalar processors for making use of triple issue is not using "less calculations", it is more like just doing "the same calculations faster". That's an ordinary performance optimization and has nothing to do with the "quality of Scrypt" and does not help to "get rid of redundant XORs". I had a hope that you have something more impressive to show and was somewhat disappointed.

Quote
Btw, a CPU with 128K L1 per core is quite... uncommon and ur statement about 3 passes is wrong.
You have 3 loops looking like "for (i=0; i<32768; i+=16) { ... }" in your code. The first two of them are walking over the buffer sequentially one after another. And because the buffer is larger than L1 cache ("a CPU with 128K L1 per core is quite... uncommon" as you have noticed), you get L1 cache misses on each of these loops. Merging these first two loops into one would reduce the number of L1 cache misses and also save some load/store instructions. This would actually make the code look more like the reference implementation.

The thing that actually *might* make some sense is to have the data buffers for 4 hashes interleaved. So on the first loop you just run the SIMD code in exactly the same way as scalar code without the need for any shuffle instructions. However this data layout becomes problematic on the last loop because you need to access different memory areas for different hashes ('j' indexes are different for each of the simultaneously processed 4 hashes) and gather scattered data into different lanes of SIMD registers. Introducing the intermediate loop for deinterleaving the data in addition to XORing it would fix the problem. But the data needs to be interleaved again on the last loop. The disadvantage is extra interleaving/deinterleaving overhead. The advantage is no need for shuffles and possibly better instructions scheduling. Is this what your SSE2 code is actually trying to do?
1714191764
Hero Member
*
Offline Offline

Posts: 1714191764

View Profile Personal Message (Offline)

Ignore
1714191764
Reply with quote  #2

1714191764
Report to moderator
1714191764
Hero Member
*
Offline Offline

Posts: 1714191764

View Profile Personal Message (Offline)

Ignore
1714191764
Reply with quote  #2

1714191764
Report to moderator
1714191764
Hero Member
*
Offline Offline

Posts: 1714191764

View Profile Personal Message (Offline)

Ignore
1714191764
Reply with quote  #2

1714191764
Report to moderator
Whoever mines the block which ends up containing your transaction will get its fee.
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
1714191764
Hero Member
*
Offline Offline

Posts: 1714191764

View Profile Personal Message (Offline)

Ignore
1714191764
Reply with quote  #2

1714191764
Report to moderator
Come-from-Beyond (OP)
Legendary
*
Offline Offline

Activity: 2142
Merit: 1009

Newbie


View Profile
January 20, 2012, 06:13:43 AM
 #22

>> So what was the purpose of dumping here this garbage C++ edition then? Why not using some pseudocode or intrinsics which would show the idea about the efficient SIMD implementation of scrypt?

Coz I posted the code which could be simplified using NAND/NOR-optimizations. Assembler edition is very hard to read. I wasn't going to show efficient SIMD implementation.

>> Better instructions scheduling on superscalar processors for making use of triple issue is not using "less calculations", it is more like just doing "the same calculations faster".

English is not my mother tongue, sorry if I failed to explain my idea. In few words its like "CPU spends less time".

>> That's an ordinary performance optimization and has nothing to do with the "quality of Scrypt" and does not help to "get rid of redundant XORs". I had a hope that you have something more impressive to show and was somewhat disappointed.

Look at my function named Salsa(). Each cell from 16-vector uiA is XORed 4*2=8 times. My phrase was

"Imagine what will happen if someone get rid of redundant XORs of 16 temporary memory cells in SALSA..."

It's not about the posted code, it's about NAND/NOR-optimization. I didn't say that posted code eliminates any XORs.

>> You have 3 loops looking like "for (i=0; i<32768; i+=16) { ... }" in your code. The first two of them are walking over the buffer sequentially one after another. And because the buffer is larger than L1 cache ("a CPU with 128K L1 per core is quite... uncommon" as you have noticed), you get L1 cache misses on each of these loops.

I thought so before I tried both ways. CPU understands when code do sequential memory access and prefetches data into a cache before the code needs it.

>> Merging these first two loops into one would reduce the number of L1 cache misses and also save some load/store instructions. This would actually make the code look more like the reference implementation.

Merged loop accesses 3 different chunks of memory. This leads to more MOVDQA instructions and makes CPU to maintain 3 flows of prefetching. My edition needs only 2 flows which improve memory access.

>> The thing that actually *might* make some sense is to have the data buffers for 4 hashes interleaved. So on the first loop you just run the SIMD code in exactly the same way as scalar code without the need for any shuffle instructions. However this data layout becomes problematic on the last loop because you need to access different memory areas for different hashes ('j' indexes are different for each of the simultaneously processed 4 hashes) and gather scattered data into different lanes of SIMD registers. Introducing the intermediate loop for deinterleaving the data in addition to XORing it would fix the problem. But the data needs to be interleaved again on the last loop. The disadvantage is extra interleaving/deinterleaving overhead. The advantage is no need for shuffles and possibly better instructions scheduling. Is this what your SSE2 code is actually trying to do?

I got rid of PSHUFDs this way. I collect data from uiStorage to be XORed with 4 hashes into a register using 4 other registers as pointers (EAX, EBX, ECX, EDX). Disadvantage of PSHUFDs is much bigger than disadvantage of collecting data into a register. EAX, EBX, ECX and EDX r set in previous loop iteration, so when I realy need this data it's already in a cache.
ssvb
Newbie
*
Offline Offline

Activity: 39
Merit: 0


View Profile
January 20, 2012, 08:31:15 AM
 #23

Coz I posted the code which could be simplified using NAND/NOR-optimizations.
Could it really be simplified? That's the main question.

Quote
Assembler edition is very hard to read. I wasn't going to show efficient SIMD implementation.
Yes, yes. We already know that you decided not to show it, but "close researches in this direction". Assuming that the assembler edition was actually efficient and even existed in the first place...

Quote
Look at my function named Salsa(). Each cell from 16-vector uiA is XORed 4*2=8 times. My phrase was

"Imagine what will happen if someone get rid of redundant XORs of 16 temporary memory cells in SALSA..."

It's not about the posted code, it's about NAND/NOR-optimization.
So here we are again. Have you managed to do the alleged NAND/NOR-optimizations? If yes, then how many XORs does it actually eliminate?

Quote
I didn't say that posted code eliminates any XORs.
You lost me here. What are you trying to prove by posting some piece of useless code?

And what makes you think that "Scrypt is good, but [1024, 1, 1] was a bad choice"?
Come-from-Beyond (OP)
Legendary
*
Offline Offline

Activity: 2142
Merit: 1009

Newbie


View Profile
January 20, 2012, 09:18:05 AM
 #24

>> Could it really be simplified? That's the main question.

Theoretically it could be. But like factorization of a very long number it's hard to do. And I hope noone will manage to do it at least while I'm holding a lot of LTC.

>> So here we are again. Have you managed to do the alleged NAND/NOR-optimizations? If yes, then how many XORs does it actually eliminate?

I failed to complete the optimization. Replaced it with modified salsa which does precached results lookup. Traded CPU for RAM. Qty of XORs were reduced to 50%.

>> You lost me here. What are you trying to prove by posting some piece of useless code?

It's usefull code. It's easier to optimize coz u could move it into 1st or 2nd cycle, or leave it alone. It also helps to access memory in a "symmetrical" way.

>> And what makes you think that "Scrypt is good, but [1024, 1, 1] was a bad choice"?

2nd part of Scrypt looks like jumps from one chunk of memory to other. If u count how often each chunk is accessed u'll notice that some of them rn't accessed at all and some - very often. Good parameters r supposed to make probability of access equal for all chunks. That's how GOOD random number generators work. With [1024, 1, 1] I always got BAD distribution of access probability. So [1024, 1, 1] is a BAD choice.
Pages: « 1 [2]  All
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!