BTW, I hate[1] am annoyed by GPU mineable coins, as it clearly gives an advantage to riggers vs ordinary people.
I tell ya; the mining business is rigged...
|
|
|
Can you sum up for not_so_smart people? Is Bytecoin super ASIC proof or ?
We can only talk about ASIC resistance. To be "ASIC-proof" would mean that no ASIC implementation is possible, because the algorithm requires way more memory than can fit on an ASIC in the foreseeable future. Even then, an ASIC solution is possible, but it just wouldn't be self-contained and still need to be hooked up to some DIMMs. So I would say Bytecoin is ASIC-resistant, and significantly more so than scrypt. I'm not a big fan of the word super, which gets abused alot. Like all these hash-mish-mash algorithms being called "super-secure". That's just "super-laughable"...
|
|
|
This gives X a great opportunity to implement hidden features to his compiled wallet. X can basically do/take whatever he/she wants from your computer without you ever knowing about it. It does not matter if you encrypt your other wallets, the infected wallet might log everything that is typed and so revealing the password to X. This is just an example of what X is capable of.
The beauty in this scam is that the coin itself works. Those who compile their wallets are not affected, while those who downloaded the compiled wallet might be compromised to theft.
I do not know if this issue has been raised by anyone earlier but I hope this raises some awareness with the danger associated with jumping on that new coin and downloading a newly released wallet.
If you insist on running precompiled binaries from anonymous coin developers, then install it on a machine from which all sensitive data has been wiped, let the miner run for a few days while studying the wallet source code. When profitability drops off, save just the wallet private keys; wipe the machine again, and compile the now verified wallet source code to move the mining proceeds away to a safe newly generated address. Nah; that sounds like way too much trouble. Let's just trust the developer instead...
|
|
|
1) Cache sizes slowly grow over time (Moore's law). Currently, high-end x86 has 2.5MB / core. The proof-of-work should use as much of this as possible. You want each core running its own instance while fully utilizing its cache. Ideally, the dynamic difficulty adjustment should be able to increase the memory requirement, so as to keep up with hardware improvements.
It's interesting though. RAM is not that much slower than L3 cache, * and in fact this algorithm runs faster with two instances per core (with hyperthreading), suggesting that cache doesn't really matter here. * RAM = 60 cycles, L3 unshared - 40 cycles. See page 22 at https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdfThanks for the reference. That's a much smaller difference than I thought. I just ran some tests with the simple cuckoo miner, and here's how the runtime scales with memory on a Xeon with a total of 12MB L3 cache: 1M 0.075s 2M 0.15s 4M 0.32s 8M 0.7s 16M 1.9s 32M 5.5s 64M 13.7s 128M 31s So we do see an extra slowdown when it crosses the L3 cache size, but it's indeed not very dramatic. This suggests that a memory-bound proof-of-work should just try to use as much memory as possiible to frustrate GPUs/ASICs rather than try to optimize for L3 cache size.
|
|
|
Arguing for ASIC-resistance due to perceived high cost of 1MB is silly though. (gridseed LTC ASICs already have 512KB on board) The original claim doesn't even mention cost. It should elaborate on what they mean by ASIC pipeline though, as that has me confused...
There is something to be said for a cache-oriented proof-of-work though, of which there are not many examples. There are several concerns: 1) Cache sizes slowly grow over time (Moore's law). Currently, high-end x86 has 2.5MB / core. The proof-of-work should use as much of this as possible. You want each core running its own instance while fully utilizing its cache. Ideally, the dynamic difficulty adjustment should be able to increase the memory requirement, so as to keep up with hardware improvements. 2) There must be no easy memory-time trade-off (like scrypt has). Memory accesses must be dependent on earlier ones, so that must necessarily be done in sequential order. Otherwise, a GPU/FPGA/ASIC will gain a lot from increased parallellism. 3) The amount of computation per memory access must be minimized in order for the core to have a very high frequency of (cache) memory accesses. The point is that ASICs can always do computation much faster, but they cannot do random memory access much faster. If you get these things right; then a GPU will not be competitive. Certainly, CryptoNight is an improvement on scrypt on all 3 counts above (for item 3, mostly due to its smaller block size, 16 bytes vs 128 bytes).
|
|
|
what the hell is cum_difficulty
Must be cumulative difficulty...
|
|
|
Myriad. It just works. Everybody can mine,equal,no forks, no 51% attacks,no ipo,no premine, good devs,projects for the future. What can you ask more from a coin?
They had the option of creating some serious variety of algorithms, but mostly wasted it by using 4 almost interchangeable Bitcoin-like hash functions, plus the altcoin default, scrypt. I look forward to more variety in the future, perhaps with some CPU-friendly choice as well...
|
|
|
Cache is by far the largest part of the die of any modern CPU.
Not by far. Cache area is usually close in size to Core area.
|
|
|
I'll stick with CryptoNight vs Cuckoo Cycle, assuming a size 2^{32} for the latter memory 2MB 4GB memoy-hardness solid can run twice as slow in 3GB, or 50% faster in 6GB hardness hypothesized; not provable parallellizable not at least 20+ threads (probably less than 100) access pattern random random latency low (cache) high (main memory) verification slow instant %computation probably > 90% 33% %memory probably < 10% 67%
Can you explain what latency means in this particular case? I mean I know it's some kind of delay between something but what stands for "something" here? Latency is the time between the start of a memory access instruction, and it's end, when the value has been retrieved, and is available for use. And from your numbers I assume that cache is much faster compared to main memory? Yes; cache is at least an order of magnitude faster than main memory. Note that CryptoNight also does a lot more computation per random memory access (AES encryption) than Cuckoo Cycle (siphash-2-4), further contributing to its high computational load.
|
|
|
Arguing for ASIC-resistance due to perceived high cost of 1MB is silly though. (gridseed LTC ASICs already have 512MB on board) The original claim doesn't even mention cost. It should elaborate on what they mean by ASIC pipeline though, as that has me confused...
Did gridseed announce the cost of this ASIC? Also, will CPU get bigger (in physical size) in order to have 512MB cache on board? Sorry for annoying you, I really don't know much about it. Curiousity Sorry; I meant 512KB, not MB:) 512KB cache is very modest for a modern CPU, but could be a large fraction of a scrypt ASIC.
|
|
|
2. A megabyte of internal memory is an almost unacceptable size for a modern ASIC pipeline; Does it really? If we assume that scrypt-ASIC era is not far from now maybe 1mb of internal memory is not too. Questionable ASIC resistance. If Intel can put 37.5MB of cache on a Xeon, or more than 1MB on a cheap Celeron, then 1MB can hardly be called "almost unacceptable". As far as I know cache is the most expensive part of the CPU. (correct me if I'm wrong) So even 1mb increase of cache is $$$. And ASIC developers are not wealthy enough to cover this expenses. There is no "most expensive part of the CPU". Cost is dominated by die size and fault rate. It's true that 37.5MB takes a lot of die space, so that certainly contributes to cost. Arguing for ASIC-resistance due to perceived high cost of 1MB is silly though. (gridseed LTC ASICs already have 512MB on board) The original claim doesn't even mention cost. It should elaborate on what they mean by ASIC pipeline though, as that has me confused...
|
|
|
I'll stick with CryptoNight vs Cuckoo Cycle, assuming a size 2^{32} for the latter memory 2MB 4GB memoy-hardness solid can run twice as slow in 3GB, or 50% faster in 6GB hardness hypothesized; not provable parallellizable not at least 20+ threads (probably less than 100) access pattern random random latency low (cache) high (main memory) verification slow instant %computation probably > 90% 33% %memory probably < 10% 67%
Can you explain what latency means in this particular case? I mean I know it's some kind of delay between something but what stands for "something" here? Latency is the time between the start of a memory access instruction, and it's end, when the value has been retrieved, and is available for use.
|
|
|
2. A megabyte of internal memory is an almost unacceptable size for a modern ASIC pipeline; Does it really? If we assume that scrypt-ASIC era is not far from now maybe 1mb of internal memory is not too. Questionable ASIC resistance. If Intel can put 37.5MB of cache on a Xeon, or more than 1MB on a cheap Celeron, then 1MB can hardly be called "almost unacceptable".
|
|
|
As it' s said in their whitepaper CryptoNight is "a new memory-bound algorithm for the proof-of-work pricing function. It relies on random access to a slow memory and emphasizes latency dependence. As opposed to scrypt every new block (64 bytes in length) depends on all the previous blocks. As a result a hypothetical "memory-saver" should increase his calculation speed exponentially". More info you can get by yourself here. So scrypt gives us ASIC-resistance, but this algo is much better at this field. From 128KB (scrypt) to 2MB (CryptoNight) is a factor 16 increase in memory usage. To improve ASIC-resistance further, you'd like to require hundreds of MB, but that would make verification even slower than it already is. Other memory-bound proofs-of-work avoid this problem (at the cost of introducing more parrallellism) by having an asymmetry between computation and verification. E.g. Momentum, memory-coin 2, my own Cuckoo Cycle, or Coelho's scheme ( "An (Almost) Constant-Effort Solution-Verification Proof-of-Work Protocol based on Merkle Trees" AfriCrypt 2008, Fabien Coelho) Nice to see you took note of this, that's good news! Would you mind going over some of the parallels/differences between some of these PoW's compared to this one? I'd especially like a comparison between this one and yours, as I've read some of your posts about it and have a little bit of an understanding. I'm not really familiar with MMC2 (I know it uses AES, but not much more) or Coelho's scheme though. I'll stick with CryptoNight vs Cuckoo Cycle, assuming a size 2^{32} for the latter memory 2MB 512MB memoy-hardness solid can run twice as slow in 384MB, or 50% faster in 768MB hardness hypothesized; not proven parallellizable not at least 20+ threads (probably less than 100) access pattern random random latency low (cache) high (main memory) verification slow instant %computation probably > 90% 33% %memory probably < 10% 67%
|
|
|
As it' s said in their whitepaper CryptoNight is "a new memory-bound algorithm for the proof-of-work pricing function. It relies on random access to a slow memory and emphasizes latency dependence. As opposed to scrypt every new block (64 bytes in length) depends on all the previous blocks. As a result a hypothetical "memory-saver" should increase his calculation speed exponentially". More info you can get by yourself here. So scrypt gives us ASIC-resistance, but this algo is much better at this field. From 128KB (scrypt) to 2MB (CryptoNight) is a factor 16 increase in memory usage. To improve ASIC-resistance further, you'd like to require hundreds of MB, but that would make verification even slower than it already is. Other memory-bound proofs-of-work avoid this problem (at the cost of introducing more parrallellism) by having an asymmetry between computation and verification. E.g. Momentum, memory-coin 2, my own Cuckoo Cycle, or Coelho's scheme ( "An (Almost) Constant-Effort Solution-Verification Proof-of-Work Protocol based on Merkle Trees" AfriCrypt 2008, Fabien Coelho)
|
|
|
Yo folks! Where can I read more about the ring signature? Besides CryptoNote website. I meant more technical details, you can find a lot of basic info on the CryptoNote.org but not the details. Or I'm just blind Have you tried the White paper and source code? This is all that the whitepaper has on the CryproNight proof-of-work: 5.2 The proposed algorithm We propose a new memory-bound algorithm for the proof-of-work pricing function. It relies on random access to a slow memory and emphasizes latency dependence. As opposed to scrypt every new block (64 bytes in length) depends on all the previous blocks. As a result a hypothetical “memory-saver” should increase his calculation speed exponentially. Our algorithm requires about 2 Mb per instance for the following reasons: 1. It fits in the L3 cache (per core) of modern processors, which should become mainstream in a few years; 2. A megabyte of internal memory is an almost unacceptable size for a modern ASIC pipeline; 3. GPUs may run hundreds of concurrent instances, but they are limited in other ways: GDDR5 memory is slower than the CPU L3 cache and remarkable for its bandwidth, not random access speed. 4. Significant expansion of the scratchpad would require an increase in iterations, which in turn implies an overall time increase. “Heavy” calls in a trust-less p2p network may lead to serious vulnerabilities, because nodes are obliged to check every new block’s proof-of-work. If a node spends a considerable amount of time on each hash evaluation, it can be easily DDoSed by a flood of fake objects with arbitrary work data (nonce values).
In item 4 they admit that this scheme, like scrypt-N, has a major drawback in being rather slow to verify. But in the preceding section, reproduced below, they point out a weakness in scrypt that the new algorithm should have remedied: Moreover, the scrypt construction itself allows a linear trade-off between memory size and CPU speed due to the fact that every block in the scratchpad is derived only from the previous. For example, you can store every second block and recalculate the others in a lazy way, i.e. only when it becomes necessary. The pseudo-random indexes are assumed to be uniformly distributed, hence the expected value of the additional blocks’ recalculations is 12 · N , where N is the number of iterations. The overall computation time increases less than by half because there are also time independent (constant time) operations such as preparing the scratchpad and hashing on every iteration. ... This in turn implies that a machine with a CPU 2200 times faster than the modern chips can store only 320 bytes of the scratchpad.
If you really want details, you'll have to consult the code: #define MEMORY (1 << 21) /* 2 MiB */ #define ITER (1 << 20) #define AES_BLOCK_SIZE 16 #define AES_KEY_SIZE 32 /*16*/ #define INIT_SIZE_BLK 8 #define INIT_SIZE_BYTE (INIT_SIZE_BLK * AES_BLOCK_SIZE)
...
void cn_slow_hash(const void *data, size_t length, char *hash) { uint8_t long_state[MEMORY]; union cn_slow_hash_state state; uint8_t text[INIT_SIZE_BYTE]; uint8_t a[AES_BLOCK_SIZE]; uint8_t b[AES_BLOCK_SIZE]; uint8_t c[AES_BLOCK_SIZE]; uint8_t d[AES_BLOCK_SIZE]; size_t i, j; uint8_t aes_key[AES_KEY_SIZE]; OAES_CTX* aes_ctx;
hash_process(&state.hs, data, length); memcpy(text, state.init, INIT_SIZE_BYTE); memcpy(aes_key, state.hs.b, AES_KEY_SIZE); aes_ctx = oaes_alloc(); for (i = 0; i < MEMORY / INIT_SIZE_BYTE; i++) { for (j = 0; j < INIT_SIZE_BLK; j++) { oaes_key_import_data(aes_ctx, aes_key, AES_KEY_SIZE); oaes_pseudo_encrypt_ecb(aes_ctx, &text[AES_BLOCK_SIZE * j]); /*memcpy(aes_key, &text[AES_BLOCK_SIZE * j], AES_KEY_SIZE);*/ memcpy(aes_key, state.hs.b, AES_KEY_SIZE); } memcpy(&long_state[i * INIT_SIZE_BYTE], text, INIT_SIZE_BYTE); }
for (i = 0; i < 16; i++) { a[i] = state.k[ i] ^ state.k[32 + i]; b[i] = state.k[16 + i] ^ state.k[48 + i]; }
for (i = 0; i < ITER / 2; i++) { /* Dependency chain: address -> read value ------+ * written value <-+ hard function (AES or MUL) <+ * next address <-+ */ /* Iteration 1 */ j = e2i(a, MEMORY / AES_BLOCK_SIZE); copy_block(c, &long_state[j * AES_BLOCK_SIZE]); oaes_encryption_round(a, c); xor_blocks(b, c); swap_blocks(b, c); copy_block(&long_state[j * AES_BLOCK_SIZE], c); assert(j == e2i(a, MEMORY / AES_BLOCK_SIZE)); swap_blocks(a, b); /* Iteration 2 */ j = e2i(a, MEMORY / AES_BLOCK_SIZE); copy_block(c, &long_state[j * AES_BLOCK_SIZE]); mul(a, c, d); sum_half_blocks(b, d); swap_blocks(b, c); xor_blocks(b, c); copy_block(&long_state[j * AES_BLOCK_SIZE], c); assert(j == e2i(a, MEMORY / AES_BLOCK_SIZE)); swap_blocks(a, b); }
memcpy(text, state.init, INIT_SIZE_BYTE); for (i = 0; i < MEMORY / INIT_SIZE_BYTE; i++) { for (j = 0; j < INIT_SIZE_BLK; j++) { /*oaes_key_import_data(aes_ctx, &long_state[i * INIT_SIZE_BYTE + j * AES_BLOCK_SIZE], AES_KEY_SIZE);*/ oaes_key_import_data(aes_ctx, &state.hs.b[32], AES_KEY_SIZE); xor_blocks(&text[j * AES_BLOCK_SIZE], &long_state[i * INIT_SIZE_BYTE + j * AES_BLOCK_SIZE]); oaes_pseudo_encrypt_ecb(aes_ctx, &text[j * AES_BLOCK_SIZE]); } } memcpy(state.init, text, INIT_SIZE_BYTE); hash_permutation(&state.hs); /*memcpy(hash, &state, 32);*/ extra_hashes[state.hs.b[0] & 3](&state, 200, hash); oaes_free(&aes_ctx); }
|
|
|
Bucketing at the size you're doing isn't about page faults, it's more about cache misses. A cache miss causes a stalled cycle.
Also, you don't mean page faults: You mean TLB misses. Page faults are huge and rare. A TLB miss, at various levels of the hierarchy, is common.
You're right about the TLBs. Bucketing reduces TLB load misses from 2.8 billion to 0.56 billion (even while increasing TLB store misses from 1.7 million to 467 million). My stalled cycles do not appear due to cache misses though. The bucketed version has slightly more cache misses; 3.51 billion versus 3.14 billion. Maybe the stalled cycles are also due to the TLB misses?!
|
|
|
Sure, it's a 1024-bit word. I think of it that way because it's the word size of the memory on the ideal hardware for scrypt(N=1).
So, therefore, make the memory have 2^N words where N is the number of bits read or written in each memory transaction. So, for example, 2^32 entries each being 32 bits. Or 2^16 entries of 16 bits each. Then the "backward pointers" cost as much space as they save.
Hmm, I see what you mean. Maybe you don't even need separate filling-memory and reading-memory phases. Something like (assuming a hash function mapping 1024 bits to 1024 bits): state = hash(header) do 2^32 times { interpret state as 16 pairs (i,v) of 32-bit values; for each pair set mem[i] := v state = hash(state) interpret state as vector of 32 bit indices; mix in all 32 mem[i] state = hash(state) }
Hmm; that still needs an initialization phase, unless we assume the memory starts out all 0. But I still don't see these schemes as viable PoWs due to the lack of quick verifiability...
|
|
|
Worse, N/(lookup gap), so for lookup gap 4 this map will take only 256 bytes.
What is a lookup gap? That was smolen's remark, not mine:-(
|
|
|
Okay, I was able to make a first pass. I want to call out three really awesome contributions you've made:
1. Identifying cycles as the structure to be sought (this is a big deal, more on this later)
2. Making the memory size be (at least the high-order part of) the difficulty. This avoids having to pick magic numbers like "16GB" out of the air.
3. Using siphash() -- very cool, I've updated my initial post to mention this.
Typo: second page, "proof-of-work system. which". I think the period should be a comma.
I think cycles could be substituted by other structures, like cliques. Obviously you'd need to use plain instead of bipartite graphs, and a larger fraction of edges to get something like a 10-clique occurring with non-negligible probability. I originally used sha256 to generate edges. I was very happy when I later realized I could use this lightweight function instead. Being keyed made its use even more natural. I introduced several typos when hastily rewriting parts of the paper to reflect the shift of focus from memory-hardness (since it's not as tmto proof as i thought it was) to being graph-theoretical. So please just gloss over the obvious grammatical mistakes. They'll get fixed in the weeks ahead. Disclaimer: I didn't read the paper on cuckoo hash tables because your description of the proof-of-work (an undirected cycle of length >L in the graph of the converse composition of two hash functions with disjoint ranges) made sense right away. Let me know if I missed anything as a result. Or if I misunderstood the POW, but I think I got it.
I don't know what you mean by converse composition, given that the two hash functions are not injective. Also, I look for cycles of length exactly L rather than > L. One thing that nagged me was that the size of the proof scales with L. Since the proof has to go in the block headers, which even SPV clients need, it's the most space-sensitive part of a cryptocoin. Could one look for a cycle instead in the graph of a hash function postcomposed with "modulo N" where N is the targeted memory size? Then you can prove a cycle of length L by giving any element in the cycle, and a verifier can check your proof with O(log N)-memory and O(L)-time. This is partly why I'm excited about cycles as the proof. Also, in a sense this is what scrypt is, except that it asks for a path rather than a cycle and the hash function is H(n)=salsa^n(0) (where salsa^0(x)=x and salsa^(n+1)(x)=salsa(salsa^n(x))) -- the write-phase is a sort of Ackermann construction.
Yes, proof size of 168 bytes, or Lx4 bytes in general, is a downside. If you're suggesting looking for cycles in a hash function with equal domain and range, that doesn't need any memory. You can use either Floyd's or Brent's cycle detection algorithm in constant space.
|
|
|
|