neuroMode
|
|
April 03, 2014, 07:56:32 AM |
|
Us boys over at MyriadCoin will keep an eye on this algorithm
|
|
|
|
tromp (OP)
Legendary
Offline
Activity: 990
Merit: 1110
|
|
April 07, 2014, 03:19:14 AM |
|
Dave has shown my algorithm is not as tmto-hard as I thought it was. Excellent work, Dave! For now, I amended my README at https://github.com/tromp/cuckoo with the following UPDATE: Dave Anderson proposed an alternative algorithm on his blog http://da-data.blogspot.com/2014/03/a-public-review-of-cuckoo-cycle.htmlthat uses significantly less memory than mine at potentially less than an order of magnitude slowdown. I hope to soon implement his algorithm and quantify the "tomato" (his pronouncable spelling of tmto, or time-memory trade-off). Baseline: My code uses 10x less memory and runs 2x slower (on a laptop instead of an i7). At 6x less memory, it should have speed parity. Could be optimized substantially - I sent you some very hacked up code. Happy to post it publicly if anyone's interested, but it's trashy - I just wrote it as a proof of concept for the pre-filtering before invoking the cycle finder. But it gets the point across. A complete working implementation of Dave's scheme is now available on my github. Makefile targets cuckooNN still use the old memory hungry algorithm, while targets tomatoNN use his edge trimming (12 rounds by default). I haven't figured out how to achieve speed parity:-( Suggestions for further improvement are more than welcome! -John
|
|
|
|
tromp (OP)
Legendary
Offline
Activity: 990
Merit: 1110
|
|
April 07, 2014, 04:20:28 PM |
|
I haven't figured out how to achieve speed parity:-(
Suggestions for further improvement are more than welcome!
Next on the to-do list is another of Dave's suggestions: to bucketize the array of nodes to work on, to improve main-memory locality. That will help the edge trimming, which works on only one edge endpoint at a time, but won't help the original algorithm, which needs to work on both edge endpoints at the same time. So that may well achieve speed-parity...
|
|
|
|
tromp (OP)
Legendary
Offline
Activity: 990
Merit: 1110
|
|
April 08, 2014, 03:21:30 PM |
|
I haven't figured out how to achieve speed parity:-(
Suggestions for further improvement are more than welcome!
Next on the to-do list is another of Dave's suggestions: to bucketize the array of nodes to work on, to improve main-memory locality. That will help the edge trimming, which works on only one edge endpoint at a time, but won't help the original algorithm, which needs to work on both edge endpoints at the same time. So that may well achieve speed-parity... And so it did! Reducing the number of page faults is a big gain. Edge-trimming is 33% faster single-threaded, but multi-threading speedups are very erratic, e.g. 7-11 threads are all worse than 6 threads, 12 is much faster but then slow down through 19 threads, and finally gain more than double at 20 threads. The new code is up at https://github.com/tromp/cuckoo, as Makefile targets cuckooNN (for various value of NN). The old code is now only useful for small instances and relative easiness > 50% (at which edge trimming fails).
|
|
|
|
tromp (OP)
Legendary
Offline
Activity: 990
Merit: 1110
|
|
April 08, 2014, 06:06:48 PM |
|
The new code is up at https://github.com/tromp/cuckoo, as Makefile targets cuckooNN (for various value of NN). The old code is now only useful for small instances and relative easiness > 50% (at which edge trimming fails). Apparently this is due to the use of C++ atomic datatypes. Experiments using plain non-atomic datatypes instead, show negligable differences, so that is now the default. Current performance is listed in the README as 6) running time on high end x86 is 9min/GB single-threaded, and 1min/GB for 20 threads. The big open question is still how well a GPU can perform Cuckoo Cycle. Most main memory accesses in (the current algorithm for) Cuckoo Cycle are to different 64-byte segments (GPU cache line size), making latency of prime importance. And that's where GPUs fare worse than CPUs...
|
|
|
|
tromp (OP)
Legendary
Offline
Activity: 990
Merit: 1110
|
|
April 11, 2014, 03:31:07 PM |
|
Next on the to-do list is another of Dave's suggestions: to bucketize the array of nodes to work on, to improve main-memory locality. That will help the edge trimming, which works on only one edge endpoint at a time, but won't help the original algorithm, which needs to work on both edge endpoints at the same time. So that may well achieve speed-parity...
And so it did! Reducing the number of page faults is a big gain. Actually, the bucketing doesn't affect page faults. Profiling shows that the large speedup is due to big reductions in stalled-cycles-frontend and stalled-cycles-backend. Somehow alternating siphash computations with main memory accesses causes lots of stalls, while doing a bunch of hashes followed by a bunch of main memory accesses is much better. Damn; performance optimization is hard...
|
|
|
|
dga
|
|
April 13, 2014, 09:19:27 PM |
|
Next on the to-do list is another of Dave's suggestions: to bucketize the array of nodes to work on, to improve main-memory locality. That will help the edge trimming, which works on only one edge endpoint at a time, but won't help the original algorithm, which needs to work on both edge endpoints at the same time. So that may well achieve speed-parity...
And so it did! Reducing the number of page faults is a big gain. Actually, the bucketing doesn't affect page faults. Profiling shows that the large speedup is due to big reductions in stalled-cycles-frontend and stalled-cycles-backend. Somehow alternating siphash computations with main memory accesses causes lots of stalls, while doing a bunch of hashes followed by a bunch of main memory accesses is much better. Damn; performance optimization is hard... Bucketing at the size you're doing isn't about page faults, it's more about cache misses. A cache miss causes a stalled cycle. Also, you don't mean page faults: You mean TLB misses. Page faults are huge and rare. A TLB miss, at various levels of the hierarchy, is common. You can eliminate the TLB misses as a factor by allocating your data structures in hugepages. see, for example, http://lxr.free-electrons.com/source/tools/testing/selftests/vm/map_hugetlb.c
|
|
|
|
tromp (OP)
Legendary
Offline
Activity: 990
Merit: 1110
|
|
April 13, 2014, 10:24:29 PM |
|
Bucketing at the size you're doing isn't about page faults, it's more about cache misses. A cache miss causes a stalled cycle.
Also, you don't mean page faults: You mean TLB misses. Page faults are huge and rare. A TLB miss, at various levels of the hierarchy, is common.
You're right about the TLBs. Bucketing reduces TLB load misses from 2.8 billion to 0.56 billion (even while increasing TLB store misses from 1.7 million to 467 million). My stalled cycles do not appear due to cache misses though. The bucketed version has slightly more cache misses; 3.51 billion versus 3.14 billion. Maybe the stalled cycles are also due to the TLB misses?!
|
|
|
|
tromp (OP)
Legendary
Offline
Activity: 990
Merit: 1110
|
|
April 30, 2014, 10:11:40 PM |
|
So while your statement is true, it's not necessarily the right truth. Is Cuckoo Cycle better or worse than adaptive scrypt-N? Is it better than a variant of Invictus's Momentum PoW that has an adaptive N factor?
Neither of us knows the answer to this question. There is a chance that it's stronger. But there's also a very real chance that the requirement to find a cycle, instead of a simple collision, will mean that the approach can yield to something with sublinear memory requirements and, in fact, underperform compared to Momentum in its memory requirements.
It cannot "underperform compared to Momentum in its memory requirements" because Cuckoo Cycle generalizes Momentum. Momentum looks for collisions of a hash function mapping 2^{26} nonces to 50-bit outputs. Each nonce corresponds to an edge in a bi-partite graph on nodes M union L, where M is all possible 25-most-significant bits of hash output, and L is all possible 25-least-significant bits of hash output, and so each collision is a 2-cycle in this graph. Thus, it is more likely that Momentum, being a special case of Cuckoo Cycle, underperforms in its memory requirements.
|
|
|
|
|
tromp (OP)
Legendary
Offline
Activity: 990
Merit: 1110
|
|
May 24, 2014, 09:49:08 PM |
|
any coins using this algo atm?
Nope...
|
|
|
|
optimiz3
Newbie
Offline
Activity: 37
Merit: 0
|
|
June 23, 2014, 10:24:29 PM |
|
tromp - after our discussion on k-SUM, I went back and tried to categorize why I was uncomfortable with Cuckoo Cycle in its present state right now. Basically it comes down to: - It seems like the selection of 42 for the cycle length was based on experimentation, but there doesn't seem to be a hard proof for this (dga mentions this). I was originally looking at kSUM because the difficulty of kSUM is well known. In the case of cycle finding, I'm not sure if the difficulty is proven. - One way of proving this might be to instead establish the probability of a cycle of length k occurring in a randomly generated graph, and then tweaking parameters so the expectation is only 1 cycle is found. - Separately might it make sense to look deeper at k-cliques? Their probability seems to be much better understood, and might provide a more provably secure basis. Quick starting point: Assume G(n,p) - Erdős–Rényi graph let pK = Probability k nodes are a clique p^(k choose 2) let pNK = Probability no k-clique exists (1-pK)^(n choose k) Probability at least 1 k-clique exists = 1-pNK Thoughts?
|
|
|
|
tromp (OP)
Legendary
Offline
Activity: 990
Merit: 1110
|
|
June 24, 2014, 06:45:23 PM |
|
tromp - after our discussion on k-SUM, I went back and tried to categorize why I was uncomfortable with Cuckoo Cycle in its present state right now. Basically it comes down to: - It seems like the selection of 42 for the cycle length was based on experimentation You can find the motivation for choice of cycle length in the whitepaper; I devoted a section to it. but there doesn't seem to be a hard proof
Proofs are for theorems, not for parameter choices. I was originally looking at kSUM because the difficulty of kSUM is well known. In the case of cycle finding, I'm not sure if the difficulty is proven.
There is more literature on cycle detection in graphs than on k-SUM, and very little on k-SUM in cyclic groups as you consider. - One way of proving this might be to instead establish the probability of a cycle of length k occurring in a randomly generated graph, and then tweaking parameters so the expectation is only 1 cycle is found.
My whitepaper has sufficient data on probability of cycles occurring. You don't need expectation equal to 1. Any non-negligable constant less than 1 works just as well. [/quote]
|
|
|
|
optimiz3
Newbie
Offline
Activity: 37
Merit: 0
|
|
June 24, 2014, 07:43:15 PM |
|
You can find the motivation for choice of cycle length in the whitepaper; I devoted a section to it. While the Cuckoo Cycle paper does devote a section to this, the section lacks any hard guarantees on whether the parameter choices are correct or optimal. The graphs in the paper were derived from what seem like experimentation, but the problem with such an approach is it leaves open the possibility that more efficient and effective attacks exist. There needs to be provable formulae establishing the relationships between parameter choices, the likelihood of cycles, and in turn the theoretical bounds that any attacking implementation could have. Without this, it leaves open the possibility of novel attacks that may not have been identified yet, which is no basis for a currency. There is more literature on cycle detection in graphs than on k-SUM, and very little on k-SUM in cyclic groups as you consider. I'm not advocating k-SUM as I think we've established it is a dead-end. That said, while there is much literature on cycle detection in graphs, I haven't found much in terms of proofs that concretely establish the probability of n-length such cycles existing in a G(n,p) graph. Would certainly appreciate information here if you are familiar with work in this area, and I think it would go a long way towards strengthening the fundamentals of Cuckoo Cycle as proposed.
|
|
|
|
tromp (OP)
Legendary
Offline
Activity: 990
Merit: 1110
|
|
June 25, 2014, 08:06:19 AM |
|
You can find the motivation for choice of cycle length in the whitepaper; I devoted a section to it. While the Cuckoo Cycle paper does devote a section to this, the section lacks any hard guarantees on whether the parameter choices are correct or optimal. Remind me again, which other Proof-of-Works have guarantees of parameter optimality, and where to find proof of these?
|
|
|
|
optimiz3
Newbie
Offline
Activity: 37
Merit: 0
|
|
June 25, 2014, 01:57:59 PM Last edit: June 25, 2014, 03:54:23 PM by optimiz3 |
|
Momentum is trivial to prove for any parameter selection because it is based on a well known problem.
Cuckoo Cycle is based on a novel graph-theoretic approach which may or may not have vulnerabilities, but without a proof of difficulty for a particular set of parameters, there is no way to asses the vulnerability to future attacks.
Optimality comes through applying engineering requirements to a defined set of provable algorithm characteristics. We don't have that (yet) for Cuckoo Cycle.
EDIT #2: Grammar, also some searching turned up this recent (2009) paper "Deviation inequality for the number of k-cycles in a random graph" by Wang and Gao. Perhaps it may be of use?
|
|
|
|
tromp (OP)
Legendary
Offline
Activity: 990
Merit: 1110
|
|
June 25, 2014, 10:13:52 PM |
|
Momentum is trivial to prove for any parameter selection because it is based on a well known problem. Cuckoo Cycle is based on a novel graph-theoretic approach which may or may not have vulnerabilities, but without a proof of difficulty for a particular set of parameters, there is no way to asses the vulnerability to future attacks.
I'm afraid you're not making much sense. If we're going to continue this discussion you should first define your terminology. What mathematical statement about Momentum is trivial to proof? What constitutes a vulnerability of a proof-of-work? What is a proof of difficulty? And how do you rhyme the above with my Apr 30 observation that Momentum is a special case of Cuckoo Cycle where cycle length equals 2, a choice which leads it to be particularly memory-easy? [/quote]
|
|
|
|
optimiz3
Newbie
Offline
Activity: 37
Merit: 0
|
|
June 26, 2014, 12:26:18 AM |
|
What doesn't make sense? What mathematical statement about Momentum is trivial to proof? It is an instance of the birthday problem, where complexity is understood. What constitutes a vulnerability of a proof-of-work? The possibility implementations exist that exploit unaccounted for (lower and amortized) bounds in memory and/or execution time. What is a proof of difficulty? A proof the amortized limits of memory and execution time cannot be less than the proposed implementation for a PoW. And how do you rhyme the above with my Apr 30 observation that Momentum is a special case of Cuckoo Cycle where cycle length equals 2, a choice which leads it to be particularly memory-easy? Cycle lengths of 2 and 3 are special cases because they can be restated in terms of the birthday problem and 3-cliques respectively. They do not generalize to k-cycle (nor 42-cycle), which is the concern.
|
|
|
|
tromp (OP)
Legendary
Offline
Activity: 990
Merit: 1110
|
|
June 26, 2014, 08:15:14 AM |
|
What mathematical statement about Momentum is trivial to proof? It is an instance of the birthday problem, where complexity is understood. I don't think this is understood. What constitutes a vulnerability of a proof-of-work? The possibility implementations exist that exploit unaccounted for (lower and amortized) bounds in memory and/or execution time. What is a proof of difficulty? A proof the amortized limits of memory and execution time cannot be less than the proposed implementation for a PoW. What is the minimum memory usage for a Momentum implementation?
|
|
|
|
optimiz3
Newbie
Offline
Activity: 37
Merit: 0
|
|
June 26, 2014, 02:07:47 PM |
|
Quote from: optimiz3 on Today at 12:26:18 AM
Quote What mathematical statement about Momentum is trivial to proof? It is an instance of the birthday problem, where complexity is understood.
I don't think this is understood. Why don't you think the complexity of Momentum understood/proven? From your Cuckoo Cycle paper: "For example, for L = 2 the problem reduces to finding a birthday collision as in the Momentum proof- of-work."
|
|
|
|
|