Print Page - Cuckoo Cycle Speed Challenge; $2500 in bounties

Cuckoo Cycle is the first graph-theoretic proof-of-work.

Proofs take the form of a length 42-cycle in a bipartite graph
with N nodes and N/2 edges, with N scalable from millions to billions and beyond.
This makes verification trivial: compute the 42*2 edge endpoints
with one initialising sha256 and 84 siphash24 computations, check that
each endpoint occurs twice, and that you come back to the
starting point only after traversing 42 edges.
A final sha256 can check whether the 42-cycle meets a difficulty target.

With siphash24 being a very lightweight hash function, this makes for
practically instant verification.

This is implemented in just 157 lines of C code (files cuckoo.h and cuckoo.c) at
https://github.com/tromp/cuckoo, where you can also find a whitepaper.

From this point of view, Cuckoo Cycle is a rather simple PoW.

Finding a 42-cycle, on the other hand, is far from trivial,
requiring considerable resources, and some luck
(for a given header, the odds of its graph having a 42-cycle are about 2.5%).

The algorithm implemented in cuckoo_miner.h runs in time linear in N.
(Note that running in sub-linear time is out of the question, as you could
only compute a fraction of all edges, and the odds of all 42 edges of a cycle
occurring in this fraction are astronomically small).
Memory-wise, it uses N/2 bits to maintain a subset of all edges (potential cycle edges)
and N additional bits (or N/2^k bits with corresponding slowdown)
to trim the subset in a series of edge trimming rounds.
Once the subset is small enough, an algorithm inspired by Cuckoo Hashing
is used to recognise all cycles, and recover those of the right length.

I'd like to claim that this implementation is a reasonably optimal Cuckoo miner,
and that trading off memory for running time, as implemented in tomato_miner.h,
incurs at least one order of magnitude extra slowdown.
I'd further like to claim that GPUs cannot achieve speed parity with server CPUs.

To that end, I offer the following bounties:

Speedup Bounty:

$500 for an open source implementation that finds 42-cycles twice as fast (disregarding memory use).

Linear Time-Memory Trade-Off Bounty:

$500 for an open source implementation that uses at most N/k bits while running up to 15*k times slower, for any k>=2.

Both of these bounties require N ranging over {2^28,2^30,2^32} and #threads ranging over {1,2,4,8},
and further assume a high-end Intel Core i7 or Xeon and recent gcc compiler with regular flags as in my Makefile.

GPU Speed Parity Bounty:

$1000 for an open source implementation for an AMD R9 280X or nVidia GeForce GTX 770 (or similar high-end GPU)
that is as fast as a high-end Intel Xeon running 16 threads. Again with N ranging over {2^28,2^30,2^32}.

Note that there is already a cuda_miner.cu, my attempted port of the edge trimming part
of cuckoo_miner.c to CUDA, but while it seems to run ok for medium graph sizes,
it crashes my computer at larger sizes, and I haven't had the chance to debug the issue yet
(I prefer to let my computer work on solving 8x8 connect-4:-)
Anyway, this could make a good starting point.

These bounties are to expire at the end of this year. They are admittedly modest in size, but then
claiming them might only require one or two insightful tweaks to my existing implementations.

I invite anyone who'd like to see my claims refuted to extend any of these bounties
with amounts of your crypto-currency of choice.

(I don't have any crypto-currencies to offer myself, but if need be, I might be able to convert
a bounty through a trusted 3rd party)

Happy bounty hunting!

-John

Could you specify the exact program / command line arguments that must be compared against for each of those categories?
And how you expect the time to be measured, etc., etc.?
What operating system will the test be conducted on? What CPU(s) are acceptable for demonstrating the speedup?
Is a 2x speedup on some but not all CPUs adequate?

For GPUs, I'd suggest a way of putting it that allows a little more flexibility in the hardware.
Perhaps cycles / second / hardware dollar?

Also - am I correct that you've updated your algorithm to use the faster:

Code:

  2*nonce + {left or right side}  % HALFSIZE

where HALFSIZE is a power of two (expressed as a mask) instead of the previous %N0 and %N1 where N0 and N1 were not powers of two?

Quote from: dga on July 29, 2014, 02:59:47 PM

Could you specify the exact program / command line arguments that must be compared against for each of those categories?
And how you expect the time to be measured, etc., etc.?

I will add relevant bounty Makefile targets that encompasses all the required runs, and reports total running time
for both my cuckoo_miner source and the alternate bounty_miner source.
Please allow me a day or so to update the repository.

Quote

What operating system will the test be conducted on? What CPU(s) are acceptable for demonstrating the speedup?
Is a 2x speedup on some but not all CPUs adequate?

Tests are to be conducted on a common Linux distribution. Recent Intel Core i7 and Xeons are acceptable.
A 2x speedup on either of those is adequate.

Quote

For GPUs, I'd suggest a way of putting it that allows a little more flexibility in the hardware.
Perhaps cycles / second / hardware dollar?

I specified Xeon's because Core i7 doesn't offer that many threads. But those Xeons are rather prohibitively
priced, so I want to keep dollars out of that equation. However, we can consider your metric if a factor of 2x
or more is demonstrated for a GPU compared to a price/performance leading Core i7 running at 8 threads.

Quote

Also - am I correct that you've updated your algorithm to use the faster:

Code:

  2*nonce + {left or right side}  % HALFSIZE

where HALFSIZE is a power of two (expressed as a mask) instead of the previous %N0 and %N1 where N0 and N1 were not powers of two?

The main reason for that change is to avoid limiting Cuckoo Cycle to sizes of 2^32.
It is now well-defined for sizes up to 2^64.
I also wanted to get rid of the modulo operations which are not essential, cause some architectural bias,
slow down the critical path, and make the bipartite graph imperfectly balanced.
The goal is to make everything as simple and uniform as possible.

Quote from: tromp on July 29, 2014, 04:03:58 PM

Quote from: dga on July 29, 2014, 02:59:47 PM

Could you specify the exact program / command line arguments that must be compared against for each of those categories?
And how you expect the time to be measured, etc., etc.?

I will add relevant bounty Makefile targets that encompasses all the required runs, and reports total running time
for both my cuckoo_miner source and the alternate bounty_miner source.

The latest Makefile includes targets

Code:

bounty28:       cuckoo.h bounty_miner.h bounty_miner.cpp Makefile
        $(GPP) -o bounty28 -DSIZESHIFT=28 bounty_miner.cpp

bounty30:       cuckoo.h bounty_miner.h bounty_miner.cpp Makefile
        $(GPP) -o bounty30 -DSIZESHIFT=30 bounty_miner.cpp

bounty32:       cuckoo.h bounty_miner.h bounty_miner.cpp Makefile
        $(GPP) -o bounty32 -DSIZESHIFT=32 bounty_miner.cpp

cuda28: cuda_miner.cu Makefile
        nvcc -o cuda28 -DSIZESHIFT=28 -arch sm_20 cuda_miner.cu -lcrypto

cuda30: cuda_miner.cu Makefile
        nvcc -o cuda30 -DSIZESHIFT=30 -arch sm_20 cuda_miner.cu -lcrypto

cuda32: cuda_miner.cu Makefile
        nvcc -o cuda32 -DSIZESHIFT=32 -arch sm_20 cuda_miner.cu -lcrypto

cpubounty:      cuckoo28 bounty28 cuckoo30 bounty30 cuckoo32 bounty32 Makefile
        for c in 28 30 32; do for t in 1 2 4 8; do time for h in {0..9}; do ./cuckoo$$c -t $$t -h $$h; done; time for h in {0..9}; do ./bounty$$c -t $$t -h $$h; done;done; done

gpubounty:      cuckoo28 cuda28 cuckoo30 cuda30 cuckoo32 cuda32 Makefile
        for c in 28 30 32; do time for h in {0..9}; do ./cuckoo$$c -t 16 -h $$h; done; time for h in {0..9}; do ./cuda$$c -h $$h; done; done

Target cpubounty runs the required 3x4=12 time comparison tests for the first two bounties while target gpubounty runs the required 3 time comparison tests for the third bounty.

The output should indicate what cycles are found. Long cycles (length > 64) may be ignored. If your program doesn't find all short cycles, then the short cycle finding rate should be determined somehow and used to discount the speedup.

Quote from: tromp on July 30, 2014, 06:57:10 PM

Quote from: dga on July 30, 2014, 05:46:31 PM

Just an fyi:

Code:

unsigned char *SHA256(const unsigned char *d, size_t n,unsigned char *md) DEPRECATED_IN_MAC_OS_X_VERSION_10_7_AND_LATER;

Where do I have that code? I don't see that in my repository?!

Quote

And does not compile on Ubuntu 14.04.

I've put a quick hack patch for making it work under Linux -

http://www.cs.cmu.edu/~dga/crypto/cuckoo-linux.patch

but didn't test it again under macos. :)

I can compile fine on my SUSE Linux box. But of course, I'd like it to work on all Linux distributions. Before applying your patch though, I'd like to identify the exact issue you're having...

That SHA256 is defined in openssl/sha.h for MacOS - I wa smostly pointing out the deprecation.

For Ubuntu, I had to define a little macro to do the init/update/final.

Otherwise, I get a link error linking against SHA256.

I also updated the makefile to separate the LIBS, because the -l should appear *after* the .o files that reference them, and that was needed when I added -lssl for compiling on ubuntu.

Quote from: itod on July 31, 2014, 11:44:30 AM

Quote from: fluffypony on July 31, 2014, 10:45:22 AM

Quote from: mtve on July 31, 2014, 10:20:58 AM

Just a wild idea - wouldn't it be easier to start one more altcoin and see market reaction?

The market reaction is irrelevant, this thread revolves around deep technical tests.

I agree with you that this is not about the "market", but you may have misunderstood the question. I believe he asked if launching the altcoin will bring far greater incentive than $2500 for skilled developers to work on this, since making the GPU miner while all others mine on CPU can get you reward a few orders of magnitude bigger than $2500. That way these technical test could be done much, much quicker. Not a bad idea.

It would. It's what most other proof-of-works have done.

But John is doing it the ethical way by trying to have a firm foundation for it established in a way that doesn't let someone (ahem - like me) come along and make a huge profit first. (And I don't think he wants to run a coin or have his name associated with yet-another-ripoff coin!)

That's the *right* way to do it if you believe that mining should be relatively equitable.

Quote from: dga on July 31, 2014, 11:55:13 AM

Quote from: itod on July 31, 2014, 11:44:30 AM

Quote from: fluffypony on July 31, 2014, 11:56:32 AM

The downside to that is that the GPU miner will be kept on the DL for weeks or even months until the coin has been bled dry. This approach is far more ethical, and it means that as and when an existing cryptocurrency adopts Cuckoo Cycle it will already have reasonably performant CPU and GPU miners.

Don't want to troll this wonderful topic, so my last post just to clarify the point of view: If there's already a bounty on the bitcointalk.org, then this is not quite an "academic only" attempt. There's no harm in making the bounty much, much bigger, and there is no scam in launching such a coin with this background. I also believe that tromp is expecting this challenge to be unclaimed, then why not making it more explicit by involving the highest quality coders, which will not make a move for $2500?

If he doesn't want to mess with launching the coin, there are more than enough people who would like to help him doing it. I would help without expecting any monetary reward, simply because I believe this algo is the future and we really need the coin which is CPU only for cryptocurrencies to be widely adopted. Tromp just has to ask and the team around the coin would be gathered in no time.

Quote from: itod on July 31, 2014, 12:23:47 PM

I think at this stage the bounty is more to refine Cuckoo Cycle at a working-PoC level (for want of a better term). If, thereafter, a cryptocurrency were to adopt it, you can bet they'd discuss it with John and would put paid-for resources on cryptanalysis and code optimisation of the PoW, which would negate the need to launch a separate cryptocurrency merely as a vehicle for this.

Quote from: tromp on July 31, 2014, 02:29:26 PM

This thread in Bitcoin Development & Technical Discussion may also be of interest:

https://bitcointalk.org/index.php?topic=717267.0

For the record, I have no interest in developing my own coin. As a computer scientist, proof-of-work algorithms interest me, and Cuckoo Cycle has become one of my hobbies. Beyond the proof-of-work, crypto currencies have too many complexities (e.g. key management, networking issues) that I don't want to have to deal with.

Cuckoo Cycle will be adopted eventually. I suggested that the Ethereum guys take a serious look at it. Let's see what they come up with and how it compares...

Hi, John -

One more follow-up. I posted in my public review that, beyond the speedups I suggested to you, I was somewhat nervous about algorithmic speedups to the problem still being possible, and I can't shake that nervousness yet. (But I haven't been able to exploit it either, but my failure to do so doesn't mean it's not possible, just that it's not immediately obvious).

One thing I'm curious about: your current union-find solution seems to echo that of Monien. Before wasting my time thinking about it more if you've already done the work, did you attempt to apply Yuster & Zwick's algorithm (in "Finding Even Cycles Even Faster")? The only difference is a factor of inverse-of-ackerman's in (V), but it's mostly intriguing because it's simpler from a data structure perspective, which might lead to a faster real-world implementation. Or not. :)

My nervousness remains because of the unique structure of your graphs - as you note, they're like GNMp graphs - and it's not clear that there are specific tricks that can be used here. I'm a little less worried about this than I was, but I've still got this concern that the algorithm could be vulnerable to non-breakthrough-level theory developments (i.e., things easier than breaking discrete logs. :-). Still trying to wrap my head around a good formulation of it, and people shouldn't take me too seriously on this, because it's a tickling hunch, not a "gosh, I'm sure" kind of feeling.

I still compare it in my head to Momentum, where there's less structure sitting around, and therefore I'm more confident in the analysis, which I view as a good property in a PoW. But we know that Momentum, even if it were modified to use SIPhash, basically turns into a DRAM bandwidth problem. That's not bad from an ASIC standpoint, but it is comparatively GPU-friendly.

Quote from: dga on July 31, 2014, 02:48:36 PM

Quote from: tromp on July 31, 2014, 02:29:26 PM

Please stay nervous:-) As you noted in the past, losing my nervousness is one of my problems...

Quote

One thing I'm curious about: your current union-find solution seems to echo that of Monien. Before wasting my time thinking about it more if you've already done the work, did you attempt to apply Yuster & Zwick's algorithm (in "Finding Even Cycles Even Faster")? The only difference is a factor of inverse-of-ackerman's in (V), but it's mostly intriguing because it's simpler from a data structure perspective, which might lead to a faster real-world implementation. Or not. :)

I thought their approaches are quite different. Can you point to a specific paper and page of Monien describing a union-find like algorithm? My paper has a paragraph on both the similarity to and crucial difference with union-find
which explains why I can't use path-compression (which leads to the Ackermann function). My algorithm also crucially depends on the edge to vertex ratio being at most one half.
But any speedup of this part is mostly irrelevant anyway, as the running time of Cuckoo Cycle is entirely dominated by the edge trimming phase.

Quote

My nervousness remains because of the unique structure of your graphs - as you note, they're like GNMp graphs - and it's not clear that there are specific tricks that can be used here. I'm a little less worried about this than I was, but I've still got this concern that the algorithm could be vulnerable to non-breakthrough-level theory developments (i.e., things easier than breaking discrete logs. :-). Still trying to wrap my head around a good formulation of it, and people shouldn't take me too seriously on this, because it's a tickling hunch, not a "gosh, I'm sure" kind of feeling.

I still compare it in my head to Momentum, where there's less structure sitting around, and therefore I'm more confident in the analysis, which I view as a good property in a PoW. But we know that Momentum, even if it were modified to use SIPhash, basically turns into a DRAM bandwidth problem. That's not bad from an ASIC standpoint, but it is comparatively GPU-friendly.

Momentum suffers from a having a linear time-memory trade-off though. It's essentially identical to Cuckoo Cycle with N=2^26 vertices, cycle length 2, and #edges = N rather than N/2. Note that the best known implementations (entirely or partly due to you) are also essentially identical. I believe that by generalizing cycle length, Cuckoo Cycle leaves less room for trade-offs, as witnessed by the fact that Momentum's linear tmto does not carry over...

Quote from: tromp on July 31, 2014, 05:09:34 PM

Quote

"the complexity of determining a shortest cycle of even length" (I couldn't find a non-paywalled PDF, but could send you a copy if you want). A key difference is the shortest aspect, however. And asymptotically, the V^2 runtime is worse than the |V||E| runtime of the simpler BFS-based approach on the graph you choose for cuckoo.

(Note: I wasn't suggesting it's better -- I mostly wanted to avoid chasing down the Yuster algorithm if you'd already evaluated it. ;-)

Quote

What's the linear TMTO in momentum? It seems a lot like the edge trimming step in Cuckoo (and you're right - I just repurposed my code from momentum for that step in Cuckoo). But isn't the tmto quadratic in momentum as well? E.g., split the nonce set in two halves n_1, n_2, and then test n_1 x n_1, n_1 x n_2, n_2 x n_2? Or is there something I missed?

Duh, nevermind. Right - rip through and save all outputs with high order bit 0. Repeat again for high order bit 1. Straightforward. Sorry, brain has been in too many meetings today.

Quote from: dga on July 31, 2014, 06:44:59 PM

"the complexity of determining a shortest cycle of even length" (I couldn't find a non-paywalled PDF, but could send you a copy if you want). A key difference is the shortest aspect, however. And asymptotically, the V^2 runtime is worse than the |V||E| runtime of the simpler BFS-based approach on the graph you choose for cuckoo.
(Note: I wasn't suggesting it's better -- I mostly wanted to avoid chasing down the Yuster algorithm if you'd already evaluated it. ;-)

Please email me a copy of that Monien paper. Btw, my runtime is O(|E|) = O(|V|), not |V| |E|. I didn't check much graph theory literature because my setting is pretty specific:

the graph is (pseudo-)random.
it is very sparse; in fact a forest, apart from a constant expected number of additional cycle inducing edges.
it is only given implicitly, rather than some adjacency matrix or set of adjacency lists.

That's why the cuckoo hashing literature seemed more relevant, since these graph arise in that setting.

Quote from: tromp on July 31, 2014, 08:04:49 PM

Quote from: dga on July 31, 2014, 06:44:59 PM

"the complexity of determining a shortest cycle of even length" (I couldn't find a non-paywalled PDF, but could send you a copy if you want). A key difference is the shortest aspect, however. And asymptotically, the V^2 runtime is worse than the |V||E| runtime of the simpler BFS-based approach on the graph you choose for cuckoo.
(Note: I wasn't suggesting it's better -- I mostly wanted to avoid chasing down the Yuster algorithm if you'd already evaluated it. ;-)

Agreed, and emailed. Just trying to figure out where you've gone in looking for solutions. :) The sparsity is very interesting (and what makes edge pruning effective). The nature of the oracle is also fun. I tried yesterday to find a theory postdoc or someone to dangle the problem in front of, but didn't get any bites, just forward pointers.

Quote from: dga on August 01, 2014, 02:03:12 AM

Would you consider extending the applicability of your time/memory tradeoff bounty to a theoretical improvement to the asymptotic bounds for the time-memory tradeoff, with a low-speed demonstrator, but not an insanely tuned implementation, proving the feasibility of a sub-quadratic TMTO (but superlinear - i'm guessing some extra log(n) factor) for the edge pruning component of Cuckoo Cycle?

Are you asking for a bounty for using N/k bits and an o(k^2) slowdown?
The problem with asymptotic running times is that they're hard to verify:-(

I'd be happy to generalize the bounty as follows:

$1000/E for an open source implementation that uses at most N/k bits
while running up to 1.5*k^E times slower, for any k>=2 and E>=1.

Or is that still too strict for your taste?

Quote from: tromp on August 01, 2014, 02:46:57 AM

Quote from: dga on August 01, 2014, 02:03:12 AM

Yes, but N/k bits with an O(k log N) slowdown. Proof-of-concept implementation with no mind at all paid to efficiency, but showing clearly the attack vector and its resulting complexity.

In other words - close to linear, but not quite.

Quote from: tromp on August 01, 2014, 01:22:04 PM

Quote from: dga on August 01, 2014, 10:26:55 AM

Quote from: tromp on August 01, 2014, 02:46:57 AM

I'd be happy to generalize the bounty as follows:

$1000/E for an open source implementation that uses at most N/k bits
while running up to 1.5*k^E times slower, for any k>=2 and E>=1.

Or is that still too strict for your taste?

I see. That would indeed be very interesting, even if not a practical attack. But to be precise, is there a dependence on k in the hidden constant? Can you set k equal to log(N), or sqrt(N)?

k must be > log(n), or the constants lose out. Anything >= log^2(n) is fine.

Obviously, because it's only acting on edge trimming, there's a lower bound on the size requirement determined by the cycles. In essence, it boils down to #nodes-involved-in-paths-of-length-L * log(N), where L is the number of edge trimming steps you're willing to pay for. From our prior discussion about that, that's in the tens of thousands of nodes for a N=2^26.

Here's a pretty concrete example:

- Using the equivalent of 7 iterations of edge trimming
- N= 2^28, E=2^27
- Using k=sqrt(N), requiring roughly 2^14 *words* of memory (so sqrt(N) * log(N))
- Processes in O(2^28 * 2^14) steps
- Reduces the graph down to about 2.5% of its original size. <-- requires 0.025N\logN words to represent, of course...

The way in which it's not a perfectly linear TMTO is that I have to go to a word representation of the graph, not the bitvector I introduced earlier. It's a little more nuanced than that, but this is the core.

I'm writing the proof of concept in Julia because I wanted to learn a new language, and so it's glacially slow because I'm a newbie to it. I can discuss high-performance implementation strategies for it, of course, and I believe it's a pretty ASIC-friendly algorithm.

I'll write the whole thing up, share my PoC source code, and if you feel like throwing me some of the bounty, cool.

... wish there were some venue to publish this stuff in. :)

Quote from: dga on August 01, 2014, 02:47:43 PM

- Using the equivalent of 7 iterations of edge trimming
- N= 2^28, E=2^27
- Using k=sqrt(N), requiring roughly 2^14 *words* of memory (so sqrt(N) * log(N))
- Processes in O(2^28 * 2^14) steps
- Reduces the graph down to about 2.5% of its original size. <-- requires 0.025N\logN words to represent, of course...

The current code already trims the edges down to about 1.6%, in order to allow the cycle finding to run in the same memory footprint as the edge trimming. To run the cycle finding in only 2^14 words, you'd have to trim WAY more?!

Quote

The way in which it's not a perfectly linear TMTO is that I have to go to a word representation of the graph, not the bitvector I introduced earlier. It's a little more nuanced than that, but this is the core.

That would make the situation very similar to that of Momentum, where the linear TMTO applies to the original implementatoin using N=2^26 words, but not to your trimming version using N+2N/k bits (assuming it recomputes SHA hashes instead of storing them).

Quote

I'll write the whole thing up, share my PoC source code, and if you feel like throwing me some of the bounty, cool.

For a linear TMTO of my original non-trimming Cuckoo Cycle implementation I will offer a $666 bounty.

Quote

... wish there were some venue to publish this stuff in. :)

You could do as I did and publish on the Cryptology ePrint Archive. That's good for exposure, but, lacking peer-review, not so good for academic merit:-(

Quote from: fluffypony on August 01, 2014, 09:06:17 PM

Quote from: tromp on August 01, 2014, 09:00:28 PM

Quote

... wish there were some venue to publish this stuff in. :)

You could do as I did and publish on the Cryptology ePrint Archive. That's good for exposure, but, lacking peer-review, not so good for academic merit:-(

The problem is there's nothing like Nature when it comes to cryptography (and certainly nothing like Pubmed). Then again, compared to medical sciences, this is an industry that is still very much in its infancy.

Well - there are lots of academic conferences. I may toss it to one, but I suspect I'll throw it to ArXiv or crypto ePrint.

The challenge that I see with it is that the CC paper wasn't published in a peer-reviewed spot, so it makes it harder to publish something following on to it. The reason I jumped back on CC is because people in the cryptocurrency space are expressing a lot of interest in it, and that raises the importance of reviewing it, but there aren't too many academics who follow bitcointalk or think that when major devs for bitcoin are closely following something, it becomes important. Ahh well. It's fun stuff in any event. :)

http://www.cs.cmu.edu/~dga/crypto/cuckoo/analysis.pdf

Very hackish proof-of-concept:

http://www.cs.cmu.edu/~dga/crypto/cuckoo/partitioned.jl

I'm releasing this well in advance of what I would normally do for academic work because I think it's worth pushing the discussion of proof-of-work functions forward fast -- crypto moves too fast these days -- but please be aware that it's not done yet and requires a lot more polish. The bottom line is as I suggested above: It successfully TMTO's but still requires something like 1-3% of the "full" graph space, because that's what gets passed to the solver after edge trimming. I don't think this is the final word in optimizing for cuckoo cycle -- I believe there are some further nifty optimizations possible, though I think this gets the first primary attack vector down.

John, I'd love your feedback on whether this is clear and/or needs help, or if you find some of the handwavy O(N) parts too vague to constitute a real threat. I plan to clean it up more, but, as noted, I figured that it's better to swallow my perfectionism and push the crappy version out there faster to get the dialogue rolling, since I don't have coauthors on it to push it in the right direction. :)

Feedback appreciated!

(The weakest part of the analysis, btw, is the growth rate of the dictionaries used to track the other live nodes. With 7 iterations of edge trimming, for example, they actually grow slightly larger than the original node set in a partition, but less than 2x its size in some spot checks. I need to think more carefully about how that affects some of the asymptotic factors.)

As an example of the latter, with a partition initially containing 16384 nodes:

Code:

609 live nodes at start of iteration 6
Size of hop 2 dictionary: 3140
Size of hop 3 dictionary: 4940
Size of hop 4 dictionary: 6841
Size of hop 5 dictionary: 8873

That's 23794 total dictionary entries, or 1.45x the initial partition size, and at iteration 7, it's grown to 26384, or 1.61x. It's not an exponential explosion, so I'm not worried about it invalidating the major part of the result, but it's the place where I or someone else should focus some algorithmic attention to reduce.

update 2: To run the julia code, install Julia, and then type:

Code:

include("partitioned.jl")

It's glacially slow. For the impatient, reduce the problem size from 2^27 to something smaller first, like 2^20, and the partition accordingly. (Note: I use "N" to mean the size of one half of the bipartite graph, whereas John's formulation uses it to include both, so n=2^27 is equivalent to john's 2^28)

Update 3: Herp derp. That dictionary growth was my stupidity - I forgot to not include the edge back to the node that caused the insertion in the first place, so it's got a little bit of accidental exponential growth. I'll fix that tomorrow. That should get it back to closer to the linear scaling I'd expected.

Update 4: Fixed that above silliness with adding back inbound edges. Much improved dictionary size, now in line with what it should have been:

Code:

647 live nodes at start of iteration 6
Size of hop 2 dictionary: 2803
Size of hop 3 dictionary: 3554
Size of hop 4 dictionary: 4152
Size of hop 5 dictionary: 4404

By the end of iteration 7, the sum of all dictionaries (for that run) was 15797, slightly less than the number of nodes in the original partition, so at least through 7 iterations, the space for dictionaries remains O(|P| log N). Empirically speaking only, of course, since I haven't really done that analysis as it needs to be.
Julia code file on the website updated.

Quote from: dga on August 02, 2014, 12:38:28 AM

http://www.cs.cmu.edu/~dga/crypto/cuckoo/analysis.pdf

Very hackish proof-of-concept:

http://www.cs.cmu.edu/~dga/crypto/partitioned.jl

I'm releasing this well in advance of what I would normally do for academic work because I think it's worth pushing the discussion of proof-of-work functions forward fast -- crypto moves too fast these days -- but please be aware that it's not done yet and requires a lot more polish. The bottom line is as I suggested above: It successfully TMTO's but still requires something like 1-3% of the "full" graph space, because that's what gets passed to the solver after edge trimming. I don't think this is the final word in optimizing for cuckoo cycle -- I believe there are some further nifty optimizations possible, though I think this gets the first primary attack vector down.

John, I'd love your feedback on whether this is clear and/or needs help, or if you find some of the handwavy O(N) parts too vague to constitute a real threat. I plan to clean it up more, but, as noted, I figured that it's better to swallow my perfectionism and push the crappy version out there faster to get the dialogue rolling, since I don't have coauthors on it to push it in the right direction. :)

Feedback appreciated!

Thanks, Dave. I appreciate the quick release. Will start going over it tonight.

Quote from: tromp on August 02, 2014, 12:57:00 AM

Thanks, Dave. I appreciate the quick release. Will start going over it tonight.

You're on to something here!

In fact, l think you may be able to avoid the original union-find like algorithm,
and try to recognize 42 cycles starting from any D21 set.
If a node in D21 has 2 or more neighbours in D20, then you can work
your way back and try to find disjoint paths back to the same starting node in P.
(it suffices to check disjointness within each Di)

This approach sounds pretty similar to what Monien and Yuster/Zwick were doing,
so I'm going to have to go back and study those papers in detail.

Some more notes:

If you have for instance |P| = N/1000, then it's not wise to try all 1000 subsets P.
If there is a 42 cycle then you have a good chance of having one of its nodes in
one of the first 100 subsets.

It's going to be interesting to analyze whether a bunch of ASICs implementing this approach
is going to outperform the reference implementation on an FPGA plus a few hundreds DRAM chips, taking into account both fabrication costs and power usage of all components involved.

I remain hopeful that the constant factor overhead of this TMTO may preserve CC's ASIC resistance.

Quote from: tromp on August 02, 2014, 02:37:55 AM

Quote from: tromp on August 02, 2014, 12:57:00 AM

Thanks, Dave. I appreciate the quick release. Will start going over it tonight.

Must zzz, but yeah - that's kind of the approach I was trying to figure out how to express/implement when I talked about a sampling based approach working for this problem. (See the last paragraph of my initial review post.) It took me a long time to wrap my head around it and I think you just put it better than I was able, but it's that core idea of being able to start from a subset, prune it down, and then expand out and try to find cycles that participate in that subset.

If your subset is 20% of the nodes, for example, you're pretty likely to find the 42 cycle if it exists.

Now that I've had a bit of time to digest it, it strikes me that a better way of phrasing my algorithm might be this:

Select |P| initial nodes.

Begin a breadth-first-search from each node n in P.

Upon reaching a terminal node (one with only one incident edge), remove the edge to that node, recursing as needed if that removal causes another node to become terminal, and so on. If a node in p becomes itself terminal, remove it from the set and remove the outbound BFS chain from it.

For each level of the breadth, use one O(N) pass through the entire edge set by generating the edges and matching them against the leading edge nodes in the BFS tree.

This description alone is effective for edge trimming, but building on this for directly detecting 42-cycles requires more careful specification of how to handle the BFS graph representation and what to do about cycles in it.

btw - I wasn't actually clear on the definition of the CC PoW in one way: Is a valid proof any sequence of unique edges that form a cycle, or must the cycles be completely node-disjoint as well?

Quote from: dga on August 02, 2014, 06:34:17 PM

This description alone is effective for edge trimming, but building on this for directly detecting 42-cycles requires more careful specification of how to handle the BFS graph representation and what to do about cycles in it.

Indeed. I checked that the Yuster/Zwick paper doesn't add anything relevant for us over the Monien paper (it extends results for dense graphs), so I'm studying Monien's "how to find long paths efficiently" paper now...

Quote

btw - I wasn't actually clear on the definition of the CC PoW in one way: Is a valid proof any sequence of unique edges that form a cycle, or must the cycles be completely node-disjoint as well?

As the OP mentions, the verification checks that each node is incident to exactly two edges, so yes, the cycle must be node disjoint.

Quote from: dga on August 02, 2014, 06:34:17 PM

For each level of the breadth, use one O(N) pass through the entire edge set by generating the edges and matching them against the leading edge nodes in the BFS tree.

This description alone is effective for edge trimming, but building on this for directly detecting 42-cycles requires more careful specification of how to handle the BFS graph representation and what to do about cycles in it.

Hmm, seems we were overlooking the obvious.

You can just feed all the edges generated in each pass (having one endpoint already present in subset P or as a key in cuckoo_hash) to my cycle finder (lines 357-390 of cuckoo_miner.h, tweaked to ignore duplicates/2-cycles). It doesn't care where the edges come from or how they are located between successive D_i layers. It will even happily report cycles that bounce back and forth multiple times between layers.

This should be relatively easy to code...

Quote from: tromp on August 01, 2014, 09:00:28 PM

Quote from: dga on August 01, 2014, 02:47:43 PM

Quote

I'll write the whole thing up, share my PoC source code, and if you feel like throwing me some of the bounty, cool.

For a linear TMTO of my original non-trimming Cuckoo Cycle implementation I will offer a $666 bounty.

I'll take you up on that. :) (BTC address is in my signature). It seemed worth doing the theory bits more solidly before trying to optimize a potentially asymptotically-wrong, or even big-constants-wrong, algorithm. And I appreciate you extending the bounty to include that.

While nothing's perfect, there are a lot of coins that could learn from the way you're approaching the development of your PoW ideas. Or maybe they shouldn't -- there are a few CPU/GPU/etc. hackers who might go out of business. ;-)

(btw, I re-tested with 8 iterations, and it, like basic edge trimming, reduces the set to about 1.8%. I'm on battery, so I didn't want to test 9. Dunno what i was thinking coding this in Julia. So this is an effective TMTO down to 1.8% compared to the original. Not perfect, but not too bad, either -- certainly enough to let it run in SRAM.)

Quote from: tromp on August 02, 2014, 11:05:20 PM

Quote from: dga on August 02, 2014, 06:34:17 PM

Having a rough version done, it looks like the slowdown for saving a factor k in memory is roughly k times L, where L is cycle length. So, to use only half the memory of the edge-trimming implementation causes two orders of magnitude slowdown. If this is the best one can do, then Cuckoo Cycle is still memory-hard in a practical sense, with cycle-length becoming a memory-hardness parameter.

I'll have more detailed numbers after fine-tuning the implementation...

Quote from: tromp on August 04, 2014, 02:29:26 PM

This miner will be available as tomato_miner.h / tomato_miner.cpp
I will experiment with this some more before updating the TMTO bounty...

The tomato miner is now up, and bounty miner is restored as a copy of cuckoo miner.

The TMTO bounty in the OP has been updated, allowing an extra factor 10 of slowdown.

Below is the slightly improved output of the tomato miner, shown running almost twice as fast by using 2 threads.

Happy continued bounty hunting!

-John

Code:

> time ./tomato25 -t 2 -h 157
Looking for 42-cycle on cuckoo25("157") with 50% edges, 1/64 memory, 24/512 parts, 2 threads
Using 4MB node memory.
vpart 0 depth 21 load 78%
   4-cycle found at 1:5
   4-cycle found at 1:6
   4-cycle found at 1:7
   4-cycle found at 1:8
   4-cycle found at 1:9
   4-cycle found at 1:10
   4-cycle found at 1:11
   4-cycle found at 1:12
   4-cycle found at 1:13
   4-cycle found at 1:14
   4-cycle found at 1:15
   4-cycle found at 1:16
   4-cycle found at 1:17
   4-cycle found at 1:18
   4-cycle found at 1:19
   4-cycle found at 1:20
vpart 1 depth 21 load 78%
vpart 2 depth 21 load 78%
vpart 3 depth 21 load 75%
vpart 4 depth 21 load 77%
vpart 5 depth 21 load 76%
vpart 6 depth 21 load 78%
vpart 7 depth 21 load 76%
vpart 8 depth 21 load 77%
vpart 9 depth 21 load 79%
vpart 10 depth 21 load 78%
vpart 11 depth 21 load 76%
vpart 12 depth 21 load 76%
vpart 13 depth 21 load 77%
vpart 14 depth 21 load 76%
vpart 15 depth 21 load 78%
vpart 16 depth 21 load 78%
vpart 17 depth 21 load 78%
vpart 18 depth 21 load 78%
vpart 19 depth 21 load 75%
vpart 20 depth 21 load 77%
vpart 21 depth 21 load 75%
vpart 22 depth 21 load 76%
  42-cycle found at 1:20
vpart 23 depth 21 load 79%
Solution 223e7 56b1b 86b2f cb0a2 106629 233b7e 253ee1 290e4e 2e15cc 2efe4b 32bc8e 356dfd 35ef96 3dd854 534d37 57ee85 5abaa1 5d7d49 60ca66 662933 718651 78554b 7d8e01 7f7360 886c55 8a6448 8e4fd2 9674a2 98e431 b00d71 b16050 b1b561 b362d4 b9e539 ca3ab0 cb4f28 d2a53b d53bc3 e213cc e40bf1 ed5b2a faef36

real    7m5.287s
user    12m27.616s
sys     0m2.345s

Quote from: tromp on August 05, 2014, 03:01:42 AM

Quote from: tromp on August 04, 2014, 02:29:26 PM

This miner will be available as tomato_miner.h / tomato_miner.cpp
I will experiment with this some more before updating the TMTO bounty...

Weird, the post I was quoting has disappeared entirely.

In that post I showed a comparison between the reference and tmto implementation at memory parity, which showed a factor 70 slowdown and a factor 1-1/e reduced solution finding rate (due to sampling a fraction 1/42 of all vertices as starting points for the depth 21 breadth first search).
Discounting for the lower solution finding rate, the tmto implementation was 111 times slower.

For completeness, I include below the output of the reference implementation running on 2 threads,
showing a factor 61 difference (rather than the 70 single threaded).

-John

Code:

> time ./cuckoo25.1 -t 2  -h 157
Looking for 42-cycle on cuckoo25("157") with 50% edges, 11 trims, 2 threads
Using 2MB edge and 2MB node memory.
initial load 6400%
round 1 part U0 load 5222%
round 1 part U1 load 4045%
round 1 part V0 load 2970%
round 1 part V1 load 1895%
round 2 part U0 load 1508%
round 2 part U1 load 1121%
round 2 part V0 load 934%
round 2 part V1 load 746%
round 3 part U0 load 641%
round 3 part U1 load 535%
round 3 part V0 load 469%
round 3 part V1 load 403%
round 4 part U0 load 359%
round 4 part U1 load 315%
round 4 part V0 load 284%
round 4 part V1 load 253%
round 5 part U0 load 230%
round 5 part U1 load 208%
round 5 part V0 load 191%
round 5 part V1 load 174%
round 6 part U0 load 161%
round 6 part U1 load 147%
round 6 part V0 load 137%
round 6 part V1 load 127%
round 7 part U0 load 118%
round 7 part U1 load 110%
round 7 part V0 load 103%
round 7 part V1 load 96%
round 8 part U0 load 91%
round 8 part U1 load 85%
round 8 part V0 load 80%
round 8 part V1 load 76%
round 9 part U0 load 72%
round 9 part U1 load 68%
round 9 part V0 load 64%
round 9 part V1 load 61%
round 10 part U0 load 58%
round 10 part U1 load 55%
round 10 part V0 load 53%
round 10 part V1 load 50%
round 11 part U0 load 48%
round 11 part U1 load 46%
round 11 part V0 load 44%
round 11 part V1 load 42%
   4-cycle found at 1:90%
 166-cycle found at 0:98%
  42-cycle found at 1:98%
  76-cycle found at 1:99%
Solution 223e7 56b1b 86b2f cb0a2 106629 233b7e 253ee1 290e4e 2e15cc 2efe4b 32bc8e 356dfd 35ef96 3dd854 534d37 57ee85 5abaa1 5d7d49 60ca66 662933 718651 78554b 7d8e01 7f7360 886c55 8a6448 8e4fd2 9674a2 98e431 b00d71 b16050 b1b561 b362d4 b9e539 ca3ab0 cb4f28 d2a53b d53bc3 e213cc e40bf1 ed5b2a faef36

real    0m6.957s
user    0m10.709s
sys     0m0.077s

Quote from: tromp on August 06, 2014, 07:31:53 PM

In that post I showed a comparison between the reference and tmto implementation at memory parity, which showed a factor 70 slowdown and a factor 1-1/e reduced solution finding rate (due to sampling a fraction 1/42 of all vertices as starting points for the depth 21 breadth first search).
Discounting for the lower solution finding rate, the tmto implementation was 111 times slower.

I tried another approach, which turns out to be superior. First of all, assume that the graph we're working on has a unique 42 cycle, and we hope to find it by sampling a small fraction of nodes, and exploring their local neighbourhood with a depth-limited breadth-first-search (BFS).

Instead of doing a depth 21 BFS , which (roughly) requires starting from a node on the 42-cycle, we can do a much deeper search, say depth 42. Then we'll find the 42-cycle as long as we start from any node with distance at most 21 to the 42-cycle, which seems much more likely.

Apart from doubling the BFS depth, there is another difference. In the former approach, we only care about nodes on the cycle, and can eliminate any that are known to be off the cycle, such as those with only one incident edge. In the latter approach, we cannot eliminate these. So the BFS tree will be much larger in size relative to the starting set, not only due to doubled depth, but to less elimination as well.

To keep space usage down, we thus need to shrink the starting sets even more.
I ran some experiments which indicate that the increased likelihood of finding the cycle more than makes up for having smaller starting sets.

The updated tomato_miner.h can now use k times less memory with a (discounted) slowdown of roughly 25*k. Further details will appear in the whitepaper, in a week or two.

With this progress on the time-memory trade-off, the partial bounty payout, and the relaxation of the slowdown, I'm reducing the TMTO bounty from $1000 to $500.

Bitcoin Forum

Alternate cryptocurrencies => Altcoin Discussion => Topic started by: tromp on July 24, 2014, 03:49:32 AM