iamnotback


June 08, 2016, 06:18:30 AM 

"What is coming?", says finitemaz, "please tell me".
Last post for as long as I can resist the temptation to respond to the BS on BCT: Re: Will Bitcoin be replaced by another cryptocurrency?
Yes it could happen but i doubt very much it will happen the short to medium term, to actually make bitcoin obsolete would be a achievement.
It would have to offer something very important for mass adoption which Bitcoin doesn't offer. And that something would need viral uptake. I am planning to make that Bitcoin killer. And yes very difficult to achieve and implementation is a Herculian undertaking that is more likely to not be achieved quickly enough. Although Bitcoin can copy any innovation...
Bitcoin can't copy every innovation, especially if that innovation turns all the existing mining farms into useless door stops. The vested interests in Bitcoin could never agree to copy radical block chain overhaul. Also that new design would already have the first mover advantage with millions more users than Bitcoin. Bitcoin can't scale so it won't have those millions of users to start with. ...yes it could but at this rate of popularity its going to be pretty hard for some other crypto to take over...
What popularity? Do you have any clue that Bitcoin has less users than any one of dozens of minority marketshare social networks. Virtually no one uses Bitcoin for commerce. The main use of Bitcoin is for avoiding capital controls, speculation, gambling, and "nefarious" activities. Bitcoin isn't close to becoming mainstream. Merchants accept Bitcoin not because they get a lot of sales from it, but because it is free geekcool promotion to the geek demographic. Note I added "Edit#3" for tromp to read.





Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertise here.

ravens
Member
Offline
Activity: 98
Merit: 10


June 08, 2016, 06:46:41 AM 

Ethereum is better than bitcoin. End of story
Nope, bitcoin is better than eth. Eth is still new(compared to bit/doge/litecoin). The only thing Eth has going for it is it's contracts(which I admit, are useful), that's it. End of story.




Washika


June 08, 2016, 04:48:20 PM 

Ethereum is better than bitcoin. End of story
Nope, bitcoin is better than eth. Eth is still new(compared to bit/doge/litecoin). The only thing Eth has going for it is it's contracts(which I admit, are useful), that's it. End of story. So far, I prefer bitcoin as bitcoin has been tested by time and it is under active development. But things may change.





MyownBoss


June 08, 2016, 10:04:18 PM 

eth still scam? Yes.
BTC is the experiment chasing the flawed central banks ETH is the flawed experiment chasing THIS experiment Iamnotback might argue THIS is already flawed and he might be right with some CORE. BTC is the flawed experiment chasing the flawed central banks ETH is the flawed experiement chasing THIS flawed experiment in short...

Rule #1:Never lose money. Rule #2: Never forget #



tromp


June 09, 2016, 07:09:40 AM 

Cuckoo Cycle insecure use of Siphash.
Cuckoo Cycle doesn't rely on cryptographic security of its underlying hash function, and Dan Bernstein himself attested to its use in Cuckoo Cycle being perfectly sound. I also expect that given Cuckoo Cycle is parallelizable
No need to expect. This is clearly stated in the white paper. necessary to make Cuckoo Cycle's proving time reasonable for instant transactions
The white paper also explains that Cuckoo Cycle is not suitable for short block intervals, and works best for intervals even longer than bitcoin's. Instant transactions are best handled with offchain payment channels, reserving the main chain for settlement. tromp thinks that Cuckoo Cycle has very low computation relative to memory latency. But this depends on only one 32bit bucket being read in each memory page/row. With massive number of h/w threads, can get them all synced up (or sorted) so that we read 1024 of 32bit buckets in one memory latency cycle
Cuckoo Cycle randomly accesses 2bit counters, not 32bit buckets. In one round of edge trimming, Cuckoo Cycle typically updates 2^29 counters spread over 2^8 memory banks. Each memory bank thus holds 2^21 counters, only 2^14 of which typically fit in a single row. This is where the latency comes in. To avoid latency, you could increase the number of memory banks by a factor of 2^7, but this similarly increases die area, hardware costs, and energy usage. Alternatively, the algorithm parameter PART_BITS allows for a reduction in the number of counters in use at the same time, which is what your proposal essentially amounts to. Setting this to 21 will require only 2^8 counters, one per memory bank. But now your hash computations have increased by a factor 2^21, over 2 million. Both of these approaches are rather cost prohibitive. You might as well avoid DRAM latency by using SRAM instead. But that is also two orders of magnitude more expensive, while the performance increase is closer to one order of magnitude. Cuckoo Cycle may not be ASICproof, but it certainly seems to resist...




iamnotback


June 09, 2016, 07:41:57 AM 

Cuckoo Cycle insecure use of Siphash.
Cuckoo Cycle doesn't rely on cryptographic security of its underlying hash function, and Dan Bernstein himself attested to its use in Cuckoo Cycle being perfectly sound. Please provide me a copy of that discussion, because I believe I have already shown it is potentially unsound in some logic I have written down and not yet revealed. I think perhaps you have misunderstood Dan or (more unlikely given how smart he is) Dan has misunderstood something. It can happen. I also expect that given Cuckoo Cycle is parallelizable
No need to expect. This is clearly stated in the white paper. I presume based on your birthplace, that English is not your native language. Please don't ignore then word 'given'. The 'expect' is referring to the remainder of the statement that you elided. necessary to make Cuckoo Cycle's proving time reasonable for instant transactions
The white paper also explains that Cuckoo Cycle is not suitable for short block intervals, and works best for intervals even longer than bitcoin's. Yes but you have interleaved other reasons for that, afair not mentioning the reason that I added. So I am just adding analysis. By no means was I indicating your work is worthless. There is something very important that will come from this. Instant transactions are best handled with offchain payment channels, reserving the main chain for settlement.
Technically incorrect. tromp thinks that Cuckoo Cycle has very low computation relative to memory latency. But this depends on only one 32bit bucket being read in each memory page/row. With massive number of h/w threads, can get them all synced up (or sorted) so that we read 1024 of 32bit buckets in one memory latency cycle
Cuckoo Cycle randomly accesses 2bit counters, not 32bit buckets. Are you referring to the latest version of your code, because I am referring to the code that was in the Appendix of your December 31. 2014February 1, 2014 white paper which references int buckets. Is this edge trimming coming from a suggestion from Dave Andersen? In one round of edge trimming, Cuckoo Cycle typically updates 2^29 counters spread over 2^8 memory banks. Each memory bank thus holds 2^21 counters, only 2^14 of which typically fit in a single row. This is where the latency comes in. To avoid latency, you could increase the number of memory banks by a factor of 2^7, but this similarly increases die area, hardware costs, and energy usage.
Or increase the number of h/w threads to 2^15 (2^8 multiplied by 2^21 bits for 2bit counters divided by 2^14 bits per row) which have some efficient h/w mechanism for waiting to be synchronized on a memory page/row. Alternatively, the algorithm parameter PART_BITS allows for a reduction in the number of counters in use at the same time, which is what your proposal essentially amounts to. Setting this to 21 will require only 2^8 counters, one per memory bank. But now your hash computations have increased by a factor 2^21, over 2 million.
No that is not equivalent to increasing the number of h/w threads and syncing them to pause until 2^13 of them are queued up to read a shared memory page/row. Both of these approaches are rather cost prohibitive. You might as well avoid DRAM latency by using SRAM instead. But that is also two orders of magnitude more expensive, while the performance increase is closer to one order of magnitude.
Btw, electrical power cost of SRAM is only one orderofmagnitude increase: Magnus Moreau. Estimating the Energy Consumption of Emerging Random Access Memory Technologies Cuckoo Cycle may not be ASICproof, but it certainly seems to resist...
So far, I disagree. I appreciate your discussion, because I want to make sure my analysis is correct. I also have to strike a balance between full publishing and secrecy, because unlike you, I have objectives around releasing a Bitcoin killer altcoin.





tromp


June 09, 2016, 01:03:26 PM 

Cuckoo Cycle doesn't rely on cryptographic security of its underlying hash function, and Dan Bernstein himself attested to its use in Cuckoo Cycle being perfectly sound.
Please provide me a copy of that discussion, because I believe I have already shown it is potentially unsound in some logic I have written down and not yet revealed. I spoke to him in person at BITCOIN'15 where I presented Cuckoo Cycle. Are you referring to the latest version of your code, because I am referring to the code that was in the Appendix of your December 31. 2014 white paper which references int buckets.
The bucketing in an earlier version was just something inessential to the algorithm that was found to be beneficial to performance. Last year I replaced it by prefetching in commit https://github.com/tromp/cuckoo/commit/5aef59659e77c599c730ece6a42a7a2597de80dawhich turned out to be more beneficial. Is this edge trimming coming from a suggestion from Dave Andersen?
Yes, implemented back in April 2014. In one round of edge trimming, Cuckoo Cycle typically updates 2^29 counters spread over 2^8 memory banks. Each memory bank thus holds 2^21 counters, only 2^14 of which typically fit in a single row. This is where the latency comes in. To avoid latency, you could increase the number of memory banks by a factor of 2^7, but this similarly increases die area, hardware costs, and energy usage.
Or increase the number of h/w threads to 2^15 (2^8 multiplied by 2^21 bits for 2bit counters divided by 2^14 bits per row) which have some efficient h/w mechanism for waiting to be synchronized on a memory page/row. I don't see why you multiply with the number of rows that Cuckoo uses per memory bank. Your 2^15 threads will each hash some nonce and then most will want to access a unique row. Then what? There's little to be synchronized. Alternatively, the algorithm parameter PART_BITS allows for a reduction in the number of counters in use at the same time, which is what your proposal essentially amounts to. Setting this to 21 will require only 2^8 counters, one per memory bank. But now your hash computations have increased by a factor 2^21, over 2 million.
No that is not equivalent to increasing the number of h/w threads and syncing them to pause until 2^13 of them are queued up to read a shared memory page/row. In my code a thread is something that processes a sequence of nonces, hashing each one, and then atomically updating a corresponding counter, without any coordination. Are you proposing a thread for each nonce? Half a billion threads? I really don't understand what you are proposing and how this syncing is supposed to be implemented. Btw, electrical power cost of SRAM is only one orderofmagnitude increase:
I'm talking about the dollar cost of SRAM vs DRAM.




iamnotback


June 09, 2016, 01:31:29 PM 

Cuckoo Cycle doesn't rely on cryptographic security of its underlying hash function, and Dan Bernstein himself attested to its use in Cuckoo Cycle being perfectly sound.
Please provide me a copy of that discussion, because I believe I have already shown it is potentially unsound in some logic I have written down and not yet revealed. I spoke to him in person at BITCOIN'15 where I presented Cuckoo Cycle. Did he consider (especially parallelized) multicollision attacks and their impact on the potential to find an algorithm that can find cycles faster? Can you paraphrase what he said or the gist of his reasoning? The basic issue is that can we guarantee that patterns will not leak through Siphash because the key is controllable by the attacker? By not using a general cryptographic hash function, there is not the setup overhead of randomizing the impact (or lack thereof) of a key, but I think this is very risky. If not necessary to take this risk, then why do it. I understand you think you need to minimize computation for ASIC resistance but I don't think that is the wise place to remove computation from the assurances on the Cuckoo Cycle PoW. For example, look at the potential cryptanalysis break found just recently on break 256bit random numbers by reducing the search space by eliminating certain patterns (although this is probably a hoax): https://bitcointalk.org/index.php?topic=1504710.msg15138928#msg15138928If the PoW function is broken, the entire block chain can be rewritten from the genesis block. This is not a trivial decision. We need large margin of error assurances. Are you referring to the latest version of your code, because I am referring to the code that was in the Appendix of your December 31. 2014 white paper which references int buckets.
The bucketing in an earlier version was just something inessential to the algorithm that was found to be beneficial to performance. Last year I replaced it by prefetching in commit https://github.com/tromp/cuckoo/commit/5aef59659e77c599c730ece6a42a7a2597de80dawhich turned out to be more beneficial. I think you are referring to something entirely unrelated to what I was referring to. I see int here in the code I am referring to: https://github.com/tromp/cuckoo/blob/ghpages/cuckoo.c#L10I am intentionally referring to the earlier version of your algorithm, to avoid all the source code noise implementing multithreading. Also apparently you added some TMTO tradeoffs in the latter version of your code which I presume adds more complexity to the code. Once we establish what we are referring to, then I can better respond to your other points. I will await first your reply. Btw, electrical power cost of SRAM is only one orderofmagnitude increase:
I'm talking about the dollar cost of SRAM vs DRAM. I know you were. And I thinking electrical power consumption is what matters most for mining and antiDDoS economics.




tromp


June 09, 2016, 04:00:22 PM 

Cuckoo Cycle doesn't rely on cryptographic security of its underlying hash function, and Dan Bernstein himself attested to its use in Cuckoo Cycle being perfectly sound.
Please provide me a copy of that discussion, because I believe I have already shown it is potentially unsound in some logic I have written down and not yet revealed. I spoke to him in person at BITCOIN'15 where I presented Cuckoo Cycle. Did he consider (especially parallelized) multicollision attacks and their impact on the potential to find an algorithm that can find cycles faster? Can you paraphrase what he said or the gist of his reasoning? Attacks on hash functions are mostly about doing tons of computation (but less than try all inputs) to try invert its outputs. Siphash may not stand up to that. But to think that a weakness in inversion resistance can make finding cycles of length 42 in a graph defined by millions or billions of siphash outputs easier is just... utterly inconceivable. That's the gist of it. That's the basic algorithm, not the reference miner, which is in https://github.com/tromp/cuckoo/blob/master/src/cuckoo_miner.hfor CPUs and https://github.com/tromp/cuckoo/blob/master/src/cuda_miner.cufor GPUs. These spend most time doing edge trimming, so that's what this discussion should focus on.




iamnotback


June 09, 2016, 05:33:30 PM 

Cuckoo Cycle doesn't rely on cryptographic security of its underlying hash function, and Dan Bernstein himself attested to its use in Cuckoo Cycle being perfectly sound.
Please provide me a copy of that discussion, because I believe I have already shown it is potentially unsound in some logic I have written down and not yet revealed. I spoke to him in person at BITCOIN'15 where I presented Cuckoo Cycle. Did he consider (especially parallelized) multicollision attacks and their impact on the potential to find an algorithm that can find cycles faster? Can you paraphrase what he said or the gist of his reasoning? Attacks on hash functions are mostly about doing tons of computation (but less than try all inputs) to try invert its outputs. Siphash may not stand up to that. But to think that a weakness in inversion resistance can make finding cycles of length 42 in a graph defined by millions or billions of siphash outputs easier is just... utterly inconceivable. That's the gist of it. Please distinguish preimage attacks from multicollision attacks. If we can find certain characteristics in the hash function which are not perfectly random and are repeatable relationship with certain changes of bits in the key from hashtohash, then we can perhaps impact the cycling in the Cuckoo. Cycle forming or the way we optimize detecting them, may be impacting by certain patterning away from a perfect Random Oracle. Think of a differential cryptanalysis where we look for the way a hash changes over differentials from one hash to the next. And also think of the parallelized rho attack. The weakness in Cuckcoo w.r.t. to hashing it is isn't doing one hash, but millions of them and so the dynamic randomness of correlation over hash differentials also has to be considered. I am not comfortable with cheating margins of security on the hash function. Sorry. I think Zooko @ Zcash was correct to be concerned even though he didn't express what sort of attack might concern him. And remember he worked on the Blake2s cryptographic hash function but I understand he isn't a cryptographer per se (and neither am I). Cryptocurrency is also about being able to trust the cryptography is strong and sound. I don't see the advantage of cheating on the hash function used in the Cuckoo Cycle, especially given I don't think you attain ASIC resistance any way, which was the entire motivation for replacing SHA256 with Siphash in order to lower the computation to memory consumption ratio. You are apparently thinking of inversion and not about what matters. I doubt Dan really thought deeply about what he was commenting on offthecuff. And Btw, you write about in our past discussions and also in your white paper (or blog) about 1/3 computation and 2/3 memory accesses, but this is execution time, not energy cost. The power cost of loading a row buffer on DRAM is very cheap because the circuits are much more dense and the page is localized. The cost of computation on a CPU is very costly because the circuits are spread out all over the chip and not as dense as 1 transistor per bit of DRAM, thus the parasitic capacitance and resistance of the interconnects is an orderofmagnitude or more greater. I think it is a losing strategy to try to say 1 hash per 1 memory access (meaning you expect 1 memory access to load a new memory row) is going to only result in a 1/3 speedup for the ASIC. And besides below I will explain how I think the assumption of 1 hash per memory row buffer load can I think be avoided by the ASIC running many h/w threads. I was referring to the "simple array in the basic algorithm" which was from the February 1, 2014 version of your white paper. I note you mentioned the edge trimming in the December 31, 2014 update. I was avoiding studying the edge trimming code and was focusing on the "basic algorithm". Any way, if the edge trimming is done after running enumerating all possible nonces, then it is not a heuristic and is an optimization to reduce memory consumption by eliminating all the edges which can't be in a cycle and then I presume you reenumerate the nonces on the "basic algorithm" but use a hash table to represent a sparse array so you don't have to store all the buckets from the more naive version of the basic algorithm. So it is reducing memory consumption and I presume eliminating the wasted code paths for walking leaf edges in the final basic algorithm. Btw, you may not remember that AnonyMint or TPTB_need_war wrote to you a long time ago (and I think perhaps before Dave Andersen did) that you could compress and count with bit flags. So you should credit Shelby too, lol (but Dave wrote a much more detailed exposition). Do I need to go find that post on BCT where AnonyMint or TPTB_need_war told you that? I remember reading it again this week when I was doing my deeper research on Cuckoo Cycle. In any case, it seems it doesn't change my theory about the lack of ASIC resistance. Rather it just reduces the size of the data being randomly read/write from 32bits to 2bits for the majority of running time. Also you do note that the edge trimming is applicable for the case where many/most of the edges will leaf nodes, which is the case where M/N < 1/2. You may think that is acceptable, because your intended usage is apparently to run Cuckoo at that load. I note it for reasons I will not reveal now. Okay so now we are on the same page, so I can proceed to reply to what you had written before... In one round of edge trimming, Cuckoo Cycle typically updates 2^29 counters spread over 2^8 memory banks. Each memory bank thus holds 2^21 counters, only 2^14 of which typically fit in a single row. This is where the latency comes in. To avoid latency, you could increase the number of memory banks by a factor of 2^7, but this similarly increases die area, hardware costs, and energy usage.
Or increase the number of h/w threads to 2^15 (2^8 multiplied by 2^21 bits for 2bit counters divided by 2^14 bits per row) which have some efficient h/w mechanism for waiting to be synchronized on a memory page/row. I don't see why you multiply with the number of rows that Cuckoo uses per memory bank. Your 2^15 threads will each hash some nonce and then most will want to access a unique row. Then what? There's little to be synchronized. The point is that if you have enough threads running, the likelihood is there are 2^12 threads wanting to read from the same row, thus your algorithm is no longer bounded on memory latency rather on memory bandwidth and moreover that there isn't 1 hash per memory row buffer load but instead 2^12 hashes, and thus actually your algorithm is computation bound and will be 1000s times more power efficient on the ASIC. I suspect if the GPU is running at memory bandwidth, then it may be getting multiple hashes per row buffer load and thus more power efficient than the CPU with only 8 threads/cores. CPU h/w threads may not be coalesced to sync on row buffer loads either (and I am not sure if they are on the GPU). Alternatively, the algorithm parameter PART_BITS allows for a reduction in the number of counters in use at the same time, which is what your proposal essentially amounts to. Setting this to 21 will require only 2^8 counters, one per memory bank. But now your hash computations have increased by a factor 2^21, over 2 million.
No that is not equivalent to increasing the number of h/w threads and syncing them to pause until 2^13 of them are queued up to read a shared memory page/row. In my code a thread is something that processes a sequence of nonces, hashing each one, and then atomically updating a corresponding counter, without any coordination. Are you proposing a thread for each nonce? Half a billion threads? I really don't understand what you are proposing and how this syncing is supposed to be implemented. It doesn't matter that the thread proceeds to do something else after reading and updating the 2bit count corresponding to a nonce. It only matters that enough h/w threads are running to enable syncing many reads on the same row buffer load without stalling all the threads and gridlock. The ASIC will need some custom timing and syncing logic. I presume stalled threads (waiting for the other threads to coalesce on each row buffer load) don't cost much in terms of power consumption when implemented with custom circuitry on the ASIC. I think you really didn't understand what Bill Cox was trying to explain.




finitemaz
Member
Offline
Activity: 64
Merit: 10


June 09, 2016, 05:35:51 PM 

Cuckoo Cycle doesn't rely on cryptographic security of its underlying hash function, and Dan Bernstein himself attested to its use in Cuckoo Cycle being perfectly sound.
Please provide me a copy of that discussion, because I believe I have already shown it is potentially unsound in some logic I have written down and not yet revealed. I spoke to him in person at BITCOIN'15 where I presented Cuckoo Cycle. Did he consider (especially parallelized) multicollision attacks and their impact on the potential to find an algorithm that can find cycles faster? Can you paraphrase what he said or the gist of his reasoning? Attacks on hash functions are mostly about doing tons of computation (but less than try all inputs) to try invert its outputs. Siphash may not stand up to that. But to think that a weakness in inversion resistance can make finding cycles of length 42 in a graph defined by millions or billions of siphash outputs easier is just... utterly inconceivable. That's the gist of it. Please distinguish preimage attacks from multicollision attacks. If we can find certain characteristics in the hash function which are not perfectly random and are repeatable relationship with certain changes of bits in the key from hashtohash, then we can perhaps impact the cycling in the Cuckoo. Cycle forming or the way we optimize detecting them, may be impacting by certain patterning away from a perfect Random Oracle. Think of a differential cryptanalysis where we look for the way a hash changes over differentials from one hash to the next. And also think of the parallelized rho attack. The weakness in Cuckcoo w.r.t. to hashing it is isn't doing one hash, but millions of them and so the dynamic randomness of correlation over hash differentials also has to be considered. I am not comfortable with cheating margins of security on the hash function. Sorry. I think Zooko @ Zcash was correct to be concerned even though he didn't express what sort of attack might concern him. And remember he worked on the Blake2s cryptographic hash function but I understand he isn't a cryptographer per se (and neither am I). Cryptocurrency is also about being able to trust the cryptography is strong and sound. I don't see the advantage of cheating on the hash function used in the Cuckoo Cycle, especially given I don't think you attain ASIC resistance any way, which was the entire motivation for replacing SHA256 with Siphash in order to lower the computation to memory consumption ratio. You are apparently thinking of inversion and not about what matters. I doubt Dan really thought deeply about what he was commenting on offthecuff. And Btw, you write about in our past discussions and also in your white paper (or blog) about 1/3 computation and 2/3 memory accesses, but this is execution time, not energy cost. The power cost of loading a row buffer on DRAM is very cheap because the circuits are much more dense and the page is localized. The cost of computation on a CPU is very costly because the circuits are spread out all over the chip and not as dense as 1 transistor per bit of DRAM. I think it is a losing strategy to try to say 1 hash per 1 memory access (meaning you expect 1 memory access to load a new memory row) is going to only result in a 1/3 speedup for the ASIC. And besides below I will explain how I think the assumption of 1 hash per memory row buffer load can I think be avoided by the ASIC running many h/w threads. I was referring to the "simple array in the basic algorithm" which was from the February 1, 2014 version of your white paper. I note you mentioned the edge trimming in the December 31, 2014 update. I was avoiding studying the edge trimming code and was focusing on the "basic algorithm". Any way, if the edge trimming is done after running enumerating all possible nonces, then it is not a heuristic and is an optimization to reduce memory consumption by eliminating all the edges which can't be in a cycle and then I presume you reenumerate the nonces on the "basic algorithm" but use a hash table to represent a sparse array so you don't have to store all the buckets from the more naive version of the basic algorithm. So it is reducing memory consumption and I presume eliminating the wasted code paths for walking leaf edges in the final basic algorithm. Btw, you may not remember that AnonyMint or TPTB_need_war wrote to you a long time ago (and I think perhaps before Dave Andersen did) that you could compress and count with bit flags. So you should credit Shelby too, lol (but Dave wrote a much more detailed exposition). Do I need to go find that post on BCT where AnonyMint or TPTB_need_war told you that? I remember reading it again this week when I was doing my deeper research on Cuckoo Cycle. In any case, it seems it doesn't change my theory about the lack of ASIC resistance. Rather it just reduces the size of the data being randomly read/write from 32bits to 2bits for the majority of running time. Also you do note that the edge trimming is applicable for the case where many/most of the edges will leaf nodes, which is the case where M/N < 1/2. You may think that is acceptable, because your intended usage is apparently to run Cuckoo at that load. I note it for reasons I will not reveal now. Okay so now we are on the same page, so I can proceed to reply to what you had written before... In one round of edge trimming, Cuckoo Cycle typically updates 2^29 counters spread over 2^8 memory banks. Each memory bank thus holds 2^21 counters, only 2^14 of which typically fit in a single row. This is where the latency comes in. To avoid latency, you could increase the number of memory banks by a factor of 2^7, but this similarly increases die area, hardware costs, and energy usage.
Or increase the number of h/w threads to 2^15 (2^8 multiplied by 2^21 bits for 2bit counters divided by 2^14 bits per row) which have some efficient h/w mechanism for waiting to be synchronized on a memory page/row. The point is that if you have enough threads running, the likelihood is there are 2^12 threads wanting to read from the same row, thus your algorithm is no longer bounded on memory latency rather on memory bandwidth and moreover that there isn't 1 hash per memory row buffer load but instead 2^12 hashes, and thus actually your algorithm is computation bound and will be 1000s times more power efficient on the ASIC. I don't see why you multiply with the number of rows that Cuckoo uses per memory bank. Your 2^15 threads will each hash some nonce and then most will want to access a unique row. Then what? There's little to be synchronized. Alternatively, the algorithm parameter PART_BITS allows for a reduction in the number of counters in use at the same time, which is what your proposal essentially amounts to. Setting this to 21 will require only 2^8 counters, one per memory bank. But now your hash computations have increased by a factor 2^21, over 2 million.
No that is not equivalent to increasing the number of h/w threads and syncing them to pause until 2^13 of them are queued up to read a shared memory page/row. In my code a thread is something that processes a sequence of nonces, hashing each one, and then atomically updating a corresponding counter, without any coordination. Are you proposing a thread for each nonce? Half a billion threads? I really don't understand what you are proposing and how this syncing is supposed to be implemented. It doesn't matter that the thread proceeds to do something else after read a nonce. It only matters that enough h/w threads are running to enable syncing many reads on the same row buffer load without stalling all the threads and gridlock. The ASIC will need some custom timing and syncing logic. I think you really didn't understand what Bill Cox was trying to explain. Wasting time! I want to see the release!




tromp


June 09, 2016, 06:47:12 PM 

Attacks on hash functions are mostly about doing tons of computation (but less than try all inputs) to try invert its outputs. Siphash may not stand up to that. But to think that a weakness in inversion resistance can make finding cycles of length 42 in a graph defined by millions or billions of siphash outputs easier is just... utterly inconceivable. That's the gist of it.
Please distinguish preimage attacks from multicollision attacks. If we can find certain characteristics in the hash function which are not perfectly random and are repeatable relationship with certain changes of bits in the key from hashtohash, then we can perhaps impact the cycling in the Cuckoo. Cycle forming or the way we optimize detecting them, may be impacting by certain patterning away from a perfect Random Oracle. My comment applies similarly to multicollision attacks. I believe detecting deviations from perfect randomness to be not only way too expensive but more importantly to die out exponentially fast as the cycle grows. I am not comfortable with cheating margins of security on the hash function. Sorry.
I know you and Zooko disagree with me and Dan on this. You feel such attacks are conceivable. I feel such attacks are quite inconceivable. Clearly we're not going to convince each other. So let's just agree to disagree. People are more than welcome to run Cuckoo Cycle with blake2b replacing siphash if they feel that trade off is worthwhile. It's just not something I recommend. Any way, if the edge trimming is done after running enumerating all possible nonces, then it is not a heuristic and is an optimization to reduce memory consumption by eliminating all the edges which can't be in a cycle and then I presume you reenumerate the nonces on the "basic algorithm" but use a hash table to represent a sparse array so you don't have to store all the buckets from the more naive version of the basic algorithm.
The basic algorithm doesn't use buckets. Just a cuckoo hash table. In the edge trimming case, only on the order of 1% of edges remains, and that table is a little more complicated to translate the sparseness into memory savings. But still no buckets. Btw, you may not remember that AnonyMint or TPTB_need_war wrote to you a long time ago (and I think perhaps before Dave Andersen did) that you could compress and count with bit flags. So you should credit Shelby too, lol (but Dave wrote a much more detailed exposition). Do I need to go find that post on BCT where AnonyMint or TPTB_need_war told you that? I remember reading it again this week when I was doing my deeper research on Cuckoo Cycle.
Sure, a URL would help. Also you do note that the edge trimming is applicable for the case where many/most of the edges will leaf nodes, which is the case where M/N < 1/2. You may think that is acceptable, because your intended usage is apparently to run Cuckoo at that load. I note it for reasons I will not reveal now. A much earlier version used edge fractions lower than 1/2 as a difficulty control, but I abandoned it since there a relationship between edge fraction and cycle probability is strongly nonlinear, making dynamic difficulty control practically impossible. Okay so now we are on the same page, so I can proceed to reply to what you had written before...
Gonna visit the local chess club now. Will read and respond to that later...




iamnotback


June 09, 2016, 07:09:30 PM 

Attacks on hash functions are mostly about doing tons of computation (but less than try all inputs) to try invert its outputs. Siphash may not stand up to that. But to think that a weakness in inversion resistance can make finding cycles of length 42 in a graph defined by millions or billions of siphash outputs easier is just... utterly inconceivable. That's the gist of it.
Please distinguish preimage attacks from multicollision attacks. If we can find certain characteristics in the hash function which are not perfectly random and are repeatable relationship with certain changes of bits in the key from hashtohash, then we can perhaps impact the cycling in the Cuckoo. Cycle forming or the way we optimize detecting them, may be impacting by certain patterning away from a perfect Random Oracle. My comment applies similarly to multicollision attacks. I believe detecting deviations from perfect randomness to be not only way too expensive but more importantly to die out exponentially fast as the cycle grows. Except if the cycle is growing in a patterned way and that is part of the algorithm of the attack, i.e. deviates from the algorithm you use but is still able to provide the cycles within the maximum nonce count (and does so with some advantage in performance obviously else there is no point to it). I think it is naive for you to assume that a random walk becomes more convoluted as cycle length grows when the entire risk is about whether the uniform randomness could be subverted. I have a very creative imagination, which is why I am often able to come up with things that others don't think of. There are so many permutations of possibilities that you sweep away with a generalized notion of random walking which may not hold when the random oracle assumption is subverted. It may be that there is some differential analysis or distinguisher (even just bias in even one of the bits between relative outputs to relative inputs between multiple hashes) that can be precomputed and amortized over all the hashes, because it is a table of differentials of the way the hash behaves over huge number of invocations where the spread between the enumeration of the space is smaller due to the large number of hashes computed. There are many different angles to analyse because we are not just considering one hash, but millions of them for one Cuckoo Cycle solution. Of course I respect Dan Berstein as a much smarter cryptographer than I am, but unless I am reading that he has written, I am hesitant to be convinced he has considered all of what I am getting at just from your second hand summary of what he said verbally at a conference. I am not comfortable with cheating margins of security on the hash function. Sorry.
I know you and Zooko disagree with me and Dan on this. You feel such attacks are conceivable. I feel such attacks are quite inconceivable. Clearly we're not going to convince each other. So let's just agree to disagree. People are more than welcome to run Cuckoo Cycle with blake2b replacing siphash if they feel that trade off is worthwhile. It's just not something I recommend. I am just arguing why not be safer than sorry when so much is at stake? Why would you recommend adding unnecessary risk even where you think you are omniscient and there is no risk? Any way, if the edge trimming is done after running enumerating all possible nonces, then it is not a heuristic and is an optimization to reduce memory consumption by eliminating all the edges which can't be in a cycle and then I presume you reenumerate the nonces on the "basic algorithm" but use a hash table to represent a sparse array so you don't have to store all the buckets from the more naive version of the basic algorithm.
The basic algorithm doesn't use buckets. Just a cuckoo hash table. In the edge trimming case, only on the order of 1% of edges remains, and that table is a little more complicated to translate the sparseness into memory savings. But still no buckets. I don't see what you wrote that is different than what I wrote. You focus too literally on the use of the work 'buckets'. What I am saying is that the edge trimming eliminates most of the nodes from consideration and then you have a hash table representing a sparse array for the remainder of the nodes which aren't the pruned/trimmed leaf edges. Those remaining nodes were "buckets" in a regular array in the former untrimmed basic algorithm. Btw, you may not remember that AnonyMint or TPTB_need_war wrote to you a long time ago (and I think perhaps before Dave Andersen did) that you could compress and count with bit flags. So you should credit Shelby too, lol (but Dave wrote a much more detailed exposition). Do I need to go find that post on BCT where AnonyMint or TPTB_need_war told you that? I remember reading it again this week when I was doing my deeper research on Cuckoo Cycle.
Sure, a URL would help. Here it was: https://bitcointalk.org/index.php?topic=1219023.msg13721473#msg13721473Note that was quoting a document that AnonyMint wrote I believe in late 2013 or early 2014. But he didn't publish all of that document until January 2016. Note he had published excerpts of that document on BCT when Monero was first released and AnonyMint was having an argument with the guy who optimized Monero's hash function. So we could go correlate that if we wanted to prove it. Any way, I am fine with David getting the credit, as he developed the concept more than AnonyMint did. Also you do note that the edge trimming is applicable for the case where many/most of the edges will leaf nodes, which is the case where M/N < 1/2. You may think that is acceptable, because your intended usage is apparently to run Cuckoo at that load. I note it for reasons I will not reveal now. A much earlier version used edge fractions lower than 1/2 as a difficulty control, but I abandoned it since there a relationship between edge fraction and cycle probability is strongly nonlinear, making dynamic difficulty control practically impossible. Well my wink is about using maximum nonce counts, i.e. edge fraction M/N > 1/2. Yeah I can see that dynamic difficulty control gets messed up if deviate too far from 1/2. But difficulty can also be controlled externally by hashing the result with SHA256, which you also noted in your paper.




tromp


June 09, 2016, 10:55:11 PM 

The point is that if you have enough threads running, the likelihood is there are 2^12 threads wanting to read from the same row, thus your algorithm is no longer bounded on memory latency rather on memory bandwidth and moreover that there isn't 1 hash per memory row buffer load but instead 2^12 hashes, and thus actually your algorithm is computation bound and will be 1000s times more power efficient on the ASIC.
I stated earlier To avoid latency, you could increase the number of memory banks by a factor of 2^7, but this similarly increases die area, hardware costs, and energy usage.
Your above proposal of simultaneously running millions of hardware threads and needing to sync them incurs much worse increases in die area (to the point of not fitting on any single die), hardware costs, and energy usage.




iamnotback


June 09, 2016, 11:18:37 PM 

The point is that if you have enough threads running, the likelihood is there are 2^12 threads wanting to read from the same row, thus your algorithm is no longer bounded on memory latency rather on memory bandwidth and moreover that there isn't 1 hash per memory row buffer load but instead 2^12 hashes, and thus actually your algorithm is computation bound and will be 1000s times more power efficient on the ASIC.
I stated earlier To avoid latency, you could increase the number of memory banks by a factor of 2^7, but this similarly increases die area, hardware costs, and energy usage.
Your above proposal of simultaneously running millions of hardware threads and needing to sync them incurs much worse increases in die area (to the point of not fitting on any single die), hardware costs, and energy usage. My calculation was 2^15, that isn't a million. Did you really design these threads already. I bet they can use a minimal number of transistors each by reusing some components for threads that are stalled. Multiple CPUs can access the same memory. Perhaps the same could be done if multiple dies are needed. Afaik, the hardware cost is rather insignificant relative to the relative power consumption efficiency cost when it comes to mining economics for ASIC farms.




tromp


June 10, 2016, 09:32:15 AM 

I am just arguing why not be safer than sorry when so much is at stake? Why would you recommend adding unnecessary risk even where you think you are omniscient and there is no risk?
I only need good statistical properties of the hash function, not cryptographic security. In fact siphash24 is already overkill for my purposes, with siphash48 being close to cryptographically secure. Your attack is about as conceivable to me as P being equal to NP, in which case there is no cryptographically secure hash function anyway. What I am saying is that the edge trimming eliminates most of the nodes from consideration and then you have a hash table representing a sparse array for the remainder of the nodes which aren't the pruned/trimmed leaf edges.
Yes, totally correct. But in my view the term "bucket" denotes a fixed size unordered container, and my current algorithms use no such thing. Well my wink is about using maximum nonce counts, i.e. edge fraction M/N > 1/2.
My algorithms break down in that case, since the expected number of cycles becomes nonconstant, causing the basic algorithm to overlook an increasing fraction of them, and edgetrimming fails altogether to remove a large fraction of edges.




tromp


June 10, 2016, 09:40:21 AM 

My calculation was 2^15, that isn't a million.
You need millions to find 2^12 of them wanting to access a single row. With 2^15 threads you expect 1 access per row.




iamnotback


June 10, 2016, 10:20:05 AM 

I am just arguing why not be safer than sorry when so much is at stake? Why would you recommend adding unnecessary risk even where you think you are omniscient and there is no risk?
I only need good statistical properties of the hash function, not cryptographic security. Pseudorandomness ("good statistical properties") is tied into the cryptographic security: BLAKE2 aims to provide the highest security level, be it in terms of classical notions as (second) preimage or collision resistance, or of theoretical notions as pseudorandomness (a.k.a. indistinguishability) or indifferentiability. Relevant terms are 'indistinguishability' and 'indifferentiability'. Siphash is sufficient for the use case it was designed for because the attacker doesn't know the input to the hash function. It is not the case that Siphash has been cryptanalysed for the case of your application of Siphash in Cuckoo Cycle wherein the attacker knows both the input and output of the hash function. Note that every submission that Dan Berstein made for cryptographic hash, has always been rejected having breakage. He appears to not excel in that one area. His ARX rounds (from ChaCha symmetric encryption) have been redeployed in Blake2 by those who know how to make it cryptographically secure for the different application of a hash function. I admire Dan a lot. He knows a hell of a lot more than me about everything involving higher maths and cryptography. In fact siphash24 is already overkill for my purposes, with siphash48 being close to cryptographically secure. Your attack is about as conceivable to me as P being equal to NP, in which case there is no cryptographically secure hash function anyway.
Fact? Based on what published cryptanalysis. What I am saying is that the edge trimming eliminates most of the nodes from consideration and then you have a hash table representing a sparse array for the remainder of the nodes which aren't the pruned/trimmed leaf edges.
Yes, totally correct. But in my view the term "bucket" denotes a fixed size unordered container, and my current algorithms use no such thing. Agreed 'bucket' would not be an ordered array element. That is a more precise use of the term 'bucket'. We programmers who started hacking since the 1970s and 1980s may be a little bit more loose (employing the general English definition) with our terminology (using 'bucket' to refer to an array element as an array bucket) as compared to someone who came up predominantly in an academic setting and especially someone employing technical English not as their native language wherein they would likely be more strict about definitions. I stand corrected on the more precise terminology. Well my wink is about using maximum nonce counts, i.e. edge fraction M/N > 1/2.
My algorithms break down in that case, since the expected number of cycles becomes nonconstant, causing the basic algorithm to overlook an increasing fraction of them, and edgetrimming fails altogether to remove a large fraction of edges. Exactly. In a few months, you'll know what I was referring to when it is published. My calculation was 2^15, that isn't a million.
You need millions to find 2^12 of them wanting to access a single row. With 2^15 threads you expect 1 access per row. Correct on 2^15 but not on the conclusion. We both agree the number of memory banks is irrelevant for this concern. Memory banks only determine how many row buffers are active simultaneously, which impacts maximum speed and whether we are memory bandwidth or memory latency bound on speed of execution. So for the power consumption issue, if we have 2^29 counters (2bits each) and 2^14 counters per memory page/row, then 2^16 (65536) h/w threads means we can stall threads (perhaps not permanently so as to account for variance outliers) and be roughly assured statistically of roughly 2 accesses per row. That already doubles the hash computation per memory page row. Thus we don't need millions of h/w threads to convert the algorithm to computation bound on power consumption, and thus making it not ASIC resistant. At 2^18 (262144) h/w threads we have 8 times more computation per memory page row than the CPU (and the computation will be ordersofmagnitude more power efficient on the ASIC). A CUDA gpu can apparently do 671 million threads, although this are probably not synced on memory row buffers as we would need here, although the statistical spread might be sufficient without syncing. I think if you had the KillAWatt meter what you would have observed is that by increasing threads, the speed remained topped out at memory bandwidth, but the power consumption of the GPU decreases as threads are increased beyond a certain threshold (assuming GPU threads are very efficient, but that might not be the case). I don't think it is safe to assume that ASICs can't be designed to support millions of very efficient threads for this very customized computation and again sharing the computation transistors amongst only the threads that aren't stalled, thus not needing millions of instances of the compute units. The GPU may already be doing this, although maybe not since it is probably optimized for performance and not power consumption where there is a choice, although once it has hit memory bandwidth bound one might hope the GPU would be optimizing for minimizing power consumption by maximizing row buffer coalescing, but this is perhaps too costly to implement in the general case of GPU (but I am arguing not in the case of an ASIC with a specific custom computation circuit). Apparently GPU memory coalescing is not powerful enough to do what we require. Also remember I need for instant transactions to have a very fast proving time, which thus means at 2^20 counters it could be trivially parallelized losing ASIC resistance with only thousands of threads. Actually how ironic that the 2bit counters instead of the original basic algorithm, makes the ASIC resistance worse, because more counters fit in a row. Remember I wrote this about "bit array" in my 2013 rough draft paper on memory hard proofofwork.




