azwccc


October 29, 2013, 05:42:53 PM 

What exactly is the "memory bus speed"? Is the RAM speed like PC12800 or the QPI become the bottleneck?
Will larger CPU cache size be able to help a little bit in the performance?

Bitrated user: azwccc.








BOUNTY PORTALS BLOG WHERE BOUNTY MANAGEMENT MEETS AUTOMATION SIGNATURE CAMPAIGNS TWITTER FACEBOOK MEDIA CAMPAIGNS AND MORE!



Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertise here.


bytemaster


October 29, 2013, 05:50:47 PM 

The memory access pattern is 'random' so CPU cache does not help improve the locality of memory access.






smolen


November 06, 2013, 05:41:21 PM 

Reposting here, just in case  hash nonce space into birthdays space in parallel, get array of (nonce, birthday)  sort it with bitonic find dublicates Bonus: don't store birthdays at all, calculate them ad hoc during the sort This is not an attack on the algorithm itself, it's just a way to fit existing algorithm into GPU

Of course I gave you bad advice. Good one is way out of your price range.



gigawatt


November 07, 2013, 11:56:28 PM 

I've managed to make a huge step forward in proving Momentum being not nearly as good for proofofwork as intended. The magic lies in using a Bloom filter to store the intermediate hashes. As a result, instead of using 12 bytes per hash/nonce in a SemiOrdered Map (which results in ~750 MB of memory), the required memory is (1 * (2^26 * ln(0.01)) / (ln(2)^2)), or about ~76 MB. This number can be reduced arbitrarily if we're willing to have a false positive rate greater than 1%. For example, if we allowed up to a 50% chance of having a false positive, the memory requirement drops to ~11 MB. Here's a overview of how the algorithm works: Make a "main" bloom filter of size 2^26 with a false positive rate 1%: ~76 MB
 Make a tiny "clash" bloom filter of size 2^16 and false positive rate 2^26: ~0.7 MB
 Make a vector of pairs< hash, nonce > to store candidate birthday collisions.
 For each nonce in the search space, check if its hash exists in the "main" bloom filter. If it is, add it's entry to the "clash" bloom filter.
 The "main" bloom is no longer required and can be discarded.
 For each nonce in the search check if its hash exists in the "clash" bloom filter. If it does, add < hash, nonce > to a candidate list for investigation.
 Sort the list of candidates by hash.
 For each pair in the candidate list, see if the previous element has the same hash. If it does, add it to the output list. This step removes false positives by comparing the actual hash instead of the bloom filter's idea of a hash.
 Return the output list as normal.
For your testing pleasure, I also built a working proof of concept.(Most of the magic is right here. The bloom filter is a modified version of "bloom.cpp" called "bigbloom.cpp") Unmodified source: http://i.imgur.com/k2cNrmd.pngSource using bloom filters: http://i.imgur.com/w8Enf9e.pngIn exchange for lowering the memory requirements by a factor of ~10, the algorithm runs at about 1/4th speed, mainly due to the doubling calls to SHA512 and the hash calls within the bloom filters. The overall result is a net efficiency gain of approximately 2.5The reduction in memory requirement means that if we could fit N instances of Momentum in GPU memory, we can instead fit 10*N instances. If we up the false positive rate in exchange for more time spent sorting, we can achieve ratios of up to 70*N. Given that bloom filters, SHA512, and sorting data are all parallel/GPU friendly, we can conclude that Momentum as a proofofwork isn't nearly as GPUhard as initially intended.




bytemaster


November 08, 2013, 12:26:40 AM 

gigawatt, Thank you for providing the first innovative algorithm for reducing the memory requirements. Let me attempt to post mitigating factors to your algorithm.
From a CPU miner perspective, your reduction in memory comes at the expense of performance so does break the algorithmic complexity of the algorithm.
From a GPU perspective you have to populate a bloom filter with 2^26 results... based upon my understanding of how bloom filters operate this would require updating a common datastructure from every thread and the resulting memory race conditions could create false negatives. If you have to do this step sequentially, then you might as well use a CPU with memory.
So do you have any solid algorithms that can populate a bloom filter with 2^26 results in parallel?




bytemaster


November 08, 2013, 12:30:08 AM 

I suppose the performance of your algorithm would also suffer if instead of SHA512(X), SCRYPT(X) was used because the cost of doing this step twice would be much higher and less GPU friendly.




bytemaster


November 08, 2013, 12:34:41 AM 

I wonder what would happen if we used NESTED momentum proof of work?
Change the noncespace of the outer proof of work to a pair of 16 bit numbers that result in a Xbit collision?
Now you have a more memory intensive inner hash that is still quick to validate, but would significantly complicate GPU miners.




Etlase2


November 08, 2013, 12:46:01 AM 

I wonder what would happen if we used NESTED momentum proof of work?
Change the noncespace of the outer proof of work to a pair of 16 bit numbers that result in a Xbit collision?
Now you have a more memory intensive inner hash that is still quick to validate, but would significantly complicate GPU miners.
AnonyMint suggested this very thing but with scrypt at one point. However, I think it would have made verifying even more difficult. He would know better. He would probably also be a reasonable judge of this corollary to that idea.




bytemaster


November 08, 2013, 12:51:22 AM 

All further discussion in this thread should be moved here: http://bitsharestalk.org/index.php?topic=22.0I do not have time to follow two threads or repeat myself... :) gigawatt: join us over at bitsharestalk, I will pay you a 0.5 BTC tip for your contribution.




gigawatt


November 08, 2013, 06:33:29 PM 

All further discussion in this thread should be moved here: http://bitsharestalk.org/index.php?topic=22.0I do not have time to follow two threads or repeat myself... :) gigawatt: join us over at bitsharestalk, I will pay you a 0.5 BTC tip for your contribution. I scooted my conversation. My address is in my signature.




smolen


November 13, 2013, 04:36:59 AM 

The magic lies in using a Bloom filter to store the intermediate hashes. For sake of history Bloom filters were first mentioned in this thread in mysteriously disappeared message by iruu, it was right before your post. Algorithm: We have pw parallel workers and one mbit bloom filter, initialized to zero.

Of course I gave you bad advice. Good one is way out of your price range.



bytemaster


November 16, 2013, 07:01:51 AM 

The magic lies in using a Bloom filter to store the intermediate hashes. For sake of history Bloom filters were first mentioned in this thread in mysteriously disappeared message by iruu, it was right before your post. Algorithm: We have pw parallel workers and one mbit bloom filter, initialized to zero.
This bounty has been award to gigawatt for his excellent work. Great progress all!




tromp


January 09, 2014, 02:53:08 PM 

bytemaster,
First, I think the idea is quite interesting and novel. Let me summarize what you are proposing to make sure we’re on the same page. Then, I will raise my concerns with the current proposal (partly in hopes of collecting a bounty and partly because it’s a really cool idea that could be the grounds for some interesting new cryptocurrencies).
The question is whether the Momentum or scrypt give worse performance increases for specialized hardware design over typical commodity hardware already in the hands of users.
Paraphrase of your section 3 in the paper with some notes: Take data D (transactions, previous block hash, etc), calculate H = Hash(D). Find nonce A and nonce B such that BirthdayHash(A+H) == BirthdayHash(B+H) (in practice a nonce C, D, etc could also be required to match) If Hash(H+A+B) < TargetDifficulty then you win.
My argument against Momentum compared to scrypt: The way you describe the algorithm and present it in the paper is all in terms of finding pairs of birthdays that match from a large set and the probability of finding a match. While this is the intuition behind the idea of how you can require lots of RAM while still being able to verify a correct answer quickly, it does not show the real goal of a miner. A miner does not want to find one match, but as many matches as possible. This is because a miner wants as many chances at finding a hash under the difficulty as possible. When looking at number of matches, we should talk in terms of expected value.
Note that since Hash() is supposed to be trivially fast and is called so much less than other functions I just assume we have an oracle that gives us a win or loss answer for each match for clarity even though technically each match hash to be Hash(H + A + B) to determine if it’s a winner. For the rest of this explanation I’ll just talk in terms of expected matches.
Let B = the number of ‘birthdays’ (hashspace of BirthdayHash). Let S = the number of BirthdayHash(nonce+H) stored in RAM. Let M = maximum number that S can be given a particular system’s RAM.
If we find a new BirthdayHash(A + H), the expected number of matches ( BirthdayHash(A+H)==BirthdayHash(B+H) ) = S / B
Now we calculate the total number of expected matches during a time frame (average block creation time for instance). we take the number of BirthdayHash() we compute in that time (N). S will increase from 0 to N unless M<N, meaning we do not have enough RAM to store N results.
N * Mean(S) / B = Total expected matches If M>=N: Mean(S) = (1 + 2 + 3 + … + N)/N = (N+1)/2 If N<M: Mean(S) = ((1 + 2 + 3 + … + M) + M(NM) )/N = M, if N >> M
If M>=N: Then the system scales the same as scrypt just with the amount of hashes so I make an scrypt ASIC until N>M. At this point they would be the same basically.
If N > M but not by much: Keep deploying scrypt ASICs even though there is a minor hit once RAM is full it doesn’t hurt too bad yet. If the goal is to land in this small sliver, I think that will be very hard to predict where it will be ahead of time and will likely vary drastically between CPU/GPU/ASIC so even hitting it right on for one may miss the others.
If N >> M: This is when the lack of RAM kills straight scrypt ASICs as a possibility so I assume this is the scenario that Momentum is designed and would be tweaked to get into. Now the expected value is: N * M / B = Total expected matches
This means that two variables affect the total amount of mining: N the number of BirthdayHash() computed per time and M which is basically the amount of RAM. The fundamental problem I see here is that we now have quadratic scaling with increased resources rather than linear. For example: You have a standard system that performs N BirthdayHash() and has M RAM which then expects N*M/B matches. Someone who buys a system that costs five times as much now has 5N*5M/B matches or 25 times as much output while only paying 5 times as much.
With off the shelf hardware there will be a natural limit on RAM perhaps such that the cost increases dramatically as people have to start buying high end servers, but custom ASICs can use standard commodity RAM and while no doubt they would face some scaling and bus issues, even if ASIC makers could achieve a few multiples of the amount of RAM normal users have it would multiply by the amount of hashing power. At the same time, even if RAM isn’t increased past commodity systems, if ASIC makers can use just as much RAM as commodity systems, they can still add the same multiplier as a straight scrypt ASIC would as just increasing the number of hashes per second still scales the number of matches linearly. For this reason I believe that Momentum is worse than scrypt in terms of customized hardware in the long run. In the short term, such hardware would no doubt be harder to design than just scrypt ASICs.
Another interesting piece of the algorithm to look at is the number of nonces that have to have matching birthdays. Scaling this up to 3 or 4+ shouldn’t affect the fundamental hashes per second * RAM scaling (I don’t have time to do the math but each should be so reliant on the pair matching as to be almost identically) although it will affect the storage strategy and how you use the RAM. In fact once the number of birthdays all matching is greater than 2, pools could use this to gain advantage. Consider a pool that in addition to hashing on the same block also share any partial matches (when BirthdayHash(A+H) ==BirthdayHash(B+H) and you need C and D matching as well). Each user of the pool can then improve their odds by saving partial matches and junking some single results they had. In this scenario it makes since to join the largest pool who gives you better partial matches so your expected matches also go up increasing your contribution. This is a dangerous incentive as even with BTC we have seen some large pools without any particular incentive beyond lower variance.
All that said, it is an interesting idea and I’ll keep thinking about ways to utilize/improve it. It’s also quite possible that I’ve misunderstood your algorithm so let me know what you think. These are the kinds of ideas that we need to share and discuss in order to build new innovative cryptocurrencies and improve upon existing ones.
Let me know if anything is unclear. I'm not sure how this board does math notation.
Thanks,
Nathaniel
The new Cuckoo Cycle proofofwork, described and implemented at https://github.com/tromp/cuckoo, doesn't suffer these drawbacks, since 1) at the default or any higher difficulty settings, the number of expected solutions of each instance is less than one. so if you fail, you need to change the header (e.g. timestamp field) and try a new instance. 2) giving it much more memory than is used by the standard algorithm doesn't help in finding a solution any faster.




BootstrapCoinDev


November 23, 2014, 08:38:15 PM 

I've managed to make a huge step forward in proving Momentum being not nearly as good for proofofwork as intended. The magic lies in using a Bloom filter to store the intermediate hashes. As a result, instead of using 12 bytes per hash/nonce in a SemiOrdered Map (which results in ~750 MB of memory), the required memory is (1 * (2^26 * ln(0.01)) / (ln(2)^2)), or about ~76 MB. This number can be reduced arbitrarily if we're willing to have a false positive rate greater than 1%. For example, if we allowed up to a 50% chance of having a false positive, the memory requirement drops to ~11 MB. Here's a overview of how the algorithm works: Make a "main" bloom filter of size 2^26 with a false positive rate 1%: ~76 MB
 Make a tiny "clash" bloom filter of size 2^16 and false positive rate 2^26: ~0.7 MB
 Make a vector of pairs< hash, nonce > to store candidate birthday collisions.
 For each nonce in the search space, check if its hash exists in the "main" bloom filter. If it is, add it's entry to the "clash" bloom filter.
 The "main" bloom is no longer required and can be discarded.
 For each nonce in the search check if its hash exists in the "clash" bloom filter. If it does, add < hash, nonce > to a candidate list for investigation.
 Sort the list of candidates by hash.
 For each pair in the candidate list, see if the previous element has the same hash. If it does, add it to the output list. This step removes false positives by comparing the actual hash instead of the bloom filter's idea of a hash.
 Return the output list as normal.
For your testing pleasure, I also built a working proof of concept.(Most of the magic is right here. The bloom filter is a modified version of "bloom.cpp" called "bigbloom.cpp") Unmodified source: http://i.imgur.com/k2cNrmd.pngSource using bloom filters: http://i.imgur.com/w8Enf9e.pngIn exchange for lowering the memory requirements by a factor of ~10, the algorithm runs at about 1/4th speed, mainly due to the doubling calls to SHA512 and the hash calls within the bloom filters. The overall result is a net efficiency gain of approximately 2.5The reduction in memory requirement means that if we could fit N instances of Momentum in GPU memory, we can instead fit 10*N instances. If we up the false positive rate in exchange for more time spent sorting, we can achieve ratios of up to 70*N. Given that bloom filters, SHA512, and sorting data are all parallel/GPU friendly, we can conclude that Momentum as a proofofwork isn't nearly as GPUhard as initially intended. Yes, vey interesting concept. With so many problems with folks pushing unfair bots into this, right now ... it looks pretty bad. There are so many coins right now that are easy & inexpensive to setup and start running fast, making money & doing more lifting up this whole thing. With the money you make and helping others know what can be done. As long as you feel comfortable with it, then go for it!




p2p
Newbie
Offline
Activity: 29
Merit: 0


January 29, 2016, 12:36:33 PM 

What's the status of Momentum POW? Was it used in Bitshares? Is it used in any coin?

Research coins on coinwik.org. Code is Nakamoto.



tromp


January 29, 2016, 01:53:40 PM 

What's the status of Momentum POW? Was it used in Bitshares? Is it used in any coin?
It is broken beyond repair according to https://eprint.iacr.org/2015/946.pdfTheir Proposition 9 states a sublinear timememory tradeoff. For instance, collisions can be found in memory O(sqrt(N)) and time O(N^(5/4)), instead of memory and time O(N). This Proposition is based on the golden collision search technique, described in Section 4.2 of http://people.scs.carleton.ca/~paulv/papers/JoC97.pdf




