I don't know if anyone is still interested, and I hope I am not annoying readers. I will try to describe now the GPU vulnerability with education on technical concepts so that hopefully everyone can readily understand, so there hopefully won't be any doubt for those who desire to know.
First of all, let me describe what I believe the Momentum algorithm is, so we that we can see if we are all discussing the same thing. I will simplify some of the details which don't matter to my point. The Birthday hash algorithm generates a sequence of numbers, with each number being randomly generated (actually pseudo-random but never mind the slight distinction for now) and uniformly distributed over the number range specified.
In other words, we start with some number, then we use a SHA256 hash function on that number to get the next number in the sequence. The SHA256 hash function is applied to each output number over and over to generate a sequence of numbers.
We are looking for a pair of output numbers which are the same. We when find it, then that is the output of the Birthday hash algorithm.
So what is the fastest way to implement this? We must save the numbers in memory as we compute them (from repeated applications of the SHA256), then when we find we have already generated that number because it is already saved, then the hash solution is found.
Our options for saving the numbers as we compute them are to store them in the memory at an array index which is the number, or by storing them in a hash table where the key of the hash table is the number. The latter will use less memory. I will not explain hash tables here, because it is irrelevant to my point below.
The speed of the hash is thus determined by how fast SHA256 is and how fast we can store and lookup numbers from memory (either full random or hash table).
We know SHA256 can be parallelized and accelerated, in fact even some CPUs have accelerated custom logic for doing so. So asymptotically SHA256 will always be faster than the memory accesses in terms of the optimum hardware that can be made given the whole point is to force a use of large memory. I don't think anyone argues this point nor any point I have made above.
Thus the relative speed of the CPU and the GPU (or other custom hardware) is going to depend on the relative speed of the memory access and the level of parallelism that can be applied to the storage and lookup of numbers.
We have two choices for the algorithm. Either we can specify that only the first number of the sequence that is the same is allowed as the solution. In this case, verification can no longer be fast, because verification will also have to store and lookup all the numbers. So if we specify that any number of the sequence that matches is allowed as a solution, then verification can be much faster than when we search for a match, because verification doesn't need to store all the numbers that don't match the number we've told verification will match.
So obviously bytemaster chose the second option, as he needs fast verification. I don't think anyone disputes any of the above.
Let us assume that we can run SHA256 so fast that we can compute 100 candidate numbers asymptotically fast relative to the memory accesses.
However, the second option choice that provides fast verification, also allows us to run numerous simultaneous hardware threads (cores) each will simultaneously store and lookup a different number from the same sequence in the same memory simultaneously.
Now you need to understand the way this works on hardware. While waiting on memory the core will block (meaning it will sleep) and let another core run that isn't waiting on memory.
Random access latency on main memory for both CPU and GPU is very high on the order of several hundred clock cycles. That is an egregious amount of time that a single thread would be sleeping doing nothing if you didn't run simultaneously threads on the same sequence and same memory (i.e. within the same hash instance).
However, if you mix a multi-threaded (parallelized) computation of the SHA256 sequence with simultaneous threads for the memory stores and lookups, then you can asymptotically reduce that computation to 0 time. As well if your simultaneous memory access have collisions in the cache line and memory bus sizes (and don't forget that caches are multi-way mapped into memory so this doesn't just require sequential memory locations), then what happens if that the more hardware threads (cores) you run simultaneously, then the memory latency gets masked away and the underlying memory bandwidth becomes the limit of the speed. This is a fundamental quality of the GPU, and it was designed to have this advantage. If you don't design to defeat it, you won't. Momemtum algorithm does nothing to defeat it.
Since the i7 CPU has at most 8 hardware threads and the GPU has 100s or 1000s of hardware threads, then the GPU can reach its memory bandwidth and the CPU will not even get close to its 20 GBps bandwidth because the CPU has a statistically much lower chance of coherence (collisions) on the accesses. Additionally the GPU main memory bandwidth is up to 10x faster (e.g. 260 GBps for Radeon Tahiti versus 20 GBps for i7).
So the ratio of speed between the GPU and the CPU on this coin is going to be on the order of 260 GBps versus 1 GBps, i.e. two orders-of-magnitude. This is a major fail, since Litecoin is only one order-of-magnitude.
Now someone might counter that GPU cores are not as powerful as i7 cores. True but irrelevant, because the memory latency is the bound here not the computational power per core.
I am open to discussion, questions, and refutations.
.... but by making verification faster than finding a hash, I assert he opened a vulnerability that makes it easy to parallelize the finding of a hash and thus not CPU-only ...
Having a cheap verify function is a must for a proof of work, otherwise loading the blockchain would take forever. BTW, I have yet to see any evidence or proof of concept of your claims. Ranting is ok, especially in this subforum, but bear in mind that for this very same reason, people in here are used to start taking things seriously only when an evidence is provided, not before.
I have already described the algorithm as linked in my prior post. If there are any questions, I will answer when asked.
Anyone who is interested can code it and test it. I am not particularly interested to do so, because I am quite happy if bytemaster gets totally dependent on a non-CPU-only PoW. I have studied all the major PoW altcoins and none of them are CPU-only in my analysis. I don't have time to go proving them all, by implementing them all on GPUs. But I do suppose I will pay someone to do that soon (within months), so it benefits me to let these altcoins waste people's time so that they will then believe me when I later pay the someone like the author of cpuminer (if he is willing) to help me.
Indeed in my deep study, I have come to the conclusion that no coin can be CPU-only and have fast verification. Thus this presents a major problem for both verifying a long blockchain and for dealing with denial-of-service attacks (since the rate of determining if a received transaction or block solution is valid thus decreases). It is possible this is exacerbated for BitShares, because they attempt to do so much more within a block with their market clearing of bid/ask, but I would need to study their code to verify.
I would be happy to describe the algorithm for the GPU attack in more detail after I get back from eating dinner.
Genuine question about the protocol, correct me if I'm wrong anywhere: Bitcoin works well because it takes a huge amount of work to find a block, but only takes one hash to verify that a block is valid. So a wallet can verify thousands of blocks fairly quickly. From what I understand Protoshares works by the same concept, except one Protoshares hash takes many orders of magnitude more work to perform. Does the Protoshares wallet need to perform one hash to verify each block in the network? Seeing how most computers get ~5 hashes per min, wouldn't that take an outrageously long time to verify the block chain once it gets bigger?
bytemaster apparently designed Momentum PoW and didn't use Scrypt for PoW for precisely that reason, but by making verification faster than finding a hash, I assert he opened a vulnerability that makes it easy to parallelize the finding of a hash and thus not CPU-only. See
the link to my claim on the bounty and judge for yourself.
And when (if) someone does implement a faster GPU miner for this coin, please kindly acknowledge I was correct. You don't have to tip me anything, the acknowledgement would be more than enough.