Probably due to having
twice as many fat registers in 64-bit mode, which means among other possibilities you can pipeline up to doubly better (although a hyperthreaded CPU should compensate if your pipelining is nailed fully with 16 fat registers).
I've posted an informal summary of my analysis of the CryptoNight algorithm earlier in this thread with respect to GPU balance and its likely eventual balance with ASICs.
It's good.
Link please?
The L3 cache by itself is almost half of the chip.
I looked at an image of the Haswell die and appears to be less than 20%. The APU (GPU) is taking up more space on the consumer models. On the server models there is no GPU and the cache is probably a higher percentage of the die.
There is also a 64 bit multiply, which is I'm told is non-trivial. Once you combine that with your observation about Intel having a (likely persistent) process advantage (and also the inherent average unit cost advantage of a widely-used general purpose device), there just isn't much if anything left for an ASIC-maker to to work with.
So no I don't think the point is really valid. You won't be able to get thousands of times anything with a straightforward ASIC design here. There may be back doors though, we don't know. The point about lack of a clear writeup and peer review is valid.
The CPU has an inherent disadvantage in that it is designed to be a general purpose computing device so it can't be as specialized at any one computation as an ASIC can be.
This is obviously going to be true, but the scope of the task here is very different. Thousands of copies will not work.
I believe that is wrong. I suspect an ASIC can be designed that vastly outperform (at least on a power efficiency basis) and one of the reasons is the algorithm is so complex, thus it probably has many ways to be optimized with specific circuitry instead of generalized circuitry. My point is isolating a simpler ("enveloped") instruction such as aesinc would be a superior strategy (and embrace USB pluggable ASICs and get them spread out to the consumer).
Also I had noted (find my post in my thread a couple of months ago) that the way the AES is incorrectly employed as a random oracle (as the index to lookup in the memory table), the algorithm is very likely subject to some reduced solution space. This is perhaps Claymore's advantage (I could probably figure it out if I was inclined to spend sufficient time on it).
There is no cryptographic analysis of the hash. It might have impossible images, collisions, etc..
I strongly disagree.
The algorithm is *not* complex, it's very simple. Grab a random-indexed 128 bit value from the big lookup table. Mix it using a single round of AES. Store part of the result back. Use that to index the next item. Mix that with a 64 bit multiply. Store back. Repeat. It's intellectually very close to scrypt, with a few tweaks to take advantage of things that are fast on modern CPUs.
I know more about this because I independently developed an algorithm several months ago that is very similar which I named L3scrypt. I also solved several problems which are not solved in CryptoNite such as the speed.
The concept of lookup can still be utilized to trade computation for space. I will quote from my white papers as follows which also explains why I abandoned it (at least until I can do some real world testing on GPUs) when I realized the coalescing of memory access is likely much more sophisticated on the GPU.
However, since first loop of L3crypt is overwriting values in random order instead of sequentially, a more
complex data structure is required for implementing "lookup gap" than was the case for Scrypt. For every element
store in an elements table the index to a values table, index of a stored value in that values table and the
number of iterations of H required on the stored value. Each time an element in the values table needs to be
overwritten, an additional values table must be created because other elements may reference the existing stored
value.
Thus for example, reducing the memory required by up to half (if no element is overwritten), doubles the number of H
computed for the input of FH as follows because there is a recursive 50% chance to recompute H before reaching a
stored value.
1 + 1/2 + 1/4 + 1/8 + 1/16 + ... = 2 [15]
Storing only every third V[j] reducing the memory required by up to two-thirds (if no element is overwritten), trebles the
number of H computed for the input of FH as follows because there is a recursive 2/3 chance to recompute H before
reaching a stored value.
1 + 2/3 + 4/9 + 8/27 + 16/81 + ... = 3
The increased memory required due to overwritten 512B elements is approximately the factor n. With n = 8 to reduce the
1MB memory footprint to 256KB would require 32X more computation of H if the second loop isn't also overwriting elements.
Optionally given the second loop overwrites twice as many 32B elements, to reduce the 1MB memory footprint to 256KB
would also require 4X more computation of FH.
However since execution time of the first loop bounded by latency can be significantly reduced by trading recomputation of
H for lower memory requirements if FLOPs exceed the CPU by significantly more than a factor of 8, it is a desirable precaution
to make the the latency bound of the second loop a significant portion of the execution time so L3crypt remains latency
bound in that case.
Even without employing "lookup gap", the GPU could potentially execute more than 200 concurrent instances of L3crypt to
leverage its superior FLOPs and offset the 25x slower main memory latency and the CPU's 8 hyperthreads.
So if you can trade computation for space, then an ASIC can potentially clobber the CPU. The GPU would be beating the CPU, except for the inclusion of the AES instructions which the GPU doesn't have. An ASIC won't have this limitation.
Also the use of AES as random oracle to generate the lookup in the table is a potential major snafu, because AES is not designed to be a hash function. Thus it is possible that certain non-randomized patterns exist which can be exploited. I
covered this in post in my thread, which the Monero developers were made of aware of but apparently decided to conveniently ignore(?).
Adding all these together it may also be possible to utilize more efficient caching design on the ASIC that is tailored for the access profile of this algorithm. There are sram caches for ASICs (e.g. Toshiba) and there is a lot of leeway in terms of parameters such as set associativity, etc..
So sorry, I don't think you know in depth what you are talking about. And I think I do.
Remember that there are two ways to implement the CryptoNight algorithm:
(1) Try to fit a few copies in cache and pound the hell out of them;
(2) Fit a lot of copies in DRAM and use a lot of bandwidth.
Approach (1) is what's being done on CPUs. Approach (2) is what's being done on GPUs.
And the coalescing of memory accesses for #2 is precisely what I meant. It is only the AES instructions that impeding the GPU from clobbering the CPU.
I tried implementing #2 on CPU and couldn't get it to perform as well as my back-of-the-envelope analysis suggests it should, but it's possible it could outperform the current CPU implementations by about 20%. (I believe yvg1900 tried something similar and came to the same conclusion I did).
No because external memory access parameters decline as the number of threads simultaneously accessing it increases (less so for the high end server CPUs). So what you saw is what I expected. If you need me to cite a reference I can go dig it up.
An ASIC approach might well be better off with #2, however, but it simply moves the bottleneck to the memory controller, and it's a hard engineering job compared to building an AES unit, a 64 bit multiplier, and 2MB of DRAM. But that 2MB of DRAM area limits you in a big way.
Computation can be traded for space to use fast caches. See what I wrote up-post. And or you could design an ASIC to drop into an existing GPU memory controller setup. Etc.. There are numerous options. Yes it is a more difficult engineering job that is a worse for CryptoNite because it means who ever is first to solve it, will limit supply and give an incredible advantage to a few, which is what plagued Bitcoin in 2013 until the ASICs became ubiquitous. This proprietary advantage might linger for much longer duration.
In my best professional opinion, barring funky weaknesses lingering within the single round of AES, CryptoNight is a very solid PoW. Its only real disadvantage is comparatively slow verification time, which really hurts the time to download and verify the blockchain.
In my professional opinion, I think you lack depth of understanding. What gives?