in curve25519 field the fmul is the fast operator, millions per second, and is typically called the "addition". both addition and multiply are order independent with unit element and I avoid the zero, so no issues with additive zero vs multiplicative zero
the "multiply" is cmult (complex multiply with montgomery)
even with cmult it does hundreds of thousands per second.
also, by keeping at the 320 bit polynomial form for the fmul, there is only the one fexpand per field element and one fcontract at the end
My guess is that the fmul will be speed competitive with sha256 of the merkle leaves
Blake2 can do apparently
several hundred thousand hashes per second per thread roughly guesstimated for an Intel Xeon.
Since you have to hash each leaf before you multiply, then the multiplies need to be faster then the rest of the Merkel tree other than leaves. Also my other point (you quoted me before I edited my prior post) is that for the intermerdiary and recipient, the Merkel tree is more space efficient because not all the leaves have to provided with the proof.
So seems roughly on par speed wise but Merkel is more space efficient when only a portion of the leaves (i.e. the chunks) are being proved/provided to the recipient.
Even though I am pretty good at estimating speed of algorithms, I prefer to benchmark based on actual code running vs hypotheticals. It is very easy to overlook effects of the various levels of caching that happens in CPU and system and HDD.
without compiler optimizations:
10000000 expand/fmuls in 2559.064 milliseconds
10000000 sha256(32bytes) in 23867.496 milliseconds
with -O2
10000000 expand/fmuls in 1067.811 milliseconds, about 100 nanoseconds each
10000000 sha256(32bytes) in 7372.751 milliseconds, about 737 nanoseconds
1000 sha256(1MBbytes) in 8808.173 milliseconds, about 8.8 million nanoseconds
With -O2 optimizations, I am getting almost 1.4 million sha256 per second on my blazing fast 1.4Ghz i5. The sha256 of sblock takes almost 9 milliseconds, so just a bit more than 100/second
and
10 million expand/fmul's per second
"seems roughly on par speed wise" not sure what you mean by on par, but 7x to 10x faster seems a different game of golf. And it is literally 4 orders of magnitude faster than the sha256 of 1MB, so clearly some faster hash function is needed for doing TB datasets, if the blake is much faster and is cryptographically strong enough regarding collisions, sounds like a good improvment. But I just cant see anything approaching the 100 nanoseconds speed.
As I estimated, the cost for doing all the field mults is insignificant compared to the sha256 of only 32 bytes(!) or even the HDD data transfer.
As usual it is a speed vs time tradeoff. If you have plenty of time and not much HDD space and dont mind things 10x slower, then merkle tree. [merkle trees add a layer of complication and chance for bugs and also make it harder to explain] If you dont mind using 32 bytes per sblock (1MB) and want the 10 million ops per second, then the fmuls. Keep in mind this was just a lunch break first pass. If space is really so important, then I think I can make it smaller, but really at this point need to hear from OP if this even solves the problem at hand and being 10x faster than sha256 of 32 bytes, I dont think anybody can say the multiplies are too slow.
This is the problem with assuming things that everybody "knows". Sure, use that as a guideline, but I believe my experimental results far more than what is supposed to happen.
That is how I achieve 30 minute sync time for BTC blockchain from scratch. Not a week, not a day, just 30 minutes. It is on a 32GB VPS server with 1gbps connection, but still I think it is a very good result. I used to think BTC blockchain size was a very big problem, now it seems to be a long term issue with time enough to solve. Moore's law wont help, but time to create more efficient solutions will.
At first I thought some sort of ZK with pederson commitments was needed for this (probably since that is what I was looking at for something else). But since I only had half hour for lunch, I tried to come up with something I can code in that time, and a simple set of multiplies was my limit. Hopefully it works. For now I dont have any more time for this, at least till I hear from the OP
James
http://docs.supernet.org/ has first pass docs on the SuperNET API, I used the code from it to create the above solution. It is still a work in progress, but gives a good overview of the scope