Bitcoin Forum
October 31, 2024, 07:22:43 PM *
News: Latest Bitcoin Core release: 28.0 [Torrent]
 
   Home   Help Search Login Register More  
Poll
Question: Viᖚes (social currency unit)?
like - 27 (27.6%)
might work - 10 (10.2%)
dislike - 17 (17.3%)
prefer tech name, e.g. factom, ion, ethereum, iota, epsilon - 15 (15.3%)
prefer explicit currency name, e.g. net⚷eys, neㄘcash, ᨇcash, mycash, bitoken, netoken, cyberbit, bitcash - 2 (2%)
problematic - 2 (2%)
offending / repulsive - 4 (4.1%)
project objectives unrealistic or incorrect - 10 (10.2%)
biased against lead dev or project ethos - 11 (11.2%)
Total Voters: 98

Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 [41] 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 »
  Print  
Author Topic: [neㄘcash, ᨇcash, net⚷eys, or viᖚes?] Name AnonyMint's vapor coin?  (Read 95272 times)
This is a self-moderated topic. If you do not want to be moderated by the person who started this topic, create a new topic.
trollercoaster
Legendary
*
Offline Offline

Activity: 1050
Merit: 1001



View Profile
January 27, 2016, 01:42:06 AM
 #801

TPTB_need_war was banned for 3 days for writing in big red letters that "Ethereum is broken and can't be fixed" and proceeded to defend this point factually.

And so the mods have now demonstrated they are involved in the pump of Ethereum.

So much for the objectivity of this forum.

They allow excessive trolling and scams no problem though.
Why can't he just leave us all alone to enjoy our circle jerk.

Fixed that for you.
altcoinUK
Hero Member
*****
Offline Offline

Activity: 784
Merit: 1000


View Profile
January 27, 2016, 12:23:44 PM
 #802


And now the narcissistic loony begs for empathy about his faked illness over at Reddit. It is really a class act.

Yes, he is at Reddit and it says a lot about this mostly wannabe rich, greedy, absolutely clueless, high school lunch money size financial powered, average 20-25 years old "crypto community" that they have been benefiting from TPTB_need_war's knowledge, but nobody is there to support him. The keyword is the "mostly", there are a few good people here, but the majority fits into those mentioned categories.
TPTB_need_war (OP)
Sr. Member
****
Offline Offline

Activity: 420
Merit: 262


View Profile
January 29, 2016, 09:51:26 AM
Last edit: January 29, 2016, 11:34:02 AM by TPTB_need_war
 #803

When I was at Fractal Design Corp briefly in 1993 and 1995 working as a programmer on Painter X2 and 3 closely with the two founders of the company who were also programmers, on my own initiative I took on the role of communicating with Tech Support and I built some friendships there and served a liason to help articulate technical issues in ways they could explain to users. One of the guys seemed to be capable of becoming a programmer himself, and I told him that.

Well today I come to find out he is the CTO and Co-founder of a major social media publishing company focused in the music realm. God or damn good luck kissed me on the shoulder in terms of plans for a Bitcoin killer.

I will mostly shut up now. I have said enough in these forums.

I do want to do some technical analysis of Synereo and will be posting about that on Reddit. I have said all I need to say otherwise. Cheerio.



P.S. Here is my LinkedIn account with my photo from January 2015, career history and education.

TPTB_need_war (OP)
Sr. Member
****
Offline Offline

Activity: 420
Merit: 262


View Profile
January 30, 2016, 02:04:41 AM
Last edit: January 30, 2016, 02:16:51 AM by TPTB_need_war
 #804

And they insulted me about doing a 30 second rough sketch of a logo (just to see how it would appear not because I thought it was a great result):



Saint Patrick's Day  Huh  Roll Eyes

4-leaf clover or a bowtie.

Smooth you are out of your experience zone.

The above is apparently after "many iterations" whereas my ideas below were quick one-offs:



Very roughly (the proportions are not balanced in this rapidly hacked up mockup) what I am thinking about for a Sync logo, if we choose the Sync name for the consensus network.



I am still considering 'ion' as well. Perhaps people can learn to think of it as the "electric charge" (fuel) they need to do actions in virtual reality. The logo would probably be a lighting bolt and/or positively charge ions floating in a cloud . If 'ion' wins out, I won't be upset. I just want to make sure all perspectives have been weighed by voters.

Here were some refined logos I did:

Here are various logos in I designed from 1999 - 2002:





OROBTC
Legendary
*
Offline Offline

Activity: 2926
Merit: 1863



View Profile
January 30, 2016, 02:25:12 AM
 #805

...

Welcome back, TPTB.

Now act nice, ya hear?   :p   Smiley
TPTB_need_war (OP)
Sr. Member
****
Offline Offline

Activity: 420
Merit: 262


View Profile
January 30, 2016, 08:42:50 AM
Last edit: January 30, 2016, 02:49:33 PM by TPTB_need_war
 #806

I decided to publish a section of my research on memory hard PoW hash functions from 2013. This is only the first section where I analyzed Scrypt. I may have published this section before when Monero was first announced and I was publicly debating with one of the Monero dudes about Cryptonote's PoW hash Cryptonite (yet another discussion that turned condescending and foul mouthed). Note there are 23 references cited in the complete version of this paper. This explains why Cryptonite employs AES-NI instructions to defeat the FLOPS and superior memory bandwidth advantage of the GPU.

I claim my analysis of tromp's Cuckoo below predated this as I added it to my paper immediately after tromp posted about his new paper:

David Andersen. A public review of cuckoo cycle. http://www.cs.cmu.edu/dga/crypto/cuckoo/analysis.pdf, 2014
http://da-data.blogspot.com/2014/03/a-public-review-of-cuckoo-cycle.html



Scrypt ROMix

   1: X <- B
   2: for i = 0 to N-1 do
   3:    V[ i] <- X
   4:    X <- H(X)
   5: end for
   6: for i = 0 to N-1 do
   7:    j <- Integerify(X) mod N
   8:    X <- H(X ^ V[j])
   9: end for
  10: B' <- X

Without parallelism the execution time of ROMix is bounded by that of the hash function H or the random access memory latency to read V[j].

Without parallelism the memory bandwidth to write V[ i] can not be a significant factor because in the case that the execution time of the first loop is dependent on memory bandwidth instead of H and not random access latency since the writes are sequential, the random access latency to read V[j] in the second loop is slower than the memory bandwidth in the first loop.

Percival's sequential memory-hard proof [1] states that redundant recomputation of H in exchange for a reduced memory footprint will not be asymptotically faster. For example even if H is asymptotically (as N goes to infinity) perfectly distributed in the Random Oracle model so the second loop only accesses each V[j] at most once, the H for each second element will be computed twice when reducing the memory footprint by ⅔ by recomputing the H for each second and third element in the second loop instead of retrieved from memory.

Percival's proof fails if the random access latency is not insignificant compared to the execution time of H because the execution time for H in a parallel thread is free as it masked by the other thread is which is stalled for the duration of the latency. This is why BlockMix is employed to increase the execution time of H.

Consider the example that the execution of H is twice as fast as the random access memory latency, i.e. H executes in ½ the delay of each random access. Analogous to cpuminer's "lookup gap" of 3, the computation of H for each second and third elements of V[j] can be repeated in the second loop instead of retrieved from memory. Thus ⅓ the memory requirements, average execution time equal to the latency (indicated by the computed value of 1 below), and only a ½ × ⅓ = 1/6 average increase in computational cost for accessing the third elements which is masked by ½ of the latency not offset by H in line 8. Each first, second, and third element of V[j] has a ⅓ probability of being accessed, so the relative execution time is computed as follows.

   ½ × ⅓ + (½ + ½) × ⅓ + (½ + ½ + ½) × ⅓ = 1

Since GPU DDR main memory has nearly an order-of-magnitude slower random access memory latency than the CPU's DRAM, the GPU employs a "lookup gap" to reduce the memory footprint to allow more parallel instances of Scrypt to execute simultaneously up to the available number of threads and memory bandwidth. The GPU's order-of-magnitude faster memory bandwidth allows running more parallel instances of the first loop. Thus the superior FLOPs of the GPU is fully utilized, making it faster than the CPU.

L3crypt

[...]

Even without employing "lookup gap", the GPU could potentially execute more than 200 concurrent instances of L3crypt to leverage its superior FLOPs and offset the 25x slower main memory latency and the CPU's 8 hyperthreads. To defeat this, the output of L3crypt should be hashed with a cryptographic hash the leverages the CPU's AES-NI instructions and with enough rounds to roughly equal the computation time of L3crypt. GPUs are roughly at parity with AES-NI in hashes per watt [24].

[...]

Asymmetric Validation

Verification—that a hash output is valid for a given input—should be orders-of-magnitude more efficient than computing the hash.

The validation ratio when used in a proof-of-work system also depends on how fast the hash is, because validation only requires one hash whereas solving proof-of-work requires as many hashes as the collective difficulty or more importantly the pool's minimum share difficulty.

The Cuckoo Cycle hash [19] significantly increases the asymmetric validation ratio, but has unprovable security because its algorithm is based on permutated orderings which do not incorporate diffusion and confusion [20] and thus not provably secure in the Random Oracle model, i.e. we can't prove there aren't algorithms to speed up the solutions. Although its entropy is large, i.e. the factorial permutation of all buckets in the hash space, it doesn't require that its entire memory space be accessed, thus possibly bit flags could be employed to reduce the memory used and make it faster.

The Cuckoo Cycle hash isn't a personal computer only hash because it requires only a minimal CPU and it is non-sequentially latency bound. It can be parallelized over the same memory space masking much of the latency. Thus for example the Tilera server CPUs will outperform Intel's personal computer CPUs because Tilera's has 3X to 12X more hardware threads per watt and isn't plagued by the GPU's slow main memory latency. Whereas for L3crypt no further parallelization is possible so even though compared to the 8-core Xeon E5 or Haswell-E the Tilera has same L3 cache [21] and 50% to 67% of the power consumption, the latency is 2X greater [22] and each clock cycle is 2X to 3X slower. Although parallelization can be applied for computing H to try to match Intel's AVX2 acceleration, L3crypt is sequentially memory latency bound.

[...]

Future Proof


CPU memory bandwidth is doubling approximately every four years [7] with up to a 50% improvement expected by 2015 [6] and memory size is doubling approximately every two years which why Moore's Law expects a doubling of performance every 18 months [8] computed as follows.

   2^(years/2) × 2^(years/4) = 2^(years/1.5)

However Percival noted that memory latency is not following Moore's Law [1].

References

[1] Percival, Colin. Stronger key derivation via sequential memory-hard functions.
    BSDCan'09, May 2009. http://www.tarsnap.com/scrypt/scrypt.pdf

[...]

[6] http://www.pcworld.com/article/2050260/hefty-price-premium-awaits-early-ddr4-memory-adopters.html

[7] http://www.timeline-help.com/computer-memory-timeline-4.html

[8] http://en.wikipedia.org/wiki/Moore's_law#cite_note-IntelInterview-2

[...]

[19] https://github.com/tromp/cuckoo/blob/master/cuckoo.pdf

[20] http://en.wikipedia.org/wiki/Confusion_and_diffusion
     http://www.theamazingking.com/crypto-block.php

[21] http://www.tilera.com/sites/default/files/productbriefs/TILE-Gx8072_PB041-03_WEB.pdf

[22] http://www.tilera.com/scm/docs/UG101-User-Architecture-Reference.pdf#page=369

[...]

[24] https://www.iacr.org/workshops/ches/ches2010/presentations/CHES2010_Session06_Talk03.pdf#page=16

tromp
Legendary
*
Offline Offline

Activity: 990
Merit: 1110


View Profile
January 30, 2016, 03:41:20 PM
 #807

Verification—that a hash output is valid for a given input—should be orders-of-magnitude more efficient than computing the hash.

The validation ratio when used in a proof-of-work system also depends on how fast the hash is, because validation only requires one hash whereas solving proof-of-work requires as many hashes as the collective difficulty or more importantly the pool's minimum share difficulty.

Cuckoo Cycle is not a hash.

Quote
The Cuckoo Cycle hash [19] significantly increases the asymmetric validation ratio, but has unprovable security because its algorithm is based on permutated orderings which do not incorporate diffusion and confusion [20]

Diffusion and confusion are properties of hash functions. Cuckoo Cycle is not a hash function.
You need to get your basics straight.

Quote
Although its entropy is large, i.e. the factorial permutation of all buckets in the hash space, it doesn't require that its entire memory space be accessed, thus possibly bit flags could be employed to reduce the memory used and make it faster.

There is no "entire memory space". Cuckoo Cycle uses memory for data structures, not for filling up with random garbage like scrypt does.

I agree that Cuckoo Cycle can be parallellized. In fact the GPU solver works best with 16384 threads.
But it's only marginally faster than 4096 threads. Because they're saturating the random access
memory bandwidth.
coins101
Legendary
*
Offline Offline

Activity: 1456
Merit: 1000



View Profile
January 30, 2016, 04:25:49 PM
 #808

Proving data is stored on a decentralised node is something of an ongoing project. 

Others are looking at the issue:

https://bitslog.wordpress.com/2015/09/16/proof-of-unique-blockchain-storage-revised/

So far I think PoW for nodes to validate blocks or data they contain is an interesting approach.
TPTB_need_war (OP)
Sr. Member
****
Offline Offline

Activity: 420
Merit: 262


View Profile
January 30, 2016, 06:52:44 PM
Last edit: January 31, 2016, 04:27:11 AM by TPTB_need_war
 #809

Verification—that a hash output is valid for a given input—should be orders-of-magnitude more efficient than computing the hash.

The validation ratio when used in a proof-of-work system also depends on how fast the hash is, because validation only requires one hash whereas solving proof-of-work requires as many hashes as the collective difficulty or more importantly the pool's minimum share difficulty.

Cuckoo Cycle is not a hash.

I was using the term 'hash' to mean proof-of-work problem with a verifier in the sense as defined here:

https://eprint.iacr.org/2015/946.pdf#page=2

I wasn't focused on defining terms (in an informal internal paper for my own analysis that I never prepared for publishing), but rather on analyzing the issues pertaining to memory hardness, such as resistance to parallelization and/or trading computation for space as should be obvious from my analysis of Scrypt.

The Cuckoo Cycle hash [19] significantly increases the asymmetric validation ratio, but has unprovable security because its algorithm is based on permutated orderings which do not incorporate diffusion and confusion [20]

Diffusion and confusion are properties of hash functions. Cuckoo Cycle is not a hash function.
You need to get your basics straight.

Careful of that condescending attitude. You are starting to act like Shen-noether, Gregory Maxwell and other condescending jerks that have come before you.

What I am saying is that the entropy of your problem space is large but limited, which is indeed because the confusion and diffusion injected into the memory space is not entirely randomized over the entire memory space allocated to the PoW computation. Duh. Which is precisely what Andersen discovered when he broke your Cuckoo Cycle as I warned you would be the case. Quoting from the above paper:


A more promising scheme was proposed by Tromp [37] as the Cuckoo-cycle PoW. The prover must find a cycle of certain length in a directed bipartite graph with N vertices and O(N) edges. It is reasonably efficient (only 10 seconds to fill 1 GB of RAM with 4 threads) and allows very fast verification. The author claimed prohibitive time-memory tradeoffs. However, the original scheme was broken by Andersen [6]: a prover can reduce the memory by the factor of 50 with time increase by the factor of 2 only. Moreover, Andersen demonstrated a simple time-memory tradeoff, which allows for the constant time-memory product (reduced by the factor of 25 compared to the original). Thus the actual performance is closer to 3-4 minutes per GB1). Apart from Andersen’s analysis, no other tradeoffs were explored for the problem in [37], there is no evidence that the proposed cycle-finding algorithm is optimal, and its amortization properties are unknown



Although its entropy is large, i.e. the factorial permutation of all buckets in the hash space, it doesn't require that its entire memory space be accessed, thus possibly bit flags could be employed to reduce the memory used and make it faster.

There is no "entire memory space". Cuckoo Cycle uses memory for data structures, not for filling up with random garbage like scrypt does.

You lack imagination and abstraction conceptualization skills. Think of about the algorithm from the analogy that applies, and discover new insights.

Think about the entropy of the memory allocated for the data structures. If the entropy is not maximum one can in theory find a way to shrink that memory space and trade computation for memory space, then your algorithm is no longer memory hard. As well if one can parallelize computation within the same memory structures, it is no longer memory hard. As well if the entropy is not maximized, there is no evidence that a faster algorithm can't be found.

The "random garbage" is precisely necessary to maximize entropy and eliminate the possibility to trade memory for computation and/or parallelize the computation. Even the above paper's proposal seems to give up parallelization to the GPU (although they claim only by an order-of-magnitude and that is not including the GPU has multi-GB of memory and can run more than one instance if not memory bandwidth bounded).

I agree that Cuckoo Cycle can be parallellized. In fact the GPU solver works best with 16384 threads.
But it's only marginally faster than 4096 threads. Because they're saturating the random access
memory bandwidth.

Not just parallelization but also memory for computation time improvement as I predicted and now confirmed:

"Andersen [6]: a prover can reduce the memory by the factor of 50 with time increase by the factor of 2 only."

tromp
Legendary
*
Offline Offline

Activity: 990
Merit: 1110


View Profile
January 30, 2016, 07:07:21 PM
Last edit: January 30, 2016, 07:22:28 PM by tromp
 #810

What I am saying is that the entropy of your problem space is large but limited, which is indeed because the confusion and diffusion injected into the memory space is not entirely randomized over the entire memory space allocated to the PoW computation. Duh. Which is precisely what Andersen discovered when he broke your Cuckoo Cycle as I warned you would be the case. Quoting from the above paper:

You're living in the past quoting 2 papers that both focus on an early 2014 version of Cuckoo Cycle.

What David Andersen did in march 2014 is to show how to reduce memory consumption by a an order of magnitude, which has become part of the reference miner since april 2014, well in the past when my Cuckoo Cycle paper was published in the January 2015 BITCOIN workshop. Please focus your criticism on that version.

But you don't need to read that paper to learn of the linear time memory trade off, which is right on the project page:

"I claim that trading off memory for running time, as implemented in tomato_miner.h, incurs at least one order of magnitude extra slowdown".

Btw, there is maximum entropy in the bitmap of alive edges once 50% of them have been eliminated.
TPTB_need_war (OP)
Sr. Member
****
Offline Offline

Activity: 420
Merit: 262


View Profile
January 30, 2016, 07:14:57 PM
Last edit: January 30, 2016, 07:53:22 PM by TPTB_need_war
 #811

What I am saying is that the entropy of your problem space is large but limited, which is indeed because the confusion and diffusion injected into the memory space is not entirely randomized over the entire memory space allocated to the PoW computation. Duh. Which is precisely what Andersen discovered when he broke your Cuckoo Cycle as I warned you would be the case. Quoting from the above paper:

You're living in the past quoting 2 papers that both focus on an early 2014 version of Cuckoo Cycle.

What David Andersen did in april 2014 is to reduce memory consumption by a factor of 32, which has become
part of the reference miner in may 2014, well in the past when my Cuckoo Cycle paper was published in BITCOIN 2015.

The paper says:

Quote
1The project webpage [37] claims Andersen’s optimizations be integrated into the miner, but the performance numbers are mainly unchanged since before the cryptanalysis appeared


What is the performance per Watt and performance per $ hardware comparing CPU and GPU now for reference miners?

It should be possible to use the superior FLOPS of the GPU to trade less memory for more computation and/or parallelization, thus giving the GPU an advantage over the CPU. Maintaining parity for the CPU was the entire point of a memory hard PoW algorithm. Also the more parallelization, the lower the effective latency of the GPU's memory because latency gets masked by computation proceeding in parallel. Up to the limit of memory bandwidth (which is very high on the GPU as you know).

Edit: I will need to study this when I am not so sleepy, to give it proper thought.

Edit#2: you added the following your post after I replied to it:

But you don't need to read that paper to learn of the linear time memory trade off, which is right on the project page:

"I claim that trading off memory for running time, as implemented in tomato_miner.h, incurs at least one order of magnitude extra slowdown".

Btw, there is maximum entropy in the bitmap of alive edges once 50% of them have been eliminated.

But the GPU gets that computation for free because it is masked by the latency for the random accesses. That is why I asked for some performance figures above comparing CPU and GPU. I haven't looked at your project page for 2+ years.

Edit#3: so apparently about a 3X advantage in rate per watt since the TDP of the GPU you cited is 165W (including 4 GB ram) and i7 afair is about 125W:

https://github.com/tromp/cuckoo#1-memory-bank--1-virtual-core--1-vote

Was that running all threads of the i7 versus all compute units of the GPU? Did you maximize the # of instances each could do with its available compute units, i.e. normalize for if the GPU has 8 or 16GB as necessary to max out its FLOPS? I see you said you maxed out memory bandwidth, but what about trading some memory for 10X more computation until the memory bandwidth bound and computation bound (FLOPS) are matched?

tromp
Legendary
*
Offline Offline

Activity: 990
Merit: 1110


View Profile
January 30, 2016, 09:36:32 PM
 #812

The paper says:

Quote
1The project webpage [37] claims Andersen’s optimizations be integrated into the miner, but the performance numbers are mainly unchanged since before the cryptanalysis appeared


All my code is right there to inspect and ready to be run if the authors don't trust my stated numbers.

Quote
What is the performance per Watt and performance per $ hardware comparing CPU and GPU now for reference miners?


IN another thread I commented
Quote
I have very limited data. A GTX980 with a TDP of around 160W was 5x faster than an i7-4790K with a TDP of around 80W. So the GPU appears to be between 2x and 3x more efficient in this case.
But since Cuckoo Cycle is memory bound, neither CPU nor GPU is going to be near their TDP.
So we really need to bench it with power measuring tools, which I'm lacking...
I don't know how to measure power usage of CPU and GPU. What I can measure is that the GPU is about 5x faster.

Quote
Was that running all threads of the i7 versus all compute units of the GPU? Did you maximize the # of instances each could do with its available compute units, i.e. normalize for if the GPU has 8 or 16GB as necessary to max out its FLOPS?

That's with maxing out either cores (CPU) or memory bandwidth (GPU).

Quote
I see you said you maxed out memory bandwidth, but what about trading some memory for 10X more computation until the memory bandwidth bound and computation bound (FLOPS) are matched?

The only known trade-off uses k times less memory but 15k times more computation *and* memory accesses.
smooth
Legendary
*
Offline Offline

Activity: 2968
Merit: 1198



View Profile
January 30, 2016, 10:41:47 PM
 #813

I don't know how to measure power usage of CPU and GPU. What I can measure is that the GPU is about 5x faster.

The easiest is probably just measure at the wall using Kill-A-Watt (around $20 for a cheap version)

TDP specs are kind of useless for anything, except as a very rough indicator.
TPTB_need_war (OP)
Sr. Member
****
Offline Offline

Activity: 420
Merit: 262


View Profile
January 31, 2016, 02:45:13 AM
Last edit: January 31, 2016, 04:08:48 AM by TPTB_need_war
 #814

tromp, the bottom line is that when I wrote that document I was trying to find a way to make a PoW where CPUs would be as power efficient as GPUs and ASICs, so that a professional miner wouldn't be more profitable than a CPU miner. That was back when I hadn't yet contemplated the solution of making mining unprofitable. And in the portion of the paper I omitted, I concluded that I had failed (even with fiddling around the various SRAM caches on Intel CPUs). Even if one makes a Scrypt-like random walk through memory entirely latency bound, thus GPU can run multiple instances until the latency is masked by computation and becomes computation bound or memory bandwidth bound. And believe in both cases, then the GPU will be more efficient on computation being performed.

What Cryptonote's Cryptonite PoW hash apparently does is make it impossible to run enough instances on a typical GPU (with its limited memory of say 6GB unless one were to customize one) to overcome the AES-NI instructions incorporated into the memory hard algorithm, since the GPU is apparently only at par in computational efficiency on AES-NI. Or what I think is really going on but I haven't confirmed it, is Cryptonite is AES-NI bound, so the GPU remains at parity. Which is exactly the direction I investigated next in 2014 after abandoning memory hard PoW (even the NP complexity class asymmetric variant such as yours). Also CN attempts to fiddle around the size of the various SRAM caches, but that can be a pitfall in scenarios such as ASICS or Tilegra or other hardware competitors.

So that is why I had abandoned memory hard PoW and investigated a specific instruction in the AES-NI instruction set which appears to have the highest level of optimization in terms of power efficiency as far as I can estimate. This also meant the PoW hash could be very fast (noting Cryptonite is slow) and would help with asymmetric validation of PoW shares in the DDoS scenario (although in my latest 2015 coin design I can verify Merkel signatures orders-of-magnitude faster than any memory hard PoW hash so the point becomes irrelevant).

I have recently entertained the thought that the only way to make them (nearly) equal with a memory hard approach would be to be no computation or for the computation to be so small so as to require an inordinate amount of total RAM or where the memory bandwidth bound would limit the computation latency masking to an insignificant portion of total power consumed. But this may not be easy to accomplish because DRAM is so power efficient. I also noticed an error in my thought process before in my rough draft paper where I hadn't contemplated another way to force a serial random walk that much more strongly resists memory vs. computation tradeoff and for which the computation would be very tiny relative to memory latency. And now my goal is no longer for them to be equal (and besides even if they were equal the mining farms have up to an order-of-magnitude cheaper electricity and more efficient amortization of power supplies), but just to be within say an order-of-magnitude because I am targeting unprofitable mining and the ratio dictates the ratio of unprofitable miners to those miners who desire to be profitable required for all miners to be unprofitable. This approach might be superior to the specific AES-NI instruction I had designed a PoW hash around in 2014.

But the main reason I revisited memory hard PoW is because I can't optimize an AES-NI instruction PoW hash from a browser (no native assembly code and because WebGL on mobile phones means as GPU or ASIC orders of magnitude more power efficient and hardware cost efficient) which impacted a marketing strategy I was investigating. However I have concluded last night that the marketing strategy I was contemplated is flawed because there isn't enough value in the electricity (and memory cost) consumed by the PoW hash to give sufficient value to computing the hash for transferred income (even if unprofitable) on a mobile phone. It turns out that marketing is much more important than PoW in terms of a problem that needs to be solved for crypto currency. The income transfer would make music download bandwidth profitable, but that is peanuts compared to the value generated by social media advertising and user expenditures. I am starting to bump up against some fundamental marketing barriers, e.g. microtransactions are useless is most every scenario (even music!), mobile is the future and there isn't enough electricity to monetize anything from PoW (besides competing on electricity consumption due to computation is a losing strategy w.r.t. to GPUs and ASICs). The money to be made in social media isn't from monetizing the CPU/DRAM nor the users' Likes as Synereo is attempting, but from creating value for the users (then either profiting on the advertising and/or the users' expenditures on what they value). This unfortunately has nothing to do with crypto currency, although it is very interesting to me and requires a lot of fun programming & design so I am starting to get frustrated with crypto currency as being an enormous time waster for me. Three years of researching and still not funding a sure project to work on in crypto currency that doesn't just devolve to a P&D to speculators, because crypto currency has no significant adoption markets (subject to change as I do my final thinking on these fundamental question before I quit crypto).

Bottom line is your Cuckcoo Cycle PoW can be parallelized and so the GPU can employ more computation to mask some of its slower main memory latency up to the memory bandwidth bound. With a guesstimated power efficiency advantage of perhaps 2 - 3X, although you have only estimated that from TDP and not actually measured it. It behoves you to attempt to measure that as it is a very important metric for considering whether to deploy your PoW. The reason it can be parallelized is because the entropy of the data structures is not a serial random walk (which is the point I was trying to make in my poorly worded text from the rough draft of an unpublished paper). Also I am wondering if you tried instead of hitting the memory bound with one instance, run multiple instances on the GPU since I assume you are not using 6GB for each instance? In this case the tradeoff between increased computation (thus masking memory latency) and memory bandwidth may be more favorable to the GPU?

Note that other paper's proposed PoW can also be parallelized up to the memory bandwidth bound but afaics they didn't measure relative power efficiency:


Quote
thus the total advantage
of GPU over CPU is about the factor of 4, which is even
smaller than bandwidth ratio (134 GB/s in GTX480 vs 17
GB/s for DDR3). This supports our assumption of very limited
parallelism advantage due to restrictions of memory bandwidth


tromp, I think you will remember the discussion we had in 2014 about ASICs and I was claiming it could be parallelized and you were pointing out the bandwidth limitation at the interconnection between IC chips due to the fact the memory can't fit on the same silicon chip as the computational logic (or that not all the memory can fit on one chip).

So what I am really saying above is that afaics the fundamentally important invention of lasting value that I have found for crypto is unprofitable mining. I haven't decided yet whether to pursue that to implementation.

TPTB_need_war (OP)
Sr. Member
****
Offline Offline

Activity: 420
Merit: 262


View Profile
January 31, 2016, 03:48:22 AM
Last edit: January 31, 2016, 03:59:20 AM by TPTB_need_war
 #815

Was that running all threads of the i7 versus all compute units of the GPU? Did you maximize the # of instances each could do with its available compute units, i.e. normalize for if the GPU has 8 or 16GB as necessary to max out its FLOPS?

That's with maxing out either cores (CPU) or memory bandwidth (GPU).

But not necessarily maxing out CPU computation, since the NP problem is memory latency bound i.e. cores spend a lot of time idle. This is why TDP isn't a reliable estimate of what is going on. I assume you know the CPU has 25 times lower/faster main memory latency than the GPU. So the CPU may be sucking a lot more electricity than the GPU which is hitting its bound on memory bandwidth (and not latency thus not maximum masked parallel computation).

I am shocked that you hadn't purchased a $20 Kill-A-Watt meter given you have invested so much of your time on this. However, I believe these may not be available in some countries in Europe. In the USA, we can order it from Amazon.

I see you said you maxed out memory bandwidth, but what about trading some memory for 10X more computation until the memory bandwidth bound and computation bound (FLOPS) are matched?

The only known trade-off uses k times less memory but 15k times more computation *and* memory accesses.

Perhaps there might be a probabilistic approach. I really haven't studied the NP problem you are applying. It worries me that you assume there aren't other tradeoffs when this is not serial random walk. Your claims are graph-theoretic only, not informational theoretic (as is the case in a perfectly uniformly distributed serial random walk). How extensively is this specific NP problem studied?

tromp
Legendary
*
Offline Offline

Activity: 990
Merit: 1110


View Profile
January 31, 2016, 03:58:26 AM
 #816

Also I am wondering if you tried instead of hitting the memory bound with one instance, run multiple instances on the GPU since I assume you are not using 6GB for each instance? In this case the tradeoff between increased computation (thus masking memory latency) and memory bandwidth may be more favorable to the GPU?

Cuckoo Cycle, even using smaller amounts of memory, like 512MB, consumes practically all memory bandwidth of a GPU. So the GPU having 4GB or 6GB is irrelevant.
If you run two 512MB Cuckoo instances, then each runs at about half the speed due to memory contention.
What happens is that the memory used by Cuckoo Cycle is spread across all the memory banks
(using let's say 1/8th of each) as this allows the maximum throughput. Each memory bank suffers a row switching delay on every memory access, forming the bottleneck. Running two instances means they
have to take turns getting results from each memory bank, so there's no increase in throughput.
No amount of extra computational resources changes that.

I've measured the same phenomenon on CPUs.

We cannot say what is the performance-per-watt advantage of GPU over CPU until we measure it.
Note that I don't even have physical access to the machine that I bench. So I'm hoping other people
can do the measurement.

TPTB_need_war (OP)
Sr. Member
****
Offline Offline

Activity: 420
Merit: 262


View Profile
January 31, 2016, 04:01:26 AM
Last edit: January 31, 2016, 04:16:41 AM by TPTB_need_war
 #817

Also I am wondering if you tried instead of hitting the memory bound with one instance, run multiple instances on the GPU since I assume you are not using 6GB for each instance? In this case the tradeoff between increased computation (thus masking memory latency) and memory bandwidth may be more favorable to the GPU?

Cuckoo Cycle, even using smaller amounts of memory, like 512MB, consumes practically all memory bandwidth of a GPU. So the GPU having 4GB or 6GB is irrelevant.
If you run two 512MB Cuckoo instances, then each runs at about half the speed due to memory contention.
What happens is that the memory used by Cuckoo Cycle is spread across all the memory banks
(using let's say 1/8th of each) as this allows the maximum throughput. Each memory bank suffers a row switching delay on every memory access, forming the bottleneck. Running two instances means they
have to take turns getting results from each memory bank, so there's no increase in throughput.
No amount of extra computational resources changes that.

I've measured the same phenomenon on CPUs.

I know that of course. What I was getting at is whether the parallelization on one instance was more costly in terms of memory accesses versus running a separate instance. It has been 2 years since I studied that Cuckoo Cycle graph NP problem briefly (and no available time or human brain bandwidth to delve into it today), so I didn't know the answer. See also my other reply above (which I even edited) while you were replying.

Note the 5.5 minute per GB proving time for your Cuckoo Cycle PoW (and slower on mobile devices I presume) is entirely unacceptable for my design for decentralizing Satoshi's PoW design by having each payer include a PoW share. Even one minute is too slow, so then you aren't even monetizing a fraction of the CPU's main memory which is the performance per $ hardware economics.

Designing a holistic PoW system for crypto currency has many variables, including marketing strategy.

tromp
Legendary
*
Offline Offline

Activity: 990
Merit: 1110


View Profile
January 31, 2016, 04:31:32 AM
 #818

Note the 5.5 minute per GB proving time for your Cuckoo Cycle PoW (and slower on mobile devices I presume) is entirely unacceptable for my design for decentralizing Satoshi's PoW design by having each payer include a PoW share. Even one minute is too slow, so then you aren't even monetizing a fraction of the CPU's main memory which is the performance per $ hardware economics.

Less time translates to less memory. A graph size of 2^26 takes about 8MB of memory and 5 sec runtime single-threaded. But you might need several runs to find a cycle of the required length.
You'd probably want to accept a whole range of lengths, like 16-64, to reduce the variance.
TPTB_need_war (OP)
Sr. Member
****
Offline Offline

Activity: 420
Merit: 262


View Profile
January 31, 2016, 04:35:45 AM
Last edit: January 31, 2016, 04:55:43 AM by TPTB_need_war
 #819

Note the 5.5 minute per GB proving time for your Cuckoo Cycle PoW (and slower on mobile devices I presume) is entirely unacceptable for my design for decentralizing Satoshi's PoW design by having each payer include a PoW share. Even one minute is too slow, so then you aren't even monetizing a fraction of the CPU's main memory which is the performance per $ hardware economics.

Less time translates to less memory. A graph size of 2^26 takes about 8MB of memory and 5 sec runtime single-threaded. But you might need several runs to find a cycle of the required length.
You'd probably want to accept a whole range of lengths, like 16-64, to reduce the variance.

Yeah but my point is then we haven't monetized the DRAM of the computer relative to the professional mining farm. The user paid the entire purchase price for the computer not just 8MB of his 2+GB DRAM. As I wrote upthread, this impacts the ratio of unprofitable miners to ones that desire to profitable required to make all unprofitable in my design (it also has impacts in a profitable PoW mining design such as Satoshi's design). The economics of mining is electricity and hardware amortization. For a memory hard scenario, afaics the DRAM amortization becomes a more significant factor in the economics equation.

I figure I can probably do 0.5 GB in roughly 5 seconds or less (perhaps under a second, I need to restudy my late 2013/early 2014 Shazam 512B variant of Chacha work in a new context). I lose the asymmetric verification of yours, but gain much better proving times relative to memory amortized (and this wouldn't be vulnerable to probabilistic graph-theoretic analysis breakage of your NP problem). Nevertheless afaics (thus far) it doesn't help me with the music marketing strategy.  Undecided  It may help with my decentralized PoW design. Need to study that aspect. I am more focused on marketing strategy today.

TPTB_need_war (OP)
Sr. Member
****
Offline Offline

Activity: 420
Merit: 262


View Profile
January 31, 2016, 04:41:06 AM
Last edit: January 31, 2016, 06:27:44 AM by TPTB_need_war
 #820

Proving data is stored on a decentralised node is something of an ongoing project.  

Others are looking at the issue:

https://bitslog.wordpress.com/2015/09/16/proof-of-unique-blockchain-storage-revised/

So far I think PoW for nodes to validate blocks or data they contain is an interesting approach.

Afaics, all of these proof-of-storage/retrievably designs are fundamentally flawed (MaidSafe, Storj, Sia, etc) and can't ever be fixed.

They try to use network latency to prevent centralized outsourcing, but ubiquitously consistent network latency is not a reliable commodity. Duh. Sorry!

Sergio also the one who started the idea for a DAG which I have explained is fundamentally flawed (and afaics even had an egregious/fundamental error in this white paper for his DagCoin). Find that info in the thread linked above.

Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 [41] 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 »
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!