Bitcoin Forum

Bitcoin => Development & Technical Discussion => Topic started by: gmaxwell on October 13, 2013, 11:28:40 PM



Title: Proof of Storage to make distributed resource consumption costly.
Post by: gmaxwell on October 13, 2013, 11:28:40 PM
Motivation:
Consider a mutually distrusting decentralized network where each node has some finite capacity, like Bitcoin's P2P network, with 100,000 nodes.

An attacker wants to soak up all the connection capacity of the all the nodes in order to funnel new peers to malicious nodes the attacker controls.  Each node can perform actions to make resource usage costly. For example, each node could reject additional connections from single IPs or subnets, requiring the attacker to have access to more address space (a scarce resource) in order to use up that nodes' capacity. Or they could prioritize peers that present a costly Bitcoin bond (e.g. a Fidelity bond/SIN).

The problem there is that the attackers effectiveness scales with the number of victims— e.g. each IP or bond or whatever lets them use 100k sockets—, since the nodes are mutually distrusting and thus can't collaborate to reject an attacker who makes one connection per IP to every single node in the network.

The network could respond by requiring a scarce resource to connect— e.g. a POW or a Bitcoin fee. But an on-connect fee doesn't really map well to a ongoing resource utilization.  You could require a periodic fee (e.g. intermittent POW), but setting the rates so that it doesn't favor attackers (e.g. an attacker might not mind spending a half bitcoin to attack the whole network, but regular users might not tolerate a 0.0001/per connect cost).

Also, most POW schemes are subject to huge speedups by more efficient implementations. E.g. a FPGA or even GPU implementation will be enormously faster than a java one used in most clients.  A memory hard function may narrow the gap, but common memory hard functions are just as memory hard to validate as they are to produce (though it's possible to do otherwise), and increasing the server's resource utilization is not a great way to address a resource exhaustion attack.

What might be useful is a POW function which maps well to the kind of resource usage that holding a connection represents and is also less vulnerable to speedup for specialized attackers.

One idea that has been suggested for this would be to have a storage hard function: Some function that forces you to store a bunch of data and continue to prove you're still storing it while the connection is up.  Making one that is storage hard for the client but not the server seemed pretty tricky and eluded a number of people for some time. I had a prior idea, which required fully homorphic encryption to create a sequential function with a trap-door property to allow random access... which might have worked but it was so complicated that no one would ever bother with it. Fortunately I've since had a better idea:

A proof of storage system:

Take some fast cryptographically strong pseudorandom function, e.g. like Blake2 or another hash function H().

The server gives the client a random seed and a target size. The client repeatedly applies H() in a tree, e.g.  {Left, Right} = H(seed) and so on, to build up the target size worth of data.  The client takes the final leaves of this tree-structured pseudo-random function and appends their index and sorts the data (or, equivalently, stores it in a binary search tree).

The server then can periodically pick a random index and perform log2(size) hash operations to determine the value at that index and then can challenge the client to provide the corresponding index for that value.

The tree structuring of the function means that you can cheaply lookup the value for a given index,  but looking up the index for a given value requires computing the whole thing (and storing it, if you want to efficiently perform multiple lookups).

(There are some possible optimizations in the client, like common prefix compression, but they're only small optimizations because the function is strongly pseudorandom)

A gigabyte table wouldn't be too insanely burdensome for many kinds of clients, but would require a terabyte for an attacker to use resources on 1000 nodes. Such a scheme could be offered optionally to clients (e.g. if you perform a storage proof you get a preferred connection slot) and so it might provide some value in making some attacks ineffective.


Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: justusranvier on October 13, 2013, 11:38:32 PM
So each node requires a gigabyte for their own storage table, plus one gigabyte for each peer they connect to?

Would a node only require a storage proof for new nodes and drop the requirement after they have established a certain amount of good behaviour?


Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: gmaxwell on October 14, 2013, 12:07:04 AM
So each node requires a gigabyte for their own storage table, plus one gigabyte for each peer they connect to?
One gigabyte (assuming that security level, it was a random number) per peer that they are using a storage proof to get priority access to. So they don't necessarily need to have one per outbound peer.  The idea would be the your peers would prioritize inbound connections that were providing a storage proof, everyone else's connectivity would depend on other criteria.

There is no "their own" storage requirement the whole basis of this idea is to be memorless on the server.

Node could pick their favorite couple peers to use storage proofs on, and if they get bumped off other peers during a dos attack, it's not the end of the world.

Quote
Would a node only require a storage proof for new nodes and drop the requirement after they have established a certain amount of good behaviour?
The problem there is that an attacker could just slowly build up proof free connections to the whole network and still saturate things.

Long term I expect our behavior with inbound connections to look something like:  When a new connection comes in and we're full, randomly drop a peer (including, possibly, the new one) based on a priority scheme.  The priority scheme could reserve a number of slots for each of several different kinds of priority.  For example, some slots should be reserved for the longest connected nodes— since having a long uptime connection is a kind of scarce resource and some attackers come in bursts.  Other slots should be reserved for peers which have the lowest minimum round trip time (because being close to you on the network can't be faked), come from the same subnet as you, come from many distinct subnets, are often the first to relay you new blocks, have an IP that when hashed with a secret is a low value, etc.  You can list out a bunch of different groups to protect... then randomly (weighed by goodness or whatever) kick nodes that don't make the cut.

One problem with that kind of litany of criteria is that you'll still end up with clients that just don't meet any of the protected classes— they won't appear to be highly useful good nodes, because they're not highly useful (esp SPV clients). They may not be near any of their peers on the network, they may be freshly started and so have low uptime.  So it would be good to have a way for nodes to opt into priority without creating an easy avenue for attackers.... thus ideas like a proof of storage.


Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: Sergio_Demian_Lerner on October 15, 2013, 01:36:36 PM
Gmaxwell: Your idea is excellent, but needs some changes to prevent an obvious query forwarding attack.

Suppose client C wants to prove the server S that he keeps the data. When client C connects to server S, it also connects to some client D, acting as a server. Then server S sends the seed to client C, client C forwards the same seed to client D. Also each query is forwarded.
S cannot detect C is cheating, because he is in fact checking D storage. Then the seed must be chosen by a fair coin-flip or similar protocol.

New proposed Setup Protocol

C chooses a seed c and sends it to S.
S chooses a seed s and sends it to C.
Both agree that seed = H (c | s)

Also it's important to note that C may not store all the tree, but only a precomputed top. To effectively detect if C is in fact storing almost all the tree or not, S would need to make many time measurements of C responses, averaging the elapsed time.



Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: d'aniel on October 15, 2013, 10:15:21 PM
Also it's important to note that C may not store all the tree, but only a precomputed top. To effectively detect if C is in fact storing almost all the tree or not, S would need to make many time measurements of C responses, averaging the elapsed time.
IIUC, for an n level tree, you'd save yourself from storing

(1 - 2-n ) / (2 - 2-n)   ~  1/2

the nodes in the tree.  So could you could just assume everybody is cheating and require an extra level?


Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: gmaxwell on October 15, 2013, 10:28:24 PM
Also it's important to note that C may not store all the tree, but only a precomputed top. To effectively detect if C is in fact storing almost all the tree or not, S would need to make many time measurements of C responses, averaging the elapsed time.
IIUC, for an n level tree, you'd save yourself from storing

(1 - 2-n ) / (2 - 2-n)   ~  1/2

the nodes in the tree.  So could you could just assume everybody is cheating and require an extra level?

Yea this was what I was referring to with "There are some possible optimizations in the client, like common prefix compression, but they're only small optimizations because the function is strongly pseudorandom".

The timing comment leaves me feeling a bit confused.  I don't hope to prevent outsourcing: if someone is storing the data then that creates the desired scarcity.   If the client instead wants to store no data but perform a billion computations (or what have you) per query, then I think thats okay:  The queries are super cheap  for the server and honest clients so they can be performed frequently. I think the parameters can be chosen so that the storage-less solution of recomputing most of the answer space for every query is pretty costly (which would also achieve the goal of limiting abuse).


Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: Peter Todd on October 15, 2013, 11:03:39 PM
The timing comment leaves me feeling a bit confused.  I don't hope to prevent outsourcing: if someone is storing the data then that creates the desired scarcity.   If the client instead wants to store no data but perform a billion computations (or what have you) per query, then I think thats okay:  The queries are super cheap  for the server and honest clients so they can be performed frequently. I think the parameters can be chosen so that the storage-less solution of recomputing most of the answer space for every query is pretty costly (which would also achieve the goal of limiting abuse).

I think Sergio means the client could outsource the computation to another client.


Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: gmaxwell on October 15, 2013, 11:15:13 PM
I think Sergio means the client could outsource the computation to another client.
Okay, I thought perhaps. But Sergio's protocol improvement prevents unwitting outsourcing.  Voluntarily outsourcing is okay I think, as a scarce resource is still being consumed by someone involved in the activity.


Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: Gavin Andresen on October 16, 2013, 03:32:51 AM
Great idea!

If implemented, it would probably make sense to create a little protocol for the server and client to negotiate the amount of storage needed to "rent" a connection slot. Maybe the server reports something like "I've got 0 slots free, min/median/max proof-of-storage for my other connected nodes is 0/1MB/11MB."


Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: Sergio_Demian_Lerner on October 16, 2013, 12:49:16 PM
side-topic: Also a node could prioritize nodes which store the full block-chain. This can be done with a variety of protocols known as  Proof of Storage and Proof of Retreivability.


Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: jgarzik on October 17, 2013, 08:17:32 AM
Related:

Similar to each bitcoin P2P peer's nonce value, which serves as a node's unique id, I've occasionally desired for each side of the P2P protocol to offer a random seed in "version" and "verack."

(a DH exchange is probably too much to wish for, in the context of bitcoin P2P, heh heh)



Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: AnonyMint on October 17, 2013, 08:27:58 AM
Note I invented Proof-of-Harddrive in my Bitcoin : The Digital Kill Switch thread several months prior. Thanks for finding a useful application of it, even if you may prefer DRAM. I had to abandon it as a substitute for PoW because it doesn't provide the entropy for selecting which peer is next.


Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: Sergio_Demian_Lerner on October 17, 2013, 12:53:53 PM
Note I invented Proof-of-Harddrive in my Bitcoin : The Digital Kill Switch thread several months prior. Thanks for finding a useful application of it, even if you may prefer DRAM. I had to abandon it as a substitute for PoW because it doesn't provide the entropy for selecting which peer is next.

Could you provide a link to the exact post where you elaborate the idea of Proof-of-Harddrive ?


Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: ShadowOfHarbringer on October 19, 2013, 11:23:30 PM
A proof of storage system:

Take some fast cryptographically strong pseudorandom function, e.g. like Blake2 or another hash function H().

The server gives the client a random seed and a target size. The client repeatedly applies H() in a tree, e.g.  {Left, Right} = H(seed) and so on, to build up the target size worth of data.  The client takes the final leaves of the tree and appends their index and sorts the data (or, equivalently, stores it in a binary search tree).

The server then can periodically pick a random index and perform log2(size) hash operations to determine the value at that index and then can challenge the client to provide the corresponding index for that value.

The tree structuring of the function means that you can cheaply lookup the value for a given index,  but looking up the index for a given value requires computing the whole thing (and storing it, if you want to efficiently perform multiple lookups).

I have just found time to read this post thoughtfully, and...
Mother of Hash... Another genius invention.

I mean how brilliant can the Bitcoin community be ? I am now convinced that most (if not all) of current humanity's problems can be solved by genius mathematical algorithms.

PS.
I wonder if this couldn't somehow be used in file sharing by the way ?


Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: d'aniel on October 20, 2013, 01:28:29 AM
I have just found time to read this post thoughtfully, and...
Mother of Hash... Another genius invention.

I mean how brilliant can the Bitcoin community be ?
Hey, give credit where it's due :)  This guy in particular oozes genius ideas.


Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: Sukrim on October 20, 2013, 02:27:34 AM
PS.
I wonder if this couldn't somehow be used in file sharing by the way ?
Likely not, as you are probably more interested in having counterparties that also share files with you than ones that provably waste hard drive space.

I initially thought "Proof of Storage" would mean that A has a large file that B previously had, then A periodically prooves to B that she still has the full file available.

The current idea is rather along the lines of "Proof of wasted HDD-space" (until CPUs catch up and it becomes faster to just generate the tables on the fly then to look them up from disk) and is intended to be used for deciding which connections to drop based on how "invested" other nodes are in you. This liely needs to be a longer time relationship than the usual file sharing time frame to pay off.


Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: gmaxwell on October 20, 2013, 02:34:48 AM
I initially thought "Proof of Storage" would mean that A has a large file that B previously had, then A periodically prooves to B that she still has the full file available.
It's utterly trivial to do that. Just ask that they tell you parts of the file from time to time. You can even do this without remembering the file yourself if you just compute a hashtree over it, and remember the root, and they give you the hash fragments.

But that wouldn't be very useful as a decentralized anti-dos mechanism: one copy of the file would allow you to prove to as many concurrent peers as you want.  If, instead, the peers ask you to store some unique file for them to address that issue— they have to undertake the cost of actually sending you that file which is contrary to the goal of using this as anti-dos (and also you have the risk that Sergio addressed of someone doing that in order to invisibly delegate their proof work to someone else).

What this adds to that idea is the ability to send a few bytes that the peer can use to make a file of arbitrary size, and then they can convince you that they have actually done so... without you having to do non-trivial communication, computation, or storage. The price is, indeed, that the information stored isn't terribly useful. (You could perhaps replace the hash with some other kind of abstractly useful function "folding proteins" or something— but in general double-use for proof-of-whatever subsidizes the cost of attacking, since now your attack can be paid for by the secondary benefit of the barrier function, and other useful functions have some risk of surprising optimizations that strong cryptographic hashes lack).


Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: super3 on October 20, 2013, 02:47:50 AM
A proof of storage system:

Take some fast cryptographically strong pseudorandom function, e.g. like Blake2 or another hash function H().

The server gives the client a random seed and a target size. The client repeatedly applies H() in a tree, e.g.  {Left, Right} = H(seed) and so on, to build up the target size worth of data.  The client takes the final leaves of the tree and appends their index and sorts the data (or, equivalently, stores it in a binary search tree).

The server then can periodically pick a random index and perform log2(size) hash operations to determine the value at that index and then can challenge the client to provide the corresponding index for that value.

The tree structuring of the function means that you can cheaply lookup the value for a given index,  but looking up the index for a given value requires computing the whole thing (and storing it, if you want to efficiently perform multiple lookups).

I have just found time to read this post thoughtfully, and...
Mother of Hash... Another genius invention.

I mean how brilliant can the Bitcoin community be ? I am now convinced that most (if not all) of current humanity's problems can be solved by genius mathematical algorithms.

PS.
I wonder if this couldn't somehow be used in file sharing by the way ?
This could provide a basis for some sort of distributed file sharing, but you still need much more to make it work. 


Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: Sukrim on October 20, 2013, 12:37:23 PM
This could provide a basis for some sort of distributed file sharing, but you still need much more to make it work. 
How?

This is only useful to rank/score peers, the data generated is quite random... you might be able (by bruteforcing it) to generate some meaningful data from time to time (then this would be kind of a compression mechanism) but the content of the data definitely is NOT the focus here, rather the guaranteed waste of ressources, thus being "invested" into the relationship with you. This could help with bootstrapping issues, as a completely new peer has no data to offer and will have to leech initially. With Proof of Storage as described here, he can make sure that he will stay connected longer, as it otherwise would be quite a waste of ressources for him to leech + leave.

Doing a hash tree over a file would then mean that the construction of my tree needs to be randomized too, otherwise the counterparty would compute the same tree and throw away the file. I thought that you want a mechanism to make sure that a random client actually DOES have the full block chain stored... This proposal is something different and more along the lines of the NashX proposal, where both parties show something of high value before proceeding further. Before it was ressources like data on one end, nothing for a new peer on the other end. Now you can at least "invest" ressource exhaustion with this approch on the other end of the deal which shows that you are committed to this relationship.

I'm not too sure if that can beat Tit-for-Tat though, as after some time you probably want to transfer from Proof of Storage (exhaustion) to something like seed ratios or similar.

Also I'm not too sure that you can infer from the willingness to waste ressources the willingness to play fair.
Imagine this scheme in BitTorrent: Seeds accept new leechers only if they either have at least 10% of a file available or agree to waste 1 GB (random numbers). Now I as a new node can connect to a handful of seeds and load my first 10%. This does not mean that I'm willing at all to contribute anything back to the swarm in sense of upload. With tit-for-tat I'd get the whole file eventually even without uploading anything but I'll be severely choked by any peer out there quite fast. With ressource exhaustion, I just need to lie that I am a new fresh node and will get anything. My cost is the disk space + energy used for the calculations, BUT there is no gain on the seeder's end and also not for the network! The only way to detract me from doing these things would be to make it more expensive to run this attack than to participate. Given current electricity prices and HDD prices, there will be a lot of controversy about the "correct" size + computational requirements (you could build a tree where you have a much more "difficult" hash).


Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: super3 on October 20, 2013, 04:53:22 PM
This could provide a basis for some sort of distributed file sharing, but you still need much more to make it work. 
How?

This is only useful to rank/score peers, the data generated is quite random... you might be able (by bruteforcing it) to generate some meaningful data from time to time (then this would be kind of a compression mechanism) but the content of the data definitely is NOT the focus here, rather the guaranteed waste of ressources, thus being "invested" into the relationship with you. This could help with bootstrapping issues, as a completely new peer has no data to offer and will have to leech initially. With Proof of Storage as described here, he can make sure that he will stay connected longer, as it otherwise would be quite a waste of ressources for him to leech + leave.

Doing a hash tree over a file would then mean that the construction of my tree needs to be randomized too, otherwise the counterparty would compute the same tree and throw away the file. I thought that you want a mechanism to make sure that a random client actually DOES have the full block chain stored... This proposal is something different and more along the lines of the NashX proposal, where both parties show something of high value before proceeding further. Before it was ressources like data on one end, nothing for a new peer on the other end. Now you can at least "invest" ressource exhaustion with this approch on the other end of the deal which shows that you are committed to this relationship.

I'm not too sure if that can beat Tit-for-Tat though, as after some time you probably want to transfer from Proof of Storage (exhaustion) to something like seed ratios or similar.

Also I'm not too sure that you can infer from the willingness to waste ressources the willingness to play fair.
Imagine this scheme in BitTorrent: Seeds accept new leechers only if they either have at least 10% of a file available or agree to waste 1 GB (random numbers). Now I as a new node can connect to a handful of seeds and load my first 10%. This does not mean that I'm willing at all to contribute anything back to the swarm in sense of upload. With tit-for-tat I'd get the whole file eventually even without uploading anything but I'll be severely choked by any peer out there quite fast. With ressource exhaustion, I just need to lie that I am a new fresh node and will get anything. My cost is the disk space + energy used for the calculations, BUT there is no gain on the seeder's end and also not for the network! The only way to detract me from doing these things would be to make it more expensive to run this attack than to participate. Given current electricity prices and HDD prices, there will be a lot of controversy about the "correct" size + computational requirements (you could build a tree where you have a much more "difficult" hash).
Could the seed of the tree perhaps be a file or the hash of the file that you want to verify is stored?

Indeed "correct" size will be very difficult to find. My desktop computer has 3x the hard drive space of my laptop. A large size might be negligible for my desktop, but costly for my laptop. Not to mention those portables with small SSD drives. Unlike GPUs and ASICs, cloud storage could be rented on demand for very cheap. As long as its not a sustained attack you could easily get significant space for pennies.


Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: Cryddit on October 31, 2013, 03:47:46 AM
So let me see if I get this right.  We suppose that Bob wants to connect to Alice's server.


Alice picks a nonce N0, adds Bob and Alice's Server IDs to it,
   hashes it, and communicates the hash H0 to Bob.
Bob picks a nonce N1, adds Alice and Bob's Server IDs to it,
   hashes it, and communicates the hash  H1 to Alice.

Alice reveals her nonce, so Bob can verify that the constructed
    value is the preimage to the hash H0.
Bob reveals his nonce, so Alice can verify that the constructed
    value is the preimage to the hash H1.

Now both of them can calculate N = N0 ^ N1.

Alice picks an address A (say) 32 bits long, then performs the following:

    initialize V (a hash-wide value) with N
    for each bit B in A
        V = hash (V,B);

and asks Bob, "What is the address of Vfinal?"

And now Bob must exhaustively compute (or have stored) the
hash tree in order to find the address.  


My immediate thought is that the burden on Bob is asymmetric, but not by as much as you think.  

If I am Bob, I will set up a Bloom Filter Tree.  Bob needs to exhaustively compute the hash tree once in order to initialize the filters, but can then use a set of filters much smaller than the whole tree to efficiently narrow the address search.  He can spend less than one percent of the memory you thought you were requiring and still find the address with less than ten thousand hashes.



Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: gmaxwell on October 31, 2013, 04:03:40 AM
No, go review it again. Thats not whats being described here. The PRNG has to have a special tree structure in order to make Alice's cost in producing an exact match query logarithmic in the tree size. You can trade off additional computation for space, but because the function is random this isn't terribly effective.


Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: Crowex on October 31, 2013, 12:20:17 PM
is it possible to use this in a decentralised system where the nodes exchange random numbers with each other and store lots of smaller hash trees of the numbers, each hash tree corresponding to an individual node they want to communicate with?
or are there ways of cheating this?


Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: hannesnaude on October 31, 2013, 02:01:13 PM
No, go review it again. Thats not whats being described here. The PRNG has to have a special tree structure in order to make Alice's cost in producing an exact match query logarithmic in the tree size. You can trade off additional computation for space, but because the function is random this isn't terribly effective.

Apparently I don't understand it either  :-[.

Despite my ignorance, I want to make the following suggestion. Perhaps, if you could point out the weakness in this scheme it would help me see what I missed.

I think Alice's cost in producing the query can be made independent of the tree size. Simply require Bob to fill a table with the values of
H(seed + index) where index ranges from 0 to N and H is a hash function (preferably a memory hard one). He can then sort these results as before.
 
Alice can easily choose a random index i and compute H(seed+i) for a constant cost and then challenge Bob to produce i given H(seed+i). If Bob stored the sorted table a binary search allows him to find the correct entry in O(log N) time. If not he needs to compute at O(N) cost.

As far as I can see, the Bloom Filter Tree optimization that cryddit described will still work against this, but perhaps the fact that Alice's cost is now O(1) means that we can simply assume that legitimate clients all use this optimization and demand 100GB of nominal storage knowing that only ~1GB of actual storage (or whatever the true ratio is) needs to be consumed in order to be able to respond to her queries in a timely manner.

Secondly, I'm not convinced that the query forwarding attack that Sergio describes is actually an attack. If it costs me 1 GB of storage to be able to connect to X and Y wants to connect to me at a cost of 1 GB, why shouldn't I be able to offload the cost? If we don't setup in a secure manner as outlined by Sergio, a timely response to a Proof of storage query essentially proves either that I have stored the data OR that I am maintaining an open connection to someone else that has. Both of these are costly. So, if I want to open 5000 connections using this attack, I first need to convince 5000 clients to connect to me and I have to maintain those connections.


Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: Crowex on October 31, 2013, 02:52:54 PM
is it possible to use this in a decentralised system where the nodes exchange random numbers with each other and store lots of smaller hash trees of the numbers, each hash tree corresponding to an individual node they want to communicate with?
or are there ways of cheating this?


Oh, sorry, I think this what you are describing?


Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: socrates1024 on October 31, 2013, 07:29:37 PM
I implemented this in a few lines of python, mostly to convince myself it works https://gist.github.com/amiller/7131876
The test runs in codepad: http://codepad.org/Nzp2Vdsk


Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: SlipperySlope on April 23, 2014, 06:21:09 PM
Could someone contrast Proof of Storage with Proof of Stake, in particular with regard to physical resource consumption?

For Bitcoin, I am becoming interested in coding, testing, and documenting various alternatives to Proof of Work, which I believe is unsustainable, e.g. ultimately in conflict with the Energy Charter Treaty's Article 19. http://en.wikipedia.org/wiki/Energy_Charter_Treaty#Energy_efficiency (http://en.wikipedia.org/wiki/Energy_Charter_Treaty#Energy_efficiency)


Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: TierNolan on April 24, 2014, 11:13:23 AM
Could someone contrast Proof of Storage with Proof of Stake, in particular with regard to physical resource consumption?

The proposal here is that proof of storage allows peers to prove that they have expended some resources on the current connection.  It isn't like POW for determining which fork has the most support.

Each full node only accepts a maximum number of incoming connections.  If each node limited the number of incoming connections to 100, then a node with 100 unique IP addresses could consume all the spare connections for every node.  Each node would see 100 connection attempts from 100 different IPs and would accept them all.

Proof of storage means that connecting to a node requires that you allocate drive space to every connection.  The random data you have to store is different for each connection, so you can't reuse it.

It doesn't even need to be that much.  If an attacker wants to connect to 10,000 open nodes and make 100 connections to each, then it needs 1 million times the per-connection storage requirement.


Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: TierNolan on April 24, 2014, 11:25:49 AM
I am not clear why a tree is required.  "index maps to hash(seed | index)" seems to work too?  That is fast and just as irreversible and it allows the server to compute entries in linear time, rather than log time.

The client then stores a value to index map.  Each sub-file would contain values that have the same prefix.

It wouldn't even necessarily have to do sorting, since the files could just act as bins.

The process could be

<client connects>
<client sends seed (S2)>
<server sends seed (S1) and MAX_INDEX>
<server randomly selects an index (0 <= index < MAX_INDEX) and sends value (V) of Hash(S1 | S2 | index)>

<server disconnects client>
// If the server doesn't disconnect, then it defeats the purpose, since an attacker can stall at this point
// The server should also disconnect the client if it takes more than a few seconds to send S2

<client computes Hash(S1 | S2 | index) for all possible indexes>
<client sorts the results and puts them into sub-files>

<client reconnects>
<client sends S1 and S2>
<server retrieves S1/S2 session to give the index and expected IP address>
<client sends the index that matches the value>

Periodically

<server sends value>
<client replies with matching index>

The key is to make it so that it is faster to read the hard drive than to re-compute the table for each request.

This means that the hash function should take a while, but that converts it into a proof of work system.

The timeout between requesting a lookup and the reply would have to be tweaked.  

The more often the lookup is requested by the server, the higher the incentive to not just recompute.

The server might give a "well behaved" client a few chances before disconnecting.


Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: gmaxwell on April 24, 2014, 05:20:53 PM
I am not clear why a tree is required.  "index maps to hash(seed | index)" seems to work too?  That is fast and just as irreversible and it allows the server to compute entries in linear time, rather than log time.
Constant, not linear.

I'm not sure now if there was a reason I thought I needed the tree-structured PRNG, or if I'd just worked myself into a design corner before coming up with the working idea. If there is, I don't see it now. My original goal had been to find a function that was linear-time and mostly-sequential (e.g. not fast on PRAM) the first time for the client, but fast the second time... and always fast (log or better) for the server. The tree imposes some upper bounds on the non-precomputed search parallelism, but fairly weak ones— the non-precomputed search must do 2x the work and that work has geometrically increasing parallelism.

The reason to reduce parallelism in the non-precomputed search is, of course, based on the notion that any ridiculously fast hardware for bypassing this would likely work by searching most of the space concurrently.

An interesting question is can it be changed in the other direction— rather than simplified (thanks!), slightly complexified to decrease the parallelism in the client further?— I don't think the log() behavior in the server hurts one way or another, though I agree if its not doing anything important simpler is better!


Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: TierNolan on April 24, 2014, 05:58:01 PM
Constant, not linear.

Right.

Quote
The reason to reduce parallelism in the non-precomputed search is, of course, based on the notion that any ridiculously fast hardware for bypassing this would likely work by searching most of the space concurrently.

Ah, I see.  It is still burning server resources though, so isn't completely ineffective.

Quote
An interesting question is can it be changed in the other direction— rather than simplified (thanks!), slightly complexified to decrease the parallelism in the client further?— I don't think the log() behavior in the server hurts one way or another, though I agree if its not doing anything important simpler is better!

Maybe the reply could include multiple indexes.

The server sends a value and then the client has to find the matching exact index, but also some inexact matches.  The client would sort the values into bins which all have the same 16 bit prefix.  The server sends the full value and the client replies with the set of matches and also the index that has the exact match.

The average of the set size would have to be kept higher than some threshold.

Each set could be stored in a different file.  This would favour disk storage, since they could all be read quickly as a unit.


Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: Cryddit on April 24, 2014, 07:45:31 PM
Use Rivest's time lock puzzle.  It's asymmetric in that the person with the key can random-access it but the person without the key has to compute it linearly.  Here's a linky:

http://people.csail.mit.edu/rivest/lcs35-puzzle-description.txt

Mathematically, it works like this:  You're trying to compute R=(2^(2^t)) mod n for some arbitrary 't' and an 'n' which is a product of 2 large primes.  If you actually know the primes, you can do this quickly; if you don't, the fastest way to get the result is to either (a) compute 't' successive squarings modulo n, or (b) look it up in a dictionary because you already computed those successive squarings.  This is important because the successive-squarings operation cannot be done in parallel. It is strictly linear computation.

Your storage-requirement then looks like this:  Alice transmits 'n' to Bob.  Bob builds a table of (t, R) pairs and an index by 'R'.  Whenever Alice wants proof that Bob is still holding this table, she picks a 't' at random, calculates the corresponding 'R' which takes her constant time, then transmits 'R' and asks Bob what 't' it corresponds to.  Bob must either look up the 't' value in his table, or recompute the table until he hits it.

With that description played straight, Bob could build a rainbow table to trade off space for computation: if he stores every millionth (t, R) pair, he can just start squaring when he gets Alice's 'R' and within a million squaring operations he'll get a (t, R) pair that's in his table, whereupon he can subtract the number of squaring operations he did to find the 't' that Alice started with.  You can reduce Bob's benefit from rainbow tables by asking for many different 't', but  the annoying part of this is that the rainbow table approach *is* parallelizable.  If Alice is asking for fifteen different 't', Bob can have fifteen different threads all being simultaneously useful in searches for a 't'.  

If you don't play it straight, there is probably a way to use a hash function or 'salt' to defeat rainbow tables; I'll have to think about it.


Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: gmaxwell on April 24, 2014, 11:37:08 PM
If you don't play it straight, there is probably a way to use a hash function or 'salt' to defeat rainbow tables; I'll have to think about it.
Right. The way to do that is to basically use the idea here.

You don't ask them for (2^(2^t)) (mod n),  you give them H((2^(2^t)) (mod n)) and ask them for t.  I don't see how to construct a rainbow table out of the hashed form because there is no useful function H(R) -> R  that maps to R's along your solution path. You could still extract some parallelism by first computing T sequentially and then saving way-points along the way, though that requires you to store elements of n which are large compared to the hashes... but probably reasonable enough for extracting some parallelism. Hm.

Downside of this is you can't do a publicly verifiable version, e.g. where you have just one piece of storage per identity but one piece per identity,peer pair (which was what I wanted in the motivation for this post; but there are other applications, e.g. like making a hidden service identity costly).


Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: Cryddit on April 25, 2014, 02:23:21 AM
If you don't play it straight, there is probably a way to use a hash function or 'salt' to defeat rainbow tables; I'll have to think about it.

You don't ask them for (2^(2^t)) (mod n),  you give them H((2^(2^t)) (mod n)) and ask them for t.  

You're absolutely right.   Instead of having Alice give Bob R and ask for t, you want her to give him H(R) and ask for t.  When Bob doesn't have R he can't start successively squaring R -  therefore he has no successor function for R, thus no way to construct a rainbow table.


You could still extract some parallelism by first computing T sequentially and then saving way-points along the way, though that requires you to store elements of n which are large compared to the hashes... but probably reasonable enough for extracting some parallelism. Hm.


I see it.  Say you've got a nice GPU that lets you do a thousand operations in parallel.  So instead of storing the _whole_ table you store a thousand widely separated starting points for your search.  Then instead of searching a million-element table when you get a request, you have a thousand threads that each work on building a thousand-element table until one of them finds the answer.   It uses a thousand times the processing power of a straight rainbow-table implementation, but it can be just as fast as a thousand-element rainbow table. 

But here is a good way to defeat that.

We're dealing with arbitrary 't', right? What if Alice starts out by saying that she will never ask for any 't' except a 't' that's evenly divisible by some factor X?  You can skip storing the result of all but one out of every X squaring operations. The result is that constructing the million-element table now costs a factor of 'X' more nonparallelizable operations.  Say X is 4300.  Now Bob can save a thousand starting points in his million-element table, but each thread is going to have to do 4300 squarings to get the 'next' element in its thousand-element subtable.  You can favor lookups over table construction by essentially any work factor. 

Of course this makes it compute-expensive to initiate a connection as well as storage-expensive to keep one going.  But if the connections last longer than an hour, that probably isn't a problem.



Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: Cryddit on April 25, 2014, 02:38:25 AM
Downside of this is you can't do a publicly verifiable version, e.g. where you have just one piece of storage per identity but one piece per identity,peer pair (which was what I wanted in the motivation for this post; but there are other applications, e.g. like making a hidden service identity costly).


Wait...  I thought that was exactly what we were talking about. 

Could you explain in more detail what you mean by a 'publicly verifiable version'?

I was thinking of a system where each peer has to keep one table in memory per peer that it's connected to.  The table, in turn, is constructed from a 'seed' which lasts as long as that peer doesn't change its key.  Now, I had assumed that Alice's key for each connection would be different, but it doesn't have to be...  Alice doesn't have to use a different key for each connection; would it suit your purposes if she didn't?  In principle, Alice could publish 'n' while keeping its prime factors secret.  People would still have to construct an 'Alice' table in order to bring up a connection to the 'Alice' node. 



Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: Cryddit on April 25, 2014, 02:45:07 AM

You don't ask them for (2^(2^t)) (mod n),  you give them H((2^(2^t)) (mod n)) and ask them for t.  

You're absolutely right.   Instead of having Alice give Bob R and ask for t, you want her to give him H(R) and ask for t.  When Bob doesn't have R he can't start successively squaring R -  therefore he has no successor function for R, thus no way to construct a rainbow table.


Nope, darn.  There's still a Bloom filter shortcut.  Bob can still construct a rainbow table if he uses Bloom filters rather than successor functions to select the band.

Here's how it works; each band of the table stores a bloom filter that selects hashes which result from 't' in that band, plus the starting (t, R) pair.  When Bob gets H(R), he goes to see which bloom filter it matches, then starts on that band with the starting (t, R) pair and does successive squares until he gets R such that H(R) matches - at which point he knows the corresponding t.   



Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: TierNolan on April 25, 2014, 02:10:34 PM
Here's how it works; each band of the table stores a bloom filter that selects hashes which result from 't' in that band, plus the starting (t, R) pair.  When Bob gets H(R), he goes to see which bloom filter it matches, then starts on that band with the starting (t, R) pair and does successive squares until he gets R such that H(R) matches - at which point he knows the corresponding t.   

That also breaks the basic version too.

Hash(seed | index) could also be spread over bands.

A tree of Bloom filters would be pretty fast.  The first Bloom filter would tell you which group of Bloom filters to check.

That means that you need only a few bits per entry.


Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: gmaxwell on April 25, 2014, 07:12:39 PM
Wait...  I thought that was exactly what we were talking about.  
Could you explain in more detail what you mean by a 'publicly verifiable version'?
It was, you're not crazy.

The reason I mentioned it is because someone recently contacted me about using this kind of technique to make things like having tor hidden service ID's concurrently costly... e.g. you could still connect to as many peers as you want, but you'd still need to use one identity across all of them. It's a different application, but would work with the original proposal just by using a constant value for the peer challenge.

That also breaks the basic version too.
Yes, thats not news though. All of these so far allow for a choice of time/memory tradeoff, it's still maximally fast if you keep everything. The invocation of the sequential function was to try to diminish the available level of choice.


Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: Taek on September 17, 2014, 02:59:11 AM
An idea for a publicly verifiable way to do this:

Start with a random seed, and build a large set by taking H(index | seed) until the set is the target size. Have the client sort and store the set.

Instead of presenting a hash and requiring an index, you present H(new random seed), and have the client present the index that is numerically closest to the presented hash.

The server can easily verify that the hash is in the set, but has no way of knowing if the client has produced the index that's actually numerically closest.

But, statistically, the server can tell if the client is close to what would be expected. If you repeat the challenge a dozen or a hundred times (just hash the new random number repeatedly to get more challenges), the server can do statistical analysis and get a very good idea of whether the client is checking close to 100% of the values or not, with it being extremely unlikely that an honest client will be a victim of this statistical analysis due to bad luck. IE, if you are storing 100 indices and the search space is 1000, you'd expect to be about 10 off with each result, and you can put bounds on a standard deviation. Of course here the search space is 256 bits or whatever the hash size is.

Then, if the client is doing massive hashing instead of storing the file, you can make the challenge sequential. The client finds the closest value, takes H(challenge|closest hash) and uses that as the next target value. If you're not storing the hashing, you have to recompute the table again for each challenge. Still doesn't stop you from using massive parallelism to compute the table, but at least now you have to compute the table multiple times per challenge.

edit: haven't discussed where these random seeds are coming from, if you can't get a secure source of random numbers then someone could potential manipulate the seeds such that they need significantly less computation or storage to pass the statistical analysis checks. Or, if they are trying to make an honest client fail, they could manipulate the random numbers such that the client is guaranteed to get a statistically unlikely result.


===========


edit2: I'm attempting to explain this process more clearly.

The goal is to force the client to store some volume of data in a publicly verifiable way, such that the client's responses can be verified with a trivial amount of computation.

This method assumes access to secure public random numbers, which can be considered a weakness.

The client creates a data set d_i = H(i|data_seed) for i...N, such that the data set is of the desired size. (1 GB for example). The client then sorts the data set by the numerical value of the hashes.
The server can then create a publicly verifiable challenge c = H(query_seed). The client is to return the data d_i which is numerically closest to c.

The server and the third party cannot verify that the client has returned the d_i which is actually numerically closet to c, however they can verify that the presented d_i is valid, and they can calculate how numerically close the solution is to c.
Statistically, c has an expected distance from the closest d_i.
Over the course of many challenges, the average distance of the client's responses should converge to this expected distance.
If the client is not storing the entire data set, or is recalulating only a small portion of the data set each query, the average distance of its responses will converge to a distance greater than the expected distance, and this can be used to detect dishonesty.

So, a client will always be able to provide a response, however without checking the entire search space d_0..N before each response, the client will have responses that are not as good as they should be. A third party can use statistical analysis to detect this after a calculatable number of queries (within a calculatable certainty).


Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: blue eyes on September 19, 2014, 09:55:13 AM
Could someone contrast Proof of Storage with Proof of Stake, in particular with regard to physical resource consumption?

The proposal here is that proof of storage allows peers to prove that they have expended some resources on the current connection.  It isn't like POW for determining which fork has the most support.

Each full node only accepts a maximum number of incoming connections.  If each node limited the number of incoming connections to 100, then a node with 100 unique IP addresses could consume all the spare connections for every node.  Each node would see 100 connection attempts from 100 different IPs and would accept them all.

Proof of storage means that connecting to a node requires that you allocate drive space to every connection.  The random data you have to store is different for each connection, so you can't reuse it.

It doesn't even need to be that much.  If an attacker wants to connect to 10,000 open nodes and make 100 connections to each, then it needs 1 million times the per-connection storage requirement.

thx, very specific.


Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: Razerglass on October 24, 2014, 02:57:55 PM
https://bitcointalk.org/index.php?topic=731923.0

Burst coin PoC


Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: fivebells on October 25, 2014, 04:56:06 AM
FWIW, I asked the burstcoin dev about this thread, and he said he hadn't seen it before.


Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: TierNolan on October 31, 2014, 10:05:14 PM
It looks like the burst system is resistant to the Bloom filter attack?

They generate "plots".  These are 256kB blocks of data generated using hash(account-id | plot-nonce) as seed and hashing it over and over to expand to the 256kB.  They also mix the array with the last 32 bytes of the array.

Then, they are broken down into 4096 "scoops" of 32 bytes each.

Based on the hash(previous-block | block height), a scoop index is selected (0 - 4095).

The miner has to scan through all scoops of the same index.  I assume they store all scoops with the same index together.

The scoop is run through hash(previous-block | scoop) and then divided by the target.

This gives the number of seconds to wait until the next block (lower is better).  If you get a result of 30 seconds, then you can create a block 30 seconds after the previous block.

The more plots you have, the more likely you will get the lowest result.

This means that miners have to read 1/4096 of the data stored on the disk in order to extract the scoops.

There is no way to Bloom filter, since all scoops concatenated with info from the previous block before being passed through a hash function.  This means the value of each scoop changes on a per-block basis.


Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: work2heat on November 02, 2014, 10:35:23 PM
This seems like a great idea. Has there been any work towards implementing it in core or elsewhere?

One implication though is that honest services which provide monitoring and statistics of the network as a whole will become much more costly. Solutions might be public key authentication of such monitoring services or else for servers to accept "read only" connections (could be http-like or normal tcp socket that is never read from).


Title: Re: Proof of Storage to make distributed resource consumption costly.
Post by: Taek on November 05, 2014, 07:51:57 PM
How do you use that method to tell that a host is storing say, 2^12 plots (1GB)? There's no way for you to verify that the value presented by the host is the optimal value within the set of data that they are storing, except for you to somehow incentivize their finding of the optimal value. And then you have to do probabilistic analysis of repeated solutions to determine if they're scanning over the whole set each time.

As a method of consensus, this seems vulnerable because H(previous-block | block height) is something that the miner can control. If the miner is controlling 10% of the plots, but has enough time each time a block is found to grind over 100 potential values for `previous-block`, then the miner is very likely to find at least one value of `previous-block` that results in a scoop in one of the miners plots having an unexpectedly low time to produce the next block. Depending on how long it takes to scan the plots for the lowest value scoop, a miner could potentially try an enormous set of values for `previous-block`, until something is found that heavily favors the miner's plots.

The miner is then able to artificially inflate the number of plots that it appears to be storing, but only for blocks where it controls the value of `previous-block`. This means that a miner never has any incentive to use a chain produced by someone else, because they will not be able to grind over the value of `previous-block` and will be mining at a significant disadvantage compared to mining on a history composed of exclusively their blocks.