Bitcoin Forum
November 01, 2024, 01:35:50 PM *
News: Latest Bitcoin Core release: 28.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 [2] 3 »  All
  Print  
Author Topic: Proof of Storage to make distributed resource consumption costly.  (Read 24042 times)
Cryddit
Legendary
*
Offline Offline

Activity: 924
Merit: 1132


View Profile
October 31, 2013, 03:47:46 AM
 #21

So let me see if I get this right.  We suppose that Bob wants to connect to Alice's server.


Alice picks a nonce N0, adds Bob and Alice's Server IDs to it,
   hashes it, and communicates the hash H0 to Bob.
Bob picks a nonce N1, adds Alice and Bob's Server IDs to it,
   hashes it, and communicates the hash  H1 to Alice.

Alice reveals her nonce, so Bob can verify that the constructed
    value is the preimage to the hash H0.
Bob reveals his nonce, so Alice can verify that the constructed
    value is the preimage to the hash H1.

Now both of them can calculate N = N0 ^ N1.

Alice picks an address A (say) 32 bits long, then performs the following:

    initialize V (a hash-wide value) with N
    for each bit B in A
        V = hash (V,B);

and asks Bob, "What is the address of Vfinal?"

And now Bob must exhaustively compute (or have stored) the
hash tree in order to find the address.  


My immediate thought is that the burden on Bob is asymmetric, but not by as much as you think.  

If I am Bob, I will set up a Bloom Filter Tree.  Bob needs to exhaustively compute the hash tree once in order to initialize the filters, but can then use a set of filters much smaller than the whole tree to efficiently narrow the address search.  He can spend less than one percent of the memory you thought you were requiring and still find the address with less than ten thousand hashes.

gmaxwell (OP)
Moderator
Legendary
*
expert
Offline Offline

Activity: 4270
Merit: 8805



View Profile WWW
October 31, 2013, 04:03:40 AM
 #22

No, go review it again. Thats not whats being described here. The PRNG has to have a special tree structure in order to make Alice's cost in producing an exact match query logarithmic in the tree size. You can trade off additional computation for space, but because the function is random this isn't terribly effective.
Crowex
Member
**
Offline Offline

Activity: 111
Merit: 10


View Profile
October 31, 2013, 12:20:17 PM
 #23

is it possible to use this in a decentralised system where the nodes exchange random numbers with each other and store lots of smaller hash trees of the numbers, each hash tree corresponding to an individual node they want to communicate with?
or are there ways of cheating this?
hannesnaude
Full Member
***
Offline Offline

Activity: 169
Merit: 100

Firstbits : 1Hannes


View Profile
October 31, 2013, 02:01:13 PM
 #24

No, go review it again. Thats not whats being described here. The PRNG has to have a special tree structure in order to make Alice's cost in producing an exact match query logarithmic in the tree size. You can trade off additional computation for space, but because the function is random this isn't terribly effective.

Apparently I don't understand it either  Embarrassed.

Despite my ignorance, I want to make the following suggestion. Perhaps, if you could point out the weakness in this scheme it would help me see what I missed.

I think Alice's cost in producing the query can be made independent of the tree size. Simply require Bob to fill a table with the values of
H(seed + index) where index ranges from 0 to N and H is a hash function (preferably a memory hard one). He can then sort these results as before.
 
Alice can easily choose a random index i and compute H(seed+i) for a constant cost and then challenge Bob to produce i given H(seed+i). If Bob stored the sorted table a binary search allows him to find the correct entry in O(log N) time. If not he needs to compute at O(N) cost.

As far as I can see, the Bloom Filter Tree optimization that cryddit described will still work against this, but perhaps the fact that Alice's cost is now O(1) means that we can simply assume that legitimate clients all use this optimization and demand 100GB of nominal storage knowing that only ~1GB of actual storage (or whatever the true ratio is) needs to be consumed in order to be able to respond to her queries in a timely manner.

Secondly, I'm not convinced that the query forwarding attack that Sergio describes is actually an attack. If it costs me 1 GB of storage to be able to connect to X and Y wants to connect to me at a cost of 1 GB, why shouldn't I be able to offload the cost? If we don't setup in a secure manner as outlined by Sergio, a timely response to a Proof of storage query essentially proves either that I have stored the data OR that I am maintaining an open connection to someone else that has. Both of these are costly. So, if I want to open 5000 connections using this attack, I first need to convince 5000 clients to connect to me and I have to maintain those connections.
Crowex
Member
**
Offline Offline

Activity: 111
Merit: 10


View Profile
October 31, 2013, 02:52:54 PM
 #25

is it possible to use this in a decentralised system where the nodes exchange random numbers with each other and store lots of smaller hash trees of the numbers, each hash tree corresponding to an individual node they want to communicate with?
or are there ways of cheating this?


Oh, sorry, I think this what you are describing?
socrates1024
Full Member
***
Offline Offline

Activity: 126
Merit: 110


Andrew Miller


View Profile
October 31, 2013, 07:29:37 PM
 #26

I implemented this in a few lines of python, mostly to convince myself it works https://gist.github.com/amiller/7131876
The test runs in codepad: http://codepad.org/Nzp2Vdsk

amiller on freenode / 19G6VFcV1qZJxe3Swn28xz3F8gDKTznwEM
[my twitter] [research@umd]
I study Merkle trees, credit networks, and Byzantine Consensus algorithms.
SlipperySlope
Hero Member
*****
Offline Offline

Activity: 686
Merit: 501

Stephen Reed


View Profile
April 23, 2014, 06:21:09 PM
 #27

Could someone contrast Proof of Storage with Proof of Stake, in particular with regard to physical resource consumption?

For Bitcoin, I am becoming interested in coding, testing, and documenting various alternatives to Proof of Work, which I believe is unsustainable, e.g. ultimately in conflict with the Energy Charter Treaty's Article 19. http://en.wikipedia.org/wiki/Energy_Charter_Treaty#Energy_efficiency
TierNolan
Legendary
*
Offline Offline

Activity: 1232
Merit: 1104


View Profile
April 24, 2014, 11:13:23 AM
 #28

Could someone contrast Proof of Storage with Proof of Stake, in particular with regard to physical resource consumption?

The proposal here is that proof of storage allows peers to prove that they have expended some resources on the current connection.  It isn't like POW for determining which fork has the most support.

Each full node only accepts a maximum number of incoming connections.  If each node limited the number of incoming connections to 100, then a node with 100 unique IP addresses could consume all the spare connections for every node.  Each node would see 100 connection attempts from 100 different IPs and would accept them all.

Proof of storage means that connecting to a node requires that you allocate drive space to every connection.  The random data you have to store is different for each connection, so you can't reuse it.

It doesn't even need to be that much.  If an attacker wants to connect to 10,000 open nodes and make 100 connections to each, then it needs 1 million times the per-connection storage requirement.

1LxbG5cKXzTwZg9mjL3gaRE835uNQEteWF
TierNolan
Legendary
*
Offline Offline

Activity: 1232
Merit: 1104


View Profile
April 24, 2014, 11:25:49 AM
 #29

I am not clear why a tree is required.  "index maps to hash(seed | index)" seems to work too?  That is fast and just as irreversible and it allows the server to compute entries in linear time, rather than log time.

The client then stores a value to index map.  Each sub-file would contain values that have the same prefix.

It wouldn't even necessarily have to do sorting, since the files could just act as bins.

The process could be

<client connects>
<client sends seed (S2)>
<server sends seed (S1) and MAX_INDEX>
<server randomly selects an index (0 <= index < MAX_INDEX) and sends value (V) of Hash(S1 | S2 | index)>

<server disconnects client>
// If the server doesn't disconnect, then it defeats the purpose, since an attacker can stall at this point
// The server should also disconnect the client if it takes more than a few seconds to send S2

<client computes Hash(S1 | S2 | index) for all possible indexes>
<client sorts the results and puts them into sub-files>

<client reconnects>
<client sends S1 and S2>
<server retrieves S1/S2 session to give the index and expected IP address>
<client sends the index that matches the value>

Periodically

<server sends value>
<client replies with matching index>

The key is to make it so that it is faster to read the hard drive than to re-compute the table for each request.

This means that the hash function should take a while, but that converts it into a proof of work system.

The timeout between requesting a lookup and the reply would have to be tweaked.  

The more often the lookup is requested by the server, the higher the incentive to not just recompute.

The server might give a "well behaved" client a few chances before disconnecting.

1LxbG5cKXzTwZg9mjL3gaRE835uNQEteWF
gmaxwell (OP)
Moderator
Legendary
*
expert
Offline Offline

Activity: 4270
Merit: 8805



View Profile WWW
April 24, 2014, 05:20:53 PM
 #30

I am not clear why a tree is required.  "index maps to hash(seed | index)" seems to work too?  That is fast and just as irreversible and it allows the server to compute entries in linear time, rather than log time.
Constant, not linear.

I'm not sure now if there was a reason I thought I needed the tree-structured PRNG, or if I'd just worked myself into a design corner before coming up with the working idea. If there is, I don't see it now. My original goal had been to find a function that was linear-time and mostly-sequential (e.g. not fast on PRAM) the first time for the client, but fast the second time... and always fast (log or better) for the server. The tree imposes some upper bounds on the non-precomputed search parallelism, but fairly weak ones— the non-precomputed search must do 2x the work and that work has geometrically increasing parallelism.

The reason to reduce parallelism in the non-precomputed search is, of course, based on the notion that any ridiculously fast hardware for bypassing this would likely work by searching most of the space concurrently.

An interesting question is can it be changed in the other direction— rather than simplified (thanks!), slightly complexified to decrease the parallelism in the client further?— I don't think the log() behavior in the server hurts one way or another, though I agree if its not doing anything important simpler is better!
TierNolan
Legendary
*
Offline Offline

Activity: 1232
Merit: 1104


View Profile
April 24, 2014, 05:58:01 PM
 #31

Constant, not linear.

Right.

Quote
The reason to reduce parallelism in the non-precomputed search is, of course, based on the notion that any ridiculously fast hardware for bypassing this would likely work by searching most of the space concurrently.

Ah, I see.  It is still burning server resources though, so isn't completely ineffective.

Quote
An interesting question is can it be changed in the other direction— rather than simplified (thanks!), slightly complexified to decrease the parallelism in the client further?— I don't think the log() behavior in the server hurts one way or another, though I agree if its not doing anything important simpler is better!

Maybe the reply could include multiple indexes.

The server sends a value and then the client has to find the matching exact index, but also some inexact matches.  The client would sort the values into bins which all have the same 16 bit prefix.  The server sends the full value and the client replies with the set of matches and also the index that has the exact match.

The average of the set size would have to be kept higher than some threshold.

Each set could be stored in a different file.  This would favour disk storage, since they could all be read quickly as a unit.

1LxbG5cKXzTwZg9mjL3gaRE835uNQEteWF
Cryddit
Legendary
*
Offline Offline

Activity: 924
Merit: 1132


View Profile
April 24, 2014, 07:45:31 PM
Last edit: April 24, 2014, 07:59:57 PM by Cryddit
 #32

Use Rivest's time lock puzzle.  It's asymmetric in that the person with the key can random-access it but the person without the key has to compute it linearly.  Here's a linky:

http://people.csail.mit.edu/rivest/lcs35-puzzle-description.txt

Mathematically, it works like this:  You're trying to compute R=(2^(2^t)) mod n for some arbitrary 't' and an 'n' which is a product of 2 large primes.  If you actually know the primes, you can do this quickly; if you don't, the fastest way to get the result is to either (a) compute 't' successive squarings modulo n, or (b) look it up in a dictionary because you already computed those successive squarings.  This is important because the successive-squarings operation cannot be done in parallel. It is strictly linear computation.

Your storage-requirement then looks like this:  Alice transmits 'n' to Bob.  Bob builds a table of (t, R) pairs and an index by 'R'.  Whenever Alice wants proof that Bob is still holding this table, she picks a 't' at random, calculates the corresponding 'R' which takes her constant time, then transmits 'R' and asks Bob what 't' it corresponds to.  Bob must either look up the 't' value in his table, or recompute the table until he hits it.

With that description played straight, Bob could build a rainbow table to trade off space for computation: if he stores every millionth (t, R) pair, he can just start squaring when he gets Alice's 'R' and within a million squaring operations he'll get a (t, R) pair that's in his table, whereupon he can subtract the number of squaring operations he did to find the 't' that Alice started with.  You can reduce Bob's benefit from rainbow tables by asking for many different 't', but  the annoying part of this is that the rainbow table approach *is* parallelizable.  If Alice is asking for fifteen different 't', Bob can have fifteen different threads all being simultaneously useful in searches for a 't'.  

If you don't play it straight, there is probably a way to use a hash function or 'salt' to defeat rainbow tables; I'll have to think about it.
gmaxwell (OP)
Moderator
Legendary
*
expert
Offline Offline

Activity: 4270
Merit: 8805



View Profile WWW
April 24, 2014, 11:37:08 PM
Last edit: April 24, 2014, 11:49:25 PM by gmaxwell
 #33

If you don't play it straight, there is probably a way to use a hash function or 'salt' to defeat rainbow tables; I'll have to think about it.
Right. The way to do that is to basically use the idea here.

You don't ask them for (2^(2^t)) (mod n),  you give them H((2^(2^t)) (mod n)) and ask them for t.  I don't see how to construct a rainbow table out of the hashed form because there is no useful function H(R) -> R  that maps to R's along your solution path. You could still extract some parallelism by first computing T sequentially and then saving way-points along the way, though that requires you to store elements of n which are large compared to the hashes... but probably reasonable enough for extracting some parallelism. Hm.

Downside of this is you can't do a publicly verifiable version, e.g. where you have just one piece of storage per identity but one piece per identity,peer pair (which was what I wanted in the motivation for this post; but there are other applications, e.g. like making a hidden service identity costly).
Cryddit
Legendary
*
Offline Offline

Activity: 924
Merit: 1132


View Profile
April 25, 2014, 02:23:21 AM
 #34

If you don't play it straight, there is probably a way to use a hash function or 'salt' to defeat rainbow tables; I'll have to think about it.

You don't ask them for (2^(2^t)) (mod n),  you give them H((2^(2^t)) (mod n)) and ask them for t.  

You're absolutely right.   Instead of having Alice give Bob R and ask for t, you want her to give him H(R) and ask for t.  When Bob doesn't have R he can't start successively squaring R -  therefore he has no successor function for R, thus no way to construct a rainbow table.


You could still extract some parallelism by first computing T sequentially and then saving way-points along the way, though that requires you to store elements of n which are large compared to the hashes... but probably reasonable enough for extracting some parallelism. Hm.


I see it.  Say you've got a nice GPU that lets you do a thousand operations in parallel.  So instead of storing the _whole_ table you store a thousand widely separated starting points for your search.  Then instead of searching a million-element table when you get a request, you have a thousand threads that each work on building a thousand-element table until one of them finds the answer.   It uses a thousand times the processing power of a straight rainbow-table implementation, but it can be just as fast as a thousand-element rainbow table. 

But here is a good way to defeat that.

We're dealing with arbitrary 't', right? What if Alice starts out by saying that she will never ask for any 't' except a 't' that's evenly divisible by some factor X?  You can skip storing the result of all but one out of every X squaring operations. The result is that constructing the million-element table now costs a factor of 'X' more nonparallelizable operations.  Say X is 4300.  Now Bob can save a thousand starting points in his million-element table, but each thread is going to have to do 4300 squarings to get the 'next' element in its thousand-element subtable.  You can favor lookups over table construction by essentially any work factor. 

Of course this makes it compute-expensive to initiate a connection as well as storage-expensive to keep one going.  But if the connections last longer than an hour, that probably isn't a problem.

Cryddit
Legendary
*
Offline Offline

Activity: 924
Merit: 1132


View Profile
April 25, 2014, 02:38:25 AM
 #35

Downside of this is you can't do a publicly verifiable version, e.g. where you have just one piece of storage per identity but one piece per identity,peer pair (which was what I wanted in the motivation for this post; but there are other applications, e.g. like making a hidden service identity costly).


Wait...  I thought that was exactly what we were talking about. 

Could you explain in more detail what you mean by a 'publicly verifiable version'?

I was thinking of a system where each peer has to keep one table in memory per peer that it's connected to.  The table, in turn, is constructed from a 'seed' which lasts as long as that peer doesn't change its key.  Now, I had assumed that Alice's key for each connection would be different, but it doesn't have to be...  Alice doesn't have to use a different key for each connection; would it suit your purposes if she didn't?  In principle, Alice could publish 'n' while keeping its prime factors secret.  People would still have to construct an 'Alice' table in order to bring up a connection to the 'Alice' node. 

Cryddit
Legendary
*
Offline Offline

Activity: 924
Merit: 1132


View Profile
April 25, 2014, 02:45:07 AM
 #36


You don't ask them for (2^(2^t)) (mod n),  you give them H((2^(2^t)) (mod n)) and ask them for t.  

You're absolutely right.   Instead of having Alice give Bob R and ask for t, you want her to give him H(R) and ask for t.  When Bob doesn't have R he can't start successively squaring R -  therefore he has no successor function for R, thus no way to construct a rainbow table.


Nope, darn.  There's still a Bloom filter shortcut.  Bob can still construct a rainbow table if he uses Bloom filters rather than successor functions to select the band.

Here's how it works; each band of the table stores a bloom filter that selects hashes which result from 't' in that band, plus the starting (t, R) pair.  When Bob gets H(R), he goes to see which bloom filter it matches, then starts on that band with the starting (t, R) pair and does successive squares until he gets R such that H(R) matches - at which point he knows the corresponding t.   

TierNolan
Legendary
*
Offline Offline

Activity: 1232
Merit: 1104


View Profile
April 25, 2014, 02:10:34 PM
 #37

Here's how it works; each band of the table stores a bloom filter that selects hashes which result from 't' in that band, plus the starting (t, R) pair.  When Bob gets H(R), he goes to see which bloom filter it matches, then starts on that band with the starting (t, R) pair and does successive squares until he gets R such that H(R) matches - at which point he knows the corresponding t.   

That also breaks the basic version too.

Hash(seed | index) could also be spread over bands.

A tree of Bloom filters would be pretty fast.  The first Bloom filter would tell you which group of Bloom filters to check.

That means that you need only a few bits per entry.

1LxbG5cKXzTwZg9mjL3gaRE835uNQEteWF
gmaxwell (OP)
Moderator
Legendary
*
expert
Offline Offline

Activity: 4270
Merit: 8805



View Profile WWW
April 25, 2014, 07:12:39 PM
 #38

Wait...  I thought that was exactly what we were talking about.  
Could you explain in more detail what you mean by a 'publicly verifiable version'?
It was, you're not crazy.

The reason I mentioned it is because someone recently contacted me about using this kind of technique to make things like having tor hidden service ID's concurrently costly... e.g. you could still connect to as many peers as you want, but you'd still need to use one identity across all of them. It's a different application, but would work with the original proposal just by using a constant value for the peer challenge.

That also breaks the basic version too.
Yes, thats not news though. All of these so far allow for a choice of time/memory tradeoff, it's still maximally fast if you keep everything. The invocation of the sequential function was to try to diminish the available level of choice.
Taek
Hero Member
*****
Offline Offline

Activity: 543
Merit: 501



View Profile
September 17, 2014, 02:59:11 AM
Last edit: September 17, 2014, 07:24:33 PM by Taek
 #39

An idea for a publicly verifiable way to do this:

Start with a random seed, and build a large set by taking H(index | seed) until the set is the target size. Have the client sort and store the set.

Instead of presenting a hash and requiring an index, you present H(new random seed), and have the client present the index that is numerically closest to the presented hash.

The server can easily verify that the hash is in the set, but has no way of knowing if the client has produced the index that's actually numerically closest.

But, statistically, the server can tell if the client is close to what would be expected. If you repeat the challenge a dozen or a hundred times (just hash the new random number repeatedly to get more challenges), the server can do statistical analysis and get a very good idea of whether the client is checking close to 100% of the values or not, with it being extremely unlikely that an honest client will be a victim of this statistical analysis due to bad luck. IE, if you are storing 100 indices and the search space is 1000, you'd expect to be about 10 off with each result, and you can put bounds on a standard deviation. Of course here the search space is 256 bits or whatever the hash size is.

Then, if the client is doing massive hashing instead of storing the file, you can make the challenge sequential. The client finds the closest value, takes H(challenge|closest hash) and uses that as the next target value. If you're not storing the hashing, you have to recompute the table again for each challenge. Still doesn't stop you from using massive parallelism to compute the table, but at least now you have to compute the table multiple times per challenge.

edit: haven't discussed where these random seeds are coming from, if you can't get a secure source of random numbers then someone could potential manipulate the seeds such that they need significantly less computation or storage to pass the statistical analysis checks. Or, if they are trying to make an honest client fail, they could manipulate the random numbers such that the client is guaranteed to get a statistically unlikely result.


===========


edit2: I'm attempting to explain this process more clearly.

The goal is to force the client to store some volume of data in a publicly verifiable way, such that the client's responses can be verified with a trivial amount of computation.

This method assumes access to secure public random numbers, which can be considered a weakness.

The client creates a data set d_i = H(i|data_seed) for i...N, such that the data set is of the desired size. (1 GB for example). The client then sorts the data set by the numerical value of the hashes.
The server can then create a publicly verifiable challenge c = H(query_seed). The client is to return the data d_i which is numerically closest to c.

The server and the third party cannot verify that the client has returned the d_i which is actually numerically closet to c, however they can verify that the presented d_i is valid, and they can calculate how numerically close the solution is to c.
Statistically, c has an expected distance from the closest d_i.
Over the course of many challenges, the average distance of the client's responses should converge to this expected distance.
If the client is not storing the entire data set, or is recalulating only a small portion of the data set each query, the average distance of its responses will converge to a distance greater than the expected distance, and this can be used to detect dishonesty.

So, a client will always be able to provide a response, however without checking the entire search space d_0..N before each response, the client will have responses that are not as good as they should be. A third party can use statistical analysis to detect this after a calculatable number of queries (within a calculatable certainty).
blue eyes
Newbie
*
Offline Offline

Activity: 25
Merit: 0


View Profile
September 19, 2014, 09:55:13 AM
 #40

Could someone contrast Proof of Storage with Proof of Stake, in particular with regard to physical resource consumption?

The proposal here is that proof of storage allows peers to prove that they have expended some resources on the current connection.  It isn't like POW for determining which fork has the most support.

Each full node only accepts a maximum number of incoming connections.  If each node limited the number of incoming connections to 100, then a node with 100 unique IP addresses could consume all the spare connections for every node.  Each node would see 100 connection attempts from 100 different IPs and would accept them all.

Proof of storage means that connecting to a node requires that you allocate drive space to every connection.  The random data you have to store is different for each connection, so you can't reuse it.

It doesn't even need to be that much.  If an attacker wants to connect to 10,000 open nodes and make 100 connections to each, then it needs 1 million times the per-connection storage requirement.

thx, very specific.
Pages: « 1 [2] 3 »  All
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!