SQL block database: about nested sets

grondilu (OP)

Legendary

Offline

Activity: 1288
Merit: 1076

SQL block database: about nested sets

June 22, 2012, 01:02:12 PM

#1

As I try to implement a SQL bitcoin database for my perl library, I was looking for a way to efficiently store the tree block structure. So I found out about an idea which is kind of neat: nested sets.

Here is one document which explains it (it's not the one I actually used as the one I used is not in english, so I had to find an other one for you): https://communities.bmc.com/communities/docs/DOC-9902

EDIT: the wikipedia page seems fine also.

I think it's quite a cleaver method, but a bit tough to grasp.

Here is the mysql code I wrote so far (I didn't put here some views, but I did put the main tables, even if they are not related to the subject of this thread):

Code:

-- Table "blocks" actually only contains block headers
CREATE TABLE blocks (
    hash                char(32) binary primary key,

    version             integer default 1,
    hashPrev            char(32) binary not null,
    hashMerkleRoot      char(32) binary not null,
    nTime               integer unsigned not null,
    nBits               integer unsigned not null,
    nNonce              integer unsigned not null,

    key (hashMerkleRoot),
    key (hashPrev)
);

-- We'll insert the genesis block here as some triggers won't behave well
-- with an empty 'blocks' table.
INSERT INTO blocks values (
    unhex("6FE28C0AB6F1B372C1A6A246AE63F74F931E8365E15A089C68D6190000000000"),

    1,
    unhex("0000000000000000000000000000000000000000000000000000000000000000"),
    unhex("3BA3EDFD7A7B12B27AC72C3E67768F617FC81BC3888A51323A9FB8AA4B1E5E4A"),
    1231006505,
    486604799,
    2083236893
);

-- A view of orphan blocks
-- (notice that we don't use any "view" prefix in the name here)
CREATE VIEW orphan_blocks AS
SELECT a.*
FROM blocks a LEFT JOIN blocks b
ON a.hashPrev = b.hash
WHERE b.hash IS NULL;

-- Merkle transaction trees are stored in their own table
CREATE TABLE Merkle_trees (
    root                char(32) binary not null,
    idx                 integer unsigned not null,
    hash		char(32) binary,
    primary key (root, idx),
    key (root)
);

-- Transactions
CREATE TABLE transactions (
    hash                char(32) binary primary key,
    version             integer,
    lockTime            integer unsigned,
);

-- Transaction inputs
CREATE TABLE tx_in (
    hash                char(32) binary,
    prevout_hash        char(32) binary,
    prevout_n           integer unsigned,
    scriptSig           blob,
    sequence            integer unsigned,

    primary key (hash, prevout_hash, prevout_n),
    key(hash)
)

-- Transaction outputs
CREATE TABLE tx_out (
    tx_out_id           integer unsigned primary key auto_increment,
    hash                char(32) binary,
    value               integer,
    scriptPubKey        blob,

    key (hash)
);

-- A function to compute target from nBits
CREATE FUNCTION target (bits float)
RETURNS REAL DETERMINISTIC
RETURN mod(bits, 0x1000000) * pow( 256, bits div 0x1000000 - 3 );

-- To create the block tree structure,
-- we'll use the interval model.
-- Each node (i.e. each block) will have a left
-- and a right edge.  We must ensure that descending
-- blocks have edges inside its parent's edges.

CREATE TABLE block_tree (
    node	char(32) binary primary key,
    L           integer unsigned not null,
    R		integer unsigned not null check (R > L),
    height	integer unsigned not null
);

-- We insert the genesis node manually.
-- Left edge is 0, right edge is 1, height is 0.
INSERT INTO block_tree values (
    unhex("6FE28C0AB6F1B372C1A6A246AE63F74F931E8365E15A089C68D6190000000000"),
    0,
    1,
    0
);

CREATE TRIGGER add_block_in_tree AFTER INSERT ON blocks
FOR EACH ROW
BEGIN
    UPDATE block_tree t, block_tree r
    SET t.L=t.L+2
    WHERE r.node = new.hashPrev
    AND t.L >= r.D;

    UPDATE block_tree t, block_tree r
    SET t.D=t.D+2
    WHERE r.node = new.hashPrev
    AND t.D >= r.D;

    INSERT INTO block_tree (node, L, R, height)
    SELECT new.hash, r.D, r.D + 1, r.height + 1
    FROM block_tree r
    WHERE r.node = new.hashPrev;
END;

The trigger "add_block_in_tree" should work when a block whose parent is known is inserted. But it will not work if block is inserted before its parent is. It should be possible to code this but it is a bit tricky.

Anyway if anyone is also interested in this model, I'd gladly hear suggestions.

Check out my bitcoin bash library

pc

Sr. Member

Offline

Activity: 253
Merit: 250

Re: SQL block database: about nested sets

June 22, 2012, 03:08:59 PM

#2

I regularly use the nested set model of trees in SQL for my day job (which is completely unrelated to bitcoin). I think Joe Celko was one of the originators (or at least an advocate) for using SQL in this way, and his book "SQL for Smarties" is one of the best books I've ever used professionally. It handles a lot of "clever" uses of SQL of a variety of things, including nested sets, and I highly recommend it.

maaku

Legendary

Offline

Activity: 905
Merit: 1011

Re: SQL block database: about nested sets

June 22, 2012, 05:20:13 PM

#3

I wrote/improved-upon a nested-set implementation in Python using SQLAlchemy for this exact purpose. It's hosted on github.

It'll probably not be worth the effort to integrate it into your perl library directly (being in Python and on top of SQLAlchemy, etc.), but you can port it, or maybe you'll just find it a useful reference as you write your own.

EDIT: I suppose the codebase might be intimidating to someone who not versed in Python or the SQLAlchemy framework. You'll find the insertion/update/deletion code in 'sqlalchemy_tree/orm.py', and the higher-level methods for manipulating the tree in 'sqlalchemy_tree/manager/{class_,instance}.py'.

I'm an independent developer working on bitcoin-core, making my living off community donations.
If you like my work, please consider donating yourself: 13snZ4ZyCzaL7358SmgvHGC9AxskqumNxP

grondilu (OP)

Legendary

Offline

Activity: 1288
Merit: 1076

Re: SQL block database: about nested sets

June 27, 2012, 08:26:52 AM

#4

Quote from: maaku on June 22, 2012, 05:20:13 PM

I wrote/improved-upon a nested-set implementation in Python using SQLAlchemy for this exact purpose. It's hosted on github.

It'll probably not be worth the effort to integrate it into your perl library directly (being in Python and on top of SQLAlchemy, etc.), but you can port it, or maybe you'll just find it a useful reference as you write your own.

EDIT: I suppose the codebase might be intimidating to someone who not versed in Python or the SQLAlchemy framework. You'll find the insertion/update/deletion code in 'sqlalchemy_tree/orm.py', and the higher-level methods for manipulating the tree in 'sqlalchemy_tree/manager/{class_,instance}.py'.

I'll have a look at it someday. But meanwhile I've been working on neat stuff and I'd like to share it to everyone. It's not implemented in SQL right now, but I should do it soon.

Here it is.

I've spent quite some time reading the work of a guy named Vadim Tropashko on
this subject. It's quite fascinating, and I think his technique would be
perfect for a SQL model of the bitcoin block structure.

Tropashko uses rational values as borders for his intervals. One of the main
advantage is that you can work on a fix global interval (0 to 1 for instance):
you don't have to shift half of the block tree each time you insert a new node.

One cool aspect of this technique is the relation with continued fractions.
As you know any fraction can be written as a finite continued fraction:

Code:

p          1      
- = ---------------
q            1  
    a + -----------
               1 
        b + -------
            
            c + ...

The idea is that the a, b, c, ... sequence provides a natural path for the
node's branch.

For instance, a sequence such as 1, 1, 2, 1 would mean: "take the second branch
at the third node from the root".

Now, nested intervals are all about intervals, so we can not be happy with just
one rational for each node. We need an upper limit, assuming the rational
described above is the lower limit (it is not necessarly the case though, but
we'll deal with this problem later).

The idea is to associate a Möbius function of a variable x ranging from 0 to 1
(same as our global interval):

Code:

              1      
M(x) = ---------------
                1  
       a + -----------
                  1 
           b + -------
    
               c  +  x

Then we have p/q = M(0). The parent of f/p is then M(infinity).

This is actually already enough to understand that M(x) can be written:

Code:

        p + r x
M(x) =  -------
        q + s x

Where r/s is the parent node written as a rational.

I have no exact proof for that, but considerations on limits at infinity
and value at 0 clearly indicates it should be true.

Without the possibility to write M(x) like this, it would be pretty tough
to compute any M(x) value in SQL.

Anyway, what is important to remember is that to compute M(x) we only need
two rational numbers: the one from the actual lowest edge of the node, and
the one from its direct parent.

The only thing which is lacking is a way to get the first child. M(1) wouldn't
do as this is actually the first 'sibling'.

Tropashko then gets into quite a lot of complicated stuff and I had a lot of
difficulties grasping them, and when I managed to understand them I was
wondering if they would be useful anyway. Yet there is one remark that he
makes at the very end of his last article, a remark that can make, in my
humble opinion, the whole thing much easier.

Tropashko remarks that simple continued fractions are notoriously known
not to be monotonuous. Depending of the parity of the position of one of the
number a, b, c..., sometimes the whole value increases with the number,
sometimes it decreases.

He then proposes to use positive continued fractions
(mind the minus signs):

Code:

                 1
M(x) = ---------------------
                     1  
       a + 1 - -------------
                       1 
           b + 1 - ---------
    
                   c + 1 - x

Althouth this expression seems more complicated at first glance,
it is actually much easier to visualize and to work with.

The main reason for this is that it's easy to convince yourself that this
function of x is growing (I mean: x > y => M(x) > M(y) ).

This changes everything. We now can be sure that the lowest
edge for an interval will be M(0), and the highest edge will be M(1).
We'll usually identify a node with M(0).

Parent is still M(infinity), so we can still write the function as:

Code:

        p - r x
M(x) =  -------
        q - s x

The only difference are the minus signs, but they are not a problem.

The first child of M(0) is easy to spot: it's M(1/2). The second child (first
sibling of M(1/2)) is M(1/3), the third is M(1/4), and so on.

The first sibling of M(0) is M(-1), second sibling is M(-2), and so on.

Here is a small part of a genealogy tree, to make things clearer:

Code:

M(infinity)-->M(0)-------->M(1/2)--> M(1/(2-1/2)) --> ...asymptotically ends at M(1)
        | |          |  |
        | \-->M(-1)  |  \->M(1/3)----> M(1/(3-1/2)) --> ...asymptotically ends at M(1/2)
        |            |
        \---->M(-2)   \->M(1/4)---> M(1/(4-1/2))--> ...asymptotically ends at M(1/3)
                                |
                                \-> M(1/(4-1/3))--> ...asymptotically ends at M(1/(4-1/2))
                                       .
                                       .
                                       .

Then there is still a catch: how do you compute the parent from the lower edge
of the interval? Normally, as Tropashko explains it, you'd have to use the
extended euclidian algorithm, but we actually don't need to do that for a
bitcoin block tree, since each parent is present already in the database
anyway. We can therefore retrieve the parent rational value, and use it to
compute the Möbius function, and thus the rest of the tree as we need it.

So, to summarize, it's pretty easy. Each node tree will have two rational
numbers, defining an interval. If the lower edge of node A is inside the
interval of node B, then B is an ancestor of node A. To add a child to a node,
we just need to get the parent and compute the Möbius function for x = 1/(n+2),
where n is the number of already existing children. It shouldn't be difficult
to get n, we basically just need to make a loop and test for the existence of M(1/(k+2))
in the database.

Moving a whole branch should not be too difficult either. Imagine we have an
orphan block with children, and then suddenly the parent block of the orphan is
added to the database. To patch the new branch to the database, we need to
change intervals for each node of the previously orphan branch. It sure will
cost more than adding a single child to a leaf, but hopefully this situation
does not occur too often anyway.

One cool thing with the fact that positive continued fractions are increasing
functions, is that we will not have to store the ordinal value (height, depth,
however you name it) of each node. If we sort our queries by rational lower
edges, then we'll automatically get the sequencial order of our chain.

Check out my bitcoin bash library