You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by "Davies, Owain" <Ow...@baesystemsdetica.com> on 2012/03/13 18:26:31 UTC

Fundamentals Question on CoucheDB's append only b+tree.

Hi,

I have been reading the couchDB guide, particularly the appendix on the
power of b-trees. As I understand it couchdb uses a b+tree, in which the
leaf nodes have pointers to their siblings to facilitate quick in-order
and reverse in-order enumeration of the documents in the b+tree.

When a document is modified (or added), then the document is appended to
the file, the leaf node is cloned and updated to point to the new
document. However in order to maintain the linked list between the
sibling leaf nodes, you would need to update the adjacent link nodes
(creating new leaf nodes appended onto the end of the file), this will
eventually require that the whole tree is replicated, not just the node
on the path to the effected leaf node.

I do not think this can be the case, could somebody point out where I
have made the mistake? 

Thanks,

Owain

Please consider the environment before printing this email.
 
This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately.
 
Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. 
 
The contents of this email may relate to dealings with other companies under the control of BAE Systems plc details of which can be found at http://www.baesystems.com/Businesses/index.htm.
 
Detica Limited is a BAE Systems company trading as BAE Systems Detica.
Detica Limited is registered in England and Wales under No: 1337451.
Registered office: Surrey Research Park, Guildford, Surrey, GU2 7YP, England.


Re: Fundamentals Question on CoucheDB's append only b+tree.

Posted by Robert Newson <rn...@apache.org>.
what Randall said :P


On 13 March 2012 19:31, Randall Leeds <ra...@gmail.com> wrote:
> On Tue, Mar 13, 2012 at 10:40, Robert Newson <rn...@apache.org> wrote:
>
>> There's no linked list running between the leafs. A b+tree doesn't
>> require one, though it's a common addition. The b+tree algorithm is a
>> revision over a binary tree (where inner nodes point strictly at one
>> left and one right item). To be a b+tree you need to hold many
>> pointers on an inner node.
>>
>
> I thought being a B-tree was to have many pointers in inner nodes, being a
> B+tree was to have *only* pointers in inner nodes and the values all at the
> leaf nodes.
> Whatever it's called, the latter is what CouchDB has.

Re: Fundamentals Question on CoucheDB's append only b+tree.

Posted by Bob Dionne <di...@dionne-associates.com>.
the literature is not quite clear on this point[1].

Also, to Paul's earlier point, yes, technically couchdb's btree is a b+tree as the reductions aren't mapped to the keys, though this impacts the ability to get as many keys as possible into RAM with each read.

couch_btree is a pretty clever piece of erlang. The order (as defined in Knuth's sense[1]) is determined by a chunk size. In some cases you can improve performance by imposing a minimum but by and large no one to date has demonstrated any major improvements in it. Which is fine as that's probably not where the bottlenecks are.

I think storing the reductions with the inner nodes is a place where an optimization could be performed. The problem being that they are essential to the incremental feature of MR.


[1] http://en.wikipedia.org/wiki/B-tree


On Mar 14, 2012, at 12:37 AM, Paul Davis wrote:

> On Tue, Mar 13, 2012 at 11:33 PM, Randall Leeds <ra...@gmail.com> wrote:
>> On Mar 13, 2012 7:11 PM, "Paul Davis" <pa...@gmail.com> wrote:
>>> 
>>> On Tue, Mar 13, 2012 at 5:50 PM, Bob Dionne
>>> <di...@dionne-associates.com> wrote:
>>>> That's largely correct, though couchdb's btree is neither fish nor fowl.
>>>> 
>>>> It doesn't have an order (B-trees generally do) and it does store
>> values in the inner nodes (B+trees do not)
>>>> 
>>> 
>>> The lack of order is the big thing.
>> 
>> What is meant by order in this context?
> 
> Maximum number of elements in a node. Ie, the whole "every node except
> the root must have between N/2 and N elements at all times" thing.


Re: Fundamentals Question on CoucheDB's append only b+tree.

Posted by Paul Davis <pa...@gmail.com>.
On Tue, Mar 13, 2012 at 11:33 PM, Randall Leeds <ra...@gmail.com> wrote:
> On Mar 13, 2012 7:11 PM, "Paul Davis" <pa...@gmail.com> wrote:
>>
>> On Tue, Mar 13, 2012 at 5:50 PM, Bob Dionne
>> <di...@dionne-associates.com> wrote:
>> > That's largely correct, though couchdb's btree is neither fish nor fowl.
>> >
>> > It doesn't have an order (B-trees generally do) and it does store
> values in the inner nodes (B+trees do not)
>> >
>>
>> The lack of order is the big thing.
>
> What is meant by order in this context?

Maximum number of elements in a node. Ie, the whole "every node except
the root must have between N/2 and N elements at all times" thing.

Re: Fundamentals Question on CoucheDB's append only b+tree.

Posted by Randall Leeds <ra...@gmail.com>.
On Mar 13, 2012 7:11 PM, "Paul Davis" <pa...@gmail.com> wrote:
>
> On Tue, Mar 13, 2012 at 5:50 PM, Bob Dionne
> <di...@dionne-associates.com> wrote:
> > That's largely correct, though couchdb's btree is neither fish nor fowl.
> >
> > It doesn't have an order (B-trees generally do) and it does store
values in the inner nodes (B+trees do not)
> >
>
> The lack of order is the big thing.

What is meant by order in this context?

Re: Fundamentals Question on CoucheDB's append only b+tree.

Posted by Paul Davis <pa...@gmail.com>.
On Tue, Mar 13, 2012 at 5:50 PM, Bob Dionne
<di...@dionne-associates.com> wrote:
> That's largely correct, though couchdb's btree is neither fish nor fowl.
>
> It doesn't have an order (B-trees generally do) and it does store values in the inner nodes (B+trees do not)
>

The lack of order is the big thing. Though I'd slightly disagree with
the no values in inner nodes assessment. Strictly speaking, I would
say they do not as to get a key's value you have to read through to a
leaf node. I would guess that you're referring to the reduction values
stored on internal nodes, and while they are "values" technically,
they aren't the "value" of the "key/value" type. Specifically, they
don't have a 1:1 mapping to keys.

IOW, the internal storage of reductions directly in the inner-nodes is
an extension to the b+tree algorithm that could be done perfectly
validly in a textbook example of b+trees.

OTOH, I definitely agree about the lack of order. Sadly each time I've
tried to enforce that I end up making the whole thing considerably
slower so in this case engineering pragmatism wins over theoretical
purity.

> As Paul might say it's a B~tree
>
>
> On Mar 13, 2012, at 3:31 PM, Randall Leeds wrote:
>
>> On Tue, Mar 13, 2012 at 10:40, Robert Newson <rn...@apache.org> wrote:
>>
>>> There's no linked list running between the leafs. A b+tree doesn't
>>> require one, though it's a common addition. The b+tree algorithm is a
>>> revision over a binary tree (where inner nodes point strictly at one
>>> left and one right item). To be a b+tree you need to hold many
>>> pointers on an inner node.
>>>
>>
>> I thought being a B-tree was to have many pointers in inner nodes, being a
>> B+tree was to have *only* pointers in inner nodes and the values all at the
>> leaf nodes.
>> Whatever it's called, the latter is what CouchDB has.
>

Re: Fundamentals Question on CoucheDB's append only b+tree.

Posted by Bob Dionne <di...@dionne-associates.com>.
That's largely correct, though couchdb's btree is neither fish nor fowl.

It doesn't have an order (B-trees generally do) and it does store values in the inner nodes (B+trees do not)

As Paul might say it's a B~tree


On Mar 13, 2012, at 3:31 PM, Randall Leeds wrote:

> On Tue, Mar 13, 2012 at 10:40, Robert Newson <rn...@apache.org> wrote:
> 
>> There's no linked list running between the leafs. A b+tree doesn't
>> require one, though it's a common addition. The b+tree algorithm is a
>> revision over a binary tree (where inner nodes point strictly at one
>> left and one right item). To be a b+tree you need to hold many
>> pointers on an inner node.
>> 
> 
> I thought being a B-tree was to have many pointers in inner nodes, being a
> B+tree was to have *only* pointers in inner nodes and the values all at the
> leaf nodes.
> Whatever it's called, the latter is what CouchDB has.


Re: Fundamentals Question on CoucheDB's append only b+tree.

Posted by Randall Leeds <ra...@gmail.com>.
On Tue, Mar 13, 2012 at 10:40, Robert Newson <rn...@apache.org> wrote:

> There's no linked list running between the leafs. A b+tree doesn't
> require one, though it's a common addition. The b+tree algorithm is a
> revision over a binary tree (where inner nodes point strictly at one
> left and one right item). To be a b+tree you need to hold many
> pointers on an inner node.
>

I thought being a B-tree was to have many pointers in inner nodes, being a
B+tree was to have *only* pointers in inner nodes and the values all at the
leaf nodes.
Whatever it's called, the latter is what CouchDB has.

RE: Fundamentals Question on CoucheDB's append only b+tree.

Posted by "Davies, Owain" <Ow...@baesystemsdetica.com>.
Thanks,
That clears it all up, and makes more sense now.

Owain 

-----Original Message-----
From: Robert Newson [mailto:rnewson@apache.org] 
Sent: 13 March 2012 17:41
To: dev@couchdb.apache.org
Subject: Re: Fundamentals Question on CoucheDB's append only b+tree.

There's no linked list running between the leafs. A b+tree doesn't
require one, though it's a common addition. The b+tree algorithm is a
revision over a binary tree (where inner nodes point strictly at one
left and one right item). To be a b+tree you need to hold many pointers
on an inner node.

B.

On 13 March 2012 17:26, Davies, Owain
<Ow...@baesystemsdetica.com> wrote:
> Hi,
>
> I have been reading the couchDB guide, particularly the appendix on 
> the power of b-trees. As I understand it couchdb uses a b+tree, in 
> which the leaf nodes have pointers to their siblings to facilitate 
> quick in-order and reverse in-order enumeration of the documents in
the b+tree.
>
> When a document is modified (or added), then the document is appended 
> to the file, the leaf node is cloned and updated to point to the new 
> document. However in order to maintain the linked list between the 
> sibling leaf nodes, you would need to update the adjacent link nodes 
> (creating new leaf nodes appended onto the end of the file), this will

> eventually require that the whole tree is replicated, not just the 
> node on the path to the effected leaf node.
>
> I do not think this can be the case, could somebody point out where I 
> have made the mistake?
>
> Thanks,
>
> Owain
>
> Please consider the environment before printing this email.
>
> This message should be regarded as confidential. If you have received
this email in error please notify the sender and destroy it immediately.
>
> Statements of intent shall only become binding when confirmed in hard
copy by an authorised signatory.
>
> The contents of this email may relate to dealings with other companies
under the control of BAE Systems plc details of which can be found at
http://www.baesystems.com/Businesses/index.htm.
>
> Detica Limited is a BAE Systems company trading as BAE Systems Detica.
> Detica Limited is registered in England and Wales under No: 1337451.
> Registered office: Surrey Research Park, Guildford, Surrey, GU2 7YP,
England.
>
Please consider the environment before printing this email.
 
This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately.
 
Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. 
 
The contents of this email may relate to dealings with other companies under the control of BAE Systems plc details of which can be found at http://www.baesystems.com/Businesses/index.htm.
 
Detica Limited is a BAE Systems company trading as BAE Systems Detica.
Detica Limited is registered in England and Wales under No: 1337451.
Registered office: Surrey Research Park, Guildford, Surrey, GU2 7YP, England.


Re: Fundamentals Question on CoucheDB's append only b+tree.

Posted by Robert Newson <rn...@apache.org>.
There's no linked list running between the leafs. A b+tree doesn't
require one, though it's a common addition. The b+tree algorithm is a
revision over a binary tree (where inner nodes point strictly at one
left and one right item). To be a b+tree you need to hold many
pointers on an inner node.

B.

On 13 March 2012 17:26, Davies, Owain <Ow...@baesystemsdetica.com> wrote:
> Hi,
>
> I have been reading the couchDB guide, particularly the appendix on the
> power of b-trees. As I understand it couchdb uses a b+tree, in which the
> leaf nodes have pointers to their siblings to facilitate quick in-order
> and reverse in-order enumeration of the documents in the b+tree.
>
> When a document is modified (or added), then the document is appended to
> the file, the leaf node is cloned and updated to point to the new
> document. However in order to maintain the linked list between the
> sibling leaf nodes, you would need to update the adjacent link nodes
> (creating new leaf nodes appended onto the end of the file), this will
> eventually require that the whole tree is replicated, not just the node
> on the path to the effected leaf node.
>
> I do not think this can be the case, could somebody point out where I
> have made the mistake?
>
> Thanks,
>
> Owain
>
> Please consider the environment before printing this email.
>
> This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately.
>
> Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory.
>
> The contents of this email may relate to dealings with other companies under the control of BAE Systems plc details of which can be found at http://www.baesystems.com/Businesses/index.htm.
>
> Detica Limited is a BAE Systems company trading as BAE Systems Detica.
> Detica Limited is registered in England and Wales under No: 1337451.
> Registered office: Surrey Research Park, Guildford, Surrey, GU2 7YP, England.
>

Re: Fundamentals Question on CoucheDB's append only b+tree.

Posted by Bob Dionne <di...@dionne-associates.com>.
On Mar 13, 2012, at 1:26 PM, Davies, Owain wrote:

> Hi,
> 
> I have been reading the couchDB guide, particularly the appendix on the
> power of b-trees. As I understand it couchdb uses a b+tree, in which the
> leaf nodes have pointers to their siblings to facilitate quick in-order
> and reverse in-order enumeration of the documents in the b+tree.
> 
> When a document is modified (or added), then the document is appended to
> the file, the leaf node is cloned and updated to point to the new
> document. However in order to maintain the linked list between the
> sibling leaf nodes, you would need to update the adjacent link nodes
> (creating new leaf nodes appended onto the end of the file), this will
> eventually require that the whole tree is replicated, not just the node
> on the path to the effected leaf node.
> 
> I do not think this can be the case, could somebody point out where I
> have made the mistake? 

yes, the leaf nodes in couchdb are not linked

> 
> Thanks,
> 
> Owain
> 
> Please consider the environment before printing this email.
> 
> This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately.
> 
> Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. 
> 
> The contents of this email may relate to dealings with other companies under the control of BAE Systems plc details of which can be found at http://www.baesystems.com/Businesses/index.htm.
> 
> Detica Limited is a BAE Systems company trading as BAE Systems Detica.
> Detica Limited is registered in England and Wales under No: 1337451.
> Registered office: Surrey Research Park, Guildford, Surrey, GU2 7YP, England.
>