You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@couchdb.apache.org by Riyad Kalla <rk...@gmail.com> on 2011/12/20 19:24:13 UTC

Understanding the CouchDB file format

I've been reading everything I can find on the CouchDB file format[1] and
am getting bits and pieces here and there, but not a great, concrete,
step-by-step explanation of the process.

I'm clear on the use of B+ trees and after reading a few papers on the
benefits of log-structured file formats, I understand the benefits of
inlining the B+ tree indices directly into the data file as well (locality
+ sequential I/O)... what I'm flummoxed about is how much of the B+ tree's
index is rewritten after every modified document.

Consider a CouchDB file that looks more or less like this:

[idx/header][doc1, rev1][idx/header][doc1, rev2]....

After each revised doc is written and the "b-tree root" is rewritten after
that, is that just a modified root node of the B+ tree or the entire B+
tree?

The reason I ask is because regardless of the answer to my previous
question, for a *huge* database will millions of records, that seems like
an enormous amount of data to rewrite after every modification. Say the
root node had a fanning factor of 133; that would still be alot of data to
rewrite.

I am certain I am missing the boat on this; if anyone can pull me out of
the water and point me to dry land I'd appreciate it.

Best,
R



[1]
-- 
http://jchrisa.net/drl/_design/sofa/_list/post/post-page?startkey=%5B%22CouchDB-Implements-a-Fundamental-Algorithm%22%5D
-- http://horicky.blogspot.com/2008/10/couchdb-implementation.html
-- http://blog.kodekabuki.com/post/132952897/couchdb-naked
-- http://guide.couchdb.org/editions/1/en/btree.html
-- http://ayende.com/blog/* (Over my head)

Re: Understanding the CouchDB file format

Posted by Riyad Kalla <rk...@gmail.com>.

Thank you Robert, fixed.

On Wed, Dec 21, 2011 at 1:42 PM, Robert Dionne <dionne@dionne-associates.com
> wrote:

> Riyad,
>
> Your welcome. At a quick glance your post has one error, internal nodes do
> contain values (from the reductions). The appendix in the couchdb book also
> makes this error[1] which I've opened a ticket for.
>
> Cheers,
>
> Bob
>
>
> [1] https://github.com/oreilly/couchdb-guide/issues/450
>
>
>
>
> On Dec 21, 2011, at 3:28 PM, Riyad Kalla wrote:
>
> > Bob,
> >
> > Really appreciate the link; Rick has a handful of articles that helped a
> > lot.
> >
> > Along side all the CouchDB reading I've been looking at SSD-optimized
> data
> > storage mechanisms and tried to coalesce all of this information into
> this
> > post on Couch's file storage format:
> > https://plus.google.com/u/0/107397941677313236670/posts/CyvwRcvh4vv
> >
> > It is uncanny how many things Couch seems to have gotten right with
> regard
> > to existing storage systems and future flash-based storage systems. I'd
> > appreciate any corrections, additions or feedback to the post for anyone
> > interested.
> >
> > Best,
> > R
> >
> > On Wed, Dec 21, 2011 at 12:53 PM, Robert Dionne <
> > dionne@dionne-associates.com> wrote:
> >
> >> I think this is largely correct Riyad, I dug out an old article[1] by
> Rick
> >> Ho that you may also find helpful though it might be slightly dated.
> >> Generally the best performance will be had if the ids are sequential and
> >> updates are done in bulk. Write heavy applications will eat up a lot of
> >> space and require compaction. At the leaf nodes what are stored are
> either
> >> full_doc_info records or doc_info records which store pointers to the
> data
> >> so the main thing that impacts the branching at each level are the key
> size
> >> and in the case of views the sizes of the reductions as these are stored
> >> with the intermediate nodes.
> >>
> >> All in all it works pretty well but as always you need to test and
> >> evaluate it for you specific case to see what the limits are.
> >>
> >> Regards,
> >>
> >> Bob
> >>
> >>
> >> [1] http://horicky.blogspot.com/2008/10/couchdb-implementation.html
> >>
> >>
> >>
> >>
> >> On Dec 21, 2011, at 2:17 PM, Riyad Kalla wrote:
> >>
> >>> Adding to this conversation, I found this set of slides by Chris
> >> explaining
> >>> the append-only index update format:
> >>> http://www.slideshare.net/jchrisa/btree-nosql-oak?from=embed
> >>>
> >>> Specifically slides 16, 17 and 18.
> >>>
> >>> Using this example tree, rewriting the updated path (in reverse order)
> >>> appended to the end of the file makes sense... you see how index
> queries
> >>> can simply read backwards from the end of the file and not only find
> the
> >>> latest revisions of docs, but also every other doc that wasn't touched
> >> (it
> >>> will just seek into the existing inner nodes of the b+ tree for
> >> searching).
> >>>
> >>> What I am hoping for clarification on is the following pain-point that
> I
> >>> perceive with this approach:
> >>>
> >>> 1. In a sufficiently shallow B+ tree (like CouchDB), the paths
> themselves
> >>> to elements are short (typically no more than 3 to 5 levels deep) as
> >>> opposed to a trie or some other construct that would have much longer
> >> paths
> >>> to elements.
> >>>
> >>> 2. Because the depth of the tree is so shallow, the breadth of it
> becomes
> >>> large to compensate... more specifically, each internal node can have
> >> 100s,
> >>> 1000s or more children. Using the example slides, consider the nodes
> >>> [A...M] and [R...Z] -- in a good sized CouchDB database, those internal
> >>> index nodes would have 100s (or more) elements in them pointing at
> deeper
> >>> internal nodes that themselves had thousands of elements; instead of
> the
> >> 13
> >>> or so as implied by [A...M].
> >>>
> >>> 3. Looking at slide 17 and 18, where you see the direct B+ tree path to
> >> the
> >>> update node getting appended to the end of the file after the revision
> is
> >>> written (leaf to root ordering: [J' M] -> [A M] -> [A Z]) it implies
> that
> >>> those internal nodes with *all* their child elements are getting
> >> rewritten
> >>> as well.
> >>>
> >>> In this example tree, it is isn't such a big issue... but in a
> >> sufficiently
> >>> large CouchDB database, these nodes denoted by [A...M] and [A...Z]
> could
> >> be
> >>> quite large... I don't know the format of the node elements in the B+
> >> tree,
> >>> but it would be whatever the size of a node is times however many
> >> elements
> >>> are contained at each level (1 for root, say 100 for level 2, 1000 for
> >>> level 3 and 10,000 for level 4 -- there is a lot of hand-waving going
> on
> >>> here, of course it depends on the size of the data store).
> >>>
> >>> Am I missing something or is CouchDB really rewriting that much index
> >>> information between document revisions on every update?
> >>>
> >>> What was previously confusing me is I thought it was *only* rewriting a
> >>> direct path to the updated revision, like [B]>[E]>[J'] and Couch was
> >>> some-how patching in that updated path info to the B+ index at runtime.
> >>>
> >>> If couch is rewriting entire node paths with all their elements then I
> am
> >>> no longer confused about the B+ index updates, but am curious about the
> >>> on-disk cost of this.
> >>>
> >>> In my own rough insertion testing, that would explain why I see my
> >>> collections absolutely explode in size until they are compacted (not
> >> using
> >>> bulk insert, but intentionally doing single inserts for a million(s) of
> >>> docs to see what kind of cost the index path duplication would be
> like).
> >>>
> >>> Can anyone confirm/deny/correct this assessment? I want to make sure I
> am
> >>> on the right track understanding this.
> >>>
> >>> Best wishes,
> >>> Riyad
> >>>
> >>> On Tue, Dec 20, 2011 at 6:13 PM, Riyad Kalla <rk...@gmail.com> wrote:
> >>>
> >>>> @Filipe - I was just not clear on how CouchDB operated; you and Robert
> >>>> cleared that up for me. Thank you.
> >>>>
> >>>> @Robert - The writeup is excellent so far (I am not familiar with
> >> erlang,
> >>>> so there is a bit of stickiness there), thank you for taking the time
> to
> >>>> put this together!
> >>>>
> >>>> At this point I am curious how the _id and _seq indices are read as
> >> their
> >>>> data is continually appended to the end of the data file in small
> >>>> diff-trees for every updated doc.
> >>>>
> >>>> If CouchDB kept all the indices in-memory and simply patched-in the
> >>>> updated paths at runtime (maybe something akin to memory-mapped
> indices
> >> in
> >>>> MongoDB) I would be fairly clear on the operation... but as I
> understand
> >>>> it, CouchDB keeps such a small memory footprint by doing no in-memory
> >>>> caching and relying on the intelligence of the OS and filesystem
> (and/or
> >>>> drives) to cache frequently accessed data.
> >>>>
> >>>> I am trying to understand the logic used by CouchDB to answer a query
> >>>> using the index once updates to the tree have been appended to the
> data
> >>>> file... for example, consider a CouchDB datastore like the one Filipe
> >>>> has... 10 million documents and let's say it is freshly compacted.
> >>>>
> >>>> If I send in a request to that Couch instance, it hits the header of
> the
> >>>> data file along with the index and walks the B+ tree to the leaf node,
> >>>> where it finds the offset into the data file where the actual doc
> >> lives...
> >>>> let's say 1,000,000 bytes away.
> >>>>
> >>>> These B+ trees are shallow, so it might look something like this:
> >>>>
> >>>> Level 1: 1 node, root node.
> >>>> Level 2: 100 nodes, inner child nodes
> >>>> Level 3: 10,000 nodes, inner child nodes
> >>>> Level 4: 1,000,000, leaf nodes... all with pointers to the data
> offsets
> >> in
> >>>> the data file.
> >>>>
> >>>> Now let's say I write 10 updates to documents in that file. There are
> 10
> >>>> new revisions appended to the end of the data file *each one*
> separated
> >> by
> >>>> a rewritten B+ path to a leaf node with it's new location at the end
> of
> >> the
> >>>> file. Each of those paths written between each doc revision (say
> >> roughly 2k
> >>>> like Filipe mentioned) are just 4 item paths... root -> level1 ->
> >> level2 ->
> >>>> level3 --> level4... showing the discrete path from the root to that
> >>>> individual updated doc. The intermediary levels (l1, 2, 3) are not
> fully
> >>>> flushed out with all the OTHER children from the original b+ tree
> index.
> >>>>
> >>>> [[ is this correct so far? If not, please point out my mistakes...]
> >>>>
> >>>> Now I issue a query for a document that WAS NOT updated...
> >>>>
> >>>> **** this is where I get confused on the logic ***
> >>>>
> >>>> this would mean I need to access the original B+ tree index at the
> root
> >> of
> >>>> the data file, because the revised B+ paths that are written between
> >> each
> >>>> of the updated doc revisions at the end of the file are not full
> >> indices.
> >>>>
> >>>> NOW consider I want to query for one of the changed docs... now I
> >> suddenly
> >>>> need to scan backwards from the data file's end to find the updated
> >> path to
> >>>> the new revision of that document.
> >>>>
> >>>> (obviously) this isn't what Couch is actually doing... it's doing
> >>>> something more elegant, I just can't figure out what or how and that
> is
> >>>> what I was hoping for help with.
> >>>>
> >>>> Much thanks guys, I know this is a heavy question to ask.
> >>>>
> >>>> Best wishes,
> >>>> R
> >>>>
> >>>>
> >>>> On Tue, Dec 20, 2011 at 1:35 PM, Robert Dionne <
> >>>> dionne@dionne-associates.com> wrote:
> >>>>
> >>>>>
> >>>>> Robert Dionne
> >>>>> Computer Programmer
> >>>>> dionne@dionne-associates.com
> >>>>> 203.231.9961
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Dec 20, 2011, at 3:27 PM, Riyad Kalla wrote:
> >>>>>
> >>>>>> Filipe,
> >>>>>>
> >>>>>> Thank you for the reply.
> >>>>>>
> >>>>>> Maybe I am misunderstanding exactly what couch is writing out; the
> >> docs
> >>>>>> I've read say that it "rewrites the root node" -- I can't tell if
> the
> >>>>> docs
> >>>>>> mean the parent node of the child doc that was changed (as one of
> the
> >> b+
> >>>>>> leaves) or if it means the direct path, from the root, to that
> >>>>> individual
> >>>>>> doc... or if it means the *entire* index...
> >>>>>>
> >>>>>> In the case of even rewriting the single parent, with such a shallow
> >>>>> tree,
> >>>>>> each internal leaf will have a huge fan of nodes; let's say 1-10k
> in a
> >>>>>> decent sized data set.
> >>>>>>
> >>>>>> If you are seeing a few K of extra written out after each changed
> doc
> >>>>> then
> >>>>>> that cannot be write... I almost assumed my understanding was wrong
> >>>>> because
> >>>>>> the sheer volume of data would make performance abysmal if it was
> >> true.
> >>>>>>
> >>>>>> Given that... is it just the changed path, from the root to the new
> >> leaf
> >>>>>> that is rewritten?
> >>>>>
> >>>>> Hi Riyad,
> >>>>>
> >>>>> You are correct, it's only the changed path. Interestingly I've just
> >>>>> started to document all these internals[1] along with links to the
> >> code and
> >>>>> other references available.
> >>>>>
> >>>>> Cheers,
> >>>>>
> >>>>> Bob
> >>>>>
> >>>>>
> >>>>> [1] http://bdionne.github.com/couchdb/
> >>>>>
> >>>>>> That makes me all sorts of curious as to how Couch
> >>>>>> updates/searches the new modified index with the small diff that is
> >>>>> written
> >>>>>> out.
> >>>>>>
> >>>>>> Any pointers to reading that will help me dig down on this (even
> early
> >>>>> bugs
> >>>>>> in JIRA?) would be appreciated. I've tried skimming back in 2007/08
> on
> >>>>>> Damien's blog to see if it wrote about it in depth and so far
> haven't
> >>>>> found
> >>>>>> anything as detailed as I am hoping for on this architecture.
> >>>>>>
> >>>>>> Best,
> >>>>>> Riyad
> >>>>>>
> >>>>>> On Tue, Dec 20, 2011 at 1:07 PM, Filipe David Manana <
> >>>>> fdmanana@apache.org>wrote:
> >>>>>>
> >>>>>>> On Tue, Dec 20, 2011 at 6:24 PM, Riyad Kalla <rk...@gmail.com>
> >> wrote:
> >>>>>>>> I've been reading everything I can find on the CouchDB file
> >> format[1]
> >>>>> and
> >>>>>>>> am getting bits and pieces here and there, but not a great,
> >> concrete,
> >>>>>>>> step-by-step explanation of the process.
> >>>>>>>>
> >>>>>>>> I'm clear on the use of B+ trees and after reading a few papers on
> >> the
> >>>>>>>> benefits of log-structured file formats, I understand the benefits
> >> of
> >>>>>>>> inlining the B+ tree indices directly into the data file as well
> >>>>>>> (locality
> >>>>>>>> + sequential I/O)... what I'm flummoxed about is how much of the
> B+
> >>>>>>> tree's
> >>>>>>>> index is rewritten after every modified document.
> >>>>>>>>
> >>>>>>>> Consider a CouchDB file that looks more or less like this:
> >>>>>>>>
> >>>>>>>> [idx/header][doc1, rev1][idx/header][doc1, rev2]....
> >>>>>>>>
> >>>>>>>> After each revised doc is written and the "b-tree root" is
> rewritten
> >>>>>>> after
> >>>>>>>> that, is that just a modified root node of the B+ tree or the
> entire
> >>>>> B+
> >>>>>>>> tree?
> >>>>>>>>
> >>>>>>>> The reason I ask is because regardless of the answer to my
> previous
> >>>>>>>> question, for a *huge* database will millions of records, that
> seems
> >>>>> like
> >>>>>>>> an enormous amount of data to rewrite after every modification.
> Say
> >>>>> the
> >>>>>>>> root node had a fanning factor of 133; that would still be alot of
> >>>>> data
> >>>>>>> to
> >>>>>>>> rewrite.
> >>>>>>>
> >>>>>>> Hi Riyad,
> >>>>>>>
> >>>>>>> Have you observed that in practice?
> >>>>>>>
> >>>>>>> Typically the depth of database btrees is not that high even for
> >>>>>>> millions of documents. For example I have one around with about 10
> >>>>>>> million documents which doesn't have more than 5 or 6 levels if I
> >>>>>>> recall correctly.
> >>>>>>>
> >>>>>>> So updating a doc, for that particular case, means rewriting 5 or 6
> >>>>>>> new nodes plus the document itself. Each node is normally not much
> >>>>>>> bigger than 1.2Kb.
> >>>>>>>
> >>>>>>> I've written once a tool to analyze database files which reports
> >> btree
> >>>>>>> depths, however it's not updated to work with recent changes on
> >>>>>>> master/1.2.x such as snappy compression and btree sizes:
> >>>>>>>
> >>>>>>> https://github.com/fdmanana/couchfoo
> >>>>>>>
> >>>>>>> It should work with CouchDB 1.1 (and older) database files.
> >>>>>>>
> >>>>>>>>
> >>>>>>>> I am certain I am missing the boat on this; if anyone can pull me
> >> out
> >>>>> of
> >>>>>>>> the water and point me to dry land I'd appreciate it.
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> R
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> [1]
> >>>>>>>> --
> >>>>>>>>
> >>>>>>>
> >>>>>
> >>
> http://jchrisa.net/drl/_design/sofa/_list/post/post-page?startkey=%5B%22CouchDB-Implements-a-Fundamental-Algorithm%22%5D
> >>>>>>>> --
> http://horicky.blogspot.com/2008/10/couchdb-implementation.html
> >>>>>>>> -- http://blog.kodekabuki.com/post/132952897/couchdb-naked
> >>>>>>>> -- http://guide.couchdb.org/editions/1/en/btree.html
> >>>>>>>> -- http://ayende.com/blog/* (Over my head)
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Filipe David Manana,
> >>>>>>>
> >>>>>>> "Reasonable men adapt themselves to the world.
> >>>>>>> Unreasonable men adapt the world to themselves.
> >>>>>>> That's why all progress depends on unreasonable men."
> >>>>>>>
> >>>>>
> >>>>>
> >>>>
> >>
> >>
>
>

Re: Understanding the CouchDB file format

Posted by Robert Dionne <di...@dionne-associates.com>.

Riyad,

Your welcome. At a quick glance your post has one error, internal nodes do contain values (from the reductions). The appendix in the couchdb book also makes this error[1] which I've opened a ticket for.

Cheers,

Bob


[1] https://github.com/oreilly/couchdb-guide/issues/450




On Dec 21, 2011, at 3:28 PM, Riyad Kalla wrote:

> Bob,
> 
> Really appreciate the link; Rick has a handful of articles that helped a
> lot.
> 
> Along side all the CouchDB reading I've been looking at SSD-optimized data
> storage mechanisms and tried to coalesce all of this information into this
> post on Couch's file storage format:
> https://plus.google.com/u/0/107397941677313236670/posts/CyvwRcvh4vv
> 
> It is uncanny how many things Couch seems to have gotten right with regard
> to existing storage systems and future flash-based storage systems. I'd
> appreciate any corrections, additions or feedback to the post for anyone
> interested.
> 
> Best,
> R
> 
> On Wed, Dec 21, 2011 at 12:53 PM, Robert Dionne <
> dionne@dionne-associates.com> wrote:
> 
>> I think this is largely correct Riyad, I dug out an old article[1] by Rick
>> Ho that you may also find helpful though it might be slightly dated.
>> Generally the best performance will be had if the ids are sequential and
>> updates are done in bulk. Write heavy applications will eat up a lot of
>> space and require compaction. At the leaf nodes what are stored are either
>> full_doc_info records or doc_info records which store pointers to the data
>> so the main thing that impacts the branching at each level are the key size
>> and in the case of views the sizes of the reductions as these are stored
>> with the intermediate nodes.
>> 
>> All in all it works pretty well but as always you need to test and
>> evaluate it for you specific case to see what the limits are.
>> 
>> Regards,
>> 
>> Bob
>> 
>> 
>> [1] http://horicky.blogspot.com/2008/10/couchdb-implementation.html
>> 
>> 
>> 
>> 
>> On Dec 21, 2011, at 2:17 PM, Riyad Kalla wrote:
>> 
>>> Adding to this conversation, I found this set of slides by Chris
>> explaining
>>> the append-only index update format:
>>> http://www.slideshare.net/jchrisa/btree-nosql-oak?from=embed
>>> 
>>> Specifically slides 16, 17 and 18.
>>> 
>>> Using this example tree, rewriting the updated path (in reverse order)
>>> appended to the end of the file makes sense... you see how index queries
>>> can simply read backwards from the end of the file and not only find the
>>> latest revisions of docs, but also every other doc that wasn't touched
>> (it
>>> will just seek into the existing inner nodes of the b+ tree for
>> searching).
>>> 
>>> What I am hoping for clarification on is the following pain-point that I
>>> perceive with this approach:
>>> 
>>> 1. In a sufficiently shallow B+ tree (like CouchDB), the paths themselves
>>> to elements are short (typically no more than 3 to 5 levels deep) as
>>> opposed to a trie or some other construct that would have much longer
>> paths
>>> to elements.
>>> 
>>> 2. Because the depth of the tree is so shallow, the breadth of it becomes
>>> large to compensate... more specifically, each internal node can have
>> 100s,
>>> 1000s or more children. Using the example slides, consider the nodes
>>> [A...M] and [R...Z] -- in a good sized CouchDB database, those internal
>>> index nodes would have 100s (or more) elements in them pointing at deeper
>>> internal nodes that themselves had thousands of elements; instead of the
>> 13
>>> or so as implied by [A...M].
>>> 
>>> 3. Looking at slide 17 and 18, where you see the direct B+ tree path to
>> the
>>> update node getting appended to the end of the file after the revision is
>>> written (leaf to root ordering: [J' M] -> [A M] -> [A Z]) it implies that
>>> those internal nodes with *all* their child elements are getting
>> rewritten
>>> as well.
>>> 
>>> In this example tree, it is isn't such a big issue... but in a
>> sufficiently
>>> large CouchDB database, these nodes denoted by [A...M] and [A...Z] could
>> be
>>> quite large... I don't know the format of the node elements in the B+
>> tree,
>>> but it would be whatever the size of a node is times however many
>> elements
>>> are contained at each level (1 for root, say 100 for level 2, 1000 for
>>> level 3 and 10,000 for level 4 -- there is a lot of hand-waving going on
>>> here, of course it depends on the size of the data store).
>>> 
>>> Am I missing something or is CouchDB really rewriting that much index
>>> information between document revisions on every update?
>>> 
>>> What was previously confusing me is I thought it was *only* rewriting a
>>> direct path to the updated revision, like [B]>[E]>[J'] and Couch was
>>> some-how patching in that updated path info to the B+ index at runtime.
>>> 
>>> If couch is rewriting entire node paths with all their elements then I am
>>> no longer confused about the B+ index updates, but am curious about the
>>> on-disk cost of this.
>>> 
>>> In my own rough insertion testing, that would explain why I see my
>>> collections absolutely explode in size until they are compacted (not
>> using
>>> bulk insert, but intentionally doing single inserts for a million(s) of
>>> docs to see what kind of cost the index path duplication would be like).
>>> 
>>> Can anyone confirm/deny/correct this assessment? I want to make sure I am
>>> on the right track understanding this.
>>> 
>>> Best wishes,
>>> Riyad
>>> 
>>> On Tue, Dec 20, 2011 at 6:13 PM, Riyad Kalla <rk...@gmail.com> wrote:
>>> 
>>>> @Filipe - I was just not clear on how CouchDB operated; you and Robert
>>>> cleared that up for me. Thank you.
>>>> 
>>>> @Robert - The writeup is excellent so far (I am not familiar with
>> erlang,
>>>> so there is a bit of stickiness there), thank you for taking the time to
>>>> put this together!
>>>> 
>>>> At this point I am curious how the _id and _seq indices are read as
>> their
>>>> data is continually appended to the end of the data file in small
>>>> diff-trees for every updated doc.
>>>> 
>>>> If CouchDB kept all the indices in-memory and simply patched-in the
>>>> updated paths at runtime (maybe something akin to memory-mapped indices
>> in
>>>> MongoDB) I would be fairly clear on the operation... but as I understand
>>>> it, CouchDB keeps such a small memory footprint by doing no in-memory
>>>> caching and relying on the intelligence of the OS and filesystem (and/or
>>>> drives) to cache frequently accessed data.
>>>> 
>>>> I am trying to understand the logic used by CouchDB to answer a query
>>>> using the index once updates to the tree have been appended to the data
>>>> file... for example, consider a CouchDB datastore like the one Filipe
>>>> has... 10 million documents and let's say it is freshly compacted.
>>>> 
>>>> If I send in a request to that Couch instance, it hits the header of the
>>>> data file along with the index and walks the B+ tree to the leaf node,
>>>> where it finds the offset into the data file where the actual doc
>> lives...
>>>> let's say 1,000,000 bytes away.
>>>> 
>>>> These B+ trees are shallow, so it might look something like this:
>>>> 
>>>> Level 1: 1 node, root node.
>>>> Level 2: 100 nodes, inner child nodes
>>>> Level 3: 10,000 nodes, inner child nodes
>>>> Level 4: 1,000,000, leaf nodes... all with pointers to the data offsets
>> in
>>>> the data file.
>>>> 
>>>> Now let's say I write 10 updates to documents in that file. There are 10
>>>> new revisions appended to the end of the data file *each one* separated
>> by
>>>> a rewritten B+ path to a leaf node with it's new location at the end of
>> the
>>>> file. Each of those paths written between each doc revision (say
>> roughly 2k
>>>> like Filipe mentioned) are just 4 item paths... root -> level1 ->
>> level2 ->
>>>> level3 --> level4... showing the discrete path from the root to that
>>>> individual updated doc. The intermediary levels (l1, 2, 3) are not fully
>>>> flushed out with all the OTHER children from the original b+ tree index.
>>>> 
>>>> [[ is this correct so far? If not, please point out my mistakes...]
>>>> 
>>>> Now I issue a query for a document that WAS NOT updated...
>>>> 
>>>> **** this is where I get confused on the logic ***
>>>> 
>>>> this would mean I need to access the original B+ tree index at the root
>> of
>>>> the data file, because the revised B+ paths that are written between
>> each
>>>> of the updated doc revisions at the end of the file are not full
>> indices.
>>>> 
>>>> NOW consider I want to query for one of the changed docs... now I
>> suddenly
>>>> need to scan backwards from the data file's end to find the updated
>> path to
>>>> the new revision of that document.
>>>> 
>>>> (obviously) this isn't what Couch is actually doing... it's doing
>>>> something more elegant, I just can't figure out what or how and that is
>>>> what I was hoping for help with.
>>>> 
>>>> Much thanks guys, I know this is a heavy question to ask.
>>>> 
>>>> Best wishes,
>>>> R
>>>> 
>>>> 
>>>> On Tue, Dec 20, 2011 at 1:35 PM, Robert Dionne <
>>>> dionne@dionne-associates.com> wrote:
>>>> 
>>>>> 
>>>>> Robert Dionne
>>>>> Computer Programmer
>>>>> dionne@dionne-associates.com
>>>>> 203.231.9961
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Dec 20, 2011, at 3:27 PM, Riyad Kalla wrote:
>>>>> 
>>>>>> Filipe,
>>>>>> 
>>>>>> Thank you for the reply.
>>>>>> 
>>>>>> Maybe I am misunderstanding exactly what couch is writing out; the
>> docs
>>>>>> I've read say that it "rewrites the root node" -- I can't tell if the
>>>>> docs
>>>>>> mean the parent node of the child doc that was changed (as one of the
>> b+
>>>>>> leaves) or if it means the direct path, from the root, to that
>>>>> individual
>>>>>> doc... or if it means the *entire* index...
>>>>>> 
>>>>>> In the case of even rewriting the single parent, with such a shallow
>>>>> tree,
>>>>>> each internal leaf will have a huge fan of nodes; let's say 1-10k in a
>>>>>> decent sized data set.
>>>>>> 
>>>>>> If you are seeing a few K of extra written out after each changed doc
>>>>> then
>>>>>> that cannot be write... I almost assumed my understanding was wrong
>>>>> because
>>>>>> the sheer volume of data would make performance abysmal if it was
>> true.
>>>>>> 
>>>>>> Given that... is it just the changed path, from the root to the new
>> leaf
>>>>>> that is rewritten?
>>>>> 
>>>>> Hi Riyad,
>>>>> 
>>>>> You are correct, it's only the changed path. Interestingly I've just
>>>>> started to document all these internals[1] along with links to the
>> code and
>>>>> other references available.
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> Bob
>>>>> 
>>>>> 
>>>>> [1] http://bdionne.github.com/couchdb/
>>>>> 
>>>>>> That makes me all sorts of curious as to how Couch
>>>>>> updates/searches the new modified index with the small diff that is
>>>>> written
>>>>>> out.
>>>>>> 
>>>>>> Any pointers to reading that will help me dig down on this (even early
>>>>> bugs
>>>>>> in JIRA?) would be appreciated. I've tried skimming back in 2007/08 on
>>>>>> Damien's blog to see if it wrote about it in depth and so far haven't
>>>>> found
>>>>>> anything as detailed as I am hoping for on this architecture.
>>>>>> 
>>>>>> Best,
>>>>>> Riyad
>>>>>> 
>>>>>> On Tue, Dec 20, 2011 at 1:07 PM, Filipe David Manana <
>>>>> fdmanana@apache.org>wrote:
>>>>>> 
>>>>>>> On Tue, Dec 20, 2011 at 6:24 PM, Riyad Kalla <rk...@gmail.com>
>> wrote:
>>>>>>>> I've been reading everything I can find on the CouchDB file
>> format[1]
>>>>> and
>>>>>>>> am getting bits and pieces here and there, but not a great,
>> concrete,
>>>>>>>> step-by-step explanation of the process.
>>>>>>>> 
>>>>>>>> I'm clear on the use of B+ trees and after reading a few papers on
>> the
>>>>>>>> benefits of log-structured file formats, I understand the benefits
>> of
>>>>>>>> inlining the B+ tree indices directly into the data file as well
>>>>>>> (locality
>>>>>>>> + sequential I/O)... what I'm flummoxed about is how much of the B+
>>>>>>> tree's
>>>>>>>> index is rewritten after every modified document.
>>>>>>>> 
>>>>>>>> Consider a CouchDB file that looks more or less like this:
>>>>>>>> 
>>>>>>>> [idx/header][doc1, rev1][idx/header][doc1, rev2]....
>>>>>>>> 
>>>>>>>> After each revised doc is written and the "b-tree root" is rewritten
>>>>>>> after
>>>>>>>> that, is that just a modified root node of the B+ tree or the entire
>>>>> B+
>>>>>>>> tree?
>>>>>>>> 
>>>>>>>> The reason I ask is because regardless of the answer to my previous
>>>>>>>> question, for a *huge* database will millions of records, that seems
>>>>> like
>>>>>>>> an enormous amount of data to rewrite after every modification. Say
>>>>> the
>>>>>>>> root node had a fanning factor of 133; that would still be alot of
>>>>> data
>>>>>>> to
>>>>>>>> rewrite.
>>>>>>> 
>>>>>>> Hi Riyad,
>>>>>>> 
>>>>>>> Have you observed that in practice?
>>>>>>> 
>>>>>>> Typically the depth of database btrees is not that high even for
>>>>>>> millions of documents. For example I have one around with about 10
>>>>>>> million documents which doesn't have more than 5 or 6 levels if I
>>>>>>> recall correctly.
>>>>>>> 
>>>>>>> So updating a doc, for that particular case, means rewriting 5 or 6
>>>>>>> new nodes plus the document itself. Each node is normally not much
>>>>>>> bigger than 1.2Kb.
>>>>>>> 
>>>>>>> I've written once a tool to analyze database files which reports
>> btree
>>>>>>> depths, however it's not updated to work with recent changes on
>>>>>>> master/1.2.x such as snappy compression and btree sizes:
>>>>>>> 
>>>>>>> https://github.com/fdmanana/couchfoo
>>>>>>> 
>>>>>>> It should work with CouchDB 1.1 (and older) database files.
>>>>>>> 
>>>>>>>> 
>>>>>>>> I am certain I am missing the boat on this; if anyone can pull me
>> out
>>>>> of
>>>>>>>> the water and point me to dry land I'd appreciate it.
>>>>>>>> 
>>>>>>>> Best,
>>>>>>>> R
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> [1]
>>>>>>>> --
>>>>>>>> 
>>>>>>> 
>>>>> 
>> http://jchrisa.net/drl/_design/sofa/_list/post/post-page?startkey=%5B%22CouchDB-Implements-a-Fundamental-Algorithm%22%5D
>>>>>>>> -- http://horicky.blogspot.com/2008/10/couchdb-implementation.html
>>>>>>>> -- http://blog.kodekabuki.com/post/132952897/couchdb-naked
>>>>>>>> -- http://guide.couchdb.org/editions/1/en/btree.html
>>>>>>>> -- http://ayende.com/blog/* (Over my head)
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Filipe David Manana,
>>>>>>> 
>>>>>>> "Reasonable men adapt themselves to the world.
>>>>>>> Unreasonable men adapt the world to themselves.
>>>>>>> That's why all progress depends on unreasonable men."
>>>>>>> 
>>>>> 
>>>>> 
>>>> 
>> 
>>

Re: Understanding the CouchDB file format

Posted by Paul Davis <pa...@gmail.com>.

On Thu, Dec 22, 2011 at 8:53 PM, Riyad Kalla <rk...@gmail.com> wrote:
> Randall,
>
> Spot on; we are on the same page now. I'll go through the post again
> tomorrow morning and reword it a bit so the thoughts aren't so fragmented.
>
> Best,
> Riyad
>
> P.S.> Are there discussions anywhere on alternative file formats for the
> indices that the Couch community has considered in the past? (old JIRA or
> mailing list discussions) or has this file format been the only approach
> from the initial work started in 05/06?
>
> The reason I ask is because of the index fragmentation you mentioned that
> can occur over time... I'd be curious if an in-place-edited format for the
> indices as separate file(s) from the append-only document revisions would
> yield better performance over time. The splits would still be expensive,
> but you'd be able to pre-allocate your blocks for each node on each split
> to help a bit. Even leveraging an append-only, pre-allocated approach to
> the index might work where split nodes are appended to the end of the index
> file with the full node size (e.g. 133 elements) pre-allocated and the
> index marked ready for compaction or something.
>
> So it would be a hybrid approach... when splits aren't needed, you do an
> in-place edit to the index on-disk. If a split is needed, you use a
> technique similar to the one now where the effected node hierarchy is
> appended to the end of the file.
>
> You still run the risk of costly index rebuilds in the cast of a failed
> in-place edit, but you wouldn't run the risk of any data loss... I am just
> wondering for large data sets if this would yield some significant benefits
> putting Couch somewhere between a Mongo and Couch (performance wise).
>
> Just pie-in-the-sky thinking... I am sure the senior team here has talked
> this stuff to death years ago. My apologies if this is re-treading road
> covered in beaten horse bodies ;)
>
> I just find this category of problems interesting.
>

The goal of the view engine refactoring was to encourage people to
start asking and answering exactly these types of questions. I got a
bit distracted by work stuff since but I've still got a plan and some
notes on a blog post about how to write new indexers. Hopefully in
time we'll have a wider array of secondary index types than just the
m/r style we have currently.

>
> On Thu, Dec 22, 2011 at 6:10 PM, Randall Leeds <ra...@gmail.com>wrote:
>
>> On Thu, Dec 22, 2011 at 12:11, Riyad Kalla <rk...@gmail.com> wrote:
>> > On Thu, Dec 22, 2011 at 12:38 PM, Robert Newson <rn...@apache.org>
>> wrote:
>> >
>> >> There are a
>> >> few parts of the article that are inaccurate (like the assertion we
>> >> have good locality for the id and seq trees. If this were true we
>> >> wouldn't have seen such a huge improvement in compaction by
>> >> temporarily separating them).
>> >
>> >
>> > I'd look forward to more detail on this... it was my understanding the
>> > updates were appended out in the [doc rev][_id idx diff][_seq idx diff]
>> > format at the end of the data file. Sounds like I may have misunderstood
>> > that?
>> >
>>
>> Riyad, as you pointed out earlier, all the inner nodes are rewritten
>> up to the root. The two btrees are not written in parallel, though,
>> which means that for deep trees all the updated nodes are written
>> before the other tree's nodes are written. Also remember that the
>> trees themselves end up pretty fragmented since older nodes that
>> haven't changed are back toward the beginning of the file. In general,
>> I'm not sure there's much that's useful to mention about locality in
>> the trees. Also, updating these trees requires reading the old values,
>> so there is still seeking that occurs (if the pages aren't cached by
>> the OS).
>>
>> >
>> >> The 'COMPETE recreation' paragraph also
>> >> strikes me as factually awry.
>> >>
>> >
>> > I'd appreciate a detailed correction on this if it is wrong; all the
>> > digging I've done (in this thread and other partial resources) suggests
>> > that the path from the changed doc ref back to the root (including a copy
>> > of all internal nodes and all of their child references) is written so as
>> > being able to read-back into the index from the tail of the data file
>> > quickly... specifically slides 17, 18 and 19 from this slidedeck (
>> > http://www.slideshare.net/jchrisa/btree-nosql-oak?from=embed) -- note
>> that
>> > the interim nodes [A..M] and [A..Z] are rewritten (including any and all
>> > child pointers they contained).
>> >
>> > This is what I was referring to; I should either clean up my wording
>> > (confusing) or I got it wrong in which case I'd appreciate any and all
>> > corrections.
>>
>> Right. It mostly seems a bit confusing to me.
>> "it DOES NOT just rewrite the nodes pathing from the leaf to the node
>> and ONLY the connections for that single document"
>> That doesn't sound quite right, but I can tell what you're trying to
>> say is accurate. If I'm right, you mean that every changed inner node
>> is rewritten in its entirety rather than having a single pointer to
>> the new child updated in place.
>>
>> Cheers. Thanks for taking the time to write this up.
>>
>> -Randall
>>

Re: Understanding the CouchDB file format

Posted by Riyad Kalla <rk...@gmail.com>.

Randall,

Spot on; we are on the same page now. I'll go through the post again
tomorrow morning and reword it a bit so the thoughts aren't so fragmented.

Best,
Riyad

P.S.> Are there discussions anywhere on alternative file formats for the
indices that the Couch community has considered in the past? (old JIRA or
mailing list discussions) or has this file format been the only approach
from the initial work started in 05/06?

The reason I ask is because of the index fragmentation you mentioned that
can occur over time... I'd be curious if an in-place-edited format for the
indices as separate file(s) from the append-only document revisions would
yield better performance over time. The splits would still be expensive,
but you'd be able to pre-allocate your blocks for each node on each split
to help a bit. Even leveraging an append-only, pre-allocated approach to
the index might work where split nodes are appended to the end of the index
file with the full node size (e.g. 133 elements) pre-allocated and the
index marked ready for compaction or something.

So it would be a hybrid approach... when splits aren't needed, you do an
in-place edit to the index on-disk. If a split is needed, you use a
technique similar to the one now where the effected node hierarchy is
appended to the end of the file.

You still run the risk of costly index rebuilds in the cast of a failed
in-place edit, but you wouldn't run the risk of any data loss... I am just
wondering for large data sets if this would yield some significant benefits
putting Couch somewhere between a Mongo and Couch (performance wise).

Just pie-in-the-sky thinking... I am sure the senior team here has talked
this stuff to death years ago. My apologies if this is re-treading road
covered in beaten horse bodies ;)

I just find this category of problems interesting.

On Thu, Dec 22, 2011 at 6:10 PM, Randall Leeds <ra...@gmail.com>wrote:

> On Thu, Dec 22, 2011 at 12:11, Riyad Kalla <rk...@gmail.com> wrote:
> > On Thu, Dec 22, 2011 at 12:38 PM, Robert Newson <rn...@apache.org>
> wrote:
> >
> >> There are a
> >> few parts of the article that are inaccurate (like the assertion we
> >> have good locality for the id and seq trees. If this were true we
> >> wouldn't have seen such a huge improvement in compaction by
> >> temporarily separating them).
> >
> >
> > I'd look forward to more detail on this... it was my understanding the
> > updates were appended out in the [doc rev][_id idx diff][_seq idx diff]
> > format at the end of the data file. Sounds like I may have misunderstood
> > that?
> >
>
> Riyad, as you pointed out earlier, all the inner nodes are rewritten
> up to the root. The two btrees are not written in parallel, though,
> which means that for deep trees all the updated nodes are written
> before the other tree's nodes are written. Also remember that the
> trees themselves end up pretty fragmented since older nodes that
> haven't changed are back toward the beginning of the file. In general,
> I'm not sure there's much that's useful to mention about locality in
> the trees. Also, updating these trees requires reading the old values,
> so there is still seeking that occurs (if the pages aren't cached by
> the OS).
>
> >
> >> The 'COMPETE recreation' paragraph also
> >> strikes me as factually awry.
> >>
> >
> > I'd appreciate a detailed correction on this if it is wrong; all the
> > digging I've done (in this thread and other partial resources) suggests
> > that the path from the changed doc ref back to the root (including a copy
> > of all internal nodes and all of their child references) is written so as
> > being able to read-back into the index from the tail of the data file
> > quickly... specifically slides 17, 18 and 19 from this slidedeck (
> > http://www.slideshare.net/jchrisa/btree-nosql-oak?from=embed) -- note
> that
> > the interim nodes [A..M] and [A..Z] are rewritten (including any and all
> > child pointers they contained).
> >
> > This is what I was referring to; I should either clean up my wording
> > (confusing) or I got it wrong in which case I'd appreciate any and all
> > corrections.
>
> Right. It mostly seems a bit confusing to me.
> "it DOES NOT just rewrite the nodes pathing from the leaf to the node
> and ONLY the connections for that single document"
> That doesn't sound quite right, but I can tell what you're trying to
> say is accurate. If I'm right, you mean that every changed inner node
> is rewritten in its entirety rather than having a single pointer to
> the new child updated in place.
>
> Cheers. Thanks for taking the time to write this up.
>
> -Randall
>

Re: Understanding the CouchDB file format

Posted by Randall Leeds <ra...@gmail.com>.

On Thu, Dec 22, 2011 at 12:11, Riyad Kalla <rk...@gmail.com> wrote:
> On Thu, Dec 22, 2011 at 12:38 PM, Robert Newson <rn...@apache.org> wrote:
>
>> There are a
>> few parts of the article that are inaccurate (like the assertion we
>> have good locality for the id and seq trees. If this were true we
>> wouldn't have seen such a huge improvement in compaction by
>> temporarily separating them).
>
>
> I'd look forward to more detail on this... it was my understanding the
> updates were appended out in the [doc rev][_id idx diff][_seq idx diff]
> format at the end of the data file. Sounds like I may have misunderstood
> that?
>

Riyad, as you pointed out earlier, all the inner nodes are rewritten
up to the root. The two btrees are not written in parallel, though,
which means that for deep trees all the updated nodes are written
before the other tree's nodes are written. Also remember that the
trees themselves end up pretty fragmented since older nodes that
haven't changed are back toward the beginning of the file. In general,
I'm not sure there's much that's useful to mention about locality in
the trees. Also, updating these trees requires reading the old values,
so there is still seeking that occurs (if the pages aren't cached by
the OS).

>
>> The 'COMPETE recreation' paragraph also
>> strikes me as factually awry.
>>
>
> I'd appreciate a detailed correction on this if it is wrong; all the
> digging I've done (in this thread and other partial resources) suggests
> that the path from the changed doc ref back to the root (including a copy
> of all internal nodes and all of their child references) is written so as
> being able to read-back into the index from the tail of the data file
> quickly... specifically slides 17, 18 and 19 from this slidedeck (
> http://www.slideshare.net/jchrisa/btree-nosql-oak?from=embed) -- note that
> the interim nodes [A..M] and [A..Z] are rewritten (including any and all
> child pointers they contained).
>
> This is what I was referring to; I should either clean up my wording
> (confusing) or I got it wrong in which case I'd appreciate any and all
> corrections.

Right. It mostly seems a bit confusing to me.
"it DOES NOT just rewrite the nodes pathing from the leaf to the node
and ONLY the connections for that single document"
That doesn't sound quite right, but I can tell what you're trying to
say is accurate. If I'm right, you mean that every changed inner node
is rewritten in its entirety rather than having a single pointer to
the new child updated in place.

Cheers. Thanks for taking the time to write this up.

-Randall

Re: Understanding the CouchDB file format

Posted by Riyad Kalla <rk...@gmail.com>.

On Thu, Dec 22, 2011 at 12:38 PM, Robert Newson <rn...@apache.org> wrote:

> It reads well as an article but needs some polish before it could be a
> great wiki page. I suggest that if it does go up, it is clearly marked
> as a draft, and we all chip in to sculpt it into shape.
>

Great idea. Agreed that it is no where as clean/factual as it would need to
be for an appropriate wiki/ref-doc-style entry, but if it can be massaged
into that, it would hopefully make a great resource.

> There are a
> few parts of the article that are inaccurate (like the assertion we
> have good locality for the id and seq trees. If this were true we
> wouldn't have seen such a huge improvement in compaction by
> temporarily separating them).

I'd look forward to more detail on this... it was my understanding the
updates were appended out in the [doc rev][_id idx diff][_seq idx diff]
format at the end of the data file. Sounds like I may have misunderstood
that?

> The 'COMPETE recreation' paragraph also
> strikes me as factually awry.
>

I'd appreciate a detailed correction on this if it is wrong; all the
digging I've done (in this thread and other partial resources) suggests
that the path from the changed doc ref back to the root (including a copy
of all internal nodes and all of their child references) is written so as
being able to read-back into the index from the tail of the data file
quickly... specifically slides 17, 18 and 19 from this slidedeck (
http://www.slideshare.net/jchrisa/btree-nosql-oak?from=embed) -- note that
the interim nodes [A..M] and [A..Z] are rewritten (including any and all
child pointers they contained).

This is what I was referring to; I should either clean up my wording
(confusing) or I got it wrong in which case I'd appreciate any and all
corrections.

Thanks for helping move this forward and the feedback Robert.

-Riyad

Re: Understanding the CouchDB file format

Posted by Robert Newson <rn...@apache.org>.

It reads well as an article but needs some polish before it could be a
great wiki page. I suggest that if it does go up, it is clearly marked
as a draft, and we all chip in to sculpt it into shape.

Particularly, the author is very enthusiastic but this mars the
article (the all-caps, the ad-hoc methods of emphasis). There are a
few parts of the article that are inaccurate (like the assertion we
have good locality for the id and seq trees. If this were true we
wouldn't have seen such a huge improvement in compaction by
temporarily separating them). The 'COMPETE recreation' paragraph also
strikes me as factually awry.

B.

On 22 December 2011 18:52, Riyad Kalla <rk...@gmail.com> wrote:
> Jan,
>
> Thank you and yes, I'd be happy to contribute this to the wiki.
>
> I made some edits early this morning after some feedback. If a few folks
> in-the-know could give it a quick read-through to make sure I didn't get
> anything wrong then I'm happy to work it up on the wiki as well (or send it
> along to the wiki's editor for inclusion).
>
> Just point the way.
>
> Best,
> Riyad
>
> On Thu, Dec 22, 2011 at 2:21 AM, Jan Lehnardt <ja...@apache.org> wrote:
>
>> Good writeup! Would you consider contributing it to the CouchDB Wiki?
>>
>>  http://wiki.apache.org/couchdb/
>>
>> Cheers
>> Jan
>> --
>>
>> On Dec 21, 2011, at 21:28 , Riyad Kalla wrote:
>>
>> > Bob,
>> >
>> > Really appreciate the link; Rick has a handful of articles that helped a
>> > lot.
>> >
>> > Along side all the CouchDB reading I've been looking at SSD-optimized
>> data
>> > storage mechanisms and tried to coalesce all of this information into
>> this
>> > post on Couch's file storage format:
>> > https://plus.google.com/u/0/107397941677313236670/posts/CyvwRcvh4vv
>> >
>> > It is uncanny how many things Couch seems to have gotten right with
>> regard
>> > to existing storage systems and future flash-based storage systems. I'd
>> > appreciate any corrections, additions or feedback to the post for anyone
>> > interested.
>> >
>> > Best,
>> > R
>> >
>> > On Wed, Dec 21, 2011 at 12:53 PM, Robert Dionne <
>> > dionne@dionne-associates.com> wrote:
>> >
>> >> I think this is largely correct Riyad, I dug out an old article[1] by
>> Rick
>> >> Ho that you may also find helpful though it might be slightly dated.
>> >> Generally the best performance will be had if the ids are sequential and
>> >> updates are done in bulk. Write heavy applications will eat up a lot of
>> >> space and require compaction. At the leaf nodes what are stored are
>> either
>> >> full_doc_info records or doc_info records which store pointers to the
>> data
>> >> so the main thing that impacts the branching at each level are the key
>> size
>> >> and in the case of views the sizes of the reductions as these are stored
>> >> with the intermediate nodes.
>> >>
>> >> All in all it works pretty well but as always you need to test and
>> >> evaluate it for you specific case to see what the limits are.
>> >>
>> >> Regards,
>> >>
>> >> Bob
>> >>
>> >>
>> >> [1] http://horicky.blogspot.com/2008/10/couchdb-implementation.html
>> >>
>> >>
>> >>
>> >>
>> >> On Dec 21, 2011, at 2:17 PM, Riyad Kalla wrote:
>> >>
>> >>> Adding to this conversation, I found this set of slides by Chris
>> >> explaining
>> >>> the append-only index update format:
>> >>> http://www.slideshare.net/jchrisa/btree-nosql-oak?from=embed
>> >>>
>> >>> Specifically slides 16, 17 and 18.
>> >>>
>> >>> Using this example tree, rewriting the updated path (in reverse order)
>> >>> appended to the end of the file makes sense... you see how index
>> queries
>> >>> can simply read backwards from the end of the file and not only find
>> the
>> >>> latest revisions of docs, but also every other doc that wasn't touched
>> >> (it
>> >>> will just seek into the existing inner nodes of the b+ tree for
>> >> searching).
>> >>>
>> >>> What I am hoping for clarification on is the following pain-point that
>> I
>> >>> perceive with this approach:
>> >>>
>> >>> 1. In a sufficiently shallow B+ tree (like CouchDB), the paths
>> themselves
>> >>> to elements are short (typically no more than 3 to 5 levels deep) as
>> >>> opposed to a trie or some other construct that would have much longer
>> >> paths
>> >>> to elements.
>> >>>
>> >>> 2. Because the depth of the tree is so shallow, the breadth of it
>> becomes
>> >>> large to compensate... more specifically, each internal node can have
>> >> 100s,
>> >>> 1000s or more children. Using the example slides, consider the nodes
>> >>> [A...M] and [R...Z] -- in a good sized CouchDB database, those internal
>> >>> index nodes would have 100s (or more) elements in them pointing at
>> deeper
>> >>> internal nodes that themselves had thousands of elements; instead of
>> the
>> >> 13
>> >>> or so as implied by [A...M].
>> >>>
>> >>> 3. Looking at slide 17 and 18, where you see the direct B+ tree path to
>> >> the
>> >>> update node getting appended to the end of the file after the revision
>> is
>> >>> written (leaf to root ordering: [J' M] -> [A M] -> [A Z]) it implies
>> that
>> >>> those internal nodes with *all* their child elements are getting
>> >> rewritten
>> >>> as well.
>> >>>
>> >>> In this example tree, it is isn't such a big issue... but in a
>> >> sufficiently
>> >>> large CouchDB database, these nodes denoted by [A...M] and [A...Z]
>> could
>> >> be
>> >>> quite large... I don't know the format of the node elements in the B+
>> >> tree,
>> >>> but it would be whatever the size of a node is times however many
>> >> elements
>> >>> are contained at each level (1 for root, say 100 for level 2, 1000 for
>> >>> level 3 and 10,000 for level 4 -- there is a lot of hand-waving going
>> on
>> >>> here, of course it depends on the size of the data store).
>> >>>
>> >>> Am I missing something or is CouchDB really rewriting that much index
>> >>> information between document revisions on every update?
>> >>>
>> >>> What was previously confusing me is I thought it was *only* rewriting a
>> >>> direct path to the updated revision, like [B]>[E]>[J'] and Couch was
>> >>> some-how patching in that updated path info to the B+ index at runtime.
>> >>>
>> >>> If couch is rewriting entire node paths with all their elements then I
>> am
>> >>> no longer confused about the B+ index updates, but am curious about the
>> >>> on-disk cost of this.
>> >>>
>> >>> In my own rough insertion testing, that would explain why I see my
>> >>> collections absolutely explode in size until they are compacted (not
>> >> using
>> >>> bulk insert, but intentionally doing single inserts for a million(s) of
>> >>> docs to see what kind of cost the index path duplication would be
>> like).
>> >>>
>> >>> Can anyone confirm/deny/correct this assessment? I want to make sure I
>> am
>> >>> on the right track understanding this.
>> >>>
>> >>> Best wishes,
>> >>> Riyad
>> >>>
>> >>> On Tue, Dec 20, 2011 at 6:13 PM, Riyad Kalla <rk...@gmail.com> wrote:
>> >>>
>> >>>> @Filipe - I was just not clear on how CouchDB operated; you and Robert
>> >>>> cleared that up for me. Thank you.
>> >>>>
>> >>>> @Robert - The writeup is excellent so far (I am not familiar with
>> >> erlang,
>> >>>> so there is a bit of stickiness there), thank you for taking the time
>> to
>> >>>> put this together!
>> >>>>
>> >>>> At this point I am curious how the _id and _seq indices are read as
>> >> their
>> >>>> data is continually appended to the end of the data file in small
>> >>>> diff-trees for every updated doc.
>> >>>>
>> >>>> If CouchDB kept all the indices in-memory and simply patched-in the
>> >>>> updated paths at runtime (maybe something akin to memory-mapped
>> indices
>> >> in
>> >>>> MongoDB) I would be fairly clear on the operation... but as I
>> understand
>> >>>> it, CouchDB keeps such a small memory footprint by doing no in-memory
>> >>>> caching and relying on the intelligence of the OS and filesystem
>> (and/or
>> >>>> drives) to cache frequently accessed data.
>> >>>>
>> >>>> I am trying to understand the logic used by CouchDB to answer a query
>> >>>> using the index once updates to the tree have been appended to the
>> data
>> >>>> file... for example, consider a CouchDB datastore like the one Filipe
>> >>>> has... 10 million documents and let's say it is freshly compacted.
>> >>>>
>> >>>> If I send in a request to that Couch instance, it hits the header of
>> the
>> >>>> data file along with the index and walks the B+ tree to the leaf node,
>> >>>> where it finds the offset into the data file where the actual doc
>> >> lives...
>> >>>> let's say 1,000,000 bytes away.
>> >>>>
>> >>>> These B+ trees are shallow, so it might look something like this:
>> >>>>
>> >>>> Level 1: 1 node, root node.
>> >>>> Level 2: 100 nodes, inner child nodes
>> >>>> Level 3: 10,000 nodes, inner child nodes
>> >>>> Level 4: 1,000,000, leaf nodes... all with pointers to the data
>> offsets
>> >> in
>> >>>> the data file.
>> >>>>
>> >>>> Now let's say I write 10 updates to documents in that file. There are
>> 10
>> >>>> new revisions appended to the end of the data file *each one*
>> separated
>> >> by
>> >>>> a rewritten B+ path to a leaf node with it's new location at the end
>> of
>> >> the
>> >>>> file. Each of those paths written between each doc revision (say
>> >> roughly 2k
>> >>>> like Filipe mentioned) are just 4 item paths... root -> level1 ->
>> >> level2 ->
>> >>>> level3 --> level4... showing the discrete path from the root to that
>> >>>> individual updated doc. The intermediary levels (l1, 2, 3) are not
>> fully
>> >>>> flushed out with all the OTHER children from the original b+ tree
>> index.
>> >>>>
>> >>>> [[ is this correct so far? If not, please point out my mistakes...]
>> >>>>
>> >>>> Now I issue a query for a document that WAS NOT updated...
>> >>>>
>> >>>> **** this is where I get confused on the logic ***
>> >>>>
>> >>>> this would mean I need to access the original B+ tree index at the
>> root
>> >> of
>> >>>> the data file, because the revised B+ paths that are written between
>> >> each
>> >>>> of the updated doc revisions at the end of the file are not full
>> >> indices.
>> >>>>
>> >>>> NOW consider I want to query for one of the changed docs... now I
>> >> suddenly
>> >>>> need to scan backwards from the data file's end to find the updated
>> >> path to
>> >>>> the new revision of that document.
>> >>>>
>> >>>> (obviously) this isn't what Couch is actually doing... it's doing
>> >>>> something more elegant, I just can't figure out what or how and that
>> is
>> >>>> what I was hoping for help with.
>> >>>>
>> >>>> Much thanks guys, I know this is a heavy question to ask.
>> >>>>
>> >>>> Best wishes,
>> >>>> R
>> >>>>
>> >>>>
>> >>>> On Tue, Dec 20, 2011 at 1:35 PM, Robert Dionne <
>> >>>> dionne@dionne-associates.com> wrote:
>> >>>>
>> >>>>>
>> >>>>> Robert Dionne
>> >>>>> Computer Programmer
>> >>>>> dionne@dionne-associates.com
>> >>>>> 203.231.9961
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On Dec 20, 2011, at 3:27 PM, Riyad Kalla wrote:
>> >>>>>
>> >>>>>> Filipe,
>> >>>>>>
>> >>>>>> Thank you for the reply.
>> >>>>>>
>> >>>>>> Maybe I am misunderstanding exactly what couch is writing out; the
>> >> docs
>> >>>>>> I've read say that it "rewrites the root node" -- I can't tell if
>> the
>> >>>>> docs
>> >>>>>> mean the parent node of the child doc that was changed (as one of
>> the
>> >> b+
>> >>>>>> leaves) or if it means the direct path, from the root, to that
>> >>>>> individual
>> >>>>>> doc... or if it means the *entire* index...
>> >>>>>>
>> >>>>>> In the case of even rewriting the single parent, with such a shallow
>> >>>>> tree,
>> >>>>>> each internal leaf will have a huge fan of nodes; let's say 1-10k
>> in a
>> >>>>>> decent sized data set.
>> >>>>>>
>> >>>>>> If you are seeing a few K of extra written out after each changed
>> doc
>> >>>>> then
>> >>>>>> that cannot be write... I almost assumed my understanding was wrong
>> >>>>> because
>> >>>>>> the sheer volume of data would make performance abysmal if it was
>> >> true.
>> >>>>>>
>> >>>>>> Given that... is it just the changed path, from the root to the new
>> >> leaf
>> >>>>>> that is rewritten?
>> >>>>>
>> >>>>> Hi Riyad,
>> >>>>>
>> >>>>> You are correct, it's only the changed path. Interestingly I've just
>> >>>>> started to document all these internals[1] along with links to the
>> >> code and
>> >>>>> other references available.
>> >>>>>
>> >>>>> Cheers,
>> >>>>>
>> >>>>> Bob
>> >>>>>
>> >>>>>
>> >>>>> [1] http://bdionne.github.com/couchdb/
>> >>>>>
>> >>>>>> That makes me all sorts of curious as to how Couch
>> >>>>>> updates/searches the new modified index with the small diff that is
>> >>>>> written
>> >>>>>> out.
>> >>>>>>
>> >>>>>> Any pointers to reading that will help me dig down on this (even
>> early
>> >>>>> bugs
>> >>>>>> in JIRA?) would be appreciated. I've tried skimming back in 2007/08
>> on
>> >>>>>> Damien's blog to see if it wrote about it in depth and so far
>> haven't
>> >>>>> found
>> >>>>>> anything as detailed as I am hoping for on this architecture.
>> >>>>>>
>> >>>>>> Best,
>> >>>>>> Riyad
>> >>>>>>
>> >>>>>> On Tue, Dec 20, 2011 at 1:07 PM, Filipe David Manana <
>> >>>>> fdmanana@apache.org>wrote:
>> >>>>>>
>> >>>>>>> On Tue, Dec 20, 2011 at 6:24 PM, Riyad Kalla <rk...@gmail.com>
>> >> wrote:
>> >>>>>>>> I've been reading everything I can find on the CouchDB file
>> >> format[1]
>> >>>>> and
>> >>>>>>>> am getting bits and pieces here and there, but not a great,
>> >> concrete,
>> >>>>>>>> step-by-step explanation of the process.
>> >>>>>>>>
>> >>>>>>>> I'm clear on the use of B+ trees and after reading a few papers on
>> >> the
>> >>>>>>>> benefits of log-structured file formats, I understand the benefits
>> >> of
>> >>>>>>>> inlining the B+ tree indices directly into the data file as well
>> >>>>>>> (locality
>> >>>>>>>> + sequential I/O)... what I'm flummoxed about is how much of the
>> B+
>> >>>>>>> tree's
>> >>>>>>>> index is rewritten after every modified document.
>> >>>>>>>>
>> >>>>>>>> Consider a CouchDB file that looks more or less like this:
>> >>>>>>>>
>> >>>>>>>> [idx/header][doc1, rev1][idx/header][doc1, rev2]....
>> >>>>>>>>
>> >>>>>>>> After each revised doc is written and the "b-tree root" is
>> rewritten
>> >>>>>>> after
>> >>>>>>>> that, is that just a modified root node of the B+ tree or the
>> entire
>> >>>>> B+
>> >>>>>>>> tree?
>> >>>>>>>>
>> >>>>>>>> The reason I ask is because regardless of the answer to my
>> previous
>> >>>>>>>> question, for a *huge* database will millions of records, that
>> seems
>> >>>>> like
>> >>>>>>>> an enormous amount of data to rewrite after every modification.
>> Say
>> >>>>> the
>> >>>>>>>> root node had a fanning factor of 133; that would still be alot of
>> >>>>> data
>> >>>>>>> to
>> >>>>>>>> rewrite.
>> >>>>>>>
>> >>>>>>> Hi Riyad,
>> >>>>>>>
>> >>>>>>> Have you observed that in practice?
>> >>>>>>>
>> >>>>>>> Typically the depth of database btrees is not that high even for
>> >>>>>>> millions of documents. For example I have one around with about 10
>> >>>>>>> million documents which doesn't have more than 5 or 6 levels if I
>> >>>>>>> recall correctly.
>> >>>>>>>
>> >>>>>>> So updating a doc, for that particular case, means rewriting 5 or 6
>> >>>>>>> new nodes plus the document itself. Each node is normally not much
>> >>>>>>> bigger than 1.2Kb.
>> >>>>>>>
>> >>>>>>> I've written once a tool to analyze database files which reports
>> >> btree
>> >>>>>>> depths, however it's not updated to work with recent changes on
>> >>>>>>> master/1.2.x such as snappy compression and btree sizes:
>> >>>>>>>
>> >>>>>>> https://github.com/fdmanana/couchfoo
>> >>>>>>>
>> >>>>>>> It should work with CouchDB 1.1 (and older) database files.
>> >>>>>>>
>> >>>>>>>>
>> >>>>>>>> I am certain I am missing the boat on this; if anyone can pull me
>> >> out
>> >>>>> of
>> >>>>>>>> the water and point me to dry land I'd appreciate it.
>> >>>>>>>>
>> >>>>>>>> Best,
>> >>>>>>>> R
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> [1]
>> >>>>>>>> --
>> >>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>
>> http://jchrisa.net/drl/_design/sofa/_list/post/post-page?startkey=%5B%22CouchDB-Implements-a-Fundamental-Algorithm%22%5D
>> >>>>>>>> --
>> http://horicky.blogspot.com/2008/10/couchdb-implementation.html
>> >>>>>>>> -- http://blog.kodekabuki.com/post/132952897/couchdb-naked
>> >>>>>>>> -- http://guide.couchdb.org/editions/1/en/btree.html
>> >>>>>>>> -- http://ayende.com/blog/* (Over my head)
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> --
>> >>>>>>> Filipe David Manana,
>> >>>>>>>
>> >>>>>>> "Reasonable men adapt themselves to the world.
>> >>>>>>> Unreasonable men adapt the world to themselves.
>> >>>>>>> That's why all progress depends on unreasonable men."
>> >>>>>>>
>> >>>>>
>> >>>>>
>> >>>>
>> >>
>> >>
>>
>>

Re: Understanding the CouchDB file format

Posted by Riyad Kalla <rk...@gmail.com>.

Jan,

Thank you and yes, I'd be happy to contribute this to the wiki.

I made some edits early this morning after some feedback. If a few folks
in-the-know could give it a quick read-through to make sure I didn't get
anything wrong then I'm happy to work it up on the wiki as well (or send it
along to the wiki's editor for inclusion).

Just point the way.

Best,
Riyad

On Thu, Dec 22, 2011 at 2:21 AM, Jan Lehnardt <ja...@apache.org> wrote:

> Good writeup! Would you consider contributing it to the CouchDB Wiki?
>
>  http://wiki.apache.org/couchdb/
>
> Cheers
> Jan
> --
>
> On Dec 21, 2011, at 21:28 , Riyad Kalla wrote:
>
> > Bob,
> >
> > Really appreciate the link; Rick has a handful of articles that helped a
> > lot.
> >
> > Along side all the CouchDB reading I've been looking at SSD-optimized
> data
> > storage mechanisms and tried to coalesce all of this information into
> this
> > post on Couch's file storage format:
> > https://plus.google.com/u/0/107397941677313236670/posts/CyvwRcvh4vv
> >
> > It is uncanny how many things Couch seems to have gotten right with
> regard
> > to existing storage systems and future flash-based storage systems. I'd
> > appreciate any corrections, additions or feedback to the post for anyone
> > interested.
> >
> > Best,
> > R
> >
> > On Wed, Dec 21, 2011 at 12:53 PM, Robert Dionne <
> > dionne@dionne-associates.com> wrote:
> >
> >> I think this is largely correct Riyad, I dug out an old article[1] by
> Rick
> >> Ho that you may also find helpful though it might be slightly dated.
> >> Generally the best performance will be had if the ids are sequential and
> >> updates are done in bulk. Write heavy applications will eat up a lot of
> >> space and require compaction. At the leaf nodes what are stored are
> either
> >> full_doc_info records or doc_info records which store pointers to the
> data
> >> so the main thing that impacts the branching at each level are the key
> size
> >> and in the case of views the sizes of the reductions as these are stored
> >> with the intermediate nodes.
> >>
> >> All in all it works pretty well but as always you need to test and
> >> evaluate it for you specific case to see what the limits are.
> >>
> >> Regards,
> >>
> >> Bob
> >>
> >>
> >> [1] http://horicky.blogspot.com/2008/10/couchdb-implementation.html
> >>
> >>
> >>
> >>
> >> On Dec 21, 2011, at 2:17 PM, Riyad Kalla wrote:
> >>
> >>> Adding to this conversation, I found this set of slides by Chris
> >> explaining
> >>> the append-only index update format:
> >>> http://www.slideshare.net/jchrisa/btree-nosql-oak?from=embed
> >>>
> >>> Specifically slides 16, 17 and 18.
> >>>
> >>> Using this example tree, rewriting the updated path (in reverse order)
> >>> appended to the end of the file makes sense... you see how index
> queries
> >>> can simply read backwards from the end of the file and not only find
> the
> >>> latest revisions of docs, but also every other doc that wasn't touched
> >> (it
> >>> will just seek into the existing inner nodes of the b+ tree for
> >> searching).
> >>>
> >>> What I am hoping for clarification on is the following pain-point that
> I
> >>> perceive with this approach:
> >>>
> >>> 1. In a sufficiently shallow B+ tree (like CouchDB), the paths
> themselves
> >>> to elements are short (typically no more than 3 to 5 levels deep) as
> >>> opposed to a trie or some other construct that would have much longer
> >> paths
> >>> to elements.
> >>>
> >>> 2. Because the depth of the tree is so shallow, the breadth of it
> becomes
> >>> large to compensate... more specifically, each internal node can have
> >> 100s,
> >>> 1000s or more children. Using the example slides, consider the nodes
> >>> [A...M] and [R...Z] -- in a good sized CouchDB database, those internal
> >>> index nodes would have 100s (or more) elements in them pointing at
> deeper
> >>> internal nodes that themselves had thousands of elements; instead of
> the
> >> 13
> >>> or so as implied by [A...M].
> >>>
> >>> 3. Looking at slide 17 and 18, where you see the direct B+ tree path to
> >> the
> >>> update node getting appended to the end of the file after the revision
> is
> >>> written (leaf to root ordering: [J' M] -> [A M] -> [A Z]) it implies
> that
> >>> those internal nodes with *all* their child elements are getting
> >> rewritten
> >>> as well.
> >>>
> >>> In this example tree, it is isn't such a big issue... but in a
> >> sufficiently
> >>> large CouchDB database, these nodes denoted by [A...M] and [A...Z]
> could
> >> be
> >>> quite large... I don't know the format of the node elements in the B+
> >> tree,
> >>> but it would be whatever the size of a node is times however many
> >> elements
> >>> are contained at each level (1 for root, say 100 for level 2, 1000 for
> >>> level 3 and 10,000 for level 4 -- there is a lot of hand-waving going
> on
> >>> here, of course it depends on the size of the data store).
> >>>
> >>> Am I missing something or is CouchDB really rewriting that much index
> >>> information between document revisions on every update?
> >>>
> >>> What was previously confusing me is I thought it was *only* rewriting a
> >>> direct path to the updated revision, like [B]>[E]>[J'] and Couch was
> >>> some-how patching in that updated path info to the B+ index at runtime.
> >>>
> >>> If couch is rewriting entire node paths with all their elements then I
> am
> >>> no longer confused about the B+ index updates, but am curious about the
> >>> on-disk cost of this.
> >>>
> >>> In my own rough insertion testing, that would explain why I see my
> >>> collections absolutely explode in size until they are compacted (not
> >> using
> >>> bulk insert, but intentionally doing single inserts for a million(s) of
> >>> docs to see what kind of cost the index path duplication would be
> like).
> >>>
> >>> Can anyone confirm/deny/correct this assessment? I want to make sure I
> am
> >>> on the right track understanding this.
> >>>
> >>> Best wishes,
> >>> Riyad
> >>>
> >>> On Tue, Dec 20, 2011 at 6:13 PM, Riyad Kalla <rk...@gmail.com> wrote:
> >>>
> >>>> @Filipe - I was just not clear on how CouchDB operated; you and Robert
> >>>> cleared that up for me. Thank you.
> >>>>
> >>>> @Robert - The writeup is excellent so far (I am not familiar with
> >> erlang,
> >>>> so there is a bit of stickiness there), thank you for taking the time
> to
> >>>> put this together!
> >>>>
> >>>> At this point I am curious how the _id and _seq indices are read as
> >> their
> >>>> data is continually appended to the end of the data file in small
> >>>> diff-trees for every updated doc.
> >>>>
> >>>> If CouchDB kept all the indices in-memory and simply patched-in the
> >>>> updated paths at runtime (maybe something akin to memory-mapped
> indices
> >> in
> >>>> MongoDB) I would be fairly clear on the operation... but as I
> understand
> >>>> it, CouchDB keeps such a small memory footprint by doing no in-memory
> >>>> caching and relying on the intelligence of the OS and filesystem
> (and/or
> >>>> drives) to cache frequently accessed data.
> >>>>
> >>>> I am trying to understand the logic used by CouchDB to answer a query
> >>>> using the index once updates to the tree have been appended to the
> data
> >>>> file... for example, consider a CouchDB datastore like the one Filipe
> >>>> has... 10 million documents and let's say it is freshly compacted.
> >>>>
> >>>> If I send in a request to that Couch instance, it hits the header of
> the
> >>>> data file along with the index and walks the B+ tree to the leaf node,
> >>>> where it finds the offset into the data file where the actual doc
> >> lives...
> >>>> let's say 1,000,000 bytes away.
> >>>>
> >>>> These B+ trees are shallow, so it might look something like this:
> >>>>
> >>>> Level 1: 1 node, root node.
> >>>> Level 2: 100 nodes, inner child nodes
> >>>> Level 3: 10,000 nodes, inner child nodes
> >>>> Level 4: 1,000,000, leaf nodes... all with pointers to the data
> offsets
> >> in
> >>>> the data file.
> >>>>
> >>>> Now let's say I write 10 updates to documents in that file. There are
> 10
> >>>> new revisions appended to the end of the data file *each one*
> separated
> >> by
> >>>> a rewritten B+ path to a leaf node with it's new location at the end
> of
> >> the
> >>>> file. Each of those paths written between each doc revision (say
> >> roughly 2k
> >>>> like Filipe mentioned) are just 4 item paths... root -> level1 ->
> >> level2 ->
> >>>> level3 --> level4... showing the discrete path from the root to that
> >>>> individual updated doc. The intermediary levels (l1, 2, 3) are not
> fully
> >>>> flushed out with all the OTHER children from the original b+ tree
> index.
> >>>>
> >>>> [[ is this correct so far? If not, please point out my mistakes...]
> >>>>
> >>>> Now I issue a query for a document that WAS NOT updated...
> >>>>
> >>>> **** this is where I get confused on the logic ***
> >>>>
> >>>> this would mean I need to access the original B+ tree index at the
> root
> >> of
> >>>> the data file, because the revised B+ paths that are written between
> >> each
> >>>> of the updated doc revisions at the end of the file are not full
> >> indices.
> >>>>
> >>>> NOW consider I want to query for one of the changed docs... now I
> >> suddenly
> >>>> need to scan backwards from the data file's end to find the updated
> >> path to
> >>>> the new revision of that document.
> >>>>
> >>>> (obviously) this isn't what Couch is actually doing... it's doing
> >>>> something more elegant, I just can't figure out what or how and that
> is
> >>>> what I was hoping for help with.
> >>>>
> >>>> Much thanks guys, I know this is a heavy question to ask.
> >>>>
> >>>> Best wishes,
> >>>> R
> >>>>
> >>>>
> >>>> On Tue, Dec 20, 2011 at 1:35 PM, Robert Dionne <
> >>>> dionne@dionne-associates.com> wrote:
> >>>>
> >>>>>
> >>>>> Robert Dionne
> >>>>> Computer Programmer
> >>>>> dionne@dionne-associates.com
> >>>>> 203.231.9961
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Dec 20, 2011, at 3:27 PM, Riyad Kalla wrote:
> >>>>>
> >>>>>> Filipe,
> >>>>>>
> >>>>>> Thank you for the reply.
> >>>>>>
> >>>>>> Maybe I am misunderstanding exactly what couch is writing out; the
> >> docs
> >>>>>> I've read say that it "rewrites the root node" -- I can't tell if
> the
> >>>>> docs
> >>>>>> mean the parent node of the child doc that was changed (as one of
> the
> >> b+
> >>>>>> leaves) or if it means the direct path, from the root, to that
> >>>>> individual
> >>>>>> doc... or if it means the *entire* index...
> >>>>>>
> >>>>>> In the case of even rewriting the single parent, with such a shallow
> >>>>> tree,
> >>>>>> each internal leaf will have a huge fan of nodes; let's say 1-10k
> in a
> >>>>>> decent sized data set.
> >>>>>>
> >>>>>> If you are seeing a few K of extra written out after each changed
> doc
> >>>>> then
> >>>>>> that cannot be write... I almost assumed my understanding was wrong
> >>>>> because
> >>>>>> the sheer volume of data would make performance abysmal if it was
> >> true.
> >>>>>>
> >>>>>> Given that... is it just the changed path, from the root to the new
> >> leaf
> >>>>>> that is rewritten?
> >>>>>
> >>>>> Hi Riyad,
> >>>>>
> >>>>> You are correct, it's only the changed path. Interestingly I've just
> >>>>> started to document all these internals[1] along with links to the
> >> code and
> >>>>> other references available.
> >>>>>
> >>>>> Cheers,
> >>>>>
> >>>>> Bob
> >>>>>
> >>>>>
> >>>>> [1] http://bdionne.github.com/couchdb/
> >>>>>
> >>>>>> That makes me all sorts of curious as to how Couch
> >>>>>> updates/searches the new modified index with the small diff that is
> >>>>> written
> >>>>>> out.
> >>>>>>
> >>>>>> Any pointers to reading that will help me dig down on this (even
> early
> >>>>> bugs
> >>>>>> in JIRA?) would be appreciated. I've tried skimming back in 2007/08
> on
> >>>>>> Damien's blog to see if it wrote about it in depth and so far
> haven't
> >>>>> found
> >>>>>> anything as detailed as I am hoping for on this architecture.
> >>>>>>
> >>>>>> Best,
> >>>>>> Riyad
> >>>>>>
> >>>>>> On Tue, Dec 20, 2011 at 1:07 PM, Filipe David Manana <
> >>>>> fdmanana@apache.org>wrote:
> >>>>>>
> >>>>>>> On Tue, Dec 20, 2011 at 6:24 PM, Riyad Kalla <rk...@gmail.com>
> >> wrote:
> >>>>>>>> I've been reading everything I can find on the CouchDB file
> >> format[1]
> >>>>> and
> >>>>>>>> am getting bits and pieces here and there, but not a great,
> >> concrete,
> >>>>>>>> step-by-step explanation of the process.
> >>>>>>>>
> >>>>>>>> I'm clear on the use of B+ trees and after reading a few papers on
> >> the
> >>>>>>>> benefits of log-structured file formats, I understand the benefits
> >> of
> >>>>>>>> inlining the B+ tree indices directly into the data file as well
> >>>>>>> (locality
> >>>>>>>> + sequential I/O)... what I'm flummoxed about is how much of the
> B+
> >>>>>>> tree's
> >>>>>>>> index is rewritten after every modified document.
> >>>>>>>>
> >>>>>>>> Consider a CouchDB file that looks more or less like this:
> >>>>>>>>
> >>>>>>>> [idx/header][doc1, rev1][idx/header][doc1, rev2]....
> >>>>>>>>
> >>>>>>>> After each revised doc is written and the "b-tree root" is
> rewritten
> >>>>>>> after
> >>>>>>>> that, is that just a modified root node of the B+ tree or the
> entire
> >>>>> B+
> >>>>>>>> tree?
> >>>>>>>>
> >>>>>>>> The reason I ask is because regardless of the answer to my
> previous
> >>>>>>>> question, for a *huge* database will millions of records, that
> seems
> >>>>> like
> >>>>>>>> an enormous amount of data to rewrite after every modification.
> Say
> >>>>> the
> >>>>>>>> root node had a fanning factor of 133; that would still be alot of
> >>>>> data
> >>>>>>> to
> >>>>>>>> rewrite.
> >>>>>>>
> >>>>>>> Hi Riyad,
> >>>>>>>
> >>>>>>> Have you observed that in practice?
> >>>>>>>
> >>>>>>> Typically the depth of database btrees is not that high even for
> >>>>>>> millions of documents. For example I have one around with about 10
> >>>>>>> million documents which doesn't have more than 5 or 6 levels if I
> >>>>>>> recall correctly.
> >>>>>>>
> >>>>>>> So updating a doc, for that particular case, means rewriting 5 or 6
> >>>>>>> new nodes plus the document itself. Each node is normally not much
> >>>>>>> bigger than 1.2Kb.
> >>>>>>>
> >>>>>>> I've written once a tool to analyze database files which reports
> >> btree
> >>>>>>> depths, however it's not updated to work with recent changes on
> >>>>>>> master/1.2.x such as snappy compression and btree sizes:
> >>>>>>>
> >>>>>>> https://github.com/fdmanana/couchfoo
> >>>>>>>
> >>>>>>> It should work with CouchDB 1.1 (and older) database files.
> >>>>>>>
> >>>>>>>>
> >>>>>>>> I am certain I am missing the boat on this; if anyone can pull me
> >> out
> >>>>> of
> >>>>>>>> the water and point me to dry land I'd appreciate it.
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> R
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> [1]
> >>>>>>>> --
> >>>>>>>>
> >>>>>>>
> >>>>>
> >>
> http://jchrisa.net/drl/_design/sofa/_list/post/post-page?startkey=%5B%22CouchDB-Implements-a-Fundamental-Algorithm%22%5D
> >>>>>>>> --
> http://horicky.blogspot.com/2008/10/couchdb-implementation.html
> >>>>>>>> -- http://blog.kodekabuki.com/post/132952897/couchdb-naked
> >>>>>>>> -- http://guide.couchdb.org/editions/1/en/btree.html
> >>>>>>>> -- http://ayende.com/blog/* (Over my head)
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Filipe David Manana,
> >>>>>>>
> >>>>>>> "Reasonable men adapt themselves to the world.
> >>>>>>> Unreasonable men adapt the world to themselves.
> >>>>>>> That's why all progress depends on unreasonable men."
> >>>>>>>
> >>>>>
> >>>>>
> >>>>
> >>
> >>
>
>

Re: Understanding the CouchDB file format

Posted by Jan Lehnardt <ja...@apache.org>.

Good writeup! Would you consider contributing it to the CouchDB Wiki?

  http://wiki.apache.org/couchdb/

Cheers
Jan
-- 

On Dec 21, 2011, at 21:28 , Riyad Kalla wrote:

> Bob,
> 
> Really appreciate the link; Rick has a handful of articles that helped a
> lot.
> 
> Along side all the CouchDB reading I've been looking at SSD-optimized data
> storage mechanisms and tried to coalesce all of this information into this
> post on Couch's file storage format:
> https://plus.google.com/u/0/107397941677313236670/posts/CyvwRcvh4vv
> 
> It is uncanny how many things Couch seems to have gotten right with regard
> to existing storage systems and future flash-based storage systems. I'd
> appreciate any corrections, additions or feedback to the post for anyone
> interested.
> 
> Best,
> R
> 
> On Wed, Dec 21, 2011 at 12:53 PM, Robert Dionne <
> dionne@dionne-associates.com> wrote:
> 
>> I think this is largely correct Riyad, I dug out an old article[1] by Rick
>> Ho that you may also find helpful though it might be slightly dated.
>> Generally the best performance will be had if the ids are sequential and
>> updates are done in bulk. Write heavy applications will eat up a lot of
>> space and require compaction. At the leaf nodes what are stored are either
>> full_doc_info records or doc_info records which store pointers to the data
>> so the main thing that impacts the branching at each level are the key size
>> and in the case of views the sizes of the reductions as these are stored
>> with the intermediate nodes.
>> 
>> All in all it works pretty well but as always you need to test and
>> evaluate it for you specific case to see what the limits are.
>> 
>> Regards,
>> 
>> Bob
>> 
>> 
>> [1] http://horicky.blogspot.com/2008/10/couchdb-implementation.html
>> 
>> 
>> 
>> 
>> On Dec 21, 2011, at 2:17 PM, Riyad Kalla wrote:
>> 
>>> Adding to this conversation, I found this set of slides by Chris
>> explaining
>>> the append-only index update format:
>>> http://www.slideshare.net/jchrisa/btree-nosql-oak?from=embed
>>> 
>>> Specifically slides 16, 17 and 18.
>>> 
>>> Using this example tree, rewriting the updated path (in reverse order)
>>> appended to the end of the file makes sense... you see how index queries
>>> can simply read backwards from the end of the file and not only find the
>>> latest revisions of docs, but also every other doc that wasn't touched
>> (it
>>> will just seek into the existing inner nodes of the b+ tree for
>> searching).
>>> 
>>> What I am hoping for clarification on is the following pain-point that I
>>> perceive with this approach:
>>> 
>>> 1. In a sufficiently shallow B+ tree (like CouchDB), the paths themselves
>>> to elements are short (typically no more than 3 to 5 levels deep) as
>>> opposed to a trie or some other construct that would have much longer
>> paths
>>> to elements.
>>> 
>>> 2. Because the depth of the tree is so shallow, the breadth of it becomes
>>> large to compensate... more specifically, each internal node can have
>> 100s,
>>> 1000s or more children. Using the example slides, consider the nodes
>>> [A...M] and [R...Z] -- in a good sized CouchDB database, those internal
>>> index nodes would have 100s (or more) elements in them pointing at deeper
>>> internal nodes that themselves had thousands of elements; instead of the
>> 13
>>> or so as implied by [A...M].
>>> 
>>> 3. Looking at slide 17 and 18, where you see the direct B+ tree path to
>> the
>>> update node getting appended to the end of the file after the revision is
>>> written (leaf to root ordering: [J' M] -> [A M] -> [A Z]) it implies that
>>> those internal nodes with *all* their child elements are getting
>> rewritten
>>> as well.
>>> 
>>> In this example tree, it is isn't such a big issue... but in a
>> sufficiently
>>> large CouchDB database, these nodes denoted by [A...M] and [A...Z] could
>> be
>>> quite large... I don't know the format of the node elements in the B+
>> tree,
>>> but it would be whatever the size of a node is times however many
>> elements
>>> are contained at each level (1 for root, say 100 for level 2, 1000 for
>>> level 3 and 10,000 for level 4 -- there is a lot of hand-waving going on
>>> here, of course it depends on the size of the data store).
>>> 
>>> Am I missing something or is CouchDB really rewriting that much index
>>> information between document revisions on every update?
>>> 
>>> What was previously confusing me is I thought it was *only* rewriting a
>>> direct path to the updated revision, like [B]>[E]>[J'] and Couch was
>>> some-how patching in that updated path info to the B+ index at runtime.
>>> 
>>> If couch is rewriting entire node paths with all their elements then I am
>>> no longer confused about the B+ index updates, but am curious about the
>>> on-disk cost of this.
>>> 
>>> In my own rough insertion testing, that would explain why I see my
>>> collections absolutely explode in size until they are compacted (not
>> using
>>> bulk insert, but intentionally doing single inserts for a million(s) of
>>> docs to see what kind of cost the index path duplication would be like).
>>> 
>>> Can anyone confirm/deny/correct this assessment? I want to make sure I am
>>> on the right track understanding this.
>>> 
>>> Best wishes,
>>> Riyad
>>> 
>>> On Tue, Dec 20, 2011 at 6:13 PM, Riyad Kalla <rk...@gmail.com> wrote:
>>> 
>>>> @Filipe - I was just not clear on how CouchDB operated; you and Robert
>>>> cleared that up for me. Thank you.
>>>> 
>>>> @Robert - The writeup is excellent so far (I am not familiar with
>> erlang,
>>>> so there is a bit of stickiness there), thank you for taking the time to
>>>> put this together!
>>>> 
>>>> At this point I am curious how the _id and _seq indices are read as
>> their
>>>> data is continually appended to the end of the data file in small
>>>> diff-trees for every updated doc.
>>>> 
>>>> If CouchDB kept all the indices in-memory and simply patched-in the
>>>> updated paths at runtime (maybe something akin to memory-mapped indices
>> in
>>>> MongoDB) I would be fairly clear on the operation... but as I understand
>>>> it, CouchDB keeps such a small memory footprint by doing no in-memory
>>>> caching and relying on the intelligence of the OS and filesystem (and/or
>>>> drives) to cache frequently accessed data.
>>>> 
>>>> I am trying to understand the logic used by CouchDB to answer a query
>>>> using the index once updates to the tree have been appended to the data
>>>> file... for example, consider a CouchDB datastore like the one Filipe
>>>> has... 10 million documents and let's say it is freshly compacted.
>>>> 
>>>> If I send in a request to that Couch instance, it hits the header of the
>>>> data file along with the index and walks the B+ tree to the leaf node,
>>>> where it finds the offset into the data file where the actual doc
>> lives...
>>>> let's say 1,000,000 bytes away.
>>>> 
>>>> These B+ trees are shallow, so it might look something like this:
>>>> 
>>>> Level 1: 1 node, root node.
>>>> Level 2: 100 nodes, inner child nodes
>>>> Level 3: 10,000 nodes, inner child nodes
>>>> Level 4: 1,000,000, leaf nodes... all with pointers to the data offsets
>> in
>>>> the data file.
>>>> 
>>>> Now let's say I write 10 updates to documents in that file. There are 10
>>>> new revisions appended to the end of the data file *each one* separated
>> by
>>>> a rewritten B+ path to a leaf node with it's new location at the end of
>> the
>>>> file. Each of those paths written between each doc revision (say
>> roughly 2k
>>>> like Filipe mentioned) are just 4 item paths... root -> level1 ->
>> level2 ->
>>>> level3 --> level4... showing the discrete path from the root to that
>>>> individual updated doc. The intermediary levels (l1, 2, 3) are not fully
>>>> flushed out with all the OTHER children from the original b+ tree index.
>>>> 
>>>> [[ is this correct so far? If not, please point out my mistakes...]
>>>> 
>>>> Now I issue a query for a document that WAS NOT updated...
>>>> 
>>>> **** this is where I get confused on the logic ***
>>>> 
>>>> this would mean I need to access the original B+ tree index at the root
>> of
>>>> the data file, because the revised B+ paths that are written between
>> each
>>>> of the updated doc revisions at the end of the file are not full
>> indices.
>>>> 
>>>> NOW consider I want to query for one of the changed docs... now I
>> suddenly
>>>> need to scan backwards from the data file's end to find the updated
>> path to
>>>> the new revision of that document.
>>>> 
>>>> (obviously) this isn't what Couch is actually doing... it's doing
>>>> something more elegant, I just can't figure out what or how and that is
>>>> what I was hoping for help with.
>>>> 
>>>> Much thanks guys, I know this is a heavy question to ask.
>>>> 
>>>> Best wishes,
>>>> R
>>>> 
>>>> 
>>>> On Tue, Dec 20, 2011 at 1:35 PM, Robert Dionne <
>>>> dionne@dionne-associates.com> wrote:
>>>> 
>>>>> 
>>>>> Robert Dionne
>>>>> Computer Programmer
>>>>> dionne@dionne-associates.com
>>>>> 203.231.9961
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Dec 20, 2011, at 3:27 PM, Riyad Kalla wrote:
>>>>> 
>>>>>> Filipe,
>>>>>> 
>>>>>> Thank you for the reply.
>>>>>> 
>>>>>> Maybe I am misunderstanding exactly what couch is writing out; the
>> docs
>>>>>> I've read say that it "rewrites the root node" -- I can't tell if the
>>>>> docs
>>>>>> mean the parent node of the child doc that was changed (as one of the
>> b+
>>>>>> leaves) or if it means the direct path, from the root, to that
>>>>> individual
>>>>>> doc... or if it means the *entire* index...
>>>>>> 
>>>>>> In the case of even rewriting the single parent, with such a shallow
>>>>> tree,
>>>>>> each internal leaf will have a huge fan of nodes; let's say 1-10k in a
>>>>>> decent sized data set.
>>>>>> 
>>>>>> If you are seeing a few K of extra written out after each changed doc
>>>>> then
>>>>>> that cannot be write... I almost assumed my understanding was wrong
>>>>> because
>>>>>> the sheer volume of data would make performance abysmal if it was
>> true.
>>>>>> 
>>>>>> Given that... is it just the changed path, from the root to the new
>> leaf
>>>>>> that is rewritten?
>>>>> 
>>>>> Hi Riyad,
>>>>> 
>>>>> You are correct, it's only the changed path. Interestingly I've just
>>>>> started to document all these internals[1] along with links to the
>> code and
>>>>> other references available.
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> Bob
>>>>> 
>>>>> 
>>>>> [1] http://bdionne.github.com/couchdb/
>>>>> 
>>>>>> That makes me all sorts of curious as to how Couch
>>>>>> updates/searches the new modified index with the small diff that is
>>>>> written
>>>>>> out.
>>>>>> 
>>>>>> Any pointers to reading that will help me dig down on this (even early
>>>>> bugs
>>>>>> in JIRA?) would be appreciated. I've tried skimming back in 2007/08 on
>>>>>> Damien's blog to see if it wrote about it in depth and so far haven't
>>>>> found
>>>>>> anything as detailed as I am hoping for on this architecture.
>>>>>> 
>>>>>> Best,
>>>>>> Riyad
>>>>>> 
>>>>>> On Tue, Dec 20, 2011 at 1:07 PM, Filipe David Manana <
>>>>> fdmanana@apache.org>wrote:
>>>>>> 
>>>>>>> On Tue, Dec 20, 2011 at 6:24 PM, Riyad Kalla <rk...@gmail.com>
>> wrote:
>>>>>>>> I've been reading everything I can find on the CouchDB file
>> format[1]
>>>>> and
>>>>>>>> am getting bits and pieces here and there, but not a great,
>> concrete,
>>>>>>>> step-by-step explanation of the process.
>>>>>>>> 
>>>>>>>> I'm clear on the use of B+ trees and after reading a few papers on
>> the
>>>>>>>> benefits of log-structured file formats, I understand the benefits
>> of
>>>>>>>> inlining the B+ tree indices directly into the data file as well
>>>>>>> (locality
>>>>>>>> + sequential I/O)... what I'm flummoxed about is how much of the B+
>>>>>>> tree's
>>>>>>>> index is rewritten after every modified document.
>>>>>>>> 
>>>>>>>> Consider a CouchDB file that looks more or less like this:
>>>>>>>> 
>>>>>>>> [idx/header][doc1, rev1][idx/header][doc1, rev2]....
>>>>>>>> 
>>>>>>>> After each revised doc is written and the "b-tree root" is rewritten
>>>>>>> after
>>>>>>>> that, is that just a modified root node of the B+ tree or the entire
>>>>> B+
>>>>>>>> tree?
>>>>>>>> 
>>>>>>>> The reason I ask is because regardless of the answer to my previous
>>>>>>>> question, for a *huge* database will millions of records, that seems
>>>>> like
>>>>>>>> an enormous amount of data to rewrite after every modification. Say
>>>>> the
>>>>>>>> root node had a fanning factor of 133; that would still be alot of
>>>>> data
>>>>>>> to
>>>>>>>> rewrite.
>>>>>>> 
>>>>>>> Hi Riyad,
>>>>>>> 
>>>>>>> Have you observed that in practice?
>>>>>>> 
>>>>>>> Typically the depth of database btrees is not that high even for
>>>>>>> millions of documents. For example I have one around with about 10
>>>>>>> million documents which doesn't have more than 5 or 6 levels if I
>>>>>>> recall correctly.
>>>>>>> 
>>>>>>> So updating a doc, for that particular case, means rewriting 5 or 6
>>>>>>> new nodes plus the document itself. Each node is normally not much
>>>>>>> bigger than 1.2Kb.
>>>>>>> 
>>>>>>> I've written once a tool to analyze database files which reports
>> btree
>>>>>>> depths, however it's not updated to work with recent changes on
>>>>>>> master/1.2.x such as snappy compression and btree sizes:
>>>>>>> 
>>>>>>> https://github.com/fdmanana/couchfoo
>>>>>>> 
>>>>>>> It should work with CouchDB 1.1 (and older) database files.
>>>>>>> 
>>>>>>>> 
>>>>>>>> I am certain I am missing the boat on this; if anyone can pull me
>> out
>>>>> of
>>>>>>>> the water and point me to dry land I'd appreciate it.
>>>>>>>> 
>>>>>>>> Best,
>>>>>>>> R
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> [1]
>>>>>>>> --
>>>>>>>> 
>>>>>>> 
>>>>> 
>> http://jchrisa.net/drl/_design/sofa/_list/post/post-page?startkey=%5B%22CouchDB-Implements-a-Fundamental-Algorithm%22%5D
>>>>>>>> -- http://horicky.blogspot.com/2008/10/couchdb-implementation.html
>>>>>>>> -- http://blog.kodekabuki.com/post/132952897/couchdb-naked
>>>>>>>> -- http://guide.couchdb.org/editions/1/en/btree.html
>>>>>>>> -- http://ayende.com/blog/* (Over my head)
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Filipe David Manana,
>>>>>>> 
>>>>>>> "Reasonable men adapt themselves to the world.
>>>>>>> Unreasonable men adapt the world to themselves.
>>>>>>> That's why all progress depends on unreasonable men."
>>>>>>> 
>>>>> 
>>>>> 
>>>> 
>> 
>>

Re: Understanding the CouchDB file format

Posted by Riyad Kalla <rk...@gmail.com>.

Bob,

Really appreciate the link; Rick has a handful of articles that helped a
lot.

Along side all the CouchDB reading I've been looking at SSD-optimized data
storage mechanisms and tried to coalesce all of this information into this
post on Couch's file storage format:
https://plus.google.com/u/0/107397941677313236670/posts/CyvwRcvh4vv

It is uncanny how many things Couch seems to have gotten right with regard
to existing storage systems and future flash-based storage systems. I'd
appreciate any corrections, additions or feedback to the post for anyone
interested.

Best,
R

On Wed, Dec 21, 2011 at 12:53 PM, Robert Dionne <
dionne@dionne-associates.com> wrote:

> I think this is largely correct Riyad, I dug out an old article[1] by Rick
> Ho that you may also find helpful though it might be slightly dated.
> Generally the best performance will be had if the ids are sequential and
> updates are done in bulk. Write heavy applications will eat up a lot of
> space and require compaction. At the leaf nodes what are stored are either
> full_doc_info records or doc_info records which store pointers to the data
> so the main thing that impacts the branching at each level are the key size
> and in the case of views the sizes of the reductions as these are stored
> with the intermediate nodes.
>
> All in all it works pretty well but as always you need to test and
> evaluate it for you specific case to see what the limits are.
>
> Regards,
>
> Bob
>
>
> [1] http://horicky.blogspot.com/2008/10/couchdb-implementation.html
>
>
>
>
> On Dec 21, 2011, at 2:17 PM, Riyad Kalla wrote:
>
> > Adding to this conversation, I found this set of slides by Chris
> explaining
> > the append-only index update format:
> > http://www.slideshare.net/jchrisa/btree-nosql-oak?from=embed
> >
> > Specifically slides 16, 17 and 18.
> >
> > Using this example tree, rewriting the updated path (in reverse order)
> > appended to the end of the file makes sense... you see how index queries
> > can simply read backwards from the end of the file and not only find the
> > latest revisions of docs, but also every other doc that wasn't touched
> (it
> > will just seek into the existing inner nodes of the b+ tree for
> searching).
> >
> > What I am hoping for clarification on is the following pain-point that I
> > perceive with this approach:
> >
> > 1. In a sufficiently shallow B+ tree (like CouchDB), the paths themselves
> > to elements are short (typically no more than 3 to 5 levels deep) as
> > opposed to a trie or some other construct that would have much longer
> paths
> > to elements.
> >
> > 2. Because the depth of the tree is so shallow, the breadth of it becomes
> > large to compensate... more specifically, each internal node can have
> 100s,
> > 1000s or more children. Using the example slides, consider the nodes
> > [A...M] and [R...Z] -- in a good sized CouchDB database, those internal
> > index nodes would have 100s (or more) elements in them pointing at deeper
> > internal nodes that themselves had thousands of elements; instead of the
> 13
> > or so as implied by [A...M].
> >
> > 3. Looking at slide 17 and 18, where you see the direct B+ tree path to
> the
> > update node getting appended to the end of the file after the revision is
> > written (leaf to root ordering: [J' M] -> [A M] -> [A Z]) it implies that
> > those internal nodes with *all* their child elements are getting
> rewritten
> > as well.
> >
> > In this example tree, it is isn't such a big issue... but in a
> sufficiently
> > large CouchDB database, these nodes denoted by [A...M] and [A...Z] could
> be
> > quite large... I don't know the format of the node elements in the B+
> tree,
> > but it would be whatever the size of a node is times however many
> elements
> > are contained at each level (1 for root, say 100 for level 2, 1000 for
> > level 3 and 10,000 for level 4 -- there is a lot of hand-waving going on
> > here, of course it depends on the size of the data store).
> >
> > Am I missing something or is CouchDB really rewriting that much index
> > information between document revisions on every update?
> >
> > What was previously confusing me is I thought it was *only* rewriting a
> > direct path to the updated revision, like [B]>[E]>[J'] and Couch was
> > some-how patching in that updated path info to the B+ index at runtime.
> >
> > If couch is rewriting entire node paths with all their elements then I am
> > no longer confused about the B+ index updates, but am curious about the
> > on-disk cost of this.
> >
> > In my own rough insertion testing, that would explain why I see my
> > collections absolutely explode in size until they are compacted (not
> using
> > bulk insert, but intentionally doing single inserts for a million(s) of
> > docs to see what kind of cost the index path duplication would be like).
> >
> > Can anyone confirm/deny/correct this assessment? I want to make sure I am
> > on the right track understanding this.
> >
> > Best wishes,
> > Riyad
> >
> > On Tue, Dec 20, 2011 at 6:13 PM, Riyad Kalla <rk...@gmail.com> wrote:
> >
> >> @Filipe - I was just not clear on how CouchDB operated; you and Robert
> >> cleared that up for me. Thank you.
> >>
> >> @Robert - The writeup is excellent so far (I am not familiar with
> erlang,
> >> so there is a bit of stickiness there), thank you for taking the time to
> >> put this together!
> >>
> >> At this point I am curious how the _id and _seq indices are read as
> their
> >> data is continually appended to the end of the data file in small
> >> diff-trees for every updated doc.
> >>
> >> If CouchDB kept all the indices in-memory and simply patched-in the
> >> updated paths at runtime (maybe something akin to memory-mapped indices
> in
> >> MongoDB) I would be fairly clear on the operation... but as I understand
> >> it, CouchDB keeps such a small memory footprint by doing no in-memory
> >> caching and relying on the intelligence of the OS and filesystem (and/or
> >> drives) to cache frequently accessed data.
> >>
> >> I am trying to understand the logic used by CouchDB to answer a query
> >> using the index once updates to the tree have been appended to the data
> >> file... for example, consider a CouchDB datastore like the one Filipe
> >> has... 10 million documents and let's say it is freshly compacted.
> >>
> >> If I send in a request to that Couch instance, it hits the header of the
> >> data file along with the index and walks the B+ tree to the leaf node,
> >> where it finds the offset into the data file where the actual doc
> lives...
> >> let's say 1,000,000 bytes away.
> >>
> >> These B+ trees are shallow, so it might look something like this:
> >>
> >> Level 1: 1 node, root node.
> >> Level 2: 100 nodes, inner child nodes
> >> Level 3: 10,000 nodes, inner child nodes
> >> Level 4: 1,000,000, leaf nodes... all with pointers to the data offsets
> in
> >> the data file.
> >>
> >> Now let's say I write 10 updates to documents in that file. There are 10
> >> new revisions appended to the end of the data file *each one* separated
> by
> >> a rewritten B+ path to a leaf node with it's new location at the end of
> the
> >> file. Each of those paths written between each doc revision (say
> roughly 2k
> >> like Filipe mentioned) are just 4 item paths... root -> level1 ->
> level2 ->
> >> level3 --> level4... showing the discrete path from the root to that
> >> individual updated doc. The intermediary levels (l1, 2, 3) are not fully
> >> flushed out with all the OTHER children from the original b+ tree index.
> >>
> >> [[ is this correct so far? If not, please point out my mistakes...]
> >>
> >> Now I issue a query for a document that WAS NOT updated...
> >>
> >> **** this is where I get confused on the logic ***
> >>
> >> this would mean I need to access the original B+ tree index at the root
> of
> >> the data file, because the revised B+ paths that are written between
> each
> >> of the updated doc revisions at the end of the file are not full
> indices.
> >>
> >> NOW consider I want to query for one of the changed docs... now I
> suddenly
> >> need to scan backwards from the data file's end to find the updated
> path to
> >> the new revision of that document.
> >>
> >> (obviously) this isn't what Couch is actually doing... it's doing
> >> something more elegant, I just can't figure out what or how and that is
> >> what I was hoping for help with.
> >>
> >> Much thanks guys, I know this is a heavy question to ask.
> >>
> >> Best wishes,
> >> R
> >>
> >>
> >> On Tue, Dec 20, 2011 at 1:35 PM, Robert Dionne <
> >> dionne@dionne-associates.com> wrote:
> >>
> >>>
> >>> Robert Dionne
> >>> Computer Programmer
> >>> dionne@dionne-associates.com
> >>> 203.231.9961
> >>>
> >>>
> >>>
> >>>
> >>> On Dec 20, 2011, at 3:27 PM, Riyad Kalla wrote:
> >>>
> >>>> Filipe,
> >>>>
> >>>> Thank you for the reply.
> >>>>
> >>>> Maybe I am misunderstanding exactly what couch is writing out; the
> docs
> >>>> I've read say that it "rewrites the root node" -- I can't tell if the
> >>> docs
> >>>> mean the parent node of the child doc that was changed (as one of the
> b+
> >>>> leaves) or if it means the direct path, from the root, to that
> >>> individual
> >>>> doc... or if it means the *entire* index...
> >>>>
> >>>> In the case of even rewriting the single parent, with such a shallow
> >>> tree,
> >>>> each internal leaf will have a huge fan of nodes; let's say 1-10k in a
> >>>> decent sized data set.
> >>>>
> >>>> If you are seeing a few K of extra written out after each changed doc
> >>> then
> >>>> that cannot be write... I almost assumed my understanding was wrong
> >>> because
> >>>> the sheer volume of data would make performance abysmal if it was
> true.
> >>>>
> >>>> Given that... is it just the changed path, from the root to the new
> leaf
> >>>> that is rewritten?
> >>>
> >>> Hi Riyad,
> >>>
> >>> You are correct, it's only the changed path. Interestingly I've just
> >>> started to document all these internals[1] along with links to the
> code and
> >>> other references available.
> >>>
> >>> Cheers,
> >>>
> >>> Bob
> >>>
> >>>
> >>> [1] http://bdionne.github.com/couchdb/
> >>>
> >>>> That makes me all sorts of curious as to how Couch
> >>>> updates/searches the new modified index with the small diff that is
> >>> written
> >>>> out.
> >>>>
> >>>> Any pointers to reading that will help me dig down on this (even early
> >>> bugs
> >>>> in JIRA?) would be appreciated. I've tried skimming back in 2007/08 on
> >>>> Damien's blog to see if it wrote about it in depth and so far haven't
> >>> found
> >>>> anything as detailed as I am hoping for on this architecture.
> >>>>
> >>>> Best,
> >>>> Riyad
> >>>>
> >>>> On Tue, Dec 20, 2011 at 1:07 PM, Filipe David Manana <
> >>> fdmanana@apache.org>wrote:
> >>>>
> >>>>> On Tue, Dec 20, 2011 at 6:24 PM, Riyad Kalla <rk...@gmail.com>
> wrote:
> >>>>>> I've been reading everything I can find on the CouchDB file
> format[1]
> >>> and
> >>>>>> am getting bits and pieces here and there, but not a great,
> concrete,
> >>>>>> step-by-step explanation of the process.
> >>>>>>
> >>>>>> I'm clear on the use of B+ trees and after reading a few papers on
> the
> >>>>>> benefits of log-structured file formats, I understand the benefits
> of
> >>>>>> inlining the B+ tree indices directly into the data file as well
> >>>>> (locality
> >>>>>> + sequential I/O)... what I'm flummoxed about is how much of the B+
> >>>>> tree's
> >>>>>> index is rewritten after every modified document.
> >>>>>>
> >>>>>> Consider a CouchDB file that looks more or less like this:
> >>>>>>
> >>>>>> [idx/header][doc1, rev1][idx/header][doc1, rev2]....
> >>>>>>
> >>>>>> After each revised doc is written and the "b-tree root" is rewritten
> >>>>> after
> >>>>>> that, is that just a modified root node of the B+ tree or the entire
> >>> B+
> >>>>>> tree?
> >>>>>>
> >>>>>> The reason I ask is because regardless of the answer to my previous
> >>>>>> question, for a *huge* database will millions of records, that seems
> >>> like
> >>>>>> an enormous amount of data to rewrite after every modification. Say
> >>> the
> >>>>>> root node had a fanning factor of 133; that would still be alot of
> >>> data
> >>>>> to
> >>>>>> rewrite.
> >>>>>
> >>>>> Hi Riyad,
> >>>>>
> >>>>> Have you observed that in practice?
> >>>>>
> >>>>> Typically the depth of database btrees is not that high even for
> >>>>> millions of documents. For example I have one around with about 10
> >>>>> million documents which doesn't have more than 5 or 6 levels if I
> >>>>> recall correctly.
> >>>>>
> >>>>> So updating a doc, for that particular case, means rewriting 5 or 6
> >>>>> new nodes plus the document itself. Each node is normally not much
> >>>>> bigger than 1.2Kb.
> >>>>>
> >>>>> I've written once a tool to analyze database files which reports
> btree
> >>>>> depths, however it's not updated to work with recent changes on
> >>>>> master/1.2.x such as snappy compression and btree sizes:
> >>>>>
> >>>>> https://github.com/fdmanana/couchfoo
> >>>>>
> >>>>> It should work with CouchDB 1.1 (and older) database files.
> >>>>>
> >>>>>>
> >>>>>> I am certain I am missing the boat on this; if anyone can pull me
> out
> >>> of
> >>>>>> the water and point me to dry land I'd appreciate it.
> >>>>>>
> >>>>>> Best,
> >>>>>> R
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> [1]
> >>>>>> --
> >>>>>>
> >>>>>
> >>>
> http://jchrisa.net/drl/_design/sofa/_list/post/post-page?startkey=%5B%22CouchDB-Implements-a-Fundamental-Algorithm%22%5D
> >>>>>> -- http://horicky.blogspot.com/2008/10/couchdb-implementation.html
> >>>>>> -- http://blog.kodekabuki.com/post/132952897/couchdb-naked
> >>>>>> -- http://guide.couchdb.org/editions/1/en/btree.html
> >>>>>> -- http://ayende.com/blog/* (Over my head)
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Filipe David Manana,
> >>>>>
> >>>>> "Reasonable men adapt themselves to the world.
> >>>>> Unreasonable men adapt the world to themselves.
> >>>>> That's why all progress depends on unreasonable men."
> >>>>>
> >>>
> >>>
> >>
>
>

Re: Understanding the CouchDB file format

Posted by Robert Dionne <di...@dionne-associates.com>.

I think this is largely correct Riyad, I dug out an old article[1] by Rick Ho that you may also find helpful though it might be slightly dated. Generally the best performance will be had if the ids are sequential and updates are done in bulk. Write heavy applications will eat up a lot of space and require compaction. At the leaf nodes what are stored are either full_doc_info records or doc_info records which store pointers to the data so the main thing that impacts the branching at each level are the key size and in the case of views the sizes of the reductions as these are stored with the intermediate nodes.

All in all it works pretty well but as always you need to test and evaluate it for you specific case to see what the limits are.

Regards,

Bob


[1] http://horicky.blogspot.com/2008/10/couchdb-implementation.html




On Dec 21, 2011, at 2:17 PM, Riyad Kalla wrote:

> Adding to this conversation, I found this set of slides by Chris explaining
> the append-only index update format:
> http://www.slideshare.net/jchrisa/btree-nosql-oak?from=embed
> 
> Specifically slides 16, 17 and 18.
> 
> Using this example tree, rewriting the updated path (in reverse order)
> appended to the end of the file makes sense... you see how index queries
> can simply read backwards from the end of the file and not only find the
> latest revisions of docs, but also every other doc that wasn't touched (it
> will just seek into the existing inner nodes of the b+ tree for searching).
> 
> What I am hoping for clarification on is the following pain-point that I
> perceive with this approach:
> 
> 1. In a sufficiently shallow B+ tree (like CouchDB), the paths themselves
> to elements are short (typically no more than 3 to 5 levels deep) as
> opposed to a trie or some other construct that would have much longer paths
> to elements.
> 
> 2. Because the depth of the tree is so shallow, the breadth of it becomes
> large to compensate... more specifically, each internal node can have 100s,
> 1000s or more children. Using the example slides, consider the nodes
> [A...M] and [R...Z] -- in a good sized CouchDB database, those internal
> index nodes would have 100s (or more) elements in them pointing at deeper
> internal nodes that themselves had thousands of elements; instead of the 13
> or so as implied by [A...M].
> 
> 3. Looking at slide 17 and 18, where you see the direct B+ tree path to the
> update node getting appended to the end of the file after the revision is
> written (leaf to root ordering: [J' M] -> [A M] -> [A Z]) it implies that
> those internal nodes with *all* their child elements are getting rewritten
> as well.
> 
> In this example tree, it is isn't such a big issue... but in a sufficiently
> large CouchDB database, these nodes denoted by [A...M] and [A...Z] could be
> quite large... I don't know the format of the node elements in the B+ tree,
> but it would be whatever the size of a node is times however many elements
> are contained at each level (1 for root, say 100 for level 2, 1000 for
> level 3 and 10,000 for level 4 -- there is a lot of hand-waving going on
> here, of course it depends on the size of the data store).
> 
> Am I missing something or is CouchDB really rewriting that much index
> information between document revisions on every update?
> 
> What was previously confusing me is I thought it was *only* rewriting a
> direct path to the updated revision, like [B]>[E]>[J'] and Couch was
> some-how patching in that updated path info to the B+ index at runtime.
> 
> If couch is rewriting entire node paths with all their elements then I am
> no longer confused about the B+ index updates, but am curious about the
> on-disk cost of this.
> 
> In my own rough insertion testing, that would explain why I see my
> collections absolutely explode in size until they are compacted (not using
> bulk insert, but intentionally doing single inserts for a million(s) of
> docs to see what kind of cost the index path duplication would be like).
> 
> Can anyone confirm/deny/correct this assessment? I want to make sure I am
> on the right track understanding this.
> 
> Best wishes,
> Riyad
> 
> On Tue, Dec 20, 2011 at 6:13 PM, Riyad Kalla <rk...@gmail.com> wrote:
> 
>> @Filipe - I was just not clear on how CouchDB operated; you and Robert
>> cleared that up for me. Thank you.
>> 
>> @Robert - The writeup is excellent so far (I am not familiar with erlang,
>> so there is a bit of stickiness there), thank you for taking the time to
>> put this together!
>> 
>> At this point I am curious how the _id and _seq indices are read as their
>> data is continually appended to the end of the data file in small
>> diff-trees for every updated doc.
>> 
>> If CouchDB kept all the indices in-memory and simply patched-in the
>> updated paths at runtime (maybe something akin to memory-mapped indices in
>> MongoDB) I would be fairly clear on the operation... but as I understand
>> it, CouchDB keeps such a small memory footprint by doing no in-memory
>> caching and relying on the intelligence of the OS and filesystem (and/or
>> drives) to cache frequently accessed data.
>> 
>> I am trying to understand the logic used by CouchDB to answer a query
>> using the index once updates to the tree have been appended to the data
>> file... for example, consider a CouchDB datastore like the one Filipe
>> has... 10 million documents and let's say it is freshly compacted.
>> 
>> If I send in a request to that Couch instance, it hits the header of the
>> data file along with the index and walks the B+ tree to the leaf node,
>> where it finds the offset into the data file where the actual doc lives...
>> let's say 1,000,000 bytes away.
>> 
>> These B+ trees are shallow, so it might look something like this:
>> 
>> Level 1: 1 node, root node.
>> Level 2: 100 nodes, inner child nodes
>> Level 3: 10,000 nodes, inner child nodes
>> Level 4: 1,000,000, leaf nodes... all with pointers to the data offsets in
>> the data file.
>> 
>> Now let's say I write 10 updates to documents in that file. There are 10
>> new revisions appended to the end of the data file *each one* separated by
>> a rewritten B+ path to a leaf node with it's new location at the end of the
>> file. Each of those paths written between each doc revision (say roughly 2k
>> like Filipe mentioned) are just 4 item paths... root -> level1 -> level2 ->
>> level3 --> level4... showing the discrete path from the root to that
>> individual updated doc. The intermediary levels (l1, 2, 3) are not fully
>> flushed out with all the OTHER children from the original b+ tree index.
>> 
>> [[ is this correct so far? If not, please point out my mistakes...]
>> 
>> Now I issue a query for a document that WAS NOT updated...
>> 
>> **** this is where I get confused on the logic ***
>> 
>> this would mean I need to access the original B+ tree index at the root of
>> the data file, because the revised B+ paths that are written between each
>> of the updated doc revisions at the end of the file are not full indices.
>> 
>> NOW consider I want to query for one of the changed docs... now I suddenly
>> need to scan backwards from the data file's end to find the updated path to
>> the new revision of that document.
>> 
>> (obviously) this isn't what Couch is actually doing... it's doing
>> something more elegant, I just can't figure out what or how and that is
>> what I was hoping for help with.
>> 
>> Much thanks guys, I know this is a heavy question to ask.
>> 
>> Best wishes,
>> R
>> 
>> 
>> On Tue, Dec 20, 2011 at 1:35 PM, Robert Dionne <
>> dionne@dionne-associates.com> wrote:
>> 
>>> 
>>> Robert Dionne
>>> Computer Programmer
>>> dionne@dionne-associates.com
>>> 203.231.9961
>>> 
>>> 
>>> 
>>> 
>>> On Dec 20, 2011, at 3:27 PM, Riyad Kalla wrote:
>>> 
>>>> Filipe,
>>>> 
>>>> Thank you for the reply.
>>>> 
>>>> Maybe I am misunderstanding exactly what couch is writing out; the docs
>>>> I've read say that it "rewrites the root node" -- I can't tell if the
>>> docs
>>>> mean the parent node of the child doc that was changed (as one of the b+
>>>> leaves) or if it means the direct path, from the root, to that
>>> individual
>>>> doc... or if it means the *entire* index...
>>>> 
>>>> In the case of even rewriting the single parent, with such a shallow
>>> tree,
>>>> each internal leaf will have a huge fan of nodes; let's say 1-10k in a
>>>> decent sized data set.
>>>> 
>>>> If you are seeing a few K of extra written out after each changed doc
>>> then
>>>> that cannot be write... I almost assumed my understanding was wrong
>>> because
>>>> the sheer volume of data would make performance abysmal if it was true.
>>>> 
>>>> Given that... is it just the changed path, from the root to the new leaf
>>>> that is rewritten?
>>> 
>>> Hi Riyad,
>>> 
>>> You are correct, it's only the changed path. Interestingly I've just
>>> started to document all these internals[1] along with links to the code and
>>> other references available.
>>> 
>>> Cheers,
>>> 
>>> Bob
>>> 
>>> 
>>> [1] http://bdionne.github.com/couchdb/
>>> 
>>>> That makes me all sorts of curious as to how Couch
>>>> updates/searches the new modified index with the small diff that is
>>> written
>>>> out.
>>>> 
>>>> Any pointers to reading that will help me dig down on this (even early
>>> bugs
>>>> in JIRA?) would be appreciated. I've tried skimming back in 2007/08 on
>>>> Damien's blog to see if it wrote about it in depth and so far haven't
>>> found
>>>> anything as detailed as I am hoping for on this architecture.
>>>> 
>>>> Best,
>>>> Riyad
>>>> 
>>>> On Tue, Dec 20, 2011 at 1:07 PM, Filipe David Manana <
>>> fdmanana@apache.org>wrote:
>>>> 
>>>>> On Tue, Dec 20, 2011 at 6:24 PM, Riyad Kalla <rk...@gmail.com> wrote:
>>>>>> I've been reading everything I can find on the CouchDB file format[1]
>>> and
>>>>>> am getting bits and pieces here and there, but not a great, concrete,
>>>>>> step-by-step explanation of the process.
>>>>>> 
>>>>>> I'm clear on the use of B+ trees and after reading a few papers on the
>>>>>> benefits of log-structured file formats, I understand the benefits of
>>>>>> inlining the B+ tree indices directly into the data file as well
>>>>> (locality
>>>>>> + sequential I/O)... what I'm flummoxed about is how much of the B+
>>>>> tree's
>>>>>> index is rewritten after every modified document.
>>>>>> 
>>>>>> Consider a CouchDB file that looks more or less like this:
>>>>>> 
>>>>>> [idx/header][doc1, rev1][idx/header][doc1, rev2]....
>>>>>> 
>>>>>> After each revised doc is written and the "b-tree root" is rewritten
>>>>> after
>>>>>> that, is that just a modified root node of the B+ tree or the entire
>>> B+
>>>>>> tree?
>>>>>> 
>>>>>> The reason I ask is because regardless of the answer to my previous
>>>>>> question, for a *huge* database will millions of records, that seems
>>> like
>>>>>> an enormous amount of data to rewrite after every modification. Say
>>> the
>>>>>> root node had a fanning factor of 133; that would still be alot of
>>> data
>>>>> to
>>>>>> rewrite.
>>>>> 
>>>>> Hi Riyad,
>>>>> 
>>>>> Have you observed that in practice?
>>>>> 
>>>>> Typically the depth of database btrees is not that high even for
>>>>> millions of documents. For example I have one around with about 10
>>>>> million documents which doesn't have more than 5 or 6 levels if I
>>>>> recall correctly.
>>>>> 
>>>>> So updating a doc, for that particular case, means rewriting 5 or 6
>>>>> new nodes plus the document itself. Each node is normally not much
>>>>> bigger than 1.2Kb.
>>>>> 
>>>>> I've written once a tool to analyze database files which reports btree
>>>>> depths, however it's not updated to work with recent changes on
>>>>> master/1.2.x such as snappy compression and btree sizes:
>>>>> 
>>>>> https://github.com/fdmanana/couchfoo
>>>>> 
>>>>> It should work with CouchDB 1.1 (and older) database files.
>>>>> 
>>>>>> 
>>>>>> I am certain I am missing the boat on this; if anyone can pull me out
>>> of
>>>>>> the water and point me to dry land I'd appreciate it.
>>>>>> 
>>>>>> Best,
>>>>>> R
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> [1]
>>>>>> --
>>>>>> 
>>>>> 
>>> http://jchrisa.net/drl/_design/sofa/_list/post/post-page?startkey=%5B%22CouchDB-Implements-a-Fundamental-Algorithm%22%5D
>>>>>> -- http://horicky.blogspot.com/2008/10/couchdb-implementation.html
>>>>>> -- http://blog.kodekabuki.com/post/132952897/couchdb-naked
>>>>>> -- http://guide.couchdb.org/editions/1/en/btree.html
>>>>>> -- http://ayende.com/blog/* (Over my head)
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Filipe David Manana,
>>>>> 
>>>>> "Reasonable men adapt themselves to the world.
>>>>> Unreasonable men adapt the world to themselves.
>>>>> That's why all progress depends on unreasonable men."
>>>>> 
>>> 
>>> 
>>

Re: Understanding the CouchDB file format

Posted by Riyad Kalla <rk...@gmail.com>.

Adding to this conversation, I found this set of slides by Chris explaining
the append-only index update format:
http://www.slideshare.net/jchrisa/btree-nosql-oak?from=embed

Specifically slides 16, 17 and 18.

Using this example tree, rewriting the updated path (in reverse order)
appended to the end of the file makes sense... you see how index queries
can simply read backwards from the end of the file and not only find the
latest revisions of docs, but also every other doc that wasn't touched (it
will just seek into the existing inner nodes of the b+ tree for searching).

What I am hoping for clarification on is the following pain-point that I
perceive with this approach:

1. In a sufficiently shallow B+ tree (like CouchDB), the paths themselves
to elements are short (typically no more than 3 to 5 levels deep) as
opposed to a trie or some other construct that would have much longer paths
to elements.

2. Because the depth of the tree is so shallow, the breadth of it becomes
large to compensate... more specifically, each internal node can have 100s,
1000s or more children. Using the example slides, consider the nodes
[A...M] and [R...Z] -- in a good sized CouchDB database, those internal
index nodes would have 100s (or more) elements in them pointing at deeper
internal nodes that themselves had thousands of elements; instead of the 13
or so as implied by [A...M].

3. Looking at slide 17 and 18, where you see the direct B+ tree path to the
update node getting appended to the end of the file after the revision is
written (leaf to root ordering: [J' M] -> [A M] -> [A Z]) it implies that
those internal nodes with *all* their child elements are getting rewritten
as well.

In this example tree, it is isn't such a big issue... but in a sufficiently
large CouchDB database, these nodes denoted by [A...M] and [A...Z] could be
quite large... I don't know the format of the node elements in the B+ tree,
but it would be whatever the size of a node is times however many elements
are contained at each level (1 for root, say 100 for level 2, 1000 for
level 3 and 10,000 for level 4 -- there is a lot of hand-waving going on
here, of course it depends on the size of the data store).

Am I missing something or is CouchDB really rewriting that much index
information between document revisions on every update?

What was previously confusing me is I thought it was *only* rewriting a
direct path to the updated revision, like [B]>[E]>[J'] and Couch was
some-how patching in that updated path info to the B+ index at runtime.

If couch is rewriting entire node paths with all their elements then I am
no longer confused about the B+ index updates, but am curious about the
on-disk cost of this.

In my own rough insertion testing, that would explain why I see my
collections absolutely explode in size until they are compacted (not using
bulk insert, but intentionally doing single inserts for a million(s) of
docs to see what kind of cost the index path duplication would be like).

Can anyone confirm/deny/correct this assessment? I want to make sure I am
on the right track understanding this.

Best wishes,
Riyad

On Tue, Dec 20, 2011 at 6:13 PM, Riyad Kalla <rk...@gmail.com> wrote:

> @Filipe - I was just not clear on how CouchDB operated; you and Robert
> cleared that up for me. Thank you.
>
> @Robert - The writeup is excellent so far (I am not familiar with erlang,
> so there is a bit of stickiness there), thank you for taking the time to
> put this together!
>
> At this point I am curious how the _id and _seq indices are read as their
> data is continually appended to the end of the data file in small
> diff-trees for every updated doc.
>
> If CouchDB kept all the indices in-memory and simply patched-in the
> updated paths at runtime (maybe something akin to memory-mapped indices in
> MongoDB) I would be fairly clear on the operation... but as I understand
> it, CouchDB keeps such a small memory footprint by doing no in-memory
> caching and relying on the intelligence of the OS and filesystem (and/or
> drives) to cache frequently accessed data.
>
> I am trying to understand the logic used by CouchDB to answer a query
> using the index once updates to the tree have been appended to the data
> file... for example, consider a CouchDB datastore like the one Filipe
> has... 10 million documents and let's say it is freshly compacted.
>
> If I send in a request to that Couch instance, it hits the header of the
> data file along with the index and walks the B+ tree to the leaf node,
> where it finds the offset into the data file where the actual doc lives...
> let's say 1,000,000 bytes away.
>
> These B+ trees are shallow, so it might look something like this:
>
> Level 1: 1 node, root node.
> Level 2: 100 nodes, inner child nodes
> Level 3: 10,000 nodes, inner child nodes
> Level 4: 1,000,000, leaf nodes... all with pointers to the data offsets in
> the data file.
>
> Now let's say I write 10 updates to documents in that file. There are 10
> new revisions appended to the end of the data file *each one* separated by
> a rewritten B+ path to a leaf node with it's new location at the end of the
> file. Each of those paths written between each doc revision (say roughly 2k
> like Filipe mentioned) are just 4 item paths... root -> level1 -> level2 ->
> level3 --> level4... showing the discrete path from the root to that
> individual updated doc. The intermediary levels (l1, 2, 3) are not fully
> flushed out with all the OTHER children from the original b+ tree index.
>
> [[ is this correct so far? If not, please point out my mistakes...]
>
> Now I issue a query for a document that WAS NOT updated...
>
> **** this is where I get confused on the logic ***
>
> this would mean I need to access the original B+ tree index at the root of
> the data file, because the revised B+ paths that are written between each
> of the updated doc revisions at the end of the file are not full indices.
>
> NOW consider I want to query for one of the changed docs... now I suddenly
> need to scan backwards from the data file's end to find the updated path to
> the new revision of that document.
>
> (obviously) this isn't what Couch is actually doing... it's doing
> something more elegant, I just can't figure out what or how and that is
> what I was hoping for help with.
>
> Much thanks guys, I know this is a heavy question to ask.
>
> Best wishes,
> R
>
>
> On Tue, Dec 20, 2011 at 1:35 PM, Robert Dionne <
> dionne@dionne-associates.com> wrote:
>
>>
>> Robert Dionne
>> Computer Programmer
>> dionne@dionne-associates.com
>> 203.231.9961
>>
>>
>>
>>
>> On Dec 20, 2011, at 3:27 PM, Riyad Kalla wrote:
>>
>> > Filipe,
>> >
>> > Thank you for the reply.
>> >
>> > Maybe I am misunderstanding exactly what couch is writing out; the docs
>> > I've read say that it "rewrites the root node" -- I can't tell if the
>> docs
>> > mean the parent node of the child doc that was changed (as one of the b+
>> > leaves) or if it means the direct path, from the root, to that
>> individual
>> > doc... or if it means the *entire* index...
>> >
>> > In the case of even rewriting the single parent, with such a shallow
>> tree,
>> > each internal leaf will have a huge fan of nodes; let's say 1-10k in a
>> > decent sized data set.
>> >
>> > If you are seeing a few K of extra written out after each changed doc
>> then
>> > that cannot be write... I almost assumed my understanding was wrong
>> because
>> > the sheer volume of data would make performance abysmal if it was true.
>> >
>> > Given that... is it just the changed path, from the root to the new leaf
>> > that is rewritten?
>>
>> Hi Riyad,
>>
>> You are correct, it's only the changed path. Interestingly I've just
>> started to document all these internals[1] along with links to the code and
>> other references available.
>>
>> Cheers,
>>
>> Bob
>>
>>
>> [1] http://bdionne.github.com/couchdb/
>>
>> > That makes me all sorts of curious as to how Couch
>> > updates/searches the new modified index with the small diff that is
>> written
>> > out.
>> >
>> > Any pointers to reading that will help me dig down on this (even early
>> bugs
>> > in JIRA?) would be appreciated. I've tried skimming back in 2007/08 on
>> > Damien's blog to see if it wrote about it in depth and so far haven't
>> found
>> > anything as detailed as I am hoping for on this architecture.
>> >
>> > Best,
>> > Riyad
>> >
>> > On Tue, Dec 20, 2011 at 1:07 PM, Filipe David Manana <
>> fdmanana@apache.org>wrote:
>> >
>> >> On Tue, Dec 20, 2011 at 6:24 PM, Riyad Kalla <rk...@gmail.com> wrote:
>> >>> I've been reading everything I can find on the CouchDB file format[1]
>> and
>> >>> am getting bits and pieces here and there, but not a great, concrete,
>> >>> step-by-step explanation of the process.
>> >>>
>> >>> I'm clear on the use of B+ trees and after reading a few papers on the
>> >>> benefits of log-structured file formats, I understand the benefits of
>> >>> inlining the B+ tree indices directly into the data file as well
>> >> (locality
>> >>> + sequential I/O)... what I'm flummoxed about is how much of the B+
>> >> tree's
>> >>> index is rewritten after every modified document.
>> >>>
>> >>> Consider a CouchDB file that looks more or less like this:
>> >>>
>> >>> [idx/header][doc1, rev1][idx/header][doc1, rev2]....
>> >>>
>> >>> After each revised doc is written and the "b-tree root" is rewritten
>> >> after
>> >>> that, is that just a modified root node of the B+ tree or the entire
>> B+
>> >>> tree?
>> >>>
>> >>> The reason I ask is because regardless of the answer to my previous
>> >>> question, for a *huge* database will millions of records, that seems
>> like
>> >>> an enormous amount of data to rewrite after every modification. Say
>> the
>> >>> root node had a fanning factor of 133; that would still be alot of
>> data
>> >> to
>> >>> rewrite.
>> >>
>> >> Hi Riyad,
>> >>
>> >> Have you observed that in practice?
>> >>
>> >> Typically the depth of database btrees is not that high even for
>> >> millions of documents. For example I have one around with about 10
>> >> million documents which doesn't have more than 5 or 6 levels if I
>> >> recall correctly.
>> >>
>> >> So updating a doc, for that particular case, means rewriting 5 or 6
>> >> new nodes plus the document itself. Each node is normally not much
>> >> bigger than 1.2Kb.
>> >>
>> >> I've written once a tool to analyze database files which reports btree
>> >> depths, however it's not updated to work with recent changes on
>> >> master/1.2.x such as snappy compression and btree sizes:
>> >>
>> >> https://github.com/fdmanana/couchfoo
>> >>
>> >> It should work with CouchDB 1.1 (and older) database files.
>> >>
>> >>>
>> >>> I am certain I am missing the boat on this; if anyone can pull me out
>> of
>> >>> the water and point me to dry land I'd appreciate it.
>> >>>
>> >>> Best,
>> >>> R
>> >>>
>> >>>
>> >>>
>> >>> [1]
>> >>> --
>> >>>
>> >>
>> http://jchrisa.net/drl/_design/sofa/_list/post/post-page?startkey=%5B%22CouchDB-Implements-a-Fundamental-Algorithm%22%5D
>> >>> -- http://horicky.blogspot.com/2008/10/couchdb-implementation.html
>> >>> -- http://blog.kodekabuki.com/post/132952897/couchdb-naked
>> >>> -- http://guide.couchdb.org/editions/1/en/btree.html
>> >>> -- http://ayende.com/blog/* (Over my head)
>> >>
>> >>
>> >>
>> >> --
>> >> Filipe David Manana,
>> >>
>> >> "Reasonable men adapt themselves to the world.
>> >> Unreasonable men adapt the world to themselves.
>> >> That's why all progress depends on unreasonable men."
>> >>
>>
>>
>

Re: Understanding the CouchDB file format

Posted by Riyad Kalla <rk...@gmail.com>.

@Filipe - I was just not clear on how CouchDB operated; you and Robert
cleared that up for me. Thank you.

@Robert - The writeup is excellent so far (I am not familiar with erlang,
so there is a bit of stickiness there), thank you for taking the time to
put this together!

At this point I am curious how the _id and _seq indices are read as their
data is continually appended to the end of the data file in small
diff-trees for every updated doc.

If CouchDB kept all the indices in-memory and simply patched-in the updated
paths at runtime (maybe something akin to memory-mapped indices in MongoDB)
I would be fairly clear on the operation... but as I understand it, CouchDB
keeps such a small memory footprint by doing no in-memory caching and
relying on the intelligence of the OS and filesystem (and/or drives) to
cache frequently accessed data.

I am trying to understand the logic used by CouchDB to answer a query using
the index once updates to the tree have been appended to the data file...
for example, consider a CouchDB datastore like the one Filipe has... 10
million documents and let's say it is freshly compacted.

If I send in a request to that Couch instance, it hits the header of the
data file along with the index and walks the B+ tree to the leaf node,
where it finds the offset into the data file where the actual doc lives...
let's say 1,000,000 bytes away.

These B+ trees are shallow, so it might look something like this:

Level 1: 1 node, root node.
Level 2: 100 nodes, inner child nodes
Level 3: 10,000 nodes, inner child nodes
Level 4: 1,000,000, leaf nodes... all with pointers to the data offsets in
the data file.

Now let's say I write 10 updates to documents in that file. There are 10
new revisions appended to the end of the data file *each one* separated by
a rewritten B+ path to a leaf node with it's new location at the end of the
file. Each of those paths written between each doc revision (say roughly 2k
like Filipe mentioned) are just 4 item paths... root -> level1 -> level2 ->
level3 --> level4... showing the discrete path from the root to that
individual updated doc. The intermediary levels (l1, 2, 3) are not fully
flushed out with all the OTHER children from the original b+ tree index.

[[ is this correct so far? If not, please point out my mistakes...]

Now I issue a query for a document that WAS NOT updated...

**** this is where I get confused on the logic ***

this would mean I need to access the original B+ tree index at the root of
the data file, because the revised B+ paths that are written between each
of the updated doc revisions at the end of the file are not full indices.

NOW consider I want to query for one of the changed docs... now I suddenly
need to scan backwards from the data file's end to find the updated path to
the new revision of that document.

(obviously) this isn't what Couch is actually doing... it's doing something
more elegant, I just can't figure out what or how and that is what I was
hoping for help with.

Much thanks guys, I know this is a heavy question to ask.

Best wishes,
R

On Tue, Dec 20, 2011 at 1:35 PM, Robert Dionne <dionne@dionne-associates.com
> wrote:

>
> Robert Dionne
> Computer Programmer
> dionne@dionne-associates.com
> 203.231.9961
>
>
>
>
> On Dec 20, 2011, at 3:27 PM, Riyad Kalla wrote:
>
> > Filipe,
> >
> > Thank you for the reply.
> >
> > Maybe I am misunderstanding exactly what couch is writing out; the docs
> > I've read say that it "rewrites the root node" -- I can't tell if the
> docs
> > mean the parent node of the child doc that was changed (as one of the b+
> > leaves) or if it means the direct path, from the root, to that individual
> > doc... or if it means the *entire* index...
> >
> > In the case of even rewriting the single parent, with such a shallow
> tree,
> > each internal leaf will have a huge fan of nodes; let's say 1-10k in a
> > decent sized data set.
> >
> > If you are seeing a few K of extra written out after each changed doc
> then
> > that cannot be write... I almost assumed my understanding was wrong
> because
> > the sheer volume of data would make performance abysmal if it was true.
> >
> > Given that... is it just the changed path, from the root to the new leaf
> > that is rewritten?
>
> Hi Riyad,
>
> You are correct, it's only the changed path. Interestingly I've just
> started to document all these internals[1] along with links to the code and
> other references available.
>
> Cheers,
>
> Bob
>
>
> [1] http://bdionne.github.com/couchdb/
>
> > That makes me all sorts of curious as to how Couch
> > updates/searches the new modified index with the small diff that is
> written
> > out.
> >
> > Any pointers to reading that will help me dig down on this (even early
> bugs
> > in JIRA?) would be appreciated. I've tried skimming back in 2007/08 on
> > Damien's blog to see if it wrote about it in depth and so far haven't
> found
> > anything as detailed as I am hoping for on this architecture.
> >
> > Best,
> > Riyad
> >
> > On Tue, Dec 20, 2011 at 1:07 PM, Filipe David Manana <
> fdmanana@apache.org>wrote:
> >
> >> On Tue, Dec 20, 2011 at 6:24 PM, Riyad Kalla <rk...@gmail.com> wrote:
> >>> I've been reading everything I can find on the CouchDB file format[1]
> and
> >>> am getting bits and pieces here and there, but not a great, concrete,
> >>> step-by-step explanation of the process.
> >>>
> >>> I'm clear on the use of B+ trees and after reading a few papers on the
> >>> benefits of log-structured file formats, I understand the benefits of
> >>> inlining the B+ tree indices directly into the data file as well
> >> (locality
> >>> + sequential I/O)... what I'm flummoxed about is how much of the B+
> >> tree's
> >>> index is rewritten after every modified document.
> >>>
> >>> Consider a CouchDB file that looks more or less like this:
> >>>
> >>> [idx/header][doc1, rev1][idx/header][doc1, rev2]....
> >>>
> >>> After each revised doc is written and the "b-tree root" is rewritten
> >> after
> >>> that, is that just a modified root node of the B+ tree or the entire B+
> >>> tree?
> >>>
> >>> The reason I ask is because regardless of the answer to my previous
> >>> question, for a *huge* database will millions of records, that seems
> like
> >>> an enormous amount of data to rewrite after every modification. Say the
> >>> root node had a fanning factor of 133; that would still be alot of data
> >> to
> >>> rewrite.
> >>
> >> Hi Riyad,
> >>
> >> Have you observed that in practice?
> >>
> >> Typically the depth of database btrees is not that high even for
> >> millions of documents. For example I have one around with about 10
> >> million documents which doesn't have more than 5 or 6 levels if I
> >> recall correctly.
> >>
> >> So updating a doc, for that particular case, means rewriting 5 or 6
> >> new nodes plus the document itself. Each node is normally not much
> >> bigger than 1.2Kb.
> >>
> >> I've written once a tool to analyze database files which reports btree
> >> depths, however it's not updated to work with recent changes on
> >> master/1.2.x such as snappy compression and btree sizes:
> >>
> >> https://github.com/fdmanana/couchfoo
> >>
> >> It should work with CouchDB 1.1 (and older) database files.
> >>
> >>>
> >>> I am certain I am missing the boat on this; if anyone can pull me out
> of
> >>> the water and point me to dry land I'd appreciate it.
> >>>
> >>> Best,
> >>> R
> >>>
> >>>
> >>>
> >>> [1]
> >>> --
> >>>
> >>
> http://jchrisa.net/drl/_design/sofa/_list/post/post-page?startkey=%5B%22CouchDB-Implements-a-Fundamental-Algorithm%22%5D
> >>> -- http://horicky.blogspot.com/2008/10/couchdb-implementation.html
> >>> -- http://blog.kodekabuki.com/post/132952897/couchdb-naked
> >>> -- http://guide.couchdb.org/editions/1/en/btree.html
> >>> -- http://ayende.com/blog/* (Over my head)
> >>
> >>
> >>
> >> --
> >> Filipe David Manana,
> >>
> >> "Reasonable men adapt themselves to the world.
> >> Unreasonable men adapt the world to themselves.
> >> That's why all progress depends on unreasonable men."
> >>
>
>

Re: Understanding the CouchDB file format

Posted by Robert Dionne <di...@dionne-associates.com>.

Robert Dionne
Computer Programmer
dionne@dionne-associates.com
203.231.9961




On Dec 20, 2011, at 3:27 PM, Riyad Kalla wrote:

> Filipe,
> 
> Thank you for the reply.
> 
> Maybe I am misunderstanding exactly what couch is writing out; the docs
> I've read say that it "rewrites the root node" -- I can't tell if the docs
> mean the parent node of the child doc that was changed (as one of the b+
> leaves) or if it means the direct path, from the root, to that individual
> doc... or if it means the *entire* index...
> 
> In the case of even rewriting the single parent, with such a shallow tree,
> each internal leaf will have a huge fan of nodes; let's say 1-10k in a
> decent sized data set.
> 
> If you are seeing a few K of extra written out after each changed doc then
> that cannot be write... I almost assumed my understanding was wrong because
> the sheer volume of data would make performance abysmal if it was true.
> 
> Given that... is it just the changed path, from the root to the new leaf
> that is rewritten?

Hi Riyad,

You are correct, it's only the changed path. Interestingly I've just started to document all these internals[1] along with links to the code and other references available. 

Cheers,

Bob


[1] http://bdionne.github.com/couchdb/

> That makes me all sorts of curious as to how Couch
> updates/searches the new modified index with the small diff that is written
> out.
> 
> Any pointers to reading that will help me dig down on this (even early bugs
> in JIRA?) would be appreciated. I've tried skimming back in 2007/08 on
> Damien's blog to see if it wrote about it in depth and so far haven't found
> anything as detailed as I am hoping for on this architecture.
> 
> Best,
> Riyad
> 
> On Tue, Dec 20, 2011 at 1:07 PM, Filipe David Manana <fd...@apache.org>wrote:
> 
>> On Tue, Dec 20, 2011 at 6:24 PM, Riyad Kalla <rk...@gmail.com> wrote:
>>> I've been reading everything I can find on the CouchDB file format[1] and
>>> am getting bits and pieces here and there, but not a great, concrete,
>>> step-by-step explanation of the process.
>>> 
>>> I'm clear on the use of B+ trees and after reading a few papers on the
>>> benefits of log-structured file formats, I understand the benefits of
>>> inlining the B+ tree indices directly into the data file as well
>> (locality
>>> + sequential I/O)... what I'm flummoxed about is how much of the B+
>> tree's
>>> index is rewritten after every modified document.
>>> 
>>> Consider a CouchDB file that looks more or less like this:
>>> 
>>> [idx/header][doc1, rev1][idx/header][doc1, rev2]....
>>> 
>>> After each revised doc is written and the "b-tree root" is rewritten
>> after
>>> that, is that just a modified root node of the B+ tree or the entire B+
>>> tree?
>>> 
>>> The reason I ask is because regardless of the answer to my previous
>>> question, for a *huge* database will millions of records, that seems like
>>> an enormous amount of data to rewrite after every modification. Say the
>>> root node had a fanning factor of 133; that would still be alot of data
>> to
>>> rewrite.
>> 
>> Hi Riyad,
>> 
>> Have you observed that in practice?
>> 
>> Typically the depth of database btrees is not that high even for
>> millions of documents. For example I have one around with about 10
>> million documents which doesn't have more than 5 or 6 levels if I
>> recall correctly.
>> 
>> So updating a doc, for that particular case, means rewriting 5 or 6
>> new nodes plus the document itself. Each node is normally not much
>> bigger than 1.2Kb.
>> 
>> I've written once a tool to analyze database files which reports btree
>> depths, however it's not updated to work with recent changes on
>> master/1.2.x such as snappy compression and btree sizes:
>> 
>> https://github.com/fdmanana/couchfoo
>> 
>> It should work with CouchDB 1.1 (and older) database files.
>> 
>>> 
>>> I am certain I am missing the boat on this; if anyone can pull me out of
>>> the water and point me to dry land I'd appreciate it.
>>> 
>>> Best,
>>> R
>>> 
>>> 
>>> 
>>> [1]
>>> --
>>> 
>> http://jchrisa.net/drl/_design/sofa/_list/post/post-page?startkey=%5B%22CouchDB-Implements-a-Fundamental-Algorithm%22%5D
>>> -- http://horicky.blogspot.com/2008/10/couchdb-implementation.html
>>> -- http://blog.kodekabuki.com/post/132952897/couchdb-naked
>>> -- http://guide.couchdb.org/editions/1/en/btree.html
>>> -- http://ayende.com/blog/* (Over my head)
>> 
>> 
>> 
>> --
>> Filipe David Manana,
>> 
>> "Reasonable men adapt themselves to the world.
>> Unreasonable men adapt the world to themselves.
>> That's why all progress depends on unreasonable men."
>>

Re: Understanding the CouchDB file format

Posted by Filipe David Manana <fd...@apache.org>.

On Tue, Dec 20, 2011 at 8:27 PM, Riyad Kalla <rk...@gmail.com> wrote:
> Filipe,
>
> Thank you for the reply.
>
> Maybe I am misunderstanding exactly what couch is writing out; the docs
> I've read say that it "rewrites the root node" -- I can't tell if the docs
> mean the parent node of the child doc that was changed (as one of the b+
> leaves) or if it means the direct path, from the root, to that individual
> doc... or if it means the *entire* index...
>
> In the case of even rewriting the single parent, with such a shallow tree,
> each internal leaf will have a huge fan of nodes; let's say 1-10k in a
> decent sized data set.

The leaf nodes don't contain the documents, but rather pointers (file
offsets) to where the documents are stored in the file.
Non-leaf nodes contain a list of pointers to child nodes.

Updating a document means changing a value in its leaf node, writing
the modified version to disk, than update its parent node to refer to
the offset of the new leaf node, the parent of the parent, etc until
you write a new root node.

Maybe I misunderstood your doubt, but that doesn't seem a lot to me
(remember each node is typically about 1.2K).

However a doc update implies updating 2 btrees (seq and id trees).

>
> If you are seeing a few K of extra written out after each changed doc then
> that cannot be write... I almost assumed my understanding was wrong because
> the sheer volume of data would make performance abysmal if it was true.
>
> Given that... is it just the changed path, from the root to the new leaf
> that is rewritten? That makes me all sorts of curious as to how Couch
> updates/searches the new modified index with the small diff that is written
> out.
>
> Any pointers to reading that will help me dig down on this (even early bugs
> in JIRA?) would be appreciated. I've tried skimming back in 2007/08 on
> Damien's blog to see if it wrote about it in depth and so far haven't found
> anything as detailed as I am hoping for on this architecture.

Just read couch_btree.erl, couch_file.erl and couch_db_updater.erl.

>
> Best,
> Riyad
>
> On Tue, Dec 20, 2011 at 1:07 PM, Filipe David Manana <fd...@apache.org>wrote:
>
>> On Tue, Dec 20, 2011 at 6:24 PM, Riyad Kalla <rk...@gmail.com> wrote:
>> > I've been reading everything I can find on the CouchDB file format[1] and
>> > am getting bits and pieces here and there, but not a great, concrete,
>> > step-by-step explanation of the process.
>> >
>> > I'm clear on the use of B+ trees and after reading a few papers on the
>> > benefits of log-structured file formats, I understand the benefits of
>> > inlining the B+ tree indices directly into the data file as well
>> (locality
>> > + sequential I/O)... what I'm flummoxed about is how much of the B+
>> tree's
>> > index is rewritten after every modified document.
>> >
>> > Consider a CouchDB file that looks more or less like this:
>> >
>> > [idx/header][doc1, rev1][idx/header][doc1, rev2]....
>> >
>> > After each revised doc is written and the "b-tree root" is rewritten
>> after
>> > that, is that just a modified root node of the B+ tree or the entire B+
>> > tree?
>> >
>> > The reason I ask is because regardless of the answer to my previous
>> > question, for a *huge* database will millions of records, that seems like
>> > an enormous amount of data to rewrite after every modification. Say the
>> > root node had a fanning factor of 133; that would still be alot of data
>> to
>> > rewrite.
>>
>> Hi Riyad,
>>
>> Have you observed that in practice?
>>
>> Typically the depth of database btrees is not that high even for
>> millions of documents. For example I have one around with about 10
>> million documents which doesn't have more than 5 or 6 levels if I
>> recall correctly.
>>
>> So updating a doc, for that particular case, means rewriting 5 or 6
>> new nodes plus the document itself. Each node is normally not much
>> bigger than 1.2Kb.
>>
>> I've written once a tool to analyze database files which reports btree
>> depths, however it's not updated to work with recent changes on
>> master/1.2.x such as snappy compression and btree sizes:
>>
>> https://github.com/fdmanana/couchfoo
>>
>> It should work with CouchDB 1.1 (and older) database files.
>>
>> >
>> > I am certain I am missing the boat on this; if anyone can pull me out of
>> > the water and point me to dry land I'd appreciate it.
>> >
>> > Best,
>> > R
>> >
>> >
>> >
>> > [1]
>> > --
>> >
>> http://jchrisa.net/drl/_design/sofa/_list/post/post-page?startkey=%5B%22CouchDB-Implements-a-Fundamental-Algorithm%22%5D
>> > -- http://horicky.blogspot.com/2008/10/couchdb-implementation.html
>> > -- http://blog.kodekabuki.com/post/132952897/couchdb-naked
>> > -- http://guide.couchdb.org/editions/1/en/btree.html
>> > -- http://ayende.com/blog/* (Over my head)
>>
>>
>>
>> --
>> Filipe David Manana,
>>
>> "Reasonable men adapt themselves to the world.
>>  Unreasonable men adapt the world to themselves.
>>  That's why all progress depends on unreasonable men."
>>



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

Re: Understanding the CouchDB file format

Posted by Riyad Kalla <rk...@gmail.com>.

Filipe,

Thank you for the reply.

Maybe I am misunderstanding exactly what couch is writing out; the docs
I've read say that it "rewrites the root node" -- I can't tell if the docs
mean the parent node of the child doc that was changed (as one of the b+
leaves) or if it means the direct path, from the root, to that individual
doc... or if it means the *entire* index...

In the case of even rewriting the single parent, with such a shallow tree,
each internal leaf will have a huge fan of nodes; let's say 1-10k in a
decent sized data set.

If you are seeing a few K of extra written out after each changed doc then
that cannot be write... I almost assumed my understanding was wrong because
the sheer volume of data would make performance abysmal if it was true.

Given that... is it just the changed path, from the root to the new leaf
that is rewritten? That makes me all sorts of curious as to how Couch
updates/searches the new modified index with the small diff that is written
out.

Any pointers to reading that will help me dig down on this (even early bugs
in JIRA?) would be appreciated. I've tried skimming back in 2007/08 on
Damien's blog to see if it wrote about it in depth and so far haven't found
anything as detailed as I am hoping for on this architecture.

Best,
Riyad

On Tue, Dec 20, 2011 at 1:07 PM, Filipe David Manana <fd...@apache.org>wrote:

> On Tue, Dec 20, 2011 at 6:24 PM, Riyad Kalla <rk...@gmail.com> wrote:
> > I've been reading everything I can find on the CouchDB file format[1] and
> > am getting bits and pieces here and there, but not a great, concrete,
> > step-by-step explanation of the process.
> >
> > I'm clear on the use of B+ trees and after reading a few papers on the
> > benefits of log-structured file formats, I understand the benefits of
> > inlining the B+ tree indices directly into the data file as well
> (locality
> > + sequential I/O)... what I'm flummoxed about is how much of the B+
> tree's
> > index is rewritten after every modified document.
> >
> > Consider a CouchDB file that looks more or less like this:
> >
> > [idx/header][doc1, rev1][idx/header][doc1, rev2]....
> >
> > After each revised doc is written and the "b-tree root" is rewritten
> after
> > that, is that just a modified root node of the B+ tree or the entire B+
> > tree?
> >
> > The reason I ask is because regardless of the answer to my previous
> > question, for a *huge* database will millions of records, that seems like
> > an enormous amount of data to rewrite after every modification. Say the
> > root node had a fanning factor of 133; that would still be alot of data
> to
> > rewrite.
>
> Hi Riyad,
>
> Have you observed that in practice?
>
> Typically the depth of database btrees is not that high even for
> millions of documents. For example I have one around with about 10
> million documents which doesn't have more than 5 or 6 levels if I
> recall correctly.
>
> So updating a doc, for that particular case, means rewriting 5 or 6
> new nodes plus the document itself. Each node is normally not much
> bigger than 1.2Kb.
>
> I've written once a tool to analyze database files which reports btree
> depths, however it's not updated to work with recent changes on
> master/1.2.x such as snappy compression and btree sizes:
>
> https://github.com/fdmanana/couchfoo
>
> It should work with CouchDB 1.1 (and older) database files.
>
> >
> > I am certain I am missing the boat on this; if anyone can pull me out of
> > the water and point me to dry land I'd appreciate it.
> >
> > Best,
> > R
> >
> >
> >
> > [1]
> > --
> >
> http://jchrisa.net/drl/_design/sofa/_list/post/post-page?startkey=%5B%22CouchDB-Implements-a-Fundamental-Algorithm%22%5D
> > -- http://horicky.blogspot.com/2008/10/couchdb-implementation.html
> > -- http://blog.kodekabuki.com/post/132952897/couchdb-naked
> > -- http://guide.couchdb.org/editions/1/en/btree.html
> > -- http://ayende.com/blog/* (Over my head)
>
>
>
> --
> Filipe David Manana,
>
> "Reasonable men adapt themselves to the world.
>  Unreasonable men adapt the world to themselves.
>  That's why all progress depends on unreasonable men."
>

Re: Understanding the CouchDB file format

Posted by Filipe David Manana <fd...@apache.org>.

On Tue, Dec 20, 2011 at 6:24 PM, Riyad Kalla <rk...@gmail.com> wrote:
> I've been reading everything I can find on the CouchDB file format[1] and
> am getting bits and pieces here and there, but not a great, concrete,
> step-by-step explanation of the process.
>
> I'm clear on the use of B+ trees and after reading a few papers on the
> benefits of log-structured file formats, I understand the benefits of
> inlining the B+ tree indices directly into the data file as well (locality
> + sequential I/O)... what I'm flummoxed about is how much of the B+ tree's
> index is rewritten after every modified document.
>
> Consider a CouchDB file that looks more or less like this:
>
> [idx/header][doc1, rev1][idx/header][doc1, rev2]....
>
> After each revised doc is written and the "b-tree root" is rewritten after
> that, is that just a modified root node of the B+ tree or the entire B+
> tree?
>
> The reason I ask is because regardless of the answer to my previous
> question, for a *huge* database will millions of records, that seems like
> an enormous amount of data to rewrite after every modification. Say the
> root node had a fanning factor of 133; that would still be alot of data to
> rewrite.

Hi Riyad,

Have you observed that in practice?

Typically the depth of database btrees is not that high even for
millions of documents. For example I have one around with about 10
million documents which doesn't have more than 5 or 6 levels if I
recall correctly.

So updating a doc, for that particular case, means rewriting 5 or 6
new nodes plus the document itself. Each node is normally not much
bigger than 1.2Kb.

I've written once a tool to analyze database files which reports btree
depths, however it's not updated to work with recent changes on
master/1.2.x such as snappy compression and btree sizes:

https://github.com/fdmanana/couchfoo

It should work with CouchDB 1.1 (and older) database files.

>
> I am certain I am missing the boat on this; if anyone can pull me out of
> the water and point me to dry land I'd appreciate it.
>
> Best,
> R
>
>
>
> [1]
> --
> http://jchrisa.net/drl/_design/sofa/_list/post/post-page?startkey=%5B%22CouchDB-Implements-a-Fundamental-Algorithm%22%5D
> -- http://horicky.blogspot.com/2008/10/couchdb-implementation.html
> -- http://blog.kodekabuki.com/post/132952897/couchdb-naked
> -- http://guide.couchdb.org/editions/1/en/btree.html
> -- http://ayende.com/blog/* (Over my head)



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."