You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Andreas Pavlogiannis <pa...@gmail.com> on 2009/11/16 04:14:19 UTC

Storing Hierarchical Data

Greetings,

I recently started exploring the capabilities of couchdb and although I 
find it really interesting and flexible, I am experiencing some 
difficulties:

Is there any recommended way to store hierarchical data? Consider for 
example the case of a file system with multiple directories. I can think 
of some possible scenarios each with different capabilities and limitations:
    *   Each file and each folder is represented by a single document, 
with each folder document containing a "contents" list that has the ids 
of the subdocuments under the specific folder (the usual tree 
structure). In this case, deleting a file would require updating more 
than one document (the file for deletion and the parent folder for the 
"contents" attribute) which seems dangerous considering the absence of 
transactional operations (what about deleting a whole folder?). 
Moreover, accessing the file "foo/bar/cow" would require a conventional 
pathname translation which adds overhead (cut the pathname in chunks, 
request the "foo" folder, retrieve the ids of its contents, find which 
one corresponds to the "bar" folder  etc..)
    *    Each file and each folder is represented by a single document, 
with each file having an attribute "parent id" that contains the id of 
its parent folder(reverse tree structure). In this case deleting the 
file requires only one operation and seems more robust.   However 
pathname translation gets fuzzy and seems to add a lot of overhead 
(retrieve id of folder, find documents having this "parent id" 
attribute, find the one you want among them...)
     *   Each file is represented by a single document that has a "path" 
attribute that indicates the directory that is being stored to. This 
gives the advantage of avoiding conventional pathname translation and 
retrieving the correct document immediately. However, operations such as 
renaming a folder require updating many documents and should be avoided.
    * Keep the whole file system in a single document. Ouch!

I am aware of the bulk update technique with the "all or nothing" 
attribute, but it is to my understanding that it should be avoided, 
especially when dealing with clustering and replication. In addition, 
things seem to get more obscure when considering file sharing 
possibilities between the users of the file system.

I would be glad if you could provide me some pointers on how to 
circumvent the disadvantages of each of the methods above.

In general, do you thing that since dealing with documents is so 
flexible and provided the absence of transactional operations one should 
try to organize his data as decoupled as possible?

Thank you for your time ,

Andreas

Re: Storing Hierarchical Data

Posted by Mark Hammond <sk...@gmail.com>.

On 16/11/2009 2:14 PM, Andreas Pavlogiannis wrote:
> * Each file is represented by a single document that has a "path"
> attribute that indicates the directory that is being stored to. This
> gives the advantage of avoiding conventional pathname translation and
> retrieving the correct document immediately.

This sounds like the sanest approach.  Storing the path as a list would 
give a nice interface to the view engine for locating all items in a 
"folder", etc.

> However, operations such as
> renaming a folder require updating many documents and should be avoided.

This just might have to be the price you pay.  A "rename" operation 
can't really be made atomic, especially in the face of clustering, 
replication etc.

HTH,

Mark

Re: Storing Hierarchical Data

Posted by Adam Wolff <aw...@gmail.com>.

Ok, for some apps it's a no-go. If this is a
highly concurrent server app, you'll orphan data if you start two
rename updates at the same time,

a

On Monday, November 16, 2009, Jan Lehnardt <ja...@apache.org> wrote:
>
> On 16 Nov 2009, at 05:28, Adam Wolff wrote:
>
>> There isn't a great way to store hierarchical data in couch. If you want to
>> actually move stuff around, the full pathname is a no-go, since there are no
>> bulk updates. The only other trick here, if you have meaningful roots or
>> branch points, is to store a reference to those in addition to the specific
>> parent node in the graph.
>
> It is not a no-go, renames just can't be atomic :)
>
> Cheers
> Jan
> --
>
>>
>> In any case, it seems better to me to store references from child to parent,
>> rather than the other way around. The child document makes a more natural
>> concurrency boundary.
>>
>>
>> A
>>
>>
>> On Sun, Nov 15, 2009 at 7:14 PM, Andreas Pavlogiannis <
>> paulogiann.couchdb@gmail.com> wrote:
>>
>>> Greetings,
>>>
>>> I recently started exploring the capabilities of couchdb and although I
>>> find it really interesting and flexible, I am experiencing some
>>> difficulties:
>>>
>>> Is there any recommended way to store hierarchical data? Consider for
>>> example the case of a file system with multiple directories. I can think of
>>> some possible scenarios each with different capabilities and limitations:
>>>  *   Each file and each folder is represented by a single document, with
>>> each folder document containing a "contents" list that has the ids of the
>>> subdocuments under the specific folder (the usual tree structure). In this
>>> case, deleting a file would require updating more than one document (the
>>> file for deletion and the parent folder for the "contents" attribute) which
>>> seems dangerous considering the absence of transactional operations (what
>>> about deleting a whole folder?). Moreover, accessing the file "foo/bar/cow"
>>> would require a conventional pathname translation which adds overhead (cut
>>> the pathname in chunks, request the "foo" folder, retrieve the ids of its
>>> contents, find which one corresponds to the "bar" folder  etc..)
>>>  *    Each file and each folder is represented by a single document, with
>>> each file having an attribute "parent id" that contains the id of its parent
>>> folder(reverse tree structure). In this case deleting the file requires only
>>> one operation and seems more robust.   However pathname translation gets
>>> fuzzy and seems to add a lot of overhead (retrieve id of folder, find
>>> documents having this "parent id" attribute, find the one you want among
>>> them...)
>>>   *   Each file is represented by a single document that has a "path"
>>> attribute that indicates the directory that is being stored to. This gives
>>> the advantage of avoiding conventional pathname translation and retrieving
>>> the correct document immediately. However, operations such as renaming a
>>> folder require updating many documents and should be avoided.
>>>  * Keep the whole file system in a single document. Ouch!
>>>
>>> I am aware of the bulk update technique with the "all or nothing"
>>> attribute, but it is to my understanding that it should be avoided,
>>> especially when dealing with clustering and replication. In addition, things
>>> seem to get more obscure when considering file sharing possibilities between
>>> the users of the file system.
>>>
>>> I would be glad if you could provide me some pointers on how to circumvent
>>> the disadvantages of each of the methods above.
>>>
>>> In general, do you thing that since dealing with documents is so flexible
>>> and provided the absence of transactional operations one should try to
>>> organize his data as decoupled as possible?
>>>
>>> Thank you for your time ,
>>>
>>> Andreas
>>>
>
>

Re: Storing Hierarchical Data

Posted by Jan Lehnardt <ja...@apache.org>.

On 16 Nov 2009, at 05:28, Adam Wolff wrote:

> There isn't a great way to store hierarchical data in couch. If you want to
> actually move stuff around, the full pathname is a no-go, since there are no
> bulk updates. The only other trick here, if you have meaningful roots or
> branch points, is to store a reference to those in addition to the specific
> parent node in the graph.

It is not a no-go, renames just can't be atomic :)

Cheers
Jan
--

> 
> In any case, it seems better to me to store references from child to parent,
> rather than the other way around. The child document makes a more natural
> concurrency boundary.
> 
> 
> A
> 
> 
> On Sun, Nov 15, 2009 at 7:14 PM, Andreas Pavlogiannis <
> paulogiann.couchdb@gmail.com> wrote:
> 
>> Greetings,
>> 
>> I recently started exploring the capabilities of couchdb and although I
>> find it really interesting and flexible, I am experiencing some
>> difficulties:
>> 
>> Is there any recommended way to store hierarchical data? Consider for
>> example the case of a file system with multiple directories. I can think of
>> some possible scenarios each with different capabilities and limitations:
>>  *   Each file and each folder is represented by a single document, with
>> each folder document containing a "contents" list that has the ids of the
>> subdocuments under the specific folder (the usual tree structure). In this
>> case, deleting a file would require updating more than one document (the
>> file for deletion and the parent folder for the "contents" attribute) which
>> seems dangerous considering the absence of transactional operations (what
>> about deleting a whole folder?). Moreover, accessing the file "foo/bar/cow"
>> would require a conventional pathname translation which adds overhead (cut
>> the pathname in chunks, request the "foo" folder, retrieve the ids of its
>> contents, find which one corresponds to the "bar" folder  etc..)
>>  *    Each file and each folder is represented by a single document, with
>> each file having an attribute "parent id" that contains the id of its parent
>> folder(reverse tree structure). In this case deleting the file requires only
>> one operation and seems more robust.   However pathname translation gets
>> fuzzy and seems to add a lot of overhead (retrieve id of folder, find
>> documents having this "parent id" attribute, find the one you want among
>> them...)
>>   *   Each file is represented by a single document that has a "path"
>> attribute that indicates the directory that is being stored to. This gives
>> the advantage of avoiding conventional pathname translation and retrieving
>> the correct document immediately. However, operations such as renaming a
>> folder require updating many documents and should be avoided.
>>  * Keep the whole file system in a single document. Ouch!
>> 
>> I am aware of the bulk update technique with the "all or nothing"
>> attribute, but it is to my understanding that it should be avoided,
>> especially when dealing with clustering and replication. In addition, things
>> seem to get more obscure when considering file sharing possibilities between
>> the users of the file system.
>> 
>> I would be glad if you could provide me some pointers on how to circumvent
>> the disadvantages of each of the methods above.
>> 
>> In general, do you thing that since dealing with documents is so flexible
>> and provided the absence of transactional operations one should try to
>> organize his data as decoupled as possible?
>> 
>> Thank you for your time ,
>> 
>> Andreas
>>

Re: Storing Hierarchical Data

Posted by Adam Wolff <aw...@gmail.com>.

There isn't a great way to store hierarchical data in couch. If you want to
actually move stuff around, the full pathname is a no-go, since there are no
bulk updates. The only other trick here, if you have meaningful roots or
branch points, is to store a reference to those in addition to the specific
parent node in the graph.

In any case, it seems better to me to store references from child to parent,
rather than the other way around. The child document makes a more natural
concurrency boundary.


A


On Sun, Nov 15, 2009 at 7:14 PM, Andreas Pavlogiannis <
paulogiann.couchdb@gmail.com> wrote:

> Greetings,
>
> I recently started exploring the capabilities of couchdb and although I
> find it really interesting and flexible, I am experiencing some
> difficulties:
>
> Is there any recommended way to store hierarchical data? Consider for
> example the case of a file system with multiple directories. I can think of
> some possible scenarios each with different capabilities and limitations:
>   *   Each file and each folder is represented by a single document, with
> each folder document containing a "contents" list that has the ids of the
> subdocuments under the specific folder (the usual tree structure). In this
> case, deleting a file would require updating more than one document (the
> file for deletion and the parent folder for the "contents" attribute) which
> seems dangerous considering the absence of transactional operations (what
> about deleting a whole folder?). Moreover, accessing the file "foo/bar/cow"
> would require a conventional pathname translation which adds overhead (cut
> the pathname in chunks, request the "foo" folder, retrieve the ids of its
> contents, find which one corresponds to the "bar" folder  etc..)
>   *    Each file and each folder is represented by a single document, with
> each file having an attribute "parent id" that contains the id of its parent
> folder(reverse tree structure). In this case deleting the file requires only
> one operation and seems more robust.   However pathname translation gets
> fuzzy and seems to add a lot of overhead (retrieve id of folder, find
> documents having this "parent id" attribute, find the one you want among
> them...)
>    *   Each file is represented by a single document that has a "path"
> attribute that indicates the directory that is being stored to. This gives
> the advantage of avoiding conventional pathname translation and retrieving
> the correct document immediately. However, operations such as renaming a
> folder require updating many documents and should be avoided.
>   * Keep the whole file system in a single document. Ouch!
>
> I am aware of the bulk update technique with the "all or nothing"
> attribute, but it is to my understanding that it should be avoided,
> especially when dealing with clustering and replication. In addition, things
> seem to get more obscure when considering file sharing possibilities between
> the users of the file system.
>
> I would be glad if you could provide me some pointers on how to circumvent
> the disadvantages of each of the methods above.
>
> In general, do you thing that since dealing with documents is so flexible
> and provided the absence of transactional operations one should try to
> organize his data as decoupled as possible?
>
> Thank you for your time ,
>
> Andreas
>

Re: Storing Hierarchical Data

Posted by Ilya Sterin <st...@gmail.com>.

I actually am using CouchDB for storing hierarchical filesystem-like
data.  A good way of thinking about it is the same as amazon s3.  It's
really a key/value store, but hierarchy is inferred by key names.

One food for thought is, if say you create keys to contain the full
name and then infer the names based on the path, /some/path/to/file
...  You can then use views to get a consistent view of the directory.
 So basically files are documents, directories are views.  Deleting
files is just a matter of deleting that document, since views are not
materialized, it will reflect point in time contents.  Also if the
directory content is highly volatile, you might run into issues of
having orphaned views, which (since I believe they are indexed) use up
unnecessary space.  You can have some background process clean up
orphaned directory views, if you actually want to remove them.

Ilya

Re: Storing Hierarchical Data

Posted by Brian Candler <B....@pobox.com>.

On Mon, Nov 16, 2009 at 05:14:19AM +0200, Andreas Pavlogiannis wrote:
>     *   Each file is represented by a single document that has a "path"  
> attribute that indicates the directory that is being stored to. This  
> gives the advantage of avoiding conventional pathname translation and  
> retrieving the correct document immediately.

This is the form I'd suggest. It has a number of retrieval benefits:

* If you have a view which splits the path into an array of path components
and emits that as the key in a view, then the key ordering is such that a
parent is always immediately followed by its children. It is very cheap to
retrieve a specific directory and all its descendants in one query using
startkey and endkey. You can naturally serialise the data into a stream
(e.g. as XML).

* In another view, if you emit all path components except the last, then you
have an index by immediate parent. This makes it very cheap to get all the
children (first-level descendants) of a directory in a single query.

Also, you may be able to work without explicit 'directory' objects at all,
just the files themselves.

> However, operations such as  
> renaming a folder require updating many documents

True.

> and should be avoided.

Well, it's a more complex/expensive operation, but unless you expect this to
be happening frequently it's probably OK.

> I am aware of the bulk update technique with the "all or nothing"  
> attribute, but it is to my understanding that it should be avoided,  
> especially when dealing with clustering and replication.

On the contrary. The "all or nothing" updates give you more or less the
*same* semantics as you'd get with a multi-master clustered scenario. If you
use this then your application will be simpler (as it never has to deal with
HTTP 409 conflicts), and it will work the same with a single master or in
a multi-master cluster.

Unfortunately, the "all or nothing"-ness isn't carried across through
replication. For example, if you want to reparent a directory, you might
rename 100 documents using a single all-or-nothing bulk update. On the node
where this is first applied, you are guaranteed atomicity. But after
replication, you may find that some documents are renamed and some are not.
Of course, eventual consistency means that barring some administrative
block, the updates should eventually get through, but there could be a
period of time where the files aren't all together.

Regards,

Brian.