You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Nicolas Peeters <ni...@gmail.com> on 2011/02/23 18:25:30 UTC

Large document design question (updated)

Hi CouchDB community,

*//Sorry, the previous email was sent too quickly...
*
I have basically a design "best practices" question. We are using CouchDB to
store crawled web content. The document is pretty self explanatory, the id
is the URL and there's a "pages" array that contains all the text from the
web pages.
Potentially, this document can grow very quickly to a large size (> 20 MB).
It seems that we run into issues (
https://issues.apache.org/jira/browse/COUCHDB-893) when creating a view with
objects that are larger than 9 MB (in our case).

{
   "_id": "http://www.website.com/",
   "_rev": "1-33c75795126ff81b0125156b88593cc0",
      *"metadata1" : "blabla",
**   "metadata2" : "blabla",*
   "pages": [
       {
           "description": "",
           "text": "A lot of text comes here....:",
           "url": "http://www.website.com/",
           "title": "The title of this website /",
           "keywords": "",
       },
       {
           "description": "",
           "text": "A lot of text comes here....:",
           "url": "http://www.website.com/contact/",
           "title": "Contact Page",
           "keywords": "",
       }

            // MANY other pages here
      ],
        "crawlDate": "2011-02-10T12:30:07.416+01:00"
}

This document structure  is not working very well for us. We are thinking
about the following alternatives. We would really appreciate if you could
give expert modelling advice.

*- Alternative 1)* Create a "page" document where we would have 1 page
(description, text, *parent_url *(which would be the _id of the original
doc)*,* url,...) per document. The rest of the data contained in the
original doc would be duplicated/denormalized. We could then create view
that "assembles" all the pages for a given parent_url (which in essence
would have the same effect of the original implementation).

*-* *Alternative 2)* Model in One to Many fashion as described here:
http://wiki.apache.org/couchdb/EntityRelationship
*
- Alternative 3) *Keep the design as is, but store the "page" content as
attachment when we store the object. (Subquestion: would that influence the
size of the doc?)

*- Alternative 4) *Keep the design as is and change some settings in the
configuration that I don't know about.
*
*Subquestion: any particular design reason why this issue (
https://issues.apache.org/jira/browse/COUCHDB-893) is occuring? Any good
workaround (apart from recompilation!). Any ETC when this will be fixed in a
release version?

Thank you for your help and advice.

Nicolas

PS: The reason that we need a view is that we are using Document Update
handler <http://wiki.apache.org/couchdb/Document_Update_Handlers> to do
incremental updates, view requires some kind of view. The incremental
updates works fine for normal sizes documents.

Re: Large document design question (updated)

Posted by Nicolas Peeters <pe...@gmail.com>.

Thanks.

The update scenario's are as follows:
- Either we "crawl" a whole website and would add a new document for each page (Alt. 1)
- either we crawl the whole site, add one document that represents the "crawl" (with some metadata) (with a reference to the pages documents being updated for every new page document that is added). This would be very similar to the traditional http://wiki.apache.org/couchdb/EntityRelationship model... (probably not ideal)

If I think about it, I still think that Alternative 1 might be best. Any thoughts?

Also storing the text as attachments seems to be a bit cumbersome... (If we use the model describe in Alt.1 there's not some much text after all in the "page" doc). 
On Thursday, February 24, 2011 at 4:51 PM, Zachary Zolton wrote:
Nicolas,
> 
> Storing that much text in your documents will add a lot of overhead to
> your view functions—or any of the other JavaScript design doc
> functions you may want to use.
> 
> Therefore, if you don't need to access the raw text of each page to
> create your views, you may want try storing them as attachments to
> your web site document. This will result in smaller JSON strings
> getting marshalled over to the JavaScript view server, needing to be
> parsed.
> 
> As for answering what the "best practice" is for how to model
> one-to-many relationship, it totally depends on what kind of update
> scenarios and methods of access your application requires.
> 
> 
> Cheers,
> 
> Zach
> 
> On Thu, Feb 24, 2011 at 2:27 AM, Nicolas Peeters <ni...@gmail.com> wrote:
> > Thanks for your reply. Actually it's either Alt 1. or Alt 2., I guess. I
> > don't see why I should be combining. I'm really wondering what the best
> > practice is (I'm leaning toward Alt. 1, by the way). It seems like Alt 2. is
> > more like hacking the document model to make it look and behave like a
> > relational model!
> > 
> > Hoping to get some more advice from the experts out there!
> > 
> > Cheers,
> > 
> > Nicolas
> > 
> > On Wed, Feb 23, 2011 at 7:14 PM, Javier Julio <jj...@gmail.com> wrote:
> > 
> > > Nicolas,
> > > 
> > > Great question. I think what you want here and from what I've learned from
> > > reading the guide and wiki is a combination of Alternative 1 and 2. While it
> > > is suggested to do what you have done there are limits and since you are
> > > hitting those limits that's when the alternative approaches come in and are
> > > usually best I would think. You might not know if at a later point what
> > > you're storing will get to big or if multiple users can work with it (think
> > > comments for a blog post). So Alternative 1 and 2 would be great to start
> > > with.
> > > 
> > > So basically you can break it down into 2 different document "types". One
> > > document with a type of say "website" that just contains the general site
> > > info and then a second document with a type of "page" that has the page
> > > content as well as the website id, whether that's a URL or you just use the
> > > generated id's CouchDB creates.
> > > 
> > > Interesting considering storing the pages as attachments (Alternative 3).
> > > No idea if this is beneficial to you in any way so will let others comment
> > > on that.
> > > 
> > > Hope this helps.
> > > 
> > > Ciao!
> > > Javi
> > > 
> > > On Feb 23, 2011, at 12:25 PM, Nicolas Peeters wrote:
> > > 
> > > > Hi CouchDB community,
> > > > 
> > > > *//Sorry, the previous email was sent too quickly...
> > > > *
> > > > I have basically a design "best practices" question. We are using CouchDB
> > > to
> > > > store crawled web content. The document is pretty self explanatory, the
> > > id
> > > > is the URL and there's a "pages" array that contains all the text from
> > > the
> > > > web pages.
> > > > Potentially, this document can grow very quickly to a large size (> 20
> > > MB).
> > > > It seems that we run into issues (
> > > > https://issues.apache.org/jira/browse/COUCHDB-893) when creating a view
> > > with
> > > > objects that are larger than 9 MB (in our case).
> > > > 
> > > > {
> > > >  "_id": "http://www.website.com/",
> > > >  "_rev": "1-33c75795126ff81b0125156b88593cc0",
> > > > *"metadata1" : "blabla",
> > > > **  "metadata2" : "blabla",*
> > > >  "pages": [
> > > >  {
> > > >  "description": "",
> > > >  "text": "A lot of text comes here....:",
> > > >  "url": "http://www.website.com/",
> > > >  "title": "The title of this website /",
> > > >  "keywords": "",
> > > >  },
> > > >  {
> > > >  "description": "",
> > > >  "text": "A lot of text comes here....:",
> > > >  "url": "http://www.website.com/contact/",
> > > >  "title": "Contact Page",
> > > >  "keywords": "",
> > > >  }
> > > > 
> > > > // MANY other pages here
> > > > ],
> > > > "crawlDate": "2011-02-10T12:30:07.416+01:00"
> > > > }
> > > > 
> > > > This document structure is not working very well for us. We are thinking
> > > > about the following alternatives. We would really appreciate if you could
> > > > give expert modelling advice.
> > > > 
> > > > *- Alternative 1)* Create a "page" document where we would have 1 page
> > > > (description, text, *parent_url *(which would be the _id of the original
> > > > doc)*,* url,...) per document. The rest of the data contained in the
> > > > original doc would be duplicated/denormalized. We could then create view
> > > > that "assembles" all the pages for a given parent_url (which in essence
> > > > would have the same effect of the original implementation).
> > > > 
> > > > *-* *Alternative 2)* Model in One to Many fashion as described here:
> > > > http://wiki.apache.org/couchdb/EntityRelationship
> > > > *
> > > > - Alternative 3) *Keep the design as is, but store the "page" content as
> > > > attachment when we store the object. (Subquestion: would that influence
> > > the
> > > > size of the doc?)
> > > > 
> > > > *- Alternative 4) *Keep the design as is and change some settings in the
> > > > configuration that I don't know about.
> > > > *
> > > > *Subquestion: any particular design reason why this issue (
> > > > https://issues.apache.org/jira/browse/COUCHDB-893) is occuring? Any good
> > > > workaround (apart from recompilation!). Any ETC when this will be fixed
> > > in a
> > > > release version?
> > > > 
> > > > Thank you for your help and advice.
> > > > 
> > > > Nicolas
> > > > 
> > > > PS: The reason that we need a view is that we are using Document Update
> > > > handler <http://wiki.apache.org/couchdb/Document_Update_Handlers> to do
> > > > incremental updates, view requires some kind of view. The incremental
> > > > updates works fine for normal sizes documents.
>

Re: Large document design question (updated)

Posted by Zachary Zolton <za...@gmail.com>.

Nicolas,

Storing that much text in your documents will add a lot of overhead to
your view functions—or any of the other JavaScript design doc
functions you may want to use.

Therefore, if you don't need to access the raw text of each page to
create your views, you may want try storing them as attachments to
your web site document. This will result in smaller JSON strings
getting marshalled over to the JavaScript view server, needing to be
parsed.

As for answering what the "best practice" is for how to model
one-to-many relationship, it totally depends on what kind of update
scenarios and methods of access your application requires.


Cheers,

Zach

On Thu, Feb 24, 2011 at 2:27 AM, Nicolas Peeters <ni...@gmail.com> wrote:
> Thanks for your reply. Actually it's either Alt 1. or Alt 2., I guess. I
> don't see why I should be combining. I'm really wondering what the best
> practice is (I'm leaning toward Alt. 1, by the way). It seems like Alt 2. is
> more like hacking the document model to make it look and behave like a
> relational model!
>
> Hoping to get some more advice from the experts out there!
>
> Cheers,
>
> Nicolas
>
> On Wed, Feb 23, 2011 at 7:14 PM, Javier Julio <jj...@gmail.com> wrote:
>
>> Nicolas,
>>
>> Great question. I think what you want here and from what I've learned from
>> reading the guide and wiki is a combination of Alternative 1 and 2. While it
>> is suggested to do what you have done there are limits and since you are
>> hitting those limits that's when the alternative approaches come in and are
>> usually best I would think. You might not know if at a later point what
>> you're storing will get to big or if multiple users can work with it (think
>> comments for a blog post). So Alternative 1 and 2 would be great to start
>> with.
>>
>> So basically you can break it down into 2 different document "types". One
>> document with a type of say "website" that just contains the general site
>> info and then a second document with a type of "page" that has the page
>> content as well as the website id, whether that's a URL or you just use the
>> generated id's CouchDB creates.
>>
>> Interesting considering storing the pages as attachments (Alternative 3).
>> No idea if this is beneficial to you in any way so will let others comment
>> on that.
>>
>> Hope this helps.
>>
>> Ciao!
>> Javi
>>
>> On Feb 23, 2011, at 12:25 PM, Nicolas Peeters wrote:
>>
>> > Hi CouchDB community,
>> >
>> > *//Sorry, the previous email was sent too quickly...
>> > *
>> > I have basically a design "best practices" question. We are using CouchDB
>> to
>> > store crawled web content. The document is pretty self explanatory, the
>> id
>> > is the URL and there's a "pages" array that contains all the text from
>> the
>> > web pages.
>> > Potentially, this document can grow very quickly to a large size (> 20
>> MB).
>> > It seems that we run into issues (
>> > https://issues.apache.org/jira/browse/COUCHDB-893) when creating a view
>> with
>> > objects that are larger than 9 MB (in our case).
>> >
>> > {
>> >   "_id": "http://www.website.com/",
>> >   "_rev": "1-33c75795126ff81b0125156b88593cc0",
>> >      *"metadata1" : "blabla",
>> > **   "metadata2" : "blabla",*
>> >   "pages": [
>> >       {
>> >           "description": "",
>> >           "text": "A lot of text comes here....:",
>> >           "url": "http://www.website.com/",
>> >           "title": "The title of this website /",
>> >           "keywords": "",
>> >       },
>> >       {
>> >           "description": "",
>> >           "text": "A lot of text comes here....:",
>> >           "url": "http://www.website.com/contact/",
>> >           "title": "Contact Page",
>> >           "keywords": "",
>> >       }
>> >
>> >            // MANY other pages here
>> >      ],
>> >        "crawlDate": "2011-02-10T12:30:07.416+01:00"
>> > }
>> >
>> > This document structure  is not working very well for us. We are thinking
>> > about the following alternatives. We would really appreciate if you could
>> > give expert modelling advice.
>> >
>> > *- Alternative 1)* Create a "page" document where we would have 1 page
>> > (description, text, *parent_url *(which would be the _id of the original
>> > doc)*,* url,...) per document. The rest of the data contained in the
>> > original doc would be duplicated/denormalized. We could then create view
>> > that "assembles" all the pages for a given parent_url (which in essence
>> > would have the same effect of the original implementation).
>> >
>> > *-* *Alternative 2)* Model in One to Many fashion as described here:
>> > http://wiki.apache.org/couchdb/EntityRelationship
>> > *
>> > - Alternative 3) *Keep the design as is, but store the "page" content as
>> > attachment when we store the object. (Subquestion: would that influence
>> the
>> > size of the doc?)
>> >
>> > *- Alternative 4) *Keep the design as is and change some settings in the
>> > configuration that I don't know about.
>> > *
>> > *Subquestion: any particular design reason why this issue (
>> > https://issues.apache.org/jira/browse/COUCHDB-893) is occuring? Any good
>> > workaround (apart from recompilation!). Any ETC when this will be fixed
>> in a
>> > release version?
>> >
>> > Thank you for your help and advice.
>> >
>> > Nicolas
>> >
>> > PS: The reason that we need a view is that we are using Document Update
>> > handler <http://wiki.apache.org/couchdb/Document_Update_Handlers> to do
>> > incremental updates, view requires some kind of view. The incremental
>> > updates works fine for normal sizes documents.
>>
>>
>

Re: Large document design question (updated)

Posted by Nicolas Peeters <ni...@gmail.com>.

Thanks for your reply. Actually it's either Alt 1. or Alt 2., I guess. I
don't see why I should be combining. I'm really wondering what the best
practice is (I'm leaning toward Alt. 1, by the way). It seems like Alt 2. is
more like hacking the document model to make it look and behave like a
relational model!

Hoping to get some more advice from the experts out there!

Cheers,

Nicolas

On Wed, Feb 23, 2011 at 7:14 PM, Javier Julio <jj...@gmail.com> wrote:

> Nicolas,
>
> Great question. I think what you want here and from what I've learned from
> reading the guide and wiki is a combination of Alternative 1 and 2. While it
> is suggested to do what you have done there are limits and since you are
> hitting those limits that's when the alternative approaches come in and are
> usually best I would think. You might not know if at a later point what
> you're storing will get to big or if multiple users can work with it (think
> comments for a blog post). So Alternative 1 and 2 would be great to start
> with.
>
> So basically you can break it down into 2 different document "types". One
> document with a type of say "website" that just contains the general site
> info and then a second document with a type of "page" that has the page
> content as well as the website id, whether that's a URL or you just use the
> generated id's CouchDB creates.
>
> Interesting considering storing the pages as attachments (Alternative 3).
> No idea if this is beneficial to you in any way so will let others comment
> on that.
>
> Hope this helps.
>
> Ciao!
> Javi
>
> On Feb 23, 2011, at 12:25 PM, Nicolas Peeters wrote:
>
> > Hi CouchDB community,
> >
> > *//Sorry, the previous email was sent too quickly...
> > *
> > I have basically a design "best practices" question. We are using CouchDB
> to
> > store crawled web content. The document is pretty self explanatory, the
> id
> > is the URL and there's a "pages" array that contains all the text from
> the
> > web pages.
> > Potentially, this document can grow very quickly to a large size (> 20
> MB).
> > It seems that we run into issues (
> > https://issues.apache.org/jira/browse/COUCHDB-893) when creating a view
> with
> > objects that are larger than 9 MB (in our case).
> >
> > {
> >   "_id": "http://www.website.com/",
> >   "_rev": "1-33c75795126ff81b0125156b88593cc0",
> >      *"metadata1" : "blabla",
> > **   "metadata2" : "blabla",*
> >   "pages": [
> >       {
> >           "description": "",
> >           "text": "A lot of text comes here....:",
> >           "url": "http://www.website.com/",
> >           "title": "The title of this website /",
> >           "keywords": "",
> >       },
> >       {
> >           "description": "",
> >           "text": "A lot of text comes here....:",
> >           "url": "http://www.website.com/contact/",
> >           "title": "Contact Page",
> >           "keywords": "",
> >       }
> >
> >            // MANY other pages here
> >      ],
> >        "crawlDate": "2011-02-10T12:30:07.416+01:00"
> > }
> >
> > This document structure  is not working very well for us. We are thinking
> > about the following alternatives. We would really appreciate if you could
> > give expert modelling advice.
> >
> > *- Alternative 1)* Create a "page" document where we would have 1 page
> > (description, text, *parent_url *(which would be the _id of the original
> > doc)*,* url,...) per document. The rest of the data contained in the
> > original doc would be duplicated/denormalized. We could then create view
> > that "assembles" all the pages for a given parent_url (which in essence
> > would have the same effect of the original implementation).
> >
> > *-* *Alternative 2)* Model in One to Many fashion as described here:
> > http://wiki.apache.org/couchdb/EntityRelationship
> > *
> > - Alternative 3) *Keep the design as is, but store the "page" content as
> > attachment when we store the object. (Subquestion: would that influence
> the
> > size of the doc?)
> >
> > *- Alternative 4) *Keep the design as is and change some settings in the
> > configuration that I don't know about.
> > *
> > *Subquestion: any particular design reason why this issue (
> > https://issues.apache.org/jira/browse/COUCHDB-893) is occuring? Any good
> > workaround (apart from recompilation!). Any ETC when this will be fixed
> in a
> > release version?
> >
> > Thank you for your help and advice.
> >
> > Nicolas
> >
> > PS: The reason that we need a view is that we are using Document Update
> > handler <http://wiki.apache.org/couchdb/Document_Update_Handlers> to do
> > incremental updates, view requires some kind of view. The incremental
> > updates works fine for normal sizes documents.
>
>

Re: Large document design question (updated)

Posted by Javier Julio <jj...@gmail.com>.

Nicolas,

Great question. I think what you want here and from what I've learned from reading the guide and wiki is a combination of Alternative 1 and 2. While it is suggested to do what you have done there are limits and since you are hitting those limits that's when the alternative approaches come in and are usually best I would think. You might not know if at a later point what you're storing will get to big or if multiple users can work with it (think comments for a blog post). So Alternative 1 and 2 would be great to start with.

So basically you can break it down into 2 different document "types". One document with a type of say "website" that just contains the general site info and then a second document with a type of "page" that has the page content as well as the website id, whether that's a URL or you just use the generated id's CouchDB creates.

Interesting considering storing the pages as attachments (Alternative 3). No idea if this is beneficial to you in any way so will let others comment on that.

Hope this helps.

Ciao!
Javi

On Feb 23, 2011, at 12:25 PM, Nicolas Peeters wrote:

> Hi CouchDB community,
> 
> *//Sorry, the previous email was sent too quickly...
> *
> I have basically a design "best practices" question. We are using CouchDB to
> store crawled web content. The document is pretty self explanatory, the id
> is the URL and there's a "pages" array that contains all the text from the
> web pages.
> Potentially, this document can grow very quickly to a large size (> 20 MB).
> It seems that we run into issues (
> https://issues.apache.org/jira/browse/COUCHDB-893) when creating a view with
> objects that are larger than 9 MB (in our case).
> 
> {
>   "_id": "http://www.website.com/",
>   "_rev": "1-33c75795126ff81b0125156b88593cc0",
>      *"metadata1" : "blabla",
> **   "metadata2" : "blabla",*
>   "pages": [
>       {
>           "description": "",
>           "text": "A lot of text comes here....:",
>           "url": "http://www.website.com/",
>           "title": "The title of this website /",
>           "keywords": "",
>       },
>       {
>           "description": "",
>           "text": "A lot of text comes here....:",
>           "url": "http://www.website.com/contact/",
>           "title": "Contact Page",
>           "keywords": "",
>       }
> 
>            // MANY other pages here
>      ],
>        "crawlDate": "2011-02-10T12:30:07.416+01:00"
> }
> 
> This document structure  is not working very well for us. We are thinking
> about the following alternatives. We would really appreciate if you could
> give expert modelling advice.
> 
> *- Alternative 1)* Create a "page" document where we would have 1 page
> (description, text, *parent_url *(which would be the _id of the original
> doc)*,* url,...) per document. The rest of the data contained in the
> original doc would be duplicated/denormalized. We could then create view
> that "assembles" all the pages for a given parent_url (which in essence
> would have the same effect of the original implementation).
> 
> *-* *Alternative 2)* Model in One to Many fashion as described here:
> http://wiki.apache.org/couchdb/EntityRelationship
> *
> - Alternative 3) *Keep the design as is, but store the "page" content as
> attachment when we store the object. (Subquestion: would that influence the
> size of the doc?)
> 
> *- Alternative 4) *Keep the design as is and change some settings in the
> configuration that I don't know about.
> *
> *Subquestion: any particular design reason why this issue (
> https://issues.apache.org/jira/browse/COUCHDB-893) is occuring? Any good
> workaround (apart from recompilation!). Any ETC when this will be fixed in a
> release version?
> 
> Thank you for your help and advice.
> 
> Nicolas
> 
> PS: The reason that we need a view is that we are using Document Update
> handler <http://wiki.apache.org/couchdb/Document_Update_Handlers> to do
> incremental updates, view requires some kind of view. The incremental
> updates works fine for normal sizes documents.