You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Chris Anderson <jc...@grabb.it> on 2008/08/01 00:14:30 UTC

Re: Is it possible to evaluate a view on a 20.000 documents database?

If your view is complex, and you have many (100k+) records (and the
emitted row size is large) views could take hours to generate on a
Core Duo MacBook. Let them generate overnight, and in the morning the
queries will be very fast.

On Thu, Jul 31, 2008 at 2:44 PM, Ed Finkler <fu...@gmail.com> wrote:
> I have been working with a very similar problem, actually. A large set of
> records (40k+), building views from scratch.
>
> My experience was that I just needed to let couchdb build the view. It can
> take several minutes, and the CPU usage will be high. You should see both
> the beam and couchjs processes working while the view is building. If you're
> accessing a view via Futon, it's likely the browser will time-out the
> request before the build is finished. The build process *will* continue on
> the server side, though. If you let the build finish, the next time you
> query the view, it will return the data immediately.
>
> To mitigate this problem, I'm now updating the view every time I do an
> insert (I bulk-add 20 records per minute). This only requires that the new
> data be added to the view, so building at this point is a short process.
>
> (big thanks to the folks on #couchdb for helping me with this problem!)
>
> --
> Ed Finkler
> http://funkatron.com
> AIM: funka7ron
> ICQ: 3922133
> Skype: funka7ron
>
>
> On Jul 31, 2008, at 5:10 PM, Demetrius Nunes wrote:
>
>> Hi there,
>>
>> I was having a great time playing aroung with CouchDB. It seems like a
>> perfect fit for a future system that we'll be building fairly soon.
>>
>> But then, I've just created a CouchDB database, importing 20.000 records
>> from an old relational database into it.
>>
>> When I go into Futon, I can see the database is there, with 20.899
>> documents
>> and 125.2 MB in size.
>>
>> Clicking on it, I can navigate thru the "All Documents" pretty quickly (10
>> documents per page).
>>
>> The problem is when I try to create a custom view. Just as I enter the
>> custom view page in Futon, the server hangs and locks up my CPU at 90%
>> usage. I waited several minutes for it to cool off but the process was
>> still
>> there and I had no response at all.
>>
>> I then tried to create a view programatically, using REST/JSON and I get
>> the
>> same result.
>>
>> I am running CouchDB 0.8.0 on Ubuntu 8.0.4.
>>
>> Is CouchDB not ready for a dataset of this size yet?
>>
>> Thanks and best regards,
>> Dema
>>
>> --
>> ____________________________
>> http://www.demetriusnunes.com
>
>



-- 
Chris Anderson
http://jchris.mfdz.com

Re: Is it possible to evaluate a view on a 20.000 documents database?

Posted by Paul Bonser <mi...@gmail.com>.

On 7/31/08, Demetrius Nunes <de...@gmail.com> wrote:
> The view I am trying to create is really simple:
>
>  function(doc) {
>   if
>  (doc.classe_id.match(/8a8090a20075ffba010075ffbed600028a8090a20075ffba010075ffbf7200c48a8090a20075ffba010075ffbf7200d9/))
>     emit(doc.id, doc);
>  }

Any reason you're using a regex rather than just using string comparison?

-- 
Paul Bonser
http://blog.paulbonser.com

Re: Is it possible to evaluate a view on a 20.000 documents database?

Posted by Demetrius Nunes <de...@gmail.com>.

Thanks for all the advice. I'll do that and get back to you about how it
went.
Cheers.

On Fri, Aug 1, 2008 at 1:17 PM, Paul Davis <pa...@gmail.com>wrote:

> On Fri, Aug 1, 2008 at 12:08 PM, Michael Hendricks <mi...@ndrix.org>
> wrote:
> > On Thu, Jul 31, 2008 at 07:38:03PM -0300, Demetrius Nunes wrote:
> >> The view I am trying to create is really simple:
> >>
> >> function(doc) {
> >>   if
> >>
> (doc.classe_id.match(/8a8090a20075ffba010075ffbed600028a8090a20075ffba010075ffbf7200c48a8090a20075ffba010075ffbf7200d9/))
> >>     emit(doc.id, doc);
> >> }
> >
> > You might try changing your emit() to
> >
> >    emit(doc.id, null);
> >
> > I seem to recall some discussion on the mailing list that including the
> > document in the emitted value (especially for large documents) can
> > significantly affect view performance.
> >
>
> Whatever you emit is stored in the view. So if you're emitting an
> entire doc, the entire doc is going to be stored twice (Once in the
> db, once in the view). Not sure that there's any extra overhead other
> than storing the doc. Although, storing the doc is going to require a
> second json <-> erlang conversion. Not sure how expensive that really
> is, but I seem to remember chatter about the conversion being
> noticeable.
>
> > --
> > Michael
> >
>



-- 
____________________________
http://www.demetriusnunes.com

Re: Is it possible to evaluate a view on a 20.000 documents database?

Posted by Paul Davis <pa...@gmail.com>.

On Fri, Aug 1, 2008 at 12:08 PM, Michael Hendricks <mi...@ndrix.org> wrote:
> On Thu, Jul 31, 2008 at 07:38:03PM -0300, Demetrius Nunes wrote:
>> The view I am trying to create is really simple:
>>
>> function(doc) {
>>   if
>> (doc.classe_id.match(/8a8090a20075ffba010075ffbed600028a8090a20075ffba010075ffbf7200c48a8090a20075ffba010075ffbf7200d9/))
>>     emit(doc.id, doc);
>> }
>
> You might try changing your emit() to
>
>    emit(doc.id, null);
>
> I seem to recall some discussion on the mailing list that including the
> document in the emitted value (especially for large documents) can
> significantly affect view performance.
>

Whatever you emit is stored in the view. So if you're emitting an
entire doc, the entire doc is going to be stored twice (Once in the
db, once in the view). Not sure that there's any extra overhead other
than storing the doc. Although, storing the doc is going to require a
second json <-> erlang conversion. Not sure how expensive that really
is, but I seem to remember chatter about the conversion being
noticeable.

> --
> Michael
>

Re: Is it possible to evaluate a view on a 20.000 documents database?

Posted by Jan Lehnardt <ja...@apache.org>.

On Aug 1, 2008, at 19:06, Demetrius Nunes wrote:

> What would be the best way to try out new views and not suffer with  
> the long
> computation times on this big dataset? Should I create a "development"
> database with only a subset (say a couple of hundred documents) of  
> the data
> and work there until I have all the views I want and the port those  
> to the
> "real" big database?

That would be an option. Possible the route I'd take.

Cheers
Jan
--

> On Fri, Aug 1, 2008 at 1:42 PM, Jan Lehnardt <ja...@apache.org> wrote:
>
>>
>> On Aug 1, 2008, at 18:08, Michael Hendricks wrote:
>>
>> On Thu, Jul 31, 2008 at 07:38:03PM -0300, Demetrius Nunes wrote:
>>>
>>>> The view I am trying to create is really simple:
>>>>
>>>> function(doc) {
>>>> if
>>>>
>>>> (doc.classe_id.match(/ 
>>>> 8a8090a20075ffba010075ffbed600028a8090a20075ffba010075ffbf7200c48a8090a20075ffba010075ffbf7200d9 
>>>> /))
>>>>  emit(doc.id, doc);
>>>> }
>>>>
>>>
>>> You might try changing your emit() to
>>>
>>>  emit(doc.id, null);
>>>
>>> I seem to recall some discussion on the mailing list that  
>>> including the
>>> document in the emitted value (especially for large documents) can
>>> significantly affect view performance.
>>>
>>
>> also, the doc id is always automatically included, so emit(null,  
>> null);
>> does
>> the trick as well :) for pagination use docid_startkey & _endkey.
>>
>> Cheers
>> Jan
>> --
>>
>
>
>
> -- 
> ____________________________
> http://www.demetriusnunes.com

Re: Is it possible to evaluate a view on a 20.000 documents database?

Posted by Chris Anderson <jc...@grabb.it>.

On Fri, Aug 1, 2008 at 11:06 AM, Ed Finkler <fu...@gmail.com> wrote:
>
> On Aug 1, 2008, at 2:05 PM, Chris Anderson wrote:
>
>> That's how I do it. I've got a bunch of Ruby scripts that do things
>> like replicate 1% of the database, clone a map index to documents in
>> another db, etc. Maybe a cleanup-for-release day is in order.
>
> Yes please! 8)

I cleaned up subset.rb and remap.rb for release - you can find them here:

http://github.com/jchris/couchrest/tree/master/utils

subset.rb takes a source and a target database, and randomly
replicates some percentage of the documents from the source to the
target.

remap.rb iterates over a map view yielding an array of values as
associated with each unique key. It creates a document in a separate
target db for each key.

-- 
Chris Anderson
http://jchris.mfdz.com

Re: Is it possible to evaluate a view on a 20.000 documents database?

Posted by Ed Finkler <fu...@gmail.com>.

On Aug 1, 2008, at 2:05 PM, Chris Anderson wrote:

> That's how I do it. I've got a bunch of Ruby scripts that do things
> like replicate 1% of the database, clone a map index to documents in
> another db, etc. Maybe a cleanup-for-release day is in order.

Yes please! 8)

--
Ed Finkler
http://funkatron.com
AIM: funka7ron
ICQ: 3922133
Skype: funka7ron

Re: Is it possible to evaluate a view on a 20.000 documents database?

Posted by Chris Anderson <jc...@grabb.it>.

On Fri, Aug 1, 2008 at 10:06 AM, Demetrius Nunes
<de...@gmail.com> wrote:
> Should I create a "development"
> database with only a subset (say a couple of hundred documents) of the data
> and work there until I have all the views I want and the port those to the
> "real" big database?

That's how I do it. I've got a bunch of Ruby scripts that do things
like replicate 1% of the database, clone a map index to documents in
another db, etc. Maybe a cleanup-for-release day is in order.

-- 
Chris Anderson
http://jchris.mfdz.com

Re: Is it possible to evaluate a view on a 20.000 documents database?

Posted by Demetrius Nunes <de...@gmail.com>.

What would be the best way to try out new views and not suffer with the long
computation times on this big dataset? Should I create a "development"
database with only a subset (say a couple of hundred documents) of the data
and work there until I have all the views I want and the port those to the
"real" big database?

On Fri, Aug 1, 2008 at 1:42 PM, Jan Lehnardt <ja...@apache.org> wrote:

>
> On Aug 1, 2008, at 18:08, Michael Hendricks wrote:
>
>  On Thu, Jul 31, 2008 at 07:38:03PM -0300, Demetrius Nunes wrote:
>>
>>> The view I am trying to create is really simple:
>>>
>>> function(doc) {
>>>  if
>>>
>>> (doc.classe_id.match(/8a8090a20075ffba010075ffbed600028a8090a20075ffba010075ffbf7200c48a8090a20075ffba010075ffbf7200d9/))
>>>   emit(doc.id, doc);
>>> }
>>>
>>
>> You might try changing your emit() to
>>
>>   emit(doc.id, null);
>>
>> I seem to recall some discussion on the mailing list that including the
>> document in the emitted value (especially for large documents) can
>> significantly affect view performance.
>>
>
> also, the doc id is always automatically included, so emit(null, null);
> does
> the trick as well :) for pagination use docid_startkey & _endkey.
>
> Cheers
> Jan
> --
>

-- 
____________________________
http://www.demetriusnunes.com

Re: Is it possible to evaluate a view on a 20.000 documents database?

Posted by Jan Lehnardt <ja...@apache.org>.

On Aug 1, 2008, at 18:08, Michael Hendricks wrote:

> On Thu, Jul 31, 2008 at 07:38:03PM -0300, Demetrius Nunes wrote:
>> The view I am trying to create is really simple:
>>
>> function(doc) {
>>  if
>> (doc.classe_id.match(/ 
>> 8a8090a20075ffba010075ffbed600028a8090a20075ffba010075ffbf7200c48a8090a20075ffba010075ffbf7200d9 
>> /))
>>    emit(doc.id, doc);
>> }
>
> You might try changing your emit() to
>
>    emit(doc.id, null);
>
> I seem to recall some discussion on the mailing list that including  
> the
> document in the emitted value (especially for large documents) can
> significantly affect view performance.

also, the doc id is always automatically included, so emit(null,  
null); does
the trick as well :) for pagination use docid_startkey & _endkey.

Cheers
Jan
--

Re: Is it possible to evaluate a view on a 20.000 documents database?

Posted by Michael Hendricks <mi...@ndrix.org>.

On Thu, Jul 31, 2008 at 07:38:03PM -0300, Demetrius Nunes wrote:
> The view I am trying to create is really simple:
> 
> function(doc) {
>   if
> (doc.classe_id.match(/8a8090a20075ffba010075ffbed600028a8090a20075ffba010075ffbf7200c48a8090a20075ffba010075ffbf7200d9/))
>     emit(doc.id, doc);
> }

You might try changing your emit() to

    emit(doc.id, null);

I seem to recall some discussion on the mailing list that including the
document in the emitted value (especially for large documents) can
significantly affect view performance.

-- 
Michael

Re: Is it possible to evaluate a view on a 20.000 documents database?

Posted by Damien Katz <da...@apache.org>.

I'm guessing you are running an old version of erlang (R11) known to  
have performance issues. Upgrade to the latest (R12 B-3 available from  
erlang.org),  The stuff in the packages (apt, macport etc) is usually  
outdated.

-Damien

On Jul 31, 2008, at 6:38 PM, Demetrius Nunes wrote:

> The view I am trying to create is really simple:
>
> function(doc) {
>  if
> (doc.classe_id.match(/ 
> 8a8090a20075ffba010075ffbed600028a8090a20075ffba010075ffbf7200c48a8090a20075ffba010075ffbf7200d9 
> /))
>    emit(doc.id, doc);
> }
>
> It's being applied to a 20.000 documents dataset and I've already  
> waited
> several minutes until the CPU cooled off, but to my surprise, the  
> view is
> still taking a long time to respond when I try to run it. Ive never  
> actually
> got a result out of it...
>
> Am I doing something wrong?
>
> Also, what are the performance goals for view-related operations  
> like these
> on bigger datasets (I consider a 20.000 document dataset fairly  
> small) for
> CouchDB? What shoud we expect for 1.0 ?
>
> If it's not possible to evaluate views on these kinds of datasets in  
> a few
> seconds, then it would be huge deal-breaker for me. And I'd have to  
> consider
> using something like Sesame RDF database, but I really like CouchDB  
> much
> better.
>
> Cheers,
> Dema
>
> On Thu, Jul 31, 2008 at 7:14 PM, Chris Anderson <jc...@grabb.it>  
> wrote:
>
>> If your view is complex, and you have many (100k+) records (and the
>> emitted row size is large) views could take hours to generate on a
>> Core Duo MacBook. Let them generate overnight, and in the morning the
>> queries will be very fast.
>>
>> On Thu, Jul 31, 2008 at 2:44 PM, Ed Finkler <fu...@gmail.com>  
>> wrote:
>>> I have been working with a very similar problem, actually. A large  
>>> set of
>>> records (40k+), building views from scratch.
>>>
>>> My experience was that I just needed to let couchdb build the  
>>> view. It
>> can
>>> take several minutes, and the CPU usage will be high. You should  
>>> see both
>>> the beam and couchjs processes working while the view is building.  
>>> If
>> you're
>>> accessing a view via Futon, it's likely the browser will time-out  
>>> the
>>> request before the build is finished. The build process *will*  
>>> continue
>> on
>>> the server side, though. If you let the build finish, the next  
>>> time you
>>> query the view, it will return the data immediately.
>>>
>>> To mitigate this problem, I'm now updating the view every time I  
>>> do an
>>> insert (I bulk-add 20 records per minute). This only requires that  
>>> the
>> new
>>> data be added to the view, so building at this point is a short  
>>> process.
>>>
>>> (big thanks to the folks on #couchdb for helping me with this  
>>> problem!)
>>>
>>> --
>>> Ed Finkler
>>> http://funkatron.com
>>> AIM: funka7ron
>>> ICQ: 3922133
>>> Skype: funka7ron
>>>
>>>
>>> On Jul 31, 2008, at 5:10 PM, Demetrius Nunes wrote:
>>>
>>>> Hi there,
>>>>
>>>> I was having a great time playing aroung with CouchDB. It seems  
>>>> like a
>>>> perfect fit for a future system that we'll be building fairly soon.
>>>>
>>>> But then, I've just created a CouchDB database, importing 20.000  
>>>> records
>>>> from an old relational database into it.
>>>>
>>>> When I go into Futon, I can see the database is there, with 20.899
>>>> documents
>>>> and 125.2 MB in size.
>>>>
>>>> Clicking on it, I can navigate thru the "All Documents" pretty  
>>>> quickly
>> (10
>>>> documents per page).
>>>>
>>>> The problem is when I try to create a custom view. Just as I  
>>>> enter the
>>>> custom view page in Futon, the server hangs and locks up my CPU  
>>>> at 90%
>>>> usage. I waited several minutes for it to cool off but the  
>>>> process was
>>>> still
>>>> there and I had no response at all.
>>>>
>>>> I then tried to create a view programatically, using REST/JSON  
>>>> and I get
>>>> the
>>>> same result.
>>>>
>>>> I am running CouchDB 0.8.0 on Ubuntu 8.0.4.
>>>>
>>>> Is CouchDB not ready for a dataset of this size yet?
>>>>
>>>> Thanks and best regards,
>>>> Dema
>>>>
>>>> --
>>>> ____________________________
>>>> http://www.demetriusnunes.com
>>>
>>>
>>
>>
>>
>> --
>> Chris Anderson
>> http://jchris.mfdz.com
>>
>
>
>
> -- 
> ____________________________
> http://www.demetriusnunes.com

Re: Is it possible to evaluate a view on a 20.000 documents database?

Posted by Jan Lehnardt <ja...@apache.org>.

On Aug 1, 2008, at 18:50, Adam Jacob wrote:

> On Fri, Aug 1, 2008 at 12:37 AM, Johan Liseborn
> <jo...@gmail.com> wrote:
>> AFAIU, when you add new documents and then evaluate a view including
>> those documents, indexing will happen, but only for the newly added
>> documents (i.e. already indexed documents will not be re-indexed). I
>> believe this means that the time to index will be, in some way,
>> proportional to the number of *new* documents. I believe I have  
>> seen a
>> big-O "number" for this somewhere, but I don't remember right now if
>> it is O(n), O(log n), or something else (I am sure someone else on  
>> the
>> list can answer that :-).
>>
>> As can be seen from the results, when CouchDB had to index the 10.000
>> new documents, it took about 13 minutes to get the result, but when
>> all the documents had been indexed, the answer came back in 0.7
>> seconds. Having to index 10 documents did not take that long, giving
>> an answer in 1.2 seconds.
>
> I did some totally off-the-cuff benchmarking with Varnish (a caching
> HTTP reverse proxy) in front of CouchDB.  It works entirely as
> expected - Varnish returns lightning fast (<30ms) very consistently
> (20ms in the 50%, 35ms in the longest.)
>
> With Varnish, you can also add PURGE support for a given URL.
>
> This means that, if those compute times are egregious, and you can
> suffer your data being a bit out of date, it would be trivial to
> create a system that does:
>
> Reads...
> User -> Varnish -> CouchDB
>
> Writes...
> User->Varnish->CouchDB->PURGE for entry->Queue for view
> re-indexing->PURGE for views
>
> Essentially ensuring that all your reads always return as quickly as
> possible, masking any possible delay in index queries on bulk updates
> (assuming stale data for 10 minutes is ok.)

?update=false would be the CouchDB-only solution for that.

Cheers
Jan
--

Re: Is it possible to evaluate a view on a 20.000 documents database?

Posted by Adam Jacob <ad...@hjksolutions.com>.

On Fri, Aug 1, 2008 at 12:37 AM, Johan Liseborn
<jo...@gmail.com> wrote:
> AFAIU, when you add new documents and then evaluate a view including
> those documents, indexing will happen, but only for the newly added
> documents (i.e. already indexed documents will not be re-indexed). I
> believe this means that the time to index will be, in some way,
> proportional to the number of *new* documents. I believe I have seen a
> big-O "number" for this somewhere, but I don't remember right now if
> it is O(n), O(log n), or something else (I am sure someone else on the
> list can answer that :-).
>
> As can be seen from the results, when CouchDB had to index the 10.000
> new documents, it took about 13 minutes to get the result, but when
> all the documents had been indexed, the answer came back in 0.7
> seconds. Having to index 10 documents did not take that long, giving
> an answer in 1.2 seconds.

I did some totally off-the-cuff benchmarking with Varnish (a caching
HTTP reverse proxy) in front of CouchDB.  It works entirely as
expected - Varnish returns lightning fast (<30ms) very consistently
(20ms in the 50%, 35ms in the longest.)

With Varnish, you can also add PURGE support for a given URL.

This means that, if those compute times are egregious, and you can
suffer your data being a bit out of date, it would be trivial to
create a system that does:

Reads...
User -> Varnish -> CouchDB

Writes...
User->Varnish->CouchDB->PURGE for entry->Queue for view
re-indexing->PURGE for views

Essentially ensuring that all your reads always return as quickly as
possible, masking any possible delay in index queries on bulk updates
(assuming stale data for 10 minutes is ok.)

Regards,
Adam

-- 
HJK Solutions - We Launch Startups - http://www.hjksolutions.com
Adam Jacob, Senior Partner
T: (206) 508-4759 E: adam@hjksolutions.com

Re: Is it possible to evaluate a view on a 20.000 documents database?

Posted by Dean Landolt <de...@deanlandolt.com>.

>
> So, I am
> just checking if the classe_id field starts with a certain string. Is regex
> comparison a heavy operation in Javascript?


I'm going go out on a limb and say it's *way* more heavyweight than a
doc.classe_id.indexOf('...') == 0 which from the sounds of it would give you
the same results.

Re: Is it possible to evaluate a view on a 20.000 documents database?

Posted by Demetrius Nunes <de...@gmail.com>.

Johan, this was excellent stuff. Thanks for the enlightenment.
As for the question on why I am using a regex insted of a plain string
comparison, is because, the "classe_id" field is structured in a way that it
starts with a string (say ABC) but can end in several other ways, so the
documents may have field values of ABCDEF, ABCGHI, ABCRST, etc. So, I am
just checking if the classe_id field starts with a certain string. Is regex
comparison a heavy operation in Javascript?

rgds,
Demetrius

On Fri, Aug 1, 2008 at 4:37 AM, Johan Liseborn <jo...@gmail.com>wrote:

> On Fri, Aug 1, 2008 at 00:38, Demetrius Nunes <de...@gmail.com>
> wrote:
> > The view I am trying to create is really simple:
> >
> > function(doc) {
> >  if
> >
> (doc.classe_id.match(/8a8090a20075ffba010075ffbed600028a8090a20075ffba010075ffbf7200c48a8090a20075ffba010075ffbf7200d9/))
> >    emit(doc.id, doc);
> > }
> >
> > It's being applied to a 20.000 documents dataset and I've already waited
> > several minutes until the CPU cooled off, but to my surprise, the view is
> > still taking a long time to respond when I try to run it. Ive never
> actually
> > got a result out of it...
> >
> > Am I doing something wrong?
>
> I guess you have already gotten a number of answers, but just to give
> you some additional input (which points in the same direction), here
> is some data from a little experiment I just did:
>
> I have a database consisting of documents that describe "projects";
> each document has a number of fields including fields for project
> manager, due date, an array of project activities (which in turn has
> descriptions, an array of assigned workers, etc), an array of notes,
> and a field giving the priority (the point being the documents are
> "semi-complex", or at least I *think* they could be considered so; I
> am not sure how much this matter, but it seems to matter a little, at
> least when the document itself is part of the output of the view
> (which it *isn't* in my example below, but anyway...)).
>
> I am running this on a second generation MacBook (core 2 duo) with
> Erlang R12B-3, SMP enabled.
>
> Now, I have a view which gives me the number of projects per priority
> level. The view consists of the following map and reduce functions
> (mind you, I am not sure that I am doing this entirely correctly, I am
> pretty new to using CouchDB (my third day of playing with actually),
> and I am still figuring the map/reduce stuff out; the result of the
> view seems to be correct though):
>
> map: function(doc) { if (doc.type == 'task') emit(doc.priority, 1); }
>
> reduce: function(keys, values) { return sum(values); }
>
> I just ran a test where I had a database already consisting of 42.000
> project documents (the view had already been indexed on these
> documents). I added an additional 10.000 documents, and then ran the
> view above like so:
>
> Johans-MacBook% time curl
> 'localhost:5984/test-001/_view/tasks/per_prio_count?group=true'
>
> The result I got back was:
>
>
> {"rows":[{"key":1,"value":10391},{"key":2,"value":10399},{"key":3,"value":10482},{"key":4,"value":10320},{"key":5,"value":10408}]}
> curl 'localhost:5984/test-001/_view/tasks/per_prio_count2?group=true'
> 0.01s user 0.03s system 0% cpu 13:32.02 total
>
> Running the view again gave the following result:
>
>
> {"rows":[{"key":1,"value":10391},{"key":2,"value":10399},{"key":3,"value":10482},{"key":4,"value":10320},{"key":5,"value":10408}]}
> curl 'localhost:5984/test-001/_view/tasks/per_prio_count2?group=true'
> 0.00s user 0.00s system 0% cpu 0.703 total
>
> As the last part, I added an additional 10 documents and then re-ran
> the view, giving the following result:
>
>
> {"rows":[{"key":1,"value":10392},{"key":2,"value":10400},{"key":3,"value":10487},{"key":4,"value":10322},{"key":5,"value":10409}]}
> curl 'localhost:5984/test-001/_view/tasks/per_prio_count2?group=true'
> 0.00s user 0.00s system 0% cpu 1.207 total
>
> AFAIU, when you add new documents and then evaluate a view including
> those documents, indexing will happen, but only for the newly added
> documents (i.e. already indexed documents will not be re-indexed). I
> believe this means that the time to index will be, in some way,
> proportional to the number of *new* documents. I believe I have seen a
> big-O "number" for this somewhere, but I don't remember right now if
> it is O(n), O(log n), or something else (I am sure someone else on the
> list can answer that :-).
>
> As can be seen from the results, when CouchDB had to index the 10.000
> new documents, it took about 13 minutes to get the result, but when
> all the documents had been indexed, the answer came back in 0.7
> seconds. Having to index 10 documents did not take that long, giving
> an answer in 1.2 seconds.
>
> Hope this help in some way.
>
>
> Cheers,
>
> johan
>
>
> P.S.
>
> I am really excited about CouchDB; kudos to Damien and everyone else
> involved (sorry, I don't know all of your names yet :-)
>
> --
> Johan Liseborn
>



-- 
____________________________
http://www.demetriusnunes.com

Re: Is it possible to evaluate a view on a 20.000 documents database?

Posted by Johan Liseborn <jo...@gmail.com>.

On Fri, Aug 1, 2008 at 00:38, Demetrius Nunes <de...@gmail.com> wrote:
> The view I am trying to create is really simple:
>
> function(doc) {
>  if
> (doc.classe_id.match(/8a8090a20075ffba010075ffbed600028a8090a20075ffba010075ffbf7200c48a8090a20075ffba010075ffbf7200d9/))
>    emit(doc.id, doc);
> }
>
> It's being applied to a 20.000 documents dataset and I've already waited
> several minutes until the CPU cooled off, but to my surprise, the view is
> still taking a long time to respond when I try to run it. Ive never actually
> got a result out of it...
>
> Am I doing something wrong?

I guess you have already gotten a number of answers, but just to give
you some additional input (which points in the same direction), here
is some data from a little experiment I just did:

I have a database consisting of documents that describe "projects";
each document has a number of fields including fields for project
manager, due date, an array of project activities (which in turn has
descriptions, an array of assigned workers, etc), an array of notes,
and a field giving the priority (the point being the documents are
"semi-complex", or at least I *think* they could be considered so; I
am not sure how much this matter, but it seems to matter a little, at
least when the document itself is part of the output of the view
(which it *isn't* in my example below, but anyway...)).

I am running this on a second generation MacBook (core 2 duo) with
Erlang R12B-3, SMP enabled.

Now, I have a view which gives me the number of projects per priority
level. The view consists of the following map and reduce functions
(mind you, I am not sure that I am doing this entirely correctly, I am
pretty new to using CouchDB (my third day of playing with actually),
and I am still figuring the map/reduce stuff out; the result of the
view seems to be correct though):

map: function(doc) { if (doc.type == 'task') emit(doc.priority, 1); }

reduce: function(keys, values) { return sum(values); }

I just ran a test where I had a database already consisting of 42.000
project documents (the view had already been indexed on these
documents). I added an additional 10.000 documents, and then ran the
view above like so:

Johans-MacBook% time curl
'localhost:5984/test-001/_view/tasks/per_prio_count?group=true'

The result I got back was:

{"rows":[{"key":1,"value":10391},{"key":2,"value":10399},{"key":3,"value":10482},{"key":4,"value":10320},{"key":5,"value":10408}]}
curl 'localhost:5984/test-001/_view/tasks/per_prio_count2?group=true'
0.01s user 0.03s system 0% cpu 13:32.02 total

Running the view again gave the following result:

{"rows":[{"key":1,"value":10391},{"key":2,"value":10399},{"key":3,"value":10482},{"key":4,"value":10320},{"key":5,"value":10408}]}
curl 'localhost:5984/test-001/_view/tasks/per_prio_count2?group=true'
0.00s user 0.00s system 0% cpu 0.703 total

As the last part, I added an additional 10 documents and then re-ran
the view, giving the following result:

{"rows":[{"key":1,"value":10392},{"key":2,"value":10400},{"key":3,"value":10487},{"key":4,"value":10322},{"key":5,"value":10409}]}
curl 'localhost:5984/test-001/_view/tasks/per_prio_count2?group=true'
0.00s user 0.00s system 0% cpu 1.207 total

AFAIU, when you add new documents and then evaluate a view including
those documents, indexing will happen, but only for the newly added
documents (i.e. already indexed documents will not be re-indexed). I
believe this means that the time to index will be, in some way,
proportional to the number of *new* documents. I believe I have seen a
big-O "number" for this somewhere, but I don't remember right now if
it is O(n), O(log n), or something else (I am sure someone else on the
list can answer that :-).

As can be seen from the results, when CouchDB had to index the 10.000
new documents, it took about 13 minutes to get the result, but when
all the documents had been indexed, the answer came back in 0.7
seconds. Having to index 10 documents did not take that long, giving
an answer in 1.2 seconds.

Hope this help in some way.

Cheers,

johan

P.S.

I am really excited about CouchDB; kudos to Damien and everyone else
involved (sorry, I don't know all of your names yet :-)

-- 
Johan Liseborn

Re: Is it possible to evaluate a view on a 20.000 documents database?

Posted by Demetrius Nunes <de...@gmail.com>.

The view I am trying to create is really simple:

function(doc) {
  if
(doc.classe_id.match(/8a8090a20075ffba010075ffbed600028a8090a20075ffba010075ffbf7200c48a8090a20075ffba010075ffbf7200d9/))
    emit(doc.id, doc);
}

It's being applied to a 20.000 documents dataset and I've already waited
several minutes until the CPU cooled off, but to my surprise, the view is
still taking a long time to respond when I try to run it. Ive never actually
got a result out of it...

Am I doing something wrong?

Also, what are the performance goals for view-related operations like these
on bigger datasets (I consider a 20.000 document dataset fairly small) for
CouchDB? What shoud we expect for 1.0 ?

If it's not possible to evaluate views on these kinds of datasets in a few
seconds, then it would be huge deal-breaker for me. And I'd have to consider
using something like Sesame RDF database, but I really like CouchDB much
better.

Cheers,
Dema

On Thu, Jul 31, 2008 at 7:14 PM, Chris Anderson <jc...@grabb.it> wrote:

> If your view is complex, and you have many (100k+) records (and the
> emitted row size is large) views could take hours to generate on a
> Core Duo MacBook. Let them generate overnight, and in the morning the
> queries will be very fast.
>
> On Thu, Jul 31, 2008 at 2:44 PM, Ed Finkler <fu...@gmail.com> wrote:
> > I have been working with a very similar problem, actually. A large set of
> > records (40k+), building views from scratch.
> >
> > My experience was that I just needed to let couchdb build the view. It
> can
> > take several minutes, and the CPU usage will be high. You should see both
> > the beam and couchjs processes working while the view is building. If
> you're
> > accessing a view via Futon, it's likely the browser will time-out the
> > request before the build is finished. The build process *will* continue
> on
> > the server side, though. If you let the build finish, the next time you
> > query the view, it will return the data immediately.
> >
> > To mitigate this problem, I'm now updating the view every time I do an
> > insert (I bulk-add 20 records per minute). This only requires that the
> new
> > data be added to the view, so building at this point is a short process.
> >
> > (big thanks to the folks on #couchdb for helping me with this problem!)
> >
> > --
> > Ed Finkler
> > http://funkatron.com
> > AIM: funka7ron
> > ICQ: 3922133
> > Skype: funka7ron
> >
> >
> > On Jul 31, 2008, at 5:10 PM, Demetrius Nunes wrote:
> >
> >> Hi there,
> >>
> >> I was having a great time playing aroung with CouchDB. It seems like a
> >> perfect fit for a future system that we'll be building fairly soon.
> >>
> >> But then, I've just created a CouchDB database, importing 20.000 records
> >> from an old relational database into it.
> >>
> >> When I go into Futon, I can see the database is there, with 20.899
> >> documents
> >> and 125.2 MB in size.
> >>
> >> Clicking on it, I can navigate thru the "All Documents" pretty quickly
> (10
> >> documents per page).
> >>
> >> The problem is when I try to create a custom view. Just as I enter the
> >> custom view page in Futon, the server hangs and locks up my CPU at 90%
> >> usage. I waited several minutes for it to cool off but the process was
> >> still
> >> there and I had no response at all.
> >>
> >> I then tried to create a view programatically, using REST/JSON and I get
> >> the
> >> same result.
> >>
> >> I am running CouchDB 0.8.0 on Ubuntu 8.0.4.
> >>
> >> Is CouchDB not ready for a dataset of this size yet?
> >>
> >> Thanks and best regards,
> >> Dema
> >>
> >> --
> >> ____________________________
> >> http://www.demetriusnunes.com
> >
> >
>
>
>
> --
> Chris Anderson
> http://jchris.mfdz.com
>



-- 
____________________________
http://www.demetriusnunes.com