You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Brad King <br...@gmail.com> on 2008/07/01 15:26:44 UTC

Re: view index build time

Thanks for the tips. I'll start scaling back the data I'm returning
and see if it improves. The largest field is an html description of an
inventory item, which seems like a good candidate for a binary
attachment, but I need to be able to do full text searches on this
data eventually (hopefully with the Lucene integration) so I'll
probably try just not including the document data in the views first.
We've had some success with Lucene independent of couchdb, so I'm
pleased you guys are integrating this.

On Sat, Jun 21, 2008 at 8:39 AM, Damien Katz <da...@gmail.com> wrote:
> Part of the problem is you are storing copies of the documents into the
> btree. If the documents are big, it takes longer to compute on them, and if
> the results (emit(...)) are big or numerous, then you'll be spending most of
> your time in I/O.
>
> My advice is to not emit the document into the view, and if you can, get the
> documents smaller in general. If the data can stored as an binary
> attachment, then that too will give you a performance improvement.
>
> -Damien
>
> On Jun 20, 2008, at 4:51 PM, Brad King wrote:
>
>> Thanks, yes its currently at 357M and growing!
>>
>> On Fri, Jun 20, 2008 at 4:49 PM, Chris Anderson <jc...@grabb.it> wrote:
>>>
>>> Brad,
>>>
>>> You can look at
>>>
>>> ls -lha /usr/local/var/lib/couchdb/.my-dbname_design/
>>>
>>> to see the view size growing...
>>>
>>> It won't tell you when it's done but it will give you hope that the
>>> progress is happening.
>>>
>>> Chris
>>>
>>> On Fri, Jun 20, 2008 at 1:45 PM, Brad King <br...@gmail.com> wrote:
>>>>
>>>> I have about 350K documents in a database. typically around 5K each. I
>>>> created and saved a view which simply looks at one field in the
>>>> document. I called the view for the first time with a key that should
>>>> only match one document, and its been awaiting a response for about 45
>>>> minutes now.
>>>>
>>>> {
>>>>  "sku": {
>>>>     "map": "function(doc) { emit(doc.entityobject.SKU, doc); }"
>>>>  }
>>>> }
>>>>
>>>> Is this typical, or is there some optimizing to be done on either my
>>>> view or the server? I'm also running on a VM so this may have some
>>>> effects, but smaller databases seem to be performing pretty well.
>>>> Insert times to set this up were actually really good I thought, at
>>>> 4000 to 5000 documents per minute running from my laptop.
>>>>
>>>
>>>
>>>
>>> --
>>> Chris Anderson
>>> http://jchris.mfdz.com
>>>
>
>

Re: view index build time

Posted by Dean Landolt <de...@deanlandolt.com>.

On Sat, Sep 6, 2008 at 9:11 AM, Brad King <br...@gmail.com> wrote:

> When I talk about CouchDB to other developers here, the first question
> I get is if the data can be distributed across multiple nodes or not
> (this is usually after the shock of how cool couchdb is wears off a
> little). Without this we have many of the same constraints we have
> today with relational databases. Yeah JSon storage is super cool, but
> in the end capacity and performance will win over ease of use. This
> where we see couchdb becoming something serious to consider for
> enterprise computing. We just have to be able to drop 3 or 4 million
> documents into this thing and not worry at all about index time,
> reliability, etc.

Just this morning I dropped just a smidgen of couch's coolness on my "cloud
computing" professor and he was definitely impressed, but of course this is
the first thing he came back to me with. Damn...

> I'm sure we'd even pay for commercial licensing if
> that were available. Faster Damien! :-)

Thank goodness for Apache. I'm pretty confident this isn't in couch's
future, but even still, why go down that route? As an open source project, a
concrete bounty would likely be a much faster way to spur a feature you're
willing to pay for.

Re: view index build time

Posted by Jan Lehnardt <ja...@apache.org>.

On Sep 6, 2008, at 15:11, Brad King wrote:

> When I talk about CouchDB to other developers here, the first question
> I get is if the data can be distributed across multiple nodes or not
> (this is usually after the shock of how cool couchdb is wears off a
> little). Without this we have many of the same constraints we have
> today with relational databases. Yeah JSon storage is super cool, but
> in the end capacity and performance will win over ease of use. This
> where we see couchdb becoming something serious to consider for
> enterprise computing. We just have to be able to drop 3 or 4 million
> documents into this thing and not worry at all about index time,

Trigger index updates every X-thousand records on bulk-insert.


> reliability, etc. I'm sure we'd even pay for commercial licensing if
> that were available. Faster Damien! :-)
>
> On Thu, Sep 4, 2008 at 9:12 AM, Paul C. Nendick <paul.nendick@gmail.com 
> > wrote:
>> 2008/9/4 Luke Galea <ga...@ideaforge.org>:
>>> Anyone ever managed to write a Disco map/reduce functions in  
>>> erlang instead
>>> of python?
>>
>> Disco has been publicly available for 2 days now, so I doubt it :D
>>
>> /p
>>
>

Re: view index build time

Posted by Brad King <br...@gmail.com>.

When I talk about CouchDB to other developers here, the first question
I get is if the data can be distributed across multiple nodes or not
(this is usually after the shock of how cool couchdb is wears off a
little). Without this we have many of the same constraints we have
today with relational databases. Yeah JSon storage is super cool, but
in the end capacity and performance will win over ease of use. This
where we see couchdb becoming something serious to consider for
enterprise computing. We just have to be able to drop 3 or 4 million
documents into this thing and not worry at all about index time,
reliability, etc. I'm sure we'd even pay for commercial licensing if
that were available. Faster Damien! :-)

On Thu, Sep 4, 2008 at 9:12 AM, Paul C. Nendick <pa...@gmail.com> wrote:
> 2008/9/4 Luke Galea <ga...@ideaforge.org>:
>> Anyone ever managed to write a Disco map/reduce functions in erlang instead
>> of python?
>
> Disco has been publicly available for 2 days now, so I doubt it :D
>
> /p
>

Re: view index build time

Posted by "Paul C. Nendick" <pa...@gmail.com>.

2008/9/4 Luke Galea <ga...@ideaforge.org>:
> Anyone ever managed to write a Disco map/reduce functions in erlang instead
> of python?

Disco has been publicly available for 2 days now, so I doubt it :D

/p

Re: view index build time

Posted by Luke Galea <ga...@ideaforge.org>.

Nice. Disco sounds great.

Anyone ever managed to write a Disco map/reduce functions in erlang  
instead of python?

The external interface is low-level C, and the API is Python.. so it  
seems like it might require reaching under the covers to just stay in  
erlang.

-- Luke Galea

On 4-Sep-08, at 7:52 AM, Paul C. Nendick wrote:

> 2008/7/3 Paul Davis <pa...@gmail.com>:
>> Are there any plans on making this parallel in the future?
>
> deus ex machina:
>
> http://discoproject.org
>
> /p

Re: view index build time

Posted by "Paul C. Nendick" <pa...@gmail.com>.

2008/7/3 Paul Davis <pa...@gmail.com>:
> Are there any plans on making this parallel in the future?

deus ex machina:

http://discoproject.org

/p

Re: view index build time

Posted by Jan Lehnardt <ja...@apache.org>.

On Jul 3, 2008, at 02:03, Chris Anderson wrote:

> On Wed, Jul 2, 2008 at 4:19 PM, Paul Davis <paul.joseph.davis@gmail.com 
> > wrote:
>> Also, where do the javascript conversions happen? Wouldn't that be in
>> the beam process with mochiweb?
>
> I wonder how the time in beam breaks down between JSON and sorting...
> I wish I had time to learn Erlang profiling techniques.

On the Erlang console (comes when you start CouchDB on the
console with the -i option):

1> couch_server:stop().
...
2> cprof:start().
3>couch_server:start().
...use CouchDB...
4> cprof:pause().
5> cprof:analyse(couch).
...analyse output...
6> cprof:stop().

All this untested (and stolen from Joe's book).

Cheers
Jan
--

Re: view index build time

Posted by Chris Anderson <jc...@grabb.it>.

On Wed, Jul 2, 2008 at 4:19 PM, Paul Davis <pa...@gmail.com> wrote:
> Also, where do the javascript conversions happen? Wouldn't that be in
> the beam process with mochiweb?

Yes - the javascript views communicate with the CouchDB mothership
over the JSON line protocol, so with each emit() Javascript objects
are converted to JSON strings in Spidermonkey, which is then parsed
from JSON in Erlang. Likewise, when CouchDB sends a doc to the view
server, it must convert it to a JSON string. The view server uses
eval() to parse that JSON into objects, which one hopes is decently
fast.

I wonder how the time in beam breaks down between JSON and sorting...
I wish I had time to learn Erlang profiling techniques.

-- 
Chris Anderson
http://jchris.mfdz.com

Re: view index build time

Posted by Brad King <br...@gmail.com>.

I guess that crosses the line somewhat in my opinion for usabilty.
JSon is great because there are libraries to support it in pretty much
every coding environment. I can't say the same about Erlang :-),
superior though it may be. Of course you could always wrap it in an
API, sort of like what Google does for AppEngine in Google Query
Language (GQL).

On Thu, Jul 3, 2008 at 10:02 PM, David King <dk...@ketralnis.com> wrote:
>> Another interesting avenue to pursue vis-a-vis faster views would be
>> an Erlang view server. [...]
>> Feasible?
>
> I hope so. that would be incredibly useful.
>
>

Re: view index build time

Posted by David King <dk...@ketralnis.com>.

> Another interesting avenue to pursue vis-a-vis faster views would be
> an Erlang view server. [...]
> Feasible?

I hope so. that would be incredibly useful.

Re: view index build time

Posted by Chris Anderson <jc...@grabb.it>.

Another interesting avenue to pursue vis-a-vis faster views would be
an Erlang view server. I don't think it would be practical for general
use (running untrusted code in a privileged environment), but it would
be an easy way to see how fast things could be if the JSON translation
costs disappear. And if it is fast enough, maybe it would be useful
for high-demand applications.

Writing one would involve making an alternate version of
couch_query_servers.erl that skipped all the JSON stuff and just ran
Erlang functions inside itself.

Feasible?

-- 
Chris Anderson
http://jchris.mfdz.com

Re: view index build time

Posted by Jan Lehnardt <ja...@apache.org>.

On Jul 3, 2008, at 20:00, Paul C. Nendick wrote:

> 2008/7/3 Jan Lehnardt <ja...@apache.org>:
>> Of course, we do want to have that. :)
>>
>> Cheers
>> Jan
>
> If I've ever heard an invitation to learn Erlang, that's it...

Yes please, patches are the best way to get things moving!

Cheers
Jan
--

Re: view index build time

Posted by "Paul C. Nendick" <pa...@gmail.com>.

2008/7/3 Jan Lehnardt <ja...@apache.org>:
> Of course, we do want to have that. :)
>
> Cheers
> Jan

If I've ever heard an invitation to learn Erlang, that's it...

/p

Re: view index build time

Posted by Jan Lehnardt <ja...@apache.org>.

On Jul 3, 2008, at 17:00, Paul C. Nendick wrote:

> 2008/7/3 Paul Davis <pa...@gmail.com>:
>> Are there any plans on making this parallel in the future? Splitting
>> up the docs amongst a set of processes and having them sort local
>> copies before doing a merge sort back into the main index file  
>> doesn't
>> seem conceptually hard.
>
> Hi all, I'm quite new to couchdb and this list. I've been
> investigating both for a few days now; putting some ideas of mine
> through couchdb and catching up on the list archives.. Given what I've
> seen thusfar, I'd have to rate Paul's query above my number one
> consideration remaining. You see, I'm prototyping solutions to a Very
> Big and Important (tm) project I'm on and couchdb has sparked much
> interest in our evaluation phase.
>
> Certainly distributed computation is one of the reasons Erlang was
> chosen for couchdb. Could this be on the Road Map someday?

Of course, we do want to have that. :)

Cheers
Jan
--

Re: view index build time

Posted by "Paul C. Nendick" <pa...@gmail.com>.

2008/7/3 Paul Davis <pa...@gmail.com>:
> Are there any plans on making this parallel in the future? Splitting
> up the docs amongst a set of processes and having them sort local
> copies before doing a merge sort back into the main index file doesn't
> seem conceptually hard.

Hi all, I'm quite new to couchdb and this list. I've been
investigating both for a few days now; putting some ideas of mine
through couchdb and catching up on the list archives.. Given what I've
seen thusfar, I'd have to rate Paul's query above my number one
consideration remaining. You see, I'm prototyping solutions to a Very
Big and Important (tm) project I'm on and couchdb has sparked much
interest in our evaluation phase.

Certainly distributed computation is one of the reasons Erlang was
chosen for couchdb. Could this be on the Road Map someday?

regards,

/p

Re: view index build time

Posted by Paul Davis <pa...@gmail.com>.

Are there any plans on making this parallel in the future? Splitting
up the docs amongst a set of processes and having them sort local
copies before doing a merge sort back into the main index file doesn't
seem conceptually hard.

Also, where do the javascript conversions happen? Wouldn't that be in
the beam process with mochiweb?

On Wed, Jul 2, 2008 at 6:28 PM, Chris Anderson <jc...@grabb.it> wrote:
> On Wed, Jul 2, 2008 at 3:08 PM, Paul Davis <pa...@gmail.com> wrote:
>> I'd have to go back and double check, but off the top of my head 25
>> min for 300K docs seems about like what I was getting. Ie, not orders
>> of magnitude slower or anything.
>
> In my experience, views generate about 1/2 as fast as that, if not
> more slowly. My views are often quite complex with a lot of internal
> looping and multiple emits, so that probably explains it. In short,
> the times you're reporting seem reasonable.
>
> The bottleneck (based on my extremely unscientific use of top) doesn't
> seem to be the view server, but rather CouchDB's beam process, which
> as I understand it, is busy sorting the results as they come back from
> the view server. So the quickest route to parallelizing this may be to
> manually partition your data across CouchDB instances, generate the
> views, and query them in parallel, merging the results in your
> application.
>
> I don't actually plan to do all that work until my insert rate
> eclipses CouchDB's view generation speed. :)
>
> Once upon a time there was a feature to return the available results
> of a view, even while generation is still occurring. The feature has
> fallen by the wayside, and it would be non-trivial to turn it back on,
> according to Damien on IRC. Maybe if it would be useful to enough
> people, we'll see it again.
>
> --
> Chris Anderson
> http://jchris.mfdz.com
>

Re: view index build time

Posted by Joseph Liu <fr...@gmail.com>.

Late to the discussion but here's my 2 cents:

Depending on your virtualization software, disk accesses can suck. On
a "hosted" hypervisor, you're to have to rely on the host to schedule
your disk accesses. Disk io is scheduled in the guest, potentially go
through an emulation layer by the hypervisor, and then be scheduled in
the host. Furthermore there can be significant latency switching
between the host and the guest. If the disk accesses are small and
random this can cause the slowdown you are observing. Finally, your
guest is not always scheduled in since it's just like any other
processes to the host, so the actual amount of cpu time in the guest
is less than you normally have and will affect the total wall clock of
the computation time.

I'm not saying that virtualization sucks as it has many important uses
(e.g. VMotion), and some of these issues may be mitigated with proper
paravirtualization, but at the end you should still run benchmarks to
see if your workload is suited for the hypervisor you are considering.

On Tue, Jul 8, 2008 at 6:53 AM, Brad King <br...@gmail.com> wrote:
> Following up on this. After moving to real hardware my view index time
> for the same data set dropped from 25 minutes to 6 minutes, so
> definitely was a factor. If there any other optimizations I can make
> I'd love to know what they are. Thanks.
>
> On Thu, Jul 3, 2008 at 9:35 AM, Brad King <br...@gmail.com> wrote:
>> That would be fantastic, but it sounds like other users are seeing
>> performance similar to what I see. When you say tuning and
>> optimizations, are you talking about code changes in future versions
>> of couchdb or parameters we can change now? VM is definitely a
>> variable. I probably should try this out on real hardware too and
>> compare.
>>
>> On Wed, Jul 2, 2008 at 7:30 PM, Damien Katz <da...@gmail.com> wrote:
>>> This sounds really slow, like somethings wrong. 25 minutes to process 300k
>>> means ~500 docs sec, or each document takes 2ms. That's a really long time
>>> CPU wise.
>>>
>>> Assuming it's not another VM bug, we should be able about to get that down
>>> to under minute with some tuning, and probably closer to 10 secs after
>>> serious optimizations.
>>>
>>> -Damien
>>>
>>>
>>> On Jul 2, 2008, at 6:28 PM, Chris Anderson wrote:
>>>
>>>> On Wed, Jul 2, 2008 at 3:08 PM, Paul Davis <pa...@gmail.com>
>>>> wrote:
>>>>>
>>>>> I'd have to go back and double check, but off the top of my head 25
>>>>> min for 300K docs seems about like what I was getting. Ie, not orders
>>>>> of magnitude slower or anything.
>>>>
>>>> In my experience, views generate about 1/2 as fast as that, if not
>>>> more slowly. My views are often quite complex with a lot of internal
>>>> looping and multiple emits, so that probably explains it. In short,
>>>> the times you're reporting seem reasonable.
>>>>
>>>> The bottleneck (based on my extremely unscientific use of top) doesn't
>>>> seem to be the view server, but rather CouchDB's beam process, which
>>>> as I understand it, is busy sorting the results as they come back from
>>>> the view server. So the quickest route to parallelizing this may be to
>>>> manually partition your data across CouchDB instances, generate the
>>>> views, and query them in parallel, merging the results in your
>>>> application.
>>>>
>>>> I don't actually plan to do all that work until my insert rate
>>>> eclipses CouchDB's view generation speed. :)
>>>>
>>>> Once upon a time there was a feature to return the available results
>>>> of a view, even while generation is still occurring. The feature has
>>>> fallen by the wayside, and it would be non-trivial to turn it back on,
>>>> according to Damien on IRC. Maybe if it would be useful to enough
>>>> people, we'll see it again.
>>>>
>>>> --
>>>> Chris Anderson
>>>> http://jchris.mfdz.com
>>>
>>>
>>
>

Re: view index build time

Posted by Brad King <br...@gmail.com>.

Following up on this. After moving to real hardware my view index time
for the same data set dropped from 25 minutes to 6 minutes, so
definitely was a factor. If there any other optimizations I can make
I'd love to know what they are. Thanks.

On Thu, Jul 3, 2008 at 9:35 AM, Brad King <br...@gmail.com> wrote:
> That would be fantastic, but it sounds like other users are seeing
> performance similar to what I see. When you say tuning and
> optimizations, are you talking about code changes in future versions
> of couchdb or parameters we can change now? VM is definitely a
> variable. I probably should try this out on real hardware too and
> compare.
>
> On Wed, Jul 2, 2008 at 7:30 PM, Damien Katz <da...@gmail.com> wrote:
>> This sounds really slow, like somethings wrong. 25 minutes to process 300k
>> means ~500 docs sec, or each document takes 2ms. That's a really long time
>> CPU wise.
>>
>> Assuming it's not another VM bug, we should be able about to get that down
>> to under minute with some tuning, and probably closer to 10 secs after
>> serious optimizations.
>>
>> -Damien
>>
>>
>> On Jul 2, 2008, at 6:28 PM, Chris Anderson wrote:
>>
>>> On Wed, Jul 2, 2008 at 3:08 PM, Paul Davis <pa...@gmail.com>
>>> wrote:
>>>>
>>>> I'd have to go back and double check, but off the top of my head 25
>>>> min for 300K docs seems about like what I was getting. Ie, not orders
>>>> of magnitude slower or anything.
>>>
>>> In my experience, views generate about 1/2 as fast as that, if not
>>> more slowly. My views are often quite complex with a lot of internal
>>> looping and multiple emits, so that probably explains it. In short,
>>> the times you're reporting seem reasonable.
>>>
>>> The bottleneck (based on my extremely unscientific use of top) doesn't
>>> seem to be the view server, but rather CouchDB's beam process, which
>>> as I understand it, is busy sorting the results as they come back from
>>> the view server. So the quickest route to parallelizing this may be to
>>> manually partition your data across CouchDB instances, generate the
>>> views, and query them in parallel, merging the results in your
>>> application.
>>>
>>> I don't actually plan to do all that work until my insert rate
>>> eclipses CouchDB's view generation speed. :)
>>>
>>> Once upon a time there was a feature to return the available results
>>> of a view, even while generation is still occurring. The feature has
>>> fallen by the wayside, and it would be non-trivial to turn it back on,
>>> according to Damien on IRC. Maybe if it would be useful to enough
>>> people, we'll see it again.
>>>
>>> --
>>> Chris Anderson
>>> http://jchris.mfdz.com
>>
>>
>

Re: view index build time

Posted by Brad King <br...@gmail.com>.

That would be fantastic, but it sounds like other users are seeing
performance similar to what I see. When you say tuning and
optimizations, are you talking about code changes in future versions
of couchdb or parameters we can change now? VM is definitely a
variable. I probably should try this out on real hardware too and
compare.

On Wed, Jul 2, 2008 at 7:30 PM, Damien Katz <da...@gmail.com> wrote:
> This sounds really slow, like somethings wrong. 25 minutes to process 300k
> means ~500 docs sec, or each document takes 2ms. That's a really long time
> CPU wise.
>
> Assuming it's not another VM bug, we should be able about to get that down
> to under minute with some tuning, and probably closer to 10 secs after
> serious optimizations.
>
> -Damien
>
>
> On Jul 2, 2008, at 6:28 PM, Chris Anderson wrote:
>
>> On Wed, Jul 2, 2008 at 3:08 PM, Paul Davis <pa...@gmail.com>
>> wrote:
>>>
>>> I'd have to go back and double check, but off the top of my head 25
>>> min for 300K docs seems about like what I was getting. Ie, not orders
>>> of magnitude slower or anything.
>>
>> In my experience, views generate about 1/2 as fast as that, if not
>> more slowly. My views are often quite complex with a lot of internal
>> looping and multiple emits, so that probably explains it. In short,
>> the times you're reporting seem reasonable.
>>
>> The bottleneck (based on my extremely unscientific use of top) doesn't
>> seem to be the view server, but rather CouchDB's beam process, which
>> as I understand it, is busy sorting the results as they come back from
>> the view server. So the quickest route to parallelizing this may be to
>> manually partition your data across CouchDB instances, generate the
>> views, and query them in parallel, merging the results in your
>> application.
>>
>> I don't actually plan to do all that work until my insert rate
>> eclipses CouchDB's view generation speed. :)
>>
>> Once upon a time there was a feature to return the available results
>> of a view, even while generation is still occurring. The feature has
>> fallen by the wayside, and it would be non-trivial to turn it back on,
>> according to Damien on IRC. Maybe if it would be useful to enough
>> people, we'll see it again.
>>
>> --
>> Chris Anderson
>> http://jchris.mfdz.com
>
>

Re: view index build time

Posted by Damien Katz <da...@gmail.com>.

This sounds really slow, like somethings wrong. 25 minutes to process  
300k means ~500 docs sec, or each document takes 2ms. That's a really  
long time CPU wise.

Assuming it's not another VM bug, we should be able about to get that  
down to under minute with some tuning, and probably closer to 10 secs  
after serious optimizations.

-Damien


On Jul 2, 2008, at 6:28 PM, Chris Anderson wrote:

> On Wed, Jul 2, 2008 at 3:08 PM, Paul Davis <paul.joseph.davis@gmail.com 
> > wrote:
>> I'd have to go back and double check, but off the top of my head 25
>> min for 300K docs seems about like what I was getting. Ie, not orders
>> of magnitude slower or anything.
>
> In my experience, views generate about 1/2 as fast as that, if not
> more slowly. My views are often quite complex with a lot of internal
> looping and multiple emits, so that probably explains it. In short,
> the times you're reporting seem reasonable.
>
> The bottleneck (based on my extremely unscientific use of top) doesn't
> seem to be the view server, but rather CouchDB's beam process, which
> as I understand it, is busy sorting the results as they come back from
> the view server. So the quickest route to parallelizing this may be to
> manually partition your data across CouchDB instances, generate the
> views, and query them in parallel, merging the results in your
> application.
>
> I don't actually plan to do all that work until my insert rate
> eclipses CouchDB's view generation speed. :)
>
> Once upon a time there was a feature to return the available results
> of a view, even while generation is still occurring. The feature has
> fallen by the wayside, and it would be non-trivial to turn it back on,
> according to Damien on IRC. Maybe if it would be useful to enough
> people, we'll see it again.
>
> -- 
> Chris Anderson
> http://jchris.mfdz.com

Re: view index build time

Posted by Sergey <sk...@gmail.com>.

Hello!

I was reading through the thread carefully and still I can not get any
strong view on what is the supposed production speed of building index with
CouchDB ? My own experience was far away from satisfactory.
Is there any sample timing that should be taken as a must for current
version of CouchDB? Except those Damien mentioned about several messages
before.

2008/7/3, Jan Lehnardt <ja...@apache.org>:
>
>
> On Jul 3, 2008, at 00:28, Chris Anderson wrote:
>
> On Wed, Jul 2, 2008 at 3:08 PM, Paul Davis <pa...@gmail.com>
>> wrote:
>>
>>> I'd have to go back and double check, but off the top of my head 25
>>> min for 300K docs seems about like what I was getting. Ie, not orders
>>> of magnitude slower or anything.
>>>
>>
>>
>> Once upon a time there was a feature to return the available results
>> of a view, even while generation is still occurring. The feature has
>> fallen by the wayside, and it would be non-trivial to turn it back on,
>> according to Damien on IRC. Maybe if it would be useful to enough
>> people, we'll see it again.
>>
>
> 'tis strange, the Damien I talked to on AIM a few nights back said
> it would be easy to allow update=false-view queries to be non-blocking
> and returning the then current (before the update) view data.
>
> Damien? :)
>
> It would be nice if we can have that.
>
> Cheers
> Jan
> --
>
>


-- 
С уважением,
Сергей.

Re: view index build time

Posted by Jan Lehnardt <ja...@apache.org>.

On Jul 3, 2008, at 00:28, Chris Anderson wrote:

> On Wed, Jul 2, 2008 at 3:08 PM, Paul Davis <paul.joseph.davis@gmail.com 
> > wrote:
>> I'd have to go back and double check, but off the top of my head 25
>> min for 300K docs seems about like what I was getting. Ie, not orders
>> of magnitude slower or anything.
>
>
> Once upon a time there was a feature to return the available results
> of a view, even while generation is still occurring. The feature has
> fallen by the wayside, and it would be non-trivial to turn it back on,
> according to Damien on IRC. Maybe if it would be useful to enough
> people, we'll see it again.

'tis strange, the Damien I talked to on AIM a few nights back said
it would be easy to allow update=false-view queries to be non-blocking
and returning the then current (before the update) view data.

Damien? :)

It would be nice if we can have that.

Cheers
Jan
--

Re: view index build time

Posted by Chris Anderson <jc...@grabb.it>.

On Wed, Jul 2, 2008 at 3:08 PM, Paul Davis <pa...@gmail.com> wrote:
> I'd have to go back and double check, but off the top of my head 25
> min for 300K docs seems about like what I was getting. Ie, not orders
> of magnitude slower or anything.

In my experience, views generate about 1/2 as fast as that, if not
more slowly. My views are often quite complex with a lot of internal
looping and multiple emits, so that probably explains it. In short,
the times you're reporting seem reasonable.

The bottleneck (based on my extremely unscientific use of top) doesn't
seem to be the view server, but rather CouchDB's beam process, which
as I understand it, is busy sorting the results as they come back from
the view server. So the quickest route to parallelizing this may be to
manually partition your data across CouchDB instances, generate the
views, and query them in parallel, merging the results in your
application.

I don't actually plan to do all that work until my insert rate
eclipses CouchDB's view generation speed. :)

Once upon a time there was a feature to return the available results
of a view, even while generation is still occurring. The feature has
fallen by the wayside, and it would be non-trivial to turn it back on,
according to Damien on IRC. Maybe if it would be useful to enough
people, we'll see it again.

-- 
Chris Anderson
http://jchris.mfdz.com

Re: view index build time

Posted by Paul Davis <pa...@gmail.com>.

I'd have to go back and double check, but off the top of my head 25
min for 300K docs seems about like what I was getting. Ie, not orders
of magnitude slower or anything.

Not sure about moving the design folder to a different disk, you may
check iostat while indexing, although I think I saw either on this
list or in IRC someone reporting that the erlang->javascript and
javascript->erlang translations were what was slowing everything down.
Although I could've made that conversation up in a dream.

HTH,
Paul

On Wed, Jul 2, 2008 at 6:00 PM, Brad King <br...@gmail.com> wrote:
> I've got R12B. We've also got the couchdb 0.8.0-incubating version.
> I'm just curious what my expectations should be for view creation
> times. Also was wondering if anyone had tried putting the design
> folder on different disk to improve I/O.
>
> On Wed, Jul 2, 2008 at 2:18 PM, Paul Davis <pa...@gmail.com> wrote:
>> One thing that got me awhile back was the version of erlang I was
>> using. If you're not on one of the most recent erlang versions R12B or
>> some such, you might try upgrading that bit to see if it fixes things.
>>
>> Paul
>>
>> On Wed, Jul 2, 2008 at 1:58 PM, Brad King <br...@gmail.com> wrote:
>>> I created a view with emit(doc.entityobject.sku, null) to only emit
>>> the doc ids. After trying attachments, I nuked the DB  and started
>>> over, going back to having the documents inline. This is ok, but
>>> again, the index build time of about 25 minutes for this view against
>>> 300K or so docs seems long. What are you seeing as typical for
>>> creating your views against a much larger set? What do your docs look
>>> like? Thanks.
>>>
>>>
>>> On Wed, Jul 2, 2008 at 10:50 AM, Jan Lehnardt <ja...@apache.org> wrote:
>>>>
>>>> On Jul 2, 2008, at 16:17, Brad King wrote:
>>>>
>>>>> Just to post some results here of working with around 300K docs. I
>>>>> changed the view to emit only the doc ID and index time went down to
>>>>> about 25 minutes vs. an hour for the same dataset.
>>>>>
>>>>> I then converted the largest text field to an attachment and things
>>>>> went down hill from there. I deleted the db and started the upload,
>>>>> but repeatedly got random 500 server errors with no real way to know
>>>>> what is happening or why. Also the DB size as reported by Futon seemed
>>>>> to fluctuate wildly as I was adding documents. And I mean wildly like
>>>>> anywhere from 1.2G then back down to 144M. Weird. I don't get a very
>>>>> warm fuzzy feeling about the stability of using attachments right now.
>>>>> Ideally, I don't want to use them anyway, I'd prefer to have the
>>>>> fields all inline and have the database handle these docs as-is. I
>>>>> don't see these as huge documents (2 to 5K) as compared to what I
>>>>> would store in something like Berkeley DB XML, just for comparison
>>>>> sake, so I'm hoping its a goal of the project to handle these
>>>>> effectively, even when several million documents are added.
>>>>
>>>> This doesn't sound right at all. Can you make sure you use the
>>>> very latest SVN version or the 0.8 release and completely
>>>> new databases? Also, just to clarify, do you emit the doc into
>>>> the view payload? As in emit(doc._id, doc); are you just doing
>>>> emit(null, null); to only get the docIds that matter to you and
>>>> then fetch the documents later? I have had the later setup running
>>>> without any problems across ~2mio documents in a database.
>>>>
>>>>
>>>>> As always, thanks for the help.
>>>>
>>>> Thanks for the problem report.
>>>>
>>>> Cheers
>>>> Jan
>>>> --
>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Jul 1, 2008 at 9:26 AM, Brad King <br...@gmail.com> wrote:
>>>>>>
>>>>>> Thanks for the tips. I'll start scaling back the data I'm returning
>>>>>> and see if it improves. The largest field is an html description of an
>>>>>> inventory item, which seems like a good candidate for a binary
>>>>>> attachment, but I need to be able to do full text searches on this
>>>>>> data eventually (hopefully with the Lucene integration) so I'll
>>>>>> probably try just not including the document data in the views first.
>>>>>> We've had some success with Lucene independent of couchdb, so I'm
>>>>>> pleased you guys are integrating this.
>>>>>>
>>>>>> On Sat, Jun 21, 2008 at 8:39 AM, Damien Katz <da...@gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> Part of the problem is you are storing copies of the documents into the
>>>>>>> btree. If the documents are big, it takes longer to compute on them, and
>>>>>>> if
>>>>>>> the results (emit(...)) are big or numerous, then you'll be spending
>>>>>>> most of
>>>>>>> your time in I/O.
>>>>>>>
>>>>>>> My advice is to not emit the document into the view, and if you can, get
>>>>>>> the
>>>>>>> documents smaller in general. If the data can stored as an binary
>>>>>>> attachment, then that too will give you a performance improvement.
>>>>>>>
>>>>>>> -Damien
>>>>>>>
>>>>>>> On Jun 20, 2008, at 4:51 PM, Brad King wrote:
>>>>>>>
>>>>>>>> Thanks, yes its currently at 357M and growing!
>>>>>>>>
>>>>>>>> On Fri, Jun 20, 2008 at 4:49 PM, Chris Anderson <jc...@grabb.it>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Brad,
>>>>>>>>>
>>>>>>>>> You can look at
>>>>>>>>>
>>>>>>>>> ls -lha /usr/local/var/lib/couchdb/.my-dbname_design/
>>>>>>>>>
>>>>>>>>> to see the view size growing...
>>>>>>>>>
>>>>>>>>> It won't tell you when it's done but it will give you hope that the
>>>>>>>>> progress is happening.
>>>>>>>>>
>>>>>>>>> Chris
>>>>>>>>>
>>>>>>>>> On Fri, Jun 20, 2008 at 1:45 PM, Brad King <br...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> I have about 350K documents in a database. typically around 5K each.
>>>>>>>>>> I
>>>>>>>>>> created and saved a view which simply looks at one field in the
>>>>>>>>>> document. I called the view for the first time with a key that should
>>>>>>>>>> only match one document, and its been awaiting a response for about
>>>>>>>>>> 45
>>>>>>>>>> minutes now.
>>>>>>>>>>
>>>>>>>>>> {
>>>>>>>>>> "sku": {
>>>>>>>>>>   "map": "function(doc) { emit(doc.entityobject.SKU, doc); }"
>>>>>>>>>> }
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> Is this typical, or is there some optimizing to be done on either my
>>>>>>>>>> view or the server? I'm also running on a VM so this may have some
>>>>>>>>>> effects, but smaller databases seem to be performing pretty well.
>>>>>>>>>> Insert times to set this up were actually really good I thought, at
>>>>>>>>>> 4000 to 5000 documents per minute running from my laptop.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Chris Anderson
>>>>>>>>> http://jchris.mfdz.com
>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Re: view index build time

Posted by Brad King <br...@gmail.com>.

I've got R12B. We've also got the couchdb 0.8.0-incubating version.
I'm just curious what my expectations should be for view creation
times. Also was wondering if anyone had tried putting the design
folder on different disk to improve I/O.

On Wed, Jul 2, 2008 at 2:18 PM, Paul Davis <pa...@gmail.com> wrote:
> One thing that got me awhile back was the version of erlang I was
> using. If you're not on one of the most recent erlang versions R12B or
> some such, you might try upgrading that bit to see if it fixes things.
>
> Paul
>
> On Wed, Jul 2, 2008 at 1:58 PM, Brad King <br...@gmail.com> wrote:
>> I created a view with emit(doc.entityobject.sku, null) to only emit
>> the doc ids. After trying attachments, I nuked the DB  and started
>> over, going back to having the documents inline. This is ok, but
>> again, the index build time of about 25 minutes for this view against
>> 300K or so docs seems long. What are you seeing as typical for
>> creating your views against a much larger set? What do your docs look
>> like? Thanks.
>>
>>
>> On Wed, Jul 2, 2008 at 10:50 AM, Jan Lehnardt <ja...@apache.org> wrote:
>>>
>>> On Jul 2, 2008, at 16:17, Brad King wrote:
>>>
>>>> Just to post some results here of working with around 300K docs. I
>>>> changed the view to emit only the doc ID and index time went down to
>>>> about 25 minutes vs. an hour for the same dataset.
>>>>
>>>> I then converted the largest text field to an attachment and things
>>>> went down hill from there. I deleted the db and started the upload,
>>>> but repeatedly got random 500 server errors with no real way to know
>>>> what is happening or why. Also the DB size as reported by Futon seemed
>>>> to fluctuate wildly as I was adding documents. And I mean wildly like
>>>> anywhere from 1.2G then back down to 144M. Weird. I don't get a very
>>>> warm fuzzy feeling about the stability of using attachments right now.
>>>> Ideally, I don't want to use them anyway, I'd prefer to have the
>>>> fields all inline and have the database handle these docs as-is. I
>>>> don't see these as huge documents (2 to 5K) as compared to what I
>>>> would store in something like Berkeley DB XML, just for comparison
>>>> sake, so I'm hoping its a goal of the project to handle these
>>>> effectively, even when several million documents are added.
>>>
>>> This doesn't sound right at all. Can you make sure you use the
>>> very latest SVN version or the 0.8 release and completely
>>> new databases? Also, just to clarify, do you emit the doc into
>>> the view payload? As in emit(doc._id, doc); are you just doing
>>> emit(null, null); to only get the docIds that matter to you and
>>> then fetch the documents later? I have had the later setup running
>>> without any problems across ~2mio documents in a database.
>>>
>>>
>>>> As always, thanks for the help.
>>>
>>> Thanks for the problem report.
>>>
>>> Cheers
>>> Jan
>>> --
>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Jul 1, 2008 at 9:26 AM, Brad King <br...@gmail.com> wrote:
>>>>>
>>>>> Thanks for the tips. I'll start scaling back the data I'm returning
>>>>> and see if it improves. The largest field is an html description of an
>>>>> inventory item, which seems like a good candidate for a binary
>>>>> attachment, but I need to be able to do full text searches on this
>>>>> data eventually (hopefully with the Lucene integration) so I'll
>>>>> probably try just not including the document data in the views first.
>>>>> We've had some success with Lucene independent of couchdb, so I'm
>>>>> pleased you guys are integrating this.
>>>>>
>>>>> On Sat, Jun 21, 2008 at 8:39 AM, Damien Katz <da...@gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> Part of the problem is you are storing copies of the documents into the
>>>>>> btree. If the documents are big, it takes longer to compute on them, and
>>>>>> if
>>>>>> the results (emit(...)) are big or numerous, then you'll be spending
>>>>>> most of
>>>>>> your time in I/O.
>>>>>>
>>>>>> My advice is to not emit the document into the view, and if you can, get
>>>>>> the
>>>>>> documents smaller in general. If the data can stored as an binary
>>>>>> attachment, then that too will give you a performance improvement.
>>>>>>
>>>>>> -Damien
>>>>>>
>>>>>> On Jun 20, 2008, at 4:51 PM, Brad King wrote:
>>>>>>
>>>>>>> Thanks, yes its currently at 357M and growing!
>>>>>>>
>>>>>>> On Fri, Jun 20, 2008 at 4:49 PM, Chris Anderson <jc...@grabb.it>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Brad,
>>>>>>>>
>>>>>>>> You can look at
>>>>>>>>
>>>>>>>> ls -lha /usr/local/var/lib/couchdb/.my-dbname_design/
>>>>>>>>
>>>>>>>> to see the view size growing...
>>>>>>>>
>>>>>>>> It won't tell you when it's done but it will give you hope that the
>>>>>>>> progress is happening.
>>>>>>>>
>>>>>>>> Chris
>>>>>>>>
>>>>>>>> On Fri, Jun 20, 2008 at 1:45 PM, Brad King <br...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> I have about 350K documents in a database. typically around 5K each.
>>>>>>>>> I
>>>>>>>>> created and saved a view which simply looks at one field in the
>>>>>>>>> document. I called the view for the first time with a key that should
>>>>>>>>> only match one document, and its been awaiting a response for about
>>>>>>>>> 45
>>>>>>>>> minutes now.
>>>>>>>>>
>>>>>>>>> {
>>>>>>>>> "sku": {
>>>>>>>>>   "map": "function(doc) { emit(doc.entityobject.SKU, doc); }"
>>>>>>>>> }
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> Is this typical, or is there some optimizing to be done on either my
>>>>>>>>> view or the server? I'm also running on a VM so this may have some
>>>>>>>>> effects, but smaller databases seem to be performing pretty well.
>>>>>>>>> Insert times to set this up were actually really good I thought, at
>>>>>>>>> 4000 to 5000 documents per minute running from my laptop.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Chris Anderson
>>>>>>>> http://jchris.mfdz.com
>>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>
>

Re: view index build time

Posted by Paul Davis <pa...@gmail.com>.

One thing that got me awhile back was the version of erlang I was
using. If you're not on one of the most recent erlang versions R12B or
some such, you might try upgrading that bit to see if it fixes things.

Paul

On Wed, Jul 2, 2008 at 1:58 PM, Brad King <br...@gmail.com> wrote:
> I created a view with emit(doc.entityobject.sku, null) to only emit
> the doc ids. After trying attachments, I nuked the DB  and started
> over, going back to having the documents inline. This is ok, but
> again, the index build time of about 25 minutes for this view against
> 300K or so docs seems long. What are you seeing as typical for
> creating your views against a much larger set? What do your docs look
> like? Thanks.
>
>
> On Wed, Jul 2, 2008 at 10:50 AM, Jan Lehnardt <ja...@apache.org> wrote:
>>
>> On Jul 2, 2008, at 16:17, Brad King wrote:
>>
>>> Just to post some results here of working with around 300K docs. I
>>> changed the view to emit only the doc ID and index time went down to
>>> about 25 minutes vs. an hour for the same dataset.
>>>
>>> I then converted the largest text field to an attachment and things
>>> went down hill from there. I deleted the db and started the upload,
>>> but repeatedly got random 500 server errors with no real way to know
>>> what is happening or why. Also the DB size as reported by Futon seemed
>>> to fluctuate wildly as I was adding documents. And I mean wildly like
>>> anywhere from 1.2G then back down to 144M. Weird. I don't get a very
>>> warm fuzzy feeling about the stability of using attachments right now.
>>> Ideally, I don't want to use them anyway, I'd prefer to have the
>>> fields all inline and have the database handle these docs as-is. I
>>> don't see these as huge documents (2 to 5K) as compared to what I
>>> would store in something like Berkeley DB XML, just for comparison
>>> sake, so I'm hoping its a goal of the project to handle these
>>> effectively, even when several million documents are added.
>>
>> This doesn't sound right at all. Can you make sure you use the
>> very latest SVN version or the 0.8 release and completely
>> new databases? Also, just to clarify, do you emit the doc into
>> the view payload? As in emit(doc._id, doc); are you just doing
>> emit(null, null); to only get the docIds that matter to you and
>> then fetch the documents later? I have had the later setup running
>> without any problems across ~2mio documents in a database.
>>
>>
>>> As always, thanks for the help.
>>
>> Thanks for the problem report.
>>
>> Cheers
>> Jan
>> --
>>
>>>
>>>
>>>
>>>
>>> On Tue, Jul 1, 2008 at 9:26 AM, Brad King <br...@gmail.com> wrote:
>>>>
>>>> Thanks for the tips. I'll start scaling back the data I'm returning
>>>> and see if it improves. The largest field is an html description of an
>>>> inventory item, which seems like a good candidate for a binary
>>>> attachment, but I need to be able to do full text searches on this
>>>> data eventually (hopefully with the Lucene integration) so I'll
>>>> probably try just not including the document data in the views first.
>>>> We've had some success with Lucene independent of couchdb, so I'm
>>>> pleased you guys are integrating this.
>>>>
>>>> On Sat, Jun 21, 2008 at 8:39 AM, Damien Katz <da...@gmail.com>
>>>> wrote:
>>>>>
>>>>> Part of the problem is you are storing copies of the documents into the
>>>>> btree. If the documents are big, it takes longer to compute on them, and
>>>>> if
>>>>> the results (emit(...)) are big or numerous, then you'll be spending
>>>>> most of
>>>>> your time in I/O.
>>>>>
>>>>> My advice is to not emit the document into the view, and if you can, get
>>>>> the
>>>>> documents smaller in general. If the data can stored as an binary
>>>>> attachment, then that too will give you a performance improvement.
>>>>>
>>>>> -Damien
>>>>>
>>>>> On Jun 20, 2008, at 4:51 PM, Brad King wrote:
>>>>>
>>>>>> Thanks, yes its currently at 357M and growing!
>>>>>>
>>>>>> On Fri, Jun 20, 2008 at 4:49 PM, Chris Anderson <jc...@grabb.it>
>>>>>> wrote:
>>>>>>>
>>>>>>> Brad,
>>>>>>>
>>>>>>> You can look at
>>>>>>>
>>>>>>> ls -lha /usr/local/var/lib/couchdb/.my-dbname_design/
>>>>>>>
>>>>>>> to see the view size growing...
>>>>>>>
>>>>>>> It won't tell you when it's done but it will give you hope that the
>>>>>>> progress is happening.
>>>>>>>
>>>>>>> Chris
>>>>>>>
>>>>>>> On Fri, Jun 20, 2008 at 1:45 PM, Brad King <br...@gmail.com> wrote:
>>>>>>>>
>>>>>>>> I have about 350K documents in a database. typically around 5K each.
>>>>>>>> I
>>>>>>>> created and saved a view which simply looks at one field in the
>>>>>>>> document. I called the view for the first time with a key that should
>>>>>>>> only match one document, and its been awaiting a response for about
>>>>>>>> 45
>>>>>>>> minutes now.
>>>>>>>>
>>>>>>>> {
>>>>>>>> "sku": {
>>>>>>>>   "map": "function(doc) { emit(doc.entityobject.SKU, doc); }"
>>>>>>>> }
>>>>>>>> }
>>>>>>>>
>>>>>>>> Is this typical, or is there some optimizing to be done on either my
>>>>>>>> view or the server? I'm also running on a VM so this may have some
>>>>>>>> effects, but smaller databases seem to be performing pretty well.
>>>>>>>> Insert times to set this up were actually really good I thought, at
>>>>>>>> 4000 to 5000 documents per minute running from my laptop.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Chris Anderson
>>>>>>> http://jchris.mfdz.com
>>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>>
>

Re: view index build time

Posted by Brad King <br...@gmail.com>.

I created a view with emit(doc.entityobject.sku, null) to only emit
the doc ids. After trying attachments, I nuked the DB  and started
over, going back to having the documents inline. This is ok, but
again, the index build time of about 25 minutes for this view against
300K or so docs seems long. What are you seeing as typical for
creating your views against a much larger set? What do your docs look
like? Thanks.


On Wed, Jul 2, 2008 at 10:50 AM, Jan Lehnardt <ja...@apache.org> wrote:
>
> On Jul 2, 2008, at 16:17, Brad King wrote:
>
>> Just to post some results here of working with around 300K docs. I
>> changed the view to emit only the doc ID and index time went down to
>> about 25 minutes vs. an hour for the same dataset.
>>
>> I then converted the largest text field to an attachment and things
>> went down hill from there. I deleted the db and started the upload,
>> but repeatedly got random 500 server errors with no real way to know
>> what is happening or why. Also the DB size as reported by Futon seemed
>> to fluctuate wildly as I was adding documents. And I mean wildly like
>> anywhere from 1.2G then back down to 144M. Weird. I don't get a very
>> warm fuzzy feeling about the stability of using attachments right now.
>> Ideally, I don't want to use them anyway, I'd prefer to have the
>> fields all inline and have the database handle these docs as-is. I
>> don't see these as huge documents (2 to 5K) as compared to what I
>> would store in something like Berkeley DB XML, just for comparison
>> sake, so I'm hoping its a goal of the project to handle these
>> effectively, even when several million documents are added.
>
> This doesn't sound right at all. Can you make sure you use the
> very latest SVN version or the 0.8 release and completely
> new databases? Also, just to clarify, do you emit the doc into
> the view payload? As in emit(doc._id, doc); are you just doing
> emit(null, null); to only get the docIds that matter to you and
> then fetch the documents later? I have had the later setup running
> without any problems across ~2mio documents in a database.
>
>
>> As always, thanks for the help.
>
> Thanks for the problem report.
>
> Cheers
> Jan
> --
>
>>
>>
>>
>>
>> On Tue, Jul 1, 2008 at 9:26 AM, Brad King <br...@gmail.com> wrote:
>>>
>>> Thanks for the tips. I'll start scaling back the data I'm returning
>>> and see if it improves. The largest field is an html description of an
>>> inventory item, which seems like a good candidate for a binary
>>> attachment, but I need to be able to do full text searches on this
>>> data eventually (hopefully with the Lucene integration) so I'll
>>> probably try just not including the document data in the views first.
>>> We've had some success with Lucene independent of couchdb, so I'm
>>> pleased you guys are integrating this.
>>>
>>> On Sat, Jun 21, 2008 at 8:39 AM, Damien Katz <da...@gmail.com>
>>> wrote:
>>>>
>>>> Part of the problem is you are storing copies of the documents into the
>>>> btree. If the documents are big, it takes longer to compute on them, and
>>>> if
>>>> the results (emit(...)) are big or numerous, then you'll be spending
>>>> most of
>>>> your time in I/O.
>>>>
>>>> My advice is to not emit the document into the view, and if you can, get
>>>> the
>>>> documents smaller in general. If the data can stored as an binary
>>>> attachment, then that too will give you a performance improvement.
>>>>
>>>> -Damien
>>>>
>>>> On Jun 20, 2008, at 4:51 PM, Brad King wrote:
>>>>
>>>>> Thanks, yes its currently at 357M and growing!
>>>>>
>>>>> On Fri, Jun 20, 2008 at 4:49 PM, Chris Anderson <jc...@grabb.it>
>>>>> wrote:
>>>>>>
>>>>>> Brad,
>>>>>>
>>>>>> You can look at
>>>>>>
>>>>>> ls -lha /usr/local/var/lib/couchdb/.my-dbname_design/
>>>>>>
>>>>>> to see the view size growing...
>>>>>>
>>>>>> It won't tell you when it's done but it will give you hope that the
>>>>>> progress is happening.
>>>>>>
>>>>>> Chris
>>>>>>
>>>>>> On Fri, Jun 20, 2008 at 1:45 PM, Brad King <br...@gmail.com> wrote:
>>>>>>>
>>>>>>> I have about 350K documents in a database. typically around 5K each.
>>>>>>> I
>>>>>>> created and saved a view which simply looks at one field in the
>>>>>>> document. I called the view for the first time with a key that should
>>>>>>> only match one document, and its been awaiting a response for about
>>>>>>> 45
>>>>>>> minutes now.
>>>>>>>
>>>>>>> {
>>>>>>> "sku": {
>>>>>>>   "map": "function(doc) { emit(doc.entityobject.SKU, doc); }"
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> Is this typical, or is there some optimizing to be done on either my
>>>>>>> view or the server? I'm also running on a VM so this may have some
>>>>>>> effects, but smaller databases seem to be performing pretty well.
>>>>>>> Insert times to set this up were actually really good I thought, at
>>>>>>> 4000 to 5000 documents per minute running from my laptop.
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Chris Anderson
>>>>>> http://jchris.mfdz.com
>>>>>>
>>>>
>>>>
>>>
>>
>
>

Re: view index build time

Posted by Jan Lehnardt <ja...@apache.org>.

On Jul 2, 2008, at 16:17, Brad King wrote:

> Just to post some results here of working with around 300K docs. I
> changed the view to emit only the doc ID and index time went down to
> about 25 minutes vs. an hour for the same dataset.
>
> I then converted the largest text field to an attachment and things
> went down hill from there. I deleted the db and started the upload,
> but repeatedly got random 500 server errors with no real way to know
> what is happening or why. Also the DB size as reported by Futon seemed
> to fluctuate wildly as I was adding documents. And I mean wildly like
> anywhere from 1.2G then back down to 144M. Weird. I don't get a very
> warm fuzzy feeling about the stability of using attachments right now.
> Ideally, I don't want to use them anyway, I'd prefer to have the
> fields all inline and have the database handle these docs as-is. I
> don't see these as huge documents (2 to 5K) as compared to what I
> would store in something like Berkeley DB XML, just for comparison
> sake, so I'm hoping its a goal of the project to handle these
> effectively, even when several million documents are added.

This doesn't sound right at all. Can you make sure you use the
very latest SVN version or the 0.8 release and completely
new databases? Also, just to clarify, do you emit the doc into
the view payload? As in emit(doc._id, doc); are you just doing
emit(null, null); to only get the docIds that matter to you and
then fetch the documents later? I have had the later setup running
without any problems across ~2mio documents in a database.


> As always, thanks for the help.

Thanks for the problem report.

Cheers
Jan
--

>
>
>
>
> On Tue, Jul 1, 2008 at 9:26 AM, Brad King <br...@gmail.com> wrote:
>> Thanks for the tips. I'll start scaling back the data I'm returning
>> and see if it improves. The largest field is an html description of  
>> an
>> inventory item, which seems like a good candidate for a binary
>> attachment, but I need to be able to do full text searches on this
>> data eventually (hopefully with the Lucene integration) so I'll
>> probably try just not including the document data in the views first.
>> We've had some success with Lucene independent of couchdb, so I'm
>> pleased you guys are integrating this.
>>
>> On Sat, Jun 21, 2008 at 8:39 AM, Damien Katz <da...@gmail.com>  
>> wrote:
>>> Part of the problem is you are storing copies of the documents  
>>> into the
>>> btree. If the documents are big, it takes longer to compute on  
>>> them, and if
>>> the results (emit(...)) are big or numerous, then you'll be  
>>> spending most of
>>> your time in I/O.
>>>
>>> My advice is to not emit the document into the view, and if you  
>>> can, get the
>>> documents smaller in general. If the data can stored as an binary
>>> attachment, then that too will give you a performance improvement.
>>>
>>> -Damien
>>>
>>> On Jun 20, 2008, at 4:51 PM, Brad King wrote:
>>>
>>>> Thanks, yes its currently at 357M and growing!
>>>>
>>>> On Fri, Jun 20, 2008 at 4:49 PM, Chris Anderson <jc...@grabb.it>  
>>>> wrote:
>>>>>
>>>>> Brad,
>>>>>
>>>>> You can look at
>>>>>
>>>>> ls -lha /usr/local/var/lib/couchdb/.my-dbname_design/
>>>>>
>>>>> to see the view size growing...
>>>>>
>>>>> It won't tell you when it's done but it will give you hope that  
>>>>> the
>>>>> progress is happening.
>>>>>
>>>>> Chris
>>>>>
>>>>> On Fri, Jun 20, 2008 at 1:45 PM, Brad King <br...@gmail.com>  
>>>>> wrote:
>>>>>>
>>>>>> I have about 350K documents in a database. typically around 5K  
>>>>>> each. I
>>>>>> created and saved a view which simply looks at one field in the
>>>>>> document. I called the view for the first time with a key that  
>>>>>> should
>>>>>> only match one document, and its been awaiting a response for  
>>>>>> about 45
>>>>>> minutes now.
>>>>>>
>>>>>> {
>>>>>> "sku": {
>>>>>>    "map": "function(doc) { emit(doc.entityobject.SKU, doc); }"
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> Is this typical, or is there some optimizing to be done on  
>>>>>> either my
>>>>>> view or the server? I'm also running on a VM so this may have  
>>>>>> some
>>>>>> effects, but smaller databases seem to be performing pretty well.
>>>>>> Insert times to set this up were actually really good I  
>>>>>> thought, at
>>>>>> 4000 to 5000 documents per minute running from my laptop.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Chris Anderson
>>>>> http://jchris.mfdz.com
>>>>>
>>>
>>>
>>
>

Re: view index build time

Posted by Brad King <br...@gmail.com>.

Just to post some results here of working with around 300K docs. I
changed the view to emit only the doc ID and index time went down to
about 25 minutes vs. an hour for the same dataset.

I then converted the largest text field to an attachment and things
went down hill from there. I deleted the db and started the upload,
but repeatedly got random 500 server errors with no real way to know
what is happening or why. Also the DB size as reported by Futon seemed
to fluctuate wildly as I was adding documents. And I mean wildly like
anywhere from 1.2G then back down to 144M. Weird. I don't get a very
warm fuzzy feeling about the stability of using attachments right now.
Ideally, I don't want to use them anyway, I'd prefer to have the
fields all inline and have the database handle these docs as-is. I
don't see these as huge documents (2 to 5K) as compared to what I
would store in something like Berkeley DB XML, just for comparison
sake, so I'm hoping its a goal of the project to handle these
effectively, even when several million documents are added.

As always, thanks for the help.



On Tue, Jul 1, 2008 at 9:26 AM, Brad King <br...@gmail.com> wrote:
> Thanks for the tips. I'll start scaling back the data I'm returning
> and see if it improves. The largest field is an html description of an
> inventory item, which seems like a good candidate for a binary
> attachment, but I need to be able to do full text searches on this
> data eventually (hopefully with the Lucene integration) so I'll
> probably try just not including the document data in the views first.
> We've had some success with Lucene independent of couchdb, so I'm
> pleased you guys are integrating this.
>
> On Sat, Jun 21, 2008 at 8:39 AM, Damien Katz <da...@gmail.com> wrote:
>> Part of the problem is you are storing copies of the documents into the
>> btree. If the documents are big, it takes longer to compute on them, and if
>> the results (emit(...)) are big or numerous, then you'll be spending most of
>> your time in I/O.
>>
>> My advice is to not emit the document into the view, and if you can, get the
>> documents smaller in general. If the data can stored as an binary
>> attachment, then that too will give you a performance improvement.
>>
>> -Damien
>>
>> On Jun 20, 2008, at 4:51 PM, Brad King wrote:
>>
>>> Thanks, yes its currently at 357M and growing!
>>>
>>> On Fri, Jun 20, 2008 at 4:49 PM, Chris Anderson <jc...@grabb.it> wrote:
>>>>
>>>> Brad,
>>>>
>>>> You can look at
>>>>
>>>> ls -lha /usr/local/var/lib/couchdb/.my-dbname_design/
>>>>
>>>> to see the view size growing...
>>>>
>>>> It won't tell you when it's done but it will give you hope that the
>>>> progress is happening.
>>>>
>>>> Chris
>>>>
>>>> On Fri, Jun 20, 2008 at 1:45 PM, Brad King <br...@gmail.com> wrote:
>>>>>
>>>>> I have about 350K documents in a database. typically around 5K each. I
>>>>> created and saved a view which simply looks at one field in the
>>>>> document. I called the view for the first time with a key that should
>>>>> only match one document, and its been awaiting a response for about 45
>>>>> minutes now.
>>>>>
>>>>> {
>>>>>  "sku": {
>>>>>     "map": "function(doc) { emit(doc.entityobject.SKU, doc); }"
>>>>>  }
>>>>> }
>>>>>
>>>>> Is this typical, or is there some optimizing to be done on either my
>>>>> view or the server? I'm also running on a VM so this may have some
>>>>> effects, but smaller databases seem to be performing pretty well.
>>>>> Insert times to set this up were actually really good I thought, at
>>>>> 4000 to 5000 documents per minute running from my laptop.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Chris Anderson
>>>> http://jchris.mfdz.com
>>>>
>>
>>
>