You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Fabio Forno <fa...@gmail.com> on 2009/10/22 12:55:19 UTC

re-index efficiency

Hi,
not knowing the internals of couchdb I may ask stupid question, so
just ignore it if it's really stupid ;)

Using it I've noticed the re-index times take a time which comparable
to the insertion off all the documents without using bulk inserts,
while with bulk inserts the insert ionof documents is much faster.
Instead in my idea, re-indexing should be as fast as fast bulk
inserts, since when computing an index we don't need to do many
fsyncs, but instead allow maximum caching before disk writes (with
berkeley db for example, sustained write of data exceeding the memory
cache are 100-1000x faster without syncs for each write). So, since I
don't think that this relative slowness is due to fsyncs which is the
main reason? (another hint which rules out fsyncs is that cpu is
rather high and not in waiting state)

-- 
Fabio Forno,
Bluendo srl http://www.bluendo.com
jabber id: ff@jabber.bluendo.com

Re: re-index efficiency

Posted by Adam Kocoloski <ko...@apache.org>.
On Oct 22, 2009, at 11:28 AM, Fabio Forno wrote:

> On Thu, Oct 22, 2009 at 5:12 PM, Paul Davis <paul.joseph.davis@gmail.com 
> > wrote:
>> Fabio,
>>
>> There are about four things that will slow view generation down from
>> the _bulk_docs rate:
>>
>> 1. JSON conversion (twice) when passing data to the view process
>> 2. Collation of keys on tree insertion
>> 3. I/O (Disk and stdio)
>> 4. Memory thresholds
>>
>> Things like native views will give noticeable speed improvements
>> because it avoids JSON serialization and transfer over stdio. The
>> other (theoretically) tunable parameter is the memory threshold that
>> triggers flushes to disk. Its not currently configurable by the  
>> client
>> (requires a rebuild of couchdb) and as such I haven't seen anyone
>> attempt to tune it.
>
> Thanks fro the answer, so I see that there are considerable margins
> for improvements, because ideally the index re-generation should be
> bound by disk speed once all possible optimizations are kicked in
> (except some pathological situations such as an application I have
> which stores chunks of xml in document strings, obliging double
> parsing in order to process them ;))
>
> bye

There are optimizations in trunk that get CouchDB closer to achieving  
this goal.  Re-indexing does lots of random I/O, so you won't be  
seeing 30MB/s on spinning platters, but it's many times better than  
what we had in 0.9.  Best,

Adam


Re: re-index efficiency

Posted by Fabio Forno <fa...@gmail.com>.
On Thu, Oct 22, 2009 at 5:12 PM, Paul Davis <pa...@gmail.com> wrote:
> Fabio,
>
> There are about four things that will slow view generation down from
> the _bulk_docs rate:
>
> 1. JSON conversion (twice) when passing data to the view process
> 2. Collation of keys on tree insertion
> 3. I/O (Disk and stdio)
> 4. Memory thresholds
>
> Things like native views will give noticeable speed improvements
> because it avoids JSON serialization and transfer over stdio. The
> other (theoretically) tunable parameter is the memory threshold that
> triggers flushes to disk. Its not currently configurable by the client
> (requires a rebuild of couchdb) and as such I haven't seen anyone
> attempt to tune it.

Thanks fro the answer, so I see that there are considerable margins
for improvements, because ideally the index re-generation should be
bound by disk speed once all possible optimizations are kicked in
(except some pathological situations such as an application I have
which stores chunks of xml in document strings, obliging double
parsing in order to process them ;))

bye

-- 
Fabio Forno,
Bluendo srl http://www.bluendo.com
jabber id: ff@jabber.bluendo.com

Re: re-index efficiency

Posted by Paul Davis <pa...@gmail.com>.
Fabio,

There are about four things that will slow view generation down from
the _bulk_docs rate:

1. JSON conversion (twice) when passing data to the view process
2. Collation of keys on tree insertion
3. I/O (Disk and stdio)
4. Memory thresholds

Things like native views will give noticeable speed improvements
because it avoids JSON serialization and transfer over stdio. The
other (theoretically) tunable parameter is the memory threshold that
triggers flushes to disk. Its not currently configurable by the client
(requires a rebuild of couchdb) and as such I haven't seen anyone
attempt to tune it.

HTH,
Paul Davis

On Thu, Oct 22, 2009 at 6:55 AM, Fabio Forno <fa...@gmail.com> wrote:
> Hi,
> not knowing the internals of couchdb I may ask stupid question, so
> just ignore it if it's really stupid ;)
>
> Using it I've noticed the re-index times take a time which comparable
> to the insertion off all the documents without using bulk inserts,
> while with bulk inserts the insert ionof documents is much faster.
> Instead in my idea, re-indexing should be as fast as fast bulk
> inserts, since when computing an index we don't need to do many
> fsyncs, but instead allow maximum caching before disk writes (with
> berkeley db for example, sustained write of data exceeding the memory
> cache are 100-1000x faster without syncs for each write). So, since I
> don't think that this relative slowness is due to fsyncs which is the
> main reason? (another hint which rules out fsyncs is that cpu is
> rather high and not in waiting state)
>
> --
> Fabio Forno,
> Bluendo srl http://www.bluendo.com
> jabber id: ff@jabber.bluendo.com
>