You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Paul Davis <pa...@gmail.com> on 2009/02/26 19:58:45 UTC

Re: Why sequential document ids? [was: Re: What's the speed(performance) of couchdb?]

On Thu, Feb 26, 2009 at 1:49 PM, Barry Wark <ba...@gmail.com> wrote:
> On Thu, Feb 26, 2009 at 8:30 AM, Chris Anderson <jc...@apache.org> wrote:
>> On Thu, Feb 26, 2009 at 2:04 AM, Jan Lehnardt <ja...@apache.org> wrote:
>>> Hi Scott,
>>>
>>> thanks for your feedback. As a general note, you can't expect any magic
>>> from CouchDB. It is bound by the same constraint all other programmes
>>> are. To get the most out of CouchDB or SqlServer or MySQL, you need
>>> to understand how it works.
>>>
>>>
>>> On 26 Feb 2009, at 05:30, Scott Zhang wrote:
>>>
>>>> Hi. Thanks for replying.
>>>> But what a database is for if it is slow? Every database has the feature
>>>> to
>>>> make cluster to improve speed and capacity (Don't metion "access" things).
>>>
>>> The point of CouchDB is allowing high numbers of concurrent requests. This
>>> gives you more throughput for a single machine but not necessarily faster
>>> single query execution speed.
>>>
>>>
>>>> I was expecting couchDB is as fast as SqlServer or mysql. At least I know,
>>>> mnesia is much faster than SqlServer. But mnesia always throw harmless
>>>> "overload" message.
>>>
>>> CouchDB is not nearly as old as either of them. Did you really expect a
>>> software in alpha stages to be faster than fine-tuned systems that have
>>> been used in production for a decade or longer?
>>>
>>>
>>>> I will try bulk insert now. But be  fair, I was inserting  into sqlserver
>>>> one insert one time.
>>>
>>> Insert speed can be speed up in numerous ways:
>>>
>>>  - Use sequential descending document ids on insert.
>>
>> or ascending...
>
> As an asside, why is it that sequential document ids would produce a
> significant performance boost? I suspect the answer is something
> rather fundamental to CouchDB's design, and I'd like to try to grok
> it.
>
> Thanks,
> Barry
>

It has to do with how the btree is updated. Basically when you write
to a btree, any leaf node that changes plus the entire path to the
root node must be rewritten. If we use sequential ids, you're
minimizing the number of nodes that must be rewritten. Also thinking
idly about it, there might be gains in disk seek times because you're
only traveling one direction with each append. Also, what Jan said
with FS caching.

HTH,
Paul Davis

>>
>>>  - Use bulk insert.
>>
>> with ascending keys and bulk insert of 1000 docs at a time I was able
>> to write 3k docs per second. here is the benchmark script:
>> http://friendpaste.com/5g0kOEPonxdXMKibNRzetJ
>>
>>
>>>  - Bypass the HTTP API and insert native Erlang terms and skip JSON
>>> conversion.
>>
>> doing this I was able to get 6k docs / sec
>>
>> In a separate test using attachments of 250k and an Erlang API (no
>> HTTP) I was able to write to my disk at 80% of the speed it can accept
>> when streaming raw bytes to disk. (Rougly 20 MB/sec)
>>
>>>
>>> The question is what you need you system to look like eventually. If this is
>>> an initial data-import and after that you get mostly read requests, the
>>> longer
>>> insertion time will amortize over time.
>>>
>>> What version is the Windows binary you are using? If it is still 0.8, you
>>> should
>>> try trunk (which most likely means switching to some UNIXy system).
>>>
>>> Cheers
>>> Jan
>>> --
>>>
>>>
>>>
>>>
>>>
>>>>
>>>> Regards.
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Feb 26, 2009 at 12:18 PM, Jens Alfke <je...@mooseyard.com> wrote:
>>>>
>>>>>
>>>>> On Feb 25, 2009, at 8:02 PM, Scott Zhang wrote:
>>>>>
>>>>> But the performance is as bad as I can image, After several minutes run,
>>>>> I
>>>>>>
>>>>>> only inserted into 120K records. I saw the speed is ~20 records each
>>>>>> second.
>>>>>>
>>>>>
>>>>> Use the bulk-insert API to improve speed. The way you're doing it, every
>>>>> record being added is a separate transaction, which requires a separate
>>>>> HTTP
>>>>> request and flushing the file.
>>>>>
>>>>> (I'm a CouchDB newbie, but I don't think the point of CouchDB is speed.
>>>>> What's exciting about it is the flexibility and the ability to build
>>>>> distributed systems. If you're looking for a traditional database with
>>>>> speed, have you tried MySQL?)
>>>>>
>>>>> —Jens
>>>
>>>
>>
>>
>>
>> --
>> Chris Anderson
>> http://jchris.mfdz.com
>>
>