You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Barry Wark <ba...@gmail.com> on 2009/02/26 19:49:10 UTC

Why sequential document ids? [was: Re: What's the speed(performance) of couchdb?]

On Thu, Feb 26, 2009 at 8:30 AM, Chris Anderson <jc...@apache.org> wrote:
> On Thu, Feb 26, 2009 at 2:04 AM, Jan Lehnardt <ja...@apache.org> wrote:
>> Hi Scott,
>>
>> thanks for your feedback. As a general note, you can't expect any magic
>> from CouchDB. It is bound by the same constraint all other programmes
>> are. To get the most out of CouchDB or SqlServer or MySQL, you need
>> to understand how it works.
>>
>>
>> On 26 Feb 2009, at 05:30, Scott Zhang wrote:
>>
>>> Hi. Thanks for replying.
>>> But what a database is for if it is slow? Every database has the feature
>>> to
>>> make cluster to improve speed and capacity (Don't metion "access" things).
>>
>> The point of CouchDB is allowing high numbers of concurrent requests. This
>> gives you more throughput for a single machine but not necessarily faster
>> single query execution speed.
>>
>>
>>> I was expecting couchDB is as fast as SqlServer or mysql. At least I know,
>>> mnesia is much faster than SqlServer. But mnesia always throw harmless
>>> "overload" message.
>>
>> CouchDB is not nearly as old as either of them. Did you really expect a
>> software in alpha stages to be faster than fine-tuned systems that have
>> been used in production for a decade or longer?
>>
>>
>>> I will try bulk insert now. But be  fair, I was inserting  into sqlserver
>>> one insert one time.
>>
>> Insert speed can be speed up in numerous ways:
>>
>>  - Use sequential descending document ids on insert.
>
> or ascending...

As an asside, why is it that sequential document ids would produce a
significant performance boost? I suspect the answer is something
rather fundamental to CouchDB's design, and I'd like to try to grok
it.

Thanks,
Barry

>
>>  - Use bulk insert.
>
> with ascending keys and bulk insert of 1000 docs at a time I was able
> to write 3k docs per second. here is the benchmark script:
> http://friendpaste.com/5g0kOEPonxdXMKibNRzetJ
>
>
>>  - Bypass the HTTP API and insert native Erlang terms and skip JSON
>> conversion.
>
> doing this I was able to get 6k docs / sec
>
> In a separate test using attachments of 250k and an Erlang API (no
> HTTP) I was able to write to my disk at 80% of the speed it can accept
> when streaming raw bytes to disk. (Rougly 20 MB/sec)
>
>>
>> The question is what you need you system to look like eventually. If this is
>> an initial data-import and after that you get mostly read requests, the
>> longer
>> insertion time will amortize over time.
>>
>> What version is the Windows binary you are using? If it is still 0.8, you
>> should
>> try trunk (which most likely means switching to some UNIXy system).
>>
>> Cheers
>> Jan
>> --
>>
>>
>>
>>
>>
>>>
>>> Regards.
>>>
>>>
>>>
>>>
>>> On Thu, Feb 26, 2009 at 12:18 PM, Jens Alfke <je...@mooseyard.com> wrote:
>>>
>>>>
>>>> On Feb 25, 2009, at 8:02 PM, Scott Zhang wrote:
>>>>
>>>> But the performance is as bad as I can image, After several minutes run,
>>>> I
>>>>>
>>>>> only inserted into 120K records. I saw the speed is ~20 records each
>>>>> second.
>>>>>
>>>>
>>>> Use the bulk-insert API to improve speed. The way you're doing it, every
>>>> record being added is a separate transaction, which requires a separate
>>>> HTTP
>>>> request and flushing the file.
>>>>
>>>> (I'm a CouchDB newbie, but I don't think the point of CouchDB is speed.
>>>> What's exciting about it is the flexibility and the ability to build
>>>> distributed systems. If you're looking for a traditional database with
>>>> speed, have you tried MySQL?)
>>>>
>>>> —Jens
>>
>>
>
>
>
> --
> Chris Anderson
> http://jchris.mfdz.com
>

Re: Why sequential document ids? [was: Re: What's the speed(performance) of couchdb?]

Posted by Barry Wark <ba...@gmail.com>.

On Thu, Feb 26, 2009 at 3:54 PM, Chris Anderson <jc...@apache.org> wrote:
> On Thu, Feb 26, 2009 at 3:31 PM, Paul Davis <pa...@gmail.com> wrote:
>>
>> Not sure i follow what you mean by balance issues. Seeing as its a
>> btree it can't be unbalanced because if it were it wouldn't be a
>> btree. But then our definitions of balance might be different. :D
>
> When a B-tree is consistently inserted into, with a new highest (or
> lowest) value, you can end up with a deeper tree than when it is
> randomly inserted into. Imagine the inner-nodes all accumulating on
> the right side of the tree, so in the worst case you can end up with
> one lonely leaf node to the left of the root, and essentially a linked
> list of inner-nodes on the right side (such that cost would approach
> O(N) for inserts and lookups.

Thank you for clarifying my intent with a much better statement. This
is what I meant by "unbalanced"

>
> My benchmarks show that inserts do not slow down as you would expect
> them to in this worst case, so Couch is probably doing something
> non-naive to rebalance the tree even when inserts are always at one
> extreme.

Yes, Pauls response shows that this is the case.

>
>>
>> The only performance difference between reading random and the
>> sequential document ids I can think of would favor the sequential side
>> when you're reading documents in roughly the same order as their ids
>> so we get FS cache hits.
>
> From a practical standpoint, you're probably right about this
> (assuming the tree remains shallow even with inserts consistently at
> one edge of the tree.)
>
> Chris
>
> --
> Chris Anderson
> http://jchris.mfdz.com
>

Re: Why sequential document ids? [was: Re: What's the speed(performance) of couchdb?]

Posted by Paul Davis <pa...@gmail.com>.

On Thu, Feb 26, 2009 at 6:54 PM, Chris Anderson <jc...@apache.org> wrote:
> On Thu, Feb 26, 2009 at 3:31 PM, Paul Davis <pa...@gmail.com> wrote:
>>
>> Not sure i follow what you mean by balance issues. Seeing as its a
>> btree it can't be unbalanced because if it were it wouldn't be a
>> btree. But then our definitions of balance might be different. :D
>
> When a B-tree is consistently inserted into, with a new highest (or
> lowest) value, you can end up with a deeper tree than when it is
> randomly inserted into. Imagine the inner-nodes all accumulating on
> the right side of the tree, so in the worst case you can end up with
> one lonely leaf node to the left of the root, and essentially a linked
> list of inner-nodes on the right side (such that cost would approach
> O(N) for inserts and lookups.
>

An unbalanced btree is by definition, not a btree. One of the major
defintion bullet points is that all leaf nodes are always exactly the
same depth.

Some interesting code that makes the rebalancing stick out is the code
in couch_btree:complete_root that 'caps the tree' if you will.

> My benchmarks show that inserts do not slow down as you would expect
> them to in this worst case, so Couch is probably doing something
> non-naive to rebalance the tree even when inserts are always at one
> extreme.
>
>>
>> The only performance difference between reading random and the
>> sequential document ids I can think of would favor the sequential side
>> when you're reading documents in roughly the same order as their ids
>> so we get FS cache hits.
>
> From a practical standpoint, you're probably right about this
> (assuming the tree remains shallow even with inserts consistently at
> one edge of the tree.)
>
> Chris
>
> --
> Chris Anderson
> http://jchris.mfdz.com
>

HTH,
Paul Davis

Re: Why sequential document ids? [was: Re: What's the speed(performance) of couchdb?]

Posted by Chris Anderson <jc...@apache.org>.

On Thu, Feb 26, 2009 at 3:31 PM, Paul Davis <pa...@gmail.com> wrote:
>
> Not sure i follow what you mean by balance issues. Seeing as its a
> btree it can't be unbalanced because if it were it wouldn't be a
> btree. But then our definitions of balance might be different. :D

When a B-tree is consistently inserted into, with a new highest (or
lowest) value, you can end up with a deeper tree than when it is
randomly inserted into. Imagine the inner-nodes all accumulating on
the right side of the tree, so in the worst case you can end up with
one lonely leaf node to the left of the root, and essentially a linked
list of inner-nodes on the right side (such that cost would approach
O(N) for inserts and lookups.

My benchmarks show that inserts do not slow down as you would expect
them to in this worst case, so Couch is probably doing something
non-naive to rebalance the tree even when inserts are always at one
extreme.

>
> The only performance difference between reading random and the
> sequential document ids I can think of would favor the sequential side
> when you're reading documents in roughly the same order as their ids
> so we get FS cache hits.

>From a practical standpoint, you're probably right about this
(assuming the tree remains shallow even with inserts consistently at
one edge of the tree.)

Chris

-- 
Chris Anderson
http://jchris.mfdz.com

Re: Why sequential document ids? [was: Re: What's the speed(performance) of couchdb?]

Posted by Paul Davis <pa...@gmail.com>.

On Thu, Feb 26, 2009 at 5:58 PM, Barry Wark <ba...@gmail.com> wrote:
> On Thu, Feb 26, 2009 at 11:18 AM, Damien Katz <da...@apache.org> wrote:
>>
>> On Feb 26, 2009, at 1:55 PM, Jan Lehnardt wrote:
>>
>>>
>>> On 26 Feb 2009, at 19:49, Barry Wark wrote:
>>>>>
>>>>> or ascending...
>>>>
>>>> As an asside, why is it that sequential document ids would produce a
>>>> significant performance boost? I suspect the answer is something
>>>> rather fundamental to CouchDB's design, and I'd like to try to grok
>>>> it.
>>>
>>> b-trees inner-nodes can get cached better if inserts basically always
>>> use the same path.
>>>
>>
>> What he said. It's pretty standard btree stuff, most, if not all the major
>> rdbms have similar issues with primary keys.
>>
>> Also, he Ids don't need to be sequential (1,2,3,4...), just ordered
>> (1,5,19,22...). And they don't need to sort higher or lower than all the
>> other ids, so long as they are clustered together.  The each btree nodes
>> that have to be loaded that isn't in cache is expensive. The more the keys
>> have to be inserted into random places in the btree, the worse the caching
>> behavior. Right now, with the crypto random UUIDs we generate, it's
>> basically the worst case scenario for doc inserts.
>
>
> Ah, of course. Thanks everyone!
>
> *thinking back to my CS undergrad days* (I don't come across this
> stuff much in my work these days...)
> Inserting with sequential IDs would lead to btree balance issues,
> though, wouldn't it? In that sense the crypto random UUIDs would seem
> to be the ideal for read performance in the future as opposed to
> insert performance. Am I on the right track?
>
> Thanks for the info,
> Barry
>>
>> -Damien
>>
>

Not sure i follow what you mean by balance issues. Seeing as its a
btree it can't be unbalanced because if it were it wouldn't be a
btree. But then our definitions of balance might be different. :D

The only performance difference between reading random and the
sequential document ids I can think of would favor the sequential side
when you're reading documents in roughly the same order as their ids
so we get FS cache hits.

HTH,
Paul Davis

Re: Why sequential document ids? [was: Re: What's the speed(performance) of couchdb?]

Posted by Barry Wark <ba...@gmail.com>.

On Thu, Feb 26, 2009 at 11:18 AM, Damien Katz <da...@apache.org> wrote:
>
> On Feb 26, 2009, at 1:55 PM, Jan Lehnardt wrote:
>
>>
>> On 26 Feb 2009, at 19:49, Barry Wark wrote:
>>>>
>>>> or ascending...
>>>
>>> As an asside, why is it that sequential document ids would produce a
>>> significant performance boost? I suspect the answer is something
>>> rather fundamental to CouchDB's design, and I'd like to try to grok
>>> it.
>>
>> b-trees inner-nodes can get cached better if inserts basically always
>> use the same path.
>>
>
> What he said. It's pretty standard btree stuff, most, if not all the major
> rdbms have similar issues with primary keys.
>
> Also, he Ids don't need to be sequential (1,2,3,4...), just ordered
> (1,5,19,22...). And they don't need to sort higher or lower than all the
> other ids, so long as they are clustered together.  The each btree nodes
> that have to be loaded that isn't in cache is expensive. The more the keys
> have to be inserted into random places in the btree, the worse the caching
> behavior. Right now, with the crypto random UUIDs we generate, it's
> basically the worst case scenario for doc inserts.


Ah, of course. Thanks everyone!

*thinking back to my CS undergrad days* (I don't come across this
stuff much in my work these days...)
Inserting with sequential IDs would lead to btree balance issues,
though, wouldn't it? In that sense the crypto random UUIDs would seem
to be the ideal for read performance in the future as opposed to
insert performance. Am I on the right track?

Thanks for the info,
Barry
>
> -Damien
>

Re: Why sequential document ids? [was: Re: What's the speed(performance) of couchdb?]

Posted by Barry Wark <ba...@gmail.com>.

On Fri, Feb 27, 2009 at 8:23 AM, Paul Davis <pa...@gmail.com> wrote:
> On Fri, Feb 27, 2009 at 9:00 AM, Damien Katz <da...@apache.org> wrote:
>> The doc ids can be generated client side, so clients can use any scheme you
>> want. For how the server generates UUIDs, there has been some talk about
>> doing some work there, but I don't think anyone has started it.
>>
>> -Damien
>>
>>
>> On Feb 27, 2009, at 7:44 AM, Wout Mertens wrote:
>>
>>> On Feb 26, 2009, at 8:18 PM, Damien Katz wrote:
>>>>
>>>> Also, he Ids don't need to be sequential (1,2,3,4...), just ordered
>>>> (1,5,19,22...). And they don't need to sort higher or lower than all the
>>>> other ids, so long as they are clustered together.  The each btree nodes
>>>> that have to be loaded that isn't in cache is expensive. The more the keys
>>>> have to be inserted into random places in the btree, the worse the caching
>>>> behavior. Right now, with the crypto random UUIDs we generate, it's
>>>> basically the worst case scenario for doc inserts.
>>>
>>> So why not decrease security by a tunable factor and add a random number
>>> to the previous UUID instead?
>>>
>>> And while we're on the subject, I've wished for automatic doc UUIDs that
>>> nonetheless have a fixed prefix. More human readable plus it doubles as a
>>> type field... Any chance of that or is that one of the beginner mistakes?
>>>
>>> Wout.
>>
>>
>
> There was talk of switching the internal uuid generationg to a version
> 1 style instead of the version 4 (random) style. My suggestion was to
> make the _uuids endpoint accept a v=[1|4] url parameter to choose with
> a config file default but people seemed to not like that idea.
>
> HTH,
> Paul Davis
>

FWIW (though the opinions of a newbie should be taken with a big brick
of salt), I don't mind that the auto-generated UUIDs are
"conservative" in their crypto randomness. As a client, I can use
whatever scheme I want so it makes sense to go with the most
conservative default.

And thanks for the very informative answers about btrees everyone. I
learned a lot.

Cheers,
Barry

Re: Why sequential document ids? [was: Re: What's the speed(performance) of couchdb?]

Posted by Jan Lehnardt <ja...@apache.org>.

On 2 Mar 2009, at 19:29, Damien Katz wrote:

>
> On Mar 2, 2009, at 1:22 PM, Jan Lehnardt wrote:
>
>>
>> On 2 Mar 2009, at 19:17, Brian Candler wrote:
>>
>>> On Fri, Feb 27, 2009 at 10:11:12AM -0800, Chris Anderson wrote:
>>>> I wonder if there's a way to get roughly ascending uuids (v1-ish)
>>>> without the security issues that come from having the originating  
>>>> node
>>>> permanently inscribed on each document...
>>>
>>> If doing bulk inserts: the client can ask for a uuid, take the top  
>>> N bits,
>>> and append its own M-bit sequence number to the end?
>>
>> it can also get document-count UUIDs sort them and bulk insert docs
>> in that order. :)
>
> That won't help because the insertions places in the btree are still  
> evenly distributed, and besides CouchDB does that step itself.
>
> A better way would be to grab a UUID and increment it by 1 for the  
> id of each new document you are saving. You still don't get great  
> caching behavior as you are still inserting into a random place in  
> the btree, but it's much better than N random inserts.

right, I only read the first half of your earlier remark: "And they  
don't need to sort higher or lower than all the other ids, so long as  
they are clustered together. "

nevermind.

Cheers
Jan
--

Re: Why sequential document ids? [was: Re: What's the speed(performance) of couchdb?]

Posted by Damien Katz <da...@apache.org>.

On Mar 2, 2009, at 1:22 PM, Jan Lehnardt wrote:

>
> On 2 Mar 2009, at 19:17, Brian Candler wrote:
>
>> On Fri, Feb 27, 2009 at 10:11:12AM -0800, Chris Anderson wrote:
>>> I wonder if there's a way to get roughly ascending uuids (v1-ish)
>>> without the security issues that come from having the originating  
>>> node
>>> permanently inscribed on each document...
>>
>> If doing bulk inserts: the client can ask for a uuid, take the top  
>> N bits,
>> and append its own M-bit sequence number to the end?
>
> it can also get document-count UUIDs sort them and bulk insert docs
> in that order. :)

That won't help because the insertions places in the btree are still  
evenly distributed, and besides CouchDB does that step itself.

A better way would be to grab a UUID and increment it by 1 for the id  
of each new document you are saving. You still don't get great caching  
behavior as you are still inserting into a random place in the btree,  
but it's much better than N random inserts.

-Damien

Re: Why sequential document ids? [was: Re: What's the speed(performance) of couchdb?]

Posted by Jan Lehnardt <ja...@apache.org>.

On 2 Mar 2009, at 19:17, Brian Candler wrote:

> On Fri, Feb 27, 2009 at 10:11:12AM -0800, Chris Anderson wrote:
>> I wonder if there's a way to get roughly ascending uuids (v1-ish)
>> without the security issues that come from having the originating  
>> node
>> permanently inscribed on each document...
>
> If doing bulk inserts: the client can ask for a uuid, take the top N  
> bits,
> and append its own M-bit sequence number to the end?

it can also get document-count UUIDs sort them and bulk insert docs
in that order. :)

Cheers
Jan
--

Re: Why sequential document ids? [was: Re: What's the speed(performance) of couchdb?]

Posted by Brian Candler <B....@pobox.com>.

On Fri, Feb 27, 2009 at 10:11:12AM -0800, Chris Anderson wrote:
> I wonder if there's a way to get roughly ascending uuids (v1-ish)
> without the security issues that come from having the originating node
> permanently inscribed on each document...

If doing bulk inserts: the client can ask for a uuid, take the top N bits,
and append its own M-bit sequence number to the end?

(The docs say that clients have to be able to recover from collisions
anyway, so no harm making them slightly more likely)

Re: Why sequential document ids? [was: Re: What's the speed(performance) of couchdb?]

Posted by Antony Blakey <an...@gmail.com>.

On 03/03/2009, at 9:49 AM, Chris Anderson wrote:

> It's an information leak if you can tell the originating node because
> of the uuid. If we want an option to flag docs with information about
> the originating node, we should add that. Also, the doc.id isn't the
> right place for this, as what you want to know is the origination of
> updates, not documents.

Yes, sorry, I was confused - I thought we were talking about  
update_seqs.

Antony Blakey
-------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

One should respect public opinion insofar as is necessary to avoid  
starvation and keep out of prison, but anything that goes beyond this  
is voluntary submission to an unnecessary tyranny.
   -- Bertrand Russell

Re: Why sequential document ids? [was: Re: What's the speed(performance) of couchdb?]

Posted by Chris Anderson <jc...@apache.org>.

On Mon, Mar 2, 2009 at 3:13 PM, Antony Blakey <an...@gmail.com> wrote:
> I guess it's a tradeoff between wanting a distributed overview and wanting
> global anonymity. Maybe that's a valid configuration option, rather than
> being policy.

It's an information leak if you can tell the originating node because
of the uuid. If we want an option to flag docs with information about
the originating node, we should add that. Also, the doc.id isn't the
right place for this, as what you want to know is the origination of
updates, not documents.

UUIDs that are consistent per-node are bad not just because you could
trace the node. There is even information leakage if you can tell that
a set of documents originated on the same node.

"Oh, the person who's participating in opposition-party discussions is
the person who filed these taxes."

-- 
Chris Anderson
http://jchris.mfdz.com

Re: Why sequential document ids? [was: Re: What's the speed(performance) of couchdb?]

Posted by Antony Blakey <an...@gmail.com>.

On 03/03/2009, at 9:26 AM, Dean Landolt wrote:

> I can see the point of replication source, but document origination  
> source
> is different than replication source, isn't it?

You are right. However the document origination source is required for  
a useful version vector, because you can then track which peer has how  
much of a source's writes. Which is useful if the source disappears.  
You can then build a distributed picture of the eventual consistency  
status.

> Yes.

I guess it's a tradeoff between wanting a distributed overview and  
wanting global anonymity. Maybe that's a valid configuration option,  
rather than being policy.

Unless a node makes it's identity available in some way other than  
replication, the source is still anonymous because there's no mapping  
of node id to source URL (which might change in any case). To identify  
the node you'd have to find the node, try to replicate from it, and  
get lucky that the id it provides is the one you're looking for.

Antony Blakey
--------------------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

Success is not the key to happiness. Happiness is the key to success.
  -- Albert Schweitzer

Re: Why sequential document ids? [was: Re: What's the speed(performance) of couchdb?]

Posted by Dean Landolt <de...@deanlandolt.com>.

On Mon, Mar 2, 2009 at 4:41 PM, Antony Blakey <an...@gmail.com>wrote:

>
> On 03/03/2009, at 5:20 AM, Chris Anderson wrote:
>
>  On Sat, Feb 28, 2009 at 5:42 AM, Antony Blakey <an...@gmail.com>
>> wrote:
>>
>>>
>>> What security issues are you thinking of?
>>>
>>
>> The one where you can tell by looking at a doc, which node it was
>> inserted on. I suppose that's not strictly security - more like
>> anonymity.
>>
>
> If you want to a feature such as replication tracking i.e. who have I
> replicated from, and how up-to-date am I, the ability to track the
> replication source chould would be a good thing. It would allow a computed
> version-vector form of staleness-tracking.


I can see the point of replication source, but document origination source
is different than replication source, isn't it?

>
>
> More would be required than just having a source identifier, because you
> would need to map that identifier to a domain useful to the user, but it
> would be a start.
>
> I wonder if anonymity, even transitive anonymity, is that useful?


Yes. And not just for copyright infringement apps.

Re: Why sequential document ids? [was: Re: What's the speed(performance) of couchdb?]

Posted by Antony Blakey <an...@gmail.com>.

On 03/03/2009, at 5:20 AM, Chris Anderson wrote:

> On Sat, Feb 28, 2009 at 5:42 AM, Antony Blakey <antony.blakey@gmail.com 
> > wrote:
>>
>> What security issues are you thinking of?
>
> The one where you can tell by looking at a doc, which node it was
> inserted on. I suppose that's not strictly security - more like
> anonymity.

If you want to a feature such as replication tracking i.e. who have I  
replicated from, and how up-to-date am I, the ability to track the  
replication source chould would be a good thing. It would allow a  
computed version-vector form of staleness-tracking.

More would be required than just having a source identifier, because  
you would need to map that identifier to a domain useful to the user,  
but it would be a start.

I wonder if anonymity, even transitive anonymity, is that useful?

Antony Blakey
-------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

It is no measure of health to be well adjusted to a profoundly sick  
society.
   -- Jiddu Krishnamurti

Re: Why sequential document ids? [was: Re: What's the speed(performance) of couchdb?]

Posted by Chris Anderson <jc...@apache.org>.

On Sat, Feb 28, 2009 at 5:42 AM, Antony Blakey <an...@gmail.com> wrote:
>
> What security issues are you thinking of?

The one where you can tell by looking at a doc, which node it was
inserted on. I suppose that's not strictly security - more like
anonymity.

-- 
Chris Anderson
http://jchris.mfdz.com

Re: Why sequential document ids? [was: Re: What's the speed(performance) of couchdb?]

Posted by Antony Blakey <an...@gmail.com>.

On 28/02/2009, at 4:41 AM, Chris Anderson wrote:

> I wonder if there's a way to get roughly ascending uuids (v1-ish)
> without the security issues that come from having the originating node
> permanently inscribed on each document...

What security issues are you thinking of?

I ask because (NodeId, Sequence#) gives a form of global partial  
ordering that can be useful, within the constraints of no global time.

Antony Blakey
-------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

Some defeats are instalments to victory.
   -- Jacob Riis

Re: Why sequential document ids? [was: Re: What's the speed(performance) of couchdb?]

Posted by Wout Mertens <wm...@cisco.com>.

On Feb 27, 2009, at 7:11 PM, Chris Anderson wrote:

> I wonder if there's a way to get roughly ascending uuids (v1-ish)
> without the security issues that come from having the originating node
> permanently inscribed on each document...

1. start at a UUID that would normally be generated (256 bit number in  
hex)
2. add a random 128bit number (or so) each time you need a new UUID

Or is the problem that you would know the node from the UUID range?

Wout.

Re: Why sequential document ids? [was: Re: What's the speed(performance) of couchdb?]

Posted by Robert Newson <ro...@gmail.com>.

Wouldn't clock correction/manual setting potentially render a
YYMMDDHHMMSS approach collision-prone?

It's an issue I recently faced (though in Java) and the details are
interesting.

On Linux, at least, the kernel is able to access a few different
notions of time, the most typical one is the wall clock, and it's
subject to runtime manipulation. This means, for my case and perhaps
this case, that you can't rely on it to monotonically increase.
Fortunately, there is a way to request access to a monotonically
increasing clock and Java's System.nanoTime().The number is only
usefully locally, which is fine with a v1-style UUID since it encodes
the MAC.

B.

On Fri, Feb 27, 2009 at 6:24 PM, Dean Landolt <de...@deanlandolt.com> wrote:
> On Fri, Feb 27, 2009 at 1:11 PM, Chris Anderson <jc...@apache.org> wrote:
>
>> On Fri, Feb 27, 2009 at 8:23 AM, Paul Davis <pa...@gmail.com>
>> wrote:
>> >
>> > There was talk of switching the internal uuid generationg to a version
>> > 1 style instead of the version 4 (random) style. My suggestion was to
>> > make the _uuids endpoint accept a v=[1|4] url parameter to choose with
>> > a config file default but people seemed to not like that idea.
>> >
>>
>> Now that I've got a better grasp on the why's of each option, I think
>> I'm liking your idea just fine. I think the real resistance was to
>> making an ini setting for it.
>>
>> I wonder if there's a way to get roughly ascending uuids (v1-ish)
>> without the security issues that come from having the originating node
>> permanently inscribed on each document...
>
>
> UUID v1 only _very_ roughly ascends -- it recycles every minute or so. This
> probably wouldn't matter too much though...
>
> Even still, an algo to give you ever-increasing uuids is easy -- just go
> YYMMDDHHMMSS etc. Then you can append something like a UUID v4, or something
> shorter that can still strongly guarantee you uniqueness, and you've got the
> best of both worlds. I think...
>

Re: Why sequential document ids? [was: Re: What's the speed(performance) of couchdb?]

Posted by Dean Landolt <de...@deanlandolt.com>.

On Fri, Feb 27, 2009 at 1:11 PM, Chris Anderson <jc...@apache.org> wrote:

> On Fri, Feb 27, 2009 at 8:23 AM, Paul Davis <pa...@gmail.com>
> wrote:
> >
> > There was talk of switching the internal uuid generationg to a version
> > 1 style instead of the version 4 (random) style. My suggestion was to
> > make the _uuids endpoint accept a v=[1|4] url parameter to choose with
> > a config file default but people seemed to not like that idea.
> >
>
> Now that I've got a better grasp on the why's of each option, I think
> I'm liking your idea just fine. I think the real resistance was to
> making an ini setting for it.
>
> I wonder if there's a way to get roughly ascending uuids (v1-ish)
> without the security issues that come from having the originating node
> permanently inscribed on each document...

UUID v1 only _very_ roughly ascends -- it recycles every minute or so. This
probably wouldn't matter too much though...

Even still, an algo to give you ever-increasing uuids is easy -- just go
YYMMDDHHMMSS etc. Then you can append something like a UUID v4, or something
shorter that can still strongly guarantee you uniqueness, and you've got the
best of both worlds. I think...

Re: Why sequential document ids? [was: Re: What's the speed(performance) of couchdb?]

Posted by Chris Anderson <jc...@apache.org>.

On Fri, Feb 27, 2009 at 8:23 AM, Paul Davis <pa...@gmail.com> wrote:
>
> There was talk of switching the internal uuid generationg to a version
> 1 style instead of the version 4 (random) style. My suggestion was to
> make the _uuids endpoint accept a v=[1|4] url parameter to choose with
> a config file default but people seemed to not like that idea.
>

Now that I've got a better grasp on the why's of each option, I think
I'm liking your idea just fine. I think the real resistance was to
making an ini setting for it.

I wonder if there's a way to get roughly ascending uuids (v1-ish)
without the security issues that come from having the originating node
permanently inscribed on each document...

-- 
Chris Anderson
http://jchris.mfdz.com

Re: Why sequential document ids? [was: Re: What's the speed(performance) of couchdb?]

Posted by Paul Davis <pa...@gmail.com>.

On Fri, Feb 27, 2009 at 9:00 AM, Damien Katz <da...@apache.org> wrote:
> The doc ids can be generated client side, so clients can use any scheme you
> want. For how the server generates UUIDs, there has been some talk about
> doing some work there, but I don't think anyone has started it.
>
> -Damien
>
>
> On Feb 27, 2009, at 7:44 AM, Wout Mertens wrote:
>
>> On Feb 26, 2009, at 8:18 PM, Damien Katz wrote:
>>>
>>> Also, he Ids don't need to be sequential (1,2,3,4...), just ordered
>>> (1,5,19,22...). And they don't need to sort higher or lower than all the
>>> other ids, so long as they are clustered together.  The each btree nodes
>>> that have to be loaded that isn't in cache is expensive. The more the keys
>>> have to be inserted into random places in the btree, the worse the caching
>>> behavior. Right now, with the crypto random UUIDs we generate, it's
>>> basically the worst case scenario for doc inserts.
>>
>> So why not decrease security by a tunable factor and add a random number
>> to the previous UUID instead?
>>
>> And while we're on the subject, I've wished for automatic doc UUIDs that
>> nonetheless have a fixed prefix. More human readable plus it doubles as a
>> type field... Any chance of that or is that one of the beginner mistakes?
>>
>> Wout.
>
>

There was talk of switching the internal uuid generationg to a version
1 style instead of the version 4 (random) style. My suggestion was to
make the _uuids endpoint accept a v=[1|4] url parameter to choose with
a config file default but people seemed to not like that idea.

HTH,
Paul Davis

Re: Why sequential document ids? [was: Re: What's the speed(performance) of couchdb?]

Posted by Damien Katz <da...@apache.org>.

The doc ids can be generated client side, so clients can use any  
scheme you want. For how the server generates UUIDs, there has been  
some talk about doing some work there, but I don't think anyone has  
started it.

-Damien


On Feb 27, 2009, at 7:44 AM, Wout Mertens wrote:

> On Feb 26, 2009, at 8:18 PM, Damien Katz wrote:
>> Also, he Ids don't need to be sequential (1,2,3,4...), just ordered  
>> (1,5,19,22...). And they don't need to sort higher or lower than  
>> all the other ids, so long as they are clustered together.  The  
>> each btree nodes that have to be loaded that isn't in cache is  
>> expensive. The more the keys have to be inserted into random places  
>> in the btree, the worse the caching behavior. Right now, with the  
>> crypto random UUIDs we generate, it's basically the worst case  
>> scenario for doc inserts.
>
> So why not decrease security by a tunable factor and add a random  
> number to the previous UUID instead?
>
> And while we're on the subject, I've wished for automatic doc UUIDs  
> that nonetheless have a fixed prefix. More human readable plus it  
> doubles as a type field... Any chance of that or is that one of the  
> beginner mistakes?
>
> Wout.

Re: Why sequential document ids? [was: Re: What's the speed(performance) of couchdb?]

Posted by Wout Mertens <wm...@cisco.com>.

On Feb 26, 2009, at 8:18 PM, Damien Katz wrote:
> Also, he Ids don't need to be sequential (1,2,3,4...), just ordered  
> (1,5,19,22...). And they don't need to sort higher or lower than all  
> the other ids, so long as they are clustered together.  The each  
> btree nodes that have to be loaded that isn't in cache is expensive.  
> The more the keys have to be inserted into random places in the  
> btree, the worse the caching behavior. Right now, with the crypto  
> random UUIDs we generate, it's basically the worst case scenario for  
> doc inserts.

So why not decrease security by a tunable factor and add a random  
number to the previous UUID instead?

And while we're on the subject, I've wished for automatic doc UUIDs  
that nonetheless have a fixed prefix. More human readable plus it  
doubles as a type field... Any chance of that or is that one of the  
beginner mistakes?

Wout.

Re: Why sequential document ids? [was: Re: What's the speed(performance) of couchdb?]

Posted by Damien Katz <da...@apache.org>.

On Feb 26, 2009, at 1:55 PM, Jan Lehnardt wrote:

>
> On 26 Feb 2009, at 19:49, Barry Wark wrote:
>>>
>>> or ascending...
>>
>> As an asside, why is it that sequential document ids would produce a
>> significant performance boost? I suspect the answer is something
>> rather fundamental to CouchDB's design, and I'd like to try to grok
>> it.
>
> b-trees inner-nodes can get cached better if inserts basically always
> use the same path.
>

What he said. It's pretty standard btree stuff, most, if not all the  
major rdbms have similar issues with primary keys.

Also, he Ids don't need to be sequential (1,2,3,4...), just ordered  
(1,5,19,22...). And they don't need to sort higher or lower than all  
the other ids, so long as they are clustered together.  The each btree  
nodes that have to be loaded that isn't in cache is expensive. The  
more the keys have to be inserted into random places in the btree, the  
worse the caching behavior. Right now, with the crypto random UUIDs we  
generate, it's basically the worst case scenario for doc inserts.

-Damien

Re: Why sequential document ids? [was: Re: What's the speed(performance) of couchdb?]

Posted by Jan Lehnardt <ja...@apache.org>.

On 26 Feb 2009, at 19:49, Barry Wark wrote:
>>
>> or ascending...
>
> As an asside, why is it that sequential document ids would produce a
> significant performance boost? I suspect the answer is something
> rather fundamental to CouchDB's design, and I'd like to try to grok
> it.

b-trees inner-nodes can get cached better if inserts basically always
use the same path.

Cheers
Jan
--

Re: Why sequential document ids? [was: Re: What's the speed(performance) of couchdb?]

Posted by Paul Davis <pa...@gmail.com>.

On Thu, Feb 26, 2009 at 1:49 PM, Barry Wark <ba...@gmail.com> wrote:
> On Thu, Feb 26, 2009 at 8:30 AM, Chris Anderson <jc...@apache.org> wrote:
>> On Thu, Feb 26, 2009 at 2:04 AM, Jan Lehnardt <ja...@apache.org> wrote:
>>> Hi Scott,
>>>
>>> thanks for your feedback. As a general note, you can't expect any magic
>>> from CouchDB. It is bound by the same constraint all other programmes
>>> are. To get the most out of CouchDB or SqlServer or MySQL, you need
>>> to understand how it works.
>>>
>>>
>>> On 26 Feb 2009, at 05:30, Scott Zhang wrote:
>>>
>>>> Hi. Thanks for replying.
>>>> But what a database is for if it is slow? Every database has the feature
>>>> to
>>>> make cluster to improve speed and capacity (Don't metion "access" things).
>>>
>>> The point of CouchDB is allowing high numbers of concurrent requests. This
>>> gives you more throughput for a single machine but not necessarily faster
>>> single query execution speed.
>>>
>>>
>>>> I was expecting couchDB is as fast as SqlServer or mysql. At least I know,
>>>> mnesia is much faster than SqlServer. But mnesia always throw harmless
>>>> "overload" message.
>>>
>>> CouchDB is not nearly as old as either of them. Did you really expect a
>>> software in alpha stages to be faster than fine-tuned systems that have
>>> been used in production for a decade or longer?
>>>
>>>
>>>> I will try bulk insert now. But be  fair, I was inserting  into sqlserver
>>>> one insert one time.
>>>
>>> Insert speed can be speed up in numerous ways:
>>>
>>>  - Use sequential descending document ids on insert.
>>
>> or ascending...
>
> As an asside, why is it that sequential document ids would produce a
> significant performance boost? I suspect the answer is something
> rather fundamental to CouchDB's design, and I'd like to try to grok
> it.
>
> Thanks,
> Barry
>

It has to do with how the btree is updated. Basically when you write
to a btree, any leaf node that changes plus the entire path to the
root node must be rewritten. If we use sequential ids, you're
minimizing the number of nodes that must be rewritten. Also thinking
idly about it, there might be gains in disk seek times because you're
only traveling one direction with each append. Also, what Jan said
with FS caching.

HTH,
Paul Davis

>>
>>>  - Use bulk insert.
>>
>> with ascending keys and bulk insert of 1000 docs at a time I was able
>> to write 3k docs per second. here is the benchmark script:
>> http://friendpaste.com/5g0kOEPonxdXMKibNRzetJ
>>
>>
>>>  - Bypass the HTTP API and insert native Erlang terms and skip JSON
>>> conversion.
>>
>> doing this I was able to get 6k docs / sec
>>
>> In a separate test using attachments of 250k and an Erlang API (no
>> HTTP) I was able to write to my disk at 80% of the speed it can accept
>> when streaming raw bytes to disk. (Rougly 20 MB/sec)
>>
>>>
>>> The question is what you need you system to look like eventually. If this is
>>> an initial data-import and after that you get mostly read requests, the
>>> longer
>>> insertion time will amortize over time.
>>>
>>> What version is the Windows binary you are using? If it is still 0.8, you
>>> should
>>> try trunk (which most likely means switching to some UNIXy system).
>>>
>>> Cheers
>>> Jan
>>> --
>>>
>>>
>>>
>>>
>>>
>>>>
>>>> Regards.
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Feb 26, 2009 at 12:18 PM, Jens Alfke <je...@mooseyard.com> wrote:
>>>>
>>>>>
>>>>> On Feb 25, 2009, at 8:02 PM, Scott Zhang wrote:
>>>>>
>>>>> But the performance is as bad as I can image, After several minutes run,
>>>>> I
>>>>>>
>>>>>> only inserted into 120K records. I saw the speed is ~20 records each
>>>>>> second.
>>>>>>
>>>>>
>>>>> Use the bulk-insert API to improve speed. The way you're doing it, every
>>>>> record being added is a separate transaction, which requires a separate
>>>>> HTTP
>>>>> request and flushing the file.
>>>>>
>>>>> (I'm a CouchDB newbie, but I don't think the point of CouchDB is speed.
>>>>> What's exciting about it is the flexibility and the ability to build
>>>>> distributed systems. If you're looking for a traditional database with
>>>>> speed, have you tried MySQL?)
>>>>>
>>>>> —Jens
>>>
>>>
>>
>>
>>
>> --
>> Chris Anderson
>> http://jchris.mfdz.com
>>
>