You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jackrabbit.apache.org by Thomas Mueller <mu...@adobe.com> on 2012/03/06 10:34:20 UTC

[jr3] Index on randomly distributed data

Hi,

In Jackrabbit 2, we currently use a randomly generated UUID as the node
id. For Jackrabbit 3 this is an option question. I was looking for ways to
index randomly distributed data, but so far didn't find a solution. A
Google query for "uuid primary key performance" gave me:

http://stackoverflow.com/questions/2365132/uuid-performance-in-mysql "At
my job, we use UUID as PKs. What I can tell you from experience is DO NOT
USE THEM as PKs ... It's one of those things that when you have less than
1000 records it;s ok, but when you have millions, it's the worst thing you
can do. Why? Because UUID are not sequential..."

http://kccoder.com/mysql/uuid-vs-int-insert-performance/ "it takes 25
hours to insert 15 million records into an empty UUID table"

http://www.mysqlperformanceblog.com/2007/03/13/to-uuid-or-not-to-uuid/
"For auto_increment key load process took 1 hour 50 minutes ... For UUID
process took over 12 hours and is still going...  So in this little case
we have about 200 times performance difference"

I believe if we rely on an index on randomly distributed data, performance
will degrade (factor 10 or more, depending on the repository size, the
memory, and potentially on the number of changes). For Jackrabbit 2, to
solve this performance problem, we can actually switch to sequential node
ids - see JCR-2857. For Jackrabbit 3, if we use the content hash as the
node id, then it wouldn't be possible to switch (it is not possible to
generate sequential content hashes). With content hashes, one option is to
make sure the index is always in memory. However, I believe we should not
build a system that has such constraints, unless the alternative
(sequential node ids) has problems we can not solve otherwise.

Regards,
Thomas

Re: [jr3] Index on randomly distributed data

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

> I suggest that we define a set of relevant
> performance benchmarks, and  use them for
> evaluating potential alternatives.

Sounds good to me.

This thread isn't about UUIDs, I'm sorry if this was not clear. By the
way, a UUID doesn't have to be randomly distributed. For example Microsoft
introduced sequential UUIDs in SQL Server 2005 -
http://msdn.microsoft.com/en-us/library/ms189786.aspx - Other libraries /
products support this as well. We could also use such sequential UUIDs in
Jackrabbit (2 or 3) for the jcr:uuid property.

Regards,
Thomas

On 3/6/12 5:01 PM, "Jukka Zitting" <ju...@gmail.com> wrote:

>Hi,
>
>This thread is confusing two different things: the UUID accessible as
>the jcr:uuid property of a JCR node, and the internal reference used
>by the underlying storage implementation to locate a specific node
>state. In Jackrabbit 2.x these are the same, but in jr3/oak this is no
>longer the case since the MVCC model implies that the internal storage
>references change at each content change.
>
>IIUC Thomas' original post was about the latter case, i.e. the
>internal storage references, *not* jcr:uuid properties.
>
>The tree interface we've been drafting in the other thread explicitly
>hides the underlying storage identifiers, which should help isolate
>this design decision from higher levels of code.
>
>Rather than discuss this issue in the abstract, I suggest that we
>define a set of relevant performance benchmarks, and  use them for
>evaluating potential alternatives.
>
>BR,
>
>Jukka Zitting

Re: [jr3] Index on randomly distributed data

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

This thread is confusing two different things: the UUID accessible as
the jcr:uuid property of a JCR node, and the internal reference used
by the underlying storage implementation to locate a specific node
state. In Jackrabbit 2.x these are the same, but in jr3/oak this is no
longer the case since the MVCC model implies that the internal storage
references change at each content change.

IIUC Thomas' original post was about the latter case, i.e. the
internal storage references, *not* jcr:uuid properties.

The tree interface we've been drafting in the other thread explicitly
hides the underlying storage identifiers, which should help isolate
this design decision from higher levels of code.

Rather than discuss this issue in the abstract, I suggest that we
define a set of relevant performance benchmarks, and  use them for
evaluating potential alternatives.

BR,

Jukka Zitting

Re: [jr3] Index on randomly distributed data

Posted by Ard Schrijvers <a....@onehippo.com>.

On Tue, Mar 6, 2012 at 4:19 PM, Marcel Reutegger <mr...@adobe.com> wrote:
>> And with moves the id stays the same?
>
> no, the id would change in that case. a move then
> basically is a copy and delete at the previous location.

Ok. Moving an entire tree will then be expensive as all descendants
also change their id's. Also, references to an id will from other
nodes will change. So, although I am in favor of storing path
information as this can make simple path constraint queries really
efficient, I don't see how this could work as id due to the references
of other nodes

Regards Ard

>
> regards
>  marcel

RE: [jr3] Index on randomly distributed data

Posted by Marcel Reutegger <mr...@adobe.com>.

> And with moves the id stays the same?

no, the id would change in that case. a move then
basically is a copy and delete at the previous location.

regards
 marcel

Re: [jr3] Index on randomly distributed data

Posted by Ard Schrijvers <a....@onehippo.com>.

On Tue, Mar 6, 2012 at 3:15 PM, Marcel Reutegger <mr...@adobe.com> wrote:
>> How can we guarantee ids are unique *and* sequential in a
>> cluster without (too much) overhead?
>
> use the path as id instead of generating some arbitrary id?
>
> after all, most methods of the JCR API operate on a path
> and not an id.

And with moves the id stays the same? What happens in this scenario:

/foo/bar/lux

is moved to

/foo/bar/flux

and you add a new node 'lux' at /foo/bar

If the move did not change the id, the addition of 'lux' later on
cannot have the id of its path.

Regards Ard

>
> regards
>  marcel



-- 
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com

RE: [jr3] Index on randomly distributed data

Posted by Marcel Reutegger <mr...@adobe.com>.

> How can we guarantee ids are unique *and* sequential in a
> cluster without (too much) overhead?

use the path as id instead of generating some arbitrary id?

after all, most methods of the JCR API operate on a path
and not an id.

regards
 marcel

Re: [jr3] Index on randomly distributed data

Posted by Raffaele Sena <ra...@gmail.com>.

Can you use time based UUID.  They should be guaranteed to be sequential
and unique in the cluster.
On Mar 6, 2012 4:47 AM, "Bart van der Schans" <b....@onehippo.com>
wrote:

> Hi Thomas,
>
> On Tue, Mar 6, 2012 at 10:34 AM, Thomas Mueller <mu...@adobe.com> wrote:
> > Hi,
> >
> > In Jackrabbit 2, we currently use a randomly generated UUID as the node
> > id. For Jackrabbit 3 this is an option question. I was looking for ways
> to
> > index randomly distributed data, but so far didn't find a solution. A
> > Google query for "uuid primary key performance" gave me:
> >
> > http://stackoverflow.com/questions/2365132/uuid-performance-in-mysql "At
> > my job, we use UUID as PKs. What I can tell you from experience is DO NOT
> > USE THEM as PKs ... It's one of those things that when you have less than
> > 1000 records it;s ok, but when you have millions, it's the worst thing
> you
> > can do. Why? Because UUID are not sequential..."
> >
> > http://kccoder.com/mysql/uuid-vs-int-insert-performance/ "it takes 25
> > hours to insert 15 million records into an empty UUID table"
> >
> > http://www.mysqlperformanceblog.com/2007/03/13/to-uuid-or-not-to-uuid/
> > "For auto_increment key load process took 1 hour 50 minutes ... For UUID
> > process took over 12 hours and is still going...  So in this little case
> > we have about 200 times performance difference"
> >
> > I believe if we rely on an index on randomly distributed data,
> performance
> > will degrade (factor 10 or more, depending on the repository size, the
> > memory, and potentially on the number of changes). For Jackrabbit 2, to
> > solve this performance problem, we can actually switch to sequential node
> > ids - see JCR-2857. For Jackrabbit 3, if we use the content hash as the
> > node id, then it wouldn't be possible to switch (it is not possible to
> > generate sequential content hashes). With content hashes, one option is
> to
> > make sure the index is always in memory. However, I believe we should not
> > build a system that has such constraints, unless the alternative
> > (sequential node ids) has problems we can not solve otherwise.
>
> I think this is an important subject. Databases are generally
> optimized for sequential indexes and will perform better. What I'm
> wondering about is how we would like to handle the id generation (or
> the uuid generation for that matter) in a cluster. Do we want to make
> it the responsibility of the storage layer (aka auto increment,
> sequences, etc) or the responsibility of jackrabbit? How can we
> guarantee ids are unique *and* sequential in a cluster without (too
> much) overhead?
>
> Regards,
> Bart
>

Re: [jr3] Index on randomly distributed data

Posted by Bart van der Schans <b....@onehippo.com>.

Hi Thomas,

On Tue, Mar 6, 2012 at 10:34 AM, Thomas Mueller <mu...@adobe.com> wrote:
> Hi,
>
> In Jackrabbit 2, we currently use a randomly generated UUID as the node
> id. For Jackrabbit 3 this is an option question. I was looking for ways to
> index randomly distributed data, but so far didn't find a solution. A
> Google query for "uuid primary key performance" gave me:
>
> http://stackoverflow.com/questions/2365132/uuid-performance-in-mysql "At
> my job, we use UUID as PKs. What I can tell you from experience is DO NOT
> USE THEM as PKs ... It's one of those things that when you have less than
> 1000 records it;s ok, but when you have millions, it's the worst thing you
> can do. Why? Because UUID are not sequential..."
>
> http://kccoder.com/mysql/uuid-vs-int-insert-performance/ "it takes 25
> hours to insert 15 million records into an empty UUID table"
>
> http://www.mysqlperformanceblog.com/2007/03/13/to-uuid-or-not-to-uuid/
> "For auto_increment key load process took 1 hour 50 minutes ... For UUID
> process took over 12 hours and is still going...  So in this little case
> we have about 200 times performance difference"
>
> I believe if we rely on an index on randomly distributed data, performance
> will degrade (factor 10 or more, depending on the repository size, the
> memory, and potentially on the number of changes). For Jackrabbit 2, to
> solve this performance problem, we can actually switch to sequential node
> ids - see JCR-2857. For Jackrabbit 3, if we use the content hash as the
> node id, then it wouldn't be possible to switch (it is not possible to
> generate sequential content hashes). With content hashes, one option is to
> make sure the index is always in memory. However, I believe we should not
> build a system that has such constraints, unless the alternative
> (sequential node ids) has problems we can not solve otherwise.

I think this is an important subject. Databases are generally
optimized for sequential indexes and will perform better. What I'm
wondering about is how we would like to handle the id generation (or
the uuid generation for that matter) in a cluster. Do we want to make
it the responsibility of the storage layer (aka auto increment,
sequences, etc) or the responsibility of jackrabbit? How can we
guarantee ids are unique *and* sequential in a cluster without (too
much) overhead?

Regards,
Bart

Re: [jr3] Index on randomly distributed data

Posted by Bart van der Schans <b....@onehippo.com>.

On Tue, Mar 6, 2012 at 6:00 PM, Felix Meschberger <fm...@adobe.com> wrote:
> Hi,
>
> I see and understand your points. But then, when it comes to clustering, generating a strictly monotone sequential ID without collisions without tampering performance is probably a hard problem to solve, right ?

That was exactly my point ;-)

If we have a internal storage id (for example a db PK) and a different
(jcr:) uuid, would those have a 1-to-1 mapping? Are the nodes
retrieved by looking up some (path based) uuid or by the id (PK)? If
the latter is true then you still need an index on the uuids and you
would have the same "performance problem". Or am I mistaken? I do
agree with Jukka that we really need some numbers to discuss this and
make some choices.

Regards,
Bart

Re: [jr3] Index on randomly distributed data

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

As for clustering, there are multiple solutions that don't require
*randomly distributed* node ids:

- if node ids are only used internally within a cluster node, then having
conflicting node ids isn't a problem.

- the path (compressed or uncompressed) could be used as the node id

You could still use UUIDs, but sequential UUIDs, as even those are
guaranteed to be globally unique - see also

http://www.sqlmag.com/article/quering/using-newsequentialid-instead-of-newi
d-

"In the past, many people used NewID() to assign primary keys to columns,
and these columns were unique across servers. When SQL Server 2000 first
came out, I too used NewID() to create primary keys. Alas, I soon
determined (as did many others) that using the NewID() function this way
can cause a variety of performance problems and that NewID() doesn't
provide a unique value that increases in order, as the IDENTITY-based key
does.You get some important performance benefits when a key value is
always higher than the last assigned value; generally these benefits are
related to how data is stored on the pages." (NewID() is the randomly
generated one we currently use in Jackrabbit 2).


Regards,
Thomas


On 3/6/12 6:00 PM, "Felix Meschberger" <fm...@adobe.com> wrote:

>Hi,
>
>I see and understand your points. But then, when it comes to clustering,
>generating a strictly monotone sequential ID without collisions without
>tampering performance is probably a hard problem to solve, right ?
>
>I wonder, who defines the node IDs ? Is it the layer above of below the
>Mikrokernel API ?
>
>If it would be the Mikrokernel impl. itself (thus below the API), why
>would it matter to the upper layers ?
>
>Regards
>Felix
>
>Am 06.03.2012 um 01:34 schrieb Thomas Mueller:
>
>> Hi,
>> 
>> In Jackrabbit 2, we currently use a randomly generated UUID as the node
>> id. For Jackrabbit 3 this is an option question. I was looking for ways
>>to
>> index randomly distributed data, but so far didn't find a solution. A
>> Google query for "uuid primary key performance" gave me:
>> 
>> http://stackoverflow.com/questions/2365132/uuid-performance-in-mysql "At
>> my job, we use UUID as PKs. What I can tell you from experience is DO
>>NOT
>> USE THEM as PKs ... It's one of those things that when you have less
>>than
>> 1000 records it;s ok, but when you have millions, it's the worst thing
>>you
>> can do. Why? Because UUID are not sequential..."
>> 
>> http://kccoder.com/mysql/uuid-vs-int-insert-performance/ "it takes 25
>> hours to insert 15 million records into an empty UUID table"
>> 
>> http://www.mysqlperformanceblog.com/2007/03/13/to-uuid-or-not-to-uuid/
>> "For auto_increment key load process took 1 hour 50 minutes ... For UUID
>> process took over 12 hours and is still going...  So in this little case
>> we have about 200 times performance difference"
>> 
>> I believe if we rely on an index on randomly distributed data,
>>performance
>> will degrade (factor 10 or more, depending on the repository size, the
>> memory, and potentially on the number of changes). For Jackrabbit 2, to
>> solve this performance problem, we can actually switch to sequential
>>node
>> ids - see JCR-2857. For Jackrabbit 3, if we use the content hash as the
>> node id, then it wouldn't be possible to switch (it is not possible to
>> generate sequential content hashes). With content hashes, one option is
>>to
>> make sure the index is always in memory. However, I believe we should
>>not
>> build a system that has such constraints, unless the alternative
>> (sequential node ids) has problems we can not solve otherwise.
>> 
>> Regards,
>> Thomas
>> 
>

Re: [jr3] Index on randomly distributed data

Posted by Felix Meschberger <fm...@adobe.com>.

Hi,

I see and understand your points. But then, when it comes to clustering, generating a strictly monotone sequential ID without collisions without tampering performance is probably a hard problem to solve, right ?

I wonder, who defines the node IDs ? Is it the layer above of below the Mikrokernel API ?

If it would be the Mikrokernel impl. itself (thus below the API), why would it matter to the upper layers ?

Regards
Felix

Am 06.03.2012 um 01:34 schrieb Thomas Mueller:

> Hi,
> 
> In Jackrabbit 2, we currently use a randomly generated UUID as the node
> id. For Jackrabbit 3 this is an option question. I was looking for ways to
> index randomly distributed data, but so far didn't find a solution. A
> Google query for "uuid primary key performance" gave me:
> 
> http://stackoverflow.com/questions/2365132/uuid-performance-in-mysql "At
> my job, we use UUID as PKs. What I can tell you from experience is DO NOT
> USE THEM as PKs ... It's one of those things that when you have less than
> 1000 records it;s ok, but when you have millions, it's the worst thing you
> can do. Why? Because UUID are not sequential..."
> 
> http://kccoder.com/mysql/uuid-vs-int-insert-performance/ "it takes 25
> hours to insert 15 million records into an empty UUID table"
> 
> http://www.mysqlperformanceblog.com/2007/03/13/to-uuid-or-not-to-uuid/
> "For auto_increment key load process took 1 hour 50 minutes ... For UUID
> process took over 12 hours and is still going...  So in this little case
> we have about 200 times performance difference"
> 
> I believe if we rely on an index on randomly distributed data, performance
> will degrade (factor 10 or more, depending on the repository size, the
> memory, and potentially on the number of changes). For Jackrabbit 2, to
> solve this performance problem, we can actually switch to sequential node
> ids - see JCR-2857. For Jackrabbit 3, if we use the content hash as the
> node id, then it wouldn't be possible to switch (it is not possible to
> generate sequential content hashes). With content hashes, one option is to
> make sure the index is always in memory. However, I believe we should not
> build a system that has such constraints, unless the alternative
> (sequential node ids) has problems we can not solve otherwise.
> 
> Regards,
> Thomas
>