You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jackrabbit.apache.org by David Johnson <db...@gmail.com> on 2007/03/01 00:24:14 UTC

Fwd: Query Performance and Optimization

We are exploring using Jackrabbit in a production environment.  I have a
repository that we have created from our content that has > 100K nodes.
Several of our use case need to use date range queries and also use 'order
by' frequently.  We have noticed that the query time is significantly slower
than necessary.  After warming up the repository ( i.e., running the suite
of queries once), as an example:

"select * from Column where jcr:path like 'Gossip/ColumnName/Columns/%' and
status <> 'hidden' order by publishDate desc" takes 500 ms to execute - this
is just the execution time, I am not actually using or accessing the
NodeIterator.

Whereas: "select * from Column where jcr:path like
'Gossip/ColumnName/Columns/%' and status <> 'hidden'" takes only 33 ms to
execute.

/jcr:root/Gossip/ColumnName/Columns//element(*,Column)[@publishDate >
xs:dateTime("way in the past") and @publishDate < xs:dateTime("way in the
future") and (@status != 'hidden')] order by @publishDate descending takes
1096 ms to execute.

Clearly dates (ordering and ranges) have a significant impact on query
execution speed.

Digging into the internals of Jackrabbit, we have noticed that there is an
implementation of RangeQuery that essentially walks the results if the # of
query terms is greater than what Lucene can handle.  Reading the Lucene
documentation, it looks like Filters are the recommended method of
implementing "large" range queries, and also seem like a natural for
matching node types - i.e., select * from Column

Is there any ongoing work on query optimization and performance.  We would
be very interested in such work, including offering any help that we can.

-Dave

Re: Query Performance and Optimization

Posted by Marcel Reutegger <ma...@gmx.net>.

Jukka Zitting wrote:
> Note that the hierarchical content model of JCR is fundamentally
> different from the relational model. There is no way you could achieve
> such a mapping without major compromizes especially with residual item
> definitions but also with things like type inheritance and even
> handling of the tree structure.

there is a third way: use the database but do not expose/store nodes of 
different types in separate tables. it would be similar to what the current DB 
PMs do, but in addition make the following information available to queries:
- hierarchy information
- property values and types
- node name
currently this information is buried in the serialized item state.

the lucene query handler already does this because that's what is required to 
execute a query: the node name, the hierarchy information and values of properties.

yes, that's all you need. the type information is already present in the 
jcr:primaryType property value and the type hierarchy is given by the node type 
registry. IMO the node types do not have to be represented as separate tables.

> In fact I think the use of relational databases in the current
> database persistence managers and the incoming bundle persistence
> managers is a necessary evil until we come up with a "native"
> persistence model that achieves the required ACID features without the
> need for an external component to "do its thing". :-)

but will this native persistence model also have query support? the current 
jackrabbit design offers separation between data storage and indexing, but wrt 
data integrity and performance it is maybe beneficial to have both in one pot 
(at least in some cases).

regards
  marcel

Re: Query Performance and Optimization

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On 3/8/07, Michael Neale <mi...@gmail.com> wrote:
> On 3/7/07, Stefan Guggisberg <st...@gmail.com> wrote:
> > On 3/7/07, Michael Neale <mi...@gmail.com> wrote:
> > > The DDL generation kind of scares me, in terms of complexity, but I
> > > think its necessary to let RDBMS "do its thing" so to speak?
> >
> > why?
>
> Mainly for queries. if we have a  node type def that has something:title,
> something:size etc... then if they map to  columns in a table called
> something_title, something_age we can get the RDBMS to do indexing.

Note that the hierarchical content model of JCR is fundamentally
different from the relational model. There is no way you could achieve
such a mapping without major compromizes especially with residual item
definitions but also with things like type inheritance and even
handling of the tree structure.

> However, this is turning jackrabbit into a kind of ORM itself - probably
> not one of the aims ;)

Exactly. There are already a number of mature ORM tools out there,
each with the fundamental limitations of the ORM approach. Jackrabbit
is actually breaking those barriers by offering an alternative storage
model and imposing the relational model on Jackrabbit would IMHO be a
major step backwards.

In fact I think the use of relational databases in the current
database persistence managers and the incoming bundle persistence
managers is a necessary evil until we come up with a "native"
persistence model that achieves the required ACID features without the
need for an external component to "do its thing". :-)

BR,

Jukka Zitting

Re: Query Performance and Optimization

Posted by Michael Neale <mi...@gmail.com>.

On 3/7/07, Stefan Guggisberg <st...@gmail.com> wrote:
>
> On 3/7/07, Michael Neale <mi...@gmail.com> wrote:
> > Hi Marcel - yes it would be interesting - I guess to get the most out of
> it,
> > the node type definitions would have to come into play to generate DDL
> for
> > the database - so the node type definitions will map to a more "tuned"
> > database schema - of course some concepts may not work that way, like
>
> i guess by "tuned" you mean a normalized schema. why do you think that
> such a normalized schema would improve performance?


Mainly allowing the RDBMS to perform queries - natively.

> hierarchies, or "nt:unstructured" in which case it would need to use the
> > current style.
> >
> > As for fulltext - database support varies with each vendor, so I would
> > hazard a guess that lucene would still need to be part of it (that is
> the
> > approach that the newer versions of hibernate have taken - take full
> text
> > out of the hands of the database).
> >
> > The DDL generation kind of scares me, in terms of complexity, but I
> think
> > its necessary to let RDBMS "do its thing" so to speak?
>
> why?


Mainly for queries. if we have a  node type def that has something:title,
something:size etc... then if they map to  columns in a table called
something_title, something_age we can get the RDBMS to do indexing. However,
this is turning jackrabbit into a kind of ORM itself - probably not one of
the aims ;)

> ORM tools can certainly help here - can avoid programmatically generating
> > DDL by instead generating a meta model that ORM tools work off - just a
> > thought (let the ORMs generate DB specific schemas).
> >
> > I know RDBMS are a proven way to scale up - but as for content, I am a
> > novice, so I am happy to follow the lead of those in the know in how
> best to
> > help jackrabbit scale. So far I have not been that "whelmed" by the
> query
> > performance - I am using the SQL dialect cause its familiar, but I think
> its
> > familiarity makes me want to do things that it is perhaps not optimised
> for,
> > maybe that is my problem.
> >
> > I should and will join the dev list, so as to not pollute the user list
> with
> > ponderings over jackrabbit internals ;)
> >
> > Thoughts?
> >
> > Michael.
> >
> > On 3/6/07, Marcel Reutegger <ma...@gmx.net> wrote:
> > >
> > > Michael Neale wrote:
> > > > I know from previous discussions that it is a design decision of
> > > Jackrabbit
> > > > to not exlcusively work with RDBMS - if it was, I would be all in
> favour
> > > of
> > > > leaning on it to do the hardwork.
> > >
> > > please note that it is possible to exclusively use an RDBMS for
> storing
> > > and
> > > querying content, though you have to create your own persistence
> manager
> > > and
> > > query handler. the jackrabbit core does not force you to separate the
> > > store and
> > > the index.
> > >
> > > but you are right that it was a design decision to allow separation if
> you
> > > want
> > > to. because jackrabbit initially only had plain file based persistence
> > > managers
> > > and because lucene provides very good fulltext indexing we decided to
> go
> > > with
> > > lucene.
> > >
> > > coming back to the RDBMS only approach. you would have to implement a
> > > persistence manager that stores nodes and properties in a way that
> allows
> > > the
> > > database to use its indexes. then create a query handler that
> translates
> > > an
> > > abstract query tree into a SQL statement based on the database schema.
> > >
> > > there are some obstacles you will have to overcome (or actually the
> > > database):
> > > 1) handle node hierarchies (e.g. get all ancestors of a certain node)
> > > 2) provide fulltext indexing
> > >
> > > I think this would be a very useful extension for jackrabbit. so, if
> > > anyone is
> > > interested in implementing this, I'm very curious how well it performs
> > > compared
> > > to the current implementation using lucene.
> > >
> > > regards
> > >   marcel
> > >
> >
>

Re: Query Performance and Optimization

Posted by Stefan Guggisberg <st...@gmail.com>.

On 3/7/07, Michael Neale <mi...@gmail.com> wrote:
> Hi Marcel - yes it would be interesting - I guess to get the most out of it,
> the node type definitions would have to come into play to generate DDL for
> the database - so the node type definitions will map to a more "tuned"
> database schema - of course some concepts may not work that way, like

i guess by "tuned" you mean a normalized schema. why do you think that
such a normalized schema would improve performance?

> hierarchies, or "nt:unstructured" in which case it would need to use the
> current style.
>
> As for fulltext - database support varies with each vendor, so I would
> hazard a guess that lucene would still need to be part of it (that is the
> approach that the newer versions of hibernate have taken - take full text
> out of the hands of the database).
>
> The DDL generation kind of scares me, in terms of complexity, but I think
> its necessary to let RDBMS "do its thing" so to speak?

why?

> ORM tools can certainly help here - can avoid programmatically generating
> DDL by instead generating a meta model that ORM tools work off - just a
> thought (let the ORMs generate DB specific schemas).
>
> I know RDBMS are a proven way to scale up - but as for content, I am a
> novice, so I am happy to follow the lead of those in the know in how best to
> help jackrabbit scale. So far I have not been that "whelmed" by the query
> performance - I am using the SQL dialect cause its familiar, but I think its
> familiarity makes me want to do things that it is perhaps not optimised for,
> maybe that is my problem.
>
> I should and will join the dev list, so as to not pollute the user list with
> ponderings over jackrabbit internals ;)
>
> Thoughts?
>
> Michael.
>
> On 3/6/07, Marcel Reutegger <ma...@gmx.net> wrote:
> >
> > Michael Neale wrote:
> > > I know from previous discussions that it is a design decision of
> > Jackrabbit
> > > to not exlcusively work with RDBMS - if it was, I would be all in favour
> > of
> > > leaning on it to do the hardwork.
> >
> > please note that it is possible to exclusively use an RDBMS for storing
> > and
> > querying content, though you have to create your own persistence manager
> > and
> > query handler. the jackrabbit core does not force you to separate the
> > store and
> > the index.
> >
> > but you are right that it was a design decision to allow separation if you
> > want
> > to. because jackrabbit initially only had plain file based persistence
> > managers
> > and because lucene provides very good fulltext indexing we decided to go
> > with
> > lucene.
> >
> > coming back to the RDBMS only approach. you would have to implement a
> > persistence manager that stores nodes and properties in a way that allows
> > the
> > database to use its indexes. then create a query handler that translates
> > an
> > abstract query tree into a SQL statement based on the database schema.
> >
> > there are some obstacles you will have to overcome (or actually the
> > database):
> > 1) handle node hierarchies (e.g. get all ancestors of a certain node)
> > 2) provide fulltext indexing
> >
> > I think this would be a very useful extension for jackrabbit. so, if
> > anyone is
> > interested in implementing this, I'm very curious how well it performs
> > compared
> > to the current implementation using lucene.
> >
> > regards
> >   marcel
> >
>

Re: Query Performance and Optimization

Posted by Michael Neale <mi...@gmail.com>.

Hi Marcel - yes it would be interesting - I guess to get the most out of it,
the node type definitions would have to come into play to generate DDL for
the database - so the node type definitions will map to a more "tuned"
database schema - of course some concepts may not work that way, like
hierarchies, or "nt:unstructured" in which case it would need to use the
current style.

As for fulltext - database support varies with each vendor, so I would
hazard a guess that lucene would still need to be part of it (that is the
approach that the newer versions of hibernate have taken - take full text
out of the hands of the database).

The DDL generation kind of scares me, in terms of complexity, but I think
its necessary to let RDBMS "do its thing" so to speak?
ORM tools can certainly help here - can avoid programmatically generating
DDL by instead generating a meta model that ORM tools work off - just a
thought (let the ORMs generate DB specific schemas).

I know RDBMS are a proven way to scale up - but as for content, I am a
novice, so I am happy to follow the lead of those in the know in how best to
help jackrabbit scale. So far I have not been that "whelmed" by the query
performance - I am using the SQL dialect cause its familiar, but I think its
familiarity makes me want to do things that it is perhaps not optimised for,
maybe that is my problem.

I should and will join the dev list, so as to not pollute the user list with
ponderings over jackrabbit internals ;)

Thoughts?

Michael.

On 3/6/07, Marcel Reutegger <ma...@gmx.net> wrote:
>
> Michael Neale wrote:
> > I know from previous discussions that it is a design decision of
> Jackrabbit
> > to not exlcusively work with RDBMS - if it was, I would be all in favour
> of
> > leaning on it to do the hardwork.
>
> please note that it is possible to exclusively use an RDBMS for storing
> and
> querying content, though you have to create your own persistence manager
> and
> query handler. the jackrabbit core does not force you to separate the
> store and
> the index.
>
> but you are right that it was a design decision to allow separation if you
> want
> to. because jackrabbit initially only had plain file based persistence
> managers
> and because lucene provides very good fulltext indexing we decided to go
> with
> lucene.
>
> coming back to the RDBMS only approach. you would have to implement a
> persistence manager that stores nodes and properties in a way that allows
> the
> database to use its indexes. then create a query handler that translates
> an
> abstract query tree into a SQL statement based on the database schema.
>
> there are some obstacles you will have to overcome (or actually the
> database):
> 1) handle node hierarchies (e.g. get all ancestors of a certain node)
> 2) provide fulltext indexing
>
> I think this would be a very useful extension for jackrabbit. so, if
> anyone is
> interested in implementing this, I'm very curious how well it performs
> compared
> to the current implementation using lucene.
>
> regards
>   marcel
>

Re: Query Performance and Optimization

Posted by Marcel Reutegger <ma...@gmx.net>.

Michael Neale wrote:
> I know from previous discussions that it is a design decision of Jackrabbit
> to not exlcusively work with RDBMS - if it was, I would be all in favour of
> leaning on it to do the hardwork.

please note that it is possible to exclusively use an RDBMS for storing and 
querying content, though you have to create your own persistence manager and 
query handler. the jackrabbit core does not force you to separate the store and 
the index.

but you are right that it was a design decision to allow separation if you want 
to. because jackrabbit initially only had plain file based persistence managers 
and because lucene provides very good fulltext indexing we decided to go with 
lucene.

coming back to the RDBMS only approach. you would have to implement a 
persistence manager that stores nodes and properties in a way that allows the 
database to use its indexes. then create a query handler that translates an 
abstract query tree into a SQL statement based on the database schema.

there are some obstacles you will have to overcome (or actually the database):
1) handle node hierarchies (e.g. get all ancestors of a certain node)
2) provide fulltext indexing

I think this would be a very useful extension for jackrabbit. so, if anyone is 
interested in implementing this, I'm very curious how well it performs compared 
to the current implementation using lucene.

regards
  marcel

Re: Query Performance and Optimization

Posted by Michael Neale <mi...@gmail.com>.

I would like to second this.

I know from previous discussions that it is a design decision of Jackrabbit
to not exlcusively work with RDBMS - if it was, I would be all in favour of
leaning on it to do the hardwork.

But I presume Lucene is leaned on to do all the hard work instead (and it is
certainly capable) - but for me query performance seems to be a bit of
voodoo and random without diving into jackrabbit. I definately think a lot
of work can be done in that regards.

On 3/1/07, David Johnson <db...@gmail.com> wrote:
>
> We are exploring using Jackrabbit in a production environment.  I have a
> repository that we have created from our content that has > 100K nodes.
> Several of our use case need to use date range queries and also use 'order
> by' frequently.  We have noticed that the query time is significantly
> slower
> than necessary.  After warming up the repository ( i.e., running the suite
> of queries once), as an example:
>
> "select * from Column where jcr:path like 'Gossip/ColumnName/Columns/%'
> and
> status <> 'hidden' order by publishDate desc" takes 500 ms to execute -
> this
> is just the execution time, I am not actually using or accessing the
> NodeIterator.
>
> Whereas: "select * from Column where jcr:path like
> 'Gossip/ColumnName/Columns/%' and status <> 'hidden'" takes only 33 ms to
> execute.
>
> /jcr:root/Gossip/ColumnName/Columns//element(*,Column)[@publishDate >
> xs:dateTime("way in the past") and @publishDate < xs:dateTime("way in the
> future") and (@status != 'hidden')] order by @publishDate descending takes
> 1096 ms to execute.
>
> Clearly dates (ordering and ranges) have a significant impact on query
> execution speed.
>
> Digging into the internals of Jackrabbit, we have noticed that there is an
> implementation of RangeQuery that essentially walks the results if the #
> of
> query terms is greater than what Lucene can handle.  Reading the Lucene
> documentation, it looks like Filters are the recommended method of
> implementing "large" range queries, and also seem like a natural for
> matching node types - i.e., select * from Column
>
> Is there any ongoing work on query optimization and performance.  We would
> be very interested in such work, including offering any help that we can.
>
> -Dave
>