You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jackrabbit.apache.org by "Seidel. Robert" <Ro...@aeb.de> on 2011/05/04 17:37:34 UTC

SQL2 can't be used for large amounts of hits

Hi,

I've done some testing with Jackrabbit 2.2.1, SQL2 and nearly 100k nodes and an Oracle bundled data store. A query resulting all 100k nodes took about 10 minutes to execute (not to iterate about the result).
First I thought it was because the sort with the "order by" expression I have used, but I've removed the order by and the query is still slow.

I've looked a litte bit into the code and found LuceneQueryFactory public List<Row> execute(...). There is something like:

Node = hits.nextScoreNode();
While (node != null) {
  ... session.getNodeById(Node.getNodeId()
  ...node = hits.nextScoreNode();
}

The funny thing is, although everything is read into an collection, you can't ask the query how many hits there were, because the collection was put into an iterator afterwards...

For the GUI this is ok for me, because I can only display a defined number of hits, like 100 or so - so I can use the setLimit() (if I do so, the sort is broken - because it sorts only the limited results...).

But I have some other use case, where I want to iterate over an huge amount of nodes and export the results. In this case, the collection from LuceneQueryFactory will simply not fit into the memory.

Is there any solution for this, like a real iterator?

Kindly regards, Robert
________________________________

Logistik und Au?enwirtschaft stehen auf der transport logistic vom 10.-13. Mai 2011 in M?nchen im Mittelpunkt.
Ihre Fragen am Stand von AEB. In Halle B2, Stand 405/506.
Hier k?nnen Sie einen Termin vereinbaren und einen Gewinncode generieren: www.aeb.de/transport-logistic.
Mit etwas Gl?ck gewinnen Sie vor Ort ein Apple iPad.

AW: SQL2 can't be used for large amounts of hits

Posted by "Seidel. Robert" <Ro...@aeb.de>.
Hi Alex,

I've changed the query language to sql and it really helped a lot, now the query is done in seconds.

I don't even need the limit anymore, because the iterator iterates over the results.

Kindly regards, Robert

-----Ursprüngliche Nachricht-----
Von: Alex Parvulescu [mailto:alex.parvulescu@gmail.com]
Gesendet: Donnerstag, 5. Mai 2011 10:00
An: users@jackrabbit.apache.org
Betreff: Re: SQL2 can't be used for large amounts of hits

Hi Robert,

The overall implementation of SQL2 is rather green. If you decide to use it
in a production, you should expect some things to be slower than SQL or
XPath.

In this particular case, if you continue through the code, you'll see that
some conditions are being applied in the java code, on the result set, that
is why you need to accumulate it in a collection and then sort it via
Collections.sort.
I'm not saying it is the best approach, I'm only explaining how it is now.

I have been pushing some patches to make it a bit better performance wise,
but there are still some design decisions that need to be changed, and the
change is rather involved.
I opted for an incremental approach as that would allow us to optimize SQL2
while still keeping it usable, even if slow in some cases.

If you can take a look at this pach [0] for limit and offset, I'd love to
have your feedback on it.
Also, tests are highly welcome ;)

If you don't feel too adventurous, a strong alternative for you is to use
XPath.

[0] https://issues.apache.org/jira/browse/JCR-2830

On Wed, May 4, 2011 at 5:37 PM, Seidel. Robert <Ro...@aeb.de> wrote:

> Hi,
>
> I've done some testing with Jackrabbit 2.2.1, SQL2 and nearly 100k nodes
> and an Oracle bundled data store. A query resulting all 100k nodes took
> about 10 minutes to execute (not to iterate about the result).
> First I thought it was because the sort with the "order by" expression I
> have used, but I've removed the order by and the query is still slow.
>
> I've looked a litte bit into the code and found LuceneQueryFactory public
> List<Row> execute(...). There is something like:
>
> Node = hits.nextScoreNode();
> While (node != null) {
>  ... session.getNodeById(Node.getNodeId()
>  ...node = hits.nextScoreNode();
> }
>
> The funny thing is, although everything is read into an collection, you
> can't ask the query how many hits there were, because the collection was put
> into an iterator afterwards...
>
> For the GUI this is ok for me, because I can only display a defined number
> of hits, like 100 or so - so I can use the setLimit() (if I do so, the sort
> is broken - because it sorts only the limited results...).
>
> But I have some other use case, where I want to iterate over an huge amount
> of nodes and export the results. In this case, the collection from
> LuceneQueryFactory will simply not fit into the memory.
>
> Is there any solution for this, like a real iterator?
>
> Kindly regards, Robert
> ________________________________
>
> Logistik und Au?enwirtschaft stehen auf der transport logistic vom 10.-13.
> Mai 2011 in M?nchen im Mittelpunkt.
> Ihre Fragen am Stand von AEB. In Halle B2, Stand 405/506.
> Hier k?nnen Sie einen Termin vereinbaren und einen Gewinncode generieren:
> www.aeb.de/transport-logistic.
> Mit etwas Gl?ck gewinnen Sie vor Ort ein Apple iPad.
>
________________________________

Logistik und Außenwirtschaft stehen auf der transport logistic vom 10.-13. Mai 2011 in München im Mittelpunkt.
Ihre Fragen am Stand von AEB. In Halle B2, Stand 405/506.
Hier können Sie einen Termin vereinbaren und einen Gewinncode generieren: www.aeb.de/transport-logistic.
Mit etwas Glück gewinnen Sie vor Ort ein Apple iPad.

Re: SQL2 can't be used for large amounts of hits

Posted by Alex Parvulescu <al...@gmail.com>.
Hi Robert,

The overall implementation of SQL2 is rather green. If you decide to use it
in a production, you should expect some things to be slower than SQL or
XPath.

In this particular case, if you continue through the code, you'll see that
some conditions are being applied in the java code, on the result set, that
is why you need to accumulate it in a collection and then sort it via
Collections.sort.
I'm not saying it is the best approach, I'm only explaining how it is now.

I have been pushing some patches to make it a bit better performance wise,
but there are still some design decisions that need to be changed, and the
change is rather involved.
I opted for an incremental approach as that would allow us to optimize SQL2
while still keeping it usable, even if slow in some cases.

If you can take a look at this pach [0] for limit and offset, I'd love to
have your feedback on it.
Also, tests are highly welcome ;)

If you don't feel too adventurous, a strong alternative for you is to use
XPath.

[0] https://issues.apache.org/jira/browse/JCR-2830

On Wed, May 4, 2011 at 5:37 PM, Seidel. Robert <Ro...@aeb.de> wrote:

> Hi,
>
> I've done some testing with Jackrabbit 2.2.1, SQL2 and nearly 100k nodes
> and an Oracle bundled data store. A query resulting all 100k nodes took
> about 10 minutes to execute (not to iterate about the result).
> First I thought it was because the sort with the "order by" expression I
> have used, but I've removed the order by and the query is still slow.
>
> I've looked a litte bit into the code and found LuceneQueryFactory public
> List<Row> execute(...). There is something like:
>
> Node = hits.nextScoreNode();
> While (node != null) {
>  ... session.getNodeById(Node.getNodeId()
>  ...node = hits.nextScoreNode();
> }
>
> The funny thing is, although everything is read into an collection, you
> can't ask the query how many hits there were, because the collection was put
> into an iterator afterwards...
>
> For the GUI this is ok for me, because I can only display a defined number
> of hits, like 100 or so - so I can use the setLimit() (if I do so, the sort
> is broken - because it sorts only the limited results...).
>
> But I have some other use case, where I want to iterate over an huge amount
> of nodes and export the results. In this case, the collection from
> LuceneQueryFactory will simply not fit into the memory.
>
> Is there any solution for this, like a real iterator?
>
> Kindly regards, Robert
> ________________________________
>
> Logistik und Au?enwirtschaft stehen auf der transport logistic vom 10.-13.
> Mai 2011 in M?nchen im Mittelpunkt.
> Ihre Fragen am Stand von AEB. In Halle B2, Stand 405/506.
> Hier k?nnen Sie einen Termin vereinbaren und einen Gewinncode generieren:
> www.aeb.de/transport-logistic.
> Mit etwas Gl?ck gewinnen Sie vor Ort ein Apple iPad.
>