You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jackrabbit.apache.org by dan <da...@hotmail.com> on 2006/11/29 21:00:36 UTC

RowIterator loop is slow?

Hi,
I was using RowIterator to loop through about 2000 entries in query result
and it took about 3+ seconds. 
I stripped the code to the bare loop structure like below: 
	logger.debug("start loop");
	while (rows.hasNext()){
		Row row = rows.nextRow();
		Value[] values = row.getValues();
	}  
	logger.debug("end loop");

The time for going through the entire RowSet is still 3+ second. Tried with
NodeIterator, the result did not change much. 

Could anyone advise if this is the normal performance? I'm running this code
on a Windows 2003 server.

Thanks,
Dan

Re: RowIterator loop is slow?

Posted by Marcel Reutegger <ma...@gmx.net>.

Hi Dan,

thanks for the clarification. I think now I understand.

dan wrote:
> An example of my query scenario would be: 
> For my:document nodes that meet the following query conditions: 
>  1) refer to countries US, or Canada, or Mexico and, 
>  2) refer to sizes Small or Medium, and
>  3) refer to colors Red or Yellow, and
>  4) document content contains arbitrary user entered text
>  5) ... other property based query parameters
> Return the country names referred by above document result (in a unique
> list), and count the number of documents under each returned country name. 
> An example of expected result set may be: 
>    US, 19
>    Canada, 4
> 
> You may have noticed that in query condition 1), users are allowed to
> specify target countries, but the result may not have all country names as
> specified (Mexico here), because other document filtering parameters may
> prevent any "Mexico"-referring document from showing in the result.
> 
> I hope this makes things clear for you. 
> My perception is that I can't achieve the result with One query because
> there no "Select distinct" and "inner join" equivalent in JCR/Jackrabbit.  

and you would also need 'group by' and aggregate functions like sum(). 
enhancements like those are currently discussed in JSR 283.

> Would you have any suggestion/comment on the approach? 

I think the best you can do with the current JCR version is:

1) query for all categories (countries) that match your query. That's the SQL 
query you posted initially but converted into an XPath with an additional 
jcr:deref() at the end to the category node.

2) for each matching category run a new query for that category, which will 
return the documents for that category. Then get the number of documents by 
calling NodeIterator.getSize() instead of looping through all the matches. This 
should be faster that your initial approach.

regards
  marcel

RE: RowIterator loop is slow?

Posted by dan <da...@hotmail.com>.

Hi Marcel,

>I'm also a bit confused whether you are finally interested in documents or
>categories. The SQL query you posted earlier indicates that you are
>interested in documents, but the above XPath query indicates that you are
>interested in the referenced category text.

Yes I am (ultimately) interested in the referenced Category text. The query
in previous post is a test to achieve my goal described here. Let me make it
in plain language...

All my:document nodes refer to MULTIPLE "category nodes" - say Country,
Size, and ColorScheme. Each of them has multiple category-entries. 
Say category "Country" has entries "US, Canada, Mexico, Argentina, Brazil";
Category "Size" has entries "Small, Medium, Large"; And Category
"ColorScheme" has entries "Red, Green, Yellow, Pink, Black".

For any my:document node, 
- it points to ONE or more entries of Country
- it points to zero or more entries of Size
- it points to zero or more entries of ColorScheme
- it has other text/date properties

An example of my query scenario would be: 
For my:document nodes that meet the following query conditions: 
 1) refer to countries US, or Canada, or Mexico and, 
 2) refer to sizes Small or Medium, and
 3) refer to colors Red or Yellow, and
 4) document content contains arbitrary user entered text
 5) ... other property based query parameters
Return the country names referred by above document result (in a unique
list), and count the number of documents under each returned country name. 
An example of expected result set may be: 
   US, 19
   Canada, 4

You may have noticed that in query condition 1), users are allowed to
specify target countries, but the result may not have all country names as
specified (Mexico here), because other document filtering parameters may
prevent any "Mexico"-referring document from showing in the result.

I hope this makes things clear for you. 
My perception is that I can't achieve the result with One query because
there no "Select distinct" and "inner join" equivalent in JCR/Jackrabbit.  

Would you have any suggestion/comment on the approach? 

Regards,
Dan

Re: RowIterator loop is slow?

Posted by Marcel Reutegger <ma...@gmx.net>.

dan wrote:
> In my case, my:document nodes refer to MULTIPLE "categories". I need to
> support queries that say: 
> 
> //element(*, my:document)
> [ (jcr:deref(@my:cat1Ref, 'cat1Entry'))    
>        (--- the document must refer to at least one of the 'category1'
>             entries)
> 
>   And (jcr:deref(@my:cat2Ref, 'cat2Entry')/@my:text = 'cat2_Entry_2')
>        (--- the document must also refer to category2 entry, named
>             "cat2_entry_2")
> 
>   And (jcr:deref(@my:cat3Ref, 'cat3Entry')/@my:text = 'cat3_Entry_x')
>        (--- the document must also refer to category3 entry, named
>             "cat3_entry_x")
> 
> ]/jcr:deref(@my:cat1Ref, 'cat1Entry')/@my:text order by @my:text
> 			
> Due to the query limitations, I had to gather all document nodes and
> manually compile the list of referenced category entries.
> 
> Would you advise on some other approaches? 

I'm not sure I understand your requirements correctly. I would simplify the 
query by not using the jcr:deref() in the predicate (well, you can't anyway, 
because it's not supported) but replace it with the uuid of the referenced 
category. I assume that jcr:deref(@my:cat3Ref, 'cat3Entry')/@my:text = 
'cat3_Entry_x' can be easily replaced because it points to a well known node or 
at least limited set of nodes?

I'm also a bit confused whether you are finally interested in documents or 
categories. The SQL query you posted earlier indicates that you are interested 
in documents, but the above XPath query indicates that you are interested in the 
referenced category text.

regards
  marcel

RE: RowIterator loop is slow?

Posted by dan <da...@hotmail.com>.

Hi Marcel,

> > - RowIterator.nextRow(): 13234-9375 = 3859ms
> 
> that indicates that most of the time is spent in retrieving the values
> from persistent storage. Does your application really require to read
> all 2000 documents?

Actually, I was forced to use this approach because I could not find a
proper content structure and/or query syntax to support the client's
requirement.

In my case, my:document nodes refer to MULTIPLE "categories". I need to
support queries that say: 

//element(*, my:document)
[ (jcr:deref(@my:cat1Ref, 'cat1Entry'))    
       (--- the document must refer to at least one of the 'category1'
            entries)

  And (jcr:deref(@my:cat2Ref, 'cat2Entry')/@my:text = 'cat2_Entry_2')
       (--- the document must also refer to category2 entry, named
            "cat2_entry_2")

  And (jcr:deref(@my:cat3Ref, 'cat3Entry')/@my:text = 'cat3_Entry_x')
       (--- the document must also refer to category3 entry, named
            "cat3_entry_x")

]/jcr:deref(@my:cat1Ref, 'cat1Entry')/@my:text order by @my:text
			
Due to the query limitations, I had to gather all document nodes and
manually compile the list of referenced category entries.

Would you advise on some other approaches? 

Thanks,
Dan

Re: RowIterator loop is slow?

Posted by Marcel Reutegger <ma...@gmx.net>.

dan wrote:
> And here is the result (Extracted from the Log4j output):
> - Query.execute(): 9375-7484 = 1891ms

do you see an improvement here when you execute the query again. Initial queries 
can be quite slow because the hierarchy structure is resolved from the index but 
later cached which speeds up queries significantly.

> - QueryResult.getRows(): less than 1 ms
> - RowIterator.nextRow(): 13234-9375 = 3859ms

that indicates that most of the time is spent in retrieving the values from 
persistent storage. Does your application really require to read all 2000 documents?

regards
  marcel

RE: RowIterator loop is slow?

Posted by dan <da...@hotmail.com>.

> Can you give more details where those 3 seconds are spent? I'd be
> interested to

The query is in SQL: 
SELECT Source FROM cm:document WHERE jcr:path LIKE '/cm:contentRoot/CCD/%'
AND cm:state='published' AND  (flag3='2'  OR flag3='1' )   AND
(Source='640'  OR Source='240'  OR Source='220'  OR Source='130'  OR
Source='160'  OR Source='020'  OR Source='630'  OR Source='760'  OR
Source='050'  OR Source='730'  OR Source='190'  OR Source='230'  OR
Source='360'  OR Source='530'  OR Source='040'  OR Source='330'  OR
Source='720'  OR Source='750'  OR Source='390'  OR Source='540'  OR
Source='280'  OR Source='110'  OR Source='580'  OR Source='620' )   AND
(category1='090'  OR Category1='150'  OR Category1='130'  OR Category1='160'
OR Category1='020'  OR Category1='060'  OR Category1='050'  OR
Category1='140'  OR Category1='040'  OR Category1='010'  OR Category1='080'
OR Category1='110'  OR Category1='030'  OR Category1='070'  OR
Category1='100'  OR Category1='120' )   ORDER BY Source

NodeType cm:document has properties "Source", "Category1" whose values equal
to other category nodes' values. (I can't use references because I need to
query on these multiple values to filter documents, which is not supported
in current JCR if node reference is used).

And here is the result (Extracted from the Log4j output):
- Query.execute(): 9375-7484 = 1891ms
- QueryResult.getRows(): less than 1 ms
- RowIterator.nextRow(): 13234-9375 = 3859ms

***********************************
[2006-12-01 10:32,  7484]DEBUG[WebContainer : 1] executing SQL query for
categories: SELECT Source FROM cm:document WHERE jcr:path LIKE
'/cm:contentRoot/CCD/%'   AND cm:state='published' AND  (flag3='2'  OR
flag3='1' )   AND  (Source='640'  OR Source='240'  OR Source='220'  OR
Source='130'  OR Source='160'  OR Source='020'  OR Source='630'  OR
Source='760'  OR Source='050'  OR Source='730'  OR Source='190'  OR
Source='230'  OR Source='360'  OR Source='530'  OR Source='040'  OR
Source='330'  OR Source='720'  OR Source='750'  OR Source='390'  OR
Source='540'  OR Source='280'  OR Source='110'  OR Source='580'  OR
Source='620' )   AND  (Category1='090'  OR Category1='150'  OR
Category1='130'  OR Category1='160'  OR Category1='020'  OR Category1='060'
OR Category1='050'  OR Category1='140'  OR Category1='040'  OR
Category1='010'  OR Category1='080'  OR Category1='110'  OR Category1='030'
OR Category1='070'  OR Category1='100'  OR Category1='120' )   ORDER BY
Source
[2006-12-01 10:32,  9375]DEBUG[WebContainer : 1] query successful, getting
RowIterator
[2006-12-01 10:32,  9375]DEBUG[WebContainer : 1] got iterator, total number
of entries: 2092
[2006-12-01 10:32,  9375]DEBUG[WebContainer : 1] start parsing categories..
[2006-12-01 10:32, 13234]DEBUG[WebContainer : 1] end processing categories
************************************

Thanks,
Dan

Re: RowIterator loop is slow?

Posted by Marcel Reutegger <ma...@gmx.net>.

dan wrote:
> IMO, "order by NOT_JUST_jcr:score" is very common use case. The way that
> retrieving all nodes from multiple BLOBs into Java objects and then do Java
> sorting, won't have any performance advantage over that allowing RDB to
> handle everything in one shot. 

The expensive sorting is only done when document order is requested. order by 
jcr:score() was just an example. If you order by any other property lucene will 
do the sorting as well, just like ordering by score.

Can you give more details where those 3 seconds are spent? I'd be interested to 
know how much time is spent in:

- Query.execute()
- QueryResult.getRows()
- RowIterator.nextRow()

regards
  marcel

RE: RowIterator loop is slow?

Posted by dan <da...@hotmail.com>.

Thanks Marcel,

> if your query does not have an 'order by' clause AND the query handler 
> configuration uses the default value for the
> 'respectDocumentOrder' parameter. In that case there is a post processing
> in the query result which orders the result nodes in document order.

That explains why it was slow for me - I have "order by @my:propname" in the
query, although I've already set "respectDocumentOrder" to false.

Now I feel somewhat agree with the idea in theads a few days back, about
"expanding RDB schema" and "if using RDB repository, let RDB do all
queries". 

IMO, "order by NOT_JUST_jcr:score" is very common use case. The way that
retrieving all nodes from multiple BLOBs into Java objects and then do Java
sorting, won't have any performance advantage over that allowing RDB to
handle everything in one shot. 
Also, many RDB products now have full-text search capability, although they
may not be as great as Lucene. When considering the 'over-all' performance,
it might be legitimate to think about a "RDB oriented search/query
mechanism". 

Of course, that may fall beyond the scope of Jacarabbit, as a reference impl
of JCR.

Thanks again & 
Best regards,
Dan

Re: RowIterator loop is slow?

Posted by Marcel Reutegger <ma...@gmx.net>.

dan wrote:
> All at once? I thought the Lucene search would return a set of Node UUIDs or
> something similar. Then the reading of actual Rows/Nodes from the result is
> incremental (by smaller chunks). 

well, depending on the query you have and the configuration it may happen that 
all result nodes are read at once. if your query does not have an 'order by' 
clause AND the query handler configuration uses the default value for the 
'respectDocumentOrder' parameter. In that case there is a post processing in the 
query result which orders the result nodes in document order.

If you have an 'order by jcr:score()' OR if you set the 'respectDocumentOrder' 
to false the query result will read the nodes on demand from persistent storage 
when you request them through either Row- or NodeIterator.

>> does the performance improve when you look again through the entries?
> 
> Yes, I saw a slight improvement. But still around 3 seconds. 

Try changing the configuration to respectDocumentOrder=false.

regards
  marcel

RE: RowIterator loop is slow?

Posted by dan <da...@hotmail.com>.

Hi,

> That's probably because jackrabbit needs to read all 2000 entries from
> disk.

All at once? I thought the Lucene search would return a set of Node UUIDs or
something similar. Then the reading of actual Rows/Nodes from the result is
incremental (by smaller chunks). 

> does the performance improve when you look again through the entries?

Yes, I saw a slight improvement. But still around 3 seconds. 

I'm using DB2FileSystem. The Jackrabbit FAQ says DBFileSystem is "slower
than native file system", while LocalFileSystem is "slow on Windows boxes". 
Which of these two file systems is faster on Windows box? Has anyone tested
with both file system on Windows?

Also, If I want to try using LocalFileSystem, can I simply switch the
settings in repository.xml? 

Thanks,
Dan

Re: RowIterator loop is slow?

Posted by Marcel Reutegger <ma...@gmx.net>.

That's probably because jackrabbit needs to read all 2000 entries from disk. 
does the performance improve when you look again through the entries?

regards
  marcel

dan wrote:
> Hi,
> I was using RowIterator to loop through about 2000 entries in query result
> and it took about 3+ seconds. 
> I stripped the code to the bare loop structure like below: 
> 	logger.debug("start loop");
> 	while (rows.hasNext()){
> 		Row row = rows.nextRow();
> 		Value[] values = row.getValues();
> 	}  
> 	logger.debug("end loop");
> 
> The time for going through the entire RowSet is still 3+ second. Tried with
> NodeIterator, the result did not change much. 
> 
> Could anyone advise if this is the normal performance? I'm running this code
> on a Windows 2003 server.
> 
> Thanks,
> Dan
>

RE: RowIterator loop is slow?

Posted by dan <da...@hotmail.com>.

I've seen some discussion on slow performace on Window box. But those are
about using LocalFileSystem on Windows. In my case, the repository uses
SimpleDBPersistenceManager and DBFileSystem (DB2). 
I thought about maybe Lucene index on Windows file system is also slow to
search with, but since I've already got the RowIterator/NodeIterator, I
guess looping through the iterators has nothing to do with Lucene index.
Is that correct? 
Thanks
Dan


> -----Original Message-----
> From: dan [mailto:danz8086@hotmail.com]
> Sent: November 29, 2006 3:01 PM
> To: users@jackrabbit.apache.org
> Subject: RowIterator loop is slow?
> 
> Hi,
> I was using RowIterator to loop through about 2000 entries in query result
> and it took about 3+ seconds.
> I stripped the code to the bare loop structure like below:
> 	logger.debug("start loop");
> 	while (rows.hasNext()){
> 		Row row = rows.nextRow();
> 		Value[] values = row.getValues();
> 	}
> 	logger.debug("end loop");
> 
> The time for going through the entire RowSet is still 3+ second. Tried
> with
> NodeIterator, the result did not change much.
> 
> Could anyone advise if this is the normal performance? I'm running this
> code
> on a Windows 2003 server.
> 
> Thanks,
> Dan