You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jackrabbit.apache.org by Dennis van der Laan <d....@rug.nl> on 2009/12/03 09:47:11 UTC

Searching for a property

Hi,

It seems querying on a property is very slow on our system (running
Jackrabbit 1.6.0): almost 1 second per query which would normally return
0 or 1 result.

We use jcr:content nodes of type nt:unstructured to store the contents
of a file (in the jcr:data property) and we store an array of Strings in
a property cms:virtualPath on the same node. So basically, every file in
our repository has the JCR path and zero or more virtual paths in the
cms:virtualPath property. If we want to add a virtual path to a file, we
have to check if the virtual path does not exist already. For this, we
use an XPath query in the following code snippet:

String vpath = QueryUtil.escapeForAttributeSearch(path.toLowerCase());
String query =
"/jcr:root//element(*,nt:hierarchyNode)[fn:lower-case(jcr:content/@cms:virtualPath)
= '" + vpath + "']";
Query q = queryManager.createQuery(query, Query.XPATH);
NodeIterator ni = q.execute().getNodes();
if (ni.getSize() == 0) {
    throw new ItemNotFoundException("Unable to find item by virtual
path: " + path);
}
else if (ni.getSize() > 1) {
    throw new IllegalStateException("More than 1 item on virtual path: "
+ path);
}
else {
    return ni.nextNode();
}

Our repository now contains around 500,000 virtual paths, more or less
divided over 150,000 files which are evenly distributed over more than
1000 (nested) folders.

The repository runs on an Intel Nehalem Xeon (2 x 2.5GHz) running
Solaris 10 and the repository database (for datastore, filesystem, etc)
runs on the same specs, on a different server, running Oracle 10g.

When we try to add virtual paths in a batch (about 2000 virtual path
properties for 1000 files) and all virtual paths already exist (so the
above query returns 1 virtual path), we see a 100% load of the our
Tomcat application (which means 1 core fully utilized).

I would expect a JCR repository to be able to handle this kind of
queries. How are these properties indexed? Is it possible to optimize
the repository for this kind of queries? Or should I use a different
query? The alternative would be to keep a different database which keeps
track of the virtual paths, but keeping that in sync with the JCR
repository would be a pain, at the least.

Thanks for your ideas about this issue,
Kind regards,
Dennis van der Laan

Re: Searching for a property

Posted by Dennis van der Laan <d....@rug.nl>.
Hello Ard,

We rewrote a part of our virtual path handling, and now store both the
virtual path itself, and the lower-case equivalent (we really need the
not-lowercased path). All queries are now done on the lowercased virtual
path and indeed (!) everything stays fast, even after a million virtual
paths. We'll try to keep away from the lower-case function and similar
functions.

Thanks very much for all your help!

Dennis

Ard Schrijvers wrote:
> On Mon, Jan 11, 2010 at 1:07 PM, Dennis van der Laan
> <d....@rug.nl> wrote:
>   
>> Hello Ard,
>>
>> We rewrote a part of our virtual path handling, and now store both the
>> virtual path itself, and the lower-case equivalent (we really need the
>> not-lowercased path). All queries are now done on the lowercased virtual
>> path and indeed (!) everything stays fast, even after a million virtual
>> paths. We'll try to keep away from the lower-case function and similar
>> functions.
>>     
>
> as long as it is a single term lookup in Lucene, it is always fast,
> almost regardless the number of terms there are
>
>   
>> Thanks very much for all your help!
>>     
>
> You're welcome,
>
> Ard
>
>   
>> Dennis
>>
>> Ard Schrijvers wrote:
>>     
>>> On Thu, Dec 17, 2009 at 10:59 PM, Dennis van der Laan
>>> <d....@rug.nl> wrote:
>>>
>>>       
>>>> Dennis van der Laan wrote:
>>>>
>>>>         
>>>       
>>>> See the increase of time spent on the execution: 400+ ms instead of 7ms.
>>>> And this is not a single incident, I see this increase on all queries
>>>> like the above.
>>>>
>>>> The memory of the JVM should not be a problem, it's set to 2Gb and only
>>>> 800Mb is used at the moment the queries are slow. Restarting the
>>>> application does not help either.
>>>>
>>>>         
>>> No, this seems logical to me. The memory is consumed by internal
>>> lucene term enums. I am quite sure what your issue is, but did not
>>> test it, nor ever tried it myself. But, I have always wondered *how*
>>> the fn:lower-case could have been implemented efficiently in
>>> Jackrabbit. It doesn't fit into my understanding of how inverted
>>> indexes work, what Lucene is in the end. So, I am happy that my
>>> understanding was correct, and unhappy that fn:lower-case does (again,
>>> from top of my head and looking at code only) not scale to well.
>>>
>>> I think in your setup a lot of time is spend in the CaseTermQuery,
>>> which traverses all your 1 million virtualpaths first and lowercase
>>> it. This cannot scale (nor in cpu, nor in memory).
>>>
>>> So, would you like to give me an indication about the query execution
>>> time without the fn:lower-case? I think it will drop to < 1 ms.
>>>
>>> I think you should try to get away without using the fn:local-name if
>>> this works for you. Just make sure that you store the virtualpath
>>> property always as lower-case: then, you are fine
>>>
>>>
>>>       
>>>> Again, any help will be appreciated.
>>>>
>>>>         
>>> let me know if this helped,
>>>
>>> Regards Ard
>>>
>>>
>>>       
>>>> Dennis
>>>>
>>>>
>>>>         
>>>>>> Furthermore, of course, index size matters as well
>>>>>>
>>>>>>
>>>>>>             
>> --
>> Dennis van der Laan
>>
>>
>>     


-- 
Dennis van der Laan


Re: Searching for a property

Posted by Ard Schrijvers <a....@onehippo.com>.
On Mon, Jan 11, 2010 at 1:07 PM, Dennis van der Laan
<d....@rug.nl> wrote:
> Hello Ard,
>
> We rewrote a part of our virtual path handling, and now store both the
> virtual path itself, and the lower-case equivalent (we really need the
> not-lowercased path). All queries are now done on the lowercased virtual
> path and indeed (!) everything stays fast, even after a million virtual
> paths. We'll try to keep away from the lower-case function and similar
> functions.

as long as it is a single term lookup in Lucene, it is always fast,
almost regardless the number of terms there are

>
> Thanks very much for all your help!

You're welcome,

Ard

>
> Dennis
>
> Ard Schrijvers wrote:
>> On Thu, Dec 17, 2009 at 10:59 PM, Dennis van der Laan
>> <d....@rug.nl> wrote:
>>
>>> Dennis van der Laan wrote:
>>>
>>
>>
>>> See the increase of time spent on the execution: 400+ ms instead of 7ms.
>>> And this is not a single incident, I see this increase on all queries
>>> like the above.
>>>
>>> The memory of the JVM should not be a problem, it's set to 2Gb and only
>>> 800Mb is used at the moment the queries are slow. Restarting the
>>> application does not help either.
>>>
>>
>> No, this seems logical to me. The memory is consumed by internal
>> lucene term enums. I am quite sure what your issue is, but did not
>> test it, nor ever tried it myself. But, I have always wondered *how*
>> the fn:lower-case could have been implemented efficiently in
>> Jackrabbit. It doesn't fit into my understanding of how inverted
>> indexes work, what Lucene is in the end. So, I am happy that my
>> understanding was correct, and unhappy that fn:lower-case does (again,
>> from top of my head and looking at code only) not scale to well.
>>
>> I think in your setup a lot of time is spend in the CaseTermQuery,
>> which traverses all your 1 million virtualpaths first and lowercase
>> it. This cannot scale (nor in cpu, nor in memory).
>>
>> So, would you like to give me an indication about the query execution
>> time without the fn:lower-case? I think it will drop to < 1 ms.
>>
>> I think you should try to get away without using the fn:local-name if
>> this works for you. Just make sure that you store the virtualpath
>> property always as lower-case: then, you are fine
>>
>>
>>> Again, any help will be appreciated.
>>>
>>
>> let me know if this helped,
>>
>> Regards Ard
>>
>>
>>> Dennis
>>>
>>>
>>>>> Furthermore, of course, index size matters as well
>>>>>
>>>>>
>
>
> --
> Dennis van der Laan
>
>

Re: Searching for a property

Posted by Dennis van der Laan <d....@rug.nl>.
Hello Ard,

We rewrote a part of our virtual path handling, and now store both the
virtual path itself, and the lower-case equivalent (we really need the
not-lowercased path). All queries are now done on the lowercased virtual
path and indeed (!) everything stays fast, even after a million virtual
paths. We'll try to keep away from the lower-case function and similar
functions.

Thanks very much for all your help!

Dennis

Ard Schrijvers wrote:
> On Thu, Dec 17, 2009 at 10:59 PM, Dennis van der Laan
> <d....@rug.nl> wrote:
>   
>> Dennis van der Laan wrote:
>>     
>
>   
>> See the increase of time spent on the execution: 400+ ms instead of 7ms.
>> And this is not a single incident, I see this increase on all queries
>> like the above.
>>
>> The memory of the JVM should not be a problem, it's set to 2Gb and only
>> 800Mb is used at the moment the queries are slow. Restarting the
>> application does not help either.
>>     
>
> No, this seems logical to me. The memory is consumed by internal
> lucene term enums. I am quite sure what your issue is, but did not
> test it, nor ever tried it myself. But, I have always wondered *how*
> the fn:lower-case could have been implemented efficiently in
> Jackrabbit. It doesn't fit into my understanding of how inverted
> indexes work, what Lucene is in the end. So, I am happy that my
> understanding was correct, and unhappy that fn:lower-case does (again,
> from top of my head and looking at code only) not scale to well.
>
> I think in your setup a lot of time is spend in the CaseTermQuery,
> which traverses all your 1 million virtualpaths first and lowercase
> it. This cannot scale (nor in cpu, nor in memory).
>
> So, would you like to give me an indication about the query execution
> time without the fn:lower-case? I think it will drop to < 1 ms.
>
> I think you should try to get away without using the fn:local-name if
> this works for you. Just make sure that you store the virtualpath
> property always as lower-case: then, you are fine
>
>   
>> Again, any help will be appreciated.
>>     
>
> let me know if this helped,
>
> Regards Ard
>
>   
>> Dennis
>>
>>     
>>>> Furthermore, of course, index size matters as well
>>>>
>>>>         


-- 
Dennis van der Laan


Re: Searching for a property

Posted by Ard Schrijvers <a....@onehippo.com>.
On Thu, Dec 17, 2009 at 10:59 PM, Dennis van der Laan
<d....@rug.nl> wrote:
> Dennis van der Laan wrote:

> See the increase of time spent on the execution: 400+ ms instead of 7ms.
> And this is not a single incident, I see this increase on all queries
> like the above.
>
> The memory of the JVM should not be a problem, it's set to 2Gb and only
> 800Mb is used at the moment the queries are slow. Restarting the
> application does not help either.

No, this seems logical to me. The memory is consumed by internal
lucene term enums. I am quite sure what your issue is, but did not
test it, nor ever tried it myself. But, I have always wondered *how*
the fn:lower-case could have been implemented efficiently in
Jackrabbit. It doesn't fit into my understanding of how inverted
indexes work, what Lucene is in the end. So, I am happy that my
understanding was correct, and unhappy that fn:lower-case does (again,
from top of my head and looking at code only) not scale to well.

I think in your setup a lot of time is spend in the CaseTermQuery,
which traverses all your 1 million virtualpaths first and lowercase
it. This cannot scale (nor in cpu, nor in memory).

So, would you like to give me an indication about the query execution
time without the fn:lower-case? I think it will drop to < 1 ms.

I think you should try to get away without using the fn:local-name if
this works for you. Just make sure that you store the virtualpath
property always as lower-case: then, you are fine

>
> Again, any help will be appreciated.

let me know if this helped,

Regards Ard

>
> Dennis
>
>>> Furthermore, of course, index size matters as well
>>>

Re: Searching for a property

Posted by Dennis van der Laan <d....@rug.nl>.
Dennis van der Laan wrote:
> Hi Ard,
>   
>> Hello Dennis,
>>
>> On Fri, Dec 11, 2009 at 11:24 AM, Dennis van der Laan
>> <d....@rug.nl> wrote:
>>   
>>     
>>> Hi Ard,
>>>
>>> Thanks! The performance went up by a factor x10. Still not what I hoped
>>> for, but I'm not sure the query itself is still a problem.
>>>     
>>>       
>> so now it is 100 ms? That is not to fast still. What is your query?
>>     
Some logging:

2009-12-17 15:51:42,102 DEBUG (208340) [jcr.JcrFileSystem] - created
vpath query string:
//element(*,nt:unstructured)[fn:lower-case(@cms:virtualPath)
= '/_definition/shared/schemas/include/banner.xsd']
2009-12-17 15:51:42,102 DEBUG (208340) [jcr.JcrFileSystem] - vpath query
object created
2009-12-17 15:51:42,109 DEBUG (208340) [jcr.JcrFileSystem] - vpath query
executed
2009-12-17 15:51:42,109 DEBUG (208340) [jcr.JcrFileSystem] - vpath node
iterator created
2009-12-17 15:51:42,109 DEBUG (208340) [jcr.JcrFileSystem] - vpath query
done

Then, several hours later:

2009-12-17 22:49:44,533 DEBUG (      ) [jcr.JcrFileSystem] - created
vpath query string:
//element(*,nt:unstructured)[fn:lower-case(@cms:virtualPath)
= '/fwn/onderwijs/roosters/2007/wi/overzicht/overzicht_4.xml']
2009-12-17 22:49:44,534 DEBUG (      ) [jcr.JcrFileSystem] - vpath query
object created
2009-12-17 22:49:44,977 DEBUG (      ) [jcr.JcrFileSystem] - vpath query
executed
2009-12-17 22:49:44,977 DEBUG (      ) [jcr.JcrFileSystem] - vpath node
iterator created
2009-12-17 22:49:44,977 DEBUG (      ) [jcr.JcrFileSystem] - vpath query
done

See the increase of time spent on the execution: 400+ ms instead of 7ms.
And this is not a single incident, I see this increase on all queries
like the above.

The memory of the JVM should not be a problem, it's set to 2Gb and only
800Mb is used at the moment the queries are slow. Restarting the
application does not help either.

Again, any help will be appreciated.

Dennis

>> Furthermore, of course, index size matters as well
>>   
>>     
> Triggered by your remark on index size, I created a new repository and
> started filling it up with nodes which have a virtual path property
> (cms:virtualPath). At a certain point, I see a significant degradation
> of the performance. I made a thread dump to see what the VM was doing
> and found this stack trace:
>
>    java.lang.Thread.State: RUNNABLE
>         at java.io.RandomAccessFile.readBytes(Native Method)
>         at java.io.RandomAccessFile.read(RandomAccessFile.java:322)
>         at
> org.apache.lucene.store.FSDirectory$FSIndexInput.readInternal(FSDirectory.java:596)
>         - locked <0x85523040> (a
> org.apache.lucene.store.FSDirectory$FSIndexInput$Descriptor)
>         at
> org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:136)
>         at
> org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:247)
>         at
> org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:157)
>         at
> org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:116)
>         at
> org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:92)
>         at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:82)
>         at
> org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:127)
>         at
> org.apache.lucene.index.SegmentMergeInfo.next(SegmentMergeInfo.java:65)
>         at
> org.apache.lucene.index.MultiSegmentReader$MultiTermEnum.next(MultiSegmentReader.java:494)
>         at
> org.apache.lucene.search.FilteredTermEnum.next(FilteredTermEnum.java:67)
>         at
> org.apache.jackrabbit.core.query.lucene.CaseTermQuery$CaseTermEnum.<init>(CaseTermQuery.java:146)
>         at
> org.apache.jackrabbit.core.query.lucene.CaseTermQuery.getEnum(CaseTermQuery.java:53)
>         at
> org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:55)
>         at
> org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:383)
>         at
> org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:383)
>         at
> org.apache.jackrabbit.core.query.lucene.JackrabbitIndexSearcher.evaluate(JackrabbitIndexSearcher.java:99)
>         at
> org.apache.jackrabbit.core.query.lucene.JackrabbitIndexSearcher.execute(JackrabbitIndexSearcher.java:84)
>         at
> org.apache.jackrabbit.core.query.lucene.SearchIndex.executeQuery(SearchIndex.java:760)
>         at
> org.apache.jackrabbit.core.query.lucene.SingleColumnQueryResult.executeQuery(SingleColumnQueryResult.java:66)
>         at
> org.apache.jackrabbit.core.query.lucene.QueryResultImpl.getResults(QueryResultImpl.java:298)
>         at
> org.apache.jackrabbit.core.query.lucene.SingleColumnQueryResult.<init>(SingleColumnQueryResult.java:58)
>         at
> org.apache.jackrabbit.core.query.lucene.QueryImpl.execute(QueryImpl.java:131)
>         at
> org.apache.jackrabbit.core.query.QueryImpl.execute(QueryImpl.java:177)
>
> Could this mean that there is not enough memory for the Lucene indexes
> and the indexes are read from disk all the time?
> Any idea how large the indexes will become? I have no idea how the
> internals of Lucene look like. The virtual paths have an average string
> length of about 50 characters and we end up having about 1 million of
> these properties.
>
> Thanks for any help!
>
> Dennis
>   
>>   
>>     
>>> A related question: could it be that when a query returns no results,
>>> this is slower than when it does return a result? Might it have
>>> something to do with Lucene not having an index for that particular
>>> property value?
>>>     
>>>       
>> No, an inverted index structure does not suffer from this
>>
>> Regards Ard
>>
>>   
>>     
>>>> Hello Dennis,
>>>>
>>>>       
>>>>         
>
>   


Re: Searching for a property

Posted by Ard Schrijvers <a....@onehippo.com>.
Hello Dennis,

On Thu, Dec 17, 2009 at 9:50 PM, Dennis van der Laan
<d....@rug.nl> wrote:
>
> Could this mean that there is not enough memory for the Lucene indexes
> and the indexes are read from disk all the time?

The problem is not Lucene: if Lucene is slow, the problem is in *how*
Lucene is used.

> Any idea how large the indexes will become? I have no idea how the

Depends on the size of your repository. We have sizes of around 20 Gb.
Depending on what your queries are, you are fine

> internals of Lucene look like. The virtual paths have an average string
> length of about 50 characters and we end up having about 1 million of
> these properties.

Depending, again, on how you query this will lead to large memory
consumptions. If you sort on it for example. Internally in Lucene a
lot of memory is taken for it, and in Jackrabbit again. So, 50 chars,
4 byte a char (?) * 10^6 * 2  = 400 Mb...


>
> Thanks for any help!

See next mail :-))

Regards Ard

>
> Dennis
>>
>>> A related question: could it be that when a query returns no results,
>>> this is slower than when it does return a result? Might it have
>>> something to do with Lucene not having an index for that particular
>>> property value?
>>>
>>
>> No, an inverted index structure does not suffer from this
>>
>> Regards Ard
>>
>>
>>>> Hello Dennis,
>>>>
>>>>
>
>

Re: Searching for a property

Posted by Dennis van der Laan <d....@rug.nl>.
Hi Ard,
> Hello Dennis,
>
> On Fri, Dec 11, 2009 at 11:24 AM, Dennis van der Laan
> <d....@rug.nl> wrote:
>   
>> Hi Ard,
>>
>> Thanks! The performance went up by a factor x10. Still not what I hoped
>> for, but I'm not sure the query itself is still a problem.
>>     
>
> so now it is 100 ms? That is not to fast still. What is your query?
> Furthermore, of course, index size matters as well
>   
Triggered by your remark on index size, I created a new repository and
started filling it up with nodes which have a virtual path property
(cms:virtualPath). At a certain point, I see a significant degradation
of the performance. I made a thread dump to see what the VM was doing
and found this stack trace:

   java.lang.Thread.State: RUNNABLE
        at java.io.RandomAccessFile.readBytes(Native Method)
        at java.io.RandomAccessFile.read(RandomAccessFile.java:322)
        at
org.apache.lucene.store.FSDirectory$FSIndexInput.readInternal(FSDirectory.java:596)
        - locked <0x85523040> (a
org.apache.lucene.store.FSDirectory$FSIndexInput$Descriptor)
        at
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:136)
        at
org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:247)
        at
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:157)
        at
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:116)
        at
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:92)
        at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:82)
        at
org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:127)
        at
org.apache.lucene.index.SegmentMergeInfo.next(SegmentMergeInfo.java:65)
        at
org.apache.lucene.index.MultiSegmentReader$MultiTermEnum.next(MultiSegmentReader.java:494)
        at
org.apache.lucene.search.FilteredTermEnum.next(FilteredTermEnum.java:67)
        at
org.apache.jackrabbit.core.query.lucene.CaseTermQuery$CaseTermEnum.<init>(CaseTermQuery.java:146)
        at
org.apache.jackrabbit.core.query.lucene.CaseTermQuery.getEnum(CaseTermQuery.java:53)
        at
org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:55)
        at
org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:383)
        at
org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:383)
        at
org.apache.jackrabbit.core.query.lucene.JackrabbitIndexSearcher.evaluate(JackrabbitIndexSearcher.java:99)
        at
org.apache.jackrabbit.core.query.lucene.JackrabbitIndexSearcher.execute(JackrabbitIndexSearcher.java:84)
        at
org.apache.jackrabbit.core.query.lucene.SearchIndex.executeQuery(SearchIndex.java:760)
        at
org.apache.jackrabbit.core.query.lucene.SingleColumnQueryResult.executeQuery(SingleColumnQueryResult.java:66)
        at
org.apache.jackrabbit.core.query.lucene.QueryResultImpl.getResults(QueryResultImpl.java:298)
        at
org.apache.jackrabbit.core.query.lucene.SingleColumnQueryResult.<init>(SingleColumnQueryResult.java:58)
        at
org.apache.jackrabbit.core.query.lucene.QueryImpl.execute(QueryImpl.java:131)
        at
org.apache.jackrabbit.core.query.QueryImpl.execute(QueryImpl.java:177)

Could this mean that there is not enough memory for the Lucene indexes
and the indexes are read from disk all the time?
Any idea how large the indexes will become? I have no idea how the
internals of Lucene look like. The virtual paths have an average string
length of about 50 characters and we end up having about 1 million of
these properties.

Thanks for any help!

Dennis
>   
>> A related question: could it be that when a query returns no results,
>> this is slower than when it does return a result? Might it have
>> something to do with Lucene not having an index for that particular
>> property value?
>>     
>
> No, an inverted index structure does not suffer from this
>
> Regards Ard
>
>   
>>> Hello Dennis,
>>>
>>>       


Re: Searching for a property

Posted by Ard Schrijvers <a....@onehippo.com>.
Hello Dennis,

On Fri, Dec 11, 2009 at 11:24 AM, Dennis van der Laan
<d....@rug.nl> wrote:
> Hi Ard,
>
> Thanks! The performance went up by a factor x10. Still not what I hoped
> for, but I'm not sure the query itself is still a problem.

so now it is 100 ms? That is not to fast still. What is your query?
Furthermore, of course, index size matters as well

>
> A related question: could it be that when a query returns no results,
> this is slower than when it does return a result? Might it have
> something to do with Lucene not having an index for that particular
> property value?

No, an inverted index structure does not suffer from this

Regards Ard

>
>> Hello Dennis,
>>

Re: Searching for a property

Posted by Dennis van der Laan <d....@rug.nl>.
Hi Ard,

Thanks! The performance went up by a factor x10. Still not what I hoped
for, but I'm not sure the query itself is still a problem.

A related question: could it be that when a query returns no results,
this is slower than when it does return a result? Might it have
something to do with Lucene not having an index for that particular
property value?

> Hello Dennis,
>
> it's because your using 2 times a child axis query (jcr:root and one
> within the where clause) that makes it slow and. Explaining why is out
> of scope for the user list, but I wrote quite some time ago a few
> guidelines (most of them still valid):
>
> http://n4.nabble.com/Explanation-and-solutions-of-some-Jackrabbit-queries-regarding-performance-td516614.html#a516614
>
> I am not sure what nodetype jcr:content is, but suppose: my:contenttype
>
> now, if your query would be:
>
> //element(*,my:contenttype)[fn:lower-case(@cms:virtualPath)= '" + vpath + "']";
>
> the query will be instant. Just take the parent node of the result and
> you should be fine. 

> Just wondering, are you building a brand new cms
> on jcr? I am not sure what the @cms:virtualPath holds, but if you also
> need virtual environments showing the same jcr nodes in different tree
> structures you might wanna take a look here [1].
>   
We're not building a brand new CMS, we're migrating our old Oracle iFS
storage to a JCR repository. The CMS itself stays the same.

regards,
Dennis
> Regards Ard
>
> [1] http://www.onehippo.org/cms7
>
>
> On Thu, Dec 3, 2009 at 9:47 AM, Dennis van der Laan
> <d....@rug.nl> wrote:
>   
>> Hi,
>>
>> It seems querying on a property is very slow on our system (running
>> Jackrabbit 1.6.0): almost 1 second per query which would normally return
>> 0 or 1 result.
>>
>> We use jcr:content nodes of type nt:unstructured to store the contents
>> of a file (in the jcr:data property) and we store an array of Strings in
>> a property cms:virtualPath on the same node. So basically, every file in
>> our repository has the JCR path and zero or more virtual paths in the
>> cms:virtualPath property. If we want to add a virtual path to a file, we
>> have to check if the virtual path does not exist already. For this, we
>> use an XPath query in the following code snippet:
>>
>> String vpath = QueryUtil.escapeForAttributeSearch(path.toLowerCase());
>> String query =
>> "/jcr:root//element(*,nt:hierarchyNode)[fn:lower-case(jcr:content/@cms:virtualPath)
>> = '" + vpath + "']";
>> Query q = queryManager.createQuery(query, Query.XPATH);
>> NodeIterator ni = q.execute().getNodes();
>> if (ni.getSize() == 0) {
>>    throw new ItemNotFoundException("Unable to find item by virtual
>> path: " + path);
>> }
>> else if (ni.getSize() > 1) {
>>    throw new IllegalStateException("More than 1 item on virtual path: "
>> + path);
>> }
>> else {
>>    return ni.nextNode();
>> }
>>
>> Our repository now contains around 500,000 virtual paths, more or less
>> divided over 150,000 files which are evenly distributed over more than
>> 1000 (nested) folders.
>>
>> The repository runs on an Intel Nehalem Xeon (2 x 2.5GHz) running
>> Solaris 10 and the repository database (for datastore, filesystem, etc)
>> runs on the same specs, on a different server, running Oracle 10g.
>>
>> When we try to add virtual paths in a batch (about 2000 virtual path
>> properties for 1000 files) and all virtual paths already exist (so the
>> above query returns 1 virtual path), we see a 100% load of the our
>> Tomcat application (which means 1 core fully utilized).
>>
>> I would expect a JCR repository to be able to handle this kind of
>> queries. How are these properties indexed? Is it possible to optimize
>> the repository for this kind of queries? Or should I use a different
>> query? The alternative would be to keep a different database which keeps
>> track of the virtual paths, but keeping that in sync with the JCR
>> repository would be a pain, at the least.
>>
>> Thanks for your ideas about this issue,
>> Kind regards,
>> Dennis van der Laan
>>
>>     


-- 
Dennis van der Laan


Re: Searching for a property

Posted by Ard Schrijvers <a....@onehippo.com>.
Hello Dennis,

it's because your using 2 times a child axis query (jcr:root and one
within the where clause) that makes it slow and. Explaining why is out
of scope for the user list, but I wrote quite some time ago a few
guidelines (most of them still valid):

http://n4.nabble.com/Explanation-and-solutions-of-some-Jackrabbit-queries-regarding-performance-td516614.html#a516614

I am not sure what nodetype jcr:content is, but suppose: my:contenttype

now, if your query would be:

//element(*,my:contenttype)[fn:lower-case(@cms:virtualPath)= '" + vpath + "']";

the query will be instant. Just take the parent node of the result and
you should be fine. Just wondering, are you building a brand new cms
on jcr? I am not sure what the @cms:virtualPath holds, but if you also
need virtual environments showing the same jcr nodes in different tree
structures you might wanna take a look here [1].

Regards Ard

[1] http://www.onehippo.org/cms7


On Thu, Dec 3, 2009 at 9:47 AM, Dennis van der Laan
<d....@rug.nl> wrote:
> Hi,
>
> It seems querying on a property is very slow on our system (running
> Jackrabbit 1.6.0): almost 1 second per query which would normally return
> 0 or 1 result.
>
> We use jcr:content nodes of type nt:unstructured to store the contents
> of a file (in the jcr:data property) and we store an array of Strings in
> a property cms:virtualPath on the same node. So basically, every file in
> our repository has the JCR path and zero or more virtual paths in the
> cms:virtualPath property. If we want to add a virtual path to a file, we
> have to check if the virtual path does not exist already. For this, we
> use an XPath query in the following code snippet:
>
> String vpath = QueryUtil.escapeForAttributeSearch(path.toLowerCase());
> String query =
> "/jcr:root//element(*,nt:hierarchyNode)[fn:lower-case(jcr:content/@cms:virtualPath)
> = '" + vpath + "']";
> Query q = queryManager.createQuery(query, Query.XPATH);
> NodeIterator ni = q.execute().getNodes();
> if (ni.getSize() == 0) {
>    throw new ItemNotFoundException("Unable to find item by virtual
> path: " + path);
> }
> else if (ni.getSize() > 1) {
>    throw new IllegalStateException("More than 1 item on virtual path: "
> + path);
> }
> else {
>    return ni.nextNode();
> }
>
> Our repository now contains around 500,000 virtual paths, more or less
> divided over 150,000 files which are evenly distributed over more than
> 1000 (nested) folders.
>
> The repository runs on an Intel Nehalem Xeon (2 x 2.5GHz) running
> Solaris 10 and the repository database (for datastore, filesystem, etc)
> runs on the same specs, on a different server, running Oracle 10g.
>
> When we try to add virtual paths in a batch (about 2000 virtual path
> properties for 1000 files) and all virtual paths already exist (so the
> above query returns 1 virtual path), we see a 100% load of the our
> Tomcat application (which means 1 core fully utilized).
>
> I would expect a JCR repository to be able to handle this kind of
> queries. How are these properties indexed? Is it possible to optimize
> the repository for this kind of queries? Or should I use a different
> query? The alternative would be to keep a different database which keeps
> track of the virtual paths, but keeping that in sync with the JCR
> repository would be a pain, at the least.
>
> Thanks for your ideas about this issue,
> Kind regards,
> Dennis van der Laan
>