You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jackrabbit.apache.org by James Hang <jh...@bea.com> on 2007/04/19 01:23:41 UTC

Lucene index

After spending some time running Jackrabbit in debug mode, I noticed some
peculiar behavior in the Lucene SearchIndex implementation.  

When indexing of a Node occurs via the AbstractIndex.addDocument() method,
the Lucene Document object being indexed seems to contain all the indexed
fields, i.e. all the properties of the node, the extracted fulltext terms,
etc.  

However, during a search operation, on the call to
SearchIndex.executeQuery(), the Document objects being returned from the
search only contains some of the indexed fields.  In fact for all of the
Document objects, only these 5 fields are present:

_:UUID
_:PARENT
_:PROPERTIES[0] "3:versionHistory"
_:PROPERTIES[1] "3:baseVersion"
_:PROPERTIES[2] "3:predecessor"

I know that Jackrabbit only really needs the _:UUID field so that it can
look up the Node, so is it stripping out the other fields at some point? 

We've noticed that for large result sets (1000+ nodes), the performance can
drag because each Node lookup requires at least one database query.  Since
we are only interested in data contained in the Lucene index, it would be
nice if we would get that data from the index and not have to go through the
Jackrabbit PM at all.

Does anyone know if this is possible?
-- 
View this message in context: http://www.nabble.com/Lucene-index-tf3604049.html#a10069152
Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.


Re: Lucene index

Posted by James Hang <jh...@bea.com>.
Thanks Marcel,

Storing the properties in the index on index time did the trick.  Thanks!

Definitely would love to see those enhancements you suggested.   Being able
to retrieve results from a search without having to go to the PM would be a
big performance boost for us.

James



Marcel Reutegger wrote:
> 
> Hi James,
> 
> James Hang wrote:
>> After spending some time running Jackrabbit in debug mode, I noticed some
>> peculiar behavior in the Lucene SearchIndex implementation.  
>> 
>> When indexing of a Node occurs via the AbstractIndex.addDocument()
>> method,
>> the Lucene Document object being indexed seems to contain all the indexed
>> fields, i.e. all the properties of the node, the extracted fulltext
>> terms,
>> etc.  
>> 
>> However, during a search operation, on the call to
>> SearchIndex.executeQuery(), the Document objects being returned from the
>> search only contains some of the indexed fields.  In fact for all of the
>> Document objects, only these 5 fields are present:
>> 
>> _:UUID
>> _:PARENT
>> _:PROPERTIES[0] "3:versionHistory"
>> _:PROPERTIES[1] "3:baseVersion"
>> _:PROPERTIES[2] "3:predecessor"
>> 
>> I know that Jackrabbit only really needs the _:UUID field so that it can
>> look up the Node, so is it stripping out the other fields at some point? 
> 
> The other properties you see in the lucene document are all reference 
> properties. Those must also be *stored* in the index to resolve them in a 
> jcr:deref() statement.
> 
>> We've noticed that for large result sets (1000+ nodes), the performance
>> can
>> drag because each Node lookup requires at least one database query. 
>> Since
>> we are only interested in data contained in the Lucene index, it would be
>> nice if we would get that data from the index and not have to go through
>> the
>> Jackrabbit PM at all.
> 
> this is because the index does not store the property values (in lucene 
> terminology) but only uses them to create an inverted index using the
> values. 
> funny enough the values are actually present in the index, but you cannot
> get 
> them using a simple call.
> 
> in addition, the presence of the jcr:path property in the query result row 
> forces us to use node information to resolve the path (be it from the 
> persistence manager or from a cache within jackrabbit). in jackrabbit a
> path of 
> a node is always calculated and not stored literally with it.
> 
>> Does anyone know if this is possible?
> 
> With the current implementation this is not possible, but it is feasible
> to 
> implement it.
> 
> The required changes to the jackrabbit would be:
> 
> - also store the values of all the other properties within the index.
> because 
> this makes the document instances retrieved from the index much heavier we
> would 
> have to move to lucene 2.1. this version supports lazy loading of document
> fields.
> - a query result row would then use the values from the index, if
> available. 
> whether property values are stored in the index, should be configurable.
> - calculate the values of the jcr:path column only when requested.
> 
> with those changes at least the RowIterator result representation could
> work 
> without a single access to the PM.
> 
> ah, well. since I've put that all in an email I can as well create a jira
> issue ;)
> 
> http://issues.apache.org/jira/browse/JCR-855
> 
> regards
>   marcel
> 
> 

-- 
View this message in context: http://www.nabble.com/Lucene-index-tf3604049.html#a10111578
Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.


Re: Lucene index

Posted by Marcel Reutegger <ma...@gmx.net>.
Hi James,

James Hang wrote:
> After spending some time running Jackrabbit in debug mode, I noticed some
> peculiar behavior in the Lucene SearchIndex implementation.  
> 
> When indexing of a Node occurs via the AbstractIndex.addDocument() method,
> the Lucene Document object being indexed seems to contain all the indexed
> fields, i.e. all the properties of the node, the extracted fulltext terms,
> etc.  
> 
> However, during a search operation, on the call to
> SearchIndex.executeQuery(), the Document objects being returned from the
> search only contains some of the indexed fields.  In fact for all of the
> Document objects, only these 5 fields are present:
> 
> _:UUID
> _:PARENT
> _:PROPERTIES[0] "3:versionHistory"
> _:PROPERTIES[1] "3:baseVersion"
> _:PROPERTIES[2] "3:predecessor"
> 
> I know that Jackrabbit only really needs the _:UUID field so that it can
> look up the Node, so is it stripping out the other fields at some point? 

The other properties you see in the lucene document are all reference 
properties. Those must also be *stored* in the index to resolve them in a 
jcr:deref() statement.

> We've noticed that for large result sets (1000+ nodes), the performance can
> drag because each Node lookup requires at least one database query.  Since
> we are only interested in data contained in the Lucene index, it would be
> nice if we would get that data from the index and not have to go through the
> Jackrabbit PM at all.

this is because the index does not store the property values (in lucene 
terminology) but only uses them to create an inverted index using the values. 
funny enough the values are actually present in the index, but you cannot get 
them using a simple call.

in addition, the presence of the jcr:path property in the query result row 
forces us to use node information to resolve the path (be it from the 
persistence manager or from a cache within jackrabbit). in jackrabbit a path of 
a node is always calculated and not stored literally with it.

> Does anyone know if this is possible?

With the current implementation this is not possible, but it is feasible to 
implement it.

The required changes to the jackrabbit would be:

- also store the values of all the other properties within the index. because 
this makes the document instances retrieved from the index much heavier we would 
have to move to lucene 2.1. this version supports lazy loading of document fields.
- a query result row would then use the values from the index, if available. 
whether property values are stored in the index, should be configurable.
- calculate the values of the jcr:path column only when requested.

with those changes at least the RowIterator result representation could work 
without a single access to the PM.

ah, well. since I've put that all in an email I can as well create a jira issue ;)

http://issues.apache.org/jira/browse/JCR-855

regards
  marcel