You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jackrabbit.apache.org by Ard Schrijvers <a....@onehippo.com> on 2010/02/17 16:48:07 UTC

[jr3] Restructure Lucene indexing & make use of Lucene 2.9 features

Currently, we index *all* properties into the same Lucene field. There
is an issue for this already [1]. I think we can gain a lot by having
each property indexed in its own Lucene field. This avoids that we
need to do so many custom queries and keep caches of terms ourselves,
which end up in using lots of memory.

Furthermore, if we want to use Lucene 2.9 kind of RangeQueries for
dates, doubles and longs, I think we need to refactor to this 1:1
mapping anyway. Currently, Range queries on let's say 100.000 dates in
jackrabbit are quite/very slow. It however will be a backwards
incompatible move (I mean that existing indexes need to be rebuild),
and I think it touches quite some code.  Also Lucene 2.9 is
incompatible with earlier versions of Lucene

Regards Ard

[1] http://issues.apache.org/jira/browse/JCR-1080

Re: [jr3] Restructure Lucene indexing & make use of Lucene 2.9 features

Posted by Felix Meschberger <fm...@gmail.com>.
Hi,

On 18.02.2010 10:37, Ard Schrijvers wrote:
> On Thu, Feb 18, 2010 at 10:35 AM, Ard Schrijvers
> <a....@onehippo.com> wrote:
>> On Wed, Feb 17, 2010 at 5:15 PM, Thomas Müller <th...@day.com> wrote:
>>> Hi
>>>
>>>> each property indexed in its own Lucene field
>>>
>>> Could you explain in more details? What is a 1:1 mapping? Do you mean
>>> each property type should have it's own index, or each property name
>>> should have its own index? Would this not increase the number of
>>> Lucene index files a lot?
>>
>> No, I mean that the current implementation is based on a Lucene
>> version that could not handle infinit number of unique lucene fields
>> (jcr can have any property name). Therefor, Jackrabbit indexes every
>> different jcr fieldname in the same lucene field, but prefix the value
>> with the jackrabbit fieldname. This has quite some disadvantages,
>> memory loss (terms are cached in lucene and in jackrabbit without the
>> fieldname prefix), and I think we cannot easily make use of the Lucene
>> trie range stuff making range queries on dates, doubles and longs
>> efficient
> 
> Addon: So my improvement would be to suggest to index every unique jcr
> fieldname in a unique lucene field, and do not prefix values as
> currently is being done. This makes lots of the lucene classes and
> queries in jr easier or redundant

Being by no means an expert in this field.... but this sounds very much
doable in the shorter time frame of 2.x release, right ?

Regards
Felix

> 
> Ard
> 
>>
>> Regards Ard
>>
>>>
>>> Regards,
>>> Thomas
>>>
>>
> 


Re: [jr3] Restructure Lucene indexing & make use of Lucene 2.9 features

Posted by Ard Schrijvers <a....@onehippo.com>.
On Thu, Feb 18, 2010 at 1:27 PM, Alexander Klimetschek <ak...@day.com> wrote:
> On Thu, Feb 18, 2010 at 10:37, Ard Schrijvers <a....@onehippo.com> wrote:
>> Addon: So my improvement would be to suggest to index every unique jcr
>> fieldname in a unique lucene field, and do not prefix values as
>> currently is being done. This makes lots of the lucene classes and
>> queries in jr easier or redundant
>
> +1
>
> And as Felix noted, this is "just" an internal improvement to the
> Lucene search index and can be done quite early in 2.x. The only
> question is the migration of indexes. This could be done by still
> supporting old-style indexes (for 2.x releases), but when a new index
> is created, the newer variant is chosen. Existing repositories that
> upgrade could then chose, ie. delete and re-index to get the new
> structure.

If we do this, I would also like to move to lucene 2.9, which is
incompatible, even in API with older version. I really think backward
support for existing indexes would be too much pain: people should be
able to reindex, this would solve the old style stuff. Wouldn't it be
reasonable to have people re-index their contents if they want to move
to the latest repository with a new lucene 2.9 version which is by
itself incompatible with older versions (though not sure whether it
can handle older existing indexes).

>
> If this makes the implementation too difficult, we could simply offer

I am having a headache already, so I think yes :-)

> a different SearchIndex implementation, so one can chose via the
> configuration.

would be easier. Unfortunately, we would still have all the 'old'
classes being redundant in the new version in the core...we though
could move them to some package

Ard

>
> Regards,
> Alex
>
> --
> Alexander Klimetschek
> alexander.klimetschek@day.com
>

Re: [jr3] Restructure Lucene indexing & make use of Lucene 2.9 features

Posted by Alexander Klimetschek <ak...@day.com>.
On Thu, Feb 18, 2010 at 10:37, Ard Schrijvers <a....@onehippo.com> wrote:
> Addon: So my improvement would be to suggest to index every unique jcr
> fieldname in a unique lucene field, and do not prefix values as
> currently is being done. This makes lots of the lucene classes and
> queries in jr easier or redundant

+1

And as Felix noted, this is "just" an internal improvement to the
Lucene search index and can be done quite early in 2.x. The only
question is the migration of indexes. This could be done by still
supporting old-style indexes (for 2.x releases), but when a new index
is created, the newer variant is chosen. Existing repositories that
upgrade could then chose, ie. delete and re-index to get the new
structure.

If this makes the implementation too difficult, we could simply offer
a different SearchIndex implementation, so one can chose via the
configuration.

Regards,
Alex

-- 
Alexander Klimetschek
alexander.klimetschek@day.com

Re: [jr3] Restructure Lucene indexing & make use of Lucene 2.9 features

Posted by Thomas Müller <th...@day.com>.
Hi,

Thanks for the explanation!

> index every unique jcr fieldname in a unique lucene field, and do not prefix
> values as currently is being done.

This sounds very reasonable.

Regards,
Thomas

Re: [jr3] Restructure Lucene indexing & make use of Lucene 2.9 features

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Thu, Feb 18, 2010 at 10:37 AM, Ard Schrijvers
<a....@onehippo.com> wrote:
> Addon: So my improvement would be to suggest to index every unique jcr
> fieldname in a unique lucene field, and do not prefix values as
> currently is being done. This makes lots of the lucene classes and
> queries in jr easier or redundant

+1

BR,

Jukka Zitting

Re: [jr3] Restructure Lucene indexing & make use of Lucene 2.9 features

Posted by Ard Schrijvers <a....@onehippo.com>.
On Thu, Feb 18, 2010 at 10:35 AM, Ard Schrijvers
<a....@onehippo.com> wrote:
> On Wed, Feb 17, 2010 at 5:15 PM, Thomas Müller <th...@day.com> wrote:
>> Hi
>>
>>> each property indexed in its own Lucene field
>>
>> Could you explain in more details? What is a 1:1 mapping? Do you mean
>> each property type should have it's own index, or each property name
>> should have its own index? Would this not increase the number of
>> Lucene index files a lot?
>
> No, I mean that the current implementation is based on a Lucene
> version that could not handle infinit number of unique lucene fields
> (jcr can have any property name). Therefor, Jackrabbit indexes every
> different jcr fieldname in the same lucene field, but prefix the value
> with the jackrabbit fieldname. This has quite some disadvantages,
> memory loss (terms are cached in lucene and in jackrabbit without the
> fieldname prefix), and I think we cannot easily make use of the Lucene
> trie range stuff making range queries on dates, doubles and longs
> efficient

Addon: So my improvement would be to suggest to index every unique jcr
fieldname in a unique lucene field, and do not prefix values as
currently is being done. This makes lots of the lucene classes and
queries in jr easier or redundant

Ard

>
> Regards Ard
>
>>
>> Regards,
>> Thomas
>>
>

Re: [jr3] Restructure Lucene indexing & make use of Lucene 2.9 features

Posted by Ard Schrijvers <a....@onehippo.com>.
On Wed, Feb 17, 2010 at 5:15 PM, Thomas Müller <th...@day.com> wrote:
> Hi
>
>> each property indexed in its own Lucene field
>
> Could you explain in more details? What is a 1:1 mapping? Do you mean
> each property type should have it's own index, or each property name
> should have its own index? Would this not increase the number of
> Lucene index files a lot?

No, I mean that the current implementation is based on a Lucene
version that could not handle infinit number of unique lucene fields
(jcr can have any property name). Therefor, Jackrabbit indexes every
different jcr fieldname in the same lucene field, but prefix the value
with the jackrabbit fieldname. This has quite some disadvantages,
memory loss (terms are cached in lucene and in jackrabbit without the
fieldname prefix), and I think we cannot easily make use of the Lucene
trie range stuff making range queries on dates, doubles and longs
efficient

Regards Ard

>
> Regards,
> Thomas
>

Re: [jr3] Restructure Lucene indexing & make use of Lucene 2.9 features

Posted by Thomas Müller <th...@day.com>.
Hi

> each property indexed in its own Lucene field

Could you explain in more details? What is a 1:1 mapping? Do you mean
each property type should have it's own index, or each property name
should have its own index? Would this not increase the number of
Lucene index files a lot?

Regards,
Thomas