You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jackrabbit.apache.org by James Abley <ja...@gmail.com> on 2010/02/10 23:05:32 UTC

Potential performance improvement?

Hi,

Just curious if any of the devs are familiar with zoie [1],[2] and know
whether it might be useful in Jackrabbit?

Cheers,

James

[1] http://code.google.com/p/zoie
[2]
http://code.google.com/p/zoie/wiki/Performance_Comparisons_for_ZoieLucene24ZoieLucene29LuceneNRT

Re: Potential performance improvement?

Posted by Ard Schrijvers <a....@onehippo.com>.

On Tue, Feb 16, 2010 at 10:33 AM, Alexander Klimetschek
<ak...@day.com> wrote:
> On Mon, Feb 15, 2010 at 13:28, Marcel Reutegger
> <ma...@gmx.net> wrote:
>> On Fri, Feb 12, 2010 at 14:47, Alexander Klimetschek <ak...@day.com> wrote:
>>> On Fri, Feb 12, 2010 at 13:33, Marcel Reutegger
>>> <ma...@gmx.net> wrote:
>>>> jackrabbit does it in a similar way for quite some time now.
>>>
>>> To me it sounds like this partial-temporary-indexing feature should be
>>> part of Lucene directly (configurable, of course).
>>
>> well, it's not that easy. jackrabbit makes use of many assumptions and
>> implementation specific properties of the content that is indexed.
>> e.g. nodes are uniquely identifiable and it is not required to
>> immediately persist the index on commit. it is sufficient that a redo
>> log contains enough information to replay the changes. all this cannot
>> be moved easily into a more generic library like lucene. however there
>> is interesting work going on with the near-real-time index that we
>> might want to use in the future.
>
> I see. The near-real-time index sounds great (however, "real-time"
> always has to be taken carefully ;-)).

I scanned http://code.google.com/p/zoie/, and although not totally
clear from the documentation, I assume indeed that they have, as
Marcel points out, something similar to Jackrabbit's indexing
strategy, namely readonly multi index reader + one in memory index.
Afaik, it is also similar to [1], lucene Ocean Real Time Search.

As the current implementation in jr already has 'read only' indexes, I
doubt whether the gain of Lucene 2.9 will be that high. A good paper
on the changes by the way can be found here [2] (what is new in 2.9).
What I do think we can benefit on largely is triranges, as currently
range queries on for example dates are really expensive

Regards Ard

[1] http://wiki.apache.org/lucene-java/OceanRealtimeSearch
[2] http://www.lucidimagination.com/solutions/whitepapers

>
> Regards,
> Alex
>
> --
> Alexander Klimetschek
> alexander.klimetschek@day.com
>

Re: Potential performance improvement?

Posted by Alexander Klimetschek <ak...@day.com>.

On Mon, Feb 15, 2010 at 13:28, Marcel Reutegger
<ma...@gmx.net> wrote:
> On Fri, Feb 12, 2010 at 14:47, Alexander Klimetschek <ak...@day.com> wrote:
>> On Fri, Feb 12, 2010 at 13:33, Marcel Reutegger
>> <ma...@gmx.net> wrote:
>>> jackrabbit does it in a similar way for quite some time now.
>>
>> To me it sounds like this partial-temporary-indexing feature should be
>> part of Lucene directly (configurable, of course).
>
> well, it's not that easy. jackrabbit makes use of many assumptions and
> implementation specific properties of the content that is indexed.
> e.g. nodes are uniquely identifiable and it is not required to
> immediately persist the index on commit. it is sufficient that a redo
> log contains enough information to replay the changes. all this cannot
> be moved easily into a more generic library like lucene. however there
> is interesting work going on with the near-real-time index that we
> might want to use in the future.

I see. The near-real-time index sounds great (however, "real-time"
always has to be taken carefully ;-)).

Regards,
Alex

-- 
Alexander Klimetschek
alexander.klimetschek@day.com

Re: Potential performance improvement?

Posted by Marcel Reutegger <ma...@gmx.net>.

On Fri, Feb 12, 2010 at 14:47, Alexander Klimetschek <ak...@day.com> wrote:
> On Fri, Feb 12, 2010 at 13:33, Marcel Reutegger
> <ma...@gmx.net> wrote:
>> jackrabbit does it in a similar way for quite some time now.
>
> To me it sounds like this partial-temporary-indexing feature should be
> part of Lucene directly (configurable, of course).

well, it's not that easy. jackrabbit makes use of many assumptions and
implementation specific properties of the content that is indexed.
e.g. nodes are uniquely identifiable and it is not required to
immediately persist the index on commit. it is sufficient that a redo
log contains enough information to replay the changes. all this cannot
be moved easily into a more generic library like lucene. however there
is interesting work going on with the near-real-time index that we
might want to use in the future.

regards
 marcel

> Regards,
> Alex
>
> --
> Alexander Klimetschek
> alexander.klimetschek@day.com
>

Re: Potential performance improvement?

Posted by Alexander Klimetschek <ak...@day.com>.

On Fri, Feb 12, 2010 at 13:33, Marcel Reutegger
<ma...@gmx.net> wrote:
> jackrabbit does it in a similar way for quite some time now.

To me it sounds like this partial-temporary-indexing feature should be
part of Lucene directly (configurable, of course).

Regards,
Alex

-- 
Alexander Klimetschek
alexander.klimetschek@day.com

Re: Potential performance improvement?

Posted by Marcel Reutegger <ma...@gmx.net>.

Hi,

thanks for the pointers.

On Thu, Feb 11, 2010 at 12:36, Alexander Klimetschek <ak...@day.com> wrote:
> On Wed, Feb 10, 2010 at 23:05, James Abley <ja...@gmail.com> wrote:
>> Just curious if any of the devs are familiar with zoie [1],[2] and know
>> whether it might be useful in Jackrabbit?
>
> IIUC, the additional advantage of zoie over standard Lucene is that it
> makes new documents immediately available for searches by having them
> integrated into the index via a temporary in-memory representation.
> Thus it avoids the time it takes to merge the newly indexed data back
> into the full disk-based index.

jackrabbit does it in a similar way for quite some time now.

some of the features of zoie are also considered useful for jackrabbit
and jira issues have been filed already. e.g. optimize only a single
segment based on how many nodes are marked as deleted.

regards
 marcel

> But it still needs to wait for the time it takes to actually index
> documents (to make up the temporary, partial in-memory index), right?
> In my experience, in practical use of Jackrabbit with binary
> documents, most of the time is spent for full-text extraction from
> various file-formats. That's why this is post-poned via a queue (on
> demand, if it takes too long), to speed up the session.save() and
> update the full-text index later. The actual process of merging index
> segments is already quite fast, at least AFAIK.
>
> Regards,
> Alex
>
> --
> Alexander Klimetschek
> alexander.klimetschek@day.com
>

Re: Potential performance improvement?

Posted by Alexander Klimetschek <ak...@day.com>.

On Wed, Feb 10, 2010 at 23:05, James Abley <ja...@gmail.com> wrote:
> Just curious if any of the devs are familiar with zoie [1],[2] and know
> whether it might be useful in Jackrabbit?

IIUC, the additional advantage of zoie over standard Lucene is that it
makes new documents immediately available for searches by having them
integrated into the index via a temporary in-memory representation.
Thus it avoids the time it takes to merge the newly indexed data back
into the full disk-based index.

But it still needs to wait for the time it takes to actually index
documents (to make up the temporary, partial in-memory index), right?
In my experience, in practical use of Jackrabbit with binary
documents, most of the time is spent for full-text extraction from
various file-formats. That's why this is post-poned via a queue (on
demand, if it takes too long), to speed up the session.save() and
update the full-text index later. The actual process of merging index
segments is already quite fast, at least AFAIK.

Regards,
Alex

-- 
Alexander Klimetschek
alexander.klimetschek@day.com