You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jackrabbit.apache.org by Ard Schrijvers <a....@onehippo.com> on 2009/09/16 14:56:03 UTC

Improve indexing performance wrt Aggregates

Hello,

I want to avoid recreating a lucene Document when using aggregates
over and over. This particularly holds when using aggregates for
binary data like a pdf. I think using aggregates also for binary data
is a good usecase. But, indexing and extracting a large pdf is cpu
intensive, and, this currently is done multiple times when using
aggregates. For non binary data the same holds, but the performance
penalty is lower.

I would like to add a really simple small cache into the SearchIndex:

Something like:

private final Map<String, WeakReference<Document>> lastCreatedDocs =
new LRUMap(500);

(WeakReferences used in case the Lucene Document is really large)

Now, in

protected Document  createDocument(NodeState node, NamespaceMappings
nsMappings, IndexFormatVersion indexFormatVersion)

we start with:

WeakReference<Document> refDoc =
lastCreatedDocs.get(node.getNodeId().getUUID().toString());
        Document doc;
        if(refDoc != null && (doc = refDoc.get()) != null) {
            // We already created a Document for this NodeState within
the current transaction. Return it directly.
            return doc;
        }


and before the return statement:

lastCreatedDocs.put(node.getNodeId().getUUID().toString(), new
WeakReference<Document>(doc));

Now, when you have aggregates, then in

for mergeAggregatedNodeIndexes(node, doc);

you'll get a cached version back of the node to merge. This way,
within a *single* indexing transaction, all nodes are lucene indexed
once, and not, possibly, many times.

Obviously, there is one catch: on the next indexing transaction, we
need to flush the lastCreatedDocs cache, as a node might have been
changed. If we add a method to the QueryHandler interface, something
like

void flush()

and we call this whenever we are going to index a new set of nodes, I
think we should be fine. AFAICS if in MultiIndex

synchronized void update(Iterator remove, Iterator add) throws IOException {

we do a

handler.flush()

we are there.

What do others think? If people like it, I'll create a patch to test.

Regards Ard

Re: Improve indexing performance wrt Aggregates

Posted by Ard Schrijvers <a....@onehippo.com>.
>> I am only not sure about the affected and fix versions. What
>> do you think?
>
> enhancements don't need an affected version IMO and fix version can be
> set later when the issue is closed, unless you want to indicate that a
> issue should block a release. in that case you set the fix version to
> the given yet unreleased version.
>
>> Should I implement it in the trunk, and then see whether
>> or not we should backport it to other branches?
>
> I think it is a good practice to first implement a new feature or
> enhancement in trunk and then backport it when there is a need.

Thank you, I'll remove the affects and fix versions. Hope to have a
patch ready next week,

Regards Ard

>
> regards
>  marcel
>

Re: Improve indexing performance wrt Aggregates

Posted by Marcel Reutegger <ma...@gmx.net>.
Hi,

On Fri, Sep 18, 2009 at 10:29, Ard Schrijvers <a....@onehippo.com> wrote:
> See [1].

thanks.

> I am only not sure about the affected and fix versions. What
> do you think?

enhancements don't need an affected version IMO and fix version can be
set later when the issue is closed, unless you want to indicate that a
issue should block a release. in that case you set the fix version to
the given yet unreleased version.

> Should I implement it in the trunk, and then see whether
> or not we should backport it to other branches?

I think it is a good practice to first implement a new feature or
enhancement in trunk and then backport it when there is a need.

regards
 marcel

Re: Improve indexing performance wrt Aggregates

Posted by Ard Schrijvers <a....@onehippo.com>.
>
> I'm very interested. I think you should create jira issue and then we
> can work out the details based on your patch.

See [1]. I am only not sure about the affected and fix versions. What
do you think? Should I implement it in the trunk, and then see whether
or not we should backport it to other branches?

Regards Ard

[1] https://issues.apache.org/jira/browse/JCR-2311

>
> regards
>  marcel
>

Re: Improve indexing performance wrt Aggregates

Posted by Marcel Reutegger <ma...@day.com>.
Hi,

On Thu, Sep 17, 2009 at 11:17, Ard Schrijvers <a....@onehippo.com> wrote:
> If people like the idea, I'll work on a suggested patch.

I'm very interested. I think you should create jira issue and then we
can work out the details based on your patch.

regards
 marcel

Re: Improve indexing performance wrt Aggregates

Posted by Ard Schrijvers <a....@onehippo.com>.
On Thu, Sep 17, 2009 at 10:58 AM, Marcel Reutegger
<ma...@gmx.net> wrote:
> Hi,
>
> In general I think such a cache makes sense, but we have to be careful
> when reusing Document instances. They may contain Readers that have a
> state and once consumed will probably be closed and throw an exception
> on the next read.

That is a good point! I think it might be better to not call/use it as
a cache (LRUMap kind of thing), but to have the MultiIndex when it
starts a new transaction, thus at

long transactionId = nextTransactionId++;

reset a alreadyIndexed map in the SearchIndex: Map<String,
WeakReference<Document>> alreadyIndexed


when the  MultiIndex finishes its  'synchronized void update(Iterator
remove, Iterator add)'

we set the alreadyIndexed to null, or clear it. Afaik, this will not
interfere with existing Readers ever, and after the update we flush
the alreadyIndexed map, making sure no Lucene Documents references are
kept. I suggest to use WeakReference in the alreadyIndexed map in case
the update would be really large resulting in many, possible large,
Lucene Documents.

If people like the idea, I'll work on a suggested patch. (i tested
with indexing the jsr-170 spec, it takes around 2 seconds on my
laptop, and the jsr-283 even 3+...you do want to reuse this in
aggregates, making sure that in one update, the indexing is not done
multiple time)


Regards Ard

>
> regards
>  marcel
>
> On Wed, Sep 16, 2009 at 14:56, Ard Schrijvers <a....@onehippo.com> wrote:
>> Hello,
>>
>> I want to avoid recreating a lucene Document when using aggregates
>> over and over. This particularly holds when using aggregates for
>> binary data like a pdf. I think using aggregates also for binary data
>> is a good usecase. But, indexing and extracting a large pdf is cpu
>> intensive, and, this currently is done multiple times when using
>> aggregates. For non binary data the same holds, but the performance
>> penalty is lower.
>>
>> I would like to add a really simple small cache into the SearchIndex:
>>
>> Something like:
>>
>> private final Map<String, WeakReference<Document>> lastCreatedDocs =
>> new LRUMap(500);
>>
>> (WeakReferences used in case the Lucene Document is really large)
>>
>> Now, in
>>
>> protected Document  createDocument(NodeState node, NamespaceMappings
>> nsMappings, IndexFormatVersion indexFormatVersion)
>>
>> we start with:
>>
>> WeakReference<Document> refDoc =
>> lastCreatedDocs.get(node.getNodeId().getUUID().toString());
>>        Document doc;
>>        if(refDoc != null && (doc = refDoc.get()) != null) {
>>            // We already created a Document for this NodeState within
>> the current transaction. Return it directly.
>>            return doc;
>>        }
>>
>>
>> and before the return statement:
>>
>> lastCreatedDocs.put(node.getNodeId().getUUID().toString(), new
>> WeakReference<Document>(doc));
>>
>> Now, when you have aggregates, then in
>>
>> for mergeAggregatedNodeIndexes(node, doc);
>>
>> you'll get a cached version back of the node to merge. This way,
>> within a *single* indexing transaction, all nodes are lucene indexed
>> once, and not, possibly, many times.
>>
>> Obviously, there is one catch: on the next indexing transaction, we
>> need to flush the lastCreatedDocs cache, as a node might have been
>> changed. If we add a method to the QueryHandler interface, something
>> like
>>
>> void flush()
>>
>> and we call this whenever we are going to index a new set of nodes, I
>> think we should be fine. AFAICS if in MultiIndex
>>
>> synchronized void update(Iterator remove, Iterator add) throws IOException {
>>
>> we do a
>>
>> handler.flush()
>>
>> we are there.
>>
>> What do others think? If people like it, I'll create a patch to test.
>>
>> Regards Ard
>>
>

Re: Improve indexing performance wrt Aggregates

Posted by Marcel Reutegger <ma...@gmx.net>.
Hi,

In general I think such a cache makes sense, but we have to be careful
when reusing Document instances. They may contain Readers that have a
state and once consumed will probably be closed and throw an exception
on the next read.

regards
 marcel

On Wed, Sep 16, 2009 at 14:56, Ard Schrijvers <a....@onehippo.com> wrote:
> Hello,
>
> I want to avoid recreating a lucene Document when using aggregates
> over and over. This particularly holds when using aggregates for
> binary data like a pdf. I think using aggregates also for binary data
> is a good usecase. But, indexing and extracting a large pdf is cpu
> intensive, and, this currently is done multiple times when using
> aggregates. For non binary data the same holds, but the performance
> penalty is lower.
>
> I would like to add a really simple small cache into the SearchIndex:
>
> Something like:
>
> private final Map<String, WeakReference<Document>> lastCreatedDocs =
> new LRUMap(500);
>
> (WeakReferences used in case the Lucene Document is really large)
>
> Now, in
>
> protected Document  createDocument(NodeState node, NamespaceMappings
> nsMappings, IndexFormatVersion indexFormatVersion)
>
> we start with:
>
> WeakReference<Document> refDoc =
> lastCreatedDocs.get(node.getNodeId().getUUID().toString());
>        Document doc;
>        if(refDoc != null && (doc = refDoc.get()) != null) {
>            // We already created a Document for this NodeState within
> the current transaction. Return it directly.
>            return doc;
>        }
>
>
> and before the return statement:
>
> lastCreatedDocs.put(node.getNodeId().getUUID().toString(), new
> WeakReference<Document>(doc));
>
> Now, when you have aggregates, then in
>
> for mergeAggregatedNodeIndexes(node, doc);
>
> you'll get a cached version back of the node to merge. This way,
> within a *single* indexing transaction, all nodes are lucene indexed
> once, and not, possibly, many times.
>
> Obviously, there is one catch: on the next indexing transaction, we
> need to flush the lastCreatedDocs cache, as a node might have been
> changed. If we add a method to the QueryHandler interface, something
> like
>
> void flush()
>
> and we call this whenever we are going to index a new set of nodes, I
> think we should be fine. AFAICS if in MultiIndex
>
> synchronized void update(Iterator remove, Iterator add) throws IOException {
>
> we do a
>
> handler.flush()
>
> we are there.
>
> What do others think? If people like it, I'll create a patch to test.
>
> Regards Ard
>