You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Steven Parkes <st...@esseff.org> on 2007/05/30 19:52:30 UTC

addIndexes()

I'm cleaning up the patch for LUCENE-847 (factored merge policy) and
noticed a couple of things about the addIndexes methods.

Is there any particular reason that the version that takes a Directory[]
optimizes first? The later merge is going to use the normal logarithmic
stepping; is there a compelling reason why it's better to create one
segment from the existing segments before adding the new indexes? I can
certainly come up with counter examples where this is a really poor
choice.

The IndexReader[] version has kind of the dual issue: it doesn't use the
merge policy semantics at all. It does one giant merge without regard
for the merge factor, etc. Is there any reason why this is better than
the normal logarithmic stepping case? One thing I can imagine is that it
would be done because it's possible? In other words, by definition (?)
you can open all the necessary files because in this case you already
have IndexReaders created for them and if there were a file descriptor
issue, you'd have died before you go to the addIndexes call.

And seeing where it was first added (1.3 rc2) it was to support derived
IndexReaders.

The fact that it's doing an end around the merge policy makes me uneasy.
Two things jump out as difficulties, both having to do with how much
flexibility we can delegate to merge policies:

First, it'd be nice if merge policies could decide whether resulting
segments are going to be compound or not (talked about previously on
dev). The upshot of this is that there would be no
get/setUseCompoundFile() in IndexWriter anymore (except the deprecated
version for compatibility). So addIndexes can't decide as it does now.

Second, it's possible that a merge policy doesn't want to do optimize
this way. In fact, I think it's reasonable for optimize on some merge
policies not to reduce all the way down to a single segment, but instead
to obey something like maxMergeDocs even when optimize is called. (That
may be debatable; all I care about right now is leaving it up to the
merge policy).

These things would argue for making addIndexes(IndexReader[]) like
addIndexes(Directory[]), letting the merge policy decide whether to do
staged merges or not. But that would seriously mess with the API.

Are there good example use cases for addIndexes(IndexReader[])?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: addIndexes()

Posted by Andi Vajda <va...@osafoundation.org>.

On Thu, 31 May 2007, Steven Parkes wrote:

> Hmmm ... something's not meshing for me here.
>
> If I understood what you've said, you have a DbD index to which you are
> addIndexes'ing a memory index? I must have missed something, because
> addIndexes pre- and post-optimizes the target (Dbd) index, not the
> operand (mem) index.

I stand corrected. I'm using an IndexWriter opened on a RAMDirectory to do the 
indexing for a given transaction. Then I call addIndexes([writer]) on the 
IndexWriter backed by the DbDirectory to persist this. This approach ash 
turned out to be considerably faster and less noisy in the database (the 
amount of random access changes) than indexing into the DbDirectory backed 
index directly and then optimizing it.

The docs for addIndexes() say "After this completes, the index is optimized." 
I mistakenly thought that there was discussion here about making this no 
longer be the case.

Andi..

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: addIndexes()

Posted by Steven Parkes <st...@esseff.org>.

Hmmm ... something's not meshing for me here.

If I understood what you've said, you have a DbD index to which you are
addIndexes'ing a memory index? I must have missed something, because
addIndexes pre- and post-optimizes the target (Dbd) index, not the
operand (mem) index.

-----Original Message-----
From: Andi Vajda [mailto:vajda@osafoundation.org] 
Sent: Thursday, May 31, 2007 10:10 AM
To: java-dev@lucene.apache.org
Subject: Re: addIndexes()

On Thu, 31 May 2007, Doug Cutting wrote:

> Steven Parkes wrote:
>> Is there any particular reason that the version that takes a
Directory[]
>> optimizes first?
>
> There was, but unfortunately I can't recall it now.  Index merging has

> changed substantially since then, so, whatever it was, it may no
longer 
> apply.  If no one can think of a good reason to optimize any longer,
then 
> probably we should remove it, no?

No longer optimizing on this call would impact performance in what I'm
doing.
My usage pattern involves indexing in a MemoryIndex and adding that
index to 
an index backed by a DbDirectory. If the index is not optimized first,
the 
operation becomes very noisy in the database.

In other words, if that change is made, please let us know so that I can
adapt 
my code to explicitely optimize the MemoryIndex first.

Thanks !

Andi..

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: addIndexes()

Posted by Andi Vajda <va...@osafoundation.org>.

On Thu, 31 May 2007, Doug Cutting wrote:

> Steven Parkes wrote:
>> Is there any particular reason that the version that takes a Directory[]
>> optimizes first?
>
> There was, but unfortunately I can't recall it now.  Index merging has 
> changed substantially since then, so, whatever it was, it may no longer 
> apply.  If no one can think of a good reason to optimize any longer, then 
> probably we should remove it, no?

No longer optimizing on this call would impact performance in what I'm doing.
My usage pattern involves indexing in a MemoryIndex and adding that index to 
an index backed by a DbDirectory. If the index is not optimized first, the 
operation becomes very noisy in the database.

In other words, if that change is made, please let us know so that I can adapt 
my code to explicitely optimize the MemoryIndex first.

Thanks !

Andi..

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: addIndexes()

Posted by Doug Cutting <cu...@apache.org>.

Steven Parkes wrote:
> Is there any particular reason that the version that takes a Directory[]
> optimizes first?

There was, but unfortunately I can't recall it now.  Index merging has 
changed substantially since then, so, whatever it was, it may no longer 
apply.  If no one can think of a good reason to optimize any longer, 
then probably we should remove it, no?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org