You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Shai Erera <se...@gmail.com> on 2010/05/10 11:58:05 UTC

When to use addIndexes and when addIndexesNoOptimize

Hi

As I was working on LUCENE-1585 and understanding more the differences
between addIndexes and addIndexesNoOptimize, I was wondering why do we have
those two methods? It seems like addIndexes's usage is discouraged, no? Can
someone please explain me why it isn't deprecated, and addIndexesNoOpt
becomes the default one? Are we losing any functionality?

Shai

Re: When to use addIndexes and when addIndexesNoOptimize

Posted by Shai Erera <se...@gmail.com>.

I agree w/ two variants for addIndexes.

About the optimize concerns - I somehow doubt there are apps with tens or
hundreds of segments in the target directory, to which they would like to
add yet another index. Might be, but hard to believe. Anyway, apps could
still call optimize() before that. Note that on the issue I've mentioned
that the jdoc of addIndexes is wrong, saying after the method completes the
index is optimized.

Maybe what we should do is add just the external indexes, as you suggest on
the issue, and don't touch the local ones. Then, apps could call maybeMerge
following that, or optimize, whatever they prefer.

I think we can move the discussion to the issue.

Shai

On Tue, May 11, 2010 at 12:55 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> On Mon, May 10, 2010 at 11:38 PM, Shai Erera <se...@gmail.com> wrote:
> >> Hmm addDirectories feels a bit too low level
> >
> > I don't mind calling it addIndexes(Directory...), but I don't think it's
> too
> > low level - whoever executes the method passes Directory... and that's
> > exactly what the method does :). Two addIndexes force you to go read the
> > jdoc, but so will addDirectories. I don't mind either way.
>
> I also like addIndexes because it better reflects what it does, ie you
> are in fact adding indexes, it's just that the indexes are delivered
> via a Directory vs via an IndexReader instance.  Also, putting the
> type of the params seems sort of redundant since it's already
> declared/visible in the method's sig.
>
> >> But advertise this in back compat breaks.
> >
> > I don't think it's a bw break? More of a runtime change IMO. True it can
>
> Sorry, you're right, it's a runtime change.  But for users who
> actually "rely" on it, this is a big change.
>
> > affect performance, but did we ever measure addIndexes w/ and w/o
> > optimize(). Are we sure that optimize() first, then SM merges that follow
> > perform better? I mean, on paper it should. But since we do it only for
> the
>
> Really the question is perf effect of changing mergeFactor, because
> this method adds in current index's segment, and all incoming
> segments, and does one giant merge.
>
> If we remove the optimize, and there are too many segments in the
> index, and the app doesn't use CFS, they can run out of file
> descriptors due to this.
>
> > target index, we don't really know what's happening in users' apps. It's
> > only documentation in CHANGES, so it can go into both sections (BACKWARDS
> or
> > RUNTIME). I prefer the latter. It can also go like that into trunk's
> > CHANGES.
>
> I agree, it should go into RUNTIME and CHANGES.
>
> >
> >> Good!
> >
> > I'll open an issue to track this.
>
> Thanks!
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: When to use addIndexes and when addIndexesNoOptimize

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Mon, May 10, 2010 at 11:38 PM, Shai Erera <se...@gmail.com> wrote:
>> Hmm addDirectories feels a bit too low level
>
> I don't mind calling it addIndexes(Directory...), but I don't think it's too
> low level - whoever executes the method passes Directory... and that's
> exactly what the method does :). Two addIndexes force you to go read the
> jdoc, but so will addDirectories. I don't mind either way.

I also like addIndexes because it better reflects what it does, ie you
are in fact adding indexes, it's just that the indexes are delivered
via a Directory vs via an IndexReader instance.  Also, putting the
type of the params seems sort of redundant since it's already
declared/visible in the method's sig.

>> But advertise this in back compat breaks.
>
> I don't think it's a bw break? More of a runtime change IMO. True it can

Sorry, you're right, it's a runtime change.  But for users who
actually "rely" on it, this is a big change.

> affect performance, but did we ever measure addIndexes w/ and w/o
> optimize(). Are we sure that optimize() first, then SM merges that follow
> perform better? I mean, on paper it should. But since we do it only for the

Really the question is perf effect of changing mergeFactor, because
this method adds in current index's segment, and all incoming
segments, and does one giant merge.

If we remove the optimize, and there are too many segments in the
index, and the app doesn't use CFS, they can run out of file
descriptors due to this.

> target index, we don't really know what's happening in users' apps. It's
> only documentation in CHANGES, so it can go into both sections (BACKWARDS or
> RUNTIME). I prefer the latter. It can also go like that into trunk's
> CHANGES.

I agree, it should go into RUNTIME and CHANGES.

>
>> Good!
>
> I'll open an issue to track this.

Thanks!

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: When to use addIndexes and when addIndexesNoOptimize

Posted by Shai Erera <se...@gmail.com>.

>
> Hmm addDirectories feels a bit too low level
>

I don't mind calling it addIndexes(Directory...), but I don't think it's too
low level - whoever executes the method passes Directory... and that's
exactly what the method does :). Two addIndexes force you to go read the
jdoc, but so will addDirectories. I don't mind either way.

But advertise this in back compat breaks.
>

I don't think it's a bw break? More of a runtime change IMO. True it can
affect performance, but did we ever measure addIndexes w/ and w/o
optimize(). Are we sure that optimize() first, then SM merges that follow
perform better? I mean, on paper it should. But since we do it only for the
target index, we don't really know what's happening in users' apps. It's
only documentation in CHANGES, so it can go into both sections (BACKWARDS or
RUNTIME). I prefer the latter. It can also go like that into trunk's
CHANGES.

Good!
>

I'll open an issue to track this.

Shai

On Tue, May 11, 2010 at 12:13 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> On Mon, May 10, 2010 at 3:08 PM, Shai Erera <se...@gmail.com> wrote:
> > That's still weird Mike - we call optimize in addIndexes to reduce the
> > number of SRs, that's fair. So why don't we do that in addIndexesNoOpt?
>
> I agree it's weird and inconsistent and all that :)
>
> > There, we get a SR per SI. And name of the method suggests optimize() is
> > avoided on purpose ... it's as if addIndexesNoOpt should be called
> > addDirectories, and we should let the caller decide whether to call
> > optimize() on all IRs (including the local) before he calls addIndexes,
> or
> > NoOpt.
>
> Well... there used to be an addIndexes(Directory..), that did an
> optimize, I think both before and after.  So addIndexesNoOpt was
> reacting to that.
>
> > I mean, we call those methods in confusing names, and don't follow the
> same
> > approach when handling each ... I can live a/ addIndexes existing to take
> IR
> > extensions, and w/ addDirectories if you don't need IR extensions. But
> > calling/not-calling optimize() is inconsistent, and from what I
> understand,
> > for no good reason?
>
> It's for a good reason -- it's to attempt to ensure that the single
> .merge done by that method isn't insanely slow, if your index has alot
> of segments.
>
> But, really, those merges ought to go through a merge
> policy/scheduler, so we do mergeFactor at a time, we do up to N
> concurrently, etc.
>
> So I think this pre-optimize is a hack to try to keep how many readers
> we merge at once, contained.
>
> > I'm asking these questions b/c someone asked me the other day when one
> > should call each and what the hell that NoOpt is doing in the name ... I
> was
> > confused when I was asked the question, and I'm confused now :).
>
> I hear you...
>
> > So how about if we:
> > 1) Rename addIndexesNoOptimize to addDirectories
>
> Hmm addDirectories feels a bit too low level... why not call it
> addIndexes (it's a different signature since it accepts Dir not IR).
>
> > 2) Remove optimize() call from addIndexes
>
> +1
>
> But advertise this in back compat breaks.  We could also preserve old
> way under Version.
>
> > 3) Document that clearly in both, w/ a recommendation to call optimize()
> > before on any of the Directories/Indexes if it's a concern.
>
> Good.
>
> > That way, we maintain all the flexibility in the API - addIndexes allows
> for
> > using IR extensions, addDirectories is considered more efficient, by
> > allowing the merges to happen concurrently (depending on MS) and also
> > factors in the MP. So unless you have an IR extension, addDirectories is
> > really the one you should be using. And you have the freedom to call
> > optimize() before each if you care about it, or don't if you don't care.
> > Either way, incurring the cost of optimize() is entirely in your hands.
>
> Good!
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: When to use addIndexes and when addIndexesNoOptimize

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Mon, May 10, 2010 at 3:08 PM, Shai Erera <se...@gmail.com> wrote:
> That's still weird Mike - we call optimize in addIndexes to reduce the
> number of SRs, that's fair. So why don't we do that in addIndexesNoOpt?

I agree it's weird and inconsistent and all that :)

> There, we get a SR per SI. And name of the method suggests optimize() is
> avoided on purpose ... it's as if addIndexesNoOpt should be called
> addDirectories, and we should let the caller decide whether to call
> optimize() on all IRs (including the local) before he calls addIndexes, or
> NoOpt.

Well... there used to be an addIndexes(Directory..), that did an
optimize, I think both before and after.  So addIndexesNoOpt was
reacting to that.

> I mean, we call those methods in confusing names, and don't follow the same
> approach when handling each ... I can live a/ addIndexes existing to take IR
> extensions, and w/ addDirectories if you don't need IR extensions. But
> calling/not-calling optimize() is inconsistent, and from what I understand,
> for no good reason?

It's for a good reason -- it's to attempt to ensure that the single
.merge done by that method isn't insanely slow, if your index has alot
of segments.

But, really, those merges ought to go through a merge
policy/scheduler, so we do mergeFactor at a time, we do up to N
concurrently, etc.

So I think this pre-optimize is a hack to try to keep how many readers
we merge at once, contained.

> I'm asking these questions b/c someone asked me the other day when one
> should call each and what the hell that NoOpt is doing in the name ... I was
> confused when I was asked the question, and I'm confused now :).

I hear you...

> So how about if we:
> 1) Rename addIndexesNoOptimize to addDirectories

Hmm addDirectories feels a bit too low level... why not call it
addIndexes (it's a different signature since it accepts Dir not IR).

> 2) Remove optimize() call from addIndexes

+1

But advertise this in back compat breaks.  We could also preserve old
way under Version.

> 3) Document that clearly in both, w/ a recommendation to call optimize()
> before on any of the Directories/Indexes if it's a concern.

Good.

> That way, we maintain all the flexibility in the API - addIndexes allows for
> using IR extensions, addDirectories is considered more efficient, by
> allowing the merges to happen concurrently (depending on MS) and also
> factors in the MP. So unless you have an IR extension, addDirectories is
> really the one you should be using. And you have the freedom to call
> optimize() before each if you care about it, or don't if you don't care.
> Either way, incurring the cost of optimize() is entirely in your hands.

Good!

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: When to use addIndexes and when addIndexesNoOptimize

Posted by Shai Erera <se...@gmail.com>.

That's still weird Mike - we call optimize in addIndexes to reduce the
number of SRs, that's fair. So why don't we do that in addIndexesNoOpt?
There, we get a SR per SI. And name of the method suggests optimize() is
avoided on purpose ... it's as if addIndexesNoOpt should be called
addDirectories, and we should let the caller decide whether to call
optimize() on all IRs (including the local) before he calls addIndexes, or
NoOpt.

I mean, we call those methods in confusing names, and don't follow the same
approach when handling each ... I can live a/ addIndexes existing to take IR
extensions, and w/ addDirectories if you don't need IR extensions. But
calling/not-calling optimize() is inconsistent, and from what I understand,
for no good reason?

I'm asking these questions b/c someone asked me the other day when one
should call each and what the hell that NoOpt is doing in the name ... I was
confused when I was asked the question, and I'm confused now :).

So how about if we:
1) Rename addIndexesNoOptimize to addDirectories
2) Remove optimize() call from addIndexes
3) Document that clearly in both, w/ a recommendation to call optimize()
before on any of the Directories/Indexes if it's a concern.

That way, we maintain all the flexibility in the API - addIndexes allows for
using IR extensions, addDirectories is considered more efficient, by
allowing the merges to happen concurrently (depending on MS) and also
factors in the MP. So unless you have an IR extension, addDirectories is
really the one you should be using. And you have the freedom to call
optimize() before each if you care about it, or don't if you don't care.
Either way, incurring the cost of optimize() is entirely in your hands.

What do you think?

Shai

On Mon, May 10, 2010 at 9:33 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> On Mon, May 10, 2010 at 2:18 PM, Shai Erera <se...@gmail.com> wrote:
> > Ahh, I see. Didn't think of IndexReader extensions. Why do we call
> > optimize() on the local dir in addIndexes then? What's the benefits?
>
> I really don't know!  Maybe to handle the case where local index has
> many segments?  Ie, reduce the net number of readers open?
>
> I would think "typically" a smallish number of foreign indexes are
> added to a largish number of local segments?
>
> We should at least make it optional to do the optimize...
>
> > We don't do the same on the incoming readers, so why does it matter if
> e.g. the
> > local dir has 2 segments and the incoming ones have 100? We insist on
> > optimizing the local 2 segments ...
> >
> > BTW, addIndexesNoOpt does not obtain a reader, but rather reads the SIs
> from
> > each directory and then calls maybeMerge().
>
> Ahh right, thanks for the clarification.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: When to use addIndexes and when addIndexesNoOptimize

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Mon, May 10, 2010 at 2:18 PM, Shai Erera <se...@gmail.com> wrote:
> Ahh, I see. Didn't think of IndexReader extensions. Why do we call
> optimize() on the local dir in addIndexes then? What's the benefits?

I really don't know!  Maybe to handle the case where local index has
many segments?  Ie, reduce the net number of readers open?

I would think "typically" a smallish number of foreign indexes are
added to a largish number of local segments?

We should at least make it optional to do the optimize...

> We don't do the same on the incoming readers, so why does it matter if e.g. the
> local dir has 2 segments and the incoming ones have 100? We insist on
> optimizing the local 2 segments ...
>
> BTW, addIndexesNoOpt does not obtain a reader, but rather reads the SIs from
> each directory and then calls maybeMerge().

Ahh right, thanks for the clarification.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: When to use addIndexes and when addIndexesNoOptimize

Posted by Shai Erera <se...@gmail.com>.

Ahh, I see. Didn't think of IndexReader extensions. Why do we call
optimize() on the local dir in addIndexes then? What's the benefits? We
don't do the same on the incoming readers, so why does it matter if e.g. the
local dir has 2 segments and the incoming ones have 100? We insist on
optimizing the local 2 segments ...

BTW, addIndexesNoOpt does not obtain a reader, but rather reads the SIs from
each directory and then calls maybeMerge().

Shai

On Mon, May 10, 2010 at 7:27 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> addIndexes accepts IndexReaders, so eg "foreign" IndexReader impls can
> be passed in (eg FilterIndexReader).
>
> While addIndexesNoOptimize accepts Directory, ie it gets a reader using
> IR.open.
>
> Mike
>
> On Mon, May 10, 2010 at 5:58 AM, Shai Erera <se...@gmail.com> wrote:
> > Hi
> >
> > As I was working on LUCENE-1585 and understanding more the differences
> > between addIndexes and addIndexesNoOptimize, I was wondering why do we
> have
> > those two methods? It seems like addIndexes's usage is discouraged, no?
> Can
> > someone please explain me why it isn't deprecated, and addIndexesNoOpt
> > becomes the default one? Are we losing any functionality?
> >
> > Shai
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: When to use addIndexes and when addIndexesNoOptimize

Posted by Michael McCandless <lu...@mikemccandless.com>.

addIndexes accepts IndexReaders, so eg "foreign" IndexReader impls can
be passed in (eg FilterIndexReader).

While addIndexesNoOptimize accepts Directory, ie it gets a reader using IR.open.

Mike

On Mon, May 10, 2010 at 5:58 AM, Shai Erera <se...@gmail.com> wrote:
> Hi
>
> As I was working on LUCENE-1585 and understanding more the differences
> between addIndexes and addIndexesNoOptimize, I was wondering why do we have
> those two methods? It seems like addIndexes's usage is discouraged, no? Can
> someone please explain me why it isn't deprecated, and addIndexesNoOpt
> becomes the default one? Are we losing any functionality?
>
> Shai
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org