You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by John Wang <jo...@gmail.com> on 2009/07/05 19:40:52 UTC

Fwd: addIndexesNoOptimize

Guys:

       Any thoughts? Forwarding the question from the users list after not
hearing back.

Thanks

-John

---------- Forwarded message ----------
From: John Wang <jo...@gmail.com>
Date: Fri, Jul 3, 2009 at 3:49 PM
Subject: addIndexesNoOptimize
To: java-user@lucene.apache.org


Hi guys:

    Running into a question with IndexWriter.addIndexesNoOptimize:

    I am trying to expand a smaller index by replicating it into a larger
index. So I am adding the same directory N times.

    I get an exception because noDupDirs(dirs) fails. For this call, is this
check neccessary?

    I temporarily commented it and the resulting index seems to fine.

Thanks

-John

Re: addIndexesNoOptimize

Posted by Jason Rutherglen <ja...@gmail.com>.

> MergePolicy expects to receive SegmentInfo instances
I ran into this implementing LUCENE-1589.

On Mon, Jul 6, 2009 at 3:18 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> On Mon, Jul 6, 2009 at 2:18 AM, John Wang<jo...@gmail.com> wrote:
>
> >      Currently, addIndexesNoOptimize(Directory[] dir) is really really
> > really fast! (I duplicated my index of 15k docs 200 times and created a
> 3M
> > doc index in less than a minute) Perhaps we should handle duplicate
> > directory names more gracefully? e.g. append a numeral after the segment
> > name or something? (I'd happy to work on a patch for it)
>
> I guess we could explicitly disambiguate on adding external
> SegmentInfo instances into IndexWriter's segmentInfos (add a new
> member to SegmentInfo that's normally set to a default value but on
> importing dups is set to unique values, and then use that member in
> hashCode/equals).  It's somewhat "smelly" though...
>
> Or, you could call addIndexesNoOptimize N times, instead; I wonder how
> the performance would compare.  Is performance a real issue here?
> This is just for testing right?
>
> >  For what I need now, I think in my case
> addIndexesNoOptimize(IndexReader[]) would work as well (I wouldn't know how
> performance would compare though).
>
> Actually implementing this is actually rather tricky, because
> MergePolicy expects to receive SegmentInfo instances, not IndexReader
> instances, to make its decisions.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: addIndexesNoOptimize

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Mon, Jul 6, 2009 at 2:18 AM, John Wang<jo...@gmail.com> wrote:

>      Currently, addIndexesNoOptimize(Directory[] dir) is really really
> really fast! (I duplicated my index of 15k docs 200 times and created a 3M
> doc index in less than a minute) Perhaps we should handle duplicate
> directory names more gracefully? e.g. append a numeral after the segment
> name or something? (I'd happy to work on a patch for it)

I guess we could explicitly disambiguate on adding external
SegmentInfo instances into IndexWriter's segmentInfos (add a new
member to SegmentInfo that's normally set to a default value but on
importing dups is set to unique values, and then use that member in
hashCode/equals).  It's somewhat "smelly" though...

Or, you could call addIndexesNoOptimize N times, instead; I wonder how
the performance would compare.  Is performance a real issue here?
This is just for testing right?

>  For what I need now, I think in my case addIndexesNoOptimize(IndexReader[]) would work as well (I wouldn't know how performance would compare though).

Actually implementing this is actually rather tricky, because
MergePolicy expects to receive SegmentInfo instances, not IndexReader
instances, to make its decisions.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: addIndexesNoOptimize

Posted by John Wang <jo...@gmail.com>.

Hi Mark and Michael:

     Thanks for your replies.

     Currently, addIndexesNoOptimize(Directory[] dir) is really really
really fast! (I duplicated my index of 15k docs 200 times and created a 3M
doc index in less than a minute) Perhaps we should handle duplicate
directory names more gracefully? e.g. append a numeral after the segment
name or something? (I'd happy to work on a patch for it)

     For what I need now, I think in my case
addIndexesNoOptimize(IndexReader[]) would work as well (I wouldn't know how
performance would compare though).

Thanks

-John

On Sun, Jul 5, 2009 at 6:10 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> This was added defensively a while back (can't find the issue right
> now), because internally IndexWriter now identifies each SegmentInfo
> as its Directory + segment name.
>
> EG the "runningMerges" set makes use of this.
>
> If you comment the check out, and pass duplicate segments in, I think
> at least IndexWriter would falsely delay certain merges (ie, gain less
> concurrency from CMS) because of the dups.
>
> But offhand I'm not sure where else we key on a SegmentInfo and what
> else might go wrong if dups enter IndexWriter's segmentInfos but it'd
> make me somewhat nervous removing that defensive check.
>
> Maybe instead we can add an addIndexesNoOptimize(IndexReader[]) (and
> deprecate addIndexes(IndexReader[]))?  Would that work?
>
> Mike
>
> On Sun, Jul 5, 2009 at 1:40 PM, John Wang<jo...@gmail.com> wrote:
> > Guys:
> >
> >        Any thoughts? Forwarding the question from the users list after
> not
> > hearing back.
> >
> > Thanks
> >
> > -John
> >
> > ---------- Forwarded message ----------
> > From: John Wang <jo...@gmail.com>
> > Date: Fri, Jul 3, 2009 at 3:49 PM
> > Subject: addIndexesNoOptimize
> > To: java-user@lucene.apache.org
> >
> >
> > Hi guys:
> >
> >     Running into a question with IndexWriter.addIndexesNoOptimize:
> >
> >     I am trying to expand a smaller index by replicating it into a larger
> > index. So I am adding the same directory N times.
> >
> >     I get an exception because noDupDirs(dirs) fails. For this call, is
> this
> > check neccessary?
> >
> >     I temporarily commented it and the resulting index seems to fine.
> >
> > Thanks
> >
> > -John
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: addIndexesNoOptimize

Posted by Michael McCandless <lu...@mikemccandless.com>.

This was added defensively a while back (can't find the issue right
now), because internally IndexWriter now identifies each SegmentInfo
as its Directory + segment name.

EG the "runningMerges" set makes use of this.

If you comment the check out, and pass duplicate segments in, I think
at least IndexWriter would falsely delay certain merges (ie, gain less
concurrency from CMS) because of the dups.

But offhand I'm not sure where else we key on a SegmentInfo and what
else might go wrong if dups enter IndexWriter's segmentInfos but it'd
make me somewhat nervous removing that defensive check.

Maybe instead we can add an addIndexesNoOptimize(IndexReader[]) (and
deprecate addIndexes(IndexReader[]))?  Would that work?

Mike

On Sun, Jul 5, 2009 at 1:40 PM, John Wang<jo...@gmail.com> wrote:
> Guys:
>
>        Any thoughts? Forwarding the question from the users list after not
> hearing back.
>
> Thanks
>
> -John
>
> ---------- Forwarded message ----------
> From: John Wang <jo...@gmail.com>
> Date: Fri, Jul 3, 2009 at 3:49 PM
> Subject: addIndexesNoOptimize
> To: java-user@lucene.apache.org
>
>
> Hi guys:
>
>     Running into a question with IndexWriter.addIndexesNoOptimize:
>
>     I am trying to expand a smaller index by replicating it into a larger
> index. So I am adding the same directory N times.
>
>     I get an exception because noDupDirs(dirs) fails. For this call, is this
> check neccessary?
>
>     I temporarily commented it and the resulting index seems to fine.
>
> Thanks
>
> -John
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Fwd: addIndexesNoOptimize

Posted by Mark Miller <ma...@gmail.com>.

I don't see why we would disallow that. We would prob just want to 
remove one of the two checks though:

      if (dups.contains(dirs[i]))
        throw new IllegalArgumentException("Directory " + dirs[i] + " 
appears more than once");



-- 
- Mark

http://www.lucidimagination.com



John Wang wrote:
> Guys:
>
>        Any thoughts? Forwarding the question from the users list after 
> not hearing back.
>
> Thanks
>
> -John
>
> ---------- Forwarded message ----------
> From: *John Wang* <john.wang@gmail.com <ma...@gmail.com>>
> Date: Fri, Jul 3, 2009 at 3:49 PM
> Subject: addIndexesNoOptimize
> To: java-user@lucene.apache.org <ma...@lucene.apache.org>
>
>
> Hi guys:
>
>     Running into a question with IndexWriter.addIndexesNoOptimize:
>
>     I am trying to expand a smaller index by replicating it into a 
> larger index. So I am adding the same directory N times.
>
>     I get an exception because noDupDirs(dirs) fails. For this call, 
> is this check neccessary?
>
>     I temporarily commented it and the resulting index seems to fine.
>
> Thanks
>
> -John
>




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org