You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Michael McCandless <lu...@mikemccandless.com> on 2007/07/02 15:35:11 UTC

[VOTE] Commit LUCENE-843 (IndexWriter performance gains)

Hi,

I'd like to commit LUCENE-843.

The patch has gone through a number of iterations but the final
version that's there now (take9) is quite a bit cleaner & simpler than
the ones leading up to it and I believe ready.

It provides solid indexing performance gains (between 2X-8X), but, it
is somewhat more complex than the current "single doc per segment"
approach and it does introduce a change to the index format (only when
autoCommit=false) whereby multiple segments can share a single set of
term vector & stored fields files.

Given that it's such a big change I think (?) it's appropriate to ask
for a vote (only PMC member votes are binding) to make sure we have
consensus that this is net/net a good change for Lucene.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [VOTE] Commit LUCENE-843 (IndexWriter performance gains)

Posted by Michael McCandless <lu...@mikemccandless.com>.
Ahh, right, I will update fileformats.xml & re-build html/PDF
(with Forrest 0.8) before committing.

The only downside I have now is if you do flush by RAM (which gives
best performance), you have to be very careful to work around
LUCENE-845 by also setting maxBufferedDocs to be something "around"
the right number.  However this downside should go away once we
resolve LUCENE-845 (which is next on my stack, after the "multiple
writers over NFS" that's in progress now!).

I will also plant a tag just before committing.

Thanks for reviewing, everyone!  I will give it another day or so and
then commit.

Mike

"Grant Ingersoll" <gs...@apache.org> wrote:
> Mike,
> 
> Nice piece of work here.  One caveat, I think you mentioned you  
> needed to update fileformats.xml (don't forget to generate the site  
> and commit those changes too), but I don't see that in the patch.
> 
> Also, do you see any downsides to this patch?  Do you think it would  
> ever be the case that a user would not benefit from it?  If so,  
> probably would be useful to document them.
> 
> Other than that, I am +1
> 
> Cheers,
> Grant
> 
> On Jul 2, 2007, at 9:35 AM, Michael McCandless wrote:
> 
> > Hi,
> >
> > I'd like to commit LUCENE-843.
> >
> > The patch has gone through a number of iterations but the final
> > version that's there now (take9) is quite a bit cleaner & simpler than
> > the ones leading up to it and I believe ready.
> >
> > It provides solid indexing performance gains (between 2X-8X), but, it
> > is somewhat more complex than the current "single doc per segment"
> > approach and it does introduce a change to the index format (only when
> > autoCommit=false) whereby multiple segments can share a single set of
> > term vector & stored fields files.
> >
> > Given that it's such a big change I think (?) it's appropriate to ask
> > for a vote (only PMC member votes are binding) to make sure we have
> > consensus that this is net/net a good change for Lucene.
> >
> > Mike
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [VOTE] Commit LUCENE-843 (IndexWriter performance gains)

Posted by Grant Ingersoll <gs...@apache.org>.
Mike,

Nice piece of work here.  One caveat, I think you mentioned you  
needed to update fileformats.xml (don't forget to generate the site  
and commit those changes too), but I don't see that in the patch.

Also, do you see any downsides to this patch?  Do you think it would  
ever be the case that a user would not benefit from it?  If so,  
probably would be useful to document them.

Other than that, I am +1

Cheers,
Grant

On Jul 2, 2007, at 9:35 AM, Michael McCandless wrote:

> Hi,
>
> I'd like to commit LUCENE-843.
>
> The patch has gone through a number of iterations but the final
> version that's there now (take9) is quite a bit cleaner & simpler than
> the ones leading up to it and I believe ready.
>
> It provides solid indexing performance gains (between 2X-8X), but, it
> is somewhat more complex than the current "single doc per segment"
> approach and it does introduce a change to the index format (only when
> autoCommit=false) whereby multiple segments can share a single set of
> term vector & stored fields files.
>
> Given that it's such a big change I think (?) it's appropriate to ask
> for a vote (only PMC member votes are binding) to make sure we have
> consensus that this is net/net a good change for Lucene.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [VOTE] Commit LUCENE-843 (IndexWriter performance gains)

Posted by Yonik Seeley <yo...@apache.org>.
On 7/2/07, Michael McCandless <lu...@mikemccandless.com> wrote:
> I'd like to commit LUCENE-843.

+1
Awesome job!

> The patch has gone through a number of iterations but the final
> version that's there now (take9) is quite a bit cleaner & simpler than
> the ones leading up to it and I believe ready.
>
> It provides solid indexing performance gains (between 2X-8X), but, it
> is somewhat more complex than the current "single doc per segment"
> approach and it does introduce a change to the index format (only when
> autoCommit=false) whereby multiple segments can share a single set of
> term vector & stored fields files.

I'll miss the elegant single doc approach that's been with us for so
long, but one can't ignore the magnitude of these performance gains.

> Given that it's such a big change I think (?) it's appropriate to ask
> for a vote (only PMC member votes are binding) to make sure we have
> consensus that this is net/net a good change for Lucene.

IMO, there's no need to be that formal.  A simple vote on the dev list
(non-committer votes are welcome and carry weight too), and if there's
a consensus then everything is good.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [VOTE] Commit LUCENE-843 (IndexWriter performance gains)

Posted by Grant Ingersoll <gs...@apache.org>.

On Jul 2, 2007, at 4:18 PM, Yonik Seeley wrote:

> On 7/2/07, Grant Ingersoll <gs...@apache.org> wrote:
>> 2. or, at a minimum, do a tag of the trunk right before committing.
>> I just find explicit tags make it easier to rollback or compare diffs
>> if need be
>
> You can always use an explicit revision number, which is easy to find
> out from the bug, or you can even find the closest by time:
>

Yeah, I know you can do that, I just sometimes like explicit tags for  
things of this magnitude.




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [VOTE] Commit LUCENE-843 (IndexWriter performance gains)

Posted by Yonik Seeley <yo...@apache.org>.
On 7/2/07, Grant Ingersoll <gs...@apache.org> wrote:
> 2. or, at a minimum, do a tag of the trunk right before committing.
> I just find explicit tags make it easier to rollback or compare diffs
> if need be

You can always use an explicit revision number, which is easy to find
out from the bug, or you can even find the closest by time:

svn info -r {2006-11-10T00:03:00Z} http://svn.apache.org/repos/asf

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [VOTE] Commit LUCENE-843 (IndexWriter performance gains)

Posted by Grant Ingersoll <gs...@apache.org>.
Also, is it worth considering a couple of things:

1. Do a build version release prior to committing (i.e. 2.2.1) that  
way we could isolate this change and do a separate release to 2.3.  I  
don't want to do releases just for the sake of releases, but I think  
we should at least prepare people that the next release (i.e. the one  
containing 843) has a significant change.  I don't think this patch  
warrants a major revision tick, but it does make sense to have people  
really scrutinize it and to have them know that there are significant  
gains to be had.

2. or, at a minimum, do a tag of the trunk right before committing.   
I just find explicit tags make it easier to rollback or compare diffs  
if need be

Note these suggestions are by no means a judgment of the quality of  
the patch, just some precautions before such a big change.

-Grant

On Jul 2, 2007, at 1:31 PM, Grant Ingersoll wrote:

>
> On Jul 2, 2007, at 9:35 AM, Michael McCandless wrote:
>
>> Hi,
>>
>> I'd like to commit LUCENE-843.
>>
>> The patch has gone through a number of iterations but the final
>> version that's there now (take9) is quite a bit cleaner & simpler  
>> than
>> the ones leading up to it and I believe ready.
>>
>> It provides solid indexing performance gains (between 2X-8X), but, it
>> is somewhat more complex than the current "single doc per segment"
>> approach and it does introduce a change to the index format (only  
>> when
>> autoCommit=false) whereby multiple segments can share a single set of
>> term vector & stored fields files.
>>
>
> +0 for now, I will try to review tonight or tomorrow night.  From  
> what I gather from reading the issue, etc. it sounds great and you  
> and others have put a lot of hard work into it.  Also, from some  
> benchmarking I have done, it seems to sit well with the notion of  
> optimizing merge factor, etc. based on the amount of memory available.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [VOTE] Commit LUCENE-843 (IndexWriter performance gains)

Posted by Grant Ingersoll <gs...@apache.org>.
On Jul 2, 2007, at 9:35 AM, Michael McCandless wrote:

> Hi,
>
> I'd like to commit LUCENE-843.
>
> The patch has gone through a number of iterations but the final
> version that's there now (take9) is quite a bit cleaner & simpler than
> the ones leading up to it and I believe ready.
>
> It provides solid indexing performance gains (between 2X-8X), but, it
> is somewhat more complex than the current "single doc per segment"
> approach and it does introduce a change to the index format (only when
> autoCommit=false) whereby multiple segments can share a single set of
> term vector & stored fields files.
>

+0 for now, I will try to review tonight or tomorrow night.  From  
what I gather from reading the issue, etc. it sounds great and you  
and others have put a lot of hard work into it.  Also, from some  
benchmarking I have done, it seems to sit well with the notion of  
optimizing merge factor, etc. based on the amount of memory available.

  
  

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [VOTE] Commit LUCENE-843 (IndexWriter performance gains)

Posted by Doug Cutting <cu...@apache.org>.
+1 This is great work!  Commit it.

Doug

Michael McCandless wrote:
> Hi,
> 
> I'd like to commit LUCENE-843.
> 
> The patch has gone through a number of iterations but the final
> version that's there now (take9) is quite a bit cleaner & simpler than
> the ones leading up to it and I believe ready.
> 
> It provides solid indexing performance gains (between 2X-8X), but, it
> is somewhat more complex than the current "single doc per segment"
> approach and it does introduce a change to the index format (only when
> autoCommit=false) whereby multiple segments can share a single set of
> term vector & stored fields files.
> 
> Given that it's such a big change I think (?) it's appropriate to ask
> for a vote (only PMC member votes are binding) to make sure we have
> consensus that this is net/net a good change for Lucene.
> 
> Mike
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org