You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Selva Kumar <se...@gmail.com> on 2015/09/14 18:24:37 UTC

Lucene 5 : any merge performance metrics compared to 4.x?

We observe some merge slowness after we migrated from 4.10 to 5.2.
Is this expected? Any new tunable merge parameters in Lucene 5 ?

-Selva

Re: Lucene 5 : any merge performance metrics compared to 4.x?

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Wed, Sep 30, 2015 at 7:41 PM, McKinley, James T
<ja...@cengage.com> wrote:

> We really don't have the option of moving to local disk without a significant redesign of our systems.  However, we do have the possibility of switching to iSCSI instead of NFS without changing our hardware, do you happen to know whether iSCSI would be a better protocol for use with Lucene?  Thanks!

I don't have any direct experience with iSCSI, but I think it's likely
it would be "correct" since it's a lower level protocol than NFS, i.e.
from the computer's standpoint it thinks it's talking to a local SCSI
drive (maybe)?

But performance wise I'm not sure if it'd be better or worse ... if it
must cross the same network connection as your NFS connection it seems
likely it'd also have performance issues?

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene 5 : any merge performance metrics compared to 4.x?

Posted by "McKinley, James T" <ja...@cengage.com>.
Hi Mike,

Thanks for your response.  We've been using NFS for 10 years with Lucene and never saw index corruption until we moved to 4.x if I remember correctly.  We are aware of the locking and other issues you mentioned with NFS, but they've not been much of a problem for us.  You're probably correct that using a network file system is causing the check index to be slower than it would be on local disk.

We really don't have the option of moving to local disk without a significant redesign of our systems.  However, we do have the possibility of switching to iSCSI instead of NFS without changing our hardware, do you happen to know whether iSCSI would be a better protocol for use with Lucene?  Thanks!

Jim

________________________________________
From: will martin <wm...@gmail.com>
Sent: 30 September 2015 06:49
To: java-user@lucene.apache.org
Subject: RE: Lucene 5 : any merge performance metrics compared to 4.x?

Thanks Mike. This is very informative.



-----Original Message-----
From: Michael McCandless [mailto:lucene@mikemccandless.com]
Sent: Tuesday, September 29, 2015 3:22 PM
To: Lucene Users
Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?

No, it is not possible to disable, and, yes, we removed that API in 5.x because 1) the risk of silent index corruption is too high to warrant this small optimization and 2) we re-worked how merging works so that this checkIntegrity has IO locality with what's being merged next.

There were other performance gains for merging in 5.x, e.g. using much less memory in the many-fields case, not decompressing + recompressing stored fields and term vectors, etc.

As Adrien pointed out, the cost should be much lower than 25% for a local filesystem ... I suspect something about your NFS setup is making it more costly.

NFS is in general a dangerous filesystem to use with Lucene (no delete on last close, locking is tricky to get right, incoherent client file contents and directory listing caching).

If you want to also checkIntegrity of the merged segment you could e.g. install an IndexReaderWarmer in your IW and call IndexReader.checkIntegrity.

Mike McCandless

http://blog.mikemccandless.com


On Tue, Sep 29, 2015 at 9:00 PM, will martin <wm...@gmail.com> wrote:
> Ok So I'm a little confused:
>
> The 4.10 JavaDoc for LiveIndexWriterConfig supports volatile access on
> a flag to setCheckIntegrityAtMerge ...
>
> Method states it controls pre-merge cost.
>
> Ref:
>
> https://lucene.apache.org/core/4_10_0/core/org/apache/lucene/index/Liv
> eIndex
> WriterConfig.html#setCheckIntegrityAtMerge%28boolean%29
>
> And it seems to be gone in 5.3 folks? Meaning Adrien's comment is a
> whole lot significant? Merges ALWAYS pre-merge CheckIntegrity? Is this
> a 5.0 feature drop? You can't deprecate, um, er totally remove an
> index time audit feature on a point release of any level IMHO.
>
>
> -----Original Message-----
> From: McKinley, James T [mailto:james.mckinley@cengage.com]
> Sent: Tuesday, September 29, 2015 2:42 PM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?
>
> Yes, the indexing workflow is completely separate from the runtime system.
> The file system is EMC Isilon via NFS.
>
> Jim
>
> ________________________________________
> From: will martin <wm...@gmail.com>
> Sent: 29 September 2015 14:29
> To: java-user@lucene.apache.org
> Subject: RE: Lucene 5 : any merge performance metrics compared to 4.x?
>
> This sounds robust. Is the index batch creation workflow a separate process?
> Distributed shared filesystems?
>
> --will
>
> -----Original Message-----
> From: McKinley, James T [mailto:james.mckinley@cengage.com]
> Sent: Tuesday, September 29, 2015 2:22 PM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?
>
> Hi Adrien and Will,
>
> Thanks for your responses.  I work with Selva and he's busy right now
> with other things, so I'll add some more context to his question in an
> attempt to improve clarity.
>
> The merge in question is part of our batch indexing workflow wherein
> we index new content for a given partition and then merge this new
> index with the big index of everything that was previously loaded on
> the given partition.  The increase in merge time we've seen since
> upgrading from 4.10 to 5.2 is on the order of 25%.  It varies from
> partition to partition, but 25% is a good ballpark estimate I think.
> Maybe our case is non-standard, we have a large number of fields (> 200).
>
> The reason we perform an index check after the merge is that this is
> the final index state that will be used for a given batch.  Since we
> have a batch-oriented workflow we are able to roll back to a previous
> batch if we find a problem with a given batch (Lucene or other
> problem).  However due to disk space constraints we can only keep a
> couple batches.  If our indexing workflow completes without errors but
> the index is corrupt, we may not know right away and we might delete
> the previous good batch thinking the latest batch is OK, which would
> be very bad requiring a full reload of all our content.
>
> Checking the index prior to the merge would no doubt catch many
> issues, but it might not catch corruption that occurs during the merge
> step itself, so we implemented a check step once the index is in its
> final state to ensure that it is OK.
>
> So, since we want to do the check post-merge, is there a way to
> disable the check during merge so we don't have to do two checks?
>
> Thanks!
>
> Jim
>
> ________________________________________
> From: will martin <wm...@gmail.com>
> Sent: 29 September 2015 12:08
> To: java-user@lucene.apache.org
> Subject: RE: Lucene 5 : any merge performance metrics compared to 4.x?
>
> So, if its new, it adds to pre-existing time? So it is a cost that
> needs to be understood I think.
>
>
>
> And, I'm really curious, what happens to the result of the post merge
> checkIntegrity IFF (if and only if) there was corruption pre-merge: I
> mean if you let it merge anyway could you get a false positive for integrity?
> [see the concept of lazy-evaluation]
>
>
>
> These are, imo, the kinds of engineering questions Selva's post raised
> in my triage mode of the scenario.
>
>
>
>
>
> -----Original Message-----
> From: Adrien Grand [mailto:jpountz@gmail.com]
> Sent: Tuesday, September 29, 2015 8:46 AM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?
>
>
>
> Indeed this is new but I'm a bit surprised this is the source of your
> issues as it should be much faster than the merge itself. I don't
> understand your proposal to check the index after merge: the goal is
> to make sure that we do not propagate corruptions so it's better to
> check the index before the merge starts so that we don't even try to merge if there are corruptions?
>
>
>
> Le mar. 15 sept. 2015 à 00:40, Selva Kumar <
> <ma...@gmail.com> selva.kumar.at.work@gmail.com>
> a écrit :
>
>
>
>> it appears Lucene 5.2 index merge is running checkIntegrity on
>
>> existing index prior to merging additional indices.
>
>> This seems to be new.
>
>>
>
>> We have an existing checkIndex but this is run post index merge.
>
>>
>
>> Two follow up questions :
>
>> * Is there way to turn off built-in checkIntegrity? Just for my
> understand.
>
>> No plan to turn this off.
>
>> * Is running checkIntegrity prior to index merge better than running
>
>> post merge?
>
>>
>
>>
>
>> On Mon, Sep 14, 2015 at 12:24 PM, Selva Kumar <
>
>>  <ma...@gmail.com> selva.kumar.at.work@gmail.com
>
>> > wrote:
>
>>
>
>> > We observe some merge slowness after we migrated from 4.10 to 5.2.
>
>> > Is this expected? Any new tunable merge parameters in Lucene 5 ?
>
>> >
>
>> > -Selva
>
>> >
>
>> >
>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Lucene 5 : any merge performance metrics compared to 4.x?

Posted by will martin <wm...@gmail.com>.
Thanks Mike. This is very informative. 



-----Original Message-----
From: Michael McCandless [mailto:lucene@mikemccandless.com] 
Sent: Tuesday, September 29, 2015 3:22 PM
To: Lucene Users
Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?

No, it is not possible to disable, and, yes, we removed that API in 5.x because 1) the risk of silent index corruption is too high to warrant this small optimization and 2) we re-worked how merging works so that this checkIntegrity has IO locality with what's being merged next.

There were other performance gains for merging in 5.x, e.g. using much less memory in the many-fields case, not decompressing + recompressing stored fields and term vectors, etc.

As Adrien pointed out, the cost should be much lower than 25% for a local filesystem ... I suspect something about your NFS setup is making it more costly.

NFS is in general a dangerous filesystem to use with Lucene (no delete on last close, locking is tricky to get right, incoherent client file contents and directory listing caching).

If you want to also checkIntegrity of the merged segment you could e.g. install an IndexReaderWarmer in your IW and call IndexReader.checkIntegrity.

Mike McCandless

http://blog.mikemccandless.com


On Tue, Sep 29, 2015 at 9:00 PM, will martin <wm...@gmail.com> wrote:
> Ok So I'm a little confused:
>
> The 4.10 JavaDoc for LiveIndexWriterConfig supports volatile access on 
> a flag to setCheckIntegrityAtMerge ...
>
> Method states it controls pre-merge cost.
>
> Ref:
>
> https://lucene.apache.org/core/4_10_0/core/org/apache/lucene/index/Liv
> eIndex
> WriterConfig.html#setCheckIntegrityAtMerge%28boolean%29
>
> And it seems to be gone in 5.3 folks? Meaning Adrien's comment is a 
> whole lot significant? Merges ALWAYS pre-merge CheckIntegrity? Is this 
> a 5.0 feature drop? You can't deprecate, um, er totally remove an 
> index time audit feature on a point release of any level IMHO.
>
>
> -----Original Message-----
> From: McKinley, James T [mailto:james.mckinley@cengage.com]
> Sent: Tuesday, September 29, 2015 2:42 PM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?
>
> Yes, the indexing workflow is completely separate from the runtime system.
> The file system is EMC Isilon via NFS.
>
> Jim
>
> ________________________________________
> From: will martin <wm...@gmail.com>
> Sent: 29 September 2015 14:29
> To: java-user@lucene.apache.org
> Subject: RE: Lucene 5 : any merge performance metrics compared to 4.x?
>
> This sounds robust. Is the index batch creation workflow a separate process?
> Distributed shared filesystems?
>
> --will
>
> -----Original Message-----
> From: McKinley, James T [mailto:james.mckinley@cengage.com]
> Sent: Tuesday, September 29, 2015 2:22 PM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?
>
> Hi Adrien and Will,
>
> Thanks for your responses.  I work with Selva and he's busy right now 
> with other things, so I'll add some more context to his question in an 
> attempt to improve clarity.
>
> The merge in question is part of our batch indexing workflow wherein 
> we index new content for a given partition and then merge this new 
> index with the big index of everything that was previously loaded on 
> the given partition.  The increase in merge time we've seen since 
> upgrading from 4.10 to 5.2 is on the order of 25%.  It varies from 
> partition to partition, but 25% is a good ballpark estimate I think.  
> Maybe our case is non-standard, we have a large number of fields (> 200).
>
> The reason we perform an index check after the merge is that this is 
> the final index state that will be used for a given batch.  Since we 
> have a batch-oriented workflow we are able to roll back to a previous 
> batch if we find a problem with a given batch (Lucene or other 
> problem).  However due to disk space constraints we can only keep a 
> couple batches.  If our indexing workflow completes without errors but 
> the index is corrupt, we may not know right away and we might delete 
> the previous good batch thinking the latest batch is OK, which would 
> be very bad requiring a full reload of all our content.
>
> Checking the index prior to the merge would no doubt catch many 
> issues, but it might not catch corruption that occurs during the merge 
> step itself, so we implemented a check step once the index is in its 
> final state to ensure that it is OK.
>
> So, since we want to do the check post-merge, is there a way to 
> disable the check during merge so we don't have to do two checks?
>
> Thanks!
>
> Jim
>
> ________________________________________
> From: will martin <wm...@gmail.com>
> Sent: 29 September 2015 12:08
> To: java-user@lucene.apache.org
> Subject: RE: Lucene 5 : any merge performance metrics compared to 4.x?
>
> So, if its new, it adds to pre-existing time? So it is a cost that 
> needs to be understood I think.
>
>
>
> And, I'm really curious, what happens to the result of the post merge 
> checkIntegrity IFF (if and only if) there was corruption pre-merge: I 
> mean if you let it merge anyway could you get a false positive for integrity?
> [see the concept of lazy-evaluation]
>
>
>
> These are, imo, the kinds of engineering questions Selva's post raised 
> in my triage mode of the scenario.
>
>
>
>
>
> -----Original Message-----
> From: Adrien Grand [mailto:jpountz@gmail.com]
> Sent: Tuesday, September 29, 2015 8:46 AM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?
>
>
>
> Indeed this is new but I'm a bit surprised this is the source of your 
> issues as it should be much faster than the merge itself. I don't 
> understand your proposal to check the index after merge: the goal is 
> to make sure that we do not propagate corruptions so it's better to 
> check the index before the merge starts so that we don't even try to merge if there are corruptions?
>
>
>
> Le mar. 15 sept. 2015 à 00:40, Selva Kumar < 
> <ma...@gmail.com> selva.kumar.at.work@gmail.com> 
> a écrit :
>
>
>
>> it appears Lucene 5.2 index merge is running checkIntegrity on
>
>> existing index prior to merging additional indices.
>
>> This seems to be new.
>
>>
>
>> We have an existing checkIndex but this is run post index merge.
>
>>
>
>> Two follow up questions :
>
>> * Is there way to turn off built-in checkIntegrity? Just for my
> understand.
>
>> No plan to turn this off.
>
>> * Is running checkIntegrity prior to index merge better than running
>
>> post merge?
>
>>
>
>>
>
>> On Mon, Sep 14, 2015 at 12:24 PM, Selva Kumar <
>
>>  <ma...@gmail.com> selva.kumar.at.work@gmail.com
>
>> > wrote:
>
>>
>
>> > We observe some merge slowness after we migrated from 4.10 to 5.2.
>
>> > Is this expected? Any new tunable merge parameters in Lucene 5 ?
>
>> >
>
>> > -Selva
>
>> >
>
>> >
>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene 5 : any merge performance metrics compared to 4.x?

Posted by Michael McCandless <lu...@mikemccandless.com>.
No, it is not possible to disable, and, yes, we removed that API in
5.x because 1) the risk of silent index corruption is too high to
warrant this small optimization and 2) we re-worked how merging works
so that this checkIntegrity has IO locality with what's being merged
next.

There were other performance gains for merging in 5.x, e.g. using much
less memory in the many-fields case, not decompressing + recompressing
stored fields and term vectors, etc.

As Adrien pointed out, the cost should be much lower than 25% for a
local filesystem ... I suspect something about your NFS setup is
making it more costly.

NFS is in general a dangerous filesystem to use with Lucene (no delete
on last close, locking is tricky to get right, incoherent client file
contents and directory listing caching).

If you want to also checkIntegrity of the merged segment you could
e.g. install an IndexReaderWarmer in your IW and call
IndexReader.checkIntegrity.

Mike McCandless

http://blog.mikemccandless.com


On Tue, Sep 29, 2015 at 9:00 PM, will martin <wm...@gmail.com> wrote:
> Ok So I'm a little confused:
>
> The 4.10 JavaDoc for LiveIndexWriterConfig supports volatile access on a
> flag to setCheckIntegrityAtMerge ...
>
> Method states it controls pre-merge cost.
>
> Ref:
>
> https://lucene.apache.org/core/4_10_0/core/org/apache/lucene/index/LiveIndex
> WriterConfig.html#setCheckIntegrityAtMerge%28boolean%29
>
> And it seems to be gone in 5.3 folks? Meaning Adrien's comment is a whole
> lot significant? Merges ALWAYS pre-merge CheckIntegrity? Is this a 5.0
> feature drop? You can't deprecate, um, er totally remove an index time audit
> feature on a point release of any level IMHO.
>
>
> -----Original Message-----
> From: McKinley, James T [mailto:james.mckinley@cengage.com]
> Sent: Tuesday, September 29, 2015 2:42 PM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?
>
> Yes, the indexing workflow is completely separate from the runtime system.
> The file system is EMC Isilon via NFS.
>
> Jim
>
> ________________________________________
> From: will martin <wm...@gmail.com>
> Sent: 29 September 2015 14:29
> To: java-user@lucene.apache.org
> Subject: RE: Lucene 5 : any merge performance metrics compared to 4.x?
>
> This sounds robust. Is the index batch creation workflow a separate process?
> Distributed shared filesystems?
>
> --will
>
> -----Original Message-----
> From: McKinley, James T [mailto:james.mckinley@cengage.com]
> Sent: Tuesday, September 29, 2015 2:22 PM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?
>
> Hi Adrien and Will,
>
> Thanks for your responses.  I work with Selva and he's busy right now with
> other things, so I'll add some more context to his question in an attempt to
> improve clarity.
>
> The merge in question is part of our batch indexing workflow wherein we
> index new content for a given partition and then merge this new index with
> the big index of everything that was previously loaded on the given
> partition.  The increase in merge time we've seen since upgrading from 4.10
> to 5.2 is on the order of 25%.  It varies from partition to partition, but
> 25% is a good ballpark estimate I think.  Maybe our case is non-standard, we
> have a large number of fields (> 200).
>
> The reason we perform an index check after the merge is that this is the
> final index state that will be used for a given batch.  Since we have a
> batch-oriented workflow we are able to roll back to a previous batch if we
> find a problem with a given batch (Lucene or other problem).  However due to
> disk space constraints we can only keep a couple batches.  If our indexing
> workflow completes without errors but the index is corrupt, we may not know
> right away and we might delete the previous good batch thinking the latest
> batch is OK, which would be very bad requiring a full reload of all our
> content.
>
> Checking the index prior to the merge would no doubt catch many issues, but
> it might not catch corruption that occurs during the merge step itself, so
> we implemented a check step once the index is in its final state to ensure
> that it is OK.
>
> So, since we want to do the check post-merge, is there a way to disable the
> check during merge so we don't have to do two checks?
>
> Thanks!
>
> Jim
>
> ________________________________________
> From: will martin <wm...@gmail.com>
> Sent: 29 September 2015 12:08
> To: java-user@lucene.apache.org
> Subject: RE: Lucene 5 : any merge performance metrics compared to 4.x?
>
> So, if its new, it adds to pre-existing time? So it is a cost that needs to
> be understood I think.
>
>
>
> And, I'm really curious, what happens to the result of the post merge
> checkIntegrity IFF (if and only if) there was corruption pre-merge: I mean
> if you let it merge anyway could you get a false positive for integrity?
> [see the concept of lazy-evaluation]
>
>
>
> These are, imo, the kinds of engineering questions Selva's post raised in my
> triage mode of the scenario.
>
>
>
>
>
> -----Original Message-----
> From: Adrien Grand [mailto:jpountz@gmail.com]
> Sent: Tuesday, September 29, 2015 8:46 AM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?
>
>
>
> Indeed this is new but I'm a bit surprised this is the source of your issues
> as it should be much faster than the merge itself. I don't understand your
> proposal to check the index after merge: the goal is to make sure that we do
> not propagate corruptions so it's better to check the index before the merge
> starts so that we don't even try to merge if there are corruptions?
>
>
>
> Le mar. 15 sept. 2015 à 00:40, Selva Kumar <
> <ma...@gmail.com> selva.kumar.at.work@gmail.com> a
> écrit :
>
>
>
>> it appears Lucene 5.2 index merge is running checkIntegrity on
>
>> existing index prior to merging additional indices.
>
>> This seems to be new.
>
>>
>
>> We have an existing checkIndex but this is run post index merge.
>
>>
>
>> Two follow up questions :
>
>> * Is there way to turn off built-in checkIntegrity? Just for my
> understand.
>
>> No plan to turn this off.
>
>> * Is running checkIntegrity prior to index merge better than running
>
>> post merge?
>
>>
>
>>
>
>> On Mon, Sep 14, 2015 at 12:24 PM, Selva Kumar <
>
>>  <ma...@gmail.com> selva.kumar.at.work@gmail.com
>
>> > wrote:
>
>>
>
>> > We observe some merge slowness after we migrated from 4.10 to 5.2.
>
>> > Is this expected? Any new tunable merge parameters in Lucene 5 ?
>
>> >
>
>> > -Selva
>
>> >
>
>> >
>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Lucene 5 : any merge performance metrics compared to 4.x?

Posted by will martin <wm...@gmail.com>.
Ok So I'm a little confused:

The 4.10 JavaDoc for LiveIndexWriterConfig supports volatile access on a
flag to setCheckIntegrityAtMerge ... 

Method states it controls pre-merge cost.

Ref: 

https://lucene.apache.org/core/4_10_0/core/org/apache/lucene/index/LiveIndex
WriterConfig.html#setCheckIntegrityAtMerge%28boolean%29

And it seems to be gone in 5.3 folks? Meaning Adrien's comment is a whole
lot significant? Merges ALWAYS pre-merge CheckIntegrity? Is this a 5.0
feature drop? You can't deprecate, um, er totally remove an index time audit
feature on a point release of any level IMHO.


-----Original Message-----
From: McKinley, James T [mailto:james.mckinley@cengage.com] 
Sent: Tuesday, September 29, 2015 2:42 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?

Yes, the indexing workflow is completely separate from the runtime system.
The file system is EMC Isilon via NFS.

Jim

________________________________________
From: will martin <wm...@gmail.com>
Sent: 29 September 2015 14:29
To: java-user@lucene.apache.org
Subject: RE: Lucene 5 : any merge performance metrics compared to 4.x?

This sounds robust. Is the index batch creation workflow a separate process?
Distributed shared filesystems?

--will

-----Original Message-----
From: McKinley, James T [mailto:james.mckinley@cengage.com]
Sent: Tuesday, September 29, 2015 2:22 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?

Hi Adrien and Will,

Thanks for your responses.  I work with Selva and he's busy right now with
other things, so I'll add some more context to his question in an attempt to
improve clarity.

The merge in question is part of our batch indexing workflow wherein we
index new content for a given partition and then merge this new index with
the big index of everything that was previously loaded on the given
partition.  The increase in merge time we've seen since upgrading from 4.10
to 5.2 is on the order of 25%.  It varies from partition to partition, but
25% is a good ballpark estimate I think.  Maybe our case is non-standard, we
have a large number of fields (> 200).

The reason we perform an index check after the merge is that this is the
final index state that will be used for a given batch.  Since we have a
batch-oriented workflow we are able to roll back to a previous batch if we
find a problem with a given batch (Lucene or other problem).  However due to
disk space constraints we can only keep a couple batches.  If our indexing
workflow completes without errors but the index is corrupt, we may not know
right away and we might delete the previous good batch thinking the latest
batch is OK, which would be very bad requiring a full reload of all our
content.

Checking the index prior to the merge would no doubt catch many issues, but
it might not catch corruption that occurs during the merge step itself, so
we implemented a check step once the index is in its final state to ensure
that it is OK.

So, since we want to do the check post-merge, is there a way to disable the
check during merge so we don't have to do two checks?

Thanks!

Jim

________________________________________
From: will martin <wm...@gmail.com>
Sent: 29 September 2015 12:08
To: java-user@lucene.apache.org
Subject: RE: Lucene 5 : any merge performance metrics compared to 4.x?

So, if its new, it adds to pre-existing time? So it is a cost that needs to
be understood I think.



And, I'm really curious, what happens to the result of the post merge
checkIntegrity IFF (if and only if) there was corruption pre-merge: I mean
if you let it merge anyway could you get a false positive for integrity?
[see the concept of lazy-evaluation]



These are, imo, the kinds of engineering questions Selva's post raised in my
triage mode of the scenario.





-----Original Message-----
From: Adrien Grand [mailto:jpountz@gmail.com]
Sent: Tuesday, September 29, 2015 8:46 AM
To: java-user@lucene.apache.org
Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?



Indeed this is new but I'm a bit surprised this is the source of your issues
as it should be much faster than the merge itself. I don't understand your
proposal to check the index after merge: the goal is to make sure that we do
not propagate corruptions so it's better to check the index before the merge
starts so that we don't even try to merge if there are corruptions?



Le mar. 15 sept. 2015 à 00:40, Selva Kumar <
<ma...@gmail.com> selva.kumar.at.work@gmail.com> a
écrit :



> it appears Lucene 5.2 index merge is running checkIntegrity on

> existing index prior to merging additional indices.

> This seems to be new.

>

> We have an existing checkIndex but this is run post index merge.

>

> Two follow up questions :

> * Is there way to turn off built-in checkIntegrity? Just for my
understand.

> No plan to turn this off.

> * Is running checkIntegrity prior to index merge better than running

> post merge?

>

>

> On Mon, Sep 14, 2015 at 12:24 PM, Selva Kumar <

>  <ma...@gmail.com> selva.kumar.at.work@gmail.com

> > wrote:

>

> > We observe some merge slowness after we migrated from 4.10 to 5.2.

> > Is this expected? Any new tunable merge parameters in Lucene 5 ?

> >

> > -Selva

> >

> >

>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene 5 : any merge performance metrics compared to 4.x?

Posted by "McKinley, James T" <ja...@cengage.com>.
Yes, the indexing workflow is completely separate from the runtime system.  The file system is EMC Isilon via NFS.

Jim

________________________________________
From: will martin <wm...@gmail.com>
Sent: 29 September 2015 14:29
To: java-user@lucene.apache.org
Subject: RE: Lucene 5 : any merge performance metrics compared to 4.x?

This sounds robust. Is the index batch creation workflow a separate process?
Distributed shared filesystems?

--will

-----Original Message-----
From: McKinley, James T [mailto:james.mckinley@cengage.com]
Sent: Tuesday, September 29, 2015 2:22 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?

Hi Adrien and Will,

Thanks for your responses.  I work with Selva and he's busy right now with
other things, so I'll add some more context to his question in an attempt to
improve clarity.

The merge in question is part of our batch indexing workflow wherein we
index new content for a given partition and then merge this new index with
the big index of everything that was previously loaded on the given
partition.  The increase in merge time we've seen since upgrading from 4.10
to 5.2 is on the order of 25%.  It varies from partition to partition, but
25% is a good ballpark estimate I think.  Maybe our case is non-standard, we
have a large number of fields (> 200).

The reason we perform an index check after the merge is that this is the
final index state that will be used for a given batch.  Since we have a
batch-oriented workflow we are able to roll back to a previous batch if we
find a problem with a given batch (Lucene or other problem).  However due to
disk space constraints we can only keep a couple batches.  If our indexing
workflow completes without errors but the index is corrupt, we may not know
right away and we might delete the previous good batch thinking the latest
batch is OK, which would be very bad requiring a full reload of all our
content.

Checking the index prior to the merge would no doubt catch many issues, but
it might not catch corruption that occurs during the merge step itself, so
we implemented a check step once the index is in its final state to ensure
that it is OK.

So, since we want to do the check post-merge, is there a way to disable the
check during merge so we don't have to do two checks?

Thanks!

Jim

________________________________________
From: will martin <wm...@gmail.com>
Sent: 29 September 2015 12:08
To: java-user@lucene.apache.org
Subject: RE: Lucene 5 : any merge performance metrics compared to 4.x?

So, if its new, it adds to pre-existing time? So it is a cost that needs to
be understood I think.



And, I'm really curious, what happens to the result of the post merge
checkIntegrity IFF (if and only if) there was corruption pre-merge: I mean
if you let it merge anyway could you get a false positive for integrity?
[see the concept of lazy-evaluation]



These are, imo, the kinds of engineering questions Selva's post raised in my
triage mode of the scenario.





-----Original Message-----
From: Adrien Grand [mailto:jpountz@gmail.com]
Sent: Tuesday, September 29, 2015 8:46 AM
To: java-user@lucene.apache.org
Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?



Indeed this is new but I'm a bit surprised this is the source of your issues
as it should be much faster than the merge itself. I don't understand your
proposal to check the index after merge: the goal is to make sure that we do
not propagate corruptions so it's better to check the index before the merge
starts so that we don't even try to merge if there are corruptions?



Le mar. 15 sept. 2015 à 00:40, Selva Kumar <
<ma...@gmail.com> selva.kumar.at.work@gmail.com> a
écrit :



> it appears Lucene 5.2 index merge is running checkIntegrity on

> existing index prior to merging additional indices.

> This seems to be new.

>

> We have an existing checkIndex but this is run post index merge.

>

> Two follow up questions :

> * Is there way to turn off built-in checkIntegrity? Just for my
understand.

> No plan to turn this off.

> * Is running checkIntegrity prior to index merge better than running

> post merge?

>

>

> On Mon, Sep 14, 2015 at 12:24 PM, Selva Kumar <

>  <ma...@gmail.com> selva.kumar.at.work@gmail.com

> > wrote:

>

> > We observe some merge slowness after we migrated from 4.10 to 5.2.

> > Is this expected? Any new tunable merge parameters in Lucene 5 ?

> >

> > -Selva

> >

> >

>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Lucene 5 : any merge performance metrics compared to 4.x?

Posted by will martin <wm...@gmail.com>.
This sounds robust. Is the index batch creation workflow a separate process?
Distributed shared filesystems?

--will

-----Original Message-----
From: McKinley, James T [mailto:james.mckinley@cengage.com] 
Sent: Tuesday, September 29, 2015 2:22 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?

Hi Adrien and Will,

Thanks for your responses.  I work with Selva and he's busy right now with
other things, so I'll add some more context to his question in an attempt to
improve clarity.

The merge in question is part of our batch indexing workflow wherein we
index new content for a given partition and then merge this new index with
the big index of everything that was previously loaded on the given
partition.  The increase in merge time we've seen since upgrading from 4.10
to 5.2 is on the order of 25%.  It varies from partition to partition, but
25% is a good ballpark estimate I think.  Maybe our case is non-standard, we
have a large number of fields (> 200).

The reason we perform an index check after the merge is that this is the
final index state that will be used for a given batch.  Since we have a
batch-oriented workflow we are able to roll back to a previous batch if we
find a problem with a given batch (Lucene or other problem).  However due to
disk space constraints we can only keep a couple batches.  If our indexing
workflow completes without errors but the index is corrupt, we may not know
right away and we might delete the previous good batch thinking the latest
batch is OK, which would be very bad requiring a full reload of all our
content.

Checking the index prior to the merge would no doubt catch many issues, but
it might not catch corruption that occurs during the merge step itself, so
we implemented a check step once the index is in its final state to ensure
that it is OK.

So, since we want to do the check post-merge, is there a way to disable the
check during merge so we don't have to do two checks?

Thanks!

Jim 

________________________________________
From: will martin <wm...@gmail.com>
Sent: 29 September 2015 12:08
To: java-user@lucene.apache.org
Subject: RE: Lucene 5 : any merge performance metrics compared to 4.x?

So, if its new, it adds to pre-existing time? So it is a cost that needs to
be understood I think.



And, I'm really curious, what happens to the result of the post merge
checkIntegrity IFF (if and only if) there was corruption pre-merge: I mean
if you let it merge anyway could you get a false positive for integrity?
[see the concept of lazy-evaluation]



These are, imo, the kinds of engineering questions Selva's post raised in my
triage mode of the scenario.





-----Original Message-----
From: Adrien Grand [mailto:jpountz@gmail.com]
Sent: Tuesday, September 29, 2015 8:46 AM
To: java-user@lucene.apache.org
Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?



Indeed this is new but I'm a bit surprised this is the source of your issues
as it should be much faster than the merge itself. I don't understand your
proposal to check the index after merge: the goal is to make sure that we do
not propagate corruptions so it's better to check the index before the merge
starts so that we don't even try to merge if there are corruptions?



Le mar. 15 sept. 2015 à 00:40, Selva Kumar <
<ma...@gmail.com> selva.kumar.at.work@gmail.com> a
écrit :



> it appears Lucene 5.2 index merge is running checkIntegrity on

> existing index prior to merging additional indices.

> This seems to be new.

>

> We have an existing checkIndex but this is run post index merge.

>

> Two follow up questions :

> * Is there way to turn off built-in checkIntegrity? Just for my
understand.

> No plan to turn this off.

> * Is running checkIntegrity prior to index merge better than running

> post merge?

>

>

> On Mon, Sep 14, 2015 at 12:24 PM, Selva Kumar <

>  <ma...@gmail.com> selva.kumar.at.work@gmail.com

> > wrote:

>

> > We observe some merge slowness after we migrated from 4.10 to 5.2.

> > Is this expected? Any new tunable merge parameters in Lucene 5 ?

> >

> > -Selva

> >

> >

>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene 5 : any merge performance metrics compared to 4.x?

Posted by "McKinley, James T" <ja...@cengage.com>.
Hi Adrien and Will,

Thanks for your responses.  I work with Selva and he's busy right now with other things, so I'll add some more context to his question in an attempt to improve clarity.

The merge in question is part of our batch indexing workflow wherein we index new content for a given partition and then merge this new index with the big index of everything that was previously loaded on the given partition.  The increase in merge time we've seen since upgrading from 4.10 to 5.2 is on the order of 25%.  It varies from partition to partition, but 25% is a good ballpark estimate I think.  Maybe our case is non-standard, we have a large number of fields (> 200).

The reason we perform an index check after the merge is that this is the final index state that will be used for a given batch.  Since we have a batch-oriented workflow we are able to roll back to a previous batch if we find a problem with a given batch (Lucene or other problem).  However due to disk space constraints we can only keep a couple batches.  If our indexing workflow completes without errors but the index is corrupt, we may not know right away and we might delete the previous good batch thinking the latest batch is OK, which would be very bad requiring a full reload of all our content.

Checking the index prior to the merge would no doubt catch many issues, but it might not catch corruption that occurs during the merge step itself, so we implemented a check step once the index is in its final state to ensure that it is OK.

So, since we want to do the check post-merge, is there a way to disable the check during merge so we don't have to do two checks?

Thanks!

Jim 

________________________________________
From: will martin <wm...@gmail.com>
Sent: 29 September 2015 12:08
To: java-user@lucene.apache.org
Subject: RE: Lucene 5 : any merge performance metrics compared to 4.x?

So, if its new, it adds to pre-existing time? So it is a cost that needs to be understood I think.



And, I'm really curious, what happens to the result of the post merge checkIntegrity IFF (if and only if) there was corruption pre-merge: I mean if you let it merge anyway could you get a false positive for integrity?  [see the concept of lazy-evaluation]



These are, imo, the kinds of engineering questions Selva's post raised in my triage mode of the scenario.





-----Original Message-----
From: Adrien Grand [mailto:jpountz@gmail.com]
Sent: Tuesday, September 29, 2015 8:46 AM
To: java-user@lucene.apache.org
Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?



Indeed this is new but I'm a bit surprised this is the source of your issues as it should be much faster than the merge itself. I don't understand your proposal to check the index after merge: the goal is to make sure that we do not propagate corruptions so it's better to check the index before the merge starts so that we don't even try to merge if there are corruptions?



Le mar. 15 sept. 2015 à 00:40, Selva Kumar < <ma...@gmail.com> selva.kumar.at.work@gmail.com> a écrit :



> it appears Lucene 5.2 index merge is running checkIntegrity on

> existing index prior to merging additional indices.

> This seems to be new.

>

> We have an existing checkIndex but this is run post index merge.

>

> Two follow up questions :

> * Is there way to turn off built-in checkIntegrity? Just for my understand.

> No plan to turn this off.

> * Is running checkIntegrity prior to index merge better than running

> post merge?

>

>

> On Mon, Sep 14, 2015 at 12:24 PM, Selva Kumar <

>  <ma...@gmail.com> selva.kumar.at.work@gmail.com

> > wrote:

>

> > We observe some merge slowness after we migrated from 4.10 to 5.2.

> > Is this expected? Any new tunable merge parameters in Lucene 5 ?

> >

> > -Selva

> >

> >

>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Lucene 5 : any merge performance metrics compared to 4.x?

Posted by will martin <wm...@gmail.com>.
So, if its new, it adds to pre-existing time? So it is a cost that needs to be understood I think.

 

And, I'm really curious, what happens to the result of the post merge checkIntegrity IFF (if and only if) there was corruption pre-merge: I mean if you let it merge anyway could you get a false positive for integrity?  [see the concept of lazy-evaluation]

 

These are, imo, the kinds of engineering questions Selva's post raised in my triage mode of the scenario.

 

 

-----Original Message-----
From: Adrien Grand [mailto:jpountz@gmail.com] 
Sent: Tuesday, September 29, 2015 8:46 AM
To: java-user@lucene.apache.org
Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?

 

Indeed this is new but I'm a bit surprised this is the source of your issues as it should be much faster than the merge itself. I don't understand your proposal to check the index after merge: the goal is to make sure that we do not propagate corruptions so it's better to check the index before the merge starts so that we don't even try to merge if there are corruptions?

 

Le mar. 15 sept. 2015 à 00:40, Selva Kumar < <ma...@gmail.com> selva.kumar.at.work@gmail.com> a écrit :

 

> it appears Lucene 5.2 index merge is running checkIntegrity on 

> existing index prior to merging additional indices.

> This seems to be new.

> 

> We have an existing checkIndex but this is run post index merge.

> 

> Two follow up questions :

> * Is there way to turn off built-in checkIntegrity? Just for my understand.

> No plan to turn this off.

> * Is running checkIntegrity prior to index merge better than running 

> post merge?

> 

> 

> On Mon, Sep 14, 2015 at 12:24 PM, Selva Kumar < 

>  <ma...@gmail.com> selva.kumar.at.work@gmail.com

> > wrote:

> 

> > We observe some merge slowness after we migrated from 4.10 to 5.2.

> > Is this expected? Any new tunable merge parameters in Lucene 5 ?

> >

> > -Selva

> >

> >

> 


Re: Lucene 5 : any merge performance metrics compared to 4.x?

Posted by Adrien Grand <jp...@gmail.com>.
Indeed this is new but I'm a bit surprised this is the source of your
issues as it should be much faster than the merge itself. I don't
understand your proposal to check the index after merge: the goal is to
make sure that we do not propagate corruptions so it's better to check the
index before the merge starts so that we don't even try to merge if there
are corruptions?

Le mar. 15 sept. 2015 à 00:40, Selva Kumar <se...@gmail.com>
a écrit :

> it appears Lucene 5.2 index merge is running checkIntegrity on existing
> index prior to merging additional indices.
> This seems to be new.
>
> We have an existing checkIndex but this is run post index merge.
>
> Two follow up questions :
> * Is there way to turn off built-in checkIntegrity? Just for my understand.
> No plan to turn this off.
> * Is running checkIntegrity prior to index merge better than running post
> merge?
>
>
> On Mon, Sep 14, 2015 at 12:24 PM, Selva Kumar <
> selva.kumar.at.work@gmail.com
> > wrote:
>
> > We observe some merge slowness after we migrated from 4.10 to 5.2.
> > Is this expected? Any new tunable merge parameters in Lucene 5 ?
> >
> > -Selva
> >
> >
>

Re: Lucene 5 : any merge performance metrics compared to 4.x?

Posted by Selva Kumar <se...@gmail.com>.
it appears Lucene 5.2 index merge is running checkIntegrity on existing
index prior to merging additional indices.
This seems to be new.

We have an existing checkIndex but this is run post index merge.

Two follow up questions :
* Is there way to turn off built-in checkIntegrity? Just for my understand.
No plan to turn this off.
* Is running checkIntegrity prior to index merge better than running post
merge?


On Mon, Sep 14, 2015 at 12:24 PM, Selva Kumar <selva.kumar.at.work@gmail.com
> wrote:

> We observe some merge slowness after we migrated from 4.10 to 5.2.
> Is this expected? Any new tunable merge parameters in Lucene 5 ?
>
> -Selva
>
>