You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Bernd Fehling <be...@uni-bielefeld.de> on 2016/08/04 07:14:21 UTC

Re: BufferedUpdateStreams breaks high performance indexing

After updating to version 5.5.3 it looks good now.
Thanks a lot for your help and advise.

Best regards
Bernd

Am 29.07.2016 um 15:04 schrieb Michael McCandless:
> The deleted terms accumulate whenever you use updateDocument(Term, Doc), or
> when you do deleteDocuments(Term).
> 
> Deleted queries are when you delete by query, but I don't think DIH would
> be doing that unless you asked it to ... maybe a Solr user/dev knows better?
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> On Fri, Jul 29, 2016 at 3:21 AM, Bernd Fehling <
> bernd.fehling@uni-bielefeld.de> wrote:
> 
>> Yes, with default of 10 it performs very much better.
>> I didn't take into count that DIH uses updateDocument for adding new
>> documents but after thinking about the "why" I assume that
>> this might be because you don't know if a document already exists in the
>> index.
>> Conclusion, using DIH and setting segmentsPerTier to a high value is a
>> killer.
>>
>> One question still remains about messages in INFOSTREAM, I have lines
>> saying
>> BD 0 [...] push deletes 24345 deleted terms (unique count=24345) 24345
>> deleted queries
>>            bytesUsed=2313024 delGen=2265 packetCount=69
>> totBytesUsed=262526720
>> ...
>> BD 0 [...] seg=_xt(4.10.4):C50486 segGen=2370 segDeletes=[ 97145 deleted
>> terms (unique count=0)
>>            97142 deleted queries bytesUsed=3108576]; coalesced deletes=
>>
>>  [CoalescedUpdates(termSets1,queries=75721,numericDVUpdates=0,binaryDVUpdates=0)]
>>             newDelCount=0
>>
>> Do you know what these deleted terms and deleted queries are?
>>
>> Best regards,
>> Bernd
>>
>>
>> Am 28.07.2016 um 17:34 schrieb Michael McCandless:
>>> Hmm, your merge policy changes are dangerous: that will cause too many
>>> segments in the index, which makes it longer to apply deletes.
>>>
>>> Can you revert that and re-test?
>>>
>>> I'm not sure why DIH is using updateDocument instead of addDocument ...
>>> maybe ask on the solr-user list?
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>> On Thu, Jul 28, 2016 at 10:07 AM, Bernd Fehling <
>>> bernd.fehling@uni-bielefeld.de> wrote:
>>>
>>>> Currently I use concurrent DIH but will write some SolrJ for testing
>>>> or even as replacement for DIH.
>>>> Don't know whats behind DIH if only documents are added.
>>>>
>>>> Not tried any newer release yet, but after reading LUCENE-6161 I really
>>>> should.
>>>> At least a version > 5.1
>>>> May be before writing some SolrJ.
>>>>
>>>>
>>>> Yes IndexWriterConfig is changed from default:
>>>> <indexConfig>
>>>>     <maxIndexingThreads>8</maxIndexingThreads>
>>>>     <ramBufferSizeMB>1024</ramBufferSizeMB>
>>>>     <maxBufferedDocs>-1</maxBufferedDocs>
>>>>     <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
>>>>       <int name="maxMergeAtOnce">8</int>
>>>>       <int name="segmentsPerTier">100</int>
>>>>       <int name="maxMergedSegmentMB">512</int>
>>>>     </mergePolicy>
>>>>     <mergeFactor>8</mergeFactor>
>>>>     <mergeScheduler
>>>> class="org.apache.lucene.index.ConcurrentMergeScheduler"/>
>>>>     <lockType>${solr.lock.type:native}</lockType>
>>>>     ...
>>>> </indexConfig>
>>>>
>>>> A unique id as example: "ftoxfordilej:ar.1770.x.x.13.x.x.u1"
>>>> Somewhere between 20 and 50 characters in length.
>>>>
>>>> Thanks for your help,
>>>> Bernd
>>>>
>>>>
>>>> Am 28.07.2016 um 15:35 schrieb Michael McCandless:
>>>>> Hmm not good.
>>>>>
>>>>> If you are really only adding documents, you should be using
>>>>> IndexWriter.addDocument, which won't buffer any deleted terms and that
>>>>> method call should be a no-op.  It also makes flushes more efficient
>>>> since
>>>>> all of your indexing buffer goes to the added documents, not buffered
>>>>> delete terms.  Are you using updateDocument?
>>>>>
>>>>> Can you reproduce this slowness on a newer release?  There have been
>>>>> performance issues fixed in newer releases in this method, e.g
>>>>> https://issues.apache.org/jira/browse/LUCENE-6161
>>>>>
>>>>> Have you changed any IndexWriterConfig settings from defaults?
>>>>>
>>>>> What are your unique id fields like?  How many bytes in length?
>>>>>
>>>>> Mike McCandless
>>>>>
>>>>> http://blog.mikemccandless.com
>>>>>
>>>>> On Thu, Jul 28, 2016 at 5:01 AM, Bernd Fehling <
>>>>> bernd.fehling@uni-bielefeld.de> wrote:
>>>>>
>>>>>> While trying to get higher performance for indexing it turned out that
>>>>>> BufferedUpdateStreams is breaking indexing performance.
>>>>>> public synchronized ApplyDeletesResult applyDeletesAndUpdates(...)
>>>>>>
>>>>>> At IndexWriterConfig I have setRAMBufferSizeMB=1024 and the Lucene
>>>> 4.10.4
>>>>>> API states:
>>>>>> "Determines the amount of RAM that may be used for buffering added
>>>>>> documents and deletions before they are flushed to the Directory.
>>>>>> Generally for faster indexing performance it's best to flush by RAM
>>>>>> usage instead of document count and use as large a RAM buffer as you
>>>> can."
>>>>>>
>>>>>> Also setMaxBufferedDocs=-1 and setMaxBufferedDeleteTerms=-1.
>>>>>>
>>>>>> BD 0 [Wed Jul 27 13:42:03 GMT+01:00 2016; Thread-27890]: applyDeletes:
>>>>>> infos=...
>>>>>> BD 0 [Wed Jul 27 14:38:55 GMT+01:00 2016; Thread-27890]: applyDeletes
>>>> took
>>>>>> 3411845 msec
>>>>>>
>>>>>> About 56 minutes no indexing and only applying deletes.
>>>>>> What is it deleting?
>>>>>>
>>>>>> If the index gets bigger the time gets longer, currently 2.5 hours of
>>>>>> waiting.
>>>>>> I'm adding 96 million docs with uniq id, no duplicates, only add, no
>>>>>> deletes.
>>>>>>
>>>>>> Any suggestions which config is _really_ going for high performance
>>>>>> indexing?
>>>>>>
>>>>>> Best regards,
>>>>>> Bernd
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>
>> --
>> *************************************************************
>> Bernd Fehling                    Bielefeld University Library
>> Dipl.-Inform. (FH)                LibTec - Library Technology
>> Universit�tsstr. 25                  and Knowledge Management
>> 33615 Bielefeld
>> Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de
>>
>> BASE - Bielefeld Academic Search Engine - www.base-search.net
>> *************************************************************
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 

-- 
*************************************************************
Bernd Fehling                    Bielefeld University Library
Dipl.-Inform. (FH)                LibTec - Library Technology
Universit�tsstr. 25                  and Knowledge Management
33615 Bielefeld
Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de

BASE - Bielefeld Academic Search Engine - www.base-search.net
*************************************************************

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: BufferedUpdateStreams breaks high performance indexing

Posted by Michael McCandless <lu...@mikemccandless.com>.
Wonderful, thanks for bringing closure!

Mike McCandless

http://blog.mikemccandless.com

On Thu, Aug 4, 2016 at 3:14 AM, Bernd Fehling <
bernd.fehling@uni-bielefeld.de> wrote:

> After updating to version 5.5.3 it looks good now.
> Thanks a lot for your help and advise.
>
> Best regards
> Bernd
>
> Am 29.07.2016 um 15:04 schrieb Michael McCandless:
> > The deleted terms accumulate whenever you use updateDocument(Term, Doc),
> or
> > when you do deleteDocuments(Term).
> >
> > Deleted queries are when you delete by query, but I don't think DIH would
> > be doing that unless you asked it to ... maybe a Solr user/dev knows
> better?
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> > On Fri, Jul 29, 2016 at 3:21 AM, Bernd Fehling <
> > bernd.fehling@uni-bielefeld.de> wrote:
> >
> >> Yes, with default of 10 it performs very much better.
> >> I didn't take into count that DIH uses updateDocument for adding new
> >> documents but after thinking about the "why" I assume that
> >> this might be because you don't know if a document already exists in the
> >> index.
> >> Conclusion, using DIH and setting segmentsPerTier to a high value is a
> >> killer.
> >>
> >> One question still remains about messages in INFOSTREAM, I have lines
> >> saying
> >> BD 0 [...] push deletes 24345 deleted terms (unique count=24345) 24345
> >> deleted queries
> >>            bytesUsed=2313024 delGen=2265 packetCount=69
> >> totBytesUsed=262526720
> >> ...
> >> BD 0 [...] seg=_xt(4.10.4):C50486 segGen=2370 segDeletes=[ 97145 deleted
> >> terms (unique count=0)
> >>            97142 deleted queries bytesUsed=3108576]; coalesced deletes=
> >>
> >>
> [CoalescedUpdates(termSets1,queries=75721,numericDVUpdates=0,binaryDVUpdates=0)]
> >>             newDelCount=0
> >>
> >> Do you know what these deleted terms and deleted queries are?
> >>
> >> Best regards,
> >> Bernd
> >>
> >>
> >> Am 28.07.2016 um 17:34 schrieb Michael McCandless:
> >>> Hmm, your merge policy changes are dangerous: that will cause too many
> >>> segments in the index, which makes it longer to apply deletes.
> >>>
> >>> Can you revert that and re-test?
> >>>
> >>> I'm not sure why DIH is using updateDocument instead of addDocument ...
> >>> maybe ask on the solr-user list?
> >>>
> >>> Mike McCandless
> >>>
> >>> http://blog.mikemccandless.com
> >>>
> >>> On Thu, Jul 28, 2016 at 10:07 AM, Bernd Fehling <
> >>> bernd.fehling@uni-bielefeld.de> wrote:
> >>>
> >>>> Currently I use concurrent DIH but will write some SolrJ for testing
> >>>> or even as replacement for DIH.
> >>>> Don't know whats behind DIH if only documents are added.
> >>>>
> >>>> Not tried any newer release yet, but after reading LUCENE-6161 I
> really
> >>>> should.
> >>>> At least a version > 5.1
> >>>> May be before writing some SolrJ.
> >>>>
> >>>>
> >>>> Yes IndexWriterConfig is changed from default:
> >>>> <indexConfig>
> >>>>     <maxIndexingThreads>8</maxIndexingThreads>
> >>>>     <ramBufferSizeMB>1024</ramBufferSizeMB>
> >>>>     <maxBufferedDocs>-1</maxBufferedDocs>
> >>>>     <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
> >>>>       <int name="maxMergeAtOnce">8</int>
> >>>>       <int name="segmentsPerTier">100</int>
> >>>>       <int name="maxMergedSegmentMB">512</int>
> >>>>     </mergePolicy>
> >>>>     <mergeFactor>8</mergeFactor>
> >>>>     <mergeScheduler
> >>>> class="org.apache.lucene.index.ConcurrentMergeScheduler"/>
> >>>>     <lockType>${solr.lock.type:native}</lockType>
> >>>>     ...
> >>>> </indexConfig>
> >>>>
> >>>> A unique id as example: "ftoxfordilej:ar.1770.x.x.13.x.x.u1"
> >>>> Somewhere between 20 and 50 characters in length.
> >>>>
> >>>> Thanks for your help,
> >>>> Bernd
> >>>>
> >>>>
> >>>> Am 28.07.2016 um 15:35 schrieb Michael McCandless:
> >>>>> Hmm not good.
> >>>>>
> >>>>> If you are really only adding documents, you should be using
> >>>>> IndexWriter.addDocument, which won't buffer any deleted terms and
> that
> >>>>> method call should be a no-op.  It also makes flushes more efficient
> >>>> since
> >>>>> all of your indexing buffer goes to the added documents, not buffered
> >>>>> delete terms.  Are you using updateDocument?
> >>>>>
> >>>>> Can you reproduce this slowness on a newer release?  There have been
> >>>>> performance issues fixed in newer releases in this method, e.g
> >>>>> https://issues.apache.org/jira/browse/LUCENE-6161
> >>>>>
> >>>>> Have you changed any IndexWriterConfig settings from defaults?
> >>>>>
> >>>>> What are your unique id fields like?  How many bytes in length?
> >>>>>
> >>>>> Mike McCandless
> >>>>>
> >>>>> http://blog.mikemccandless.com
> >>>>>
> >>>>> On Thu, Jul 28, 2016 at 5:01 AM, Bernd Fehling <
> >>>>> bernd.fehling@uni-bielefeld.de> wrote:
> >>>>>
> >>>>>> While trying to get higher performance for indexing it turned out
> that
> >>>>>> BufferedUpdateStreams is breaking indexing performance.
> >>>>>> public synchronized ApplyDeletesResult applyDeletesAndUpdates(...)
> >>>>>>
> >>>>>> At IndexWriterConfig I have setRAMBufferSizeMB=1024 and the Lucene
> >>>> 4.10.4
> >>>>>> API states:
> >>>>>> "Determines the amount of RAM that may be used for buffering added
> >>>>>> documents and deletions before they are flushed to the Directory.
> >>>>>> Generally for faster indexing performance it's best to flush by RAM
> >>>>>> usage instead of document count and use as large a RAM buffer as you
> >>>> can."
> >>>>>>
> >>>>>> Also setMaxBufferedDocs=-1 and setMaxBufferedDeleteTerms=-1.
> >>>>>>
> >>>>>> BD 0 [Wed Jul 27 13:42:03 GMT+01:00 2016; Thread-27890]:
> applyDeletes:
> >>>>>> infos=...
> >>>>>> BD 0 [Wed Jul 27 14:38:55 GMT+01:00 2016; Thread-27890]:
> applyDeletes
> >>>> took
> >>>>>> 3411845 msec
> >>>>>>
> >>>>>> About 56 minutes no indexing and only applying deletes.
> >>>>>> What is it deleting?
> >>>>>>
> >>>>>> If the index gets bigger the time gets longer, currently 2.5 hours
> of
> >>>>>> waiting.
> >>>>>> I'm adding 96 million docs with uniq id, no duplicates, only add, no
> >>>>>> deletes.
> >>>>>>
> >>>>>> Any suggestions which config is _really_ going for high performance
> >>>>>> indexing?
> >>>>>>
> >>>>>> Best regards,
> >>>>>> Bernd
> >>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>
> >>>>
> >>>
> >>
> >> --
> >> *************************************************************
> >> Bernd Fehling                    Bielefeld University Library
> >> Dipl.-Inform. (FH)                LibTec - Library Technology
> >> Universitätsstr. 25                  and Knowledge Management
> >> 33615 Bielefeld
> >> Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de
> >>
> >> BASE - Bielefeld Academic Search Engine - www.base-search.net
> >> *************************************************************
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
>
> --
> *************************************************************
> Bernd Fehling                    Bielefeld University Library
> Dipl.-Inform. (FH)                LibTec - Library Technology
> Universitätsstr. 25                  and Knowledge Management
> 33615 Bielefeld
> Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de
>
> BASE - Bielefeld Academic Search Engine - www.base-search.net
> *************************************************************
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>