You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Bernd Fehling <be...@uni-bielefeld.de> on 2016/07/28 09:01:09 UTC

BufferedUpdateStreams breaks high performance indexing

While trying to get higher performance for indexing it turned out that
BufferedUpdateStreams is breaking indexing performance.
public synchronized ApplyDeletesResult applyDeletesAndUpdates(...)

At IndexWriterConfig I have setRAMBufferSizeMB=1024 and the Lucene 4.10.4 API states:
"Determines the amount of RAM that may be used for buffering added
documents and deletions before they are flushed to the Directory.
Generally for faster indexing performance it's best to flush by RAM
usage instead of document count and use as large a RAM buffer as you can."

Also setMaxBufferedDocs=-1 and setMaxBufferedDeleteTerms=-1.

BD 0 [Wed Jul 27 13:42:03 GMT+01:00 2016; Thread-27890]: applyDeletes: infos=...
BD 0 [Wed Jul 27 14:38:55 GMT+01:00 2016; Thread-27890]: applyDeletes took 3411845 msec

About 56 minutes no indexing and only applying deletes.
What is it deleting?

If the index gets bigger the time gets longer, currently 2.5 hours of waiting.
I'm adding 96 million docs with uniq id, no duplicates, only add, no deletes.

Any suggestions which config is _really_ going for high performance indexing?

Best regards,
Bernd

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: BufferedUpdateStreams breaks high performance indexing

Posted by Michael McCandless <lu...@mikemccandless.com>.

Hmm not good.

If you are really only adding documents, you should be using
IndexWriter.addDocument, which won't buffer any deleted terms and that
method call should be a no-op.  It also makes flushes more efficient since
all of your indexing buffer goes to the added documents, not buffered
delete terms.  Are you using updateDocument?

Can you reproduce this slowness on a newer release?  There have been
performance issues fixed in newer releases in this method, e.g
https://issues.apache.org/jira/browse/LUCENE-6161

Have you changed any IndexWriterConfig settings from defaults?

What are your unique id fields like?  How many bytes in length?

Mike McCandless

http://blog.mikemccandless.com

On Thu, Jul 28, 2016 at 5:01 AM, Bernd Fehling <
bernd.fehling@uni-bielefeld.de> wrote:

> While trying to get higher performance for indexing it turned out that
> BufferedUpdateStreams is breaking indexing performance.
> public synchronized ApplyDeletesResult applyDeletesAndUpdates(...)
>
> At IndexWriterConfig I have setRAMBufferSizeMB=1024 and the Lucene 4.10.4
> API states:
> "Determines the amount of RAM that may be used for buffering added
> documents and deletions before they are flushed to the Directory.
> Generally for faster indexing performance it's best to flush by RAM
> usage instead of document count and use as large a RAM buffer as you can."
>
> Also setMaxBufferedDocs=-1 and setMaxBufferedDeleteTerms=-1.
>
> BD 0 [Wed Jul 27 13:42:03 GMT+01:00 2016; Thread-27890]: applyDeletes:
> infos=...
> BD 0 [Wed Jul 27 14:38:55 GMT+01:00 2016; Thread-27890]: applyDeletes took
> 3411845 msec
>
> About 56 minutes no indexing and only applying deletes.
> What is it deleting?
>
> If the index gets bigger the time gets longer, currently 2.5 hours of
> waiting.
> I'm adding 96 million docs with uniq id, no duplicates, only add, no
> deletes.
>
> Any suggestions which config is _really_ going for high performance
> indexing?
>
> Best regards,
> Bernd
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: BufferedUpdateStreams breaks high performance indexing

Posted by Michael McCandless <lu...@mikemccandless.com>.

Hmm, your merge policy changes are dangerous: that will cause too many
segments in the index, which makes it longer to apply deletes.

Can you revert that and re-test?

I'm not sure why DIH is using updateDocument instead of addDocument ...
maybe ask on the solr-user list?

Mike McCandless

http://blog.mikemccandless.com

On Thu, Jul 28, 2016 at 10:07 AM, Bernd Fehling <
bernd.fehling@uni-bielefeld.de> wrote:

> Currently I use concurrent DIH but will write some SolrJ for testing
> or even as replacement for DIH.
> Don't know whats behind DIH if only documents are added.
>
> Not tried any newer release yet, but after reading LUCENE-6161 I really
> should.
> At least a version > 5.1
> May be before writing some SolrJ.
>
>
> Yes IndexWriterConfig is changed from default:
> <indexConfig>
>     <maxIndexingThreads>8</maxIndexingThreads>
>     <ramBufferSizeMB>1024</ramBufferSizeMB>
>     <maxBufferedDocs>-1</maxBufferedDocs>
>     <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
>       <int name="maxMergeAtOnce">8</int>
>       <int name="segmentsPerTier">100</int>
>       <int name="maxMergedSegmentMB">512</int>
>     </mergePolicy>
>     <mergeFactor>8</mergeFactor>
>     <mergeScheduler
> class="org.apache.lucene.index.ConcurrentMergeScheduler"/>
>     <lockType>${solr.lock.type:native}</lockType>
>     ...
> </indexConfig>
>
> A unique id as example: "ftoxfordilej:ar.1770.x.x.13.x.x.u1"
> Somewhere between 20 and 50 characters in length.
>
> Thanks for your help,
> Bernd
>
>
> Am 28.07.2016 um 15:35 schrieb Michael McCandless:
> > Hmm not good.
> >
> > If you are really only adding documents, you should be using
> > IndexWriter.addDocument, which won't buffer any deleted terms and that
> > method call should be a no-op.  It also makes flushes more efficient
> since
> > all of your indexing buffer goes to the added documents, not buffered
> > delete terms.  Are you using updateDocument?
> >
> > Can you reproduce this slowness on a newer release?  There have been
> > performance issues fixed in newer releases in this method, e.g
> > https://issues.apache.org/jira/browse/LUCENE-6161
> >
> > Have you changed any IndexWriterConfig settings from defaults?
> >
> > What are your unique id fields like?  How many bytes in length?
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> > On Thu, Jul 28, 2016 at 5:01 AM, Bernd Fehling <
> > bernd.fehling@uni-bielefeld.de> wrote:
> >
> >> While trying to get higher performance for indexing it turned out that
> >> BufferedUpdateStreams is breaking indexing performance.
> >> public synchronized ApplyDeletesResult applyDeletesAndUpdates(...)
> >>
> >> At IndexWriterConfig I have setRAMBufferSizeMB=1024 and the Lucene
> 4.10.4
> >> API states:
> >> "Determines the amount of RAM that may be used for buffering added
> >> documents and deletions before they are flushed to the Directory.
> >> Generally for faster indexing performance it's best to flush by RAM
> >> usage instead of document count and use as large a RAM buffer as you
> can."
> >>
> >> Also setMaxBufferedDocs=-1 and setMaxBufferedDeleteTerms=-1.
> >>
> >> BD 0 [Wed Jul 27 13:42:03 GMT+01:00 2016; Thread-27890]: applyDeletes:
> >> infos=...
> >> BD 0 [Wed Jul 27 14:38:55 GMT+01:00 2016; Thread-27890]: applyDeletes
> took
> >> 3411845 msec
> >>
> >> About 56 minutes no indexing and only applying deletes.
> >> What is it deleting?
> >>
> >> If the index gets bigger the time gets longer, currently 2.5 hours of
> >> waiting.
> >> I'm adding 96 million docs with uniq id, no duplicates, only add, no
> >> deletes.
> >>
> >> Any suggestions which config is _really_ going for high performance
> >> indexing?
> >>
> >> Best regards,
> >> Bernd
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: BufferedUpdateStreams breaks high performance indexing

Posted by Michael McCandless <lu...@mikemccandless.com>.

The deleted terms accumulate whenever you use updateDocument(Term, Doc), or
when you do deleteDocuments(Term).

Deleted queries are when you delete by query, but I don't think DIH would
be doing that unless you asked it to ... maybe a Solr user/dev knows better?

Mike McCandless

http://blog.mikemccandless.com

On Fri, Jul 29, 2016 at 3:21 AM, Bernd Fehling <
bernd.fehling@uni-bielefeld.de> wrote:

> Yes, with default of 10 it performs very much better.
> I didn't take into count that DIH uses updateDocument for adding new
> documents but after thinking about the "why" I assume that
> this might be because you don't know if a document already exists in the
> index.
> Conclusion, using DIH and setting segmentsPerTier to a high value is a
> killer.
>
> One question still remains about messages in INFOSTREAM, I have lines
> saying
> BD 0 [...] push deletes 24345 deleted terms (unique count=24345) 24345
> deleted queries
>            bytesUsed=2313024 delGen=2265 packetCount=69
> totBytesUsed=262526720
> ...
> BD 0 [...] seg=_xt(4.10.4):C50486 segGen=2370 segDeletes=[ 97145 deleted
> terms (unique count=0)
>            97142 deleted queries bytesUsed=3108576]; coalesced deletes=
>
>  [CoalescedUpdates(termSets1,queries=75721,numericDVUpdates=0,binaryDVUpdates=0)]
>             newDelCount=0
>
> Do you know what these deleted terms and deleted queries are?
>
> Best regards,
> Bernd
>
>
> Am 28.07.2016 um 17:34 schrieb Michael McCandless:
> > Hmm, your merge policy changes are dangerous: that will cause too many
> > segments in the index, which makes it longer to apply deletes.
> >
> > Can you revert that and re-test?
> >
> > I'm not sure why DIH is using updateDocument instead of addDocument ...
> > maybe ask on the solr-user list?
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> > On Thu, Jul 28, 2016 at 10:07 AM, Bernd Fehling <
> > bernd.fehling@uni-bielefeld.de> wrote:
> >
> >> Currently I use concurrent DIH but will write some SolrJ for testing
> >> or even as replacement for DIH.
> >> Don't know whats behind DIH if only documents are added.
> >>
> >> Not tried any newer release yet, but after reading LUCENE-6161 I really
> >> should.
> >> At least a version > 5.1
> >> May be before writing some SolrJ.
> >>
> >>
> >> Yes IndexWriterConfig is changed from default:
> >> <indexConfig>
> >>     <maxIndexingThreads>8</maxIndexingThreads>
> >>     <ramBufferSizeMB>1024</ramBufferSizeMB>
> >>     <maxBufferedDocs>-1</maxBufferedDocs>
> >>     <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
> >>       <int name="maxMergeAtOnce">8</int>
> >>       <int name="segmentsPerTier">100</int>
> >>       <int name="maxMergedSegmentMB">512</int>
> >>     </mergePolicy>
> >>     <mergeFactor>8</mergeFactor>
> >>     <mergeScheduler
> >> class="org.apache.lucene.index.ConcurrentMergeScheduler"/>
> >>     <lockType>${solr.lock.type:native}</lockType>
> >>     ...
> >> </indexConfig>
> >>
> >> A unique id as example: "ftoxfordilej:ar.1770.x.x.13.x.x.u1"
> >> Somewhere between 20 and 50 characters in length.
> >>
> >> Thanks for your help,
> >> Bernd
> >>
> >>
> >> Am 28.07.2016 um 15:35 schrieb Michael McCandless:
> >>> Hmm not good.
> >>>
> >>> If you are really only adding documents, you should be using
> >>> IndexWriter.addDocument, which won't buffer any deleted terms and that
> >>> method call should be a no-op.  It also makes flushes more efficient
> >> since
> >>> all of your indexing buffer goes to the added documents, not buffered
> >>> delete terms.  Are you using updateDocument?
> >>>
> >>> Can you reproduce this slowness on a newer release?  There have been
> >>> performance issues fixed in newer releases in this method, e.g
> >>> https://issues.apache.org/jira/browse/LUCENE-6161
> >>>
> >>> Have you changed any IndexWriterConfig settings from defaults?
> >>>
> >>> What are your unique id fields like?  How many bytes in length?
> >>>
> >>> Mike McCandless
> >>>
> >>> http://blog.mikemccandless.com
> >>>
> >>> On Thu, Jul 28, 2016 at 5:01 AM, Bernd Fehling <
> >>> bernd.fehling@uni-bielefeld.de> wrote:
> >>>
> >>>> While trying to get higher performance for indexing it turned out that
> >>>> BufferedUpdateStreams is breaking indexing performance.
> >>>> public synchronized ApplyDeletesResult applyDeletesAndUpdates(...)
> >>>>
> >>>> At IndexWriterConfig I have setRAMBufferSizeMB=1024 and the Lucene
> >> 4.10.4
> >>>> API states:
> >>>> "Determines the amount of RAM that may be used for buffering added
> >>>> documents and deletions before they are flushed to the Directory.
> >>>> Generally for faster indexing performance it's best to flush by RAM
> >>>> usage instead of document count and use as large a RAM buffer as you
> >> can."
> >>>>
> >>>> Also setMaxBufferedDocs=-1 and setMaxBufferedDeleteTerms=-1.
> >>>>
> >>>> BD 0 [Wed Jul 27 13:42:03 GMT+01:00 2016; Thread-27890]: applyDeletes:
> >>>> infos=...
> >>>> BD 0 [Wed Jul 27 14:38:55 GMT+01:00 2016; Thread-27890]: applyDeletes
> >> took
> >>>> 3411845 msec
> >>>>
> >>>> About 56 minutes no indexing and only applying deletes.
> >>>> What is it deleting?
> >>>>
> >>>> If the index gets bigger the time gets longer, currently 2.5 hours of
> >>>> waiting.
> >>>> I'm adding 96 million docs with uniq id, no duplicates, only add, no
> >>>> deletes.
> >>>>
> >>>> Any suggestions which config is _really_ going for high performance
> >>>> indexing?
> >>>>
> >>>> Best regards,
> >>>> Bernd
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>
> >>>>
> >>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
>
> --
> *************************************************************
> Bernd Fehling                    Bielefeld University Library
> Dipl.-Inform. (FH)                LibTec - Library Technology
> Universitätsstr. 25                  and Knowledge Management
> 33615 Bielefeld
> Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de
>
> BASE - Bielefeld Academic Search Engine - www.base-search.net
> *************************************************************
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: BufferedUpdateStreams breaks high performance indexing

Posted by Michael McCandless <lu...@mikemccandless.com>.

Wonderful, thanks for bringing closure!

Mike McCandless

http://blog.mikemccandless.com

On Thu, Aug 4, 2016 at 3:14 AM, Bernd Fehling <
bernd.fehling@uni-bielefeld.de> wrote:

> After updating to version 5.5.3 it looks good now.
> Thanks a lot for your help and advise.
>
> Best regards
> Bernd
>
> Am 29.07.2016 um 15:04 schrieb Michael McCandless:
> > The deleted terms accumulate whenever you use updateDocument(Term, Doc),
> or
> > when you do deleteDocuments(Term).
> >
> > Deleted queries are when you delete by query, but I don't think DIH would
> > be doing that unless you asked it to ... maybe a Solr user/dev knows
> better?
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> > On Fri, Jul 29, 2016 at 3:21 AM, Bernd Fehling <
> > bernd.fehling@uni-bielefeld.de> wrote:
> >
> >> Yes, with default of 10 it performs very much better.
> >> I didn't take into count that DIH uses updateDocument for adding new
> >> documents but after thinking about the "why" I assume that
> >> this might be because you don't know if a document already exists in the
> >> index.
> >> Conclusion, using DIH and setting segmentsPerTier to a high value is a
> >> killer.
> >>
> >> One question still remains about messages in INFOSTREAM, I have lines
> >> saying
> >> BD 0 [...] push deletes 24345 deleted terms (unique count=24345) 24345
> >> deleted queries
> >>            bytesUsed=2313024 delGen=2265 packetCount=69
> >> totBytesUsed=262526720
> >> ...
> >> BD 0 [...] seg=_xt(4.10.4):C50486 segGen=2370 segDeletes=[ 97145 deleted
> >> terms (unique count=0)
> >>            97142 deleted queries bytesUsed=3108576]; coalesced deletes=
> >>
> >>
> [CoalescedUpdates(termSets1,queries=75721,numericDVUpdates=0,binaryDVUpdates=0)]
> >>             newDelCount=0
> >>
> >> Do you know what these deleted terms and deleted queries are?
> >>
> >> Best regards,
> >> Bernd
> >>
> >>
> >> Am 28.07.2016 um 17:34 schrieb Michael McCandless:
> >>> Hmm, your merge policy changes are dangerous: that will cause too many
> >>> segments in the index, which makes it longer to apply deletes.
> >>>
> >>> Can you revert that and re-test?
> >>>
> >>> I'm not sure why DIH is using updateDocument instead of addDocument ...
> >>> maybe ask on the solr-user list?
> >>>
> >>> Mike McCandless
> >>>
> >>> http://blog.mikemccandless.com
> >>>
> >>> On Thu, Jul 28, 2016 at 10:07 AM, Bernd Fehling <
> >>> bernd.fehling@uni-bielefeld.de> wrote:
> >>>
> >>>> Currently I use concurrent DIH but will write some SolrJ for testing
> >>>> or even as replacement for DIH.
> >>>> Don't know whats behind DIH if only documents are added.
> >>>>
> >>>> Not tried any newer release yet, but after reading LUCENE-6161 I
> really
> >>>> should.
> >>>> At least a version > 5.1
> >>>> May be before writing some SolrJ.
> >>>>
> >>>>
> >>>> Yes IndexWriterConfig is changed from default:
> >>>> <indexConfig>
> >>>>     <maxIndexingThreads>8</maxIndexingThreads>
> >>>>     <ramBufferSizeMB>1024</ramBufferSizeMB>
> >>>>     <maxBufferedDocs>-1</maxBufferedDocs>
> >>>>     <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
> >>>>       <int name="maxMergeAtOnce">8</int>
> >>>>       <int name="segmentsPerTier">100</int>
> >>>>       <int name="maxMergedSegmentMB">512</int>
> >>>>     </mergePolicy>
> >>>>     <mergeFactor>8</mergeFactor>
> >>>>     <mergeScheduler
> >>>> class="org.apache.lucene.index.ConcurrentMergeScheduler"/>
> >>>>     <lockType>${solr.lock.type:native}</lockType>
> >>>>     ...
> >>>> </indexConfig>
> >>>>
> >>>> A unique id as example: "ftoxfordilej:ar.1770.x.x.13.x.x.u1"
> >>>> Somewhere between 20 and 50 characters in length.
> >>>>
> >>>> Thanks for your help,
> >>>> Bernd
> >>>>
> >>>>
> >>>> Am 28.07.2016 um 15:35 schrieb Michael McCandless:
> >>>>> Hmm not good.
> >>>>>
> >>>>> If you are really only adding documents, you should be using
> >>>>> IndexWriter.addDocument, which won't buffer any deleted terms and
> that
> >>>>> method call should be a no-op.  It also makes flushes more efficient
> >>>> since
> >>>>> all of your indexing buffer goes to the added documents, not buffered
> >>>>> delete terms.  Are you using updateDocument?
> >>>>>
> >>>>> Can you reproduce this slowness on a newer release?  There have been
> >>>>> performance issues fixed in newer releases in this method, e.g
> >>>>> https://issues.apache.org/jira/browse/LUCENE-6161
> >>>>>
> >>>>> Have you changed any IndexWriterConfig settings from defaults?
> >>>>>
> >>>>> What are your unique id fields like?  How many bytes in length?
> >>>>>
> >>>>> Mike McCandless
> >>>>>
> >>>>> http://blog.mikemccandless.com
> >>>>>
> >>>>> On Thu, Jul 28, 2016 at 5:01 AM, Bernd Fehling <
> >>>>> bernd.fehling@uni-bielefeld.de> wrote:
> >>>>>
> >>>>>> While trying to get higher performance for indexing it turned out
> that
> >>>>>> BufferedUpdateStreams is breaking indexing performance.
> >>>>>> public synchronized ApplyDeletesResult applyDeletesAndUpdates(...)
> >>>>>>
> >>>>>> At IndexWriterConfig I have setRAMBufferSizeMB=1024 and the Lucene
> >>>> 4.10.4
> >>>>>> API states:
> >>>>>> "Determines the amount of RAM that may be used for buffering added
> >>>>>> documents and deletions before they are flushed to the Directory.
> >>>>>> Generally for faster indexing performance it's best to flush by RAM
> >>>>>> usage instead of document count and use as large a RAM buffer as you
> >>>> can."
> >>>>>>
> >>>>>> Also setMaxBufferedDocs=-1 and setMaxBufferedDeleteTerms=-1.
> >>>>>>
> >>>>>> BD 0 [Wed Jul 27 13:42:03 GMT+01:00 2016; Thread-27890]:
> applyDeletes:
> >>>>>> infos=...
> >>>>>> BD 0 [Wed Jul 27 14:38:55 GMT+01:00 2016; Thread-27890]:
> applyDeletes
> >>>> took
> >>>>>> 3411845 msec
> >>>>>>
> >>>>>> About 56 minutes no indexing and only applying deletes.
> >>>>>> What is it deleting?
> >>>>>>
> >>>>>> If the index gets bigger the time gets longer, currently 2.5 hours
> of
> >>>>>> waiting.
> >>>>>> I'm adding 96 million docs with uniq id, no duplicates, only add, no
> >>>>>> deletes.
> >>>>>>
> >>>>>> Any suggestions which config is _really_ going for high performance
> >>>>>> indexing?
> >>>>>>
> >>>>>> Best regards,
> >>>>>> Bernd
> >>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>
> >>>>
> >>>
> >>
> >> --
> >> *************************************************************
> >> Bernd Fehling                    Bielefeld University Library
> >> Dipl.-Inform. (FH)                LibTec - Library Technology
> >> Universitätsstr. 25                  and Knowledge Management
> >> 33615 Bielefeld
> >> Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de
> >>
> >> BASE - Bielefeld Academic Search Engine - www.base-search.net
> >> *************************************************************
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
>
> --
> *************************************************************
> Bernd Fehling                    Bielefeld University Library
> Dipl.-Inform. (FH)                LibTec - Library Technology
> Universitätsstr. 25                  and Knowledge Management
> 33615 Bielefeld
> Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de
>
> BASE - Bielefeld Academic Search Engine - www.base-search.net
> *************************************************************
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: BufferedUpdateStreams breaks high performance indexing

Posted by Bernd Fehling <be...@uni-bielefeld.de>.

After updating to version 5.5.3 it looks good now.
Thanks a lot for your help and advise.

Best regards
Bernd

Am 29.07.2016 um 15:04 schrieb Michael McCandless:
> The deleted terms accumulate whenever you use updateDocument(Term, Doc), or
> when you do deleteDocuments(Term).
> 
> Deleted queries are when you delete by query, but I don't think DIH would
> be doing that unless you asked it to ... maybe a Solr user/dev knows better?
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> On Fri, Jul 29, 2016 at 3:21 AM, Bernd Fehling <
> bernd.fehling@uni-bielefeld.de> wrote:
> 
>> Yes, with default of 10 it performs very much better.
>> I didn't take into count that DIH uses updateDocument for adding new
>> documents but after thinking about the "why" I assume that
>> this might be because you don't know if a document already exists in the
>> index.
>> Conclusion, using DIH and setting segmentsPerTier to a high value is a
>> killer.
>>
>> One question still remains about messages in INFOSTREAM, I have lines
>> saying
>> BD 0 [...] push deletes 24345 deleted terms (unique count=24345) 24345
>> deleted queries
>>            bytesUsed=2313024 delGen=2265 packetCount=69
>> totBytesUsed=262526720
>> ...
>> BD 0 [...] seg=_xt(4.10.4):C50486 segGen=2370 segDeletes=[ 97145 deleted
>> terms (unique count=0)
>>            97142 deleted queries bytesUsed=3108576]; coalesced deletes=
>>
>>  [CoalescedUpdates(termSets1,queries=75721,numericDVUpdates=0,binaryDVUpdates=0)]
>>             newDelCount=0
>>
>> Do you know what these deleted terms and deleted queries are?
>>
>> Best regards,
>> Bernd
>>
>>
>> Am 28.07.2016 um 17:34 schrieb Michael McCandless:
>>> Hmm, your merge policy changes are dangerous: that will cause too many
>>> segments in the index, which makes it longer to apply deletes.
>>>
>>> Can you revert that and re-test?
>>>
>>> I'm not sure why DIH is using updateDocument instead of addDocument ...
>>> maybe ask on the solr-user list?
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>> On Thu, Jul 28, 2016 at 10:07 AM, Bernd Fehling <
>>> bernd.fehling@uni-bielefeld.de> wrote:
>>>
>>>> Currently I use concurrent DIH but will write some SolrJ for testing
>>>> or even as replacement for DIH.
>>>> Don't know whats behind DIH if only documents are added.
>>>>
>>>> Not tried any newer release yet, but after reading LUCENE-6161 I really
>>>> should.
>>>> At least a version > 5.1
>>>> May be before writing some SolrJ.
>>>>
>>>>
>>>> Yes IndexWriterConfig is changed from default:
>>>> <indexConfig>
>>>>     <maxIndexingThreads>8</maxIndexingThreads>
>>>>     <ramBufferSizeMB>1024</ramBufferSizeMB>
>>>>     <maxBufferedDocs>-1</maxBufferedDocs>
>>>>     <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
>>>>       <int name="maxMergeAtOnce">8</int>
>>>>       <int name="segmentsPerTier">100</int>
>>>>       <int name="maxMergedSegmentMB">512</int>
>>>>     </mergePolicy>
>>>>     <mergeFactor>8</mergeFactor>
>>>>     <mergeScheduler
>>>> class="org.apache.lucene.index.ConcurrentMergeScheduler"/>
>>>>     <lockType>${solr.lock.type:native}</lockType>
>>>>     ...
>>>> </indexConfig>
>>>>
>>>> A unique id as example: "ftoxfordilej:ar.1770.x.x.13.x.x.u1"
>>>> Somewhere between 20 and 50 characters in length.
>>>>
>>>> Thanks for your help,
>>>> Bernd
>>>>
>>>>
>>>> Am 28.07.2016 um 15:35 schrieb Michael McCandless:
>>>>> Hmm not good.
>>>>>
>>>>> If you are really only adding documents, you should be using
>>>>> IndexWriter.addDocument, which won't buffer any deleted terms and that
>>>>> method call should be a no-op.  It also makes flushes more efficient
>>>> since
>>>>> all of your indexing buffer goes to the added documents, not buffered
>>>>> delete terms.  Are you using updateDocument?
>>>>>
>>>>> Can you reproduce this slowness on a newer release?  There have been
>>>>> performance issues fixed in newer releases in this method, e.g
>>>>> https://issues.apache.org/jira/browse/LUCENE-6161
>>>>>
>>>>> Have you changed any IndexWriterConfig settings from defaults?
>>>>>
>>>>> What are your unique id fields like?  How many bytes in length?
>>>>>
>>>>> Mike McCandless
>>>>>
>>>>> http://blog.mikemccandless.com
>>>>>
>>>>> On Thu, Jul 28, 2016 at 5:01 AM, Bernd Fehling <
>>>>> bernd.fehling@uni-bielefeld.de> wrote:
>>>>>
>>>>>> While trying to get higher performance for indexing it turned out that
>>>>>> BufferedUpdateStreams is breaking indexing performance.
>>>>>> public synchronized ApplyDeletesResult applyDeletesAndUpdates(...)
>>>>>>
>>>>>> At IndexWriterConfig I have setRAMBufferSizeMB=1024 and the Lucene
>>>> 4.10.4
>>>>>> API states:
>>>>>> "Determines the amount of RAM that may be used for buffering added
>>>>>> documents and deletions before they are flushed to the Directory.
>>>>>> Generally for faster indexing performance it's best to flush by RAM
>>>>>> usage instead of document count and use as large a RAM buffer as you
>>>> can."
>>>>>>
>>>>>> Also setMaxBufferedDocs=-1 and setMaxBufferedDeleteTerms=-1.
>>>>>>
>>>>>> BD 0 [Wed Jul 27 13:42:03 GMT+01:00 2016; Thread-27890]: applyDeletes:
>>>>>> infos=...
>>>>>> BD 0 [Wed Jul 27 14:38:55 GMT+01:00 2016; Thread-27890]: applyDeletes
>>>> took
>>>>>> 3411845 msec
>>>>>>
>>>>>> About 56 minutes no indexing and only applying deletes.
>>>>>> What is it deleting?
>>>>>>
>>>>>> If the index gets bigger the time gets longer, currently 2.5 hours of
>>>>>> waiting.
>>>>>> I'm adding 96 million docs with uniq id, no duplicates, only add, no
>>>>>> deletes.
>>>>>>
>>>>>> Any suggestions which config is _really_ going for high performance
>>>>>> indexing?
>>>>>>
>>>>>> Best regards,
>>>>>> Bernd
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>
>> --
>> *************************************************************
>> Bernd Fehling                    Bielefeld University Library
>> Dipl.-Inform. (FH)                LibTec - Library Technology
>> Universit�tsstr. 25                  and Knowledge Management
>> 33615 Bielefeld
>> Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de
>>
>> BASE - Bielefeld Academic Search Engine - www.base-search.net
>> *************************************************************
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 

-- 
*************************************************************
Bernd Fehling                    Bielefeld University Library
Dipl.-Inform. (FH)                LibTec - Library Technology
Universit�tsstr. 25                  and Knowledge Management
33615 Bielefeld
Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de

BASE - Bielefeld Academic Search Engine - www.base-search.net
*************************************************************

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: BufferedUpdateStreams breaks high performance indexing

Posted by Bernd Fehling <be...@uni-bielefeld.de>.

Yes, with default of 10 it performs very much better.
I didn't take into count that DIH uses updateDocument for adding new
documents but after thinking about the "why" I assume that
this might be because you don't know if a document already exists in the
index.
Conclusion, using DIH and setting segmentsPerTier to a high value is a killer.

One question still remains about messages in INFOSTREAM, I have lines saying
BD 0 [...] push deletes 24345 deleted terms (unique count=24345) 24345 deleted queries
           bytesUsed=2313024 delGen=2265 packetCount=69 totBytesUsed=262526720
...
BD 0 [...] seg=_xt(4.10.4):C50486 segGen=2370 segDeletes=[ 97145 deleted terms (unique count=0)
           97142 deleted queries bytesUsed=3108576]; coalesced deletes=
           [CoalescedUpdates(termSets1,queries=75721,numericDVUpdates=0,binaryDVUpdates=0)]
            newDelCount=0

Do you know what these deleted terms and deleted queries are?

Best regards,
Bernd


Am 28.07.2016 um 17:34 schrieb Michael McCandless:
> Hmm, your merge policy changes are dangerous: that will cause too many
> segments in the index, which makes it longer to apply deletes.
> 
> Can you revert that and re-test?
> 
> I'm not sure why DIH is using updateDocument instead of addDocument ...
> maybe ask on the solr-user list?
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> On Thu, Jul 28, 2016 at 10:07 AM, Bernd Fehling <
> bernd.fehling@uni-bielefeld.de> wrote:
> 
>> Currently I use concurrent DIH but will write some SolrJ for testing
>> or even as replacement for DIH.
>> Don't know whats behind DIH if only documents are added.
>>
>> Not tried any newer release yet, but after reading LUCENE-6161 I really
>> should.
>> At least a version > 5.1
>> May be before writing some SolrJ.
>>
>>
>> Yes IndexWriterConfig is changed from default:
>> <indexConfig>
>>     <maxIndexingThreads>8</maxIndexingThreads>
>>     <ramBufferSizeMB>1024</ramBufferSizeMB>
>>     <maxBufferedDocs>-1</maxBufferedDocs>
>>     <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
>>       <int name="maxMergeAtOnce">8</int>
>>       <int name="segmentsPerTier">100</int>
>>       <int name="maxMergedSegmentMB">512</int>
>>     </mergePolicy>
>>     <mergeFactor>8</mergeFactor>
>>     <mergeScheduler
>> class="org.apache.lucene.index.ConcurrentMergeScheduler"/>
>>     <lockType>${solr.lock.type:native}</lockType>
>>     ...
>> </indexConfig>
>>
>> A unique id as example: "ftoxfordilej:ar.1770.x.x.13.x.x.u1"
>> Somewhere between 20 and 50 characters in length.
>>
>> Thanks for your help,
>> Bernd
>>
>>
>> Am 28.07.2016 um 15:35 schrieb Michael McCandless:
>>> Hmm not good.
>>>
>>> If you are really only adding documents, you should be using
>>> IndexWriter.addDocument, which won't buffer any deleted terms and that
>>> method call should be a no-op.  It also makes flushes more efficient
>> since
>>> all of your indexing buffer goes to the added documents, not buffered
>>> delete terms.  Are you using updateDocument?
>>>
>>> Can you reproduce this slowness on a newer release?  There have been
>>> performance issues fixed in newer releases in this method, e.g
>>> https://issues.apache.org/jira/browse/LUCENE-6161
>>>
>>> Have you changed any IndexWriterConfig settings from defaults?
>>>
>>> What are your unique id fields like?  How many bytes in length?
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>> On Thu, Jul 28, 2016 at 5:01 AM, Bernd Fehling <
>>> bernd.fehling@uni-bielefeld.de> wrote:
>>>
>>>> While trying to get higher performance for indexing it turned out that
>>>> BufferedUpdateStreams is breaking indexing performance.
>>>> public synchronized ApplyDeletesResult applyDeletesAndUpdates(...)
>>>>
>>>> At IndexWriterConfig I have setRAMBufferSizeMB=1024 and the Lucene
>> 4.10.4
>>>> API states:
>>>> "Determines the amount of RAM that may be used for buffering added
>>>> documents and deletions before they are flushed to the Directory.
>>>> Generally for faster indexing performance it's best to flush by RAM
>>>> usage instead of document count and use as large a RAM buffer as you
>> can."
>>>>
>>>> Also setMaxBufferedDocs=-1 and setMaxBufferedDeleteTerms=-1.
>>>>
>>>> BD 0 [Wed Jul 27 13:42:03 GMT+01:00 2016; Thread-27890]: applyDeletes:
>>>> infos=...
>>>> BD 0 [Wed Jul 27 14:38:55 GMT+01:00 2016; Thread-27890]: applyDeletes
>> took
>>>> 3411845 msec
>>>>
>>>> About 56 minutes no indexing and only applying deletes.
>>>> What is it deleting?
>>>>
>>>> If the index gets bigger the time gets longer, currently 2.5 hours of
>>>> waiting.
>>>> I'm adding 96 million docs with uniq id, no duplicates, only add, no
>>>> deletes.
>>>>
>>>> Any suggestions which config is _really_ going for high performance
>>>> indexing?
>>>>
>>>> Best regards,
>>>> Bernd
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 

-- 
*************************************************************
Bernd Fehling                    Bielefeld University Library
Dipl.-Inform. (FH)                LibTec - Library Technology
Universit�tsstr. 25                  and Knowledge Management
33615 Bielefeld
Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de

BASE - Bielefeld Academic Search Engine - www.base-search.net
*************************************************************

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: BufferedUpdateStreams breaks high performance indexing

Posted by Bernd Fehling <be...@uni-bielefeld.de>.

Currently I use concurrent DIH but will write some SolrJ for testing
or even as replacement for DIH.
Don't know whats behind DIH if only documents are added.

Not tried any newer release yet, but after reading LUCENE-6161 I really should.
At least a version > 5.1
May be before writing some SolrJ.


Yes IndexWriterConfig is changed from default:
<indexConfig>
    <maxIndexingThreads>8</maxIndexingThreads>
    <ramBufferSizeMB>1024</ramBufferSizeMB>
    <maxBufferedDocs>-1</maxBufferedDocs>
    <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
      <int name="maxMergeAtOnce">8</int>
      <int name="segmentsPerTier">100</int>
      <int name="maxMergedSegmentMB">512</int>
    </mergePolicy>
    <mergeFactor>8</mergeFactor>
    <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"/>
    <lockType>${solr.lock.type:native}</lockType>
    ...
</indexConfig>

A unique id as example: "ftoxfordilej:ar.1770.x.x.13.x.x.u1"
Somewhere between 20 and 50 characters in length.

Thanks for your help,
Bernd


Am 28.07.2016 um 15:35 schrieb Michael McCandless:
> Hmm not good.
> 
> If you are really only adding documents, you should be using
> IndexWriter.addDocument, which won't buffer any deleted terms and that
> method call should be a no-op.  It also makes flushes more efficient since
> all of your indexing buffer goes to the added documents, not buffered
> delete terms.  Are you using updateDocument?
> 
> Can you reproduce this slowness on a newer release?  There have been
> performance issues fixed in newer releases in this method, e.g
> https://issues.apache.org/jira/browse/LUCENE-6161
> 
> Have you changed any IndexWriterConfig settings from defaults?
> 
> What are your unique id fields like?  How many bytes in length?
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> On Thu, Jul 28, 2016 at 5:01 AM, Bernd Fehling <
> bernd.fehling@uni-bielefeld.de> wrote:
> 
>> While trying to get higher performance for indexing it turned out that
>> BufferedUpdateStreams is breaking indexing performance.
>> public synchronized ApplyDeletesResult applyDeletesAndUpdates(...)
>>
>> At IndexWriterConfig I have setRAMBufferSizeMB=1024 and the Lucene 4.10.4
>> API states:
>> "Determines the amount of RAM that may be used for buffering added
>> documents and deletions before they are flushed to the Directory.
>> Generally for faster indexing performance it's best to flush by RAM
>> usage instead of document count and use as large a RAM buffer as you can."
>>
>> Also setMaxBufferedDocs=-1 and setMaxBufferedDeleteTerms=-1.
>>
>> BD 0 [Wed Jul 27 13:42:03 GMT+01:00 2016; Thread-27890]: applyDeletes:
>> infos=...
>> BD 0 [Wed Jul 27 14:38:55 GMT+01:00 2016; Thread-27890]: applyDeletes took
>> 3411845 msec
>>
>> About 56 minutes no indexing and only applying deletes.
>> What is it deleting?
>>
>> If the index gets bigger the time gets longer, currently 2.5 hours of
>> waiting.
>> I'm adding 96 million docs with uniq id, no duplicates, only add, no
>> deletes.
>>
>> Any suggestions which config is _really_ going for high performance
>> indexing?
>>
>> Best regards,
>> Bernd
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org