You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Igor Shalyminov <is...@yandex-team.ru> on 2013/11/21 16:45:05 UTC

Lucene multithreaded indexing problems

Hello!

I tried to perform indexing multithreadedly, with a FixedThreadPool of Callable workers.
The main operation - parsing a single document and addDocument() to the index - is done by a single worker.
After parsing a document, a lot (really a lot) of Strings appears, and at the end of the worker's call() all of them goes to the indexWriter.
I use no merging, the resourses are flushed on disk when the segment size limit is reached.

The problem is, after a little while (when the most of the heap memory is used) indexer makes no progress, and CPU load is constant 100% (no difference if there are 2 threads or 32). So I think at some point garbage collection takes the whole indexing process down.

Could you please give some advices on the proper concurrent indexing with Lucene?
Can there be "memory leaks" somewhere in the indexWriter? Maybe I must perform some operations with writer to release unused resourses from time to time?


-- 
Best Regards,
Igor

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene multithreaded indexing problems

Posted by Desidero <de...@gmail.com>.

Apparently writing emails first thing in the morning isn't always a great
idea. I forgot to address the questions you had at the end:

1) How many threads are used by the indexWriter?
     A maximum of IndexWriterConfig.maxThreadStates threads can write at
once.

2) When does it flush segments to disk?
     This can be set in a few different ways. You can specify a certain
number of documents or the RAM buffer size. The RAM buffer size is
preferred, since documents vary in size. Read up on
IndexWriterConfig.setRAMBufferSizeMB

3) Can I know when the indexWriter is done with my Document?
     The thread calling addDocument will block if the IndexWriter is
flushing to disk. When the method call returns, assume all is well (though
obviously there's no guarantee of a flush/commit during the addDocument
call).

4) Do I need to call commit() frequently?
     If you're doing mass indexing, you don't need to call commit
frequently. Call it as infrequently as possible - I split my data into a
few "shards" and only call commit once per shard at the very end of
reindexing. If you're doing live indexing with NRT updates, then that's up
to you and depends on a lot of other factors.

4b) [Do] I also need to keep segment size constant and use no merging?
     I assume this is related to the presence of the forceMerge method? If
so, don't use it unless you have a specific reason to do so. In order to
solve your current problem, I wouldn't worry about segment sizes and merge
policies. That comes later.


On Mon, Nov 25, 2013 at 11:48 AM, Desidero <de...@gmail.com> wrote:

> Providing your system specs, parallelism (# of writer threads in a
> specific example), and any other special values you set (RAMBufferSizeMB,
> maxThreadStates, etc) would be helpful.
>
> --
>
> I do most of my development on a 64 core machine. When doing a full
> reindex, I have 64 worker threads doing concurrent indexing. I have a
> simple queuing mechanism in place that only allows ${maxThreadStates}
> threads to add documents (and subsequently potentially block during disk
> flushes), and I set RAMBufferSizeMB appropriately for my system and the
> data that I'm processing. There are obviously lots other settings that
> affect the process, but it's easier to go into those when you explain how
> you're doing things...
>
> When I'm doing extremely CPU-heavy processing of input (read: Analyzer
> work), I get a nice wavy heap graph in VisualVM and no "stop the world"
> collections occur. When I'm doing light processing prior to adding
> documents to the index and the JVM isn't tuned properly, I end up creating
> garbage extremely quickly and sometimes I do get some stop the world
> collections. They are not fun, but the system always recovers after a
> little while. I've never seen issues indicative of a memory leak in Lucene.
>
>
>
>
> On Mon, Nov 25, 2013 at 9:19 AM, Igor Shalyminov <
> ishalyminov@yandex-team.ru> wrote:
>
>> Thank you!
>>
>> But here's what I have.
>>
>> Today I looked at the indexer in the VisualVM, and I can definitely say
>> that the problem is in the memory: the resourses (which mostly are Document
>> fields) just don't go away.
>> I tried different GCs (Parallel, CMS, the default one), and every time
>> the behaviour is the same.
>> As I pass my Documents into the indexWriter, I forget about them (the
>> references are all local-scope), I think the resourses are stuck somewhere
>> in the writer.
>>
>> I wonder now how do I see:
>> - how many threads are used by the indexWriter?
>> - when does it flush segments to disk?
>>
>> Can I also know whether the indexWriter is done with my Document? Is
>> addDocument() operation some kind of synchronous?
>> Do I need to call commit() frequently (I also need to keep segment size
>> constant and use no merging)?
>>
>> --
>> Igor
>>
>> 23.11.2013, 20:29, "Daniel Penning" <dp...@gamona.de>:
>> > G1 and CMS are both tuned primarily for low pauses which is typically
>> > prefered for searching an index. In this case i guess that indexing
>> > throughput is prefered in which case using ParallelGC might be the
>> > better choice.
>> >
>> > Am 23.11.2013 17:15, schrieb Uwe Schindler:
>> >
>> >>  Hi,
>> >>
>> >>  Maybe your heap size is just too big, so your JVM spends too much
>> time in GC? The setup you described in your last eMail ist the "official
>> supported" setup :-) Lucene has no problem with that setup and can index.
>> Be sure:
>> >>  - Don't give too much heap to your indexing app. Larger heaps create
>> much more GC load.
>> >>  - Use a suitable Garbage collector (e.g. Java 7 G1 Collector or Java
>> 6 CMS Collector). Other garbage collectors may do GCs in a single thread
>> ("stop-the-world").
>> >>
>> >>  Uwe
>> >>  -----
>> >>  Uwe Schindler
>> >>  H.-H.-Meier-Allee 63, D-28213 Bremen
>> >>  http://www.thetaphi.de
>> >>  eMail: uwe@thetaphi.de
>> >>>  -----Original Message-----
>> >>>  From: Igor Shalyminov [mailto:ishalyminov@yandex-team.ru]
>> >>>  Sent: Saturday, November 23, 2013 4:46 PM
>> >>>  To: java-user@lucene.apache.org
>> >>>  Subject: Re: Lucene multithreaded indexing problems
>> >>>
>> >>>  So we return to the initially described setup: multiple parallel
>> workers, each
>> >>>  making "parse + indexWriter.addDocument()" for single documents with
>> no
>> >>>  synchronization at my side. This setup was also bad on memory
>> consumption
>> >>>  and thread blocking, as I reported.
>> >>>
>> >>>  Or did I misunderstand you?
>> >>>
>> >>>  --
>> >>>  Igor
>> >>>
>> >>>  22.11.2013, 23:34, "Uwe Schindler" <uw...@thetaphi.de>:
>> >>>>  Hi,
>> >>>>  Don't use addDocuments. This method is more made for so called block
>> >>>  indexing (where all documents need to be on a block for block
>> joins). Call
>> >>>  addDocument for each document possibly from many threads.  By this
>> >>>  Lucene can better handle multithreading and free memory early. There
>> is
>> >>>  really no need to use bulk adds, this is solely for block joins,
>> where docs need
>> >>>  to be sequential and without gaps.
>> >>>>  Uwe
>> >>>>
>> >>>>  Igor Shalyminov <is...@yandex-team.ru> schrieb:
>> >>>>>  - uwe@
>> >>>>>
>> >>>>>  Thanks Uwe!
>> >>>>>
>> >>>>>  I changed the logic so that my workers only parse input docs into
>> >>>>>  Documents, and indexWriter does addDocuments() by itself for the
>> >>>>>  chunks of 100 Documents.
>> >>>>>  Unfortunately, this behaviour reproduces: memory usage slightly
>> >>>>>  increases with the number of processed documents, and at some point
>> >>>>>  the program runs very slowly, and it seems that only a single
>> thread
>> >>>>>  is active.
>> >>>>>  It happens after lots of parse/index cycles.
>> >>>>>
>> >>>>>  The current instance is now in the "single-thread" phase with ~100%
>> >>>>>  CPU and with 8397M RES memory (limit for the VM is -Xmx8G).
>> >>>>>  My question is, when does addDocuments() release all resourses
>> passed
>> >>>>>  in (the Documents themselves)?
>> >>>>>  Are the resourses released after finishing the function call, or I
>> >>>>>  have to do indexWriter.commit() after, say, each chunk?
>> >>>>>
>> >>>>>  --
>> >>>>>  Igor
>> >>>>>
>> >>>>>  21.11.2013, 19:59, "Uwe Schindler" <uw...@thetaphi.de>:
>> >>>>>>    Hi,
>> >>>>>>
>> >>>>>>    why are you doing this? Lucene's IndexWriter can handle
>> >>>>>>  addDocuments
>> >>>>>  in multiple threads. And, since Lucene 4, it will process them
>> almost
>> >>>>>  completely parallel!
>> >>>>>>    If you do the addDocuments single-threaded you are adding an
>> >>>>>  additional bottleneck in your application. If you are doing a
>> >>>>>  synchronization on IndexWriter (which I hope you will not do),
>> things
>> >>>>>  will go wrong, too.
>> >>>>>>    Uwe
>> >>>>>>
>> >>>>>>    -----
>> >>>>>>    Uwe Schindler
>> >>>>>>    H.-H.-Meier-Allee 63, D-28213 Bremen
>> >>>>>>    http://www.thetaphi.de
>> >>>>>>    eMail: uwe@thetaphi.de
>> >>>>>>>     -----Original Message-----
>> >>>>>>>     From: Igor Shalyminov [mailto:ishalyminov@yandex-team.ru]
>> >>>>>>>     Sent: Thursday, November 21, 2013 4:45 PM
>> >>>>>>>     To: java-user@lucene.apache.org
>> >>>>>>>     Subject: Lucene multithreaded indexing problems
>> >>>>>>>
>> >>>>>>>     Hello!
>> >>>>>>>
>> >>>>>>>     I tried to perform indexing multithreadedly, with a
>> >>>>>>>  FixedThreadPool
>> >>>>>  of
>> >>>>>>>     Callable workers.
>> >>>>>>>     The main operation - parsing a single document and
>> addDocument()
>> >>>>>>>  to
>> >>>>>  the
>> >>>>>>>     index - is done by a single worker.
>> >>>>>>>     After parsing a document, a lot (really a lot) of Strings
>> >>>>>>>  appears,
>> >>>>>  and at the
>> >>>>>>>     end of the worker's call() all of them goes to the
>> indexWriter.
>> >>>>>>>     I use no merging, the resourses are flushed on disk when the
>> >>>>>  segment size
>> >>>>>>>     limit is reached.
>> >>>>>>>
>> >>>>>>>     The problem is, after a little while (when the most of the
>> heap
>> >>>>>  memory is
>> >>>>>>>     used) indexer makes no progress, and CPU load is constant 100%
>> >>>>>>>  (no
>> >>>>>>>     difference if there are 2 threads or 32). So I think at some
>> >>>>>>>  point
>> >>>>>  garbage
>> >>>>>>>     collection takes the whole indexing process down.
>> >>>>>>>
>> >>>>>>>     Could you please give some advices on the proper concurrent
>> >>>>>  indexing with
>> >>>>>>>     Lucene?
>> >>>>>>>     Can there be "memory leaks" somewhere in the indexWriter?
>> Maybe
>> >>>  I
>> >>>>>  must
>> >>>>>>>     perform some operations with writer to release unused
>> resourses
>> >>>>>  from time
>> >>>>>>>     to time?
>> >>>>>>>
>> >>>>>>>     --
>> >>>>>>>     Best Regards,
>> >>>>>>>     Igor
>> >>>>>
>>  ---------------------------------------------------------------------
>> >>>>>>>     To unsubscribe, e-mail:
>> java-user-unsubscribe@lucene.apache.org
>> >>>>>>>     For additional commands, e-mail:
>> java-user-help@lucene.apache.org
>> >>>>>>
>>  --------------------------------------------------------------------
>> >>>>>>  -
>> >>>>>>    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >>>>>>    For additional commands, e-mail:
>> java-user-help@lucene.apache.org
>> >>>>>
>>  ---------------------------------------------------------------------
>> >>>>>  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >>>>>  For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>>>  --
>> >>>>  Uwe Schindler
>> >>>>  H.-H.-Meier-Allee 63, 28213 Bremen
>> >>>>  http://www.thetaphi.de
>> >>>  ---------------------------------------------------------------------
>> >>>  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >>>  For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>  ---------------------------------------------------------------------
>> >>  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >>  For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

Re: Lucene multithreaded indexing problems

Posted by Desidero <de...@gmail.com>.

 Providing your system specs, parallelism (# of writer threads in a
specific example), and any other special values you set (RAMBufferSizeMB,
maxThreadStates, etc) would be helpful.

--

I do most of my development on a 64 core machine. When doing a full
reindex, I have 64 worker threads doing concurrent indexing. I have a
simple queuing mechanism in place that only allows ${maxThreadStates}
threads to add documents (and subsequently potentially block during disk
flushes), and I set RAMBufferSizeMB appropriately for my system and the
data that I'm processing. There are obviously lots other settings that
affect the process, but it's easier to go into those when you explain how
you're doing things...

When I'm doing extremely CPU-heavy processing of input (read: Analyzer
work), I get a nice wavy heap graph in VisualVM and no "stop the world"
collections occur. When I'm doing light processing prior to adding
documents to the index and the JVM isn't tuned properly, I end up creating
garbage extremely quickly and sometimes I do get some stop the world
collections. They are not fun, but the system always recovers after a
little while. I've never seen issues indicative of a memory leak in Lucene.




On Mon, Nov 25, 2013 at 9:19 AM, Igor Shalyminov <ishalyminov@yandex-team.ru
> wrote:

> Thank you!
>
> But here's what I have.
>
> Today I looked at the indexer in the VisualVM, and I can definitely say
> that the problem is in the memory: the resourses (which mostly are Document
> fields) just don't go away.
> I tried different GCs (Parallel, CMS, the default one), and every time the
> behaviour is the same.
> As I pass my Documents into the indexWriter, I forget about them (the
> references are all local-scope), I think the resourses are stuck somewhere
> in the writer.
>
> I wonder now how do I see:
> - how many threads are used by the indexWriter?
> - when does it flush segments to disk?
>
> Can I also know whether the indexWriter is done with my Document? Is
> addDocument() operation some kind of synchronous?
> Do I need to call commit() frequently (I also need to keep segment size
> constant and use no merging)?
>
> --
> Igor
>
> 23.11.2013, 20:29, "Daniel Penning" <dp...@gamona.de>:
> > G1 and CMS are both tuned primarily for low pauses which is typically
> > prefered for searching an index. In this case i guess that indexing
> > throughput is prefered in which case using ParallelGC might be the
> > better choice.
> >
> > Am 23.11.2013 17:15, schrieb Uwe Schindler:
> >
> >>  Hi,
> >>
> >>  Maybe your heap size is just too big, so your JVM spends too much time
> in GC? The setup you described in your last eMail ist the "official
> supported" setup :-) Lucene has no problem with that setup and can index.
> Be sure:
> >>  - Don't give too much heap to your indexing app. Larger heaps create
> much more GC load.
> >>  - Use a suitable Garbage collector (e.g. Java 7 G1 Collector or Java 6
> CMS Collector). Other garbage collectors may do GCs in a single thread
> ("stop-the-world").
> >>
> >>  Uwe
> >>  -----
> >>  Uwe Schindler
> >>  H.-H.-Meier-Allee 63, D-28213 Bremen
> >>  http://www.thetaphi.de
> >>  eMail: uwe@thetaphi.de
> >>>  -----Original Message-----
> >>>  From: Igor Shalyminov [mailto:ishalyminov@yandex-team.ru]
> >>>  Sent: Saturday, November 23, 2013 4:46 PM
> >>>  To: java-user@lucene.apache.org
> >>>  Subject: Re: Lucene multithreaded indexing problems
> >>>
> >>>  So we return to the initially described setup: multiple parallel
> workers, each
> >>>  making "parse + indexWriter.addDocument()" for single documents with
> no
> >>>  synchronization at my side. This setup was also bad on memory
> consumption
> >>>  and thread blocking, as I reported.
> >>>
> >>>  Or did I misunderstand you?
> >>>
> >>>  --
> >>>  Igor
> >>>
> >>>  22.11.2013, 23:34, "Uwe Schindler" <uw...@thetaphi.de>:
> >>>>  Hi,
> >>>>  Don't use addDocuments. This method is more made for so called block
> >>>  indexing (where all documents need to be on a block for block joins).
> Call
> >>>  addDocument for each document possibly from many threads.  By this
> >>>  Lucene can better handle multithreading and free memory early. There
> is
> >>>  really no need to use bulk adds, this is solely for block joins,
> where docs need
> >>>  to be sequential and without gaps.
> >>>>  Uwe
> >>>>
> >>>>  Igor Shalyminov <is...@yandex-team.ru> schrieb:
> >>>>>  - uwe@
> >>>>>
> >>>>>  Thanks Uwe!
> >>>>>
> >>>>>  I changed the logic so that my workers only parse input docs into
> >>>>>  Documents, and indexWriter does addDocuments() by itself for the
> >>>>>  chunks of 100 Documents.
> >>>>>  Unfortunately, this behaviour reproduces: memory usage slightly
> >>>>>  increases with the number of processed documents, and at some point
> >>>>>  the program runs very slowly, and it seems that only a single thread
> >>>>>  is active.
> >>>>>  It happens after lots of parse/index cycles.
> >>>>>
> >>>>>  The current instance is now in the "single-thread" phase with ~100%
> >>>>>  CPU and with 8397M RES memory (limit for the VM is -Xmx8G).
> >>>>>  My question is, when does addDocuments() release all resourses
> passed
> >>>>>  in (the Documents themselves)?
> >>>>>  Are the resourses released after finishing the function call, or I
> >>>>>  have to do indexWriter.commit() after, say, each chunk?
> >>>>>
> >>>>>  --
> >>>>>  Igor
> >>>>>
> >>>>>  21.11.2013, 19:59, "Uwe Schindler" <uw...@thetaphi.de>:
> >>>>>>    Hi,
> >>>>>>
> >>>>>>    why are you doing this? Lucene's IndexWriter can handle
> >>>>>>  addDocuments
> >>>>>  in multiple threads. And, since Lucene 4, it will process them
> almost
> >>>>>  completely parallel!
> >>>>>>    If you do the addDocuments single-threaded you are adding an
> >>>>>  additional bottleneck in your application. If you are doing a
> >>>>>  synchronization on IndexWriter (which I hope you will not do),
> things
> >>>>>  will go wrong, too.
> >>>>>>    Uwe
> >>>>>>
> >>>>>>    -----
> >>>>>>    Uwe Schindler
> >>>>>>    H.-H.-Meier-Allee 63, D-28213 Bremen
> >>>>>>    http://www.thetaphi.de
> >>>>>>    eMail: uwe@thetaphi.de
> >>>>>>>     -----Original Message-----
> >>>>>>>     From: Igor Shalyminov [mailto:ishalyminov@yandex-team.ru]
> >>>>>>>     Sent: Thursday, November 21, 2013 4:45 PM
> >>>>>>>     To: java-user@lucene.apache.org
> >>>>>>>     Subject: Lucene multithreaded indexing problems
> >>>>>>>
> >>>>>>>     Hello!
> >>>>>>>
> >>>>>>>     I tried to perform indexing multithreadedly, with a
> >>>>>>>  FixedThreadPool
> >>>>>  of
> >>>>>>>     Callable workers.
> >>>>>>>     The main operation - parsing a single document and
> addDocument()
> >>>>>>>  to
> >>>>>  the
> >>>>>>>     index - is done by a single worker.
> >>>>>>>     After parsing a document, a lot (really a lot) of Strings
> >>>>>>>  appears,
> >>>>>  and at the
> >>>>>>>     end of the worker's call() all of them goes to the indexWriter.
> >>>>>>>     I use no merging, the resourses are flushed on disk when the
> >>>>>  segment size
> >>>>>>>     limit is reached.
> >>>>>>>
> >>>>>>>     The problem is, after a little while (when the most of the heap
> >>>>>  memory is
> >>>>>>>     used) indexer makes no progress, and CPU load is constant 100%
> >>>>>>>  (no
> >>>>>>>     difference if there are 2 threads or 32). So I think at some
> >>>>>>>  point
> >>>>>  garbage
> >>>>>>>     collection takes the whole indexing process down.
> >>>>>>>
> >>>>>>>     Could you please give some advices on the proper concurrent
> >>>>>  indexing with
> >>>>>>>     Lucene?
> >>>>>>>     Can there be "memory leaks" somewhere in the indexWriter? Maybe
> >>>  I
> >>>>>  must
> >>>>>>>     perform some operations with writer to release unused resourses
> >>>>>  from time
> >>>>>>>     to time?
> >>>>>>>
> >>>>>>>     --
> >>>>>>>     Best Regards,
> >>>>>>>     Igor
> >>>>>
>  ---------------------------------------------------------------------
> >>>>>>>     To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> >>>>>>>     For additional commands, e-mail:
> java-user-help@lucene.apache.org
> >>>>>>
>  --------------------------------------------------------------------
> >>>>>>  -
> >>>>>>    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>>>>    For additional commands, e-mail:
> java-user-help@lucene.apache.org
> >>>>>
>  ---------------------------------------------------------------------
> >>>>>  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>>>  For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>  --
> >>>>  Uwe Schindler
> >>>>  H.-H.-Meier-Allee 63, 28213 Bremen
> >>>>  http://www.thetaphi.de
> >>>  ---------------------------------------------------------------------
> >>>  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>  For additional commands, e-mail: java-user-help@lucene.apache.org
> >>  ---------------------------------------------------------------------
> >>  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>  For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: Lucene multithreaded indexing problems

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi,

> For now I use such a large RAM buffer to make the segments equally sized
> without merging (it must be good for multithreaded search).
> Can one get such segments with the help of merging (thus having as little
> indexWriter RAM buffer as he wants) using some existing Lucene
> MergePolicy?

If you don't want to merge at all, there is no other way than making the indexer buffer so big. But what is your problem with merging? TieredMergePolicy will merge segments to reasonable sizes for searching. I would suggest to tune the merging policy to merge smaller segments earlier.

For multithreaded search, the index sizes used should be equally sized, you are right! But you can tune the merge policy (Tiered is quite good) to do this. Don't doing any merges does not really help for that, it is better to let merging do its job and equalize the segment sizes, so you can also reduce the indexer buffers. If you need multiple indexing threads, you cannot create too big segments without using too much RAM. This is one big change in Lucene 4: Every indexing thread creates a separate segment, eventually merged - which you disallow.

Uwe

> Igor
> 
> 25.11.2013, 23:31, "Uwe Schindler" <uw...@thetaphi.de>:
> 
> >    Hi,
> >>     But here's what I have.
> >>
> >>     Today I looked at the indexer in the VisualVM, and I can definitely say
> that
> >>     the problem is in the memory: the resourses (which mostly are
> Document
> >>     fields) just don't go away.
> >>     I tried different GCs (Parallel, CMS, the default one), and every time the
> >>     behaviour is the same.
> >>     As I pass my Documents into the indexWriter, I forget about them (the
> >>     references are all local-scope), I think the resourses are stuck
> somewhere in
> >>     the writer.
> >    That is strange! Are you sure this is the case? Maybe you are using
> Readers in your Fields?
> >>     I wonder now how do I see:
> >>     - how many threads are used by the indexWriter?
> >    As many threads as you use for indexing, up to maxThreadStates. If you
> use more threads, addDocument() will block.
> >>     - when does it flush segments to disk?
> >    This is done dependent on different settings. But while doing this, the
> documents are no longer referenced! The work of analyzing the documents
> is done in the addDocument() call. When the method returns, the document
> is no longer in use.
> >>     Can I also know whether the indexWriter is done with my Document? Is
> >>     addDocument() operation some kind of synchronous?
> >    When addDocument() returns, the Document is no longer referenced.
> Ideally you can reuse the Document/Field instances!
> >>     Do I need to call commit() frequently (I also need to keep segment size
> >>     constant and use no merging)?
> >    Write your own MergePolicy to control how merging is done.
> >>     --
> >>     Igor
> >>
> >>     23.11.2013, 20:29, "Daniel Penning" <dp...@gamona.de>:
> >>>     G1 and CMS are both tuned primarily for low pauses which is typically
> >>>     prefered for searching an index. In this case i guess that indexing
> >>>     throughput is prefered in which case using ParallelGC might be the
> >>>     better choice.
> >>>
> >>>     Am 23.11.2013 17:15, schrieb Uwe Schindler:
> >>>>      Hi,
> >>>>
> >>>>      Maybe your heap size is just too big, so your JVM spends too much
> time
> >>     in GC? The setup you described in your last eMail ist the "official
> supported"
> >>     setup :-) Lucene has no problem with that setup and can index. Be sure:
> >>>>      - Don't give too much heap to your indexing app. Larger heaps create
> >>     much more GC load.
> >>>>      - Use a suitable Garbage collector (e.g. Java 7 G1 Collector or Java 6
> CMS
> >>     Collector). Other garbage collectors may do GCs in a single thread
> ("stop-the-
> >>     world").
> >>>>      Uwe
> >>>>      -----
> >>>>      Uwe Schindler
> >>>>      H.-H.-Meier-Allee 63, D-28213 Bremen
> >>>>      http://www.thetaphi.de
> >>>>      eMail: uwe@thetaphi.de
> >>>>>      -----Original Message-----
> >>>>>      From: Igor Shalyminov [mailto:ishalyminov@yandex-team.ru]
> >>>>>      Sent: Saturday, November 23, 2013 4:46 PM
> >>>>>      To: java-user@lucene.apache.org
> >>>>>      Subject: Re: Lucene multithreaded indexing problems
> >>>>>
> >>>>>      So we return to the initially described setup: multiple parallel
> >>>>>     workers, each
> >>>>>      making "parse + indexWriter.addDocument()" for single documents
> >>>>>     with no
> >>>>>      synchronization at my side. This setup was also bad on memory
> >>>>>     consumption
> >>>>>      and thread blocking, as I reported.
> >>>>>
> >>>>>      Or did I misunderstand you?
> >>>>>
> >>>>>      --
> >>>>>      Igor
> >>>>>
> >>>>>      22.11.2013, 23:34, "Uwe Schindler" <uw...@thetaphi.de>:
> >>>>>>      Hi,
> >>>>>>      Don't use addDocuments. This method is more made for so called
> >>>>>>     block
> >>>>>      indexing (where all documents need to be on a block for block
> >>>>>     joins). Call
> >>>>>      addDocument for each document possibly from many threads.  By
> this
> >>>>>      Lucene can better handle multithreading and free memory early.
> >>>>>     There is
> >>>>>      really no need to use bulk adds, this is solely for block joins,
> >>>>>     where docs need
> >>>>>      to be sequential and without gaps.
> >>>>>>      Uwe
> >>>>>>
> >>>>>>      Igor Shalyminov <is...@yandex-team.ru> schrieb:
> >>>>>>>      - uwe@
> >>>>>>>
> >>>>>>>      Thanks Uwe!
> >>>>>>>
> >>>>>>>      I changed the logic so that my workers only parse input docs into
> >>>>>>>      Documents, and indexWriter does addDocuments() by itself for
> the
> >>>>>>>      chunks of 100 Documents.
> >>>>>>>      Unfortunately, this behaviour reproduces: memory usage
> slightly
> >>>>>>>      increases with the number of processed documents, and at
> some
> >>>>>>>     point
> >>>>>>>      the program runs very slowly, and it seems that only a single
> >>>>>>>     thread
> >>>>>>>      is active.
> >>>>>>>      It happens after lots of parse/index cycles.
> >>>>>>>
> >>>>>>>      The current instance is now in the "single-thread" phase with
> >>>>>>>     ~100%
> >>>>>>>      CPU and with 8397M RES memory (limit for the VM is -Xmx8G).
> >>>>>>>      My question is, when does addDocuments() release all
> resourses
> >>>>>>>     passed
> >>>>>>>      in (the Documents themselves)?
> >>>>>>>      Are the resourses released after finishing the function call, or
> >>>>>>>     I
> >>>>>>>      have to do indexWriter.commit() after, say, each chunk?
> >>>>>>>
> >>>>>>>      --
> >>>>>>>      Igor
> >>>>>>>
> >>>>>>>      21.11.2013, 19:59, "Uwe Schindler" <uw...@thetaphi.de>:
> >>>>>>>>        Hi,
> >>>>>>>>
> >>>>>>>>        why are you doing this? Lucene's IndexWriter can handle
> >>>>>>>>      addDocuments
> >>>>>>>      in multiple threads. And, since Lucene 4, it will process them
> >>>>>>>     almost
> >>>>>>>      completely parallel!
> >>>>>>>>        If you do the addDocuments single-threaded you are adding
> an
> >>>>>>>      additional bottleneck in your application. If you are doing a
> >>>>>>>      synchronization on IndexWriter (which I hope you will not do),
> >>>>>>>     things
> >>>>>>>      will go wrong, too.
> >>>>>>>>        Uwe
> >>>>>>>>
> >>>>>>>>        -----
> >>>>>>>>        Uwe Schindler
> >>>>>>>>        H.-H.-Meier-Allee 63, D-28213 Bremen
> >>>>>>>>        http://www.thetaphi.de
> >>>>>>>>        eMail: uwe@thetaphi.de
> >>>>>>>>>         -----Original Message-----
> >>>>>>>>>         From: Igor Shalyminov [mailto:ishalyminov@yandex-
> team.ru]
> >>>>>>>>>         Sent: Thursday, November 21, 2013 4:45 PM
> >>>>>>>>>         To: java-user@lucene.apache.org
> >>>>>>>>>         Subject: Lucene multithreaded indexing problems
> >>>>>>>>>
> >>>>>>>>>         Hello!
> >>>>>>>>>
> >>>>>>>>>         I tried to perform indexing multithreadedly, with a
> >>>>>>>>>      FixedThreadPool
> >>>>>>>      of
> >>>>>>>>>         Callable workers.
> >>>>>>>>>         The main operation - parsing a single document and
> >>>>>>>>>     addDocument()
> >>>>>>>>>      to
> >>>>>>>      the
> >>>>>>>>>         index - is done by a single worker.
> >>>>>>>>>         After parsing a document, a lot (really a lot) of Strings
> >>>>>>>>>      appears,
> >>>>>>>      and at the
> >>>>>>>>>         end of the worker's call() all of them goes to the
> indexWriter.
> >>>>>>>>>         I use no merging, the resourses are flushed on disk when
> the
> >>>>>>>      segment size
> >>>>>>>>>         limit is reached.
> >>>>>>>>>
> >>>>>>>>>         The problem is, after a little while (when the most of the
> >>>>>>>>>     heap
> >>>>>>>      memory is
> >>>>>>>>>         used) indexer makes no progress, and CPU load is constant
> >>>>>>>>>     100%
> >>>>>>>>>      (no
> >>>>>>>>>         difference if there are 2 threads or 32). So I think at some
> >>>>>>>>>      point
> >>>>>>>      garbage
> >>>>>>>>>         collection takes the whole indexing process down.
> >>>>>>>>>
> >>>>>>>>>         Could you please give some advices on the proper
> concurrent
> >>>>>>>      indexing with
> >>>>>>>>>         Lucene?
> >>>>>>>>>         Can there be "memory leaks" somewhere in the
> indexWriter?
> >>>>>>>>>     Maybe
> >>>>>      I
> >>>>>>>      must
> >>>>>>>>>         perform some operations with writer to release unused
> >>>>>>>>>     resourses
> >>>>>>>      from time
> >>>>>>>>>         to time?
> >>>>>>>>>
> >>>>>>>>>         --
> >>>>>>>>>         Best Regards,
> >>>>>>>>>         Igor
> >>>>>>>     ------------------------------------------------------------------
> >>>>>>>     ---
> >>>>>>>>>         To unsubscribe, e-mail:
> >>>>>>>>>     java-user-unsubscribe@lucene.apache.org
> >>>>>>>>>         For additional commands, e-mail:
> >>>>>>>>>     java-user-help@lucene.apache.org
> >>>>>>>>     -----------------------------------------------------------------
> >>>>>>>>     ---
> >>>>>>>>      -
> >>>>>>>>        To unsubscribe, e-mail:
> >>>>>>>>     java-user-unsubscribe@lucene.apache.org
> >>>>>>>>        For additional commands, e-mail:
> >>>>>>>>     java-user-help@lucene.apache.org
> >>>>>>>     ------------------------------------------------------------------
> >>>>>>>     ---
> >>>>>>>      To unsubscribe, e-mail: java-user-
> unsubscribe@lucene.apache.org
> >>>>>>>      For additional commands, e-mail: java-user-
> help@lucene.apache.org
> >>>>>>      --
> >>>>>>      Uwe Schindler
> >>>>>>      H.-H.-Meier-Allee 63, 28213 Bremen
> >>>>>>      http://www.thetaphi.de
> >>>>>     --------------------------------------------------------------------
> >>>>>     -
> >>>>>      To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>>>      For additional commands, e-mail: java-user-
> help@lucene.apache.org
> >>>>     ---------------------------------------------------------------------
> >>>>      To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>>      For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>     ---------------------------------------------------------------------
> >>>     To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>     For additional commands, e-mail: java-user-help@lucene.apache.org
> >>     ---------------------------------------------------------------------
> >>     To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>     For additional commands, e-mail: java-user-help@lucene.apache.org
> >    ---------------------------------------------------------------------
> >    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >    For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene multithreaded indexing problems

Posted by Igor Shalyminov <is...@yandex-team.ru>.

Thanks all!

It's strange that with limited indexWriter's maxThreadStates to 1 (and all the other variables left intact) the indexer finally worked without OOM's.
Am I right that indexWriter, no matter how many threads it uses, can store in memory the maximum of maxRAMBufferSizeMB of data (which is 512.0 in my case)?

It seems that multiple threads create a noticeable memory overhead (or it's somewhat hard for a multithreaded indexWriter to make a segment flush, I don't know..).

For now I use such a large RAM buffer to make the segments equally sized without merging (it must be good for multithreaded search).
Can one get such segments with the help of merging (thus having as little indexWriter RAM buffer as he wants) using some existing Lucene MergePolicy?


-- 
Igor

25.11.2013, 23:31, "Uwe Schindler" <uw...@thetaphi.de>:

>    Hi,
>>     But here's what I have.
>>
>>     Today I looked at the indexer in the VisualVM, and I can definitely say that
>>     the problem is in the memory: the resourses (which mostly are Document
>>     fields) just don't go away.
>>     I tried different GCs (Parallel, CMS, the default one), and every time the
>>     behaviour is the same.
>>     As I pass my Documents into the indexWriter, I forget about them (the
>>     references are all local-scope), I think the resourses are stuck somewhere in
>>     the writer.
>    That is strange! Are you sure this is the case? Maybe you are using Readers in your Fields?
>>     I wonder now how do I see:
>>     - how many threads are used by the indexWriter?
>    As many threads as you use for indexing, up to maxThreadStates. If you use more threads, addDocument() will block.
>>     - when does it flush segments to disk?
>    This is done dependent on different settings. But while doing this, the documents are no longer referenced! The work of analyzing the documents is done in the addDocument() call. When the method returns, the document is no longer in use.
>>     Can I also know whether the indexWriter is done with my Document? Is
>>     addDocument() operation some kind of synchronous?
>    When addDocument() returns, the Document is no longer referenced. Ideally you can reuse the Document/Field instances!
>>     Do I need to call commit() frequently (I also need to keep segment size
>>     constant and use no merging)?
>    Write your own MergePolicy to control how merging is done.
>>     --
>>     Igor
>>
>>     23.11.2013, 20:29, "Daniel Penning" <dp...@gamona.de>:
>>>     G1 and CMS are both tuned primarily for low pauses which is typically
>>>     prefered for searching an index. In this case i guess that indexing
>>>     throughput is prefered in which case using ParallelGC might be the
>>>     better choice.
>>>
>>>     Am 23.11.2013 17:15, schrieb Uwe Schindler:
>>>>      Hi,
>>>>
>>>>      Maybe your heap size is just too big, so your JVM spends too much time
>>     in GC? The setup you described in your last eMail ist the "official supported"
>>     setup :-) Lucene has no problem with that setup and can index. Be sure:
>>>>      - Don't give too much heap to your indexing app. Larger heaps create
>>     much more GC load.
>>>>      - Use a suitable Garbage collector (e.g. Java 7 G1 Collector or Java 6 CMS
>>     Collector). Other garbage collectors may do GCs in a single thread ("stop-the-
>>     world").
>>>>      Uwe
>>>>      -----
>>>>      Uwe Schindler
>>>>      H.-H.-Meier-Allee 63, D-28213 Bremen
>>>>      http://www.thetaphi.de
>>>>      eMail: uwe@thetaphi.de
>>>>>      -----Original Message-----
>>>>>      From: Igor Shalyminov [mailto:ishalyminov@yandex-team.ru]
>>>>>      Sent: Saturday, November 23, 2013 4:46 PM
>>>>>      To: java-user@lucene.apache.org
>>>>>      Subject: Re: Lucene multithreaded indexing problems
>>>>>
>>>>>      So we return to the initially described setup: multiple parallel
>>>>>     workers, each
>>>>>      making "parse + indexWriter.addDocument()" for single documents
>>>>>     with no
>>>>>      synchronization at my side. This setup was also bad on memory
>>>>>     consumption
>>>>>      and thread blocking, as I reported.
>>>>>
>>>>>      Or did I misunderstand you?
>>>>>
>>>>>      --
>>>>>      Igor
>>>>>
>>>>>      22.11.2013, 23:34, "Uwe Schindler" <uw...@thetaphi.de>:
>>>>>>      Hi,
>>>>>>      Don't use addDocuments. This method is more made for so called
>>>>>>     block
>>>>>      indexing (where all documents need to be on a block for block
>>>>>     joins). Call
>>>>>      addDocument for each document possibly from many threads.  By this
>>>>>      Lucene can better handle multithreading and free memory early.
>>>>>     There is
>>>>>      really no need to use bulk adds, this is solely for block joins,
>>>>>     where docs need
>>>>>      to be sequential and without gaps.
>>>>>>      Uwe
>>>>>>
>>>>>>      Igor Shalyminov <is...@yandex-team.ru> schrieb:
>>>>>>>      - uwe@
>>>>>>>
>>>>>>>      Thanks Uwe!
>>>>>>>
>>>>>>>      I changed the logic so that my workers only parse input docs into
>>>>>>>      Documents, and indexWriter does addDocuments() by itself for the
>>>>>>>      chunks of 100 Documents.
>>>>>>>      Unfortunately, this behaviour reproduces: memory usage slightly
>>>>>>>      increases with the number of processed documents, and at some
>>>>>>>     point
>>>>>>>      the program runs very slowly, and it seems that only a single
>>>>>>>     thread
>>>>>>>      is active.
>>>>>>>      It happens after lots of parse/index cycles.
>>>>>>>
>>>>>>>      The current instance is now in the "single-thread" phase with
>>>>>>>     ~100%
>>>>>>>      CPU and with 8397M RES memory (limit for the VM is -Xmx8G).
>>>>>>>      My question is, when does addDocuments() release all resourses
>>>>>>>     passed
>>>>>>>      in (the Documents themselves)?
>>>>>>>      Are the resourses released after finishing the function call, or
>>>>>>>     I
>>>>>>>      have to do indexWriter.commit() after, say, each chunk?
>>>>>>>
>>>>>>>      --
>>>>>>>      Igor
>>>>>>>
>>>>>>>      21.11.2013, 19:59, "Uwe Schindler" <uw...@thetaphi.de>:
>>>>>>>>        Hi,
>>>>>>>>
>>>>>>>>        why are you doing this? Lucene's IndexWriter can handle
>>>>>>>>      addDocuments
>>>>>>>      in multiple threads. And, since Lucene 4, it will process them
>>>>>>>     almost
>>>>>>>      completely parallel!
>>>>>>>>        If you do the addDocuments single-threaded you are adding an
>>>>>>>      additional bottleneck in your application. If you are doing a
>>>>>>>      synchronization on IndexWriter (which I hope you will not do),
>>>>>>>     things
>>>>>>>      will go wrong, too.
>>>>>>>>        Uwe
>>>>>>>>
>>>>>>>>        -----
>>>>>>>>        Uwe Schindler
>>>>>>>>        H.-H.-Meier-Allee 63, D-28213 Bremen
>>>>>>>>        http://www.thetaphi.de
>>>>>>>>        eMail: uwe@thetaphi.de
>>>>>>>>>         -----Original Message-----
>>>>>>>>>         From: Igor Shalyminov [mailto:ishalyminov@yandex-team.ru]
>>>>>>>>>         Sent: Thursday, November 21, 2013 4:45 PM
>>>>>>>>>         To: java-user@lucene.apache.org
>>>>>>>>>         Subject: Lucene multithreaded indexing problems
>>>>>>>>>
>>>>>>>>>         Hello!
>>>>>>>>>
>>>>>>>>>         I tried to perform indexing multithreadedly, with a
>>>>>>>>>      FixedThreadPool
>>>>>>>      of
>>>>>>>>>         Callable workers.
>>>>>>>>>         The main operation - parsing a single document and
>>>>>>>>>     addDocument()
>>>>>>>>>      to
>>>>>>>      the
>>>>>>>>>         index - is done by a single worker.
>>>>>>>>>         After parsing a document, a lot (really a lot) of Strings
>>>>>>>>>      appears,
>>>>>>>      and at the
>>>>>>>>>         end of the worker's call() all of them goes to the indexWriter.
>>>>>>>>>         I use no merging, the resourses are flushed on disk when the
>>>>>>>      segment size
>>>>>>>>>         limit is reached.
>>>>>>>>>
>>>>>>>>>         The problem is, after a little while (when the most of the
>>>>>>>>>     heap
>>>>>>>      memory is
>>>>>>>>>         used) indexer makes no progress, and CPU load is constant
>>>>>>>>>     100%
>>>>>>>>>      (no
>>>>>>>>>         difference if there are 2 threads or 32). So I think at some
>>>>>>>>>      point
>>>>>>>      garbage
>>>>>>>>>         collection takes the whole indexing process down.
>>>>>>>>>
>>>>>>>>>         Could you please give some advices on the proper concurrent
>>>>>>>      indexing with
>>>>>>>>>         Lucene?
>>>>>>>>>         Can there be "memory leaks" somewhere in the indexWriter?
>>>>>>>>>     Maybe
>>>>>      I
>>>>>>>      must
>>>>>>>>>         perform some operations with writer to release unused
>>>>>>>>>     resourses
>>>>>>>      from time
>>>>>>>>>         to time?
>>>>>>>>>
>>>>>>>>>         --
>>>>>>>>>         Best Regards,
>>>>>>>>>         Igor
>>>>>>>     ------------------------------------------------------------------
>>>>>>>     ---
>>>>>>>>>         To unsubscribe, e-mail:
>>>>>>>>>     java-user-unsubscribe@lucene.apache.org
>>>>>>>>>         For additional commands, e-mail:
>>>>>>>>>     java-user-help@lucene.apache.org
>>>>>>>>     -----------------------------------------------------------------
>>>>>>>>     ---
>>>>>>>>      -
>>>>>>>>        To unsubscribe, e-mail:
>>>>>>>>     java-user-unsubscribe@lucene.apache.org
>>>>>>>>        For additional commands, e-mail:
>>>>>>>>     java-user-help@lucene.apache.org
>>>>>>>     ------------------------------------------------------------------
>>>>>>>     ---
>>>>>>>      To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>>      For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>      --
>>>>>>      Uwe Schindler
>>>>>>      H.-H.-Meier-Allee 63, 28213 Bremen
>>>>>>      http://www.thetaphi.de
>>>>>     --------------------------------------------------------------------
>>>>>     -
>>>>>      To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>      For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>     ---------------------------------------------------------------------
>>>>      To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>      For additional commands, e-mail: java-user-help@lucene.apache.org
>>>     ---------------------------------------------------------------------
>>>     To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>     For additional commands, e-mail: java-user-help@lucene.apache.org
>>     ---------------------------------------------------------------------
>>     To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>     For additional commands, e-mail: java-user-help@lucene.apache.org
>    ---------------------------------------------------------------------
>    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>    For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Lucene multithreaded indexing problems

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi,

> But here's what I have.
> 
> Today I looked at the indexer in the VisualVM, and I can definitely say that
> the problem is in the memory: the resourses (which mostly are Document
> fields) just don't go away.
> I tried different GCs (Parallel, CMS, the default one), and every time the
> behaviour is the same.
> As I pass my Documents into the indexWriter, I forget about them (the
> references are all local-scope), I think the resourses are stuck somewhere in
> the writer.

That is strange! Are you sure this is the case? Maybe you are using Readers in your Fields?

> I wonder now how do I see:
> - how many threads are used by the indexWriter?

As many threads as you use for indexing, up to maxThreadStates. If you use more threads, addDocument() will block.

> - when does it flush segments to disk?

This is done dependent on different settings. But while doing this, the documents are no longer referenced! The work of analyzing the documents is done in the addDocument() call. When the method returns, the document is no longer in use.

> Can I also know whether the indexWriter is done with my Document? Is
> addDocument() operation some kind of synchronous?

When addDocument() returns, the Document is no longer referenced. Ideally you can reuse the Document/Field instances!

> Do I need to call commit() frequently (I also need to keep segment size
> constant and use no merging)?

Write your own MergePolicy to control how merging is done.

> 
> --
> Igor
> 
> 23.11.2013, 20:29, "Daniel Penning" <dp...@gamona.de>:
> > G1 and CMS are both tuned primarily for low pauses which is typically
> > prefered for searching an index. In this case i guess that indexing
> > throughput is prefered in which case using ParallelGC might be the
> > better choice.
> >
> > Am 23.11.2013 17:15, schrieb Uwe Schindler:
> >
> >>  Hi,
> >>
> >>  Maybe your heap size is just too big, so your JVM spends too much time
> in GC? The setup you described in your last eMail ist the "official supported"
> setup :-) Lucene has no problem with that setup and can index. Be sure:
> >>  - Don't give too much heap to your indexing app. Larger heaps create
> much more GC load.
> >>  - Use a suitable Garbage collector (e.g. Java 7 G1 Collector or Java 6 CMS
> Collector). Other garbage collectors may do GCs in a single thread ("stop-the-
> world").
> >>
> >>  Uwe
> >>  -----
> >>  Uwe Schindler
> >>  H.-H.-Meier-Allee 63, D-28213 Bremen
> >>  http://www.thetaphi.de
> >>  eMail: uwe@thetaphi.de
> >>>  -----Original Message-----
> >>>  From: Igor Shalyminov [mailto:ishalyminov@yandex-team.ru]
> >>>  Sent: Saturday, November 23, 2013 4:46 PM
> >>>  To: java-user@lucene.apache.org
> >>>  Subject: Re: Lucene multithreaded indexing problems
> >>>
> >>>  So we return to the initially described setup: multiple parallel
> >>> workers, each
> >>>  making "parse + indexWriter.addDocument()" for single documents
> >>> with no
> >>>  synchronization at my side. This setup was also bad on memory
> >>> consumption
> >>>  and thread blocking, as I reported.
> >>>
> >>>  Or did I misunderstand you?
> >>>
> >>>  --
> >>>  Igor
> >>>
> >>>  22.11.2013, 23:34, "Uwe Schindler" <uw...@thetaphi.de>:
> >>>>  Hi,
> >>>>  Don't use addDocuments. This method is more made for so called
> >>>> block
> >>>  indexing (where all documents need to be on a block for block
> >>> joins). Call
> >>>  addDocument for each document possibly from many threads.  By this
> >>>  Lucene can better handle multithreading and free memory early.
> >>> There is
> >>>  really no need to use bulk adds, this is solely for block joins,
> >>> where docs need
> >>>  to be sequential and without gaps.
> >>>>  Uwe
> >>>>
> >>>>  Igor Shalyminov <is...@yandex-team.ru> schrieb:
> >>>>>  - uwe@
> >>>>>
> >>>>>  Thanks Uwe!
> >>>>>
> >>>>>  I changed the logic so that my workers only parse input docs into
> >>>>>  Documents, and indexWriter does addDocuments() by itself for the
> >>>>>  chunks of 100 Documents.
> >>>>>  Unfortunately, this behaviour reproduces: memory usage slightly
> >>>>>  increases with the number of processed documents, and at some
> >>>>> point
> >>>>>  the program runs very slowly, and it seems that only a single
> >>>>> thread
> >>>>>  is active.
> >>>>>  It happens after lots of parse/index cycles.
> >>>>>
> >>>>>  The current instance is now in the "single-thread" phase with
> >>>>> ~100%
> >>>>>  CPU and with 8397M RES memory (limit for the VM is -Xmx8G).
> >>>>>  My question is, when does addDocuments() release all resourses
> >>>>> passed
> >>>>>  in (the Documents themselves)?
> >>>>>  Are the resourses released after finishing the function call, or
> >>>>> I
> >>>>>  have to do indexWriter.commit() after, say, each chunk?
> >>>>>
> >>>>>  --
> >>>>>  Igor
> >>>>>
> >>>>>  21.11.2013, 19:59, "Uwe Schindler" <uw...@thetaphi.de>:
> >>>>>>    Hi,
> >>>>>>
> >>>>>>    why are you doing this? Lucene's IndexWriter can handle
> >>>>>>  addDocuments
> >>>>>  in multiple threads. And, since Lucene 4, it will process them
> >>>>> almost
> >>>>>  completely parallel!
> >>>>>>    If you do the addDocuments single-threaded you are adding an
> >>>>>  additional bottleneck in your application. If you are doing a
> >>>>>  synchronization on IndexWriter (which I hope you will not do),
> >>>>> things
> >>>>>  will go wrong, too.
> >>>>>>    Uwe
> >>>>>>
> >>>>>>    -----
> >>>>>>    Uwe Schindler
> >>>>>>    H.-H.-Meier-Allee 63, D-28213 Bremen
> >>>>>>    http://www.thetaphi.de
> >>>>>>    eMail: uwe@thetaphi.de
> >>>>>>>     -----Original Message-----
> >>>>>>>     From: Igor Shalyminov [mailto:ishalyminov@yandex-team.ru]
> >>>>>>>     Sent: Thursday, November 21, 2013 4:45 PM
> >>>>>>>     To: java-user@lucene.apache.org
> >>>>>>>     Subject: Lucene multithreaded indexing problems
> >>>>>>>
> >>>>>>>     Hello!
> >>>>>>>
> >>>>>>>     I tried to perform indexing multithreadedly, with a
> >>>>>>>  FixedThreadPool
> >>>>>  of
> >>>>>>>     Callable workers.
> >>>>>>>     The main operation - parsing a single document and
> >>>>>>> addDocument()
> >>>>>>>  to
> >>>>>  the
> >>>>>>>     index - is done by a single worker.
> >>>>>>>     After parsing a document, a lot (really a lot) of Strings
> >>>>>>>  appears,
> >>>>>  and at the
> >>>>>>>     end of the worker's call() all of them goes to the indexWriter.
> >>>>>>>     I use no merging, the resourses are flushed on disk when the
> >>>>>  segment size
> >>>>>>>     limit is reached.
> >>>>>>>
> >>>>>>>     The problem is, after a little while (when the most of the
> >>>>>>> heap
> >>>>>  memory is
> >>>>>>>     used) indexer makes no progress, and CPU load is constant
> >>>>>>> 100%
> >>>>>>>  (no
> >>>>>>>     difference if there are 2 threads or 32). So I think at some
> >>>>>>>  point
> >>>>>  garbage
> >>>>>>>     collection takes the whole indexing process down.
> >>>>>>>
> >>>>>>>     Could you please give some advices on the proper concurrent
> >>>>>  indexing with
> >>>>>>>     Lucene?
> >>>>>>>     Can there be "memory leaks" somewhere in the indexWriter?
> >>>>>>> Maybe
> >>>  I
> >>>>>  must
> >>>>>>>     perform some operations with writer to release unused
> >>>>>>> resourses
> >>>>>  from time
> >>>>>>>     to time?
> >>>>>>>
> >>>>>>>     --
> >>>>>>>     Best Regards,
> >>>>>>>     Igor
> >>>>>
> >>>>> ------------------------------------------------------------------
> >>>>> ---
> >>>>>>>     To unsubscribe, e-mail:
> >>>>>>> java-user-unsubscribe@lucene.apache.org
> >>>>>>>     For additional commands, e-mail:
> >>>>>>> java-user-help@lucene.apache.org
> >>>>>>
> >>>>>> -----------------------------------------------------------------
> >>>>>> ---
> >>>>>>  -
> >>>>>>    To unsubscribe, e-mail:
> >>>>>> java-user-unsubscribe@lucene.apache.org
> >>>>>>    For additional commands, e-mail:
> >>>>>> java-user-help@lucene.apache.org
> >>>>>
> >>>>> ------------------------------------------------------------------
> >>>>> ---
> >>>>>  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>>>  For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>  --
> >>>>  Uwe Schindler
> >>>>  H.-H.-Meier-Allee 63, 28213 Bremen
> >>>>  http://www.thetaphi.de
> >>>
> >>> --------------------------------------------------------------------
> >>> -
> >>>  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>  For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >> ---------------------------------------------------------------------
> >>  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>  For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene multithreaded indexing problems

Posted by Igor Shalyminov <is...@yandex-team.ru>.

Thank you!

But here's what I have.

Today I looked at the indexer in the VisualVM, and I can definitely say that the problem is in the memory: the resourses (which mostly are Document fields) just don't go away.
I tried different GCs (Parallel, CMS, the default one), and every time the behaviour is the same.
As I pass my Documents into the indexWriter, I forget about them (the references are all local-scope), I think the resourses are stuck somewhere in the writer.

I wonder now how do I see:
- how many threads are used by the indexWriter?
- when does it flush segments to disk?

Can I also know whether the indexWriter is done with my Document? Is addDocument() operation some kind of synchronous?
Do I need to call commit() frequently (I also need to keep segment size constant and use no merging)?

-- 
Igor

23.11.2013, 20:29, "Daniel Penning" <dp...@gamona.de>:
> G1 and CMS are both tuned primarily for low pauses which is typically
> prefered for searching an index. In this case i guess that indexing
> throughput is prefered in which case using ParallelGC might be the
> better choice.
>
> Am 23.11.2013 17:15, schrieb Uwe Schindler:
>
>>  Hi,
>>
>>  Maybe your heap size is just too big, so your JVM spends too much time in GC? The setup you described in your last eMail ist the "official supported" setup :-) Lucene has no problem with that setup and can index. Be sure:
>>  - Don't give too much heap to your indexing app. Larger heaps create much more GC load.
>>  - Use a suitable Garbage collector (e.g. Java 7 G1 Collector or Java 6 CMS Collector). Other garbage collectors may do GCs in a single thread ("stop-the-world").
>>
>>  Uwe
>>  -----
>>  Uwe Schindler
>>  H.-H.-Meier-Allee 63, D-28213 Bremen
>>  http://www.thetaphi.de
>>  eMail: uwe@thetaphi.de
>>>  -----Original Message-----
>>>  From: Igor Shalyminov [mailto:ishalyminov@yandex-team.ru]
>>>  Sent: Saturday, November 23, 2013 4:46 PM
>>>  To: java-user@lucene.apache.org
>>>  Subject: Re: Lucene multithreaded indexing problems
>>>
>>>  So we return to the initially described setup: multiple parallel workers, each
>>>  making "parse + indexWriter.addDocument()" for single documents with no
>>>  synchronization at my side. This setup was also bad on memory consumption
>>>  and thread blocking, as I reported.
>>>
>>>  Or did I misunderstand you?
>>>
>>>  --
>>>  Igor
>>>
>>>  22.11.2013, 23:34, "Uwe Schindler" <uw...@thetaphi.de>:
>>>>  Hi,
>>>>  Don't use addDocuments. This method is more made for so called block
>>>  indexing (where all documents need to be on a block for block joins). Call
>>>  addDocument for each document possibly from many threads.  By this
>>>  Lucene can better handle multithreading and free memory early. There is
>>>  really no need to use bulk adds, this is solely for block joins, where docs need
>>>  to be sequential and without gaps.
>>>>  Uwe
>>>>
>>>>  Igor Shalyminov <is...@yandex-team.ru> schrieb:
>>>>>  - uwe@
>>>>>
>>>>>  Thanks Uwe!
>>>>>
>>>>>  I changed the logic so that my workers only parse input docs into
>>>>>  Documents, and indexWriter does addDocuments() by itself for the
>>>>>  chunks of 100 Documents.
>>>>>  Unfortunately, this behaviour reproduces: memory usage slightly
>>>>>  increases with the number of processed documents, and at some point
>>>>>  the program runs very slowly, and it seems that only a single thread
>>>>>  is active.
>>>>>  It happens after lots of parse/index cycles.
>>>>>
>>>>>  The current instance is now in the "single-thread" phase with ~100%
>>>>>  CPU and with 8397M RES memory (limit for the VM is -Xmx8G).
>>>>>  My question is, when does addDocuments() release all resourses passed
>>>>>  in (the Documents themselves)?
>>>>>  Are the resourses released after finishing the function call, or I
>>>>>  have to do indexWriter.commit() after, say, each chunk?
>>>>>
>>>>>  --
>>>>>  Igor
>>>>>
>>>>>  21.11.2013, 19:59, "Uwe Schindler" <uw...@thetaphi.de>:
>>>>>>    Hi,
>>>>>>
>>>>>>    why are you doing this? Lucene's IndexWriter can handle
>>>>>>  addDocuments
>>>>>  in multiple threads. And, since Lucene 4, it will process them almost
>>>>>  completely parallel!
>>>>>>    If you do the addDocuments single-threaded you are adding an
>>>>>  additional bottleneck in your application. If you are doing a
>>>>>  synchronization on IndexWriter (which I hope you will not do), things
>>>>>  will go wrong, too.
>>>>>>    Uwe
>>>>>>
>>>>>>    -----
>>>>>>    Uwe Schindler
>>>>>>    H.-H.-Meier-Allee 63, D-28213 Bremen
>>>>>>    http://www.thetaphi.de
>>>>>>    eMail: uwe@thetaphi.de
>>>>>>>     -----Original Message-----
>>>>>>>     From: Igor Shalyminov [mailto:ishalyminov@yandex-team.ru]
>>>>>>>     Sent: Thursday, November 21, 2013 4:45 PM
>>>>>>>     To: java-user@lucene.apache.org
>>>>>>>     Subject: Lucene multithreaded indexing problems
>>>>>>>
>>>>>>>     Hello!
>>>>>>>
>>>>>>>     I tried to perform indexing multithreadedly, with a
>>>>>>>  FixedThreadPool
>>>>>  of
>>>>>>>     Callable workers.
>>>>>>>     The main operation - parsing a single document and addDocument()
>>>>>>>  to
>>>>>  the
>>>>>>>     index - is done by a single worker.
>>>>>>>     After parsing a document, a lot (really a lot) of Strings
>>>>>>>  appears,
>>>>>  and at the
>>>>>>>     end of the worker's call() all of them goes to the indexWriter.
>>>>>>>     I use no merging, the resourses are flushed on disk when the
>>>>>  segment size
>>>>>>>     limit is reached.
>>>>>>>
>>>>>>>     The problem is, after a little while (when the most of the heap
>>>>>  memory is
>>>>>>>     used) indexer makes no progress, and CPU load is constant 100%
>>>>>>>  (no
>>>>>>>     difference if there are 2 threads or 32). So I think at some
>>>>>>>  point
>>>>>  garbage
>>>>>>>     collection takes the whole indexing process down.
>>>>>>>
>>>>>>>     Could you please give some advices on the proper concurrent
>>>>>  indexing with
>>>>>>>     Lucene?
>>>>>>>     Can there be "memory leaks" somewhere in the indexWriter? Maybe
>>>  I
>>>>>  must
>>>>>>>     perform some operations with writer to release unused resourses
>>>>>  from time
>>>>>>>     to time?
>>>>>>>
>>>>>>>     --
>>>>>>>     Best Regards,
>>>>>>>     Igor
>>>>>  ---------------------------------------------------------------------
>>>>>>>     To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>>     For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>  --------------------------------------------------------------------
>>>>>>  -
>>>>>>    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>    For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>  ---------------------------------------------------------------------
>>>>>  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>  For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>  --
>>>>  Uwe Schindler
>>>>  H.-H.-Meier-Allee 63, 28213 Bremen
>>>>  http://www.thetaphi.de
>>>  ---------------------------------------------------------------------
>>>  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>  For additional commands, e-mail: java-user-help@lucene.apache.org
>>  ---------------------------------------------------------------------
>>  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>  For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene multithreaded indexing problems

Posted by Daniel Penning <dp...@gamona.de>.

G1 and CMS are both tuned primarily for low pauses which is typically 
prefered for searching an index. In this case i guess that indexing 
throughput is prefered in which case using ParallelGC might be the 
better choice.

Am 23.11.2013 17:15, schrieb Uwe Schindler:
> Hi,
>
> Maybe your heap size is just too big, so your JVM spends too much time in GC? The setup you described in your last eMail ist the "official supported" setup :-) Lucene has no problem with that setup and can index. Be sure:
> - Don't give too much heap to your indexing app. Larger heaps create much more GC load.
> - Use a suitable Garbage collector (e.g. Java 7 G1 Collector or Java 6 CMS Collector). Other garbage collectors may do GCs in a single thread ("stop-the-world").
>
> Uwe
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
>> -----Original Message-----
>> From: Igor Shalyminov [mailto:ishalyminov@yandex-team.ru]
>> Sent: Saturday, November 23, 2013 4:46 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: Lucene multithreaded indexing problems
>>
>> So we return to the initially described setup: multiple parallel workers, each
>> making "parse + indexWriter.addDocument()" for single documents with no
>> synchronization at my side. This setup was also bad on memory consumption
>> and thread blocking, as I reported.
>>
>> Or did I misunderstand you?
>>
>> --
>> Igor
>>
>> 22.11.2013, 23:34, "Uwe Schindler" <uw...@thetaphi.de>:
>>> Hi,
>>> Don't use addDocuments. This method is more made for so called block
>> indexing (where all documents need to be on a block for block joins). Call
>> addDocument for each document possibly from many threads.  By this
>> Lucene can better handle multithreading and free memory early. There is
>> really no need to use bulk adds, this is solely for block joins, where docs need
>> to be sequential and without gaps.
>>> Uwe
>>>
>>> Igor Shalyminov <is...@yandex-team.ru> schrieb:
>>>
>>>> - uwe@
>>>>
>>>> Thanks Uwe!
>>>>
>>>> I changed the logic so that my workers only parse input docs into
>>>> Documents, and indexWriter does addDocuments() by itself for the
>>>> chunks of 100 Documents.
>>>> Unfortunately, this behaviour reproduces: memory usage slightly
>>>> increases with the number of processed documents, and at some point
>>>> the program runs very slowly, and it seems that only a single thread
>>>> is active.
>>>> It happens after lots of parse/index cycles.
>>>>
>>>> The current instance is now in the "single-thread" phase with ~100%
>>>> CPU and with 8397M RES memory (limit for the VM is -Xmx8G).
>>>> My question is, when does addDocuments() release all resourses passed
>>>> in (the Documents themselves)?
>>>> Are the resourses released after finishing the function call, or I
>>>> have to do indexWriter.commit() after, say, each chunk?
>>>>
>>>> --
>>>> Igor
>>>>
>>>> 21.11.2013, 19:59, "Uwe Schindler" <uw...@thetaphi.de>:
>>>>>   Hi,
>>>>>
>>>>>   why are you doing this? Lucene's IndexWriter can handle
>>>>> addDocuments
>>>> in multiple threads. And, since Lucene 4, it will process them almost
>>>> completely parallel!
>>>>>   If you do the addDocuments single-threaded you are adding an
>>>> additional bottleneck in your application. If you are doing a
>>>> synchronization on IndexWriter (which I hope you will not do), things
>>>> will go wrong, too.
>>>>>   Uwe
>>>>>
>>>>>   -----
>>>>>   Uwe Schindler
>>>>>   H.-H.-Meier-Allee 63, D-28213 Bremen
>>>>>   http://www.thetaphi.de
>>>>>   eMail: uwe@thetaphi.de
>>>>>>    -----Original Message-----
>>>>>>    From: Igor Shalyminov [mailto:ishalyminov@yandex-team.ru]
>>>>>>    Sent: Thursday, November 21, 2013 4:45 PM
>>>>>>    To: java-user@lucene.apache.org
>>>>>>    Subject: Lucene multithreaded indexing problems
>>>>>>
>>>>>>    Hello!
>>>>>>
>>>>>>    I tried to perform indexing multithreadedly, with a
>>>>>> FixedThreadPool
>>>> of
>>>>>>    Callable workers.
>>>>>>    The main operation - parsing a single document and addDocument()
>>>>>> to
>>>> the
>>>>>>    index - is done by a single worker.
>>>>>>    After parsing a document, a lot (really a lot) of Strings
>>>>>> appears,
>>>> and at the
>>>>>>    end of the worker's call() all of them goes to the indexWriter.
>>>>>>    I use no merging, the resourses are flushed on disk when the
>>>> segment size
>>>>>>    limit is reached.
>>>>>>
>>>>>>    The problem is, after a little while (when the most of the heap
>>>> memory is
>>>>>>    used) indexer makes no progress, and CPU load is constant 100%
>>>>>> (no
>>>>>>    difference if there are 2 threads or 32). So I think at some
>>>>>> point
>>>> garbage
>>>>>>    collection takes the whole indexing process down.
>>>>>>
>>>>>>    Could you please give some advices on the proper concurrent
>>>> indexing with
>>>>>>    Lucene?
>>>>>>    Can there be "memory leaks" somewhere in the indexWriter? Maybe
>> I
>>>> must
>>>>>>    perform some operations with writer to release unused resourses
>>>> from time
>>>>>>    to time?
>>>>>>
>>>>>>    --
>>>>>>    Best Regards,
>>>>>>    Igor
>>>> ---------------------------------------------------------------------
>>>>>>    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>    For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>> --------------------------------------------------------------------
>>>>> -
>>>>>   To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>   For additional commands, e-mail: java-user-help@lucene.apache.org
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> --
>>> Uwe Schindler
>>> H.-H.-Meier-Allee 63, 28213 Bremen
>>> http://www.thetaphi.de
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Lucene multithreaded indexing problems

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi,

Maybe your heap size is just too big, so your JVM spends too much time in GC? The setup you described in your last eMail ist the "official supported" setup :-) Lucene has no problem with that setup and can index. Be sure:
- Don't give too much heap to your indexing app. Larger heaps create much more GC load.
- Use a suitable Garbage collector (e.g. Java 7 G1 Collector or Java 6 CMS Collector). Other garbage collectors may do GCs in a single thread ("stop-the-world").

Uwe
-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Igor Shalyminov [mailto:ishalyminov@yandex-team.ru]
> Sent: Saturday, November 23, 2013 4:46 PM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene multithreaded indexing problems
> 
> So we return to the initially described setup: multiple parallel workers, each
> making "parse + indexWriter.addDocument()" for single documents with no
> synchronization at my side. This setup was also bad on memory consumption
> and thread blocking, as I reported.
> 
> Or did I misunderstand you?
> 
> --
> Igor
> 
> 22.11.2013, 23:34, "Uwe Schindler" <uw...@thetaphi.de>:
> > Hi,
> > Don't use addDocuments. This method is more made for so called block
> indexing (where all documents need to be on a block for block joins). Call
> addDocument for each document possibly from many threads.  By this
> Lucene can better handle multithreading and free memory early. There is
> really no need to use bulk adds, this is solely for block joins, where docs need
> to be sequential and without gaps.
> >
> > Uwe
> >
> > Igor Shalyminov <is...@yandex-team.ru> schrieb:
> >
> >> - uwe@
> >>
> >> Thanks Uwe!
> >>
> >> I changed the logic so that my workers only parse input docs into
> >> Documents, and indexWriter does addDocuments() by itself for the
> >> chunks of 100 Documents.
> >> Unfortunately, this behaviour reproduces: memory usage slightly
> >> increases with the number of processed documents, and at some point
> >> the program runs very slowly, and it seems that only a single thread
> >> is active.
> >> It happens after lots of parse/index cycles.
> >>
> >> The current instance is now in the "single-thread" phase with ~100%
> >> CPU and with 8397M RES memory (limit for the VM is -Xmx8G).
> >> My question is, when does addDocuments() release all resourses passed
> >> in (the Documents themselves)?
> >> Are the resourses released after finishing the function call, or I
> >> have to do indexWriter.commit() after, say, each chunk?
> >>
> >> --
> >> Igor
> >>
> >> 21.11.2013, 19:59, "Uwe Schindler" <uw...@thetaphi.de>:
> >>>  Hi,
> >>>
> >>>  why are you doing this? Lucene's IndexWriter can handle
> >>> addDocuments
> >> in multiple threads. And, since Lucene 4, it will process them almost
> >> completely parallel!
> >>>  If you do the addDocuments single-threaded you are adding an
> >> additional bottleneck in your application. If you are doing a
> >> synchronization on IndexWriter (which I hope you will not do), things
> >> will go wrong, too.
> >>>  Uwe
> >>>
> >>>  -----
> >>>  Uwe Schindler
> >>>  H.-H.-Meier-Allee 63, D-28213 Bremen
> >>>  http://www.thetaphi.de
> >>>  eMail: uwe@thetaphi.de
> >>>>   -----Original Message-----
> >>>>   From: Igor Shalyminov [mailto:ishalyminov@yandex-team.ru]
> >>>>   Sent: Thursday, November 21, 2013 4:45 PM
> >>>>   To: java-user@lucene.apache.org
> >>>>   Subject: Lucene multithreaded indexing problems
> >>>>
> >>>>   Hello!
> >>>>
> >>>>   I tried to perform indexing multithreadedly, with a
> >>>> FixedThreadPool
> >> of
> >>>>   Callable workers.
> >>>>   The main operation - parsing a single document and addDocument()
> >>>> to
> >> the
> >>>>   index - is done by a single worker.
> >>>>   After parsing a document, a lot (really a lot) of Strings
> >>>> appears,
> >> and at the
> >>>>   end of the worker's call() all of them goes to the indexWriter.
> >>>>   I use no merging, the resourses are flushed on disk when the
> >> segment size
> >>>>   limit is reached.
> >>>>
> >>>>   The problem is, after a little while (when the most of the heap
> >> memory is
> >>>>   used) indexer makes no progress, and CPU load is constant 100%
> >>>> (no
> >>>>   difference if there are 2 threads or 32). So I think at some
> >>>> point
> >> garbage
> >>>>   collection takes the whole indexing process down.
> >>>>
> >>>>   Could you please give some advices on the proper concurrent
> >> indexing with
> >>>>   Lucene?
> >>>>   Can there be "memory leaks" somewhere in the indexWriter? Maybe
> I
> >> must
> >>>>   perform some operations with writer to release unused resourses
> >> from time
> >>>>   to time?
> >>>>
> >>>>   --
> >>>>   Best Regards,
> >>>>   Igor
> >>
> >> ---------------------------------------------------------------------
> >>>>   To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>>   For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>
> >>> --------------------------------------------------------------------
> >>> -
> >>>  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>  For additional commands, e-mail: java-user-help@lucene.apache.org
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> > --
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, 28213 Bremen
> > http://www.thetaphi.de
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene multithreaded indexing problems

Posted by Daniel Penning <dp...@gamona.de>.

Maybe you should turn on Garbage Collection logging to confirm that you 
are running into some kind of memory problem. (start JVM with -verbose:gc)
If the GC is running very often as soon as your indexing process slows 
down, i would suggest you to create a heapdump and check what the memory 
is used for.

Am 23.11.2013 16:45, schrieb Igor Shalyminov:
> So we return to the initially described setup: multiple parallel workers, each making "parse + indexWriter.addDocument()" for single documents with no synchronization at my side. This setup was also bad on memory consumption and thread blocking, as I reported.
>
> Or did I misunderstand you?
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene multithreaded indexing problems

Posted by Igor Shalyminov <is...@yandex-team.ru>.

So we return to the initially described setup: multiple parallel workers, each making "parse + indexWriter.addDocument()" for single documents with no synchronization at my side. This setup was also bad on memory consumption and thread blocking, as I reported.

Or did I misunderstand you?

-- 
Igor

22.11.2013, 23:34, "Uwe Schindler" <uw...@thetaphi.de>:
> Hi,
> Don't use addDocuments. This method is more made for so called block indexing (where all documents need to be on a block for block joins). Call addDocument for each document possibly from many threads.  By this Lucene can better handle multithreading and free memory early. There is really no need to use bulk adds, this is solely for block joins, where docs need to be sequential and without gaps.
>
> Uwe
>
> Igor Shalyminov <is...@yandex-team.ru> schrieb:
>
>> - uwe@
>>
>> Thanks Uwe!
>>
>> I changed the logic so that my workers only parse input docs into
>> Documents, and indexWriter does addDocuments() by itself for the chunks
>> of 100 Documents.
>> Unfortunately, this behaviour reproduces: memory usage slightly
>> increases with the number of processed documents, and at some point the
>> program runs very slowly, and it seems that only a single thread is
>> active.
>> It happens after lots of parse/index cycles.
>>
>> The current instance is now in the "single-thread" phase with ~100% CPU
>> and with 8397M RES memory (limit for the VM is -Xmx8G).
>> My question is, when does addDocuments() release all resourses passed
>> in (the Documents themselves)?
>> Are the resourses released after finishing the function call, or I have
>> to do indexWriter.commit() after, say, each chunk?
>>
>> --
>> Igor
>>
>> 21.11.2013, 19:59, "Uwe Schindler" <uw...@thetaphi.de>:
>>>  Hi,
>>>
>>>  why are you doing this? Lucene's IndexWriter can handle addDocuments
>> in multiple threads. And, since Lucene 4, it will process them almost
>> completely parallel!
>>>  If you do the addDocuments single-threaded you are adding an
>> additional bottleneck in your application. If you are doing a
>> synchronization on IndexWriter (which I hope you will not do), things
>> will go wrong, too.
>>>  Uwe
>>>
>>>  -----
>>>  Uwe Schindler
>>>  H.-H.-Meier-Allee 63, D-28213 Bremen
>>>  http://www.thetaphi.de
>>>  eMail: uwe@thetaphi.de
>>>>   -----Original Message-----
>>>>   From: Igor Shalyminov [mailto:ishalyminov@yandex-team.ru]
>>>>   Sent: Thursday, November 21, 2013 4:45 PM
>>>>   To: java-user@lucene.apache.org
>>>>   Subject: Lucene multithreaded indexing problems
>>>>
>>>>   Hello!
>>>>
>>>>   I tried to perform indexing multithreadedly, with a FixedThreadPool
>> of
>>>>   Callable workers.
>>>>   The main operation - parsing a single document and addDocument() to
>> the
>>>>   index - is done by a single worker.
>>>>   After parsing a document, a lot (really a lot) of Strings appears,
>> and at the
>>>>   end of the worker's call() all of them goes to the indexWriter.
>>>>   I use no merging, the resourses are flushed on disk when the
>> segment size
>>>>   limit is reached.
>>>>
>>>>   The problem is, after a little while (when the most of the heap
>> memory is
>>>>   used) indexer makes no progress, and CPU load is constant 100% (no
>>>>   difference if there are 2 threads or 32). So I think at some point
>> garbage
>>>>   collection takes the whole indexing process down.
>>>>
>>>>   Could you please give some advices on the proper concurrent
>> indexing with
>>>>   Lucene?
>>>>   Can there be "memory leaks" somewhere in the indexWriter? Maybe I
>> must
>>>>   perform some operations with writer to release unused resourses
>> from time
>>>>   to time?
>>>>
>>>>   --
>>>>   Best Regards,
>>>>   Igor
>>  ---------------------------------------------------------------------
>>>>   To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>   For additional commands, e-mail: java-user-help@lucene.apache.org
>>>  ---------------------------------------------------------------------
>>>  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>  For additional commands, e-mail: java-user-help@lucene.apache.org
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> --
> Uwe Schindler
> H.-H.-Meier-Allee 63, 28213 Bremen
> http://www.thetaphi.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene multithreaded indexing problems

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi,
Don't use addDocuments. This method is more made for so called block indexing (where all documents need to be on a block for block joins). Call addDocument for each document possibly from many threads.  By this Lucene can better handle multithreading and free memory early. There is really no need to use bulk adds, this is solely for block joins, where docs need to be sequential and without gaps.

Uwe



Igor Shalyminov <is...@yandex-team.ru> schrieb:
>- uwe@
>
>Thanks Uwe!
>
>I changed the logic so that my workers only parse input docs into
>Documents, and indexWriter does addDocuments() by itself for the chunks
>of 100 Documents.
>Unfortunately, this behaviour reproduces: memory usage slightly
>increases with the number of processed documents, and at some point the
>program runs very slowly, and it seems that only a single thread is
>active.
>It happens after lots of parse/index cycles.
>
>The current instance is now in the "single-thread" phase with ~100% CPU
>and with 8397M RES memory (limit for the VM is -Xmx8G).
>My question is, when does addDocuments() release all resourses passed
>in (the Documents themselves)?
>Are the resourses released after finishing the function call, or I have
>to do indexWriter.commit() after, say, each chunk? 
>
>-- 
>Igor
>
>21.11.2013, 19:59, "Uwe Schindler" <uw...@thetaphi.de>:
>> Hi,
>>
>> why are you doing this? Lucene's IndexWriter can handle addDocuments
>in multiple threads. And, since Lucene 4, it will process them almost
>completely parallel!
>> If you do the addDocuments single-threaded you are adding an
>additional bottleneck in your application. If you are doing a
>synchronization on IndexWriter (which I hope you will not do), things
>will go wrong, too.
>>
>> Uwe
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: uwe@thetaphi.de
>>
>>>  -----Original Message-----
>>>  From: Igor Shalyminov [mailto:ishalyminov@yandex-team.ru]
>>>  Sent: Thursday, November 21, 2013 4:45 PM
>>>  To: java-user@lucene.apache.org
>>>  Subject: Lucene multithreaded indexing problems
>>>
>>>  Hello!
>>>
>>>  I tried to perform indexing multithreadedly, with a FixedThreadPool
>of
>>>  Callable workers.
>>>  The main operation - parsing a single document and addDocument() to
>the
>>>  index - is done by a single worker.
>>>  After parsing a document, a lot (really a lot) of Strings appears,
>and at the
>>>  end of the worker's call() all of them goes to the indexWriter.
>>>  I use no merging, the resourses are flushed on disk when the
>segment size
>>>  limit is reached.
>>>
>>>  The problem is, after a little while (when the most of the heap
>memory is
>>>  used) indexer makes no progress, and CPU load is constant 100% (no
>>>  difference if there are 2 threads or 32). So I think at some point
>garbage
>>>  collection takes the whole indexing process down.
>>>
>>>  Could you please give some advices on the proper concurrent
>indexing with
>>>  Lucene?
>>>  Can there be "memory leaks" somewhere in the indexWriter? Maybe I
>must
>>>  perform some operations with writer to release unused resourses
>from time
>>>  to time?
>>>
>>>  --
>>>  Best Regards,
>>>  Igor
>>>
>>>
> ---------------------------------------------------------------------
>>>  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>  For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org

--
Uwe Schindler
H.-H.-Meier-Allee 63, 28213 Bremen
http://www.thetaphi.de

Re: Lucene multithreaded indexing problems

Posted by Igor Shalyminov <is...@yandex-team.ru>.

- uwe@

Thanks Uwe!

I changed the logic so that my workers only parse input docs into Documents, and indexWriter does addDocuments() by itself for the chunks of 100 Documents.
Unfortunately, this behaviour reproduces: memory usage slightly increases with the number of processed documents, and at some point the program runs very slowly, and it seems that only a single thread is active.
It happens after lots of parse/index cycles.

The current instance is now in the "single-thread" phase with ~100% CPU and with 8397M RES memory (limit for the VM is -Xmx8G).
My question is, when does addDocuments() release all resourses passed in (the Documents themselves)?
Are the resourses released after finishing the function call, or I have to do indexWriter.commit() after, say, each chunk? 

-- 
Igor

21.11.2013, 19:59, "Uwe Schindler" <uw...@thetaphi.de>:
> Hi,
>
> why are you doing this? Lucene's IndexWriter can handle addDocuments in multiple threads. And, since Lucene 4, it will process them almost completely parallel!
> If you do the addDocuments single-threaded you are adding an additional bottleneck in your application. If you are doing a synchronization on IndexWriter (which I hope you will not do), things will go wrong, too.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>>  -----Original Message-----
>>  From: Igor Shalyminov [mailto:ishalyminov@yandex-team.ru]
>>  Sent: Thursday, November 21, 2013 4:45 PM
>>  To: java-user@lucene.apache.org
>>  Subject: Lucene multithreaded indexing problems
>>
>>  Hello!
>>
>>  I tried to perform indexing multithreadedly, with a FixedThreadPool of
>>  Callable workers.
>>  The main operation - parsing a single document and addDocument() to the
>>  index - is done by a single worker.
>>  After parsing a document, a lot (really a lot) of Strings appears, and at the
>>  end of the worker's call() all of them goes to the indexWriter.
>>  I use no merging, the resourses are flushed on disk when the segment size
>>  limit is reached.
>>
>>  The problem is, after a little while (when the most of the heap memory is
>>  used) indexer makes no progress, and CPU load is constant 100% (no
>>  difference if there are 2 threads or 32). So I think at some point garbage
>>  collection takes the whole indexing process down.
>>
>>  Could you please give some advices on the proper concurrent indexing with
>>  Lucene?
>>  Can there be "memory leaks" somewhere in the indexWriter? Maybe I must
>>  perform some operations with writer to release unused resourses from time
>>  to time?
>>
>>  --
>>  Best Regards,
>>  Igor
>>
>>  ---------------------------------------------------------------------
>>  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>  For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Lucene multithreaded indexing problems

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi,

why are you doing this? Lucene's IndexWriter can handle addDocuments in multiple threads. And, since Lucene 4, it will process them almost completely parallel!
If you do the addDocuments single-threaded you are adding an additional bottleneck in your application. If you are doing a synchronization on IndexWriter (which I hope you will not do), things will go wrong, too.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Igor Shalyminov [mailto:ishalyminov@yandex-team.ru]
> Sent: Thursday, November 21, 2013 4:45 PM
> To: java-user@lucene.apache.org
> Subject: Lucene multithreaded indexing problems
> 
> Hello!
> 
> I tried to perform indexing multithreadedly, with a FixedThreadPool of
> Callable workers.
> The main operation - parsing a single document and addDocument() to the
> index - is done by a single worker.
> After parsing a document, a lot (really a lot) of Strings appears, and at the
> end of the worker's call() all of them goes to the indexWriter.
> I use no merging, the resourses are flushed on disk when the segment size
> limit is reached.
> 
> The problem is, after a little while (when the most of the heap memory is
> used) indexer makes no progress, and CPU load is constant 100% (no
> difference if there are 2 threads or 32). So I think at some point garbage
> collection takes the whole indexing process down.
> 
> Could you please give some advices on the proper concurrent indexing with
> Lucene?
> Can there be "memory leaks" somewhere in the indexWriter? Maybe I must
> perform some operations with writer to release unused resourses from time
> to time?
> 
> 
> --
> Best Regards,
> Igor
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org