You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Simon McDuff <sm...@hotmail.com> on 2012/07/19 18:54:55 UTC

Flushing Thread

I see some behavior at the moment when I'm flushing and would like to know if I can change that.

 One main thread is inserting, when it flushes, it blocks.
 During that time my main thread is blocking. Instead of blocking, Could it spawn another thread to do that ? 

Basically,  would like to have one main thread adding document to my index, if a flushing needs to occur, spawn another threads but it should never lock the main  threads. Is it possible ?

Is the only solution is to have many threads indexing the data ?
In that case Is it true to say ONLY one of them will be busy while the other is flushing ? (I do understand that if my flushing is taking two much time, they will both flush... :-))

Thank you!

Simon

 		 	   		  

Re: Flushing Thread

Posted by Simon Willnauer <si...@gmail.com>.
On Fri, Jul 20, 2012 at 2:43 PM, Simon McDuff <sm...@hotmail.com> wrote:
>
> Hi Simon W.,
> See comments below.
>> Date: Fri, 20 Jul 2012 11:49:03 +0200> Subject: Re: Flushing Thread
>> From: simon.willnauer@gmail.com
>> To: java-user@lucene.apache.org
>>
>> hey simon ;)
>>
>>
>> On Fri, Jul 20, 2012 at 2:29 AM, Simon McDuff <sm...@hotmail.com> wrote:
>> >
>> > Thank you Simon Willnauer!
>> >
>> > With your explanation, we`ve decided to control the flushing by spawning another thread. So the thread is available to still ingest ! :-) (correct me if I'm wrong)We do so by checking the RAM size provided by Lucene! (Thank you!)By putting the automatic flushing at 1000 megs and our controlling at 900 megs, we know that the automatic flushing "should" not happen.
>>
>> it should not. Yet, 1G is a large ram buffer. In my tests I got much
>> better results with lowish ram buffers like 256MB since that causes
>> flush to happen more often and it saturates your IO on the machine.
>> The general goal is to keep the RAM buffer at a level where you almost
>> constantly flush ie. you maximise the the RAM buffer so that a flush
>> should happen once you are done with the previous flush. Does that
>> make sense?
> [SIMON M.] It make sense for some use cases.In our case we have FUSION IO Cards that write at 6 Gb/S. We do not have contention for IO.Also, we use larger RAM to compress as much as possible (we have a lot of compression). (in fact we found that 500 megs was enough)

cool man I am jealous! :) I have to admit that the average usecase is
not on Fusion IO :) I think for commodity this makes lots of sense
though.
>
>>
>> > I know you contribute a lot to the concurrency feature! This is great! I was very excited to try it!
>> > We tried the following approaches:Option 1- 6 threads referring to the same IndexWriterOption 2- 6 threads having their own IndexWriter, merge it at the end
>> > Unfortunately, we found that option 2 scale better. I'm not sure why option 1 didn`t scale. Is it possible that synchronization between threads is too costly ? ... I don`t have an answered but it was definitely slower.
>>
>> can you provide the numbers and what you actually did in your experiment.
> [SIMON M.] I'm not at work today, I can provide these numbers monday if you are still interested.

cool I am totally interested. I will be on vacation until early august
so I might reply late.

>>
>> > With option 2, we are able to insert between 800 000 - 900 000 documents / sec. (we've modified lucene to remove some bottleneck)Threads DO NOT ONLY index, it does other stuff before adding documents.
>>
>> what are your modifications? 800k documents are a lot! I wonder what
>> you are indexing, do you have any text you are inverting. I have run
>> tests on a very strong machine on 4k /doc average doc size and I
>> couldn't even get 10% of this. So in your case lock contention in the
>> indexwriter (there are still blocking parts) could be dominating. This
>> is certainly not what we optimize for. I'd say 99% of the cases the
>> most of the time is spend in DocumentsWriterPerThread inverting the
>> document. If that is not the case in your experiment and you are only
>> measuring thread overhead then I can totally buy your numbers.
>> [SIMON M.] We have 3 fields, (2 Fixed ByteRef and one bigger (textField))800k  is for the 6 threads all together, so one thread is about 133 333 doc / secs.To achieve that performance we :- Removed notifications process in lucene that does check for stalled flushing... it was really slow.- We spot some places were memory wasn`t recycle properly.- Removed Stored writer ... we do not use store field.- One IndexWriter per Thread.

that is very interesting - I changed the stalling stuff just before
4.0 alpha from non-blocking to blocking. I might need to thing about a
different way of doing that. Can you please report the places where
you see memory problems?

I have some ideas about refactoring the IW to make usecases like yours
easier here is my brain dump on IRC just for the record:

[6:46pm] s1monw: 1. I wanna logically divide Writeing and merging
[6:46pm] s1monw: so all merge code should go in a dedicated API
[6:46pm] s1monw: 2. IndexWriter should be a Composite out of IndexWriters
[6:47pm] s1monw: 2a. the composite would take care of notifying the
merger and handle deletes
[6:47pm] s1monw: and do the commit
[6:47pm] s1monw: the other IW are single threaded
[6:48pm] s1monw: no synchonization
[6:48pm] mikemccand: sounds awesome!
[6:48pm] s1monw: ie like aDocumentsWriterPerThread just public
[6:48pm] s1monw: that way you can build something like simonm wants easily
[6:48pm] mikemccand: so each app thread opens a dedicated writer?
[6:48pm] s1monw: well you can have multiple models
[6:49pm] s1monw: you can have a non-blocking IW where you just hand off docs
[6:49pm] s1monw: and get a callback once done
[6:49pm] s1monw: or you have what we have today
[6:49pm] s1monw: like blocking
[6:49pm] s1monw: but you can also if you don't use updateDocument have
a IW per thread
[6:49pm] s1monw: like you just said
[6:49pm] s1monw: its all up to you
[6:49pm] mikemccand: ok
[6:49pm] s1monw: makes sense
[6:49pm] s1monw: ?
[6:49pm] mikemccand: yes!
[6:50pm] s1monw: ok cool
[6:50pm] mikemccand: handling pending deletes seems tricky...
[6:50pm] s1monw: did I say its easy
[6:50pm] s1monw: :D
[6:50pm] mikemccand: LOL
[6:50pm] s1monw: its basically all up to the wrapper
[6:50pm] s1monw: so lets say we have a ClassicalIW
[6:51pm] s1monw: that has N IndexWriterPerThread
[6:51pm] s1monw: IndexWriterPerThread allows to flush in a single
thread and hands back the deletes it has
[6:52pm] mikemccand: good
[6:52pm] s1monw: the ClassicIW handles deletes on a global level
[6:52pm] s1monw: and it handles the deletes per IWPerThread
[6:53pm] s1monw: so IWPerThread doesn't know about deletes until flush
[6:53pm] s1monw: on flush we pass in a BitSet that marks deleted docs
[6:53pm] s1monw: ie updates on another thread
[6:53pm] s1monw: ie docId + seq id
[6:53pm] s1monw: or something like that
[6:53pm] s1monw: err. Term + seq id
[6:53pm] s1monw: and apply everything on flush


simon

>> > Did you look at the disruptor pattern (by LMAX) ? It helped us a lot to achieve great performance in multithreaded environment!>
>> I know of the pattern though their usecase is totally different to
>> ours. The time spend per transaction is super low compared to the
>> thread overhead so they try to optimize this for high performance
>> computing. ie. for like 5M transactions per second you enter / leave
>> locks literally all the freaking time. With IndexWriter you don't have
>> such a pattern. Large numbers would be like 50k / sec that it 2 orders
>> of a magnitude less so lock overhead becomes minor since contention is
>> much lower. If you go and make your documents super super small like
>> not invert anything or just store you might see an overhead in the
>> threading model I agree. Our bottleneck is not lock contention here
>> but IO and that is what we optimized this for. Makes sense?[SIMON M.] Not really, By adding document without any store everything is in memory until we flush. So the bottleneck wasn`t IO.
> When adding document, this is where we optimized.
> I just mentionned disruptor because I thought a design having an IndexWriter having a ringbuffer inside and many threads that write or flush would be faster.In fact this is what we did but externally and by having one indexWriter per thread (6 indexWriters). By doing it internally I think we could remove a lot of overhead.The advantage is your producer should never block. :-)But the draw back is you need to do copy these fields to the ring buffer. I do understand it is not suitable for everybody.
>>
>> That said, if you really wanna optimize this you could write your own
>> DocumentsWriterPerThreadPool and a custom FlushPolicy (both package
>> private in org.apache.lucene.index) in DWPThreadPool you only maintain
>> one DWPT and in the FlushPolicy you only track ram consumption of that
>> DWPT. Once you see that it has filled up you notify another thread
>> that its time for flush and go out and call commit. You can then over
>> time find out what is the right RAM buffer to saturate IO, don't
>> create too many segments to kill performance due to too many
>> background merges and maximise in memory throughput.
> [SIMON M.] Thank you for the tips. I will continue to find bottlenecks we have!
>>
>> simonw :)
>>
>>
>> > Thank you
>> > Simon M.
>> >
>> >
>> >
>> >
>> >> Date: Thu, 19 Jul 2012 21:52:19 +0200
>> >> Subject: Re: Flushing Thread
>> >> From: simon.willnauer@gmail.com
>> >> To: java-user@lucene.apache.org
>> >>
>> >> hey,
>> >>
>> >> On Thu, Jul 19, 2012 at 7:41 PM, Simon McDuff <sm...@hotmail.com> wrote:
>> >> >
>> >> > Thank you for your answer!
>> >> >
>> >> > I read all your blogs! It is always interesting!
>> >>
>> >> for details see:
>> >>
>> >> http://www.searchworkings.org/blog/-/blogs/gimme-all-resources-you-have-i-can-use-them!/
>> >>
>> >> and
>> >>
>> >> http://www.searchworkings.org/blog/-/blogs/lucene-indexing-gains-concurrency/
>> >> >
>> >> > My understanding is probably incorrect ...
>> >> > I observed that if you have only one thread that addDocument, it will not spawn another thread for flushing, it uses the main thread.
>> >>
>> >> every indexing thread can hit a flush. if you only have one thread you
>> >> will not make progress adding docs while flushing.
>> >> IW will not create new threads for flushing.
>> >> > In this case, my main thread is locked. Correct ?
>> >> >
>> >> > The concurrent flushing will ONLY work when I have many threads adding documents ? (In that case I will need to put a ringbuffer in front)
>> >>
>> >> that is basically correct. You can frequently call commit / or pull a
>> >> reader from the IW in a different thread before you ram buffer fills
>> >> up so that flushing happens in a different thread. That could work
>> >> pretty well if you don't have many deletes to be applied. (if you have
>> >> many deletes then pull a reader without applying deletes.
>> >>
>> >> simon
>> >> >
>> >> > Do I understand correctly ? Did I miss something ?
>> >> >
>> >> > Simon
>> >> >
>> >> >> From: lucene@mikemccandless.com
>> >> >> Date: Thu, 19 Jul 2012 13:02:42 -0400
>> >> >> Subject: Re: Flushing Thread
>> >> >> To: java-user@lucene.apache.org
>> >> >>
>> >> >> This has already been fixed on Lucene 4.0 (we now have fully
>> >> >> concurrent flushing), eg see:
>> >> >>
>> >> >>   http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html
>> >> >>
>> >> >> Mike McCandless
>> >> >>
>> >> >> http://blog.mikemccandless.com
>> >> >>
>> >> >> On Thu, Jul 19, 2012 at 12:54 PM, Simon McDuff <sm...@hotmail.com> wrote:
>> >> >> >
>> >> >> > I see some behavior at the moment when I'm flushing and would like to know if I can change that.
>> >> >> >
>> >> >> >  One main thread is inserting, when it flushes, it blocks.
>> >> >> >  During that time my main thread is blocking. Instead of blocking, Could it spawn another thread to do that ?
>> >> >> >
>> >> >> > Basically,  would like to have one main thread adding document to my index, if a flushing needs to occur, spawn another threads but it should never lock the main  threads. Is it possible ?
>> >> >> >
>> >> >> > Is the only solution is to have many threads indexing the data ?
>> >> >> > In that case Is it true to say ONLY one of them will be busy while the other is flushing ? (I do understand that if my flushing is taking two much time, they will both flush... :-))
>> >> >> >
>> >> >> > Thank you!
>> >> >> >
>> >> >> > Simon
>> >> >> >
>> >> >> >
>> >> >>
>> >> >> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >> >>
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Flushing Thread

Posted by Simon McDuff <sm...@hotmail.com>.
Hi Simon W.,
See comments below.
> Date: Fri, 20 Jul 2012 11:49:03 +0200> Subject: Re: Flushing Thread
> From: simon.willnauer@gmail.com
> To: java-user@lucene.apache.org
> 
> hey simon ;)
> 
> 
> On Fri, Jul 20, 2012 at 2:29 AM, Simon McDuff <sm...@hotmail.com> wrote:
> >
> > Thank you Simon Willnauer!
> >
> > With your explanation, we`ve decided to control the flushing by spawning another thread. So the thread is available to still ingest ! :-) (correct me if I'm wrong)We do so by checking the RAM size provided by Lucene! (Thank you!)By putting the automatic flushing at 1000 megs and our controlling at 900 megs, we know that the automatic flushing "should" not happen.
> 
> it should not. Yet, 1G is a large ram buffer. In my tests I got much
> better results with lowish ram buffers like 256MB since that causes
> flush to happen more often and it saturates your IO on the machine.
> The general goal is to keep the RAM buffer at a level where you almost
> constantly flush ie. you maximise the the RAM buffer so that a flush
> should happen once you are done with the previous flush. Does that
> make sense?
[SIMON M.] It make sense for some use cases.In our case we have FUSION IO Cards that write at 6 Gb/S. We do not have contention for IO.Also, we use larger RAM to compress as much as possible (we have a lot of compression). (in fact we found that 500 megs was enough)

> 
> > I know you contribute a lot to the concurrency feature! This is great! I was very excited to try it!
> > We tried the following approaches:Option 1- 6 threads referring to the same IndexWriterOption 2- 6 threads having their own IndexWriter, merge it at the end
> > Unfortunately, we found that option 2 scale better. I'm not sure why option 1 didn`t scale. Is it possible that synchronization between threads is too costly ? ... I don`t have an answered but it was definitely slower.
> 
> can you provide the numbers and what you actually did in your experiment.
[SIMON M.] I'm not at work today, I can provide these numbers monday if you are still interested.
> 
> > With option 2, we are able to insert between 800 000 - 900 000 documents / sec. (we've modified lucene to remove some bottleneck)Threads DO NOT ONLY index, it does other stuff before adding documents.
> 
> what are your modifications? 800k documents are a lot! I wonder what
> you are indexing, do you have any text you are inverting. I have run
> tests on a very strong machine on 4k /doc average doc size and I
> couldn't even get 10% of this. So in your case lock contention in the
> indexwriter (there are still blocking parts) could be dominating. This
> is certainly not what we optimize for. I'd say 99% of the cases the
> most of the time is spend in DocumentsWriterPerThread inverting the
> document. If that is not the case in your experiment and you are only
> measuring thread overhead then I can totally buy your numbers.
> [SIMON M.] We have 3 fields, (2 Fixed ByteRef and one bigger (textField))800k  is for the 6 threads all together, so one thread is about 133 333 doc / secs.To achieve that performance we :- Removed notifications process in lucene that does check for stalled flushing... it was really slow.- We spot some places were memory wasn`t recycle properly.- Removed Stored writer ... we do not use store field.- One IndexWriter per Thread.
> > Did you look at the disruptor pattern (by LMAX) ? It helped us a lot to achieve great performance in multithreaded environment!> 
> I know of the pattern though their usecase is totally different to
> ours. The time spend per transaction is super low compared to the
> thread overhead so they try to optimize this for high performance
> computing. ie. for like 5M transactions per second you enter / leave
> locks literally all the freaking time. With IndexWriter you don't have
> such a pattern. Large numbers would be like 50k / sec that it 2 orders
> of a magnitude less so lock overhead becomes minor since contention is
> much lower. If you go and make your documents super super small like
> not invert anything or just store you might see an overhead in the
> threading model I agree. Our bottleneck is not lock contention here
> but IO and that is what we optimized this for. Makes sense?[SIMON M.] Not really, By adding document without any store everything is in memory until we flush. So the bottleneck wasn`t IO.
When adding document, this is where we optimized. 
I just mentionned disruptor because I thought a design having an IndexWriter having a ringbuffer inside and many threads that write or flush would be faster.In fact this is what we did but externally and by having one indexWriter per thread (6 indexWriters). By doing it internally I think we could remove a lot of overhead.The advantage is your producer should never block. :-)But the draw back is you need to do copy these fields to the ring buffer. I do understand it is not suitable for everybody.
> 
> That said, if you really wanna optimize this you could write your own
> DocumentsWriterPerThreadPool and a custom FlushPolicy (both package
> private in org.apache.lucene.index) in DWPThreadPool you only maintain
> one DWPT and in the FlushPolicy you only track ram consumption of that
> DWPT. Once you see that it has filled up you notify another thread
> that its time for flush and go out and call commit. You can then over
> time find out what is the right RAM buffer to saturate IO, don't
> create too many segments to kill performance due to too many
> background merges and maximise in memory throughput.
[SIMON M.] Thank you for the tips. I will continue to find bottlenecks we have!
> 
> simonw :)
> 
> 
> > Thank you
> > Simon M.
> >
> >
> >
> >
> >> Date: Thu, 19 Jul 2012 21:52:19 +0200
> >> Subject: Re: Flushing Thread
> >> From: simon.willnauer@gmail.com
> >> To: java-user@lucene.apache.org
> >>
> >> hey,
> >>
> >> On Thu, Jul 19, 2012 at 7:41 PM, Simon McDuff <sm...@hotmail.com> wrote:
> >> >
> >> > Thank you for your answer!
> >> >
> >> > I read all your blogs! It is always interesting!
> >>
> >> for details see:
> >>
> >> http://www.searchworkings.org/blog/-/blogs/gimme-all-resources-you-have-i-can-use-them!/
> >>
> >> and
> >>
> >> http://www.searchworkings.org/blog/-/blogs/lucene-indexing-gains-concurrency/
> >> >
> >> > My understanding is probably incorrect ...
> >> > I observed that if you have only one thread that addDocument, it will not spawn another thread for flushing, it uses the main thread.
> >>
> >> every indexing thread can hit a flush. if you only have one thread you
> >> will not make progress adding docs while flushing.
> >> IW will not create new threads for flushing.
> >> > In this case, my main thread is locked. Correct ?
> >> >
> >> > The concurrent flushing will ONLY work when I have many threads adding documents ? (In that case I will need to put a ringbuffer in front)
> >>
> >> that is basically correct. You can frequently call commit / or pull a
> >> reader from the IW in a different thread before you ram buffer fills
> >> up so that flushing happens in a different thread. That could work
> >> pretty well if you don't have many deletes to be applied. (if you have
> >> many deletes then pull a reader without applying deletes.
> >>
> >> simon
> >> >
> >> > Do I understand correctly ? Did I miss something ?
> >> >
> >> > Simon
> >> >
> >> >> From: lucene@mikemccandless.com
> >> >> Date: Thu, 19 Jul 2012 13:02:42 -0400
> >> >> Subject: Re: Flushing Thread
> >> >> To: java-user@lucene.apache.org
> >> >>
> >> >> This has already been fixed on Lucene 4.0 (we now have fully
> >> >> concurrent flushing), eg see:
> >> >>
> >> >>   http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html
> >> >>
> >> >> Mike McCandless
> >> >>
> >> >> http://blog.mikemccandless.com
> >> >>
> >> >> On Thu, Jul 19, 2012 at 12:54 PM, Simon McDuff <sm...@hotmail.com> wrote:
> >> >> >
> >> >> > I see some behavior at the moment when I'm flushing and would like to know if I can change that.
> >> >> >
> >> >> >  One main thread is inserting, when it flushes, it blocks.
> >> >> >  During that time my main thread is blocking. Instead of blocking, Could it spawn another thread to do that ?
> >> >> >
> >> >> > Basically,  would like to have one main thread adding document to my index, if a flushing needs to occur, spawn another threads but it should never lock the main  threads. Is it possible ?
> >> >> >
> >> >> > Is the only solution is to have many threads indexing the data ?
> >> >> > In that case Is it true to say ONLY one of them will be busy while the other is flushing ? (I do understand that if my flushing is taking two much time, they will both flush... :-))
> >> >> >
> >> >> > Thank you!
> >> >> >
> >> >> > Simon
> >> >> >
> >> >> >
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >>
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
 		 	   		  

Re: Flushing Thread

Posted by Simon Willnauer <si...@gmail.com>.
hey simon ;)


On Fri, Jul 20, 2012 at 2:29 AM, Simon McDuff <sm...@hotmail.com> wrote:
>
> Thank you Simon Willnauer!
>
> With your explanation, we`ve decided to control the flushing by spawning another thread. So the thread is available to still ingest ! :-) (correct me if I'm wrong)We do so by checking the RAM size provided by Lucene! (Thank you!)By putting the automatic flushing at 1000 megs and our controlling at 900 megs, we know that the automatic flushing "should" not happen.

it should not. Yet, 1G is a large ram buffer. In my tests I got much
better results with lowish ram buffers like 256MB since that causes
flush to happen more often and it saturates your IO on the machine.
The general goal is to keep the RAM buffer at a level where you almost
constantly flush ie. you maximise the the RAM buffer so that a flush
should happen once you are done with the previous flush. Does that
make sense?

> I know you contribute a lot to the concurrency feature! This is great! I was very excited to try it!
> We tried the following approaches:Option 1- 6 threads referring to the same IndexWriterOption 2- 6 threads having their own IndexWriter, merge it at the end
> Unfortunately, we found that option 2 scale better. I'm not sure why option 1 didn`t scale. Is it possible that synchronization between threads is too costly ? ... I don`t have an answered but it was definitely slower.

can you provide the numbers and what you actually did in your experiment.

> With option 2, we are able to insert between 800 000 - 900 000 documents / sec. (we've modified lucene to remove some bottleneck)Threads DO NOT ONLY index, it does other stuff before adding documents.

what are your modifications? 800k documents are a lot! I wonder what
you are indexing, do you have any text you are inverting. I have run
tests on a very strong machine on 4k /doc average doc size and I
couldn't even get 10% of this. So in your case lock contention in the
indexwriter (there are still blocking parts) could be dominating. This
is certainly not what we optimize for. I'd say 99% of the cases the
most of the time is spend in DocumentsWriterPerThread inverting the
document. If that is not the case in your experiment and you are only
measuring thread overhead then I can totally buy your numbers.

> Did you look at the disruptor pattern (by LMAX) ? It helped us a lot to achieve great performance in multithreaded environment!

I know of the pattern though their usecase is totally different to
ours. The time spend per transaction is super low compared to the
thread overhead so they try to optimize this for high performance
computing. ie. for like 5M transactions per second you enter / leave
locks literally all the freaking time. With IndexWriter you don't have
such a pattern. Large numbers would be like 50k / sec that it 2 orders
of a magnitude less so lock overhead becomes minor since contention is
much lower. If you go and make your documents super super small like
not invert anything or just store you might see an overhead in the
threading model I agree. Our bottleneck is not lock contention here
but IO and that is what we optimized this for. Makes sense?

That said, if you really wanna optimize this you could write your own
DocumentsWriterPerThreadPool and a custom FlushPolicy (both package
private in org.apache.lucene.index) in DWPThreadPool you only maintain
one DWPT and in the FlushPolicy you only track ram consumption of that
DWPT. Once you see that it has filled up you notify another thread
that its time for flush and go out and call commit. You can then over
time find out what is the right RAM buffer to saturate IO, don't
create too many segments to kill performance due to too many
background merges and maximise in memory throughput.

simonw :)


> Thank you
> Simon M.
>
>
>
>
>> Date: Thu, 19 Jul 2012 21:52:19 +0200
>> Subject: Re: Flushing Thread
>> From: simon.willnauer@gmail.com
>> To: java-user@lucene.apache.org
>>
>> hey,
>>
>> On Thu, Jul 19, 2012 at 7:41 PM, Simon McDuff <sm...@hotmail.com> wrote:
>> >
>> > Thank you for your answer!
>> >
>> > I read all your blogs! It is always interesting!
>>
>> for details see:
>>
>> http://www.searchworkings.org/blog/-/blogs/gimme-all-resources-you-have-i-can-use-them!/
>>
>> and
>>
>> http://www.searchworkings.org/blog/-/blogs/lucene-indexing-gains-concurrency/
>> >
>> > My understanding is probably incorrect ...
>> > I observed that if you have only one thread that addDocument, it will not spawn another thread for flushing, it uses the main thread.
>>
>> every indexing thread can hit a flush. if you only have one thread you
>> will not make progress adding docs while flushing.
>> IW will not create new threads for flushing.
>> > In this case, my main thread is locked. Correct ?
>> >
>> > The concurrent flushing will ONLY work when I have many threads adding documents ? (In that case I will need to put a ringbuffer in front)
>>
>> that is basically correct. You can frequently call commit / or pull a
>> reader from the IW in a different thread before you ram buffer fills
>> up so that flushing happens in a different thread. That could work
>> pretty well if you don't have many deletes to be applied. (if you have
>> many deletes then pull a reader without applying deletes.
>>
>> simon
>> >
>> > Do I understand correctly ? Did I miss something ?
>> >
>> > Simon
>> >
>> >> From: lucene@mikemccandless.com
>> >> Date: Thu, 19 Jul 2012 13:02:42 -0400
>> >> Subject: Re: Flushing Thread
>> >> To: java-user@lucene.apache.org
>> >>
>> >> This has already been fixed on Lucene 4.0 (we now have fully
>> >> concurrent flushing), eg see:
>> >>
>> >>   http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html
>> >>
>> >> Mike McCandless
>> >>
>> >> http://blog.mikemccandless.com
>> >>
>> >> On Thu, Jul 19, 2012 at 12:54 PM, Simon McDuff <sm...@hotmail.com> wrote:
>> >> >
>> >> > I see some behavior at the moment when I'm flushing and would like to know if I can change that.
>> >> >
>> >> >  One main thread is inserting, when it flushes, it blocks.
>> >> >  During that time my main thread is blocking. Instead of blocking, Could it spawn another thread to do that ?
>> >> >
>> >> > Basically,  would like to have one main thread adding document to my index, if a flushing needs to occur, spawn another threads but it should never lock the main  threads. Is it possible ?
>> >> >
>> >> > Is the only solution is to have many threads indexing the data ?
>> >> > In that case Is it true to say ONLY one of them will be busy while the other is flushing ? (I do understand that if my flushing is taking two much time, they will both flush... :-))
>> >> >
>> >> > Thank you!
>> >> >
>> >> > Simon
>> >> >
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Flushing Thread

Posted by Simon McDuff <sm...@hotmail.com>.
Thank you Simon Willnauer!

With your explanation, we`ve decided to control the flushing by spawning another thread. So the thread is available to still ingest ! :-) (correct me if I'm wrong)We do so by checking the RAM size provided by Lucene! (Thank you!)By putting the automatic flushing at 1000 megs and our controlling at 900 megs, we know that the automatic flushing "should" not happen.
I know you contribute a lot to the concurrency feature! This is great! I was very excited to try it!
We tried the following approaches:Option 1- 6 threads referring to the same IndexWriterOption 2- 6 threads having their own IndexWriter, merge it at the end
Unfortunately, we found that option 2 scale better. I'm not sure why option 1 didn`t scale. Is it possible that synchronization between threads is too costly ? ... I don`t have an answered but it was definitely slower.
With option 2, we are able to insert between 800 000 - 900 000 documents / sec. (we've modified lucene to remove some bottleneck)Threads DO NOT ONLY index, it does other stuff before adding documents. 
Did you look at the disruptor pattern (by LMAX) ? It helped us a lot to achieve great performance in multithreaded environment!
Thank you
Simon M.




> Date: Thu, 19 Jul 2012 21:52:19 +0200
> Subject: Re: Flushing Thread
> From: simon.willnauer@gmail.com
> To: java-user@lucene.apache.org
> 
> hey,
> 
> On Thu, Jul 19, 2012 at 7:41 PM, Simon McDuff <sm...@hotmail.com> wrote:
> >
> > Thank you for your answer!
> >
> > I read all your blogs! It is always interesting!
> 
> for details see:
> 
> http://www.searchworkings.org/blog/-/blogs/gimme-all-resources-you-have-i-can-use-them!/
> 
> and
> 
> http://www.searchworkings.org/blog/-/blogs/lucene-indexing-gains-concurrency/
> >
> > My understanding is probably incorrect ...
> > I observed that if you have only one thread that addDocument, it will not spawn another thread for flushing, it uses the main thread.
> 
> every indexing thread can hit a flush. if you only have one thread you
> will not make progress adding docs while flushing.
> IW will not create new threads for flushing.
> > In this case, my main thread is locked. Correct ?
> >
> > The concurrent flushing will ONLY work when I have many threads adding documents ? (In that case I will need to put a ringbuffer in front)
> 
> that is basically correct. You can frequently call commit / or pull a
> reader from the IW in a different thread before you ram buffer fills
> up so that flushing happens in a different thread. That could work
> pretty well if you don't have many deletes to be applied. (if you have
> many deletes then pull a reader without applying deletes.
> 
> simon
> >
> > Do I understand correctly ? Did I miss something ?
> >
> > Simon
> >
> >> From: lucene@mikemccandless.com
> >> Date: Thu, 19 Jul 2012 13:02:42 -0400
> >> Subject: Re: Flushing Thread
> >> To: java-user@lucene.apache.org
> >>
> >> This has already been fixed on Lucene 4.0 (we now have fully
> >> concurrent flushing), eg see:
> >>
> >>   http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >> On Thu, Jul 19, 2012 at 12:54 PM, Simon McDuff <sm...@hotmail.com> wrote:
> >> >
> >> > I see some behavior at the moment when I'm flushing and would like to know if I can change that.
> >> >
> >> >  One main thread is inserting, when it flushes, it blocks.
> >> >  During that time my main thread is blocking. Instead of blocking, Could it spawn another thread to do that ?
> >> >
> >> > Basically,  would like to have one main thread adding document to my index, if a flushing needs to occur, spawn another threads but it should never lock the main  threads. Is it possible ?
> >> >
> >> > Is the only solution is to have many threads indexing the data ?
> >> > In that case Is it true to say ONLY one of them will be busy while the other is flushing ? (I do understand that if my flushing is taking two much time, they will both flush... :-))
> >> >
> >> > Thank you!
> >> >
> >> > Simon
> >> >
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
 		 	   		  

Re: Flushing Thread

Posted by Simon Willnauer <si...@gmail.com>.
hey,

On Thu, Jul 19, 2012 at 7:41 PM, Simon McDuff <sm...@hotmail.com> wrote:
>
> Thank you for your answer!
>
> I read all your blogs! It is always interesting!

for details see:

http://www.searchworkings.org/blog/-/blogs/gimme-all-resources-you-have-i-can-use-them!/

and

http://www.searchworkings.org/blog/-/blogs/lucene-indexing-gains-concurrency/
>
> My understanding is probably incorrect ...
> I observed that if you have only one thread that addDocument, it will not spawn another thread for flushing, it uses the main thread.

every indexing thread can hit a flush. if you only have one thread you
will not make progress adding docs while flushing.
IW will not create new threads for flushing.
> In this case, my main thread is locked. Correct ?
>
> The concurrent flushing will ONLY work when I have many threads adding documents ? (In that case I will need to put a ringbuffer in front)

that is basically correct. You can frequently call commit / or pull a
reader from the IW in a different thread before you ram buffer fills
up so that flushing happens in a different thread. That could work
pretty well if you don't have many deletes to be applied. (if you have
many deletes then pull a reader without applying deletes.

simon
>
> Do I understand correctly ? Did I miss something ?
>
> Simon
>
>> From: lucene@mikemccandless.com
>> Date: Thu, 19 Jul 2012 13:02:42 -0400
>> Subject: Re: Flushing Thread
>> To: java-user@lucene.apache.org
>>
>> This has already been fixed on Lucene 4.0 (we now have fully
>> concurrent flushing), eg see:
>>
>>   http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Thu, Jul 19, 2012 at 12:54 PM, Simon McDuff <sm...@hotmail.com> wrote:
>> >
>> > I see some behavior at the moment when I'm flushing and would like to know if I can change that.
>> >
>> >  One main thread is inserting, when it flushes, it blocks.
>> >  During that time my main thread is blocking. Instead of blocking, Could it spawn another thread to do that ?
>> >
>> > Basically,  would like to have one main thread adding document to my index, if a flushing needs to occur, spawn another threads but it should never lock the main  threads. Is it possible ?
>> >
>> > Is the only solution is to have many threads indexing the data ?
>> > In that case Is it true to say ONLY one of them will be busy while the other is flushing ? (I do understand that if my flushing is taking two much time, they will both flush... :-))
>> >
>> > Thank you!
>> >
>> > Simon
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Flushing Thread

Posted by Simon McDuff <sm...@hotmail.com>.
Thank you for your answer!

I read all your blogs! It is always interesting!

My understanding is probably incorrect ...
I observed that if you have only one thread that addDocument, it will not spawn another thread for flushing, it uses the main thread.
In this case, my main thread is locked. Correct ?

The concurrent flushing will ONLY work when I have many threads adding documents ? (In that case I will need to put a ringbuffer in front)

Do I understand correctly ? Did I miss something ?

Simon

> From: lucene@mikemccandless.com
> Date: Thu, 19 Jul 2012 13:02:42 -0400
> Subject: Re: Flushing Thread
> To: java-user@lucene.apache.org
> 
> This has already been fixed on Lucene 4.0 (we now have fully
> concurrent flushing), eg see:
> 
>   http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> On Thu, Jul 19, 2012 at 12:54 PM, Simon McDuff <sm...@hotmail.com> wrote:
> >
> > I see some behavior at the moment when I'm flushing and would like to know if I can change that.
> >
> >  One main thread is inserting, when it flushes, it blocks.
> >  During that time my main thread is blocking. Instead of blocking, Could it spawn another thread to do that ?
> >
> > Basically,  would like to have one main thread adding document to my index, if a flushing needs to occur, spawn another threads but it should never lock the main  threads. Is it possible ?
> >
> > Is the only solution is to have many threads indexing the data ?
> > In that case Is it true to say ONLY one of them will be busy while the other is flushing ? (I do understand that if my flushing is taking two much time, they will both flush... :-))
> >
> > Thank you!
> >
> > Simon
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
 		 	   		  

Re: Flushing Thread

Posted by Michael McCandless <lu...@mikemccandless.com>.
This has already been fixed on Lucene 4.0 (we now have fully
concurrent flushing), eg see:

  http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html

Mike McCandless

http://blog.mikemccandless.com

On Thu, Jul 19, 2012 at 12:54 PM, Simon McDuff <sm...@hotmail.com> wrote:
>
> I see some behavior at the moment when I'm flushing and would like to know if I can change that.
>
>  One main thread is inserting, when it flushes, it blocks.
>  During that time my main thread is blocking. Instead of blocking, Could it spawn another thread to do that ?
>
> Basically,  would like to have one main thread adding document to my index, if a flushing needs to occur, spawn another threads but it should never lock the main  threads. Is it possible ?
>
> Is the only solution is to have many threads indexing the data ?
> In that case Is it true to say ONLY one of them will be busy while the other is flushing ? (I do understand that if my flushing is taking two much time, they will both flush... :-))
>
> Thank you!
>
> Simon
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org