You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Ning Li <ni...@gmail.com> on 2007/02/20 20:22:53 UTC

Concurrent merge

I think it's possible for another version of IndexWriter to have
a concurrent merge thread so that disk segments could be merged
while documents are being added or deleted.

This would be beneficial not only because it will improve indexing
performance when there are enough system resources, but more
importantly, disk segment merges will no longer block document
additions or deletions.

I'd like to get feedback on this idea and after we agree on a best
design I can submit a full patch.

I have an initial implementation based on an earlier version of
Lucene (but with deletes via IndexWriter). The basic idea is to
separate a merge process into three steps:
  1 select disk segments to merge
  2 merge selected segments into one segment
  3 apply document deletions committed during the merge if any
    and replace selected segments with the result segment
The merge process is carried out in the merge thread. Steps 1 and
3 are executed in the critical section, but step 2, in which most
time is spent, is not.

There are three main challenges in enabling concurrent merge:
  1 a robust merge policy
  2 detect when merge lags document additions/deletions
  3 how to slow down document additions/deletions (and amortize
    the cost) when merge falls behind

Because new disk segments (flushed from ram) can continue to be
produced while a disk merge is going on, it is difficult to hold
the two invariants guaranteed by the current IndexWriter. Thus it
is important and challenging to detect when merge starts to lag
behind and to slow down document additions/deletions properly.

Several merge strategies are possible. In the initial implementation,
I adopted one similar to the merge policy in current IndexWriter.
Two limits on the total number of disk segments are used to detect
merge's lag and to slow down document additions/deletions: a soft
limit and a hard limit. When the number of disk segments reaches
the soft limit, a document addition/deletion will be slowed down
for time T. As the number of disk segments continues to grow, the
time for which an addition/deletion is slowed down will increase.
When the number of disk segments reaches the hard limit, document
additions/deletions will be blocked until the number falls under
the hard limit. The hard limit is almost never reached with proper
slow down.

Other ideas are most welcome!

I also experimented with a concurrent flush thread, which flushes
ram segments into a disk segment, and multiple disk merge threads.
The flush thread provides limited additional benefit when the ram
size (buffered documents) is not too big. And multiple disk merge
threads require significant system resources to add benefit.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Concurrent merge

Posted by Nadav Har'El <ny...@math.technion.ac.il>.
On Tue, Feb 20, 2007, robert engels wrote about "Re: Concurrent merge":
> What about a queue of segments to merge. The add document will add  
> segments to the queue, if the queue contains too many segments it  
> blocks.
> 
> Another thread reads the segments from the queue and merges them.
> 
> This would effectively block adding of documents some times, but that  
> is not different than what happens now.

So if adds can still block, what is the point of making this change?

-- 
Nadav Har'El                        |      Wednesday, Feb 21 2007, 3 Adar 5767
IBM Haifa Research Lab              |-----------------------------------------
                                    |Committee: A group of people that keeps
http://nadav.harel.org.il           |minutes and wastes hours.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Concurrent merge

Posted by robert engels <re...@ix.netcom.com>.
The point I was trying to make is that even if you use a background  
thread for merging, that at some point it may block, if the document  
additions out-pace the merging.

On Feb 21, 2007, at 11:47 AM, Ning Li wrote:

> I agree that the current blocking model works for some applications,
> especially if the indexes are batch built.
>
> But other applications, e.g. with online indexes, would greatly
> benefit from a non-blocking model. Most systems that merge data
> support background merges. As long as we keep it simple (how about the
> original proposal?), applications will benefit from this.
>
> Ning
>
> On 2/20/07, Mike Klaas <mi...@gmail.com> wrote:
>> On 2/20/07, robert engels <re...@ix.netcom.com> wrote:
>> > What about a queue of segments to merge. The add document will add
>> > segments to the queue, if the queue contains too many segments it
>> > blocks.
>> >
>> > Another thread reads the segments from the queue and merges them.
>> >
>> > This would effectively block adding of documents some times, but  
>> that
>> > is not different than what happens now.
>>
>> In our custom database (that has a lucene-like logarithmic set of
>> segmented databases), we have a separate merge thread controlled with
>> a bounded queue as you suggest.  As you note, the unbounded wait time
>> on doc addition must still be present if you have limits on how far
>> ahead you can let the doc addition run ahead of the merging.
>>
>> You could imagine more complexity, allowing smaller sub-merges to
>> occur of fresh segments while large background merges were  
>> proceeding.
>>
>> For us, ironing out the resulting complexity wasn't a beneficial use
>> of time, and in retrospect I prefer lucene's model.
>>
>> -Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Concurrent merge

Posted by Yonik Seeley <yo...@apache.org>.
On 2/21/07, Ning Li <ni...@gmail.com> wrote:
> I agree that the current blocking model works for some applications,
> especially if the indexes are batch built.
>
> But other applications, e.g. with online indexes, would greatly
> benefit from a non-blocking model. Most systems that merge data
> support background merges. As long as we keep it simple (how about the
> original proposal?), applications will benefit from this.

Yes, if we do anything, I think simple is better.  I wouldn't go down
the whole soft-limit/hard-limit, gradually slowing down additions
road... the complexity doesn't sound worth it.

The simplest model could just take a reference to the ram segments and
the deleted terms on a flush, and then another thread could then do
merging into a single segment, term deletion, and any other necessary
merging, while the original thread created new empty ram-segments and
deleted-terms so additions could continue.  If the user added more
than maxBufferedDocs before the background merge was complete, the add
would block (as it does currently).

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Concurrent merge

Posted by Ning Li <ni...@gmail.com>.
I agree that the current blocking model works for some applications,
especially if the indexes are batch built.

But other applications, e.g. with online indexes, would greatly
benefit from a non-blocking model. Most systems that merge data
support background merges. As long as we keep it simple (how about the
original proposal?), applications will benefit from this.

Ning

On 2/20/07, Mike Klaas <mi...@gmail.com> wrote:
> On 2/20/07, robert engels <re...@ix.netcom.com> wrote:
> > What about a queue of segments to merge. The add document will add
> > segments to the queue, if the queue contains too many segments it
> > blocks.
> >
> > Another thread reads the segments from the queue and merges them.
> >
> > This would effectively block adding of documents some times, but that
> > is not different than what happens now.
>
> In our custom database (that has a lucene-like logarithmic set of
> segmented databases), we have a separate merge thread controlled with
> a bounded queue as you suggest.  As you note, the unbounded wait time
> on doc addition must still be present if you have limits on how far
> ahead you can let the doc addition run ahead of the merging.
>
> You could imagine more complexity, allowing smaller sub-merges to
> occur of fresh segments while large background merges were proceeding.
>
> For us, ironing out the resulting complexity wasn't a beneficial use
> of time, and in retrospect I prefer lucene's model.
>
> -Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Concurrent merge

Posted by Mike Klaas <mi...@gmail.com>.
On 2/20/07, robert engels <re...@ix.netcom.com> wrote:
> What about a queue of segments to merge. The add document will add
> segments to the queue, if the queue contains too many segments it
> blocks.
>
> Another thread reads the segments from the queue and merges them.
>
> This would effectively block adding of documents some times, but that
> is not different than what happens now.

In our custom database (that has a lucene-like logarithmic set of
segmented databases), we have a separate merge thread controlled with
a bounded queue as you suggest.  As you note, the unbounded wait time
on doc addition must still be present if you have limits on how far
ahead you can let the doc addition run ahead of the merging.

You could imagine more complexity, allowing smaller sub-merges to
occur of fresh segments while large background merges were proceeding.

For us, ironing out the resulting complexity wasn't a beneficial use
of time, and in retrospect I prefer lucene's model.

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Concurrent merge

Posted by robert engels <re...@ix.netcom.com>.
What about a queue of segments to merge. The add document will add  
segments to the queue, if the queue contains too many segments it  
blocks.

Another thread reads the segments from the queue and merges them.

This would effectively block adding of documents some times, but that  
is not different than what happens now.

On Feb 20, 2007, at 1:22 PM, Ning Li wrote:

> I think it's possible for another version of IndexWriter to have
> a concurrent merge thread so that disk segments could be merged
> while documents are being added or deleted.
>
> This would be beneficial not only because it will improve indexing
> performance when there are enough system resources, but more
> importantly, disk segment merges will no longer block document
> additions or deletions.
>
> I'd like to get feedback on this idea and after we agree on a best
> design I can submit a full patch.
>
> I have an initial implementation based on an earlier version of
> Lucene (but with deletes via IndexWriter). The basic idea is to
> separate a merge process into three steps:
>  1 select disk segments to merge
>  2 merge selected segments into one segment
>  3 apply document deletions committed during the merge if any
>    and replace selected segments with the result segment
> The merge process is carried out in the merge thread. Steps 1 and
> 3 are executed in the critical section, but step 2, in which most
> time is spent, is not.
>
> There are three main challenges in enabling concurrent merge:
>  1 a robust merge policy
>  2 detect when merge lags document additions/deletions
>  3 how to slow down document additions/deletions (and amortize
>    the cost) when merge falls behind
>
> Because new disk segments (flushed from ram) can continue to be
> produced while a disk merge is going on, it is difficult to hold
> the two invariants guaranteed by the current IndexWriter. Thus it
> is important and challenging to detect when merge starts to lag
> behind and to slow down document additions/deletions properly.
>
> Several merge strategies are possible. In the initial implementation,
> I adopted one similar to the merge policy in current IndexWriter.
> Two limits on the total number of disk segments are used to detect
> merge's lag and to slow down document additions/deletions: a soft
> limit and a hard limit. When the number of disk segments reaches
> the soft limit, a document addition/deletion will be slowed down
> for time T. As the number of disk segments continues to grow, the
> time for which an addition/deletion is slowed down will increase.
> When the number of disk segments reaches the hard limit, document
> additions/deletions will be blocked until the number falls under
> the hard limit. The hard limit is almost never reached with proper
> slow down.
>
> Other ideas are most welcome!
>
> I also experimented with a concurrent flush thread, which flushes
> ram segments into a disk segment, and multiple disk merge threads.
> The flush thread provides limited additional benefit when the ram
> size (buffered documents) is not too big. And multiple disk merge
> threads require significant system resources to add benefit.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Concurrent merge

Posted by robert engels <re...@ix.netcom.com>.
I agree, that is why I suggested using the queue. You will add new  
segments to be merged to a queue. The merge thread pulls segments  
form the queue when it is free to do so. The addDocument() creates a  
new segment and adds it to the queue. If the queue is full, the add  
to the queue will block until the queue has room.

As for using multiple disks, this is a great idea. We use a balanced  
merge sort in our application, and splitting the merge across disks  
and using multiple threads, we were able to double the IO throughput.  
It probably doesn't make much difference for sites with advanced  
striped disk controllers/configurations, but for rather dumb/cheap  
installs it makes a huge difference, since a balanced merge involves  
massive amounts of sequential reads and writes.


On Feb 21, 2007, at 9:56 AM, Nadav Har'El wrote:

> On Tue, Feb 20, 2007, Ning Li wrote about "Concurrent merge":
>> I think it's possible for another version of IndexWriter to have
>> a concurrent merge thread so that disk segments could be merged
>> while documents are being added or deleted.
>> This would be beneficial not only because it will improve indexing
>> performance when there are enough system resources, but more
>> importantly, disk segment merges will no longer block document
>> additions or deletions.
>
> Hi Ning,
>
> Reading what you wrote here, I understand that the most important  
> benefit
> (as you see it) of this idea is to make the document addition time  
> more
> predictable: the current situation where addDocument() usually takes a
> second, but sometimes can take an hour, will be gone.
>
> I am wondering, however, whether this problem will actually be  
> solved by
> the single-background-merge-thread solution you propose. I'll  
> explain the
> possible problem I see below.
>
>> ...
>> Two limits on the total number of disk segments are used to detect
>> merge's lag and to slow down document additions/deletions: a soft
>> limit and a hard limit. When the number of disk segments reaches
>> the soft limit, a document addition/deletion will be slowed down
>> for time T. As the number of disk segments continues to grow, the
>> time for which an addition/deletion is slowed down will increase.
>> When the number of disk segments reaches the hard limit, document
>> additions/deletions will be blocked until the number falls under
>> the hard limit. The hard limit is almost never reached with proper
>> slow down.
>> ..
>
> How many merges can go on concurrently in your scheme? You  
> mentioned "a merge
> thread", which leads me to think that you mean that only one merge  
> can go on
> at once. You also said you prefer a single merge thread in the end  
> of your
> email.
>
> However, I think that allowing only one merge to proceed at once  
> can prevent
> you from achieving your intended improvment (of never having to  
> wait an hour
> for adding a single document). Consider the following scenario:
>
> Imagine that after having just added the milionth document, a huge  
> merge
> starts in the background, which is going to take an hour.
> At the same time, new documents continue to come in and be added to  
> the index.
> When you add more documents, you start getting small segments,  
> which need to
> be merged into larger segments, and so on. But if you don't allow  
> multiple
> merges to be active concurrently, the huge merge in progress will  
> cause these
> small merges to be postponed for later, and in just a few minutes  
> you'll
> accumulate many small segments, and very soon your "slowdown"  
> mechanism will
> start slowing down the additions, to the point that after a few  
> minutes
> additions will have to come to a grinding halt to avoid accumulating
> thousands of new small segments.
>
> This means that if you allow only one concurrent merge, your original
> intention - of allowing additions to continue in (almost) full steam
> while a huge merge is in progress - may not be realized.
>
> A counter-argument may be that if you slow down the additions  
> enough - say -
> wait 10 seconds between every addition - you'll not accumulate too  
> many new
> segments in this hour, and at the same time never wait more than 10  
> seconds
> for an addition, so everything is perfect. Well, this view is only  
> true if
> additions do come only once every 10 seconds. If a new file is  
> queued to be
> added every, say, once every 2 seconds, then your deliberate  
> slowdown will
> cause new additions to be significantly delayed - up to almost the  
> full hour!
> Moreover, the 10 seconds you spend sleeping (doing nothing) could  
> have been
> used for something more productive on a multiple-CPU machine - such as
> continuing to do additions and do more merging.
>
> One possibility of solving this problem is to allow multiple  
> background
> merge threads, not just one.
>
> Another possible solution I can think of is to use the background  
> thread for
> merging when it's free, but do merges in the foreground (blocking  
> the adds)
> when it is busy. This means slowing down the adds a bit, but not  
> with sleep()
> calls which doesn't utilize the available CPU, but with real work,  
> and adds
> will continue to proceed at the fastest possible speed (depending  
> on the
> machine). Presumably, the really huge merges (which I guess are the  
> ones
> that bother you) will happen in the background thread.
>
>> Other ideas are most welcome!
>
> An idea I once had, but I have no idea how much it can help, is this:
> It is conceivable (but I didn't actually test this) that a  
> significant (?)
> percentage of indexing time is IO time - disk write, read and seek.
> It might be possible, while we're doing a merge of a few segments  
> on one
> disk drive, to write new segments, and do new merges, on a second disk
> drive; This means that all the writing we do to the second disk  
> drive doesn't
> disturb the work that is being done on the first drive (less head  
> seeks,
> and more concurrent reading and writing on both disks).
>
> But as I said, I never tested how much this idea can improve  
> performance
> on systems with multiple separate disks and multiple CPUs.
>
>> size (buffered documents) is not too big. And multiple disk merge
>> threads require significant system resources to add benefit.
>
> See my comments above on why multiple concurrent merges might be  
> necessary,
> depending on what the "benefit" you were aiming at.
>
> Thanks,
> Nadav.
>
> -- 
> Nadav Har'El                        |      Wednesday, Feb 21 2007,  
> 3 Adar 5767
> IBM Haifa Research Lab               
> |-----------------------------------------
>                                     |It's fortunate I have bad luck  
> - without
> http://nadav.harel.org.il           |it I would have no luck at all!
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Concurrent merge

Posted by Nadav Har'El <ny...@math.technion.ac.il>.
On Tue, Feb 20, 2007, Ning Li wrote about "Concurrent merge":
> I think it's possible for another version of IndexWriter to have
> a concurrent merge thread so that disk segments could be merged
> while documents are being added or deleted.
> This would be beneficial not only because it will improve indexing
> performance when there are enough system resources, but more
> importantly, disk segment merges will no longer block document
> additions or deletions.

Hi Ning,

Reading what you wrote here, I understand that the most important benefit
(as you see it) of this idea is to make the document addition time more
predictable: the current situation where addDocument() usually takes a
second, but sometimes can take an hour, will be gone.

I am wondering, however, whether this problem will actually be solved by
the single-background-merge-thread solution you propose. I'll explain the
possible problem I see below.

>...
> Two limits on the total number of disk segments are used to detect
> merge's lag and to slow down document additions/deletions: a soft
> limit and a hard limit. When the number of disk segments reaches
> the soft limit, a document addition/deletion will be slowed down
> for time T. As the number of disk segments continues to grow, the
> time for which an addition/deletion is slowed down will increase.
> When the number of disk segments reaches the hard limit, document
> additions/deletions will be blocked until the number falls under
> the hard limit. The hard limit is almost never reached with proper
> slow down.
>..

How many merges can go on concurrently in your scheme? You mentioned "a merge
thread", which leads me to think that you mean that only one merge can go on
at once. You also said you prefer a single merge thread in the end of your
email.

However, I think that allowing only one merge to proceed at once can prevent
you from achieving your intended improvment (of never having to wait an hour
for adding a single document). Consider the following scenario:

Imagine that after having just added the milionth document, a huge merge
starts in the background, which is going to take an hour.
At the same time, new documents continue to come in and be added to the index.
When you add more documents, you start getting small segments, which need to
be merged into larger segments, and so on. But if you don't allow multiple
merges to be active concurrently, the huge merge in progress will cause these
small merges to be postponed for later, and in just a few minutes you'll
accumulate many small segments, and very soon your "slowdown" mechanism will
start slowing down the additions, to the point that after a few minutes
additions will have to come to a grinding halt to avoid accumulating
thousands of new small segments.

This means that if you allow only one concurrent merge, your original
intention - of allowing additions to continue in (almost) full steam
while a huge merge is in progress - may not be realized.

A counter-argument may be that if you slow down the additions enough - say -
wait 10 seconds between every addition - you'll not accumulate too many new
segments in this hour, and at the same time never wait more than 10 seconds
for an addition, so everything is perfect. Well, this view is only true if
additions do come only once every 10 seconds. If a new file is queued to be
added every, say, once every 2 seconds, then your deliberate slowdown will
cause new additions to be significantly delayed - up to almost the full hour!
Moreover, the 10 seconds you spend sleeping (doing nothing) could have been
used for something more productive on a multiple-CPU machine - such as
continuing to do additions and do more merging.

One possibility of solving this problem is to allow multiple background
merge threads, not just one.

Another possible solution I can think of is to use the background thread for
merging when it's free, but do merges in the foreground (blocking the adds)
when it is busy. This means slowing down the adds a bit, but not with sleep()
calls which doesn't utilize the available CPU, but with real work, and adds
will continue to proceed at the fastest possible speed (depending on the
machine). Presumably, the really huge merges (which I guess are the ones
that bother you) will happen in the background thread.

> Other ideas are most welcome!

An idea I once had, but I have no idea how much it can help, is this:
It is conceivable (but I didn't actually test this) that a significant (?)
percentage of indexing time is IO time - disk write, read and seek.
It might be possible, while we're doing a merge of a few segments on one
disk drive, to write new segments, and do new merges, on a second disk
drive; This means that all the writing we do to the second disk drive doesn't
disturb the work that is being done on the first drive (less head seeks,
and more concurrent reading and writing on both disks).

But as I said, I never tested how much this idea can improve performance
on systems with multiple separate disks and multiple CPUs.

> size (buffered documents) is not too big. And multiple disk merge
> threads require significant system resources to add benefit.

See my comments above on why multiple concurrent merges might be necessary,
depending on what the "benefit" you were aiming at.

Thanks,
Nadav.

-- 
Nadav Har'El                        |      Wednesday, Feb 21 2007, 3 Adar 5767
IBM Haifa Research Lab              |-----------------------------------------
                                    |It's fortunate I have bad luck - without
http://nadav.harel.org.il           |it I would have no luck at all!

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Concurrent merge

Posted by Ning Li <ni...@gmail.com>.
Many good points! Thanks, guys!

When background merge is employed, document additions can
out-pace merging, no matter how many background merge threads
are used. Blocking has to happen at some point.

So, if we do anything, we make it simple. I agree with what
Robert and Yonik have proposed: documents can be buffered
while segments are merged, but no more than maxBufferedDocs
can be buffered at any time.


Nadav asks a good question: if document additions can still
block, what is the point?

I see the following benefits:
1 Performance improvement to certain extent
2 Shorter ingestion hiccups: Especially when the ingestion
   rate is low, the hiccups can be very short. In the current
   single-threaded IndexWriter, the hiccups can be very long
   when a document ingestion triggers a cascade of merges.


Doron has a valid concern: Lucene's being single threaded is
a simplifying factor.

Agree. That's why I propose this as a subclass of IndexWriter,
not to replace the current single-threaded IndexWriter.


So, given this simple background merge design, are people
interested in the benefits it provides?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Concurrent merge

Posted by Yonik Seeley <yo...@apache.org>.
On 2/21/07, Doron Cohen <DO...@il.ibm.com> wrote:
> > > The downside is another complexity increase though.
>
> I think complexity can be divided in two:
> (1) more complex synchronization and data-manipulation/accounting
> (2) multi-threading.
>
> The multi-threading becoming part of and responsibility of Lucne
> seems quite a change to me. Lucene's being single threaded is a
> simplifying factor, an advantage to my opinion.

I definitely agree with that point.  MultiSearcher spins up threads, but it's
not as core as IndexWriter.

> So how about, alternatively, (perhaps optionally, probably in a
> subclass) just reducing the synchronization level of IndexWriter,
> so one could call addDocument, deleteDocument, optimize() etc.
> in more than one thread, in parallel.
>
> The critical sections would be similar to the proposal below, and
> delicate synchronization details need to be looked after. It is
> in fact possible that synchronization wise this is more challenging
> then the proposal when Lucene launches the threads, because there
> can be more than two threads.
>
> But this way Lucene itself remains single threaded. It is the
> application decision/responsibility to launch and manage these
> threads.
>
> Just a thought.

Yeah, I thought about things like this as well... including separating
the analysis
from adding the doc, but whatever I came up with it tended to
complicate the job of the user, esp if deletes are thrown in.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Concurrent merge

Posted by Doron Cohen <DO...@il.ibm.com>.
> > The downside is another complexity increase though.

I think complexity can be divided in two:
(1) more complex synchronization and data-manipulation/accounting
(2) multi-threading.

The multi-threading becoming part of and responsibility of Lucne
seems quite a change to me. Lucene's being single threaded is a
simplifying factor, an advantage to my opinion.

So how about, alternatively, (perhaps optionally, probably in a
subclass) just reducing the synchronization level of IndexWriter,
so one could call addDocument, deleteDocument, optimize() etc.
in more than one thread, in parallel.

The critical sections would be similar to the proposal below, and
delicate synchronization details need to be looked after. It is
in fact possible that synchronization wise this is more challenging
then the proposal when Lucene launches the threads, because there
can be more than two threads.

But this way Lucene itself remains single threaded. It is the
application decision/responsibility to launch and manage these
threads.

Just a thought.

robert engels <re...@ix.netcom.com> wrote on 21/02/2007 15:29:56:

> I think when you start discussing background threads you need to
> think server environment.
>
> It is fairly trivial there. I have pushed to move Lucene in that
> direction, rather than the multiple client accessing a shared
> resource via a network filesystem. No decent server product works
> this way.
>
> On Feb 21, 2007, at 5:23 PM, Yonik Seeley wrote:
>
> > On 2/21/07, Doron Cohen <DO...@il.ibm.com> wrote:
> >> Ning Li wrote:
> >>
> >> > There are three main challenges in enabling concurrent merge:
> >> >   1 a robust merge policy
> >> >   2 detect when merge lags document additions/deletions
> >> >   3 how to slow down document additions/deletions (and amortize
> >> >     the cost) when merge falls behind
> >>
> >> I wonder what it means for current API semantics -
> >>
> >> - An application today can set max-bufferred-docs to N, and after
> >> the Nth (or N+1th?) call to addDoc returns, a newly opened searcher
> >> would see these docs. With merges in a background thread this
> >> might not hold.
> >>
> >> - Today, after add(), an application can call flush() or close(),
> >> but with a background merge thread these calls would be blocked.
> >> Mmm... this is probably not a behavior change, because today
> >> these operations can trigger a merge that would take a long(er) time.
> >
> > We shouldn't advertise or guarantee that behavior.  This wasn't even
> > true before the new merge policy was implemented.
> >
> >> - numRamDocs() and ramSizeInBytes() - not sure what they mean
> >> once a background merge thread had started.
> >
> > IMO, for the current "batch" of documents being buffered.
> > The "old" buffered documents should be flushed to disk ASAP.
> >
> >> Still, having non blocking adds is compelling.
> >
> > Somewhat... It would result in some performance increase...
> > overlapping analysis of new documents with merging of other segments,
> > resulting in a higher CPU utilization (esp on multi-processor
> > systems).  The larger the maxBufferedDocs, the better.
> >
> > The downside is another complexity increase though.
> >
> > -Yonik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Concurrent merge

Posted by robert engels <re...@ix.netcom.com>.
I think when you start discussing background threads you need to  
think server environment.

It is fairly trivial there. I have pushed to move Lucene in that  
direction, rather than the multiple client accessing a shared  
resource via a network filesystem. No decent server product works  
this way.

On Feb 21, 2007, at 5:23 PM, Yonik Seeley wrote:

> On 2/21/07, Doron Cohen <DO...@il.ibm.com> wrote:
>> Ning Li wrote:
>>
>> > There are three main challenges in enabling concurrent merge:
>> >   1 a robust merge policy
>> >   2 detect when merge lags document additions/deletions
>> >   3 how to slow down document additions/deletions (and amortize
>> >     the cost) when merge falls behind
>>
>> I wonder what it means for current API semantics -
>>
>> - An application today can set max-bufferred-docs to N, and after
>> the Nth (or N+1th?) call to addDoc returns, a newly opened searcher
>> would see these docs. With merges in a background thread this
>> might not hold.
>>
>> - Today, after add(), an application can call flush() or close(),
>> but with a background merge thread these calls would be blocked.
>> Mmm... this is probably not a behavior change, because today
>> these operations can trigger a merge that would take a long(er) time.
>
> We shouldn't advertise or guarantee that behavior.  This wasn't even
> true before the new merge policy was implemented.
>
>> - numRamDocs() and ramSizeInBytes() - not sure what they mean
>> once a background merge thread had started.
>
> IMO, for the current "batch" of documents being buffered.
> The "old" buffered documents should be flushed to disk ASAP.
>
>> Still, having non blocking adds is compelling.
>
> Somewhat... It would result in some performance increase...
> overlapping analysis of new documents with merging of other segments,
> resulting in a higher CPU utilization (esp on multi-processor
> systems).  The larger the maxBufferedDocs, the better.
>
> The downside is another complexity increase though.
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Concurrent merge

Posted by Yonik Seeley <yo...@apache.org>.
On 2/21/07, Doron Cohen <DO...@il.ibm.com> wrote:
> Ning Li wrote:
>
> > There are three main challenges in enabling concurrent merge:
> >   1 a robust merge policy
> >   2 detect when merge lags document additions/deletions
> >   3 how to slow down document additions/deletions (and amortize
> >     the cost) when merge falls behind
>
> I wonder what it means for current API semantics -
>
> - An application today can set max-bufferred-docs to N, and after
> the Nth (or N+1th?) call to addDoc returns, a newly opened searcher
> would see these docs. With merges in a background thread this
> might not hold.
>
> - Today, after add(), an application can call flush() or close(),
> but with a background merge thread these calls would be blocked.
> Mmm... this is probably not a behavior change, because today
> these operations can trigger a merge that would take a long(er) time.

We shouldn't advertise or guarantee that behavior.  This wasn't even
true before the new merge policy was implemented.

> - numRamDocs() and ramSizeInBytes() - not sure what they mean
> once a background merge thread had started.

IMO, for the current "batch" of documents being buffered.
The "old" buffered documents should be flushed to disk ASAP.

> Still, having non blocking adds is compelling.

Somewhat... It would result in some performance increase...
overlapping analysis of new documents with merging of other segments,
resulting in a higher CPU utilization (esp on multi-processor
systems).  The larger the maxBufferedDocs, the better.

The downside is another complexity increase though.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Concurrent merge

Posted by Doron Cohen <DO...@il.ibm.com>.
Ning Li wrote:

> There are three main challenges in enabling concurrent merge:
>   1 a robust merge policy
>   2 detect when merge lags document additions/deletions
>   3 how to slow down document additions/deletions (and amortize
>     the cost) when merge falls behind

I wonder what it means for current API semantics -

- An application today can set max-bufferred-docs to N, and after
the Nth (or N+1th?) call to addDoc returns, a newly opened searcher
would see these docs. With merges in a background thread this
might not hold.

- Today, after add(), an application can call flush() or close(),
but with a background merge thread these calls would be blocked.
Mmm... this is probably not a behavior change, because today
these operations can trigger a merge that would take a long(er) time.

- numRamDocs() and ramSizeInBytes() - not sure what they mean
once a background merge thread had started.

Still, having non blocking adds is compelling.



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org