You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Shai Erera <se...@gmail.com> on 2010/04/20 07:25:23 UTC

Per-Thread DW and IW

Hi

I've been following the PTDW issue and so far the approach that's been
proposed and agreed on sounds great! - each thread will have it's own
DW, DWs will flush themselves independently of each other etc.

I'm now wondering if we cannot simplify the approach even further by
eliminating DW at all and let several IW instances open and index over
the same Directory inside the same JVM (basically allowing multiple
threads to open their own IW)?

Would it simplify matters for the currently ongoing issue? Would it
complicate matters for the app (init'ing IW per thread, controlling
RAM settings etc.)?
It will definitely simplify multi-threaded handling for IW extensions
like Parallel Index …

One thing we'll need to expose is a shared deleted doc IDs object
which all IW will update. Anything else?

I'm just thinking outloud here, hence why I didn't post it on the
issue itself. What I'm thinking is taking all that per-thread thing
outside IW and let the app more control. But having IW multi-threaded
is also a great advantage and more convenient to the app.

Shai

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Per-Thread DW and IW

Posted by Doron Cohen <cd...@gmail.com>.

On Thu, Apr 22, 2010 at 5:04 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> I like the "slice" term, but, can we drop the 'd'?  Ie "SliceWriter"
> and "SliceReader".
>

I agree, it is better without the 'd'.

Re: Per-Thread DW and IW

Posted by Shai Erera <se...@gmail.com>.

The big picture includes what you write, but also other usage, such as
loading different slices into memory, introduce the complementary API to
ParallelReader, query a single slice only etc.

Shai

On Thu, Apr 22, 2010 at 5:04 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> I like the "slice" term, but, can we drop the 'd'?  Ie "SliceWriter"
> and "SliceReader".
>
> BTW, what are the big picture use cases for slices?  Is this just an
> approximation to incremental indexing?  EG if you suddenly need to
> add a new field to all docs in your index, rather than fully
> reindexing all docs, you can just add a new slice?
>
> Or if you need to change the values across some number of docs for a
> single field, it's better to rewrite the entire slice for that field
> than fully reindex those docs?
>
> Mike
>
> On Wed, Apr 21, 2010 at 4:16 PM, Doron Cohen <cd...@gmail.com> wrote:
> > It is somewhat confusing that "Parallel" in this discussion refers to two
> > different things - in PI it stands for an index who is sliced into N
> slices
> > which are in turn accessed in Parallel, and in PDW it stands for two
> > document writers which run in parallel, and update the same index.
> Perhaps
> > it would be more clear to rename PW to SlicedWriter and similarly
> > ParallelReader to SlicedReader, then each of them is working on a slice,
> and
> > parallelism indicates what is done for speed (although slicing also is
> for
> > speed, but in a different manner). This would also remove the confusion
> > between ParallelReader and ParallelMultiSearcher.
> > (Side comment/thoughts - If one would have attempted to implement a
> > SlicedWriter way back when each added document was creating a segment in
> > memory, and at flush those segments were merged - well, then, a sliced IW
> > would just create two segments - A and B - out of each document (assuming
> > two slices) and at flush merge all A's into A* and all B's into B*. Today
> > added docs are maintained more efficiently, supporting deletions, merge
> > polices, file-deletion-policy, commit points, crash recovery, NRT and
> more -
> > and a sliced DW is more complex than just having two DW's each working on
> > its part of the document... The simplicity of the old design was a beauty
> -
> > reusing the segment concept over and over - though it could not achieve
> the
> > nice features of today. Mmm... reading this again not sure that with a
> > segment per doc things would be really simpler - IW would still need to
> > manage both....)
> > Doron
> > On Wed, Apr 21, 2010 at 8:12 PM, Shai Erera <se...@gmail.com> wrote:
> >>
> >> I don't advocate to develop PI as an external entity to Lucene, you've
> >> already done that ! :)
> >>
> >> We should open up IW enough to develop PI efficiently, but I think we
> >> should always allow some freedom and flexibility to using applications.
> If
> >> IW simply created a Parallel DW, handle the merges on its own as if
> those
> >> are just one big happy bunch of Directories, then apps won't be able to
> plug
> >> in their own custom IWs, such as a FacetedIW maybe (one which handles
> the
> >> facets in the application).
> >>
> >> If that 'openness' of IW is the SegmentsWriter API, then that might be
> >> enough. I imagine apps will want to control things like
> add/update/delete of
> >> documents, but it should be IW which controls the MP and MS for all
> slices
> >> (you should give your own, but it will be one MP and MS for all slices,
> and
> >> not one per slice). Also, methods like addIndexes* probably cannot be
> >> supported by PI, unless we add a special method signature which accept
> >> ParallelWriter[] or some such.
> >>
> >> Currently, I view SegmentWriter as DocumentWriter, and so I think I'm
> >> operating under such low-level assumptions. But since I work over IW,
> some
> >> things are buried too low. Maybe we should refactor IW first, before PI
> is
> >> developed ... any estimates on when PerThread DW is going to be ready?
> :)
> >>
> >> Shai
> >>
> >> On Wed, Apr 21, 2010 at 6:48 PM, Michael Busch <bu...@gmail.com>
> wrote:
> >>>
> >>> Yeah, sounds like we have the same things in mind here.  In fact, this
> is
> >>> pretty similar to what we discussed a while ago on LUCENE-2026 I think.
> >>>
> >>> SegmentWriter could be a higher level interface with more than one
> >>> implementation.  E.g. there could be one SegmentWriter that supports
> >>> appending documents (i.e. the DocumentsWriter today) and also one that
> >>> allows adding terms at-a-time, e.g. similar to what IW.addIndexes*()
> does
> >>> today.  Often when you rewrite entire parallel slices you don't want to
> use
> >>> addDocument().  E.g. when you read from a source slice, modify some
> data and
> >>> write a new version of that slice it can be dramatically faster to
> write
> >>> postinglist after postinglist,  because you avoid parallel I/O and a
> lot of
> >>> seeks. (with dramatically faster I mean e.g. 24 hrs vs. 8 mins, actual
> >>> numbers from an implementation I had at IBM...)
> >>>
> >>> Further, I imagine to utilize the slice concept within Lucene.  The
> store
> >>> could be a separate slice, and so could be the norms and the new
> flexible
> >>> scoring data structures.  It's then super easy to turn those off or
> rewrite
> >>> them individually (see LUCENE-2025).  Often parallel indexes don't need
> a
> >>> store or norms, so this slice concept makes total sense in my opinion.
> >>>  Norms actually works like this already, you can rewrite them which
> bumps up
> >>> their generation number.  We just have to make this concept more
> abstract,
> >>> so that it can be used for any kind of slice.
> >>> Many people have also asked about allowing Lucene to manage external
> data
> >>> structures.  I think these changes would allow exactly that:  just
> implement
> >>> your external data structure as a slice, and Lucene will call your code
> when
> >>> merging, deletions, adds happen. Cool! :)
> >>>
> >>> @Shai: If we implement Parallel indexing outside of Lucene's core then
> we
> >>> have some of the same drawbacks as with the current master-slave
> approach.
> >>>  I'm especially worried about how that would work then with realtime
> >>> indexing (both searchable RAM buffer and also NRT).  I think PI must be
> >>> completely segment-aware.  Then it should fit very nicely into realtime
> >>> indexing, which is also very cool!
> >>>
> >>>  Michael
> >>>
> >>>
> >>> On 4/21/10 8:06 AM, Michael McCandless wrote:
> >>>>
> >>>> I do think the idea of an abstract class (or interface) SegmentWriter
> >>>> is compelling.
> >>>>
> >>>> Each DWPT would be a [single-threaded] SegmentWriter.
> >>>>
> >>>> And then we'd make a MultiThreadedSegmentWriterWrapper (manages a
> >>>> collection of SegmentWriters, deleting to them, aggregating RAM used
> >>>> across all, manages picking which ones to flush, etc.).
> >>>>
> >>>> Then, a SlicedSegmentWriter (say) would write to separate slices,
> >>>> single threaded, and then you could make it multi-threaded by wrapping
> >>>> w/ the above class.
> >>>>
> >>>> Though SegmentWriter isn't a great name since it would in general
> >>>> write to multiple segments.  Indexer is a little too broad though :)
> >>>>
> >>>> Something like that maybe?
> >>>>
> >>>> Also, allowing an app to directly control the underlying
> >>>> SegmentWriters inside IndexWriter (instead of letting the
> >>>> multi-threaded wrapper decide for you) is compelling for way advanced
> >>>> apps, I think.  EG your app may know it's done indexing from source A
> >>>> for a while, so, you should right now go and flush it (whereas the
> >>>> default "flush the one using the most RAM" could leave that source
> >>>> unflushed for a quite a while, tying up RAM, unless we do some kind of
> >>>> LRU flushing policy or something).
> >>>>
> >>>> Mike
> >>>>
> >>>> On Wed, Apr 21, 2010 at 2:27 AM, Shai Erera<se...@gmail.com>  wrote:
> >>>>
> >>>>>
> >>>>> I'm not sure that a Parallel DW would work for PI because DW is too
> >>>>> internal
> >>>>> to IW. Currently, the approach I've been thinking about for PI is to
> >>>>> tackle
> >>>>> it from a high level, e.g. allow the application to pass a Directory,
> >>>>> or
> >>>>> even an IW instance, and PI will play the coordinator role, ensuring
> >>>>> that
> >>>>> merge of segments happens across all the slices in accordance,
> >>>>> implementing
> >>>>> two-phase operations etc. A Parallel DW then does not fit nicely w/
> >>>>> that
> >>>>> approach (unless we want to refactor how IW works completely) because
> >>>>> DW is
> >>>>> not aware of the Directory, and if PI indeed works over IW instances,
> >>>>> then
> >>>>> each will have its own DW.
> >>>>>
> >>>>> So there are two basic approaches we can take for PI (following
> current
> >>>>> architecture) - either let PI manage IW, or have PI a sort of IW
> >>>>> itself,
> >>>>> which handles events at a much lower level. While the latter is more
> >>>>> robust
> >>>>> (and based on current limitations I'm running into, might be even
> >>>>> easier to
> >>>>> do), it lacks the flexibility of allowing the app to plug any IW it
> >>>>> wants.
> >>>>> That requirement is also important, if the application wants to use
> PI
> >>>>> in
> >>>>> scenarios where it keeps some slices in RAM and some on disk, or it
> >>>>> wants to
> >>>>> control more closely which fields go to which slice, so that it can
> at
> >>>>> some
> >>>>> point in time "rebuild" a certain slice outside PI and replace the
> >>>>> existing
> >>>>> slice in PI w/ the new one ...
> >>>>>
> >>>>> We should probably continue the discussion on PI, so I suggest we
> >>>>> either
> >>>>> move it to another thread or on the issue directly.
> >>>>>
> >>>>> Mike - I agree w/ you that we should keep the life of the application
> >>>>> developers easy and that having IW itself support concurrency is
> >>>>> beneficial.
> >>>>> Like I said ... it was just a thought which was aimed at keeping our
> >>>>> life
> >>>>> (Lucene developers) easier, but that probably comes second compared
> to
> >>>>> app-devs life :). I'm not at all sure also that that would have make
> >>>>> our
> >>>>> life easier ...
> >>>>>
> >>>>> So I'm good if you want to drop the discussion.
> >>>>>
> >>>>> Shai
> >>>>>
> >>>>> On Tue, Apr 20, 2010 at 8:16 PM, Michael Busch<bu...@gmail.com>
> >>>>>  wrote:
> >>>>>
> >>>>>>
> >>>>>> On 4/19/10 10:25 PM, Shai Erera wrote:
> >>>>>>
> >>>>>>>
> >>>>>>> It will definitely simplify multi-threaded handling for IW
> extensions
> >>>>>>> like Parallel Index …
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>> I'm keeping Parallel indexing in mind.  After we have separate DWPT
> >>>>>> I'd
> >>>>>> like to introduce parallel DWPTs, that write different slices.
> >>>>>>  Synchronization should not be a big worry then, because writing is
> >>>>>> single-threaded.
> >>>>>>
> >>>>>> We could introduce a new abstract class SegmentWriter, which DWPT
> >>>>>> would
> >>>>>> implement.  An extension would be ParallelSegmentWriter, which would
> >>>>>> manage
> >>>>>> multiple SegmentWriters.   Or maybe SegmentSliceWriter would be a
> >>>>>> better
> >>>>>> name.
> >>>>>>
> >>>>>>  Michael
> >>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Per-Thread DW and IW

Posted by Michael McCandless <lu...@mikemccandless.com>.

I like the "slice" term, but, can we drop the 'd'?  Ie "SliceWriter"
and "SliceReader".

BTW, what are the big picture use cases for slices?  Is this just an
approximation to incremental indexing?  EG if you suddenly need to
add a new field to all docs in your index, rather than fully
reindexing all docs, you can just add a new slice?

Or if you need to change the values across some number of docs for a
single field, it's better to rewrite the entire slice for that field
than fully reindex those docs?

Mike

On Wed, Apr 21, 2010 at 4:16 PM, Doron Cohen <cd...@gmail.com> wrote:
> It is somewhat confusing that "Parallel" in this discussion refers to two
> different things - in PI it stands for an index who is sliced into N slices
> which are in turn accessed in Parallel, and in PDW it stands for two
> document writers which run in parallel, and update the same index. Perhaps
> it would be more clear to rename PW to SlicedWriter and similarly
> ParallelReader to SlicedReader, then each of them is working on a slice, and
> parallelism indicates what is done for speed (although slicing also is for
> speed, but in a different manner). This would also remove the confusion
> between ParallelReader and ParallelMultiSearcher.
> (Side comment/thoughts - If one would have attempted to implement a
> SlicedWriter way back when each added document was creating a segment in
> memory, and at flush those segments were merged - well, then, a sliced IW
> would just create two segments - A and B - out of each document (assuming
> two slices) and at flush merge all A's into A* and all B's into B*. Today
> added docs are maintained more efficiently, supporting deletions, merge
> polices, file-deletion-policy, commit points, crash recovery, NRT and more -
> and a sliced DW is more complex than just having two DW's each working on
> its part of the document... The simplicity of the old design was a beauty -
> reusing the segment concept over and over - though it could not achieve the
> nice features of today. Mmm... reading this again not sure that with a
> segment per doc things would be really simpler - IW would still need to
> manage both....)
> Doron
> On Wed, Apr 21, 2010 at 8:12 PM, Shai Erera <se...@gmail.com> wrote:
>>
>> I don't advocate to develop PI as an external entity to Lucene, you've
>> already done that ! :)
>>
>> We should open up IW enough to develop PI efficiently, but I think we
>> should always allow some freedom and flexibility to using applications. If
>> IW simply created a Parallel DW, handle the merges on its own as if those
>> are just one big happy bunch of Directories, then apps won't be able to plug
>> in their own custom IWs, such as a FacetedIW maybe (one which handles the
>> facets in the application).
>>
>> If that 'openness' of IW is the SegmentsWriter API, then that might be
>> enough. I imagine apps will want to control things like add/update/delete of
>> documents, but it should be IW which controls the MP and MS for all slices
>> (you should give your own, but it will be one MP and MS for all slices, and
>> not one per slice). Also, methods like addIndexes* probably cannot be
>> supported by PI, unless we add a special method signature which accept
>> ParallelWriter[] or some such.
>>
>> Currently, I view SegmentWriter as DocumentWriter, and so I think I'm
>> operating under such low-level assumptions. But since I work over IW, some
>> things are buried too low. Maybe we should refactor IW first, before PI is
>> developed ... any estimates on when PerThread DW is going to be ready? :)
>>
>> Shai
>>
>> On Wed, Apr 21, 2010 at 6:48 PM, Michael Busch <bu...@gmail.com> wrote:
>>>
>>> Yeah, sounds like we have the same things in mind here.  In fact, this is
>>> pretty similar to what we discussed a while ago on LUCENE-2026 I think.
>>>
>>> SegmentWriter could be a higher level interface with more than one
>>> implementation.  E.g. there could be one SegmentWriter that supports
>>> appending documents (i.e. the DocumentsWriter today) and also one that
>>> allows adding terms at-a-time, e.g. similar to what IW.addIndexes*() does
>>> today.  Often when you rewrite entire parallel slices you don't want to use
>>> addDocument().  E.g. when you read from a source slice, modify some data and
>>> write a new version of that slice it can be dramatically faster to write
>>> postinglist after postinglist,  because you avoid parallel I/O and a lot of
>>> seeks. (with dramatically faster I mean e.g. 24 hrs vs. 8 mins, actual
>>> numbers from an implementation I had at IBM...)
>>>
>>> Further, I imagine to utilize the slice concept within Lucene.  The store
>>> could be a separate slice, and so could be the norms and the new flexible
>>> scoring data structures.  It's then super easy to turn those off or rewrite
>>> them individually (see LUCENE-2025).  Often parallel indexes don't need a
>>> store or norms, so this slice concept makes total sense in my opinion.
>>>  Norms actually works like this already, you can rewrite them which bumps up
>>> their generation number.  We just have to make this concept more abstract,
>>> so that it can be used for any kind of slice.
>>> Many people have also asked about allowing Lucene to manage external data
>>> structures.  I think these changes would allow exactly that:  just implement
>>> your external data structure as a slice, and Lucene will call your code when
>>> merging, deletions, adds happen. Cool! :)
>>>
>>> @Shai: If we implement Parallel indexing outside of Lucene's core then we
>>> have some of the same drawbacks as with the current master-slave approach.
>>>  I'm especially worried about how that would work then with realtime
>>> indexing (both searchable RAM buffer and also NRT).  I think PI must be
>>> completely segment-aware.  Then it should fit very nicely into realtime
>>> indexing, which is also very cool!
>>>
>>>  Michael
>>>
>>>
>>> On 4/21/10 8:06 AM, Michael McCandless wrote:
>>>>
>>>> I do think the idea of an abstract class (or interface) SegmentWriter
>>>> is compelling.
>>>>
>>>> Each DWPT would be a [single-threaded] SegmentWriter.
>>>>
>>>> And then we'd make a MultiThreadedSegmentWriterWrapper (manages a
>>>> collection of SegmentWriters, deleting to them, aggregating RAM used
>>>> across all, manages picking which ones to flush, etc.).
>>>>
>>>> Then, a SlicedSegmentWriter (say) would write to separate slices,
>>>> single threaded, and then you could make it multi-threaded by wrapping
>>>> w/ the above class.
>>>>
>>>> Though SegmentWriter isn't a great name since it would in general
>>>> write to multiple segments.  Indexer is a little too broad though :)
>>>>
>>>> Something like that maybe?
>>>>
>>>> Also, allowing an app to directly control the underlying
>>>> SegmentWriters inside IndexWriter (instead of letting the
>>>> multi-threaded wrapper decide for you) is compelling for way advanced
>>>> apps, I think.  EG your app may know it's done indexing from source A
>>>> for a while, so, you should right now go and flush it (whereas the
>>>> default "flush the one using the most RAM" could leave that source
>>>> unflushed for a quite a while, tying up RAM, unless we do some kind of
>>>> LRU flushing policy or something).
>>>>
>>>> Mike
>>>>
>>>> On Wed, Apr 21, 2010 at 2:27 AM, Shai Erera<se...@gmail.com>  wrote:
>>>>
>>>>>
>>>>> I'm not sure that a Parallel DW would work for PI because DW is too
>>>>> internal
>>>>> to IW. Currently, the approach I've been thinking about for PI is to
>>>>> tackle
>>>>> it from a high level, e.g. allow the application to pass a Directory,
>>>>> or
>>>>> even an IW instance, and PI will play the coordinator role, ensuring
>>>>> that
>>>>> merge of segments happens across all the slices in accordance,
>>>>> implementing
>>>>> two-phase operations etc. A Parallel DW then does not fit nicely w/
>>>>> that
>>>>> approach (unless we want to refactor how IW works completely) because
>>>>> DW is
>>>>> not aware of the Directory, and if PI indeed works over IW instances,
>>>>> then
>>>>> each will have its own DW.
>>>>>
>>>>> So there are two basic approaches we can take for PI (following current
>>>>> architecture) - either let PI manage IW, or have PI a sort of IW
>>>>> itself,
>>>>> which handles events at a much lower level. While the latter is more
>>>>> robust
>>>>> (and based on current limitations I'm running into, might be even
>>>>> easier to
>>>>> do), it lacks the flexibility of allowing the app to plug any IW it
>>>>> wants.
>>>>> That requirement is also important, if the application wants to use PI
>>>>> in
>>>>> scenarios where it keeps some slices in RAM and some on disk, or it
>>>>> wants to
>>>>> control more closely which fields go to which slice, so that it can at
>>>>> some
>>>>> point in time "rebuild" a certain slice outside PI and replace the
>>>>> existing
>>>>> slice in PI w/ the new one ...
>>>>>
>>>>> We should probably continue the discussion on PI, so I suggest we
>>>>> either
>>>>> move it to another thread or on the issue directly.
>>>>>
>>>>> Mike - I agree w/ you that we should keep the life of the application
>>>>> developers easy and that having IW itself support concurrency is
>>>>> beneficial.
>>>>> Like I said ... it was just a thought which was aimed at keeping our
>>>>> life
>>>>> (Lucene developers) easier, but that probably comes second compared to
>>>>> app-devs life :). I'm not at all sure also that that would have make
>>>>> our
>>>>> life easier ...
>>>>>
>>>>> So I'm good if you want to drop the discussion.
>>>>>
>>>>> Shai
>>>>>
>>>>> On Tue, Apr 20, 2010 at 8:16 PM, Michael Busch<bu...@gmail.com>
>>>>>  wrote:
>>>>>
>>>>>>
>>>>>> On 4/19/10 10:25 PM, Shai Erera wrote:
>>>>>>
>>>>>>>
>>>>>>> It will definitely simplify multi-threaded handling for IW extensions
>>>>>>> like Parallel Index …
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> I'm keeping Parallel indexing in mind.  After we have separate DWPT
>>>>>> I'd
>>>>>> like to introduce parallel DWPTs, that write different slices.
>>>>>>  Synchronization should not be a big worry then, because writing is
>>>>>> single-threaded.
>>>>>>
>>>>>> We could introduce a new abstract class SegmentWriter, which DWPT
>>>>>> would
>>>>>> implement.  An extension would be ParallelSegmentWriter, which would
>>>>>> manage
>>>>>> multiple SegmentWriters.   Or maybe SegmentSliceWriter would be a
>>>>>> better
>>>>>> name.
>>>>>>
>>>>>>  Michael
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Per-Thread DW and IW

Posted by Doron Cohen <cd...@gmail.com>.

It is somewhat confusing that "Parallel" in this discussion refers to two
different things - in PI it stands for an index who is sliced into N slices
which are in turn accessed in Parallel, and in PDW it stands for two
document writers which run in parallel, and update the same index. Perhaps
it would be more clear to rename PW to SlicedWriter and similarly
ParallelReader to SlicedReader, then each of them is working on a slice, and
parallelism indicates what is done for speed (although slicing also is for
speed, but in a different manner). This would also remove the confusion
between ParallelReader and ParallelMultiSearcher.

(Side comment/thoughts - If one would have attempted to implement a
SlicedWriter way back when each added document was creating a segment in
memory, and at flush those segments were merged - well, then, a sliced IW
would just create two segments - A and B - out of each document (assuming
two slices) and at flush merge all A's into A* and all B's into B*. Today
added docs are maintained more efficiently, supporting deletions, merge
polices, file-deletion-policy, commit points, crash recovery, NRT and more -
and a sliced DW is more complex than just having two DW's each working on
its part of the document... The simplicity of the old design was a beauty -
reusing the segment concept over and over - though it could not achieve the
nice features of today. Mmm... reading this again not sure that with a
segment per doc things would be really simpler - IW would still need to
manage both....)

Doron

On Wed, Apr 21, 2010 at 8:12 PM, Shai Erera <se...@gmail.com> wrote:

> I don't advocate to develop PI as an external entity to Lucene, you've
> already done that ! :)
>
> We should open up IW enough to develop PI efficiently, but I think we
> should always allow some freedom and flexibility to using applications. If
> IW simply created a Parallel DW, handle the merges on its own as if those
> are just one big happy bunch of Directories, then apps won't be able to plug
> in their own custom IWs, such as a FacetedIW maybe (one which handles the
> facets in the application).
>
> If that 'openness' of IW is the SegmentsWriter API, then that might be
> enough. I imagine apps will want to control things like add/update/delete of
> documents, but it should be IW which controls the MP and MS for all slices
> (you should give your own, but it will be one MP and MS for all slices, and
> not one per slice). Also, methods like addIndexes* probably cannot be
> supported by PI, unless we add a special method signature which accept
> ParallelWriter[] or some such.
>
> Currently, I view SegmentWriter as DocumentWriter, and so I think I'm
> operating under such low-level assumptions. But since I work over IW, some
> things are buried too low. Maybe we should refactor IW first, before PI is
> developed ... any estimates on when PerThread DW is going to be ready? :)
>
> Shai
>
>
> On Wed, Apr 21, 2010 at 6:48 PM, Michael Busch <bu...@gmail.com> wrote:
>
>> Yeah, sounds like we have the same things in mind here.  In fact, this is
>> pretty similar to what we discussed a while ago on LUCENE-2026 I think.
>>
>> SegmentWriter could be a higher level interface with more than one
>> implementation.  E.g. there could be one SegmentWriter that supports
>> appending documents (i.e. the DocumentsWriter today) and also one that
>> allows adding terms at-a-time, e.g. similar to what IW.addIndexes*() does
>> today.  Often when you rewrite entire parallel slices you don't want to use
>> addDocument().  E.g. when you read from a source slice, modify some data and
>> write a new version of that slice it can be dramatically faster to write
>> postinglist after postinglist,  because you avoid parallel I/O and a lot of
>> seeks. (with dramatically faster I mean e.g. 24 hrs vs. 8 mins, actual
>> numbers from an implementation I had at IBM...)
>>
>> Further, I imagine to utilize the slice concept within Lucene.  The store
>> could be a separate slice, and so could be the norms and the new flexible
>> scoring data structures.  It's then super easy to turn those off or rewrite
>> them individually (see LUCENE-2025).  Often parallel indexes don't need a
>> store or norms, so this slice concept makes total sense in my opinion.
>>  Norms actually works like this already, you can rewrite them which bumps up
>> their generation number.  We just have to make this concept more abstract,
>> so that it can be used for any kind of slice.
>> Many people have also asked about allowing Lucene to manage external data
>> structures.  I think these changes would allow exactly that:  just implement
>> your external data structure as a slice, and Lucene will call your code when
>> merging, deletions, adds happen. Cool! :)
>>
>> @Shai: If we implement Parallel indexing outside of Lucene's core then we
>> have some of the same drawbacks as with the current master-slave approach.
>>  I'm especially worried about how that would work then with realtime
>> indexing (both searchable RAM buffer and also NRT).  I think PI must be
>> completely segment-aware.  Then it should fit very nicely into realtime
>> indexing, which is also very cool!
>>
>>  Michael
>>
>>
>>
>> On 4/21/10 8:06 AM, Michael McCandless wrote:
>>
>>> I do think the idea of an abstract class (or interface) SegmentWriter
>>> is compelling.
>>>
>>> Each DWPT would be a [single-threaded] SegmentWriter.
>>>
>>> And then we'd make a MultiThreadedSegmentWriterWrapper (manages a
>>> collection of SegmentWriters, deleting to them, aggregating RAM used
>>> across all, manages picking which ones to flush, etc.).
>>>
>>> Then, a SlicedSegmentWriter (say) would write to separate slices,
>>> single threaded, and then you could make it multi-threaded by wrapping
>>> w/ the above class.
>>>
>>> Though SegmentWriter isn't a great name since it would in general
>>> write to multiple segments.  Indexer is a little too broad though :)
>>>
>>> Something like that maybe?
>>>
>>> Also, allowing an app to directly control the underlying
>>> SegmentWriters inside IndexWriter (instead of letting the
>>> multi-threaded wrapper decide for you) is compelling for way advanced
>>> apps, I think.  EG your app may know it's done indexing from source A
>>> for a while, so, you should right now go and flush it (whereas the
>>> default "flush the one using the most RAM" could leave that source
>>> unflushed for a quite a while, tying up RAM, unless we do some kind of
>>> LRU flushing policy or something).
>>>
>>> Mike
>>>
>>> On Wed, Apr 21, 2010 at 2:27 AM, Shai Erera<se...@gmail.com>  wrote:
>>>
>>>
>>>> I'm not sure that a Parallel DW would work for PI because DW is too
>>>> internal
>>>> to IW. Currently, the approach I've been thinking about for PI is to
>>>> tackle
>>>> it from a high level, e.g. allow the application to pass a Directory, or
>>>> even an IW instance, and PI will play the coordinator role, ensuring
>>>> that
>>>> merge of segments happens across all the slices in accordance,
>>>> implementing
>>>> two-phase operations etc. A Parallel DW then does not fit nicely w/ that
>>>> approach (unless we want to refactor how IW works completely) because DW
>>>> is
>>>> not aware of the Directory, and if PI indeed works over IW instances,
>>>> then
>>>> each will have its own DW.
>>>>
>>>> So there are two basic approaches we can take for PI (following current
>>>> architecture) - either let PI manage IW, or have PI a sort of IW itself,
>>>> which handles events at a much lower level. While the latter is more
>>>> robust
>>>> (and based on current limitations I'm running into, might be even easier
>>>> to
>>>> do), it lacks the flexibility of allowing the app to plug any IW it
>>>> wants.
>>>> That requirement is also important, if the application wants to use PI
>>>> in
>>>> scenarios where it keeps some slices in RAM and some on disk, or it
>>>> wants to
>>>> control more closely which fields go to which slice, so that it can at
>>>> some
>>>> point in time "rebuild" a certain slice outside PI and replace the
>>>> existing
>>>> slice in PI w/ the new one ...
>>>>
>>>> We should probably continue the discussion on PI, so I suggest we either
>>>> move it to another thread or on the issue directly.
>>>>
>>>> Mike - I agree w/ you that we should keep the life of the application
>>>> developers easy and that having IW itself support concurrency is
>>>> beneficial.
>>>> Like I said ... it was just a thought which was aimed at keeping our
>>>> life
>>>> (Lucene developers) easier, but that probably comes second compared to
>>>> app-devs life :). I'm not at all sure also that that would have make our
>>>> life easier ...
>>>>
>>>> So I'm good if you want to drop the discussion.
>>>>
>>>> Shai
>>>>
>>>> On Tue, Apr 20, 2010 at 8:16 PM, Michael Busch<bu...@gmail.com>
>>>>  wrote:
>>>>
>>>>
>>>>> On 4/19/10 10:25 PM, Shai Erera wrote:
>>>>>
>>>>>
>>>>>> It will definitely simplify multi-threaded handling for IW extensions
>>>>>> like Parallel Index …
>>>>>>
>>>>>>
>>>>>>
>>>>> I'm keeping Parallel indexing in mind.  After we have separate DWPT I'd
>>>>> like to introduce parallel DWPTs, that write different slices.
>>>>>  Synchronization should not be a big worry then, because writing is
>>>>> single-threaded.
>>>>>
>>>>> We could introduce a new abstract class SegmentWriter, which DWPT would
>>>>> implement.  An extension would be ParallelSegmentWriter, which would
>>>>> manage
>>>>> multiple SegmentWriters.   Or maybe SegmentSliceWriter would be a
>>>>> better
>>>>> name.
>>>>>
>>>>>  Michael
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>

Re: Per-Thread DW and IW

Posted by Shai Erera <se...@gmail.com>.

I don't advocate to develop PI as an external entity to Lucene, you've
already done that ! :)

We should open up IW enough to develop PI efficiently, but I think we should
always allow some freedom and flexibility to using applications. If IW
simply created a Parallel DW, handle the merges on its own as if those are
just one big happy bunch of Directories, then apps won't be able to plug in
their own custom IWs, such as a FacetedIW maybe (one which handles the
facets in the application).

If that 'openness' of IW is the SegmentsWriter API, then that might be
enough. I imagine apps will want to control things like add/update/delete of
documents, but it should be IW which controls the MP and MS for all slices
(you should give your own, but it will be one MP and MS for all slices, and
not one per slice). Also, methods like addIndexes* probably cannot be
supported by PI, unless we add a special method signature which accept
ParallelWriter[] or some such.

Currently, I view SegmentWriter as DocumentWriter, and so I think I'm
operating under such low-level assumptions. But since I work over IW, some
things are buried too low. Maybe we should refactor IW first, before PI is
developed ... any estimates on when PerThread DW is going to be ready? :)

Shai

On Wed, Apr 21, 2010 at 6:48 PM, Michael Busch <bu...@gmail.com> wrote:

> Yeah, sounds like we have the same things in mind here.  In fact, this is
> pretty similar to what we discussed a while ago on LUCENE-2026 I think.
>
> SegmentWriter could be a higher level interface with more than one
> implementation.  E.g. there could be one SegmentWriter that supports
> appending documents (i.e. the DocumentsWriter today) and also one that
> allows adding terms at-a-time, e.g. similar to what IW.addIndexes*() does
> today.  Often when you rewrite entire parallel slices you don't want to use
> addDocument().  E.g. when you read from a source slice, modify some data and
> write a new version of that slice it can be dramatically faster to write
> postinglist after postinglist,  because you avoid parallel I/O and a lot of
> seeks. (with dramatically faster I mean e.g. 24 hrs vs. 8 mins, actual
> numbers from an implementation I had at IBM...)
>
> Further, I imagine to utilize the slice concept within Lucene.  The store
> could be a separate slice, and so could be the norms and the new flexible
> scoring data structures.  It's then super easy to turn those off or rewrite
> them individually (see LUCENE-2025).  Often parallel indexes don't need a
> store or norms, so this slice concept makes total sense in my opinion.
>  Norms actually works like this already, you can rewrite them which bumps up
> their generation number.  We just have to make this concept more abstract,
> so that it can be used for any kind of slice.
> Many people have also asked about allowing Lucene to manage external data
> structures.  I think these changes would allow exactly that:  just implement
> your external data structure as a slice, and Lucene will call your code when
> merging, deletions, adds happen. Cool! :)
>
> @Shai: If we implement Parallel indexing outside of Lucene's core then we
> have some of the same drawbacks as with the current master-slave approach.
>  I'm especially worried about how that would work then with realtime
> indexing (both searchable RAM buffer and also NRT).  I think PI must be
> completely segment-aware.  Then it should fit very nicely into realtime
> indexing, which is also very cool!
>
>  Michael
>
>
>
> On 4/21/10 8:06 AM, Michael McCandless wrote:
>
>> I do think the idea of an abstract class (or interface) SegmentWriter
>> is compelling.
>>
>> Each DWPT would be a [single-threaded] SegmentWriter.
>>
>> And then we'd make a MultiThreadedSegmentWriterWrapper (manages a
>> collection of SegmentWriters, deleting to them, aggregating RAM used
>> across all, manages picking which ones to flush, etc.).
>>
>> Then, a SlicedSegmentWriter (say) would write to separate slices,
>> single threaded, and then you could make it multi-threaded by wrapping
>> w/ the above class.
>>
>> Though SegmentWriter isn't a great name since it would in general
>> write to multiple segments.  Indexer is a little too broad though :)
>>
>> Something like that maybe?
>>
>> Also, allowing an app to directly control the underlying
>> SegmentWriters inside IndexWriter (instead of letting the
>> multi-threaded wrapper decide for you) is compelling for way advanced
>> apps, I think.  EG your app may know it's done indexing from source A
>> for a while, so, you should right now go and flush it (whereas the
>> default "flush the one using the most RAM" could leave that source
>> unflushed for a quite a while, tying up RAM, unless we do some kind of
>> LRU flushing policy or something).
>>
>> Mike
>>
>> On Wed, Apr 21, 2010 at 2:27 AM, Shai Erera<se...@gmail.com>  wrote:
>>
>>
>>> I'm not sure that a Parallel DW would work for PI because DW is too
>>> internal
>>> to IW. Currently, the approach I've been thinking about for PI is to
>>> tackle
>>> it from a high level, e.g. allow the application to pass a Directory, or
>>> even an IW instance, and PI will play the coordinator role, ensuring that
>>> merge of segments happens across all the slices in accordance,
>>> implementing
>>> two-phase operations etc. A Parallel DW then does not fit nicely w/ that
>>> approach (unless we want to refactor how IW works completely) because DW
>>> is
>>> not aware of the Directory, and if PI indeed works over IW instances,
>>> then
>>> each will have its own DW.
>>>
>>> So there are two basic approaches we can take for PI (following current
>>> architecture) - either let PI manage IW, or have PI a sort of IW itself,
>>> which handles events at a much lower level. While the latter is more
>>> robust
>>> (and based on current limitations I'm running into, might be even easier
>>> to
>>> do), it lacks the flexibility of allowing the app to plug any IW it
>>> wants.
>>> That requirement is also important, if the application wants to use PI in
>>> scenarios where it keeps some slices in RAM and some on disk, or it wants
>>> to
>>> control more closely which fields go to which slice, so that it can at
>>> some
>>> point in time "rebuild" a certain slice outside PI and replace the
>>> existing
>>> slice in PI w/ the new one ...
>>>
>>> We should probably continue the discussion on PI, so I suggest we either
>>> move it to another thread or on the issue directly.
>>>
>>> Mike - I agree w/ you that we should keep the life of the application
>>> developers easy and that having IW itself support concurrency is
>>> beneficial.
>>> Like I said ... it was just a thought which was aimed at keeping our life
>>> (Lucene developers) easier, but that probably comes second compared to
>>> app-devs life :). I'm not at all sure also that that would have make our
>>> life easier ...
>>>
>>> So I'm good if you want to drop the discussion.
>>>
>>> Shai
>>>
>>> On Tue, Apr 20, 2010 at 8:16 PM, Michael Busch<bu...@gmail.com>
>>>  wrote:
>>>
>>>
>>>> On 4/19/10 10:25 PM, Shai Erera wrote:
>>>>
>>>>
>>>>> It will definitely simplify multi-threaded handling for IW extensions
>>>>> like Parallel Index …
>>>>>
>>>>>
>>>>>
>>>> I'm keeping Parallel indexing in mind.  After we have separate DWPT I'd
>>>> like to introduce parallel DWPTs, that write different slices.
>>>>  Synchronization should not be a big worry then, because writing is
>>>> single-threaded.
>>>>
>>>> We could introduce a new abstract class SegmentWriter, which DWPT would
>>>> implement.  An extension would be ParallelSegmentWriter, which would
>>>> manage
>>>> multiple SegmentWriters.   Or maybe SegmentSliceWriter would be a better
>>>> name.
>>>>
>>>>  Michael
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>
>>>
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Per-Thread DW and IW

Posted by Michael Busch <bu...@gmail.com>.

Yeah, sounds like we have the same things in mind here.  In fact, this 
is pretty similar to what we discussed a while ago on LUCENE-2026 I think.

SegmentWriter could be a higher level interface with more than one 
implementation.  E.g. there could be one SegmentWriter that supports 
appending documents (i.e. the DocumentsWriter today) and also one that 
allows adding terms at-a-time, e.g. similar to what IW.addIndexes*() 
does today.  Often when you rewrite entire parallel slices you don't 
want to use addDocument().  E.g. when you read from a source slice, 
modify some data and write a new version of that slice it can be 
dramatically faster to write postinglist after postinglist,  because you 
avoid parallel I/O and a lot of seeks. (with dramatically faster I mean 
e.g. 24 hrs vs. 8 mins, actual numbers from an implementation I had at 
IBM...)

Further, I imagine to utilize the slice concept within Lucene.  The 
store could be a separate slice, and so could be the norms and the new 
flexible scoring data structures.  It's then super easy to turn those 
off or rewrite them individually (see LUCENE-2025).  Often parallel 
indexes don't need a store or norms, so this slice concept makes total 
sense in my opinion.  Norms actually works like this already, you can 
rewrite them which bumps up their generation number.  We just have to 
make this concept more abstract, so that it can be used for any kind of 
slice.
Many people have also asked about allowing Lucene to manage external 
data structures.  I think these changes would allow exactly that:  just 
implement your external data structure as a slice, and Lucene will call 
your code when merging, deletions, adds happen. Cool! :)

@Shai: If we implement Parallel indexing outside of Lucene's core then 
we have some of the same drawbacks as with the current master-slave 
approach.  I'm especially worried about how that would work then with 
realtime indexing (both searchable RAM buffer and also NRT).  I think PI 
must be completely segment-aware.  Then it should fit very nicely into 
realtime indexing, which is also very cool!

  Michael

On 4/21/10 8:06 AM, Michael McCandless wrote:
> I do think the idea of an abstract class (or interface) SegmentWriter
> is compelling.
>
> Each DWPT would be a [single-threaded] SegmentWriter.
>
> And then we'd make a MultiThreadedSegmentWriterWrapper (manages a
> collection of SegmentWriters, deleting to them, aggregating RAM used
> across all, manages picking which ones to flush, etc.).
>
> Then, a SlicedSegmentWriter (say) would write to separate slices,
> single threaded, and then you could make it multi-threaded by wrapping
> w/ the above class.
>
> Though SegmentWriter isn't a great name since it would in general
> write to multiple segments.  Indexer is a little too broad though :)
>
> Something like that maybe?
>
> Also, allowing an app to directly control the underlying
> SegmentWriters inside IndexWriter (instead of letting the
> multi-threaded wrapper decide for you) is compelling for way advanced
> apps, I think.  EG your app may know it's done indexing from source A
> for a while, so, you should right now go and flush it (whereas the
> default "flush the one using the most RAM" could leave that source
> unflushed for a quite a while, tying up RAM, unless we do some kind of
> LRU flushing policy or something).
>
> Mike
>
> On Wed, Apr 21, 2010 at 2:27 AM, Shai Erera<se...@gmail.com>  wrote:
>    
>> I'm not sure that a Parallel DW would work for PI because DW is too internal
>> to IW. Currently, the approach I've been thinking about for PI is to tackle
>> it from a high level, e.g. allow the application to pass a Directory, or
>> even an IW instance, and PI will play the coordinator role, ensuring that
>> merge of segments happens across all the slices in accordance, implementing
>> two-phase operations etc. A Parallel DW then does not fit nicely w/ that
>> approach (unless we want to refactor how IW works completely) because DW is
>> not aware of the Directory, and if PI indeed works over IW instances, then
>> each will have its own DW.
>>
>> So there are two basic approaches we can take for PI (following current
>> architecture) - either let PI manage IW, or have PI a sort of IW itself,
>> which handles events at a much lower level. While the latter is more robust
>> (and based on current limitations I'm running into, might be even easier to
>> do), it lacks the flexibility of allowing the app to plug any IW it wants.
>> That requirement is also important, if the application wants to use PI in
>> scenarios where it keeps some slices in RAM and some on disk, or it wants to
>> control more closely which fields go to which slice, so that it can at some
>> point in time "rebuild" a certain slice outside PI and replace the existing
>> slice in PI w/ the new one ...
>>
>> We should probably continue the discussion on PI, so I suggest we either
>> move it to another thread or on the issue directly.
>>
>> Mike - I agree w/ you that we should keep the life of the application
>> developers easy and that having IW itself support concurrency is beneficial.
>> Like I said ... it was just a thought which was aimed at keeping our life
>> (Lucene developers) easier, but that probably comes second compared to
>> app-devs life :). I'm not at all sure also that that would have make our
>> life easier ...
>>
>> So I'm good if you want to drop the discussion.
>>
>> Shai
>>
>> On Tue, Apr 20, 2010 at 8:16 PM, Michael Busch<bu...@gmail.com>  wrote:
>>      
>>> On 4/19/10 10:25 PM, Shai Erera wrote:
>>>        
>>>> It will definitely simplify multi-threaded handling for IW extensions
>>>> like Parallel Index …
>>>>
>>>>          
>>> I'm keeping Parallel indexing in mind.  After we have separate DWPT I'd
>>> like to introduce parallel DWPTs, that write different slices.
>>>   Synchronization should not be a big worry then, because writing is
>>> single-threaded.
>>>
>>> We could introduce a new abstract class SegmentWriter, which DWPT would
>>> implement.  An extension would be ParallelSegmentWriter, which would manage
>>> multiple SegmentWriters.   Or maybe SegmentSliceWriter would be a better
>>> name.
>>>
>>>   Michael
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>        
>>
>>      
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
>    

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Per-Thread DW and IW

Posted by Michael McCandless <lu...@mikemccandless.com>.

I do think the idea of an abstract class (or interface) SegmentWriter
is compelling.

Each DWPT would be a [single-threaded] SegmentWriter.

And then we'd make a MultiThreadedSegmentWriterWrapper (manages a
collection of SegmentWriters, deleting to them, aggregating RAM used
across all, manages picking which ones to flush, etc.).

Then, a SlicedSegmentWriter (say) would write to separate slices,
single threaded, and then you could make it multi-threaded by wrapping
w/ the above class.

Though SegmentWriter isn't a great name since it would in general
write to multiple segments.  Indexer is a little too broad though :)

Something like that maybe?

Also, allowing an app to directly control the underlying
SegmentWriters inside IndexWriter (instead of letting the
multi-threaded wrapper decide for you) is compelling for way advanced
apps, I think.  EG your app may know it's done indexing from source A
for a while, so, you should right now go and flush it (whereas the
default "flush the one using the most RAM" could leave that source
unflushed for a quite a while, tying up RAM, unless we do some kind of
LRU flushing policy or something).

Mike

On Wed, Apr 21, 2010 at 2:27 AM, Shai Erera <se...@gmail.com> wrote:
> I'm not sure that a Parallel DW would work for PI because DW is too internal
> to IW. Currently, the approach I've been thinking about for PI is to tackle
> it from a high level, e.g. allow the application to pass a Directory, or
> even an IW instance, and PI will play the coordinator role, ensuring that
> merge of segments happens across all the slices in accordance, implementing
> two-phase operations etc. A Parallel DW then does not fit nicely w/ that
> approach (unless we want to refactor how IW works completely) because DW is
> not aware of the Directory, and if PI indeed works over IW instances, then
> each will have its own DW.
>
> So there are two basic approaches we can take for PI (following current
> architecture) - either let PI manage IW, or have PI a sort of IW itself,
> which handles events at a much lower level. While the latter is more robust
> (and based on current limitations I'm running into, might be even easier to
> do), it lacks the flexibility of allowing the app to plug any IW it wants.
> That requirement is also important, if the application wants to use PI in
> scenarios where it keeps some slices in RAM and some on disk, or it wants to
> control more closely which fields go to which slice, so that it can at some
> point in time "rebuild" a certain slice outside PI and replace the existing
> slice in PI w/ the new one ...
>
> We should probably continue the discussion on PI, so I suggest we either
> move it to another thread or on the issue directly.
>
> Mike - I agree w/ you that we should keep the life of the application
> developers easy and that having IW itself support concurrency is beneficial.
> Like I said ... it was just a thought which was aimed at keeping our life
> (Lucene developers) easier, but that probably comes second compared to
> app-devs life :). I'm not at all sure also that that would have make our
> life easier ...
>
> So I'm good if you want to drop the discussion.
>
> Shai
>
> On Tue, Apr 20, 2010 at 8:16 PM, Michael Busch <bu...@gmail.com> wrote:
>>
>> On 4/19/10 10:25 PM, Shai Erera wrote:
>>>
>>> It will definitely simplify multi-threaded handling for IW extensions
>>> like Parallel Index …
>>>
>>
>> I'm keeping Parallel indexing in mind.  After we have separate DWPT I'd
>> like to introduce parallel DWPTs, that write different slices.
>>  Synchronization should not be a big worry then, because writing is
>> single-threaded.
>>
>> We could introduce a new abstract class SegmentWriter, which DWPT would
>> implement.  An extension would be ParallelSegmentWriter, which would manage
>> multiple SegmentWriters.   Or maybe SegmentSliceWriter would be a better
>> name.
>>
>>  Michael
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Per-Thread DW and IW

Posted by Shai Erera <se...@gmail.com>.

I'm not sure that a Parallel DW would work for PI because DW is too internal
to IW. Currently, the approach I've been thinking about for PI is to tackle
it from a high level, e.g. allow the application to pass a Directory, or
even an IW instance, and PI will play the coordinator role, ensuring that
merge of segments happens across all the slices in accordance, implementing
two-phase operations etc. A Parallel DW then does not fit nicely w/ that
approach (unless we want to refactor how IW works completely) because DW is
not aware of the Directory, and if PI indeed works over IW instances, then
each will have its own DW.

So there are two basic approaches we can take for PI (following current
architecture) - either let PI manage IW, or have PI a sort of IW itself,
which handles events at a much lower level. While the latter is more robust
(and based on current limitations I'm running into, might be even easier to
do), it lacks the flexibility of allowing the app to plug any IW it wants.
That requirement is also important, if the application wants to use PI in
scenarios where it keeps some slices in RAM and some on disk, or it wants to
control more closely which fields go to which slice, so that it can at some
point in time "rebuild" a certain slice outside PI and replace the existing
slice in PI w/ the new one ...

We should probably continue the discussion on PI, so I suggest we either
move it to another thread or on the issue directly.

Mike - I agree w/ you that we should keep the life of the application
developers easy and that having IW itself support concurrency is beneficial.
Like I said ... it was just a thought which was aimed at keeping our life
(Lucene developers) easier, but that probably comes second compared to
app-devs life :). I'm not at all sure also that that would have make our
life easier ...

So I'm good if you want to drop the discussion.

Shai

On Tue, Apr 20, 2010 at 8:16 PM, Michael Busch <bu...@gmail.com> wrote:

> On 4/19/10 10:25 PM, Shai Erera wrote:
>
>> It will definitely simplify multi-threaded handling for IW extensions
>> like Parallel Index …
>>
>>
>
> I'm keeping Parallel indexing in mind.  After we have separate DWPT I'd
> like to introduce parallel DWPTs, that write different slices.
>  Synchronization should not be a big worry then, because writing is
> single-threaded.
>
> We could introduce a new abstract class SegmentWriter, which DWPT would
> implement.  An extension would be ParallelSegmentWriter, which would manage
> multiple SegmentWriters.   Or maybe SegmentSliceWriter would be a better
> name.
>
>  Michael
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Per-Thread DW and IW

Posted by Michael Busch <bu...@gmail.com>.

On 4/19/10 10:25 PM, Shai Erera wrote:
> It will definitely simplify multi-threaded handling for IW extensions
> like Parallel Index …
>    

I'm keeping Parallel indexing in mind.  After we have separate DWPT I'd 
like to introduce parallel DWPTs, that write different slices.  
Synchronization should not be a big worry then, because writing is 
single-threaded.

We could introduce a new abstract class SegmentWriter, which DWPT would 
implement.  An extension would be ParallelSegmentWriter, which would 
manage multiple SegmentWriters.   Or maybe SegmentSliceWriter would be a 
better name.

  Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Per-Thread DW and IW

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Tue, Apr 20, 2010 at 07:27:57AM -0400, Michael McCandless wrote:
> There are elements of IW that still must be "centralized" -- managing
> the merge policy/schedulers, deletion policy, writing/committing the
> segments files, managing ongoing addIndexes, tracking pending
> deletions, the reader pool, etc.

I've got a prototype BackgroundMerger working for KS/Lucy which can work
concurrently with an Indexer.  It drops a "merge.lock" file which blocks
Indexer from merging any segments that existed at the moment of the lockfile's
creation.  When it's done merging, it acquires the write.lock, carries forward
any deletions that Indexer has written against the segments it's merging away,
then commits and releases both locks.

Based on the success of this prototype, I believe merging policy is a
theoretically solvable problem using mutexes to lay claim to the mergable
segments while doing the heavy lifting, and the write lock to coordinate
deletions and committing.  

However, The management of individual deletions seems like a daunting problem
every time I consider how to expand this model out to multiple indexing
processes operating against the same index when those processes must be
allowed to create new deletions.  I think there are insoluble race conditions
until you get into document-level locking.

I imagine NRT readers make this problem even harder.

Marvin Humphrey

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Per-Thread DW and IW

Posted by Michael McCandless <lu...@mikemccandless.com>.

I don't think an app should be required to manage multiple IWs, when
using multiple threads?

Ie, I think we should still offer the concurrency model we have today
-- you share a single IW across threads, you tell it how much RAM it's
allowed to use, and it has high internal concurrency.

There are elements of IW that still must be "centralized" -- managing
the merge policy/schedulers, deletion policy, writing/committing the
segments files, managing ongoing addIndexes, tracking pending
deletions, the reader pool, etc.

Maybe we can expose invidual control of each indexer (DWPT)... on an
advanced basis?  Ie, for convenience we have a single IW instance, single
RAM buffer, that manages the separate indexers, but advanced apps can
skip this convenience layer and directly control individual indexers?

If we make an "Indexer" interface (marked experimental!), that each
DWPT implements, but also the convenience layer implements, that could
be a clean way to achieve this?  So if an advance apps disagrees with
how the convenience layer manages concurrency (it's thread affinity &
flushing policy based on aggregate RAM used), it could go straight to
individual indexers.

Alternatively, we could make the thread affinity / flush selection
explicitly controllable with a policy?

Also thinking outloud :),

Mike

On Tue, Apr 20, 2010 at 1:25 AM, Shai Erera <se...@gmail.com> wrote:
> Hi
>
> I've been following the PTDW issue and so far the approach that's been
> proposed and agreed on sounds great! - each thread will have it's own
> DW, DWs will flush themselves independently of each other etc.
>
> I'm now wondering if we cannot simplify the approach even further by
> eliminating DW at all and let several IW instances open and index over
> the same Directory inside the same JVM (basically allowing multiple
> threads to open their own IW)?
>
> Would it simplify matters for the currently ongoing issue? Would it
> complicate matters for the app (init'ing IW per thread, controlling
> RAM settings etc.)?
> It will definitely simplify multi-threaded handling for IW extensions
> like Parallel Index …
>
> One thing we'll need to expose is a shared deleted doc IDs object
> which all IW will update. Anything else?
>
> I'm just thinking outloud here, hence why I didn't post it on the
> issue itself. What I'm thinking is taking all that per-thread thing
> outside IW and let the app more control. But having IW multi-threaded
> is also a great advantage and more convenient to the app.
>
> Shai
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org