You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael Busch (JIRA)" <ji...@apache.org> on 2010/04/19 19:39:50 UTC

[jira] Issue Comment Edited: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

    [ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858591#action_12858591 ] 

Michael Busch edited comment on LUCENE-2324 at 4/19/10 1:38 PM:
----------------------------------------------------------------

{quote}
But... could we allow an add/updateDocument call to express this
affinity, explicitly? If you index homogenous docs you wouldn't use
it, but, if you index drastically different docs that fall into clear
"categories", expressing the affinity can get you a good gain in
indexing throughput.

This may be the best solution, since then one could pass the affinity
even through a thread pool, and then we would fallback to thread
binding if the document class wasn't declared?
{quote}

I would like this if we then also added an API that can be used to specify the
per-DWPT RAM size.  E.g. if someone has such an app where different threads
index docs of different sizes, then the DW that indexes big docs can be given
more memory?

What I'm mainly trying to avoid is synchronization points between the
different DWPTs.  For example, currently the same ByteBlockAllocator is shared
between the different threads, so all its methods need to be synchronized.


{quote}
The other DWs would keep indexing  That's the beauty of this
approach... a flush of one DW doesn't stop all other DWs from
indexing, unliked today.

And you want to serialize the flushing right? Ie, only one DW flushes
at a time (the others keep indexing).

Hmm I suppose flushing more than one should be allowed (OS/IO have
alot of concurrency, esp since IO goes into write cache)... perhaps
that's the best way to balance index vs flush time? EG we pick one to
flush @ 90%, if we cross 95% we pick another to flush, another at
100%, etc.
{quote}

Oh I don't want to disallow flushing in parallel!  I think it makes perfect
sense to allow more than one DW to flush at the same time.  If each DWPT has a
private max buffer size, then it can decide on its own when it's time to
flush.

{quote}
Hmm I suppose flushing more than one should be allowed (OS/IO have
alot of concurrency, esp since IO goes into write cache)... perhaps
that's the best way to balance index vs flush time? EG we pick one to
flush @ 90%, if we cross 95% we pick another to flush, another at
100%, etc.
{quote}

If we allow flushing in parallel and also allow specifying the max RAM per
DWPT, then there doesn't even have to be any cross-thread RAM tracking?  Each
DWPT could just flush when its own buffer is full?

So let's summarize: 
 # Expose a ThreadBinder API for controlling number of DWPT instances and
thread affinity of DWPTs explicitly. (We can later decide if we want to also
support such an affinity after a segment was flushed, as Tim is asking for.
But that should IMO not be part of this patch.)
 # Also expose an API for specifying the RAM buffer size per DWPT. 
 # Allow flushing in parallel (multiple DWPTs can flush at the same time). A
DWPT flushes when its buffer is full, independent of what the other DWPTs are
doing.
 # The default implementation of the ThreadBinder API assigns threads to DWPT
randomly and gives each DWPT 1/n-th of the overall memory.
 # The DWPT RAM value must be updateable.  E.g. when you first start indexing 
only one DWPT should be created with the max RAM.  Then when multiple threads
are used for adding documents another DWPT should be added and the RAM
value of the already existing one should be reduced, and possibly a flush of that
DWPT needs to be triggered.

How does this sound?

      was (Author: michaelbusch):
    {quote}
But... could we allow an add/updateDocument call to express this
affinity, explicitly? If you index homogenous docs you wouldn't use
it, but, if you index drastically different docs that fall into clear
"categories", expressing the affinity can get you a good gain in
indexing throughput.

This may be the best solution, since then one could pass the affinity
even through a thread pool, and then we would fallback to thread
binding if the document class wasn't declared?
{quote}

I would like this if we then also added an API that can be used to specify the
per-DWPT RAM size.  E.g. if someone has such an app where different threads
index docs of different sizes, then the DW that indexes big docs can be given
more memory?

What I'm mainly trying to avoid is synchronization points between the
different DWPTs.  For example, currently the same ByteBlockAllocator is shared
between the different threads, so all its methods need to be synchronized.


{quote}
The other DWs would keep indexing  That's the beauty of this
approach... a flush of one DW doesn't stop all other DWs from
indexing, unliked today.

And you want to serialize the flushing right? Ie, only one DW flushes
at a time (the others keep indexing).

Hmm I suppose flushing more than one should be allowed (OS/IO have
alot of concurrency, esp since IO goes into write cache)... perhaps
that's the best way to balance index vs flush time? EG we pick one to
flush @ 90%, if we cross 95% we pick another to flush, another at
100%, etc.
{quote}

Oh I don't want to disallow flushing in parallel!  I think it makes perfect
sense to allow more than one DW to flush at the same time.  If each DWPT has a
private max buffer size, then it can decide on its own when it's time to
flush.

{quote}
Hmm I suppose flushing more than one should be allowed (OS/IO have
alot of concurrency, esp since IO goes into write cache)... perhaps
that's the best way to balance index vs flush time? EG we pick one to
flush @ 90%, if we cross 95% we pick another to flush, another at
100%, etc.
{quote}

If we allow flushing in parallel and also allow specifying the max RAM per
DWPT, then there doesn't even have to be any cross-thread RAM tracking?  Each
DWPT could just flush when its own buffer is full?

So let's summarize: 
 # Expose a ThreadBinder API for controlling number of DWPT instances and
thread affinity of DWPTs explicitly. (We can later decide if we want to also
support such an affinity after a segment was flushed, as Tim is asking for.
But that should IMO not be part of this patch.)
 # Also expose an API for specifying the RAM buffer size per DWPT. 
 # Allow flushing in parallel (multiple DWPTs can flush at the same time). A
DWPT flushes when its buffer is full, independent of what the other DWPTs are
doing.
 # The default implementation of the ThreadBinder API assigns threads to DWPT
randomly and gives each DWPT 1/n-th of the overall memory.

How does this sound?
  
> Per thread DocumentsWriters that write their own private segments
> -----------------------------------------------------------------
>
>                 Key: LUCENE-2324
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2324
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: lucene-2324.patch, LUCENE-2324.patch
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org