You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Uwe Schindler (JIRA)" <ji...@apache.org> on 2013/05/10 01:06:09 UTC

[jira] [Updated] (LUCENE-2408) Add Document.set/getSourceID, as an optional hint to IndexWriter to improve indexing performance

     [ https://issues.apache.org/jira/browse/LUCENE-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated LUCENE-2408:
----------------------------------

    Fix Version/s:     (was: 4.3)
                   4.4
    
> Add Document.set/getSourceID, as an optional hint to IndexWriter to improve indexing performance
> ------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2408
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2408
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Michael McCandless
>            Priority: Minor
>             Fix For: 4.4
>
>
> (Spinoff from LUCENE-2324).
> The internal indexer (currently DocumentsWriter & its full indexing
> chain) has separate *PerThread objects holding buffered postings in
> RAM until flush.
> The RAM efficiency of these buffers is very dependent on the term
> distributions sent to each.
> As an optimization, today, we use thread affinity (ie we try to assign
> the same thread to the same *PerThread classes), on the assumption
> that sometimes that thread may be indexing from its own source of
> docs.  When the assumption applies it means we can have much better
> overall RAM efficiency since a single *PerThread set of classes handles
> the term distribution for that source.
> In the extreme case (many threads, each doing completely orthogonal
> terms, eg say different languages) this should be a sizable
> performance gain.
> But really this is a hack -- eg if you index using a dedicated
> indexing thread pool, then thread binding has nothing to do with
> source, and you have no way to get this optimization (even though
> it's still "there").
> To fix this, we should add an optional get/setSourceID to Document.
> It's completely optional for an app to set this... and if they do,
> it'd be a hint which IW can make use of (in an impl private manner).
> If they don't we should just fallback to the "best guess" we use today
> (each thread is its own source).
> The javadoc would be something like "as a hint to IW, to possibly
> improve its indexing performance, if you have docs from difference
> sources you should set the source ID on your Document". And
> how/whether IW makes use of this information is "under the hood"...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org