You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Uwe Schindler (JIRA)" <ji...@apache.org> on 2013/05/10 01:06:09 UTC
[jira] [Updated] (LUCENE-2408) Add Document.set/getSourceID, as an
optional hint to IndexWriter to improve indexing performance
[ https://issues.apache.org/jira/browse/LUCENE-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Uwe Schindler updated LUCENE-2408:
----------------------------------
Fix Version/s: (was: 4.3)
4.4
> Add Document.set/getSourceID, as an optional hint to IndexWriter to improve indexing performance
> ------------------------------------------------------------------------------------------------
>
> Key: LUCENE-2408
> URL: https://issues.apache.org/jira/browse/LUCENE-2408
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index
> Reporter: Michael McCandless
> Priority: Minor
> Fix For: 4.4
>
>
> (Spinoff from LUCENE-2324).
> The internal indexer (currently DocumentsWriter & its full indexing
> chain) has separate *PerThread objects holding buffered postings in
> RAM until flush.
> The RAM efficiency of these buffers is very dependent on the term
> distributions sent to each.
> As an optimization, today, we use thread affinity (ie we try to assign
> the same thread to the same *PerThread classes), on the assumption
> that sometimes that thread may be indexing from its own source of
> docs. When the assumption applies it means we can have much better
> overall RAM efficiency since a single *PerThread set of classes handles
> the term distribution for that source.
> In the extreme case (many threads, each doing completely orthogonal
> terms, eg say different languages) this should be a sizable
> performance gain.
> But really this is a hack -- eg if you index using a dedicated
> indexing thread pool, then thread binding has nothing to do with
> source, and you have no way to get this optimization (even though
> it's still "there").
> To fix this, we should add an optional get/setSourceID to Document.
> It's completely optional for an app to set this... and if they do,
> it'd be a hint which IW can make use of (in an impl private manner).
> If they don't we should just fallback to the "best guess" we use today
> (each thread is its own source).
> The javadoc would be something like "as a hint to IW, to possibly
> improve its indexing performance, if you have docs from difference
> sources you should set the source ID on your Document". And
> how/whether IW makes use of this information is "under the hood"...
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org