You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2010/04/21 16:51:49 UTC

[jira] Created: (LUCENE-2408) Add Document.set/getSourceID, as an optional hint to IndexWriter to improve indexing performance

Add Document.set/getSourceID, as an optional hint to IndexWriter to improve indexing performance
------------------------------------------------------------------------------------------------

                 Key: LUCENE-2408
                 URL: https://issues.apache.org/jira/browse/LUCENE-2408
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Index
            Reporter: Michael McCandless
            Priority: Minor
             Fix For: 3.1


(Spinoff from LUCENE-2324).

The internal indexer (currently DocumentsWriter & its full indexing
chain) has separate *PerThread objects holding buffered postings in
RAM until flush.

The RAM efficiency of these buffers is very dependent on the term
distributions sent to each.

As an optimization, today, we use thread affinity (ie we try to assign
the same thread to the same *PerThread classes), on the assumption
that sometimes that thread may be indexing from its own source of
docs.  When the assumption applies it means we can have much better
overall RAM efficiency since a single *PerThread set of classes handles
the term distribution for that source.

In the extreme case (many threads, each doing completely orthogonal
terms, eg say different languages) this should be a sizable
performance gain.

But really this is a hack -- eg if you index using a dedicated
indexing thread pool, then thread binding has nothing to do with
source, and you have no way to get this optimization (even though
it's still "there").

To fix this, we should add an optional get/setSourceID to Document.
It's completely optional for an app to set this... and if they do,
it'd be a hint which IW can make use of (in an impl private manner).
If they don't we should just fallback to the "best guess" we use today
(each thread is its own source).

The javadoc would be something like "as a hint to IW, to possibly
improve its indexing performance, if you have docs from difference
sources you should set the source ID on your Document". And
how/whether IW makes use of this information is "under the hood"...


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org