You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Yonik Seeley (JIRA)" <ji...@apache.org> on 2009/04/11 23:17:14 UTC

[jira] Issue Comment Edited: (LUCENE-1596) optimize MultiTermEnum/MultiTermDocs

    [ https://issues.apache.org/jira/browse/LUCENE-1596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12698136#action_12698136 ] 

Yonik Seeley edited comment on LUCENE-1596 at 4/11/09 2:16 PM:
---------------------------------------------------------------

Attaching optimization patch.  Results up front:
  random seeks to common terms with term enumerator:  58% improvement
  full iteration over all docs matching relatively unique terms: 1595% improvement

The optimizations:
 - MultiTermEnum keeps track of which segments match... if termDocs.seek(termEnum) is used, then MultiTermDocs will only visit segments that matched the term.
 - MultiTermEnum defers calling next() on sub enumerators until needed.  This allows MultiTermDocs to use the faster seek(enum) since the enumerator is still on the correct term.  This also avoids unnecessary calls to next() that may never be used, as well as unnecessary insertions into the priority queue.  Using seek(enum) in the sub TermDocs also allows cascading of these optimizations (in the event that one has a MultiReader of MultiReaders).

Test index: this was obviously stacked to show best-case performance for these optimizations.  999,999 documents with maxBufferedDocs=10, resulting in 46 segments.  The full iteration test used relatively unique terms (1 or 2 docs matching each), and the random seeks test used very common terms (if rare terms are used in this test, the initial seek dominates and swamps any improvement from the deferral of calls to next().)


      was (Author: yseeley@gmail.com):
    Attaching optimization patch.  Results up front:
  random seeks to common terms with term enumerator:  58% improvement
  full iteration over all docs matching relatively unique terms: 1595% improvement

The optimizations:
 - MultiTermEnum keeps track of which segments match... if termDocs.seek(termEnum) is used, then MultiTermDocs will only visit segments that matched the term.
 - MultiTermEnum defers calling next() on sub enumerators until needed.  This allows MultiTermDocs to use the faster seek(enum) since the enumerator is still on the correct term.  This also avoids unnecessary calls to next() that may never be used, as well as unnecessary insertions into the priority queue.

Test index: this was obviously stacked to show best-case performance for these optimizations.  999,999 documents with maxBufferedDocs=10, resulting in 46 segments.  The full iteration test used relatively unique terms (1 or 2 docs matching each), and the random seeks test used very common terms (if rare terms are used in this test, the initial seek dominates and swamps any improvement from the deferral of calls to next().)

  
> optimize MultiTermEnum/MultiTermDocs
> ------------------------------------
>
>                 Key: LUCENE-1596
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1596
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Yonik Seeley
>            Assignee: Yonik Seeley
>         Attachments: LUCENE-1596.patch
>
>
> Optimize MultiTermEnum and MultiTermDocs to avoid seeks on TermDocs that don't match the term.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org