You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2010/07/06 00:36:50 UTC
[jira] Commented: (LUCENE-2410) Optimize PhraseQuery
[ https://issues.apache.org/jira/browse/LUCENE-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885359#action_12885359 ]
Michael McCandless commented on LUCENE-2410:
--------------------------------------------
Alas.... I think I somehow screwed up my performance tests above.
I'm testing search perf (working on LUCENE-2504), and in comparing search perf from 2.9.x -> 3.x, I only saw a ~20% speedup on the phrase query "united states", for a 5M doc Wikipedia index. And, re-running the test on trunk pre and post this commit, I still see only ~20% gain.... still not sure what I did wrong.
I'll update CHANGES. Two steps forward, one step back... sigh.
> Optimize PhraseQuery
> --------------------
>
> Key: LUCENE-2410
> URL: https://issues.apache.org/jira/browse/LUCENE-2410
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Reporter: Michael McCandless
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2410.patch, LUCENE-2410.patch, LUCENE-2410.patch, LUCENE-2410.patch, LUCENE-2410_rewrite.patch
>
>
> Looking the scorers for PhraseQuery, I think there are some speedups
> we could do:
> * The AND part of the scorer (which advances to the next doc that
> has all the terms), in PhraseScorer.doNext, should do the same
> optimizing as BooleanQuery's ConjunctionScorer, ie sort terms from
> rarest to most frequent. I don't think it should use a linked
> list/firstToLast() that it does today.
> * We do way too much work now when .score() is not called, because
> we go and find all occurrences of the phrase in the doc, whereas
> we should stop only after finding the first and then go and count
> the rest if .score() is called.
> * For the exact case, I think we can use two int arrays to find the
> matches. The first array holds the count of how many times a term
> in the phrase "matched" a phrase starting at that position. When
> that count == the number of terms in the phrase, it's a match.
> The 2nd is a "gen" array (holds docID when that count was last
> touched), to avoid clearing. Ie when incrementing the count, if
> the docID != gen, we reset count to 0. I think this'd be faster
> than the PQ we now use. Downside of this is if you have immense
> docs (position gets very large) we'd need 2 immense arrays.
> It'd be great to do LUCENE-1252 along with this, ie factor
> PhraseScorer into two AND'd sub-scorers (LUCENE-1252 is open for
> this). The first one should be ConjunctionScorer, and the 2nd one
> checks the positions (ie, either the exact or sloppy scorers). This
> would mean if the PhraseQuery is AND'd w/ other clauses (or, a filter
> is applied) we would save CPU by not checking the positions for a doc
> unless all other AND'd clauses accepted the doc.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org