You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2015/06/10 11:48:01 UTC

[jira] [Commented] (LUCENE-3893) TermsQuery should use AutomatonQuery or Terms.intersect

    [ https://issues.apache.org/jira/browse/LUCENE-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14580317#comment-14580317 ] 

Michael McCandless commented on LUCENE-3893:
--------------------------------------------

I made a simplistic performance test (TermsQueryPerf.java in luceneutil).  It indexes 100M ids into a 7/7/7 segment index, then randomly picks 10 subsets of 50K ids each to run using TermsQuery.

I compared using TermsQuery, vs AutomatonQuery on Automata.makeStringUnion (which uses Terms.intersect), and I only measured the query execution time, not the cost to init the query (that is maybe a separate battle to fight, since we don't have a fast way to build a binary minimal automaton now, i.e. we always go through the UTF32 -> UTF8 conversion unnecessarily).

With "easy" ids (zero pad, sequential, encoded in base 10), Terms.intersect was ~40% faster (1040 msec vs 640 msec), but then when I measured on "hard" ids (fully random long, encoded in base 10), it was slower (4576 msec vs 3352) which is disappointing.

Digging a bit, I think what's happening is the cost of checking each term's suffix when scanning the terms in a block is quite a bit higher with the automaton (a RunAutomaton.step for each one, vs comparing byte[] slices for the seekExact used by TermsQuery).

Maybe we could improve the Terms.intersect impl to be more like AutomatonTermsEnum here, grabbing the next matching suffix as a BytesRef and comparing against that, instead of .step() for every suffix byte, but until we can consistently make Terms.intersect faster I think we should leave TermsQuery as is ...

> TermsQuery should use AutomatonQuery or Terms.intersect
> -------------------------------------------------------
>
>                 Key: LUCENE-3893
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3893
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2014
>
> I think we could see perf gains if TermsFilter sorted the terms, built a minimal automaton, and used TermsEnum.intersect to visit the terms...
> This idea came up on the dev list recently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org