You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2014/07/11 13:32:04 UTC

[jira] [Updated] (LUCENE-5815) Add TermAutomatonQuery, for proximity matching that generalizes MultiPhraseQuery/SpanNearQuery

     [ https://issues.apache.org/jira/browse/LUCENE-5815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-5815:
---------------------------------------

    Attachment: LUCENE-5815.patch

Patch, work in progress, lots of nocommits...

> Add TermAutomatonQuery, for proximity matching that generalizes MultiPhraseQuery/SpanNearQuery
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-5815
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5815
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.10
>
>         Attachments: LUCENE-5815.patch
>
>
> I created a new query, called TermAutomatonQuery, that's a proximity
> query to generalize MultiPhraseQuery/SpanNearQuery: it lets you
> construct an arbitrary automaton whose transitions are whole terms, and
> then find all documents that the automaton matches.  This is different
> from a "normal" automaton whose transitions are usually
> bytes/characters within a term/s.
> So, if the automaton has just 1 transition, it's just an expensive
> TermQuery.  If you have two transitions in sequence, it's a phrase
> query of two terms.  You can express synonyms by using transitions
> that overlap one another but the automaton doesn't have to be a
> "sausage" (as MultiPhraseQuery requires) i.e. it "respects" posLength
> (at query time).
> It also allows "any" transitions, to match any term, so you can do
> sloppy matching and span-like queries, e.g. find "lucene" and "python"
> with up to 3 other terms in between.
> I also added a class to convert a TokenStream directly to the
> automaton for this query, preserving posLength.  (Of course, the index
> can't store posLength, so the matching won't be fully correct if any
> indexed tokens has posLength != 1).  But if you do query-time-only
> synonyms then the matching should finally be correct.
> I haven't tested performance but I suspect it's quite slowish ... its
> cost is O(sum-totalTF) of all terms "used" in the automaton.  There
> are some optimizations we could do, e.g. detecting that some terms in
> the automaton can be upgraded to MUST (right now they are all
> effectively SHOULD).
> I'm not sure how it should assign scores (punted on that for now), but
> the matching seems to be working.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org