You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2014/07/11 13:32:04 UTC
[jira] [Updated] (LUCENE-5815) Add TermAutomatonQuery, for
proximity matching that generalizes MultiPhraseQuery/SpanNearQuery
[ https://issues.apache.org/jira/browse/LUCENE-5815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-5815:
---------------------------------------
Attachment: LUCENE-5815.patch
Patch, work in progress, lots of nocommits...
> Add TermAutomatonQuery, for proximity matching that generalizes MultiPhraseQuery/SpanNearQuery
> ----------------------------------------------------------------------------------------------
>
> Key: LUCENE-5815
> URL: https://issues.apache.org/jira/browse/LUCENE-5815
> Project: Lucene - Core
> Issue Type: New Feature
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Fix For: 4.10
>
> Attachments: LUCENE-5815.patch
>
>
> I created a new query, called TermAutomatonQuery, that's a proximity
> query to generalize MultiPhraseQuery/SpanNearQuery: it lets you
> construct an arbitrary automaton whose transitions are whole terms, and
> then find all documents that the automaton matches. This is different
> from a "normal" automaton whose transitions are usually
> bytes/characters within a term/s.
> So, if the automaton has just 1 transition, it's just an expensive
> TermQuery. If you have two transitions in sequence, it's a phrase
> query of two terms. You can express synonyms by using transitions
> that overlap one another but the automaton doesn't have to be a
> "sausage" (as MultiPhraseQuery requires) i.e. it "respects" posLength
> (at query time).
> It also allows "any" transitions, to match any term, so you can do
> sloppy matching and span-like queries, e.g. find "lucene" and "python"
> with up to 3 other terms in between.
> I also added a class to convert a TokenStream directly to the
> automaton for this query, preserving posLength. (Of course, the index
> can't store posLength, so the matching won't be fully correct if any
> indexed tokens has posLength != 1). But if you do query-time-only
> synonyms then the matching should finally be correct.
> I haven't tested performance but I suspect it's quite slowish ... its
> cost is O(sum-totalTF) of all terms "used" in the automaton. There
> are some optimizations we could do, e.g. detecting that some terms in
> the automaton can be upgraded to MUST (right now they are all
> effectively SHOULD).
> I'm not sure how it should assign scores (punted on that for now), but
> the matching seems to be working.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org