You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2014/01/24 23:32:40 UTC

[jira] [Commented] (LUCENE-5415) Support wildcard & co in PostingsHighlighter

    [ https://issues.apache.org/jira/browse/LUCENE-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13881494#comment-13881494 ] 

Michael McCandless commented on LUCENE-5415:
--------------------------------------------

I love this approach!  Assuming the analyzer is not too costly, this ought to scale much better than "rewrite Query up front" workaround when the query matches tons of terms, i.e. it's less susceptible to an adversary.

The patch is surprisingly simple :)

How will the FakeDocsEnum.freq() lie affect the default PassageScorer?  Will this bias against passages that had an MTQ match?

So, all MTQs are squished into a single fake/virtual term for matching, like I cannot tell which of the N MTQs in my query caused the hit.  I think this is OK for starters: it's unusual (maybe?) to run multiple MTQs and to *also* care about which one matched each hit in the highlight.  But I guess we could instead add N virtual terms, one per MTQ... later.

> Support wildcard & co in PostingsHighlighter
> --------------------------------------------
>
>                 Key: LUCENE-5415
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5415
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: Robert Muir
>         Attachments: LUCENE-5415.patch
>
>
> PostingsHighlighter uses the offsets encoded in the postings lists for the terms to find query matches.
> As such, it isn't really suitable for stuff like wildcards for two reasons:
> 1. an expensive rewrite against the term dictionary (i think other highlighters share this problem)
> 2. accumulating data from potentially many terms (e.g. reading many postings)
> However, we could provide an option for some of these queries to work, but in a different way, that avoids these downsides.
> Instead we can just grab the Automaton representation of the queries, and match it against the content directly (which won't blow up).



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org