You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Scott Stults (JIRA)" <ji...@apache.org> on 2015/10/07 17:40:27 UTC
[jira] [Commented] (LUCENE-2287) Unexpected terms are highlighted within nested SpanQuery instances

    [ https://issues.apache.org/jira/browse/LUCENE-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14947050#comment-14947050 ] 

Scott Stults commented on LUCENE-2287:
--------------------------------------

LUCENE-5455 has a few tests that should be added here once this patch is cleaned up. 

There are a few hurdles in cleaning this up though. The first is that this patch was based on a *really* old version and I can't seem to find anything in SVN or git older than 3.1. The second is that Spans are quite a bit different.

By the way, I've tried the unit tests in both issues and they still fail in 5.3+. The issue seems to be in WeightedSpanTermExtractor.extractWeightedSpanTerms(). It first builds a list of all position spans, and then it adds all of those position spans to a map of the term irrespective of whether that term was used in that position span. Mike's patch addresses this by keeping a separate list of position spans per term.

The thing that's *not* fixed by the patch is the notion of when to stop recursing into the spans. I tried several methods of inspecting and classifying the spans but I either end up with too many positions (resulting in too many term highlights) or too few. 

[~ romseygeek], why is this so hard? Can't we just use the same methods the searcher uses? Maybe create a new collector and re-search the incoming doc?

> Unexpected terms are highlighted within nested SpanQuery instances
> ------------------------------------------------------------------
>
>                 Key: LUCENE-2287
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2287
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>    Affects Versions: 2.9.1
>         Environment: Linux, Solaris, Windows
>            Reporter: Michael Goddard
>            Priority: Minor
>         Attachments: LUCENE-2287.patch, LUCENE-2287.patch, LUCENE-2287.patch, LUCENE-2287.patch, LUCENE-2287.patch, LUCENE-2287.patch
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> I haven't yet been able to resolve why I'm seeing spurious highlighting in nested SpanQuery instances.  Briefly, the issue is illustrated by the second instance of "Lucene" being highlighted in the test below, when it doesn't satisfy the inner span.  There's been some discussion about this on the java-dev list, and I'm opening this issue now because I have made some initial progress on this.
> This new test, added to the  HighlighterTest class in lucene_2_9_1, illustrates this:
> /*
>  * Ref: http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/
>  */
> public void testHighlightingNestedSpans2() throws Exception {
>   String theText = "The Lucene was made by Doug Cutting and Lucene great Hadoop was"; // Problem
>   //String theText = "The Lucene was made by Doug Cutting and the great Hadoop was"; // Works okay
>   String fieldName = "SOME_FIELD_NAME";
>   SpanNearQuery spanNear = new SpanNearQuery(new SpanQuery[] {
>     new SpanTermQuery(new Term(fieldName, "lucene")),
>     new SpanTermQuery(new Term(fieldName, "doug")) }, 5, true);
>   Query query = new SpanNearQuery(new SpanQuery[] { spanNear,
>     new SpanTermQuery(new Term(fieldName, "hadoop")) }, 4, true);
>   String expected = "The <B>Lucene</B> was made by <B>Doug</B> Cutting and Lucene great <B>Hadoop</B> was";
>   //String expected = "The <B>Lucene</B> was made by <B>Doug</B> Cutting and the great <B>Hadoop</B> was";
>   String observed = highlightField(query, fieldName, theText);
>   System.out.println("Expected: \"" + expected + "\n" + "Observed: \"" + observed);
>   assertEquals("Why is that second instance of the term \"Lucene\" highlighted?", expected, observed);
> }
> Is this an issue that's arisen before?  I've been reading through the source to QueryScorer, WeightedSpanTerm, WeightedSpanTermExtractor, Spans, and NearSpansOrdered, but haven't found the solution yet.  Initially, I thought that the extractWeightedSpanTerms method in WeightedSpanTermExtractor should be called on each clause of a SpanNearQuery or SpanOrQuery, but that didn't get me too far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org