You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Duke DAI <du...@gmail.com> on 2015/08/07 04:58:43 UTC

bug of highlighter/SimpleSpanFragmenter, returned longer fragment than expected?

Hi experts,

I'm trying to reproduce a bug from Lucene side, and found something.

In latest codeline, 5.2.1, I modified test
case HighlighterTest.testSimpleQueryTermScorerHighlighter a little to
below, mainly to use SimpleSpanFragmenter to get only one fragment with
length 64.

  public void testSimpleQueryTermScorerHighlighter() throws Exception {
    doSearching(new SpanTermQuery(new Term(FIELD_NAME, "cats")));
    QueryScorer queryScorer = new QueryScorer(query, FIELD_NAME);
    Highlighter highlighter = new Highlighter(queryScorer);
    // Highlighter highlighter = new Highlighter(new
QueryTermScorer(query));
    highlighter.setTextFragmenter(new SimpleSpanFragmenter(queryScorer,
64));
    int maxNumFragmentsRequired = 1;  // only need one fragment
    for (int i = 0; i < hits.totalHits; i++) {
      final int docId = hits.scoreDocs[i].doc;
      final Document doc = searcher.doc(docId);
      String text = doc.get(FIELD_NAME);
      TokenStream tokenStream = getAnyTokenStream(FIELD_NAME, docId);

      String result = highlighter.getBestFragments(tokenStream, text,
maxNumFragmentsRequired,
          "...");
      if (true) System.out.println("\t" + result);
    }
    // Not sure we can assert anything here - just running to check we dont
    // throw any exceptions
  }

With two documents:
1. "The word content does not contain the stem that we are looking for but
the metadata cats does. Do you think fragmenter work well? Do you think
fragmenter work well?"
2. "The word content does not contain the stem that we are looking for but
the metadata cats does. "
Got corresponding fragment:
1. "for but the metadata <B>cats</B> does. Do you think fragmenter work",
no problem, it's exact what I expected.
2. "The word content does not contain the stem that we are looking for but
the metadata <B>cats</B> does. ", apparently the length is more than 64.
That's the problem reported by my colleague.

More specific, the problem is caused by below code snippet in
SimpleSpanFragmenter.isNewFragment:

    boolean isNewFrag = offsetAtt.endOffset() >= (fragmentSize *
currentNumFrags)
        && (textSize - offsetAtt.endOffset()) >= (fragmentSize >>> 1);

At the end of text, fragmenter can't stop well and following logic also
does not do the trim work.


Is it possible to handle this corner case in standard highlighter code?



Best regards,
Duke
If not now, when? If not me, who?

Re: bug of highlighter/SimpleSpanFragmenter, returned longer fragment than expected?

Posted by Robert Alexander <ro...@gmail.com>.
I've been digging on a similar issue and eventually found this Jira ticket.

https://issues.apache.org/jira/browse/LUCENE-2229

So far I haven't received any response in IRC or from the mailing list, and
the bug is resolved as "won't fix" even though there's a patch attached
that attempts to solve the issue.

For now I have given up. I'm assuming that most of the Lucene community
just doesn't use that highlighter anymore. It is also difficult to
reproduce the issue, so it probably doesn't cause a problem all that often.
It isn't worth my time right now to dig much deeper.

On Tue, Aug 11, 2015 at 10:38 AM, Duke DAI <du...@gmail.com> wrote:

> Greetings!
>
> Any body has input on this?
>
> Best regards,
> Duke
> If not now, when? If not me, who?
>
> On Fri, Aug 7, 2015 at 10:58 AM, Duke DAI <du...@gmail.com> wrote:
>
> > Hi experts,
> >
> > I'm trying to reproduce a bug from Lucene side, and found something.
> >
> > In latest codeline, 5.2.1, I modified test
> > case HighlighterTest.testSimpleQueryTermScorerHighlighter a little to
> > below, mainly to use SimpleSpanFragmenter to get only one fragment with
> > length 64.
> >
> >   public void testSimpleQueryTermScorerHighlighter() throws Exception {
> >     doSearching(new SpanTermQuery(new Term(FIELD_NAME, "cats")));
> >     QueryScorer queryScorer = new QueryScorer(query, FIELD_NAME);
> >     Highlighter highlighter = new Highlighter(queryScorer);
> >     // Highlighter highlighter = new Highlighter(new
> > QueryTermScorer(query));
> >     highlighter.setTextFragmenter(new SimpleSpanFragmenter(queryScorer,
> > 64));
> >     int maxNumFragmentsRequired = 1;  // only need one fragment
> >     for (int i = 0; i < hits.totalHits; i++) {
> >       final int docId = hits.scoreDocs[i].doc;
> >       final Document doc = searcher.doc(docId);
> >       String text = doc.get(FIELD_NAME);
> >       TokenStream tokenStream = getAnyTokenStream(FIELD_NAME, docId);
> >
> >       String result = highlighter.getBestFragments(tokenStream, text,
> > maxNumFragmentsRequired,
> >           "...");
> >       if (true) System.out.println("\t" + result);
> >     }
> >     // Not sure we can assert anything here - just running to check we
> dont
> >     // throw any exceptions
> >   }
> >
> > With two documents:
> > 1. "The word content does not contain the stem that we are looking for
> but
> > the metadata cats does. Do you think fragmenter work well? Do you think
> > fragmenter work well?"
> > 2. "The word content does not contain the stem that we are looking for
> but
> > the metadata cats does. "
> > Got corresponding fragment:
> > 1. "for but the metadata <B>cats</B> does. Do you think fragmenter work",
> > no problem, it's exact what I expected.
> > 2. "The word content does not contain the stem that we are looking for
> but
> > the metadata <B>cats</B> does. ", apparently the length is more than 64.
> > That's the problem reported by my colleague.
> >
> > More specific, the problem is caused by below code snippet in
> > SimpleSpanFragmenter.isNewFragment:
> >
> >     boolean isNewFrag = offsetAtt.endOffset() >= (fragmentSize *
> > currentNumFrags)
> >         && (textSize - offsetAtt.endOffset()) >= (fragmentSize >>> 1);
> >
> > At the end of text, fragmenter can't stop well and following logic also
> > does not do the trim work.
> >
> >
> > Is it possible to handle this corner case in standard highlighter code?
> >
> >
> >
> > Best regards,
> > Duke
> > If not now, when? If not me, who?
> >
>

Re: bug of highlighter/SimpleSpanFragmenter, returned longer fragment than expected?

Posted by Duke DAI <du...@gmail.com>.
Greetings!

Any body has input on this?

Best regards,
Duke
If not now, when? If not me, who?

On Fri, Aug 7, 2015 at 10:58 AM, Duke DAI <du...@gmail.com> wrote:

> Hi experts,
>
> I'm trying to reproduce a bug from Lucene side, and found something.
>
> In latest codeline, 5.2.1, I modified test
> case HighlighterTest.testSimpleQueryTermScorerHighlighter a little to
> below, mainly to use SimpleSpanFragmenter to get only one fragment with
> length 64.
>
>   public void testSimpleQueryTermScorerHighlighter() throws Exception {
>     doSearching(new SpanTermQuery(new Term(FIELD_NAME, "cats")));
>     QueryScorer queryScorer = new QueryScorer(query, FIELD_NAME);
>     Highlighter highlighter = new Highlighter(queryScorer);
>     // Highlighter highlighter = new Highlighter(new
> QueryTermScorer(query));
>     highlighter.setTextFragmenter(new SimpleSpanFragmenter(queryScorer,
> 64));
>     int maxNumFragmentsRequired = 1;  // only need one fragment
>     for (int i = 0; i < hits.totalHits; i++) {
>       final int docId = hits.scoreDocs[i].doc;
>       final Document doc = searcher.doc(docId);
>       String text = doc.get(FIELD_NAME);
>       TokenStream tokenStream = getAnyTokenStream(FIELD_NAME, docId);
>
>       String result = highlighter.getBestFragments(tokenStream, text,
> maxNumFragmentsRequired,
>           "...");
>       if (true) System.out.println("\t" + result);
>     }
>     // Not sure we can assert anything here - just running to check we dont
>     // throw any exceptions
>   }
>
> With two documents:
> 1. "The word content does not contain the stem that we are looking for but
> the metadata cats does. Do you think fragmenter work well? Do you think
> fragmenter work well?"
> 2. "The word content does not contain the stem that we are looking for but
> the metadata cats does. "
> Got corresponding fragment:
> 1. "for but the metadata <B>cats</B> does. Do you think fragmenter work",
> no problem, it's exact what I expected.
> 2. "The word content does not contain the stem that we are looking for but
> the metadata <B>cats</B> does. ", apparently the length is more than 64.
> That's the problem reported by my colleague.
>
> More specific, the problem is caused by below code snippet in
> SimpleSpanFragmenter.isNewFragment:
>
>     boolean isNewFrag = offsetAtt.endOffset() >= (fragmentSize *
> currentNumFrags)
>         && (textSize - offsetAtt.endOffset()) >= (fragmentSize >>> 1);
>
> At the end of text, fragmenter can't stop well and following logic also
> does not do the trim work.
>
>
> Is it possible to handle this corner case in standard highlighter code?
>
>
>
> Best regards,
> Duke
> If not now, when? If not me, who?
>