You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2010/11/17 22:13:11 UTC

[lucy-dev] Re: [KinoSearch] Highlighter Bug

(cc'ing lucy-dev)

On Wed, Nov 17, 2010 at 01:26:51PM +0100, Nick Wellnhofer wrote:
> On 17/11/2010 00:52, Marvin Humphrey wrote:
> >On Tue, Nov 16, 2010 at 05:15:42PM +0100, Nick Wellnhofer wrote:
> >>
> >>I found a bug in the highlighter in KinoSearch 0.30_121 and 0.31. For
> >>some searches the excerpt doesn't contain any of the search terms, but
> >>only a sentence following the one that looks like it should have been
> >>picked.
> >>
> >>I had a quick look at the code in
> >>core/KinoSearch/Highlight/Highlighter.c and the test "candidate>= top"
> >>in Highlighter_raw_excerpt looks fishy to me. AFAICS this might cause
> >>sentences to be skipped.
> >
> >You're probably right that that's where we're skipping ahead, but it's not
> >clear what aspect of the input is causing Raw_Excerpt() to malfunction.
> >The check inside that block is supposed to verify that the chosen excerpt
> >contains at least some relevant material -- that's what the S_has_heat()
> >call is there for.
> 
> I had a closer look and the error case is that you have three sentences 
> where only the first and the last contain keywords but the middle one is 
> chosen for the excerpt.
> 
> >This will be hard to debug without sample material.  Once we have a
> >document which triggers the bad behavior, we can start throwing in
> >debugging printf's.  If you can supply me with sample code which has the
> >problem, I'll have a good hard look.
> 
> See the attached test case.
> 
> Nick

Thank you for the test case!  It illustrates the problem well and I have
started investigating.

Since Lucy has inherited KinoSearch's highlighter code, the problem also
exists in Lucy and I have opened an issue there.

    https://issues.apache.org/jira/browse/LUCY-126

Can you please create a JIRA account for yourself and attach your test case to
that issue as a contribution to the Apache Software Foundation, so that we can
use it for Lucy as well?  Just as is is fine, no need to adapt it for Lucy
specifically.

(Nick is already familiar with the intellectual property issues involved, but
for the rest of the KS list, here's an explanation:  When a contribution
arrives on the KinoSearch mailing list, we may assume that it is intended
for use in the KinoSearch project under KinoSearch's GPL/Artistic license.
This license is more restrictive than Lucy's Apache License 2.0; we can't
use such contributions for Lucy without an additional step granting an
additional license for the code to the ASF.)

Marvin Humphrey

> #! perl
> use strict;
> 
> use KinoSearch;
> 
> my $schema = KinoSearch::Schema->new;
> 
> my $polyanalyzer = KinoSearch::Analysis::PolyAnalyzer->new( 
>     language => 'de',
> );
> 
> $schema->spec_field( 
>     name => 'url',
>     type => KinoSearch::FieldType::StringType->new(),
> );
> $schema->spec_field( 
>     name  => 'title',
>     type => KinoSearch::FieldType::FullTextType->new(
>         analyzer      => $polyanalyzer,
>         boost         => 3,
>         highlightable => 1,
>     ),
> );
> $schema->spec_field(
>     name => 'bodytext',
>     type => KinoSearch::FieldType::FullTextType->new(
>         analyzer      => $polyanalyzer,
>         highlightable => 1,
>     ),
> );
> 
> my $indexer = KinoSearch::Indexer->new(
>     schema   => $schema,
>     index    => 'debug_index',
>     create   => 1,
>     truncate => 1,
> );
> 
> $indexer->add_doc(
>     doc => {
>         url      => 'urn:test',
>         title    => 'Test',
>         bodytext => <<'EOF',
> bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla.
> bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla.
> bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla.
> bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla.
> bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla.
> bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla NNN bla.
> bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla.
> bla bla bla MMM bla bla bla bla bla bla bla bla bla bla bla bla.
> bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla.
> bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla.
> bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla.
> bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla.
> bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla.
> EOF
>     },
> );
> 
> $indexer->commit();
> 
> my $searcher = KinoSearch::Searcher->new(
>     index => 'debug_index'
> );
> my $query = 'nnn mmm';
> my $bodytext_hl = KinoSearch::Highlight::Highlighter->new(
>     searcher => $searcher,
>     query    => $query,
>     field    => 'bodytext'
> );
> my $hits = $searcher->hits(
>     query      => $query,
>     offset     => 0,
>     num_wanted => 10,
> );
> 
> while(my $hit = $hits->next()) {
>     print($bodytext_hl->create_excerpt($hit), "\n");
> }
> 

> _______________________________________________
> kinosearch mailing list
> kinosearch@rectangular.com
> http://rectangular.com/cgi-bin/mailman/listinfo/kinosearch


[lucy-dev] Re: [KinoSearch] Highlighter Bug

Posted by Marvin Humphrey <ma...@rectangular.com>.
(cc to lucy-dev)

On Wed, Nov 17, 2010 at 03:17:42PM -0800, Ashley Pond V wrote:
> Just throwing this out there since it's on point though I have NO
> sample/test code. In highlighting I was doing I was seeing results
> like so
> 
> bla <strong>bla</strong> bla bla bla bla bla bla bla bla bla bla bla
> bla bla bla bla bla
>    bla.<strong></strong><strong></strong><strong></strong><strong></strong><strong></strong><strong></strong>
> 
> I strongly, snerk, suspect your fix will also address that. 

Yes, that may be the case, if one part of the Highlighter was aware of the
truncation but another part was not.

> Sorry I never got around to reporting it. The empty tags were harmless in
> HTML and I never got time to work up a test, etc. {shame on me}

The current Highlighter implementation is a little difficult to work up unit
tests for -- particularly the Raw_Excerpt() method, where Nick's bug was
hiding.  It could stand to be refactored into smaller, more testable chunks.

If you can provide confirmation at some point that the glitch is gone, that
would be nice.  :)

Cheers,

Marvin Humphrey


[lucy-dev] Re: [KinoSearch] Highlighter Bug

Posted by Marvin Humphrey <ma...@rectangular.com>.
> > I had a closer look and the error case is that you have three sentences 
> > where only the first and the last contain keywords but the middle one is 
> > chosen for the excerpt.

Found the bug.  S_has_heat() expects a length but was being passed an offset.

As a result, S_has_heat() was approving an excerpt -- because the excerpt had
"heat", meaning a warm spot in the HeatMap -- but the warm spot actually lay
outside the excerpt's boundaries.  Thus an excerpt which should have been
rejected was being approved.  Or, to be precise, the truncation of the excerpt
to end on a particular sentence boundary was approved when it should not have
been.

Before:

    bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla
    bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla
    bla.<strong></strong>

After (the &#8230; is a Unicode ellipsis):

    bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla
    bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla.  bla
    bla bla <strong>MMM</strong> bla bla bla bla bla bla bla bla bla
    bla&#8230;

I've committed the fix as r6485 to the KinoSearch repository.  It would be
nice to augment that with the test case you provided and commit to Lucy as
well.

Cheers,

Marvin Humphrey