You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2011/11/15 04:46:06 UTC

[lucy-dev] Highlighter bug in 0.2.2

Greets,

I've been prioritizing other things, but I received an offlist request about
the Highlighter bug in 0.2.2, which is a consequence of a bad fix for a bug
present in Lucy 0.2.1 and earlier.

Highlighter starts off by obtaining an array of "Span" objects via
Compiler_Highlight_Spans() which tell you which parts of the field contributed
to the score.  Each Span has three member variables, which have the following
meanings in the context of highlighting:

  offset - Start of the selection in code points from the top of the field.
  length - Length of the selection in code points.
  weight - Floating point score indicating how much the selection contributed.

Say that you have this field value...

  Three blind mice.  Three blind mice.  See how they run.  See how they run.

... and you search for 'three'.  You'll see spans like this:

  { offset: 00, length: 05, weight: 0.307 } <-- first occurrence of 'three'
  { offset: 19, length: 05, weight: 0.307 } <-- second occurrence of 'three'

Highlighter uses this array of Spans to build a "HeatMap", taking into account
Spans which occur close to each other and boosting the areas near by, with the
goal of preferring excerpts with high term density.  Here's are HeatMap's
Spans for the query 'three':
    
  { offset: 00, length: 05, weight: 0.798 } <-- red hot
  { offset: 05, length: 14, weight: 0.491 } <-- warm
  { offset: 19, length: 05, weight: 0.798 } <-- red hot

Our current bug is that we are using the HeatMap's Spans not only to determine
where to snip our excerpt from, but also where to place our highlighting tags.
We should only be highlighting "Three" and "Three".  Instead we are
highlighting "Three", " blind mice.  ", and "Three".

We used to use the original set of Spans produced by
Compiler_Highlight_Spans() to determine where to place the highlighting tags.
However, this was vulnerable to a problem affecting queries with duplicate
terms (<https://issues.apache.org/jira/browse/LUCY-182>).  Here are the spans
produced when searching for 'three three':

  { offset: 00, length: 05, weight: 0.217 }
  { offset: 19, length: 05, weight: 0.217 }
  { offset: 00, length: 05, weight: 0.217 } <-- repeat of first Span
  { offset: 19, length: 05, weight: 0.217 } <-- repeat of second Span

The solution, I believe, is to copy the original spans, then sort and "flatten"
them (using VA_Sort() and HeatMap_Flatten_Spans()), producing the following
set:

  { offset: 00, length: 05, weight: 0.434 } <-- sum of Spans 0 and 2
  { offset: 19, length: 05, weight: 0.434 } <-- sum of Spans 1 and 3

I've include a small script and its output below my sig which I think
demonstrates that the approach will work, along with a small patch that is
necessary to sum overlapping spans correctly.

It's frustrating that this bug made it through our unit tests, as our test
coverage for Highlighter and HeatMap isn't that horrible.  But that's how it
goes sometimes...

Marvin Humphrey


marvin@smokey:~/projects/lucy5/perl $ cat test_highlight.pl 
use strict;
use warnings;
use Lucy;

my $schema = Lucy::Plan::Schema->new;
my $type   = Lucy::Plan::FullTextType->new(
    analyzer      => Lucy::Analysis::PolyAnalyzer->new( language => 'en' ),
    highlightable => 1,
);
$schema->spec_field( name => 'content', type => $type );
my $folder  = Lucy::Store::RAMFolder->new;
my $indexer = Lucy::Index::Indexer->new(
    index  => $folder,
    create => 1,
    schema => $schema
);
my $string
    = "Three blind mice.  Three blind mice.  See how they run.  See how they run.";
$indexer->add_doc( { content => $string } );
$indexer->commit;

show_spans("three");
show_spans("three three");
exit;

sub show_spans {
    my $qstring  = shift;
    my $searcher = Lucy::Search::IndexSearcher->new( index => $folder );
    my $query    = $searcher->glean_query($qstring);
    my $compiler = $query->make_compiler( searcher => $searcher );
    my $spans    = $compiler->highlight_spans(
        doc_vec  => $searcher->fetch_doc_vec(1),
        searcher => $searcher,
        field    => 'content',
    );
    my $heat_map = Lucy::Highlight::HeatMap->new( spans => $spans );

    print "Spans for query '$qstring':\n";
    pretty_print_spans($spans);
    print "\nHeatMap Spans for query '$qstring':\n";
    pretty_print_spans( $heat_map->get_spans );
    print "\nFlattened Spans for query '$qstring':\n";
    pretty_print_spans( $heat_map->flatten_spans( [ sort @$spans ] ) );
    print "\n";
}

sub pretty_print_spans {
    my $spans = shift;
    for my $span (@$spans) {
        printf( "  { offset: %.2d, length: %.2d, weight: %.3f }\n",
            $span->get_offset, $span->get_length, $span->get_weight );
    }
}

marvin@smokey:~/projects/lucy5/perl $ perl -Mblib test_highlight.pl 
Spans for query 'three':
  { offset: 00, length: 05, weight: 0.307 }
  { offset: 19, length: 05, weight: 0.307 }

HeatMap Spans for query 'three':
  { offset: 00, length: 05, weight: 0.798 }
  { offset: 05, length: 14, weight: 0.491 }
  { offset: 19, length: 05, weight: 0.798 }

Flattened Spans for query 'three':
  { offset: 00, length: 05, weight: 0.307 }
  { offset: 19, length: 05, weight: 0.307 }

Spans for query 'three three':
  { offset: 00, length: 05, weight: 0.217 }
  { offset: 19, length: 05, weight: 0.217 }
  { offset: 00, length: 05, weight: 0.217 }
  { offset: 19, length: 05, weight: 0.217 }

HeatMap Spans for query 'three three':
  { offset: 00, length: 05, weight: 2.258 }
  { offset: 05, length: 14, weight: 1.390 }
  { offset: 19, length: 05, weight: 2.258 }

Flattened Spans for query 'three three':
  { offset: 00, length: 05, weight: 0.434 }
  { offset: 19, length: 05, weight: 0.434 }

marvin@smokey:~/projects/lucy5/perl $ git diff ../core/Lucy/Highlight/HeatMap.c
diff --git a/core/Lucy/Highlight/HeatMap.c b/core/Lucy/Highlight/HeatMap.c
index 0472022..6902fc6 100644
--- a/core/Lucy/Highlight/HeatMap.c
+++ b/core/Lucy/Highlight/HeatMap.c
@@ -118,7 +118,7 @@ HeatMap_flatten_spans(HeatMap *self, VArray *spans) {
 
             // Get the location of the flattened span that shares the source
             // span's offset.
-            for (; dest_tick < num_raw_flattened; dest_tick++) {
+            for (dest_tick = 0; dest_tick < num_raw_flattened; dest_tick++) {
                 Span *dest_span = (Span*)VA_Fetch(flattened, dest_tick);
                 if (dest_span->offset == source_span->offset) {
                     break;
marvin@smokey:~/projects/lucy5/perl $