You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucy.apache.org by Peter Karman <pe...@peknet.com> on 2011/08/30 20:38:39 UTC

[lucy-dev] which fields contained which terms

Per the thread here from Feb 2011[0] I am want to make it easy to discover why a
document matched a given query, i.e. which terms matched in which fields.

Marvin and I have chatted about this a few different times on #lucy_dev, and
it's clear to me now why it is problematic to do this kind of data gathering in
the existing Matcher/Collector architecture. Post-processing provides a cleaner
way into the solution, provided we can do it without sacrificing performance.

I wanted to get this thread on to the -dev list as we need to sort out if/how
the index structure might change to make this feature possible.

Thoughts?


[0] http://s.apache.org/fz
-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [lucy-dev] which fields contained which terms

Posted by Peter Karman <pe...@peknet.com>.

Marvin Humphrey wrote on 8/30/11 4:59 PM:

> Make sure that you spec every field as "highlightable".  Then, at search time,
> try something like this:
> 
>     my $query = $query_parser->parse($query_string);
>     my $compiler = $query->make_compiler(searcher => $searcher);
>     my $hits = $searcher->hits;
>     while (my $hit = $hits->next) {
>         my $doc_vec = $searcher->fetch_doc_vec($hit->get_doc_id);
>         my @relevant_fields;
>         for my $field (@{ $schema->all_fields }) {
>             my $spans = $compiler->highlight_spans(
>                 searcher => $searcher,
>                 doc_vec  => $doc_vec,
>                 field    => $field,
>             );
>             if (@$spans) {
>                 push @relevant_fields, $field;
>             }
>         }
>         print "Relevant fields: ";
>         print join ", ", @relevant_fields;
>         print "\n";
>     }
> 
> If a field produces highlight spans, it was relevant.  If it doesn't produce
> highlight spans, it wasn't relevant.
> 
> Does that work?
> 

it does, quite well, thank you.

I've added a find_relevant_fields() method to SWISH::Prog::Lucy::Results that
implements the above nearly verbatim.

Aside from the index size increase, I think this is a win for this problem.

Thanks, Marvin.


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [lucy-dev] which fields contained which terms

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Sat, Sep 10, 2011 at 08:18:02PM -0500, Peter Karman wrote:
> my brief tests show that setting highlightable => 1 for all fields increases the
> size of the index by about 65%. Is that about right, in your experience?

Yes, that's not surprising.  Those miniature inverted indexes contain a lot of
data.  Each has its own term dictionary.  Both term frequency and positional
data are included, and the per-token positional data is augmented with start
and end offsets measured in code points.  

Marvin Humphrey

Re: [lucy-dev] which fields contained which terms

Posted by Peter Karman <pe...@peknet.com>.

Marvin Humphrey wrote on 8/30/11 4:59 PM:

> 
> To support highlighting, at index-time we create an inverted representation
> for each field that has been marked as "highlightable", then serialize all the
> inverted fields together in one blob (called, for no particularly good reason,
> a "DocVector").  Effectively this is a miniature inverted-index containing a
> single document.  The class which does the work is
> Lucy::Index::HighlightWriter, and the relevant segment files are named
> seg_NNN/highlight.ix and seg_NNN/highlight.dat.
> 

my brief tests show that setting highlightable => 1 for all fields increases the
size of the index by about 65%. Is that about right, in your experience?

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [lucy-dev] Re: which fields contained which terms

Posted by Peter Karman <pe...@peknet.com>.

Marvin Humphrey wrote on 9/9/11 12:45 AM:

> The highlight.* files exist as discrete files for a little while during
> indexing, but then they get rolled into the compound file.

ah! thanks.

/me continues with testing from original thread.

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [lucy-dev] Re: which fields contained which terms

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Thu, Sep 08, 2011 at 10:25:59PM -0500, Peter Karman wrote:
> Marvin Humphrey wrote on 8/30/11 4:59 PM:
> 
> > To support highlighting, at index-time we create an inverted representation
> > for each field that has been marked as "highlightable", then serialize all the
> > inverted fields together in one blob (called, for no particularly good reason,
> > a "DocVector").  Effectively this is a miniature inverted-index containing a
> > single document.  The class which does the work is
> > Lucy::Index::HighlightWriter, and the relevant segment files are named
> > seg_NNN/highlight.ix and seg_NNN/highlight.dat.
> 
> I finally found time to try this, but I must be doing something wrong, because
> despite setting all my fields to 'highlightable', no seg_NNN/highlight.* files
> are getting created.

The highlight.* files exist as discrete files for a little while during
indexing, but then they get rolled into the compound file.  Take a look in
seg_NNN/cfmeta.json.

> |-- locks
> |-- schema_1.json
> |-- seg_1
> |   |-- cf.dat
> |   |-- cfmeta.json
> |   `-- segmeta.json
> |-- snapshot_1.json
> |-- swish.xml
> `-- swish_last_start

The "cf.dat" file has the data, and the "cfmeta.json" file contains the file
metadata, such as offset and length.

Marvin Humphrey

[lucy-dev] Re: which fields contained which terms

Posted by Peter Karman <pe...@peknet.com>.

Marvin Humphrey wrote on 8/30/11 4:59 PM:

> To support highlighting, at index-time we create an inverted representation
> for each field that has been marked as "highlightable", then serialize all the
> inverted fields together in one blob (called, for no particularly good reason,
> a "DocVector").  Effectively this is a miniature inverted-index containing a
> single document.  The class which does the work is
> Lucy::Index::HighlightWriter, and the relevant segment files are named
> seg_NNN/highlight.ix and seg_NNN/highlight.dat.

I finally found time to try this, but I must be doing something wrong, because
despite setting all my fields to 'highlightable', no seg_NNN/highlight.* files
are getting created.

here's a snip from schema_1.json:

   "fields" : {
      "author" : {
         "analyzer" : "1",
         "highlightable" : "1",
         "sortable" : "1",
         "type" : "fulltext"
      },
      "dates" : {
         "analyzer" : "1",
         "highlightable" : "1",
         "sortable" : "1",
         "type" : "fulltext"
      },
      "orgs" : {
         "analyzer" : "1",
         "highlightable" : "1",
         "sortable" : "1",
         "type" : "fulltext"
      },

and here's what my index dir looks like:

index.swish3]$ tree
.
|-- locks
|-- schema_1.json
|-- seg_1
|   |-- cf.dat
|   |-- cfmeta.json
|   `-- segmeta.json
|-- snapshot_1.json
|-- swish.xml
`-- swish_last_start

2 directories, 7 files


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [lucy-dev] which fields contained which terms

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Tue, Aug 30, 2011 at 01:38:39PM -0500, Peter Karman wrote:
> Per the thread here from Feb 2011[0] I am want to make it easy to discover why a
> document matched a given query, i.e. which terms matched in which fields.
> 
> Marvin and I have chatted about this a few different times on #lucy_dev, and
> it's clear to me now why it is problematic to do this kind of data gathering in
> the existing Matcher/Collector architecture. Post-processing provides a cleaner
> way into the solution, provided we can do it without sacrificing performance.
> 
> I wanted to get this thread on to the -dev list as we need to sort out if/how
> the index structure might change to make this feature possible.

The general idea is to treat this problem like highlighting, which is also
done using post-processing.

To support highlighting, at index-time we create an inverted representation
for each field that has been marked as "highlightable", then serialize all the
inverted fields together in one blob (called, for no particularly good reason,
a "DocVector").  Effectively this is a miniature inverted-index containing a
single document.  The class which does the work is
Lucy::Index::HighlightWriter, and the relevant segment files are named
seg_NNN/highlight.ix and seg_NNN/highlight.dat.

At search time, we retrieve the single-document mini-inverted-index which
corresponds to each hit, and then use that information to determine what
portions of a given highlightable field matched.  Each Query subclass's
companion Compiler class implements a Highlight_Spans() method which returns
an array of Lucy::Search::Span objects.  If the field matched against the
document, the array returned by Highlight_Spans() will be non-empty, and
Highlighter uses those spans to choose the excerpt and highlight the relevant
sections.

Hey wait a minute...

It occurs to me that we might be able to fake up a prototype implementation
using the existing Highlight_Spans() functionality.  

Make sure that you spec every field as "highlightable".  Then, at search time,
try something like this:

    my $query = $query_parser->parse($query_string);
    my $compiler = $query->make_compiler(searcher => $searcher);
    my $hits = $searcher->hits;
    while (my $hit = $hits->next) {
        my $doc_vec = $searcher->fetch_doc_vec($hit->get_doc_id);
        my @relevant_fields;
        for my $field (@{ $schema->all_fields }) {
            my $spans = $compiler->highlight_spans(
                searcher => $searcher,
                doc_vec  => $doc_vec,
                field    => $field,
            );
            if (@$spans) {
                push @relevant_fields, $field;
            }
        }
        print "Relevant fields: ";
        print join ", ", @relevant_fields;
        print "\n";
    }

If a field produces highlight spans, it was relevant.  If it doesn't produce
highlight spans, it wasn't relevant.

Does that work?

Marvin Humphrey