You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by Chris Harris <ry...@gmail.com> on 2009/01/21 01:47:07 UTC

Highlighting: Analyzing query using analyzers for hl.fl fields

Currently during highlighting, the query string is analyzed by the
analyzer returned by IndexSchema.getQueryAnalyzer(). (If you step through
the code, you'll see that the Query object representing the
analyzed-and-parsed query string is generated before SolrHighlighter's
key doHighlighting() method gets called.)

Two things to emphasized here:

 * Query analysis takes place independent of which field is being
highlighted. (In other words, the query analyzer used does not vary
depending on which hl.fl is currently under consideration.)
 * Under the hood, this analyzer delegates to a separate sub-Analyzer for each
field referenced in the query itself. (For example, if you have the
query "body_text:smith AND num:5", then "smith" will perhaps be analyzed
using an analyzer with stopword analysis, stemming, etc., while "5" will
be analyzed with something simpler, more appropriate for a numeric-only
field.)

Or, to summarize: Query analysis during highlighting is a function of
the fields being *searched*, and *not* a function of the fields being
*highlighted*.

It seems to me that this behavior might be backwards. That is, what
we'd really want is for query analysis during highlighting to be a
function of the fields being highlighted (i.e. the hl.fl params), and
*not* of the fields being mentioned in the query.

Let me try to sketch the use case that leads me to think this:

I have an index with two fields:

  body (default field; word bigram analyzer)
  kwic (text is copyTo'd here from the body field; non-bigram analyzer)

(By "word bigram analyzer" I mean one that might analyze the input
"once upon a time" into the token stream "once", "once upon", "upon",
"upon a", "a", "a time".)

Let's say I want to search for

  "audit trail"

(with quotes), and get use hl.fl=kwic. If I use the current highlight
mechanism, then it will be the *body* field's bigram-generating
analyzer that will be used
to construct the Query object used in highlighting. The resulting object,
in my case, is something like this:

  TermQuery: "audit trail"

Note that my kwic field was not analyzed with word bigrams, though,
so although it contains "audit" and "trail" as adjacent tokens, it
does not contain the composite token "audit trail". As such, when
this TermQuery is used for highlighting, no snippets will be
generated. (This no-snippets situation probably depends on a few
other details of my situation that I haven't mentioned. But I'm trying
to avoid drowning people in details here.)

In contrast, let's suppose that, when doing my analysis for query
highlighting, I *ignore* which particular fields are being *searched*,
and instead use the query analyzer for the hl.fl field being requested. In
this case my hl.fl=kwic, and so the non-bigram analyzer will be used,
and so my Query object for highlight will be something like this:

  PhraseQuery: audit, trail

Unlike the earlier TermQuery, this PhraseQuery works fine for
highlighting with the kwic field, and generates nice snippets.

Does this example make any sense? It would probably be more helpful
to provide a test case, but I'll have to figure out how to make
one that would provide a compelling use case here but that will also run
without require you to download and apply patches from JIRA.

I've thrown together a patch that makes this highlighting analysis change, and
it doesn't seem to break the test suite in any major ways. I may put it up on
JIRA, but it's kind of a hack. What's more, *how* to make the change in behavior
I'm talking about is sort of a separate question from whether it's
wildly off course
even in theory.

What do you think?

Chris