You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Edward Garrett <he...@gmail.com> on 2006/12/11 15:53:35 UTC

highlighting phrasal hits

hello,

i'm doing phrasal searches, and am not happy with how highlighting is done
by default.

if i search for something, like "w1 w2 w3", then correctly, only fields that
match perfectly will be found. however, when i specify highlighting with
hl=true&hl.fl=myfield, then two things don't work according to (my)
expectations:

1) "w1 w2 w3" is not highlighted as a whole, but rather the pieces are
highlighted. e.g. <em>w1</em> <em>w2</em> <em>w3</em>. really, the whole
thing should be contained within a single <em> element.

2) relatedly, and presumably for the same reason, all instances of "w1",
"w2" and "w3" in myfield are highlighted, even when they don't occur
together.

i can't see any possible reason for things working this way, but perhaps
SOLR is just following lucene here.

any thoughts appreciated,
edward

p.s. haven't actually tested the above against indexed english data, so it's
possible that it's an artifact of the data and analysis procedures i am
using.
-- 
Edward Garrett

Visiting Fellow (2006-07)
Endangered Languages Academic Programme
School of Oriental and African Studies
London, UK
0207 898 4536

Assistant Professor, Linguistics Program
Eastern Michigan University
612 Pray-Harrold Building
Ypsilanti, MI, USA

Re: highlighting phrasal hits

Posted by Mike Klaas <mi...@gmail.com>.
On 12/11/06, Edward Garrett <he...@gmail.com> wrote:
> hello,
>
> i'm doing phrasal searches, and am not happy with how highlighting is done
> by default.
>
> if i search for something, like "w1 w2 w3", then correctly, only fields that
> match perfectly will be found. however, when i specify highlighting with
> hl=true&hl.fl=myfield, then two things don't work according to (my)
> expectations:
>
> 1) "w1 w2 w3" is not highlighted as a whole, but rather the pieces are
> highlighted. e.g. <em>w1</em> <em>w2</em> <em>w3</em>. really, the whole
> thing should be contained within a single <em> element.
>
> 2) relatedly, and presumably for the same reason, all instances of "w1",
> "w2" and "w3" in myfield are highlighted, even when they don't occur
> together.
>
> i can't see any possible reason for things working this way, but perhaps
> SOLR is just following lucene here.

Solr is using Lucene's built-in highlighter, which has the
deficiencies you mention.  There have been improved highlighting
approaches proposed; see
http://issues.apache.org/jira/browse/LUCENE-663 and
http://issues.apache.org/jira/browse/LUCENE-644.

Improving Solr's highlighting is something I am quite interested in
personally.  Unfortunately, this is an extremely busy time for me at
work, and I doubt that I'll have time to work on this in the near
future.

-Mike