You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Andrew May <am...@ingenta.com> on 2006/08/11 23:01:29 UTC

Highlighting problem - mutivalue field

Hi,

I'm afraid I've found another slightly odd thing with Highlighting, in this case in a 
multi-valued field I'm using for author names.

The author names are typically Surname, initials (e.g. May, A.D.), and these are the kind 
of results I'm getting:

authors:Buxton

<?xml version="1.0" encoding="UTF-8"?>
<response>
<responseHeader><status>0</status><QTime>2</QTime></responseHeader>
<result numFound="2" start="0">
  <doc>
   <arr name="authors"><str>Duncan, W.I.</str><str>Buxton, N.W.K.</str></arr>
  </doc>
  <doc>
   <arr name="authors"><str>Buxton, M.W.N.</str><str>Pedley, H.M.</str></arr>
  </doc>
</result>
<lst name="highlighting">
  <lst name="geol/jgs/1995/00000152/00000002/15220251">
   <arr name="authors">
     <str>.&lt;em>Buxton&lt;/em>, N.W.K</str>
   </arr>
  </lst>
  <lst name="geol/jgs/1989/00000146/00000005/14650746">
   <arr name="authors">
     <str>&lt;em>Buxton&lt;/em>, M.W.N</str>
   </arr>
  </lst>
</lst>
</response>

So in the first case, where the second author name was matched, the final period has 
disappeared, and there's a stray period at the start. In the second case where the first 
author name was matched, the final period is also missing, but there's no extra period at 
the start.

This pattern is the same for other author searches, which suggests that it's picking up 
the last character from the previous field and returning that at the start, and loosing 
the last character.

However, some searches on keywords (also multi-valued) seem to suggest that it's not that 
simple:

keywords:rock (with maxSnippets=100)

<?xml version="1.0" encoding="UTF-8"?>
<response>
<responseHeader><status>0</status><QTime>2</QTime></responseHeader>

<result numFound="18" start="0">
  <doc>
   <arr name="keywords"><str>fracture (rock)</str><str>porosity 
(rock)</str><str>permeability (rock)</str>
	<str>nuclear magnetic resonance</str></arr>
  </doc>
  <doc>
   <arr name="keywords"><str>United Kingdom</str><str>Carboniferous</str><str>clastie 
rocks</str>
	<str>coal seams</str><str>sedimentary rocks</str></arr>
  </doc>
</result>
<lst name="highlighting">
  <lst name="geol/pg/2002/00000008/00000003/art00001">
   <arr name="keywords">
	<str>fracture (&lt;em>rock&lt;/em></str>
	<str>)porosity (&lt;em>rock&lt;/em></str>
	<str>)permeability (&lt;em>rock&lt;/em></str>
   </arr>
  </lst>
  <lst name="geol/jgs/1995/00000152/00000005/15250819">
   <arr name="keywords">
	<str>clastie &lt;em>rocks&lt;/em></str>
	<str>sedimentary &lt;em>rocks&lt;/em></str>
   </arr>
  </lst>
</lst>
</response>

The first document seems to have the same behaviour as the authors searching, but the 
second one where there's no punctuation, there's no missing/moved characters (as far as I 
can tell this seems to be true whether the highlight is at the start/end of the value, or 
in the middle).

Any thoughts? Let me know if I should open a JIRA issue.

Thanks,

Andrew


Re: Highlighting problem - mutivalue field

Posted by Yonik Seeley <yo...@apache.org>.
On 9/7/06, Mike Klaas <mi...@gmail.com> wrote:
> Might it be a good time to upgrade our lucene revision?

+1, the current version should be stable.  I usually grab a
lucene-nightly build for the refresh.

-Yonik

Re: Highlighting problem - mutivalue field

Posted by Mike Klaas <mi...@gmail.com>.
On 8/16/06, Mike Klaas <mi...@gmail.com> wrote:

> > http://issues.apache.org/jira/browse/LUCENE-645?page=comments#action_12428521,
> > so the next lucene revision sync should resolve this issue.
>
> Speaking of sync'ing with Lucene, this issue promises to provide a
> rather significant indexing speed boost which might be worth waiting
> for: http://issues.apache.org/jira/browse/LUCENE-388?page=all

LUCENE-645 (highlighter bugfix) and LUCENE-388 have both been resolved
in lucene trunk.  Also LUCENE-629 (which provides a sometimes huge
indexing performance boost for compressed fields) has recently been
added.

Might it be a good time to upgrade our lucene revision?

-Mike

Re: Highlighting problem - mutivalue field

Posted by Mike Klaas <mi...@gmail.com>.
[moved to solr-dev]

On 8/16/06, Mike Klaas <mi...@gmail.com> wrote:

> A fix has been commited to lucene:
> http://issues.apache.org/jira/browse/LUCENE-645?page=comments#action_12428521,
> so the next lucene revision sync should resolve this issue.

Speaking of sync'ing with Lucene, this issue promises to provide a
rather significant indexing speed boost which might be worth waiting
for: http://issues.apache.org/jira/browse/LUCENE-388?page=all

regards,
-Mike

Re: Highlighting problem - mutivalue field

Posted by Mike Klaas <mi...@gmail.com>.
On 8/11/06, Mike Klaas <mi...@gmail.com> wrote:

> Thanks for the report.  This is a known Lucene Highlighter issue; see
> http://issues.apache.org/jira/browse/LUCENE-645.
>
> The issue contains a patch which you may want to apply to your local
> code, though there are some cases which could cause relatively severe
> problems (namely, very with large fields, which you may not care
> about).

A fix has been commited to lucene:
http://issues.apache.org/jira/browse/LUCENE-645?page=comments#action_12428521,
so the next lucene revision sync should resolve this issue.

-Mike

Re: Highlighting problem - mutivalue field

Posted by Mike Klaas <mi...@gmail.com>.
On 8/11/06, Andrew May <am...@ingenta.com> wrote:
> Hi,
>
> I'm afraid I've found another slightly odd thing with Highlighting, in this case in a
> multi-valued field I'm using for author names.

Thanks for the report.  This is a known Lucene Highlighter issue; see
http://issues.apache.org/jira/browse/LUCENE-645.

The issue contains a patch which you may want to apply to your local
code, though there are some cases which could cause relatively severe
problems (namely, very with large fields, which you may not care
about).

regards,
-MIke