You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Christian Wittern <cw...@gmail.com> on 2008/04/18 10:59:33 UTC

Highlighted field gets truncated

Dear Solr users,

Here I am having a problem with hightlighting which is slightly different
from the one reported by Martijn.
The field that contains the match is rather short, in this case less than
300 characters altogether.  Nevertheless, the field is only returned
truncated. Since I also return the field itself, I can clearly see that the
whole content is there.

Here is the query string I am using:

http://localhost:8983/solr/select/?q=%E8%99%9B%E5%A4%9A&fl=variants,content,cdata,id%2Cdoctitle%2Chead%2Ccitekey%2Cseqnum%2Cjuan&hl=true&f.contents.hl.snippets=20&hl.fl=content,variants&wt=xml&tr=solr-tei.xsl

Any hint on how to debug this would be highly appreciated!

All the best,

Christian

-- 

 Christian Wittern
 Institute for Research in Humanities, Kyoto University
 47 Higashiogura-cho, Kitashirakawa, Sakyo-ku, Kyoto 606-8265, JAPAN

Re: Highlighted field gets truncated

Posted by Mike Klaas <mi...@gmail.com>.
On 22-Apr-08, at 6:00 PM, Christian Wittern wrote:
> Mike Klaas wrote:
>> On 19-Apr-08, at 3:02 AM, Christian Wittern wrote:
>>> So it could be that the match is not part of the fragment?  This  
>>> sounds a bit strange.  Is there a way to make sure the fragment  
>>> contains the match other than returning the whole field and do the  
>>> fragmenting myself?
>>
> [...]
>> As you can see, only fragments containing a match are returned  
>> (note that there is very often multiple matches--you seemed to  
>> assume only one).
>>
> Mike, thank you for the clarification.  Now I understand what went  
> wrong in the example I looked at.   I am querying ngram indexed   
> data (Chinese text).  A user enters two or three characters and  
> expect them to be matched more or less as a substring match.  The  
> fragment I looked at did contain only one of the characters (the  
> other was cut off at the end), this is what made me wondering.    
> From what you say, even adding quotation marks around the query will  
> not prevent this from happening (in this case, it would simply  
> obscure the match).
> Are there any plans to improve the algorithm for fragmentation?  Or  
> are there other work arounds?

LUCENE-794 contains an implementation that solves this problem.  My  
plan is to eventually integrate this into Solr one day, but I don't  
see myself having time for this in the short or medium term.

Contributions welcome :)

-Mike

Re: Highlighted field gets truncated

Posted by Christian Wittern <cw...@gmail.com>.
Mike Klaas wrote:
> On 19-Apr-08, at 3:02 AM, Christian Wittern wrote:
>> So it could be that the match is not part of the fragment?  This 
>> sounds a bit strange.  Is there a way to make sure the fragment 
>> contains the match other than returning the whole field and do the 
>> fragmenting myself?
>
[...]
> As you can see, only fragments containing a match are returned (note 
> that there is very often multiple matches--you seemed to assume only 
> one).
>
Mike, thank you for the clarification.  Now I understand what went wrong 
in the example I looked at.   I am querying ngram indexed  data (Chinese 
text).  A user enters two or three characters and expect them to be 
matched more or less as a substring match.  The fragment I looked at did 
contain only one of the characters (the other was cut off at the end), 
this is what made me wondering.   From what you say, even adding 
quotation marks around the query will not prevent this from happening 
(in this case, it would simply obscure the match). 

Are there any plans to improve the algorithm for fragmentation?  Or are 
there other work arounds?
 
All the best,

Christian


Re: Highlighted field gets truncated

Posted by Mike Klaas <mi...@gmail.com>.
On 19-Apr-08, at 3:02 AM, Christian Wittern wrote:
> Mike Klaas wrote:
>>
>> Fragments are generated independently from matching (I realize this  
>> isn't an ideal algorithm).
>>
> So it could be that the match is not part of the fragment?  This  
> sounds a bit strange.  Is there a way to make sure the fragment  
> contains the match other than returning the whole field and do the  
> fragmenting myself?

The highlighting algorithm is as follows:
  1. fragment the whole field into N fragments
  2. score each fragment based on the keyword matches (more matches  
the better; prefer different keyword matching to many of the same  
keyword matching).  fragments that have no matching keywords do not  
have a positive score.
  3. return the top hl.maxSnippets fragments that score > 0

As you can see, only fragments containing a match are returned (note  
that there is very often multiple matches--you seemed to assume only  
one).

-Mike

Re: Highlighted field gets truncated

Posted by Christian Wittern <cw...@gmail.com>.
Mike Klaas wrote:
>
> Fragments are generated independently from matching (I realize this 
> isn't an ideal algorithm).
>
So it could be that the match is not part of the fragment?  This sounds 
a bit strange.  Is there a way to make sure the fragment contains the 
match other than returning the whole field and do the fragmenting myself?

>
> Fragments are returned as an xml list; you can combine them together 
> however you like in client code.  Solr can merge adjacent fragments 
> for you if you wish.
>
I see.  That is great.

Thanks,  Christian

-- 

 Christian Wittern 
 Institute for Research in Humanities, Kyoto University
 47 Higashiogura-cho, Kitashirakawa, Sakyo-ku, Kyoto 606-8265, JAPAN


Re: Highlighted field gets truncated

Posted by Mike Klaas <mi...@gmail.com>.
On 18-Apr-08, at 2:47 AM, Christian Wittern wrote:
> Martijn Dekkers wrote:
>> Did you look at the hl.fragsize parameter? the default for that is  
>> 100. try:
>>
>> http://localhost:8983/solr/select/?q=%E8%99%9B%E5%A4%9A&fl=variants,content,cdata,id%2Cdoctitle%2Chead%2Ccitekey%2Cseqnum%2Cjuan&hl=true&f.contents.hl.snippets=20&hl.fl=content,variants&wt=xml&tr=solr-tei.xsl&hl.fragsize=500
>>
>>
> Thanks Martijn, with this URL, I do indeed get the whole match.   
> Maybe I am not understanding the meaning of the hl.fragsize  
> correctly.  I was assuming that it would grab content in similar  
> sizes to the left and right of the match with the default fragmenter.

Fragments are generated independently from matching (I realize this  
isn't an ideal algorithm).

> Maybe I should try to use the regex fragmenter instead, but this  
> seems to be 1.3 only?
> Another related question:  Is there a way to insert some limiters  
> between fragments so that it is clearly visible that these are  
> chunks of text with gaps in between? I understand that hl.simple.pre  
> and *.post are for surrounding the match, not the snippet, right?

Fragments are returned as an xml list; you can combine them together  
however you like in client code.  Solr can merge adjacent fragments  
for you if you wish.

-Mike

Re: Highlighted field gets truncated

Posted by Christian Wittern <cw...@gmail.com>.
Martijn Dekkers wrote:
> Did you look at the hl.fragsize parameter? the default for that is 100. try:
>
> http://localhost:8983/solr/select/?q=%E8%99%9B%E5%A4%9A&fl=variants,content,cdata,id%2Cdoctitle%2Chead%2Ccitekey%2Cseqnum%2Cjuan&hl=true&f.contents.hl.snippets=20&hl.fl=content,variants&wt=xml&tr=solr-tei.xsl&hl.fragsize=500
>
>   
Thanks Martijn, with this URL, I do indeed get the whole match.  Maybe I 
am not understanding the meaning of the hl.fragsize correctly.  I was 
assuming that it would grab content in similar sizes to the left and 
right of the match with the default fragmenter.   Maybe I should try to 
use the regex fragmenter instead, but this seems to be 1.3 only?
 
Another related question:  Is there a way to insert some limiters 
between fragments so that it is clearly visible that these are chunks of 
text with gaps in between? I understand that hl.simple.pre and *.post 
are for surrounding the match, not the snippet, right?

Christian

-- 

 Christian Wittern 
 Institute for Research in Humanities, Kyoto University
 47 Higashiogura-cho, Kitashirakawa, Sakyo-ku, Kyoto 606-8265, JAPAN


Re: Highlighted field gets truncated

Posted by Martijn Dekkers <ma...@dekkers.org.uk>.
Hey Christian,

Did you look at the hl.fragsize parameter? the default for that is 100. try:

http://localhost:8983/solr/select/?q=%E8%99%9B%E5%A4%9A&fl=variants,content,cdata,id%2Cdoctitle%2Chead%2Ccitekey%2Cseqnum%2Cjuan&hl=true&f.contents.hl.snippets=20&hl.fl=content,variants&wt=xml&tr=solr-tei.xsl&hl.fragsize=500

Cheers,

Martijn

On 18/04/2008, Christian Wittern <cw...@gmail.com> wrote:
> Dear Solr users,
>
>  Here I am having a problem with hightlighting which is slightly different
>  from the one reported by Martijn.
>  The field that contains the match is rather short, in this case less than
>  300 characters altogether.  Nevertheless, the field is only returned
>  truncated. Since I also return the field itself, I can clearly see that
the
>  whole content is there.
>
>  Here is the query string I am using:
>
>
http://localhost:8983/solr/select/?q=%E8%99%9B%E5%A4%9A&fl=variants,content,cdata,id%2Cdoctitle%2Chead%2Ccitekey%2Cseqnum%2Cjuan&hl=true&f.contents.hl.snippets=20&hl.fl=content,variants&wt=xml&tr=solr-tei.xsl
>
>  Any hint on how to debug this would be highly appreciated!
>
>  All the best,
>
>  Christian
>
>
>  --
>
>   Christian Wittern
>   Institute for Research in Humanities, Kyoto University
>   47 Higashiogura-cho, Kitashirakawa, Sakyo-ku, Kyoto 606-8265, JAPAN
>

Re: Highlighted field gets truncated

Posted by Thomas Arni <ar...@gmail.com>.
Have a look at
http://wiki.apache.org/solr/HighlightingParameters?highlight=%28highlighting%29#head-dbf0474b5b2c0db08f3a464ff3525225a9c71fbc

and set
hl.fragsize=0

Hope this helps.

Christian Wittern said the following on 18/04/2008 09:59:
> Dear Solr users,
> 
> Here I am having a problem with hightlighting which is slightly different
> from the one reported by Martijn.
> The field that contains the match is rather short, in this case less than
> 300 characters altogether.  Nevertheless, the field is only returned
> truncated. Since I also return the field itself, I can clearly see that the
> whole content is there.
> 
> Here is the query string I am using:
> 
> http://localhost:8983/solr/select/?q=%E8%99%9B%E5%A4%9A&fl=variants,content,cdata,id%2Cdoctitle%2Chead%2Ccitekey%2Cseqnum%2Cjuan&hl=true&f.contents.hl.snippets=20&hl.fl=content,variants&wt=xml&tr=solr-tei.xsl
> 
> Any hint on how to debug this would be highly appreciated!
> 
> All the best,
> 
> Christian
>