You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Christian Wittern <cw...@gmail.com> on 2008/04/18 10:59:33 UTC
Highlighted field gets truncated
Dear Solr users,
Here I am having a problem with hightlighting which is slightly different
from the one reported by Martijn.
The field that contains the match is rather short, in this case less than
300 characters altogether. Nevertheless, the field is only returned
truncated. Since I also return the field itself, I can clearly see that the
whole content is there.
Here is the query string I am using:
http://localhost:8983/solr/select/?q=%E8%99%9B%E5%A4%9A&fl=variants,content,cdata,id%2Cdoctitle%2Chead%2Ccitekey%2Cseqnum%2Cjuan&hl=true&f.contents.hl.snippets=20&hl.fl=content,variants&wt=xml&tr=solr-tei.xsl
Any hint on how to debug this would be highly appreciated!
All the best,
Christian
--
Christian Wittern
Institute for Research in Humanities, Kyoto University
47 Higashiogura-cho, Kitashirakawa, Sakyo-ku, Kyoto 606-8265, JAPAN
Re: Highlighted field gets truncated
Posted by Mike Klaas <mi...@gmail.com>.
On 22-Apr-08, at 6:00 PM, Christian Wittern wrote:
> Mike Klaas wrote:
>> On 19-Apr-08, at 3:02 AM, Christian Wittern wrote:
>>> So it could be that the match is not part of the fragment? This
>>> sounds a bit strange. Is there a way to make sure the fragment
>>> contains the match other than returning the whole field and do the
>>> fragmenting myself?
>>
> [...]
>> As you can see, only fragments containing a match are returned
>> (note that there is very often multiple matches--you seemed to
>> assume only one).
>>
> Mike, thank you for the clarification. Now I understand what went
> wrong in the example I looked at. I am querying ngram indexed
> data (Chinese text). A user enters two or three characters and
> expect them to be matched more or less as a substring match. The
> fragment I looked at did contain only one of the characters (the
> other was cut off at the end), this is what made me wondering.
> From what you say, even adding quotation marks around the query will
> not prevent this from happening (in this case, it would simply
> obscure the match).
> Are there any plans to improve the algorithm for fragmentation? Or
> are there other work arounds?
LUCENE-794 contains an implementation that solves this problem. My
plan is to eventually integrate this into Solr one day, but I don't
see myself having time for this in the short or medium term.
Contributions welcome :)
-Mike
Re: Highlighted field gets truncated
Posted by Christian Wittern <cw...@gmail.com>.
Mike Klaas wrote:
> On 19-Apr-08, at 3:02 AM, Christian Wittern wrote:
>> So it could be that the match is not part of the fragment? This
>> sounds a bit strange. Is there a way to make sure the fragment
>> contains the match other than returning the whole field and do the
>> fragmenting myself?
>
[...]
> As you can see, only fragments containing a match are returned (note
> that there is very often multiple matches--you seemed to assume only
> one).
>
Mike, thank you for the clarification. Now I understand what went wrong
in the example I looked at. I am querying ngram indexed data (Chinese
text). A user enters two or three characters and expect them to be
matched more or less as a substring match. The fragment I looked at did
contain only one of the characters (the other was cut off at the end),
this is what made me wondering. From what you say, even adding
quotation marks around the query will not prevent this from happening
(in this case, it would simply obscure the match).
Are there any plans to improve the algorithm for fragmentation? Or are
there other work arounds?
All the best,
Christian
Re: Highlighted field gets truncated
Posted by Mike Klaas <mi...@gmail.com>.
On 19-Apr-08, at 3:02 AM, Christian Wittern wrote:
> Mike Klaas wrote:
>>
>> Fragments are generated independently from matching (I realize this
>> isn't an ideal algorithm).
>>
> So it could be that the match is not part of the fragment? This
> sounds a bit strange. Is there a way to make sure the fragment
> contains the match other than returning the whole field and do the
> fragmenting myself?
The highlighting algorithm is as follows:
1. fragment the whole field into N fragments
2. score each fragment based on the keyword matches (more matches
the better; prefer different keyword matching to many of the same
keyword matching). fragments that have no matching keywords do not
have a positive score.
3. return the top hl.maxSnippets fragments that score > 0
As you can see, only fragments containing a match are returned (note
that there is very often multiple matches--you seemed to assume only
one).
-Mike
Re: Highlighted field gets truncated
Posted by Christian Wittern <cw...@gmail.com>.
Mike Klaas wrote:
>
> Fragments are generated independently from matching (I realize this
> isn't an ideal algorithm).
>
So it could be that the match is not part of the fragment? This sounds
a bit strange. Is there a way to make sure the fragment contains the
match other than returning the whole field and do the fragmenting myself?
>
> Fragments are returned as an xml list; you can combine them together
> however you like in client code. Solr can merge adjacent fragments
> for you if you wish.
>
I see. That is great.
Thanks, Christian
--
Christian Wittern
Institute for Research in Humanities, Kyoto University
47 Higashiogura-cho, Kitashirakawa, Sakyo-ku, Kyoto 606-8265, JAPAN
Re: Highlighted field gets truncated
Posted by Mike Klaas <mi...@gmail.com>.
On 18-Apr-08, at 2:47 AM, Christian Wittern wrote:
> Martijn Dekkers wrote:
>> Did you look at the hl.fragsize parameter? the default for that is
>> 100. try:
>>
>> http://localhost:8983/solr/select/?q=%E8%99%9B%E5%A4%9A&fl=variants,content,cdata,id%2Cdoctitle%2Chead%2Ccitekey%2Cseqnum%2Cjuan&hl=true&f.contents.hl.snippets=20&hl.fl=content,variants&wt=xml&tr=solr-tei.xsl&hl.fragsize=500
>>
>>
> Thanks Martijn, with this URL, I do indeed get the whole match.
> Maybe I am not understanding the meaning of the hl.fragsize
> correctly. I was assuming that it would grab content in similar
> sizes to the left and right of the match with the default fragmenter.
Fragments are generated independently from matching (I realize this
isn't an ideal algorithm).
> Maybe I should try to use the regex fragmenter instead, but this
> seems to be 1.3 only?
> Another related question: Is there a way to insert some limiters
> between fragments so that it is clearly visible that these are
> chunks of text with gaps in between? I understand that hl.simple.pre
> and *.post are for surrounding the match, not the snippet, right?
Fragments are returned as an xml list; you can combine them together
however you like in client code. Solr can merge adjacent fragments
for you if you wish.
-Mike
Re: Highlighted field gets truncated
Posted by Christian Wittern <cw...@gmail.com>.
Martijn Dekkers wrote:
> Did you look at the hl.fragsize parameter? the default for that is 100. try:
>
> http://localhost:8983/solr/select/?q=%E8%99%9B%E5%A4%9A&fl=variants,content,cdata,id%2Cdoctitle%2Chead%2Ccitekey%2Cseqnum%2Cjuan&hl=true&f.contents.hl.snippets=20&hl.fl=content,variants&wt=xml&tr=solr-tei.xsl&hl.fragsize=500
>
>
Thanks Martijn, with this URL, I do indeed get the whole match. Maybe I
am not understanding the meaning of the hl.fragsize correctly. I was
assuming that it would grab content in similar sizes to the left and
right of the match with the default fragmenter. Maybe I should try to
use the regex fragmenter instead, but this seems to be 1.3 only?
Another related question: Is there a way to insert some limiters
between fragments so that it is clearly visible that these are chunks of
text with gaps in between? I understand that hl.simple.pre and *.post
are for surrounding the match, not the snippet, right?
Christian
--
Christian Wittern
Institute for Research in Humanities, Kyoto University
47 Higashiogura-cho, Kitashirakawa, Sakyo-ku, Kyoto 606-8265, JAPAN
Re: Highlighted field gets truncated
Posted by Martijn Dekkers <ma...@dekkers.org.uk>.
Hey Christian,
Did you look at the hl.fragsize parameter? the default for that is 100. try:
http://localhost:8983/solr/select/?q=%E8%99%9B%E5%A4%9A&fl=variants,content,cdata,id%2Cdoctitle%2Chead%2Ccitekey%2Cseqnum%2Cjuan&hl=true&f.contents.hl.snippets=20&hl.fl=content,variants&wt=xml&tr=solr-tei.xsl&hl.fragsize=500
Cheers,
Martijn
On 18/04/2008, Christian Wittern <cw...@gmail.com> wrote:
> Dear Solr users,
>
> Here I am having a problem with hightlighting which is slightly different
> from the one reported by Martijn.
> The field that contains the match is rather short, in this case less than
> 300 characters altogether. Nevertheless, the field is only returned
> truncated. Since I also return the field itself, I can clearly see that
the
> whole content is there.
>
> Here is the query string I am using:
>
>
http://localhost:8983/solr/select/?q=%E8%99%9B%E5%A4%9A&fl=variants,content,cdata,id%2Cdoctitle%2Chead%2Ccitekey%2Cseqnum%2Cjuan&hl=true&f.contents.hl.snippets=20&hl.fl=content,variants&wt=xml&tr=solr-tei.xsl
>
> Any hint on how to debug this would be highly appreciated!
>
> All the best,
>
> Christian
>
>
> --
>
> Christian Wittern
> Institute for Research in Humanities, Kyoto University
> 47 Higashiogura-cho, Kitashirakawa, Sakyo-ku, Kyoto 606-8265, JAPAN
>
Re: Highlighted field gets truncated
Posted by Thomas Arni <ar...@gmail.com>.
Have a look at
http://wiki.apache.org/solr/HighlightingParameters?highlight=%28highlighting%29#head-dbf0474b5b2c0db08f3a464ff3525225a9c71fbc
and set
hl.fragsize=0
Hope this helps.
Christian Wittern said the following on 18/04/2008 09:59:
> Dear Solr users,
>
> Here I am having a problem with hightlighting which is slightly different
> from the one reported by Martijn.
> The field that contains the match is rather short, in this case less than
> 300 characters altogether. Nevertheless, the field is only returned
> truncated. Since I also return the field itself, I can clearly see that the
> whole content is there.
>
> Here is the query string I am using:
>
> http://localhost:8983/solr/select/?q=%E8%99%9B%E5%A4%9A&fl=variants,content,cdata,id%2Cdoctitle%2Chead%2Ccitekey%2Cseqnum%2Cjuan&hl=true&f.contents.hl.snippets=20&hl.fl=content,variants&wt=xml&tr=solr-tei.xsl
>
> Any hint on how to debug this would be highly appreciated!
>
> All the best,
>
> Christian
>