You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by ashokc <as...@qualcomm.com> on 2009/02/05 00:22:43 UTC

Re: Highlighting Oddities

I have seen some of these oddities that Chris is referring to. In my case,
terms that are NOT in the query get highlighted. For example searching for
'Intel' highlights 'Microsot Corp' as well. I do not have them as synonyms
either. Do these filter factories add some extra intelligence to the index
in that if you search for 'Samsung' even 'LG' is considered a highlightable
term?

I believe this was not the case when I was working with an earlier
development version (from Nov or early Dec). Right now I am using
solr-2008-12-29.war.

- ashok



ryguasu wrote:
> 
> I'm testing out the default (gap) fragmenter with some simple,
> single-word queries on a patched 1.3.0 release populated with some
> real-world data. (I think the primary quirk in my setup is that I'm
> using ShingleFilterFactory to put word bigrams (aka shingles) into my
> index. I was worried that this might mess up highlighting, but
> highlighting is *mostly* working.) There are some oddities here, and
> I'm wondering if people have any suggestions for debugging my setup
> and/or trying to make a good, reproducible test case.
> 
> 1. The main weird thing is that, the vast majority of the time, the
> highlighted term is the last term in the fragment. For example, if I
> search for "cat", then almost all my fragments look like this:
> 
> fragment 1: "to the *cat*"
> fragment 2: "with the *cat*"
> fragment 3: "it's what the *cat*"
> fragment 4: "Once upon a time the *cat*"
> 
> (My actual fragments are longer. The key to note is that all of these
> examples end in "cat".)
> 
> Sometimes "cat" will appear at somewhere other than the last position,
> but this is rare. My expectation, in contrast, is that "cat" would
> tend to be more or less evenly distributed throughout fragment
> positions.
> 
> Note: I tried to reproduce this on 1.3.0 with my patches applied but
> using the example dataset/schema from the Solr source tree rather than
> my own dataset/schema. With the example dataset this didn't seem to be
> an issue.
> 
> I've experienced three other highlighting issues, which may or may not
> be related:
> 
> 2. Sometimes, if a term appears multiple times in a fragment, not just
> the term but all the words in between the two appearances will get
> highlighted too. For example, I searched for "fear", and got this as
> one of the snippets:
> 
>     SETTLEMENT AGREEMENT This Agreement ("the Agreement") is entered
> into this 18th day of August, 2008, by
>     and between Cape <em>Fear Bank Corporation, a North Carolina
> corporation (the "Company"), and Cape Fear</em>
> 
> In contrast, I would have expected
> 
>     SETTLEMENT AGREEMENT This Agreement ("the Agreement") is entered
> into this 18th day of August, 2008, by
>     and between Cape <em>Fear</em> Bank Corporation, a North Carolina
> corporation (the "Company"), and Cape <em>Fear</em>
> 
> 3. My install seems to have a curiously liberal interpretation of
> hl.fragsize. Now if I put hl.fragsize=0, then things are as expected,
> i.e. it highlights the whole field. And it also seems more or less
> true (as it should) that as I increase hl.fragsize, the fragments get
> longer. However, I was surprised to see that when I put hl.fragsize=1
> or hl.fragsize=5, I can get fragments as long as this one:
> 
>     addition, we believe the wireless feature for our controller will
> facilitate exceptional customer services and
>     response time." About GpsLatitude GpsLatitude, a Montreal-based
> company, is a provider of security
>     solutions and tracking for mobile assets. It is also a developer
> of advanced " Videlocalisation" , a cost-effective,
>     integrated mobile digital <em>video</em>
> 
> That seems shockingly long for something of size "five".
> 
> 4. Very rarely I'll get a fragment that doesn't actually contain any
> of the search terms. For example, maybe I'll search for "cat", and
> I'll get back "three ounces of milk" as a snippet. I need to explore
> this more, though the last time this happened when I opened the
> document and found that when I located "three ounces of milk" in the
> document text, the word "cat" did appear nearby; so maybe the document
> did contain "three ounces of milk for the cat".
> 
> Obviously I'm not describing my setup in much detail. Let me know what
> you think would be helpful to know more about.
> 
> Thanks,
> Chris
> 
> 

-- 
View this message in context: http://www.nabble.com/Highlighting-Oddities-tp20351015p21841992.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Highlighting Oddities

Posted by ashokc <as...@qualcomm.com>.
This problem went away when I updated to use the latest nightly release
(2009-02-04)

- ashok

ashokc wrote:
> 
> I have seen some of these oddities that Chris is referring to. In my case,
> terms that are NOT in the query get highlighted. For example searching for
> 'Intel' highlights 'Microsot Corp' as well. I do not have them as synonyms
> either. Do these filter factories add some extra intelligence to the index
> in that if you search for 'Samsung' even 'LG' is considered a
> highlightable term?
> 
> I believe this was not the case when I was working with an earlier
> development version (from Nov or early Dec). Right now I am using
> solr-2008-12-29.war.
> 
> - ashok
> 
> 
> 
> ryguasu wrote:
>> 
>> I'm testing out the default (gap) fragmenter with some simple,
>> single-word queries on a patched 1.3.0 release populated with some
>> real-world data. (I think the primary quirk in my setup is that I'm
>> using ShingleFilterFactory to put word bigrams (aka shingles) into my
>> index. I was worried that this might mess up highlighting, but
>> highlighting is *mostly* working.) There are some oddities here, and
>> I'm wondering if people have any suggestions for debugging my setup
>> and/or trying to make a good, reproducible test case.
>> 
>> 1. The main weird thing is that, the vast majority of the time, the
>> highlighted term is the last term in the fragment. For example, if I
>> search for "cat", then almost all my fragments look like this:
>> 
>> fragment 1: "to the *cat*"
>> fragment 2: "with the *cat*"
>> fragment 3: "it's what the *cat*"
>> fragment 4: "Once upon a time the *cat*"
>> 
>> (My actual fragments are longer. The key to note is that all of these
>> examples end in "cat".)
>> 
>> Sometimes "cat" will appear at somewhere other than the last position,
>> but this is rare. My expectation, in contrast, is that "cat" would
>> tend to be more or less evenly distributed throughout fragment
>> positions.
>> 
>> Note: I tried to reproduce this on 1.3.0 with my patches applied but
>> using the example dataset/schema from the Solr source tree rather than
>> my own dataset/schema. With the example dataset this didn't seem to be
>> an issue.
>> 
>> I've experienced three other highlighting issues, which may or may not
>> be related:
>> 
>> 2. Sometimes, if a term appears multiple times in a fragment, not just
>> the term but all the words in between the two appearances will get
>> highlighted too. For example, I searched for "fear", and got this as
>> one of the snippets:
>> 
>>     SETTLEMENT AGREEMENT This Agreement ("the Agreement") is entered
>> into this 18th day of August, 2008, by
>>     and between Cape <em>Fear Bank Corporation, a North Carolina
>> corporation (the "Company"), and Cape Fear</em>
>> 
>> In contrast, I would have expected
>> 
>>     SETTLEMENT AGREEMENT This Agreement ("the Agreement") is entered
>> into this 18th day of August, 2008, by
>>     and between Cape <em>Fear</em> Bank Corporation, a North Carolina
>> corporation (the "Company"), and Cape <em>Fear</em>
>> 
>> 3. My install seems to have a curiously liberal interpretation of
>> hl.fragsize. Now if I put hl.fragsize=0, then things are as expected,
>> i.e. it highlights the whole field. And it also seems more or less
>> true (as it should) that as I increase hl.fragsize, the fragments get
>> longer. However, I was surprised to see that when I put hl.fragsize=1
>> or hl.fragsize=5, I can get fragments as long as this one:
>> 
>>     addition, we believe the wireless feature for our controller will
>> facilitate exceptional customer services and
>>     response time." About GpsLatitude GpsLatitude, a Montreal-based
>> company, is a provider of security
>>     solutions and tracking for mobile assets. It is also a developer
>> of advanced " Videlocalisation" , a cost-effective,
>>     integrated mobile digital <em>video</em>
>> 
>> That seems shockingly long for something of size "five".
>> 
>> 4. Very rarely I'll get a fragment that doesn't actually contain any
>> of the search terms. For example, maybe I'll search for "cat", and
>> I'll get back "three ounces of milk" as a snippet. I need to explore
>> this more, though the last time this happened when I opened the
>> document and found that when I located "three ounces of milk" in the
>> document text, the word "cat" did appear nearby; so maybe the document
>> did contain "three ounces of milk for the cat".
>> 
>> Obviously I'm not describing my setup in much detail. Let me know what
>> you think would be helpful to know more about.
>> 
>> Thanks,
>> Chris
>> 
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Highlighting-Oddities-tp20351015p21843092.html
Sent from the Solr - User mailing list archive at Nabble.com.