You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Eric O'Hanlon <el...@columbia.edu> on 2013/08/07 19:51:10 UTC

Some highlighted snippets aren't being returned

Hi Everyone,

I'm facing an issue in which my solr query is returning highlighted snippets for some, but not all results.  For reference, I'm searching through an index that contains web crawls of human-rights-related websites.  I'm running solr as a webapp under Tomcat and I've included the query's solr params from the Tomcat log:

...
webapp=/solr-4.2
path=/select
params={facet=true&sort=score+desc&group.limit=10&spellcheck.q=Unangan&f.mimetype_code.facet.limit=7&hl.simple.pre=<code>&q.alt=*:*&f.organization_type__facet.facet.limit=6&f.language__facet.facet.limit=6&hl=true&f.date_of_capture_yyyy.facet.limit=6&group.field=original_url&hl.simple.post=</code>&facet.field=domain&facet.field=date_of_capture_yyyy&facet.field=mimetype_code&facet.field=geographic_focus__facet&facet.field=organization_based_in__facet&facet.field=organization_type__facet&facet.field=language__facet&facet.field=creator_name__facet&hl.fragsize=600&f.creator_name__facet.facet.limit=6&facet.mincount=1&qf=text^1&hl.fl=contents&hl.fl=title&hl.fl=original_url&wt=ruby&f.geographic_focus__facet.facet.limit=6&defType=edismax&rows=10&f.domain.facet.limit=6&q=Unangan&f.organization_based_in__facet.facet.limit=6&q.op=AND&group=true&hl.usePhraseHighlighter=true} hits=8 status=0 QTime=108
...

For the query above (which can be simplified to say: find all documents that contain the word "unangan" and return facets, highlights, etc.), I get five search results.  Only three of these are returning highlighted snippets.  Here's the "highlighting" portion of the solr response (note: printed in ruby notation because I'm receiving this response in a Rails app):

--------
"highlighting"=>
  {"20100602195444/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%202002%20tentang%20Perlindungan%20Anak.pdf"=>
    {},
   "20100902203939/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%202002%20tentang%20Perlindungan%20Anak.pdf"=>
    {},
   "20111202233029/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%202002%20tentang%20Perlindungan%20Anak.pdf"=>
    {},
   "20100618201646/http://www.komnasham.go.id/portal/files/39-99.pdf"=>
    {"contents"=>
      ["...actual snippet is returned here..."]},
   "20100902235358/http://www.komnasham.go.id/portal/files/39-99.pdf"=>
    {"contents"=>
      ["...actual snippet is returned here..."]},
   "20110302213056/http://www.komnasham.go.id/publikasi/doc_download/2-uu-no-39-tahun-1999"=>
    {"contents"=>
      ["...actual snippet is returned here..."]},
   "20110302213102/http://www.komnasham.go.id/publikasi/doc_view/2-uu-no-39-tahun-1999?tmpl=component&format=raw"=>
    {"contents"=>
      ["...actual snippet is returned here..."]},
   "20120303113654/http://www.iwgia.org/iwgia_files_publications_files/0028_Utimut_heritage.pdf"=>
    {}}
--------

I have eight (as opposed to five) results above because I'm also doing a grouped query, grouping by a field called "original_url", and this leads to five grouped results.

I've confirmed that my highlight-lacking results DO contain the word "unangan", as expected, and this term is appearing in a text field that's indexed and stored, and being searched for all text searches.  For example, one of the search results is for a crawl of this document: http://www.iwgia.org/iwgia_files_publications_files/0028_Utimut_heritage.pdf

And if you view that document on the web, you'll see that it does contain "unangan".

Has anyone seen this before?  And does anyone have any good suggestions for troubleshooting/fixing the problem?

Thanks!

- Eric

Re: Some highlighted snippets aren't being returned

Posted by Bill Bell <bi...@gmail.com>.
Zip up all your configs 

Bill Bell
Sent from mobile


On Sep 8, 2013, at 3:00 PM, "Eric O'Hanlon" <el...@columbia.edu> wrote:

> Hi again Everyone,
> 
> I didn't get any replies to this, so I thought I'd re-send in case anyone missed it and has any thoughts.
> 
> Thanks,
> Eric
> 
> On Aug 7, 2013, at 1:51 PM, Eric O'Hanlon <el...@columbia.edu> wrote:
> 
>> Hi Everyone,
>> 
>> I'm facing an issue in which my solr query is returning highlighted snippets for some, but not all results.  For reference, I'm searching through an index that contains web crawls of human-rights-related websites.  I'm running solr as a webapp under Tomcat and I've included the query's solr params from the Tomcat log:
>> 
>> ...
>> webapp=/solr-4.2
>> path=/select
>> params={facet=true&sort=score+desc&group.limit=10&spellcheck.q=Unangan&f.mimetype_code.facet.limit=7&hl.simple.pre=<code>&q.alt=*:*&f.organization_type__facet.facet.limit=6&f.language__facet.facet.limit=6&hl=true&f.date_of_capture_yyyy.facet.limit=6&group.field=original_url&hl.simple.post=</code>&facet.field=domain&facet.field=date_of_capture_yyyy&facet.field=mimetype_code&facet.field=geographic_focus__facet&facet.field=organization_based_in__facet&facet.field=organization_type__facet&facet.field=language__facet&facet.field=creator_name__facet&hl.fragsize=600&f.creator_name__facet.facet.limit=6&facet.mincount=1&qf=text^1&hl.fl=contents&hl.fl=title&hl.fl=original_url&wt=ruby&f.geographic_focus__facet.facet.limit=6&defType=edismax&rows=10&f.domain.facet.limit=6&q=Unangan&f.organization_based_in__facet.facet.limit=6&q.op=AND&group=true&hl.usePhraseHighlighter=true} hits=8 status=0 QTime=108
>> ...
>> 
>> For the query above (which can be simplified to say: find all documents that contain the word "unangan" and return facets, highlights, etc.), I get five search results.  Only three of these are returning highlighted snippets.  Here's the "highlighting" portion of the solr response (note: printed in ruby notation because I'm receiving this response in a Rails app):
>> 
>> --------
>> "highlighting"=>
>> {"20100602195444/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%202002%20tentang%20Perlindungan%20Anak.pdf"=>
>>   {},
>>  "20100902203939/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%202002%20tentang%20Perlindungan%20Anak.pdf"=>
>>   {},
>>  "20111202233029/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%202002%20tentang%20Perlindungan%20Anak.pdf"=>
>>   {},
>>  "20100618201646/http://www.komnasham.go.id/portal/files/39-99.pdf"=>
>>   {"contents"=>
>>     ["...actual snippet is returned here..."]},
>>  "20100902235358/http://www.komnasham.go.id/portal/files/39-99.pdf"=>
>>   {"contents"=>
>>     ["...actual snippet is returned here..."]},
>>  "20110302213056/http://www.komnasham.go.id/publikasi/doc_download/2-uu-no-39-tahun-1999"=>
>>   {"contents"=>
>>     ["...actual snippet is returned here..."]},
>>  "20110302213102/http://www.komnasham.go.id/publikasi/doc_view/2-uu-no-39-tahun-1999?tmpl=component&format=raw"=>
>>   {"contents"=>
>>     ["...actual snippet is returned here..."]},
>>  "20120303113654/http://www.iwgia.org/iwgia_files_publications_files/0028_Utimut_heritage.pdf"=>
>>   {}}
>> --------
>> 
>> I have eight (as opposed to five) results above because I'm also doing a grouped query, grouping by a field called "original_url", and this leads to five grouped results.
>> 
>> I've confirmed that my highlight-lacking results DO contain the word "unangan", as expected, and this term is appearing in a text field that's indexed and stored, and being searched for all text searches.  For example, one of the search results is for a crawl of this document: http://www.iwgia.org/iwgia_files_publications_files/0028_Utimut_heritage.pdf
>> 
>> And if you view that document on the web, you'll see that it does contain "unangan".
>> 
>> Has anyone seen this before?  And does anyone have any good suggestions for troubleshooting/fixing the problem?
>> 
>> Thanks!
>> 
>> - Eric
> 

Re: Some highlighted snippets aren't being returned

Posted by Eric O'Hanlon <el...@columbia.edu>.
maxAnalyzedChars did it!  I wasn't setting that param, and I'm working with some very long documents.  I also made the hl.fl param formatting change that you suggested, Aloke.

Thanks again!

- Eric

On Sep 11, 2013, at 3:10 AM, Eric O'Hanlon <el...@columbia.edu> wrote:

> Thank you, Aloke and Bryan!  I'll give this a try and I'll report back on what happens!
> 
> - Eric
> 
> On Sep 9, 2013, at 2:32 AM, Aloke Ghoshal <al...@gmail.com> wrote:
> 
>> Hi Eric,
>> 
>> As Bryan suggests, you should look at appropriately setting up the
>> fragSize & maxAnalyzedChars for long documents.
>> 
>> One issue I find with your search request is that in trying to
>> highlight across three separate fields, you have added each of them as
>> a separate request param:
>> hl.fl=contents&hl.fl=title&hl.fl=original_url
>> 
>> The way to do it would be
>> (http://wiki.apache.org/solr/HighlightingParameters#hl.fl) to pass
>> them as values to one comma (or space) separated field:
>> hl.fl=contents,title,original_url
>> 
>> Regards,
>> Aloke
>> 
>> On 9/9/13, Bryan Loofbourrow <bl...@knowledgemosaic.com> wrote:
>>> Eric,
>>> 
>>> Your example document is quite long. Are you setting hl.maxAnalyzedChars?
>>> If you don't, the highlighter you appear to be using will not look past
>>> the first 51,200 characters of the document for snippet candidates.
>>> 
>>> http://wiki.apache.org/solr/HighlightingParameters#hl.maxAnalyzedChars
>>> 
>>> -- Bryan
>>> 
>>> 
>>>> -----Original Message-----
>>>> From: Eric O'Hanlon [mailto:elo2112@columbia.edu]
>>>> Sent: Sunday, September 08, 2013 2:01 PM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Re: Some highlighted snippets aren't being returned
>>>> 
>>>> Hi again Everyone,
>>>> 
>>>> I didn't get any replies to this, so I thought I'd re-send in case
>>> anyone
>>>> missed it and has any thoughts.
>>>> 
>>>> Thanks,
>>>> Eric
>>>> 
>>>> On Aug 7, 2013, at 1:51 PM, Eric O'Hanlon <el...@columbia.edu> wrote:
>>>> 
>>>>> Hi Everyone,
>>>>> 
>>>>> I'm facing an issue in which my solr query is returning highlighted
>>>> snippets for some, but not all results.  For reference, I'm searching
>>>> through an index that contains web crawls of human-rights-related
>>>> websites.  I'm running solr as a webapp under Tomcat and I've included
>>> the
>>>> query's solr params from the Tomcat log:
>>>>> 
>>>>> ...
>>>>> webapp=/solr-4.2
>>>>> path=/select
>>>>> 
>>>> 
>>> params={facet=true&sort=score+desc&group.limit=10&spellcheck.q=Unangan&f.m
>>>> 
>>> imetype_code.facet.limit=7&hl.simple.pre=<code>&q.alt=*:*&f.organization_t
>>>> 
>>> ype__facet.facet.limit=6&f.language__facet.facet.limit=6&hl=true&f.date_of
>>>> 
>>> _capture_yyyy.facet.limit=6&group.field=original_url&hl.simple.post=</code
>>>> 
>>>> &facet.field=domain&facet.field=date_of_capture_yyyy&facet.field=mimetype
>>>> 
>>> _code&facet.field=geographic_focus__facet&facet.field=organization_based_i
>>>> 
>>> n__facet&facet.field=organization_type__facet&facet.field=language__facet&
>>>> 
>>> facet.field=creator_name__facet&hl.fragsize=600&f.creator_name__facet.face
>>>> 
>>> t.limit=6&facet.mincount=1&qf=text^1&hl.fl=contents&hl.fl=title&hl.fl=orig
>>>> 
>>> inal_url&wt=ruby&f.geographic_focus__facet.facet.limit=6&defType=edismax&r
>>>> 
>>> ows=10&f.domain.facet.limit=6&q=Unangan&f.organization_based_in__facet.fac
>>>> et.limit=6&q.op=AND&group=true&hl.usePhraseHighlighter=true} hits=8
>>>> status=0 QTime=108
>>>>> ...
>>>>> 
>>>>> For the query above (which can be simplified to say: find all
>>> documents
>>>> that contain the word "unangan" and return facets, highlights, etc.), I
>>>> get five search results.  Only three of these are returning highlighted
>>>> snippets.  Here's the "highlighting" portion of the solr response (note:
>>>> printed in ruby notation because I'm receiving this response in a Rails
>>>> app):
>>>>> 
>>>>> --------
>>>>> "highlighting"=>
>>>>> 
>>>> 
>>> {"20100602195444/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%
>>>> 202002%20tentang%20Perlindungan%20Anak.pdf"=>
>>>>>  {},
>>>>> 
>>>> 
>>> "20100902203939/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%2
>>>> 02002%20tentang%20Perlindungan%20Anak.pdf"=>
>>>>>  {},
>>>>> 
>>>> 
>>> "20111202233029/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%2
>>>> 02002%20tentang%20Perlindungan%20Anak.pdf"=>
>>>>>  {},
>>>>> "20100618201646/http://www.komnasham.go.id/portal/files/39-99.pdf"=>
>>>>>  {"contents"=>
>>>>>    ["...actual snippet is returned here..."]},
>>>>> "20100902235358/http://www.komnasham.go.id/portal/files/39-99.pdf"=>
>>>>>  {"contents"=>
>>>>>    ["...actual snippet is returned here..."]},
>>>>> "20110302213056/http://www.komnasham.go.id/publikasi/doc_download/2-
>>>> uu-no-39-tahun-1999"=>
>>>>>  {"contents"=>
>>>>>    ["...actual snippet is returned here..."]},
>>>>> 
>>> "20110302213102/http://www.komnasham.go.id/publikasi/doc_view/2-uu-no-
>>>> 39-tahun-1999?tmpl=component&format=raw"=>
>>>>>  {"contents"=>
>>>>>    ["...actual snippet is returned here..."]},
>>>>> 
>>>> 
>>> "20120303113654/http://www.iwgia.org/iwgia_files_publications_files/0028_U
>>>> timut_heritage.pdf"=>
>>>>>  {}}
>>>>> --------
>>>>> 
>>>>> I have eight (as opposed to five) results above because I'm also doing
>>> a
>>>> grouped query, grouping by a field called "original_url", and this leads
>>>> to five grouped results.
>>>>> 
>>>>> I've confirmed that my highlight-lacking results DO contain the word
>>>> "unangan", as expected, and this term is appearing in a text field
>>> that's
>>>> indexed and stored, and being searched for all text searches.  For
>>>> example, one of the search results is for a crawl of this document:
>>>> 
>>> http://www.iwgia.org/iwgia_files_publications_files/0028_Utimut_heritage.p
>>>> df
>>>>> 
>>>>> And if you view that document on the web, you'll see that it does
>>>> contain "unangan".
>>>>> 
>>>>> Has anyone seen this before?  And does anyone have any good
>>> suggestions
>>>> for troubleshooting/fixing the problem?
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>> - Eric
>>> 
>> 
> 
> 


Re: Some highlighted snippets aren't being returned

Posted by Eric O'Hanlon <el...@columbia.edu>.
Thank you, Aloke and Bryan!  I'll give this a try and I'll report back on what happens!

- Eric

On Sep 9, 2013, at 2:32 AM, Aloke Ghoshal <al...@gmail.com> wrote:

> Hi Eric,
> 
> As Bryan suggests, you should look at appropriately setting up the
> fragSize & maxAnalyzedChars for long documents.
> 
> One issue I find with your search request is that in trying to
> highlight across three separate fields, you have added each of them as
> a separate request param:
> hl.fl=contents&hl.fl=title&hl.fl=original_url
> 
> The way to do it would be
> (http://wiki.apache.org/solr/HighlightingParameters#hl.fl) to pass
> them as values to one comma (or space) separated field:
> hl.fl=contents,title,original_url
> 
> Regards,
> Aloke
> 
> On 9/9/13, Bryan Loofbourrow <bl...@knowledgemosaic.com> wrote:
>> Eric,
>> 
>> Your example document is quite long. Are you setting hl.maxAnalyzedChars?
>> If you don't, the highlighter you appear to be using will not look past
>> the first 51,200 characters of the document for snippet candidates.
>> 
>> http://wiki.apache.org/solr/HighlightingParameters#hl.maxAnalyzedChars
>> 
>> -- Bryan
>> 
>> 
>>> -----Original Message-----
>>> From: Eric O'Hanlon [mailto:elo2112@columbia.edu]
>>> Sent: Sunday, September 08, 2013 2:01 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Some highlighted snippets aren't being returned
>>> 
>>> Hi again Everyone,
>>> 
>>> I didn't get any replies to this, so I thought I'd re-send in case
>> anyone
>>> missed it and has any thoughts.
>>> 
>>> Thanks,
>>> Eric
>>> 
>>> On Aug 7, 2013, at 1:51 PM, Eric O'Hanlon <el...@columbia.edu> wrote:
>>> 
>>>> Hi Everyone,
>>>> 
>>>> I'm facing an issue in which my solr query is returning highlighted
>>> snippets for some, but not all results.  For reference, I'm searching
>>> through an index that contains web crawls of human-rights-related
>>> websites.  I'm running solr as a webapp under Tomcat and I've included
>> the
>>> query's solr params from the Tomcat log:
>>>> 
>>>> ...
>>>> webapp=/solr-4.2
>>>> path=/select
>>>> 
>>> 
>> params={facet=true&sort=score+desc&group.limit=10&spellcheck.q=Unangan&f.m
>>> 
>> imetype_code.facet.limit=7&hl.simple.pre=<code>&q.alt=*:*&f.organization_t
>>> 
>> ype__facet.facet.limit=6&f.language__facet.facet.limit=6&hl=true&f.date_of
>>> 
>> _capture_yyyy.facet.limit=6&group.field=original_url&hl.simple.post=</code
>>> 
>>> &facet.field=domain&facet.field=date_of_capture_yyyy&facet.field=mimetype
>>> 
>> _code&facet.field=geographic_focus__facet&facet.field=organization_based_i
>>> 
>> n__facet&facet.field=organization_type__facet&facet.field=language__facet&
>>> 
>> facet.field=creator_name__facet&hl.fragsize=600&f.creator_name__facet.face
>>> 
>> t.limit=6&facet.mincount=1&qf=text^1&hl.fl=contents&hl.fl=title&hl.fl=orig
>>> 
>> inal_url&wt=ruby&f.geographic_focus__facet.facet.limit=6&defType=edismax&r
>>> 
>> ows=10&f.domain.facet.limit=6&q=Unangan&f.organization_based_in__facet.fac
>>> et.limit=6&q.op=AND&group=true&hl.usePhraseHighlighter=true} hits=8
>>> status=0 QTime=108
>>>> ...
>>>> 
>>>> For the query above (which can be simplified to say: find all
>> documents
>>> that contain the word "unangan" and return facets, highlights, etc.), I
>>> get five search results.  Only three of these are returning highlighted
>>> snippets.  Here's the "highlighting" portion of the solr response (note:
>>> printed in ruby notation because I'm receiving this response in a Rails
>>> app):
>>>> 
>>>> --------
>>>> "highlighting"=>
>>>> 
>>> 
>> {"20100602195444/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%
>>> 202002%20tentang%20Perlindungan%20Anak.pdf"=>
>>>>   {},
>>>> 
>>> 
>> "20100902203939/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%2
>>> 02002%20tentang%20Perlindungan%20Anak.pdf"=>
>>>>   {},
>>>> 
>>> 
>> "20111202233029/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%2
>>> 02002%20tentang%20Perlindungan%20Anak.pdf"=>
>>>>   {},
>>>>  "20100618201646/http://www.komnasham.go.id/portal/files/39-99.pdf"=>
>>>>   {"contents"=>
>>>>     ["...actual snippet is returned here..."]},
>>>>  "20100902235358/http://www.komnasham.go.id/portal/files/39-99.pdf"=>
>>>>   {"contents"=>
>>>>     ["...actual snippet is returned here..."]},
>>>>  "20110302213056/http://www.komnasham.go.id/publikasi/doc_download/2-
>>> uu-no-39-tahun-1999"=>
>>>>   {"contents"=>
>>>>     ["...actual snippet is returned here..."]},
>>>> 
>> "20110302213102/http://www.komnasham.go.id/publikasi/doc_view/2-uu-no-
>>> 39-tahun-1999?tmpl=component&format=raw"=>
>>>>   {"contents"=>
>>>>     ["...actual snippet is returned here..."]},
>>>> 
>>> 
>> "20120303113654/http://www.iwgia.org/iwgia_files_publications_files/0028_U
>>> timut_heritage.pdf"=>
>>>>   {}}
>>>> --------
>>>> 
>>>> I have eight (as opposed to five) results above because I'm also doing
>> a
>>> grouped query, grouping by a field called "original_url", and this leads
>>> to five grouped results.
>>>> 
>>>> I've confirmed that my highlight-lacking results DO contain the word
>>> "unangan", as expected, and this term is appearing in a text field
>> that's
>>> indexed and stored, and being searched for all text searches.  For
>>> example, one of the search results is for a crawl of this document:
>>> 
>> http://www.iwgia.org/iwgia_files_publications_files/0028_Utimut_heritage.p
>>> df
>>>> 
>>>> And if you view that document on the web, you'll see that it does
>>> contain "unangan".
>>>> 
>>>> Has anyone seen this before?  And does anyone have any good
>> suggestions
>>> for troubleshooting/fixing the problem?
>>>> 
>>>> Thanks!
>>>> 
>>>> - Eric
>> 
> 


Re: Some highlighted snippets aren't being returned

Posted by Aloke Ghoshal <al...@gmail.com>.
Hi Eric,

As Bryan suggests, you should look at appropriately setting up the
fragSize & maxAnalyzedChars for long documents.

One issue I find with your search request is that in trying to
highlight across three separate fields, you have added each of them as
a separate request param:
hl.fl=contents&hl.fl=title&hl.fl=original_url

The way to do it would be
(http://wiki.apache.org/solr/HighlightingParameters#hl.fl) to pass
them as values to one comma (or space) separated field:
hl.fl=contents,title,original_url

Regards,
Aloke

On 9/9/13, Bryan Loofbourrow <bl...@knowledgemosaic.com> wrote:
> Eric,
>
> Your example document is quite long. Are you setting hl.maxAnalyzedChars?
> If you don't, the highlighter you appear to be using will not look past
> the first 51,200 characters of the document for snippet candidates.
>
> http://wiki.apache.org/solr/HighlightingParameters#hl.maxAnalyzedChars
>
> -- Bryan
>
>
>> -----Original Message-----
>> From: Eric O'Hanlon [mailto:elo2112@columbia.edu]
>> Sent: Sunday, September 08, 2013 2:01 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Some highlighted snippets aren't being returned
>>
>> Hi again Everyone,
>>
>> I didn't get any replies to this, so I thought I'd re-send in case
> anyone
>> missed it and has any thoughts.
>>
>> Thanks,
>> Eric
>>
>> On Aug 7, 2013, at 1:51 PM, Eric O'Hanlon <el...@columbia.edu> wrote:
>>
>> > Hi Everyone,
>> >
>> > I'm facing an issue in which my solr query is returning highlighted
>> snippets for some, but not all results.  For reference, I'm searching
>> through an index that contains web crawls of human-rights-related
>> websites.  I'm running solr as a webapp under Tomcat and I've included
> the
>> query's solr params from the Tomcat log:
>> >
>> > ...
>> > webapp=/solr-4.2
>> > path=/select
>> >
>>
> params={facet=true&sort=score+desc&group.limit=10&spellcheck.q=Unangan&f.m
>>
> imetype_code.facet.limit=7&hl.simple.pre=<code>&q.alt=*:*&f.organization_t
>>
> ype__facet.facet.limit=6&f.language__facet.facet.limit=6&hl=true&f.date_of
>>
> _capture_yyyy.facet.limit=6&group.field=original_url&hl.simple.post=</code
>>
>>&facet.field=domain&facet.field=date_of_capture_yyyy&facet.field=mimetype
>>
> _code&facet.field=geographic_focus__facet&facet.field=organization_based_i
>>
> n__facet&facet.field=organization_type__facet&facet.field=language__facet&
>>
> facet.field=creator_name__facet&hl.fragsize=600&f.creator_name__facet.face
>>
> t.limit=6&facet.mincount=1&qf=text^1&hl.fl=contents&hl.fl=title&hl.fl=orig
>>
> inal_url&wt=ruby&f.geographic_focus__facet.facet.limit=6&defType=edismax&r
>>
> ows=10&f.domain.facet.limit=6&q=Unangan&f.organization_based_in__facet.fac
>> et.limit=6&q.op=AND&group=true&hl.usePhraseHighlighter=true} hits=8
>> status=0 QTime=108
>> > ...
>> >
>> > For the query above (which can be simplified to say: find all
> documents
>> that contain the word "unangan" and return facets, highlights, etc.), I
>> get five search results.  Only three of these are returning highlighted
>> snippets.  Here's the "highlighting" portion of the solr response (note:
>> printed in ruby notation because I'm receiving this response in a Rails
>> app):
>> >
>> > --------
>> > "highlighting"=>
>> >
>>
> {"20100602195444/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%
>> 202002%20tentang%20Perlindungan%20Anak.pdf"=>
>> >    {},
>> >
>>
> "20100902203939/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%2
>> 02002%20tentang%20Perlindungan%20Anak.pdf"=>
>> >    {},
>> >
>>
> "20111202233029/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%2
>> 02002%20tentang%20Perlindungan%20Anak.pdf"=>
>> >    {},
>> >   "20100618201646/http://www.komnasham.go.id/portal/files/39-99.pdf"=>
>> >    {"contents"=>
>> >      ["...actual snippet is returned here..."]},
>> >   "20100902235358/http://www.komnasham.go.id/portal/files/39-99.pdf"=>
>> >    {"contents"=>
>> >      ["...actual snippet is returned here..."]},
>> >   "20110302213056/http://www.komnasham.go.id/publikasi/doc_download/2-
>> uu-no-39-tahun-1999"=>
>> >    {"contents"=>
>> >      ["...actual snippet is returned here..."]},
>> >
> "20110302213102/http://www.komnasham.go.id/publikasi/doc_view/2-uu-no-
>> 39-tahun-1999?tmpl=component&format=raw"=>
>> >    {"contents"=>
>> >      ["...actual snippet is returned here..."]},
>> >
>>
> "20120303113654/http://www.iwgia.org/iwgia_files_publications_files/0028_U
>> timut_heritage.pdf"=>
>> >    {}}
>> > --------
>> >
>> > I have eight (as opposed to five) results above because I'm also doing
> a
>> grouped query, grouping by a field called "original_url", and this leads
>> to five grouped results.
>> >
>> > I've confirmed that my highlight-lacking results DO contain the word
>> "unangan", as expected, and this term is appearing in a text field
> that's
>> indexed and stored, and being searched for all text searches.  For
>> example, one of the search results is for a crawl of this document:
>>
> http://www.iwgia.org/iwgia_files_publications_files/0028_Utimut_heritage.p
>> df
>> >
>> > And if you view that document on the web, you'll see that it does
>> contain "unangan".
>> >
>> > Has anyone seen this before?  And does anyone have any good
> suggestions
>> for troubleshooting/fixing the problem?
>> >
>> > Thanks!
>> >
>> > - Eric
>

RE: Some highlighted snippets aren't being returned

Posted by Bryan Loofbourrow <bl...@knowledgemosaic.com>.
Eric,

Your example document is quite long. Are you setting hl.maxAnalyzedChars?
If you don't, the highlighter you appear to be using will not look past
the first 51,200 characters of the document for snippet candidates.

http://wiki.apache.org/solr/HighlightingParameters#hl.maxAnalyzedChars

-- Bryan


> -----Original Message-----
> From: Eric O'Hanlon [mailto:elo2112@columbia.edu]
> Sent: Sunday, September 08, 2013 2:01 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Some highlighted snippets aren't being returned
>
> Hi again Everyone,
>
> I didn't get any replies to this, so I thought I'd re-send in case
anyone
> missed it and has any thoughts.
>
> Thanks,
> Eric
>
> On Aug 7, 2013, at 1:51 PM, Eric O'Hanlon <el...@columbia.edu> wrote:
>
> > Hi Everyone,
> >
> > I'm facing an issue in which my solr query is returning highlighted
> snippets for some, but not all results.  For reference, I'm searching
> through an index that contains web crawls of human-rights-related
> websites.  I'm running solr as a webapp under Tomcat and I've included
the
> query's solr params from the Tomcat log:
> >
> > ...
> > webapp=/solr-4.2
> > path=/select
> >
>
params={facet=true&sort=score+desc&group.limit=10&spellcheck.q=Unangan&f.m
>
imetype_code.facet.limit=7&hl.simple.pre=<code>&q.alt=*:*&f.organization_t
>
ype__facet.facet.limit=6&f.language__facet.facet.limit=6&hl=true&f.date_of
>
_capture_yyyy.facet.limit=6&group.field=original_url&hl.simple.post=</code
>
>&facet.field=domain&facet.field=date_of_capture_yyyy&facet.field=mimetype
>
_code&facet.field=geographic_focus__facet&facet.field=organization_based_i
>
n__facet&facet.field=organization_type__facet&facet.field=language__facet&
>
facet.field=creator_name__facet&hl.fragsize=600&f.creator_name__facet.face
>
t.limit=6&facet.mincount=1&qf=text^1&hl.fl=contents&hl.fl=title&hl.fl=orig
>
inal_url&wt=ruby&f.geographic_focus__facet.facet.limit=6&defType=edismax&r
>
ows=10&f.domain.facet.limit=6&q=Unangan&f.organization_based_in__facet.fac
> et.limit=6&q.op=AND&group=true&hl.usePhraseHighlighter=true} hits=8
> status=0 QTime=108
> > ...
> >
> > For the query above (which can be simplified to say: find all
documents
> that contain the word "unangan" and return facets, highlights, etc.), I
> get five search results.  Only three of these are returning highlighted
> snippets.  Here's the "highlighting" portion of the solr response (note:
> printed in ruby notation because I'm receiving this response in a Rails
> app):
> >
> > --------
> > "highlighting"=>
> >
>
{"20100602195444/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%
> 202002%20tentang%20Perlindungan%20Anak.pdf"=>
> >    {},
> >
>
"20100902203939/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%2
> 02002%20tentang%20Perlindungan%20Anak.pdf"=>
> >    {},
> >
>
"20111202233029/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%2
> 02002%20tentang%20Perlindungan%20Anak.pdf"=>
> >    {},
> >   "20100618201646/http://www.komnasham.go.id/portal/files/39-99.pdf"=>
> >    {"contents"=>
> >      ["...actual snippet is returned here..."]},
> >   "20100902235358/http://www.komnasham.go.id/portal/files/39-99.pdf"=>
> >    {"contents"=>
> >      ["...actual snippet is returned here..."]},
> >   "20110302213056/http://www.komnasham.go.id/publikasi/doc_download/2-
> uu-no-39-tahun-1999"=>
> >    {"contents"=>
> >      ["...actual snippet is returned here..."]},
> >
"20110302213102/http://www.komnasham.go.id/publikasi/doc_view/2-uu-no-
> 39-tahun-1999?tmpl=component&format=raw"=>
> >    {"contents"=>
> >      ["...actual snippet is returned here..."]},
> >
>
"20120303113654/http://www.iwgia.org/iwgia_files_publications_files/0028_U
> timut_heritage.pdf"=>
> >    {}}
> > --------
> >
> > I have eight (as opposed to five) results above because I'm also doing
a
> grouped query, grouping by a field called "original_url", and this leads
> to five grouped results.
> >
> > I've confirmed that my highlight-lacking results DO contain the word
> "unangan", as expected, and this term is appearing in a text field
that's
> indexed and stored, and being searched for all text searches.  For
> example, one of the search results is for a crawl of this document:
>
http://www.iwgia.org/iwgia_files_publications_files/0028_Utimut_heritage.p
> df
> >
> > And if you view that document on the web, you'll see that it does
> contain "unangan".
> >
> > Has anyone seen this before?  And does anyone have any good
suggestions
> for troubleshooting/fixing the problem?
> >
> > Thanks!
> >
> > - Eric

Re: Some highlighted snippets aren't being returned

Posted by Eric O'Hanlon <el...@columbia.edu>.
Hi again Everyone,

I didn't get any replies to this, so I thought I'd re-send in case anyone missed it and has any thoughts.

Thanks,
Eric

On Aug 7, 2013, at 1:51 PM, Eric O'Hanlon <el...@columbia.edu> wrote:

> Hi Everyone,
> 
> I'm facing an issue in which my solr query is returning highlighted snippets for some, but not all results.  For reference, I'm searching through an index that contains web crawls of human-rights-related websites.  I'm running solr as a webapp under Tomcat and I've included the query's solr params from the Tomcat log:
> 
> ...
> webapp=/solr-4.2
> path=/select
> params={facet=true&sort=score+desc&group.limit=10&spellcheck.q=Unangan&f.mimetype_code.facet.limit=7&hl.simple.pre=<code>&q.alt=*:*&f.organization_type__facet.facet.limit=6&f.language__facet.facet.limit=6&hl=true&f.date_of_capture_yyyy.facet.limit=6&group.field=original_url&hl.simple.post=</code>&facet.field=domain&facet.field=date_of_capture_yyyy&facet.field=mimetype_code&facet.field=geographic_focus__facet&facet.field=organization_based_in__facet&facet.field=organization_type__facet&facet.field=language__facet&facet.field=creator_name__facet&hl.fragsize=600&f.creator_name__facet.facet.limit=6&facet.mincount=1&qf=text^1&hl.fl=contents&hl.fl=title&hl.fl=original_url&wt=ruby&f.geographic_focus__facet.facet.limit=6&defType=edismax&rows=10&f.domain.facet.limit=6&q=Unangan&f.organization_based_in__facet.facet.limit=6&q.op=AND&group=true&hl.usePhraseHighlighter=true} hits=8 status=0 QTime=108
> ...
> 
> For the query above (which can be simplified to say: find all documents that contain the word "unangan" and return facets, highlights, etc.), I get five search results.  Only three of these are returning highlighted snippets.  Here's the "highlighting" portion of the solr response (note: printed in ruby notation because I'm receiving this response in a Rails app):
> 
> --------
> "highlighting"=>
>  {"20100602195444/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%202002%20tentang%20Perlindungan%20Anak.pdf"=>
>    {},
>   "20100902203939/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%202002%20tentang%20Perlindungan%20Anak.pdf"=>
>    {},
>   "20111202233029/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%202002%20tentang%20Perlindungan%20Anak.pdf"=>
>    {},
>   "20100618201646/http://www.komnasham.go.id/portal/files/39-99.pdf"=>
>    {"contents"=>
>      ["...actual snippet is returned here..."]},
>   "20100902235358/http://www.komnasham.go.id/portal/files/39-99.pdf"=>
>    {"contents"=>
>      ["...actual snippet is returned here..."]},
>   "20110302213056/http://www.komnasham.go.id/publikasi/doc_download/2-uu-no-39-tahun-1999"=>
>    {"contents"=>
>      ["...actual snippet is returned here..."]},
>   "20110302213102/http://www.komnasham.go.id/publikasi/doc_view/2-uu-no-39-tahun-1999?tmpl=component&format=raw"=>
>    {"contents"=>
>      ["...actual snippet is returned here..."]},
>   "20120303113654/http://www.iwgia.org/iwgia_files_publications_files/0028_Utimut_heritage.pdf"=>
>    {}}
> --------
> 
> I have eight (as opposed to five) results above because I'm also doing a grouped query, grouping by a field called "original_url", and this leads to five grouped results.
> 
> I've confirmed that my highlight-lacking results DO contain the word "unangan", as expected, and this term is appearing in a text field that's indexed and stored, and being searched for all text searches.  For example, one of the search results is for a crawl of this document: http://www.iwgia.org/iwgia_files_publications_files/0028_Utimut_heritage.pdf
> 
> And if you view that document on the web, you'll see that it does contain "unangan".
> 
> Has anyone seen this before?  And does anyone have any good suggestions for troubleshooting/fixing the problem?
> 
> Thanks!
> 
> - Eric