You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Zheng Lin Edwin Yeo <ed...@gmail.com> on 2015/12/04 16:51:49 UTC

Highlighting large documents

Hi,

I'm using Solr 5.3.0

I found that in large documents, sometimes I face situation that when I do
a highlight query, the resultset that is returned does not contain the
highlighted query. There are actually matches in the documents, but just
that they located further back in the documents.

I have tried to increase the value of the hl.maxAnalyzedChars, as the
default value is 51200, and I have documents that are much larger than
51200 characters. Although this method works, but, when I increase this
value, the performance of the search and highlight drops. It can drop from
less than 0.5 seconds to more than 10 seconds.

Would like to check, is this method of increasing the value of the
hl.maxAnalyzedChars the best method to use, or is there other ways which
can solve the same purpose, but without affecting the performance much?

Regards,
Edwin

Re: Highlighting large documents

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Hi all,

Thank you for all the information.

I have set the parameter to  <str name="hl.maxAnalyzedChars">-1</str>, and
the highlighting is working fine now.

Regards,
Edwin


On 14 December 2015 at 18:03, Jens Brandt <br...@docoloc.de> wrote:

> Hi Edwin,
>
> you are limiting the portion of the document analyzed for highlighting in
> your solrconfig.xml by
>
>  <str name="hl.maxAnalyzedChars">1000000</str>
>
> Thus, snippets are only produced correctly if the query was found in the
> first 1000000 characters of the document.
>
> If you set this parameter to
>
>  <str name="hl.maxAnalyzedChars">-1</str>
>
> the original highlighter uses the whole document to find the snippet.
>
> I hope that helps
>   Jens
>
>
> > Am 04.12.2015 um 16:51 schrieb Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
> >:
> >
> > Hi,
> >
> > I'm using Solr 5.3.0
> >
> > I found that in large documents, sometimes I face situation that when I
> do
> > a highlight query, the resultset that is returned does not contain the
> > highlighted query. There are actually matches in the documents, but just
> > that they located further back in the documents.
> >
> > I have tried to increase the value of the hl.maxAnalyzedChars, as the
> > default value is 51200, and I have documents that are much larger than
> > 51200 characters. Although this method works, but, when I increase this
> > value, the performance of the search and highlight drops. It can drop
> from
> > less than 0.5 seconds to more than 10 seconds.
> >
> > Would like to check, is this method of increasing the value of the
> > hl.maxAnalyzedChars the best method to use, or is there other ways which
> > can solve the same purpose, but without affecting the performance much?
> >
> > Regards,
> > Edwin
>
>

Re: Highlighting large documents

Posted by Jens Brandt <br...@docoloc.de>.
Hi Edwin,

you are limiting the portion of the document analyzed for highlighting in your solrconfig.xml by

 <str name="hl.maxAnalyzedChars">1000000</str>

Thus, snippets are only produced correctly if the query was found in the first 1000000 characters of the document.

If you set this parameter to

 <str name="hl.maxAnalyzedChars">-1</str>

the original highlighter uses the whole document to find the snippet.

I hope that helps
  Jens


> Am 04.12.2015 um 16:51 schrieb Zheng Lin Edwin Yeo <ed...@gmail.com>:
> 
> Hi,
> 
> I'm using Solr 5.3.0
> 
> I found that in large documents, sometimes I face situation that when I do
> a highlight query, the resultset that is returned does not contain the
> highlighted query. There are actually matches in the documents, but just
> that they located further back in the documents.
> 
> I have tried to increase the value of the hl.maxAnalyzedChars, as the
> default value is 51200, and I have documents that are much larger than
> 51200 characters. Although this method works, but, when I increase this
> value, the performance of the search and highlight drops. It can drop from
> less than 0.5 seconds to more than 10 seconds.
> 
> Would like to check, is this method of increasing the value of the
> hl.maxAnalyzedChars the best method to use, or is there other ways which
> can solve the same purpose, but without affecting the performance much?
> 
> Regards,
> Edwin


Re: Highlighting large documents

Posted by Scott Stults <ss...@opensourceconnections.com>.
There are two things going on that you should be aware of. The first is,
Solr Highlighting is mainly concerned about putting a representative
snippet in a results listing. There are a couple of configuration changes
you need to do if you want to highlight a whole document, like setting the
fragListBuilder to SingleFragListBuilder and the maxAnalyzedChars setting
you've already mentioned:

https://wiki.apache.org/solr/HighlightingParameters#hl.fragsize

Because full document highlighting is so different from highlighting
snippets in a result list you'll want to configure two different
highlighters: One for snippets and one for the full document.

The other thing you need to know is that performance in highlighting is an
active area of development. Right now the top docs in the current result
list are calculated completely separate from the snippets (highlighting),
which can lead to problems when the most relevant snippets are later in the
document.

What most people do is compromise by making the result list fast but
inaccurate, and having the full-document highlight be accurate but slower.


Hope that helps,
-Scott


On Fri, Dec 4, 2015 at 11:12 AM, Andrea Gazzarini <a....@gmail.com>
wrote:

> No no, sorry, the project is not yet started so I didn't experience your
> issue, but I'll be a careful listener of this thread
>
> Best,
> Andrea
>
> 2015-12-04 17:04 GMT+01:00 Zheng Lin Edwin Yeo <ed...@gmail.com>:
>
> > Hi Andrea,
> >
> > I'm using the original highlighter.
> >
> > Below is my configuration for the highlighter in solrconfig.xml
> >
> >   <requestHandler name="/highlight" class="solr.SearchHandler">
> >        <lst name="defaults">
> >            <str name="echoParams">explicit</str>
> >            <int name="rows">10</int>
> >            <str name="wt">json</str>
> >            <str name="indent">true</str>
> >   <str name="df">text</str>
> >   <str name="fl">id, title, content_type, last_modified, url, score
> </str>
> >
> >   <str name="hl">on</str>
> >            <str name="hl.fl">id, title, content, author </str>
> >   <str name="hl.highlightMultiTerm">true</str>
> >            <str name="hl.preserveMulti">true</str>
> >            <str name="hl.encoder">html</str>
> >   <str name="hl.fragsize">200</str>
> >   <str name="hl.maxAnalyzedChars">1000000</str>
> >
> > <str name="group">true</str>
> > <str name="group.field">signature</str>
> > <str name="group.main">true</str>
> > <str name="group.cache.percent">100</str>
> >       </lst>
> >   </requestHandler>
> >
> >
> > Have you managed to solve the problem?
> >
> > Regards,
> > Edwin
> >
> >
> > On 4 December 2015 at 23:54, Andrea Gazzarini <a....@gmail.com>
> > wrote:
> >
> > > Hi Zheng,
> > > just curiousity, because shortly I will have to deal with a similar
> > > scenario (Solr 5.3.1 + large documents + highlighting).
> > > Which highlighter are you using?
> > >
> > > Andrea
> > >
> > > 2015-12-04 16:51 GMT+01:00 Zheng Lin Edwin Yeo <ed...@gmail.com>:
> > >
> > > > Hi,
> > > >
> > > > I'm using Solr 5.3.0
> > > >
> > > > I found that in large documents, sometimes I face situation that
> when I
> > > do
> > > > a highlight query, the resultset that is returned does not contain
> the
> > > > highlighted query. There are actually matches in the documents, but
> > just
> > > > that they located further back in the documents.
> > > >
> > > > I have tried to increase the value of the hl.maxAnalyzedChars, as the
> > > > default value is 51200, and I have documents that are much larger
> than
> > > > 51200 characters. Although this method works, but, when I increase
> this
> > > > value, the performance of the search and highlight drops. It can drop
> > > from
> > > > less than 0.5 seconds to more than 10 seconds.
> > > >
> > > > Would like to check, is this method of increasing the value of the
> > > > hl.maxAnalyzedChars the best method to use, or is there other ways
> > which
> > > > can solve the same purpose, but without affecting the performance
> much?
> > > >
> > > > Regards,
> > > > Edwin
> > > >
> > >
> >
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com

Re: Highlighting large documents

Posted by Andrea Gazzarini <a....@gmail.com>.
No no, sorry, the project is not yet started so I didn't experience your
issue, but I'll be a careful listener of this thread

Best,
Andrea

2015-12-04 17:04 GMT+01:00 Zheng Lin Edwin Yeo <ed...@gmail.com>:

> Hi Andrea,
>
> I'm using the original highlighter.
>
> Below is my configuration for the highlighter in solrconfig.xml
>
>   <requestHandler name="/highlight" class="solr.SearchHandler">
>        <lst name="defaults">
>            <str name="echoParams">explicit</str>
>            <int name="rows">10</int>
>            <str name="wt">json</str>
>            <str name="indent">true</str>
>   <str name="df">text</str>
>   <str name="fl">id, title, content_type, last_modified, url, score </str>
>
>   <str name="hl">on</str>
>            <str name="hl.fl">id, title, content, author </str>
>   <str name="hl.highlightMultiTerm">true</str>
>            <str name="hl.preserveMulti">true</str>
>            <str name="hl.encoder">html</str>
>   <str name="hl.fragsize">200</str>
>   <str name="hl.maxAnalyzedChars">1000000</str>
>
> <str name="group">true</str>
> <str name="group.field">signature</str>
> <str name="group.main">true</str>
> <str name="group.cache.percent">100</str>
>       </lst>
>   </requestHandler>
>
>
> Have you managed to solve the problem?
>
> Regards,
> Edwin
>
>
> On 4 December 2015 at 23:54, Andrea Gazzarini <a....@gmail.com>
> wrote:
>
> > Hi Zheng,
> > just curiousity, because shortly I will have to deal with a similar
> > scenario (Solr 5.3.1 + large documents + highlighting).
> > Which highlighter are you using?
> >
> > Andrea
> >
> > 2015-12-04 16:51 GMT+01:00 Zheng Lin Edwin Yeo <ed...@gmail.com>:
> >
> > > Hi,
> > >
> > > I'm using Solr 5.3.0
> > >
> > > I found that in large documents, sometimes I face situation that when I
> > do
> > > a highlight query, the resultset that is returned does not contain the
> > > highlighted query. There are actually matches in the documents, but
> just
> > > that they located further back in the documents.
> > >
> > > I have tried to increase the value of the hl.maxAnalyzedChars, as the
> > > default value is 51200, and I have documents that are much larger than
> > > 51200 characters. Although this method works, but, when I increase this
> > > value, the performance of the search and highlight drops. It can drop
> > from
> > > less than 0.5 seconds to more than 10 seconds.
> > >
> > > Would like to check, is this method of increasing the value of the
> > > hl.maxAnalyzedChars the best method to use, or is there other ways
> which
> > > can solve the same purpose, but without affecting the performance much?
> > >
> > > Regards,
> > > Edwin
> > >
> >
>

Re: Highlighting large documents

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Hi Andrea,

I'm using the original highlighter.

Below is my configuration for the highlighter in solrconfig.xml

  <requestHandler name="/highlight" class="solr.SearchHandler">
       <lst name="defaults">
           <str name="echoParams">explicit</str>
           <int name="rows">10</int>
           <str name="wt">json</str>
           <str name="indent">true</str>
  <str name="df">text</str>
  <str name="fl">id, title, content_type, last_modified, url, score </str>

  <str name="hl">on</str>
           <str name="hl.fl">id, title, content, author </str>
  <str name="hl.highlightMultiTerm">true</str>
           <str name="hl.preserveMulti">true</str>
           <str name="hl.encoder">html</str>
  <str name="hl.fragsize">200</str>
  <str name="hl.maxAnalyzedChars">1000000</str>

<str name="group">true</str>
<str name="group.field">signature</str>
<str name="group.main">true</str>
<str name="group.cache.percent">100</str>
      </lst>
  </requestHandler>


Have you managed to solve the problem?

Regards,
Edwin


On 4 December 2015 at 23:54, Andrea Gazzarini <a....@gmail.com> wrote:

> Hi Zheng,
> just curiousity, because shortly I will have to deal with a similar
> scenario (Solr 5.3.1 + large documents + highlighting).
> Which highlighter are you using?
>
> Andrea
>
> 2015-12-04 16:51 GMT+01:00 Zheng Lin Edwin Yeo <ed...@gmail.com>:
>
> > Hi,
> >
> > I'm using Solr 5.3.0
> >
> > I found that in large documents, sometimes I face situation that when I
> do
> > a highlight query, the resultset that is returned does not contain the
> > highlighted query. There are actually matches in the documents, but just
> > that they located further back in the documents.
> >
> > I have tried to increase the value of the hl.maxAnalyzedChars, as the
> > default value is 51200, and I have documents that are much larger than
> > 51200 characters. Although this method works, but, when I increase this
> > value, the performance of the search and highlight drops. It can drop
> from
> > less than 0.5 seconds to more than 10 seconds.
> >
> > Would like to check, is this method of increasing the value of the
> > hl.maxAnalyzedChars the best method to use, or is there other ways which
> > can solve the same purpose, but without affecting the performance much?
> >
> > Regards,
> > Edwin
> >
>

Re: Highlighting large documents

Posted by Andrea Gazzarini <a....@gmail.com>.
Hi Zheng,
just curiousity, because shortly I will have to deal with a similar
scenario (Solr 5.3.1 + large documents + highlighting).
Which highlighter are you using?

Andrea

2015-12-04 16:51 GMT+01:00 Zheng Lin Edwin Yeo <ed...@gmail.com>:

> Hi,
>
> I'm using Solr 5.3.0
>
> I found that in large documents, sometimes I face situation that when I do
> a highlight query, the resultset that is returned does not contain the
> highlighted query. There are actually matches in the documents, but just
> that they located further back in the documents.
>
> I have tried to increase the value of the hl.maxAnalyzedChars, as the
> default value is 51200, and I have documents that are much larger than
> 51200 characters. Although this method works, but, when I increase this
> value, the performance of the search and highlight drops. It can drop from
> less than 0.5 seconds to more than 10 seconds.
>
> Would like to check, is this method of increasing the value of the
> hl.maxAnalyzedChars the best method to use, or is there other ways which
> can solve the same purpose, but without affecting the performance much?
>
> Regards,
> Edwin
>