You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Shyam Bhaskaran <Sh...@synopsys.com> on 2011/11/24 16:28:28 UTC

highlighting performance poor with *.tar, *.gz files

Hi,

It is observed that highlighting of search results is taking too much time especially for highlighting terms for archived files like *.gz, *.tar, *.zip.
What could be the reason behind it ? Is it because these files are unzipped and then highlighted from the index during display time ?
Or is it dependent on the size of the file ? Is there any way by which the search & highlighter performance improves for these kind of archived files (*.tar, *.zip etc)

Let me know if there is any workaround for improving the highlighting and search performance for these kind of files?

-Shyam

RE: highlighting performance poor with *.tar, *.gz files

Posted by Shyam Bhaskaran <Sh...@synopsys.com>.
Hi Eric,

Thanks for the response.

I am already using termVectors with offsets & positions enabled as shown below.


<field name="attachment_bodies"  type="text_rev"    indexed="true"  stored="true"  multiValued="true" termVectors="true" termPositions="true" termOffsets="true" />


I am indexing FAQ content and some these FAQ has attachments linked to them and these attachments have files like PDF, DOC *.TAR , *.GZIP files that contains additional information related to the FAQ and all these contents are indexed. But while searching and highlighting it is observed that for archived files like *.gz, *.tar, *.zip the search performance degrades and using the debug flag I am finding that the time taken for highlighting these *.gz, *.tar, *.zip archived files is taking more time.

What could be the reason behind it ? Is it because these files are unzipped and then highlighted from the index during display time ?

Is the highlighting dependent on file size what I mean is if the file size is more, then does the performance of the search degrades because of the highlighting ?

I have tried to reduce the maxAnalyzedChars value from 5MB to 1 MB bus still do not see any significant improvement in the search and highlighting for these kind of files.

Let me know if you can suggest any workaround for improving the highlighting and search performance for these kind of files or even files having large file size ?


Thanks
Shyam

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Saturday, November 26, 2011 8:57 AM
To: solr-user@lucene.apache.org
Subject: Re: highlighting performance poor with *.tar, *.gz files

Highlighting is dependent on the size of the
data being fed through the highlighter. Unless you have
termVectors & offsets & positions enabled, the text
must be re-analyzed, see:
http://wiki.apache.org/solr/FieldOptionsByUseCase?highlight=%28termvector%29%7C%28retrieve%29%7C%28contents%29

But highlighting compressed files seems like an odd
use-case, what is the business reason you need to do this?

Best
Erick

On Thu, Nov 24, 2011 at 10:28 AM, Shyam Bhaskaran
<Sh...@synopsys.com> wrote:
> Hi,
>
> It is observed that highlighting of search results is taking too much time especially for highlighting terms for archived files like *.gz, *.tar, *.zip.
> What could be the reason behind it ? Is it because these files are unzipped and then highlighted from the index during display time ?
> Or is it dependent on the size of the file ? Is there any way by which the search & highlighter performance improves for these kind of archived files (*.tar, *.zip etc)
>
> Let me know if there is any workaround for improving the highlighting and search performance for these kind of files?
>
> -Shyam
>

Re: highlighting performance poor with *.tar, *.gz files

Posted by Erick Erickson <er...@gmail.com>.
Highlighting is dependent on the size of the
data being fed through the highlighter. Unless you have
termVectors & offsets & positions enabled, the text
must be re-analyzed, see:
http://wiki.apache.org/solr/FieldOptionsByUseCase?highlight=%28termvector%29%7C%28retrieve%29%7C%28contents%29

But highlighting compressed files seems like an odd
use-case, what is the business reason you need to do this?

Best
Erick

On Thu, Nov 24, 2011 at 10:28 AM, Shyam Bhaskaran
<Sh...@synopsys.com> wrote:
> Hi,
>
> It is observed that highlighting of search results is taking too much time especially for highlighting terms for archived files like *.gz, *.tar, *.zip.
> What could be the reason behind it ? Is it because these files are unzipped and then highlighted from the index during display time ?
> Or is it dependent on the size of the file ? Is there any way by which the search & highlighter performance improves for these kind of archived files (*.tar, *.zip etc)
>
> Let me know if there is any workaround for improving the highlighting and search performance for these kind of files?
>
> -Shyam
>