You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Andreas Guther <an...@gmail.com> on 2007/05/17 08:10:29 UTC

Field.Store.Compress - does it improve performance of document reads?

I am currently exploring how to solve performance problems I encounter with
Lucene document reads.

We have amongst other fields one field (default) storing all searchable
fields.  This field can become of considerable size since we are  indexing
documents and  store the content for display within results.

I noticed that the read can be very expensive.  I wonder now if it would
make sense to add this field as Field.Store.Compress to the index.  Can
someone tell me if this would speed up the document read or if this is
something only interesting for saving space.

Thanks in advance,

Andreas

Re: Field.Store.Compress - does it improve performance of document reads?

Posted by Erick Erickson <er...@gmail.com>.

hmmmm. Now that I re-read your first mail, something else
suggests itself. You stated:

"We have amongst other fields one field (default) storing all searchable
fields".

Do you need to store this field at all? You can search fields that are
indexed but NOT stored. I've used something of the same technique
where I index lots of different fields in the same search field so my
queries aren't as complex, but return various stored fields to the
user for display purposes. Often these latter fields are stored but
NOT indexed.

It might also be useful if you'd post some of your relevant code
snippets, perhaps some innocent line is messing you up... Are you,
perhaps, calling get() in a HitCollector? Or iterating through
many documents with a Hits object? Or.....

Best
Erick

On 5/17/07, Andreas Guther <an...@gmail.com> wrote:
>
> I am actually using the FieldSelector and unless I did something wrong it
> did not provide me any load performance improvements which was surprising
> to
> me and disappointing at the same time.  The only difference I could see
> was
> when I returned for all fields a NO_LOAD which from my understanding is
> the
> same as skipping over the document.
>
> Right now I am looking into fragmentation problems of my huge index files.
> I am de-fragmenting the hard drive to see if this brings any read
> performance improvements.
>
> I am also wondering if the FieldCache as discussed in
> http://www.gossamer-threads.com/lists/lucene/general/28252 would help
> improve the situation.
>
> Andreas
>
> On 5/17/07, Grant Ingersoll <gs...@apache.org> wrote:
> >
> > I haven't tried compression either.  I know there was some talk a
> > while ago about deprecating, but that hasn't happened.  The current
> > implementation yields the highest level of compression.  You might
> > find better results by compressing in your application and storing as
> > a binary field, thus giving you more control over CPU used.  This is
> > our current recommendation for dealing w/ compression.
> >
> > If you are not actually displaying that field, you should look into
> > the FieldSelector API (via IndexReader).  It allows you to lazily
> > load fields or skip them all together and can yield a pretty
> > significant savings when it comes to loading documents.
> > FieldSelector is available in 2.1.
> >
> > -Grant
> >
> > On May 17, 2007, at 4:01 AM, Paul Elschot wrote:
> >
> > > On Thursday 17 May 2007 08:10, Andreas Guther wrote:
> > >> I am currently exploring how to solve performance problems I
> > >> encounter with
> > >> Lucene document reads.
> > >>
> > >> We have amongst other fields one field (default) storing all
> > >> searchable
> > >> fields.  This field can become of considerable size since we are
> > >> indexing
> > >> documents and  store the content for display within results.
> > >>
> > >> I noticed that the read can be very expensive.  I wonder now if it
> > >> would
> > >> make sense to add this field as Field.Store.Compress to the
> > >> index.  Can
> > >> someone tell me if this would speed up the document read or if
> > >> this is
> > >> something only interesting for saving space.
> > >
> > > I have not tried the compression yet, but in my experience a good way
> > > to reduce the costs of document reads from a disk is by reading them
> > > in document number order whenever possible. In this way one saves
> > > on the disk head seeks.
> > > Compression should actually help reducing the costs of disk head seeks
> > > even more.
> > >
> > > Regards,
> > > Paul Elschot
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> >
> > --------------------------
> > Grant Ingersoll
> > Center for Natural Language Processing
> > http://www.cnlp.org/tech/lucene.asp
> >
> > Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
> > LuceneFAQ
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

Re: Field.Store.Compress - does it improve performance of document reads?

Posted by Mike Klaas <mi...@gmail.com>.

On 17-May-07, at 6:43 AM, Andreas Guther wrote:

> I am actually using the FieldSelector and unless I did something  
> wrong it
> did not provide me any load performance improvements which was  
> surprising to
> me and disappointing at the same time.  The only difference I could  
> see was
> when I returned for all fields a NO_LOAD which from my  
> understanding is the
> same as skipping over the document.

Note that storing the field as binary or compressed will increase the  
speed gains from lazy loading.  If the stored field is just text,  
then lucene has to scan the characters instead of .seek()ing to a  
byte position.

-MIke

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Indexing Open Office documents

Posted by Enis Soztutar <en...@gmail.com>.

These is a parser for open office in Nutch. It is a plugin called parse-oo.
You can find more information in the nutch mailing lists.

On 5/17/07, jim shirreffs <jp...@verizon.net> wrote:
>
>
> Anyone know how to add OpenOffice document to a Lucene index? Is there a
> parser for OpenOffice?
>
> thanks in advance
>
> jim s.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Indexing Open Office documents

Posted by jim shirreffs <jp...@verizon.net>.

Anyone know how to add OpenOffice document to a Lucene index? Is there a 
parser for OpenOffice?

thanks in advance

jim s. 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Field.Store.Compress - does it improve performance of document reads?

Posted by Andreas Guther <an...@gmail.com>.

I am actually using the FieldSelector and unless I did something wrong it
did not provide me any load performance improvements which was surprising to
me and disappointing at the same time.  The only difference I could see was
when I returned for all fields a NO_LOAD which from my understanding is the
same as skipping over the document.

Right now I am looking into fragmentation problems of my huge index files.
I am de-fragmenting the hard drive to see if this brings any read
performance improvements.

I am also wondering if the FieldCache as discussed in
http://www.gossamer-threads.com/lists/lucene/general/28252 would help
improve the situation.

Andreas

On 5/17/07, Grant Ingersoll <gs...@apache.org> wrote:
>
> I haven't tried compression either.  I know there was some talk a
> while ago about deprecating, but that hasn't happened.  The current
> implementation yields the highest level of compression.  You might
> find better results by compressing in your application and storing as
> a binary field, thus giving you more control over CPU used.  This is
> our current recommendation for dealing w/ compression.
>
> If you are not actually displaying that field, you should look into
> the FieldSelector API (via IndexReader).  It allows you to lazily
> load fields or skip them all together and can yield a pretty
> significant savings when it comes to loading documents.
> FieldSelector is available in 2.1.
>
> -Grant
>
> On May 17, 2007, at 4:01 AM, Paul Elschot wrote:
>
> > On Thursday 17 May 2007 08:10, Andreas Guther wrote:
> >> I am currently exploring how to solve performance problems I
> >> encounter with
> >> Lucene document reads.
> >>
> >> We have amongst other fields one field (default) storing all
> >> searchable
> >> fields.  This field can become of considerable size since we are
> >> indexing
> >> documents and  store the content for display within results.
> >>
> >> I noticed that the read can be very expensive.  I wonder now if it
> >> would
> >> make sense to add this field as Field.Store.Compress to the
> >> index.  Can
> >> someone tell me if this would speed up the document read or if
> >> this is
> >> something only interesting for saving space.
> >
> > I have not tried the compression yet, but in my experience a good way
> > to reduce the costs of document reads from a disk is by reading them
> > in document number order whenever possible. In this way one saves
> > on the disk head seeks.
> > Compression should actually help reducing the costs of disk head seeks
> > even more.
> >
> > Regards,
> > Paul Elschot
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
> --------------------------
> Grant Ingersoll
> Center for Natural Language Processing
> http://www.cnlp.org/tech/lucene.asp
>
> Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
> LuceneFAQ
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Field.Store.Compress - does it improve performance of document reads?

Posted by Erick Erickson <er...@gmail.com>.

Some time ago I posted the results in my peculiar app of using
FieldSelector, and it gave dramatic improvements in my case (a
factor of about 10). I suspect much of that was peculiar to my
index design, so your mileage may vary.

See  a thread titled...

*Lucene 2.1, using FieldSelector speeds up my app by a factor of 10+....*


Best
Erick

On 5/17/07, Grant Ingersoll <gs...@apache.org> wrote:
>
> I haven't tried compression either.  I know there was some talk a
> while ago about deprecating, but that hasn't happened.  The current
> implementation yields the highest level of compression.  You might
> find better results by compressing in your application and storing as
> a binary field, thus giving you more control over CPU used.  This is
> our current recommendation for dealing w/ compression.
>
> If you are not actually displaying that field, you should look into
> the FieldSelector API (via IndexReader).  It allows you to lazily
> load fields or skip them all together and can yield a pretty
> significant savings when it comes to loading documents.
> FieldSelector is available in 2.1.
>
> -Grant
>
> On May 17, 2007, at 4:01 AM, Paul Elschot wrote:
>
> > On Thursday 17 May 2007 08:10, Andreas Guther wrote:
> >> I am currently exploring how to solve performance problems I
> >> encounter with
> >> Lucene document reads.
> >>
> >> We have amongst other fields one field (default) storing all
> >> searchable
> >> fields.  This field can become of considerable size since we are
> >> indexing
> >> documents and  store the content for display within results.
> >>
> >> I noticed that the read can be very expensive.  I wonder now if it
> >> would
> >> make sense to add this field as Field.Store.Compress to the
> >> index.  Can
> >> someone tell me if this would speed up the document read or if
> >> this is
> >> something only interesting for saving space.
> >
> > I have not tried the compression yet, but in my experience a good way
> > to reduce the costs of document reads from a disk is by reading them
> > in document number order whenever possible. In this way one saves
> > on the disk head seeks.
> > Compression should actually help reducing the costs of disk head seeks
> > even more.
> >
> > Regards,
> > Paul Elschot
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
> --------------------------
> Grant Ingersoll
> Center for Natural Language Processing
> http://www.cnlp.org/tech/lucene.asp
>
> Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
> LuceneFAQ
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Field.Store.Compress - does it improve performance of document reads?

Posted by Grant Ingersoll <gs...@apache.org>.

I haven't tried compression either.  I know there was some talk a  
while ago about deprecating, but that hasn't happened.  The current  
implementation yields the highest level of compression.  You might  
find better results by compressing in your application and storing as  
a binary field, thus giving you more control over CPU used.  This is  
our current recommendation for dealing w/ compression.

If you are not actually displaying that field, you should look into  
the FieldSelector API (via IndexReader).  It allows you to lazily  
load fields or skip them all together and can yield a pretty  
significant savings when it comes to loading documents.   
FieldSelector is available in 2.1.

-Grant

On May 17, 2007, at 4:01 AM, Paul Elschot wrote:

> On Thursday 17 May 2007 08:10, Andreas Guther wrote:
>> I am currently exploring how to solve performance problems I  
>> encounter with
>> Lucene document reads.
>>
>> We have amongst other fields one field (default) storing all  
>> searchable
>> fields.  This field can become of considerable size since we are   
>> indexing
>> documents and  store the content for display within results.
>>
>> I noticed that the read can be very expensive.  I wonder now if it  
>> would
>> make sense to add this field as Field.Store.Compress to the  
>> index.  Can
>> someone tell me if this would speed up the document read or if  
>> this is
>> something only interesting for saving space.
>
> I have not tried the compression yet, but in my experience a good way
> to reduce the costs of document reads from a disk is by reading them
> in document number order whenever possible. In this way one saves
> on the disk head seeks.
> Compression should actually help reducing the costs of disk head seeks
> even more.
>
> Regards,
> Paul Elschot
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Field.Store.Compress - does it improve performance of document reads?

Posted by Paul Elschot <pa...@xs4all.nl>.

On Thursday 17 May 2007 08:10, Andreas Guther wrote:
> I am currently exploring how to solve performance problems I encounter with
> Lucene document reads.
> 
> We have amongst other fields one field (default) storing all searchable
> fields.  This field can become of considerable size since we are  indexing
> documents and  store the content for display within results.
> 
> I noticed that the read can be very expensive.  I wonder now if it would
> make sense to add this field as Field.Store.Compress to the index.  Can
> someone tell me if this would speed up the document read or if this is
> something only interesting for saving space.

I have not tried the compression yet, but in my experience a good way
to reduce the costs of document reads from a disk is by reading them
in document number order whenever possible. In this way one saves
on the disk head seeks.
Compression should actually help reducing the costs of disk head seeks
even more.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org