You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Bernhard Messer <Be...@intrafind.de> on 2004/08/30 23:41:10 UTC

Binary fields and data compression

hi developers,

a few month ago, there was a very interesting discussion about field 
compression and the possibility to store binary field values within a 
lucene document. Regarding to this topic, Drew Farris came up with a 
patch to add the necessary functionality. I ran all the necessary tests 
on his implementation and didn't find one problem. So the original 
implementation from Drew could now be enhanced to compress the binary 
field data (maybe even the text fields if they are stored only) before 
writing to disc. I made some simple statistical measurements using the 
java.util.zip package for data compression. Enabling it, we could save 
about 40% data when compressing plain text files with a size from 1KB to 
4KB. If there is still some interest, we could first try to update the 
patch, because it's outdated due to several changes within the Fields 
class. After finishing that, compression could be added to the updated 
version of the patch.

sounds good to me, what do you think ?

best regards
Bernhard




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Binary fields and data compression

Posted by "Kevin A. Burton" <bu...@newsmonster.org>.

Roy wrote:

>I also tried Drew Farris's binary patch. It works fine with a few
>testing cases of mine. However, I didn't have enough time to do a
>thorough performance comparison. I suggest the patch should be checked
>into cvs.
>  
>
This is interesting especially WRT my lucene external content PROPOSAL I 
sent off a few weeks ago.

I was considering adding gzip support for exactly this case... In our 
situation we'd rather buy more CPUs than wait for disk IO.

Kevin

-- 

Please reply using PGP.

    http://peerfear.org/pubkey.asc    
    
    NewsMonster - http://www.newsmonster.org/
    
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Binary fields and data compression

Posted by Roy <ro...@gmail.com>.

I also tried Drew Farris's binary patch. It works fine with a few
testing cases of mine. However, I didn't have enough time to do a
thorough performance comparison. I suggest the patch should be checked
into cvs.

On Wed, 01 Sep 2004 22:42:54 +0200, Bernhard Messer
<be...@intrafind.de> wrote:
> Doug Cutting wrote:
> 
> > Bernhard Messer wrote:
> >
> >> a few month ago, there was a very interesting discussion about field
> >> compression and the possibility to store binary field values within a
> >> lucene document. Regarding to this topic, Drew Farris came up with a
> >> patch to add the necessary functionality. I ran all the necessary
> >> tests on his implementation and didn't find one problem. So the
> >> original implementation from Drew could now be enhanced to compress
> >> the binary field data (maybe even the text fields if they are stored
> >> only) before writing to disc. I made some simple statistical
> >> measurements using the java.util.zip package for data compression.
> >> Enabling it, we could save about 40% data when compressing plain text
> >> files with a size from 1KB to 4KB. If there is still some interest,
> >> we could first try to update the patch, because it's outdated due to
> >> several changes within the Fields class. After finishing that,
> >> compression could be added to the updated version of the patch.
> >
> >
> > I like this patch and support upgrading it and adding it to Lucene.
> >
> Having a single, huge patch, implementing all the functionality, seems
> to be very difficult to maintain thru Bugzilla. So i would suggest to
> split the whole implementation in maybe 3 different steps.
> 1) updating the binary field patch and add it to lucene
> 2) making FieldsReader and FieldsWriter more readable using private
> static finals and add compression
> 3) additional thoughts about compressing whole documents instead of
> single fields.
> 
> > I imagine a public API like:
> >
> >   public static final class Store {
> >
> >      [ ... ]
> >
> >      public static final COMPRESS = new Store();
> >   }
> >
> >   new Field(String, byte[]) // stored, not compressed or indexed
> >   new Field(String, byte[], Store)
> >
> > Also, in Field.java, perhaps we could replace:
> >
> >   String stringValue;
> >   Reader readerValue;
> >   byte[] binaryValue;
> >
> > with:
> >
> >   Object value;
> >
> > And in FieldsReader.java and FieldsWriter.java, some package-private
> > constants would make the code more readable, like:
> >
> >   static final int FieldWriter.IS_TOKENIZED = 1;
> >   static final int FieldWriter.IS_BINARY = 2;
> >   static final int FieldWriter.IS_COMPRESSED = 4;
> >
> > Note that it makes sense to compress non-binary values.  One could use
> > String.getBytes("UTF-8") and compress that.
> >
> I'm totally with you. Compressing string values would make sense if the
> length reaches a certain size (the same for byte[]). This limit is
> something we have to figure out, what the minimum size of a compression
> candidate has to be. During my tests, i saw that everything up to 100
> bytes is a perfect candidate for compression. But there is much more
> work to do in that area.
> 
> > I wonder if it might make more sense to compress entire document
> > records, rather than individual fields.  This would probably do better
> > when documents have lots of short text fields, as is not uncommon, and
> > would also minimize the fixed compression/decompression setup costs
> > (i.e., inflator/deflator allocations).  We could instead add a
> > "isCompressed" flag to Document, and then, in Field{Reader,Writer},
> > store a bit per document indicating whether it is compressed.
> > Document records could first be serialized uncompressed to a buffer
> > which is then compressed and written.  Thoughts?
> >
> Interesting idea. I think this strongly depends on the fields, the
> options they have and at least their values. Would it make sense to
> compress a field which is tokenized and indexed but not stored ? My be
> we could think on some kind of algorithm, checking the document fields
> setting and decide if it is a candidate for compression. Just a thought ;-)
> 
> 
> 
> > Doug
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Binary fields and data compression

Posted by Bernhard Messer <Be...@intrafind.de>.

Doug Cutting wrote:

> Bernhard Messer wrote:
>
>> a few month ago, there was a very interesting discussion about field 
>> compression and the possibility to store binary field values within a 
>> lucene document. Regarding to this topic, Drew Farris came up with a 
>> patch to add the necessary functionality. I ran all the necessary 
>> tests on his implementation and didn't find one problem. So the 
>> original implementation from Drew could now be enhanced to compress 
>> the binary field data (maybe even the text fields if they are stored 
>> only) before writing to disc. I made some simple statistical 
>> measurements using the java.util.zip package for data compression. 
>> Enabling it, we could save about 40% data when compressing plain text 
>> files with a size from 1KB to 4KB. If there is still some interest, 
>> we could first try to update the patch, because it's outdated due to 
>> several changes within the Fields class. After finishing that, 
>> compression could be added to the updated version of the patch.
>
>
> I like this patch and support upgrading it and adding it to Lucene.
>
Having a single, huge patch, implementing all the functionality, seems 
to be very difficult to maintain thru Bugzilla. So i would suggest to 
split the whole implementation in maybe 3 different steps.
1) updating the binary field patch and add it to lucene
2) making FieldsReader and FieldsWriter more readable using private 
static finals and add compression
3) additional thoughts about compressing whole documents instead of 
single fields.

> I imagine a public API like:
>
>   public static final class Store {
>
>      [ ... ]
>
>      public static final COMPRESS = new Store();
>   }
>
>   new Field(String, byte[]) // stored, not compressed or indexed
>   new Field(String, byte[], Store)
>
> Also, in Field.java, perhaps we could replace:
>
>   String stringValue;
>   Reader readerValue;
>   byte[] binaryValue;
>
> with:
>
>   Object value;
>
> And in FieldsReader.java and FieldsWriter.java, some package-private 
> constants would make the code more readable, like:
>
>   static final int FieldWriter.IS_TOKENIZED = 1;
>   static final int FieldWriter.IS_BINARY = 2;
>   static final int FieldWriter.IS_COMPRESSED = 4;
>
> Note that it makes sense to compress non-binary values.  One could use 
> String.getBytes("UTF-8") and compress that.
>
I'm totally with you. Compressing string values would make sense if the 
length reaches a certain size (the same for byte[]). This limit is 
something we have to figure out, what the minimum size of a compression 
candidate has to be. During my tests, i saw that everything up to 100 
bytes is a perfect candidate for compression. But there is much more 
work to do in that area.

> I wonder if it might make more sense to compress entire document 
> records, rather than individual fields.  This would probably do better 
> when documents have lots of short text fields, as is not uncommon, and 
> would also minimize the fixed compression/decompression setup costs 
> (i.e., inflator/deflator allocations).  We could instead add a 
> "isCompressed" flag to Document, and then, in Field{Reader,Writer}, 
> store a bit per document indicating whether it is compressed.  
> Document records could first be serialized uncompressed to a buffer 
> which is then compressed and written.  Thoughts?
>
Interesting idea. I think this strongly depends on the fields, the 
options they have and at least their values. Would it make sense to 
compress a field which is tokenized and indexed but not stored ? My be 
we could think on some kind of algorithm, checking the document fields 
setting and decide if it is a candidate for compression. Just a thought ;-)

> Doug
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Binary fields and data compression

Posted by Doug Cutting <cu...@apache.org>.

Bernhard Messer wrote:
> a few month ago, there was a very interesting discussion about field 
> compression and the possibility to store binary field values within a 
> lucene document. Regarding to this topic, Drew Farris came up with a 
> patch to add the necessary functionality. I ran all the necessary tests 
> on his implementation and didn't find one problem. So the original 
> implementation from Drew could now be enhanced to compress the binary 
> field data (maybe even the text fields if they are stored only) before 
> writing to disc. I made some simple statistical measurements using the 
> java.util.zip package for data compression. Enabling it, we could save 
> about 40% data when compressing plain text files with a size from 1KB to 
> 4KB. If there is still some interest, we could first try to update the 
> patch, because it's outdated due to several changes within the Fields 
> class. After finishing that, compression could be added to the updated 
> version of the patch.

I like this patch and support upgrading it and adding it to Lucene.

I imagine a public API like:

   public static final class Store {

      [ ... ]

      public static final COMPRESS = new Store();
   }

   new Field(String, byte[]) // stored, not compressed or indexed
   new Field(String, byte[], Store)

Also, in Field.java, perhaps we could replace:

   String stringValue;
   Reader readerValue;
   byte[] binaryValue;

with:

   Object value;

And in FieldsReader.java and FieldsWriter.java, some package-private 
constants would make the code more readable, like:

   static final int FieldWriter.IS_TOKENIZED = 1;
   static final int FieldWriter.IS_BINARY = 2;
   static final int FieldWriter.IS_COMPRESSED = 4;

Note that it makes sense to compress non-binary values.  One could use 
String.getBytes("UTF-8") and compress that.

I wonder if it might make more sense to compress entire document 
records, rather than individual fields.  This would probably do better 
when documents have lots of short text fields, as is not uncommon, and 
would also minimize the fixed compression/decompression setup costs 
(i.e., inflator/deflator allocations).  We could instead add a 
"isCompressed" flag to Document, and then, in Field{Reader,Writer}, 
store a bit per document indicating whether it is compressed.  Document 
records could first be serialized uncompressed to a buffer which is then 
compressed and written.  Thoughts?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Binary fields and data compression

Posted by Bernhard Messer <Be...@intrafind.de>.

Otis,

that's exactly what i have in mind. Compression should be optional on 
binary fields only in the first step. The default setting for 
compression should be "off" and must be enabled by the user. I would 
also check the size of the byte array passed in. Even if compression is 
enabled, it doesn't make sense to compress a dataset which is too small. 
We would end up with a compressed size which is bigger than the original 
size, due to the fact that compression needs some overhead.

Having the implementation ready, we could run several tests to see how 
the overall performance will be affected when using compression.

Bernhard


Otis Gospodnetic wrote:

>Bernhard,
>
>Sounds good to me.
>I would, however, also be interested in the performance impact of
>text-field compression.  While adapting Drew's patch, it may be nice to
>make the compression mechanism pluggable.
>
>Otis
>
>--- Bernhard Messer <Be...@intrafind.de> wrote:
>
>  
>
>>hi developers,
>>
>>a few month ago, there was a very interesting discussion about field 
>>compression and the possibility to store binary field values within a
>>
>>lucene document. Regarding to this topic, Drew Farris came up with a 
>>patch to add the necessary functionality. I ran all the necessary
>>tests 
>>on his implementation and didn't find one problem. So the original 
>>implementation from Drew could now be enhanced to compress the binary
>>
>>field data (maybe even the text fields if they are stored only)
>>before 
>>writing to disc. I made some simple statistical measurements using
>>the 
>>java.util.zip package for data compression. Enabling it, we could
>>save 
>>about 40% data when compressing plain text files with a size from 1KB
>>to 
>>4KB. If there is still some interest, we could first try to update
>>the 
>>patch, because it's outdated due to several changes within the Fields
>>
>>class. After finishing that, compression could be added to the
>>updated 
>>version of the patch.
>>
>>sounds good to me, what do you think ?
>>
>>best regards
>>Bernhard
>>
>>
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>>
>>
>>    
>>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>  
>

Re: Binary fields and data compression

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Bernhard,

Sounds good to me.
I would, however, also be interested in the performance impact of
text-field compression.  While adapting Drew's patch, it may be nice to
make the compression mechanism pluggable.

Otis

--- Bernhard Messer <Be...@intrafind.de> wrote:

> hi developers,
> 
> a few month ago, there was a very interesting discussion about field 
> compression and the possibility to store binary field values within a
> 
> lucene document. Regarding to this topic, Drew Farris came up with a 
> patch to add the necessary functionality. I ran all the necessary
> tests 
> on his implementation and didn't find one problem. So the original 
> implementation from Drew could now be enhanced to compress the binary
> 
> field data (maybe even the text fields if they are stored only)
> before 
> writing to disc. I made some simple statistical measurements using
> the 
> java.util.zip package for data compression. Enabling it, we could
> save 
> about 40% data when compressing plain text files with a size from 1KB
> to 
> 4KB. If there is still some interest, we could first try to update
> the 
> patch, because it's outdated due to several changes within the Fields
> 
> class. After finishing that, compression could be added to the
> updated 
> version of the patch.
> 
> sounds good to me, what do you think ?
> 
> best regards
> Bernhard
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Binary fields and data compression

Posted by Andrzej Bialecki <ab...@getopt.org>.

Robert Engels wrote:
> My estimates are based on our own projects where we see that adding a
> DeflatorInputStream around an InputStream takes about 20% of the CPU time,
> so whether to actually use it or not will depend on if the IndexReader is
> performance bound by the CPU or IO.
> 
> The problem with the "after the read" decompression, is that you still incur
> the overhead of decompression each time the file block is accessed, since
> the OS only caches the uncompressed block (unless Lucene adds caching to the
> index read operations), but the disk IO time is almost always eliminated if
> the index reader frequently accessed the same file blocks (since the OS
> caches the data block).

As I understand the original proposal, compression would be used mostly 
for reading the data of STORED fields. When it comes to inverted lists, 
which are the main data structure used for searching over indexed 
fields, they are already "compressed" in a highly-optimized way, so 
adding another level of compression to this part wouldn't make much 
sense IMHO.

[...]
> ... thus my request that any compression support be optional.

Absolutely. :-)

-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

RE: Binary fields and data compression

Posted by Otis Gospodnetic <ot...@yahoo.com>.

--- Robert Engels <re...@ix.netcom.com> wrote:

......

> ... thus my request that any compression support be optional.

I think this goes without say.  Say say say...

Otis


> -----Original Message-----
> From: David Spencer [mailto:dave-lucene-dev@tropo.com]
> Sent: Monday, August 30, 2004 5:33 PM
> To: Lucene Developers List
> Subject: Re: Binary fields and data compression
> 
> 
> Robert Engels wrote:
> 
> > The data size savings is almost certainly not worth the probable
> 20-40%
> > increase in CPU usage in most cases no?
> >
> > I think it should be optional for those who have extremely large
> indices
> and
> > want to save some space (seems not necessary these days), and those
> who
> want
> > to maximize performance.
> 
> You don't know until you benchmark it, but I thought that the
> heuristic
> nowadays was that CPUs are fast and disk i/o is slow ( and yes, disk
> space is 'infinite' :) ) - so therefore I would guess that in spite
> of
> the CPU cost of compression, you'd save time due to less disk i/o.
> 
> 
> >
> >
> > -----Original Message-----
> > From: Bernhard Messer [mailto:Bernhard.Messer@intrafind.de]
> > Sent: Monday, August 30, 2004 4:41 PM
> > To: lucene-dev@jakarta.apache.org
> > Subject: Binary fields and data compression
> >
> >
> > hi developers,
> >
> > a few month ago, there was a very interesting discussion about
> field
> > compression and the possibility to store binary field values within
> a
> > lucene document. Regarding to this topic, Drew Farris came up with
> a
> > patch to add the necessary functionality. I ran all the necessary
> tests
> > on his implementation and didn't find one problem. So the original
> > implementation from Drew could now be enhanced to compress the
> binary
> > field data (maybe even the text fields if they are stored only)
> before
> > writing to disc. I made some simple statistical measurements using
> the
> > java.util.zip package for data compression. Enabling it, we could
> save
> > about 40% data when compressing plain text files with a size from
> 1KB to
> > 4KB. If there is still some interest, we could first try to update
> the
> > patch, because it's outdated due to several changes within the
> Fields
> > class. After finishing that, compression could be added to the
> updated
> > version of the patch.
> >
> > sounds good to me, what do you think ?
> >
> > best regards
> > Bernhard
> >
> >
> >
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> >
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

RE: Binary fields and data compression

Posted by Robert Engels <re...@ix.netcom.com>.

My estimates are based on our own projects where we see that adding a
DeflatorInputStream around an InputStream takes about 20% of the CPU time,
so whether to actually use it or not will depend on if the IndexReader is
performance bound by the CPU or IO.

The problem with the "after the read" decompression, is that you still incur
the overhead of decompression each time the file block is accessed, since
the OS only caches the uncompressed block (unless Lucene adds caching to the
index read operations), but the disk IO time is almost always eliminated if
the index reader frequently accessed the same file blocks (since the OS
caches the data block).

If Lucene is IO bound, then increasing the OS cache helps, but you will
limit the throughput enhancements because now the CPU cycles are used to
uncompress the blocks.

With high enough limits on physical memory and disk space, I believe the
compression will have negative effects on overall performance, but again,
this is going to depend heavily on the environment (# of CPUS, physical
memory, memory architecture, disk speed, etc.). Given the boundary condition
where the entire index is loaded into physical memory (I think I read
somewhere recently that this is the current scheme that Google uses),
compression will have a negative impact on the performance - as the memory
to index size ratio lowers compression will probably help the overall
performance.

... thus my request that any compression support be optional.

-----Original Message-----
From: David Spencer [mailto:dave-lucene-dev@tropo.com]
Sent: Monday, August 30, 2004 5:33 PM
To: Lucene Developers List
Subject: Re: Binary fields and data compression

Robert Engels wrote:

> The data size savings is almost certainly not worth the probable 20-40%
> increase in CPU usage in most cases no?
>
> I think it should be optional for those who have extremely large indices
and
> want to save some space (seems not necessary these days), and those who
want
> to maximize performance.

You don't know until you benchmark it, but I thought that the heuristic
nowadays was that CPUs are fast and disk i/o is slow ( and yes, disk
space is 'infinite' :) ) - so therefore I would guess that in spite of
the CPU cost of compression, you'd save time due to less disk i/o.

>
>
> -----Original Message-----
> From: Bernhard Messer [mailto:Bernhard.Messer@intrafind.de]
> Sent: Monday, August 30, 2004 4:41 PM
> To: lucene-dev@jakarta.apache.org
> Subject: Binary fields and data compression
>
>
> hi developers,
>
> a few month ago, there was a very interesting discussion about field
> compression and the possibility to store binary field values within a
> lucene document. Regarding to this topic, Drew Farris came up with a
> patch to add the necessary functionality. I ran all the necessary tests
> on his implementation and didn't find one problem. So the original
> implementation from Drew could now be enhanced to compress the binary
> field data (maybe even the text fields if they are stored only) before
> writing to disc. I made some simple statistical measurements using the
> java.util.zip package for data compression. Enabling it, we could save
> about 40% data when compressing plain text files with a size from 1KB to
> 4KB. If there is still some interest, we could first try to update the
> patch, because it's outdated due to several changes within the Fields
> class. After finishing that, compression could be added to the updated
> version of the patch.
>
> sounds good to me, what do you think ?
>
> best regards
> Bernhard
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Binary fields and data compression

Posted by David Spencer <da...@tropo.com>.

Robert Engels wrote:

> The data size savings is almost certainly not worth the probable 20-40%
> increase in CPU usage in most cases no?
> 
> I think it should be optional for those who have extremely large indices and
> want to save some space (seems not necessary these days), and those who want
> to maximize performance.

You don't know until you benchmark it, but I thought that the heuristic 
nowadays was that CPUs are fast and disk i/o is slow ( and yes, disk 
space is 'infinite' :) ) - so therefore I would guess that in spite of 
the CPU cost of compression, you'd save time due to less disk i/o.


> 
> 
> -----Original Message-----
> From: Bernhard Messer [mailto:Bernhard.Messer@intrafind.de]
> Sent: Monday, August 30, 2004 4:41 PM
> To: lucene-dev@jakarta.apache.org
> Subject: Binary fields and data compression
> 
> 
> hi developers,
> 
> a few month ago, there was a very interesting discussion about field
> compression and the possibility to store binary field values within a
> lucene document. Regarding to this topic, Drew Farris came up with a
> patch to add the necessary functionality. I ran all the necessary tests
> on his implementation and didn't find one problem. So the original
> implementation from Drew could now be enhanced to compress the binary
> field data (maybe even the text fields if they are stored only) before
> writing to disc. I made some simple statistical measurements using the
> java.util.zip package for data compression. Enabling it, we could save
> about 40% data when compressing plain text files with a size from 1KB to
> 4KB. If there is still some interest, we could first try to update the
> patch, because it's outdated due to several changes within the Fields
> class. After finishing that, compression could be added to the updated
> version of the patch.
> 
> sounds good to me, what do you think ?
> 
> best regards
> Bernhard
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

RE: Binary fields and data compression

Posted by Robert Engels <re...@ix.netcom.com>.

The data size savings is almost certainly not worth the probable 20-40%
increase in CPU usage in most cases no?

I think it should be optional for those who have extremely large indices and
want to save some space (seems not necessary these days), and those who want
to maximize performance.


-----Original Message-----
From: Bernhard Messer [mailto:Bernhard.Messer@intrafind.de]
Sent: Monday, August 30, 2004 4:41 PM
To: lucene-dev@jakarta.apache.org
Subject: Binary fields and data compression


hi developers,

a few month ago, there was a very interesting discussion about field
compression and the possibility to store binary field values within a
lucene document. Regarding to this topic, Drew Farris came up with a
patch to add the necessary functionality. I ran all the necessary tests
on his implementation and didn't find one problem. So the original
implementation from Drew could now be enhanced to compress the binary
field data (maybe even the text fields if they are stored only) before
writing to disc. I made some simple statistical measurements using the
java.util.zip package for data compression. Enabling it, we could save
about 40% data when compressing plain text files with a size from 1KB to
4KB. If there is still some interest, we could first try to update the
patch, because it's outdated due to several changes within the Fields
class. After finishing that, compression could be added to the updated
version of the patch.

sounds good to me, what do you think ?

best regards
Bernhard




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org