You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Alex Aw Seat Kiong <al...@bigonthenet.com> on 2004/05/14 03:31:12 UTC

Are lucene have a configuration feature for storage compression option?

Hi!

Some question about lucene:
1. Are lucene have a configuration feature for storage compression option?

2. Any purse java code for  Excel and Powerpoint parser can be use to support lucene index Excel and Powerpoint documents?

3. How to get the information as below. Any API for it? 

- total number index/document was  indexed.
- total index size per storage was  indexed.
- last index updated date was  indexed.
- total number index/document was  deleted.



Thanks.


Regards,
AlexAw

Re: stored field compression

Posted by Drew Farris <al...@prodigy.net>.

On Fri, 2004-05-14 at 19:35, Dmitry Serebrennikov wrote:

> >
> > Sounds like a good plan.  String-values remain as fast as they are, 
> > and binary values 

> are no slower.  We can easily layer compression, 
> > etc. on top of this.
> >
> > Are you volunteering? 
> 
> :)
> I'm pretty well pressed for time right now, so if someone else can pick 
> this up it would probably get done sooner.
> Let me see how my weekend pans out.

Hi All, 

I'm new here, so I'm not sure what the proper formalities for doing this
are, but I had some free time today and whipped up a patch that adds
binary value support to Field based on what's been already discussed.

Since its my first contribution ever, if it's not 100% perfect please
forgive, and maybe it will be of some use to Dimitry or anyone else who
was planning on or in the midst of implementing this.

This is not extensively tested, and I was hoping from some guidance from
the other developers in this area. I modified the unit test for Document
to verify it's operation -- are there any others that I should update to
fully test this addition? Are the unit tests sufficient, or should I go
to the extent of building a little app to test this and do some actual
searching?

At any rate, I hope this is useful to some degree. This patch is
performed against today's HEAD. Should I be patching against tagged
releases?

Any critique is welcome.

Drew

Re: stored field compression

Posted by Dmitry Serebrennikov <dm...@earthlink.net>.

Doug Cutting wrote:

> Dmitry Serebrennikov wrote:
>
>> Actually, I was thinking of something simpler... Somthing like a 
>> special case where one could supply binary data directly into a 
>> stored field. Something like:
>> public class Field {
>>    public static Field Binary(String name, byte[] value);
>>    public boolean isBinary();
>>    public byte[] binaryValue();
>> }
>>
>> This would automatically become a stored field. Lucene wouldn't need 
>> to know what the data means - just carry it around. The binaryValue() 
>> can return null unless isBinary() is true, in which case you'd get 
>> the data back and stringValue() would return null instead.
>>
>> This would be a start. If we want to provide special handling for 
>> ints, floats, and so on, we provide a BinaryField class, a la DateField.
>
>
> Sounds like a good plan.  String-values remain as fast as they are, 
> and binary values are no slower.  We can easily layer compression, 
> etc. on top of this.
>
> Are you volunteering? 

:)
I'm pretty well pressed for time right now, so if someone else can pick 
this up it would probably get done sooner.
Let me see how my weekend pans out.

Dmitry.

>
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: stored field compression

Posted by Doug Cutting <cu...@apache.org>.

Dmitry Serebrennikov wrote:
> Actually, I was thinking of something simpler... Somthing like a special 
> case where one could supply binary data directly into a stored field. 
> Something like:
> public class Field {
>    public static Field Binary(String name, byte[] value);
>    public boolean isBinary();
>    public byte[] binaryValue();
> }
> 
> This would automatically become a stored field. Lucene wouldn't need to 
> know what the data means - just carry it around. The binaryValue() can 
> return null unless isBinary() is true, in which case you'd get the data 
> back and stringValue() would return null instead.
> 
> This would be a start. If we want to provide special handling for ints, 
> floats, and so on, we provide a BinaryField class, a la DateField.

Sounds like a good plan.  String-values remain as fast as they are, and 
binary values are no slower.  We can easily layer compression, etc. on 
top of this.

Are you volunteering?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: stored field compression

Posted by Dmitry Serebrennikov <dm...@earthlink.net>.

Doug Cutting wrote:

> Dmitry Serebrennikov wrote:
>
>> A different approach would be to just allow binary data in fields. 
>> That way applications can compress and decompress as they see fit, 
>> plus they would be able to store numerical and other data more 
>> efficiently.
>
>
> That's an interesting idea.  One could, for convenience and 
> compatibility, add accessor methods to Field that, when you add a 
> String, convert it to UTF-8 bytes, and make stringValue() parse (and 
> possibly cache) a UTF-8 string from the binary value.  There'd be 
> another allocation per field read: FieldReader would construct a 
> byte[], then stringValue() would construct a String with a char[].  
> Right now we only construct a String with a char[] per stringValue().  
> Perhaps this is moot, especially if we're lazy about constructing the 
> strings and they're cached.  That way, for all the fields you don't 
> access you save an allocation.

Actually, I was thinking of something simpler... Somthing like a special 
case where one could supply binary data directly into a stored field. 
Something like:
public class Field {
    public static Field Binary(String name, byte[] value);
    public boolean isBinary();
    public byte[] binaryValue();
}

This would automatically become a stored field. Lucene wouldn't need to 
know what the data means - just carry it around. The binaryValue() can 
return null unless isBinary() is true, in which case you'd get the data 
back and stringValue() would return null instead.

This would be a start. If we want to provide special handling for ints, 
floats, and so on, we provide a BinaryField class, a la DateField.

We might lose some efficiency because ints and longs would be better off 
if they were stored as ints and longs rather than a byte[]...

Actually, we might be able to represent binary data fields as offsets 
into the complete byte[] that was read from the index file in the first 
place. That way we woudln't need to copy the data until binaryValue() 
method was called. Also the BinaryField class can do byte[] -> int 
conversion directly from the offsets into the main byte[] buffer, again 
saving byte[] allocation.

Would binary fields only be useful for stored fields? I can't really see 
how binary data could be usefully tokenized, but maybe in some 
multimedia applications? Binary keyword fields might be interesting. 
These could allow searching on integer ranges, more straight-forward 
date ranges, and more efficient data storage in some cases. That's a big 
change though. We'd have to change all searching to be based on binary 
tokens instead of strings.

>
>
>> Of course, this would then be a per-value compression and probably 
>> not as effective as a whole index compression that could be done with 
>> the other approaches.
>
>
> But, since documents are accessed randomly, we can't easily do a lot 
> better for field data. 

I don't know much about how Zip algorithm works internally, but it seems 
that there could be a parallel between the zip file with zip entries and 
the lucene index with lucene documents.

> This feature is primarily intended to make life easier for folks who 
> want to store whole documents in the index.  Selective use of gzip 
> would be a huge improvement over the present situation.  Alternate 
> compression algorithms might make things a bit better yet, but 
> probably not hugely. 

I agree, unless one can figure out how to share the dictionary across 
documents.
If we just go now with a simple binary data-bucket design described 
above, applications can do any clever implementation they chose. 
BinaryField class will provide helper methods for the most common 
things. Perhaps GZipField is another good candidate for the immediate 
future.

Going forward, perhaps there is a way to do compression such that 
dictionary is managed for each segment of the index, and merged when the 
segments are merged? If this is possible, it would be a good argument 
for Lucene to be compression-aware.

How does all of this sound?

Dmitry.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: stored field compression

Posted by Doug Cutting <cu...@apache.org>.

Dmitry Serebrennikov wrote:
> A different approach would be to just allow binary data in fields. That 
> way applications can compress and decompress as they see fit, plus they 
> would be able to store numerical and other data more efficiently.

That's an interesting idea.  One could, for convenience and 
compatibility, add accessor methods to Field that, when you add a 
String, convert it to UTF-8 bytes, and make stringValue() parse (and 
possibly cache) a UTF-8 string from the binary value.  There'd be 
another allocation per field read: FieldReader would construct a byte[], 
then stringValue() would construct a String with a char[].  Right now we 
only construct a String with a char[] per stringValue().  Perhaps this 
is moot, especially if we're lazy about constructing the strings and 
they're cached.  That way, for all the fields you don't access you save 
an allocation.

Then you could also add intValue() and floatValue() methods, etc. which 
use binary representations.  These could speed up lots of stuff.

For easy extensibility you could do something like:

   interface FieldValue {
     byte[] getBytes();
     void setValue(byte[]);
   }

   /** Extracts the value of the field into <code>value</code>.
    * @see FieldValue#setValue()
    */
   void getValue(FieldValue value) {
     value.setValue(getBytes());
   }

   // replace the base Field ctor with:
   public Field(String name, FieldValue value,
                boolean store, boolean index,
                boolean token, boolean vector) {
     ...
     bytes = value.getBytes();
     ...
   }

   public class CompressedTextFieldValue implements FieldValue {
     public CompressedTextFieldValue(String text) { ... }
     public String toString() { ... }
     ...
   }

   public class SerializeableFieldValue implements FieldValue {
     public SerializeableFieldValue(Serializeable) { ... }
     public Serializeable getSerializeable() { ... }
     ...
   }

It could be up to the application to always use the same FieldValue 
class with an field, or we could add the FieldValue class to the index's 
FieldInfos...

I'd like to continue to be able avoid storing type information per field 
instance, and to avoid re-inventing object serialization, but maybe I 
need to give these up...

> Of course, this would then be a per-value compression and probably not 
> as effective as a whole index compression that could be done with the 
> other approaches.

But, since documents are accessed randomly, we can't easily do a lot 
better for field data.

> Doug, what compression algorithm did you have in mind 
> for the actual compression?

I was just thinking gzip.  Alternately, one could make it extensible, 
and tag each item with the compression algorithm, but I think that gets 
to be a mess.  Also, it's good to stick to a standard algorithm, so that 
perl, c#, C++, etc. ports can easily incorporate the feature.

This feature is primarily intended to make life easier for folks who 
want to store whole documents in the index.  Selective use of gzip would 
be a huge improvement over the present situation.  Alternate compression 
algorithms might make things a bit better yet, but probably not hugely.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: stored field compression

Posted by Dmitry Serebrennikov <dm...@earthlink.net>.

Doug Cutting wrote:

> Doug Cutting wrote:
>
>> A more elaborate approach would be to lazily decompress fields when 
>> values are accessed.
>
>
> Another big advantage of this approach (as reminded by Peter 
> Cipollone) is that it will make indexing faster, as decompression will 
> be avoided when merging.
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>
A different approach would be to just allow binary data in fields. That 
way applications can compress and decompress as they see fit, plus they 
would be able to store numerical and other data more efficiently.

Of course, this would then be a per-value compression and probably not 
as effective as a whole index compression that could be done with the 
other approaches. Doug, what compression algorithm did you have in mind 
for the actual compression?

Dmitry.



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: stored field compression

Posted by Doug Cutting <cu...@apache.org>.

Doug Cutting wrote:
> A more elaborate approach would be to lazily decompress fields when 
> values are accessed.

Another big advantage of this approach (as reminded by Peter Cipollone) 
is that it will make indexing faster, as decompression will be avoided 
when merging.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

stored field compression

Posted by Doug Cutting <cu...@apache.org>.

[ Moved discussion from lucene-user. ]

Ype Kingma wrote:
> One place where compression might be useful is in the stored fields [...]

I agree, and this would not be hard to add.

The simplest approach would be to just add the following to Field.java:

   private boolean isCompressed;
   public boolean isCompressed() { return isCompressed; }
   public boolean setIsCompressed(boolean isCompressed) {
     this.isCompressed = isCompressed;
   }

Perhaps along with additional constructors that permit one to specify 
whether a field is compressed, e.g., Field.Text(String name, String 
value, boolean isCompressed).

Then just change FieldsWriter and FieldsReader to use a bit in the bits 
that are stored with each field to indicate whether the value is 
compressed, and, when it is, compress or decompress it accordingly.

A more elaborate approach would be to lazily decompress fields when 
values are accessed.  That way, when you only require one field's value, 
you don't decompress all of the values.  This would require changing 
Field.java a bit more, perhaps replacing its stringValue and readerValue 
fields with something like:

   private Object value;

   private class CompressedValue {
     private byte[] data;
     public CompressedValue(byte[] data) { this.data = data; }
     public CompressedValue(String value) { ... code to compress ... }
     public toString() { ... code to decompress ... }
     public getData() { return data; }
   }

   public String stringValue() {
     value instanceof Reader ? null : value.toString();
   }

   public Reader readerValue() {
     return value instanceof Reader ? (Reader)value : null;
   }

   public byte[] compressedValue() {
     return value instanceof CompressedValue
      ? ((CompressedValue)value).getData()
      : null;
   }

   public boolean setIsCompressed(boolean isCompressed) {
     if (isCompressed && !this.isCompressed) {
       value = new CompressedValue((String)value);
     } else if (!isCompressed && this.isCompressed) {
       value = stringValue();
     }
     this.isCompressed = isCompressed;
   }

   // replace the ctor Field(String, String, ...) with the following
   public Field(String name, Object value,
                boolean store, boolean index,
                boolean token, boolean vector) {
      ...
      if (value instanceof String) {
        this.value = (String)value;
      } else if (value instanceof byte[]) {
        this.value == new CompressedValue((byte[])array);
      } else {
        throw new IllegalArgumentException(...);
      }
      ...
   }

Then change FieldsWriter to write the compressedValue() bytes, when 
non-null, and, finally, change FieldsWriter to, when a value is 
compressed, read the bytes and pass them instead of a String to the ctor.

Anyone want to take this on?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Are lucene have a configuration feature for storage compression option?

Posted by Ype Kingma <yk...@xs4all.nl>.

Alex, Otis,

On Friday 14 May 2004 13:58, Otis Gospodnetic wrote:
> Moving to lucene-user list.
>
> Hello,
>
> Didn't I already answer these questions?
>
> 1. No :(

There is bit more to say, see below.

...
>
> --- Alex Aw Seat Kiong <al...@bigonthenet.com> wrote:
> > Hi!
> >
> > Some question about lucene:
> > 1. Are lucene have a configuration feature for storage compression
> > option?

Lucene indexes are quite compact already.
Text (western languages) is normally indexed to about 1/3 it's original size,
I don't know about CJK.
You can have a look at the file formats on the Lucene web site to see how the
compression is done. Among others there are common prefixes for
the sorted terms, variable length integers, and storing differences between
integers instead of the complete numbers when possible.

One place where compression might be useful is in the stored fields, but
there is no API for it in Lucene.

Kind regards,
Ype

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Are lucene have a configuration feature for storage compression option?

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Moving to lucene-user list.

Hello,

Didn't I already answer these questions?

1. No :(

2. Use POI (jakarta.apache.org/poi) API

3. IndexReader can provide at least some of your numbers.  I suggest
you look at Javadocs for IndexReader, which are available on Lucene's
site.

Otis

--- Alex Aw Seat Kiong <al...@bigonthenet.com> wrote:
> Hi!
> 
> Some question about lucene:
> 1. Are lucene have a configuration feature for storage compression
> option?
> 
> 2. Any purse java code for  Excel and Powerpoint parser can be use to
> support lucene index Excel and Powerpoint documents?
> 
> 3. How to get the information as below. Any API for it? 
> 
> - total number index/document was  indexed.
> - total index size per storage was  indexed.
> - last index updated date was  indexed.
> - total number index/document was  deleted.
> 
> 
> 
> Thanks.
> 
> 
> Regards,
> AlexAw


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: ClassCastException MultiReader

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Sure.  'Bugzilla it', please.

Otis
P.S.
That line 274 should be lline 273 in the CVS HEAD as of now.

--- Rasik Pandey <ra...@ajlsm.com> wrote:
> Howdy,
> 
> This exception was thrown with 1.4rc3. Do you need a test case for
> this one?
> 
> java.lang.ClassCastException
>         at
> org.apache.lucene.index.MultiTermEnum.<init>(MultiReader.java:274)
>         at
> org.apache.lucene.index.MultiReader.terms(MultiReader.java:187)
> 
> 
> Regards,
> RBP
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

ClassCastException MultiReader

Posted by Rasik Pandey <ra...@ajlsm.com>.

Howdy,

This exception was thrown with 1.4rc3. Do you need a test case for this one?

java.lang.ClassCastException
        at org.apache.lucene.index.MultiTermEnum.<init>(MultiReader.java:274)
        at org.apache.lucene.index.MultiReader.terms(MultiReader.java:187)


Regards,
RBP



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org