You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Alex Aw Seat Kiong <al...@bigonthenet.com> on 2004/05/14 03:31:12 UTC
Are lucene have a configuration feature for storage compression option?
Hi!
Some question about lucene:
1. Are lucene have a configuration feature for storage compression option?
2. Any purse java code for Excel and Powerpoint parser can be use to support lucene index Excel and Powerpoint documents?
3. How to get the information as below. Any API for it?
- total number index/document was indexed.
- total index size per storage was indexed.
- last index updated date was indexed.
- total number index/document was deleted.
Thanks.
Regards,
AlexAw
Re: stored field compression
Posted by Drew Farris <al...@prodigy.net>.
On Fri, 2004-05-14 at 19:35, Dmitry Serebrennikov wrote:
> >
> > Sounds like a good plan. String-values remain as fast as they are,
> > and binary values
> are no slower. We can easily layer compression,
> > etc. on top of this.
> >
> > Are you volunteering?
>
> :)
> I'm pretty well pressed for time right now, so if someone else can pick
> this up it would probably get done sooner.
> Let me see how my weekend pans out.
Hi All,
I'm new here, so I'm not sure what the proper formalities for doing this
are, but I had some free time today and whipped up a patch that adds
binary value support to Field based on what's been already discussed.
Since its my first contribution ever, if it's not 100% perfect please
forgive, and maybe it will be of some use to Dimitry or anyone else who
was planning on or in the midst of implementing this.
This is not extensively tested, and I was hoping from some guidance from
the other developers in this area. I modified the unit test for Document
to verify it's operation -- are there any others that I should update to
fully test this addition? Are the unit tests sufficient, or should I go
to the extent of building a little app to test this and do some actual
searching?
At any rate, I hope this is useful to some degree. This patch is
performed against today's HEAD. Should I be patching against tagged
releases?
Any critique is welcome.
Drew
Re: stored field compression
Posted by Dmitry Serebrennikov <dm...@earthlink.net>.
Doug Cutting wrote:
> Dmitry Serebrennikov wrote:
>
>> Actually, I was thinking of something simpler... Somthing like a
>> special case where one could supply binary data directly into a
>> stored field. Something like:
>> public class Field {
>> public static Field Binary(String name, byte[] value);
>> public boolean isBinary();
>> public byte[] binaryValue();
>> }
>>
>> This would automatically become a stored field. Lucene wouldn't need
>> to know what the data means - just carry it around. The binaryValue()
>> can return null unless isBinary() is true, in which case you'd get
>> the data back and stringValue() would return null instead.
>>
>> This would be a start. If we want to provide special handling for
>> ints, floats, and so on, we provide a BinaryField class, a la DateField.
>
>
> Sounds like a good plan. String-values remain as fast as they are,
> and binary values are no slower. We can easily layer compression,
> etc. on top of this.
>
> Are you volunteering?
:)
I'm pretty well pressed for time right now, so if someone else can pick
this up it would probably get done sooner.
Let me see how my weekend pans out.
Dmitry.
>
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: stored field compression
Posted by Doug Cutting <cu...@apache.org>.
Dmitry Serebrennikov wrote:
> Actually, I was thinking of something simpler... Somthing like a special
> case where one could supply binary data directly into a stored field.
> Something like:
> public class Field {
> public static Field Binary(String name, byte[] value);
> public boolean isBinary();
> public byte[] binaryValue();
> }
>
> This would automatically become a stored field. Lucene wouldn't need to
> know what the data means - just carry it around. The binaryValue() can
> return null unless isBinary() is true, in which case you'd get the data
> back and stringValue() would return null instead.
>
> This would be a start. If we want to provide special handling for ints,
> floats, and so on, we provide a BinaryField class, a la DateField.
Sounds like a good plan. String-values remain as fast as they are, and
binary values are no slower. We can easily layer compression, etc. on
top of this.
Are you volunteering?
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: stored field compression
Posted by Dmitry Serebrennikov <dm...@earthlink.net>.
Doug Cutting wrote:
> Dmitry Serebrennikov wrote:
>
>> A different approach would be to just allow binary data in fields.
>> That way applications can compress and decompress as they see fit,
>> plus they would be able to store numerical and other data more
>> efficiently.
>
>
> That's an interesting idea. One could, for convenience and
> compatibility, add accessor methods to Field that, when you add a
> String, convert it to UTF-8 bytes, and make stringValue() parse (and
> possibly cache) a UTF-8 string from the binary value. There'd be
> another allocation per field read: FieldReader would construct a
> byte[], then stringValue() would construct a String with a char[].
> Right now we only construct a String with a char[] per stringValue().
> Perhaps this is moot, especially if we're lazy about constructing the
> strings and they're cached. That way, for all the fields you don't
> access you save an allocation.
Actually, I was thinking of something simpler... Somthing like a special
case where one could supply binary data directly into a stored field.
Something like:
public class Field {
public static Field Binary(String name, byte[] value);
public boolean isBinary();
public byte[] binaryValue();
}
This would automatically become a stored field. Lucene wouldn't need to
know what the data means - just carry it around. The binaryValue() can
return null unless isBinary() is true, in which case you'd get the data
back and stringValue() would return null instead.
This would be a start. If we want to provide special handling for ints,
floats, and so on, we provide a BinaryField class, a la DateField.
We might lose some efficiency because ints and longs would be better off
if they were stored as ints and longs rather than a byte[]...
Actually, we might be able to represent binary data fields as offsets
into the complete byte[] that was read from the index file in the first
place. That way we woudln't need to copy the data until binaryValue()
method was called. Also the BinaryField class can do byte[] -> int
conversion directly from the offsets into the main byte[] buffer, again
saving byte[] allocation.
Would binary fields only be useful for stored fields? I can't really see
how binary data could be usefully tokenized, but maybe in some
multimedia applications? Binary keyword fields might be interesting.
These could allow searching on integer ranges, more straight-forward
date ranges, and more efficient data storage in some cases. That's a big
change though. We'd have to change all searching to be based on binary
tokens instead of strings.
>
>
>> Of course, this would then be a per-value compression and probably
>> not as effective as a whole index compression that could be done with
>> the other approaches.
>
>
> But, since documents are accessed randomly, we can't easily do a lot
> better for field data.
I don't know much about how Zip algorithm works internally, but it seems
that there could be a parallel between the zip file with zip entries and
the lucene index with lucene documents.
> This feature is primarily intended to make life easier for folks who
> want to store whole documents in the index. Selective use of gzip
> would be a huge improvement over the present situation. Alternate
> compression algorithms might make things a bit better yet, but
> probably not hugely.
I agree, unless one can figure out how to share the dictionary across
documents.
If we just go now with a simple binary data-bucket design described
above, applications can do any clever implementation they chose.
BinaryField class will provide helper methods for the most common
things. Perhaps GZipField is another good candidate for the immediate
future.
Going forward, perhaps there is a way to do compression such that
dictionary is managed for each segment of the index, and merged when the
segments are merged? If this is possible, it would be a good argument
for Lucene to be compression-aware.
How does all of this sound?
Dmitry.
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: stored field compression
Posted by Doug Cutting <cu...@apache.org>.
Dmitry Serebrennikov wrote:
> A different approach would be to just allow binary data in fields. That
> way applications can compress and decompress as they see fit, plus they
> would be able to store numerical and other data more efficiently.
That's an interesting idea. One could, for convenience and
compatibility, add accessor methods to Field that, when you add a
String, convert it to UTF-8 bytes, and make stringValue() parse (and
possibly cache) a UTF-8 string from the binary value. There'd be
another allocation per field read: FieldReader would construct a byte[],
then stringValue() would construct a String with a char[]. Right now we
only construct a String with a char[] per stringValue(). Perhaps this
is moot, especially if we're lazy about constructing the strings and
they're cached. That way, for all the fields you don't access you save
an allocation.
Then you could also add intValue() and floatValue() methods, etc. which
use binary representations. These could speed up lots of stuff.
For easy extensibility you could do something like:
interface FieldValue {
byte[] getBytes();
void setValue(byte[]);
}
/** Extracts the value of the field into <code>value</code>.
* @see FieldValue#setValue()
*/
void getValue(FieldValue value) {
value.setValue(getBytes());
}
// replace the base Field ctor with:
public Field(String name, FieldValue value,
boolean store, boolean index,
boolean token, boolean vector) {
...
bytes = value.getBytes();
...
}
public class CompressedTextFieldValue implements FieldValue {
public CompressedTextFieldValue(String text) { ... }
public String toString() { ... }
...
}
public class SerializeableFieldValue implements FieldValue {
public SerializeableFieldValue(Serializeable) { ... }
public Serializeable getSerializeable() { ... }
...
}
It could be up to the application to always use the same FieldValue
class with an field, or we could add the FieldValue class to the index's
FieldInfos...
I'd like to continue to be able avoid storing type information per field
instance, and to avoid re-inventing object serialization, but maybe I
need to give these up...
> Of course, this would then be a per-value compression and probably not
> as effective as a whole index compression that could be done with the
> other approaches.
But, since documents are accessed randomly, we can't easily do a lot
better for field data.
> Doug, what compression algorithm did you have in mind
> for the actual compression?
I was just thinking gzip. Alternately, one could make it extensible,
and tag each item with the compression algorithm, but I think that gets
to be a mess. Also, it's good to stick to a standard algorithm, so that
perl, c#, C++, etc. ports can easily incorporate the feature.
This feature is primarily intended to make life easier for folks who
want to store whole documents in the index. Selective use of gzip would
be a huge improvement over the present situation. Alternate compression
algorithms might make things a bit better yet, but probably not hugely.
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: stored field compression
Posted by Dmitry Serebrennikov <dm...@earthlink.net>.
Doug Cutting wrote:
> Doug Cutting wrote:
>
>> A more elaborate approach would be to lazily decompress fields when
>> values are accessed.
>
>
> Another big advantage of this approach (as reminded by Peter
> Cipollone) is that it will make indexing faster, as decompression will
> be avoided when merging.
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>
A different approach would be to just allow binary data in fields. That
way applications can compress and decompress as they see fit, plus they
would be able to store numerical and other data more efficiently.
Of course, this would then be a per-value compression and probably not
as effective as a whole index compression that could be done with the
other approaches. Doug, what compression algorithm did you have in mind
for the actual compression?
Dmitry.
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: stored field compression
Posted by Doug Cutting <cu...@apache.org>.
Doug Cutting wrote:
> A more elaborate approach would be to lazily decompress fields when
> values are accessed.
Another big advantage of this approach (as reminded by Peter Cipollone)
is that it will make indexing faster, as decompression will be avoided
when merging.
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
stored field compression
Posted by Doug Cutting <cu...@apache.org>.
[ Moved discussion from lucene-user. ]
Ype Kingma wrote:
> One place where compression might be useful is in the stored fields [...]
I agree, and this would not be hard to add.
The simplest approach would be to just add the following to Field.java:
private boolean isCompressed;
public boolean isCompressed() { return isCompressed; }
public boolean setIsCompressed(boolean isCompressed) {
this.isCompressed = isCompressed;
}
Perhaps along with additional constructors that permit one to specify
whether a field is compressed, e.g., Field.Text(String name, String
value, boolean isCompressed).
Then just change FieldsWriter and FieldsReader to use a bit in the bits
that are stored with each field to indicate whether the value is
compressed, and, when it is, compress or decompress it accordingly.
A more elaborate approach would be to lazily decompress fields when
values are accessed. That way, when you only require one field's value,
you don't decompress all of the values. This would require changing
Field.java a bit more, perhaps replacing its stringValue and readerValue
fields with something like:
private Object value;
private class CompressedValue {
private byte[] data;
public CompressedValue(byte[] data) { this.data = data; }
public CompressedValue(String value) { ... code to compress ... }
public toString() { ... code to decompress ... }
public getData() { return data; }
}
public String stringValue() {
value instanceof Reader ? null : value.toString();
}
public Reader readerValue() {
return value instanceof Reader ? (Reader)value : null;
}
public byte[] compressedValue() {
return value instanceof CompressedValue
? ((CompressedValue)value).getData()
: null;
}
public boolean setIsCompressed(boolean isCompressed) {
if (isCompressed && !this.isCompressed) {
value = new CompressedValue((String)value);
} else if (!isCompressed && this.isCompressed) {
value = stringValue();
}
this.isCompressed = isCompressed;
}
// replace the ctor Field(String, String, ...) with the following
public Field(String name, Object value,
boolean store, boolean index,
boolean token, boolean vector) {
...
if (value instanceof String) {
this.value = (String)value;
} else if (value instanceof byte[]) {
this.value == new CompressedValue((byte[])array);
} else {
throw new IllegalArgumentException(...);
}
...
}
Then change FieldsWriter to write the compressedValue() bytes, when
non-null, and, finally, change FieldsWriter to, when a value is
compressed, read the bytes and pass them instead of a String to the ctor.
Anyone want to take this on?
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: Are lucene have a configuration feature for storage compression option?
Posted by Ype Kingma <yk...@xs4all.nl>.
Alex, Otis,
On Friday 14 May 2004 13:58, Otis Gospodnetic wrote:
> Moving to lucene-user list.
>
> Hello,
>
> Didn't I already answer these questions?
>
> 1. No :(
There is bit more to say, see below.
...
>
> --- Alex Aw Seat Kiong <al...@bigonthenet.com> wrote:
> > Hi!
> >
> > Some question about lucene:
> > 1. Are lucene have a configuration feature for storage compression
> > option?
Lucene indexes are quite compact already.
Text (western languages) is normally indexed to about 1/3 it's original size,
I don't know about CJK.
You can have a look at the file formats on the Lucene web site to see how the
compression is done. Among others there are common prefixes for
the sorted terms, variable length integers, and storing differences between
integers instead of the complete numbers when possible.
One place where compression might be useful is in the stored fields, but
there is no API for it in Lucene.
Kind regards,
Ype
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Are lucene have a configuration feature for storage compression option?
Posted by Otis Gospodnetic <ot...@yahoo.com>.
Moving to lucene-user list.
Hello,
Didn't I already answer these questions?
1. No :(
2. Use POI (jakarta.apache.org/poi) API
3. IndexReader can provide at least some of your numbers. I suggest
you look at Javadocs for IndexReader, which are available on Lucene's
site.
Otis
--- Alex Aw Seat Kiong <al...@bigonthenet.com> wrote:
> Hi!
>
> Some question about lucene:
> 1. Are lucene have a configuration feature for storage compression
> option?
>
> 2. Any purse java code for Excel and Powerpoint parser can be use to
> support lucene index Excel and Powerpoint documents?
>
> 3. How to get the information as below. Any API for it?
>
> - total number index/document was indexed.
> - total index size per storage was indexed.
> - last index updated date was indexed.
> - total number index/document was deleted.
>
>
>
> Thanks.
>
>
> Regards,
> AlexAw
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: ClassCastException MultiReader
Posted by Otis Gospodnetic <ot...@yahoo.com>.
Sure. 'Bugzilla it', please.
Otis
P.S.
That line 274 should be lline 273 in the CVS HEAD as of now.
--- Rasik Pandey <ra...@ajlsm.com> wrote:
> Howdy,
>
> This exception was thrown with 1.4rc3. Do you need a test case for
> this one?
>
> java.lang.ClassCastException
> at
> org.apache.lucene.index.MultiTermEnum.<init>(MultiReader.java:274)
> at
> org.apache.lucene.index.MultiReader.terms(MultiReader.java:187)
>
>
> Regards,
> RBP
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
ClassCastException MultiReader
Posted by Rasik Pandey <ra...@ajlsm.com>.
Howdy,
This exception was thrown with 1.4rc3. Do you need a test case for this one?
java.lang.ClassCastException
at org.apache.lucene.index.MultiTermEnum.<init>(MultiReader.java:274)
at org.apache.lucene.index.MultiReader.terms(MultiReader.java:187)
Regards,
RBP
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org