You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Andreas Sewe <an...@codetrails.com> on 2016/03/15 16:35:25 UTC

Canonicalize stored fields (small set of possible values)

Hi,

I have an index in which each document has an indexed & stored "kind"
StringField, which has a small set of possible values (about 10).

Alas, Lucene (5.2.1) stores these field values over and over again,
which seems wasteful. Is there a way to avoid this, while still having
the fields' value available as stored?

Best wishes,

Andreas

-- 
Codetrails GmbH
The knowledge transfer company

Robert-Bosch-Str. 7, 64293 Darmstadt
Phone: +49-6151-276-7092
Mobile: +49-170-811-3791
http://www.codetrails.com/

Managing Director: Dr. Marcel Bruch
Handelsregister: Darmstadt HRB 91940

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Canonicalize stored fields (small set of possible values)

Posted by Adrien Grand <jp...@gmail.com>.
Le mar. 15 mars 2016 à 17:33, Andreas Sewe <an...@codetrails.com> a
écrit :

> I am afraid I don't understand. Do you suggest using IntFields as ID
> instead of StringFields, as they are presumably stored more efficiently?
>

Exactly. Integers are stored using zig-zag encoding and variable byte. So
numbers between -64 and 63 use 1 byte, numbers between -8192 and 8191 use 2
bytes, etc.


> > Otherwise, even without doing anything, things
> > should not be too bad thanks to stored fields compression.
>
> AFAICT, the fields are not compressed on disk right now. At least, "grep
> -c" finds my field over and over in the index files.
>
> So, how do I enabled stored fields compression. Googling turned up
> Store.COMPRESS, but that doesn't exist in 5.2.1.
>

Compression is on by default, but we split the stored fields file into
blocks of 16KB and compress each block individually. So each 16KB block
still needs to store values at least once before the compression algorithm
can make references to it.

If you want to enable stronger compression, you can do
`indexWriterConfig.setCodec(new Lucene54Codec(Mode.BEST_COMPRESSION))`
which will use DEFLATE insead of LZ4 to compress blocks. In addition of
removing duplicates like LZ4, DEFLATE also applies some Huffman coding so
that you should see better compression if your field values use some
symbols much more frequently than others.

Re: Canonicalize stored fields (small set of possible values)

Posted by Andreas Sewe <an...@codetrails.com>.
Hi Erick, hi Adrien,

thank you for both of your replies.

>> In a word, "no". When you set stored="true", Solr (well
>> actually Lucene) puts a compressed verbatim copy on disk.

I am using Lucene directly, not Solr, so I had the hope of being able to
configure the "store behavior" at this lower level.

(From the Javadoc it looks like SortedDocValuesField has this sharing
behavior, but AFAIK DocValues are not retrievable as StoredFields are.)

>> "Disk space is cheap" is the usual response here 

If we were talking a server application here, I would agree. But our
application is a search plugin for the Eclipse IDE and for some reason
developers are still surprisingly sensitive about their workspaces
consuming a few hundred megabytes of disk space.

> You can still give an id to each value on the application side if you want
> to avoid repeating values.

I am afraid I don't understand. Do you suggest using IntFields as ID
instead of StringFields, as they are presumably stored more efficiently?

> Otherwise, even without doing anything, things
> should not be too bad thanks to stored fields compression.

AFAICT, the fields are not compressed on disk right now. At least, "grep
-c" finds my field over and over in the index files.

So, how do I enabled stored fields compression. Googling turned up
Store.COMPRESS, but that doesn't exist in 5.2.1.

Best wishes,

Andreas

-- 
Codetrails GmbH
The knowledge transfer company

Robert-Bosch-Str. 7, 64293 Darmstadt
Phone: +49-6151-276-7092
Mobile: +49-170-811-3791
http://www.codetrails.com/

Managing Director: Dr. Marcel Bruch
Handelsregister: Darmstadt HRB 91940

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Canonicalize stored fields (small set of possible values)

Posted by Adrien Grand <jp...@gmail.com>.
You can still give an id to each value on the application side if you want
to avoid repeating values. Otherwise, even without doing anything, things
should not be too bad thanks to stored fields compression.

Le mar. 15 mars 2016 à 16:56, Erick Erickson <er...@gmail.com> a
écrit :

> In a word, "no". When you set stored="true", Solr (well
> actually Lucene) puts a compressed verbatim copy on disk.
>
> "Disk space is cheap" is the usual response here ;)
>
> Best,
> Erick
>
> On Tue, Mar 15, 2016 at 8:35 AM, Andreas Sewe
> <an...@codetrails.com> wrote:
> > Hi,
> >
> > I have an index in which each document has an indexed & stored "kind"
> > StringField, which has a small set of possible values (about 10).
> >
> > Alas, Lucene (5.2.1) stores these field values over and over again,
> > which seems wasteful. Is there a way to avoid this, while still having
> > the fields' value available as stored?
> >
> > Best wishes,
> >
> > Andreas
> >
> > --
> > Codetrails GmbH
> > The knowledge transfer company
> >
> > Robert-Bosch-Str. 7, 64293 Darmstadt
> > Phone: +49-6151-276-7092
> > Mobile: +49-170-811-3791
> > http://www.codetrails.com/
> >
> > Managing Director: Dr. Marcel Bruch
> > Handelsregister: Darmstadt HRB 91940
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Canonicalize stored fields (small set of possible values)

Posted by Erick Erickson <er...@gmail.com>.
In a word, "no". When you set stored="true", Solr (well
actually Lucene) puts a compressed verbatim copy on disk.

"Disk space is cheap" is the usual response here ;)

Best,
Erick

On Tue, Mar 15, 2016 at 8:35 AM, Andreas Sewe
<an...@codetrails.com> wrote:
> Hi,
>
> I have an index in which each document has an indexed & stored "kind"
> StringField, which has a small set of possible values (about 10).
>
> Alas, Lucene (5.2.1) stores these field values over and over again,
> which seems wasteful. Is there a way to avoid this, while still having
> the fields' value available as stored?
>
> Best wishes,
>
> Andreas
>
> --
> Codetrails GmbH
> The knowledge transfer company
>
> Robert-Bosch-Str. 7, 64293 Darmstadt
> Phone: +49-6151-276-7092
> Mobile: +49-170-811-3791
> http://www.codetrails.com/
>
> Managing Director: Dr. Marcel Bruch
> Handelsregister: Darmstadt HRB 91940
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org