You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Olivier Binda <ol...@wanadoo.fr> on 2015/08/08 16:19:36 UTC

Compressing docValues with variable length bytes[] by block of 16k ?

Greetings

are there any plans to implement compression of the variable length 
bites[] binary doc Values,
say in blocks of 16k like for stored values ?

my .cfs file goes from 2MB to like 400k when I zip it

Best regards,
Olivier



On 08/08/2015 02:32 PM, jamie wrote:
> Greetings
>
> Our app primarily uses Lucene for its intended purpose i.e. to search 
> across large amounts of unstructured text. However, recently our 
> requirement expanded to perform look-ups on specific documents in the 
> index based on associated custom defined unique keys. For our 
> purposes, a unique key is the string representation of a 128 bit 
> murmur hash, stored in a Lucene field named uid.  We are currently 
> using the TermsFilter to lookup Documents in the Lucene index as follows:
>
> List<Term> terms = new LinkedList<>();
>             for (String id : ids) {
>                 terms.add(new Term("uid", id));
> }
> TermsFilter idFilter = new TermsFilter(terms);
> ... search logic...
>
> At any time we may need to lookup say a couple of thousand documents. 
> Our problem is one of performance. On very large indexes with 30 
> million records or more, the lookup can be excruciatingly slow. At 
> this stage, its not practical for us to move the data over to fit for 
> purpose database, nor change the uid field to a numeric type. I fully 
> appreciate the fact that Lucene is not designed to be a database, 
> however, is there anything we can do to improve the performance of 
> these look-ups?
>
> Much appreciate
>
> Jamie
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Compressing docValues with variable length bytes[] by block of 16k ?

Posted by Olivier Binda <ol...@wanadoo.fr>.

On 08/09/2015 06:29 PM, Uwe Schindler wrote:
> Hi,
>
>> My values are unique and equal to the number of documents, They have
>> varying sizes, say at least 10 bytes and may be a lot bigger (say  4kbytes)
>>
>> I don't share, index or sort them.
>> I don't do grouping/faceting either
>>
>>
>> I only want to store, retrieve and traverse those values
> Then use stored fields.
ok, I'll try.

Thanks !

>
> Uwe
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Compressing docValues with variable length bytes[] by block of 16k ?

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi,

> My values are unique and equal to the number of documents, They have
> varying sizes, say at least 10 bytes and may be a lot bigger (say  4kbytes)
> 
> I don't share, index or sort them.
> I don't do grouping/faceting either
> 
> 
> I only want to store, retrieve and traverse those values

Then use stored fields.

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Compressing docValues with variable length bytes[] by block of 16k ?

Posted by Olivier Binda <ol...@wanadoo.fr>.

On 08/09/2015 04:55 PM, Arjen van der Meijden wrote:
>
> On 9-8-2015 16:22, Toke Eskildsen wrote:
>> Robert Muir <rc...@gmail.com> wrote:
>>> I am tired of repeating this:
>>> Don't use BINARY docvalues
>>> Don't use BINARY docvalues
>>> Don't use BINARY docvalues
>>> Use types like SORTED/SORTED_SET which will compress the term
>>> dictionary and make use of ordinals in your application instead.
>> This seems contrary to
>> http://lucene.apache.org/core/5_2_0/core/org/apache/lucene/document/BinaryDocValuesField.html
>>
>> Maybe you could update the JavaDoc for that field to warn against using it?
> It (probably) depends on the contents of the values. If the number of
> distinct values is roughly equal to the number of documents the javadoc
> suggest the binary docvalues are a valid choice.
My values are unique and equal to the number of documents,
They have varying sizes, say at least 10 bytes and may be a lot bigger 
(say  4kbytes)

I don't share, index or sort them.
I don't do grouping/faceting either


I only want to store, retrieve and traverse those values
>
> That's this part:
> "The values are stored directly with no sharing, which is a good fit
> when the fields don't share (many) values, such as a title field."
>
> If there are (much) less distinct values than documents, Robert's reply
> and the documentation suggest the same:
> " If values may be shared and sorted it's better to use
> SortedDocValuesField."
>
> So as soon as compression of smallish values starts making sense due to
> repetition amongst documents, it may be time to move away from the
> BinaryDocValuesField towards another variant.
>
> If only parts of the values are repeated (for instance something like
> e-mail addresses where many will end with 'gmail.com' and 'outlook.com')
> it becomes more complicated.

At the moment, there are some repeated parts inside but a lot of 
repeated parts across docIds  like "Expression", "Reading"

Also, I'm stuck with using Lucene 4.7.0 (or 4.7.2) because starting with 
version 4.8, lucene uses "try with resource" and this isn't supported on 
Android before Android 4.4


    SortedDocValuesField stores a per-document|BytesRef|
    <http://lucene.apache.org/core/5_2_0/core/org/apache/lucene/util/BytesRef.html>value,
    indexed for sorting.


If you also need to store the value, you should add a 
separate|StoredField| 
<http://lucene.apache.org/core/5_2_0/core/org/apache/lucene/document/StoredField.html>instance.


I actually went with the binaryDocValues because I thought that 
DocValues were way more efficient than the pre 4.0 fields to store stuff
(like only using 1 seek/read ...with mmap...), especially with traversal.

In my app, I traverse all binaryDocValues in increading docId order, 
unserializes my docValues (lightning fast with FlatBuffers, no object 
creation -> complex objects) and do some stats....

Would I be able to do that as efficiently with a StoredField ?


Apparently, only StoredField are compressed


    CompressingStoredFieldsFormat



Maybee I should use that (and ditch the useless docValue or make it 
store a bytesRef) to get compression ?

Many thanks for all the insights, :)
Olivier

> Best regards,
>
> Arjen
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Compressing docValues with variable length bytes[] by block of 16k ?

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

Arjen van der Meijden <ac...@tweakers.net> wrote:
> On 9-8-2015 16:22, Toke Eskildsen wrote:
> > Maybe you could update the JavaDoc for that field to warn against using it?
> It (probably) depends on the contents of the values.

That was my impression too, but we both seem to be second-guessing Robert's very non-nuanced and clearly oft-repeated recommendation. I hope Robert can shed some light on this and tell us if he finds the JavaDocs to be in order or if binary DocValues should not be used at all.

- Toke Eskildsen

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Compressing docValues with variable length bytes[] by block of 16k ?

Posted by Arjen van der Meijden <ac...@tweakers.net>.


On 9-8-2015 16:22, Toke Eskildsen wrote:
> Robert Muir <rc...@gmail.com> wrote:
>> I am tired of repeating this:
>> Don't use BINARY docvalues
>> Don't use BINARY docvalues
>> Don't use BINARY docvalues
>> Use types like SORTED/SORTED_SET which will compress the term
>> dictionary and make use of ordinals in your application instead.
> This seems contrary to
> http://lucene.apache.org/core/5_2_0/core/org/apache/lucene/document/BinaryDocValuesField.html
>
> Maybe you could update the JavaDoc for that field to warn against using it?
It (probably) depends on the contents of the values. If the number of
distinct values is roughly equal to the number of documents the javadoc
suggest the binary docvalues are a valid choice.

That's this part:
"The values are stored directly with no sharing, which is a good fit
when the fields don't share (many) values, such as a title field."

If there are (much) less distinct values than documents, Robert's reply
and the documentation suggest the same:
" If values may be shared and sorted it's better to use
SortedDocValuesField."

So as soon as compression of smallish values starts making sense due to
repetition amongst documents, it may be time to move away from the
BinaryDocValuesField towards another variant.

If only parts of the values are repeated (for instance something like 
e-mail addresses where many will end with 'gmail.com' and 'outlook.com')
it becomes more complicated.

Best regards,

Arjen

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Compressing docValues with variable length bytes[] by block of 16k ?

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

Robert Muir <rc...@gmail.com> wrote:
> I am tired of repeating this:
> Don't use BINARY docvalues
> Don't use BINARY docvalues
> Don't use BINARY docvalues

> Use types like SORTED/SORTED_SET which will compress the term
> dictionary and make use of ordinals in your application instead.

This seems contrary to
http://lucene.apache.org/core/5_2_0/core/org/apache/lucene/document/BinaryDocValuesField.html

Maybe you could update the JavaDoc for that field to warn against using it?

- Toke Eskildsen

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Compressing docValues with variable length bytes[] by block of 16k ?

Posted by Robert Muir <rc...@gmail.com>.

That makes no sense at all, it would make it slow as shit.

I am tired of repeating this:
Don't use BINARY docvalues
Don't use BINARY docvalues
Don't use BINARY docvalues

Use types like SORTED/SORTED_SET which will compress the term
dictionary and make use of ordinals in your application instead.



On Sat, Aug 8, 2015 at 10:19 AM, Olivier Binda <ol...@wanadoo.fr> wrote:
> Greetings
>
> are there any plans to implement compression of the variable length bites[]
> binary doc Values,
> say in blocks of 16k like for stored values ?
>
> my .cfs file goes from 2MB to like 400k when I zip it
>
> Best regards,
> Olivier
>
>
>
> On 08/08/2015 02:32 PM, jamie wrote:
>>
>> Greetings
>>
>> Our app primarily uses Lucene for its intended purpose i.e. to search
>> across large amounts of unstructured text. However, recently our requirement
>> expanded to perform look-ups on specific documents in the index based on
>> associated custom defined unique keys. For our purposes, a unique key is the
>> string representation of a 128 bit murmur hash, stored in a Lucene field
>> named uid.  We are currently using the TermsFilter to lookup Documents in
>> the Lucene index as follows:
>>
>> List<Term> terms = new LinkedList<>();
>>             for (String id : ids) {
>>                 terms.add(new Term("uid", id));
>> }
>> TermsFilter idFilter = new TermsFilter(terms);
>> ... search logic...
>>
>> At any time we may need to lookup say a couple of thousand documents. Our
>> problem is one of performance. On very large indexes with 30 million records
>> or more, the lookup can be excruciatingly slow. At this stage, its not
>> practical for us to move the data over to fit for purpose database, nor
>> change the uid field to a numeric type. I fully appreciate the fact that
>> Lucene is not designed to be a database, however, is there anything we can
>> do to improve the performance of these look-ups?
>>
>> Much appreciate
>>
>> Jamie
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org