You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Michael McCandless (Jira)" <ji...@apache.org> on 2021/03/15 23:24:00 UTC

[jira] [Commented] (LUCENE-9843) Remove compression option on doc values

    [ https://issues.apache.org/jira/browse/LUCENE-9843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302094#comment-17302094 ] 

Michael McCandless commented on LUCENE-9843:
--------------------------------------------

I agree having options on {{Codec}} implementations adds frustrating code complexity!

But the compression vs speed option is an especially tricky one since it is so brutally use-case dependent.

Some users want the smallest possible indices and do not care so much about query performance.  Others are willing to have larger indices if querying can go even a wee bit faster.  Our usage (Amazon's customer-facing product search) is in the latter category: when we first upgraded to Lucene 8.5.1, which enabled compression for all {{BINARY}} fields with no option to disable it, it was a big (~30%) hit to red-line QPS in our internal (single production shard) benchmarks.

We proceeded with upgrading, but forked the default {{Codec}} to fallback to the pre-8.5 implementation for doc values as a short term measure, and then iterated (LUCENE-9378 and [https://github.com/apache/lucene-solr/pull/1543] and [https://github.com/apache/lucene-solr/pull/2069] – thank you [~jpountz]!) to add the option for compression.  But we would really rather not live with a long-term fork of the default {{Codec}}...

At least two other users/use-cases also saw negative impact to their apps using Lucene due to {{BINARY}} compression: vectors extension in Elasticsearch and Twitter.

We also learned, surprisingly, that compression was GOOD for {{luceneutil}} faceting tasks, perhaps because those tasks compute facets on all documents ({{MatchAllDocsQuery}}) and so they load {{byte[]}} for every document in the index, which is best case for compression since the decompression cost is "maximally amortized" and that reduces how many bytes are loaded from the index.  We (Amazon hat) have since enabled compression for Lucene faceting in our usage as well, since it was neutral within noise on search metrics yet reduced the index.

So we are using compression for some {{BINARY}} doc values fields, but turning it off for other fields.  Having the choice is helpful/impactful, for us anyways.

I think especially for {{BINARY}} doc values the use-cases can be even more diverse, since it is more of a catch-all doc values type, where applications can encode interesting things into {{byte[]}}.

If we really must take away the "speed versus compression" choice then I think we should also remove the compression, i.e. we should not try to compress {{BINARY}} fields?  Or, would it make the code simpler if we just made it another {{DocValuesType}} e.g. {{BINARY}} and {{BINARY_COMPRESSED}} or something?

NOTE: LUCENE-9211 is where we first added the {{BINARY}} compression.

I agree testing is also harder because of this option.  Maybe we could improve the test infra, e.g. the silly tool ({{TestBackwardsCompatibility}} itself I think) that generates older indices for testing, to do a better job toggling between {{SPEED}} and {{COMPRESSED}} when it generates test indices?

> Remove compression option on doc values
> ---------------------------------------
>
>                 Key: LUCENE-9843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9843
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>
> Options on file formats add complexity and put a big tax on backward-compatibility testing. I'm the one who introduced it LUCENE-9378 but I would now like to think about what we can do to remove this option.
> For the record, compression was initially introduced because some binary fields have so much redundancy that it's wasteful not to compress them at all. But unfortunately, this slowed down some search workloads and we decided to introduce this option as a way to let users choose the trade-off they want.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org