You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by "maosuhan (via GitHub)" <gi...@apache.org> on 2023/02/09 06:36:25 UTC

[GitHub] [lucene] maosuhan opened a new issue, #12137: Add compression feature for DocValues format in new Codec

maosuhan opened a new issue, #12137:
URL: https://github.com/apache/lucene/issues/12137

### Description

We use ES as an OLAP engine in advertising scenarios, an advertiser will query the data of his own. We usually make advertiser_id as a routing and index sorting key so the read density is very high in docvalues. We leverage lucene's posting index structure to speedup the query and the performance meet our expectation.

The most complained part of ES/lucene is that the disk usage is much bigger than clickhouse/doris, and in our case, lucene storage can be 3-4x times bigger.

The reason why clickhouse/doris performs better in space is that they both compress data in blocks and uncompress the needed blocks on read. Since the read density is high, the performance is still acceptable.

We also implement the zstd/lz4 compression for lucene docvalues, below is the storage cost improvement:
<byte-sheet-html-origin data-id="I9tivHgWQ8-1675919010770" data-version="3" data-is-embed="true">

name | total size| docvalue size | docvalue compression ratio
-- | -- | -- | --
no compression | 485.8g | 394.6g | 100%
lz4 | 272g | 255.1g | 64.65% | 64.65%
zstd | 246.5g | 229.5g | 58.16%
</byte-sheet-html-origin>

All the docvalues is numeric and we compress the data in block of 4096 values.

We also run a high QPS(4000) load test from our online query set, the pct50 and pct99 both decreases by 20% to 30%. We are shocked by this improvement and we guess the read density is the key to the result.

But the disadvantage of compression is that it hurts the performance of random read a lot because a full block of data must be read ahead even just only 1 byte is needed.

I suggest that we create a new codec for compression and the purpose of this codec is to reduce the storage usage and provide adequate and balanced read performance is OLAP cases.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on issue #12137: Add compression feature for DocValues format in new Codec

Posted by "rmuir (via GitHub)" <gi...@apache.org>.

rmuir commented on issue #12137:
URL: https://github.com/apache/lucene/issues/12137#issuecomment-1424104929

   it doesn't make sense to compress integers with algorithms like these. We can use a better integer compression algorithm (e.g. PFOR) instead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org