You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2021/11/01 12:34:33 UTC

[GitHub] [pinot] richardstartin commented on pull request #7661: implement size balanced V4 raw chunk format

richardstartin commented on pull request #7661:
URL: https://github.com/apache/pinot/pull/7661#issuecomment-956196251


   > Since the target chunk size is much larger than the header size, I think it should not add much overhead to store long offset and remove the 4G limit for single index. We can also include the uncompressed size in the header in case some compressor does not include the length info in the compressed data.
   
   There's a couple of things here:
   * **Compression metadata** - this was the purpose of #7655 - to ensure that all formats we use have the correct metadata (3/4 already did) and enforce an upgrade path for `LZ4` when using this chunk format. So there's no need for any per-chunk compression metadata, and it factors into the next point
   * **Offset sizes** - to me, 4GB of compressed chunks feels like a lot. At a compression ratio of 2x, that's 8GB raw data in a single segment, at 10x (JSON can be amazingly repetitive) it's 40GB. I am aware that in the past 32 bit offsets were shown not to be enough for some use cases, but they were signed offsets so only permitted 2GB compressed data. I am unaware of any evidence that 4GB would not have been enough. Why do I care? It's not for the sake of storage overhead, but I want to keep the metadata as small as possible in memory for the sake of searching it quickly. Can we frame the discussion in terms of why a user would want/need more than 4GB compressed data for a raw column in a single segment?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org