You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2022/01/05 02:28:07 UTC

[GitHub] [pinot] ashishkf opened a new issue #7973: chunk compression type is hardcoded to passthrough for metric columns

ashishkf opened a new issue #7973:
URL: https://github.com/apache/pinot/issues/7973


   There doesn't seem to be a way to use LZ4 compression for metric column.
   
   https://github.com/apache/pinot/blob/f2f8e38f9424bcacf3946197c9afcd50ef1d58fa/pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/converter/RealtimeSegmentConverter.java#L100


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on issue #7973: chunk compression type is hardcoded to passthrough for metric columns

Posted by GitBox <gi...@apache.org>.
richardstartin commented on issue #7973:
URL: https://github.com/apache/pinot/issues/7973#issuecomment-1013421015


   @Jackie-Jiang let's make it configurable when there are encoding modes which make sense for numeric data. LZ4 and Snappy aren't good options for numeric data, and are dominated by dictionary encoding.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on issue #7973: chunk compression type is hardcoded to passthrough for metric columns

Posted by GitBox <gi...@apache.org>.
richardstartin commented on issue #7973:
URL: https://github.com/apache/pinot/issues/7973#issuecomment-1013539250


   The problem is you won't get any savings from LZ4 - those CPU readings can be almost identical but with a little bit of noise the data is difficult for a text oriented algorithm like LZ4 to compress. The XOR of any two adjacent values will typically have very few set bits so can result in high compression ratios, perhaps even 8x. Implementing codecs such as xor or delta encoding is a feature that has been discussed before, would not be very difficult, and it would solve your problem in a way making metric columns compressible would not.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang commented on issue #7973: chunk compression type is hardcoded to passthrough for metric columns

Posted by GitBox <gi...@apache.org>.
Jackie-Jiang commented on issue #7973:
URL: https://github.com/apache/pinot/issues/7973#issuecomment-1012599127


   IMO this is a bug. For metrics, we use `PASS_THROUGH` by default, but should allow overriding it if it is explicitly configured in the `FieldConfig`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] ashishkf commented on issue #7973: chunk compression type is hardcoded to passthrough for metric columns

Posted by GitBox <gi...@apache.org>.
ashishkf commented on issue #7973:
URL: https://github.com/apache/pinot/issues/7973#issuecomment-1013535427


   I think in many cases the metric values don't change much - for example cpu usage gauge will have only slight variations for a given metric series over a small interval (say, 10 minutes). We have seen good compression ratios - in our data (storing Kubernetes metrics), we are getting 1 bytes per row instead of 8 allocated for the 'double' column. As a workaround we marked the value column as dimension to get the LZ4 compression to get the savings.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on issue #7973: chunk compression type is hardcoded to passthrough for metric columns

Posted by GitBox <gi...@apache.org>.
richardstartin commented on issue #7973:
URL: https://github.com/apache/pinot/issues/7973#issuecomment-1005437083


   This makes sense the way it is for a couple of reasons:
   * chunks for metric columns are tiny: 4-8KB depending on the data type. This means there would be many chunks to decompress in a column scan. 
   * general purpose compression algorithms work better on text than arbitrary numeric data, so the compression ratio for the average user’s column likely wouldn’t be very good.
   
   These two factors combine to make a less than compelling case for general purpose compression of metric columns. 
   
   There are numerous encoding techniques which could be explored for metric columns in the future, which tend to produce better space reductions and are faster to decode. 
   
   If you have a metric column which you expect to be compressible because it has lots of duplicates, it would be worth experimenting with using a dictionary column instead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on issue #7973: chunk compression type is hardcoded to passthrough for metric columns

Posted by GitBox <gi...@apache.org>.
richardstartin commented on issue #7973:
URL: https://github.com/apache/pinot/issues/7973#issuecomment-1013526820


   Here are the sizes of 8KB (1024 doubles) of different distributions/patterns with Snappy and LZ4. There are encodings which can be introduced to reduce the size of metric columns (e.g. xor or delta encoding) but making it possible to compress metric column with general purpose compression algorithms isn't in the user's interest.
   
   |Compression|Distribution                 |Compressed Size (KB)|
   |-----------|-----------------------------|--------------------|
   |Uncompressed|integer increments           |8.00                |
   |LZ4        |integer increments           |4.09                |
   |Snappy     |integer increments           |4.02                |
   |Uncompressed|noisy increments            |8.00                |
   |LZ4        |noisy increments            |8.03                |
   |Snappy     |noisy increments            |8.00                |
   |Uncompressed|sinusoidal                   |8.00                |
   |LZ4        |sinusoidal                   |8.03                |
   |Snappy     |sinusoidal                   |8.00                |
   |Uncompressed|normal(0,1)                  |8.00                                         |
   |LZ4        |normal(0,1)                  |8.03                                         |
   |Snappy     |normal(0,1)                  |8.00                                         |
   |Uncompressed|exp(0.999)                   |8.00                |                                             |
   |LZ4        |exp(0.999)                   |7.23                |                                             |
   |Snappy     |exp(0.999)                   |7.16                |                                             |
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org