You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "jun0315 (via GitHub)" <gi...@apache.org> on 2023/05/15 14:02:43 UTC

[GitHub] [arrow-rs] jun0315 opened a new issue, #4218: Provide "optimal" encoding

jun0315 opened a new issue, #4218:
URL: https://github.com/apache/arrow-rs/issues/4218

   Parquet supports many types of encoding. If we can provide "optimal" encoding, e.g. by default, the most suitable encoding will be selected based on the characteristics of the data, rather than letting users choose. Currently, the default encoding is plain, which is not a good way. If the user needs to choose encoding based on data characteristics, the requirements for the user are relatively high.
   
   *Originally posted by @tustvold  in [issues: Non-Goals](https://github.com/apache/arrow-rs/issues/2677)*
   > Provide "optimal" encoding, rather a reasonable out-of-the-box baseline for common use-cases
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] jun0315 commented on issue #4218: Provide "optimal" encoding

Posted by "jun0315 (via GitHub)" <gi...@apache.org>.

jun0315 commented on issue #4218:
URL: https://github.com/apache/arrow-rs/issues/4218#issuecomment-1557395774

   >  it's a heuristic
   
   May I ask where the logical code for this section is located?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] mapleFU commented on issue #4218: Provide "optimal" encoding

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on issue #4218:
URL: https://github.com/apache/arrow-rs/issues/4218#issuecomment-1547974275

   RLE hybrid is only used in Dictionary data(not dictionary index) and RLE. @jun0315 
   
   @tustvold By the way, maybe make FastPFor as a encoding in standard parquet helps? Though it may spend lots of time implement and poc, I guess it can have better performance


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] jun0315 commented on issue #4218: Provide "optimal" encoding

Posted by "jun0315 (via GitHub)" <gi...@apache.org>.

jun0315 commented on issue #4218:
URL: https://github.com/apache/arrow-rs/issues/4218#issuecomment-1547962898

   > the default encoding is plain
   
   https://github.com/apache/arrow-rs/blob/master/parquet/src/basic.rs#L222-L230, Sorry, It's my mistake. I saw this line before and thought it was all plain coding.
   
   > Perhaps you could give an example where the encoding is not as you would expect?
   
   If our data is `1 1 1 1 2 2 2 2 3 3`, maybe `RLE hybrid` encoding is better, At this point, by default, we shouldn't be using rle encoding, right
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on issue #4218: Provide "optimal" encoding

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on issue #4218:
URL: https://github.com/apache/arrow-rs/issues/4218#issuecomment-1548073596

   > default encoding chosen is delta instead of plain
   
   For v2 the dictionary fallback is https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6 for byte arrays, and PLAIN for everything else.
   
   https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-encoding-delta_binary_packed--5 is never used by default. Ultimately PLAIN is very fast and well supported, and the space efficiency is good enough for most workloads, alternatives have a hard task to drive broad ecosystem adoption. Ultimately you can always do better than parquet, but people use parquet because it is good enough and well supported


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] mapleFU commented on issue #4218: Provide "optimal" encoding

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on issue #4218:
URL: https://github.com/apache/arrow-rs/issues/4218#issuecomment-1548039207

   https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-encoding-delta_binary_packed--5
   
   > This encoding is adapted from the Binary packing described in ["Decoding billions of integers per second through vectorization"](http://arxiv.org/pdf/1209.2137v5.pdf) by D. Lemire and L. Boytsov.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on issue #4218: Provide "optimal" encoding

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on issue #4218:
URL: https://github.com/apache/arrow-rs/issues/4218#issuecomment-1557402600

   https://github.com/apache/arrow-rs/blob/master/parquet/src/column/writer/mod.rs#L569


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] jun0315 commented on issue #4218: Provide "optimal" encoding

Posted by "jun0315 (via GitHub)" <gi...@apache.org>.

jun0315 commented on issue #4218:
URL: https://github.com/apache/arrow-rs/issues/4218#issuecomment-1557437622

   Thank you. If I find some useful information, I will share it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] jun0315 closed issue #4218: Provide "optimal" encoding

Posted by "jun0315 (via GitHub)" <gi...@apache.org>.

jun0315 closed issue #4218: Provide "optimal" encoding
URL: https://github.com/apache/arrow-rs/issues/4218


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] mapleFU commented on issue #4218: Provide "optimal" encoding

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on issue #4218:
URL: https://github.com/apache/arrow-rs/issues/4218#issuecomment-1547925134

   Do you have methods do decide which encoding is the "optimal" encoding? Since it might both require sampling data, and some heuristic or other methods. Do you have any idea or formula here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] jun0315 commented on issue #4218: Provide "optimal" encoding

Posted by "jun0315 (via GitHub)" <gi...@apache.org>.

jun0315 commented on issue #4218:
URL: https://github.com/apache/arrow-rs/issues/4218#issuecomment-1548036251

   > For v2 writers there is a form of delta encoding,
   
   So in the case of v2 writers, the default encoding chosen is delta instead of plain. Has this been chosen internally?
   
   Sorry, my example may not be very good. If it's' `100 100 100 100 10000 10000 10000 1000`, is it better for RLE? In v2, what encoding will be chosen by default?
   
   >  amusingly the linked paper says precisely not to do what the parquet specification then goes on to do
   
   I am very interested in this paper. Can you tell me the title of the paper? I'll go study :D


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] jun0315 commented on issue #4218: Provide "optimal" encoding

Posted by "jun0315 (via GitHub)" <gi...@apache.org>.

jun0315 commented on issue #4218:
URL: https://github.com/apache/arrow-rs/issues/4218#issuecomment-1548096960

   So in summary, if we want to write arrow's memory data to the parquet file, we generally do not need to specify encoding. Will parquet automatically help us choose a more suitable encoding?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on issue #4218: Provide "optimal" encoding

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on issue #4218:
URL: https://github.com/apache/arrow-rs/issues/4218#issuecomment-1547969573

   > If our data is 1 1 1 1 2 2 2 2 3 3, maybe RLE hybrid encoding is better,
   
   RLE Hybrid is used to encode level data, and dictionary indices. The default settings will therefore PLAIN encode `1, 2` to the dictionary page, and then RLE encode `0, 0, 0, 0, 1, 1, 1, 1, 2, 2` to the data page. I think this should be optimal.
   
   For v2 writers there _is_ a form of delta encoding, however, amusingly the linked paper says precisely not to do what the parquet specification then goes on to do :laughing:. This translates into pretty terrible decode performance, and I would not recommend using it for most workloads.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on issue #4218: Provide "optimal" encoding

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on issue #4218:
URL: https://github.com/apache/arrow-rs/issues/4218#issuecomment-1548128024

   > Will parquet automatically help us choose a more suitable encoding
   
   Correct, the defaults should be appropriate for most workloads. Some workloads may benefit from tweaking based on empirical data, e.g. smaller row groups, etc... but I would advise against premature optimisation here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] jun0315 closed issue #4218: Provide "optimal" encoding

Posted by "jun0315 (via GitHub)" <gi...@apache.org>.

jun0315 closed issue #4218: Provide "optimal" encoding
URL: https://github.com/apache/arrow-rs/issues/4218


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on issue #4218: Provide "optimal" encoding

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on issue #4218:
URL: https://github.com/apache/arrow-rs/issues/4218#issuecomment-1547933827

   > the default encoding is plain
   
   This isn't quite correct, for a V1 writer the default encoding is RLE_DICTIONARY falling back to PLAIN on exceeding the dictionary page size. There are no other non-deprecated encodings supported by the V1 spec. For a V2 writer, the defaults are similar but falling back to DELTA_BYTE_ARRAY for byte array types instead of PLAIN


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] jun0315 commented on issue #4218: Provide "optimal" encoding

Posted by "jun0315 (via GitHub)" <gi...@apache.org>.

jun0315 commented on issue #4218:
URL: https://github.com/apache/arrow-rs/issues/4218#issuecomment-1548134832

   Thanks a lot! @tustvold @mapleFU 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on issue #4218: Provide "optimal" encoding

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on issue #4218:
URL: https://github.com/apache/arrow-rs/issues/4218#issuecomment-1557390275

   > but all the changed buffers have become larger. Is this expected?
   
   Yes, it's a heuristic, there is no guaranteed way to know ahead of time the most efficient way to encode a given block of data. Consider the case of no repeated values, dictionary encoding will be larger. It will fallback to PLAIN encoding once the dictionary page is full (1 MB) but for very small columns with low repetition, it is highly probable the encoding will be larger. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] jun0315 commented on issue #4218: Provide "optimal" encoding

Posted by "jun0315 (via GitHub)" <gi...@apache.org>.

jun0315 commented on issue #4218:
URL: https://github.com/apache/arrow-rs/issues/4218#issuecomment-1557375786

   Hi @tustvold  Previously, the plain encoding of arrow2 was used, but now it has been changed to the default encoding of arrow-rs. It can be observed that the buffer written has changed, but all the changed buffers have become larger. Is this expected?
   
   https://github.com/datafuselabs/databend/actions/runs/5043938824/jobs/9046595560?pr=11473#step:4:243


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on issue #4218: Provide "optimal" encoding

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on issue #4218:
URL: https://github.com/apache/arrow-rs/issues/4218#issuecomment-1557430159

   I'm not aware of any, but would be interested should you find such information, we just follow the example of the other parquet writers like parquet-mr. I suspect if you have a cardinality estimation of the input you can make a fairly good guess as to whether dictionary encoding is valuable. If your application is really sensitive to storage size, you could consider lowering the max dictionary page size, so that fallback triggers earlier.
   
   Alternatively, if you wanted to contribute a PR that would optionally re-encode on fallback, instead of preserving what has already been dictionary encoded, I would be willing to review it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] jun0315 commented on issue #4218: Provide "optimal" encoding

Posted by "jun0315 (via GitHub)" <gi...@apache.org>.

jun0315 commented on issue #4218:
URL: https://github.com/apache/arrow-rs/issues/4218#issuecomment-1557418038

   If I want to choose some encoding based on the data characteristics in the upper layer application, such as delta. Are there any previous studies that can be used for reference?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org