You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/26 23:27:33 UTC

[GitHub] [arrow] tachyonwill commented on a change in pull request #12274: PARQUET-2115: [C++] Parquet dictionary bit widths are limited to 32 bits

tachyonwill commented on a change in pull request #12274:
URL: https://github.com/apache/arrow/pull/12274#discussion_r793135042



##########
File path: cpp/src/parquet/encoding.cc
##########
@@ -1486,7 +1486,7 @@ class DictDecoderImpl : public DecoderImpl, virtual public DictDecoder<Type> {
       return;
     }
     uint8_t bit_width = *data;
-    if (ARROW_PREDICT_FALSE(bit_width >= 64)) {
+    if (ARROW_PREDICT_FALSE(bit_width > 32)) {
       throw ParquetException("Invalid or corrupted bit_width");

Review comment:
       I think this restriction is somewhat separate from the dictionary specific bitwidth restriction. The dictionary bit width restriction has been there since at least version 2.2 in 2013: https://github.com/apache/parquet-format/commit/ad2e4c438cdf080bf042a5330965e2eefb0caa65 . A bit width > 32 bits would also not be compatible with the num_values field in the header: https://github.com/apache/parquet-format/blob/54e53e5d7794d383529dd30746378f19a12afd58/src/main/thrift/parquet.thrift#L544
   
   Also, parquet cpp uses int32_t internal for indices, so to support higher bitwidths would require a refactor (ex:   https://github.com/apache/arrow/blob/01855c791056b7f712e6df82acf97ad3ab7b823a/cpp/src/parquet/encoding.cc#L1582 )




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org