You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by GitBox <gi...@apache.org> on 2023/01/01 08:47:27 UTC

[GitHub] [arrow] mapleFU opened a new issue, #15145: [C++][Parquet] The DictEncoder is always PLAIN_DICTIONARY even in parquet_v2 format

mapleFU opened a new issue, #15145:
URL: https://github.com/apache/arrow/issues/15145

   ### Describe the enhancement requested
   
   In `DictEncoderImpl`, the encoding of it is fixed, which is `PLAIN_DICTIONARY`. In our standard, it should be `RLE_DICTIONARY` or `PLAIN_DICTIONARY`, and should be decide by parquet version.
   
   Though the final format maybe right, the temporary encoding might be trickey here.
   
   ### Component(s)
   
   C++, Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on issue #15145: [C++][Parquet] The DictEncoder is always PLAIN_DICTIONARY even in parquet_v2 format

Posted by GitBox <gi...@apache.org>.

mapleFU commented on issue #15145:
URL: https://github.com/apache/arrow/issues/15145#issuecomment-1369430274

   And It's weird that, our `DictEncoderImpl` only supports encoding as `RLE_DICTIONARY`, and it `WriteIndices` use `rle` to encoding. But if the parquet format is v1, the page would be `PLAIN_DICTIONARY` but write `RLE_DICTIONARY`.
   
   The related code is:
   
   ```c++
     int WriteIndices(uint8_t* buffer, int buffer_len) override { ... }
   
     inline Encoding::type dictionary_page_encoding() const {
       if (parquet_version_ == ParquetVersion::PARQUET_1_0) {
         return Encoding::PLAIN_DICTIONARY;
       } else {
         return Encoding::PLAIN;
       }
     }
   
     void WriteDictionaryPage() override {
       ...
       DictionaryPage page(buffer, current_dict_encoder_->num_entries(),
                           properties_->dictionary_page_encoding());
       total_bytes_written_ += pager_->WriteDictionaryPage(page);
     }
   ```
   
   If we only support write RLE_DICTIONARY, it's ok that `DictEncoderImpl` uses `RLE_DICTIONARY`
   
   @pitrou 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU closed issue #15145: [C++][Parquet] The DictEncoder is always PLAIN_DICTIONARY even in parquet_v2 format

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU closed issue #15145: [C++][Parquet] The DictEncoder is always PLAIN_DICTIONARY even in parquet_v2 format 
URL: https://github.com/apache/arrow/issues/15145


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on issue #15145: [C++][Parquet] The DictEncoder is always PLAIN_DICTIONARY even in parquet_v2 format

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on issue #15145:
URL: https://github.com/apache/arrow/issues/15145#issuecomment-1620030094

   Note the only in-memory encoder is `PLAIN_DICTIONARY`, when serializing, it will become `RLE_DICTIONARY`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org