You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "wgtmac (via GitHub)" <gi...@apache.org> on 2023/02/15 02:07:34 UTC

[GitHub] [arrow] wgtmac commented on pull request #34177: GH-15145: [C++][Parquet] Specify Dict Encoding Type

wgtmac commented on PR #34177:
URL: https://github.com/apache/arrow/pull/34177#issuecomment-1430650890

   I'd propose to fix it by using dict encoding type of format v2 and only deal with the deprecated `PLAIN_DICTIONARY` when talking to thrift messages.
   
   For example, `WriterProperties` has dealt with the format already: [link](https://github.com/apache/arrow/blob/master/cpp/src/parquet/properties.h#L577) 
   ```cpp
     inline Encoding::type dictionary_index_encoding() const {
       if (parquet_version_ == ParquetVersion::PARQUET_1_0) {
         return Encoding::PLAIN_DICTIONARY;
       } else {
         return Encoding::RLE_DICTIONARY;
       }
     }
   
     inline Encoding::type dictionary_page_encoding() const {
       if (parquet_version_ == ParquetVersion::PARQUET_1_0) {
         return Encoding::PLAIN_DICTIONARY;
       } else {
         return Encoding::PLAIN;
       }
     }
   ```
   
   The DictDecoder always uses `RLE_DICTIONARY` to create decoder and check the encoding type of data pages: [link](https://github.com/apache/arrow/blob/master/cpp/src/parquet/encoding.cc#L1470)
   ```cpp
     explicit DictDecoderImpl(const ColumnDescriptor* descr,
                              MemoryPool* pool = ::arrow::default_memory_pool())
         : DecoderImpl(descr, Encoding::RLE_DICTIONARY),
           dictionary_(AllocateBuffer(pool, 0)),
           dictionary_length_(0),
           byte_array_data_(AllocateBuffer(pool, 0)),
           byte_array_offsets_(AllocateBuffer(pool, 0)),
           indices_scratch_space_(AllocateBuffer(pool, 0)) {}
   ```
   
   And the ColumnReader always resets the dict index encoding to `RLE_DICTIONARY`: [link](https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_reader.cc#L853)
   ```cpp
       Encoding::type encoding = page.encoding();
   
       if (IsDictionaryIndexEncoding(encoding)) {
         encoding = Encoding::RLE_DICTIONARY;
       }
   ```
   
   So I'd prefer doing the same thing on the writer side. @mapleFU 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org