You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "pitrou (via GitHub)" <gi...@apache.org> on 2024/03/18 15:56:42 UTC

[I] [Parquet] Make default fallback encoding choice smarter [arrow]

pitrou opened a new issue, #40636:
URL: https://github.com/apache/arrow/issues/40636

   ### Describe the enhancement requested
   
   Currently, the choice of default encoding for a non-dictionary data page is trivial.
   It happens in two places:
   1. in the `FallbackToPlainEncoding` function for columns for which dictionary encoding is attempted:
   https://github.com/apache/arrow/blob/5718a2862b4254d8bf938912d8958837ac7313a5/cpp/src/parquet/column_writer.cc#L1567-L1580
   2. in the `ColumnWriter::Make` factory function for columns for which dictionary encoding is not attempted:
   https://github.com/apache/arrow/blob/5718a2862b4254d8bf938912d8958837ac7313a5/cpp/src/parquet/column_writer.cc#L2375-L2382
   
   I'll note that parquet-mr does not limit dictionary encoding fallback to PLAIN, even for "v1" Parquet files:
   https://github.com/apache/parquet-mr/blob/95b004c3df473e3ab0963dc5136934ce5235d5df/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV1ValuesWriterFactory.java#L124-L139
   
   We should probably consolidate the logic from the two functions above and make it more sophisticated, allowing the best encoding for the selected Parquet version.
   
   Also related: https://github.com/apache/arrow/issues/38441
   
   ### Component(s)
   
   C++, Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Parquet] Make default fallback encoding choice smarter [arrow]

Posted by "ClifHouck (via GitHub)" <gi...@apache.org>.
ClifHouck commented on issue #40636:
URL: https://github.com/apache/arrow/issues/40636#issuecomment-2023361741

   Going to take a look at this if no one objects.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org