You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Xinli shang <sh...@uber.com.INVALID> on 2023/04/28 15:32:43 UTC

Parquet sync meeting notes - April 2023

Hi all,

Here is the meeting notes for today's Parquet sync meeting.


4/28/2023

Attendee  (Shenxuan Liu, Fokko Driesprong, Gang Wu, Jiashen Zhang, Xinli
Shang )

   1.

   Post-release 1.13.0
   1.

      Iceberg upgraded to 1.13.0 bumped the Hadoop support to Hadoop 3 but
      we didn’t notice since we don’t run CI against hadoop 2. This has been
      fixed in #2290 <https://github.com/apache/parquet-mr/pull/1083>.
      2.

      Some small changes (#1073
      <https://github.com/apache/parquet-mr/pull/1073> and #1074
      <https://github.com/apache/parquet-mr/pull/1074>) to make Flink use
      the ParquetMR without having Hadoop on the classpath.
      2.

   In Velox, we store/cache files locally, then we could see a bottleneck
   in the parquet itself.
   1.

      Use SSD to store the local file 3G bytes/sec, For decompression, it
      is 200M/Sec.
      2.

      The current Parquet reader is designed for remote reading.
      3.

      There is a trans-compression
      <https://github.com/apache/parquet-mr/blob/master/parquet-cli/src/main/java/org/apache/parquet/cli/commands/TransCompressionCommand.java>
      API you can use to speed up,  about 20x faster
      4.

      ZSTD is recommended
      3.

   Data masking Parquet-2223
   <https://github.com/apache/parquet-mr/pull/1016>
   1.

      The code is incomplete. It is needed to hide the columns in the
      schema when it is hidden. And we also need to mark it as hidden.


-- 
Xinli Shang