You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Xinli shang <sh...@uber.com.INVALID> on 2023/04/28 15:32:43 UTC
Parquet sync meeting notes - April 2023
Hi all,
Here is the meeting notes for today's Parquet sync meeting.
4/28/2023
Attendee (Shenxuan Liu, Fokko Driesprong, Gang Wu, Jiashen Zhang, Xinli
Shang )
1.
Post-release 1.13.0
1.
Iceberg upgraded to 1.13.0 bumped the Hadoop support to Hadoop 3 but
we didn’t notice since we don’t run CI against hadoop 2. This has been
fixed in #2290 <https://github.com/apache/parquet-mr/pull/1083>.
2.
Some small changes (#1073
<https://github.com/apache/parquet-mr/pull/1073> and #1074
<https://github.com/apache/parquet-mr/pull/1074>) to make Flink use
the ParquetMR without having Hadoop on the classpath.
2.
In Velox, we store/cache files locally, then we could see a bottleneck
in the parquet itself.
1.
Use SSD to store the local file 3G bytes/sec, For decompression, it
is 200M/Sec.
2.
The current Parquet reader is designed for remote reading.
3.
There is a trans-compression
<https://github.com/apache/parquet-mr/blob/master/parquet-cli/src/main/java/org/apache/parquet/cli/commands/TransCompressionCommand.java>
API you can use to speed up, about 20x faster
4.
ZSTD is recommended
3.
Data masking Parquet-2223
<https://github.com/apache/parquet-mr/pull/1016>
1.
The code is incomplete. It is needed to hide the columns in the
schema when it is hidden. And we also need to mark it as hidden.
--
Xinli Shang