You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Xinli shang <sh...@uber.com.INVALID> on 2021/02/23 18:18:52 UTC

Parquet community sync meeting notes 2/23/2021

Hi all,


These are the meeting notes from today's community meeting.


Date: 2/23/2021

Attendees: Xinli Shang, Gábor Szádovszky, Gidon Gershinsky, Ryan Blue

   1.

   Iceberg and Parquet
   1.

      Column ID v.s name
      1.

         Column resolution: Parquet relies on the name, while Iceberg
         relies on ID. For example, column filtering projection by ID would
         avoid a lot of issues not only schema resolution.
         2.

      FilterAPI: Iceberg expressions cover more. It would be great that
      Parquet also supports it.
      1.

         IN, StartWith etc
         3.

      How much effort is needed for Parquet to use Iceberg filter API?
      1.

         It would depend on how to do it. We can just move that code to
         Parquet. That would save time. But that is just one solution
and might not
         be the best.
         4.

      Is the requirement generic from industry or Iceberg specific?
      1.

         The parquet-avro module has the similar thing.
         2.

         Pig has the resolution by position.
         3.

         So it is pretty generic.
         5.

      Should we create parquet-iceberg module or just make it generic to
      use?
      1.

         Making it generic would make more sense.
         6.

      Record materialization: Read support has MessageColumnIO. In the
      Iceberg, we materialize the record faster. We run Flink and
Spark with the
      same API. It is kind of general.
      7.

      Support vectorization into Arrow in Parquet
      1.

         This is a great idea. It would boost the performance.
         8.

      To conclude, we can start the ID resolution first.



   1.

   Parquet-12 release
   1.

      Once this pr <https://github.com/apache/parquet-mr/pull/868> is done,
      we can create RC build.



   1.

   Inter-ops testing


   1.

   It is about the idea about how to create data structures to have
   inter-ops testing.
   2.

   Parque test repo change.


Please let me know if you have any questions.

Xinli Shang | Tech Lead Manager @ Uber Data Infra