You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Ismaël Mejía (Jira)" <ji...@apache.org> on 2021/03/02 15:47:00 UTC
[jira] [Created] (BEAM-11913) Add support for Hadoop configuration
on ParquetIO
Ismaël Mejía created BEAM-11913:
-----------------------------------
Summary: Add support for Hadoop configuration on ParquetIO
Key: BEAM-11913
URL: https://issues.apache.org/jira/browse/BEAM-11913
Project: Beam
Issue Type: Improvement
Components: io-java-parquet
Reporter: Ismaël Mejía
Assignee: Ismaël Mejía
We have discussed this issue in the past and we tried to avoid Hadoop objects in Parquet public API however there are two valid reasons for this:
1. Many functionalities of Parquet are configurable via public helper methods on Parquet that prepare data inside of Hadoop's Configuration object, e.g. Column Projection via `{color:#000000}AvroReadSupport{color}.setRequestedProjection({color:#871094}conf{color}, {color:#871094}projectionSchema{color});` or Predicate Filters via `P{color:#000000}arquetInputFormat{color}.setFilterPredicate({color:#871094}sc{color}.hadoopConfiguration(), {color:#871094}filterPredicate{color});`. Giving access to those would allow power users to do advanced stuff without any maintenance on the IO side.
2. The main reason to avoid the Hadoop Configuration object was to align with future non Hadoop required APIs on Parquet see PARQUET-1126 for details but this does not seem that will happen soon.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)