You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Ryan Blue (JIRA)" <ji...@apache.org> on 2015/02/06 00:06:36 UTC
[jira] [Resolved] (PARQUET-139) Avoid reading file footers in
parquet-avro InputFormat
[ https://issues.apache.org/jira/browse/PARQUET-139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ryan Blue resolved PARQUET-139.
-------------------------------
Resolution: Fixed
Fix Version/s: parquet-mr_1.6.0
Issue resolved by pull request 91
[https://github.com/apache/incubator-parquet-mr/pull/91]
> Avoid reading file footers in parquet-avro InputFormat
> ------------------------------------------------------
>
> Key: PARQUET-139
> URL: https://issues.apache.org/jira/browse/PARQUET-139
> Project: Parquet
> Issue Type: Task
> Reporter: Ryan Blue
> Assignee: Ryan Blue
> Fix For: parquet-mr_1.6.0
>
>
> The AvroParquetInputFormat currently relies on the ParquetInputFormat that reads the footers for all of the files that will be processed. This is for two reasons:
> 1. To plan splits (if using client side splits)
> 2. To get a merged schema for all of the files
> Reading all of the footers is a bottle-neck when working with a large number of files and can significantly delay a job because only one machine is working. This should be done in parallel on the task side. PARQUET-84 added the ability to avoid reading footers on the client for split planning, so the difficult task is to avoid reading footers to merge the Parquet schema.
> To avoid merging the Parquet schema, the AvroParquetInputFormat should either use whatever schema a file contains or should reconcile the projection schema with the file schema on the task side.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)