You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by "Ryan Blue (JIRA)" <ji...@apache.org> on 2015/02/06 00:06:36 UTC

[jira] [Resolved] (PARQUET-139) Avoid reading file footers in parquet-avro InputFormat

     [ https://issues.apache.org/jira/browse/PARQUET-139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan Blue resolved PARQUET-139.
-------------------------------
       Resolution: Fixed
    Fix Version/s: parquet-mr_1.6.0

Issue resolved by pull request 91
[https://github.com/apache/incubator-parquet-mr/pull/91]

> Avoid reading file footers in parquet-avro InputFormat
> ------------------------------------------------------
>
>                 Key: PARQUET-139
>                 URL: https://issues.apache.org/jira/browse/PARQUET-139
>             Project: Parquet
>          Issue Type: Task
>            Reporter: Ryan Blue
>            Assignee: Ryan Blue
>             Fix For: parquet-mr_1.6.0
>
>
> The AvroParquetInputFormat currently relies on the ParquetInputFormat that reads the footers for all of the files that will be processed. This is for two reasons:
> 1. To plan splits (if using client side splits)
> 2. To get a merged schema for all of the files
> Reading all of the footers is a bottle-neck when working with a large number of files and can significantly delay a job because only one machine is working. This should be done in parallel on the task side. PARQUET-84 added the ability to avoid reading footers on the client for split planning, so the difficult task is to avoid reading footers to merge the Parquet schema.
> To avoid merging the Parquet schema, the AvroParquetInputFormat should either use whatever schema a file contains or should reconcile the projection schema with the file schema on the task side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)