You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Olivier Toupin (JIRA)" <ji...@apache.org> on 2015/08/12 17:24:45 UTC

[jira] [Commented] (SPARK-6795) Avoid reading Parquet footers on driver side when an global arbitrative schema is available

    [ https://issues.apache.org/jira/browse/SPARK-6795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14693670#comment-14693670 ] 

Olivier Toupin commented on SPARK-6795:
---------------------------------------

This doesn't seem to be fixed.  I built the latest branch-1.4 and removed our custom fix for this, and we still experience this issue. On a query to a table with a lot of files, the driver hang for a while will it's reading partitions. In the UI, if you check the timeline it's pretty clear, with our branch there is almost no empty space, with branch-1.4, there is a  50s void in our worst case.

The assumed culprit =>

1. readAllFootersInParallelUsingSummaryFiles, will default reading all footers, if no summary file is available. So most of the times we probably read all footers even if schema merging is off.

https://github.com/apache/spark/blob/branch-1.4/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala#L357

2. Why do we read schema if there is metastore schema available?

Shouldn't it be 

          val dataSchema0 = maybeDataSchema
            .orElse(maybeMetastoreSchema)
            .orElse(readSchema())

?

https://github.com/apache/spark/blob/branch-1.4/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala#L370

> Avoid reading Parquet footers on driver side when an global arbitrative schema is available
> -------------------------------------------------------------------------------------------
>
>                 Key: SPARK-6795
>                 URL: https://issues.apache.org/jira/browse/SPARK-6795
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.1
>            Reporter: Cheng Lian
>            Assignee: Cheng Lian
>            Priority: Critical
>
> With the help of [Parquet MR PR #91|https://github.com/apache/incubator-parquet-mr/pull/91] which will be included in the official release of Parquet MR 1.6.0, now it's possible to avoid reading footers on the driver side completely when an global arbitrative schema is available.
> Currently, the global schema can be either Hive metastore schema or specified via data sources DDL. All tasks should verify Parquet data files and reconcile possible schema conflicts locally against this global schema.
> However, when no global schema is available and schema merging is enabled, we still need to read schemas from all data files to infer a valid global schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org