You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2015/08/21 03:07:45 UTC

[jira] [Assigned] (SPARK-10143) Parquet changed the behavior of calculating splits

     [ https://issues.apache.org/jira/browse/SPARK-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-10143:
------------------------------------

    Assignee:     (was: Apache Spark)

> Parquet changed the behavior of calculating splits
> --------------------------------------------------
>
>                 Key: SPARK-10143
>                 URL: https://issues.apache.org/jira/browse/SPARK-10143
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 1.5.0
>            Reporter: Yin Huai
>            Priority: Critical
>
> When Parquet's task side metadata is enabled (by default it is enabled and it needs to be enabled to deal with tables with many files), Parquet delegates the work of calculating initial splits to FileInputFormat (see https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311). If filesystem's block size is smaller than the row group size and users do not set min split size, splits in the initial split list will have lots of dummy splits and they contribute to empty tasks (because the starting point and ending point of a split does not cover the starting point of a row group). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org