You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Noam Asor (JIRA)" <ji...@apache.org> on 2017/06/07 08:38:18 UTC
[jira] [Updated] (SPARK-20622) Parquet partition discovery for non
key=value named directories
[ https://issues.apache.org/jira/browse/SPARK-20622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Noam Asor updated SPARK-20622:
------------------------------
Priority: Minor (was: Major)
> Parquet partition discovery for non key=value named directories
> ---------------------------------------------------------------
>
> Key: SPARK-20622
> URL: https://issues.apache.org/jira/browse/SPARK-20622
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 2.2.0
> Reporter: Noam Asor
> Priority: Minor
>
> h4. Why
> There are cases where traditional M/R jobs and RDD based Spark jobs writes out partitioned parquet in 'value only' named directories i.e. {{hdfs:///some/base/path/2017/05/06}} and not in 'key=value' named directories i.e. {{hdfs:///some/base/path/year=2017/month=05/day=06}} which prevents users from leveraging Spark SQL parquet partition discovery when reading the former back.
> h4. What
> This issue is a proposal for a solution which will allow Spark SQL to discover parquet partitions for 'value only' named directories.
> h4. How
> By introducing a new Spark SQL read option *partitionTemplate*.
> *partitionTemplate* is in a Path form and it should include base path followed by the missing 'key=' as a template for transforming 'value only' named dirs to 'key=value' named dirs. In the example above this will look like:
> {{hdfs:///some/base/path/year=/month=/day=/}}.
> To simplify the solution this option should be tied with *basePath* option, meaning that *partitionTemplate* option is valid only if *basePath* is set also.
> In the end for the above scenario, this will look something like:
> {code}
> spark.read
> .option("basePath", "hdfs:///some/base/path")
> .option("partitionTemplate", "hdfs:///some/base/path/year=/month=/day=/")
> .parquet(...)
> {code}
> which will allow Spark SQL to do parquet partition discovery on the following directory tree:
> {code}
> some
> |--base
> |--path
> |--2016
> |--...
> |--2017
> |--01
> |--02
> |--...
> |--15
> |--...
> |--...
> {code}
> adding to the schema of the resulted DataFrame the columns year, month, day and their respective values as expected.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org