You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "V Luong (JIRA)" <ji...@apache.org> on 2018/02/19 23:38:00 UTC

[jira] [Updated] (SPARK-23467) Enable way to create DataFrame from pre-partitioned files (Parquet/ORC/etc.) with each in-memory partition mapped to 1 physical file partition

     [ https://issues.apache.org/jira/browse/SPARK-23467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

V Luong updated SPARK-23467:
----------------------------
    Description: 
I would like to echo the need described here: [https://forums.databricks.com/questions/12323/how-to-create-a-dataframe-that-has-one-filepartiti.html]

In many of my use cases, data is appended by date (say, date X) into an S3 subdir s3://bucket/path/to/parquet/date=X, after which I need to run analytics by date. The analytics involves sorts and window functions. But I'm only interested in within-date sorts/windows, and don't care about the between-dates sorts and windows. 

Currently, if I simply load the entire data set from the parent dir s3://bucket/path/to/parquet, and then write Spark SQL statements involving "SORT" or "WINDOW", then very expensive shuffles will be invoked. Hence I am exploring ways to write analytics code/function per Spark partition, and send such code/function to each partition.

The biggest problem now is that Spark's in-memory partitions do not correspond to the physical files loaded from S3, so there is no way to guarantee that the analytics by partition is done by date as desired.

Is there a way we can explicitly enable a direct correspondence between file partitions and in-memory partitions?

  was:
I would like to echo the need described here: [https://forums.databricks.com/questions/12323/how-to-create-a-dataframe-that-has-one-filepartiti.html]

In many of my use cases, data is appended by date (say, date X) into an S3 subdir `s3://bucket/path/to/parquet/date=X`, after which I need to run analytics by date. The analytics involves sorts and window functions. But I'm only interested in within-date sorts/windows, and don't care about the between-dates sorts and windows. 

Currently, if I simply load the entire data set from the parent dir `s3://bucket/path/to/parquet`, and then write Spark SQL statements involving "SORT" or "WINDOW", then very expensive shuffles will be invoked. Hence I am exploring ways to write analytics code/function per Spark partition, and send such code/function to each partition.

The biggest problem now is that Spark's in-memory partitions do not correspond to the physical files loaded from S3, so there is no way to guarantee that the analytics by partition is done by date as desired.

Is there a way we can explicitly enable a direct correspondence between file partitions and in-memory partitions?


> Enable way to create DataFrame from pre-partitioned files (Parquet/ORC/etc.) with each in-memory partition mapped to 1 physical file partition
> ----------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-23467
>                 URL: https://issues.apache.org/jira/browse/SPARK-23467
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core, SQL
>    Affects Versions: 2.2.1
>            Reporter: V Luong
>            Priority: Major
>
> I would like to echo the need described here: [https://forums.databricks.com/questions/12323/how-to-create-a-dataframe-that-has-one-filepartiti.html]
> In many of my use cases, data is appended by date (say, date X) into an S3 subdir s3://bucket/path/to/parquet/date=X, after which I need to run analytics by date. The analytics involves sorts and window functions. But I'm only interested in within-date sorts/windows, and don't care about the between-dates sorts and windows. 
> Currently, if I simply load the entire data set from the parent dir s3://bucket/path/to/parquet, and then write Spark SQL statements involving "SORT" or "WINDOW", then very expensive shuffles will be invoked. Hence I am exploring ways to write analytics code/function per Spark partition, and send such code/function to each partition.
> The biggest problem now is that Spark's in-memory partitions do not correspond to the physical files loaded from S3, so there is no way to guarantee that the analytics by partition is done by date as desired.
> Is there a way we can explicitly enable a direct correspondence between file partitions and in-memory partitions?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org