You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:14:19 UTC

[jira] [Resolved] (SPARK-23467) Enable way to create DataFrame from pre-partitioned files (Parquet/ORC/etc.) with each in-memory partition mapped to 1 physical file partition

     [ https://issues.apache.org/jira/browse/SPARK-23467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-23467.
----------------------------------
    Resolution: Incomplete

> Enable way to create DataFrame from pre-partitioned files (Parquet/ORC/etc.) with each in-memory partition mapped to 1 physical file partition
> ----------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-23467
>                 URL: https://issues.apache.org/jira/browse/SPARK-23467
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core, SQL
>    Affects Versions: 2.2.1
>            Reporter: V Luong
>            Priority: Major
>              Labels: bulk-closed
>
> I would like to echo the need described here: [https://forums.databricks.com/questions/12323/how-to-create-a-dataframe-that-has-one-filepartiti.html]. Also a related ticket is: https://issues.apache.org/jira/browse/SPARK-17998.
> In many of my use cases, data is appended by date (say, date D) into an S3 subdir s3://bucket/path/to/parquet/date=D, after which I need to run analytics by date. The analytics involves sorts and window functions. But I'm only interested in within-date sorts/windows, and don't care about the between-dates sorts and windows. 
> Currently, if I simply load the entire data set from the parent dir s3://bucket/path/to/parquet, and then write Spark SQL statements involving "SORT" or "WINDOW", then very expensive shuffles will be invoked. Hence I am exploring ways to write analytics code/function per Spark partition, and send such code/function to each partition.
> The biggest problem now is that Spark's in-memory partitions do not correspond to the physical files loaded from S3, so there is no way to guarantee that the analytics by partition is done by date as desired.
> Is there a way we can explicitly enable a direct correspondence between file partitions and in-memory partitions?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org