You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "ZhangYao (Jira)" <ji...@apache.org> on 2019/08/22 12:48:00 UTC

[jira] [Created] (SPARK-28853) Support conf to organize filePartitions by file path

ZhangYao created SPARK-28853:
--------------------------------

             Summary:  Support conf to organize filePartitions by file path
                 Key: SPARK-28853
                 URL: https://issues.apache.org/jira/browse/SPARK-28853
             Project: Spark
          Issue Type: New Feature
          Components: SQL
    Affects Versions: 2.4.3
            Reporter: ZhangYao


When dynamicly writing data to hdfs it may generates a lot of small files, so sometimes we need to merge those files. When reading this files and writing again, it will be helpful if the read file RDD partitions is formed by partitions on hdfs.

Currently in FileSourceScanExec.createNonBucketedReadRDD after spliting files, spark will sort files with file size so it may scatter the partition distribution of the data files. It is a great help to support sort by file path here :)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org