You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:24:11 UTC

[jira] [Updated] (SPARK-18556) Suboptimal number of tasks when writing partitioned data with desired number of files per directory

     [ https://issues.apache.org/jira/browse/SPARK-18556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-18556:
---------------------------------
    Labels: bulk-closed  (was: )

> Suboptimal number of tasks when writing partitioned data with desired number of files per directory
> ---------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-18556
>                 URL: https://issues.apache.org/jira/browse/SPARK-18556
>             Project: Spark
>          Issue Type: Improvement
>    Affects Versions: 2.0.0, 2.0.1, 2.0.2
>            Reporter: Damian Momot
>            Priority: Major
>              Labels: bulk-closed
>
> It's unable to have optimal number of write tasks when optimal number of files per directory is known, example:
> When saving data to hdfs:
> 1. Data which is supposed to be partitioned by column (for example date) - it contains for example 90 different dates
> 2. Upfront knowledge that each date should be written into X files (for example 4, because of recommended hdfs/parquet block size etc.)
> 3. During processing, dataset was partitioned into 200 partitions (for example because of some grouping operations)
> currently we can do
> {code}
> val data: Dataset[Row] = ???
> data
>   .write
>   .partitionBy("date")
>   .parquet("/xyz")
> {code}
> This will properly write data into 90 date directories (see point '1') but each directory will contain 200 files (see point '3')
> We can force number of files by using repartition/coalesce:
> {code}
> val data: Dataset[Row] = ???
> data
>   .repartition(4)
>   .write
>   .partitionBy("date")
>   .parquet("xyz")
> {code}
> This will properly save 90 directories, 4 files each... but it will be done using only 4 tasks which is way too slow - 360 files could be written in parallel using 360 tasks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org