You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:24:11 UTC
[jira] [Updated] (SPARK-18556) Suboptimal number of tasks when
writing partitioned data with desired number of files per directory
[ https://issues.apache.org/jira/browse/SPARK-18556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon updated SPARK-18556:
---------------------------------
Labels: bulk-closed (was: )
> Suboptimal number of tasks when writing partitioned data with desired number of files per directory
> ---------------------------------------------------------------------------------------------------
>
> Key: SPARK-18556
> URL: https://issues.apache.org/jira/browse/SPARK-18556
> Project: Spark
> Issue Type: Improvement
> Affects Versions: 2.0.0, 2.0.1, 2.0.2
> Reporter: Damian Momot
> Priority: Major
> Labels: bulk-closed
>
> It's unable to have optimal number of write tasks when optimal number of files per directory is known, example:
> When saving data to hdfs:
> 1. Data which is supposed to be partitioned by column (for example date) - it contains for example 90 different dates
> 2. Upfront knowledge that each date should be written into X files (for example 4, because of recommended hdfs/parquet block size etc.)
> 3. During processing, dataset was partitioned into 200 partitions (for example because of some grouping operations)
> currently we can do
> {code}
> val data: Dataset[Row] = ???
> data
> .write
> .partitionBy("date")
> .parquet("/xyz")
> {code}
> This will properly write data into 90 date directories (see point '1') but each directory will contain 200 files (see point '3')
> We can force number of files by using repartition/coalesce:
> {code}
> val data: Dataset[Row] = ???
> data
> .repartition(4)
> .write
> .partitionBy("date")
> .parquet("xyz")
> {code}
> This will properly save 90 directories, 4 files each... but it will be done using only 4 tasks which is way too slow - 360 files could be written in parallel using 360 tasks
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org