You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Damian Momot (JIRA)" <ji...@apache.org> on 2016/11/23 06:58:58 UTC
[jira] [Created] (SPARK-18556) Suboptimal number of tasks when
writing partitioned data with desired number of files per directory
Damian Momot created SPARK-18556:
------------------------------------
Summary: Suboptimal number of tasks when writing partitioned data with desired number of files per directory
Key: SPARK-18556
URL: https://issues.apache.org/jira/browse/SPARK-18556
Project: Spark
Issue Type: Improvement
Affects Versions: 2.0.2, 2.0.1, 2.0.0
Reporter: Damian Momot
It's unable to have optimal number of write tasks when optimal number of files per directory is known, example:
When saving data to hdfs:
1. Data which is supposed to be partitioned by column (for example date) - it contains for example 90 different dates
2. Upfront knowledge that each date should be written into X files (for example 4, because of recommended hdfs/parquet block size etc.)
3. During processing, dataset was partitioned into 200 partitions (for example because of some grouping operations)
currently we can do
{code}
val data: Dataset[Row] = ???
data
.write
.partitionBy("date")
.parquet("/xyz")
{code}
This will properly write data into 90 date directories (see point '1') but each directory will contain 200 files (see point '3')
We can force number of files by using repartition/coalesce:
{code}
val data: Dataset[Row] = ???
data
.repartition(4)
.write
.partitionBy("date")
.parquet("xyz")
{code}
This will properly save 90 directories, 4 files each... but it will be done using only 4 tasks which is way too slow - 360 files could be written in parallel using 360 tasks
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org