You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Damian Momot (JIRA)" <ji...@apache.org> on 2016/11/23 06:58:58 UTC

[jira] [Created] (SPARK-18556) Suboptimal number of tasks when writing partitioned data with desired number of files per directory

Damian Momot created SPARK-18556:
------------------------------------

             Summary: Suboptimal number of tasks when writing partitioned data with desired number of files per directory
                 Key: SPARK-18556
                 URL: https://issues.apache.org/jira/browse/SPARK-18556
             Project: Spark
          Issue Type: Improvement
    Affects Versions: 2.0.2, 2.0.1, 2.0.0
            Reporter: Damian Momot


It's unable to have optimal number of write tasks when optimal number of files per directory is known, example:

When saving data to hdfs:

1. Data which is supposed to be partitioned by column (for example date) - it contains for example 90 different dates
2. Upfront knowledge that each date should be written into X files (for example 4, because of recommended hdfs/parquet block size etc.)
3. During processing, dataset was partitioned into 200 partitions (for example because of some grouping operations)

currently we can do

{code}
val data: Dataset[Row] = ???

data
  .write
  .partitionBy("date")
  .parquet("/xyz")
{code}

This will properly write data into 90 date directories (see point '1') but each directory will contain 200 files (see point '3')

We can force number of files by using repartition/coalesce:

{code}
val data: Dataset[Row] = ???

data
  .repartition(4)
  .write
  .partitionBy("date")
  .parquet("xyz")
{code}

This will properly save 90 directories, 4 files each... but it will be done using only 4 tasks which is way too slow - 360 files could be written in parallel using 360 tasks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org