You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Tomohiro Tanaka (Jira)" <ji...@apache.org> on 2020/02/05 03:15:00 UTC

[jira] [Created] (SPARK-30735) Improving writing performance by adding repartition based on columns to partitionBy for DataFrameWriter

Tomohiro Tanaka created SPARK-30735:
---------------------------------------

             Summary: Improving writing performance by adding repartition based on columns to partitionBy for DataFrameWriter
                 Key: SPARK-30735
                 URL: https://issues.apache.org/jira/browse/SPARK-30735
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 2.4.4, 2.4.3
         Environment: * Spark-3.0.0
 * Scala: version 2.12.10
 * sbt 0.13.18, script ver: 1.3.7 (Built using sbt)
 * Java: 1.8.0_231
 ** Java(TM) SE Runtime Environment (build 1.8.0_231-b11)
 ** Java HotSpot(TM) 64-Bit Server VM (build 25.231-b11, mixed mode)
            Reporter: Tomohiro Tanaka
             Fix For: 3.0.0, 3.1.0


h1. New functionality for {{partitionBy}}

To enhance performance using partitionBy , calling {{repartition}} method based on columns is much good before calling {{partitionBy}}. I added new function: {color:#0747a6}{{partitionBy(<True | False>, columns>}}{color} to {{partitionBy}}.

 
h2. Problems when not using {{repartition}} before {{partitionBy}}.

When using {{paritionBy}}, following problems happen because of specified columns in {{partitionBy}} are located separately.
 * The spark application which includes {{partitionBy}} takes much longer (for example, [[python - partitionBy taking too long while saving a dataset on S3 using Pyspark - Stack Overflow|https://stackoverflow.com/questions/56496387/partitionby-taking-too-long-while-saving-a-dataset-on-s3-using-pyspark]|https://stackoverflow.com/questions/56496387/partitionby-taking-too-long-while-saving-a-dataset-on-s3-using-pyspark])]
 * When using {{partitionBy}}, memory usage increases much high compared with not using {{partitionBy}} (as follows I tested with Spark ver.2.4.3).
 ** Not using repartition before partitionBy:
 ** Using repartition before partitionBy

h2. How to use?

It's very simple. If you want to use repartition method before {{partitionBy}}, just you specify {color:#0747a6}{{true}}{color} in {{partitionBy}}.

Example:
{code:java}
val df  = spark.read.format("csv").option("header", true).load(<INPUT_PATH>)
df.write.format("json").partitionBy(true, columns).save(<OUTPUT_PATH>){code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org