You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Tomohiro Tanaka (Jira)" <ji...@apache.org> on 2020/02/05 03:15:00 UTC
[jira] [Created] (SPARK-30735) Improving writing performance by
adding repartition based on columns to partitionBy for DataFrameWriter
Tomohiro Tanaka created SPARK-30735:
---------------------------------------
Summary: Improving writing performance by adding repartition based on columns to partitionBy for DataFrameWriter
Key: SPARK-30735
URL: https://issues.apache.org/jira/browse/SPARK-30735
Project: Spark
Issue Type: Improvement
Components: Spark Core
Affects Versions: 2.4.4, 2.4.3
Environment: * Spark-3.0.0
* Scala: version 2.12.10
* sbt 0.13.18, script ver: 1.3.7 (Built using sbt)
* Java: 1.8.0_231
** Java(TM) SE Runtime Environment (build 1.8.0_231-b11)
** Java HotSpot(TM) 64-Bit Server VM (build 25.231-b11, mixed mode)
Reporter: Tomohiro Tanaka
Fix For: 3.0.0, 3.1.0
h1. New functionality for {{partitionBy}}
To enhance performance using partitionBy , calling {{repartition}} method based on columns is much good before calling {{partitionBy}}. I added new function: {color:#0747a6}{{partitionBy(<True | False>, columns>}}{color} to {{partitionBy}}.
h2. Problems when not using {{repartition}} before {{partitionBy}}.
When using {{paritionBy}}, following problems happen because of specified columns in {{partitionBy}} are located separately.
* The spark application which includes {{partitionBy}} takes much longer (for example, [[python - partitionBy taking too long while saving a dataset on S3 using Pyspark - Stack Overflow|https://stackoverflow.com/questions/56496387/partitionby-taking-too-long-while-saving-a-dataset-on-s3-using-pyspark]|https://stackoverflow.com/questions/56496387/partitionby-taking-too-long-while-saving-a-dataset-on-s3-using-pyspark])]
* When using {{partitionBy}}, memory usage increases much high compared with not using {{partitionBy}} (as follows I tested with Spark ver.2.4.3).
** Not using repartition before partitionBy:
** Using repartition before partitionBy
h2. How to use?
It's very simple. If you want to use repartition method before {{partitionBy}}, just you specify {color:#0747a6}{{true}}{color} in {{partitionBy}}.
Example:
{code:java}
val df = spark.read.format("csv").option("header", true).load(<INPUT_PATH>)
df.write.format("json").partitionBy(true, columns).save(<OUTPUT_PATH>){code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org