You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Dongjoon Hyun (Jira)" <ji...@apache.org> on 2020/02/05 21:50:00 UTC

[jira] [Comment Edited] (SPARK-30735) Improving writing performance by adding repartition based on columns to partitionBy for DataFrameWriter

    [ https://issues.apache.org/jira/browse/SPARK-30735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031065#comment-17031065 ] 

Dongjoon Hyun edited comment on SPARK-30735 at 2/5/20 9:49 PM:
---------------------------------------------------------------

Hi, [~tom_tanaka]. Thank you for filing a JIRA and making a PR.
Since it seems to be your first time, I want to give you some information.

- https://spark.apache.org/contributing.html

According to the above guideline, we use `Fix Version` when we merge finally. So, you should keep them empty. Also, we don't allow backporting of new feature. Your contribution will be Apache Spark 3.1 if it's merged. So, you should use `3.1.0` for `Affected Version`. In other words, new improvement and feature cannot affect old versions. Finally, `Target Version` is reserved for committers. So, please keep them empty, too.

I'll adjust the fields appropriately. Thanks.


was (Author: dongjoon):
Hi, [~tom_tanaka]. Thank you for filing a JIRA and making a PR.
Since it seems to be your first time, I want to give you some information.

- https://spark.apache.org/contributing.html

According to the above guideline, we use `Fix Version` when we merge finally. So, you should keep them empty. Also, we don't allow backporting of new feature. Your contribution will be Apache Spark 3.1 if it's merged. So, you should use `3.1.0` for `Affected Version`. In other words, new improvement and feature cannot affect old versions.

I'll adjust the fields appropriately. Thanks.

> Improving writing performance by adding repartition based on columns to partitionBy for DataFrameWriter
> -------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-30735
>                 URL: https://issues.apache.org/jira/browse/SPARK-30735
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.0
>         Environment: * Spark-3.0.0
>  * Scala: version 2.12.10
>  * sbt 0.13.18, script ver: 1.3.7 (Built using sbt)
>  * Java: 1.8.0_231
>  ** Java(TM) SE Runtime Environment (build 1.8.0_231-b11)
>  ** Java HotSpot(TM) 64-Bit Server VM (build 25.231-b11, mixed mode)
>            Reporter: Tomohiro Tanaka
>            Priority: Trivial
>              Labels: performance, pull-request-available
>         Attachments: repartition-before-partitionby.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> h1. New functionality for {{partitionBy}}
> To enhance performance using partitionBy , calling {{repartition}} method based on columns is much good before calling {{partitionBy}}. I added new function: {color:#0747a6}{{partitionBy(<True | False>, columns>}}{color} to {{partitionBy}}.
>  
> h2. Problems when not using {{repartition}} before {{partitionBy}}.
> When using {{paritionBy}}, following problems happen because of specified columns in {{partitionBy}} are located separately.
>  * The spark application which includes {{partitionBy}} takes much longer (for example, [[python - partitionBy taking too long while saving a dataset on S3 using Pyspark - Stack Overflow|https://stackoverflow.com/questions/56496387/partitionby-taking-too-long-while-saving-a-dataset-on-s3-using-pyspark]])]
>  * When using {{partitionBy}}, memory usage increases much high compared with not using {{partitionBy}} (as follows I tested with Spark ver.2.4.3).
>  * Additional information about memory usage affection by partitionBy: Please check the attachment (the left figure shows "using partitionBy", the other shows "not using partitionBy)".
> h2. How to use?
> It's very simple. If you want to use repartition method before {{partitionBy}}, just you specify {color:#0747a6}{{true}}{color} in {{partitionBy}}.
> Example:
> {code:java}
> val df  = spark.read.format("csv").option("header", true).load(<INPUT_PATH>)
> df.write.format("json").partitionBy(true, columns).save(<OUTPUT_PATH>){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org