You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sujit Das (Jira)" <ji...@apache.org> on 2020/09/22 16:50:00 UTC
[jira] [Created] (SPARK-32966) Spark| PartitionBy is taking long
time to process
Sujit Das created SPARK-32966:
---------------------------------
Summary: Spark| PartitionBy is taking long time to process
Key: SPARK-32966
URL: https://issues.apache.org/jira/browse/SPARK-32966
Project: Spark
Issue Type: Improvement
Components: PySpark
Affects Versions: 2.4.5
Environment: EMR - 5.30.0; Hadoop -2.8.5; Spark- 2.4.5
Reporter: Sujit Das
# When I do a write without any partition it takes 8 min
df2_merge.write.mode('overwrite').parquet(dest_path)
2. I have added conf - spark.sql.sources.partitionOverwriteMode=dynamic ; it took a longer time (more than 50 min before I force terminated the EMR cluster). But I have observed the partitions have been created and data files are present. But in EMR cluster the process is still showing as running, where as in spark history server it is showing no running or pending process.
df2_merge.write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest)
3. I have modified with new conf - spark.sql.shuffle.partitions=3; it took 24 min
df2_merge.coalesce(3).write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest)
4. Again I disabled the conf and run plain write with partition. It took 30 min.
df2_merge.coalesce(3).write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest)
Only one conf is common in the above scenarios is spark.sql.adaptive.coalescePartitions.initialPartitionNum=100
My point is to reduce the time of writing with partitionBy. Is there anything I am missing
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org