You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-dev@hadoop.apache.org by "Igor Dvorzhak (JIRA)" <ji...@apache.org> on 2019/02/11 22:53:00 UTC
[jira] [Created] (MAPREDUCE-7185) Parallelize part files move in
FileOutputCommitter
Igor Dvorzhak created MAPREDUCE-7185:
----------------------------------------
Summary: Parallelize part files move in FileOutputCommitter
Key: MAPREDUCE-7185
URL: https://issues.apache.org/jira/browse/MAPREDUCE-7185
Project: Hadoop Map/Reduce
Issue Type: Improvement
Affects Versions: 2.9.2, 3.2.0
Reporter: Igor Dvorzhak
Attachments: MAPREDUCE-7185.patch
If map task outputs multiple files it could be slow to move them from temp directory to output directory in object stores.
To improve performance we need to parallelize move of more than 1 file in FileOutputCommitter.
Repro:
Start spark-shell:
{code:bash}
spark-shell --num-executors 2 --executor-memory 10G --executor-cores 4 --conf spark.dynamicAllocation.maxExecutors=2
{code}
From spark-shell:
{code:scala}
val df = (1 to 10000).toList.toDF("value").withColumn("p", $"value" % 10).repartition(50)
df.write.partitionBy("p").mode("overwrite").format("parquet").options(Map("path" -> s"gs://some/path")).saveAsTable("parquet_partitioned_bench")
{code}
With the fix execution time reduces from 130 seconds to 50 seconds.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-help@hadoop.apache.org