You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Luke Swift <lu...@googlemail.com> on 2017/03/29 13:34:58 UTC

Issues with partitionBy method on data frame writer SPARK 2.0.2

Hello I am trying to write parquet files from a data frame. I am able to
use the partitionBy("year", "month", "day") and spark correctly physically
partitions the data in a directory structure i expect.

The issue is when the partitions themselves are anything non trivial in
size then the memory usage seems to blow up and i am getting a lot of gc
pressure on my cluster. There is lots of red in the executors tab on the
web UI for the all the executers in the GC time column. If i try to
coalesce the data frames rdd so get reasonably sized output files the job
falls over due to GC pressure.

Removing the partitionBy and writing directly to the output destination
alleviates the problem, however we would like this functionality to improve
out query performance in engines like hive.

I am running spark 2.0.2 on EMR 5.3.1, i am using pretty large nodes
c3.4xlarge which have 30g ram per node and each executor gets 5.5g. I saw
some previous mails about a similar issue but that was back in spark 1.4
days and they seem to have been resolved but i still have this issue.

Any help would be appreciated.