You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Daniel Haviv <da...@veracity-group.com> on 2016/10/20 05:27:05 UTC

partitionBy produces wrong number of tasks

Hi,
I have a case where I use partitionBy to write my DF using a calculated
column, so it looks somethings like this:

val df = spark.sql("select *, from_unixtime(ts, 'yyyyMMddH')
partition_key from mytable")

df.write.partitionBy("partition_key").orc("/partitioned_table")


df is 8 partitions in size (spark.sql.shuffle.partitions is set to 8) and
partition_key usually has 1 or 2 distinct values.
When the write action begins it's split into 330 tasks and takes much
longer than it should but if I switch to the following code instead it
works as expected with 8 tasks:

df.createTempView("tab")
spark.sql("set hive.exec.dynamic.partition.mode=nonstrict")
spark.sql("insert into partitioned_table select * from tab")



Any idea why is this happening ?
How does partitionBy decide to repartition the DF ?


Thank you,
Daniel