You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Rishi Shah <ri...@gmail.com> on 2019/12/08 04:11:09 UTC

[pyspark 2.4.0] write with overwrite mode fails

Hi All,

df = spark.read.csv(PATH)
spark.conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic')
df.repartition(col1,
col2).write.mode('overwrite').partitionBy('col1').parquet(OUT_PATH)

works fine and overwrites the partitioned directory as expected.

However this doesn't overwrite when previous run was abruptly interrupted
and the partitioned directory only has _started flag file & no _SUCCESS or
_committed. In this case, second run doesn't overwrite, causing partition
to have duplicated files. Could someone please help?

-- 
Regards,

Rishi Shah