You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Rishi Shah <ri...@gmail.com> on 2019/12/08 04:11:09 UTC
[pyspark 2.4.0] write with overwrite mode fails
Hi All,
df = spark.read.csv(PATH)
spark.conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic')
df.repartition(col1,
col2).write.mode('overwrite').partitionBy('col1').parquet(OUT_PATH)
works fine and overwrites the partitioned directory as expected.
However this doesn't overwrite when previous run was abruptly interrupted
and the partitioned directory only has _started flag file & no _SUCCESS or
_committed. In this case, second run doesn't overwrite, causing partition
to have duplicated files. Could someone please help?
--
Regards,
Rishi Shah