You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Serega Sheypak <se...@gmail.com> on 2019/02/05 18:15:20 UTC
Spark 2.x duplicates output when task fails at "repartition" stage.
Checkpointing is enabled before repartition.
Hi, I have spark job that produces duplicates when one or tasks from
repartition stage fails.
Here is simplified code.
sparkContext.setCheckpointDir("hdfs://path-to-checkpoint-dir")
*val *inputRDDs: List[RDD[String]] = *List*.*empty *// an RDD per input dir
*val *updatedRDDs = inputRDDs.map{ inputRDD => // some stuff happens here
inputRDD
.filter(*???*)
.map(*???*)
}
*val *unionOfUpdatedRDDs = sparkContext.union(updatedRDDs)
unionOfUpdatedRDDs.checkpoint() // id didn't help
unionOfUpdatedRDDs
.repartition(42) // task failed here,
.saveAsNewAPIHadoopFile("/path") // task failed here too.
// what really causes duplicates in output?