You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by sim <si...@swoop.com> on 2015/07/18 03:14:34 UTC

Cleanup when tasks generate errors

I've observed a number of cases where Spark does not clean HDFS side-effects
on errors, especially out of memory conditions. Here is an example from the
following code snippet executed in spark-shell:
import org.apache.spark.sql.hive.HiveContextimport
org.apache.spark.sql.SaveModeval ctx =
sqlContext.asInstanceOf[HiveContext]import ctx.implicits._ctx. 
jsonFile("file:///test_data/*/*/*/*.gz").  saveAsTable("test_data",
SaveMode.Overwrite)
First run: saveAsTable terminates with an out of memory exception.
Second run (with more RAM to driver & executor): fails with many variations
of java.lang.RuntimeException:
hdfs://localhost:54310/user/hive/warehouse/test_data/_temporary/0/_temporary/attempt_201507171538_0008_r_000021_0/part-r-00022.parquet
is not a Parquet file (too small)
Third run (after hdfs dfs -rm -r hdfs:///user/hive/warehouse/test_data)
succeeds.
What are the best practices for dealing with these types of cleanup
failures? Do they tend to come in known varieties?
Thanks,
Sim




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cleanup-when-tasks-generate-errors-tp23890.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.