You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "David P. Kleinschmidt" <da...@kleinschmidt.name> on 2015/10/30 13:44:46 UTC
Spark Streaming (1.5.0) flaky when recovering from checkpoint
I have a Spark Streaming job that runs great the first time around (Elastic
MapReduce 4.1.0), but when recovering from a checkpoint in S3, the job runs
but Spark itself seems to be jacked-up in lots of little ways:
- Executors, which are normally stable for days, are terminated within a
couple hours. I can see the termination notices in the logs, but no related
exceptions. The nodes are active in YARN, but Spark doesn't pick them up
again.
- Hadoop web proxy can't find Spark web UI ("no route to host")
- When I get to the web UI, the Streaming tab is missing
- The web UI appears to stop updating after a few thousand jobs
I'm kind of at wits end here. I've been banging my head against this for a
couple weeks now, and any help would be greatly appreciated. Below is the
configuration that I'm sending to EMR.
- dpk
[
{
"Classification": "emrfs-site",
"Properties": {"fs.s3.consistent": "true"}
},
{
"Classification": "spark-defaults",
"Properties": {
"spark.default.parallelism": "8",
"spark.dynamicAllocation.enabled": "true",
"spark.dynamicAllocation.minExecutors": "1",
"spark.executor.cores": "4",
"spark.executor.memory": "4148M",
"spark.streaming.receiver.writeAheadLog.enable": "true",
"spark.yarn.executor.memoryOverhead": "460"
}
},
{
"Classification": "spark-env",
"Configurations": [{
"Classification": "export",
"Properties": {"SPARK_YARN_MODE": "true"}
}]
},
{
"Classification": "spark-log4j",
"Properties": {"log4j.rootCategory": "WARN,console"}
}
]