You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "David P. Kleinschmidt" <da...@kleinschmidt.name> on 2015/10/30 13:44:46 UTC

Spark Streaming (1.5.0) flaky when recovering from checkpoint

I have a Spark Streaming job that runs great the first time around (Elastic
MapReduce 4.1.0), but when recovering from a checkpoint in S3, the job runs
but Spark itself seems to be jacked-up in lots of little ways:

   - Executors, which are normally stable for days, are terminated within a
   couple hours. I can see the termination notices in the logs, but no related
   exceptions. The nodes are active in YARN, but Spark doesn't pick them up
   again.
   - Hadoop web proxy can't find Spark web UI ("no route to host")
   - When I get to the web UI, the Streaming tab is missing
   - The web UI appears to stop updating after a few thousand jobs

I'm kind of at wits end here. I've been banging my head against this for a
couple weeks now, and any help would be greatly appreciated. Below is the
configuration that I'm sending to EMR.

- dpk

[
  {
    "Classification": "emrfs-site",
    "Properties": {"fs.s3.consistent": "true"}
  },
  {
    "Classification": "spark-defaults",
    "Properties": {
      "spark.default.parallelism": "8",
      "spark.dynamicAllocation.enabled": "true",
      "spark.dynamicAllocation.minExecutors": "1",
      "spark.executor.cores": "4",
      "spark.executor.memory": "4148M",
      "spark.streaming.receiver.writeAheadLog.enable": "true",
      "spark.yarn.executor.memoryOverhead": "460"
    }
  },
  {
    "Classification": "spark-env",
    "Configurations": [{
        "Classification": "export",
        "Properties": {"SPARK_YARN_MODE": "true"}
    }]
  },
  {
    "Classification": "spark-log4j",
    "Properties": {"log4j.rootCategory": "WARN,console"}
  }
]