You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Jeff Schecter <je...@levelmoney.com> on 2014/12/10 02:35:49 UTC

Workers keep dying on EC2 Spark cluster: PriviledgedActionException

Hi Spark users,

I've been attempting to get flambo
<https://github.com/yieldbot/flambo/blob/develop/README.md>, a Clojure
library for Spark, working with my codebase. After getting things to build
with this very simple interface:

(ns sharknado.core
  (:require [flambo.conf :as conf]
            [flambo.api :as spark]))

(defn configure [master-url app-name]
  (-> (conf/spark-conf)
      (conf/master master-url)
      (conf/app-name app-name)))

(defn get-context [master-url app-name]
  (spark/spark-context (configure master-url app-name)))



I run in the lein repl:

(use 'sharknado.core)
(def cx (get-context "spark://MASTER-URL.compute-1.amazonaws.com:7077"
"flambo-test"))


This connects to the master and successfully creates an app; however, the
app's workers all die after several seconds.

It looks like user Saiph Kappa had similar problems about a month ago.
Someone suggested that the cluster and submitted spark application might be
using different versions of Spark; that's definitely not the case here.
I've tried with both 1.1.0 and 1.1.1 on both ends.

With Spark 1.1.0, after all workers die, the application exits.

With spark 1.1.1, after each worker dies, another is automatically created;
at the moment the app detail screen in the UI is showing 150 exited and 5
running workers.

Anyone have any ideas? Example trace from a worker below.

Thanks,

Jeff

14/12/10 01:22:09 INFO executor.CoarseGrainedExecutorBackend:
Registered signal handlers for [TERM, HUP, INT]
14/12/10 01:22:10 INFO spark.SecurityManager: Changing view acls to: root,Jeff
14/12/10 01:22:10 INFO spark.SecurityManager: Changing modify acls to: root,Jeff
14/12/10 01:22:10 INFO spark.SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view
permissions: Set(root, Jeff); users with modify permissions: Set(root,
Jeff)
14/12/10 01:22:10 INFO slf4j.Slf4jLogger: Slf4jLogger started
14/12/10 01:22:10 INFO Remoting: Starting remoting
14/12/10 01:22:10 INFO Remoting: Remoting started; listening on
addresses :[akka.tcp://driverPropsFetcher@ip-address.ec2.internal:49050]
14/12/10 01:22:10 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://driverPropsFetcher@ip-address.ec2.internal:49050]
14/12/10 01:22:10 INFO util.Utils: Successfully started service
'driverPropsFetcher' on port 49050.
14/12/10 01:22:40 ERROR security.UserGroupInformation:
PriviledgedActionException as:Jeff
cause:java.util.concurrent.TimeoutException: Futures timed out after
[30 seconds]
Exception in thread "main"
java.lang.reflect.UndeclaredThrowableException: Unknown exception in
doAs
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134)
	at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:156)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.security.PrivilegedActionException:
java.util.concurrent.TimeoutException: Futures timed out after [30
seconds]
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
	... 4 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out
after [30 seconds]
	at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
	at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
	at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
	at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
	at scala.concurrent.Await$.result(package.scala:107)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:125)
	at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:53)
	at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:52)
	... 7 more