You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Luis Ángel Vicente Sánchez <la...@gmail.com> on 2014/06/16 19:25:20 UTC

Worker dies while submitting a job

I'm playing with a modified version of the TwitterPopularTags example and
when I tried to submit the job to my cluster, workers keep dying with this
message:

14/06/16 17:11:16 INFO DriverRunner: Launch Command: "java" "-cp"
"/opt/spark-1.0.0-bin-hadoop1/work/driver-20140616171115-0014/spark-test-0.1-SNAPSHOT.jar:::/opt/spark-1.0.0-bin-hadoop1/conf:/opt/spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop1.0.4.jar"
"-XX:MaxPermSize=128m" "-Xms512M" "-Xmx512M"
"org.apache.spark.deploy.worker.DriverWrapper"
"akka.tcp://sparkWorker@int-spark-worker:51676/user/Worker"
"org.apache.spark.examples.streaming.TwitterPopularTags"
14/06/16 17:11:17 ERROR OneForOneStrategy: FAILED (of class
scala.Enumeration$Val)
scala.MatchError: FAILED (of class scala.Enumeration$Val)
at
org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:317)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
14/06/16 17:11:17 INFO Worker: Starting Spark worker
int-spark-app-ie005d6a3.mclabs.io:51676 with 2 cores, 6.5 GB RAM
14/06/16 17:11:17 INFO Worker: Spark home: /opt/spark-1.0.0-bin-hadoop1
14/06/16 17:11:17 INFO WorkerWebUI: Started WorkerWebUI at
http://int-spark-app-ie005d6a3.mclabs.io:8081
14/06/16 17:11:17 INFO Worker: Connecting to master
spark://int-spark-app-ie005d6a3.mclabs.io:7077...
14/06/16 17:11:17 ERROR Worker: Worker registration failed: Attempted to
re-register worker at same address: akka.tcp://
sparkWorker@int-spark-app-ie005d6a3.mclabs.io:51676

This happens when the worker receive a DriverStateChanged(driverId, state,
exception) message.

To deploy the job I copied the jar file to the temporary folder of master
node and execute the following command:

./spark-submit \
--class org.apache.spark.examples.streaming.TwitterPopularTags \
--master spark://int-spark-master:7077 \
--deploy-mode cluster \
file:///tmp/spark-test-0.1-SNAPSHOT.jar

I don't really know what the problem could be as there is a 'case _' that
should avoid that problem :S

Re: Worker dies while submitting a job

Posted by Shivani Rao <ra...@gmail.com>.

That error typically means that there is a communication error (wrong
ports) between master and worker. Also check if the worker has "write"
permissions to create the "work" directory. We were getting this error due
one of the above two reasons



On Tue, Jun 17, 2014 at 10:04 AM, Luis Ángel Vicente Sánchez <
langel.groups@gmail.com> wrote:

> I have been able to submit a job successfully but I had to config my spark
> job this way:
>
>   val sparkConf: SparkConf =
>     new SparkConf()
>       .setAppName("TwitterPopularTags")
>       .setMaster("spark://int-spark-master:7077")
>       .setSparkHome("/opt/spark")
>       .setJars(Seq("/tmp/spark-test-0.1-SNAPSHOT.jar"))
>
> Now I'm getting this error on my worker:
>
> 4/06/17 17:03:40 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient memory
>
>
>
>
> 2014-06-17 17:36 GMT+01:00 Luis Ángel Vicente Sánchez <
> langel.groups@gmail.com>:
>
> Ok... I was checking the wrong version of that file yesterday. My worker
>> is sending a DriverStateChanged(_, DriverState.FAILED, _) but there is
>> no case branch for that state and the worker is crashing. I still don't
>> know why I'm getting a FAILED state but I'm sure that should kill the actor
>> due to a scala.MatchError.
>>
>> Usually in scala is a best-practice to use a sealed trait and case
>> classes/objects in a match statement instead of an enumeration (the
>> compiler will complain about missing cases); I think that should be
>> refactored to catch this kind of errors at compile time.
>>
>> Now I need to find why that state changed message is sent... I will
>> continue updating this thread until I found the problem :D
>>
>>
>> 2014-06-16 18:25 GMT+01:00 Luis Ángel Vicente Sánchez <
>> langel.groups@gmail.com>:
>>
>> I'm playing with a modified version of the TwitterPopularTags example and
>>> when I tried to submit the job to my cluster, workers keep dying with this
>>> message:
>>>
>>> 14/06/16 17:11:16 INFO DriverRunner: Launch Command: "java" "-cp"
>>> "/opt/spark-1.0.0-bin-hadoop1/work/driver-20140616171115-0014/spark-test-0.1-SNAPSHOT.jar:::/opt/spark-1.0.0-bin-hadoop1/conf:/opt/spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop1.0.4.jar"
>>> "-XX:MaxPermSize=128m" "-Xms512M" "-Xmx512M"
>>> "org.apache.spark.deploy.worker.DriverWrapper"
>>> "akka.tcp://sparkWorker@int-spark-worker:51676/user/Worker"
>>> "org.apache.spark.examples.streaming.TwitterPopularTags"
>>> 14/06/16 17:11:17 ERROR OneForOneStrategy: FAILED (of class
>>> scala.Enumeration$Val)
>>> scala.MatchError: FAILED (of class scala.Enumeration$Val)
>>> at
>>> org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:317)
>>>  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>>> at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>>>  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>>> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>>>  at
>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>>> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>  at
>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>> at
>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>  at
>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>> 14/06/16 17:11:17 INFO Worker: Starting Spark worker
>>> int-spark-app-ie005d6a3.mclabs.io:51676 with 2 cores, 6.5 GB RAM
>>> 14/06/16 17:11:17 INFO Worker: Spark home: /opt/spark-1.0.0-bin-hadoop1
>>> 14/06/16 17:11:17 INFO WorkerWebUI: Started WorkerWebUI at
>>> http://int-spark-app-ie005d6a3.mclabs.io:8081
>>> 14/06/16 17:11:17 INFO Worker: Connecting to master
>>> spark://int-spark-app-ie005d6a3.mclabs.io:7077...
>>> 14/06/16 17:11:17 ERROR Worker: Worker registration failed: Attempted to
>>> re-register worker at same address: akka.tcp://
>>> sparkWorker@int-spark-app-ie005d6a3.mclabs.io:51676
>>>
>>> This happens when the worker receive a DriverStateChanged(driverId,
>>> state, exception) message.
>>>
>>> To deploy the job I copied the jar file to the temporary folder of
>>> master node and execute the following command:
>>>
>>> ./spark-submit \
>>> --class org.apache.spark.examples.streaming.TwitterPopularTags \
>>> --master spark://int-spark-master:7077 \
>>> --deploy-mode cluster \
>>> file:///tmp/spark-test-0.1-SNAPSHOT.jar
>>>
>>> I don't really know what the problem could be as there is a 'case _'
>>> that should avoid that problem :S
>>>
>>
>>
>


-- 
Software Engineer
Analytics Engineering Team@ Box
Mountain View, CA

Re: Worker dies while submitting a job

Posted by Luis Ángel Vicente Sánchez <la...@gmail.com>.

I have been able to submit a job successfully but I had to config my spark
job this way:

  val sparkConf: SparkConf =
    new SparkConf()
      .setAppName("TwitterPopularTags")
      .setMaster("spark://int-spark-master:7077")
      .setSparkHome("/opt/spark")
      .setJars(Seq("/tmp/spark-test-0.1-SNAPSHOT.jar"))

Now I'm getting this error on my worker:

4/06/17 17:03:40 WARN TaskSchedulerImpl: Initial job has not accepted any
resources; check your cluster UI to ensure that workers are registered and
have sufficient memory




2014-06-17 17:36 GMT+01:00 Luis Ángel Vicente Sánchez <
langel.groups@gmail.com>:

> Ok... I was checking the wrong version of that file yesterday. My worker
> is sending a DriverStateChanged(_, DriverState.FAILED, _) but there is no
> case branch for that state and the worker is crashing. I still don't know
> why I'm getting a FAILED state but I'm sure that should kill the actor due
> to a scala.MatchError.
>
> Usually in scala is a best-practice to use a sealed trait and case
> classes/objects in a match statement instead of an enumeration (the
> compiler will complain about missing cases); I think that should be
> refactored to catch this kind of errors at compile time.
>
> Now I need to find why that state changed message is sent... I will
> continue updating this thread until I found the problem :D
>
>
> 2014-06-16 18:25 GMT+01:00 Luis Ángel Vicente Sánchez <
> langel.groups@gmail.com>:
>
> I'm playing with a modified version of the TwitterPopularTags example and
>> when I tried to submit the job to my cluster, workers keep dying with this
>> message:
>>
>> 14/06/16 17:11:16 INFO DriverRunner: Launch Command: "java" "-cp"
>> "/opt/spark-1.0.0-bin-hadoop1/work/driver-20140616171115-0014/spark-test-0.1-SNAPSHOT.jar:::/opt/spark-1.0.0-bin-hadoop1/conf:/opt/spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop1.0.4.jar"
>> "-XX:MaxPermSize=128m" "-Xms512M" "-Xmx512M"
>> "org.apache.spark.deploy.worker.DriverWrapper"
>> "akka.tcp://sparkWorker@int-spark-worker:51676/user/Worker"
>> "org.apache.spark.examples.streaming.TwitterPopularTags"
>> 14/06/16 17:11:17 ERROR OneForOneStrategy: FAILED (of class
>> scala.Enumeration$Val)
>> scala.MatchError: FAILED (of class scala.Enumeration$Val)
>> at
>> org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:317)
>>  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>> at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>>  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>>  at
>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>  at
>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>> at
>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>  at
>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>> 14/06/16 17:11:17 INFO Worker: Starting Spark worker
>> int-spark-app-ie005d6a3.mclabs.io:51676 with 2 cores, 6.5 GB RAM
>> 14/06/16 17:11:17 INFO Worker: Spark home: /opt/spark-1.0.0-bin-hadoop1
>> 14/06/16 17:11:17 INFO WorkerWebUI: Started WorkerWebUI at
>> http://int-spark-app-ie005d6a3.mclabs.io:8081
>> 14/06/16 17:11:17 INFO Worker: Connecting to master
>> spark://int-spark-app-ie005d6a3.mclabs.io:7077...
>> 14/06/16 17:11:17 ERROR Worker: Worker registration failed: Attempted to
>> re-register worker at same address: akka.tcp://
>> sparkWorker@int-spark-app-ie005d6a3.mclabs.io:51676
>>
>> This happens when the worker receive a DriverStateChanged(driverId,
>> state, exception) message.
>>
>> To deploy the job I copied the jar file to the temporary folder of master
>> node and execute the following command:
>>
>> ./spark-submit \
>> --class org.apache.spark.examples.streaming.TwitterPopularTags \
>> --master spark://int-spark-master:7077 \
>> --deploy-mode cluster \
>> file:///tmp/spark-test-0.1-SNAPSHOT.jar
>>
>> I don't really know what the problem could be as there is a 'case _' that
>> should avoid that problem :S
>>
>
>

Re: Worker dies while submitting a job

Posted by Luis Ángel Vicente Sánchez <la...@gmail.com>.

Ok... I was checking the wrong version of that file yesterday. My worker is
sending a DriverStateChanged(_, DriverState.FAILED, _) but there is no case
branch for that state and the worker is crashing. I still don't know why
I'm getting a FAILED state but I'm sure that should kill the actor due to a
scala.MatchError.

Usually in scala is a best-practice to use a sealed trait and case
classes/objects in a match statement instead of an enumeration (the
compiler will complain about missing cases); I think that should be
refactored to catch this kind of errors at compile time.

Now I need to find why that state changed message is sent... I will
continue updating this thread until I found the problem :D


2014-06-16 18:25 GMT+01:00 Luis Ángel Vicente Sánchez <
langel.groups@gmail.com>:

> I'm playing with a modified version of the TwitterPopularTags example and
> when I tried to submit the job to my cluster, workers keep dying with this
> message:
>
> 14/06/16 17:11:16 INFO DriverRunner: Launch Command: "java" "-cp"
> "/opt/spark-1.0.0-bin-hadoop1/work/driver-20140616171115-0014/spark-test-0.1-SNAPSHOT.jar:::/opt/spark-1.0.0-bin-hadoop1/conf:/opt/spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop1.0.4.jar"
> "-XX:MaxPermSize=128m" "-Xms512M" "-Xmx512M"
> "org.apache.spark.deploy.worker.DriverWrapper"
> "akka.tcp://sparkWorker@int-spark-worker:51676/user/Worker"
> "org.apache.spark.examples.streaming.TwitterPopularTags"
> 14/06/16 17:11:17 ERROR OneForOneStrategy: FAILED (of class
> scala.Enumeration$Val)
> scala.MatchError: FAILED (of class scala.Enumeration$Val)
> at
> org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:317)
>  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
> at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>  at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>  at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>  at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 14/06/16 17:11:17 INFO Worker: Starting Spark worker
> int-spark-app-ie005d6a3.mclabs.io:51676 with 2 cores, 6.5 GB RAM
> 14/06/16 17:11:17 INFO Worker: Spark home: /opt/spark-1.0.0-bin-hadoop1
> 14/06/16 17:11:17 INFO WorkerWebUI: Started WorkerWebUI at
> http://int-spark-app-ie005d6a3.mclabs.io:8081
> 14/06/16 17:11:17 INFO Worker: Connecting to master
> spark://int-spark-app-ie005d6a3.mclabs.io:7077...
> 14/06/16 17:11:17 ERROR Worker: Worker registration failed: Attempted to
> re-register worker at same address: akka.tcp://
> sparkWorker@int-spark-app-ie005d6a3.mclabs.io:51676
>
> This happens when the worker receive a DriverStateChanged(driverId, state,
> exception) message.
>
> To deploy the job I copied the jar file to the temporary folder of master
> node and execute the following command:
>
> ./spark-submit \
> --class org.apache.spark.examples.streaming.TwitterPopularTags \
> --master spark://int-spark-master:7077 \
> --deploy-mode cluster \
> file:///tmp/spark-test-0.1-SNAPSHOT.jar
>
> I don't really know what the problem could be as there is a 'case _' that
> should avoid that problem :S
>