You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Romi Kuntsman <ro...@totango.com> on 2015/11/01 17:08:23 UTC
Some spark apps fail with "All masters are unresponsive", while
others pass normally
[adding dev list since it's probably a bug, but i'm not sure how to
reproduce so I can open a bug about it]
Hi,
I have a standalone Spark 1.4.0 cluster with 100s of applications running
every day.
>From time to time, the applications crash with the following error (see
below)
But at the same time (and also after that), other applications are running,
so I can safely assume the master and workers are working.
1. why is there a NullPointerException? (i can't track the scala stack
trace to the code, but anyway NPE is usually a obvious bug even if there's
actually a network error...)
2. why can't it connect to the master? (if it's a network timeout, how to
increase it? i see the values are hardcoded inside AppClient)
3. how to recover from this error?
ERROR 01-11 15:32:54,991 SparkDeploySchedulerBackend - Application has
been killed. Reason: All masters are unresponsive! Giving up. ERROR
ERROR 01-11 15:32:55,087 OneForOneStrategy - ERROR
logs/error.log
java.lang.NullPointerException NullPointerException
at
org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$receiveWithLogging$1.applyOrElse(AppClient.scala:160)
at
scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at
org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
at
org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at
org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at
org.apache.spark.deploy.client.AppClient$ClientActor.aroundReceive(AppClient.scala:61)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
ERROR 01-11 15:32:55,603 SparkContext - Error
initializing SparkContext. ERROR
java.lang.IllegalStateException: Cannot call methods on a stopped
SparkContext
at org.apache.spark.SparkContext.org
$apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
at
org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1501)
at
org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2005)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:543)
at
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61)
Thanks!
*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com
Re: Some spark apps fail with "All masters are unresponsive", while
others pass normally
Posted by Tim Preece <te...@mail.com>.
Searching shows several people hit this same NPE in AppClient.scala line 160
( perhaps because appID was null - could application had be stopped before
registered ?)
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Some-spark-apps-fail-with-All-masters-are-unresponsive-while-others-pass-normally-tp14858p15096.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org
Re: Some spark apps fail with "All masters are unresponsive", while
others pass normally
Posted by Romi Kuntsman <ro...@totango.com>.
I didn't see anything about a OOM.
This happens sometimes before anything in the application happened, and
happens to a few applications at the same time - so I guess it's a
communication failure, but the problem is that the error shown doesn't
represent the actual problem (which may be a network timeout etc)
*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com
On Mon, Nov 9, 2015 at 6:00 PM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:
> Did you find anything regarding the OOM in the executor logs?
>
> Thanks
> Best Regards
>
> On Mon, Nov 9, 2015 at 8:44 PM, Romi Kuntsman <ro...@totango.com> wrote:
>
>> If they have a problem managing memory, wouldn't there should be a OOM?
>> Why does AppClient throw a NPE?
>>
>> *Romi Kuntsman*, *Big Data Engineer*
>> http://www.totango.com
>>
>> On Mon, Nov 9, 2015 at 4:59 PM, Akhil Das <ak...@sigmoidanalytics.com>
>> wrote:
>>
>>> Is that all you have in the executor logs? I suspect some of those jobs
>>> are having a hard time managing the memory.
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Sun, Nov 1, 2015 at 9:38 PM, Romi Kuntsman <ro...@totango.com> wrote:
>>>
>>>> [adding dev list since it's probably a bug, but i'm not sure how to
>>>> reproduce so I can open a bug about it]
>>>>
>>>> Hi,
>>>>
>>>> I have a standalone Spark 1.4.0 cluster with 100s of applications
>>>> running every day.
>>>>
>>>> From time to time, the applications crash with the following error (see
>>>> below)
>>>> But at the same time (and also after that), other applications are
>>>> running, so I can safely assume the master and workers are working.
>>>>
>>>> 1. why is there a NullPointerException? (i can't track the scala stack
>>>> trace to the code, but anyway NPE is usually a obvious bug even if there's
>>>> actually a network error...)
>>>> 2. why can't it connect to the master? (if it's a network timeout, how
>>>> to increase it? i see the values are hardcoded inside AppClient)
>>>> 3. how to recover from this error?
>>>>
>>>>
>>>> ERROR 01-11 15:32:54,991 SparkDeploySchedulerBackend - Application
>>>> has been killed. Reason: All masters are unresponsive! Giving up. ERROR
>>>> ERROR 01-11 15:32:55,087 OneForOneStrategy - ERROR
>>>> logs/error.log
>>>> java.lang.NullPointerException NullPointerException
>>>> at
>>>> org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$receiveWithLogging$1.applyOrElse(AppClient.scala:160)
>>>> at
>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>>> at
>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>>> at
>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>>> at
>>>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
>>>> at
>>>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
>>>> at
>>>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>>>> at
>>>> org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
>>>> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>>> at
>>>> org.apache.spark.deploy.client.AppClient$ClientActor.aroundReceive(AppClient.scala:61)
>>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>>> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>>>> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>>>> at
>>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>>>> at
>>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>> at
>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>>> at
>>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>> at
>>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>> ERROR 01-11 15:32:55,603 SparkContext - Error
>>>> initializing SparkContext. ERROR
>>>> java.lang.IllegalStateException: Cannot call methods on a stopped
>>>> SparkContext
>>>> at org.apache.spark.SparkContext.org
>>>> $apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
>>>> at
>>>> org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1501)
>>>> at
>>>> org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2005)
>>>> at org.apache.spark.SparkContext.<init>(SparkContext.scala:543)
>>>> at
>>>> org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61)
>>>>
>>>>
>>>> Thanks!
>>>>
>>>> *Romi Kuntsman*, *Big Data Engineer*
>>>> http://www.totango.com
>>>>
>>>
>>>
>>
>
Re: Some spark apps fail with "All masters are unresponsive", while
others pass normally
Posted by Romi Kuntsman <ro...@totango.com>.
I didn't see anything about a OOM.
This happens sometimes before anything in the application happened, and
happens to a few applications at the same time - so I guess it's a
communication failure, but the problem is that the error shown doesn't
represent the actual problem (which may be a network timeout etc)
*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com
On Mon, Nov 9, 2015 at 6:00 PM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:
> Did you find anything regarding the OOM in the executor logs?
>
> Thanks
> Best Regards
>
> On Mon, Nov 9, 2015 at 8:44 PM, Romi Kuntsman <ro...@totango.com> wrote:
>
>> If they have a problem managing memory, wouldn't there should be a OOM?
>> Why does AppClient throw a NPE?
>>
>> *Romi Kuntsman*, *Big Data Engineer*
>> http://www.totango.com
>>
>> On Mon, Nov 9, 2015 at 4:59 PM, Akhil Das <ak...@sigmoidanalytics.com>
>> wrote:
>>
>>> Is that all you have in the executor logs? I suspect some of those jobs
>>> are having a hard time managing the memory.
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Sun, Nov 1, 2015 at 9:38 PM, Romi Kuntsman <ro...@totango.com> wrote:
>>>
>>>> [adding dev list since it's probably a bug, but i'm not sure how to
>>>> reproduce so I can open a bug about it]
>>>>
>>>> Hi,
>>>>
>>>> I have a standalone Spark 1.4.0 cluster with 100s of applications
>>>> running every day.
>>>>
>>>> From time to time, the applications crash with the following error (see
>>>> below)
>>>> But at the same time (and also after that), other applications are
>>>> running, so I can safely assume the master and workers are working.
>>>>
>>>> 1. why is there a NullPointerException? (i can't track the scala stack
>>>> trace to the code, but anyway NPE is usually a obvious bug even if there's
>>>> actually a network error...)
>>>> 2. why can't it connect to the master? (if it's a network timeout, how
>>>> to increase it? i see the values are hardcoded inside AppClient)
>>>> 3. how to recover from this error?
>>>>
>>>>
>>>> ERROR 01-11 15:32:54,991 SparkDeploySchedulerBackend - Application
>>>> has been killed. Reason: All masters are unresponsive! Giving up. ERROR
>>>> ERROR 01-11 15:32:55,087 OneForOneStrategy - ERROR
>>>> logs/error.log
>>>> java.lang.NullPointerException NullPointerException
>>>> at
>>>> org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$receiveWithLogging$1.applyOrElse(AppClient.scala:160)
>>>> at
>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>>> at
>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>>> at
>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>>> at
>>>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
>>>> at
>>>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
>>>> at
>>>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>>>> at
>>>> org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
>>>> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>>> at
>>>> org.apache.spark.deploy.client.AppClient$ClientActor.aroundReceive(AppClient.scala:61)
>>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>>> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>>>> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>>>> at
>>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>>>> at
>>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>> at
>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>>> at
>>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>> at
>>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>> ERROR 01-11 15:32:55,603 SparkContext - Error
>>>> initializing SparkContext. ERROR
>>>> java.lang.IllegalStateException: Cannot call methods on a stopped
>>>> SparkContext
>>>> at org.apache.spark.SparkContext.org
>>>> $apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
>>>> at
>>>> org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1501)
>>>> at
>>>> org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2005)
>>>> at org.apache.spark.SparkContext.<init>(SparkContext.scala:543)
>>>> at
>>>> org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61)
>>>>
>>>>
>>>> Thanks!
>>>>
>>>> *Romi Kuntsman*, *Big Data Engineer*
>>>> http://www.totango.com
>>>>
>>>
>>>
>>
>
Re: Some spark apps fail with "All masters are unresponsive", while
others pass normally
Posted by Akhil Das <ak...@sigmoidanalytics.com>.
Did you find anything regarding the OOM in the executor logs?
Thanks
Best Regards
On Mon, Nov 9, 2015 at 8:44 PM, Romi Kuntsman <ro...@totango.com> wrote:
> If they have a problem managing memory, wouldn't there should be a OOM?
> Why does AppClient throw a NPE?
>
> *Romi Kuntsman*, *Big Data Engineer*
> http://www.totango.com
>
> On Mon, Nov 9, 2015 at 4:59 PM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>> Is that all you have in the executor logs? I suspect some of those jobs
>> are having a hard time managing the memory.
>>
>> Thanks
>> Best Regards
>>
>> On Sun, Nov 1, 2015 at 9:38 PM, Romi Kuntsman <ro...@totango.com> wrote:
>>
>>> [adding dev list since it's probably a bug, but i'm not sure how to
>>> reproduce so I can open a bug about it]
>>>
>>> Hi,
>>>
>>> I have a standalone Spark 1.4.0 cluster with 100s of applications
>>> running every day.
>>>
>>> From time to time, the applications crash with the following error (see
>>> below)
>>> But at the same time (and also after that), other applications are
>>> running, so I can safely assume the master and workers are working.
>>>
>>> 1. why is there a NullPointerException? (i can't track the scala stack
>>> trace to the code, but anyway NPE is usually a obvious bug even if there's
>>> actually a network error...)
>>> 2. why can't it connect to the master? (if it's a network timeout, how
>>> to increase it? i see the values are hardcoded inside AppClient)
>>> 3. how to recover from this error?
>>>
>>>
>>> ERROR 01-11 15:32:54,991 SparkDeploySchedulerBackend - Application
>>> has been killed. Reason: All masters are unresponsive! Giving up. ERROR
>>> ERROR 01-11 15:32:55,087 OneForOneStrategy - ERROR
>>> logs/error.log
>>> java.lang.NullPointerException NullPointerException
>>> at
>>> org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$receiveWithLogging$1.applyOrElse(AppClient.scala:160)
>>> at
>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>> at
>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>> at
>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>> at
>>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
>>> at
>>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
>>> at
>>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>>> at
>>> org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
>>> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>> at
>>> org.apache.spark.deploy.client.AppClient$ClientActor.aroundReceive(AppClient.scala:61)
>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>>> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>>> at
>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>>> at
>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>> at
>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>> at
>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>> at
>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>> ERROR 01-11 15:32:55,603 SparkContext - Error
>>> initializing SparkContext. ERROR
>>> java.lang.IllegalStateException: Cannot call methods on a stopped
>>> SparkContext
>>> at org.apache.spark.SparkContext.org
>>> $apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
>>> at
>>> org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1501)
>>> at
>>> org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2005)
>>> at org.apache.spark.SparkContext.<init>(SparkContext.scala:543)
>>> at
>>> org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61)
>>>
>>>
>>> Thanks!
>>>
>>> *Romi Kuntsman*, *Big Data Engineer*
>>> http://www.totango.com
>>>
>>
>>
>
Re: Some spark apps fail with "All masters are unresponsive", while
others pass normally
Posted by Akhil Das <ak...@sigmoidanalytics.com>.
Did you find anything regarding the OOM in the executor logs?
Thanks
Best Regards
On Mon, Nov 9, 2015 at 8:44 PM, Romi Kuntsman <ro...@totango.com> wrote:
> If they have a problem managing memory, wouldn't there should be a OOM?
> Why does AppClient throw a NPE?
>
> *Romi Kuntsman*, *Big Data Engineer*
> http://www.totango.com
>
> On Mon, Nov 9, 2015 at 4:59 PM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>> Is that all you have in the executor logs? I suspect some of those jobs
>> are having a hard time managing the memory.
>>
>> Thanks
>> Best Regards
>>
>> On Sun, Nov 1, 2015 at 9:38 PM, Romi Kuntsman <ro...@totango.com> wrote:
>>
>>> [adding dev list since it's probably a bug, but i'm not sure how to
>>> reproduce so I can open a bug about it]
>>>
>>> Hi,
>>>
>>> I have a standalone Spark 1.4.0 cluster with 100s of applications
>>> running every day.
>>>
>>> From time to time, the applications crash with the following error (see
>>> below)
>>> But at the same time (and also after that), other applications are
>>> running, so I can safely assume the master and workers are working.
>>>
>>> 1. why is there a NullPointerException? (i can't track the scala stack
>>> trace to the code, but anyway NPE is usually a obvious bug even if there's
>>> actually a network error...)
>>> 2. why can't it connect to the master? (if it's a network timeout, how
>>> to increase it? i see the values are hardcoded inside AppClient)
>>> 3. how to recover from this error?
>>>
>>>
>>> ERROR 01-11 15:32:54,991 SparkDeploySchedulerBackend - Application
>>> has been killed. Reason: All masters are unresponsive! Giving up. ERROR
>>> ERROR 01-11 15:32:55,087 OneForOneStrategy - ERROR
>>> logs/error.log
>>> java.lang.NullPointerException NullPointerException
>>> at
>>> org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$receiveWithLogging$1.applyOrElse(AppClient.scala:160)
>>> at
>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>> at
>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>> at
>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>> at
>>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
>>> at
>>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
>>> at
>>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>>> at
>>> org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
>>> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>> at
>>> org.apache.spark.deploy.client.AppClient$ClientActor.aroundReceive(AppClient.scala:61)
>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>>> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>>> at
>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>>> at
>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>> at
>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>> at
>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>> at
>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>> ERROR 01-11 15:32:55,603 SparkContext - Error
>>> initializing SparkContext. ERROR
>>> java.lang.IllegalStateException: Cannot call methods on a stopped
>>> SparkContext
>>> at org.apache.spark.SparkContext.org
>>> $apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
>>> at
>>> org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1501)
>>> at
>>> org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2005)
>>> at org.apache.spark.SparkContext.<init>(SparkContext.scala:543)
>>> at
>>> org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61)
>>>
>>>
>>> Thanks!
>>>
>>> *Romi Kuntsman*, *Big Data Engineer*
>>> http://www.totango.com
>>>
>>
>>
>
Re: Some spark apps fail with "All masters are unresponsive", while
others pass normally
Posted by Romi Kuntsman <ro...@totango.com>.
If they have a problem managing memory, wouldn't there should be a OOM?
Why does AppClient throw a NPE?
*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com
On Mon, Nov 9, 2015 at 4:59 PM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:
> Is that all you have in the executor logs? I suspect some of those jobs
> are having a hard time managing the memory.
>
> Thanks
> Best Regards
>
> On Sun, Nov 1, 2015 at 9:38 PM, Romi Kuntsman <ro...@totango.com> wrote:
>
>> [adding dev list since it's probably a bug, but i'm not sure how to
>> reproduce so I can open a bug about it]
>>
>> Hi,
>>
>> I have a standalone Spark 1.4.0 cluster with 100s of applications running
>> every day.
>>
>> From time to time, the applications crash with the following error (see
>> below)
>> But at the same time (and also after that), other applications are
>> running, so I can safely assume the master and workers are working.
>>
>> 1. why is there a NullPointerException? (i can't track the scala stack
>> trace to the code, but anyway NPE is usually a obvious bug even if there's
>> actually a network error...)
>> 2. why can't it connect to the master? (if it's a network timeout, how to
>> increase it? i see the values are hardcoded inside AppClient)
>> 3. how to recover from this error?
>>
>>
>> ERROR 01-11 15:32:54,991 SparkDeploySchedulerBackend - Application
>> has been killed. Reason: All masters are unresponsive! Giving up. ERROR
>> ERROR 01-11 15:32:55,087 OneForOneStrategy - ERROR
>> logs/error.log
>> java.lang.NullPointerException NullPointerException
>> at
>> org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$receiveWithLogging$1.applyOrElse(AppClient.scala:160)
>> at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>> at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>> at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>> at
>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
>> at
>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
>> at
>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>> at
>> org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
>> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>> at
>> org.apache.spark.deploy.client.AppClient$ClientActor.aroundReceive(AppClient.scala:61)
>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>> at
>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>> at
>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> at
>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>> at
>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>> at
>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>> ERROR 01-11 15:32:55,603 SparkContext - Error
>> initializing SparkContext. ERROR
>> java.lang.IllegalStateException: Cannot call methods on a stopped
>> SparkContext
>> at org.apache.spark.SparkContext.org
>> $apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
>> at
>> org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1501)
>> at
>> org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2005)
>> at org.apache.spark.SparkContext.<init>(SparkContext.scala:543)
>> at
>> org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61)
>>
>>
>> Thanks!
>>
>> *Romi Kuntsman*, *Big Data Engineer*
>> http://www.totango.com
>>
>
>
Re: Some spark apps fail with "All masters are unresponsive", while
others pass normally
Posted by Romi Kuntsman <ro...@totango.com>.
If they have a problem managing memory, wouldn't there should be a OOM?
Why does AppClient throw a NPE?
*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com
On Mon, Nov 9, 2015 at 4:59 PM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:
> Is that all you have in the executor logs? I suspect some of those jobs
> are having a hard time managing the memory.
>
> Thanks
> Best Regards
>
> On Sun, Nov 1, 2015 at 9:38 PM, Romi Kuntsman <ro...@totango.com> wrote:
>
>> [adding dev list since it's probably a bug, but i'm not sure how to
>> reproduce so I can open a bug about it]
>>
>> Hi,
>>
>> I have a standalone Spark 1.4.0 cluster with 100s of applications running
>> every day.
>>
>> From time to time, the applications crash with the following error (see
>> below)
>> But at the same time (and also after that), other applications are
>> running, so I can safely assume the master and workers are working.
>>
>> 1. why is there a NullPointerException? (i can't track the scala stack
>> trace to the code, but anyway NPE is usually a obvious bug even if there's
>> actually a network error...)
>> 2. why can't it connect to the master? (if it's a network timeout, how to
>> increase it? i see the values are hardcoded inside AppClient)
>> 3. how to recover from this error?
>>
>>
>> ERROR 01-11 15:32:54,991 SparkDeploySchedulerBackend - Application
>> has been killed. Reason: All masters are unresponsive! Giving up. ERROR
>> ERROR 01-11 15:32:55,087 OneForOneStrategy - ERROR
>> logs/error.log
>> java.lang.NullPointerException NullPointerException
>> at
>> org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$receiveWithLogging$1.applyOrElse(AppClient.scala:160)
>> at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>> at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>> at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>> at
>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
>> at
>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
>> at
>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>> at
>> org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
>> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>> at
>> org.apache.spark.deploy.client.AppClient$ClientActor.aroundReceive(AppClient.scala:61)
>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>> at
>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>> at
>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> at
>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>> at
>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>> at
>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>> ERROR 01-11 15:32:55,603 SparkContext - Error
>> initializing SparkContext. ERROR
>> java.lang.IllegalStateException: Cannot call methods on a stopped
>> SparkContext
>> at org.apache.spark.SparkContext.org
>> $apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
>> at
>> org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1501)
>> at
>> org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2005)
>> at org.apache.spark.SparkContext.<init>(SparkContext.scala:543)
>> at
>> org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61)
>>
>>
>> Thanks!
>>
>> *Romi Kuntsman*, *Big Data Engineer*
>> http://www.totango.com
>>
>
>
Re: Some spark apps fail with "All masters are unresponsive", while
others pass normally
Posted by Akhil Das <ak...@sigmoidanalytics.com>.
Is that all you have in the executor logs? I suspect some of those jobs are
having a hard time managing the memory.
Thanks
Best Regards
On Sun, Nov 1, 2015 at 9:38 PM, Romi Kuntsman <ro...@totango.com> wrote:
> [adding dev list since it's probably a bug, but i'm not sure how to
> reproduce so I can open a bug about it]
>
> Hi,
>
> I have a standalone Spark 1.4.0 cluster with 100s of applications running
> every day.
>
> From time to time, the applications crash with the following error (see
> below)
> But at the same time (and also after that), other applications are
> running, so I can safely assume the master and workers are working.
>
> 1. why is there a NullPointerException? (i can't track the scala stack
> trace to the code, but anyway NPE is usually a obvious bug even if there's
> actually a network error...)
> 2. why can't it connect to the master? (if it's a network timeout, how to
> increase it? i see the values are hardcoded inside AppClient)
> 3. how to recover from this error?
>
>
> ERROR 01-11 15:32:54,991 SparkDeploySchedulerBackend - Application
> has been killed. Reason: All masters are unresponsive! Giving up. ERROR
> ERROR 01-11 15:32:55,087 OneForOneStrategy - ERROR
> logs/error.log
> java.lang.NullPointerException NullPointerException
> at
> org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$receiveWithLogging$1.applyOrElse(AppClient.scala:160)
> at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
> at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
> at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
> at
> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
> at
> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
> at
> org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> at
> org.apache.spark.deploy.client.AppClient$ClientActor.aroundReceive(AppClient.scala:61)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
> at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
> at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> ERROR 01-11 15:32:55,603 SparkContext - Error
> initializing SparkContext. ERROR
> java.lang.IllegalStateException: Cannot call methods on a stopped
> SparkContext
> at org.apache.spark.SparkContext.org
> $apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
> at
> org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1501)
> at
> org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2005)
> at org.apache.spark.SparkContext.<init>(SparkContext.scala:543)
> at
> org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61)
>
>
> Thanks!
>
> *Romi Kuntsman*, *Big Data Engineer*
> http://www.totango.com
>
Re: Some spark apps fail with "All masters are unresponsive", while
others pass normally
Posted by Akhil Das <ak...@sigmoidanalytics.com>.
Is that all you have in the executor logs? I suspect some of those jobs are
having a hard time managing the memory.
Thanks
Best Regards
On Sun, Nov 1, 2015 at 9:38 PM, Romi Kuntsman <ro...@totango.com> wrote:
> [adding dev list since it's probably a bug, but i'm not sure how to
> reproduce so I can open a bug about it]
>
> Hi,
>
> I have a standalone Spark 1.4.0 cluster with 100s of applications running
> every day.
>
> From time to time, the applications crash with the following error (see
> below)
> But at the same time (and also after that), other applications are
> running, so I can safely assume the master and workers are working.
>
> 1. why is there a NullPointerException? (i can't track the scala stack
> trace to the code, but anyway NPE is usually a obvious bug even if there's
> actually a network error...)
> 2. why can't it connect to the master? (if it's a network timeout, how to
> increase it? i see the values are hardcoded inside AppClient)
> 3. how to recover from this error?
>
>
> ERROR 01-11 15:32:54,991 SparkDeploySchedulerBackend - Application
> has been killed. Reason: All masters are unresponsive! Giving up. ERROR
> ERROR 01-11 15:32:55,087 OneForOneStrategy - ERROR
> logs/error.log
> java.lang.NullPointerException NullPointerException
> at
> org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$receiveWithLogging$1.applyOrElse(AppClient.scala:160)
> at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
> at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
> at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
> at
> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
> at
> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
> at
> org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> at
> org.apache.spark.deploy.client.AppClient$ClientActor.aroundReceive(AppClient.scala:61)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
> at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
> at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> ERROR 01-11 15:32:55,603 SparkContext - Error
> initializing SparkContext. ERROR
> java.lang.IllegalStateException: Cannot call methods on a stopped
> SparkContext
> at org.apache.spark.SparkContext.org
> $apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
> at
> org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1501)
> at
> org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2005)
> at org.apache.spark.SparkContext.<init>(SparkContext.scala:543)
> at
> org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61)
>
>
> Thanks!
>
> *Romi Kuntsman*, *Big Data Engineer*
> http://www.totango.com
>