You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by lokeshkumar <lo...@dataken.net> on 2019/02/27 14:57:40 UTC
Spark 2.4.0 Master going down
Hi All
We are running Spark version 2.4.0 and we run few Spark streaming jobs
listening on Kafka topics. We receive an average of 10-20 msgs per second.
And the Spark master has been going down after 1-2 hours of it running.
Exception is given below:
Along with that spark executors also get killed.
This was not happening with Spark 2.1.1 it started happening with Spark
2.4.0 any help/suggestion is appreciated.
The exception that we see is
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
at
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:281)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.rpc.RpcTimeoutException: Cannot receive any
reply from 192.168.43.167:40007 in 120 seconds. This timeout is controlled
by spark.rpc.askTimeout
at
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
at
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
at
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
at
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:216)
at scala.util.Try$.apply(Try.scala:192)
at scala.util.Failure.recover(Try.scala:216)
at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at
org.spark_project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
at
scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136)
at
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
at
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
at scala.concurrent.Promise$class.complete(Promise.scala:55)
at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:157)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at
scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:63)
at
scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:78)
at
scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
at
scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:54)
at
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at
scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:106)
at
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
at
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
at scala.concurrent.Promise$class.tryFailure(Promise.scala:112)
at
scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:157)
at
org.apache.spark.rpc.netty.NettyRpcEnv.org$apache$spark$rpc$netty$NettyRpcEnv$$onFailure$1(NettyRpcEnv.scala:206)
at
org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:243)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply
from 192.168.43.167:40007 in 120 seconds
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org
Re: Spark 2.4.0 Master going down
Posted by lokeshkumar <lo...@dataken.net>.
Hi Akshay
Thanks for the response please find below the answers to your questions.
1. We are running Spark in cluster mode the cluster manager being Spark's
standalone cluster manager.
2. All the ports are open and we preconfigure on what ports the
communication should happen and modify firewall rules to allow traffic on
these ports. (The functionality is fine till Spark master goes down after 60
mins)
3. Memory consumptions of all the components:
Spark Master:
S0 S1 E O M CCS YGC YGCT FGC FGCT
GCT
0.00 0.00 12.91 35.11 97.08 95.80 5 0.239 2 0.197
0.436
Spark Worker:
S0 S1 E O M CCS YGC YGCT FGC FGCT
GCT
51.64 0.00 46.66 27.44 97.57 95.85 10 0.381 2 0.233
0.613
Spark Submit Process (Driver):
S0 S1 E O M CCS YGC YGCT FGC FGCT
GCT
0.00 63.57 93.82 26.29 98.24 97.53 4663 124.648 109 20.910
145.558
Spark executor (Coarse grained):
S0 S1 E O M CCS YGC YGCT FGC FGCT
GCT
0.00 69.77 17.74 31.13 95.67 90.44 7353 556.888 5 1.572
558.460
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org
Re: Spark 2.4.0 Master going down
Posted by Lokesh Kumar Padhnavis <lo...@dataken.net>.
Hi Akshay
Thanks for the response please find below the answers to your questions.
1. We are running Spark in cluster mode the cluster manager being Spark's
standalone cluster manager.
2. All the ports are open and we preconfigure on what ports the
communication should happen and modify firewall rules to allow traffic on
these ports. (The functionality is fine till Spark master goes down after
60 mins)
3. Memory consumptions of all the components:
Spark Master:
S0 S1 E O M CCS YGC YGCT FGC FGCT
GCT
0.00 0.00 12.91 35.11 97.08 95.80 5 0.239 2 0.197
0.436
Spark Worker:
S0 S1 E O M CCS YGC YGCT FGC FGCT
GCT
51.64 0.00 46.66 27.44 97.57 95.85 10 0.381 2 0.233
0.613
Spark Submit Process (Driver):
S0 S1 E O M CCS YGC YGCT FGC FGCT
GCT
0.00 63.57 93.82 26.29 98.24 97.53 4663 124.648 109 20.910
145.558
Spark executor (Coarse grained):
S0 S1 E O M CCS YGC YGCT FGC FGCT
GCT
0.00 69.77 17.74 31.13 95.67 90.44 7353 556.888 5 1.572
558.460
On Thu, Feb 28, 2019 at 3:13 PM Akshay Bhardwaj <
akshay.bhardwaj1988@gmail.com> wrote:
> Hi Lokesh,
>
> Please provide further information to help identify the issue.
>
> 1) Are you running in a standalone mode or cluster mode? If cluster, then
> is a spark master/slave or YARN/Mesos?
> 2) Have you tried checking if all ports between your master and the
> machine with IP 192.168.43.167 are accessible?
> 3) Have you checked the memory consumption of the executors/driver running
> in the cluster?
>
>
> Akshay Bhardwaj
> +91-97111-33849
>
>
> On Wed, Feb 27, 2019 at 8:27 PM lokeshkumar <lo...@dataken.net> wrote:
>
>> Hi All
>>
>> We are running Spark version 2.4.0 and we run few Spark streaming jobs
>> listening on Kafka topics. We receive an average of 10-20 msgs per
>> second.
>> And the Spark master has been going down after 1-2 hours of it running.
>> Exception is given below:
>> Along with that spark executors also get killed.
>>
>> This was not happening with Spark 2.1.1 it started happening with Spark
>> 2.4.0 any help/suggestion is appreciated.
>>
>> The exception that we see is
>>
>> Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
>> at
>>
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
>> at
>>
>> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64)
>> at
>>
>> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
>> at
>>
>> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:281)
>> at
>>
>> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
>> Caused by: org.apache.spark.rpc.RpcTimeoutException: Cannot receive any
>> reply from 192.168.43.167:40007 in 120 seconds. This timeout is
>> controlled
>> by spark.rpc.askTimeout
>> at
>> org.apache.spark.rpc.RpcTimeout.org
>> $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
>> at
>>
>> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
>> at
>>
>> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
>> at
>>
>> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>> at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:216)
>> at scala.util.Try$.apply(Try.scala:192)
>> at scala.util.Failure.recover(Try.scala:216)
>> at
>> scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
>> at
>> scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
>> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
>> at
>>
>> org.spark_project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
>> at
>>
>> scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136)
>> at
>> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
>> at
>>
>> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
>> at scala.concurrent.Promise$class.complete(Promise.scala:55)
>> at
>> scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:157)
>> at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
>> at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
>> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
>> at
>>
>> scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:63)
>> at
>>
>> scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:78)
>> at
>>
>> scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
>> at
>>
>> scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
>> at
>> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
>> at
>> scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:54)
>> at
>>
>> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
>> at
>>
>> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:106)
>> at
>>
>> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
>> at
>> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
>> at
>>
>> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
>> at scala.concurrent.Promise$class.tryFailure(Promise.scala:112)
>> at
>> scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:157)
>> at
>> org.apache.spark.rpc.netty.NettyRpcEnv.org
>> $apache$spark$rpc$netty$NettyRpcEnv$$onFailure$1(NettyRpcEnv.scala:206)
>> at
>> org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:243)
>> at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> at
>>
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>> at
>>
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>> at
>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>> Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply
>> from 192.168.43.167:40007 in 120 seconds
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
--
Regards
-Lokesh
Re: Spark 2.4.0 Master going down
Posted by lokeshkumar <lo...@dataken.net>.
Hi Akshay
Thanks for the response please find below the answers to your questions.
1. We are running Spark in cluster mode the cluster manager being Spark's
standalone cluster manager.
2. All the ports are open and we preconfigure on what ports the
communication should happen and modify firewall rules to allow traffic on
these ports. (The functionality is fine till Spark master goes down after 60
mins)
3. Memory consumptions of all the components:
Spark Master:
S0 S1 E O M CCS YGC YGCT FGC FGCT
GCT
0.00 0.00 12.91 35.11 97.08 95.80 5 0.239 2 0.197
0.436
Spark Worker:
S0 S1 E O M CCS YGC YGCT FGC FGCT
GCT
51.64 0.00 46.66 27.44 97.57 95.85 10 0.381 2 0.233
0.613
Spark Submit Process (Driver):
S0 S1 E O M CCS YGC YGCT FGC FGCT
GCT
0.00 63.57 93.82 26.29 98.24 97.53 4663 124.648 109 20.910
145.558
Spark executor (Coarse grained):
S0 S1 E O M CCS YGC YGCT FGC FGCT
GCT
0.00 69.77 17.74 31.13 95.67 90.44 7353 556.888 5 1.572
558.460
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org
Re: Spark 2.4.0 Master going down
Posted by Akshay Bhardwaj <ak...@gmail.com>.
Hi Lokesh,
Please provide further information to help identify the issue.
1) Are you running in a standalone mode or cluster mode? If cluster, then
is a spark master/slave or YARN/Mesos?
2) Have you tried checking if all ports between your master and the machine
with IP 192.168.43.167 are accessible?
3) Have you checked the memory consumption of the executors/driver running
in the cluster?
Akshay Bhardwaj
+91-97111-33849
On Wed, Feb 27, 2019 at 8:27 PM lokeshkumar <lo...@dataken.net> wrote:
> Hi All
>
> We are running Spark version 2.4.0 and we run few Spark streaming jobs
> listening on Kafka topics. We receive an average of 10-20 msgs per second.
> And the Spark master has been going down after 1-2 hours of it running.
> Exception is given below:
> Along with that spark executors also get killed.
>
> This was not happening with Spark 2.1.1 it started happening with Spark
> 2.4.0 any help/suggestion is appreciated.
>
> The exception that we see is
>
> Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
> at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
> at
>
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64)
> at
>
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
> at
>
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:281)
> at
>
> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
> Caused by: org.apache.spark.rpc.RpcTimeoutException: Cannot receive any
> reply from 192.168.43.167:40007 in 120 seconds. This timeout is controlled
> by spark.rpc.askTimeout
> at
> org.apache.spark.rpc.RpcTimeout.org
> $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
> at
>
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
> at
>
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
> at
>
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
> at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:216)
> at scala.util.Try$.apply(Try.scala:192)
> at scala.util.Failure.recover(Try.scala:216)
> at
> scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
> at
> scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> at
>
> org.spark_project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
> at
>
> scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136)
> at
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
> at
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
> at scala.concurrent.Promise$class.complete(Promise.scala:55)
> at
> scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:157)
> at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
> at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> at
>
> scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:63)
> at
>
> scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:78)
> at
>
> scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
> at
>
> scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
> at
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
> at
> scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:54)
> at
>
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
> at
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:106)
> at
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
> at
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
> at
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
> at scala.concurrent.Promise$class.tryFailure(Promise.scala:112)
> at
> scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:157)
> at
> org.apache.spark.rpc.netty.NettyRpcEnv.org
> $apache$spark$rpc$netty$NettyRpcEnv$$onFailure$1(NettyRpcEnv.scala:206)
> at
> org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:243)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at
>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply
> from 192.168.43.167:40007 in 120 seconds
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>