You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by ANDREA SPINA <74...@studenti.unimore.it> on 2016/06/28 13:04:19 UTC

Issue with Spark on 25 nodes cluster

Hello everyone,

I am running some experiments with Spark 1.4.0 on a ~80GiB dataset located
on hdfs-2.7.1. The environment is a 25 nodes cluster, 16 cores per node. I
set the following params:

spark.master = "spark://"${runtime.hostname}":7077"

# 28 GiB of memory
spark.executor.memory = "28672m"
spark.worker.memory = "28672m"
spark.driver.memory = "2048m"

spark.driver.maxResultSize = "0"

I run some scaling experiments varying the machine set number.
I can successfully experiments with the whole number of nodes (25) and also
with (20) nodes. Experiments with environments of 5 nodes and 10 nodes
relentlessy fails. During the running spark executor begin to collect
failing jobs from different stages and end with the following trace:

16/06/28 03:11:09 INFO DAGScheduler: Job 14 failed: reduce at
sGradientDescent.scala:229, took 1778.508309 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due
to stage failure: Task 212 in stage 14.0 failed 4 times, most recent
failure: Lost task 212.3 in stage 14.0 (TID 12278, 130.149.21.19):
java.io.IOException: Connection from /130.149.21.16:35997 closed
at
org.apache.spark.network.client.TransportResponseHandler.channelUnregistered(TransportResponseHandler.java:104)
at
org.apache.spark.network.server.TransportChannelHandler.channelUnregistered(TransportChannelHandler.java:91)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:183)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:169)
at
io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:183)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:169)
at
io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:183)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:169)
at
io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:183)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:169)
at
io.netty.channel.DefaultChannelPipeline.fireChannelUnregistered(DefaultChannelPipeline.java:738)
at
io.netty.channel.AbstractChannel$AbstractUnsafe$6.run(AbstractChannel.java:606)
at
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:380)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1256)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1450)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

Here
<https://dl.dropboxusercontent.com/u/78598929/spark-hadoop-org.apache.spark.deploy.master.Master-1-cloud-11.log>
the Master full Log.
As well, each Worker receive signal SIGTERM: 15

I can't figure out a solution as well.
Thank you, Regards,

Andrea


-- 
*Andrea Spina*
N.Tessera: *74598*
MAT: *89369*
*Ingegneria Informatica* *[LM] *(D.M. 270)

Re: Issue with Spark on 25 nodes cluster

Posted by ANDREA SPINA <74...@studenti.unimore.it>.

Hi,
I solved by increasing the akka timeout time.
All the bests,

2016-06-28 15:04 GMT+02:00 ANDREA SPINA <74...@studenti.unimore.it>:

> Hello everyone,
>
> I am running some experiments with Spark 1.4.0 on a ~80GiB dataset located
> on hdfs-2.7.1. The environment is a 25 nodes cluster, 16 cores per node. I
> set the following params:
>
> spark.master = "spark://"${runtime.hostname}":7077"
>
> # 28 GiB of memory
> spark.executor.memory = "28672m"
> spark.worker.memory = "28672m"
> spark.driver.memory = "2048m"
>
> spark.driver.maxResultSize = "0"
>
> I run some scaling experiments varying the machine set number.
> I can successfully experiments with the whole number of nodes (25) and
> also with (20) nodes. Experiments with environments of 5 nodes and 10 nodes
> relentlessy fails. During the running spark executor begin to collect
> failing jobs from different stages and end with the following trace:
>
> 16/06/28 03:11:09 INFO DAGScheduler: Job 14 failed: reduce at
> sGradientDescent.scala:229, took 1778.508309 s
> Exception in thread "main" org.apache.spark.SparkException: Job aborted
> due to stage failure: Task 212 in stage 14.0 failed 4 times, most recent
> failure: Lost task 212.3 in stage 14.0 (TID 12278, 130.149.21.19):
> java.io.IOException: Connection from /130.149.21.16:35997 closed
> at
> org.apache.spark.network.client.TransportResponseHandler.channelUnregistered(TransportResponseHandler.java:104)
> at
> org.apache.spark.network.server.TransportChannelHandler.channelUnregistered(TransportChannelHandler.java:91)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:183)
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:169)
> at
> io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:183)
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:169)
> at
> io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:183)
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:169)
> at
> io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:183)
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:169)
> at
> io.netty.channel.DefaultChannelPipeline.fireChannelUnregistered(DefaultChannelPipeline.java:738)
> at
> io.netty.channel.AbstractChannel$AbstractUnsafe$6.run(AbstractChannel.java:606)
> at
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:380)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
> at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
> at java.lang.Thread.run(Thread.java:745)
>
> Driver stacktrace:
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1256)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
> at scala.Option.foreach(Option.scala:236)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1450)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>
> Here
> <https://dl.dropboxusercontent.com/u/78598929/spark-hadoop-org.apache.spark.deploy.master.Master-1-cloud-11.log>
> the Master full Log.
> As well, each Worker receive signal SIGTERM: 15
>
> I can't figure out a solution as well.
> Thank you, Regards,
>
> Andrea
>
>
> --
> *Andrea Spina*
> N.Tessera: *74598*
> MAT: *89369*
> *Ingegneria Informatica* *[LM] *(D.M. 270)
>



-- 
*Andrea Spina*
N.Tessera: *74598*
MAT: *89369*
*Ingegneria Informatica* *[LM] *(D.M. 270)