You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "igor.berman" <ig...@gmail.com> on 2018/04/12 08:48:27 UTC
Driver aborts on Mesos when unable to connect to one of external
shuffle services
Hi,
any input regarding is it expected:
Driver starts and unable to connect to external shuffle service on one of
the nodes(no matter what is the reason)
This makes framework to go to Inactive mode in Mesos UI
However it seems that driver doesn't exits and continues to execute tasks(or
tries to). The attached stacktrace below shows few lines around the
connection error and aborting message
The question is is it expected behaviour?
Here is stacktracke
I0412 07:31:25.827283 274 sched.cpp:759] Framework registered with
15d9838f-b266-413b-842d-f7c3567bd04a-0051
Exception in thread "Thread-295" java.io.IOException: Failed to connect to
my-company.com/x.x.x.x:7337
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
at
org.apache.spark.network.shuffle.mesos.MesosExternalShuffleClient.registerDriverWithShuffleService(MesosExternalShuffleClient.java:75)
at
org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackend.statusUpdate(MesosCoarseGrainedSchedulerBackend.scala:537)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException:
Connection refused:my-company.com/x.x.x.x:7337
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)
at
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)
at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:631)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:748)
I0412 07:35:12.032925 277 sched.cpp:2055] Asked to abort the driver
I0412 07:35:12.033035 277 sched.cpp:1233] Aborting framework
15d9838f-b266-413b-842d-f7c3567bd04a-0051
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org
Re: Driver aborts on Mesos when unable to connect to one of
external shuffle services
Posted by "igor.berman" <ig...@gmail.com>.
Hi Szuromi,
We manage external shuffle service by Marathon and not manually
sometime though, eg. when adding new node to cluster there is some delay
between mesos schedules tasks on some slave and marathon scheduling external
shuffle service task on this node.
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org
Re: Driver aborts on Mesos when unable to connect to one of external
shuffle services
Posted by Szuromi Tamás <tr...@gmail.com>.
Hi Igor,
Have you started the external shuffle service manually?
Cheers
2018-04-12 10:48 GMT+02:00 igor.berman <ig...@gmail.com>:
> Hi,
> any input regarding is it expected:
> Driver starts and unable to connect to external shuffle service on one of
> the nodes(no matter what is the reason)
> This makes framework to go to Inactive mode in Mesos UI
> However it seems that driver doesn't exits and continues to execute
> tasks(or
> tries to). The attached stacktrace below shows few lines around the
> connection error and aborting message
>
> The question is is it expected behaviour?
>
> Here is stacktracke
>
> I0412 07:31:25.827283 274 sched.cpp:759] Framework registered with
> 15d9838f-b266-413b-842d-f7c3567bd04a-0051
> Exception in thread "Thread-295" java.io.IOException: Failed to connect to
> my-company.com/x.x.x.x:7337
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(
> TransportClientFactory.java:232)
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(
> TransportClientFactory.java:182)
> at
> org.apache.spark.network.shuffle.mesos.MesosExternalShuffleClient.
> registerDriverWithShuffleService(MesosExternalShuffleClient.java:75)
> at
> org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBac
> kend.statusUpdate(MesosCoarseGrainedSchedulerBackend.scala:537)
> Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException:
> Connection refused:my-company.com/x.x.x.x:7337
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
> at
> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(
> NioSocketChannel.java:257)
> at
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(
> AbstractNioChannel.java:291)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKey(
> NioEventLoop.java:631)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(
> NioEventLoop.java:566)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(
> NioEventLoop.java:480)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
> at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.
> run(SingleThreadEventExecutor.java:131)
> at
> io.netty.util.concurrent.DefaultThreadFactory$
> DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
> at java.lang.Thread.run(Thread.java:748)
> I0412 07:35:12.032925 277 sched.cpp:2055] Asked to abort the driver
> I0412 07:35:12.033035 277 sched.cpp:1233] Aborting framework
> 15d9838f-b266-413b-842d-f7c3567bd04a-0051
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>