You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "igor.berman" <ig...@gmail.com> on 2018/04/12 08:48:27 UTC

Driver aborts on Mesos when unable to connect to one of external shuffle services

Hi,
any input regarding is it expected:
Driver starts and unable to connect to external shuffle service on one of
the nodes(no matter what is the reason)
This makes framework to go to Inactive mode in Mesos UI
However it seems that driver doesn't exits and continues to execute tasks(or
tries to). The attached stacktrace below shows few lines around the
connection error and aborting message

The question is is it expected behaviour?

Here is stacktracke

I0412 07:31:25.827283   274 sched.cpp:759] Framework registered with
15d9838f-b266-413b-842d-f7c3567bd04a-0051
Exception in thread "Thread-295" java.io.IOException: Failed to connect to
my-company.com/x.x.x.x:7337
        at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
        at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
        at
org.apache.spark.network.shuffle.mesos.MesosExternalShuffleClient.registerDriverWithShuffleService(MesosExternalShuffleClient.java:75)
        at
org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackend.statusUpdate(MesosCoarseGrainedSchedulerBackend.scala:537)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException:
Connection refused:my-company.com/x.x.x.x:7337
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)
        at
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)
        at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:631)
        at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
        at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
        at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
        at
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
        at java.lang.Thread.run(Thread.java:748)
I0412 07:35:12.032925   277 sched.cpp:2055] Asked to abort the driver
I0412 07:35:12.033035   277 sched.cpp:1233] Aborting framework
15d9838f-b266-413b-842d-f7c3567bd04a-0051



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: Driver aborts on Mesos when unable to connect to one of external shuffle services

Posted by "igor.berman" <ig...@gmail.com>.
Hi Szuromi,
We manage external shuffle service by Marathon and not manually
sometime though, eg. when adding new node to cluster there is some delay
between mesos schedules tasks on some slave and marathon scheduling external
shuffle service task on this node.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: Driver aborts on Mesos when unable to connect to one of external shuffle services

Posted by Szuromi Tamás <tr...@gmail.com>.
Hi Igor,

Have you started the external shuffle service manually?

Cheers

2018-04-12 10:48 GMT+02:00 igor.berman <ig...@gmail.com>:

> Hi,
> any input regarding is it expected:
> Driver starts and unable to connect to external shuffle service on one of
> the nodes(no matter what is the reason)
> This makes framework to go to Inactive mode in Mesos UI
> However it seems that driver doesn't exits and continues to execute
> tasks(or
> tries to). The attached stacktrace below shows few lines around the
> connection error and aborting message
>
> The question is is it expected behaviour?
>
> Here is stacktracke
>
> I0412 07:31:25.827283   274 sched.cpp:759] Framework registered with
> 15d9838f-b266-413b-842d-f7c3567bd04a-0051
> Exception in thread "Thread-295" java.io.IOException: Failed to connect to
> my-company.com/x.x.x.x:7337
>         at
> org.apache.spark.network.client.TransportClientFactory.createClient(
> TransportClientFactory.java:232)
>         at
> org.apache.spark.network.client.TransportClientFactory.createClient(
> TransportClientFactory.java:182)
>         at
> org.apache.spark.network.shuffle.mesos.MesosExternalShuffleClient.
> registerDriverWithShuffleService(MesosExternalShuffleClient.java:75)
>         at
> org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBac
> kend.statusUpdate(MesosCoarseGrainedSchedulerBackend.scala:537)
> Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException:
> Connection refused:my-company.com/x.x.x.x:7337
>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>         at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>         at
> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(
> NioSocketChannel.java:257)
>         at
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(
> AbstractNioChannel.java:291)
>         at
> io.netty.channel.nio.NioEventLoop.processSelectedKey(
> NioEventLoop.java:631)
>         at
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(
> NioEventLoop.java:566)
>         at
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(
> NioEventLoop.java:480)
>         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
>         at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.
> run(SingleThreadEventExecutor.java:131)
>         at
> io.netty.util.concurrent.DefaultThreadFactory$
> DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
>         at java.lang.Thread.run(Thread.java:748)
> I0412 07:35:12.032925   277 sched.cpp:2055] Asked to abort the driver
> I0412 07:35:12.033035   277 sched.cpp:1233] Aborting framework
> 15d9838f-b266-413b-842d-f7c3567bd04a-0051
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>