You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Stavros Kontopoulos (JIRA)" <ji...@apache.org> on 2018/06/24 15:51:00 UTC

[jira] [Commented] (SPARK-24641) Spark-Mesos integration doesn't respect request to abort itself

    [ https://issues.apache.org/jira/browse/SPARK-24641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16521524#comment-16521524 ] 

Stavros Kontopoulos commented on SPARK-24641:
---------------------------------------------

[~igor.berman] The current idea of the shuffle service on mesos is to run it on all slaves upfront with a constraint: {{"constraints": [["hostname", "UNIQUE"]]}}.

So let's remove the case of "marathon haven't provisioned yet the external shuffle service on particular node", because if that holds then that is not going to work by design.

Now in case the shuffle service is available but due to network issues there is a communication error then the executor mesos task should be non reachable as well.

I can think of a case though where the shuffle service may fail and mesos executor has started fine. In that case we could tell the executor to fail and so the task can be restarted

elsewhere. Right now when an executor is started and we get a mesos task udpate we try to connect to the shuffle service. If that fails then we do nothing. So we should probably

improve the logic there. But first let's identify what should work and what shouldn't.

Btw there is an ongoing effort to re-design the shuffle service to be based on some storage and run as a service so things will get improved at some time.

[~susanxhuynh] thoughts?

> Spark-Mesos integration doesn't respect request to abort itself
> ---------------------------------------------------------------
>
>                 Key: SPARK-24641
>                 URL: https://issues.apache.org/jira/browse/SPARK-24641
>             Project: Spark
>          Issue Type: Bug
>          Components: Mesos, Shuffle
>    Affects Versions: 2.2.0
>            Reporter: Igor Berman
>            Priority: Major
>
> Hi,
> lately we came across following corner scenario:
> We are using dynamic allocation with external shuffle service that is managed by marathon.
>  
> Due to some network/operation issue, the external shuffle service on one of the machines(mesos-slaves) is not available for few seconds(e.g. marathon haven't provisioned yet the external shuffle service on particular node, but framework itself already accepted offer on this node and tries to startup executor)
>  
> This makes framework(spark driver) to fail and I see error from stderr of driver(seems like mesos-agent asks driver to abort itself), however spark context continues to run(seems like in kind of zombi mode, since it can't release resources to cluster and can't get additional offers since the framework is aborted from mesos perspective)
>  
> The framework in mesos UI move to "inactive" state.
> [~skonto] [~susanxhuynh] any input on this problem? Have you came across such behavior?
> I'm ready to work on some patch, but currently I don't understand where to start, seems like driver is too fragile in this sense and something in mesos-spark integration is missing
>  
>  
> {code:java}
> I0412 07:31:25.827283   274 sched.cpp:759] Framework registered with 15d9838f-b266-413b-842d-f7c3567bd04a-0051 Exception in thread "Thread-295" java.io.IOException: Failed to connect tomy-company.com/10.106.14.61:7337         at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)         at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)         at org.apache.spark.network.shuffle.mesos.MesosExternalShuffleClient.registerDriverWithShuffleService(MesosExternalShuffleClient.java:75)         at org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackend.statusUpdate(MesosCoarseGrainedSchedulerBackend.scala:537) Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: my-company.com/10.106.14.61:7337         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)         at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)         at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)         at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)         at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:631)         at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)         at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)         at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)         at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)         at java.lang.Thread.run(Thread.java:748) I0412 07:35:12.032925   277 sched.cpp:2055] Asked to abort the driver I0412 07:35:12.033035   277 sched.cpp:1233] Aborting framework 15d9838f-b266-413b-842d-f7c3567bd04a-0051  {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org