You are viewing a plain text version of this content. The canonical link for it is here.

Posted to yarn-dev@hadoop.apache.org by marc nicole <mk...@gmail.com> on 2021/11/12 13:55:54 UTC

Yarn cluster mode stalling when applicationMaster is elected on a worker node (can't find the driver)

Hi Guys !

if i specify bindAddress in the spark-defaults.conf then for YARN (client
mode) everything works fine and the applicationMaster finds the driver. But
if i submit cluster mode then the applicationMaster, if hosted on worker
nodes, won't find the driver and results in bind error.



Any idea what's the missing config ?


To note that i create the driver through a SparkSession object (not a
SparkContext).

Hint i was thinking a propagation of the driver config to the worker would
solve this e.g. through spark.yarn.dist.files

Any suggestions here ?

Re: Yarn cluster mode stalling when applicationMaster is elected on a worker node (can't find the driver)

Posted by marc nicole <mk...@gmail.com>.

Whenever  export SPARK_LOCAL_IP="127.0.0.1" is added to spark-defaults
applicationMaster will always be hosted on 127.0.0.1 (cluster or client
mode) which is not the intended goal.

Le sam. 13 nov. 2021 à 04:57, Prabhu Joseph <pr...@gmail.com> a
écrit :

> Have seen this exception. This I think was fixed by export
> SPARK_LOCAL_IP="127.0.0.1" before the spark submit command.
>
> Can you check below for more details
>
> scala - How to solve "Can't assign requested address: Service
> 'sparkDriver' failed after 16 retries" when running spark code? - Stack
> Overflow
> <https://stackoverflow.com/questions/52133731/how-to-solve-cant-assign-requested-address-service-sparkdriver-failed-after>
>
> Unable to find Spark Driver after 16 retries · Issue #435 · dotnet/spark
> (github.com) <https://github.com/dotnet/spark/issues/435>
>
> What is spark.local.ip ,spark.driver.host,spark.driver.bindAddress and
> spark.driver.hostname? - Stack Overflow
> <https://stackoverflow.com/questions/43692453/what-is-spark-local-ip-spark-driver-host-spark-driver-bindaddress-and-spark-dri>
>
> On Fri, Nov 12, 2021 at 9:52 PM marc nicole <mk...@gmail.com> wrote:
>
>> Here's the exception whenever the applicationMaster is one of the slaves
>> (cluster mode) : (also increasing spark tries or yarn tries didn't help)
>>
>> 2021-11-12 17:20:37,301 ERROR yarn.ApplicationMaster: Uncaught exception:
>> org.apache.spark.SparkException: Exception thrown in awaitResult:
>> 	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
>> 	at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:504)
>> 	at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:268)
>> 	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:899)
>> 	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:898)
>> 	at java.security.AccessController.doPrivileged(Native Method)
>> 	at javax.security.auth.Subject.doAs(Subject.java:422)
>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>> 	at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:898)
>> 	at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
>> Caused by: java.net.BindException: Cannot assign requested address: bind: Service 'sparkDriver' failed after 16 retries (on a random free port)! Consider explicitly setting the appropriate binding address for the service 'sparkDriver' (for example spark.driver.bindAddress for SparkDriver) to the correct binding address.
>> 	at sun.nio.ch.Net.bind0(Native Method)
>> 	at sun.nio.ch.Net.bind(Net.java:438)
>> 	at sun.nio.ch.Net.bind(Net.java:430)
>> 	at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:225)
>> 	at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:134)
>> 	at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:550)
>> 	at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334)
>> 	at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:506)
>> 	at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:491)
>> 	at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973)
>> 	at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:248)
>> 	at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:356)
>> 	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>> 	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>> 	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
>> 	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>> 	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>> 	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>> 	at java.lang.Thread.run(Thread.java:748)
>> 2021-11-12 17:20:37,308 INFO util.ShutdownHookManager: Shutdown hook called
>>
>>
>> Le ven. 12 nov. 2021 à 16:43, Prabhu Joseph <pr...@gmail.com>
>> a écrit :
>>
>>> Can you share the exception seen from the spark application logs. Thanks.
>>>
>>> On Fri, Nov 12, 2021, 7:24 PM marc nicole <mk...@gmail.com> wrote:
>>>
>>>> Hi Guys !
>>>>
>>>> if i specify bindAddress in the spark-defaults.conf then for YARN
>>>> (client
>>>> mode) everything works fine and the applicationMaster finds the driver.
>>>> But
>>>> if i submit cluster mode then the applicationMaster, if hosted on worker
>>>> nodes, won't find the driver and results in bind error.
>>>>
>>>>
>>>>
>>>> Any idea what's the missing config ?
>>>>
>>>>
>>>> To note that i create the driver through a SparkSession object (not a
>>>> SparkContext).
>>>>
>>>> Hint i was thinking a propagation of the driver config to the worker
>>>> would
>>>> solve this e.g. through spark.yarn.dist.files
>>>>
>>>> Any suggestions here ?
>>>>
>>>

Re: Yarn cluster mode stalling when applicationMaster is elected on a worker node (can't find the driver)

Posted by Prabhu Joseph <pr...@gmail.com>.

Have seen this exception. This I think was fixed by export
SPARK_LOCAL_IP="127.0.0.1" before the spark submit command.

Can you check below for more details

scala - How to solve "Can't assign requested address: Service 'sparkDriver'
failed after 16 retries" when running spark code? - Stack Overflow
<https://stackoverflow.com/questions/52133731/how-to-solve-cant-assign-requested-address-service-sparkdriver-failed-after>

Unable to find Spark Driver after 16 retries · Issue #435 · dotnet/spark
(github.com) <https://github.com/dotnet/spark/issues/435>

What is spark.local.ip ,spark.driver.host,spark.driver.bindAddress and
spark.driver.hostname? - Stack Overflow
<https://stackoverflow.com/questions/43692453/what-is-spark-local-ip-spark-driver-host-spark-driver-bindaddress-and-spark-dri>

On Fri, Nov 12, 2021 at 9:52 PM marc nicole <mk...@gmail.com> wrote:

> Here's the exception whenever the applicationMaster is one of the slaves
> (cluster mode) : (also increasing spark tries or yarn tries didn't help)
>
> 2021-11-12 17:20:37,301 ERROR yarn.ApplicationMaster: Uncaught exception:
> org.apache.spark.SparkException: Exception thrown in awaitResult:
> 	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
> 	at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:504)
> 	at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:268)
> 	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:899)
> 	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:898)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:422)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> 	at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:898)
> 	at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
> Caused by: java.net.BindException: Cannot assign requested address: bind: Service 'sparkDriver' failed after 16 retries (on a random free port)! Consider explicitly setting the appropriate binding address for the service 'sparkDriver' (for example spark.driver.bindAddress for SparkDriver) to the correct binding address.
> 	at sun.nio.ch.Net.bind0(Native Method)
> 	at sun.nio.ch.Net.bind(Net.java:438)
> 	at sun.nio.ch.Net.bind(Net.java:430)
> 	at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:225)
> 	at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:134)
> 	at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:550)
> 	at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334)
> 	at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:506)
> 	at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:491)
> 	at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973)
> 	at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:248)
> 	at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:356)
> 	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
> 	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
> 	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
> 	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
> 	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> 	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> 	at java.lang.Thread.run(Thread.java:748)
> 2021-11-12 17:20:37,308 INFO util.ShutdownHookManager: Shutdown hook called
>
>
> Le ven. 12 nov. 2021 à 16:43, Prabhu Joseph <pr...@gmail.com>
> a écrit :
>
>> Can you share the exception seen from the spark application logs. Thanks.
>>
>> On Fri, Nov 12, 2021, 7:24 PM marc nicole <mk...@gmail.com> wrote:
>>
>>> Hi Guys !
>>>
>>> if i specify bindAddress in the spark-defaults.conf then for YARN (client
>>> mode) everything works fine and the applicationMaster finds the driver.
>>> But
>>> if i submit cluster mode then the applicationMaster, if hosted on worker
>>> nodes, won't find the driver and results in bind error.
>>>
>>>
>>>
>>> Any idea what's the missing config ?
>>>
>>>
>>> To note that i create the driver through a SparkSession object (not a
>>> SparkContext).
>>>
>>> Hint i was thinking a propagation of the driver config to the worker
>>> would
>>> solve this e.g. through spark.yarn.dist.files
>>>
>>> Any suggestions here ?
>>>
>>

Re: Yarn cluster mode stalling when applicationMaster is elected on a worker node (can't find the driver)

Posted by marc nicole <mk...@gmail.com>.

Here's the exception whenever the applicationMaster is one of the slaves
(cluster mode) : (also increasing spark tries or yarn tries didn't help)

2021-11-12 17:20:37,301 ERROR yarn.ApplicationMaster: Uncaught exception:
org.apache.spark.SparkException: Exception thrown in awaitResult:
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
	at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:504)
	at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:268)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:899)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:898)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
	at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:898)
	at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
Caused by: java.net.BindException: Cannot assign requested address:
bind: Service 'sparkDriver' failed after 16 retries (on a random free
port)! Consider explicitly setting the appropriate binding address for
the service 'sparkDriver' (for example spark.driver.bindAddress for
SparkDriver) to the correct binding address.
	at sun.nio.ch.Net.bind0(Native Method)
	at sun.nio.ch.Net.bind(Net.java:438)
	at sun.nio.ch.Net.bind(Net.java:430)
	at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:225)
	at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:134)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:550)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334)
	at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:506)
	at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:491)
	at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973)
	at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:248)
	at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:356)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)
2021-11-12 17:20:37,308 INFO util.ShutdownHookManager: Shutdown hook called


Le ven. 12 nov. 2021 à 16:43, Prabhu Joseph <pr...@gmail.com> a
écrit :

> Can you share the exception seen from the spark application logs. Thanks.
>
> On Fri, Nov 12, 2021, 7:24 PM marc nicole <mk...@gmail.com> wrote:
>
>> Hi Guys !
>>
>> if i specify bindAddress in the spark-defaults.conf then for YARN (client
>> mode) everything works fine and the applicationMaster finds the driver.
>> But
>> if i submit cluster mode then the applicationMaster, if hosted on worker
>> nodes, won't find the driver and results in bind error.
>>
>>
>>
>> Any idea what's the missing config ?
>>
>>
>> To note that i create the driver through a SparkSession object (not a
>> SparkContext).
>>
>> Hint i was thinking a propagation of the driver config to the worker would
>> solve this e.g. through spark.yarn.dist.files
>>
>> Any suggestions here ?
>>
>

Re: Yarn cluster mode stalling when applicationMaster is elected on a worker node (can't find the driver)

Posted by Prabhu Joseph <pr...@gmail.com>.

Can you share the exception seen from the spark application logs. Thanks.

On Fri, Nov 12, 2021, 7:24 PM marc nicole <mk...@gmail.com> wrote:

> Hi Guys !
>
> if i specify bindAddress in the spark-defaults.conf then for YARN (client
> mode) everything works fine and the applicationMaster finds the driver. But
> if i submit cluster mode then the applicationMaster, if hosted on worker
> nodes, won't find the driver and results in bind error.
>
>
>
> Any idea what's the missing config ?
>
>
> To note that i create the driver through a SparkSession object (not a
> SparkContext).
>
> Hint i was thinking a propagation of the driver config to the worker would
> solve this e.g. through spark.yarn.dist.files
>
> Any suggestions here ?
>