You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Jochen Hebbrecht <jo...@gmail.com> on 2019/10/04 14:07:54 UTC

Spark job fails because of timeout to Driver

Hi,

I'm using Spark 2.4.2 on AWS EMR 5.24.0. I'm trying to send a Spark job
towards the cluster. Thhe job gets accepted, but the YARN application fails
with:


{code}
19/09/27 14:33:35 ERROR ApplicationMaster: Uncaught exception:
java.util.concurrent.TimeoutException: Futures timed out after [100000
milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
at
org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
at org.apache.spark.deploy.yarn.ApplicationMaster.org
$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at
org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
at
org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
at
org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
at
org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
19/09/27 14:33:35 INFO ApplicationMaster: Final app status: FAILED,
exitCode: 13, (reason: Uncaught exception:
java.util.concurrent.TimeoutException: Futures timed out after [100000
milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
at
org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
at org.apache.spark.deploy.yarn.ApplicationMaster.org
$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at
org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
at
org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
at
org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
at
org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
{code}

It actually goes wrong at this line:
https://github.com/apache/spark/blob/v2.4.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L468

Now, I'm 100% sure Spark is OK and there's no bug, but there must be
something wrong with my setup. I don't understand the code of the
ApplicationMaster, so could somebody explain me what it is trying to reach?
Where exactly does the connection timeout? So at least I can debug it
further because I don't have a clue what it is doing :-)

Thanks for any help!
Jochen

Re: Spark job fails because of timeout to Driver

Posted by Jochen Hebbrecht <jo...@gmail.com>.
Hi Igor,

No, it was not a memory issue - but thanks for your question. Could have
been a resources problem indeed :-)

Jochen

Op vr 4 okt. 2019 om 19:51 schreef igor cabral uchoa <
igoruchoa5e@yahoo.com.br>:

> Maybe it is a basic question, but your cluster has enough resource to run
> your application? It is requesting 208G of RAM
>
> Thanks,
>
> Sent from Yahoo Mail for iPhone
> <https://overview.mail.yahoo.com/?.src=iOS>
>
> On Friday, October 4, 2019, 2:31 PM, Jochen Hebbrecht <
> jochenhebbrecht@gmail.com> wrote:
>
> Hi Igor,
>
> We are deploying by submitting a batch job on a Livy server (from our
> local PC or a Jenkins node). The Livy server then deploys the Spark job on
> the cluster itself.
>
> For example:
> ---
>
> Running '/usr/lib/spark/bin/spark-submit' '--class' '##MY_MAIN_CLASS##' '--conf' 'spark.driver.userClassPathFirst=true' '--conf' 'spark.default.parallelism=180' '--conf' 'spark.executor.memory=52g' '--conf' 'spark.driver.memory=52g' '--conf' 'spark.yarn.tags=livy-batch-0-owjPBdmC' '--conf' 'spark.executor.instances=3' '--conf' 'spark.executor.memoryOverhead=6144' '--conf' 'spark.driver.cores=6' '--conf' 'spark.driver.memoryOverhead=6144' '--conf' 'spark.executor.extraJavaOptions=-XX:ThreadStackSize=2048 -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError=\'kill -9 %p\'' '--conf' 'spark.executor.userClassPathFirst=true' '--conf' 'spark.submit.deployMode=cluster' '--conf' 'spark.yarn.submit.waitAppCompletion=false' '--conf' 'spark.executor.extraClassPath=true' '-- ...
>
> ---
>
> Jochen
>
> Op vr 4 okt. 2019 om 17:42 schreef igor cabral uchoa <
> igoruchoa5e@yahoo.com.br>:
>
> Hi Roland!
>
> What deploy mode are you using when you submit your applications? It is
> client or cluster mode?
>
> Regards,
>
>
> Sent from Yahoo Mail for iPhone
> <https://overview.mail.yahoo.com/?.src=iOS>
>
> On Friday, October 4, 2019, 12:37 PM, Roland Johann
> <ro...@phenetic.io.INVALID> wrote:
>
> This are dynamic port ranges and dependa on configuration of your cluster.
> Per job there is a separate application master so there can‘t be just one
> port.
> If I remeber correctly the default EMR setup creates worker security
> groups with unrestricted traffic within the group, e.g. Between the worker
> nodes.
> Depending on your security requirements I suggest that you start with a
>  default like setup and determine ports and port ranges from the docs
> afterwards to further restrict traffic between the nodes.
>
> Kind regards
>
> Jochen Hebbrecht <jo...@gmail.com> schrieb am Fr. 4. Okt. 2019
> um 17:16:
>
> Hi Roland,
>
> We have indeed custom security groups. Can you tell me where exactly I
> need to be able to access what?
> For example, is it from the master instance to the driver instance? And
> which port should be open?
>
> Jochen
>
> Op vr 4 okt. 2019 om 17:14 schreef Roland Johann <
> roland.johann@phenetic.io>:
>
> Ho Jochen,
>
> did you setup the EMR cluster with custom security groups? Can you confirm
> that the relevant EC2 instances can connect through relevant ports?
>
> Best regards
>
> Jochen Hebbrecht <jo...@gmail.com> schrieb am Fr. 4. Okt. 2019
> um 17:09:
>
> Hi Jeff,
>
> Thanks! Just tried that, but the same timeout occurs :-( ...
>
> Jochen
>
> Op vr 4 okt. 2019 om 16:37 schreef Jeff Zhang <zj...@gmail.com>:
>
> You can try to increase property spark.yarn.am.waitTime (by default it is
> 100s)
> Maybe you are doing some very time consuming operation when initializing
> SparkContext, which cause timeout.
>
> See this property here
> http://spark.apache.org/docs/latest/running-on-yarn.html
>
>
> Jochen Hebbrecht <jo...@gmail.com> 于2019年10月4日周五 下午10:08写道:
>
> Hi,
>
> I'm using Spark 2.4.2 on AWS EMR 5.24.0. I'm trying to send a Spark job
> towards the cluster. Thhe job gets accepted, but the YARN application fails
> with:
>
>
> {code}
> 19/09/27 14:33:35 ERROR ApplicationMaster: Uncaught exception:
> java.util.concurrent.TimeoutException: Futures timed out after [100000
> milliseconds]
> at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
> at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
> at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
> at org.apache.spark.deploy.yarn.ApplicationMaster.org
> $apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
> 19/09/27 14:33:35 INFO ApplicationMaster: Final app status: FAILED,
> exitCode: 13, (reason: Uncaught exception:
> java.util.concurrent.TimeoutException: Futures timed out after [100000
> milliseconds]
> at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
> at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
> at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
> at org.apache.spark.deploy.yarn.ApplicationMaster.org
> $apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
> {code}
>
> It actually goes wrong at this line:
> https://github.com/apache/spark/blob/v2.4.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L468
>
> Now, I'm 100% sure Spark is OK and there's no bug, but there must be
> something wrong with my setup. I don't understand the code of the
> ApplicationMaster, so could somebody explain me what it is trying to reach?
> Where exactly does the connection timeout? So at least I can debug it
> further because I don't have a clue what it is doing :-)
>
> Thanks for any help!
> Jochen
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>
> --
>
>
> *Roland Johann*Software Developer/Data Engineer
>
> *phenetic GmbH*
> Lütticher Straße 10, 50674 Köln, Germany
> <https://www.google.com/maps/search/L%C3%BCtticher+Stra%C3%9Fe+10,+50674+K%C3%B6ln,+Germany?entry=gmail&source=g>
>
> Mobil: +49 172 365 26 46 <+49%20172%20365%2026%2046>
> Mail: roland.johann@phenetic.io
> Web: phenetic.io
>
> Handelsregister: Amtsgericht Köln (HRB 92595)
> Geschäftsführer: Roland Johann, Uwe Reimann
>
> --
>
>
> *Roland Johann*Software Developer/Data Engineer
>
> *phenetic GmbH*
> Lütticher Straße 10, 50674 Köln, Germany
>
> Mobil: +49 172 365 26 46 <+49%20172%20365%2026%2046>
> Mail: roland.johann@phenetic.io
> Web: phenetic.io
>
> Handelsregister: Amtsgericht Köln (HRB 92595)
> Geschäftsführer: Roland Johann, Uwe Reimann
>
>

Re: Spark job fails because of timeout to Driver

Posted by igor cabral uchoa <ig...@yahoo.com.br.INVALID>.
Maybe it is a basic question, but your cluster has enough resource to run your application? It is requesting 208G of RAM 
Thanks,

Sent from Yahoo Mail for iPhone


On Friday, October 4, 2019, 2:31 PM, Jochen Hebbrecht <jo...@gmail.com> wrote:

Hi Igor,
We are deploying by submitting a batch job on a Livy server (from our local PC or a Jenkins node). The Livy server then deploys the Spark job on the cluster itself.

For example:
---
Running '/usr/lib/spark/bin/spark-submit' '--class' '##MY_MAIN_CLASS##' '--conf' 'spark.driver.userClassPathFirst=true' '--conf' 'spark.default.parallelism=180' '--conf' 'spark.executor.memory=52g' '--conf' 'spark.driver.memory=52g' '--conf' 'spark.yarn.tags=livy-batch-0-owjPBdmC' '--conf' 'spark.executor.instances=3' '--conf' 'spark.executor.memoryOverhead=6144' '--conf' 'spark.driver.cores=6' '--conf' 'spark.driver.memoryOverhead=6144' '--conf' 'spark.executor.extraJavaOptions=-XX:ThreadStackSize=2048 -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError=\'kill -9 %p\'' '--conf' 'spark.executor.userClassPathFirst=true' '--conf' 'spark.submit.deployMode=cluster' '--conf' 'spark.yarn.submit.waitAppCompletion=false' '--conf' 'spark.executor.extraClassPath=true' '-- ...---

Jochen
Op vr 4 okt. 2019 om 17:42 schreef igor cabral uchoa <ig...@yahoo.com.br>:

Hi Roland!
What deploy mode are you using when you submit your applications? It is client or cluster mode?
Regards,


Sent from Yahoo Mail for iPhone


On Friday, October 4, 2019, 12:37 PM, Roland Johann <ro...@phenetic.io.INVALID> wrote:

This are dynamic port ranges and dependa on configuration of your cluster. Per job there is a separate application master so there can‘t be just one port.If I remeber correctly the default EMR setup creates worker security groups with unrestricted traffic within the group, e.g. Between the worker nodes.Depending on your security requirements I suggest that you start with a  default like setup and determine ports and port ranges from the docs afterwards to further restrict traffic between the nodes.
Kind regards
Jochen Hebbrecht <jo...@gmail.com> schrieb am Fr. 4. Okt. 2019 um 17:16:

Hi Roland,
We have indeed custom security groups. Can you tell me where exactly I need to be able to access what?
For example, is it from the master instance to the driver instance? And which port should be open?

Jochen
Op vr 4 okt. 2019 om 17:14 schreef Roland Johann <ro...@phenetic.io>:

Ho Jochen,
did you setup the EMR cluster with custom security groups? Can you confirm that the relevant EC2 instances can connect through relevant ports?
Best regards
Jochen Hebbrecht <jo...@gmail.com> schrieb am Fr. 4. Okt. 2019 um 17:09:

Hi Jeff,
Thanks! Just tried that, but the same timeout occurs :-( ...

Jochen
Op vr 4 okt. 2019 om 16:37 schreef Jeff Zhang <zj...@gmail.com>:

You can try to increase property spark.yarn.am.waitTime (by default it is 100s)  Maybe you are doing some very time consuming operation when initializing SparkContext, which cause timeout.
See this property here http://spark.apache.org/docs/latest/running-on-yarn.html

Jochen Hebbrecht <jo...@gmail.com> 于2019年10月4日周五 下午10:08写道:


Hi,

I'm using Spark 2.4.2 on AWS EMR 5.24.0. I'm trying to send a Spark job towards the cluster. Thhe job gets accepted, but the YARN application fails with:


{code}
19/09/27 14:33:35 ERROR ApplicationMaster: Uncaught exception: 
java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds]
 at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
 at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
 at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
 at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
 at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
 at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
 at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
 at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
 at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
 at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
 at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
 at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
 at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
19/09/27 14:33:35 INFO ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: Uncaught exception: java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds]
 at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
 at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
 at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
 at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
 at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
 at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
 at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
 at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
 at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
 at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
 at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
 at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
 at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
{code}

It actually goes wrong at this line: https://github.com/apache/spark/blob/v2.4.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L468

Now, I'm 100% sure Spark is OK and there's no bug, but there must be something wrong with my setup. I don't understand the code of the ApplicationMaster, so could somebody explain me what it is trying to reach? Where exactly does the connection timeout? So at least I can debug it further because I don't have a clue what it is doing :-)

Thanks for any help!
Jochen




-- 
Best Regards

Jeff Zhang

-- 

Roland Johann
Software Developer/Data Engineer

phenetic GmbH
Lütticher Straße 10, 50674 Köln, Germany

Mobil: +49 172 365 26 46
Mail: roland.johann@phenetic.io
Web: phenetic.io

Handelsregister: Amtsgericht Köln (HRB 92595)
Geschäftsführer: Roland Johann, Uwe Reimann


-- 

Roland Johann
Software Developer/Data Engineer

phenetic GmbH
Lütticher Straße 10, 50674 Köln, Germany

Mobil: +49 172 365 26 46
Mail: roland.johann@phenetic.io
Web: phenetic.io

Handelsregister: Amtsgericht Köln (HRB 92595)
Geschäftsführer: Roland Johann, Uwe Reimann








Re: Spark job fails because of timeout to Driver

Posted by Jochen Hebbrecht <jo...@gmail.com>.
Hi Igor,

We are deploying by submitting a batch job on a Livy server (from our local
PC or a Jenkins node). The Livy server then deploys the Spark job on the
cluster itself.

For example:
---

Running '/usr/lib/spark/bin/spark-submit' '--class'
'##MY_MAIN_CLASS##' '--conf' 'spark.driver.userClassPathFirst=true'
'--conf' 'spark.default.parallelism=180' '--conf'
'spark.executor.memory=52g' '--conf' 'spark.driver.memory=52g'
'--conf' 'spark.yarn.tags=livy-batch-0-owjPBdmC' '--conf'
'spark.executor.instances=3' '--conf'
'spark.executor.memoryOverhead=6144' '--conf' 'spark.driver.cores=6'
'--conf' 'spark.driver.memoryOverhead=6144' '--conf'
'spark.executor.extraJavaOptions=-XX:ThreadStackSize=2048
-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70
-XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled
-XX:OnOutOfMemoryError=\'kill -9 %p\'' '--conf'
'spark.executor.userClassPathFirst=true' '--conf'
'spark.submit.deployMode=cluster' '--conf'
'spark.yarn.submit.waitAppCompletion=false' '--conf'
'spark.executor.extraClassPath=true' '-- ...

---

Jochen

Op vr 4 okt. 2019 om 17:42 schreef igor cabral uchoa <
igoruchoa5e@yahoo.com.br>:

> Hi Roland!
>
> What deploy mode are you using when you submit your applications? It is
> client or cluster mode?
>
> Regards,
>
>
> Sent from Yahoo Mail for iPhone
> <https://overview.mail.yahoo.com/?.src=iOS>
>
> On Friday, October 4, 2019, 12:37 PM, Roland Johann
> <ro...@phenetic.io.INVALID> wrote:
>
> This are dynamic port ranges and dependa on configuration of your cluster.
> Per job there is a separate application master so there can‘t be just one
> port.
> If I remeber correctly the default EMR setup creates worker security
> groups with unrestricted traffic within the group, e.g. Between the worker
> nodes.
> Depending on your security requirements I suggest that you start with a
>  default like setup and determine ports and port ranges from the docs
> afterwards to further restrict traffic between the nodes.
>
> Kind regards
>
> Jochen Hebbrecht <jo...@gmail.com> schrieb am Fr. 4. Okt. 2019
> um 17:16:
>
> Hi Roland,
>
> We have indeed custom security groups. Can you tell me where exactly I
> need to be able to access what?
> For example, is it from the master instance to the driver instance? And
> which port should be open?
>
> Jochen
>
> Op vr 4 okt. 2019 om 17:14 schreef Roland Johann <
> roland.johann@phenetic.io>:
>
> Ho Jochen,
>
> did you setup the EMR cluster with custom security groups? Can you confirm
> that the relevant EC2 instances can connect through relevant ports?
>
> Best regards
>
> Jochen Hebbrecht <jo...@gmail.com> schrieb am Fr. 4. Okt. 2019
> um 17:09:
>
> Hi Jeff,
>
> Thanks! Just tried that, but the same timeout occurs :-( ...
>
> Jochen
>
> Op vr 4 okt. 2019 om 16:37 schreef Jeff Zhang <zj...@gmail.com>:
>
> You can try to increase property spark.yarn.am.waitTime (by default it is
> 100s)
> Maybe you are doing some very time consuming operation when initializing
> SparkContext, which cause timeout.
>
> See this property here
> http://spark.apache.org/docs/latest/running-on-yarn.html
>
>
> Jochen Hebbrecht <jo...@gmail.com> 于2019年10月4日周五 下午10:08写道:
>
> Hi,
>
> I'm using Spark 2.4.2 on AWS EMR 5.24.0. I'm trying to send a Spark job
> towards the cluster. Thhe job gets accepted, but the YARN application fails
> with:
>
>
> {code}
> 19/09/27 14:33:35 ERROR ApplicationMaster: Uncaught exception:
> java.util.concurrent.TimeoutException: Futures timed out after [100000
> milliseconds]
> at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
> at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
> at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
> at org.apache.spark.deploy.yarn.ApplicationMaster.org
> $apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
> 19/09/27 14:33:35 INFO ApplicationMaster: Final app status: FAILED,
> exitCode: 13, (reason: Uncaught exception:
> java.util.concurrent.TimeoutException: Futures timed out after [100000
> milliseconds]
> at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
> at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
> at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
> at org.apache.spark.deploy.yarn.ApplicationMaster.org
> $apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
> {code}
>
> It actually goes wrong at this line:
> https://github.com/apache/spark/blob/v2.4.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L468
>
> Now, I'm 100% sure Spark is OK and there's no bug, but there must be
> something wrong with my setup. I don't understand the code of the
> ApplicationMaster, so could somebody explain me what it is trying to reach?
> Where exactly does the connection timeout? So at least I can debug it
> further because I don't have a clue what it is doing :-)
>
> Thanks for any help!
> Jochen
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>
> --
>
>
> *Roland Johann*Software Developer/Data Engineer
>
> *phenetic GmbH*
> Lütticher Straße 10, 50674 Köln, Germany
> <https://www.google.com/maps/search/L%C3%BCtticher+Stra%C3%9Fe+10,+50674+K%C3%B6ln,+Germany?entry=gmail&source=g>
>
> Mobil: +49 172 365 26 46 <+49%20172%20365%2026%2046>
> Mail: roland.johann@phenetic.io
> Web: phenetic.io
>
> Handelsregister: Amtsgericht Köln (HRB 92595)
> Geschäftsführer: Roland Johann, Uwe Reimann
>
> --
>
>
> *Roland Johann*Software Developer/Data Engineer
>
> *phenetic GmbH*
> Lütticher Straße 10, 50674 Köln, Germany
>
> Mobil: +49 172 365 26 46 <+49%20172%20365%2026%2046>
> Mail: roland.johann@phenetic.io
> Web: phenetic.io
>
> Handelsregister: Amtsgericht Köln (HRB 92595)
> Geschäftsführer: Roland Johann, Uwe Reimann
>
>

Re: Spark job fails because of timeout to Driver

Posted by igor cabral uchoa <ig...@yahoo.com.br.INVALID>.
Hi Roland!
What deploy mode are you using when you submit your applications? It is client or cluster mode?
Regards,


Sent from Yahoo Mail for iPhone


On Friday, October 4, 2019, 12:37 PM, Roland Johann <ro...@phenetic.io.INVALID> wrote:

This are dynamic port ranges and dependa on configuration of your cluster. Per job there is a separate application master so there can‘t be just one port.If I remeber correctly the default EMR setup creates worker security groups with unrestricted traffic within the group, e.g. Between the worker nodes.Depending on your security requirements I suggest that you start with a  default like setup and determine ports and port ranges from the docs afterwards to further restrict traffic between the nodes.
Kind regards
Jochen Hebbrecht <jo...@gmail.com> schrieb am Fr. 4. Okt. 2019 um 17:16:

Hi Roland,
We have indeed custom security groups. Can you tell me where exactly I need to be able to access what?
For example, is it from the master instance to the driver instance? And which port should be open?

Jochen
Op vr 4 okt. 2019 om 17:14 schreef Roland Johann <ro...@phenetic.io>:

Ho Jochen,
did you setup the EMR cluster with custom security groups? Can you confirm that the relevant EC2 instances can connect through relevant ports?
Best regards
Jochen Hebbrecht <jo...@gmail.com> schrieb am Fr. 4. Okt. 2019 um 17:09:

Hi Jeff,
Thanks! Just tried that, but the same timeout occurs :-( ...

Jochen
Op vr 4 okt. 2019 om 16:37 schreef Jeff Zhang <zj...@gmail.com>:

You can try to increase property spark.yarn.am.waitTime (by default it is 100s)  Maybe you are doing some very time consuming operation when initializing SparkContext, which cause timeout.
See this property here http://spark.apache.org/docs/latest/running-on-yarn.html

Jochen Hebbrecht <jo...@gmail.com> 于2019年10月4日周五 下午10:08写道:


Hi,

I'm using Spark 2.4.2 on AWS EMR 5.24.0. I'm trying to send a Spark job towards the cluster. Thhe job gets accepted, but the YARN application fails with:


{code}
19/09/27 14:33:35 ERROR ApplicationMaster: Uncaught exception: 
java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds]
 at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
 at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
 at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
 at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
 at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
 at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
 at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
 at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
 at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
 at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
 at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
 at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
 at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
19/09/27 14:33:35 INFO ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: Uncaught exception: java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds]
 at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
 at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
 at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
 at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
 at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
 at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
 at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
 at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
 at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
 at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
 at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
 at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
 at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
{code}

It actually goes wrong at this line: https://github.com/apache/spark/blob/v2.4.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L468

Now, I'm 100% sure Spark is OK and there's no bug, but there must be something wrong with my setup. I don't understand the code of the ApplicationMaster, so could somebody explain me what it is trying to reach? Where exactly does the connection timeout? So at least I can debug it further because I don't have a clue what it is doing :-)

Thanks for any help!
Jochen




-- 
Best Regards

Jeff Zhang

-- 

Roland Johann
Software Developer/Data Engineer

phenetic GmbH
Lütticher Straße 10, 50674 Köln, Germany

Mobil: +49 172 365 26 46
Mail: roland.johann@phenetic.io
Web: phenetic.io

Handelsregister: Amtsgericht Köln (HRB 92595)
Geschäftsführer: Roland Johann, Uwe Reimann


-- 

Roland Johann
Software Developer/Data Engineer

phenetic GmbH
Lütticher Straße 10, 50674 Köln, Germany

Mobil: +49 172 365 26 46
Mail: roland.johann@phenetic.io
Web: phenetic.io

Handelsregister: Amtsgericht Köln (HRB 92595)
Geschäftsführer: Roland Johann, Uwe Reimann




Re: Spark job fails because of timeout to Driver

Posted by Jochen Hebbrecht <jo...@gmail.com>.
Hi Roland,

I just tried what you've suggested and it actually helped me finding the
root cause. Once I had the default EMR cluster, I've submitted a Spark job
using the master instance (using the 'spark-submit' command on a terminal)
- and not use Livy to submit this job.
In this way, I had much more logging in the terminal and now the logging
actually indicated me what the timeout was causing. The timeout was related
to a service call in our company and this service call failed due to access
constraints.

Fixing those access constraints, made the Spark job succeed!

So conclusion: nothing related to Spark itself, but it's the Livy output
logging which was hiding the real error details.

Thank you all for help! :-)

Jochen

Op vr 4 okt. 2019 om 19:32 schreef Roland Johann <roland.johann@phenetic.io
>:

> Hi Jochen,
>
> Can you crate a small EMR cluster wirh all defaults and rhn the job there?
> This way we can ensure that the issue is not infrastructure and YARN
> configuration related.
>
> Kind regards
>
> Jochen Hebbrecht <jo...@gmail.com> schrieb am Fr. 4. Okt. 2019
> um 19:27:
>
>> Hi Roland,
>>
>> I switched to the default security groups, ran my job again but the same
>> exception pops up :-( ...
>> All traffic is open on the security groups now.
>>
>> Jochen
>>
>> Op vr 4 okt. 2019 om 17:37 schreef Roland Johann <
>> roland.johann@phenetic.io>:
>>
>>> This are dynamic port ranges and dependa on configuration of your
>>> cluster. Per job there is a separate application master so there can‘t be
>>> just one port.
>>> If I remeber correctly the default EMR setup creates worker security
>>> groups with unrestricted traffic within the group, e.g. Between the worker
>>> nodes.
>>> Depending on your security requirements I suggest that you start with a
>>>  default like setup and determine ports and port ranges from the docs
>>> afterwards to further restrict traffic between the nodes.
>>>
>>> Kind regards
>>>
>>> Jochen Hebbrecht <jo...@gmail.com> schrieb am Fr. 4. Okt.
>>> 2019 um 17:16:
>>>
>>>> Hi Roland,
>>>>
>>>> We have indeed custom security groups. Can you tell me where exactly I
>>>> need to be able to access what?
>>>> For example, is it from the master instance to the driver instance? And
>>>> which port should be open?
>>>>
>>>> Jochen
>>>>
>>>> Op vr 4 okt. 2019 om 17:14 schreef Roland Johann <
>>>> roland.johann@phenetic.io>:
>>>>
>>>>> Ho Jochen,
>>>>>
>>>>> did you setup the EMR cluster with custom security groups? Can you
>>>>> confirm that the relevant EC2 instances can connect through relevant ports?
>>>>>
>>>>> Best regards
>>>>>
>>>>> Jochen Hebbrecht <jo...@gmail.com> schrieb am Fr. 4. Okt.
>>>>> 2019 um 17:09:
>>>>>
>>>>>> Hi Jeff,
>>>>>>
>>>>>> Thanks! Just tried that, but the same timeout occurs :-( ...
>>>>>>
>>>>>> Jochen
>>>>>>
>>>>>> Op vr 4 okt. 2019 om 16:37 schreef Jeff Zhang <zj...@gmail.com>:
>>>>>>
>>>>>>> You can try to increase property spark.yarn.am.waitTime (by default
>>>>>>> it is 100s)
>>>>>>> Maybe you are doing some very time consuming operation when
>>>>>>> initializing SparkContext, which cause timeout.
>>>>>>>
>>>>>>> See this property here
>>>>>>> http://spark.apache.org/docs/latest/running-on-yarn.html
>>>>>>>
>>>>>>>
>>>>>>> Jochen Hebbrecht <jo...@gmail.com> 于2019年10月4日周五
>>>>>>> 下午10:08写道:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I'm using Spark 2.4.2 on AWS EMR 5.24.0. I'm trying to send a Spark
>>>>>>>> job towards the cluster. Thhe job gets accepted, but the YARN application
>>>>>>>> fails with:
>>>>>>>>
>>>>>>>>
>>>>>>>> {code}
>>>>>>>> 19/09/27 14:33:35 ERROR ApplicationMaster: Uncaught exception:
>>>>>>>> java.util.concurrent.TimeoutException: Futures timed out after
>>>>>>>> [100000 milliseconds]
>>>>>>>> at
>>>>>>>> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
>>>>>>>> at
>>>>>>>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
>>>>>>>> at
>>>>>>>> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
>>>>>>>> at
>>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
>>>>>>>> at org.apache.spark.deploy.yarn.ApplicationMaster.org
>>>>>>>> $apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
>>>>>>>> at
>>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
>>>>>>>> at
>>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>>>>>>>> at
>>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>>>>>>>> at
>>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
>>>>>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>>>>>> at javax.security.auth.Subject.doAs(Subject.java:422)
>>>>>>>> at
>>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>>>>>>>> at
>>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
>>>>>>>> at
>>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
>>>>>>>> at
>>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
>>>>>>>> at
>>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
>>>>>>>> 19/09/27 14:33:35 INFO ApplicationMaster: Final app status: FAILED,
>>>>>>>> exitCode: 13, (reason: Uncaught exception:
>>>>>>>> java.util.concurrent.TimeoutException: Futures timed out after [100000
>>>>>>>> milliseconds]
>>>>>>>> at
>>>>>>>> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
>>>>>>>> at
>>>>>>>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
>>>>>>>> at
>>>>>>>> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
>>>>>>>> at
>>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
>>>>>>>> at org.apache.spark.deploy.yarn.ApplicationMaster.org
>>>>>>>> $apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
>>>>>>>> at
>>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
>>>>>>>> at
>>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>>>>>>>> at
>>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>>>>>>>> at
>>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
>>>>>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>>>>>> at javax.security.auth.Subject.doAs(Subject.java:422)
>>>>>>>> at
>>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>>>>>>>> at
>>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
>>>>>>>> at
>>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
>>>>>>>> at
>>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
>>>>>>>> at
>>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
>>>>>>>> {code}
>>>>>>>>
>>>>>>>> It actually goes wrong at this line:
>>>>>>>> https://github.com/apache/spark/blob/v2.4.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L468
>>>>>>>>
>>>>>>>> Now, I'm 100% sure Spark is OK and there's no bug, but there must
>>>>>>>> be something wrong with my setup. I don't understand the code of the
>>>>>>>> ApplicationMaster, so could somebody explain me what it is trying to reach?
>>>>>>>> Where exactly does the connection timeout? So at least I can debug it
>>>>>>>> further because I don't have a clue what it is doing :-)
>>>>>>>>
>>>>>>>> Thanks for any help!
>>>>>>>> Jochen
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best Regards
>>>>>>>
>>>>>>> Jeff Zhang
>>>>>>>
>>>>>> --
>>>>>
>>>>>
>>>>> *Roland Johann*Software Developer/Data Engineer
>>>>>
>>>>> *phenetic GmbH*
>>>>> Lütticher Straße 10, 50674 Köln, Germany
>>>>> <https://www.google.com/maps/search/L%C3%BCtticher+Stra%C3%9Fe+10,+50674+K%C3%B6ln,+Germany?entry=gmail&source=g>
>>>>>
>>>>> Mobil: +49 172 365 26 46
>>>>> Mail: roland.johann@phenetic.io
>>>>> Web: phenetic.io
>>>>>
>>>>> Handelsregister: Amtsgericht Köln (HRB 92595)
>>>>> Geschäftsführer: Roland Johann, Uwe Reimann
>>>>>
>>>> --
>>>
>>>
>>> *Roland Johann*Software Developer/Data Engineer
>>>
>>> *phenetic GmbH*
>>> Lütticher Straße 10, 50674 Köln, Germany
>>> <https://www.google.com/maps/search/L%C3%BCtticher+Stra%C3%9Fe+10,+50674+K%C3%B6ln,+Germany?entry=gmail&source=g>
>>>
>>> Mobil: +49 172 365 26 46
>>> Mail: roland.johann@phenetic.io
>>> Web: phenetic.io
>>>
>>> Handelsregister: Amtsgericht Köln (HRB 92595)
>>> Geschäftsführer: Roland Johann, Uwe Reimann
>>>
>> --
>
>
> *Roland Johann*Software Developer/Data Engineer
>
> *phenetic GmbH*
> Lütticher Straße 10, 50674 Köln, Germany
>
> Mobil: +49 172 365 26 46
> Mail: roland.johann@phenetic.io
> Web: phenetic.io
>
> Handelsregister: Amtsgericht Köln (HRB 92595)
> Geschäftsführer: Roland Johann, Uwe Reimann
>

Re: Spark job fails because of timeout to Driver

Posted by Roland Johann <ro...@phenetic.io.INVALID>.
Hi Jochen,

Can you crate a small EMR cluster wirh all defaults and rhn the job there?
This way we can ensure that the issue is not infrastructure and YARN
configuration related.

Kind regards

Jochen Hebbrecht <jo...@gmail.com> schrieb am Fr. 4. Okt. 2019 um
19:27:

> Hi Roland,
>
> I switched to the default security groups, ran my job again but the same
> exception pops up :-( ...
> All traffic is open on the security groups now.
>
> Jochen
>
> Op vr 4 okt. 2019 om 17:37 schreef Roland Johann <
> roland.johann@phenetic.io>:
>
>> This are dynamic port ranges and dependa on configuration of your
>> cluster. Per job there is a separate application master so there can‘t be
>> just one port.
>> If I remeber correctly the default EMR setup creates worker security
>> groups with unrestricted traffic within the group, e.g. Between the worker
>> nodes.
>> Depending on your security requirements I suggest that you start with a
>>  default like setup and determine ports and port ranges from the docs
>> afterwards to further restrict traffic between the nodes.
>>
>> Kind regards
>>
>> Jochen Hebbrecht <jo...@gmail.com> schrieb am Fr. 4. Okt. 2019
>> um 17:16:
>>
>>> Hi Roland,
>>>
>>> We have indeed custom security groups. Can you tell me where exactly I
>>> need to be able to access what?
>>> For example, is it from the master instance to the driver instance? And
>>> which port should be open?
>>>
>>> Jochen
>>>
>>> Op vr 4 okt. 2019 om 17:14 schreef Roland Johann <
>>> roland.johann@phenetic.io>:
>>>
>>>> Ho Jochen,
>>>>
>>>> did you setup the EMR cluster with custom security groups? Can you
>>>> confirm that the relevant EC2 instances can connect through relevant ports?
>>>>
>>>> Best regards
>>>>
>>>> Jochen Hebbrecht <jo...@gmail.com> schrieb am Fr. 4. Okt.
>>>> 2019 um 17:09:
>>>>
>>>>> Hi Jeff,
>>>>>
>>>>> Thanks! Just tried that, but the same timeout occurs :-( ...
>>>>>
>>>>> Jochen
>>>>>
>>>>> Op vr 4 okt. 2019 om 16:37 schreef Jeff Zhang <zj...@gmail.com>:
>>>>>
>>>>>> You can try to increase property spark.yarn.am.waitTime (by default
>>>>>> it is 100s)
>>>>>> Maybe you are doing some very time consuming operation when
>>>>>> initializing SparkContext, which cause timeout.
>>>>>>
>>>>>> See this property here
>>>>>> http://spark.apache.org/docs/latest/running-on-yarn.html
>>>>>>
>>>>>>
>>>>>> Jochen Hebbrecht <jo...@gmail.com> 于2019年10月4日周五 下午10:08写道:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I'm using Spark 2.4.2 on AWS EMR 5.24.0. I'm trying to send a Spark
>>>>>>> job towards the cluster. Thhe job gets accepted, but the YARN application
>>>>>>> fails with:
>>>>>>>
>>>>>>>
>>>>>>> {code}
>>>>>>> 19/09/27 14:33:35 ERROR ApplicationMaster: Uncaught exception:
>>>>>>> java.util.concurrent.TimeoutException: Futures timed out after
>>>>>>> [100000 milliseconds]
>>>>>>> at
>>>>>>> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
>>>>>>> at
>>>>>>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
>>>>>>> at
>>>>>>> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
>>>>>>> at
>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
>>>>>>> at org.apache.spark.deploy.yarn.ApplicationMaster.org
>>>>>>> $apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
>>>>>>> at
>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
>>>>>>> at
>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>>>>>>> at
>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>>>>>>> at
>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
>>>>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>>>>> at javax.security.auth.Subject.doAs(Subject.java:422)
>>>>>>> at
>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>>>>>>> at
>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
>>>>>>> at
>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
>>>>>>> at
>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
>>>>>>> at
>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
>>>>>>> 19/09/27 14:33:35 INFO ApplicationMaster: Final app status: FAILED,
>>>>>>> exitCode: 13, (reason: Uncaught exception:
>>>>>>> java.util.concurrent.TimeoutException: Futures timed out after [100000
>>>>>>> milliseconds]
>>>>>>> at
>>>>>>> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
>>>>>>> at
>>>>>>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
>>>>>>> at
>>>>>>> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
>>>>>>> at
>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
>>>>>>> at org.apache.spark.deploy.yarn.ApplicationMaster.org
>>>>>>> $apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
>>>>>>> at
>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
>>>>>>> at
>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>>>>>>> at
>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>>>>>>> at
>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
>>>>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>>>>> at javax.security.auth.Subject.doAs(Subject.java:422)
>>>>>>> at
>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>>>>>>> at
>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
>>>>>>> at
>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
>>>>>>> at
>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
>>>>>>> at
>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
>>>>>>> {code}
>>>>>>>
>>>>>>> It actually goes wrong at this line:
>>>>>>> https://github.com/apache/spark/blob/v2.4.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L468
>>>>>>>
>>>>>>> Now, I'm 100% sure Spark is OK and there's no bug, but there must be
>>>>>>> something wrong with my setup. I don't understand the code of the
>>>>>>> ApplicationMaster, so could somebody explain me what it is trying to reach?
>>>>>>> Where exactly does the connection timeout? So at least I can debug it
>>>>>>> further because I don't have a clue what it is doing :-)
>>>>>>>
>>>>>>> Thanks for any help!
>>>>>>> Jochen
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best Regards
>>>>>>
>>>>>> Jeff Zhang
>>>>>>
>>>>> --
>>>>
>>>>
>>>> *Roland Johann*Software Developer/Data Engineer
>>>>
>>>> *phenetic GmbH*
>>>> Lütticher Straße 10, 50674 Köln, Germany
>>>> <https://www.google.com/maps/search/L%C3%BCtticher+Stra%C3%9Fe+10,+50674+K%C3%B6ln,+Germany?entry=gmail&source=g>
>>>>
>>>> Mobil: +49 172 365 26 46
>>>> Mail: roland.johann@phenetic.io
>>>> Web: phenetic.io
>>>>
>>>> Handelsregister: Amtsgericht Köln (HRB 92595)
>>>> Geschäftsführer: Roland Johann, Uwe Reimann
>>>>
>>> --
>>
>>
>> *Roland Johann*Software Developer/Data Engineer
>>
>> *phenetic GmbH*
>> Lütticher Straße 10, 50674 Köln, Germany
>> <https://www.google.com/maps/search/L%C3%BCtticher+Stra%C3%9Fe+10,+50674+K%C3%B6ln,+Germany?entry=gmail&source=g>
>>
>> Mobil: +49 172 365 26 46
>> Mail: roland.johann@phenetic.io
>> Web: phenetic.io
>>
>> Handelsregister: Amtsgericht Köln (HRB 92595)
>> Geschäftsführer: Roland Johann, Uwe Reimann
>>
> --


*Roland Johann*Software Developer/Data Engineer

*phenetic GmbH*
Lütticher Straße 10, 50674 Köln, Germany

Mobil: +49 172 365 26 46
Mail: roland.johann@phenetic.io
Web: phenetic.io

Handelsregister: Amtsgericht Köln (HRB 92595)
Geschäftsführer: Roland Johann, Uwe Reimann

Re: Spark job fails because of timeout to Driver

Posted by Jochen Hebbrecht <jo...@gmail.com>.
Hi Roland,

I switched to the default security groups, ran my job again but the same
exception pops up :-( ...
All traffic is open on the security groups now.

Jochen

Op vr 4 okt. 2019 om 17:37 schreef Roland Johann <roland.johann@phenetic.io
>:

> This are dynamic port ranges and dependa on configuration of your cluster.
> Per job there is a separate application master so there can‘t be just one
> port.
> If I remeber correctly the default EMR setup creates worker security
> groups with unrestricted traffic within the group, e.g. Between the worker
> nodes.
> Depending on your security requirements I suggest that you start with a
>  default like setup and determine ports and port ranges from the docs
> afterwards to further restrict traffic between the nodes.
>
> Kind regards
>
> Jochen Hebbrecht <jo...@gmail.com> schrieb am Fr. 4. Okt. 2019
> um 17:16:
>
>> Hi Roland,
>>
>> We have indeed custom security groups. Can you tell me where exactly I
>> need to be able to access what?
>> For example, is it from the master instance to the driver instance? And
>> which port should be open?
>>
>> Jochen
>>
>> Op vr 4 okt. 2019 om 17:14 schreef Roland Johann <
>> roland.johann@phenetic.io>:
>>
>>> Ho Jochen,
>>>
>>> did you setup the EMR cluster with custom security groups? Can you
>>> confirm that the relevant EC2 instances can connect through relevant ports?
>>>
>>> Best regards
>>>
>>> Jochen Hebbrecht <jo...@gmail.com> schrieb am Fr. 4. Okt.
>>> 2019 um 17:09:
>>>
>>>> Hi Jeff,
>>>>
>>>> Thanks! Just tried that, but the same timeout occurs :-( ...
>>>>
>>>> Jochen
>>>>
>>>> Op vr 4 okt. 2019 om 16:37 schreef Jeff Zhang <zj...@gmail.com>:
>>>>
>>>>> You can try to increase property spark.yarn.am.waitTime (by default
>>>>> it is 100s)
>>>>> Maybe you are doing some very time consuming operation when
>>>>> initializing SparkContext, which cause timeout.
>>>>>
>>>>> See this property here
>>>>> http://spark.apache.org/docs/latest/running-on-yarn.html
>>>>>
>>>>>
>>>>> Jochen Hebbrecht <jo...@gmail.com> 于2019年10月4日周五 下午10:08写道:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm using Spark 2.4.2 on AWS EMR 5.24.0. I'm trying to send a Spark
>>>>>> job towards the cluster. Thhe job gets accepted, but the YARN application
>>>>>> fails with:
>>>>>>
>>>>>>
>>>>>> {code}
>>>>>> 19/09/27 14:33:35 ERROR ApplicationMaster: Uncaught exception:
>>>>>> java.util.concurrent.TimeoutException: Futures timed out after
>>>>>> [100000 milliseconds]
>>>>>> at
>>>>>> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
>>>>>> at
>>>>>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
>>>>>> at
>>>>>> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
>>>>>> at org.apache.spark.deploy.yarn.ApplicationMaster.org
>>>>>> $apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
>>>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>>>> at javax.security.auth.Subject.doAs(Subject.java:422)
>>>>>> at
>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
>>>>>> 19/09/27 14:33:35 INFO ApplicationMaster: Final app status: FAILED,
>>>>>> exitCode: 13, (reason: Uncaught exception:
>>>>>> java.util.concurrent.TimeoutException: Futures timed out after [100000
>>>>>> milliseconds]
>>>>>> at
>>>>>> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
>>>>>> at
>>>>>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
>>>>>> at
>>>>>> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
>>>>>> at org.apache.spark.deploy.yarn.ApplicationMaster.org
>>>>>> $apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
>>>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>>>> at javax.security.auth.Subject.doAs(Subject.java:422)
>>>>>> at
>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
>>>>>> {code}
>>>>>>
>>>>>> It actually goes wrong at this line:
>>>>>> https://github.com/apache/spark/blob/v2.4.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L468
>>>>>>
>>>>>> Now, I'm 100% sure Spark is OK and there's no bug, but there must be
>>>>>> something wrong with my setup. I don't understand the code of the
>>>>>> ApplicationMaster, so could somebody explain me what it is trying to reach?
>>>>>> Where exactly does the connection timeout? So at least I can debug it
>>>>>> further because I don't have a clue what it is doing :-)
>>>>>>
>>>>>> Thanks for any help!
>>>>>> Jochen
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards
>>>>>
>>>>> Jeff Zhang
>>>>>
>>>> --
>>>
>>>
>>> *Roland Johann*Software Developer/Data Engineer
>>>
>>> *phenetic GmbH*
>>> Lütticher Straße 10, 50674 Köln, Germany
>>> <https://www.google.com/maps/search/L%C3%BCtticher+Stra%C3%9Fe+10,+50674+K%C3%B6ln,+Germany?entry=gmail&source=g>
>>>
>>> Mobil: +49 172 365 26 46
>>> Mail: roland.johann@phenetic.io
>>> Web: phenetic.io
>>>
>>> Handelsregister: Amtsgericht Köln (HRB 92595)
>>> Geschäftsführer: Roland Johann, Uwe Reimann
>>>
>> --
>
>
> *Roland Johann*Software Developer/Data Engineer
>
> *phenetic GmbH*
> Lütticher Straße 10, 50674 Köln, Germany
>
> Mobil: +49 172 365 26 46
> Mail: roland.johann@phenetic.io
> Web: phenetic.io
>
> Handelsregister: Amtsgericht Köln (HRB 92595)
> Geschäftsführer: Roland Johann, Uwe Reimann
>

Re: Spark job fails because of timeout to Driver

Posted by Roland Johann <ro...@phenetic.io.INVALID>.
This are dynamic port ranges and dependa on configuration of your cluster.
Per job there is a separate application master so there can‘t be just one
port.
If I remeber correctly the default EMR setup creates worker security groups
with unrestricted traffic within the group, e.g. Between the worker nodes.
Depending on your security requirements I suggest that you start with a
 default like setup and determine ports and port ranges from the docs
afterwards to further restrict traffic between the nodes.

Kind regards

Jochen Hebbrecht <jo...@gmail.com> schrieb am Fr. 4. Okt. 2019 um
17:16:

> Hi Roland,
>
> We have indeed custom security groups. Can you tell me where exactly I
> need to be able to access what?
> For example, is it from the master instance to the driver instance? And
> which port should be open?
>
> Jochen
>
> Op vr 4 okt. 2019 om 17:14 schreef Roland Johann <
> roland.johann@phenetic.io>:
>
>> Ho Jochen,
>>
>> did you setup the EMR cluster with custom security groups? Can you
>> confirm that the relevant EC2 instances can connect through relevant ports?
>>
>> Best regards
>>
>> Jochen Hebbrecht <jo...@gmail.com> schrieb am Fr. 4. Okt. 2019
>> um 17:09:
>>
>>> Hi Jeff,
>>>
>>> Thanks! Just tried that, but the same timeout occurs :-( ...
>>>
>>> Jochen
>>>
>>> Op vr 4 okt. 2019 om 16:37 schreef Jeff Zhang <zj...@gmail.com>:
>>>
>>>> You can try to increase property spark.yarn.am.waitTime (by default it
>>>> is 100s)
>>>> Maybe you are doing some very time consuming operation when
>>>> initializing SparkContext, which cause timeout.
>>>>
>>>> See this property here
>>>> http://spark.apache.org/docs/latest/running-on-yarn.html
>>>>
>>>>
>>>> Jochen Hebbrecht <jo...@gmail.com> 于2019年10月4日周五 下午10:08写道:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm using Spark 2.4.2 on AWS EMR 5.24.0. I'm trying to send a Spark
>>>>> job towards the cluster. Thhe job gets accepted, but the YARN application
>>>>> fails with:
>>>>>
>>>>>
>>>>> {code}
>>>>> 19/09/27 14:33:35 ERROR ApplicationMaster: Uncaught exception:
>>>>> java.util.concurrent.TimeoutException: Futures timed out after [100000
>>>>> milliseconds]
>>>>> at
>>>>> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
>>>>> at
>>>>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
>>>>> at
>>>>> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
>>>>> at
>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
>>>>> at org.apache.spark.deploy.yarn.ApplicationMaster.org
>>>>> $apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
>>>>> at
>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
>>>>> at
>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>>>>> at
>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>>>>> at
>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
>>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>>> at javax.security.auth.Subject.doAs(Subject.java:422)
>>>>> at
>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>>>>> at
>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
>>>>> at
>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
>>>>> at
>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
>>>>> at
>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
>>>>> 19/09/27 14:33:35 INFO ApplicationMaster: Final app status: FAILED,
>>>>> exitCode: 13, (reason: Uncaught exception:
>>>>> java.util.concurrent.TimeoutException: Futures timed out after [100000
>>>>> milliseconds]
>>>>> at
>>>>> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
>>>>> at
>>>>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
>>>>> at
>>>>> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
>>>>> at
>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
>>>>> at org.apache.spark.deploy.yarn.ApplicationMaster.org
>>>>> $apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
>>>>> at
>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
>>>>> at
>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>>>>> at
>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>>>>> at
>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
>>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>>> at javax.security.auth.Subject.doAs(Subject.java:422)
>>>>> at
>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>>>>> at
>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
>>>>> at
>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
>>>>> at
>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
>>>>> at
>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
>>>>> {code}
>>>>>
>>>>> It actually goes wrong at this line:
>>>>> https://github.com/apache/spark/blob/v2.4.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L468
>>>>>
>>>>> Now, I'm 100% sure Spark is OK and there's no bug, but there must be
>>>>> something wrong with my setup. I don't understand the code of the
>>>>> ApplicationMaster, so could somebody explain me what it is trying to reach?
>>>>> Where exactly does the connection timeout? So at least I can debug it
>>>>> further because I don't have a clue what it is doing :-)
>>>>>
>>>>> Thanks for any help!
>>>>> Jochen
>>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards
>>>>
>>>> Jeff Zhang
>>>>
>>> --
>>
>>
>> *Roland Johann*Software Developer/Data Engineer
>>
>> *phenetic GmbH*
>> Lütticher Straße 10, 50674 Köln, Germany
>> <https://www.google.com/maps/search/L%C3%BCtticher+Stra%C3%9Fe+10,+50674+K%C3%B6ln,+Germany?entry=gmail&source=g>
>>
>> Mobil: +49 172 365 26 46
>> Mail: roland.johann@phenetic.io
>> Web: phenetic.io
>>
>> Handelsregister: Amtsgericht Köln (HRB 92595)
>> Geschäftsführer: Roland Johann, Uwe Reimann
>>
> --


*Roland Johann*Software Developer/Data Engineer

*phenetic GmbH*
Lütticher Straße 10, 50674 Köln, Germany

Mobil: +49 172 365 26 46
Mail: roland.johann@phenetic.io
Web: phenetic.io

Handelsregister: Amtsgericht Köln (HRB 92595)
Geschäftsführer: Roland Johann, Uwe Reimann

Re: Spark job fails because of timeout to Driver

Posted by Jochen Hebbrecht <jo...@gmail.com>.
Hi Roland,

We have indeed custom security groups. Can you tell me where exactly I need
to be able to access what?
For example, is it from the master instance to the driver instance? And
which port should be open?

Jochen

Op vr 4 okt. 2019 om 17:14 schreef Roland Johann <roland.johann@phenetic.io
>:

> Ho Jochen,
>
> did you setup the EMR cluster with custom security groups? Can you confirm
> that the relevant EC2 instances can connect through relevant ports?
>
> Best regards
>
> Jochen Hebbrecht <jo...@gmail.com> schrieb am Fr. 4. Okt. 2019
> um 17:09:
>
>> Hi Jeff,
>>
>> Thanks! Just tried that, but the same timeout occurs :-( ...
>>
>> Jochen
>>
>> Op vr 4 okt. 2019 om 16:37 schreef Jeff Zhang <zj...@gmail.com>:
>>
>>> You can try to increase property spark.yarn.am.waitTime (by default it
>>> is 100s)
>>> Maybe you are doing some very time consuming operation when initializing
>>> SparkContext, which cause timeout.
>>>
>>> See this property here
>>> http://spark.apache.org/docs/latest/running-on-yarn.html
>>>
>>>
>>> Jochen Hebbrecht <jo...@gmail.com> 于2019年10月4日周五 下午10:08写道:
>>>
>>>> Hi,
>>>>
>>>> I'm using Spark 2.4.2 on AWS EMR 5.24.0. I'm trying to send a Spark job
>>>> towards the cluster. Thhe job gets accepted, but the YARN application fails
>>>> with:
>>>>
>>>>
>>>> {code}
>>>> 19/09/27 14:33:35 ERROR ApplicationMaster: Uncaught exception:
>>>> java.util.concurrent.TimeoutException: Futures timed out after [100000
>>>> milliseconds]
>>>> at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
>>>> at
>>>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
>>>> at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
>>>> at
>>>> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
>>>> at org.apache.spark.deploy.yarn.ApplicationMaster.org
>>>> $apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
>>>> at
>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
>>>> at
>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>>>> at
>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>>>> at
>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>> at javax.security.auth.Subject.doAs(Subject.java:422)
>>>> at
>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>>>> at
>>>> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
>>>> at
>>>> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
>>>> at
>>>> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
>>>> at
>>>> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
>>>> 19/09/27 14:33:35 INFO ApplicationMaster: Final app status: FAILED,
>>>> exitCode: 13, (reason: Uncaught exception:
>>>> java.util.concurrent.TimeoutException: Futures timed out after [100000
>>>> milliseconds]
>>>> at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
>>>> at
>>>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
>>>> at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
>>>> at
>>>> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
>>>> at org.apache.spark.deploy.yarn.ApplicationMaster.org
>>>> $apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
>>>> at
>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
>>>> at
>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>>>> at
>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>>>> at
>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>> at javax.security.auth.Subject.doAs(Subject.java:422)
>>>> at
>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>>>> at
>>>> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
>>>> at
>>>> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
>>>> at
>>>> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
>>>> at
>>>> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
>>>> {code}
>>>>
>>>> It actually goes wrong at this line:
>>>> https://github.com/apache/spark/blob/v2.4.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L468
>>>>
>>>> Now, I'm 100% sure Spark is OK and there's no bug, but there must be
>>>> something wrong with my setup. I don't understand the code of the
>>>> ApplicationMaster, so could somebody explain me what it is trying to reach?
>>>> Where exactly does the connection timeout? So at least I can debug it
>>>> further because I don't have a clue what it is doing :-)
>>>>
>>>> Thanks for any help!
>>>> Jochen
>>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>> --
>
>
> *Roland Johann*Software Developer/Data Engineer
>
> *phenetic GmbH*
> Lütticher Straße 10, 50674 Köln, Germany
>
> Mobil: +49 172 365 26 46
> Mail: roland.johann@phenetic.io
> Web: phenetic.io
>
> Handelsregister: Amtsgericht Köln (HRB 92595)
> Geschäftsführer: Roland Johann, Uwe Reimann
>

Re: Spark job fails because of timeout to Driver

Posted by Roland Johann <ro...@phenetic.io.INVALID>.
Ho Jochen,

did you setup the EMR cluster with custom security groups? Can you confirm
that the relevant EC2 instances can connect through relevant ports?

Best regards

Jochen Hebbrecht <jo...@gmail.com> schrieb am Fr. 4. Okt. 2019 um
17:09:

> Hi Jeff,
>
> Thanks! Just tried that, but the same timeout occurs :-( ...
>
> Jochen
>
> Op vr 4 okt. 2019 om 16:37 schreef Jeff Zhang <zj...@gmail.com>:
>
>> You can try to increase property spark.yarn.am.waitTime (by default it
>> is 100s)
>> Maybe you are doing some very time consuming operation when initializing
>> SparkContext, which cause timeout.
>>
>> See this property here
>> http://spark.apache.org/docs/latest/running-on-yarn.html
>>
>>
>> Jochen Hebbrecht <jo...@gmail.com> 于2019年10月4日周五 下午10:08写道:
>>
>>> Hi,
>>>
>>> I'm using Spark 2.4.2 on AWS EMR 5.24.0. I'm trying to send a Spark job
>>> towards the cluster. Thhe job gets accepted, but the YARN application fails
>>> with:
>>>
>>>
>>> {code}
>>> 19/09/27 14:33:35 ERROR ApplicationMaster: Uncaught exception:
>>> java.util.concurrent.TimeoutException: Futures timed out after [100000
>>> milliseconds]
>>> at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
>>> at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
>>> at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
>>> at
>>> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
>>> at org.apache.spark.deploy.yarn.ApplicationMaster.org
>>> $apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
>>> at
>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
>>> at
>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>>> at
>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>>> at
>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
>>> at java.security.AccessController.doPrivileged(Native Method)
>>> at javax.security.auth.Subject.doAs(Subject.java:422)
>>> at
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>>> at
>>> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
>>> at
>>> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
>>> at
>>> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
>>> at
>>> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
>>> 19/09/27 14:33:35 INFO ApplicationMaster: Final app status: FAILED,
>>> exitCode: 13, (reason: Uncaught exception:
>>> java.util.concurrent.TimeoutException: Futures timed out after [100000
>>> milliseconds]
>>> at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
>>> at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
>>> at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
>>> at
>>> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
>>> at org.apache.spark.deploy.yarn.ApplicationMaster.org
>>> $apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
>>> at
>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
>>> at
>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>>> at
>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>>> at
>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
>>> at java.security.AccessController.doPrivileged(Native Method)
>>> at javax.security.auth.Subject.doAs(Subject.java:422)
>>> at
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>>> at
>>> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
>>> at
>>> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
>>> at
>>> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
>>> at
>>> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
>>> {code}
>>>
>>> It actually goes wrong at this line:
>>> https://github.com/apache/spark/blob/v2.4.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L468
>>>
>>> Now, I'm 100% sure Spark is OK and there's no bug, but there must be
>>> something wrong with my setup. I don't understand the code of the
>>> ApplicationMaster, so could somebody explain me what it is trying to reach?
>>> Where exactly does the connection timeout? So at least I can debug it
>>> further because I don't have a clue what it is doing :-)
>>>
>>> Thanks for any help!
>>> Jochen
>>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
> --


*Roland Johann*Software Developer/Data Engineer

*phenetic GmbH*
Lütticher Straße 10, 50674 Köln, Germany

Mobil: +49 172 365 26 46
Mail: roland.johann@phenetic.io
Web: phenetic.io

Handelsregister: Amtsgericht Köln (HRB 92595)
Geschäftsführer: Roland Johann, Uwe Reimann

Re: Spark job fails because of timeout to Driver

Posted by Jochen Hebbrecht <jo...@gmail.com>.
Hi Jeff,

Thanks! Just tried that, but the same timeout occurs :-( ...

Jochen

Op vr 4 okt. 2019 om 16:37 schreef Jeff Zhang <zj...@gmail.com>:

> You can try to increase property spark.yarn.am.waitTime (by default it is
> 100s)
> Maybe you are doing some very time consuming operation when initializing
> SparkContext, which cause timeout.
>
> See this property here
> http://spark.apache.org/docs/latest/running-on-yarn.html
>
>
> Jochen Hebbrecht <jo...@gmail.com> 于2019年10月4日周五 下午10:08写道:
>
>> Hi,
>>
>> I'm using Spark 2.4.2 on AWS EMR 5.24.0. I'm trying to send a Spark job
>> towards the cluster. Thhe job gets accepted, but the YARN application fails
>> with:
>>
>>
>> {code}
>> 19/09/27 14:33:35 ERROR ApplicationMaster: Uncaught exception:
>> java.util.concurrent.TimeoutException: Futures timed out after [100000
>> milliseconds]
>> at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
>> at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
>> at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
>> at
>> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
>> at org.apache.spark.deploy.yarn.ApplicationMaster.org
>> $apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
>> at
>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
>> at
>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>> at
>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>> at
>> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:422)
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>> at
>> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
>> at
>> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
>> at
>> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
>> at
>> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
>> 19/09/27 14:33:35 INFO ApplicationMaster: Final app status: FAILED,
>> exitCode: 13, (reason: Uncaught exception:
>> java.util.concurrent.TimeoutException: Futures timed out after [100000
>> milliseconds]
>> at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
>> at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
>> at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
>> at
>> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
>> at org.apache.spark.deploy.yarn.ApplicationMaster.org
>> $apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
>> at
>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
>> at
>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>> at
>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>> at
>> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:422)
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>> at
>> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
>> at
>> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
>> at
>> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
>> at
>> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
>> {code}
>>
>> It actually goes wrong at this line:
>> https://github.com/apache/spark/blob/v2.4.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L468
>>
>> Now, I'm 100% sure Spark is OK and there's no bug, but there must be
>> something wrong with my setup. I don't understand the code of the
>> ApplicationMaster, so could somebody explain me what it is trying to reach?
>> Where exactly does the connection timeout? So at least I can debug it
>> further because I don't have a clue what it is doing :-)
>>
>> Thanks for any help!
>> Jochen
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>

Re: Spark job fails because of timeout to Driver

Posted by Jeff Zhang <zj...@gmail.com>.
You can try to increase property spark.yarn.am.waitTime (by default it is
100s)
Maybe you are doing some very time consuming operation when initializing
SparkContext, which cause timeout.

See this property here
http://spark.apache.org/docs/latest/running-on-yarn.html


Jochen Hebbrecht <jo...@gmail.com> 于2019年10月4日周五 下午10:08写道:

> Hi,
>
> I'm using Spark 2.4.2 on AWS EMR 5.24.0. I'm trying to send a Spark job
> towards the cluster. Thhe job gets accepted, but the YARN application fails
> with:
>
>
> {code}
> 19/09/27 14:33:35 ERROR ApplicationMaster: Uncaught exception:
> java.util.concurrent.TimeoutException: Futures timed out after [100000
> milliseconds]
> at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
> at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
> at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
> at org.apache.spark.deploy.yarn.ApplicationMaster.org
> $apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
> 19/09/27 14:33:35 INFO ApplicationMaster: Final app status: FAILED,
> exitCode: 13, (reason: Uncaught exception:
> java.util.concurrent.TimeoutException: Futures timed out after [100000
> milliseconds]
> at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
> at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
> at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
> at org.apache.spark.deploy.yarn.ApplicationMaster.org
> $apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
> {code}
>
> It actually goes wrong at this line:
> https://github.com/apache/spark/blob/v2.4.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L468
>
> Now, I'm 100% sure Spark is OK and there's no bug, but there must be
> something wrong with my setup. I don't understand the code of the
> ApplicationMaster, so could somebody explain me what it is trying to reach?
> Where exactly does the connection timeout? So at least I can debug it
> further because I don't have a clue what it is doing :-)
>
> Thanks for any help!
> Jochen
>


-- 
Best Regards

Jeff Zhang