You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Olivier Girardot <o....@lateral-thoughts.com> on 2019/04/29 12:42:45 UTC

Spark 2.4.1 on Kubernetes - DNS resolution of driver fails

Hi everyone,
I have ~300 spark job on Kubernetes (GKE) using the cluster auto-scaler,
and sometimes while running these jobs a pretty bad thing happens, the
driver (in cluster mode) gets scheduled on Kubernetes and launches many
executor pods.
So far so good, but the k8s "Service" associated to the driver does not
seem to be propagated in terms of DNS resolution so all the executor fails
with a "spark-application-......cluster.svc.local" does not exists.

All executors failing the driver should be failing too, but it considers
that it's a "pending" initial allocation and stay stuck forever in a loop
of "Initial job has not accepted any resources, please check Cluster UI"

Has anyone else observed this king of behaviour ?
We had it on 2.3.1 and I upgraded to 2.4.1 but this issue still seems to
exist even after the "big refactoring" in the kubernetes cluster scheduler
backend.

I can work on a fix / workaround but I'd like to check with you the proper
way forward :

   - Some processes (like the airflow helm recipe) rely on a "sleep 30s"
   before launching the dependent pods (that could be added to
   /opt/entrypoint.sh used in the kubernetes packing)
   - We can add a simple step to the init container trying to do the DNS
   resolution and failing after 60s if it did not work

But these steps won't change the fact that the driver will stay stuck
thinking we're still in the case of the Initial allocation delay.

Thoughts ?

-- 
*Olivier Girardot*
o.girardot@lateral-thoughts.com

Re: [External Sender] Re: Spark 2.4.1 on Kubernetes - DNS resolution of driver fails

Posted by Jose Luis Pedrosa <Jo...@microsoft.com.INVALID>.

Hi

In other to address this issue, as well as other use cases as virtual kubelet, I’ve created this jira ticket.

https://issues.apache.org/jira/browse/SPARK-28149

From: Jose Luis Pedrosa <Jo...@microsoft.com>
Date: Tuesday 18 June 2019 at 16:38
To: "Prudhvi Chennuru (CONT)" <pr...@capitalone.com>
Cc: Olivier Girardot <o....@lateral-thoughts.com>, Li Gao <li...@gmail.com>, dev <de...@spark.apache.org>, user <us...@spark.apache.org>
Subject: Re: [External Sender] Re: Spark 2.4.1 on Kubernetes - DNS resolution of driver fails

Hi!

I am assuming you’re running it in cluster mode,
Service should be created by the submit binary,  in this file: org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala
Don’t you have any failing logs where spark submit has been launched?

JL
From: "Prudhvi Chennuru (CONT)" <pr...@capitalone.com>
Date: Tuesday 18 June 2019 at 16:15
To: Jose Luis Pedrosa <Jo...@microsoft.com>
Cc: Olivier Girardot <o....@lateral-thoughts.com>, Li Gao <li...@gmail.com>, dev <de...@spark.apache.org>, user <us...@spark.apache.org>
Subject: Re: [External Sender] Re: Spark 2.4.1 on Kubernetes - DNS resolution of driver fails

Thanks for the response Oliver.

I am facing this issue intermittently, once in a while i don't see service being created for the respective spark driver( i don't see service for that driver on kubernetes dashboard and not even via kubectl but in driver logs i see the service endpoint) and by default driver requests for executors in a batch of 5 as soon as 5 executors are created they fail with below error.

Caused by: java.io.IOException: Failed to connect to group9990-features-282526d440ab3f12a68746fbef289c95-driver-svc.experimental.svc:7078
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.UnknownHostException: group9990-features-282526d440ab3f12a68746fbef289c95-driver-svc.experimental.svc
at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at java.net.InetAddress.getByName(InetAddress.java:1076)

Did you face the same problem or were you able to see the service for the driver pod on your cluster?

On Tue, Jun 18, 2019 at 8:00 AM Jose Luis Pedrosa <Jo...@microsoft.com>> wrote:
Hi guys

There’s also an interesting one that we found in a similar case. In our case the service ip ranges takes more time to be reachable, so DNS was timing out. The approach that I was suggesting was:

  1.  Add retries in the connection from the executor to the driver: https://github.com/apache/spark/pull/24702<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__github.com_apache_spark_pull_24702%26d%3DDwMGaQ%26c%3DpLULRYW__RtkwsQUPxJVDGboCTdgji3AcHNJU0BpTJE%26r%3DZbC2jt41phJyXtl9lDl7uaUnDWK7Ilns1DeTPpSa2T4%26m%3Deg0fGctzE8h6HioRMam_Q18QTLAN3LEl1SdiGuTX7a4%26s%3DGA-PO2FbDWQPNYgoTNs0kNHbjryZZ6phLPZ-wdQSBTs%26e%3D&data=02%7C01%7CJose.Pedrosa%40microsoft.com%7C69d4967c1e114649b81a08d6f3ffbc85%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636964677055686109&sdata=OhTVkyXXqDoAXKV%2BUhkKgfILbjWIztodkjxTx%2F%2BBdtE%3D&reserved=0>
  2.  Disable negative DNS caching at JVM level, on the entrypoint.sh

JL

From: Olivier Girardot <o....@lateral-thoughts.com>>
Date: Tuesday 18 June 2019 at 10:06
To: "Prudhvi Chennuru (CONT)" <pr...@capitalone.com>>
Cc: Li Gao <li...@gmail.com>>, dev <de...@spark.apache.org>>, user <us...@spark.apache.org>>
Subject: Re: [External Sender] Re: Spark 2.4.1 on Kubernetes - DNS resolution of driver fails

Hi Prudhvi,
not really but we took a drastic approach mitigating this, modifying the bundled launch script to be more resilient.
In the kubernetes/dockerfiles/spark/entrypoint.sh in the executor case we added something like that :

  executor)

    DRIVER_HOST=$(echo $SPARK_DRIVER_URL | cut -d "@" -f 2 | cut -d ":" -f 1)

    DRIVER_PORT=$(echo $SPARK_DRIVER_URL | cut -d "@" -f 2 | cut -d ":" -f 2)

    for i in $(seq 1 20);

    do

      nc -zvw1 $DRIVER_HOST $DRIVER_PORT

      status=$?

      if [ $status -eq 0 ]

      then

        echo "Driver is accessible, let's rock'n'roll."

        break

      else

        echo "Driver not accessible :-| napping for a while..."

        sleep 3

      fi

    done

    CMD=(

      ${JAVA_HOME}/bin/java

    ....

That way the executor will not start before the driver is really connectable.
That's kind of a hack but we did not experience the issue anymore, so I guess I'll keep it for now.

Regards,

Olivier.

Le mar. 11 juin 2019 à 18:23, Prudhvi Chennuru (CONT) <pr...@capitalone.com>> a écrit :
Hey Oliver,

                     I am also facing the same issue on my kubernetes cluster(v1.11.5)  on AWS with spark version 2.3.3, any luck in figuring out the root cause?

On Fri, May 3, 2019 at 5:37 AM Olivier Girardot <o....@lateral-thoughts.com>> wrote:
Hi,
I did not try on another vendor, so I can't say if it's only related to gke, and no, I did not notice anything on the kubelet or kube-dns processes...

Regards

Le ven. 3 mai 2019 à 03:05, Li Gao <li...@gmail.com>> a écrit :
hi Olivier,

This seems a GKE specific issue? have you tried on other vendors ? Also on the kubelet nodes did you notice any pressure on the DNS side?

Li

On Mon, Apr 29, 2019, 5:43 AM Olivier Girardot <o....@lateral-thoughts.com>> wrote:
Hi everyone,
I have ~300 spark job on Kubernetes (GKE) using the cluster auto-scaler, and sometimes while running these jobs a pretty bad thing happens, the driver (in cluster mode) gets scheduled on Kubernetes and launches many executor pods.
So far so good, but the k8s "Service" associated to the driver does not seem to be propagated in terms of DNS resolution so all the executor fails with a "spark-application-......cluster.svc.local" does not exists.

All executors failing the driver should be failing too, but it considers that it's a "pending" initial allocation and stay stuck forever in a loop of "Initial job has not accepted any resources, please check Cluster UI"

Has anyone else observed this king of behaviour ?
We had it on 2.3.1 and I upgraded to 2.4.1 but this issue still seems to exist even after the "big refactoring" in the kubernetes cluster scheduler backend.

I can work on a fix / workaround but I'd like to check with you the proper way forward :

  *   Some processes (like the airflow helm recipe) rely on a "sleep 30s" before launching the dependent pods (that could be added to /opt/entrypoint.sh used in the kubernetes packing)
  *   We can add a simple step to the init container trying to do the DNS resolution and failing after 60s if it did not work
But these steps won't change the fact that the driver will stay stuck thinking we're still in the case of the Initial allocation delay.

Thoughts ?

--
Olivier Girardot
o.girardot@lateral-thoughts.com<ma...@lateral-thoughts.com>

--
Thanks,
Prudhvi Chennuru.

________________________________

The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.

--
Olivier Girardot | Associé
o.girardot@lateral-thoughts.com<ma...@lateral-thoughts.com>
+33 6 24 09 17 94

--
Thanks,
Prudhvi Chennuru.

________________________________

The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.

Re: [External Sender] Re: Spark 2.4.1 on Kubernetes - DNS resolution of driver fails

Posted by Jose Luis Pedrosa <Jo...@microsoft.com.INVALID>.

Hi!

I am assuming you’re running it in cluster mode,
Service should be created by the submit binary,  in this file: org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala
Don’t you have any failing logs where spark submit has been launched?

JL
From: "Prudhvi Chennuru (CONT)" <pr...@capitalone.com>
Date: Tuesday 18 June 2019 at 16:15
To: Jose Luis Pedrosa <Jo...@microsoft.com>
Cc: Olivier Girardot <o....@lateral-thoughts.com>, Li Gao <li...@gmail.com>, dev <de...@spark.apache.org>, user <us...@spark.apache.org>
Subject: Re: [External Sender] Re: Spark 2.4.1 on Kubernetes - DNS resolution of driver fails

Thanks for the response Oliver.

I am facing this issue intermittently, once in a while i don't see service being created for the respective spark driver( i don't see service for that driver on kubernetes dashboard and not even via kubectl but in driver logs i see the service endpoint) and by default driver requests for executors in a batch of 5 as soon as 5 executors are created they fail with below error.

Caused by: java.io.IOException: Failed to connect to group9990-features-282526d440ab3f12a68746fbef289c95-driver-svc.experimental.svc:7078
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.UnknownHostException: group9990-features-282526d440ab3f12a68746fbef289c95-driver-svc.experimental.svc
at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at java.net.InetAddress.getByName(InetAddress.java:1076)

Did you face the same problem or were you able to see the service for the driver pod on your cluster?

On Tue, Jun 18, 2019 at 8:00 AM Jose Luis Pedrosa <Jo...@microsoft.com>> wrote:
Hi guys

There’s also an interesting one that we found in a similar case. In our case the service ip ranges takes more time to be reachable, so DNS was timing out. The approach that I was suggesting was:

  1.  Add retries in the connection from the executor to the driver: https://github.com/apache/spark/pull/24702<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__github.com_apache_spark_pull_24702%26d%3DDwMGaQ%26c%3DpLULRYW__RtkwsQUPxJVDGboCTdgji3AcHNJU0BpTJE%26r%3DZbC2jt41phJyXtl9lDl7uaUnDWK7Ilns1DeTPpSa2T4%26m%3Deg0fGctzE8h6HioRMam_Q18QTLAN3LEl1SdiGuTX7a4%26s%3DGA-PO2FbDWQPNYgoTNs0kNHbjryZZ6phLPZ-wdQSBTs%26e%3D&data=02%7C01%7CJose.Pedrosa%40microsoft.com%7C69d4967c1e114649b81a08d6f3ffbc85%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636964677055686109&sdata=OhTVkyXXqDoAXKV%2BUhkKgfILbjWIztodkjxTx%2F%2BBdtE%3D&reserved=0>
  2.  Disable negative DNS caching at JVM level, on the entrypoint.sh

JL

From: Olivier Girardot <o....@lateral-thoughts.com>>
Date: Tuesday 18 June 2019 at 10:06
To: "Prudhvi Chennuru (CONT)" <pr...@capitalone.com>>
Cc: Li Gao <li...@gmail.com>>, dev <de...@spark.apache.org>>, user <us...@spark.apache.org>>
Subject: Re: [External Sender] Re: Spark 2.4.1 on Kubernetes - DNS resolution of driver fails

Hi Prudhvi,
not really but we took a drastic approach mitigating this, modifying the bundled launch script to be more resilient.
In the kubernetes/dockerfiles/spark/entrypoint.sh in the executor case we added something like that :

  executor)

    DRIVER_HOST=$(echo $SPARK_DRIVER_URL | cut -d "@" -f 2 | cut -d ":" -f 1)

    DRIVER_PORT=$(echo $SPARK_DRIVER_URL | cut -d "@" -f 2 | cut -d ":" -f 2)

    for i in $(seq 1 20);

    do

      nc -zvw1 $DRIVER_HOST $DRIVER_PORT

      status=$?

      if [ $status -eq 0 ]

      then

        echo "Driver is accessible, let's rock'n'roll."

        break

      else

        echo "Driver not accessible :-| napping for a while..."

        sleep 3

      fi

    done

    CMD=(

      ${JAVA_HOME}/bin/java

    ....

That way the executor will not start before the driver is really connectable.
That's kind of a hack but we did not experience the issue anymore, so I guess I'll keep it for now.

Regards,

Olivier.

Le mar. 11 juin 2019 à 18:23, Prudhvi Chennuru (CONT) <pr...@capitalone.com>> a écrit :
Hey Oliver,

                     I am also facing the same issue on my kubernetes cluster(v1.11.5)  on AWS with spark version 2.3.3, any luck in figuring out the root cause?

On Fri, May 3, 2019 at 5:37 AM Olivier Girardot <o....@lateral-thoughts.com>> wrote:
Hi,
I did not try on another vendor, so I can't say if it's only related to gke, and no, I did not notice anything on the kubelet or kube-dns processes...

Regards

Le ven. 3 mai 2019 à 03:05, Li Gao <li...@gmail.com>> a écrit :
hi Olivier,

This seems a GKE specific issue? have you tried on other vendors ? Also on the kubelet nodes did you notice any pressure on the DNS side?

Li

On Mon, Apr 29, 2019, 5:43 AM Olivier Girardot <o....@lateral-thoughts.com>> wrote:
Hi everyone,
I have ~300 spark job on Kubernetes (GKE) using the cluster auto-scaler, and sometimes while running these jobs a pretty bad thing happens, the driver (in cluster mode) gets scheduled on Kubernetes and launches many executor pods.
So far so good, but the k8s "Service" associated to the driver does not seem to be propagated in terms of DNS resolution so all the executor fails with a "spark-application-......cluster.svc.local" does not exists.

All executors failing the driver should be failing too, but it considers that it's a "pending" initial allocation and stay stuck forever in a loop of "Initial job has not accepted any resources, please check Cluster UI"

Has anyone else observed this king of behaviour ?
We had it on 2.3.1 and I upgraded to 2.4.1 but this issue still seems to exist even after the "big refactoring" in the kubernetes cluster scheduler backend.

I can work on a fix / workaround but I'd like to check with you the proper way forward :

  *   Some processes (like the airflow helm recipe) rely on a "sleep 30s" before launching the dependent pods (that could be added to /opt/entrypoint.sh used in the kubernetes packing)
  *   We can add a simple step to the init container trying to do the DNS resolution and failing after 60s if it did not work
But these steps won't change the fact that the driver will stay stuck thinking we're still in the case of the Initial allocation delay.

Thoughts ?

--
Olivier Girardot
o.girardot@lateral-thoughts.com<ma...@lateral-thoughts.com>

--
Thanks,
Prudhvi Chennuru.

________________________________

The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.

--
Olivier Girardot | Associé
o.girardot@lateral-thoughts.com<ma...@lateral-thoughts.com>
+33 6 24 09 17 94

--
Thanks,
Prudhvi Chennuru.

________________________________

The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.

Re: [External Sender] Re: Spark 2.4.1 on Kubernetes - DNS resolution of driver fails

Posted by "Prudhvi Chennuru (CONT)" <pr...@capitalone.com>.

Thanks for the response Oliver.

I am facing this issue intermittently, once in a while i don't see service
being created for the respective spark driver(* i don't see service for
that driver on kubernetes dashboard and not even via kubectl but in driver
logs i see the service endpoint*) and by default driver requests for
executors in a batch of 5 as soon as 5 executors are created they fail with
below error.


Caused by: java.io.IOException: Failed to connect to
group9990-features-282526d440ab3f12a68746fbef289c95-driver-svc.experimental.svc:7078
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
at
org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.UnknownHostException:
group9990-features-282526d440ab3f12a68746fbef289c95-driver-svc.experimental.svc
at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at java.net.InetAddress.getByName(InetAddress.java:1076)

Did you face the same problem or were you able to see the service for the
driver pod on your cluster?


On Tue, Jun 18, 2019 at 8:00 AM Jose Luis Pedrosa <
Jose.Pedrosa@microsoft.com> wrote:

> Hi guys
>
>
>
> There’s also an interesting one that we found in a similar case. In our
> case the service ip ranges takes more time to be reachable, so DNS was
> timing out. The approach that I was suggesting was:
>
>    1. Add retries in the connection from the executor to the driver:
>    https://github.com/apache/spark/pull/24702
>    <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_pull_24702&d=DwMGaQ&c=pLULRYW__RtkwsQUPxJVDGboCTdgji3AcHNJU0BpTJE&r=ZbC2jt41phJyXtl9lDl7uaUnDWK7Ilns1DeTPpSa2T4&m=eg0fGctzE8h6HioRMam_Q18QTLAN3LEl1SdiGuTX7a4&s=GA-PO2FbDWQPNYgoTNs0kNHbjryZZ6phLPZ-wdQSBTs&e=>
>    2. Disable negative DNS caching at JVM level, on the entrypoint.sh
>
>
>
> JL
>
>
>
>
>
> *From: *Olivier Girardot <o....@lateral-thoughts.com>
> *Date: *Tuesday 18 June 2019 at 10:06
> *To: *"Prudhvi Chennuru (CONT)" <pr...@capitalone.com>
> *Cc: *Li Gao <li...@gmail.com>, dev <de...@spark.apache.org>, user <
> user@spark.apache.org>
> *Subject: *Re: [External Sender] Re: Spark 2.4.1 on Kubernetes - DNS
> resolution of driver fails
>
>
>
> Hi Prudhvi,
>
> not really but we took a drastic approach mitigating this, modifying the
> bundled launch script to be more resilient.
>
> In the kubernetes/dockerfiles/spark/entrypoint.sh in the executor case we
> added something like that :
>
>
>
>   executor)
>
>     DRIVER_HOST=$(echo $SPARK_DRIVER_URL | cut -d "@" -f 2 | cut -d ":"
> -f 1)
>
>     DRIVER_PORT=$(echo $SPARK_DRIVER_URL | cut -d "@" -f 2 | cut -d ":"
> -f 2)
>
>
>
>     for i in $(seq 1 20);
>
>     do
>
>       nc -zvw1 $DRIVER_HOST $DRIVER_PORT
>
>       status=$?
>
>       if [ $status -eq 0 ]
>
>       then
>
>         echo "Driver is accessible, let's rock'n'roll."
>
>         break
>
>       else
>
>         echo "Driver not accessible :-| napping for a while..."
>
>         sleep 3
>
>       fi
>
>     done
>
>     CMD=(
>
>       ${JAVA_HOME}/bin/java
>
>     ....
>
>
>
> That way the executor will not start before the driver is really
> connectable.
>
> That's kind of a hack but we did not experience the issue anymore, so I
> guess I'll keep it for now.
>
>
>
> Regards,
>
>
>
> Olivier.
>
>
>
> Le mar. 11 juin 2019 à 18:23, Prudhvi Chennuru (CONT) <
> prudhvi.chennuru@capitalone.com> a écrit :
>
> Hey Oliver,
>
>
>
>                      I am also facing the same issue on my kubernetes
> cluster(v1.11.5)  on AWS with spark version 2.3.3, any luck in figuring out
> the root cause?
>
>
>
> On Fri, May 3, 2019 at 5:37 AM Olivier Girardot <
> o.girardot@lateral-thoughts.com> wrote:
>
> Hi,
>
> I did not try on another vendor, so I can't say if it's only related to
> gke, and no, I did not notice anything on the kubelet or kube-dns
> processes...
>
>
>
> Regards
>
>
>
> Le ven. 3 mai 2019 à 03:05, Li Gao <li...@gmail.com> a écrit :
>
> hi Olivier,
>
>
>
> This seems a GKE specific issue? have you tried on other vendors ? Also on
> the kubelet nodes did you notice any pressure on the DNS side?
>
>
>
> Li
>
>
>
>
>
> On Mon, Apr 29, 2019, 5:43 AM Olivier Girardot <
> o.girardot@lateral-thoughts.com> wrote:
>
> Hi everyone,
>
> I have ~300 spark job on Kubernetes (GKE) using the cluster auto-scaler,
> and sometimes while running these jobs a pretty bad thing happens, the
> driver (in cluster mode) gets scheduled on Kubernetes and launches many
> executor pods.
>
> So far so good, but the k8s "Service" associated to the driver does not
> seem to be propagated in terms of DNS resolution so all the executor fails
> with a "spark-application-......cluster.svc.local" does not exists.
>
>
>
> All executors failing the driver should be failing too, but it considers
> that it's a "pending" initial allocation and stay stuck forever in a loop
> of "Initial job has not accepted any resources, please check Cluster UI"
>
>
>
> Has anyone else observed this king of behaviour ?
>
> We had it on 2.3.1 and I upgraded to 2.4.1 but this issue still seems to
> exist even after the "big refactoring" in the kubernetes cluster scheduler
> backend.
>
>
>
> I can work on a fix / workaround but I'd like to check with you the proper
> way forward :
>
>    - Some processes (like the airflow helm recipe) rely on a "sleep 30s"
>    before launching the dependent pods (that could be added to
>    /opt/entrypoint.sh used in the kubernetes packing)
>    - We can add a simple step to the init container trying to do the DNS
>    resolution and failing after 60s if it did not work
>
> But these steps won't change the fact that the driver will stay stuck
> thinking we're still in the case of the Initial allocation delay.
>
>
>
> Thoughts ?
>
>
>
> --
>
> *Olivier Girardot*
>
> o.girardot@lateral-thoughts.com
>
>
>
>
> --
>
> *Thanks,*
>
> *Prudhvi Chennuru.*
>
>
> ------------------------------
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>
>
>
>
> --
>
> *Olivier Girardot *| Associé
>
> o.girardot@lateral-thoughts.com
> +33 6 24 09 17 94
>


-- 
*Thanks,*
*Prudhvi Chennuru.*
________________________________________________________

The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.

Re: [External Sender] Re: Spark 2.4.1 on Kubernetes - DNS resolution of driver fails

Posted by Jose Luis Pedrosa <Jo...@microsoft.com.INVALID>.

Hi guys

There’s also an interesting one that we found in a similar case. In our case the service ip ranges takes more time to be reachable, so DNS was timing out. The approach that I was suggesting was:

  1.  Add retries in the connection from the executor to the driver: https://github.com/apache/spark/pull/24702
  2.  Disable negative DNS caching at JVM level, on the entrypoint.sh

JL


From: Olivier Girardot <o....@lateral-thoughts.com>
Date: Tuesday 18 June 2019 at 10:06
To: "Prudhvi Chennuru (CONT)" <pr...@capitalone.com>
Cc: Li Gao <li...@gmail.com>, dev <de...@spark.apache.org>, user <us...@spark.apache.org>
Subject: Re: [External Sender] Re: Spark 2.4.1 on Kubernetes - DNS resolution of driver fails

Hi Prudhvi,
not really but we took a drastic approach mitigating this, modifying the bundled launch script to be more resilient.
In the kubernetes/dockerfiles/spark/entrypoint.sh in the executor case we added something like that :


  executor)

    DRIVER_HOST=$(echo $SPARK_DRIVER_URL | cut -d "@" -f 2 | cut -d ":" -f 1)

    DRIVER_PORT=$(echo $SPARK_DRIVER_URL | cut -d "@" -f 2 | cut -d ":" -f 2)



    for i in $(seq 1 20);

    do

      nc -zvw1 $DRIVER_HOST $DRIVER_PORT

      status=$?

      if [ $status -eq 0 ]

      then

        echo "Driver is accessible, let's rock'n'roll."

        break

      else

        echo "Driver not accessible :-| napping for a while..."

        sleep 3

      fi

    done

    CMD=(

      ${JAVA_HOME}/bin/java

    ....


That way the executor will not start before the driver is really connectable.
That's kind of a hack but we did not experience the issue anymore, so I guess I'll keep it for now.

Regards,

Olivier.

Le mar. 11 juin 2019 à 18:23, Prudhvi Chennuru (CONT) <pr...@capitalone.com>> a écrit :
Hey Oliver,

                     I am also facing the same issue on my kubernetes cluster(v1.11.5)  on AWS with spark version 2.3.3, any luck in figuring out the root cause?

On Fri, May 3, 2019 at 5:37 AM Olivier Girardot <o....@lateral-thoughts.com>> wrote:
Hi,
I did not try on another vendor, so I can't say if it's only related to gke, and no, I did not notice anything on the kubelet or kube-dns processes...

Regards

Le ven. 3 mai 2019 à 03:05, Li Gao <li...@gmail.com>> a écrit :
hi Olivier,

This seems a GKE specific issue? have you tried on other vendors ? Also on the kubelet nodes did you notice any pressure on the DNS side?

Li


On Mon, Apr 29, 2019, 5:43 AM Olivier Girardot <o....@lateral-thoughts.com>> wrote:
Hi everyone,
I have ~300 spark job on Kubernetes (GKE) using the cluster auto-scaler, and sometimes while running these jobs a pretty bad thing happens, the driver (in cluster mode) gets scheduled on Kubernetes and launches many executor pods.
So far so good, but the k8s "Service" associated to the driver does not seem to be propagated in terms of DNS resolution so all the executor fails with a "spark-application-......cluster.svc.local" does not exists.

All executors failing the driver should be failing too, but it considers that it's a "pending" initial allocation and stay stuck forever in a loop of "Initial job has not accepted any resources, please check Cluster UI"

Has anyone else observed this king of behaviour ?
We had it on 2.3.1 and I upgraded to 2.4.1 but this issue still seems to exist even after the "big refactoring" in the kubernetes cluster scheduler backend.

I can work on a fix / workaround but I'd like to check with you the proper way forward :

  *   Some processes (like the airflow helm recipe) rely on a "sleep 30s" before launching the dependent pods (that could be added to /opt/entrypoint.sh used in the kubernetes packing)
  *   We can add a simple step to the init container trying to do the DNS resolution and failing after 60s if it did not work
But these steps won't change the fact that the driver will stay stuck thinking we're still in the case of the Initial allocation delay.

Thoughts ?

--
Olivier Girardot
o.girardot@lateral-thoughts.com<ma...@lateral-thoughts.com>


--
Thanks,
Prudhvi Chennuru.

________________________________

The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.


--
Olivier Girardot | Associé
o.girardot@lateral-thoughts.com<ma...@lateral-thoughts.com>
+33 6 24 09 17 94

Re: [External Sender] Re: Spark 2.4.1 on Kubernetes - DNS resolution of driver fails

Posted by Olivier Girardot <o....@lateral-thoughts.com>.

Hi Prudhvi,
not really but we took a drastic approach mitigating this, modifying the
bundled launch script to be more resilient.
In the kubernetes/dockerfiles/spark/entrypoint.sh in the executor case we
added something like that :

  executor)

    DRIVER_HOST=$(echo $SPARK_DRIVER_URL | cut -d "@" -f 2 | cut -d ":" -f 1
)

    DRIVER_PORT=$(echo $SPARK_DRIVER_URL | cut -d "@" -f 2 | cut -d ":" -f 2
)


    for i in $(seq 1 20);

    do

      nc -zvw1 $DRIVER_HOST $DRIVER_PORT

      status=$?

      if [ $status -eq 0 ]

      then

        echo "Driver is accessible, let's rock'n'roll."

        break

      else

        echo "Driver not accessible :-| napping for a while..."

        sleep 3

      fi

    done

    CMD=(

      ${JAVA_HOME}/bin/java

    ....


That way the executor will not start before the driver is really
connectable.
That's kind of a hack but we did not experience the issue anymore, so I
guess I'll keep it for now.

Regards,

Olivier.

Le mar. 11 juin 2019 à 18:23, Prudhvi Chennuru (CONT) <
prudhvi.chennuru@capitalone.com> a écrit :

> Hey Oliver,
>
>                      I am also facing the same issue on my kubernetes
> cluster(v1.11.5)  on AWS with spark version 2.3.3, any luck in figuring out
> the root cause?
>
> On Fri, May 3, 2019 at 5:37 AM Olivier Girardot <
> o.girardot@lateral-thoughts.com> wrote:
>
>> Hi,
>> I did not try on another vendor, so I can't say if it's only related to
>> gke, and no, I did not notice anything on the kubelet or kube-dns
>> processes...
>>
>> Regards
>>
>> Le ven. 3 mai 2019 à 03:05, Li Gao <li...@gmail.com> a écrit :
>>
>>> hi Olivier,
>>>
>>> This seems a GKE specific issue? have you tried on other vendors ? Also
>>> on the kubelet nodes did you notice any pressure on the DNS side?
>>>
>>> Li
>>>
>>>
>>> On Mon, Apr 29, 2019, 5:43 AM Olivier Girardot <
>>> o.girardot@lateral-thoughts.com> wrote:
>>>
>>>> Hi everyone,
>>>> I have ~300 spark job on Kubernetes (GKE) using the cluster
>>>> auto-scaler, and sometimes while running these jobs a pretty bad thing
>>>> happens, the driver (in cluster mode) gets scheduled on Kubernetes and
>>>> launches many executor pods.
>>>> So far so good, but the k8s "Service" associated to the driver does not
>>>> seem to be propagated in terms of DNS resolution so all the executor fails
>>>> with a "spark-application-......cluster.svc.local" does not exists.
>>>>
>>>> All executors failing the driver should be failing too, but it
>>>> considers that it's a "pending" initial allocation and stay stuck forever
>>>> in a loop of "Initial job has not accepted any resources, please check
>>>> Cluster UI"
>>>>
>>>> Has anyone else observed this king of behaviour ?
>>>> We had it on 2.3.1 and I upgraded to 2.4.1 but this issue still seems
>>>> to exist even after the "big refactoring" in the kubernetes cluster
>>>> scheduler backend.
>>>>
>>>> I can work on a fix / workaround but I'd like to check with you the
>>>> proper way forward :
>>>>
>>>>    - Some processes (like the airflow helm recipe) rely on a "sleep
>>>>    30s" before launching the dependent pods (that could be added to
>>>>    /opt/entrypoint.sh used in the kubernetes packing)
>>>>    - We can add a simple step to the init container trying to do the
>>>>    DNS resolution and failing after 60s if it did not work
>>>>
>>>> But these steps won't change the fact that the driver will stay stuck
>>>> thinking we're still in the case of the Initial allocation delay.
>>>>
>>>> Thoughts ?
>>>>
>>>> --
>>>> *Olivier Girardot*
>>>> o.girardot@lateral-thoughts.com
>>>>
>>>
>
> --
> *Thanks,*
> *Prudhvi Chennuru.*
>
> ------------------------------
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>


-- 
*Olivier Girardot* | Associé
o.girardot@lateral-thoughts.com
+33 6 24 09 17 94

Re: [External Sender] Re: Spark 2.4.1 on Kubernetes - DNS resolution of driver fails

Posted by Olivier Girardot <o....@lateral-thoughts.com>.

Hi Prudhvi,
not really but we took a drastic approach mitigating this, modifying the
bundled launch script to be more resilient.
In the kubernetes/dockerfiles/spark/entrypoint.sh in the executor case we
added something like that :

  executor)

    DRIVER_HOST=$(echo $SPARK_DRIVER_URL | cut -d "@" -f 2 | cut -d ":" -f 1
)

    DRIVER_PORT=$(echo $SPARK_DRIVER_URL | cut -d "@" -f 2 | cut -d ":" -f 2
)


    for i in $(seq 1 20);

    do

      nc -zvw1 $DRIVER_HOST $DRIVER_PORT

      status=$?

      if [ $status -eq 0 ]

      then

        echo "Driver is accessible, let's rock'n'roll."

        break

      else

        echo "Driver not accessible :-| napping for a while..."

        sleep 3

      fi

    done

    CMD=(

      ${JAVA_HOME}/bin/java

    ....


That way the executor will not start before the driver is really
connectable.
That's kind of a hack but we did not experience the issue anymore, so I
guess I'll keep it for now.

Regards,

Olivier.

Le mar. 11 juin 2019 à 18:23, Prudhvi Chennuru (CONT) <
prudhvi.chennuru@capitalone.com> a écrit :

> Hey Oliver,
>
>                      I am also facing the same issue on my kubernetes
> cluster(v1.11.5)  on AWS with spark version 2.3.3, any luck in figuring out
> the root cause?
>
> On Fri, May 3, 2019 at 5:37 AM Olivier Girardot <
> o.girardot@lateral-thoughts.com> wrote:
>
>> Hi,
>> I did not try on another vendor, so I can't say if it's only related to
>> gke, and no, I did not notice anything on the kubelet or kube-dns
>> processes...
>>
>> Regards
>>
>> Le ven. 3 mai 2019 à 03:05, Li Gao <li...@gmail.com> a écrit :
>>
>>> hi Olivier,
>>>
>>> This seems a GKE specific issue? have you tried on other vendors ? Also
>>> on the kubelet nodes did you notice any pressure on the DNS side?
>>>
>>> Li
>>>
>>>
>>> On Mon, Apr 29, 2019, 5:43 AM Olivier Girardot <
>>> o.girardot@lateral-thoughts.com> wrote:
>>>
>>>> Hi everyone,
>>>> I have ~300 spark job on Kubernetes (GKE) using the cluster
>>>> auto-scaler, and sometimes while running these jobs a pretty bad thing
>>>> happens, the driver (in cluster mode) gets scheduled on Kubernetes and
>>>> launches many executor pods.
>>>> So far so good, but the k8s "Service" associated to the driver does not
>>>> seem to be propagated in terms of DNS resolution so all the executor fails
>>>> with a "spark-application-......cluster.svc.local" does not exists.
>>>>
>>>> All executors failing the driver should be failing too, but it
>>>> considers that it's a "pending" initial allocation and stay stuck forever
>>>> in a loop of "Initial job has not accepted any resources, please check
>>>> Cluster UI"
>>>>
>>>> Has anyone else observed this king of behaviour ?
>>>> We had it on 2.3.1 and I upgraded to 2.4.1 but this issue still seems
>>>> to exist even after the "big refactoring" in the kubernetes cluster
>>>> scheduler backend.
>>>>
>>>> I can work on a fix / workaround but I'd like to check with you the
>>>> proper way forward :
>>>>
>>>>    - Some processes (like the airflow helm recipe) rely on a "sleep
>>>>    30s" before launching the dependent pods (that could be added to
>>>>    /opt/entrypoint.sh used in the kubernetes packing)
>>>>    - We can add a simple step to the init container trying to do the
>>>>    DNS resolution and failing after 60s if it did not work
>>>>
>>>> But these steps won't change the fact that the driver will stay stuck
>>>> thinking we're still in the case of the Initial allocation delay.
>>>>
>>>> Thoughts ?
>>>>
>>>> --
>>>> *Olivier Girardot*
>>>> o.girardot@lateral-thoughts.com
>>>>
>>>
>
> --
> *Thanks,*
> *Prudhvi Chennuru.*
>
> ------------------------------
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>


-- 
*Olivier Girardot* | Associé
o.girardot@lateral-thoughts.com
+33 6 24 09 17 94

Re: [External Sender] Re: Spark 2.4.1 on Kubernetes - DNS resolution of driver fails

Posted by "Prudhvi Chennuru (CONT)" <pr...@capitalone.com>.

Hey Oliver,

                     I am also facing the same issue on my kubernetes
cluster(v1.11.5)  on AWS with spark version 2.3.3, any luck in figuring out
the root cause?

On Fri, May 3, 2019 at 5:37 AM Olivier Girardot <
o.girardot@lateral-thoughts.com> wrote:

> Hi,
> I did not try on another vendor, so I can't say if it's only related to
> gke, and no, I did not notice anything on the kubelet or kube-dns
> processes...
>
> Regards
>
> Le ven. 3 mai 2019 à 03:05, Li Gao <li...@gmail.com> a écrit :
>
>> hi Olivier,
>>
>> This seems a GKE specific issue? have you tried on other vendors ? Also
>> on the kubelet nodes did you notice any pressure on the DNS side?
>>
>> Li
>>
>>
>> On Mon, Apr 29, 2019, 5:43 AM Olivier Girardot <
>> o.girardot@lateral-thoughts.com> wrote:
>>
>>> Hi everyone,
>>> I have ~300 spark job on Kubernetes (GKE) using the cluster auto-scaler,
>>> and sometimes while running these jobs a pretty bad thing happens, the
>>> driver (in cluster mode) gets scheduled on Kubernetes and launches many
>>> executor pods.
>>> So far so good, but the k8s "Service" associated to the driver does not
>>> seem to be propagated in terms of DNS resolution so all the executor fails
>>> with a "spark-application-......cluster.svc.local" does not exists.
>>>
>>> All executors failing the driver should be failing too, but it considers
>>> that it's a "pending" initial allocation and stay stuck forever in a loop
>>> of "Initial job has not accepted any resources, please check Cluster UI"
>>>
>>> Has anyone else observed this king of behaviour ?
>>> We had it on 2.3.1 and I upgraded to 2.4.1 but this issue still seems to
>>> exist even after the "big refactoring" in the kubernetes cluster scheduler
>>> backend.
>>>
>>> I can work on a fix / workaround but I'd like to check with you the
>>> proper way forward :
>>>
>>>    - Some processes (like the airflow helm recipe) rely on a "sleep
>>>    30s" before launching the dependent pods (that could be added to
>>>    /opt/entrypoint.sh used in the kubernetes packing)
>>>    - We can add a simple step to the init container trying to do the
>>>    DNS resolution and failing after 60s if it did not work
>>>
>>> But these steps won't change the fact that the driver will stay stuck
>>> thinking we're still in the case of the Initial allocation delay.
>>>
>>> Thoughts ?
>>>
>>> --
>>> *Olivier Girardot*
>>> o.girardot@lateral-thoughts.com
>>>
>>

-- 
*Thanks,*
*Prudhvi Chennuru.*
________________________________________________________

The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.

Re: Spark 2.4.1 on Kubernetes - DNS resolution of driver fails

Posted by Olivier Girardot <o....@lateral-thoughts.com>.

Hi,
I did not try on another vendor, so I can't say if it's only related to
gke, and no, I did not notice anything on the kubelet or kube-dns
processes...

Regards

Le ven. 3 mai 2019 à 03:05, Li Gao <li...@gmail.com> a écrit :

> hi Olivier,
>
> This seems a GKE specific issue? have you tried on other vendors ? Also on
> the kubelet nodes did you notice any pressure on the DNS side?
>
> Li
>
>
> On Mon, Apr 29, 2019, 5:43 AM Olivier Girardot <
> o.girardot@lateral-thoughts.com> wrote:
>
>> Hi everyone,
>> I have ~300 spark job on Kubernetes (GKE) using the cluster auto-scaler,
>> and sometimes while running these jobs a pretty bad thing happens, the
>> driver (in cluster mode) gets scheduled on Kubernetes and launches many
>> executor pods.
>> So far so good, but the k8s "Service" associated to the driver does not
>> seem to be propagated in terms of DNS resolution so all the executor fails
>> with a "spark-application-......cluster.svc.local" does not exists.
>>
>> All executors failing the driver should be failing too, but it considers
>> that it's a "pending" initial allocation and stay stuck forever in a loop
>> of "Initial job has not accepted any resources, please check Cluster UI"
>>
>> Has anyone else observed this king of behaviour ?
>> We had it on 2.3.1 and I upgraded to 2.4.1 but this issue still seems to
>> exist even after the "big refactoring" in the kubernetes cluster scheduler
>> backend.
>>
>> I can work on a fix / workaround but I'd like to check with you the
>> proper way forward :
>>
>>    - Some processes (like the airflow helm recipe) rely on a "sleep 30s"
>>    before launching the dependent pods (that could be added to
>>    /opt/entrypoint.sh used in the kubernetes packing)
>>    - We can add a simple step to the init container trying to do the DNS
>>    resolution and failing after 60s if it did not work
>>
>> But these steps won't change the fact that the driver will stay stuck
>> thinking we're still in the case of the Initial allocation delay.
>>
>> Thoughts ?
>>
>> --
>> *Olivier Girardot*
>> o.girardot@lateral-thoughts.com
>>
>

Re: Spark 2.4.1 on Kubernetes - DNS resolution of driver fails

Posted by Olivier Girardot <o....@lateral-thoughts.com>.

Hi,
I did not try on another vendor, so I can't say if it's only related to
gke, and no, I did not notice anything on the kubelet or kube-dns
processes...

Regards

Le ven. 3 mai 2019 à 03:05, Li Gao <li...@gmail.com> a écrit :

> hi Olivier,
>
> This seems a GKE specific issue? have you tried on other vendors ? Also on
> the kubelet nodes did you notice any pressure on the DNS side?
>
> Li
>
>
> On Mon, Apr 29, 2019, 5:43 AM Olivier Girardot <
> o.girardot@lateral-thoughts.com> wrote:
>
>> Hi everyone,
>> I have ~300 spark job on Kubernetes (GKE) using the cluster auto-scaler,
>> and sometimes while running these jobs a pretty bad thing happens, the
>> driver (in cluster mode) gets scheduled on Kubernetes and launches many
>> executor pods.
>> So far so good, but the k8s "Service" associated to the driver does not
>> seem to be propagated in terms of DNS resolution so all the executor fails
>> with a "spark-application-......cluster.svc.local" does not exists.
>>
>> All executors failing the driver should be failing too, but it considers
>> that it's a "pending" initial allocation and stay stuck forever in a loop
>> of "Initial job has not accepted any resources, please check Cluster UI"
>>
>> Has anyone else observed this king of behaviour ?
>> We had it on 2.3.1 and I upgraded to 2.4.1 but this issue still seems to
>> exist even after the "big refactoring" in the kubernetes cluster scheduler
>> backend.
>>
>> I can work on a fix / workaround but I'd like to check with you the
>> proper way forward :
>>
>>    - Some processes (like the airflow helm recipe) rely on a "sleep 30s"
>>    before launching the dependent pods (that could be added to
>>    /opt/entrypoint.sh used in the kubernetes packing)
>>    - We can add a simple step to the init container trying to do the DNS
>>    resolution and failing after 60s if it did not work
>>
>> But these steps won't change the fact that the driver will stay stuck
>> thinking we're still in the case of the Initial allocation delay.
>>
>> Thoughts ?
>>
>> --
>> *Olivier Girardot*
>> o.girardot@lateral-thoughts.com
>>
>

Re: Spark 2.4.1 on Kubernetes - DNS resolution of driver fails

Posted by Li Gao <li...@gmail.com>.

hi Olivier,

This seems a GKE specific issue? have you tried on other vendors ? Also on
the kubelet nodes did you notice any pressure on the DNS side?

Li


On Mon, Apr 29, 2019, 5:43 AM Olivier Girardot <
o.girardot@lateral-thoughts.com> wrote:

> Hi everyone,
> I have ~300 spark job on Kubernetes (GKE) using the cluster auto-scaler,
> and sometimes while running these jobs a pretty bad thing happens, the
> driver (in cluster mode) gets scheduled on Kubernetes and launches many
> executor pods.
> So far so good, but the k8s "Service" associated to the driver does not
> seem to be propagated in terms of DNS resolution so all the executor fails
> with a "spark-application-......cluster.svc.local" does not exists.
>
> All executors failing the driver should be failing too, but it considers
> that it's a "pending" initial allocation and stay stuck forever in a loop
> of "Initial job has not accepted any resources, please check Cluster UI"
>
> Has anyone else observed this king of behaviour ?
> We had it on 2.3.1 and I upgraded to 2.4.1 but this issue still seems to
> exist even after the "big refactoring" in the kubernetes cluster scheduler
> backend.
>
> I can work on a fix / workaround but I'd like to check with you the proper
> way forward :
>
>    - Some processes (like the airflow helm recipe) rely on a "sleep 30s"
>    before launching the dependent pods (that could be added to
>    /opt/entrypoint.sh used in the kubernetes packing)
>    - We can add a simple step to the init container trying to do the DNS
>    resolution and failing after 60s if it did not work
>
> But these steps won't change the fact that the driver will stay stuck
> thinking we're still in the case of the Initial allocation delay.
>
> Thoughts ?
>
> --
> *Olivier Girardot*
> o.girardot@lateral-thoughts.com
>