You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Andy Grove (Jira)" <ji...@apache.org> on 2019/12/12 10:42:00 UTC

[jira] [Issue Comment Deleted] (SPARK-29640) [K8S] Intermittent "java.net.UnknownHostException: kubernetes.default.svc" in Spark driver

     [ https://issues.apache.org/jira/browse/SPARK-29640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andy Grove updated SPARK-29640:
-------------------------------
    Comment: was deleted

(was: We were finally able to get to a root cause on this so I'm documenting it here in the hopes that it helps someone else in the future.

The issue was due to the way that routing was set up on our EKS clusters combined with the fact that we were using an NLB rather than ELB along with nginx ingress controllers.

Specifically, NLB does not support "hairpinning" as explained in [https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-troubleshooting.html]

In layman's terms, if pod A tries to communicate with pod B, and both pods are on the same node and the request egresses from the node and is then routed back to the node via NLB and nginx controller then the request can never succeed and will time out.

Switching to an ELB resolves the issue but a better solution is to use cluster local addressing so that communicate between pods on the same nodes uses the local network.)

> [K8S] Intermittent "java.net.UnknownHostException: kubernetes.default.svc" in Spark driver
> ------------------------------------------------------------------------------------------
>
>                 Key: SPARK-29640
>                 URL: https://issues.apache.org/jira/browse/SPARK-29640
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes
>    Affects Versions: 2.4.4
>            Reporter: Andy Grove
>            Priority: Major
>
> We are running into intermittent DNS issues where the Spark driver fails to resolve "kubernetes.default.svc" when trying to create executors. We are running Spark 2.4.4 (with the patch for SPARK-28921) in cluster mode in EKS.
> This happens approximately 10% of the time.
> Here is the stack trace:
> {code:java}
> Exception in thread "main" org.apache.spark.SparkException: External scheduler cannot be instantiated
> 	at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2794)
> 	at org.apache.spark.SparkContext.<init>(SparkContext.scala:493)
> 	at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
> 	at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:935)
> 	at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:926)
> 	at scala.Option.getOrElse(Option.scala:121)
> 	at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
> 	at com.rms.execution.test.SparkPiTask$.main(SparkPiTask.scala:36)
> 	at com.rms.execution.test.SparkPiTask.main(SparkPiTask.scala)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:498)
> 	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
> 	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
> 	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
> 	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
> 	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
> 	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
> 	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
> 	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get]  for kind: [Pod]  with name: [wf-50000-69674f15d0fc45-1571354060179-driver]  in namespace: [tenant-8-workflows]  failed.
> 	at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
> 	at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
> 	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:229)
> 	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:162)
> 	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:57)
> 	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:55)
> 	at scala.Option.map(Option.scala:146)
> 	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.<init>(ExecutorPodsAllocator.scala:55)
> 	at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:89)
> 	at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2788)
> 	... 20 more
> Caused by: java.net.UnknownHostException: kubernetes.default.svc: Try again
> 	at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
> 	at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
> 	at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
> 	at java.net.InetAddress.getAllByName0(InetAddress.java:1277)
> 	at java.net.InetAddress.getAllByName(InetAddress.java:1193)
> 	at java.net.InetAddress.getAllByName(InetAddress.java:1127)
> 	at okhttp3.Dns$1.lookup(Dns.java:39)
> 	at okhttp3.internal.connection.RouteSelector.resetNextInetSocketAddress(RouteSelector.java:171)
> 	at okhttp3.internal.connection.RouteSelector.nextProxy(RouteSelector.java:137)
> 	at okhttp3.internal.connection.RouteSelector.next(RouteSelector.java:82)
> 	at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:171)
> 	at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:121)
> 	at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:100)
> 	at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
> 	at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
> 	at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
> 	at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
> 	at io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:119)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
> 	at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
> 	at io.fabric8.kubernetes.client.utils.HttpClientUtils.lambda$createHttpClient$3(HttpClientUtils.java:110)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
> 	at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
> 	at okhttp3.RealCall.execute(RealCall.java:69)
> 	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:404)
> 	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:365)
> 	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:330)
> 	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:311)
> 	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleGet(BaseOperation.java:810)
> 	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:218)
> 	... 27 more  {code}
> This issue seems to be caused by [https://github.com/kubernetes/kubernetes/issues/76790]
> One suggested workaround is to specify TCP mode for DNS lookups in the pod spec ([https://github.com/kubernetes/kubernetes/issues/56903#issuecomment-424498508]).
> I would like the ability to provide a flag to spark-submit to specify to use TCP mode for DNS lookups.
> I am working on a PR for this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org