You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Andy Grove (Jira)" <ji...@apache.org> on 2019/11/05 13:28:00 UTC

[jira] [Resolved] (SPARK-29640) [K8S] Intermittent "java.net.UnknownHostException: kubernetes.default.svc" in Spark driver

     [ https://issues.apache.org/jira/browse/SPARK-29640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andy Grove resolved SPARK-29640.
--------------------------------
    Resolution: Not A Bug

Specifying TCP mode for DNS lookups did not help and this turned out to be an env issue caused by two nodes in the cluster. Deleting the nodes and creating new ones resolved it.

> [K8S] Intermittent "java.net.UnknownHostException: kubernetes.default.svc" in Spark driver
> ------------------------------------------------------------------------------------------
>
>                 Key: SPARK-29640
>                 URL: https://issues.apache.org/jira/browse/SPARK-29640
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes
>    Affects Versions: 2.4.4
>            Reporter: Andy Grove
>            Priority: Major
>
> We are running into intermittent DNS issues where the Spark driver fails to resolve "kubernetes.default.svc" when trying to create executors. We are running Spark 2.4.4 (with the patch for SPARK-28921) in cluster mode in EKS.
> This happens approximately 10% of the time.
> Here is the stack trace:
> {code:java}
> Exception in thread "main" org.apache.spark.SparkException: External scheduler cannot be instantiated
> 	at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2794)
> 	at org.apache.spark.SparkContext.<init>(SparkContext.scala:493)
> 	at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
> 	at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:935)
> 	at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:926)
> 	at scala.Option.getOrElse(Option.scala:121)
> 	at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
> 	at com.rms.execution.test.SparkPiTask$.main(SparkPiTask.scala:36)
> 	at com.rms.execution.test.SparkPiTask.main(SparkPiTask.scala)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:498)
> 	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
> 	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
> 	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
> 	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
> 	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
> 	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
> 	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
> 	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get]  for kind: [Pod]  with name: [wf-50000-69674f15d0fc45-1571354060179-driver]  in namespace: [tenant-8-workflows]  failed.
> 	at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
> 	at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
> 	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:229)
> 	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:162)
> 	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:57)
> 	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:55)
> 	at scala.Option.map(Option.scala:146)
> 	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.<init>(ExecutorPodsAllocator.scala:55)
> 	at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:89)
> 	at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2788)
> 	... 20 more
> Caused by: java.net.UnknownHostException: kubernetes.default.svc: Try again
> 	at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
> 	at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
> 	at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
> 	at java.net.InetAddress.getAllByName0(InetAddress.java:1277)
> 	at java.net.InetAddress.getAllByName(InetAddress.java:1193)
> 	at java.net.InetAddress.getAllByName(InetAddress.java:1127)
> 	at okhttp3.Dns$1.lookup(Dns.java:39)
> 	at okhttp3.internal.connection.RouteSelector.resetNextInetSocketAddress(RouteSelector.java:171)
> 	at okhttp3.internal.connection.RouteSelector.nextProxy(RouteSelector.java:137)
> 	at okhttp3.internal.connection.RouteSelector.next(RouteSelector.java:82)
> 	at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:171)
> 	at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:121)
> 	at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:100)
> 	at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
> 	at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
> 	at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
> 	at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
> 	at io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:119)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
> 	at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
> 	at io.fabric8.kubernetes.client.utils.HttpClientUtils.lambda$createHttpClient$3(HttpClientUtils.java:110)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
> 	at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
> 	at okhttp3.RealCall.execute(RealCall.java:69)
> 	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:404)
> 	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:365)
> 	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:330)
> 	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:311)
> 	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleGet(BaseOperation.java:810)
> 	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:218)
> 	... 27 more  {code}
> This issue seems to be caused by [https://github.com/kubernetes/kubernetes/issues/76790]
> One suggested workaround is to specify TCP mode for DNS lookups in the pod spec ([https://github.com/kubernetes/kubernetes/issues/56903#issuecomment-424498508]).
> I would like the ability to provide a flag to spark-submit to specify to use TCP mode for DNS lookups.
> I am working on a PR for this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org