You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Jenna Hoole <je...@gmail.com> on 2018/03/27 17:48:26 UTC
Spark on K8s resource staging server timeout
So I'm running into an issue with my resource staging server that's
producing a stacktrace like Issue 342
<https://github.com/apache-spark-on-k8s/spark/issues/342>, but I don't
think for the same reasons. What's happening is that every time after I
start up a resource staging server, the first job submitted that uses it
will fail with a java.net.SocketTimeoutException: timeout, and then every
subsequent job will run perfectly. Including with different jars and
different users. It's only ever the first job that fails and it always
fails. I know I'm also running into Issue 577
<https://github.com/apache-spark-on-k8s/spark/issues/577> in that it takes
about three minutes before the resource staging server is accessible, but
I'm still failing waiting over ten minutes or in one case overnight. And
I'm just using the examples jar, so it's not a super large jar like in
Issue 342.
This isn't great for our CI process, so has anyone seen anything like this
before or know how to increase the timeout if it just takes a while on
initial contact? Using spark.network.timeout has no effect.
[jhoole@nid00006 spark]$ kubectl get pods | grep jhoole-spark
jhoole-spark-resource-staging-server-64666675c8-w5cdm 1/1 Running
0 13m
[jhoole@nid00006 spark]$ kubectl get svc | grep jhoole-spark
jhoole-spark-resource-staging-service NodePort 10.96.143.55
<none> 10000:30622/TCP 13m
[jhoole@nid00006 spark]$ bin/spark-submit --class
org.apache.spark.examples.SparkPi --conf spark.app.name=spark-pi --conf
spark.kubernetes.resourceStagingServer.uri=http://192.168.0.1:30622
./examples/target/scala-2.11/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar
2018-03-27 12:30:13 WARN NativeCodeLoader:62 - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2018-03-27 12:30:13 INFO UserGroupInformation:966 - Login successful for
user jhoole@local using keytab file /security/secrets/jhoole.keytab
2018-03-27 12:30:14 INFO HadoopStepsOrchestrator:54 - Hadoop Conf
directory: /etc/hadoop/conf
2018-03-27 12:30:14 INFO SecurityManager:54 - Changing view acls to: jhoole
2018-03-27 12:30:14 INFO SecurityManager:54 - Changing modify acls to:
jhoole
2018-03-27 12:30:14 INFO SecurityManager:54 - Changing view acls groups to:
2018-03-27 12:30:14 INFO SecurityManager:54 - Changing modify acls groups
to:
2018-03-27 12:30:14 INFO SecurityManager:54 - SecurityManager:
authentication disabled; ui acls disabled; users with view permissions:
Set(jhoole); groups with view permissions: Set(); users with modify
permissions: Set(jhoole); groups with modify permissions: Set()
Exception in thread "main" java.net.SocketTimeoutException: timeout
at okio.Okio$4.newTimeoutException(Okio.java:230)
at okio.AsyncTimeout.exit(AsyncTimeout.java:285)
at okio.AsyncTimeout$2.read(AsyncTimeout.java:241)
at okio.RealBufferedSource.indexOf(RealBufferedSource.java:345)
at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:217)
at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:211)
at
okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:189)
at
okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:75)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at
okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
at
okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
at
okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at
okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
at okhttp3.RealCall.execute(RealCall.java:69)
at retrofit2.OkHttpCall.execute(OkHttpCall.java:174)
at
org.apache.spark.deploy.k8s.submit.SubmittedDependencyUploaderImpl.getTypedResponseResult(SubmittedDependencyUploaderImpl.scala:101)
at
org.apache.spark.deploy.k8s.submit.SubmittedDependencyUploaderImpl.doUpload(SubmittedDependencyUploaderImpl.scala:97)
at
org.apache.spark.deploy.k8s.submit.SubmittedDependencyUploaderImpl.uploadJars(SubmittedDependencyUploaderImpl.scala:70)
at
org.apache.spark.deploy.k8s.submit.submitsteps.initcontainer.SubmittedResourcesInitContainerConfigurationStep.configureInitContainer(SubmittedResourcesInitContainerConfigurationStep.scala:48)
at
org.apache.spark.deploy.k8s.submit.submitsteps.InitContainerBootstrapStep$$anonfun$configureDriver$1.apply(InitContainerBootstrapStep.scala:43)
at
org.apache.spark.deploy.k8s.submit.submitsteps.InitContainerBootstrapStep$$anonfun$configureDriver$1.apply(InitContainerBootstrapStep.scala:42)
at scala.collection.immutable.List.foreach(List.scala:381)
at
org.apache.spark.deploy.k8s.submit.submitsteps.InitContainerBootstrapStep.configureDriver(InitContainerBootstrapStep.scala:42)
at
org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$1.apply(Client.scala:102)
at
org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$1.apply(Client.scala:101)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.deploy.k8s.submit.Client.run(Client.scala:101)
at
org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$5.apply(Client.scala:200)
at
org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$5.apply(Client.scala:193)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2551)
at org.apache.spark.deploy.k8s.submit.Client$.run(Client.scala:193)
at org.apache.spark.deploy.k8s.submit.Client$.main(Client.scala:213)
at org.apache.spark.deploy.k8s.submit.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:786)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.SocketException: Socket closed
at java.net.SocketInputStream.read(SocketInputStream.java:204)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at okio.Okio$2.read(Okio.java:139)
at okio.AsyncTimeout$2.read(AsyncTimeout.java:237)
... 47 more
2018-03-27 12:30:24 INFO ShutdownHookManager:54 - Shutdown hook called
2018-03-27 12:30:24 INFO ShutdownHookManager:54 - Deleting directory
/tmp/uploaded-jars-4c7ca1cf-31d6-4dba-9203-c9a6f1cd4099
Thanks,
Jenna
Re: Spark on K8s resource staging server timeout
Posted by Jenna Hoole <je...@gmail.com>.
Unfortunately the other Kubernetes cluster I was using was rebuilt from
scratch yesterday, but the RSS I have up today has pretty uninteresting
logs.
[root@nid00006 ~]# kubectl logs
default-spark-resource-staging-server-7669dd57d7-xkvp6
++ id -u
+ myuid=0
++ id -g
+ mygid=0
++ getent passwd 0
+ uidentry=root:x:0:0:root:/root:/bin/ash
+ '[' -z root:x:0:0:root:/root:/bin/ash ']'
+ /sbin/tini -s -- /opt/spark/bin/spark-class
org.apache.spark.deploy.rest.k8s.ResourceStagingServer
/etc/spark-resource-staging-server/resource-staging-server.properties
2018-03-29 18:44:03 INFO log:192 - Logging initialized @23503ms
2018-03-29 18:44:07 WARN ContextHandler:1444 -
o.s.j.s.ServletContextHandler@7a55af6b{/,null,null} contextPath ends with /
2018-03-29 18:44:17 WARN NativeCodeLoader:62 - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2018-03-29 18:44:21 INFO SecurityManager:54 - Changing view acls to: root
2018-03-29 18:44:21 INFO SecurityManager:54 - Changing modify acls to: root
2018-03-29 18:44:21 INFO SecurityManager:54 - Changing view acls groups to:
2018-03-29 18:44:21 INFO SecurityManager:54 - Changing modify acls groups
to:
2018-03-29 18:44:21 INFO SecurityManager:54 - SecurityManager:
authentication disabled; ui acls disabled; users with view permissions:
Set(root); groups with view permissions: Set(); users with modify
permissions: Set(root); groups with modify permissions: Set()
2018-03-29 18:44:22 INFO Server:345 - jetty-9.3.z-SNAPSHOT
2018-03-29 18:44:47 INFO ContextHandler:781 - Started
o.s.j.s.ServletContextHandler@7a55af6b{/api,null,AVAILABLE}
2018-03-29 18:44:48 INFO AbstractConnector:270 - Started
ServerConnector@4f8b4bd0{HTTP/1.1,[http/1.1]}{0.0.0.0:10000}
2018-03-29 18:44:48 INFO Server:403 - Started @68600ms
2018-03-29 18:44:48 INFO ResourceStagingServer:54 - Resource staging
server started on port 10000.
-Jenna
On Thu, Mar 29, 2018 at 1:26 PM, Matt Cheah <mc...@palantir.com> wrote:
> Hello Jenna,
>
>
>
> Are there any logs from the resource staging server pod? They might show
> something interesting.
>
>
>
> Unfortunately, we haven’t been maintaining the resource staging server
> because we’ve moved all of our effort to the main repository instead of the
> fork. When we consider the submission of local files in the official
> release we should probably create a mechanism that’s more resilient. Using
> a single HTTP server isn’t ideal – would ideally like something that’s
> highly available, replicated, etc.
>
>
>
> -Matt Cheah
>
>
>
> *From: *Jenna Hoole <je...@gmail.com>
> *Date: *Thursday, March 29, 2018 at 10:37 AM
> *To: *"user@spark.apache.org" <us...@spark.apache.org>
> *Subject: *Re: Spark on K8s resource staging server timeout
>
>
>
> I added overkill high timeouts to the OkHttpClient.Builder() in
> RetrofitClientFactory.scala and I don't seem to be timing out anymore.
>
>
>
> val okHttpClientBuilder = new OkHttpClient.Builder()
>
> .dispatcher(dispatcher)
>
> .proxy(resolvedProxy)
>
> .connectTimeout(120, TimeUnit.SECONDS)
>
> .writeTimeout(120, TimeUnit.SECONDS)
>
> .readTimeout(120, TimeUnit.SECONDS)
>
>
>
> -Jenna
>
>
>
> On Tue, Mar 27, 2018 at 10:48 AM, Jenna Hoole <je...@gmail.com>
> wrote:
>
> So I'm running into an issue with my resource staging server that's
> producing a stacktrace like Issue 342 [github.com]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache-2Dspark-2Don-2Dk8s_spark_issues_342&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=GgV-jRNHAD0wSF_AnyITsBbfvbkaMumFMb_8JXHvzeY&s=THLyxpQFfgYbGIOphIPQPSYEIq3SJ7O7Y67dOo1TclQ&e=>,
> but I don't think for the same reasons. What's happening is that every time
> after I start up a resource staging server, the first job submitted that
> uses it will fail with a java.net [java.net]
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__java.net&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=GgV-jRNHAD0wSF_AnyITsBbfvbkaMumFMb_8JXHvzeY&s=qfNJryW4ENvnUZQ1ZB4J7q-OA5TAY9S7-dVeh1sT8qs&e=>
> .SocketTimeoutException: timeout, and then every subsequent job will run
> perfectly. Including with different jars and different users. It's only
> ever the first job that fails and it always fails. I know I'm also running
> into Issue 577 [github.com]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache-2Dspark-2Don-2Dk8s_spark_issues_577&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=GgV-jRNHAD0wSF_AnyITsBbfvbkaMumFMb_8JXHvzeY&s=AAjPIbk-JINgth0uIt8BPYVJIvZx5LBPRheqNDccUs8&e=> in
> that it takes about three minutes before the resource staging server is
> accessible, but I'm still failing waiting over ten minutes or in one case
> overnight. And I'm just using the examples jar, so it's not a super large
> jar like in Issue 342.
>
>
>
> This isn't great for our CI process, so has anyone seen anything like this
> before or know how to increase the timeout if it just takes a while on
> initial contact? Using spark.network.timeout has no effect.
>
>
>
> [jhoole@nid00006 spark]$ kubectl get pods | grep jhoole-spark
>
> jhoole-spark-resource-staging-server-64666675c8-w5cdm 1/1 Running
> 13m
>
> [jhoole@nid00006 spark]$ kubectl get svc | grep jhoole-spark
>
> jhoole-spark-resource-staging-service NodePort 10.96.143.55
> <none> 10000:30622/TCP 13m
>
> [jhoole@nid00006 spark]$ bin/spark-submit --class
> org.apache.spark.examples.SparkPi --conf spark.app.name [spark.app.name]
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__spark.app.name&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=GgV-jRNHAD0wSF_AnyITsBbfvbkaMumFMb_8JXHvzeY&s=ZDB5X0shXc5gYqnF5fm5BICZu4s7qCcvDj7Oert4tAQ&e=>=spark-pi
> --conf spark.kubernetes.resourceStagingServer.uri=http://192.168.0.1:30622
> [192.168.0.1]
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__192.168.0.1-3A30622&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=GgV-jRNHAD0wSF_AnyITsBbfvbkaMumFMb_8JXHvzeY&s=YWPlC02hcssNzPFyXPr4WmQcNgn_2vIsD-g_mqmUkjY&e=>
> ./examples/target/scala-2.11/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar
>
> 2018-03-27 12:30:13 WARN NativeCodeLoader:62 - Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
>
> 2018-03-27 12:30:13 INFO UserGroupInformation:966 - Login successful for
> user jhoole@local using keytab file /security/secrets/jhoole.keytab
>
> 2018-03-27 12:30:14 INFO HadoopStepsOrchestrator:54 - Hadoop Conf
> directory: /etc/hadoop/conf
>
> 2018-03-27 12:30:14 INFO SecurityManager:54 - Changing view acls to:
> jhoole
>
> 2018-03-27 12:30:14 INFO SecurityManager:54 - Changing modify acls to:
> jhoole
>
> 2018-03-27 12:30:14 INFO SecurityManager:54 - Changing view acls groups
> to:
>
> 2018-03-27 12:30:14 INFO SecurityManager:54 - Changing modify acls
> groups to:
>
> 2018-03-27 12:30:14 INFO SecurityManager:54 - SecurityManager:
> authentication disabled; ui acls disabled; users with view permissions:
> Set(jhoole); groups with view permissions: Set(); users with modify
> permissions: Set(jhoole); groups with modify permissions: Set()
>
> Exception in thread "main" java.net [java.net]
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__java.net&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=GgV-jRNHAD0wSF_AnyITsBbfvbkaMumFMb_8JXHvzeY&s=qfNJryW4ENvnUZQ1ZB4J7q-OA5TAY9S7-dVeh1sT8qs&e=>
> .SocketTimeoutException: timeout
>
> at okio.Okio$4.newTimeoutException(Okio.java:230)
>
> at okio.AsyncTimeout.exit(AsyncTimeout.java:285)
>
> at okio.AsyncTimeout$2.read(AsyncTimeout.java:241)
>
> at okio.RealBufferedSource.indexOf(RealBufferedSource.java:345)
>
> at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:217)
>
> at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:211)
>
> at okhttp3.internal.http1.Http1Codec.readResponseHeaders(
> Http1Codec.java:189)
>
> at okhttp3.internal.http.CallServerInterceptor.intercept(
> CallServerInterceptor.java:75)
>
> at okhttp3.internal.http.RealInterceptorChain.proceed(
> RealInterceptorChain.java:92)
>
> at okhttp3.internal.connection.ConnectInterceptor.intercept(
> ConnectInterceptor.java:45)
>
> at okhttp3.internal.http.RealInterceptorChain.proceed(
> RealInterceptorChain.java:92)
>
> at okhttp3.internal.http.RealInterceptorChain.proceed(
> RealInterceptorChain.java:67)
>
> at okhttp3.internal.cache.CacheInterceptor.intercept(
> CacheInterceptor.java:93)
>
> at okhttp3.internal.http.RealInterceptorChain.proceed(
> RealInterceptorChain.java:92)
>
> at okhttp3.internal.http.RealInterceptorChain.proceed(
> RealInterceptorChain.java:67)
>
> at okhttp3.internal.http.BridgeInterceptor.intercept(
> BridgeInterceptor.java:93)
>
> at okhttp3.internal.http.RealInterceptorChain.proceed(
> RealInterceptorChain.java:92)
>
> at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(
> RetryAndFollowUpInterceptor.java:120)
>
> at okhttp3.internal.http.RealInterceptorChain.proceed(
> RealInterceptorChain.java:92)
>
> at okhttp3.internal.http.RealInterceptorChain.proceed(
> RealInterceptorChain.java:67)
>
> at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
>
> at okhttp3.RealCall.execute(RealCall.java:69)
>
> at retrofit2.OkHttpCall.execute(OkHttpCall.java:174)
>
> at org.apache.spark.deploy.k8s.submit.SubmittedDependencyUploaderImp
> l.getTypedResponseResult(SubmittedDependencyUploaderImpl.scala:101)
>
> at org.apache.spark.deploy.k8s.submit.SubmittedDependencyUploaderImp
> l.doUpload(SubmittedDependencyUploaderImpl.scala:97)
>
> at org.apache.spark.deploy.k8s.submit.SubmittedDependencyUploaderImp
> l.uploadJars(SubmittedDependencyUploaderImpl.scala:70)
>
> at org.apache.spark.deploy.k8s.submit.submitsteps.initcontainer.
> SubmittedResourcesInitContainerConfigurationStep.configureInitContainer(
> SubmittedResourcesInitContainerConfigurationStep.scala:48)
>
> at org.apache.spark.deploy.k8s.submit.submitsteps.
> InitContainerBootstrapStep$$anonfun$configureDriver$1.apply(
> InitContainerBootstrapStep.scala:43)
>
> at org.apache.spark.deploy.k8s.submit.submitsteps.
> InitContainerBootstrapStep$$anonfun$configureDriver$1.apply(
> InitContainerBootstrapStep.scala:42)
>
> at scala.collection.immutable.List.foreach(List.scala:381)
>
> at org.apache.spark.deploy.k8s.submit.submitsteps.
> InitContainerBootstrapStep.configureDriver(InitContainerBootstrapStep.
> scala:42)
>
> at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$1.
> apply(Client.scala:102)
>
> at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$1.
> apply(Client.scala:101)
>
> at scala.collection.immutable.List.foreach(List.scala:381)
>
> at org.apache.spark.deploy.k8s.submit.Client.run(Client.scala:101)
>
> at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$5.
> apply(Client.scala:200)
>
> at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$5.
> apply(Client.scala:193)
>
> at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2551)
>
> at org.apache.spark.deploy.k8s.submit.Client$.run(Client.scala:193)
>
> at org.apache.spark.deploy.k8s.submit.Client$.main(Client.scala:213)
>
> at org.apache.spark.deploy.k8s.submit.Client.main(Client.scala)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:62)
>
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:498)
>
> at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$
> deploy$SparkSubmit$$runMain(SparkSubmit.scala:786)
>
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> Caused by: java.net.SocketException: Socket closed
>
> at java.net.SocketInputStream.read(SocketInputStream.java:204)
>
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
>
> at okio.Okio$2.read(Okio.java:139)
>
> at okio.AsyncTimeout$2.read(AsyncTimeout.java:237)
>
> ... 47 more
>
> 2018-03-27 12:30:24 INFO ShutdownHookManager:54 - Shutdown hook called
>
> 2018-03-27 12:30:24 INFO ShutdownHookManager:54 - Deleting directory
> /tmp/uploaded-jars-4c7ca1cf-31d6-4dba-9203-c9a6f1cd4099
>
>
>
> Thanks,
>
> Jenna
>
>
>
Re: Spark on K8s resource staging server timeout
Posted by Matt Cheah <mc...@palantir.com>.
Hello Jenna,
Are there any logs from the resource staging server pod? They might show something interesting.
Unfortunately, we haven’t been maintaining the resource staging server because we’ve moved all of our effort to the main repository instead of the fork. When we consider the submission of local files in the official release we should probably create a mechanism that’s more resilient. Using a single HTTP server isn’t ideal – would ideally like something that’s highly available, replicated, etc.
-Matt Cheah
From: Jenna Hoole <je...@gmail.com>
Date: Thursday, March 29, 2018 at 10:37 AM
To: "user@spark.apache.org" <us...@spark.apache.org>
Subject: Re: Spark on K8s resource staging server timeout
I added overkill high timeouts to the OkHttpClient.Builder() in RetrofitClientFactory.scala and I don't seem to be timing out anymore.
val okHttpClientBuilder = new OkHttpClient.Builder()
.dispatcher(dispatcher)
.proxy(resolvedProxy)
.connectTimeout(120, TimeUnit.SECONDS)
.writeTimeout(120, TimeUnit.SECONDS)
.readTimeout(120, TimeUnit.SECONDS)
-Jenna
On Tue, Mar 27, 2018 at 10:48 AM, Jenna Hoole <je...@gmail.com> wrote:
So I'm running into an issue with my resource staging server that's producing a stacktrace like Issue 342 [github.com], but I don't think for the same reasons. What's happening is that every time after I start up a resource staging server, the first job submitted that uses it will fail with a java.net [java.net].SocketTimeoutException: timeout, and then every subsequent job will run perfectly. Including with different jars and different users. It's only ever the first job that fails and it always fails. I know I'm also running into Issue 577 [github.com] in that it takes about three minutes before the resource staging server is accessible, but I'm still failing waiting over ten minutes or in one case overnight. And I'm just using the examples jar, so it's not a super large jar like in Issue 342.
This isn't great for our CI process, so has anyone seen anything like this before or know how to increase the timeout if it just takes a while on initial contact? Using spark.network.timeout has no effect.
[jhoole@nid00006 spark]$ kubectl get pods | grep jhoole-spark
jhoole-spark-resource-staging-server-64666675c8-w5cdm 1/1 Running 13m
[jhoole@nid00006 spark]$ kubectl get svc | grep jhoole-spark
jhoole-spark-resource-staging-service NodePort 10.96.143.55 <none> 10000:30622/TCP 13m
[jhoole@nid00006 spark]$ bin/spark-submit --class org.apache.spark.examples.SparkPi --conf spark.app.name [spark.app.name]=spark-pi --conf spark.kubernetes.resourceStagingServer.uri=http://192.168.0.1:30622 [192.168.0.1] ./examples/target/scala-2.11/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar
2018-03-27 12:30:13 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-03-27 12:30:13 INFO UserGroupInformation:966 - Login successful for user jhoole@local using keytab file /security/secrets/jhoole.keytab
2018-03-27 12:30:14 INFO HadoopStepsOrchestrator:54 - Hadoop Conf directory: /etc/hadoop/conf
2018-03-27 12:30:14 INFO SecurityManager:54 - Changing view acls to: jhoole
2018-03-27 12:30:14 INFO SecurityManager:54 - Changing modify acls to: jhoole
2018-03-27 12:30:14 INFO SecurityManager:54 - Changing view acls groups to:
2018-03-27 12:30:14 INFO SecurityManager:54 - Changing modify acls groups to:
2018-03-27 12:30:14 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(jhoole); groups with view permissions: Set(); users with modify permissions: Set(jhoole); groups with modify permissions: Set()
Exception in thread "main" java.net [java.net].SocketTimeoutException: timeout
at okio.Okio$4.newTimeoutException(Okio.java:230)
at okio.AsyncTimeout.exit(AsyncTimeout.java:285)
at okio.AsyncTimeout$2.read(AsyncTimeout.java:241)
at okio.RealBufferedSource.indexOf(RealBufferedSource.java:345)
at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:217)
at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:211)
at okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:189)
at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:75)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
at okhttp3.RealCall.execute(RealCall.java:69)
at retrofit2.OkHttpCall.execute(OkHttpCall.java:174)
at org.apache.spark.deploy.k8s.submit.SubmittedDependencyUploaderImpl.getTypedResponseResult(SubmittedDependencyUploaderImpl.scala:101)
at org.apache.spark.deploy.k8s.submit.SubmittedDependencyUploaderImpl.doUpload(SubmittedDependencyUploaderImpl.scala:97)
at org.apache.spark.deploy.k8s.submit.SubmittedDependencyUploaderImpl.uploadJars(SubmittedDependencyUploaderImpl.scala:70)
at org.apache.spark.deploy.k8s.submit.submitsteps.initcontainer.SubmittedResourcesInitContainerConfigurationStep.configureInitContainer(SubmittedResourcesInitContainerConfigurationStep.scala:48)
at org.apache.spark.deploy.k8s.submit.submitsteps.InitContainerBootstrapStep$$anonfun$configureDriver$1.apply(InitContainerBootstrapStep.scala:43)
at org.apache.spark.deploy.k8s.submit.submitsteps.InitContainerBootstrapStep$$anonfun$configureDriver$1.apply(InitContainerBootstrapStep.scala:42)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.deploy.k8s.submit.submitsteps.InitContainerBootstrapStep.configureDriver(InitContainerBootstrapStep.scala:42)
at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$1.apply(Client.scala:102)
at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$1.apply(Client.scala:101)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.deploy.k8s.submit.Client.run(Client.scala:101)
at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$5.apply(Client.scala:200)
at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$5.apply(Client.scala:193)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2551)
at org.apache.spark.deploy.k8s.submit.Client$.run(Client.scala:193)
at org.apache.spark.deploy.k8s.submit.Client$.main(Client.scala:213)
at org.apache.spark.deploy.k8s.submit.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:786)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.SocketException: Socket closed
at java.net.SocketInputStream.read(SocketInputStream.java:204)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at okio.Okio$2.read(Okio.java:139)
at okio.AsyncTimeout$2.read(AsyncTimeout.java:237)
... 47 more
2018-03-27 12:30:24 INFO ShutdownHookManager:54 - Shutdown hook called
2018-03-27 12:30:24 INFO ShutdownHookManager:54 - Deleting directory /tmp/uploaded-jars-4c7ca1cf-31d6-4dba-9203-c9a6f1cd4099
Thanks,
Jenna
Re: Spark on K8s resource staging server timeout
Posted by Jenna Hoole <je...@gmail.com>.
I added overkill high timeouts to the OkHttpClient.Builder() in
RetrofitClientFactory.scala and I don't seem to be timing out anymore.
val okHttpClientBuilder = new OkHttpClient.Builder()
.dispatcher(dispatcher)
.proxy(resolvedProxy)
.connectTimeout(120, TimeUnit.SECONDS)
.writeTimeout(120, TimeUnit.SECONDS)
.readTimeout(120, TimeUnit.SECONDS)
-Jenna
On Tue, Mar 27, 2018 at 10:48 AM, Jenna Hoole <je...@gmail.com> wrote:
> So I'm running into an issue with my resource staging server that's
> producing a stacktrace like Issue 342
> <https://github.com/apache-spark-on-k8s/spark/issues/342>, but I don't
> think for the same reasons. What's happening is that every time after I
> start up a resource staging server, the first job submitted that uses it
> will fail with a java.net.SocketTimeoutException: timeout, and then every
> subsequent job will run perfectly. Including with different jars and
> different users. It's only ever the first job that fails and it always
> fails. I know I'm also running into Issue 577
> <https://github.com/apache-spark-on-k8s/spark/issues/577> in that it
> takes about three minutes before the resource staging server is accessible,
> but I'm still failing waiting over ten minutes or in one case overnight.
> And I'm just using the examples jar, so it's not a super large jar like in
> Issue 342.
>
> This isn't great for our CI process, so has anyone seen anything like this
> before or know how to increase the timeout if it just takes a while on
> initial contact? Using spark.network.timeout has no effect.
>
> [jhoole@nid00006 spark]$ kubectl get pods | grep jhoole-spark
>
> jhoole-spark-resource-staging-server-64666675c8-w5cdm 1/1 Running
> 0 13m
>
> [jhoole@nid00006 spark]$ kubectl get svc | grep jhoole-spark
>
> jhoole-spark-resource-staging-service NodePort 10.96.143.55
> <none> 10000:30622/TCP 13m
>
> [jhoole@nid00006 spark]$ bin/spark-submit --class
> org.apache.spark.examples.SparkPi --conf spark.app.name=spark-pi --conf
> spark.kubernetes.resourceStagingServer.uri=http://192.168.0.1:30622
> ./examples/target/scala-2.11/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar
>
> 2018-03-27 12:30:13 WARN NativeCodeLoader:62 - Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
>
> 2018-03-27 12:30:13 INFO UserGroupInformation:966 - Login successful for
> user jhoole@local using keytab file /security/secrets/jhoole.keytab
>
> 2018-03-27 12:30:14 INFO HadoopStepsOrchestrator:54 - Hadoop Conf
> directory: /etc/hadoop/conf
>
> 2018-03-27 12:30:14 INFO SecurityManager:54 - Changing view acls to:
> jhoole
>
> 2018-03-27 12:30:14 INFO SecurityManager:54 - Changing modify acls to:
> jhoole
>
> 2018-03-27 12:30:14 INFO SecurityManager:54 - Changing view acls groups
> to:
>
> 2018-03-27 12:30:14 INFO SecurityManager:54 - Changing modify acls
> groups to:
>
> 2018-03-27 12:30:14 INFO SecurityManager:54 - SecurityManager:
> authentication disabled; ui acls disabled; users with view permissions:
> Set(jhoole); groups with view permissions: Set(); users with modify
> permissions: Set(jhoole); groups with modify permissions: Set()
>
> Exception in thread "main" java.net.SocketTimeoutException: timeout
>
> at okio.Okio$4.newTimeoutException(Okio.java:230)
>
> at okio.AsyncTimeout.exit(AsyncTimeout.java:285)
>
> at okio.AsyncTimeout$2.read(AsyncTimeout.java:241)
>
> at okio.RealBufferedSource.indexOf(RealBufferedSource.java:345)
>
> at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:217)
>
> at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:211)
>
> at okhttp3.internal.http1.Http1Codec.readResponseHeaders(
> Http1Codec.java:189)
>
> at okhttp3.internal.http.CallServerInterceptor.intercept(
> CallServerInterceptor.java:75)
>
> at okhttp3.internal.http.RealInterceptorChain.proceed(
> RealInterceptorChain.java:92)
>
> at okhttp3.internal.connection.ConnectInterceptor.intercept(
> ConnectInterceptor.java:45)
>
> at okhttp3.internal.http.RealInterceptorChain.proceed(
> RealInterceptorChain.java:92)
>
> at okhttp3.internal.http.RealInterceptorChain.proceed(
> RealInterceptorChain.java:67)
>
> at okhttp3.internal.cache.CacheInterceptor.intercept(
> CacheInterceptor.java:93)
>
> at okhttp3.internal.http.RealInterceptorChain.proceed(
> RealInterceptorChain.java:92)
>
> at okhttp3.internal.http.RealInterceptorChain.proceed(
> RealInterceptorChain.java:67)
>
> at okhttp3.internal.http.BridgeInterceptor.intercept(
> BridgeInterceptor.java:93)
>
> at okhttp3.internal.http.RealInterceptorChain.proceed(
> RealInterceptorChain.java:92)
>
> at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(
> RetryAndFollowUpInterceptor.java:120)
>
> at okhttp3.internal.http.RealInterceptorChain.proceed(
> RealInterceptorChain.java:92)
>
> at okhttp3.internal.http.RealInterceptorChain.proceed(
> RealInterceptorChain.java:67)
>
> at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
>
> at okhttp3.RealCall.execute(RealCall.java:69)
>
> at retrofit2.OkHttpCall.execute(OkHttpCall.java:174)
>
> at org.apache.spark.deploy.k8s.submit.SubmittedDependencyUploaderImp
> l.getTypedResponseResult(SubmittedDependencyUploaderImpl.scala:101)
>
> at org.apache.spark.deploy.k8s.submit.SubmittedDependencyUploaderImp
> l.doUpload(SubmittedDependencyUploaderImpl.scala:97)
>
> at org.apache.spark.deploy.k8s.submit.SubmittedDependencyUploaderImp
> l.uploadJars(SubmittedDependencyUploaderImpl.scala:70)
>
> at org.apache.spark.deploy.k8s.submit.submitsteps.initcontainer.
> SubmittedResourcesInitContainerConfigurationStep.configureInitContainer(
> SubmittedResourcesInitContainerConfigurationStep.scala:48)
>
> at org.apache.spark.deploy.k8s.submit.submitsteps.
> InitContainerBootstrapStep$$anonfun$configureDriver$1.apply(
> InitContainerBootstrapStep.scala:43)
>
> at org.apache.spark.deploy.k8s.submit.submitsteps.
> InitContainerBootstrapStep$$anonfun$configureDriver$1.apply(
> InitContainerBootstrapStep.scala:42)
>
> at scala.collection.immutable.List.foreach(List.scala:381)
>
> at org.apache.spark.deploy.k8s.submit.submitsteps.
> InitContainerBootstrapStep.configureDriver(InitContainerBootstrapStep.
> scala:42)
>
> at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$1.
> apply(Client.scala:102)
>
> at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$1.
> apply(Client.scala:101)
>
> at scala.collection.immutable.List.foreach(List.scala:381)
>
> at org.apache.spark.deploy.k8s.submit.Client.run(Client.scala:101)
>
> at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$5.
> apply(Client.scala:200)
>
> at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$5.
> apply(Client.scala:193)
>
> at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2551)
>
> at org.apache.spark.deploy.k8s.submit.Client$.run(Client.scala:193)
>
> at org.apache.spark.deploy.k8s.submit.Client$.main(Client.scala:213)
>
> at org.apache.spark.deploy.k8s.submit.Client.main(Client.scala)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:62)
>
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:498)
>
> at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$
> deploy$SparkSubmit$$runMain(SparkSubmit.scala:786)
>
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> Caused by: java.net.SocketException: Socket closed
>
> at java.net.SocketInputStream.read(SocketInputStream.java:204)
>
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
>
> at okio.Okio$2.read(Okio.java:139)
>
> at okio.AsyncTimeout$2.read(AsyncTimeout.java:237)
>
> ... 47 more
>
> 2018-03-27 12:30:24 INFO ShutdownHookManager:54 - Shutdown hook called
>
> 2018-03-27 12:30:24 INFO ShutdownHookManager:54 - Deleting directory
> /tmp/uploaded-jars-4c7ca1cf-31d6-4dba-9203-c9a6f1cd4099
>
> Thanks,
> Jenna
>