You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Anton Ippolitov (Jira)" <ji...@apache.org> on 2022/10/17 08:45:00 UTC
[jira] [Updated] (SPARK-40817) Remote spark.jars URIs ignored for Spark on Kubernetes in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-40817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Anton Ippolitov updated SPARK-40817:
------------------------------------
Description:
I discovered that remote URIs in {{spark.jars}} get discarded when launching Spark on Kubernetes in cluster mode via spark-submit.
h1. Reproduction
Here is an example reproduction with S3 being used for remote JAR storage:
I first created 2 JARs:
* {{/opt/my-local-jar.jar}} on the host where I'm running spark-submit
* {{s3://$BUCKET_NAME/my-remote-jar.jar}} in an S3 bucket I own
I then ran the following spark-submit command with {{spark.jars}} pointing to both the local JAR and the remote JAR:
{code:java}
spark-submit \
--master k8s://https://$KUBERNETES_API_SERVER_URL:443 \
--deploy-mode cluster \
--name=spark-submit-test \
--class org.apache.spark.examples.SparkPi \
--conf spark.jars=/opt/my-local-jar.jar,s3a://$BUCKET_NAME/my-remote-jar.jar \
--conf spark.kubernetes.file.upload.path=s3a://$BUCKET_NAME/my-upload-path/ \
[...]
/opt/spark/examples/jars/spark-examples_2.12-3.1.3.jar
{code}
Once the driver and the executors started, I confirmed that there was no trace of {{my-remote-jar.jar}} anymore. For example, looking at the Spark History Server, I could see that {{spark.jars}} got transformed into this:
!image-2022-10-17-10-44-46-862.png|width=991,height=80!
There was no mention of {{my-remote-jar.jar}} on the classpath or anywhere else.
Note that I ran all tests with Spark 3.1.3, however the code which handles those dependencies seems to be the same for more recent versions of Spark as well.
h1. Root cause description
I believe that the issue seems to be coming from [this logic|https://github.com/apache/spark/blob/d1f8a503a26bcfb4e466d9accc5fa241a7933667/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L163-L186] in {{{}BasicDriverFeatureStep.getAdditionalPodSystemProperties(){}}}.
Specifically, this logic takes all URIs in {{{}spark.jars{}}}, [filters only on local URIs,|https://github.com/apache/spark/blob/d1f8a503a26bcfb4e466d9accc5fa241a7933667/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L165] [uploads|https://github.com/apache/spark/blob/d1f8a503a26bcfb4e466d9accc5fa241a7933667/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L173] those local files to {{spark.kubernetes.file.upload.path }}and then [*replaces*|https://github.com/apache/spark/blob/d1f8a503a26bcfb4e466d9accc5fa241a7933667/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L182] the value of {{spark.jars}} with those newly uploaded JARs. By overwriting the previous value of {{{}spark.jars{}}}, we are losing all mention of remote JARs that were previously specified there.
Consequently, when the Spark driver starts afterwards, it only downloads JARs from {{{}spark.kubernetes.file.upload.path{}}}.
h1. Possible solution
I think a possible fix would be to not fully overwrite the value of {{spark.jars}} but to make sure that we keep remote URIs there.
The new logic would look something like this:
{code:java}
Seq(JARS, FILES, ARCHIVES, SUBMIT_PYTHON_FILES).foreach { key =>
val uris = conf.get(key).filter(uri => KubernetesUtils.isLocalAndResolvable(uri))
// Save remote URIs
val remoteUris = conf.get(key).filter(uri => !KubernetesUtils.isLocalAndResolvable(uri))
val value = {
if (key == ARCHIVES) {
uris.map(UriBuilder.fromUri(_).fragment(null).build()).map(_.toString)
} else {
uris
}
}
val resolved = KubernetesUtils.uploadAndTransformFileUris(value, Some(conf.sparkConf))
if (resolved.nonEmpty) {
val resolvedValue = if (key == ARCHIVES) {
uris.zip(resolved).map { case (uri, r) =>
UriBuilder.fromUri(r).fragment(new java.net.URI(uri).getFragment).build().toString
}
} else {
resolved
}
// don't forget to add remote URIs
additionalProps.put(key.key, (resolvedValue ++ remoteUris).mkString(","))
}
} {code}
I tested it out in my environment and it worked: {{s3a://$BUCKET_NAME/my-remote-jar.jar}} was kept in {{spark.jars}} and the Spark driver was able to download it.
I don't know the codebase well enough though to assess whether I am missing something else or if this is enough to fix the issue.
was:
I discovered that remote URIs in {{spark.jars}} get discarded when launching Spark on Kubernetes in cluster mode via spark-submit.
h1. Reproduction
Here is an example reproduction with S3 being used for remote JAR storage:
I first created 2 JARs:
* {{/opt/my-local-jar.jar}} on the host where I'm running spark-submit
* {{s3://$BUCKET_NAME/my-remote-jar.jar}} in an S3 bucket I own
I then ran the following spark-submit command with {{spark.jars}} pointing to both the local JAR and the remote JAR:
{code:java}
spark-submit \
--master k8s://https://$KUBERNETES_API_SERVER_URL:443 \
--deploy-mode cluster \
--name=spark-submit-test \
--class org.apache.spark.examples.SparkPi \
--conf spark.jars=/opt/my-local-jar.jar,s3a://$BUCKET_NAME/my-remote-jar.jar \
--conf spark.kubernetes.file.upload.path=s3a://$BUCKET_NAME/my-upload-path/ \
[...]
/opt/spark/examples/jars/spark-examples_2.12-3.1.3.jar
{code}
Once the driver and the executors started, I confirmed that there was no trace of {{my-remote-jar.jar}} anymore. For example, looking at the Spark History Server, I could see that {{spark.jars}} got transformed into this:
!image-2022-10-17-10-17-03-697.png|width=744,height=60!
There was no mention of {{my-remote-jar.jar}} on the classpath or anywhere else.
Note that I ran all tests with Spark 3.1.3, however the code which handles those dependencies seems to be the same for more recent versions of Spark as well.
h1. Root cause description
I believe that the issue seems to be coming from [this logic|https://github.com/apache/spark/blob/d1f8a503a26bcfb4e466d9accc5fa241a7933667/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L163-L186] in {{{}BasicDriverFeatureStep.getAdditionalPodSystemProperties(){}}}.
Specifically, this logic takes all URIs in {{{}spark.jars{}}}, [filters only on local URIs,|https://github.com/apache/spark/blob/d1f8a503a26bcfb4e466d9accc5fa241a7933667/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L165] [uploads|https://github.com/apache/spark/blob/d1f8a503a26bcfb4e466d9accc5fa241a7933667/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L173] those local files to {{spark.kubernetes.file.upload.path }}and then [*replaces*|https://github.com/apache/spark/blob/d1f8a503a26bcfb4e466d9accc5fa241a7933667/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L182] the value of {{spark.jars}} with those newly uploaded JARs. By overwriting the previous value of {{{}spark.jars{}}}, we are losing all mention of remote JARs that were previously specified there.
Consequently, when the Spark driver starts afterwards, it only downloads JARs from {{{}spark.kubernetes.file.upload.path{}}}.
h1. Possible solution
I think a possible fix would be to not fully overwrite the value of {{spark.jars}} but to make sure that we keep remote URIs there.
The new logic would look something like this:
{code:java}
Seq(JARS, FILES, ARCHIVES, SUBMIT_PYTHON_FILES).foreach { key =>
val uris = conf.get(key).filter(uri => KubernetesUtils.isLocalAndResolvable(uri))
// Save remote URIs
val remoteUris = conf.get(key).filter(uri => !KubernetesUtils.isLocalAndResolvable(uri))
val value = {
if (key == ARCHIVES) {
uris.map(UriBuilder.fromUri(_).fragment(null).build()).map(_.toString)
} else {
uris
}
}
val resolved = KubernetesUtils.uploadAndTransformFileUris(value, Some(conf.sparkConf))
if (resolved.nonEmpty) {
val resolvedValue = if (key == ARCHIVES) {
uris.zip(resolved).map { case (uri, r) =>
UriBuilder.fromUri(r).fragment(new java.net.URI(uri).getFragment).build().toString
}
} else {
resolved
}
// don't forget to add remote URIs
additionalProps.put(key.key, (resolvedValue ++ remoteUris).mkString(","))
}
} {code}
I tested it out in my environment and it worked: {{s3a://$BUCKET_NAME/my-remote-jar.jar}} was kept in {{spark.jars}} and the Spark driver was able to download it.
I don't know the codebase well enough though to assess whether I am missing something else or if this is enough to fix the issue.
> Remote spark.jars URIs ignored for Spark on Kubernetes in cluster mode
> -----------------------------------------------------------------------
>
> Key: SPARK-40817
> URL: https://issues.apache.org/jira/browse/SPARK-40817
> Project: Spark
> Issue Type: Bug
> Components: Kubernetes, Spark Submit
> Affects Versions: 3.0.0, 3.1.3, 3.3.0, 3.2.2
> Environment: Spark 3.1.3
> Kubernetes 1.21
> Ubuntu 20.04.1
> Reporter: Anton Ippolitov
> Priority: Major
> Attachments: image-2022-10-17-10-44-46-862.png
>
>
> I discovered that remote URIs in {{spark.jars}} get discarded when launching Spark on Kubernetes in cluster mode via spark-submit.
> h1. Reproduction
> Here is an example reproduction with S3 being used for remote JAR storage:
> I first created 2 JARs:
> * {{/opt/my-local-jar.jar}} on the host where I'm running spark-submit
> * {{s3://$BUCKET_NAME/my-remote-jar.jar}} in an S3 bucket I own
> I then ran the following spark-submit command with {{spark.jars}} pointing to both the local JAR and the remote JAR:
> {code:java}
> spark-submit \
> --master k8s://https://$KUBERNETES_API_SERVER_URL:443 \
> --deploy-mode cluster \
> --name=spark-submit-test \
> --class org.apache.spark.examples.SparkPi \
> --conf spark.jars=/opt/my-local-jar.jar,s3a://$BUCKET_NAME/my-remote-jar.jar \
> --conf spark.kubernetes.file.upload.path=s3a://$BUCKET_NAME/my-upload-path/ \
> [...]
> /opt/spark/examples/jars/spark-examples_2.12-3.1.3.jar
> {code}
> Once the driver and the executors started, I confirmed that there was no trace of {{my-remote-jar.jar}} anymore. For example, looking at the Spark History Server, I could see that {{spark.jars}} got transformed into this:
> !image-2022-10-17-10-44-46-862.png|width=991,height=80!
> There was no mention of {{my-remote-jar.jar}} on the classpath or anywhere else.
> Note that I ran all tests with Spark 3.1.3, however the code which handles those dependencies seems to be the same for more recent versions of Spark as well.
> h1. Root cause description
> I believe that the issue seems to be coming from [this logic|https://github.com/apache/spark/blob/d1f8a503a26bcfb4e466d9accc5fa241a7933667/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L163-L186] in {{{}BasicDriverFeatureStep.getAdditionalPodSystemProperties(){}}}.
> Specifically, this logic takes all URIs in {{{}spark.jars{}}}, [filters only on local URIs,|https://github.com/apache/spark/blob/d1f8a503a26bcfb4e466d9accc5fa241a7933667/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L165] [uploads|https://github.com/apache/spark/blob/d1f8a503a26bcfb4e466d9accc5fa241a7933667/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L173] those local files to {{spark.kubernetes.file.upload.path }}and then [*replaces*|https://github.com/apache/spark/blob/d1f8a503a26bcfb4e466d9accc5fa241a7933667/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L182] the value of {{spark.jars}} with those newly uploaded JARs. By overwriting the previous value of {{{}spark.jars{}}}, we are losing all mention of remote JARs that were previously specified there.
> Consequently, when the Spark driver starts afterwards, it only downloads JARs from {{{}spark.kubernetes.file.upload.path{}}}.
> h1. Possible solution
> I think a possible fix would be to not fully overwrite the value of {{spark.jars}} but to make sure that we keep remote URIs there.
> The new logic would look something like this:
> {code:java}
> Seq(JARS, FILES, ARCHIVES, SUBMIT_PYTHON_FILES).foreach { key =>
> val uris = conf.get(key).filter(uri => KubernetesUtils.isLocalAndResolvable(uri))
> // Save remote URIs
> val remoteUris = conf.get(key).filter(uri => !KubernetesUtils.isLocalAndResolvable(uri))
> val value = {
> if (key == ARCHIVES) {
> uris.map(UriBuilder.fromUri(_).fragment(null).build()).map(_.toString)
> } else {
> uris
> }
> }
> val resolved = KubernetesUtils.uploadAndTransformFileUris(value, Some(conf.sparkConf))
> if (resolved.nonEmpty) {
> val resolvedValue = if (key == ARCHIVES) {
> uris.zip(resolved).map { case (uri, r) =>
> UriBuilder.fromUri(r).fragment(new java.net.URI(uri).getFragment).build().toString
> }
> } else {
> resolved
> }
> // don't forget to add remote URIs
> additionalProps.put(key.key, (resolvedValue ++ remoteUris).mkString(","))
> }
> } {code}
> I tested it out in my environment and it worked: {{s3a://$BUCKET_NAME/my-remote-jar.jar}} was kept in {{spark.jars}} and the Spark driver was able to download it.
> I don't know the codebase well enough though to assess whether I am missing something else or if this is enough to fix the issue.
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org