You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Anton Ippolitov (Jira)" <ji...@apache.org> on 2022/10/17 08:45:00 UTC
[jira] [Updated] (SPARK-40817) Remote spark.jars URIs ignored for Spark on Kubernetes in cluster mode

     [ https://issues.apache.org/jira/browse/SPARK-40817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Anton Ippolitov updated SPARK-40817:
------------------------------------
    Description: 
I discovered that remote URIs in {{spark.jars}} get discarded when launching Spark on Kubernetes in cluster mode via spark-submit.
h1. Reproduction

Here is an example reproduction with S3 being used for remote JAR storage: 

I first created 2 JARs:
 * {{/opt/my-local-jar.jar}} on the host where I'm running spark-submit
 * {{s3://$BUCKET_NAME/my-remote-jar.jar}} in an S3 bucket I own

I then ran the following spark-submit command with {{spark.jars}} pointing to both the local JAR and the remote JAR:
{code:java}
 spark-submit \
  --master k8s://https://$KUBERNETES_API_SERVER_URL:443 \
  --deploy-mode cluster \
  --name=spark-submit-test \
  --class org.apache.spark.examples.SparkPi \
  --conf spark.jars=/opt/my-local-jar.jar,s3a://$BUCKET_NAME/my-remote-jar.jar \
  --conf spark.kubernetes.file.upload.path=s3a://$BUCKET_NAME/my-upload-path/ \
  [...]
  /opt/spark/examples/jars/spark-examples_2.12-3.1.3.jar
{code}
Once the driver and the executors started, I confirmed that there was no trace of {{my-remote-jar.jar}} anymore. For example, looking at the Spark History Server, I could see that {{spark.jars}} got transformed into this:

!image-2022-10-17-10-44-46-862.png|width=991,height=80!

There was no mention of {{my-remote-jar.jar}} on the classpath or anywhere else.

Note that I ran all tests with Spark 3.1.3, however the code which handles those dependencies seems to be the same for more recent versions of Spark as well.
h1. Root cause description

I believe that the issue seems to be coming from [this logic|https://github.com/apache/spark/blob/d1f8a503a26bcfb4e466d9accc5fa241a7933667/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L163-L186] in {{{}BasicDriverFeatureStep.getAdditionalPodSystemProperties(){}}}.

Specifically, this logic takes all URIs in {{{}spark.jars{}}}, [filters only on local URIs,|https://github.com/apache/spark/blob/d1f8a503a26bcfb4e466d9accc5fa241a7933667/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L165] [uploads|https://github.com/apache/spark/blob/d1f8a503a26bcfb4e466d9accc5fa241a7933667/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L173] those local files to {{spark.kubernetes.file.upload.path }}and then [*replaces*|https://github.com/apache/spark/blob/d1f8a503a26bcfb4e466d9accc5fa241a7933667/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L182] the value of {{spark.jars}} with those newly uploaded JARs. By overwriting the previous value of {{{}spark.jars{}}}, we are losing all mention of remote JARs that were previously specified there. 

Consequently, when the Spark driver starts afterwards, it only downloads JARs from {{{}spark.kubernetes.file.upload.path{}}}.
h1. Possible solution

I think a possible fix would be to not fully overwrite the value of {{spark.jars}} but to make sure that we keep remote URIs there.

The new logic would look something like this:
{code:java}
Seq(JARS, FILES, ARCHIVES, SUBMIT_PYTHON_FILES).foreach { key =>
  val uris = conf.get(key).filter(uri => KubernetesUtils.isLocalAndResolvable(uri))
  // Save remote URIs
  val remoteUris = conf.get(key).filter(uri => !KubernetesUtils.isLocalAndResolvable(uri))
  val value = {
    if (key == ARCHIVES) {
      uris.map(UriBuilder.fromUri(_).fragment(null).build()).map(_.toString)
    } else {
      uris
    }
  }
  val resolved = KubernetesUtils.uploadAndTransformFileUris(value, Some(conf.sparkConf))
  if (resolved.nonEmpty) {
    val resolvedValue = if (key == ARCHIVES) {
      uris.zip(resolved).map { case (uri, r) =>
        UriBuilder.fromUri(r).fragment(new java.net.URI(uri).getFragment).build().toString
      }
    } else {
      resolved
    }
    // don't forget to add remote URIs
    additionalProps.put(key.key, (resolvedValue ++ remoteUris).mkString(","))
  }
} {code}
I tested it out in my environment and it worked: {{s3a://$BUCKET_NAME/my-remote-jar.jar}} was kept in {{spark.jars}} and the Spark driver was able to download it.

I don't know the codebase well enough though to assess whether I am missing something else or if this is enough to fix the issue.

 

 

 

  was:
I discovered that remote URIs in {{spark.jars}} get discarded when launching Spark on Kubernetes in cluster mode via spark-submit.
h1. Reproduction

Here is an example reproduction with S3 being used for remote JAR storage: 

I first created 2 JARs:
 * {{/opt/my-local-jar.jar}} on the host where I'm running spark-submit
 * {{s3://$BUCKET_NAME/my-remote-jar.jar}} in an S3 bucket I own

I then ran the following spark-submit command with {{spark.jars}} pointing to both the local JAR and the remote JAR:
{code:java}
 spark-submit \
  --master k8s://https://$KUBERNETES_API_SERVER_URL:443 \
  --deploy-mode cluster \
  --name=spark-submit-test \
  --class org.apache.spark.examples.SparkPi \
  --conf spark.jars=/opt/my-local-jar.jar,s3a://$BUCKET_NAME/my-remote-jar.jar \
  --conf spark.kubernetes.file.upload.path=s3a://$BUCKET_NAME/my-upload-path/ \
  [...]
  /opt/spark/examples/jars/spark-examples_2.12-3.1.3.jar
{code}
Once the driver and the executors started, I confirmed that there was no trace of {{my-remote-jar.jar}} anymore. For example, looking at the Spark History Server, I could see that {{spark.jars}} got transformed into this:

!image-2022-10-17-10-17-03-697.png|width=744,height=60!

There was no mention of {{my-remote-jar.jar}} on the classpath or anywhere else.

Note that I ran all tests with Spark 3.1.3, however the code which handles those dependencies seems to be the same for more recent versions of Spark as well.
h1. Root cause description

I believe that the issue seems to be coming from [this logic|https://github.com/apache/spark/blob/d1f8a503a26bcfb4e466d9accc5fa241a7933667/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L163-L186] in {{{}BasicDriverFeatureStep.getAdditionalPodSystemProperties(){}}}.

Specifically, this logic takes all URIs in {{{}spark.jars{}}}, [filters only on local URIs,|https://github.com/apache/spark/blob/d1f8a503a26bcfb4e466d9accc5fa241a7933667/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L165] [uploads|https://github.com/apache/spark/blob/d1f8a503a26bcfb4e466d9accc5fa241a7933667/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L173] those local files to {{spark.kubernetes.file.upload.path }}and then [*replaces*|https://github.com/apache/spark/blob/d1f8a503a26bcfb4e466d9accc5fa241a7933667/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L182] the value of {{spark.jars}} with those newly uploaded JARs. By overwriting the previous value of {{{}spark.jars{}}}, we are losing all mention of remote JARs that were previously specified there. 

Consequently, when the Spark driver starts afterwards, it only downloads JARs from {{{}spark.kubernetes.file.upload.path{}}}.
h1. Possible solution

I think a possible fix would be to not fully overwrite the value of {{spark.jars}} but to make sure that we keep remote URIs there.

The new logic would look something like this:
{code:java}
Seq(JARS, FILES, ARCHIVES, SUBMIT_PYTHON_FILES).foreach { key =>
  val uris = conf.get(key).filter(uri => KubernetesUtils.isLocalAndResolvable(uri))
  // Save remote URIs
  val remoteUris = conf.get(key).filter(uri => !KubernetesUtils.isLocalAndResolvable(uri))
  val value = {
    if (key == ARCHIVES) {
      uris.map(UriBuilder.fromUri(_).fragment(null).build()).map(_.toString)
    } else {
      uris
    }
  }
  val resolved = KubernetesUtils.uploadAndTransformFileUris(value, Some(conf.sparkConf))
  if (resolved.nonEmpty) {
    val resolvedValue = if (key == ARCHIVES) {
      uris.zip(resolved).map { case (uri, r) =>
        UriBuilder.fromUri(r).fragment(new java.net.URI(uri).getFragment).build().toString
      }
    } else {
      resolved
    }
    // don't forget to add remote URIs
    additionalProps.put(key.key, (resolvedValue ++ remoteUris).mkString(","))
  }
} {code}
I tested it out in my environment and it worked: {{s3a://$BUCKET_NAME/my-remote-jar.jar}} was kept in {{spark.jars}} and the Spark driver was able to download it.

I don't know the codebase well enough though to assess whether I am missing something else or if this is enough to fix the issue.

 

 

 


> Remote spark.jars URIs ignored for Spark on Kubernetes in cluster mode 
> -----------------------------------------------------------------------
>
>                 Key: SPARK-40817
>                 URL: https://issues.apache.org/jira/browse/SPARK-40817
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes, Spark Submit
>    Affects Versions: 3.0.0, 3.1.3, 3.3.0, 3.2.2
>         Environment: Spark 3.1.3
> Kubernetes 1.21
> Ubuntu 20.04.1
>            Reporter: Anton Ippolitov
>            Priority: Major
>         Attachments: image-2022-10-17-10-44-46-862.png
>
>
> I discovered that remote URIs in {{spark.jars}} get discarded when launching Spark on Kubernetes in cluster mode via spark-submit.
> h1. Reproduction
> Here is an example reproduction with S3 being used for remote JAR storage: 
> I first created 2 JARs:
>  * {{/opt/my-local-jar.jar}} on the host where I'm running spark-submit
>  * {{s3://$BUCKET_NAME/my-remote-jar.jar}} in an S3 bucket I own
> I then ran the following spark-submit command with {{spark.jars}} pointing to both the local JAR and the remote JAR:
> {code:java}
>  spark-submit \
>   --master k8s://https://$KUBERNETES_API_SERVER_URL:443 \
>   --deploy-mode cluster \
>   --name=spark-submit-test \
>   --class org.apache.spark.examples.SparkPi \
>   --conf spark.jars=/opt/my-local-jar.jar,s3a://$BUCKET_NAME/my-remote-jar.jar \
>   --conf spark.kubernetes.file.upload.path=s3a://$BUCKET_NAME/my-upload-path/ \
>   [...]
>   /opt/spark/examples/jars/spark-examples_2.12-3.1.3.jar
> {code}
> Once the driver and the executors started, I confirmed that there was no trace of {{my-remote-jar.jar}} anymore. For example, looking at the Spark History Server, I could see that {{spark.jars}} got transformed into this:
> !image-2022-10-17-10-44-46-862.png|width=991,height=80!
> There was no mention of {{my-remote-jar.jar}} on the classpath or anywhere else.
> Note that I ran all tests with Spark 3.1.3, however the code which handles those dependencies seems to be the same for more recent versions of Spark as well.
> h1. Root cause description
> I believe that the issue seems to be coming from [this logic|https://github.com/apache/spark/blob/d1f8a503a26bcfb4e466d9accc5fa241a7933667/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L163-L186] in {{{}BasicDriverFeatureStep.getAdditionalPodSystemProperties(){}}}.
> Specifically, this logic takes all URIs in {{{}spark.jars{}}}, [filters only on local URIs,|https://github.com/apache/spark/blob/d1f8a503a26bcfb4e466d9accc5fa241a7933667/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L165] [uploads|https://github.com/apache/spark/blob/d1f8a503a26bcfb4e466d9accc5fa241a7933667/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L173] those local files to {{spark.kubernetes.file.upload.path }}and then [*replaces*|https://github.com/apache/spark/blob/d1f8a503a26bcfb4e466d9accc5fa241a7933667/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L182] the value of {{spark.jars}} with those newly uploaded JARs. By overwriting the previous value of {{{}spark.jars{}}}, we are losing all mention of remote JARs that were previously specified there. 
> Consequently, when the Spark driver starts afterwards, it only downloads JARs from {{{}spark.kubernetes.file.upload.path{}}}.
> h1. Possible solution
> I think a possible fix would be to not fully overwrite the value of {{spark.jars}} but to make sure that we keep remote URIs there.
> The new logic would look something like this:
> {code:java}
> Seq(JARS, FILES, ARCHIVES, SUBMIT_PYTHON_FILES).foreach { key =>
>   val uris = conf.get(key).filter(uri => KubernetesUtils.isLocalAndResolvable(uri))
>   // Save remote URIs
>   val remoteUris = conf.get(key).filter(uri => !KubernetesUtils.isLocalAndResolvable(uri))
>   val value = {
>     if (key == ARCHIVES) {
>       uris.map(UriBuilder.fromUri(_).fragment(null).build()).map(_.toString)
>     } else {
>       uris
>     }
>   }
>   val resolved = KubernetesUtils.uploadAndTransformFileUris(value, Some(conf.sparkConf))
>   if (resolved.nonEmpty) {
>     val resolvedValue = if (key == ARCHIVES) {
>       uris.zip(resolved).map { case (uri, r) =>
>         UriBuilder.fromUri(r).fragment(new java.net.URI(uri).getFragment).build().toString
>       }
>     } else {
>       resolved
>     }
>     // don't forget to add remote URIs
>     additionalProps.put(key.key, (resolvedValue ++ remoteUris).mkString(","))
>   }
> } {code}
> I tested it out in my environment and it worked: {{s3a://$BUCKET_NAME/my-remote-jar.jar}} was kept in {{spark.jars}} and the Spark driver was able to download it.
> I don't know the codebase well enough though to assess whether I am missing something else or if this is enough to fix the issue.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org