You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Stavros Kontopoulos (JIRA)" <ji...@apache.org> on 2017/11/29 23:50:00 UTC
[jira] [Updated] (SPARK-22657) Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used

     [ https://issues.apache.org/jira/browse/SPARK-22657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stavros Kontopoulos updated SPARK-22657:
----------------------------------------
    Description: 
To reproduce this issue run:
```
./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl s3n://arand-sandbox-mesosphere/linecount.out
```
within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
You get: "Exception in thread "main" java.io.IOException: No FileSystem for scheme: s3n"
This can be run reproduced with local[*] as well.

The specific spark job used is[[ https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala |  here ]].  

Using this code : https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df
You get: https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5

The commit that introduced this is:
5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add glob support for resources adding to Spark [32m(5 months ago) [1;34m<jerryshao>[m Thu, 6 Jul 2017 15:32:49 +0800
https://github.com/apache/spark/pull/18235/files check line 950

The Filesystem class is initialized already before the main of the spark job is launched... the reason is --packages logic uses hadoop libraries to download files....
 Maven resolution happens before the app jar and the resolved jars are added to the classpath. So at that moment there is no s3n to add to the static map when the Filesystem static members are first initialized and also filled (SERVICE_FILE_SYSTEMS).

Later in the spark job main where we try to access the s3n filesystem we get the exception (at this point the app jar has the s3n implementation in it and its on the class path but that scheme is not loaded in the static map of the Filesystem class)... 
hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the problem is with the static map which is filled once and only once.
That's why we see two prints of the map contents in the output above when --packages is used. The first print is before creating the s3n filesystem. We use reflection there to get the static map's entries btw. When --packages is not used that map is empty since the Filesystem class is not yet loaded by the classloader.

  was:
Reproduce run:
./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl s3n://arand-sandbox-mesosphere/linecount.out

within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
You get: "Exception in thread "main" java.io.IOException: No FileSystem for scheme: s3n"
This can be run reproduced with local[*] as well.

The specific spark job used is[[ https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala |  here ]].  

Using this code : https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df
You get: https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5

The commit that introduced this is:
5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add glob support for resources adding to Spark [32m(5 months ago) [1;34m<jerryshao>[m Thu, 6 Jul 2017 15:32:49 +0800
https://github.com/apache/spark/pull/18235/files check line 950

The Filesystem class is initialized already before the main of the spark job is launched... the reason is --packages logic uses hadoop libraries to download files....
 Maven resolution happens before the app jar and the resolved jars are added to the classpath. So at that moment there is no s3n to add to the static map when the Filesystem static members are first initialized and also filled (SERVICE_FILE_SYSTEMS).

Later in the spark job main where we try to access the s3n filesystem we get the exception (at this point the app jar has the s3n implementation in it and its on the class path but that scheme is not loaded in the static map of the Filesystem class)... 
hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the problem is with the static map which is filled once and only once.
That's why we see two prints of the map contents in the output above when --packages is used. The first print is before creating the s3n filesystem. We use reflection there to get the static map's entries btw. When --packages is not used that map is empty since the Filesystem class is not yet loaded by the classloader.


> Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used 
> --------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-22657
>                 URL: https://issues.apache.org/jira/browse/SPARK-22657
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.3.0
>            Reporter: Stavros Kontopoulos
>
> To reproduce this issue run:
> ```
> ./bin/spark-submit --master mesos://leader.mesos:5050 \
> --packages com.github.scopt:scopt_2.11:3.5.0 \
> --conf spark.cores.max=8 \
> --conf spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 \
> --conf spark.mesos.executor.docker.forcePullImage=true \
> --class S3Job http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar \
> --readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl s3n://arand-sandbox-mesosphere/linecount.out
> ```
> within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
> You get: "Exception in thread "main" java.io.IOException: No FileSystem for scheme: s3n"
> This can be run reproduced with local[*] as well.
> The specific spark job used is[[ https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala |  here ]].  
> Using this code : https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df
> You get: https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5
> The commit that introduced this is:
> 5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add glob support for resources adding to Spark [32m(5 months ago) [1;34m<jerryshao>[m Thu, 6 Jul 2017 15:32:49 +0800
> https://github.com/apache/spark/pull/18235/files check line 950
> The Filesystem class is initialized already before the main of the spark job is launched... the reason is --packages logic uses hadoop libraries to download files....
>  Maven resolution happens before the app jar and the resolved jars are added to the classpath. So at that moment there is no s3n to add to the static map when the Filesystem static members are first initialized and also filled (SERVICE_FILE_SYSTEMS).
> Later in the spark job main where we try to access the s3n filesystem we get the exception (at this point the app jar has the s3n implementation in it and its on the class path but that scheme is not loaded in the static map of the Filesystem class)... 
> hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the problem is with the static map which is filled once and only once.
> That's why we see two prints of the map contents in the output above when --packages is used. The first print is before creating the s3n filesystem. We use reflection there to get the static map's entries btw. When --packages is not used that map is empty since the Filesystem class is not yet loaded by the classloader.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org