You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jhon Anderson Cardenas Diaz <jh...@gmail.com> on 2019/03/18 20:17:13 UTC

Spark - Hadoop custom filesystem service loading

Hi everyone,

On spark 2.2.0, if you wanted to create a custom file system
implementation, you just created an extension of
org.apache.hadoop.fs.FileSystem and put the canonical name of the custom
class on the file
src/main/resources/META-INF/services/org.apache.hadoop.fs.FileSystem.

Once you imported that jar dependency on your spark submit application, the
custom schema was automatically loaded, and you could start to use it just
like ds.load("customfs://path").

But on spark 2.4.0 that does not seem to work the same. If you do exactly
the same you will get an error like "No FileSystem for customfs".

The only way I achieved this on 2.4.0, was specifying the spark property
spark.hadoop.fs.customfs.impl.

Do you guys consider this as a bug? or is it an intentional change that
should be documented on somewhere?

Btw, digging a little bit on this, it seems that the cause is that now the
FileSystem is initialized before the actual dependencies are downloaded
from Maven repo (see here
<https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L66>).
And as that initialization loads the available filesystems at that point
and only once, the filesystems in the jars downloaded are not taken in
account.

Thanks.

Re: Spark - Hadoop custom filesystem service loading

Posted by Felix Cheung <fe...@hotmail.com>.

Hmm thanks. Do you have a proposed solution?

________________________________
From: Jhon Anderson Cardenas Diaz <jh...@gmail.com>
Sent: Monday, March 18, 2019 1:24 PM
To: user
Subject: Spark - Hadoop custom filesystem service loading

Hi everyone,

On spark 2.2.0, if you wanted to create a custom file system implementation, you just created an extension of org.apache.hadoop.fs.FileSystem and put the canonical name of the custom class on the file src/main/resources/META-INF/services/org.apache.hadoop.fs.FileSystem.

Once you imported that jar dependency on your spark submit application, the custom schema was automatically loaded, and you could start to use it just like ds.load("customfs://path").

But on spark 2.4.0 that does not seem to work the same. If you do exactly the same you will get an error like "No FileSystem for customfs".

The only way I achieved this on 2.4.0, was specifying the spark property spark.hadoop.fs.customfs.impl.

Do you guys consider this as a bug? or is it an intentional change that should be documented on somewhere?

Btw, digging a little bit on this, it seems that the cause is that now the FileSystem is initialized before the actual dependencies are downloaded from Maven repo (see here<https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L66>). And as that initialization loads the available filesystems at that point and only once, the filesystems in the jars downloaded are not taken in account.

Thanks.