You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Shivaram Venkataraman <sh...@eecs.berkeley.edu> on 2014/02/18 03:15:52 UTC

Accessing Hadoop2 HDFS from Spark app

I ran into a weird bug today where trying to read a file from HDFS
built using Hadoop 2 gives an error saying "No FileSystem for scheme:
hdfs".  Specifically this only seems to happen when building an
assembly jar in the application and not when using sbt's run-main.

The project's setup[0] is pretty simple and is only a slight
modification of the project used by the release audit tool. The sbt
assembly instructions[1] are mostly copied from Spark's sbt build
files.

We run into this in SparkR as well, so it'll be great if anybody has
an idea on how to debug this.
To repoduce, you can do the following:

1. Launch a Spark EC2 cluster with 0.9.0 with --hadoop-major-version=2
2. Clone https://github.com/shivaram/spark-utils
3. Run release-audits/sbt_app_core/run-hdfs-test.sh

Thanks
Shivaram

[0] https://github.com/shivaram/spark-utils/blob/master/release-audits/sbt_app_core/src/main/scala/SparkHdfsApp.scala
[1] https://github.com/shivaram/spark-utils/blob/master/release-audits/sbt_app_core/build.sbt

Re: Accessing Hadoop2 HDFS from Spark app

Posted by Shivaram Venkataraman <sh...@eecs.berkeley.edu>.
Thanks for the pointer -- I guess I should have checked Spark's build
script again while debugging. This might be useful to include in a
documentation page about how to write and run Spark apps. I think
there's are a bunch of such know-how just floating around right now.

Shivaram

On Mon, Feb 17, 2014 at 9:27 PM, Patrick Wendell <pw...@gmail.com> wrote:
> BTW my fix in Spark was later generalized to be equivalent to what you
> did, which is do this for the entire services directory rather than
> just FileSystem.
>
> On Mon, Feb 17, 2014 at 9:26 PM, Patrick Wendell <pw...@gmail.com> wrote:
>> Ya I ran into this a few months ago. We actually patched the spark
>> build back then. It took me a long time to figure it out.
>>
>> https://github.com/apache/incubator-spark/commit/0c1985b153a2dc2c891ae61c1ee67506926384ae
>>
>> On Mon, Feb 17, 2014 at 6:47 PM, Shivaram Venkataraman
>> <sh...@eecs.berkeley.edu> wrote:
>>> Thanks a lot Jey ! That fixes things. For reference I had to add the
>>> following line to build.sbt
>>>
>>>     case m if m.toLowerCase.matches("meta-inf/services.*$")  =>
>>> MergeStrategy.concat
>>>
>>> Should we also add this to Spark's assembly build ?
>>>
>>> Thanks
>>> Shivaram
>>>
>>> On Mon, Feb 17, 2014 at 6:27 PM, Jey Kottalam <je...@cs.berkeley.edu> wrote:
>>>> We ran into this issue with ADAM, and it came down to an issue of not
>>>> merging the "META-INF/services" files correctly. Here's the change we made
>>>> to our Maven build files to fix it, can probably do something similar under
>>>> SBT too:
>>>> https://github.com/bigdatagenomics/adam/commit/b0997760b23c4284efe32eeb968ef2744af8be82
>>>>
>>>> -Jey
>>>>
>>>>
>>>> On Mon, Feb 17, 2014 at 6:15 PM, Shivaram Venkataraman
>>>> <sh...@eecs.berkeley.edu> wrote:
>>>>>
>>>>> I ran into a weird bug today where trying to read a file from HDFS
>>>>> built using Hadoop 2 gives an error saying "No FileSystem for scheme:
>>>>> hdfs".  Specifically this only seems to happen when building an
>>>>> assembly jar in the application and not when using sbt's run-main.
>>>>>
>>>>> The project's setup[0] is pretty simple and is only a slight
>>>>> modification of the project used by the release audit tool. The sbt
>>>>> assembly instructions[1] are mostly copied from Spark's sbt build
>>>>> files.
>>>>>
>>>>> We run into this in SparkR as well, so it'll be great if anybody has
>>>>> an idea on how to debug this.
>>>>> To repoduce, you can do the following:
>>>>>
>>>>> 1. Launch a Spark EC2 cluster with 0.9.0 with --hadoop-major-version=2
>>>>> 2. Clone https://github.com/shivaram/spark-utils
>>>>> 3. Run release-audits/sbt_app_core/run-hdfs-test.sh
>>>>>
>>>>> Thanks
>>>>> Shivaram
>>>>>
>>>>> [0]
>>>>> https://github.com/shivaram/spark-utils/blob/master/release-audits/sbt_app_core/src/main/scala/SparkHdfsApp.scala
>>>>> [1]
>>>>> https://github.com/shivaram/spark-utils/blob/master/release-audits/sbt_app_core/build.sbt
>>>>
>>>>

Re: Accessing Hadoop2 HDFS from Spark app

Posted by Patrick Wendell <pw...@gmail.com>.
BTW my fix in Spark was later generalized to be equivalent to what you
did, which is do this for the entire services directory rather than
just FileSystem.

On Mon, Feb 17, 2014 at 9:26 PM, Patrick Wendell <pw...@gmail.com> wrote:
> Ya I ran into this a few months ago. We actually patched the spark
> build back then. It took me a long time to figure it out.
>
> https://github.com/apache/incubator-spark/commit/0c1985b153a2dc2c891ae61c1ee67506926384ae
>
> On Mon, Feb 17, 2014 at 6:47 PM, Shivaram Venkataraman
> <sh...@eecs.berkeley.edu> wrote:
>> Thanks a lot Jey ! That fixes things. For reference I had to add the
>> following line to build.sbt
>>
>>     case m if m.toLowerCase.matches("meta-inf/services.*$")  =>
>> MergeStrategy.concat
>>
>> Should we also add this to Spark's assembly build ?
>>
>> Thanks
>> Shivaram
>>
>> On Mon, Feb 17, 2014 at 6:27 PM, Jey Kottalam <je...@cs.berkeley.edu> wrote:
>>> We ran into this issue with ADAM, and it came down to an issue of not
>>> merging the "META-INF/services" files correctly. Here's the change we made
>>> to our Maven build files to fix it, can probably do something similar under
>>> SBT too:
>>> https://github.com/bigdatagenomics/adam/commit/b0997760b23c4284efe32eeb968ef2744af8be82
>>>
>>> -Jey
>>>
>>>
>>> On Mon, Feb 17, 2014 at 6:15 PM, Shivaram Venkataraman
>>> <sh...@eecs.berkeley.edu> wrote:
>>>>
>>>> I ran into a weird bug today where trying to read a file from HDFS
>>>> built using Hadoop 2 gives an error saying "No FileSystem for scheme:
>>>> hdfs".  Specifically this only seems to happen when building an
>>>> assembly jar in the application and not when using sbt's run-main.
>>>>
>>>> The project's setup[0] is pretty simple and is only a slight
>>>> modification of the project used by the release audit tool. The sbt
>>>> assembly instructions[1] are mostly copied from Spark's sbt build
>>>> files.
>>>>
>>>> We run into this in SparkR as well, so it'll be great if anybody has
>>>> an idea on how to debug this.
>>>> To repoduce, you can do the following:
>>>>
>>>> 1. Launch a Spark EC2 cluster with 0.9.0 with --hadoop-major-version=2
>>>> 2. Clone https://github.com/shivaram/spark-utils
>>>> 3. Run release-audits/sbt_app_core/run-hdfs-test.sh
>>>>
>>>> Thanks
>>>> Shivaram
>>>>
>>>> [0]
>>>> https://github.com/shivaram/spark-utils/blob/master/release-audits/sbt_app_core/src/main/scala/SparkHdfsApp.scala
>>>> [1]
>>>> https://github.com/shivaram/spark-utils/blob/master/release-audits/sbt_app_core/build.sbt
>>>
>>>

Re: Accessing Hadoop2 HDFS from Spark app

Posted by Patrick Wendell <pw...@gmail.com>.
Ya I ran into this a few months ago. We actually patched the spark
build back then. It took me a long time to figure it out.

https://github.com/apache/incubator-spark/commit/0c1985b153a2dc2c891ae61c1ee67506926384ae

On Mon, Feb 17, 2014 at 6:47 PM, Shivaram Venkataraman
<sh...@eecs.berkeley.edu> wrote:
> Thanks a lot Jey ! That fixes things. For reference I had to add the
> following line to build.sbt
>
>     case m if m.toLowerCase.matches("meta-inf/services.*$")  =>
> MergeStrategy.concat
>
> Should we also add this to Spark's assembly build ?
>
> Thanks
> Shivaram
>
> On Mon, Feb 17, 2014 at 6:27 PM, Jey Kottalam <je...@cs.berkeley.edu> wrote:
>> We ran into this issue with ADAM, and it came down to an issue of not
>> merging the "META-INF/services" files correctly. Here's the change we made
>> to our Maven build files to fix it, can probably do something similar under
>> SBT too:
>> https://github.com/bigdatagenomics/adam/commit/b0997760b23c4284efe32eeb968ef2744af8be82
>>
>> -Jey
>>
>>
>> On Mon, Feb 17, 2014 at 6:15 PM, Shivaram Venkataraman
>> <sh...@eecs.berkeley.edu> wrote:
>>>
>>> I ran into a weird bug today where trying to read a file from HDFS
>>> built using Hadoop 2 gives an error saying "No FileSystem for scheme:
>>> hdfs".  Specifically this only seems to happen when building an
>>> assembly jar in the application and not when using sbt's run-main.
>>>
>>> The project's setup[0] is pretty simple and is only a slight
>>> modification of the project used by the release audit tool. The sbt
>>> assembly instructions[1] are mostly copied from Spark's sbt build
>>> files.
>>>
>>> We run into this in SparkR as well, so it'll be great if anybody has
>>> an idea on how to debug this.
>>> To repoduce, you can do the following:
>>>
>>> 1. Launch a Spark EC2 cluster with 0.9.0 with --hadoop-major-version=2
>>> 2. Clone https://github.com/shivaram/spark-utils
>>> 3. Run release-audits/sbt_app_core/run-hdfs-test.sh
>>>
>>> Thanks
>>> Shivaram
>>>
>>> [0]
>>> https://github.com/shivaram/spark-utils/blob/master/release-audits/sbt_app_core/src/main/scala/SparkHdfsApp.scala
>>> [1]
>>> https://github.com/shivaram/spark-utils/blob/master/release-audits/sbt_app_core/build.sbt
>>
>>

Re: Accessing Hadoop2 HDFS from Spark app

Posted by Shivaram Venkataraman <sh...@eecs.berkeley.edu>.
Thanks a lot Jey ! That fixes things. For reference I had to add the
following line to build.sbt

    case m if m.toLowerCase.matches("meta-inf/services.*$")  =>
MergeStrategy.concat

Should we also add this to Spark's assembly build ?

Thanks
Shivaram

On Mon, Feb 17, 2014 at 6:27 PM, Jey Kottalam <je...@cs.berkeley.edu> wrote:
> We ran into this issue with ADAM, and it came down to an issue of not
> merging the "META-INF/services" files correctly. Here's the change we made
> to our Maven build files to fix it, can probably do something similar under
> SBT too:
> https://github.com/bigdatagenomics/adam/commit/b0997760b23c4284efe32eeb968ef2744af8be82
>
> -Jey
>
>
> On Mon, Feb 17, 2014 at 6:15 PM, Shivaram Venkataraman
> <sh...@eecs.berkeley.edu> wrote:
>>
>> I ran into a weird bug today where trying to read a file from HDFS
>> built using Hadoop 2 gives an error saying "No FileSystem for scheme:
>> hdfs".  Specifically this only seems to happen when building an
>> assembly jar in the application and not when using sbt's run-main.
>>
>> The project's setup[0] is pretty simple and is only a slight
>> modification of the project used by the release audit tool. The sbt
>> assembly instructions[1] are mostly copied from Spark's sbt build
>> files.
>>
>> We run into this in SparkR as well, so it'll be great if anybody has
>> an idea on how to debug this.
>> To repoduce, you can do the following:
>>
>> 1. Launch a Spark EC2 cluster with 0.9.0 with --hadoop-major-version=2
>> 2. Clone https://github.com/shivaram/spark-utils
>> 3. Run release-audits/sbt_app_core/run-hdfs-test.sh
>>
>> Thanks
>> Shivaram
>>
>> [0]
>> https://github.com/shivaram/spark-utils/blob/master/release-audits/sbt_app_core/src/main/scala/SparkHdfsApp.scala
>> [1]
>> https://github.com/shivaram/spark-utils/blob/master/release-audits/sbt_app_core/build.sbt
>
>

Re: Accessing Hadoop2 HDFS from Spark app

Posted by Jey Kottalam <je...@cs.berkeley.edu>.
We ran into this issue with ADAM, and it came down to an issue of not
merging the "META-INF/services" files correctly. Here's the change we made
to our Maven build files to fix it, can probably do something similar under
SBT too:
https://github.com/bigdatagenomics/adam/commit/b0997760b23c4284efe32eeb968ef2744af8be82

-Jey


On Mon, Feb 17, 2014 at 6:15 PM, Shivaram Venkataraman <
shivaram@eecs.berkeley.edu> wrote:

> I ran into a weird bug today where trying to read a file from HDFS
> built using Hadoop 2 gives an error saying "No FileSystem for scheme:
> hdfs".  Specifically this only seems to happen when building an
> assembly jar in the application and not when using sbt's run-main.
>
> The project's setup[0] is pretty simple and is only a slight
> modification of the project used by the release audit tool. The sbt
> assembly instructions[1] are mostly copied from Spark's sbt build
> files.
>
> We run into this in SparkR as well, so it'll be great if anybody has
> an idea on how to debug this.
> To repoduce, you can do the following:
>
> 1. Launch a Spark EC2 cluster with 0.9.0 with --hadoop-major-version=2
> 2. Clone https://github.com/shivaram/spark-utils
> 3. Run release-audits/sbt_app_core/run-hdfs-test.sh
>
> Thanks
> Shivaram
>
> [0]
> https://github.com/shivaram/spark-utils/blob/master/release-audits/sbt_app_core/src/main/scala/SparkHdfsApp.scala
> [1]
> https://github.com/shivaram/spark-utils/blob/master/release-audits/sbt_app_core/build.sbt
>