You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "Williams, Ken" <Ke...@windlogics.com> on 2014/04/21 21:03:53 UTC

Problem connecting to HDFS in Spark shell

I'm trying to get my feet wet with Spark.  I've done some simple stuff in the shell in standalone mode, and now I'm trying to connect to HDFS resources, but I'm running into a problem.

I synced to git's master branch (c399baa - "SPARK-1456 Remove view bounds on Ordered in favor of a context bound on Ordering. (3 days ago) <Michael Armbrust>" and built like so:

    SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt/sbt assembly

This created various jars in various places, including these (I think):

   ./examples/target/scala-2.10/spark-examples-assembly-1.0.0-SNAPSHOT.jar
   ./tools/target/scala-2.10/spark-tools-assembly-1.0.0-SNAPSHOT.jar
   ./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.2.0.jar

In `conf/spark-env.sh`, I added this (actually before I did the assembly):

    export HADOOP_CONF_DIR=/etc/hadoop/conf

Now I fire up the shell (bin/spark-shell) and try to grab data from HFDS, and get the following exception:

scala> var hdf = sc.hadoopFile("hdfs:///user/kwilliams/dat/part-m-00000")
hdf: org.apache.spark.rdd.RDD[(Nothing, Nothing)] = HadoopRDD[0] at hadoopFile at <console>:12

scala> hdf.count()
java.lang.RuntimeException: java.lang.InstantiationException
        at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:131)
        at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:155)
        at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:168)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:209)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:207)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1064)
        at org.apache.spark.rdd.RDD.count(RDD.scala:806)
        at $iwC$$iwC$$iwC$$iwC.<init>(<console>:15)
        at $iwC$$iwC$$iwC.<init>(<console>:20)
        at $iwC$$iwC.<init>(<console>:22)
        at $iwC.<init>(<console>:24)
        at <init>(<console>:26)
        at .<init>(<console>:30)
        at .<clinit>(<console>)
        at .<init>(<console>:7)
        at .<clinit>(<console>)
        at $print(<console>)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:777)
        at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1045)
        at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
        at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
        at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
        at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)
        at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841)
        at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)
        at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601)
        at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608)
        at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611)
        at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:936)
        at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
        at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
        at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
        at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
        at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
        at org.apache.spark.repl.Main$.main(Main.scala:31)
        at org.apache.spark.repl.Main.main(Main.scala)
Caused by: java.lang.InstantiationException
        at sun.reflect.InstantiationExceptionConstructorAccessorImpl.newInstance(InstantiationExceptionConstructorAccessorImpl.java:48)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:129)
        ... 41 more


Is this recognizable to anyone as a build problem, or a config problem, or anything?  Failing that, any way to get more information about where in the process it's failing?

Thanks.

--
Ken Williams, Senior Research Scientist
WindLogics
http://windlogics.com



________________________________

CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution of any kind is strictly prohibited. If you are not the intended recipient, please contact the sender via reply e-mail and destroy all copies of the original message. Thank you.

RE: Problem connecting to HDFS in Spark shell

Posted by "Williams, Ken" <Ke...@windlogics.com>.

> -----Original Message-----
> From: Marcelo Vanzin [mailto:vanzin@cloudera.com]
> Hi Ken,
>
> On Mon, Apr 21, 2014 at 1:39 PM, Williams, Ken
> <Ke...@windlogics.com> wrote:
> > I haven't figured out how to let the hostname default to the host
> mentioned in our /etc/hadoop/conf/hdfs-site.xml like the Hadoop
> command-line tools do, but that's not so important.
>
> Try adding "/etc/hadoop/conf" to SPARK_CLASSPATH.

It looks like I already had my config set up properly, but I didn't understand the URL syntax - the following works:

  sc.textFile("hdfs:///user/kwilliams/dat/part-m-00000")

In other words, just omit the hostname between the 2nd and 3rd slash of the URL.

 -Ken

________________________________

CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution of any kind is strictly prohibited. If you are not the intended recipient, please contact the sender via reply e-mail and destroy all copies of the original message. Thank you.

Re: Problem connecting to HDFS in Spark shell

Posted by Marcelo Vanzin <va...@cloudera.com>.

Hi Ken,

On Mon, Apr 21, 2014 at 1:39 PM, Williams, Ken
<Ke...@windlogics.com> wrote:
> I haven't figured out how to let the hostname default to the host mentioned in our /etc/hadoop/conf/hdfs-site.xml like the Hadoop command-line tools do, but that's not so important.

Try adding "/etc/hadoop/conf" to SPARK_CLASSPATH.

-- 
Marcelo

RE: Problem connecting to HDFS in Spark shell

Posted by "Williams, Ken" <Ke...@windlogics.com>.

I figured it out - I should be using textFile(...), not hadoopFile(...).  And my HDFS URL should include the host:

  hdfs://host/user/kwilliams/corTable2/part-m-00000

I haven't figured out how to let the hostname default to the host mentioned in our /etc/hadoop/conf/hdfs-site.xml like the Hadoop command-line tools do, but that's not so important.

 -Ken


> -----Original Message-----
> From: Williams, Ken [mailto:Ken.Williams@windlogics.com]
> Sent: Monday, April 21, 2014 2:04 PM
> To: Spark list
> Subject: Problem connecting to HDFS in Spark shell
> 
> I'm trying to get my feet wet with Spark.  I've done some simple stuff in the
> shell in standalone mode, and now I'm trying to connect to HDFS resources,
> but I'm running into a problem.
> 
> I synced to git's master branch (c399baa - "SPARK-1456 Remove view bounds
> on Ordered in favor of a context bound on Ordering. (3 days ago) <Michael
> Armbrust>" and built like so:
> 
>     SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt/sbt assembly
> 
> This created various jars in various places, including these (I think):
> 
>    ./examples/target/scala-2.10/spark-examples-assembly-1.0.0-
> SNAPSHOT.jar
>    ./tools/target/scala-2.10/spark-tools-assembly-1.0.0-SNAPSHOT.jar
>    ./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-
> hadoop2.2.0.jar
> 
> In `conf/spark-env.sh`, I added this (actually before I did the assembly):
> 
>     export HADOOP_CONF_DIR=/etc/hadoop/conf
> 
> Now I fire up the shell (bin/spark-shell) and try to grab data from HFDS, and
> get the following exception:
> 
> scala> var hdf = sc.hadoopFile("hdfs:///user/kwilliams/dat/part-m-00000")
> hdf: org.apache.spark.rdd.RDD[(Nothing, Nothing)] = HadoopRDD[0] at
> hadoopFile at <console>:12
> 
> scala> hdf.count()
> java.lang.RuntimeException: java.lang.InstantiationException
>         at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:131
> )
>         at
> org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:155)
>         at
> org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:168)
>         at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:209)
>         at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
>         at scala.Option.getOrElse(Option.scala:120)
>         at org.apache.spark.rdd.RDD.partitions(RDD.scala:207)
>         at org.apache.spark.SparkContext.runJob(SparkContext.scala:1064)
>         at org.apache.spark.rdd.RDD.count(RDD.scala:806)
>         at $iwC$$iwC$$iwC$$iwC.<init>(<console>:15)
>         at $iwC$$iwC$$iwC.<init>(<console>:20)
>         at $iwC$$iwC.<init>(<console>:22)
>         at $iwC.<init>(<console>:24)
>         at <init>(<console>:26)
>         at .<init>(<console>:30)
>         at .<clinit>(<console>)
>         at .<init>(<console>:7)
>         at .<clinit>(<console>)
>         at $print(<console>)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.j
> ava:57)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
> sorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:777)
>         at
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:10
> 45)
>         at
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
>         at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
>         at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
>         at
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)
>         at
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:84
> 1)
>         at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)
>         at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601)
>         at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608)
>         at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611)
>         at
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(Spark
> ILoop.scala:936)
>         at
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.sc
> ala:884)
>         at
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.sc
> ala:884)
>         at
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.
> scala:135)
>         at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
>         at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
>         at org.apache.spark.repl.Main$.main(Main.scala:31)
>         at org.apache.spark.repl.Main.main(Main.scala)
> Caused by: java.lang.InstantiationException
>         at
> sun.reflect.InstantiationExceptionConstructorAccessorImpl.newInstance(Ins
> tantiationExceptionConstructorAccessorImpl.java:48)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>         at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:129
> )
>         ... 41 more
> 
> 
> Is this recognizable to anyone as a build problem, or a config problem, or
> anything?  Failing that, any way to get more information about where in the
> process it's failing?
> 
> Thanks.
> 
> --
> Ken Williams, Senior Research Scientist
> WindLogics
> http://windlogics.com
> 
> 
> 
> ________________________________
> 
> CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution of any
> kind is strictly prohibited. If you are not the intended recipient, please
> contact the sender via reply e-mail and destroy all copies of the original
> message. Thank you.