You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "Taylor, Ronald C" <Ro...@pnnl.gov> on 2016/02/29 06:36:49 UTC

a basic question on first use of PySpark shell and example, which is failing

Hello folks,

I  am a newbie, and am running Spark on a small Cloudera CDH 5.5.1 cluster at our lab. I am trying to use the PySpark shell for the first time. and am attempting to  duplicate the documentation example of creating an RDD  which I called "lines" using a text file.

I placed a a text file called Warehouse.java in this HDFS location:

[rtaylor@bigdatann ~]$ hadoop fs -ls /user/rtaylor/Spark
-rw-r--r--   3 rtaylor supergroup    1155355 2016-02-28 18:09 /user/rtaylor/Spark/Warehouse.java
[rtaylor@bigdatann ~]$

I then invoked sc.textFile()in the PySpark shell.That did not work. See below. Apparently a class is not found? Don't know why that would be the case. Any guidance would be very much appreciated.

The Cloudera Manager for the cluster says that Spark is operating  in the "green", for whatever that is worth.

 - Ron Taylor

>>> lines = sc.textFile("file:///user/taylor/Spark/Warehouse.java")

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/context.py", line 451, in textFile
    return RDD(self._jsc.textFile(name, minPartitions), self,
  File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
  File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/sql/utils.py", line 36, in deco
    return f(*a, **kw)
  File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o9.textFile.
: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.rdd.RDDOperationScope$
    at org.apache.spark.SparkContext.withScope(SparkContext.scala:709)
    at org.apache.spark.SparkContext.textFile(SparkContext.scala:825)
    at org.apache.spark.api.java.JavaSparkContext.textFile(JavaSparkContext.scala:191)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:745)

>>>

RE: a basic question on first use of PySpark shell and example, which is failing

Posted by "Taylor, Ronald C" <Ro...@pnnl.gov>.
I guess I should also point out that I do an

export CLASSPATH

in my  .bash_profile file, so the CLASSPATH info should be usable by the PySpark shell that I invoke.

Ron

Ronald C. Taylor, Ph.D.
Computational Biology & Bioinformatics Group
Pacific Northwest National Laboratory (U.S. Dept of Energy/Battelle)
Richland, WA 99352
phone: (509) 372-6568,  email: ronald.taylor@pnnl.gov
web page:  http://www.pnnl.gov/science/staff/staff_info.asp?staff_num=7048

From: Taylor, Ronald C
Sent: Monday, February 29, 2016 2:57 PM
To: 'Yin Yang'; user@spark.apache.org
Cc: Jules Damji; ronald.taylor24@gmail.com; Taylor, Ronald C
Subject: RE: a basic question on first use of PySpark shell and example, which is failing

HI Yin,

My Classpath is set to:

CLASSPATH=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/*:/people/rtaylor/SparkWork/DataAlgUtils:.

And there is indeed a spark-core.jar in the ../jars subdirectory, though it is not named precisely “spark-core.jar”. It has a version number in its name, as you can see:

[rtaylor@bigdatann ~]$ find /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars -name "spark-core*.jar"

/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/spark-core_2.10-1.5.0-cdh5.5.1.jar

I extracted the class names into a text file:

[rtaylor@bigdatann jars]$ jar tf spark-core_2.10-1.5.0-cdh5.5.1.jar > /people/rtaylor/SparkWork/jar_file_listing_of_spark-core_jar.txt

And then searched for RDDOperationScope. I found these classes:

[rtaylor@bigdatann SparkWork]$ grep RDDOperationScope jar_file_listing_of_spark-core_jar.txt

org/apache/spark/rdd/RDDOperationScope$$anonfun$5.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$3.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$4$$anonfun$apply$1.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$4.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$1.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$getAllScopes$2.class
org/apache/spark/rdd/RDDOperationScope$.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$getAllScopes$1.class
org/apache/spark/rdd/RDDOperationScope.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$2.class
[rtaylor@bigdatann SparkWork]$


It looks like the RDDOperationScope class is present. Shouldn’t that work?

Ron

Ronald C. Taylor, Ph.D.
Computational Biology & Bioinformatics Group
Pacific Northwest National Laboratory (U.S. Dept of Energy/Battelle)
Richland, WA 99352
phone: (509) 372-6568,  email: ronald.taylor@pnnl.gov<ma...@pnnl.gov>
web page:  http://www.pnnl.gov/science/staff/staff_info.asp?staff_num=7048

From: Yin Yang [mailto:yy201602@gmail.com]
Sent: Monday, February 29, 2016 2:27 PM
To: Taylor, Ronald C
Cc: Jules Damji; user@spark.apache.org<ma...@spark.apache.org>; ronald.taylor24@gmail.com<ma...@gmail.com>
Subject: Re: a basic question on first use of PySpark shell and example, which is failing

RDDOperationScope is in spark-core_2.1x jar file.

  7148 Mon Feb 29 09:21:32 PST 2016 org/apache/spark/rdd/RDDOperationScope.class

Can you check whether the spark-core jar is in classpath ?

FYI

On Mon, Feb 29, 2016 at 1:40 PM, Taylor, Ronald C <Ro...@pnnl.gov>> wrote:
Hi Jules, folks,

I have tried “hdfs://<HDFS filepath>” as well as “file://<local Linux filepath><file:///\\%3clocal%20Linux%20filepath%3e>”.  And several variants. Every time, I get the same msg – NoClassDefFoundError. See below. Why do I get such a msg, if the problem is simply that Spark cannot find the text file? Doesn’t the error msg indicate some other source of the problem?

I may be missing something in the error report; I am a Java person, not a Python programmer.  But doesn’t it look like a call to a Java class –something associated with “o9.textFile” -  is failing?  If so, how do I fix this?

  Ron


"/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/context.py", line 451, in textFile
    return RDD(self._jsc.textFile(name, minPartitions), self,
  File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
  File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/sql/utils.py", line 36, in deco
    return f(*a, **kw)
  File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o9.textFile.
: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.rdd.RDDOperationScope$

Ronald C. Taylor, Ph.D.
Computational Biology & Bioinformatics Group
Pacific Northwest National Laboratory (U.S. Dept of Energy/Battelle)
Richland, WA 99352
phone: (509) 372-6568<tel:%28509%29%20372-6568>,  email: ronald.taylor@pnnl.gov<ma...@pnnl.gov>
web page:  http://www.pnnl.gov/science/staff/staff_info.asp?staff_num=7048

From: Jules Damji [mailto:dmatrix@comcast.net<ma...@comcast.net>]
Sent: Sunday, February 28, 2016 10:07 PM
To: Taylor, Ronald C
Cc: user@spark.apache.org<ma...@spark.apache.org>; ronald.taylor24@gmail.com<ma...@gmail.com>
Subject: Re: a basic question on first use of PySpark shell and example, which is failing


Hello Ronald,

Since you have placed the file under HDFS, you might same change the path name to:

val lines = sc.textFile("hdfs://user/taylor/Spark/Warehouse.java")

Sent from my iPhone
Pardon the dumb thumb typos :)

On Feb 28, 2016, at 9:36 PM, Taylor, Ronald C <Ro...@pnnl.gov>> wrote:

Hello folks,

I  am a newbie, and am running Spark on a small Cloudera CDH 5.5.1 cluster at our lab. I am trying to use the PySpark shell for the first time. and am attempting to  duplicate the documentation example of creating an RDD  which I called "lines" using a text file.
I placed a a text file called Warehouse.java in this HDFS location:

[rtaylor@bigdatann ~]$ hadoop fs -ls /user/rtaylor/Spark
-rw-r--r--   3 rtaylor supergroup    1155355 2016-02-28 18:09 /user/rtaylor/Spark/Warehouse.java
[rtaylor@bigdatann ~]$

I then invoked sc.textFile()in the PySpark shell.That did not work. See below. Apparently a class is not found? Don't know why that would be the case. Any guidance would be very much appreciated.
The Cloudera Manager for the cluster says that Spark is operating  in the "green", for whatever that is worth.
 - Ron Taylor

>>> lines = sc.textFile("file:///user/taylor/Spark/Warehouse.java<file:///\\user\taylor\Spark\Warehouse.java>")

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/context.py", line 451, in textFile
    return RDD(self._jsc.textFile(name, minPartitions), self,
  File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
  File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/sql/utils.py", line 36, in deco
    return f(*a, **kw)
  File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o9.textFile.
: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.rdd.RDDOperationScope$
    at org.apache.spark.SparkContext.withScope(SparkContext.scala:709)
    at org.apache.spark.SparkContext.textFile(SparkContext.scala:825)
    at org.apache.spark.api.java.JavaSparkContext.textFile(JavaSparkContext.scala:191)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:745)

>>>


RE: a basic question on first use of PySpark shell and example, which is failing

Posted by "Taylor, Ronald C" <Ro...@pnnl.gov>.
HI Yin,

My Classpath is set to:

CLASSPATH=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/*:/people/rtaylor/SparkWork/DataAlgUtils:.

And there is indeed a spark-core.jar in the ../jars subdirectory, though it is not named precisely “spark-core.jar”. It has a version number in its name, as you can see:

[rtaylor@bigdatann ~]$ find /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars -name "spark-core*.jar"

/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/spark-core_2.10-1.5.0-cdh5.5.1.jar

I extracted the class names into a text file:

[rtaylor@bigdatann jars]$ jar tf spark-core_2.10-1.5.0-cdh5.5.1.jar > /people/rtaylor/SparkWork/jar_file_listing_of_spark-core_jar.txt

And then searched for RDDOperationScope. I found these classes:

[rtaylor@bigdatann SparkWork]$ grep RDDOperationScope jar_file_listing_of_spark-core_jar.txt

org/apache/spark/rdd/RDDOperationScope$$anonfun$5.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$3.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$4$$anonfun$apply$1.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$4.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$1.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$getAllScopes$2.class
org/apache/spark/rdd/RDDOperationScope$.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$getAllScopes$1.class
org/apache/spark/rdd/RDDOperationScope.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$2.class
[rtaylor@bigdatann SparkWork]$


It looks like the RDDOperationScope class is present. Shouldn’t that work?

Ron

Ronald C. Taylor, Ph.D.
Computational Biology & Bioinformatics Group
Pacific Northwest National Laboratory (U.S. Dept of Energy/Battelle)
Richland, WA 99352
phone: (509) 372-6568,  email: ronald.taylor@pnnl.gov
web page:  http://www.pnnl.gov/science/staff/staff_info.asp?staff_num=7048

From: Yin Yang [mailto:yy201602@gmail.com]
Sent: Monday, February 29, 2016 2:27 PM
To: Taylor, Ronald C
Cc: Jules Damji; user@spark.apache.org; ronald.taylor24@gmail.com
Subject: Re: a basic question on first use of PySpark shell and example, which is failing

RDDOperationScope is in spark-core_2.1x jar file.

  7148 Mon Feb 29 09:21:32 PST 2016 org/apache/spark/rdd/RDDOperationScope.class

Can you check whether the spark-core jar is in classpath ?

FYI

On Mon, Feb 29, 2016 at 1:40 PM, Taylor, Ronald C <Ro...@pnnl.gov>> wrote:
Hi Jules, folks,

I have tried “hdfs://<HDFS filepath>” as well as “file://<local Linux filepath><file:///\\%3clocal%20Linux%20filepath%3e>”.  And several variants. Every time, I get the same msg – NoClassDefFoundError. See below. Why do I get such a msg, if the problem is simply that Spark cannot find the text file? Doesn’t the error msg indicate some other source of the problem?

I may be missing something in the error report; I am a Java person, not a Python programmer.  But doesn’t it look like a call to a Java class –something associated with “o9.textFile” -  is failing?  If so, how do I fix this?

  Ron


"/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/context.py", line 451, in textFile
    return RDD(self._jsc.textFile(name, minPartitions), self,
  File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
  File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/sql/utils.py", line 36, in deco
    return f(*a, **kw)
  File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o9.textFile.
: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.rdd.RDDOperationScope$

Ronald C. Taylor, Ph.D.
Computational Biology & Bioinformatics Group
Pacific Northwest National Laboratory (U.S. Dept of Energy/Battelle)
Richland, WA 99352
phone: (509) 372-6568<tel:%28509%29%20372-6568>,  email: ronald.taylor@pnnl.gov<ma...@pnnl.gov>
web page:  http://www.pnnl.gov/science/staff/staff_info.asp?staff_num=7048

From: Jules Damji [mailto:dmatrix@comcast.net<ma...@comcast.net>]
Sent: Sunday, February 28, 2016 10:07 PM
To: Taylor, Ronald C
Cc: user@spark.apache.org<ma...@spark.apache.org>; ronald.taylor24@gmail.com<ma...@gmail.com>
Subject: Re: a basic question on first use of PySpark shell and example, which is failing


Hello Ronald,

Since you have placed the file under HDFS, you might same change the path name to:

val lines = sc.textFile("hdfs://user/taylor/Spark/Warehouse.java")

Sent from my iPhone
Pardon the dumb thumb typos :)

On Feb 28, 2016, at 9:36 PM, Taylor, Ronald C <Ro...@pnnl.gov>> wrote:

Hello folks,

I  am a newbie, and am running Spark on a small Cloudera CDH 5.5.1 cluster at our lab. I am trying to use the PySpark shell for the first time. and am attempting to  duplicate the documentation example of creating an RDD  which I called "lines" using a text file.
I placed a a text file called Warehouse.java in this HDFS location:

[rtaylor@bigdatann ~]$ hadoop fs -ls /user/rtaylor/Spark
-rw-r--r--   3 rtaylor supergroup    1155355 2016-02-28 18:09 /user/rtaylor/Spark/Warehouse.java
[rtaylor@bigdatann ~]$

I then invoked sc.textFile()in the PySpark shell.That did not work. See below. Apparently a class is not found? Don't know why that would be the case. Any guidance would be very much appreciated.
The Cloudera Manager for the cluster says that Spark is operating  in the "green", for whatever that is worth.
 - Ron Taylor

>>> lines = sc.textFile("file:///user/taylor/Spark/Warehouse.java<file:///\\user\taylor\Spark\Warehouse.java>")

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/context.py", line 451, in textFile
    return RDD(self._jsc.textFile(name, minPartitions), self,
  File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
  File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/sql/utils.py", line 36, in deco
    return f(*a, **kw)
  File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o9.textFile.
: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.rdd.RDDOperationScope$
    at org.apache.spark.SparkContext.withScope(SparkContext.scala:709)
    at org.apache.spark.SparkContext.textFile(SparkContext.scala:825)
    at org.apache.spark.api.java.JavaSparkContext.textFile(JavaSparkContext.scala:191)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:745)

>>>


Re: a basic question on first use of PySpark shell and example, which is failing

Posted by Yin Yang <yy...@gmail.com>.
RDDOperationScope is in spark-core_2.1x jar file.

  7148 Mon Feb 29 09:21:32 PST 2016
org/apache/spark/rdd/RDDOperationScope.class

Can you check whether the spark-core jar is in classpath ?

FYI

On Mon, Feb 29, 2016 at 1:40 PM, Taylor, Ronald C <Ro...@pnnl.gov>
wrote:

> Hi Jules, folks,
>
>
>
> I have tried “hdfs://<HDFS filepath>” as well as “file://<local Linux
> filepath>”.  And several variants. Every time, I get the same msg –
> NoClassDefFoundError. See below. Why do I get such a msg, if the problem is
> simply that Spark cannot find the text file? Doesn’t the error msg indicate
> some other source of the problem?
>
>
>
> I may be missing something in the error report; I am a Java person, not a
> Python programmer.  But doesn’t it look like a call to a Java class
> –something associated with “o9.textFile” -  is failing?  If so, how do I
> fix this?
>
>
>
>   Ron
>
>
>
>
>
> "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/context.py",
> line 451, in textFile
>
>     return RDD(self._jsc.textFile(name, minPartitions), self,
>
>   File
> "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
> line 538, in __call__
>
>   File
> "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/sql/utils.py",
> line 36, in deco
>
>     return f(*a, **kw)
>
>   File
> "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
> line 300, in get_return_value
>
> py4j.protocol.Py4JJavaError: An error occurred while calling o9.textFile.
>
> : java.lang.NoClassDefFoundError: Could not initialize class
> org.apache.spark.rdd.RDDOperationScope$
>
>
>
> Ronald C. Taylor, Ph.D.
>
> Computational Biology & Bioinformatics Group
>
> Pacific Northwest National Laboratory (U.S. Dept of Energy/Battelle)
>
> Richland, WA 99352
>
> phone: (509) 372-6568,  email: ronald.taylor@pnnl.gov
>
> web page:  http://www.pnnl.gov/science/staff/staff_info.asp?staff_num=7048
>
>
>
> *From:* Jules Damji [mailto:dmatrix@comcast.net]
> *Sent:* Sunday, February 28, 2016 10:07 PM
> *To:* Taylor, Ronald C
> *Cc:* user@spark.apache.org; ronald.taylor24@gmail.com
> *Subject:* Re: a basic question on first use of PySpark shell and
> example, which is failing
>
>
>
>
>
> Hello Ronald,
>
>
>
> Since you have placed the file under HDFS, you might same change the path
> name to:
>
>
>
> val lines = sc.textFile("hdfs://user/taylor/Spark/Warehouse.java")
>
>
> Sent from my iPhone
>
> Pardon the dumb thumb typos :)
>
>
> On Feb 28, 2016, at 9:36 PM, Taylor, Ronald C <Ro...@pnnl.gov>
> wrote:
>
>
>
> Hello folks,
>
>
>
> I  am a newbie, and am running Spark on a small Cloudera CDH 5.5.1 cluster
> at our lab. I am trying to use the PySpark shell for the first time. and am
> attempting to  duplicate the documentation example of creating an RDD
> which I called "lines" using a text file.
>
> I placed a a text file called Warehouse.java in this HDFS location:
>
>
> [rtaylor@bigdatann ~]$ hadoop fs -ls /user/rtaylor/Spark
> -rw-r--r--   3 rtaylor supergroup    1155355 2016-02-28 18:09
> /user/rtaylor/Spark/Warehouse.java
> [rtaylor@bigdatann ~]$
>
> I then invoked sc.textFile()in the PySpark shell.That did not work. See
> below. Apparently a class is not found? Don't know why that would be the
> case. Any guidance would be very much appreciated.
>
> The Cloudera Manager for the cluster says that Spark is operating  in the
> "green", for whatever that is worth.
>
>  - Ron Taylor
>
>
> >>> lines = sc.textFile("file:///user/taylor/Spark/Warehouse.java")
>
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File
> "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/context.py",
> line 451, in textFile
>     return RDD(self._jsc.textFile(name, minPartitions), self,
>   File
> "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
> line 538, in __call__
>   File
> "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/sql/utils.py",
> line 36, in deco
>     return f(*a, **kw)
>   File
> "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
> line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o9.textFile.
> : java.lang.NoClassDefFoundError: Could not initialize class
> org.apache.spark.rdd.RDDOperationScope$
>     at org.apache.spark.SparkContext.withScope(SparkContext.scala:709)
>     at org.apache.spark.SparkContext.textFile(SparkContext.scala:825)
>     at
> org.apache.spark.api.java.JavaSparkContext.textFile(JavaSparkContext.scala:191)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>     at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:606)
>     at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>     at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>     at py4j.Gateway.invoke(Gateway.java:259)
>     at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>     at py4j.commands.CallCommand.execute(CallCommand.java:79)
>     at py4j.GatewayConnection.run(GatewayConnection.java:207)
>     at java.lang.Thread.run(Thread.java:745)
>
> >>>
>
>

RE: a basic question on first use of PySpark shell and example, which is failing

Posted by "Taylor, Ronald C" <Ro...@pnnl.gov>.
Hi Jules, folks,

I have tried “hdfs://<HDFS filepath>” as well as “file://<local Linux filepath><file:///\\%3clocal%20Linux%20filepath%3e>”.  And several variants. Every time, I get the same msg – NoClassDefFoundError. See below. Why do I get such a msg, if the problem is simply that Spark cannot find the text file? Doesn’t the error msg indicate some other source of the problem?

I may be missing something in the error report; I am a Java person, not a Python programmer.  But doesn’t it look like a call to a Java class –something associated with “o9.textFile” -  is failing?  If so, how do I fix this?

  Ron


"/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/context.py", line 451, in textFile
    return RDD(self._jsc.textFile(name, minPartitions), self,
  File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
  File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/sql/utils.py", line 36, in deco
    return f(*a, **kw)
  File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o9.textFile.
: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.rdd.RDDOperationScope$

Ronald C. Taylor, Ph.D.
Computational Biology & Bioinformatics Group
Pacific Northwest National Laboratory (U.S. Dept of Energy/Battelle)
Richland, WA 99352
phone: (509) 372-6568,  email: ronald.taylor@pnnl.gov
web page:  http://www.pnnl.gov/science/staff/staff_info.asp?staff_num=7048

From: Jules Damji [mailto:dmatrix@comcast.net]
Sent: Sunday, February 28, 2016 10:07 PM
To: Taylor, Ronald C
Cc: user@spark.apache.org; ronald.taylor24@gmail.com
Subject: Re: a basic question on first use of PySpark shell and example, which is failing


Hello Ronald,

Since you have placed the file under HDFS, you might same change the path name to:

val lines = sc.textFile("hdfs://user/taylor/Spark/Warehouse.java")

Sent from my iPhone
Pardon the dumb thumb typos :)

On Feb 28, 2016, at 9:36 PM, Taylor, Ronald C <Ro...@pnnl.gov>> wrote:

Hello folks,

I  am a newbie, and am running Spark on a small Cloudera CDH 5.5.1 cluster at our lab. I am trying to use the PySpark shell for the first time. and am attempting to  duplicate the documentation example of creating an RDD  which I called "lines" using a text file.
I placed a a text file called Warehouse.java in this HDFS location:

[rtaylor@bigdatann ~]$ hadoop fs -ls /user/rtaylor/Spark
-rw-r--r--   3 rtaylor supergroup    1155355 2016-02-28 18:09 /user/rtaylor/Spark/Warehouse.java
[rtaylor@bigdatann ~]$

I then invoked sc.textFile()in the PySpark shell.That did not work. See below. Apparently a class is not found? Don't know why that would be the case. Any guidance would be very much appreciated.
The Cloudera Manager for the cluster says that Spark is operating  in the "green", for whatever that is worth.
 - Ron Taylor

>>> lines = sc.textFile("file:///user/taylor/Spark/Warehouse.java<file:///\\user\taylor\Spark\Warehouse.java>")

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/context.py", line 451, in textFile
    return RDD(self._jsc.textFile(name, minPartitions), self,
  File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
  File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/sql/utils.py", line 36, in deco
    return f(*a, **kw)
  File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o9.textFile.
: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.rdd.RDDOperationScope$
    at org.apache.spark.SparkContext.withScope(SparkContext.scala:709)
    at org.apache.spark.SparkContext.textFile(SparkContext.scala:825)
    at org.apache.spark.api.java.JavaSparkContext.textFile(JavaSparkContext.scala:191)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:745)

>>>

Re: a basic question on first use of PySpark shell and example, which is failing

Posted by Jules Damji <dm...@comcast.net>.
Hello Ronald,

Since you have placed the file under HDFS, you might same change the path name to:

val lines = sc.textFile("hdfs://user/taylor/Spark/Warehouse.java")

Sent from my iPhone
Pardon the dumb thumb typos :)

> On Feb 28, 2016, at 9:36 PM, Taylor, Ronald C <Ro...@pnnl.gov> wrote:
> 
> 
> Hello folks,
> 
> I  am a newbie, and am running Spark on a small Cloudera CDH 5.5.1 cluster at our lab. I am trying to use the PySpark shell for the first time. and am attempting to  duplicate the documentation example of creating an RDD  which I called "lines" using a text file.
> 
> I placed a a text file called Warehouse.java in this HDFS location:
> 
> [rtaylor@bigdatann ~]$ hadoop fs -ls /user/rtaylor/Spark
> -rw-r--r--   3 rtaylor supergroup    1155355 2016-02-28 18:09 /user/rtaylor/Spark/Warehouse.java
> [rtaylor@bigdatann ~]$ 
> 
> I then invoked sc.textFile()in the PySpark shell.That did not work. See below. Apparently a class is not found? Don't know why that would be the case. Any guidance would be very much appreciated.
> 
> The Cloudera Manager for the cluster says that Spark is operating  in the "green", for whatever that is worth.
> 
>  - Ron Taylor
> 
> >>> lines = sc.textFile("file:///user/taylor/Spark/Warehouse.java")
> 
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/context.py", line 451, in textFile
>     return RDD(self._jsc.textFile(name, minPartitions), self,
>   File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
>   File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/sql/utils.py", line 36, in deco
>     return f(*a, **kw)
>   File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o9.textFile.
> : java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.rdd.RDDOperationScope$
>     at org.apache.spark.SparkContext.withScope(SparkContext.scala:709)
>     at org.apache.spark.SparkContext.textFile(SparkContext.scala:825)
>     at org.apache.spark.api.java.JavaSparkContext.textFile(JavaSparkContext.scala:191)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:606)
>     at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>     at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>     at py4j.Gateway.invoke(Gateway.java:259)
>     at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>     at py4j.commands.CallCommand.execute(CallCommand.java:79)
>     at py4j.GatewayConnection.run(GatewayConnection.java:207)
>     at java.lang.Thread.run(Thread.java:745)
> 
> >>>