You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by David Vincelli <da...@vantageanalytics.com> on 2014/09/30 20:37:13 UTC

pyspark cassandra examples

I've been trying to get the cassandra_inputformat.py and
cassandra_outputformat.py examples running for the past half day. I am
running cassandra21 community from datastax on a single node (in my dev
environment) with spark-1.1.0-bin-hadoop2.4.

I can connect and use cassandra via cqlsh and I can run the pyspark
computation of pi job.

Unfortunately, I cannot run the cassandra_inputformat and
cassandra_outputformat examples succesfully.

This is the output I am getting now:

14/09/30 18:15:41 INFO AkkaUtils: Connecting to HeartbeatReceiver:
akka.tcp://sparkDriver@dev:40208/user/HeartbeatReceiver
14/09/30 18:15:42 INFO deprecation: mapreduce.outputformat.class is
deprecated. Instead, use mapreduce.job.outputformat.class
14/09/30 18:15:43 INFO Converter: Loaded converter:
org.apache.spark.examples.pythonconverters.ToCassandraCQLKeyConverter
14/09/30 18:15:43 INFO Converter: Loaded converter:
org.apache.spark.examples.pythonconverters.ToCassandraCQLValueConverter
Traceback (most recent call last):
  File
"/opt/spark-1.1.0-bin-hadoop2.4/examples/src/main/python/cassandra_outputformat.py",
line 83, in <module>

valueConverter="org.apache.spark.examples.pythonconverters.ToCassandraCQLValueConverter")
  File "/opt/spark-1.1.0-bin-hadoop2.4/python/pyspark/rdd.py", line 1184,
in saveAsNewAPIHadoopDataset
    keyConverter, valueConverter, True)
  File
"/opt/spark-1.1.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 538, in __call__
  File
"/opt/spark-1.1.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.saveAsHadoopDataset.
: java.lang.IncompatibleClassChangeError: Found interface
org.apache.hadoop.mapreduce.JobContext, but class was expected
at
org.apache.cassandra.hadoop.AbstractColumnFamilyOutputFormat.checkOutputSpecs(AbstractColumnFamilyOutputFormat.java:75)
at
org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:900)
at
org.apache.spark.api.python.PythonRDD$.saveAsHadoopDataset(PythonRDD.scala:687)
at
org.apache.spark.api.python.PythonRDD.saveAsHadoopDataset(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)

Should I have built a custom spark assembly? Am I missing a cassandra
driver? I have browsed through the documentation and found nothing
specifically relevant to cassandra, is there such a piece of documentation?

Thank you,

- David

Re: pyspark cassandra examples

Posted by David Vincelli <da...@vantageanalytics.com>.
Thanks, that worked! I downloaded the version pre-built against hadoop1 and
the examples worked.

- David

On Tue, Sep 30, 2014 at 5:08 PM, Kan Zhang <kz...@apache.org> wrote:

> >  java.lang.IncompatibleClassChangeError: Found interface
> org.apache.hadoop.mapreduce.JobContext, but class was expected
>
> Most likely it is the Hadoop 1 vs Hadoop 2 issue. The example was given
> for Hadoop 1 (default Hadoop version for Spark). You may try to set the
> output format class in conf for Hadoop 2, or recompile your Spark with
> Hadoop 1.
>
> On Tue, Sep 30, 2014 at 11:37 AM, David Vincelli <
> david.vincelli@vantageanalytics.com> wrote:
>
>> I've been trying to get the cassandra_inputformat.py and
>> cassandra_outputformat.py examples running for the past half day. I am
>> running cassandra21 community from datastax on a single node (in my dev
>> environment) with spark-1.1.0-bin-hadoop2.4.
>>
>> I can connect and use cassandra via cqlsh and I can run the pyspark
>> computation of pi job.
>>
>> Unfortunately, I cannot run the cassandra_inputformat and
>> cassandra_outputformat examples succesfully.
>>
>> This is the output I am getting now:
>>
>> 14/09/30 18:15:41 INFO AkkaUtils: Connecting to HeartbeatReceiver:
>> akka.tcp://sparkDriver@dev:40208/user/HeartbeatReceiver
>> 14/09/30 18:15:42 INFO deprecation: mapreduce.outputformat.class is
>> deprecated. Instead, use mapreduce.job.outputformat.class
>> 14/09/30 18:15:43 INFO Converter: Loaded converter:
>> org.apache.spark.examples.pythonconverters.ToCassandraCQLKeyConverter
>> 14/09/30 18:15:43 INFO Converter: Loaded converter:
>> org.apache.spark.examples.pythonconverters.ToCassandraCQLValueConverter
>> Traceback (most recent call last):
>>   File
>> "/opt/spark-1.1.0-bin-hadoop2.4/examples/src/main/python/cassandra_outputformat.py",
>> line 83, in <module>
>>
>> valueConverter="org.apache.spark.examples.pythonconverters.ToCassandraCQLValueConverter")
>>   File "/opt/spark-1.1.0-bin-hadoop2.4/python/pyspark/rdd.py", line 1184,
>> in saveAsNewAPIHadoopDataset
>>     keyConverter, valueConverter, True)
>>   File
>> "/opt/spark-1.1.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>> line 538, in __call__
>>   File
>> "/opt/spark-1.1.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>> line 300, in get_return_value
>> py4j.protocol.Py4JJavaError: An error occurred while calling
>> z:org.apache.spark.api.python.PythonRDD.saveAsHadoopDataset.
>> : java.lang.IncompatibleClassChangeError: Found interface
>> org.apache.hadoop.mapreduce.JobContext, but class was expected
>> at
>> org.apache.cassandra.hadoop.AbstractColumnFamilyOutputFormat.checkOutputSpecs(AbstractColumnFamilyOutputFormat.java:75)
>> at
>> org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:900)
>> at
>> org.apache.spark.api.python.PythonRDD$.saveAsHadoopDataset(PythonRDD.scala:687)
>> at
>> org.apache.spark.api.python.PythonRDD.saveAsHadoopDataset(PythonRDD.scala)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>> at py4j.Gateway.invoke(Gateway.java:259)
>> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>> at py4j.GatewayConnection.run(GatewayConnection.java:207)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> Should I have built a custom spark assembly? Am I missing a cassandra
>> driver? I have browsed through the documentation and found nothing
>> specifically relevant to cassandra, is there such a piece of documentation?
>>
>> Thank you,
>>
>> - David
>>
>
>

Re: pyspark cassandra examples

Posted by Kan Zhang <kz...@apache.org>.
>  java.lang.IncompatibleClassChangeError: Found interface
org.apache.hadoop.mapreduce.JobContext, but class was expected

Most likely it is the Hadoop 1 vs Hadoop 2 issue. The example was given for
Hadoop 1 (default Hadoop version for Spark). You may try to set the output
format class in conf for Hadoop 2, or recompile your Spark with Hadoop 1.

On Tue, Sep 30, 2014 at 11:37 AM, David Vincelli <
david.vincelli@vantageanalytics.com> wrote:

> I've been trying to get the cassandra_inputformat.py and
> cassandra_outputformat.py examples running for the past half day. I am
> running cassandra21 community from datastax on a single node (in my dev
> environment) with spark-1.1.0-bin-hadoop2.4.
>
> I can connect and use cassandra via cqlsh and I can run the pyspark
> computation of pi job.
>
> Unfortunately, I cannot run the cassandra_inputformat and
> cassandra_outputformat examples succesfully.
>
> This is the output I am getting now:
>
> 14/09/30 18:15:41 INFO AkkaUtils: Connecting to HeartbeatReceiver:
> akka.tcp://sparkDriver@dev:40208/user/HeartbeatReceiver
> 14/09/30 18:15:42 INFO deprecation: mapreduce.outputformat.class is
> deprecated. Instead, use mapreduce.job.outputformat.class
> 14/09/30 18:15:43 INFO Converter: Loaded converter:
> org.apache.spark.examples.pythonconverters.ToCassandraCQLKeyConverter
> 14/09/30 18:15:43 INFO Converter: Loaded converter:
> org.apache.spark.examples.pythonconverters.ToCassandraCQLValueConverter
> Traceback (most recent call last):
>   File
> "/opt/spark-1.1.0-bin-hadoop2.4/examples/src/main/python/cassandra_outputformat.py",
> line 83, in <module>
>
> valueConverter="org.apache.spark.examples.pythonconverters.ToCassandraCQLValueConverter")
>   File "/opt/spark-1.1.0-bin-hadoop2.4/python/pyspark/rdd.py", line 1184,
> in saveAsNewAPIHadoopDataset
>     keyConverter, valueConverter, True)
>   File
> "/opt/spark-1.1.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
> line 538, in __call__
>   File
> "/opt/spark-1.1.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
> line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.saveAsHadoopDataset.
> : java.lang.IncompatibleClassChangeError: Found interface
> org.apache.hadoop.mapreduce.JobContext, but class was expected
> at
> org.apache.cassandra.hadoop.AbstractColumnFamilyOutputFormat.checkOutputSpecs(AbstractColumnFamilyOutputFormat.java:75)
> at
> org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:900)
> at
> org.apache.spark.api.python.PythonRDD$.saveAsHadoopDataset(PythonRDD.scala:687)
> at
> org.apache.spark.api.python.PythonRDD.saveAsHadoopDataset(PythonRDD.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
> at py4j.Gateway.invoke(Gateway.java:259)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:207)
> at java.lang.Thread.run(Thread.java:745)
>
> Should I have built a custom spark assembly? Am I missing a cassandra
> driver? I have browsed through the documentation and found nothing
> specifically relevant to cassandra, is there such a piece of documentation?
>
> Thank you,
>
> - David
>