You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by David Kennedy <da...@gmail.com> on 2016/02/15 10:35:07 UTC

How to add kafka streaming jars when initialising the sparkcontext in python

I have no problems when submitting the task using spark-submit.  The --jars
option with the list of jars required is successful and I see in the output
the jars being added:

16/02/10 11:14:24 INFO spark.SparkContext: Added JAR
file:/usr/lib/spark/extras/lib/spark-streaming-kafka.jar at
http://192.168.10.4:33820/jars/spark-streaming-kafka.jar with timestamp
1455102864058
16/02/10 11:14:24 INFO spark.SparkContext: Added JAR
file:/opt/kafka/libs/scala-library-2.10.1.jar at
http://192.168.10.4:33820/jars/scala-library-2.10.1.jar with timestamp
1455102864077
16/02/10 11:14:24 INFO spark.SparkContext: Added JAR
file:/opt/kafka/libs/kafka_2.10-0.8.1.1.jar at
http://192.168.10.4:33820/jars/kafka_2.10-0.8.1.1.jar with timestamp
1455102864085
16/02/10 11:14:24 INFO spark.SparkContext: Added JAR
file:/opt/kafka/libs/metrics-core-2.2.0.jar at
http://192.168.10.4:33820/jars/metrics-core-2.2.0.jar with timestamp
1455102864086
16/02/10 11:14:24 INFO spark.SparkContext: Added JAR
file:/usr/share/java/mysql.jar at http://192.168.10.4:33820/jars/mysql.jar
with timestamp 1455102864090

But when I try to programmatically create a context in python (I want to
set up some tests) I don't see this and I end up with class not found
errors.

Trying to cover all bases even though I suspect that it's redundant when
running local I've tried:

spark_conf = SparkConf()
spark_conf.setMaster('local[4]')
spark_conf.set('spark.executor.extraLibraryPath',
               '/usr/lib/spark/extras/lib/spark-streaming-kafka.jar,'
               '/opt/kafka/libs/scala-library-2.10.1.jar,'
               '/opt/kafka/libs/kafka_2.10-0.8.1.1.jar,'
               '/opt/kafka/libs/metrics-core-2.2.0.jar,'
               '/usr/share/java/mysql.jar')
spark_conf.set('spark.executor.extraClassPath',
               '/usr/lib/spark/extras/lib/spark-streaming-kafka.jar,'
               '/opt/kafka/libs/scala-library-2.10.1.jar,'
               '/opt/kafka/libs/kafka_2.10-0.8.1.1.jar,'
               '/opt/kafka/libs/metrics-core-2.2.0.jar,'
               '/usr/share/java/mysql.jar')
spark_conf.set('spark.driver.extraClassPath',
               '/usr/lib/spark/extras/lib/spark-streaming-kafka.jar,'
               '/opt/kafka/libs/scala-library-2.10.1.jar,'
               '/opt/kafka/libs/kafka_2.10-0.8.1.1.jar,'
               '/opt/kafka/libs/metrics-core-2.2.0.jar,'
               '/usr/share/java/mysql.jar')
spark_conf.set('spark.driver.extraLibraryPath',
               '/usr/lib/spark/extras/lib/spark-streaming-kafka.jar,'
               '/opt/kafka/libs/scala-library-2.10.1.jar,'
               '/opt/kafka/libs/kafka_2.10-0.8.1.1.jar,'
               '/opt/kafka/libs/metrics-core-2.2.0.jar,'
               '/usr/share/java/mysql.jar')
self.spark_context = SparkContext(conf=spark_conf)

But still I get the same failure to find the same class:

Py4JJavaError: An error occurred while calling o30.loadClass.
: java.lang.ClassNotFoundException:
org.apache.spark.streaming.kafka.KafkaUtilsPythonHelper

The class is certainly in the spark_streaming_kafka.jar and is present in
the filesystem at that location.

I'm under the impression that were I using java I'd be able to use the
addJars method on the conf to take care of this but there doesn't appear to
be anything that corresponds for python.

Hacking about I found that adding:


spark_conf.set('spark.jars',
               '/usr/lib/spark/extras/lib/spark-streaming-kafka.jar,'
               '/opt/kafka/libs/scala-library-2.10.1.jar,'
               '/opt/kafka/libs/kafka_2.10-0.8.1.1.jar,'
               '/opt/kafka/libs/metrics-core-2.2.0.jar,'
               '/usr/share/java/mysql.jar')

got the logging to admit to adding the jars to the http server (just as for
the spark submit output above) but leaving the other config options in
place or removing them the class is still not found.

Is this not possible in python?

Incidentally, I have tried SPARK_CLASSPATH (getting the message that it's
deprecated and ignored anyway) and I cannot find anything else to try.

Can anybody help?

David K.

Re: How to add kafka streaming jars when initialising the sparkcontext in python

Posted by Jorge Machado <jo...@me.com>.
Hi David, 

Just package with maven and deploy everthing into one jar. You don´t need to do it like this…  Use Maven for example. And check if your cluster already has this libraries loaded. If you are using CDH for example you can just import the classes because they already are in the path from your JVM . 
> On 15/02/2016, at 10:35, David Kennedy <da...@gmail.com> wrote:
> 
> I have no problems when submitting the task using spark-submit.  The --jars option with the list of jars required is successful and I see in the output the jars being added:
> 
> 16/02/10 11:14:24 INFO spark.SparkContext: Added JAR file:/usr/lib/spark/extras/lib/spark-streaming-kafka.jar at http://192.168.10.4:33820/jars/spark-streaming-kafka.jar <http://192.168.10.4:33820/jars/spark-streaming-kafka.jar> with timestamp 1455102864058
> 16/02/10 11:14:24 INFO spark.SparkContext: Added JAR file:/opt/kafka/libs/scala-library-2.10.1.jar at http://192.168.10.4:33820/jars/scala-library-2.10.1.jar <http://192.168.10.4:33820/jars/scala-library-2.10.1.jar> with timestamp 1455102864077
> 16/02/10 11:14:24 INFO spark.SparkContext: Added JAR file:/opt/kafka/libs/kafka_2.10-0.8.1.1.jar at http://192.168.10.4:33820/jars/kafka_2.10-0.8.1.1.jar <http://192.168.10.4:33820/jars/kafka_2.10-0.8.1.1.jar> with timestamp 1455102864085
> 16/02/10 11:14:24 INFO spark.SparkContext: Added JAR file:/opt/kafka/libs/metrics-core-2.2.0.jar at http://192.168.10.4:33820/jars/metrics-core-2.2.0.jar <http://192.168.10.4:33820/jars/metrics-core-2.2.0.jar> with timestamp 1455102864086
> 16/02/10 11:14:24 INFO spark.SparkContext: Added JAR file:/usr/share/java/mysql.jar at http://192.168.10.4:33820/jars/mysql.jar <http://192.168.10.4:33820/jars/mysql.jar> with timestamp 1455102864090
> 
> But when I try to programmatically create a context in python (I want to set up some tests) I don't see this and I end up with class not found errors.
> 
> Trying to cover all bases even though I suspect that it's redundant when running local I've tried:
> 
> spark_conf = SparkConf()
> spark_conf.setMaster('local[4]')
> spark_conf.set('spark.executor.extraLibraryPath',
>                '/usr/lib/spark/extras/lib/spark-streaming-kafka.jar,'
>                '/opt/kafka/libs/scala-library-2.10.1.jar,'
>                '/opt/kafka/libs/kafka_2.10-0.8.1.1.jar,'
>                '/opt/kafka/libs/metrics-core-2.2.0.jar,'
>                '/usr/share/java/mysql.jar')
> spark_conf.set('spark.executor.extraClassPath',
>                '/usr/lib/spark/extras/lib/spark-streaming-kafka.jar,'
>                '/opt/kafka/libs/scala-library-2.10.1.jar,'
>                '/opt/kafka/libs/kafka_2.10-0.8.1.1.jar,'
>                '/opt/kafka/libs/metrics-core-2.2.0.jar,'
>                '/usr/share/java/mysql.jar')
> spark_conf.set('spark.driver.extraClassPath',
>                '/usr/lib/spark/extras/lib/spark-streaming-kafka.jar,'
>                '/opt/kafka/libs/scala-library-2.10.1.jar,'
>                '/opt/kafka/libs/kafka_2.10-0.8.1.1.jar,'
>                '/opt/kafka/libs/metrics-core-2.2.0.jar,'
>                '/usr/share/java/mysql.jar')
> spark_conf.set('spark.driver.extraLibraryPath',
>                '/usr/lib/spark/extras/lib/spark-streaming-kafka.jar,'
>                '/opt/kafka/libs/scala-library-2.10.1.jar,'
>                '/opt/kafka/libs/kafka_2.10-0.8.1.1.jar,'
>                '/opt/kafka/libs/metrics-core-2.2.0.jar,'
>                '/usr/share/java/mysql.jar')
> self.spark_context = SparkContext(conf=spark_conf)
> But still I get the same failure to find the same class:
> 
> Py4JJavaError: An error occurred while calling o30.loadClass.
> : java.lang.ClassNotFoundException: org.apache.spark.streaming.kafka.KafkaUtilsPythonHelper
> 
> The class is certainly in the spark_streaming_kafka.jar and is present in the filesystem at that location.
> 
> I'm under the impression that were I using java I'd be able to use the addJars method on the conf to take care of this but there doesn't appear to be anything that corresponds for python.
> 
> Hacking about I found that adding:
> 
> 
> spark_conf.set('spark.jars',
>                '/usr/lib/spark/extras/lib/spark-streaming-kafka.jar,'
>                '/opt/kafka/libs/scala-library-2.10.1.jar,'
>                '/opt/kafka/libs/kafka_2.10-0.8.1.1.jar,'
>                '/opt/kafka/libs/metrics-core-2.2.0.jar,'
>                '/usr/share/java/mysql.jar')
> got the logging to admit to adding the jars to the http server (just as for the spark submit output above) but leaving the other config options in place or removing them the class is still not found.
> 
> Is this not possible in python?
> 
> Incidentally, I have tried SPARK_CLASSPATH (getting the message that it's deprecated and ignored anyway) and I cannot find anything else to try.
> 
> Can anybody help?
> 
> David K.
>