You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@zeppelin.apache.org by Jeff Steinmetz <je...@gmail.com> on 2015/10/29 23:10:24 UTC

pyspark with jar

In zeppelin, what is the equivalent to adding jars in a pyspark call?

Such as running pyspark with the elasticsearch-hadoop jar

./bin/pyspark --master local[2] --jars jars/elasticsearch-hadoop-2.1.0.Beta2.jar

My assumption is that loading something like this inside a %dep is pointless, since those dependencies would only live in the %spark scala world (the spark jvm).  In zeppelin - pyspark spawns a separate process.

Also how is the interpreters “spark.home” used?  How is it different that the  “SPARK_HOME” zeppelin-env.sh
And finally – how are args used in the interpreter?  (what uses them)?

Thank you.
Jeff

Re: pyspark with jar

Posted by Jeff Steinmetz <je...@gmail.com>.

Thank you for the quick response – good to hear I can use %dep with %pyspark

However, this doesn’t work (however the same dependencies work fine with %spark and a similar scala test – shown at end).


%dep
z.load("org.elasticsearch:elasticsearch-hadoop:2.2.0-beta1")
z.load("org.elasticsearch::elasticsearch-spark:2.2.0-beta1”)
res0: org.apache.zeppelin.spark.dep.Dependency = org.apache.zeppelin.spark.dep.Dependency@c448844

%pyspark
df = sqlContext.read.format("org.elasticsearch.spark.sql").load("index/type")
df.printSchema()

— returns
Py4JJavaError: An error occurred while calling o41.load. : java.lang.RuntimeException: Failed to load class for data source: org.elasticsearch.spark.sql

Same results with:
%pyspark
Query = “{“query… somequery”}”  #this line is pseudo code
conf = {"es.nodes": “192.168.1.1", "es.resource": “index/type", "es.query": query}
        
 rdd = spark_context.newAPIHadoopRDD(
        "org.elasticsearch.hadoop.mr.EsInputFormat",
        "org.apache.hadoop.io.NullWritable",
        "org.elasticsearch.hadoop.mr.LinkedMapWritable",
        conf=conf,
    )

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD. : java.lang.ClassNotFoundException: org.elasticsearch.hadoop.mr.LinkedMapWritable


——
This works fine (index and IP address replaced in sample)

%spark
sqlContext.sql( "CREATE TEMPORARY TABLE sessions    " +  "USING org.elasticsearch.spark.sql " +  "OPTIONS ( resource ‘index/type', nodes ‘192.168.1.1')” )


From:  moon soo Lee
Reply-To:  <us...@zeppelin.incubator.apache.org>
Date:  Thursday, October 29, 2015 at 8:00 PM
To:  <us...@zeppelin.incubator.apache.org>
Subject:  Re: pyspark with jar

Hi,

Thanks for the question.

Actually, %pyspark runs in the same JVM process that %spark runs. And it shares a single SparkContext instance. (although %pyspark runs additional python process)
Libraries loaded from %dep should be available in %pyspark, too.

interpreter property 'spark.home' is little bit confusing with SPARK_HOME.
At the moment, defining SPARK_HOME in conf/zeppelin-env.sh is recommended instead of spark.home.

Best,
moon

On Fri, Oct 30, 2015 at 2:44 AM Jeff Steinmetz <je...@gmail.com> wrote:
That’s a good pointer.
Question still stands, how do you load libraries (jars) for %pyspark?

Its clear how to do it for %spark (scala) via %dep.

Looking for the equivalent of:

./bin/pyspark --master local[2] --jars jars/elasticsearch-hadoop-2.1.0.Beta2.jar


From:  Matt Sochor
Reply-To:  <us...@zeppelin.incubator.apache.org>
Date:  Thursday, October 29, 2015 at 3:19 PM
To:  <us...@zeppelin.incubator.apache.org>
Subject:  Re: pyspark with jar

I actually *just* figured it out.  Zeppelin has sqlContext "already created and exposed" (https://zeppelin.incubator.apache.org/docs/interpreter/spark.html).

So when I do "sqlContext = SQLContext(sc)" I overwrite sqlContext.  Then Zeppelin cannot see this new sqlContext.

Anyway, anyone out there experiencing this problem, do NOT initialize sqlContext and it works fine.  

On Thu, Oct 29, 2015 at 6:10 PM Jeff Steinmetz <je...@gmail.com> wrote:
In zeppelin, what is the equivalent to adding jars in a pyspark call?

Such as running pyspark with the elasticsearch-hadoop jar

./bin/pyspark --master local[2] --jars jars/elasticsearch-hadoop-2.1.0.Beta2.jar

My assumption is that loading something like this inside a %dep is pointless, since those dependencies would only live in the %spark scala world (the spark jvm).  In zeppelin - pyspark spawns a separate process.

Also how is the interpreters “spark.home” used?  How is it different that the  “SPARK_HOME” zeppelin-env.sh
And finally – how are args used in the interpreter?  (what uses them)?

Thank you.
Jeff
-- 
Best regards,

Matt Sochor
Data Scientist
Mobile Defense

Mobile +1 215 307 7768


This email and any of its attachments may contain Mobile Defense Inc. proprietary information, which is privileged, confidential, or subject to copyright belonging to Mobile Defense Inc. This email is intended solely for the use of the individuals or entities to which it is addressed by Mobile Defense Inc. If you are not the intended recipient of this email, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this email is strictly prohibited and may be unlawful. If you have received this email in error, please notify the sender immediately and permanently delete the original and any copy of this email and any printout.

Re: pyspark with jar

Posted by Jeff Steinmetz <je...@gmail.com>.

I also saw an example you posted regarding %dep and python
This example 

%dep
z.load("org.apache.spark:spark-streaming-kafka_2.10:1.5.1”)

works even if you remove the %dep.

    from pyspark.streaming.kafka import KafkaUtils

This import will always resolve – likely because it is part of the spark assembly already.

Give it a try – reset the interpreter, and just run (with no z.load(…):


%pyspark
from pyspark.streaming.kafka import KafkaUtils
from pyspark.streaming import StreamingContext

So – still looking for a real world example of an external dependency loaded in %dep that is demonstrates best practice around %pyspark dependency loading.
I’ll stay tuned – and continue to dig around a bit.
Next step is to start over and try a no frills basic install with z-manager

Jeff

From:  moon soo Lee
Reply-To:  <us...@zeppelin.incubator.apache.org>
Date:  Thursday, October 29, 2015 at 8:00 PM
To:  <us...@zeppelin.incubator.apache.org>
Subject:  Re: pyspark with jar

Hi,

Thanks for the question.

Actually, %pyspark runs in the same JVM process that %spark runs. And it shares a single SparkContext instance. (although %pyspark runs additional python process)
Libraries loaded from %dep should be available in %pyspark, too.

interpreter property 'spark.home' is little bit confusing with SPARK_HOME.
At the moment, defining SPARK_HOME in conf/zeppelin-env.sh is recommended instead of spark.home.

Best,
moon

On Fri, Oct 30, 2015 at 2:44 AM Jeff Steinmetz <je...@gmail.com> wrote:
That’s a good pointer.
Question still stands, how do you load libraries (jars) for %pyspark?

Its clear how to do it for %spark (scala) via %dep.

Looking for the equivalent of:

./bin/pyspark --master local[2] --jars jars/elasticsearch-hadoop-2.1.0.Beta2.jar


From:  Matt Sochor
Reply-To:  <us...@zeppelin.incubator.apache.org>
Date:  Thursday, October 29, 2015 at 3:19 PM
To:  <us...@zeppelin.incubator.apache.org>
Subject:  Re: pyspark with jar

I actually *just* figured it out.  Zeppelin has sqlContext "already created and exposed" (https://zeppelin.incubator.apache.org/docs/interpreter/spark.html).

So when I do "sqlContext = SQLContext(sc)" I overwrite sqlContext.  Then Zeppelin cannot see this new sqlContext.

Anyway, anyone out there experiencing this problem, do NOT initialize sqlContext and it works fine.  

On Thu, Oct 29, 2015 at 6:10 PM Jeff Steinmetz <je...@gmail.com> wrote:
In zeppelin, what is the equivalent to adding jars in a pyspark call?

Such as running pyspark with the elasticsearch-hadoop jar

./bin/pyspark --master local[2] --jars jars/elasticsearch-hadoop-2.1.0.Beta2.jar

My assumption is that loading something like this inside a %dep is pointless, since those dependencies would only live in the %spark scala world (the spark jvm).  In zeppelin - pyspark spawns a separate process.

Also how is the interpreters “spark.home” used?  How is it different that the  “SPARK_HOME” zeppelin-env.sh
And finally – how are args used in the interpreter?  (what uses them)?

Thank you.
Jeff
-- 
Best regards,

Matt Sochor
Data Scientist
Mobile Defense

Mobile +1 215 307 7768


This email and any of its attachments may contain Mobile Defense Inc. proprietary information, which is privileged, confidential, or subject to copyright belonging to Mobile Defense Inc. This email is intended solely for the use of the individuals or entities to which it is addressed by Mobile Defense Inc. If you are not the intended recipient of this email, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this email is strictly prohibited and may be unlawful. If you have received this email in error, please notify the sender immediately and permanently delete the original and any copy of this email and any printout.

Re: pyspark with jar

Posted by Jeff Steinmetz <je...@gmail.com>.

Update -  I have it working now.
%dep loading elasticsearch-hadoop then using %pyspark works.

Tried it with spark 1.3.1 via z-manager using a vanilla install

Thanks again for the pointers.  I was originally trying to use zeppelin 0.5.0 with spark 1.4

The version that I have working via z-manager looks like a zeppelin 0.6.0 snapshot build.  Spark 1.3.1, hadoop 2.4.0
With:

%dep
z.load("org.elasticsearch:elasticsearch-hadoop:2.2.0-beta1")
z.load("org.elasticsearch::elasticsearch-spark:2.2.0-beta1”)

Best
Jeff



From:  moon soo Lee
Reply-To:  <us...@zeppelin.incubator.apache.org>
Date:  Thursday, October 29, 2015 at 8:00 PM
To:  <us...@zeppelin.incubator.apache.org>
Subject:  Re: pyspark with jar

Hi,

Thanks for the question.

Actually, %pyspark runs in the same JVM process that %spark runs. And it shares a single SparkContext instance. (although %pyspark runs additional python process)
Libraries loaded from %dep should be available in %pyspark, too.

interpreter property 'spark.home' is little bit confusing with SPARK_HOME.
At the moment, defining SPARK_HOME in conf/zeppelin-env.sh is recommended instead of spark.home.

Best,
moon

On Fri, Oct 30, 2015 at 2:44 AM Jeff Steinmetz <je...@gmail.com> wrote:
That’s a good pointer.
Question still stands, how do you load libraries (jars) for %pyspark?

Its clear how to do it for %spark (scala) via %dep.

Looking for the equivalent of:

./bin/pyspark --master local[2] --jars jars/elasticsearch-hadoop-2.1.0.Beta2.jar


From:  Matt Sochor
Reply-To:  <us...@zeppelin.incubator.apache.org>
Date:  Thursday, October 29, 2015 at 3:19 PM
To:  <us...@zeppelin.incubator.apache.org>
Subject:  Re: pyspark with jar

I actually *just* figured it out.  Zeppelin has sqlContext "already created and exposed" (https://zeppelin.incubator.apache.org/docs/interpreter/spark.html).

So when I do "sqlContext = SQLContext(sc)" I overwrite sqlContext.  Then Zeppelin cannot see this new sqlContext.

Anyway, anyone out there experiencing this problem, do NOT initialize sqlContext and it works fine.  

On Thu, Oct 29, 2015 at 6:10 PM Jeff Steinmetz <je...@gmail.com> wrote:
In zeppelin, what is the equivalent to adding jars in a pyspark call?

Such as running pyspark with the elasticsearch-hadoop jar

./bin/pyspark --master local[2] --jars jars/elasticsearch-hadoop-2.1.0.Beta2.jar

My assumption is that loading something like this inside a %dep is pointless, since those dependencies would only live in the %spark scala world (the spark jvm).  In zeppelin - pyspark spawns a separate process.

Also how is the interpreters “spark.home” used?  How is it different that the  “SPARK_HOME” zeppelin-env.sh
And finally – how are args used in the interpreter?  (what uses them)?

Thank you.
Jeff
-- 
Best regards,

Matt Sochor
Data Scientist
Mobile Defense

Mobile +1 215 307 7768


This email and any of its attachments may contain Mobile Defense Inc. proprietary information, which is privileged, confidential, or subject to copyright belonging to Mobile Defense Inc. This email is intended solely for the use of the individuals or entities to which it is addressed by Mobile Defense Inc. If you are not the intended recipient of this email, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this email is strictly prohibited and may be unlawful. If you have received this email in error, please notify the sender immediately and permanently delete the original and any copy of this email and any printout.

Re: pyspark with jar

Posted by moon soo Lee <mo...@apache.org>.

Hi,

Thanks for the question.

Actually, %pyspark runs in the same JVM process that %spark runs. And it
shares a single SparkContext instance. (although %pyspark runs additional
python process)
Libraries loaded from %dep should be available in %pyspark, too.

interpreter property 'spark.home' is little bit confusing with SPARK_HOME.
At the moment, defining SPARK_HOME in conf/zeppelin-env.sh is recommended
instead of spark.home.

Best,
moon

On Fri, Oct 30, 2015 at 2:44 AM Jeff Steinmetz <je...@gmail.com>
wrote:

> That’s a good pointer.
> Question still stands, how do you load libraries (jars) for %pyspark?
>
> Its clear how to do it for %spark (scala) via %dep.
>
> Looking for the equivalent of:
>
> ./bin/pyspark --master local[2] --jars
> jars/elasticsearch-hadoop-2.1.0.Beta2.jar
>
>
> From: Matt Sochor
> Reply-To: <us...@zeppelin.incubator.apache.org>
> Date: Thursday, October 29, 2015 at 3:19 PM
> To: <us...@zeppelin.incubator.apache.org>
> Subject: Re: pyspark with jar
>
> I actually *just* figured it out.  Zeppelin has sqlContext "already
> created and exposed" (
> https://zeppelin.incubator.apache.org/docs/interpreter/spark.html).
>
> So when I do "sqlContext = SQLContext(sc)" I overwrite sqlContext.  Then
> Zeppelin cannot see this new sqlContext.
>
> Anyway, anyone out there experiencing this problem, do NOT initialize
> sqlContext and it works fine.
>
> On Thu, Oct 29, 2015 at 6:10 PM Jeff Steinmetz <
> jeffrey.steinmetz@gmail.com> wrote:
>
>> In zeppelin, what is the equivalent to adding jars in a pyspark call?
>>
>> Such as running pyspark with the elasticsearch-hadoop jar
>>
>> ./bin/pyspark --master local[2] --jars
>> jars/elasticsearch-hadoop-2.1.0.Beta2.jar
>>
>> My assumption is that loading something like this inside a %dep is
>> pointless, since those dependencies would only live in the %spark scala
>> world (the spark jvm).  In zeppelin - pyspark spawns a separate process.
>>
>> Also how is the interpreters “spark.home” used?  How is it different that
>> the  “SPARK_HOME” zeppelin-env.sh
>> And finally – how are args used in the interpreter?  (what uses them)?
>>
>> Thank you.
>> Jeff
>>
> --
> Best regards,
>
> Matt Sochor
> Data Scientist
> Mobile Defense
>
> Mobile +1 215 307 7768
>
>
> This email and any of its attachments may contain Mobile Defense Inc.
> proprietary information, which is privileged, confidential, or subject to
> copyright belonging to Mobile Defense Inc. This email is intended solely
> for the use of the individuals or entities to which it is addressed by
> Mobile Defense Inc. If you are not the intended recipient of this email,
> you are hereby notified that any dissemination, distribution, copying, or
> action taken in relation to the contents of and attachments to this email
> is strictly prohibited and may be unlawful. If you have received this email
> in error, please notify the sender immediately and permanently delete the
> original and any copy of this email and any printout.
>
>

Re: pyspark with jar

Posted by Jeff Steinmetz <je...@gmail.com>.

That’s a good pointer.
Question still stands, how do you load libraries (jars) for %pyspark?

Its clear how to do it for %spark (scala) via %dep.

Looking for the equivalent of:

./bin/pyspark --master local[2] --jars jars/elasticsearch-hadoop-2.1.0.Beta2.jar


From:  Matt Sochor
Reply-To:  <us...@zeppelin.incubator.apache.org>
Date:  Thursday, October 29, 2015 at 3:19 PM
To:  <us...@zeppelin.incubator.apache.org>
Subject:  Re: pyspark with jar

I actually *just* figured it out.  Zeppelin has sqlContext "already created and exposed" (https://zeppelin.incubator.apache.org/docs/interpreter/spark.html).

So when I do "sqlContext = SQLContext(sc)" I overwrite sqlContext.  Then Zeppelin cannot see this new sqlContext.

Anyway, anyone out there experiencing this problem, do NOT initialize sqlContext and it works fine.  

On Thu, Oct 29, 2015 at 6:10 PM Jeff Steinmetz <je...@gmail.com> wrote:
In zeppelin, what is the equivalent to adding jars in a pyspark call?

Such as running pyspark with the elasticsearch-hadoop jar

./bin/pyspark --master local[2] --jars jars/elasticsearch-hadoop-2.1.0.Beta2.jar

My assumption is that loading something like this inside a %dep is pointless, since those dependencies would only live in the %spark scala world (the spark jvm).  In zeppelin - pyspark spawns a separate process.

Also how is the interpreters “spark.home” used?  How is it different that the  “SPARK_HOME” zeppelin-env.sh
And finally – how are args used in the interpreter?  (what uses them)?

Thank you.
Jeff
-- 
Best regards,

Matt Sochor
Data Scientist
Mobile Defense

Mobile +1 215 307 7768


This email and any of its attachments may contain Mobile Defense Inc. proprietary information, which is privileged, confidential, or subject to copyright belonging to Mobile Defense Inc. This email is intended solely for the use of the individuals or entities to which it is addressed by Mobile Defense Inc. If you are not the intended recipient of this email, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this email is strictly prohibited and may be unlawful. If you have received this email in error, please notify the sender immediately and permanently delete the original and any copy of this email and any printout.

Re: pyspark with jar

Posted by Matt Sochor <ma...@mobiledefense.com>.

I actually *just* figured it out.  Zeppelin has sqlContext "already created
and exposed" (
https://zeppelin.incubator.apache.org/docs/interpreter/spark.html).

So when I do "sqlContext = SQLContext(sc)" I overwrite sqlContext.  Then
Zeppelin cannot see this new sqlContext.

Anyway, anyone out there experiencing this problem, do NOT initialize
sqlContext and it works fine.

On Thu, Oct 29, 2015 at 6:10 PM Jeff Steinmetz <je...@gmail.com>
wrote:

> In zeppelin, what is the equivalent to adding jars in a pyspark call?
>
> Such as running pyspark with the elasticsearch-hadoop jar
>
> ./bin/pyspark --master local[2] --jars
> jars/elasticsearch-hadoop-2.1.0.Beta2.jar
>
> My assumption is that loading something like this inside a %dep is
> pointless, since those dependencies would only live in the %spark scala
> world (the spark jvm).  In zeppelin - pyspark spawns a separate process.
>
> Also how is the interpreters “spark.home” used?  How is it different that
> the  “SPARK_HOME” zeppelin-env.sh
> And finally – how are args used in the interpreter?  (what uses them)?
>
> Thank you.
> Jeff
>
-- 
Best regards,

Matt Sochor
Data Scientist
Mobile Defense

Mobile +1 215 307 7768

This email and any of its attachments may contain Mobile Defense Inc.
proprietary information, which is privileged, confidential, or subject to
copyright belonging to Mobile Defense Inc. This email is intended solely
for the use of the individuals or entities to which it is addressed by
Mobile Defense Inc. If you are not the intended recipient of this email,
you are hereby notified that any dissemination, distribution, copying, or
action taken in relation to the contents of and attachments to this email
is strictly prohibited and may be unlawful. If you have received this email
in error, please notify the sender immediately and permanently delete the
original and any copy of this email and any printout.