You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Egor Pahomov <pa...@gmail.com> on 2014/07/11 14:50:26 UTC

How pySpark works?

Hi, I want to use pySpark, but can't understand how it works. Documentation
doesn't provide enough information.

1) How python shipped to cluster? Should machines in cluster already have
python?
2) What happens when I write some python code in "map" function - is it
shipped to cluster and just executed on it? How it understand all
dependencies, which my code need and ship it there? If I use Math in my
code in "map" does it mean, that I would ship Math class or some python
Math on cluster would be used?
3) I have c++ compiled code. Can I ship this executable with "addPyFile"
and just use "exec" function from python? Would it work?

-- 



*Sincerely yoursEgor PakhomovScala Developer, Yandex*

Re: How pySpark works?

Posted by Reynold Xin <rx...@databricks.com>.

Also take a look at this:
https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals


On Fri, Jul 11, 2014 at 10:29 AM, Andrew Or <an...@databricks.com> wrote:

> Hi Egor,
>
> Here are a few answers to your questions:
>
> 1) Python needs to be installed on all machines, but not pyspark. The way
> the executors get the pyspark code depends on which cluster manager you
> use. In standalone mode, your executors need to have the actual python
> files in their working directory. In yarn mode, python files are included
> in the assembly jar, which is then shipped to your executor containers
> through a distributed cache.
>
> 2) Pyspark is just a thin wrapper around Spark. When you write a closure in
> python, it is shipped to the executors within the task itself the same way
> scala closures are shipped. If you use a special library, then all of the
> nodes will need to have that library pre-installed.
>
> 3) Are you trying to run your c++ code inside the "map" function? If so,
> you need to make sure the compiled code is present in the working directory
> on all the executors before-hand for python to "exec" it. I haven't done
> this before, but maybe there are a few gotchas in doing this.
>
> Maybe others can add more information?
>
> Andrew
>
>
> 2014-07-11 5:50 GMT-07:00 Egor Pahomov <pa...@gmail.com>:
>
> > Hi, I want to use pySpark, but can't understand how it works.
> Documentation
> > doesn't provide enough information.
> >
> > 1) How python shipped to cluster? Should machines in cluster already have
> > python?
> > 2) What happens when I write some python code in "map" function - is it
> > shipped to cluster and just executed on it? How it understand all
> > dependencies, which my code need and ship it there? If I use Math in my
> > code in "map" does it mean, that I would ship Math class or some python
> > Math on cluster would be used?
> > 3) I have c++ compiled code. Can I ship this executable with "addPyFile"
> > and just use "exec" function from python? Would it work?
> >
> > --
> >
> >
> >
> > *Sincerely yoursEgor PakhomovScala Developer, Yandex*
> >
>

Re: How pySpark works?

Posted by Andrew Or <an...@databricks.com>.

Hi Egor,

Here are a few answers to your questions:

1) Python needs to be installed on all machines, but not pyspark. The way
the executors get the pyspark code depends on which cluster manager you
use. In standalone mode, your executors need to have the actual python
files in their working directory. In yarn mode, python files are included
in the assembly jar, which is then shipped to your executor containers
through a distributed cache.

2) Pyspark is just a thin wrapper around Spark. When you write a closure in
python, it is shipped to the executors within the task itself the same way
scala closures are shipped. If you use a special library, then all of the
nodes will need to have that library pre-installed.

3) Are you trying to run your c++ code inside the "map" function? If so,
you need to make sure the compiled code is present in the working directory
on all the executors before-hand for python to "exec" it. I haven't done
this before, but maybe there are a few gotchas in doing this.

Maybe others can add more information?

Andrew


2014-07-11 5:50 GMT-07:00 Egor Pahomov <pa...@gmail.com>:

> Hi, I want to use pySpark, but can't understand how it works. Documentation
> doesn't provide enough information.
>
> 1) How python shipped to cluster? Should machines in cluster already have
> python?
> 2) What happens when I write some python code in "map" function - is it
> shipped to cluster and just executed on it? How it understand all
> dependencies, which my code need and ship it there? If I use Math in my
> code in "map" does it mean, that I would ship Math class or some python
> Math on cluster would be used?
> 3) I have c++ compiled code. Can I ship this executable with "addPyFile"
> and just use "exec" function from python? Would it work?
>
> --
>
>
>
> *Sincerely yoursEgor PakhomovScala Developer, Yandex*
>