You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Amit Rana <am...@gmail.com> on 2016/07/07 07:49:19 UTC

Understanding pyspark data flow on worker nodes

Hi all,

I am trying  to trace the data flow in pyspark. I am using intellij IDEA in
windows 7.
I had submitted  a python  job as follows:
--master local[4] <path to pyspark  job> <arguments to the job>

I have made the following  insights after running the above command in
debug mode:
->Locally when a pyspark's interpreter starts, it also starts a JVM with
which it communicates through socket.
->py4j is used to handle this communication
->Now this JVM acts as actual spark driver, and loads a JavaSparkContext
which communicates with the spark executors in cluster.

In cluster I have read that the data flow between spark executors and
python interpreter happens using pipes. But I am not able to trace that
data flow.

Please correct me if my understanding is wrong. It would be very helpful
if, someone can help me understand tge code-flow for data transfer between
JVM and python workers.

Thanks,
Amit Rana

Re: Understanding pyspark data flow on worker nodes

Posted by Adam Roberts <AR...@uk.ibm.com>.

Hi, sharing what I discovered with PySpark too, corroborates with what 
Amit notices and also interested in the pipe question:
h
ttps://mail-archives.apache.org/mod_mbox/spark-dev/201603.mbox/%3C201603291521.u2TFLBfO024212@d06av05.portsmouth.uk.ibm.com%3E


// Start a thread to feed the process input from our parent's iterator 
  val writerThread = new WriterThread(env, worker, inputIterator, 
partitionIndex, context)

...

// Return an iterator that read lines from the process's stdout 
  val stream = new DataInputStream(new 
BufferedInputStream(worker.getInputStream, bufferSize))

The above code and what follows look to be the important parts.



Note that Josh Rosen replied to my comment with more information:

"One clarification: there are Python interpreters running on executors so 
that Python UDFs and RDD API code can be executed. Some slightly-outdated 
but mostly-correct reference material for this can be found at 
https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals. 

See also: search the Spark codebase for PythonRDD and look at 
python/pyspark/worker.py"




From:   Reynold Xin <rx...@databricks.com>
To:     Amit Rana <am...@gmail.com>
Cc:     "dev@spark.apache.org" <de...@spark.apache.org>
Date:   08/07/2016 07:03
Subject:        Re: Understanding pyspark data flow on worker nodes



You can look into its source code: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala


On Thu, Jul 7, 2016 at 11:01 PM, Amit Rana <am...@gmail.com> 
wrote:
Hi all,
Did anyone get a chance to look into it??
Any sort of guidance will be much appreciated.
Thanks,
Amit Rana
On 7 Jul 2016 14:28, "Amit Rana" <am...@gmail.com> wrote:
As mentioned in the documentation:
PythonRDD objects launch Python subprocesses and communicate with them 
using pipes, sending the user's code and the data to be processed.
I am trying to understand  the implementation of how this data transfer is 
happening  using pipes.
Can anyone please guide me along that line??
Thanks, 
Amit Rana
On 7 Jul 2016 13:44, "Sun Rui" <su...@163.com> wrote:
You can read 
https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals
For pySpark data flow on worker nodes, you can read the source code of 
PythonRDD.scala. Python worker processes communicate with Spark executors 
via sockets instead of pipes.

On Jul 7, 2016, at 15:49, Amit Rana <am...@gmail.com> wrote:

Hi all,
I am trying  to trace the data flow in pyspark. I am using intellij IDEA 
in windows 7.
I had submitted  a python  job as follows:
--master local[4] <path to pyspark  job> <arguments to the job>
I have made the following  insights after running the above command in 
debug mode:
->Locally when a pyspark's interpreter starts, it also starts a JVM with 
which it communicates through socket.
->py4j is used to handle this communication 
->Now this JVM acts as actual spark driver, and loads a JavaSparkContext 
which communicates with the spark executors in cluster.
In cluster I have read that the data flow between spark executors and 
python interpreter happens using pipes. But I am not able to trace that 
data flow.
Please correct me if my understanding is wrong. It would be very helpful 
if, someone can help me understand tge code-flow for data transfer between 
JVM and python workers.
Thanks,
Amit Rana



Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

Re: Understanding pyspark data flow on worker nodes

Posted by Reynold Xin <rx...@databricks.com>.

You can look into its source code:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala


On Thu, Jul 7, 2016 at 11:01 PM, Amit Rana <am...@gmail.com> wrote:

> Hi all,
>
> Did anyone get a chance to look into it??
> Any sort of guidance will be much appreciated.
>
> Thanks,
> Amit Rana
> On 7 Jul 2016 14:28, "Amit Rana" <am...@gmail.com> wrote:
>
>> As mentioned in the documentation:
>> PythonRDD objects launch Python subprocesses and communicate with them
>> using pipes, sending the user's code and the data to be processed.
>>
>> I am trying to understand  the implementation of how this data transfer
>> is happening  using pipes.
>> Can anyone please guide me along that line??
>>
>> Thanks,
>> Amit Rana
>> On 7 Jul 2016 13:44, "Sun Rui" <su...@163.com> wrote:
>>
>>> You can read
>>> https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals
>>> For pySpark data flow on worker nodes, you can read the source code of
>>> PythonRDD.scala. Python worker processes communicate with Spark executors
>>> via sockets instead of pipes.
>>>
>>> On Jul 7, 2016, at 15:49, Amit Rana <am...@gmail.com> wrote:
>>>
>>> Hi all,
>>>
>>> I am trying  to trace the data flow in pyspark. I am using intellij IDEA
>>> in windows 7.
>>> I had submitted  a python  job as follows:
>>> --master local[4] <path to pyspark  job> <arguments to the job>
>>>
>>> I have made the following  insights after running the above command in
>>> debug mode:
>>> ->Locally when a pyspark's interpreter starts, it also starts a JVM with
>>> which it communicates through socket.
>>> ->py4j is used to handle this communication
>>> ->Now this JVM acts as actual spark driver, and loads a JavaSparkContext
>>> which communicates with the spark executors in cluster.
>>>
>>> In cluster I have read that the data flow between spark executors and
>>> python interpreter happens using pipes. But I am not able to trace that
>>> data flow.
>>>
>>> Please correct me if my understanding is wrong. It would be very helpful
>>> if, someone can help me understand tge code-flow for data transfer between
>>> JVM and python workers.
>>>
>>> Thanks,
>>> Amit Rana
>>>
>>>
>>>

Re: Understanding pyspark data flow on worker nodes

Posted by Amit Rana <am...@gmail.com>.

Hi all,

Did anyone get a chance to look into it??
Any sort of guidance will be much appreciated.

Thanks,
Amit Rana
On 7 Jul 2016 14:28, "Amit Rana" <am...@gmail.com> wrote:

> As mentioned in the documentation:
> PythonRDD objects launch Python subprocesses and communicate with them
> using pipes, sending the user's code and the data to be processed.
>
> I am trying to understand  the implementation of how this data transfer is
> happening  using pipes.
> Can anyone please guide me along that line??
>
> Thanks,
> Amit Rana
> On 7 Jul 2016 13:44, "Sun Rui" <su...@163.com> wrote:
>
>> You can read
>> https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals
>> For pySpark data flow on worker nodes, you can read the source code of
>> PythonRDD.scala. Python worker processes communicate with Spark executors
>> via sockets instead of pipes.
>>
>> On Jul 7, 2016, at 15:49, Amit Rana <am...@gmail.com> wrote:
>>
>> Hi all,
>>
>> I am trying  to trace the data flow in pyspark. I am using intellij IDEA
>> in windows 7.
>> I had submitted  a python  job as follows:
>> --master local[4] <path to pyspark  job> <arguments to the job>
>>
>> I have made the following  insights after running the above command in
>> debug mode:
>> ->Locally when a pyspark's interpreter starts, it also starts a JVM with
>> which it communicates through socket.
>> ->py4j is used to handle this communication
>> ->Now this JVM acts as actual spark driver, and loads a JavaSparkContext
>> which communicates with the spark executors in cluster.
>>
>> In cluster I have read that the data flow between spark executors and
>> python interpreter happens using pipes. But I am not able to trace that
>> data flow.
>>
>> Please correct me if my understanding is wrong. It would be very helpful
>> if, someone can help me understand tge code-flow for data transfer between
>> JVM and python workers.
>>
>> Thanks,
>> Amit Rana
>>
>>
>>

Re: Understanding pyspark data flow on worker nodes

Posted by Amit Rana <am...@gmail.com>.

As mentioned in the documentation:
PythonRDD objects launch Python subprocesses and communicate with them
using pipes, sending the user's code and the data to be processed.

I am trying to understand  the implementation of how this data transfer is
happening  using pipes.
Can anyone please guide me along that line??

Thanks,
Amit Rana
On 7 Jul 2016 13:44, "Sun Rui" <su...@163.com> wrote:

> You can read
> https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals
> For pySpark data flow on worker nodes, you can read the source code of
> PythonRDD.scala. Python worker processes communicate with Spark executors
> via sockets instead of pipes.
>
> On Jul 7, 2016, at 15:49, Amit Rana <am...@gmail.com> wrote:
>
> Hi all,
>
> I am trying  to trace the data flow in pyspark. I am using intellij IDEA
> in windows 7.
> I had submitted  a python  job as follows:
> --master local[4] <path to pyspark  job> <arguments to the job>
>
> I have made the following  insights after running the above command in
> debug mode:
> ->Locally when a pyspark's interpreter starts, it also starts a JVM with
> which it communicates through socket.
> ->py4j is used to handle this communication
> ->Now this JVM acts as actual spark driver, and loads a JavaSparkContext
> which communicates with the spark executors in cluster.
>
> In cluster I have read that the data flow between spark executors and
> python interpreter happens using pipes. But I am not able to trace that
> data flow.
>
> Please correct me if my understanding is wrong. It would be very helpful
> if, someone can help me understand tge code-flow for data transfer between
> JVM and python workers.
>
> Thanks,
> Amit Rana
>
>
>

Re: Understanding pyspark data flow on worker nodes

Posted by Sun Rui <su...@163.com>.

You can read https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals <https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals>
For pySpark data flow on worker nodes, you can read the source code of PythonRDD.scala. Python worker processes communicate with Spark executors via sockets instead of pipes.

> On Jul 7, 2016, at 15:49, Amit Rana <am...@gmail.com> wrote:
> 
> Hi all,
> 
> I am trying  to trace the data flow in pyspark. I am using intellij IDEA in windows 7.
> I had submitted  a python  job as follows:
> --master local[4] <path to pyspark  job> <arguments to the job>
> 
> I have made the following  insights after running the above command in debug mode:
> ->Locally when a pyspark's interpreter starts, it also starts a JVM with which it communicates through socket.
> ->py4j is used to handle this communication 
> ->Now this JVM acts as actual spark driver, and loads a JavaSparkContext which communicates with the spark executors in cluster.
> 
> In cluster I have read that the data flow between spark executors and python interpreter happens using pipes. But I am not able to trace that data flow.
> 
> Please correct me if my understanding is wrong. It would be very helpful if, someone can help me understand tge code-flow for data transfer between JVM and python workers.
> 
> Thanks,
> Amit Rana
>