You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Amit Rana <am...@gmail.com> on 2016/07/07 07:49:19 UTC
Understanding pyspark data flow on worker nodes
Hi all,
I am trying to trace the data flow in pyspark. I am using intellij IDEA in
windows 7.
I had submitted a python job as follows:
--master local[4] <path to pyspark job> <arguments to the job>
I have made the following insights after running the above command in
debug mode:
->Locally when a pyspark's interpreter starts, it also starts a JVM with
which it communicates through socket.
->py4j is used to handle this communication
->Now this JVM acts as actual spark driver, and loads a JavaSparkContext
which communicates with the spark executors in cluster.
In cluster I have read that the data flow between spark executors and
python interpreter happens using pipes. But I am not able to trace that
data flow.
Please correct me if my understanding is wrong. It would be very helpful
if, someone can help me understand tge code-flow for data transfer between
JVM and python workers.
Thanks,
Amit Rana
Re: Understanding pyspark data flow on worker nodes
Posted by Adam Roberts <AR...@uk.ibm.com>.
Hi, sharing what I discovered with PySpark too, corroborates with what
Amit notices and also interested in the pipe question:
h
ttps://mail-archives.apache.org/mod_mbox/spark-dev/201603.mbox/%3C201603291521.u2TFLBfO024212@d06av05.portsmouth.uk.ibm.com%3E
// Start a thread to feed the process input from our parent's iterator
val writerThread = new WriterThread(env, worker, inputIterator,
partitionIndex, context)
...
// Return an iterator that read lines from the process's stdout
val stream = new DataInputStream(new
BufferedInputStream(worker.getInputStream, bufferSize))
The above code and what follows look to be the important parts.
Note that Josh Rosen replied to my comment with more information:
"One clarification: there are Python interpreters running on executors so
that Python UDFs and RDD API code can be executed. Some slightly-outdated
but mostly-correct reference material for this can be found at
https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals.
See also: search the Spark codebase for PythonRDD and look at
python/pyspark/worker.py"
From: Reynold Xin <rx...@databricks.com>
To: Amit Rana <am...@gmail.com>
Cc: "dev@spark.apache.org" <de...@spark.apache.org>
Date: 08/07/2016 07:03
Subject: Re: Understanding pyspark data flow on worker nodes
You can look into its source code:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
On Thu, Jul 7, 2016 at 11:01 PM, Amit Rana <am...@gmail.com>
wrote:
Hi all,
Did anyone get a chance to look into it??
Any sort of guidance will be much appreciated.
Thanks,
Amit Rana
On 7 Jul 2016 14:28, "Amit Rana" <am...@gmail.com> wrote:
As mentioned in the documentation:
PythonRDD objects launch Python subprocesses and communicate with them
using pipes, sending the user's code and the data to be processed.
I am trying to understand the implementation of how this data transfer is
happening using pipes.
Can anyone please guide me along that line??
Thanks,
Amit Rana
On 7 Jul 2016 13:44, "Sun Rui" <su...@163.com> wrote:
You can read
https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals
For pySpark data flow on worker nodes, you can read the source code of
PythonRDD.scala. Python worker processes communicate with Spark executors
via sockets instead of pipes.
On Jul 7, 2016, at 15:49, Amit Rana <am...@gmail.com> wrote:
Hi all,
I am trying to trace the data flow in pyspark. I am using intellij IDEA
in windows 7.
I had submitted a python job as follows:
--master local[4] <path to pyspark job> <arguments to the job>
I have made the following insights after running the above command in
debug mode:
->Locally when a pyspark's interpreter starts, it also starts a JVM with
which it communicates through socket.
->py4j is used to handle this communication
->Now this JVM acts as actual spark driver, and loads a JavaSparkContext
which communicates with the spark executors in cluster.
In cluster I have read that the data flow between spark executors and
python interpreter happens using pipes. But I am not able to trace that
data flow.
Please correct me if my understanding is wrong. It would be very helpful
if, someone can help me understand tge code-flow for data transfer between
JVM and python workers.
Thanks,
Amit Rana
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Re: Understanding pyspark data flow on worker nodes
Posted by Reynold Xin <rx...@databricks.com>.
You can look into its source code:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
On Thu, Jul 7, 2016 at 11:01 PM, Amit Rana <am...@gmail.com> wrote:
> Hi all,
>
> Did anyone get a chance to look into it??
> Any sort of guidance will be much appreciated.
>
> Thanks,
> Amit Rana
> On 7 Jul 2016 14:28, "Amit Rana" <am...@gmail.com> wrote:
>
>> As mentioned in the documentation:
>> PythonRDD objects launch Python subprocesses and communicate with them
>> using pipes, sending the user's code and the data to be processed.
>>
>> I am trying to understand the implementation of how this data transfer
>> is happening using pipes.
>> Can anyone please guide me along that line??
>>
>> Thanks,
>> Amit Rana
>> On 7 Jul 2016 13:44, "Sun Rui" <su...@163.com> wrote:
>>
>>> You can read
>>> https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals
>>> For pySpark data flow on worker nodes, you can read the source code of
>>> PythonRDD.scala. Python worker processes communicate with Spark executors
>>> via sockets instead of pipes.
>>>
>>> On Jul 7, 2016, at 15:49, Amit Rana <am...@gmail.com> wrote:
>>>
>>> Hi all,
>>>
>>> I am trying to trace the data flow in pyspark. I am using intellij IDEA
>>> in windows 7.
>>> I had submitted a python job as follows:
>>> --master local[4] <path to pyspark job> <arguments to the job>
>>>
>>> I have made the following insights after running the above command in
>>> debug mode:
>>> ->Locally when a pyspark's interpreter starts, it also starts a JVM with
>>> which it communicates through socket.
>>> ->py4j is used to handle this communication
>>> ->Now this JVM acts as actual spark driver, and loads a JavaSparkContext
>>> which communicates with the spark executors in cluster.
>>>
>>> In cluster I have read that the data flow between spark executors and
>>> python interpreter happens using pipes. But I am not able to trace that
>>> data flow.
>>>
>>> Please correct me if my understanding is wrong. It would be very helpful
>>> if, someone can help me understand tge code-flow for data transfer between
>>> JVM and python workers.
>>>
>>> Thanks,
>>> Amit Rana
>>>
>>>
>>>
Re: Understanding pyspark data flow on worker nodes
Posted by Amit Rana <am...@gmail.com>.
Hi all,
Did anyone get a chance to look into it??
Any sort of guidance will be much appreciated.
Thanks,
Amit Rana
On 7 Jul 2016 14:28, "Amit Rana" <am...@gmail.com> wrote:
> As mentioned in the documentation:
> PythonRDD objects launch Python subprocesses and communicate with them
> using pipes, sending the user's code and the data to be processed.
>
> I am trying to understand the implementation of how this data transfer is
> happening using pipes.
> Can anyone please guide me along that line??
>
> Thanks,
> Amit Rana
> On 7 Jul 2016 13:44, "Sun Rui" <su...@163.com> wrote:
>
>> You can read
>> https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals
>> For pySpark data flow on worker nodes, you can read the source code of
>> PythonRDD.scala. Python worker processes communicate with Spark executors
>> via sockets instead of pipes.
>>
>> On Jul 7, 2016, at 15:49, Amit Rana <am...@gmail.com> wrote:
>>
>> Hi all,
>>
>> I am trying to trace the data flow in pyspark. I am using intellij IDEA
>> in windows 7.
>> I had submitted a python job as follows:
>> --master local[4] <path to pyspark job> <arguments to the job>
>>
>> I have made the following insights after running the above command in
>> debug mode:
>> ->Locally when a pyspark's interpreter starts, it also starts a JVM with
>> which it communicates through socket.
>> ->py4j is used to handle this communication
>> ->Now this JVM acts as actual spark driver, and loads a JavaSparkContext
>> which communicates with the spark executors in cluster.
>>
>> In cluster I have read that the data flow between spark executors and
>> python interpreter happens using pipes. But I am not able to trace that
>> data flow.
>>
>> Please correct me if my understanding is wrong. It would be very helpful
>> if, someone can help me understand tge code-flow for data transfer between
>> JVM and python workers.
>>
>> Thanks,
>> Amit Rana
>>
>>
>>
Re: Understanding pyspark data flow on worker nodes
Posted by Amit Rana <am...@gmail.com>.
As mentioned in the documentation:
PythonRDD objects launch Python subprocesses and communicate with them
using pipes, sending the user's code and the data to be processed.
I am trying to understand the implementation of how this data transfer is
happening using pipes.
Can anyone please guide me along that line??
Thanks,
Amit Rana
On 7 Jul 2016 13:44, "Sun Rui" <su...@163.com> wrote:
> You can read
> https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals
> For pySpark data flow on worker nodes, you can read the source code of
> PythonRDD.scala. Python worker processes communicate with Spark executors
> via sockets instead of pipes.
>
> On Jul 7, 2016, at 15:49, Amit Rana <am...@gmail.com> wrote:
>
> Hi all,
>
> I am trying to trace the data flow in pyspark. I am using intellij IDEA
> in windows 7.
> I had submitted a python job as follows:
> --master local[4] <path to pyspark job> <arguments to the job>
>
> I have made the following insights after running the above command in
> debug mode:
> ->Locally when a pyspark's interpreter starts, it also starts a JVM with
> which it communicates through socket.
> ->py4j is used to handle this communication
> ->Now this JVM acts as actual spark driver, and loads a JavaSparkContext
> which communicates with the spark executors in cluster.
>
> In cluster I have read that the data flow between spark executors and
> python interpreter happens using pipes. But I am not able to trace that
> data flow.
>
> Please correct me if my understanding is wrong. It would be very helpful
> if, someone can help me understand tge code-flow for data transfer between
> JVM and python workers.
>
> Thanks,
> Amit Rana
>
>
>
Re: Understanding pyspark data flow on worker nodes
Posted by Sun Rui <su...@163.com>.
You can read https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals <https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals>
For pySpark data flow on worker nodes, you can read the source code of PythonRDD.scala. Python worker processes communicate with Spark executors via sockets instead of pipes.
> On Jul 7, 2016, at 15:49, Amit Rana <am...@gmail.com> wrote:
>
> Hi all,
>
> I am trying to trace the data flow in pyspark. I am using intellij IDEA in windows 7.
> I had submitted a python job as follows:
> --master local[4] <path to pyspark job> <arguments to the job>
>
> I have made the following insights after running the above command in debug mode:
> ->Locally when a pyspark's interpreter starts, it also starts a JVM with which it communicates through socket.
> ->py4j is used to handle this communication
> ->Now this JVM acts as actual spark driver, and loads a JavaSparkContext which communicates with the spark executors in cluster.
>
> In cluster I have read that the data flow between spark executors and python interpreter happens using pipes. But I am not able to trace that data flow.
>
> Please correct me if my understanding is wrong. It would be very helpful if, someone can help me understand tge code-flow for data transfer between JVM and python workers.
>
> Thanks,
> Amit Rana
>