You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Renyi Xiong <re...@gmail.com> on 2015/09/15 22:46:13 UTC

pyspark streaming DStream compute

Can anybody help understand why pyspark streaming uses py4j callback to
execute python code while pyspark batch uses worker.py?

regarding pyspark streaming, is py4j callback only used for
DStream, worker.py still used for RDD?

thanks,
Renyi.

Re: pyspark streaming DStream compute

Posted by Davies Liu <da...@databricks.com>.

On Tue, Sep 15, 2015 at 1:46 PM, Renyi Xiong <re...@gmail.com> wrote:
> Can anybody help understand why pyspark streaming uses py4j callback to
> execute python code while pyspark batch uses worker.py?

There are two kind of callback in pyspark streaming:
1) one operate on RDDs, it take an RDD and return an new RDD, uses
py4j callback,
because SparkContext and RDDs are not accessible in worker.py
2) operate on records of RDD, it take an record and return new
records, uses worker.py

> regarding pyspark streaming, is py4j callback only used for DStream,
> worker.py still used for RDD?

Yes.

> thanks,
> Renyi.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org