You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Prasanth Prahladan <pr...@gmail.com> on 2014/02/21 12:33:25 UTC

Using PySpark for Streaming

Hi,
I am new to Spark, Hadoop and related technologies. I intend to use this
for gps data stream processing. As I am more comfortable with Python, I
intend to use Python based technologies for the application development.


Is it possible to use the current PySpark API for implementing Stream
Processing as executed within the Spark Streaming framework?

-- 
Regards,
Prasanth Prahladan

Re: Using PySpark for Streaming

Posted by "gaurav.dasgupta" <ga...@gmail.com>.
Hi Tathagata,

I am very new to Spark streaming and I have never used the pipe() function
yet.

I have written a Spark streaming program (JAVA API) which is receiving data
from Kafka and simply printing now. 

*JavaStreamingContext ssc = new JavaStreamingContext(args[0], 
				"SparkStreamExample", new Duration(1000), 
				System.getenv("SPARK_HOME"), 
				JavaStreamingContext.jarOfClass(SparkStreamExample.class));

JavaPairDStream<String, String> messages = 
				KafkaUtils.createStream(ssc, args[1], args[2], topicMap);

messages.print();

ssc.start();
ssc.awaitTermination();*

There is another simple Spark program (Python API) which does some data
cleaning and saves it to HDFS. 

*if __name__ == "__main__":
    if len(sys.argv) < 2:
        print >> sys.stderr, "Usage: <master> <file>"
        exit(-1)

    # Instead of reading from HDFS, I want this program to read from the
Java-Spark streaming process
    sc = SparkContext(sys.argv[1], "TextCleanUp")
    lines = sc.textFile(sys.argv[2])
    cleanText = lines.map(cleanFunction).filter(lambda x: len(x) > 0)
    cleanText.saveAsTextFile("hdfs://<IP>/user/root/cleanout")*

What I want is that the Python Spark program should read the data from the
std output of the Java Spark streaming program. I have somehow understood
that I need to use pipe() for this. But I am unable to understand how to use
it. 

Can you please provide me with an example of how to use Spark's *pipe()*
function for the above context?

Thanks in advance.

Regards,
Gaurav



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-PySpark-for-Streaming-tp1882p6391.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Using PySpark for Streaming

Posted by Tathagata Das <ta...@gmail.com>.
As Jeremy said, the Spark Streaming has no python API yet. However, there
are a number of things you can do that allows you to do your main data
manipulation in Python. Spark API allows the data of a dataset to be
"piped" out to any arbitrary external script (say, a Bash script, or a
Python script). Look up
RDD.pipe()<http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.RDD>
 function. So you can use Scala or Java based Spark Streaming API to read
data from different sources, generated RDDs (the data abstraction in Spark)
and pipe it out to an external python script for the more complex
processing.

TD


On Fri, Feb 21, 2014 at 5:57 PM, Jeremy Freeman <fr...@gmail.com>wrote:

> There is currently no support for Streaming in the Python API, but I
> believe it's on the roadmap.
>
> -- Jeremy
>
> On Feb 21, 2014, at 6:33 AM, Prasanth Prahladan <pr...@gmail.com>
> wrote:
>
> Hi,
> I am new to Spark, Hadoop and related technologies. I intend to use this
> for gps data stream processing. As I am more comfortable with Python, I
> intend to use Python based technologies for the application development.
>
>
> Is it possible to use the current PySpark API for implementing Stream
> Processing as executed within the Spark Streaming framework?
>
> --
> Regards,
> Prasanth Prahladan
>
>
>
>
>
>
>
>
>

Re: Using PySpark for Streaming

Posted by "D.Y Feng" <yy...@gmail.com>.
https://github.com/douban/dpark/


On 22 February 2014 09:57, Jeremy Freeman <fr...@gmail.com> wrote:

> There is currently no support for Streaming in the Python API, but I
> believe it's on the roadmap.
>
> -- Jeremy
>
> On Feb 21, 2014, at 6:33 AM, Prasanth Prahladan <pr...@gmail.com>
> wrote:
>
> Hi,
> I am new to Spark, Hadoop and related technologies. I intend to use this
> for gps data stream processing. As I am more comfortable with Python, I
> intend to use Python based technologies for the application development.
>
>
> Is it possible to use the current PySpark API for implementing Stream
> Processing as executed within the Spark Streaming framework?
>
> --
> Regards,
> Prasanth Prahladan
>
>
>
>
>
>
>
>
>


-- 


DY.Feng(叶毅锋)
yyfeng88625@twitter
Department of Applied Mathematics
Guangzhou University,China
dyfeng@stu.gzhu.edu.cn

Re: Using PySpark for Streaming

Posted by Jeremy Freeman <fr...@gmail.com>.
There is currently no support for Streaming in the Python API, but I believe it's on the roadmap.

-- Jeremy

On Feb 21, 2014, at 6:33 AM, Prasanth Prahladan <pr...@gmail.com> wrote:

> Hi,
> I am new to Spark, Hadoop and related technologies. I intend to use this for gps data stream processing. As I am more comfortable with Python, I intend to use Python based technologies for the application development. 
> 
> 
> Is it possible to use the current PySpark API for implementing Stream Processing as executed within the Spark Streaming framework? 
> 
> -- 
> Regards,
> Prasanth Prahladan
> 
> 
> 
> 
> 
> 
>