You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Andy Davidson <An...@SantaCruzIntegration.com> on 2014/09/30 02:56:09 UTC

newbie system architecture problem, trouble using streaming and RDD.pipe()

Hello

I am trying to build a system that does a very simple calculation on a
stream and displays the results in a graph that I want to update the graph
every second or so. I think I have a fundamental mis understanding about how
steams and rdd.pipe() works. I want to do the data visualization part using
Ipython notebook. Its really easy to graph, animate, and share the page. I
understand streaming does not work in python yet. Googleing around it
appears you can use RDD.pipe() to get the streaming data into python.

I have a little Ipython notebook I have been experimenting with. It use
rdd.pipe() to run the following java job The psudo code is

Main() {
   JavaStreamingContext ssc = new JavaStreamingContext(jsc, new
Duration(1000));

 JavaDStream<String> logs = createStream(ssc)
   JavaDStream<String> msg = logs.filter(selectMsgLevel);

JavaDStream<Long> count = msg.count()

         Logs.print()
    ssc.start();

    ssc.awaitTermination();

}



I do not understand how I can pass any data back to python. If my
understanding is correct everything runs on a worker, there is no way for
the driver to get get the value of mini batch count and write it out to
standard out.


The Streaming documentation on design patterns for using foreachRDD
 demonstrates how the slaves/works can send data to other systems. So does
the over all architecture of my little demo need to be something like

Process a) iPython Note book rdd.pipe(myReader.sh)

Process b) myReader.sh is basically some little daemon process that the
workers can connect to. It will just write what ever it receives to standard
out.

Process c) is java spark streaming code

Do I really need the ³daemon in the middle² ?

Any insights would be greatly appreciated

Andy


P.s. I assume that if I wanted to use aggregates I would need to use
JDStream.wrappRDD() . Is this correct? Is it expensive?