You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Bill Jay <bi...@gmail.com> on 2014/06/26 20:40:50 UTC

Spark Streaming RDD transformation

Hi all,

I am current working on a project that requires to transform each RDD in a
DStream to a Map. Basically, when we get a list of data in each batch, we
would like to update the global map. I would like to return the map as a
single RDD.

I am currently trying to use the function *transform*. The output will be a
RDD of the updated map after each batch. How can I create an RDD from
another data structure such as Int, Map, ect. Thanks!

Bill

Re: Spark Streaming RDD transformation

Posted by Bill Jay <bi...@gmail.com>.
Thanks, Sean!

I am currently using foreachRDD to update the global map using data in each
RDD. The reason I want to return a map as RDD instead of just updating the
map is that RDD provides many handy methods for output. For example, I want
to save the global map into files in HDFS for each batch in the stream. In
this case, do you have any suggestion how Spark can easily allow me to do
that? Thanks!


On Thu, Jun 26, 2014 at 12:26 PM, Sean Owen <so...@cloudera.com> wrote:

> If you want to transform an RDD to a Map, I assume you have an RDD of
> pairs. The method collectAsMap() creates a Map from the RDD in this
> case.
>
> Do you mean that you want to update a Map object using data in each
> RDD? You would use foreachRDD() in that case. Then you can use
> RDD.foreach to do something like update a global Map object.
>
> Not sure if this is what you mean but SparkContext.parallelize() can
> be used to make an RDD from a List or Array of objects. But that's not
> really related to streaming or updating a Map.
>
> On Thu, Jun 26, 2014 at 1:40 PM, Bill Jay <bi...@gmail.com>
> wrote:
> > Hi all,
> >
> > I am current working on a project that requires to transform each RDD in
> a
> > DStream to a Map. Basically, when we get a list of data in each batch, we
> > would like to update the global map. I would like to return the map as a
> > single RDD.
> >
> > I am currently trying to use the function transform. The output will be a
> > RDD of the updated map after each batch. How can I create an RDD from
> > another data structure such as Int, Map, ect. Thanks!
> >
> > Bill
>

Re: Spark Streaming RDD transformation

Posted by Sean Owen <so...@cloudera.com>.
If you want to transform an RDD to a Map, I assume you have an RDD of
pairs. The method collectAsMap() creates a Map from the RDD in this
case.

Do you mean that you want to update a Map object using data in each
RDD? You would use foreachRDD() in that case. Then you can use
RDD.foreach to do something like update a global Map object.

Not sure if this is what you mean but SparkContext.parallelize() can
be used to make an RDD from a List or Array of objects. But that's not
really related to streaming or updating a Map.

On Thu, Jun 26, 2014 at 1:40 PM, Bill Jay <bi...@gmail.com> wrote:
> Hi all,
>
> I am current working on a project that requires to transform each RDD in a
> DStream to a Map. Basically, when we get a list of data in each batch, we
> would like to update the global map. I would like to return the map as a
> single RDD.
>
> I am currently trying to use the function transform. The output will be a
> RDD of the updated map after each batch. How can I create an RDD from
> another data structure such as Int, Map, ect. Thanks!
>
> Bill