You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Soumitra Kumar <ku...@gmail.com> on 2014/09/21 19:43:01 UTC

How to initialize updateStateByKey operation

I started with StatefulNetworkWordCount to have a running count of words seen.

I have a file 'stored.count' which contains the word counts.

$ cat stored.count
a 1
b 2

I want to initialize StatefulNetworkWordCount with the values in 'stored.count' file, how do I do that?

I looked at the paper 'EECS-2012-259.pdf' Matei et al, in figure 2, it would be useful to have an initial RDD feeding into 'counts' at 't = 1', as below.

                           initial
                             |
t = 1: pageView -> ones -> counts
                             |
t = 2: pageView -> ones -> counts
...

I have also attached the modified figure 2 of http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf .

I managed to hack Spark code to achieve this, and attaching the modified files.

Essentially, I added an argument 'initial : RDD [(K, S)]' to updateStateByKey method, as
    def updateStateByKey[S: ClassTag](
        initial : RDD [(K, S)],
        updateFunc: (Seq[V], Option[S]) => Option[S],
        partitioner: Partitioner
      ): DStream[(K, S)]

If it sounds interesting for larger crowd I would be happy to cleanup the code, and volunteer to push into the code. I don't know the procedure to that though.

-Soumitra.

Re: How to initialize updateStateByKey operation

Posted by Tathagata Das <ta...@gmail.com>.
At a high-level, the suggestion sounds good to me. However regarding code,
its best to submit a Pull Request on Spark github page for community
reviewing. You will find more information here.
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

On Tue, Sep 23, 2014 at 10:11 PM, Soumitra Kumar <ku...@gmail.com>
wrote:

> I thought I did a good job ;-)
>
> OK, so what is the best way to initialize updateStateByKey operation? I
> have counts from previous spark-submit, and want to load that in next
> spark-submit job.
>
> ----- Original Message -----
> From: "Soumitra Kumar" <ku...@gmail.com>
> To: "spark users" <us...@spark.apache.org>
> Sent: Sunday, September 21, 2014 10:43:01 AM
> Subject: How to initialize updateStateByKey operation
>
> I started with StatefulNetworkWordCount to have a running count of words
> seen.
>
> I have a file 'stored.count' which contains the word counts.
>
> $ cat stored.count
> a 1
> b 2
>
> I want to initialize StatefulNetworkWordCount with the values in
> 'stored.count' file, how do I do that?
>
> I looked at the paper 'EECS-2012-259.pdf' Matei et al, in figure 2, it
> would be useful to have an initial RDD feeding into 'counts' at 't = 1', as
> below.
>
>                            initial
>                              |
> t = 1: pageView -> ones -> counts
>                              |
> t = 2: pageView -> ones -> counts
> ...
>
> I have also attached the modified figure 2 of
> http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf .
>
> I managed to hack Spark code to achieve this, and attaching the modified
> files.
>
> Essentially, I added an argument 'initial : RDD [(K, S)]' to
> updateStateByKey method, as
>     def updateStateByKey[S: ClassTag](
>         initial : RDD [(K, S)],
>         updateFunc: (Seq[V], Option[S]) => Option[S],
>         partitioner: Partitioner
>       ): DStream[(K, S)]
>
> If it sounds interesting for larger crowd I would be happy to cleanup the
> code, and volunteer to push into the code. I don't know the procedure to
> that though.
>
> -Soumitra.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: How to initialize updateStateByKey operation

Posted by Soumitra Kumar <ku...@gmail.com>.
I thought I did a good job ;-)

OK, so what is the best way to initialize updateStateByKey operation? I have counts from previous spark-submit, and want to load that in next spark-submit job.

----- Original Message -----
From: "Soumitra Kumar" <ku...@gmail.com>
To: "spark users" <us...@spark.apache.org>
Sent: Sunday, September 21, 2014 10:43:01 AM
Subject: How to initialize updateStateByKey operation

I started with StatefulNetworkWordCount to have a running count of words seen.

I have a file 'stored.count' which contains the word counts.

$ cat stored.count
a 1
b 2

I want to initialize StatefulNetworkWordCount with the values in 'stored.count' file, how do I do that?

I looked at the paper 'EECS-2012-259.pdf' Matei et al, in figure 2, it would be useful to have an initial RDD feeding into 'counts' at 't = 1', as below.

                           initial
                             |
t = 1: pageView -> ones -> counts
                             |
t = 2: pageView -> ones -> counts
...

I have also attached the modified figure 2 of http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf .

I managed to hack Spark code to achieve this, and attaching the modified files.

Essentially, I added an argument 'initial : RDD [(K, S)]' to updateStateByKey method, as
    def updateStateByKey[S: ClassTag](
        initial : RDD [(K, S)],
        updateFunc: (Seq[V], Option[S]) => Option[S],
        partitioner: Partitioner
      ): DStream[(K, S)]

If it sounds interesting for larger crowd I would be happy to cleanup the code, and volunteer to push into the code. I don't know the procedure to that though.

-Soumitra.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org