You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Soumitra Kumar <ku...@gmail.com> on 2014/09/12 20:36:37 UTC

How to initialize StateDStream

Hello,

How do I initialize StateDStream used in updateStateByKey?

-Soumitra.

Re: How to initialize StateDStream

Posted by Soumitra Kumar <ku...@gmail.com>.

Thanks for the pointers. I meant previous run of spark-submit.

For 1: This would be a bit more computation in every batch.

2: Its a good idea, but it may be inefficient to retrieve each value.

In general, for a generic state machine the initialization and input
sequence is critical for correctness.




On Sat, Sep 13, 2014 at 12:17 PM, qihong <qc...@pivotal.io> wrote:

> I'm not sure what you mean by "previous run". Is it previous batch? or
> previous run of spark-submit?
>
> If it's "previous batch" (spark streaming creates a batch every batch
> interval), then there's nothing to do.
>
> If it's previous run of spark-submit (assuming you are able to save the
> result somewhere), then I can think of two possible ways to do it:
>
> 1. read saved result as RDD (just do this once), and join the RDD with each
> RDD of the stateStream.
>
> 2. add extra logic to updateFunction: when the previous state is None (one
> of two Option type values), you get save state for the given key from saved
> result somehow, then your original logic to create new state object based
> on
> Seq[V] and previous state. note that you need use this version of
> updateFunction: "updateFunc: (Iterator[(K, Seq[V], Option[S])]) =>
> Iterator[(K, S)]", which make key available to the update function.
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-initialize-StateDStream-tp14113p14176.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: How to initialize StateDStream

Posted by qihong <qc...@pivotal.io>.

I'm not sure what you mean by "previous run". Is it previous batch? or
previous run of spark-submit?

If it's "previous batch" (spark streaming creates a batch every batch
interval), then there's nothing to do.

If it's previous run of spark-submit (assuming you are able to save the
result somewhere), then I can think of two possible ways to do it:

1. read saved result as RDD (just do this once), and join the RDD with each
RDD of the stateStream. 

2. add extra logic to updateFunction: when the previous state is None (one
of two Option type values), you get save state for the given key from saved
result somehow, then your original logic to create new state object based on
Seq[V] and previous state. note that you need use this version of
updateFunction: "updateFunc: (Iterator[(K, Seq[V], Option[S])]) =>
Iterator[(K, S)]", which make key available to the update function.





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-initialize-StateDStream-tp14113p14176.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: How to initialize StateDStream

Posted by Soumitra Kumar <ku...@gmail.com>.

I had looked at that.
If I have a set of saved word counts from previous run, and want to load
that in the next run, what is the best way to do it?

I am thinking of hacking the Spark code and have an initial rdd in
StateDStream,
and use that in for the first time.

On Fri, Sep 12, 2014 at 11:04 PM, qihong <qc...@pivotal.io> wrote:

> there's no need to initialize StateDStream. Take a look at example
> StatefulNetworkWordCount.scala, it's part of spark source code.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-initialize-StateDStream-tp14113p14146.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: How to initialize StateDStream

Posted by qihong <qc...@pivotal.io>.

there's no need to initialize StateDStream. Take a look at example
StatefulNetworkWordCount.scala, it's part of spark source code.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-initialize-StateDStream-tp14113p14146.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org