You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Haopu Wang <HW...@qilinsoft.com> on 2016/06/12 08:40:09 UTC

Should I avoid "state" in an Spark application?

I have a Spark application whose structure is below:

 

    var ts: Long = 0L

    dstream1.foreachRDD{

        (x, time) => {

            ts = time

            x.do_something()...

        }

    }

    ......

    process_data(dstream2, ts, ......)

 

I assume foreachRDD function call can update "ts" variable which is then
used in the Spark tasks of "process_data" function.

 

From my test result of a standalone Spark cluster, it is working. But
should I concern if switch to YARN?

 

And I saw some articles are recommending to avoid state in Scala
programming. Without the state variable, how could that be done?

 

Any comments or suggestions are appreciated.

 

Thanks,

Haopu

Re: Should I avoid "state" in an Spark application?

Posted by Alonso Isidoro Roman <al...@gmail.com>.

Hi Haopu, please check these threads:

http://stackoverflow.com/questions/24331815/spark-streaming-historical-state

https://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter1/total.html

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>

2016-06-13 3:11 GMT+02:00 Haopu Wang <HW...@qilinsoft.com>:

> Can someone look at my questions? Thanks again!
>
>
> ------------------------------
>
> *From:* Haopu Wang
> *Sent:* 2016年6月12日 16:40
> *To:* user@spark.apache.org
> *Subject:* Should I avoid "state" in an Spark application?
>
>
>
> I have a Spark application whose structure is below:
>
>
>
>     var ts: Long = 0L
>
>     dstream1.foreachRDD{
>
>         (x, time) => {
>
>             ts = time
>
>             x.do_something()...
>
>         }
>
>     }
>
>     ......
>
>     process_data(dstream2, ts, ......)
>
>
>
> I assume foreachRDD function call can update "ts" variable which is then
> used in the Spark tasks of "process_data" function.
>
>
>
> From my test result of a standalone Spark cluster, it is working. But
> should I concern if switch to YARN?
>
>
>
> And I saw some articles are recommending to avoid state in Scala
> programming. Without the state variable, how could that be done?
>
>
>
> Any comments or suggestions are appreciated.
>
>
>
> Thanks,
>
> Haopu
>

RE: Should I avoid "state" in an Spark application?

Posted by Haopu Wang <HW...@qilinsoft.com>.

Can someone look at my questions? Thanks again!

________________________________

From: Haopu Wang 
Sent: 2016年6月12日 16:40
To: user@spark.apache.org
Subject: Should I avoid "state" in an Spark application?

I have a Spark application whose structure is below:

    var ts: Long = 0L

    dstream1.foreachRDD{

        (x, time) => {

            ts = time

            x.do_something()...

        }

    }

    ......

    process_data(dstream2, ts, ......)

I assume foreachRDD function call can update "ts" variable which is then used in the Spark tasks of "process_data" function.

From my test result of a standalone Spark cluster, it is working. But should I concern if switch to YARN?

And I saw some articles are recommending to avoid state in Scala programming. Without the state variable, how could that be done?

Any comments or suggestions are appreciated.

Thanks,

Haopu

RE: Should I avoid "state" in an Spark application?

Posted by Haopu Wang <HW...@qilinsoft.com>.

Can someone look at my questions? Thanks again!

________________________________

From: Haopu Wang 
Sent: 2016年6月12日 16:40
To: user@spark.apache.org
Subject: Should I avoid "state" in an Spark application?

I have a Spark application whose structure is below:

    var ts: Long = 0L

    dstream1.foreachRDD{

        (x, time) => {

            ts = time

            x.do_something()...

        }

    }

    ......

    process_data(dstream2, ts, ......)

I assume foreachRDD function call can update "ts" variable which is then used in the Spark tasks of "process_data" function.

From my test result of a standalone Spark cluster, it is working. But should I concern if switch to YARN?

And I saw some articles are recommending to avoid state in Scala programming. Without the state variable, how could that be done?

Any comments or suggestions are appreciated.

Thanks,

Haopu