You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "turing.us" <tu...@gmail.com> on 2016/01/12 11:16:31 UTC

Maintain a state till the end of the application

Hi!

I have some application (skeleton):

val sc = new SparkContext($SOME_CONF)
val input = sc.textFile(inputFile)

val result = input.map(record => {
  val myState = new MyState() // state
})
.filter($SOME_FILTER)
.sortBy($SOME_SORT)
.partitionBy(new HashPartitioner(100))
.reduceByKey((first,current) => current.updateRecord(first))

val report = result.map {x =>  x._2.generateReport(x._1)}.coalesce(1)
report.saveAsTextFile($SOME_OUTPUT)


// classes
case class AccumulatedState() {
...
}

case class MyState()  {
  var data: AccumulatedState = new AccumulatedState()
...
}
// EOC



I map the input by UUID,
Also, I have a custom state (MyState that holds AccumulatedState),
If the application run in a local mode, or on a real cluster with a single
reducer - the behavior is correct,
but, if I am trying to run it with a multi-reducers, the state was reset
somewhere in the processing.

How a shared state maintained in the lifecycle of the application?
I know that there are Accumulators & Broadcast variables, but those standing
for different use-case (counters & global static (like lookup tables)).

How can I have such shared state across the program till the end to generate
my results?

Thanks!!!



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Maintain-a-state-till-the-end-of-the-application-tp25944.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org