You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Ron Ayoub <ro...@live.com> on 2014/12/12 17:52:09 UTC

RDD lineage and broadcast variables

I'm still wrapping my head around that fact that the data backing an RDD is immutable since an RDD may need to be reconstructed from its lineage at any point. In the context of clustering there are many iterations where an RDD may need to change (for instance cluster assignments, etc) based on a broadcast variable of a list of centroids which are objects that in turn contain a list of features. So immutability is all well and good for the purposes of being able to replay a lineage. But now I'm wondering, during each iterations in which this RDD goes through many transformations it will be transforming based on that broadcast variable of centroids that are mutable. How would it replay the lineage in this instance? Does a dependency on mutable variables mess up the whole lineage thing?
Any help appreciated. Just trying to wrap my head around using Spark correctly. I will say it does seem like there is a common miss conception that Spark RDDs are in-memory arrays - but perhaps this is for a reason. Perhaps in some cases an option for mutability and failure exception is exactly what is needed for a one off algorithm that doesn't necessarily need resiliency. Just a thought.