You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by nimbus <ni...@radius.com> on 2014/07/10 02:25:03 UTC

Pyspark, references to different rdds being overwritten to point to the same rdd, different results when using .cache()

Discovered this in ipynb, and I haven't yet checked to see if it happens
elsewhere.

here's a simple example:


this produces the output:


Which is not what I wanted.

Alarmingly, if I call .cache() on these rdds, it changes the result and I
get what I wanted.



which produces:


It's very unexpected for .cache() to actually change the results here. Also
there is additional weirdness when doing more interesting things that
.cache() still doesn't fix, but I don't yet have a simple example.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-references-to-different-rdds-being-overwritten-to-point-to-the-same-rdd-different-results-wh-tp9248.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.