You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Harry Brundage <ha...@shopify.com> on 2014/07/21 17:23:27 UTC

Is deferred execution of multiple RDDs ever coming?

Hello fellow Sparkians.

In https://groups.google.com/d/msg/spark-users/eXV7Bp3phsc/gVgm-MdeEkwJ,
Matei suggested that Spark might get deferred grouping and forced execution
of multiple jobs in an efficient way. His code sample:

rdd.reduceLater(reduceFunction1) // returns Future[ResultType1]
rdd.reduceLater(reduceFunction2) // returns Future[ResultType2]
SparkContext.force() // executes all the "later" operations as part of a
single optimized job

This would be immensely useful. If you ever want to do a thing where you do
two passes over the data and save two different results to disk, you either
have to cache the RDD which can be slow or deprive the processing code of
memory, or recompute the whole thing twice. If Spark was smart enough to
let you group together these operations and "fork" an RDD (say an
RDD.partition method), you could very easily implement these n-pass
operations across RDDs and have spark execute them efficiently.

Our use case for a feature like this is processing many records and
attaching metadata to the records during processing about our confidence in
the data-points, and then writing the data to one spot and the metadata to
another spot.

I've also wanted this for taking a dataset, profiling it for partition size
or anomalously sized partitions, and then using the profiling result to
repartition the data before saving it to disk, which I think is impossible
to do without caching right now. This use case is a bit more interesting
because information from earlier on in the DAG needs to influence later
stages, and so I suspect the answer will be "cache the thing". I explicitly
don't want to cache it because I'm not really doing an "iterative"
algorithm where I'm willing to pay the heap and time penalties, I'm just
doing an operation which needs run-time information without a collect call.
This suggests that something like a repartition with a lazily evaluated
accumulator might work as well, but I haven't been able to figure out a
solution even with this primitive and the current APIs.

So, does anyone know if this feature might land, and if not, where to start
implementing it? What would the Python story for Futures be?