You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by innowireless TaeYun Kim <ta...@innowireless.co.kr> on 2014/06/11 05:10:38 UTC

Suggestion: rdd.compute()

Hi,

Regarding the following scenario, Would it be nice to have an action method
named like 'compute()' that does nothing but computing/materializing the
whole partitions of an RDD?
It can also be useful for the profiling.


-----Original Message-----
From: innowireless TaeYun Kim [mailto:taeyun.kim@innowireless.co.kr] 
Sent: Wednesday, June 11, 2014 11:40 AM
To: user@spark.apache.org
Subject: Question about RDD cache, unpersist, materialization

Hi,

What I (seems to) know about RDD persisting API is as follows:
- cache() and persist() is not an action. It only does a marking.
- unpersist() is also not an action. It only removes a marking. But if the
rdd is already in memory, it is unloaded.

And there seems no API to forcefully materialize the RDD without requiring a
data by an action method, for example first().

So, I am faced with the following scenario.

{
    JavaRDD<T> rddUnion = sc.parallelize(new ArrayList<T>());  // create
empty for merging
    for (int i = 0; i < 10; i++)
    {
        JavaRDD<T2> rdd = sc.textFile(inputFileNames[i]);
        rdd.cache();  // Since it will be used twice, cache.
        rdd.map(...).filter(...).saveAsTextFile(outputFileNames[i]);  //
Transform and save, rdd materializes
        rddUnion = rddUnion.union(rdd.map(...).filter(...));  // Do another
transform to T and merge by union
        rdd.unpersist();  // Now it seems not needed. (But needed actually)
    }
    // Here, rddUnion actually materializes, and needs all 10 rdds that
already unpersisted.
    // So, rebuilding all 10 rdds will occur.
    rddUnion.saveAsTextFile(mergedFileName);
}

If rddUnion can be materialized before the rdd.unpersist() line and
cache()d, the rdds in the loop will not be needed on
rddUnion.saveAsTextFile().

Now what is the best strategy?
- Do not unpersist all 10 rdds in the loop.
- Materialize rddUnion in the loop by calling 'light' action API, like
first().
- Give up and just rebuild/reload all 10 rdds when saving rddUnion.

Is there some misunderstanding?

Thanks.



Re: Suggestion: rdd.compute()

Posted by Ankur Dave <an...@gmail.com>.
You can achieve an equivalent effect by calling rdd.foreach(x => {}), which
is the lightest possible action that forces materialization of the whole
RDD.

Ankur <http://www.ankurdave.com/>