You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by innowireless TaeYun Kim <ta...@innowireless.co.kr> on 2014/05/30 04:06:15 UTC

Suggestion or question: Adding rdd.cancelCache() method

What I understand is that rdd.cache() is really
rdd.cache_this_rdd_when_it_actually_materializes().
So, somewhat esoteric problem may occur.

The example is as follows: 

void method1()
{
    JavaRDD<...> rdd =
        sc.textFile(...)
        .map(...);

    rdd.cache();
        // since the following methods can call the action methods multiple
times,
        // cache the rdd to prevent rebuilding.

    method2(rdd);  // may or may not call the action methods on rdd
    method3(rdd);  // may or may not call the action methods on rdd

    // #HERE#, the action methods could have been called or not.

    rdd.saveAsTextFile(...);
        // if none of the above methods called the action methods,
        // rdd will materialize here and cached.
    // but we don't need the cache anymore. Caching was unnecessary.
    rdd.unpersist();
}

If there were rdd.cancelCache() method and we could call it at #HERE#,
unnecessary caching could be avoided.
What cancelCache() would do is to cancel the pending request for caching, if
caching is not done yet.
It is different from unpersist(), since unpersist() undoes the caching that
has been actually done.

Will rdd.cancelCache() be really needed, or I'm misunderstanding the caching
mechanism?