You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Frank Grimes <fr...@yahoo.com> on 2019/02/21 18:41:21 UTC

Is there a Flink DataSet equivalent to Spark's RDD.persist?

Hi,
I'm trying to port an existing Spark job to Flink and have gotten stuck on the same issue brought up here:
https://stackoverflow.com/questions/46243181/cache-and-persist-datasets
Is there some way to accomplish this same thing in Flink?i.e. avoid re-computing a particular DataSet when multiple different subsequent transformations are required on it.
I've even tried explicitly writing out the DataSet to avoid the re-computation but still taking an I/O hit for the initial write to HDFS and subsequent re-reading of it in the following stages. While it does yield a performance improvement over no caching at all, it doesn't match the performance I get with RDD.persist in Spark.
Thanks,
Frank Grimes

Re: Is there a Flink DataSet equivalent to Spark's RDD.persist?

Posted by Andrey Zagrebin <an...@ververica.com>.

Hi Frank,

This feature is currently under discussion. You can follow it in this issue:
https://issues.apache.org/jira/browse/FLINK-11199

Best,
Andrey

On Thu, Feb 21, 2019 at 7:41 PM Frank Grimes <fr...@yahoo.com>
wrote:

> Hi,
>
> I'm trying to port an existing Spark job to Flink and have gotten stuck on
> the same issue brought up here:
>
> https://stackoverflow.com/questions/46243181/cache-and-persist-datasets
>
> Is there some way to accomplish this same thing in Flink?
> i.e. avoid re-computing a particular DataSet when multiple different
> subsequent transformations are required on it.
>
> I've even tried explicitly writing out the DataSet to avoid the
> re-computation but still taking an I/O hit for the initial write to HDFS
> and subsequent re-reading of it in the following stages.
> While it does yield a performance improvement over no caching at all, it
> doesn't match the performance I get with RDD.persist in Spark.
>
> Thanks,
>
> Frank Grimes
>