You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2015/09/01 15:54:45 UTC

[jira] [Commented] (FLINK-1730) Add a FlinkTools.persist style method to the Data Set.

    [ https://issues.apache.org/jira/browse/FLINK-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725427#comment-14725427 ] 

ASF GitHub Bot commented on FLINK-1730:
---------------------------------------

GitHub user sachingoel0101 opened a pull request:

    https://github.com/apache/flink/pull/1083

    [FLINK-1730]Persist operator on Data Sets

    This PR introduces a `persist` operation on `DataSet` which allows persisting the data set in memory, allowing for direct access if this data set is operated on again and again.
    The idea behind the implementation is this:
    1. A `PersistOperator` extending a `SingleInputUdfOperator` for common api and Java API.
    2. A `Persist` driver strategy which allows the Job Graph to generate a `PersistNode`, which just uses a `NoOpDriver` to forward results from input to output.
    3. `RegularPactTask` determines whether it is a Persist task and accordingly uses a `SpillingResettableMutableObjectIterator` to read the input and persist them.
    4. To make the results truly persistent, the `MemorySegment`s must not be freed when the `Task` ends. To this end, I have created a `DummyPersistInvokable` which does nothing. It just prevents freeing of memory.
    5. All persisted memory segments are cleared out when the `MemoryManager` is shutting down. There is a possibility of writing some kind of Cache clearing strategy here.
    
    For testing the functionality, I have written a test `PersistITCase` which generates 100 random Long values inside a Map function and persisted the output. Then, triggering the execution twice must provide the same results.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sachingoel0101/flink flink-1730

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/1083.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1083
    
----
commit a22cc670697cc601facb164f6fd84ef6438c2499
Author: Sachin Goel <sa...@gmail.com>
Date:   2015-08-24T16:07:04Z

    Implemented a persist operator which caches elements into a Spilling
    buffer.

----


> Add a FlinkTools.persist style method to the Data Set.
> ------------------------------------------------------
>
>                 Key: FLINK-1730
>                 URL: https://issues.apache.org/jira/browse/FLINK-1730
>             Project: Flink
>          Issue Type: New Feature
>            Reporter: Stephan Ewen
>            Priority: Minor
>
> I think this is an operation that will be needed more prominently. Defining a point where one long logical program is broken into different executions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)