You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tinkerpop.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2015/10/30 18:58:27 UTC
[jira] [Commented] (TINKERPOP3-925) Use persisted SparkContext to persist an RDD across Spark jobs.

    [ https://issues.apache.org/jira/browse/TINKERPOP3-925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14982976#comment-14982976 ] 

ASF GitHub Bot commented on TINKERPOP3-925:
-------------------------------------------

GitHub user okram opened a pull request:

    https://github.com/apache/incubator-tinkerpop/pull/129

    TINKERPOP3-925: Use persisted SparkContext to persist an RDD across Spark jobs.

    https://issues.apache.org/jira/browse/TINKERPOP3-925
    
    This is implemented and its bad ass. There are now `PersistedOutputRDD` and `PersistedInputRDD` where the name of the RDD for the `SparkContext` is `outputLocation` and `inputLocation`. Tada! Now, if you have a chained GraphComputer job, you don't need to write the RDD to disk (e.g. HDFS), you can have `SparkContext` persist it across jobs. This work naturally extends the constructs we already have and thanks to @RussellSpitzer for implementing persistent Spark contexts for us. I updated various docs.
    
    * NOTE: I also renamed GraphComputer.config() to .configure() in this push. This is another ticket that was recently closed, but decided that configure() is a better name given the convention of the other GraphComputer methods.
    
    I ran `mvn clean install`, full integration tests, and built and published the docs.
      http://tinkerpop.incubator.apache.org/docs/3.1.0-SNAPSHOT/#sparkgraphcomputer
    
    VOTE +1.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/incubator-tinkerpop TINKERPOP3-925

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-tinkerpop/pull/129.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #129
    
----
commit 82bbc59e676a49ccafe54567735b320df84d60f7
Author: Marko A. Rodriguez <ok...@gmail.com>
Date:   2015-10-27T21:03:55Z

    added SparkHelper to grab RDDs from the SparkContext.getPersistedRDDs(). A simple test case proves it works. A more involved test case using BulkLoaderVertexProgram is needed.

commit 3d2d6a69086166ebd34ee10ade656c0e61a1ac0c
Author: Marko A. Rodriguez <ok...@gmail.com>
Date:   2015-10-27T22:02:42Z

    Added a test case that verifies a PageRankVertexProgram to BulkLoaderVertexProgram load into Spark without touching HDFS. Need to do the GraphComputer.config() ticket to make this all pretty.

commit 16b50052eb27ebb365341dc3b6e90608b07a71e6
Author: Marko A. Rodriguez <ok...@gmail.com>
Date:   2015-10-29T23:29:02Z

    merged master/.

commit 528ba027a098bc722211767b11b7dc010fb2cba1
Author: Marko A. Rodriguez <ok...@gmail.com>
Date:   2015-10-30T17:48:42Z

    This is a masterpiece here. PersistedXXXRDD is now a Spark RDD class where the inputLocation (outputLocation) are the names of the RDD. No HDFS is used between jobs as the graphRDD is stored in the SparkServer using a persisted context. Added test cases, renamed GraphComputer.config() to configure() to be consistent with the naming conventions of GraphComputer methods. Also made it default as most implementaitons won't need it and there is no point to require a random return this. Updated docs accordingly.

----


> Use persisted SparkContext to persist an RDD across Spark jobs.
> ---------------------------------------------------------------
>
>                 Key: TINKERPOP3-925
>                 URL: https://issues.apache.org/jira/browse/TINKERPOP3-925
>             Project: TinkerPop 3
>          Issue Type: Improvement
>          Components: hadoop
>    Affects Versions: 3.0.2-incubating
>            Reporter: Marko A. Rodriguez
>            Assignee: Marko A. Rodriguez
>             Fix For: 3.1.0-incubating
>
>
> If a provider is using Spark, they are currently forced to have HDFS be used to store intermediate RDD data. However, if they plan on using that data in a {{GraphComputer}} "job chain," then they should be able to lookup a {{.cached()}} RDD by name. 
> Create a {{inputGraphRDD.name}} and {{outputGraphRDD.name}} to make it so that the configuration references {{SparkContext.getPersitedRDDs()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)