You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tinkerpop.apache.org by "Marko A. Rodriguez (JIRA)" <ji...@apache.org> on 2016/01/29 21:34:39 UTC

[jira] [Commented] (TINKERPOP-1108) Produce two RDDs from executeVertexProgram in SparkGraphComputer

    [ https://issues.apache.org/jira/browse/TINKERPOP-1108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124157#comment-15124157 ] 

Marko A. Rodriguez commented on TINKERPOP-1108:
-----------------------------------------------

The scary thing about this is that we have Spark accumulators emitted in the {{viewOutgoingMessageRDD}} and thus, we may have a problem with generating two RDDs as we might duplicate the accumulator data. However, we may just want to put the accumulator data into {{viewRDD}} and on the {{join()}}, broadcast the variables then! ... needs some thinking.

> Produce two RDDs from executeVertexProgram in SparkGraphComputer
> ----------------------------------------------------------------
>
>                 Key: TINKERPOP-1108
>                 URL: https://issues.apache.org/jira/browse/TINKERPOP-1108
>             Project: TinkerPop
>          Issue Type: Improvement
>          Components: hadoop
>    Affects Versions: 3.1.1-incubating
>            Reporter: Marko A. Rodriguez
>
> I have done a lot to optimize our implementation of {{SparkGraphComputer}}. I now know the reason for every shuffle, input, spill, etc. piece of data that happens during a job. There is one more optimization that MAY or MAY NOT work, but it is worth trying because if it does what I think it will do, we may get a (perhaps) 2x improvement.
> We current do:
> {code}
> graphRDD -> viewOutgoingMessagesRDD
> {code}
> We should do:
> {code}
> graphRDD -->
>    viewRDD
>    outgoingMessageRDD
> {code}
> The {{viewRDD}} with have the same partitioner as the {{graphRDD}} and thus, a local join is all that is required. The {{outgoingMessageRDD}} will not be partitioned so its join will cause shuffle. Thus, after this block, we do:
> {code}
> graphRDD.join(viewRDD).mapValues(...attach the view...).join(outgoingMessageRDD)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)