You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tinkerpop.apache.org by "Marko A. Rodriguez (JIRA)" <ji...@apache.org> on 2016/01/29 18:53:40 UTC
[jira] [Created] (TINKERPOP-1108) Produce two RDDs from
executeVertexProgram in SparkGraphComputer
Marko A. Rodriguez created TINKERPOP-1108:
---------------------------------------------
Summary: Produce two RDDs from executeVertexProgram in SparkGraphComputer
Key: TINKERPOP-1108
URL: https://issues.apache.org/jira/browse/TINKERPOP-1108
Project: TinkerPop
Issue Type: Improvement
Components: hadoop
Affects Versions: 3.1.1-incubating
Reporter: Marko A. Rodriguez
I have done a lot to optimize our implementation of {{SparkGraphComputer}}. I now know the reason for every shuffle, input, spill, etc. piece of data that happens during a job. There is one more optimization that MAY or MAY NOT work, but it is worth trying because if it does what I think it will do, we may get a (perhaps) 2x improvement.
We current do:
{code}
graphRDD -> viewOutgoingMessagesRDD
{code}
We should do:
{code}
graphRDD -->
viewRDD
outgoingMessageRDD
{code}
The {{viewRDD}} with have the same partitioner as the {{graphRDD}} and thus, a local join is all that is required. The {{outgoingMessageRDD}} will not be partitioned so its join will cause shuffle. Thus, after this block, we do:
{code}
graphRDD.join(viewRDD).mapValues(...attach the view...).join(outgoingMessageRDD)
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)