You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by wu zeming <ze...@gmail.com> on 2014/04/22 17:20:00 UTC

Some questions in using Graphx

Hi all,

I am using Graphx in spark-0.9.0-incubating. The number of vertices can be 100 million and the number of edges can be 1 billion in our graph. As a result, I must carefully use my limit memory. So I have some questions to the Graphx module.

Why do some transformations like partitionBy, mapVertices cache the new graph and some like outerJoinVertices not?

I use Pregel api and just use edgeTriplet.srcAttr in sendMsg, after that I get a new Graph and I use graph.mapReduceTriplets and useedgeTriplet.srcAttr and edgeTriplet.dstAttr in sendMsg. I found that with the implement of ReplicatedVertexView, spark will complute all the graph which should has been computer before. Can anyone explain the implement here?

Why dose not VertexPartition extends Serializable? It's used by RDD.

Can you provide an "spark.default.cache.useDisk" for using DISK_ONLY in cache by default?



- Wu Zeming





Re: Some questions in using Graphx

Posted by Wu Zeming <ze...@gmail.com>.
Hi Ankur, 
Thanks for your reply! I think these are usefully for me. I hope these can
be improved in spark-0.9.2 or spark-1.0.

Another thing I forgot. I think the persist api for Graph, VertexRDD and
EdgeRDD should not be set public now, because it will lead to
UnsupportedOperationException when I change the StorageLevel.



-----
Wu Zeming
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Some-questions-in-using-Graphx-tp4604p4634.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Some questions in using Graphx

Posted by Ankur Dave <an...@gmail.com>.
These are excellent questions. Answers below:

On Tue, Apr 22, 2014 at 8:20 AM, wu zeming <ze...@gmail.com> wrote:

> 1. Why do some transformations like partitionBy, mapVertices cache the new
> graph and some like outerJoinVertices not?


In general, we cache RDDs that are used more than once to avoid
recomputation. In mapVertices, the newVerts RDD is used to construct both
changedVerts and the new graph, so we have to cache it. Good catch that
outerJoinVertices doesn't follow this convention! To be safe, we should be
caching newVerts there as well. Also, partitionBy doesn't need to cache
newEdges, so we should remove the call to cache().

2. I use Pregel api and just use edgeTriplet.srcAttr in sendMsg, after that
> I get a new Graph and I use graph.mapReduceTriplets and use
> edgeTriplet.srcAttr and edgeTriplet.dstAttr in sendMsg. I found that with
> the implement of ReplicatedVertexView, spark will complute all the graph
> which should has been computer before. Can anyone explain the implement
> here?


This is a known problem with the current implementation of join
elimination. If you do an operation that only needs one side of the triplet
(say, the source attribute) followed by an operation that uses both, the
second operation will re-replicate the source attributes unnecessarily. I
wrote a PR that solves this problem by only
moving<https://github.com/ankurdave/graphx/blob/74349e9c81fa626949172d1905dd7713ab6160ac/graphx/src/main/scala/org/apache/spark/graphx/impl/ReplicatedVertexView.scala#L52>the
necessary vertex attributes:
https://github.com/amplab/graphx/pull/137.

3. Why dose not VertexPartition extends Serializable? It's used by RDD.


Though we do create RDD[VertexPartition]s, VertexPartition itself should
never need to be serialized. Instead, we move vertex attributes using
VertexAttributeBlock.

Some other classes in GraphX (GraphImpl, for example) are marked
serializable for a different reason: to make it harmless if closures
accidentally capture them. These classes have all their fields marked
@transient so nothing really gets serialized.

4. Can you provide an "spark.default.cache.useDisk" for using DISK_ONLY in
> cache by default?


This isn't officially supported yet, but I have a
patch<https://github.com/ankurdave/graphx/commit/aa78d782e606a6c0c0f5b8253ae4f408531fd015>that
will let you do it in a hacky way by passing the desired storage level
everywhere.

Ankur <http://www.ankurdave.com/>