You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ankur Dave (JIRA)" <ji...@apache.org> on 2014/08/24 01:32:11 UTC
[jira] [Comment Edited] (SPARK-3190) Creation of large graph(> 2.15
B nodes) seems to be broken:possible overflow somewhere
[ https://issues.apache.org/jira/browse/SPARK-3190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14108193#comment-14108193 ]
Ankur Dave edited comment on SPARK-3190 at 8/23/14 11:31 PM:
-------------------------------------------------------------
I haven't tried to reproduce this yet, but counting the vertices occurs in [VertexRDD.scala:110|https://github.com/apache/spark/blob/3519b5e8e55b4530d7f7c0bcab254f863dbfa814/graphx/src/main/scala/org/apache/spark/graphx/VertexRDD.scala#L110], which sums up an Int from each partition and only promotes it to a Long when returning the result. Therefore it should fix the problem to change line 111 to {{partitionsRDD.map(_.size.toLong).reduce(_ + _)}}.
was (Author: ankurd):
I haven't tried to reproduce this yet, but counting the vertices occurs in [VertexRDD.scala:110|https://github.com/apache/spark/blob/3519b5e8e55b4530d7f7c0bcab254f863dbfa814/graphx/src/main/scala/org/apache/spark/graphx/VertexRDD.scala#L110], which sums up an Int from each partition and only promotes it to a Long when returning the result. Therefore it should fix the problem to change line 111 to {{{partitionsRDD.map(_.size.toLong).reduce(_ + _)}}}.
> Creation of large graph(> 2.15 B nodes) seems to be broken:possible overflow somewhere
> ---------------------------------------------------------------------------------------
>
> Key: SPARK-3190
> URL: https://issues.apache.org/jira/browse/SPARK-3190
> Project: Spark
> Issue Type: Bug
> Components: GraphX
> Affects Versions: 1.0.3
> Environment: Standalone mode running on EC2 . Using latest code from master branch upto commit #db56f2df1b8027171da1b8d2571d1f2ef1e103b6 .
> Reporter: npanj
> Priority: Critical
>
> While creating a graph with 6B nodes and 12B edges, I noticed that 'numVertices' api returns incorrect result; 'numEdges' reports correct number. For few times(with different dataset > 2.5B nodes) I have also notices that numVertices is returned as -ive number; so I suspect that there is some overflow (may be we are using Int for some field?).
> Here is some details of experiments I have done so far:
> 1. Input: numNodes=6101995593 ; noEdges=12163784626
> Graph returns: numVertices=1807028297 ; numEdges=12163784626
> 2. Input : numNodes=2157586441 ; noEdges=2747322705
> Graph Returns: numVertices=-2137380855 ; numEdges=2747322705
> 3. Input: numNodes=1725060105 ; noEdges=204176821
> Graph: numVertices=1725060105 ; numEdges=2041768213
> You can find the code to generate this bug here:
> https://gist.github.com/npanj/92e949d86d08715bf4bf
> Note: Nodes are labeled are 1...6B .
>
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org