You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by npanj <ni...@gmail.com> on 2014/08/23 02:43:39 UTC

Graphx seems to be broken while Creating a large graph(6B nodes in my case)

While creating a graph with 6B nodes and 12B edges, I noticed that
*'numVertices' api returns incorrect result*; 'numEdges' reports correct
number. For few times(with different dataset > 2.5B nodes) I have also
notices that numVertices is returned as -ive number; so I suspect that there
is some overflow (may be we are using Int for some field?).

Environment: Standalone mode running on EC2 . Using latest code from master
branch upto commit #db56f2df1b8027171da1b8d2571d1f2ef1e103b6 .

Here is some details of experiments I have done so far: 
1. Input: numNodes=6101995593 ; noEdges=12163784626
Graph returns: numVertices=1807028297 ; numEdges=12163784626
2. Input : numNodes=*2157586441* ; noEdges=2747322705
Graph Returns: numVertices=*-2137380855* ; numEdges=2747322705
3. Input: numNodes=1725060105 ; noEdges=204176821
Graph: numVertices=1725060105 ; numEdges=2041768213 


You can find the code to generate this bug here:
https://gist.github.com/npanj/92e949d86d08715bf4bf

(I have also filed this jira ticket:
https://issues.apache.org/jira/browse/SPARK-3190)





--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Graphx-seems-to-be-broken-while-Creating-a-large-graph-6B-nodes-in-my-case-tp7966.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: Graphx seems to be broken while Creating a large graph(6B nodes in my case)

Posted by Ankur Dave <an...@gmail.com>.
I posted the fix on the JIRA ticket (https://issues.apache.org/jira/browse/SPARK-3190). To update the user list, this is indeed an integer overflow problem when summing up the partition sizes. The fix is to use Longs for the sum: https://github.com/apache/spark/pull/2106.

Ankur


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: Graphx seems to be broken while Creating a large graph(6B nodes in my case)

Posted by Jeffrey Picard <jp...@placeiq.com>.
I’m seeing this issue also. I have graph with with 5828339535 vertices and 7398447992 edges, graph.numVertices returns 1533266498 and graph.numEdges is correct and returns 7398447992. I also am having an issue that I’m beginning to suspect is caused by the same underlying problem where connected components stops after one iteration, returning an incorrect graph.
On Aug 22, 2014, at 8:43 PM, npanj <ni...@gmail.com> wrote:

> While creating a graph with 6B nodes and 12B edges, I noticed that
> *'numVertices' api returns incorrect result*; 'numEdges' reports correct
> number. For few times(with different dataset > 2.5B nodes) I have also
> notices that numVertices is returned as -ive number; so I suspect that there
> is some overflow (may be we are using Int for some field?).
> 
> Environment: Standalone mode running on EC2 . Using latest code from master
> branch upto commit #db56f2df1b8027171da1b8d2571d1f2ef1e103b6 .
> 
> Here is some details of experiments I have done so far: 
> 1. Input: numNodes=6101995593 ; noEdges=12163784626
> Graph returns: numVertices=1807028297 ; numEdges=12163784626
> 2. Input : numNodes=*2157586441* ; noEdges=2747322705
> Graph Returns: numVertices=*-2137380855* ; numEdges=2747322705
> 3. Input: numNodes=1725060105 ; noEdges=204176821
> Graph: numVertices=1725060105 ; numEdges=2041768213 
> 
> 
> You can find the code to generate this bug here:
> https://gist.github.com/npanj/92e949d86d08715bf4bf
> 
> (I have also filed this jira ticket:
> https://issues.apache.org/jira/browse/SPARK-3190)
> 
> 
> 
> 
> 
> --
> View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Graphx-seems-to-be-broken-while-Creating-a-large-graph-6B-nodes-in-my-case-tp7966.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>