You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by "Carr, J. Ryan" <Ry...@jhuapl.edu> on 2015/06/24 16:55:04 UTC

[GraphX] Graph 500 graph generator

Hi Spark Devs,

  As part of a project at work, I have written a graph generator for RMAT graphs consistent with the specifications in the Graph 500 benchmark (http://www.graph500.org/specifications). We had originally planned to use the rmatGenerator function in GraphGenerators, but found that it wasn't suitable for generating graphs with billions of edges; the edges are generated in a single thread and stored in a Set, meaning it can't generate a graph larger than memory on a single JVM (and I think Sets are limited to Int.MaxValue elements anyway).

  The generator I have is essentially a more scalable version of rmatGenerator. We have used it to generate a graph with 2^32 vertices and 2^36 edges on our modestly-specced cluster of 16 machines. It seems like other people interested in Spark might want to play with some large RMAT graphs (or run the Graph 500 benchmark), so I would like to contribute my generator. It does have some minor differences from the current generator, though:

  1.  Vertex IDs are shuffled after the graph structure is generated, so the degree of a vertex cannot be predicted from its ID (without this step vertex 0 would always have the largest degree, followed by vertices 1,2,4,8, etc.). This is per the Graph 500 spec. It could be easily made optional.
  2.  Duplicate edges are not removed from the resulting graph. This could easily be done with a call to distinct() on the resulting edge list, but then there would be slightly fewer edges than one generated by the current rmatGenerator. Also this process would be very slow on large graphs due to skew.
  3.  Doesn't set the out degree as the vertex attribute. Again this would be simple to add, but it could be slow on the super vertices.

  My question for the Spark Devs is: Is this something you would want as part of GraphX (either as a replacement for the current rmatGenerator or a separate function in GraphGenerators)? Since it was developed at work I need to go through our legal department and QA processes to open-source it, and to fill out the paperwork I need to know whether I'll be submitting a pull request or standing it up as a separate project on GitHub.

Thanks!

-Ryan

--
J. Ryan Carr, Ph. D.

The Johns Hopkins University, Applied Physics Laboratory
11100 Johns Hopkins Rd., Laurel, MD 20723
Office: 240-228-9157
Cell: 443-744-1004
Email: Ryan.Carr@jhuapl.edu<ma...@jhuapl.edu> or James.Carr@jhuapl.edu<ma...@jhuapl.edu>


Re: [GraphX] Graph 500 graph generator

Posted by Burak Yavuz <br...@gmail.com>.
Hi Ryan,
If you can get past the paperwork, I'm sure this can make a great Spark
Package (http://spark-packages.org). People then can use it for
benchmarking purposes, and I'm sure people will be looking for graph
generators!

Best,
Burak

On Wed, Jun 24, 2015 at 7:55 AM, Carr, J. Ryan <Ry...@jhuapl.edu> wrote:

>  Hi Spark Devs,
>
>    As part of a project at work, I have written a graph generator for
> RMAT graphs consistent with the specifications in the Graph 500 benchmark (
> http://www.graph500.org/specifications). We had originally planned to use
> the rmatGenerator function in GraphGenerators, but found that it wasn’t
> suitable for generating graphs with billions of edges; the edges are
> generated in a single thread and stored in a Set, meaning it can’t generate
> a graph larger than memory on a single JVM (and I think Sets are limited to Int.MaxValue
> elements anyway).
>
>    The generator I have is essentially a more scalable version of
> rmatGenerator. We have used it to generate a graph with 2^32 vertices and
> 2^36 edges on our modestly-specced cluster of 16 machines. It seems like
> other people interested in Spark might want to play with some large RMAT
> graphs (or run the Graph 500 benchmark), so I would like to contribute my
> generator. It does have some minor differences from the current
> generator, though:
>
>    1. Vertex IDs are shuffled after the graph structure is generated, so
>    the degree of a vertex cannot be predicted from its ID (without this step
>    vertex 0 would always have the largest degree, followed by vertices
>    1,2,4,8, etc.). This is per the Graph 500 spec. It could be easily made
>    optional.
>    2. Duplicate edges are not removed from the resulting graph. This
>    could easily be done with a call to distinct() on the resulting edge list,
>    but then there would be slightly fewer edges than one generated by the
>    current rmatGenerator. Also this process would be very slow on large graphs
>    due to skew.
>    3. Doesn’t set the out degree as the vertex attribute. Again this
>    would be simple to add, but it could be slow on the super vertices.
>
>   My question for the Spark Devs is: Is this something you would want as
> part of GraphX (either as a replacement for the current rmatGenerator or a
> separate function in GraphGenerators)? Since it was developed at work I
> need to go through our legal department and QA processes to open-source it,
> and to fill out the paperwork I need to know whether I’ll be submitting a
> pull request or standing it up as a separate project on GitHub.
>
>  Thanks!
>
>  -Ryan
>
>   --
> J. Ryan Carr, Ph. D.
>
>  The Johns Hopkins University, Applied Physics Laboratory
> 11100 Johns Hopkins Rd., Laurel, MD 20723
> Office: 240-228-9157
> Cell: 443-744-1004
> Email: *Ryan.Carr@jhuapl.edu <Ry...@jhuapl.edu>* or *James.Carr@jhuapl.edu
> <Ja...@jhuapl.edu>*
>
>