You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tinkerpop.apache.org by Marko Rodriguez <ok...@gmail.com> on 2015/04/03 16:48:00 UTC

Analyzing Friendster with SparkGraphComputer and the Gryo Size Problem

Hi,

Over the last couple of weeks Daniel Kuppitz and I have been doing repeated Gremlin traversal over the Friendster graph using SparkGraphComputer.

	http://tinkerpop.incubator.apache.org/docs/3.0.0-SNAPSHOT/#sparkgraphcomputer

The Friendster graph has 124 million vertices and 2.5 billion edges. We are using 3 Blades each with 39gigs of memory and 24 cores. 

	https://ia600500.us.archive.org/27/items/friendster-dataset-201107/README-friends.txt

We use ScriptInputFormat to read the text-base representation of the Friendster data to generate VertexWritables which are (when on the wire) Gryo adjacency list serializations. 

	http://tinkerpop.incubator.apache.org/docs/3.0.0-SNAPSHOT/#gremlin-kryo

Here is the amount of data that is generated:

	raw text based input: 22.6 gigs
	gryo graph generated from parsed text data: 94.6 gigs
	
Next, here are the runtimes for the various stages:

	g.V() -- 2.5hours
	g.V.out() -- 6 hours (+3.5 hours)
	g.V.out().out() -- 9.4 hours (+3.4 hours)
	g.V.out().out().out() -- 12.8 hours (+3.4 hours) --- SPECULATED (running right now -- but on target for that amount of time)

The first computation has no incoming message so there is no join() involved and thus, its just the raw processing of VertexProgram.execute(). The others each incur a join for each out(). The cost of each join is the same as this is the nature of the Gremlin OLAP algorithm. 

	http://tinkerpop.incubator.apache.org/docs/3.0.0-SNAPSHOT/#traversalvertexprogram
		http://thinkaurelius.com/2012/11/11/faunus-provides-big-graph-data-analytics/ (read the section called The Traversal Mechanics of Faunus) 

Here is the amount of data being shuffled on each join:

	393.5 gigs

That is how much data is created by a single message pass. That is a lot -- well, an out() will send a message over each edge in the graph and there are 2.5 billion edges. What is that message that is being sent? -- Its a Traverser<Vertex> which has a respective Gryo representation (--a DetachedVertex).

I believe if we can get our Gryo serialization size down, we can expect much better runtimes. An "off the cuff" calculation estimates that we can get the Gryo file down by ~8x. 

	https://issues.apache.org/jira/browse/TINKERPOP3-609

That will not incur a 8x increase in speed as VertexProgram.execute() still takes a prescribed amount of time that is irrespective of the serialization size, but perhaps a ?5x+ …? Or perhaps more than 8x ?!?! as more data can be held in memory by Spark and partitions will not be constantly written and read from disk.

Anywho -- was IMing this with Stephen and Daniel and decided to make it an email to the groups in case people have thoughts or which to help on this problem.

Enjoy!,
Marko.

http://markorodriguez.com


Re: Analyzing Friendster with SparkGraphComputer and the Gryo Size Problem

Posted by Marko Rodriguez <ok...@gmail.com>.
By the way… g.V().repeat(out()).times(3).count() just finished.

gremlin> g.V().repeat(out()).times(3).count()
==>208214159917830

Thats 208,214,159,917,830 --- or 208 trillion length 3 paths in the Friendster dataset. 

It took 12.1 hours --- so, the last out() took 2.7 hours.

Marko.

http://markorodriguez.com

On Apr 3, 2015, at 8:48 AM, Marko Rodriguez <ok...@gmail.com> wrote:

> Hi,
> 
> Over the last couple of weeks Daniel Kuppitz and I have been doing repeated Gremlin traversal over the Friendster graph using SparkGraphComputer.
> 
> 	http://tinkerpop.incubator.apache.org/docs/3.0.0-SNAPSHOT/#sparkgraphcomputer
> 
> The Friendster graph has 124 million vertices and 2.5 billion edges. We are using 3 Blades each with 39gigs of memory and 24 cores. 
> 
> 	https://ia600500.us.archive.org/27/items/friendster-dataset-201107/README-friends.txt
> 
> We use ScriptInputFormat to read the text-base representation of the Friendster data to generate VertexWritables which are (when on the wire) Gryo adjacency list serializations. 
> 
> 	http://tinkerpop.incubator.apache.org/docs/3.0.0-SNAPSHOT/#gremlin-kryo
> 
> Here is the amount of data that is generated:
> 
> 	raw text based input: 22.6 gigs
> 	gryo graph generated from parsed text data: 94.6 gigs
> 	
> Next, here are the runtimes for the various stages:
> 
> 	g.V() -- 2.5hours
> 	g.V.out() -- 6 hours (+3.5 hours)
> 	g.V.out().out() -- 9.4 hours (+3.4 hours)
> 	g.V.out().out().out() -- 12.8 hours (+3.4 hours) --- SPECULATED (running right now -- but on target for that amount of time)
> 
> The first computation has no incoming message so there is no join() involved and thus, its just the raw processing of VertexProgram.execute(). The others each incur a join for each out(). The cost of each join is the same as this is the nature of the Gremlin OLAP algorithm. 
> 
> 	http://tinkerpop.incubator.apache.org/docs/3.0.0-SNAPSHOT/#traversalvertexprogram
> 		http://thinkaurelius.com/2012/11/11/faunus-provides-big-graph-data-analytics/ (read the section called The Traversal Mechanics of Faunus) 
> 
> Here is the amount of data being shuffled on each join:
> 
> 	393.5 gigs
> 
> That is how much data is created by a single message pass. That is a lot -- well, an out() will send a message over each edge in the graph and there are 2.5 billion edges. What is that message that is being sent? -- Its a Traverser<Vertex> which has a respective Gryo representation (--a DetachedVertex).
> 
> I believe if we can get our Gryo serialization size down, we can expect much better runtimes. An "off the cuff" calculation estimates that we can get the Gryo file down by ~8x. 
> 
> 	https://issues.apache.org/jira/browse/TINKERPOP3-609
> 
> That will not incur a 8x increase in speed as VertexProgram.execute() still takes a prescribed amount of time that is irrespective of the serialization size, but perhaps a ?5x+ …? Or perhaps more than 8x ?!?! as more data can be held in memory by Spark and partitions will not be constantly written and read from disk.
> 
> Anywho -- was IMing this with Stephen and Daniel and decided to make it an email to the groups in case people have thoughts or which to help on this problem.
> 
> Enjoy!,
> Marko.
> 
> http://markorodriguez.com
>