You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tinkerpop.apache.org by Marko Rodriguez <ok...@gmail.com> on 2016/02/09 23:31:35 UTC

Ruminations on SparkGraphComputer -- Part Deux

Hi,

Two tickets were recently completed.
	https://issues.apache.org/jira/browse/TINKERPOP-1131 (TinkerPop 3.1.2-SNAPSHOT & TinkerPop 3.2.0-SNAPSHOT)
	https://issues.apache.org/jira/browse/TINKERPOP-962 (TinkerPop 3.2.0-SNAPSHOT)
		- with updates to serialization as well in this push.

With these merged, I benchmarked SparkGraphComputer against Friendster (2.5 billion edges) for the following queries:

g.V().count() -- answer 125000000 (125 million vertices)
	- TinkerPop 3.0.0.MX: 2.5 hours
	- TinkerPop 3.0.0:	1.5 hours
	- TinkerPop 3.1.1:	23 minutes
	- TinkerPop 3.2.0:	6.8 minutes

g.V().out().count() -- answer 2586147869 (2.5 billion length-1 paths (i.e. edges))
	- TinkerPop 3.0.0.MX: unknown
	- TinkerPop 3.0.0:	2.5 hours
	- TinkerPop 3.1.1:	1.1 hours
	- TinkerPop 3.2.0:	13 minutes (*** TinkerPop 3.1.2 will be this fast too)
	
g.V().out().out().count() -- answer 640528666156 (640 billion length-2 paths)
	- TinkerPop 3.0.0.MX: unknown
	- TinkerPop 3.0.0:	unknown
	- TinkerPop 3.1.1:	unknown
	- TinkerPop 3.2.0:	55 minutes (*** TinkerPop 3.1.2 will be this fast too)

g.V().out().out().out().count() -- answer 215664338057221 (215 trillion length 3-paths)
	- TinkerPop 3.0.0.MX: 12.8 hours
	- TinkerPop 3.0.0:	8.6 hours
	- TinkerPop 3.1.1:	2.4 hours
	- TinkerPop 3.2.0:	1.6 hours (*** TinkerPop 3.1.2 will be this fast too)		

For SparkGraphComputer, I no longer have to use DISK_ONLY because the memory optimizations have greatly reduced heap usage and thus, I can do MEMORY_AND_DISK_SER w/o causing the GC to go crazy. Moreover, because of TINKERPOP-1131, ReducingBarrierSteps (e.g. groupCount(), count(), sum(), max(), etc.) are significantly faster and use a minuscule amount of memory. Together, these updates have greatly improved GraphComputer as you can see specifically with the SparkGraphComputer benchmark above.

Finally, check this out. I decided to test the speed of g.V().count() when the input graph is already partitioned to the Spark cluster. This will be what you see when you use PersistedOutputRDD/InputRDD or when you use a graph system that provides a Partitioner to their InputRDD and thus, avoids an initial partition by SparkGraphComputer.

g.V().count() -- answer 125000000 (125 million vertices)
	- TinkerPop 3.2.0:	5.2 minutes
… hmm, not as good as I was hoping. I thought this would be around 1-2 minutes. :| I bet there is something I'm doing wrong.

Enjoy!,
Marko.

http://markorodriguez.com


Re: Ruminations on SparkGraphComputer -- Part Deux

Posted by Marko Rodriguez <ok...@gmail.com>.
Hello,

Apologies, but there is one correction. Where I say "TinkerPop 3.1.2 will be this fast too" -- that is not right. I forgot that the GryoSerializer for SparkGraphComputer was only updated for 3.2.0. Thus, TinkerPop 3.1.2 should have speeds somewhere between 3.1.1 and 3.2.0 (leaning more towards 3.2.0 speeds).

Thanks,
Marko.

http://markorodriguez.com

On Feb 9, 2016, at 3:31 PM, Marko Rodriguez <ok...@gmail.com> wrote:

> Hi,
> 
> Two tickets were recently completed.
> 	https://issues.apache.org/jira/browse/TINKERPOP-1131 (TinkerPop 3.1.2-SNAPSHOT & TinkerPop 3.2.0-SNAPSHOT)
> 	https://issues.apache.org/jira/browse/TINKERPOP-962 (TinkerPop 3.2.0-SNAPSHOT)
> 		- with updates to serialization as well in this push.
> 
> With these merged, I benchmarked SparkGraphComputer against Friendster (2.5 billion edges) for the following queries:
> 
> g.V().count() -- answer 125000000 (125 million vertices)
> 	- TinkerPop 3.0.0.MX: 2.5 hours
> 	- TinkerPop 3.0.0:	1.5 hours
> 	- TinkerPop 3.1.1:	23 minutes
> 	- TinkerPop 3.2.0:	6.8 minutes
> 
> g.V().out().count() -- answer 2586147869 (2.5 billion length-1 paths (i.e. edges))
> 	- TinkerPop 3.0.0.MX: unknown
> 	- TinkerPop 3.0.0:	2.5 hours
> 	- TinkerPop 3.1.1:	1.1 hours
> 	- TinkerPop 3.2.0:	13 minutes (*** TinkerPop 3.1.2 will be this fast too)
> 	
> g.V().out().out().count() -- answer 640528666156 (640 billion length-2 paths)
> 	- TinkerPop 3.0.0.MX: unknown
> 	- TinkerPop 3.0.0:	unknown
> 	- TinkerPop 3.1.1:	unknown
> 	- TinkerPop 3.2.0:	55 minutes (*** TinkerPop 3.1.2 will be this fast too)
> 
> g.V().out().out().out().count() -- answer 215664338057221 (215 trillion length 3-paths)
> 	- TinkerPop 3.0.0.MX: 12.8 hours
> 	- TinkerPop 3.0.0:	8.6 hours
> 	- TinkerPop 3.1.1:	2.4 hours
> 	- TinkerPop 3.2.0:	1.6 hours (*** TinkerPop 3.1.2 will be this fast too)		
> 
> For SparkGraphComputer, I no longer have to use DISK_ONLY because the memory optimizations have greatly reduced heap usage and thus, I can do MEMORY_AND_DISK_SER w/o causing the GC to go crazy. Moreover, because of TINKERPOP-1131, ReducingBarrierSteps (e.g. groupCount(), count(), sum(), max(), etc.) are significantly faster and use a minuscule amount of memory. Together, these updates have greatly improved GraphComputer as you can see specifically with the SparkGraphComputer benchmark above.
> 
> Finally, check this out. I decided to test the speed of g.V().count() when the input graph is already partitioned to the Spark cluster. This will be what you see when you use PersistedOutputRDD/InputRDD or when you use a graph system that provides a Partitioner to their InputRDD and thus, avoids an initial partition by SparkGraphComputer.
> 
> g.V().count() -- answer 125000000 (125 million vertices)
> 	- TinkerPop 3.2.0:	5.2 minutes
> … hmm, not as good as I was hoping. I thought this would be around 1-2 minutes. :| I bet there is something I'm doing wrong.
> 
> Enjoy!,
> Marko.
> 
> http://markorodriguez.com
>