You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by ShreyanshB <sh...@gmail.com> on 2014/07/11 23:23:40 UTC

Graphx : optimal partitions for a graph and error in logs

Hi,

I am trying graphx on live journal data. I have a cluster of 17 computing
nodes, 1 master and 16 workers. I had few questions about this. 
* I built spark from spark-master (to avoid partitionBy error of spark 1.0). 
* I am using edgeFileList() to load data and I figured I need to provide
partitions I want. the exact syntax I am using is following
val graph = GraphLoader.edgeListFile(sc,
"filepath",true,64).partitionBy(PartitionStrategy.RandomVertexCut)

-- Is it a correct way to load file to get best performance?
-- What should be the partition size? =computing node or =cores?
-- I see following error so many times in my logs, 
ERROR BlockManagerWorker: Exception handling buffer message
java.io.NotSerializableException:
org.apache.spark.graphx.impl.ShippableVertexPartition
Does it suggest that my graph wasn't partitioned properly? I suspect it
affects performance ?

Please suggest whether I'm following every step (correctly)

Thanks in advance,
-Shreyansh



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-optimal-partitions-for-a-graph-and-error-in-logs-tp9455.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Graphx : optimal partitions for a graph and error in logs

Posted by ShreyanshB <sh...@gmail.com>.
Perfect! Thanks Ankur.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-optimal-partitions-for-a-graph-and-error-in-logs-tp9455p9488.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Graphx : optimal partitions for a graph and error in logs

Posted by Ankur Dave <an...@gmail.com>.
Spark just uses opens up inter-slave TCP connections for message passing
during shuffles (I think the relevant code is in ConnectionManager). Since
TCP automatically determines
<http://en.wikipedia.org/wiki/TCP_congestion-avoidance_algorithm> the
optimal sending rate, Spark doesn't need any configuration parameters for
this.

Ankur <http://www.ankurdave.com/>

Re: Graphx : optimal partitions for a graph and error in logs

Posted by ShreyanshB <sh...@gmail.com>.
Great! Thanks a lot. 
Hate to say this but I promise this is last quickie

I looked at the configurations but I didn't find any parameter to tune for
network bandwidth i.e. Is there anyway to tell graphx (spark) that I'm using
1G network or 10G network or infinite band? Does it figure out on its own
and speed up message passing accordingly?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-optimal-partitions-for-a-graph-and-error-in-logs-tp9455p9483.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Graphx : optimal partitions for a graph and error in logs

Posted by Ankur Dave <an...@gmail.com>.
I don't think it should affect performance very much, because GraphX
doesn't serialize ShippableVertexPartition in the "fast path" of
mapReduceTriplets execution (instead it calls
ShippableVertexPartition.shipVertexAttributes and serializes the result). I
think it should only get serialized for speculative execution, if you have
that enabled.

By the way, here's the fix: https://github.com/apache/spark/pull/1376

Ankur <http://www.ankurdave.com/>

Re: Graphx : optimal partitions for a graph and error in logs

Posted by ShreyanshB <sh...@gmail.com>.
Thanks a lot Ankur, I'll follow that.

A last quick
Does that error affect performance?

~Shreyansh



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-optimal-partitions-for-a-graph-and-error-in-logs-tp9455p9462.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Graphx : optimal partitions for a graph and error in logs

Posted by Ankur Dave <an...@gmail.com>.
On Fri, Jul 11, 2014 at 2:23 PM, ShreyanshB <sh...@gmail.com>
 wrote:
>
> -- Is it a correct way to load file to get best performance?


Yes, edgeListFile should be efficient at loading the edges.

-- What should be the partition size? =computing node or =cores?


In general it should be a multiple of the number of cores to exploit all
available parallelism, but because of shuffle overhead, it might help to
use fewer partitions -- in some cases even fewer than the number of cores.
You can measure the performance with different numbers of partitions to see
what is best.

-- I see following error so many times in my logs [...]
> NotSerializableException


This is a known bug, and there are two possible resolutions:

1. Switch from Java serialization to Kryo serialization, which is faster
and will also resolve the problem, by setting the following Spark
properties in conf/spark-defaults.conf:
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator org.apache.spark.graphx.GraphKryoRegistrator

2. Mark the affected classes as Serializable. I'll submit a patch with this
fix as well, but for now I'd suggest trying Kryo if possible.

Ankur <http://www.ankurdave.com/>