You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by ShreyanshB <sh...@gmail.com> on 2014/07/19 04:14:10 UTC

Graphx : Perfomance comparison over cluster

Hi,

I am trying to compare Graphx and other distributed graph processing systems
(graphlab) on my cluster of 64 nodes, each node having 32 cores and
connected with infinite band. 

I looked at http://arxiv.org/pdf/1402.2394.pdf and stats provided over
there. I had few questions regarding configuration and achieving best
performance.

* Should I use the pagerank application already available in graphx for this
purpose or need to modify or need to write my own?
   - If I shouldn't use the inbuilt pagerank, can you share your pagerank
application?

* What should be the executor_memory, i.e. maximum or according to graph
size?

* Other than, number of cores, executor_memory and partition strategy, Is
there any other configuration I should do to have the best performance?

I am using following script,
import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD

val startgraphloading = System.currentTimeMillis;
val graph = GraphLoader.edgeListFile(sc, "filepath",true,32)
val endgraphloading = System.currentTimeMillis;


Thanks in advance :)



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-Perfomance-comparison-over-cluster-tp10222.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Graphx : Perfomance comparison over cluster

Posted by Ankur Dave <an...@gmail.com>.

ShreyanshB <sh...@gmail.com> writes:
>> The version with in-memory shuffle is here:
>> https://github.com/amplab/graphx2/commits/vldb.
>
> It'd be great if you can tell me how to configure and invoke this spark
> version.

Sorry for the delay on this. Assuming you're planning to launch an EC2 cluster, here's how to use the version of GraphX with in-memory shuffle:

1. Check out the in-memory shuffle branch locally. It's important to do this before launching the cluster to make sure the cluster gets set up in a way that's compatible with this version of Spark (using the v2 branch of https://github.com/mesos/spark-ec2).

    git clone https://github.com/amplab/graphx2 -b vldb
    mv graphx2 spark

2. Launch a cluster.

    cd spark
    ec2/spark-ec2 -s 16 -w 500 -k ec2-key-name -i path/to/ec2-key.pem -t m2.4xlarge -z us-east-1e --spot-price=1 launch graphx-benchmarking

3. On the cluster, check out and build the in-memory shuffle branch.

    cd /mnt
    git clone https://github.com/amplab/graphx2 -b vldb
    mv graphx2 spark
    cd spark
    mkdir -p conf
    cp ~/spark/conf/* conf/
    sbt/sbt assembly
    rsync -r --delete . ~/spark
    ~/spark/sbin/stop-all.sh
    ~/spark-ec2/copy-dir --delete ~/spark
    ~/spark/sbin/start-all.sh

3. Load your input graph onto HDFS in edge list format.

    ~/ephemeral-hdfs/bin/hadoop fs -put edge-list.txt /

4. Run PageRank using the Analytics driver.

    cd ~/spark
    MASTER=spark://$(wget -q -O - http://169.254.169.254/latest/meta-data/public-hostname):7077
    /usr/bin/time -f "TOTAL TIME: %e seconds" ~/spark/bin/spark-class org.apache.spark.graphx.lib.Analytics $MASTER pagerank /edge-list.txt --numEPart=128 --numIter=10

Ankur

Re: Graphx : Perfomance comparison over cluster

Posted by ShreyanshB <sh...@gmail.com>.

Thanks Ankur.

The version with in-memory shuffle is here:
https://github.com/amplab/graphx2/commits/vldb. Unfortunately Spark has
changed a lot since then, and the way to configure and invoke Spark is
different. I can send you the correct configuration/invocation for this if
you're interested in benchmarking it.

It'd be great if you can tell me how to configure and invoke this spark
version.



On Sun, Jul 20, 2014 at 9:02 PM, ankurdave [via Apache Spark User List] <
ml-node+s1001560n10281h89@n3.nabble.com> wrote:

> On Fri, Jul 18, 2014 at 9:07 PM, ShreyanshB <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=10281&i=0>> wrote:
>
>> Does the suggested version with in-memory shuffle affects performance too
>> much?
>
>
> We've observed a 2-3x speedup from it, at least on larger graphs like
> twitter-2010 <http://law.di.unimi.it/webdata/twitter-2010/> and uk-2007-05
> <http://law.di.unimi.it/webdata/uk-2007-05/>.
>
> (according to previously reported numbers, graphx did 10 iterations in 142
>> seconds and in latest stats it does it in 68 seconds). Is it just the
>> in-memory version which is changed?
>
>
> If you're referring to previous results vs. the arXiv paper, there were
> several improvements, but in-memory shuffle had the largest impact.
>
> Ankur <http://www.ankurdave.com/>
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-Perfomance-comparison-over-cluster-tp10222p10281.html
>  To unsubscribe from Graphx : Perfomance comparison over cluster, click
> here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=10222&code=c2hyZXlhbnNocGJoYXR0QGdtYWlsLmNvbXwxMDIyMnwtMTc5NzgyNjk5NQ==>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-Perfomance-comparison-over-cluster-tp10222p10523.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Graphx : Perfomance comparison over cluster

Posted by Ankur Dave <an...@gmail.com>.

On Fri, Jul 18, 2014 at 9:07 PM, ShreyanshB <sh...@gmail.com>
 wrote:
>
> Does the suggested version with in-memory shuffle affects performance too
> much?

We've observed a 2-3x speedup from it, at least on larger graphs like
twitter-2010 <http://law.di.unimi.it/webdata/twitter-2010/> and uk-2007-05
<http://law.di.unimi.it/webdata/uk-2007-05/>.

(according to previously reported numbers, graphx did 10 iterations in 142
> seconds and in latest stats it does it in 68 seconds). Is it just the
> in-memory version which is changed?

If you're referring to previous results vs. the arXiv paper, there were
several improvements, but in-memory shuffle had the largest impact.

Ankur <http://www.ankurdave.com/>

Re: Graphx : Perfomance comparison over cluster

Posted by ShreyanshB <sh...@gmail.com>.

Thanks a lot Ankur.

The version with in-memory shuffle is here:
https://github.com/amplab/graphx2/commits/vldb. Unfortunately Spark has
changed a lot since then, and the way to configure and invoke Spark is
different. I can send you the correct configuration/invocation for this if
you're interested in benchmarking it.

Actually I wanted to see how graphlab and graphx performs for the cluster
we have (32 cores per node and infinite band). I tried the live journal
graph with partitions = 400 (16 nodes and each node with 32 cores). but it
performed better with partition=64. I'll try it again. Does the suggested
version with in-memory shuffle affects performance too much? (according to
previously reported numbers, graphx did 10 iterations in 142 seconds and in
latest stats it does it in 68 seconds). Is it just the in-memory version
which is changed?





On Fri, Jul 18, 2014 at 8:31 PM, ankurdave [via Apache Spark User List] <
ml-node+s1001560n10227h15@n3.nabble.com> wrote:

> Thanks for your interest. I should point out that the numbers in the arXiv
> paper are from GraphX running on top of a custom version of Spark with an
> experimental in-memory shuffle prototype. As a result, if you benchmark
> GraphX at the current master, it's expected that it will be 2-3x slower
> than GraphLab.
>
> The version with in-memory shuffle is here:
> https://github.com/amplab/graphx2/commits/vldb. Unfortunately Spark has
> changed a lot since then, and the way to configure and invoke Spark is
> different. I can send you the correct configuration/invocation for this if
> you're interested in benchmarking it.
>
> On Fri, Jul 18, 2014 at 7:14 PM, ShreyanshB <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=10227&i=0>> wrote:
>>
>> Should I use the pagerank application already available in graphx for
>> this purpose or need to modify or need to write my own?
>
>
> You should use the built-in PageRank. If your graph is available in edge
> list format, you can run it using the Analytics driver as follows:
>
> ~/spark/bin/spark-submit --master spark://$MASTER_URL:7077 --class
> org.apache.spark.graphx.lib.Analytics
> ~/spark/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop1.0.4.jar
> pagerank $EDGE_FILE --numEPart=$NUM_PARTITIONS --numIter=$NUM_ITERATIONS
> [--partStrategy=$PARTITION_STRATEGY]
>
> What should be the executor_memory, i.e. maximum or according to graph
>> size?
>
>
> As much memory as possible while leaving room for the operating system.
>
> Is there any other configuration I should do to have the best performance?
>
>
> I think the parameters to Analytics above should be sufficient:
>
> - numEPart - should be equal to or a small integer multiple of the number
> of cores. More partitions improve work balance but also increase memory
> usage and communication, so in some cases it can even be faster with fewer
> partitions than cores.
> - partStrategy - If your edges are already sorted, you can skip this
> option, because GraphX will leave them as-is by default and that may be
> close to optimal. Otherwise, EdgePartition2D and RandomVertexCut are both
> worth trying.
>
> CC'ing Joey and Dan, who may have other suggestions.
>
> Ankur <http://www.ankurdave.com/>
>
>
> On Fri, Jul 18, 2014 at 7:14 PM, ShreyanshB <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=10227&i=1>> wrote:
>
>> Hi,
>>
>> I am trying to compare Graphx and other distributed graph processing
>> systems
>> (graphlab) on my cluster of 64 nodes, each node having 32 cores and
>> connected with infinite band.
>>
>> I looked at http://arxiv.org/pdf/1402.2394.pdf and stats provided over
>> there. I had few questions regarding configuration and achieving best
>> performance.
>>
>> * Should I use the pagerank application already available in graphx for
>> this
>> purpose or need to modify or need to write my own?
>>    - If I shouldn't use the inbuilt pagerank, can you share your pagerank
>> application?
>>
>> * What should be the executor_memory, i.e. maximum or according to graph
>> size?
>>
>> * Other than, number of cores, executor_memory and partition strategy, Is
>> there any other configuration I should do to have the best performance?
>>
>> I am using following script,
>> import org.apache.spark._
>> import org.apache.spark.graphx._
>> import org.apache.spark.rdd.RDD
>>
>> val startgraphloading = System.currentTimeMillis;
>> val graph = GraphLoader.edgeListFile(sc, "filepath",true,32)
>> val endgraphloading = System.currentTimeMillis;
>>
>>
>> Thanks in advance :)
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-Perfomance-comparison-over-cluster-tp10222.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-Perfomance-comparison-over-cluster-tp10222p10227.html
>  To unsubscribe from Graphx : Perfomance comparison over cluster, click
> here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=10222&code=c2hyZXlhbnNocGJoYXR0QGdtYWlsLmNvbXwxMDIyMnwtMTc5NzgyNjk5NQ==>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-Perfomance-comparison-over-cluster-tp10222p10229.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Graphx : Perfomance comparison over cluster

Posted by Ankur Dave <an...@gmail.com>.

Thanks for your interest. I should point out that the numbers in the arXiv
paper are from GraphX running on top of a custom version of Spark with an
experimental in-memory shuffle prototype. As a result, if you benchmark
GraphX at the current master, it's expected that it will be 2-3x slower
than GraphLab.

The version with in-memory shuffle is here:
https://github.com/amplab/graphx2/commits/vldb. Unfortunately Spark has
changed a lot since then, and the way to configure and invoke Spark is
different. I can send you the correct configuration/invocation for this if
you're interested in benchmarking it.

On Fri, Jul 18, 2014 at 7:14 PM, ShreyanshB <sh...@gmail.com>
 wrote:
>
> Should I use the pagerank application already available in graphx for this purpose
> or need to modify or need to write my own?

You should use the built-in PageRank. If your graph is available in edge
list format, you can run it using the Analytics driver as follows:

~/spark/bin/spark-submit --master spark://$MASTER_URL:7077 --class
org.apache.spark.graphx.lib.Analytics
~/spark/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop1.0.4.jar
pagerank $EDGE_FILE --numEPart=$NUM_PARTITIONS --numIter=$NUM_ITERATIONS
[--partStrategy=$PARTITION_STRATEGY]

What should be the executor_memory, i.e. maximum or according to graph
> size?

As much memory as possible while leaving room for the operating system.

Is there any other configuration I should do to have the best performance?

I think the parameters to Analytics above should be sufficient:

- numEPart - should be equal to or a small integer multiple of the number
of cores. More partitions improve work balance but also increase memory
usage and communication, so in some cases it can even be faster with fewer
partitions than cores.
- partStrategy - If your edges are already sorted, you can skip this
option, because GraphX will leave them as-is by default and that may be
close to optimal. Otherwise, EdgePartition2D and RandomVertexCut are both
worth trying.

CC'ing Joey and Dan, who may have other suggestions.

Ankur <http://www.ankurdave.com/>

On Fri, Jul 18, 2014 at 7:14 PM, ShreyanshB <sh...@gmail.com>
wrote:

> Hi,
>
> I am trying to compare Graphx and other distributed graph processing
> systems
> (graphlab) on my cluster of 64 nodes, each node having 32 cores and
> connected with infinite band.
>
> I looked at http://arxiv.org/pdf/1402.2394.pdf and stats provided over
> there. I had few questions regarding configuration and achieving best
> performance.
>
> * Should I use the pagerank application already available in graphx for
> this
> purpose or need to modify or need to write my own?
>    - If I shouldn't use the inbuilt pagerank, can you share your pagerank
> application?
>
> * What should be the executor_memory, i.e. maximum or according to graph
> size?
>
> * Other than, number of cores, executor_memory and partition strategy, Is
> there any other configuration I should do to have the best performance?
>
> I am using following script,
> import org.apache.spark._
> import org.apache.spark.graphx._
> import org.apache.spark.rdd.RDD
>
> val startgraphloading = System.currentTimeMillis;
> val graph = GraphLoader.edgeListFile(sc, "filepath",true,32)
> val endgraphloading = System.currentTimeMillis;
>
>
> Thanks in advance :)
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-Perfomance-comparison-over-cluster-tp10222.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>