You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Jiaxin Shi <sh...@gmail.com> on 2014/07/31 14:28:40 UTC

configuration needed to run twitter(25GB) dataset

We have a 6-nodes cluster , each node has 64GB memory.

here is the command:
./bin/spark-submit --class
org.apache.spark.examples.graphx.LiveJournalPageRank
examples/target/scala-2.10/spark-examples-1.0.1-hadoop1.0.4.jar
hdfs://dataset/twitter --tol=0.01 --numEPart=144 --numIter=10

But it ran out of memory. I also try 2D and 1D partition.

And I also try Giraph under the same configuration, and it runs for 10
iterations , and then it ran out of memory as well.

Actually I don't know whether the command is right.
Should the numEPart equal to the number of nodes or number of nodes*cores?
I think if numEPart is smaller, it will require less memory, just like the
powergraph.

Thanks in advance!

Re: configuration needed to run twitter(25GB) dataset

Posted by Ankur Dave <an...@gmail.com>.
At 2014-08-01 02:12:08 -0700, shijiaxin <sh...@gmail.com> wrote:
> When I use fewer partitions, (like 6)
> It seems that all the task will be assigned to the same machine, because the
> machine has more than 6 cores.But this will run out of memory.
> How to set fewer partitions number and use all the machine at the same time?

Yes, I've encountered this problem myself. I haven't tried this, but one idea is to reduce the number of cores that Spark is allowed to use on each worker by passing --cores to spark-submit or setting SPARK_WORKER_CORES.

Ankur

Re: configuration needed to run twitter(25GB) dataset

Posted by shijiaxin <sh...@gmail.com>.
When I use fewer partitions, (like 6)
It seems that all the task will be assigned to the same machine, because the
machine has more than 6 cores.But this will run out of memory.
How to set fewer partitions number and use all the machine at the same time?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/configuration-needed-to-run-twitter-25GB-dataset-tp11044p11150.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: configuration needed to run twitter(25GB) dataset

Posted by Ankur Dave <an...@gmail.com>.
At 2014-07-31 21:40:39 -0700, shijiaxin <sh...@gmail.com> wrote:
> Is it possible to reduce the number of edge partitions and exploit
> parallelism fully at the same time?
> For example, one partition per node, and the threads in the same node share
> the same partition.

It's theoretically possible to parallelize operations within a partition, but I wouldn't worry about exploiting all available parallelism. PageRank is typically communication-bound rather than computation-bound, so it can be a net gain to reduce the amount of communication by using fewer partitions even if that means sacrificing some parallelism.

Ankur

Re: configuration needed to run twitter(25GB) dataset

Posted by shijiaxin <sh...@gmail.com>.
Is it possible to reduce the number of edge partitions and exploit
parallelism fully at the same time?
For example, one partition per node, and the threads in the same node share
the same partition.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/configuration-needed-to-run-twitter-25GB-dataset-tp11044p11126.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: configuration needed to run twitter(25GB) dataset

Posted by Ankur Dave <an...@gmail.com>.
On Thu, Jul 31, 2014 at 08:28 PM, Jiaxin Shi <sh...@gmail.com> wrote:
> We have a 6-nodes cluster , each node has 64GB memory.
> [...]
> But it ran out of memory. I also try 2D and 1D partition.
>
> And I also try Giraph under the same configuration, and it runs for 10
> iterations , and then it ran out of memory as well.

If Giraph is also running out of memory, it sounds like the graph is just too big to fit entirely in memory on your cluster. In that case, you could try changing the storage level from MEMORY_ONLY (the default) to MEMORY_AND_DISK. That would allow GraphX to spill partitions to disk, hurting performance but at least allowing the computation to finish.

You can do this by passing

    --edgeStorageLevel=MEMORY_AND_DISK --vertexStorageLevel=MEMORY_AND_DISK

to spark-submit.

> Should the numEPart equal to the number of nodes or number of nodes*cores?
> I think if numEPart is smaller, it will require less memory, just like the
> powergraph.

Right, increasing the number of edge partitions will increase the memory and communication overhead for both GraphX and PowerGraph. Setting the number of edge partitions to the total number of cores (nodes * cores) is a good starting point since it will allow GraphX to exploit parallelism fully, and you can experiment with half or double that number if necessary.

Ankur