You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@giraph.apache.org by "Tu, Min" <mi...@paypal.com> on 2013/02/14 23:50:34 UTC

General Scalability Questions for Giraph

Hi,

I have some general scalability questions for Giraph. Based on the Giraph design, I am assuming all the mappers in giraph job should be running at the same time.

If so, then

  1.  The max mappers for giraph job <= total mapper slots in the whole cluster
  2.  The max data input size to giraph should be <= total mapper slots * mapper memory limit
  3.  If the total mapper slot in the cluster is 200 and only 100 mappers is currently available, and the giraph job require 150 mappers
     *   Without any configuration change, the 100 mappers of the giraph will be started but the giraph job will NOT run successfully
     *   Is there any configuration in Giraph to start the job ONLY at them time when  all the mapper slot available?
  4.  How is the scalability in giraph? I can ONLY run up to 150 mappers for my giraph job. Does anyone run a large giraph job in large cluster successfully?
     *   I am using giraph 0.1 in my cluster

Thanks a lot for your time and inputs.

Min

Re: General Scalability Questions for Giraph

Posted by Eli Reisman <ap...@gmail.com>.

Claudio is right on. If your cluster is provisioned with lots of nodes, you
can scale way into 4 figures on Giraph 0.2. (btw...use Giraph 0.2!) if you
have the map slots. The key is the memory and network bandwidth per node if
you run on a smaller cluster, and yes Giraph worker/master tasks might (and
often are) more than one to a compute node.

If you need a certain number of workers and there are not enough mapper
slots, your job will eventually time out and die, either on the Giraph end
or the Hadoop end, whichever stalls out waiting to begin claiming input
splits. The input reading can't start until all mappers the job needs are
up and running.


On Thu, Feb 14, 2013 at 3:17 PM, Tu, Min <mi...@paypal.com> wrote:

>  Hi Claudio,
>
>  Thank you very much for your valuable inputs. I will follow your
> suggestions to try giraph 0.2 ( from trunk ) and the workers setting.
>
>  Min
>
>   From: Claudio Martella <cl...@gmail.com>
> Reply-To: "user@giraph.apache.org" <us...@giraph.apache.org>
> Date: Thursday, February 14, 2013 3:06 PM
> To: "user@giraph.apache.org" <us...@giraph.apache.org>
> Subject: Re: General Scalability Questions for Giraph
>
>   Hi Tu,
>
>  first of all, I really suggest you run trunk, especially if you have a
> large graph. That being said:
>
>  1) yes and no, the jargon is misleading. you should have n - 1 workers
> (what you call mappers for giraph job) with n as the max number of mappers
> you can have in your cluster as an upper limit (the additional 1 goes for
> the master). In general, i'd strongly suggest you have 1 mapper/worker per
> node/MACHINE, and k compute threads per worker, with k as the number of
> cores on that machine. You'll save netty sending messages over the loopback
> and additional jvm overhead.
>
>  2) yes, but I challenge you to compute those sizes before hand :) Also
> consider the size of the messages being produced by your algorithm. E.g.
> roughly, PageRank produces a double for each edge in the graph, during each
> superstep.
>
>  3) AFAIK there's no way, but I might be wrong here.
>
>  4) I'd suggest you also talk in terms of nodes. Having multiple workers
> per machine misleads the scalability on certain aspects (such as network
> i/o). I have been running Giraph jobs on hundreds of mappers and around 65
> machines. I know others here have done bigger numbers (~300 workers). I'd
> say the upper limit to scalability is your main memory ATM, so you might
> want to have a look at out-of-core graph and messages.
>
>  Hope it helps,
> Claudio
>
>
> On Thu, Feb 14, 2013 at 11:50 PM, Tu, Min <mi...@paypal.com> wrote:
>
>>  Hi,
>>
>>  I have some general scalability questions for Giraph. Based on the
>> Giraph design, I am assuming all the mappers in giraph job should be
>> running at the same time.
>>
>>  If so, then
>>
>>    1. The max mappers for giraph job <= total mapper slots in the whole
>>    cluster
>>    2. The max data input size to giraph should be <= total mapper slots
>>    * mapper memory limit
>>    3. If the total mapper slot in the cluster is 200 and only 100
>>    mappers is currently available, and the giraph job require 150 mappers
>>       1. Without any configuration change, the 100 mappers of the giraph
>>       will be started but the giraph job will NOT run successfully
>>       2. Is there any configuration in Giraph to start the job ONLY at
>>       them time when  all the mapper slot available?
>>    4. How is the scalability in giraph? I can ONLY run up to 150 mappers
>>    for my giraph job. Does anyone run a large giraph job in large cluster
>>    successfully?
>>       1. I am using giraph 0.1 in my cluster
>>
>>
>>  Thanks a lot for your time and inputs.
>>
>>  Min
>>
>
>
>
>  --
>    Claudio Martella
>    claudio.martella@gmail.com
>

Re: General Scalability Questions for Giraph

Posted by "Tu, Min" <mi...@paypal.com>.

Hi Claudio,

Thank you very much for your valuable inputs. I will follow your suggestions to try giraph 0.2 ( from trunk ) and the workers setting.

Min

From: Claudio Martella <cl...@gmail.com>>
Reply-To: "user@giraph.apache.org<ma...@giraph.apache.org>" <us...@giraph.apache.org>>
Date: Thursday, February 14, 2013 3:06 PM
To: "user@giraph.apache.org<ma...@giraph.apache.org>" <us...@giraph.apache.org>>
Subject: Re: General Scalability Questions for Giraph

Hi Tu,

first of all, I really suggest you run trunk, especially if you have a large graph. That being said:

1) yes and no, the jargon is misleading. you should have n - 1 workers (what you call mappers for giraph job) with n as the max number of mappers you can have in your cluster as an upper limit (the additional 1 goes for the master). In general, i'd strongly suggest you have 1 mapper/worker per node/MACHINE, and k compute threads per worker, with k as the number of cores on that machine. You'll save netty sending messages over the loopback and additional jvm overhead.

2) yes, but I challenge you to compute those sizes before hand :) Also consider the size of the messages being produced by your algorithm. E.g. roughly, PageRank produces a double for each edge in the graph, during each superstep.

3) AFAIK there's no way, but I might be wrong here.

4) I'd suggest you also talk in terms of nodes. Having multiple workers per machine misleads the scalability on certain aspects (such as network i/o). I have been running Giraph jobs on hundreds of mappers and around 65 machines. I know others here have done bigger numbers (~300 workers). I'd say the upper limit to scalability is your main memory ATM, so you might want to have a look at out-of-core graph and messages.

Hope it helps,
Claudio

On Thu, Feb 14, 2013 at 11:50 PM, Tu, Min <mi...@paypal.com>> wrote:
Hi,

I have some general scalability questions for Giraph. Based on the Giraph design, I am assuming all the mappers in giraph job should be running at the same time.

If so, then

  1.  The max mappers for giraph job <= total mapper slots in the whole cluster
  2.  The max data input size to giraph should be <= total mapper slots * mapper memory limit
  3.  If the total mapper slot in the cluster is 200 and only 100 mappers is currently available, and the giraph job require 150 mappers
     *   Without any configuration change, the 100 mappers of the giraph will be started but the giraph job will NOT run successfully
     *   Is there any configuration in Giraph to start the job ONLY at them time when  all the mapper slot available?
  4.  How is the scalability in giraph? I can ONLY run up to 150 mappers for my giraph job. Does anyone run a large giraph job in large cluster successfully?
     *   I am using giraph 0.1 in my cluster

Thanks a lot for your time and inputs.

Min

--
   Claudio Martella
   claudio.martella@gmail.com<ma...@gmail.com>

Re: General Scalability Questions for Giraph

Posted by Claudio Martella <cl...@gmail.com>.

Hi Tu,

first of all, I really suggest you run trunk, especially if you have a
large graph. That being said:

1) yes and no, the jargon is misleading. you should have n - 1 workers
(what you call mappers for giraph job) with n as the max number of mappers
you can have in your cluster as an upper limit (the additional 1 goes for
the master). In general, i'd strongly suggest you have 1 mapper/worker per
node/MACHINE, and k compute threads per worker, with k as the number of
cores on that machine. You'll save netty sending messages over the loopback
and additional jvm overhead.

2) yes, but I challenge you to compute those sizes before hand :) Also
consider the size of the messages being produced by your algorithm. E.g.
roughly, PageRank produces a double for each edge in the graph, during each
superstep.

3) AFAIK there's no way, but I might be wrong here.

4) I'd suggest you also talk in terms of nodes. Having multiple workers per
machine misleads the scalability on certain aspects (such as network i/o).
I have been running Giraph jobs on hundreds of mappers and around 65
machines. I know others here have done bigger numbers (~300 workers). I'd
say the upper limit to scalability is your main memory ATM, so you might
want to have a look at out-of-core graph and messages.

Hope it helps,
Claudio

On Thu, Feb 14, 2013 at 11:50 PM, Tu, Min <mi...@paypal.com> wrote:

>  Hi,
>
>  I have some general scalability questions for Giraph. Based on the
> Giraph design, I am assuming all the mappers in giraph job should be
> running at the same time.
>
>  If so, then
>
>    1. The max mappers for giraph job <= total mapper slots in the whole
>    cluster
>    2. The max data input size to giraph should be <= total mapper slots *
>    mapper memory limit
>    3. If the total mapper slot in the cluster is 200 and only 100 mappers
>    is currently available, and the giraph job require 150 mappers
>       1. Without any configuration change, the 100 mappers of the giraph
>       will be started but the giraph job will NOT run successfully
>       2. Is there any configuration in Giraph to start the job ONLY at
>       them time when  all the mapper slot available?
>    4. How is the scalability in giraph? I can ONLY run up to 150 mappers
>    for my giraph job. Does anyone run a large giraph job in large cluster
>    successfully?
>       1. I am using giraph 0.1 in my cluster
>
>
>  Thanks a lot for your time and inputs.
>
>  Min
>

-- 
   Claudio Martella
   claudio.martella@gmail.com