You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@giraph.apache.org by Arjun Sharma <as...@gmail.com> on 2015/04/24 02:15:06 UTC

Re: Optimal configuration for Giraph on YARN

Just bumping up this thread, as I am having the same question as Steven's.

Steven, did you get to know if setting both mapreduce.map.cpu.vcores and
yarn.nodemanager.resource.cpu-vcores is required? What happens if they are
not set, while giraph.numComputeThreads is set? Are there any
other parameters that must be set in order to make sure we are *really*
using the cores, not just multi-threading on a single core?

On Wed, Mar 18, 2015 at 11:48 AM, Steven Harenberg <sd...@ncsu.edu>
wrote:

> Hi all,
>
> Previously with MapReduceV1, the suggestion was to have a 1:1
> correspondence between workers and compute nodes (machines) and set the
> number of the threads to be the number of cores per machines. To achieve
> this configuration, we would set "mapred.tasktracker.map.tasks.maximum=1".
> Since workers correspond to mappers this would ensure there was one worker
> per machine.
>
> Now I am reading that with Yarn this property longer exists as there
> aren't tasktrackers. Instead, we have the global properties
> "yarn.nodemanager.resource.cpu-vcores", which specifies the cores _per
> node_, and the property "mapreduce.map.cpu.vcores", which specifies the
> cores _per map task_.
>
> If we want to have one mapper per node that is fully utilizing the
> machine, I assume we should just set mapreduce.map.cpu.vcores =
> yarn.nodemanager.resource.cpu-vcores = the # of cores per node. Is this
> correct?
>
> Do I still need to set giraph.numComputeThreads to be the number of cores
> per node?
>
> Thanks,
> Steve
>

Re: Optimal configuration for Giraph on YARN

Posted by Steven Harenberg <sd...@ncsu.edu>.

I would guess the same, but I don't know for sure.

--Steve

On Wed, Apr 29, 2015 at 12:21 PM, Arjun Sharma <as...@gmail.com> wrote:

> Hi Steven,
>
> Thank you so much for your detailed reply! Actually, my second question
> was about if we do not set mapreduce.map.cpu.vcores (defaults to 1) or
> yarn.nodemanager.resource.cpu-vcores (defaults to 8), while we set
> giraph.numComputeThreads (say to 16). I expect every worker will run 16
> threads on 1 core, but wanted to see if you have the same understanding.
>
> Thanks,
> Arjun.
>
> On Wed, Apr 29, 2015 at 8:50 AM, Steven Harenberg <sd...@ncsu.edu>
> wrote:
>
>> Hey Arjun,
>>
>> I am glad someone finally responded to this thread. I am surprised no one
>> else is trying to figure out these configuration settings...
>>
>> Here is my understanding of your questions (though I am not sure they are
>> right):
>>
>>
>> *Is setting both mapreduce.map.cpu.vcores and
>> yarn.nodemanager.resource.cpu-vcores is required?*
>>
>> Yes, I believe you need both of these set or else they will revert to
>> default values. Importantly, I think you should set these to the same value
>> so that you spawn one mapper/giraph-worker per machine (as this was said to
>> be optimal).
>>
>> Since I have 32 cores per machine, I have set both these values to 32 and
>> has worked to only spawn one worker per machine (unless I try to have a
>> worker share a machine with the master).
>>
>> Check this page out:
>> http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/
>>
>>
>> *What happens if they are not set, while giraph.numComputeThreads is set?*
>>
>> The above parameters specify how many nodes per machine you are allowing
>> for workers AND how many cores one worker will use. If you don't set *giraph.numComputeThreads
>> *then the worker will use the default number (I think that is 1) despite
>> possibly being allocated more cores. Hence, I set *giraph.numComputeThreads,
>> **giraph.numInputThreads, *and *giraph.numOutputThreads *to be the same
>> as the above two paramters, the total cores in one machine (for me 32).
>>
>> Giraph is never going to fully utilize the entire machine, so I don't
>> think its really possible to tell if these are correct settings, but all of
>> this seems reasonable based on my experience and how these parameters are
>> defined.
>>
>>
>>
>> *Are there any other parameters that must be set in order to make sure we
>> are *really* using the cores, not just multi-threading on a single core?*
>>
>> No idea, but the above parameters and some memory configurations are all
>> I set. The memory configurations are worse in my opinion, as I was running
>> into memory issues and ended up having to manually set the following
>> parameters:
>>
>>    - yarn.nodemanager.resource.memory-mb
>>    - yarn.scheduler.minimum-allocation-mb
>>    - yarn.scheduler.maximum-allocation-mb
>>    - mapreduce.map.memory.mb
>>    - -yh (in Giraph arguments)
>>
>> All of these were required to be manually set to get Giraph to run
>> without having memory issues.
>>
>> Best regards,
>> Steve
>>
>>
>> On Thu, Apr 23, 2015 at 8:15 PM, Arjun Sharma <as...@gmail.com> wrote:
>>
>>> Just bumping up this thread, as I am having the same question as
>>> Steven's.
>>>
>>> Steven, did you get to know if setting both mapreduce.map.cpu.vcores
>>> and yarn.nodemanager.resource.cpu-vcores is required? What happens if
>>> they are not set, while giraph.numComputeThreads is set? Are there any
>>> other parameters that must be set in order to make sure we are *really*
>>> using the cores, not just multi-threading on a single core?
>>>
>>>
>>> On Wed, Mar 18, 2015 at 11:48 AM, Steven Harenberg <sd...@ncsu.edu>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> Previously with MapReduceV1, the suggestion was to have a 1:1
>>>> correspondence between workers and compute nodes (machines) and set the
>>>> number of the threads to be the number of cores per machines. To achieve
>>>> this configuration, we would set "mapred.tasktracker.map.tasks.maximum=1".
>>>> Since workers correspond to mappers this would ensure there was one worker
>>>> per machine.
>>>>
>>>> Now I am reading that with Yarn this property longer exists as there
>>>> aren't tasktrackers. Instead, we have the global properties
>>>> "yarn.nodemanager.resource.cpu-vcores", which specifies the cores _per
>>>> node_, and the property "mapreduce.map.cpu.vcores", which specifies the
>>>> cores _per map task_.
>>>>
>>>> If we want to have one mapper per node that is fully utilizing the
>>>> machine, I assume we should just set mapreduce.map.cpu.vcores =
>>>> yarn.nodemanager.resource.cpu-vcores = the # of cores per node. Is this
>>>> correct?
>>>>
>>>> Do I still need to set giraph.numComputeThreads to be the number of
>>>> cores per node?
>>>>
>>>> Thanks,
>>>> Steve
>>>>
>>>
>>>
>>
>

Re: Optimal configuration for Giraph on YARN

Posted by Arjun Sharma <as...@gmail.com>.

Hi Steven,

Thank you so much for your detailed reply! Actually, my second question was
about if we do not set mapreduce.map.cpu.vcores (defaults to 1) or
yarn.nodemanager.resource.cpu-vcores (defaults to 8), while we set
giraph.numComputeThreads (say to 16). I expect every worker will run 16
threads on 1 core, but wanted to see if you have the same understanding.

Thanks,
Arjun.

On Wed, Apr 29, 2015 at 8:50 AM, Steven Harenberg <sd...@ncsu.edu> wrote:

> Hey Arjun,
>
> I am glad someone finally responded to this thread. I am surprised no one
> else is trying to figure out these configuration settings...
>
> Here is my understanding of your questions (though I am not sure they are
> right):
>
>
> *Is setting both mapreduce.map.cpu.vcores and
> yarn.nodemanager.resource.cpu-vcores is required?*
>
> Yes, I believe you need both of these set or else they will revert to
> default values. Importantly, I think you should set these to the same value
> so that you spawn one mapper/giraph-worker per machine (as this was said to
> be optimal).
>
> Since I have 32 cores per machine, I have set both these values to 32 and
> has worked to only spawn one worker per machine (unless I try to have a
> worker share a machine with the master).
>
> Check this page out:
> http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/
>
>
> *What happens if they are not set, while giraph.numComputeThreads is set?*
>
> The above parameters specify how many nodes per machine you are allowing
> for workers AND how many cores one worker will use. If you don't set *giraph.numComputeThreads
> *then the worker will use the default number (I think that is 1) despite
> possibly being allocated more cores. Hence, I set *giraph.numComputeThreads,
> **giraph.numInputThreads, *and *giraph.numOutputThreads *to be the same
> as the above two paramters, the total cores in one machine (for me 32).
>
> Giraph is never going to fully utilize the entire machine, so I don't
> think its really possible to tell if these are correct settings, but all of
> this seems reasonable based on my experience and how these parameters are
> defined.
>
>
>
> *Are there any other parameters that must be set in order to make sure we
> are *really* using the cores, not just multi-threading on a single core?*
>
> No idea, but the above parameters and some memory configurations are all I
> set. The memory configurations are worse in my opinion, as I was running
> into memory issues and ended up having to manually set the following
> parameters:
>
>    - yarn.nodemanager.resource.memory-mb
>    - yarn.scheduler.minimum-allocation-mb
>    - yarn.scheduler.maximum-allocation-mb
>    - mapreduce.map.memory.mb
>    - -yh (in Giraph arguments)
>
> All of these were required to be manually set to get Giraph to run without
> having memory issues.
>
> Best regards,
> Steve
>
>
> On Thu, Apr 23, 2015 at 8:15 PM, Arjun Sharma <as...@gmail.com> wrote:
>
>> Just bumping up this thread, as I am having the same question as Steven's.
>>
>> Steven, did you get to know if setting both mapreduce.map.cpu.vcores and
>> yarn.nodemanager.resource.cpu-vcores is required? What happens if they
>> are not set, while giraph.numComputeThreads is set? Are there any
>> other parameters that must be set in order to make sure we are *really*
>> using the cores, not just multi-threading on a single core?
>>
>>
>> On Wed, Mar 18, 2015 at 11:48 AM, Steven Harenberg <sd...@ncsu.edu>
>> wrote:
>>
>>> Hi all,
>>>
>>> Previously with MapReduceV1, the suggestion was to have a 1:1
>>> correspondence between workers and compute nodes (machines) and set the
>>> number of the threads to be the number of cores per machines. To achieve
>>> this configuration, we would set "mapred.tasktracker.map.tasks.maximum=1".
>>> Since workers correspond to mappers this would ensure there was one worker
>>> per machine.
>>>
>>> Now I am reading that with Yarn this property longer exists as there
>>> aren't tasktrackers. Instead, we have the global properties
>>> "yarn.nodemanager.resource.cpu-vcores", which specifies the cores _per
>>> node_, and the property "mapreduce.map.cpu.vcores", which specifies the
>>> cores _per map task_.
>>>
>>> If we want to have one mapper per node that is fully utilizing the
>>> machine, I assume we should just set mapreduce.map.cpu.vcores =
>>> yarn.nodemanager.resource.cpu-vcores = the # of cores per node. Is this
>>> correct?
>>>
>>> Do I still need to set giraph.numComputeThreads to be the number of
>>> cores per node?
>>>
>>> Thanks,
>>> Steve
>>>
>>
>>
>

Re: Optimal configuration for Giraph on YARN

Posted by Steven Harenberg <sd...@ncsu.edu>.

Hey Arjun,

I am glad someone finally responded to this thread. I am surprised no one
else is trying to figure out these configuration settings...

Here is my understanding of your questions (though I am not sure they are
right):

*Is setting both mapreduce.map.cpu.vcores and
yarn.nodemanager.resource.cpu-vcores is required?*

Yes, I believe you need both of these set or else they will revert to
default values. Importantly, I think you should set these to the same value
so that you spawn one mapper/giraph-worker per machine (as this was said to
be optimal).

Since I have 32 cores per machine, I have set both these values to 32 and
has worked to only spawn one worker per machine (unless I try to have a
worker share a machine with the master).

Check this page out:
http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/

*What happens if they are not set, while giraph.numComputeThreads is set?*

The above parameters specify how many nodes per machine you are allowing
for workers AND how many cores one worker will use. If you don't set
*giraph.numComputeThreads
*then the worker will use the default number (I think that is 1) despite
possibly being allocated more cores. Hence, I set *giraph.numComputeThreads,
**giraph.numInputThreads, *and *giraph.numOutputThreads *to be the same as
the above two paramters, the total cores in one machine (for me 32).

Giraph is never going to fully utilize the entire machine, so I don't think
its really possible to tell if these are correct settings, but all of this
seems reasonable based on my experience and how these parameters are
defined.

*Are there any other parameters that must be set in order to make sure we
are *really* using the cores, not just multi-threading on a single core?*

No idea, but the above parameters and some memory configurations are all I
set. The memory configurations are worse in my opinion, as I was running
into memory issues and ended up having to manually set the following
parameters:

   - yarn.nodemanager.resource.memory-mb
   - yarn.scheduler.minimum-allocation-mb
   - yarn.scheduler.maximum-allocation-mb
   - mapreduce.map.memory.mb
   - -yh (in Giraph arguments)

All of these were required to be manually set to get Giraph to run without
having memory issues.

Best regards,
Steve

On Thu, Apr 23, 2015 at 8:15 PM, Arjun Sharma <as...@gmail.com> wrote:

> Just bumping up this thread, as I am having the same question as Steven's.
>
> Steven, did you get to know if setting both mapreduce.map.cpu.vcores and
> yarn.nodemanager.resource.cpu-vcores is required? What happens if they
> are not set, while giraph.numComputeThreads is set? Are there any
> other parameters that must be set in order to make sure we are *really*
> using the cores, not just multi-threading on a single core?
>
>
> On Wed, Mar 18, 2015 at 11:48 AM, Steven Harenberg <sd...@ncsu.edu>
> wrote:
>
>> Hi all,
>>
>> Previously with MapReduceV1, the suggestion was to have a 1:1
>> correspondence between workers and compute nodes (machines) and set the
>> number of the threads to be the number of cores per machines. To achieve
>> this configuration, we would set "mapred.tasktracker.map.tasks.maximum=1".
>> Since workers correspond to mappers this would ensure there was one worker
>> per machine.
>>
>> Now I am reading that with Yarn this property longer exists as there
>> aren't tasktrackers. Instead, we have the global properties
>> "yarn.nodemanager.resource.cpu-vcores", which specifies the cores _per
>> node_, and the property "mapreduce.map.cpu.vcores", which specifies the
>> cores _per map task_.
>>
>> If we want to have one mapper per node that is fully utilizing the
>> machine, I assume we should just set mapreduce.map.cpu.vcores =
>> yarn.nodemanager.resource.cpu-vcores = the # of cores per node. Is this
>> correct?
>>
>> Do I still need to set giraph.numComputeThreads to be the number of cores
>> per node?
>>
>> Thanks,
>> Steve
>>
>
>