You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hama.apache.org by Benedikt Elser <el...@disi.unitn.it> on 2012/12/05 11:43:46 UTC

Partitioning

Hi List,

I am using the hama-0.6.0 release to run graph jobs on various input graphs in a ec2 based cluster of size 12. However as I see in the logs not every node on the cluster contributes to that job (they have no tasklog/job<ID> dir and are idle). Theoretically a distribution of 1 Million nodes across 12 buckets should hit every node at least once. Therefore I think its a configuration problem. So far I messed around with these settings:

   <name>bsp.max.tasks.per.job</name>
   <name>bsp.local.tasks.maximum</name>
   <name>bsp.tasks.maximum</name>
   <name>bsp.child.java.opts</name>

Setting bsp.local.tasks.maximum to 1 and bsp.tasks.maximum.per.job to 12 hat not the desired effect. I also split the input into 12 files (because of something in 0.5, that was fixed in 0.6). 

Could you recommend me some settings or guide me through the system's partition decision? I thought it would be:

Input -> Input Split based on input, max* conf values -> number of tasks
HashPartition.class distributes Ids across that number of tasks.

Thanks,

Benedikt

Re: Partitioning

Posted by Thomas Jungblut <th...@gmail.com>.

Exactly,

maybe you want to first read up all the different modes and how they are
configured:

http://wiki.apache.org/hama/GettingStarted#Modes

We also have some nice documentations as PDF which you can get here:

http://wiki.apache.org/hama/GettingStarted#Hama_0.6.0

The configuration property to change the number of tasks on every host is
"bsp.tasks.maximum" which is described by ">The maximum number of BSP tasks
that will be run simultaneously by a groom server.".

Setting this to 1 on every host where a groom server starts, and afterwards
restarting your cluster should do what you want to archieve.
I can recommend puppet for maintaining these kinds of configurations.

If you need a more formal complexity model for BSP applications let me
know, I have derived one from Rob Bisseling's BSP model that fits better to
Apache Hama's style of computation.



2012/12/5 Benedikt Elser <el...@disi.unitn.it>

> Ah, local mode, Bingo!
>
> About the communication costs: Yes I am aware of these, however this is
> exactly what I want to test in the first place :) Hence I would need a
> bsp.distributed.tasks.maximum
>
> Thanks for the clarifications,
>
> Benedikt
>
> On Dec 5, 2012, at 12:05 PM, Thomas Jungblut wrote:
>
> > Because the property is called "local". This doesn't affect the
> distributed
> > mode.
> > Note that it is really bad if you compute multiple tasks on different
> host
> > machines, because this leverages your communication costs.
> >
> > 2012/12/5 Benedikt Elser <el...@disi.unitn.it>
> >
> >> Thank you, I will try that. However if I set bsp.local.tasks.maximum to
> 1,
> >> why doesn't it distribute one task to each machine?
> >>
> >> On Dec 5, 2012, at 11:58 AM, Thomas Jungblut wrote:
> >>
> >>> So it will spawn 12 tasks. If this doesn't satisfy the load on your
> >>> machines, try to use smaller blocksizes.
> >>>
> >>> 2012/12/5 Benedikt Elser <el...@disi.unitn.it>
> >>>
> >>>> Hi,
> >>>>
> >>>> thanks for your reply!
> >>>>
> >>>> Total size:    49078776 B
> >>>> Total dirs:    1
> >>>> Total files:   12
> >>>> Total blocks (validated):      12 (avg. block size 4089898 B)
> >>>>
> >>>> Benedikt
> >>>>
> >>>> On Dec 5, 2012, at 11:47 AM, Thomas Jungblut wrote:
> >>>>
> >>>>> So how many blocks has your data in HDFS?
> >>>>>
> >>>>> 2012/12/5 Benedikt Elser <el...@disi.unitn.it>
> >>>>>
> >>>>>> Hi List,
> >>>>>>
> >>>>>> I am using the hama-0.6.0 release to run graph jobs on various input
> >>>>>> graphs in a ec2 based cluster of size 12. However as I see in the
> logs
> >>>> not
> >>>>>> every node on the cluster contributes to that job (they have no
> >>>>>> tasklog/job<ID> dir and are idle). Theoretically a distribution of 1
> >>>>>> Million nodes across 12 buckets should hit every node at least once.
> >>>>>> Therefore I think its a configuration problem. So far I messed
> around
> >>>> with
> >>>>>> these settings:
> >>>>>>
> >>>>>> <name>bsp.max.tasks.per.job</name>
> >>>>>> <name>bsp.local.tasks.maximum</name>
> >>>>>> <name>bsp.tasks.maximum</name>
> >>>>>> <name>bsp.child.java.opts</name>
> >>>>>>
> >>>>>> Setting bsp.local.tasks.maximum to 1 and bsp.tasks.maximum.per.job
> to
> >> 12
> >>>>>> hat not the desired effect. I also split the input into 12 files
> >>>> (because
> >>>>>> of something in 0.5, that was fixed in 0.6).
> >>>>>>
> >>>>>> Could you recommend me some settings or guide me through the
> system's
> >>>>>> partition decision? I thought it would be:
> >>>>>>
> >>>>>> Input -> Input Split based on input, max* conf values -> number of
> >> tasks
> >>>>>> HashPartition.class distributes Ids across that number of tasks.
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> Benedikt
> >>>>
> >>>>
> >>
> >>
>
>

Re: Partitioning

Posted by Benedikt Elser <el...@disi.unitn.it>.

Ah, local mode, Bingo! 

About the communication costs: Yes I am aware of these, however this is exactly what I want to test in the first place :) Hence I would need a bsp.distributed.tasks.maximum

Thanks for the clarifications,

Benedikt

On Dec 5, 2012, at 12:05 PM, Thomas Jungblut wrote:

> Because the property is called "local". This doesn't affect the distributed
> mode.
> Note that it is really bad if you compute multiple tasks on different host
> machines, because this leverages your communication costs.
> 
> 2012/12/5 Benedikt Elser <el...@disi.unitn.it>
> 
>> Thank you, I will try that. However if I set bsp.local.tasks.maximum to 1,
>> why doesn't it distribute one task to each machine?
>> 
>> On Dec 5, 2012, at 11:58 AM, Thomas Jungblut wrote:
>> 
>>> So it will spawn 12 tasks. If this doesn't satisfy the load on your
>>> machines, try to use smaller blocksizes.
>>> 
>>> 2012/12/5 Benedikt Elser <el...@disi.unitn.it>
>>> 
>>>> Hi,
>>>> 
>>>> thanks for your reply!
>>>> 
>>>> Total size:    49078776 B
>>>> Total dirs:    1
>>>> Total files:   12
>>>> Total blocks (validated):      12 (avg. block size 4089898 B)
>>>> 
>>>> Benedikt
>>>> 
>>>> On Dec 5, 2012, at 11:47 AM, Thomas Jungblut wrote:
>>>> 
>>>>> So how many blocks has your data in HDFS?
>>>>> 
>>>>> 2012/12/5 Benedikt Elser <el...@disi.unitn.it>
>>>>> 
>>>>>> Hi List,
>>>>>> 
>>>>>> I am using the hama-0.6.0 release to run graph jobs on various input
>>>>>> graphs in a ec2 based cluster of size 12. However as I see in the logs
>>>> not
>>>>>> every node on the cluster contributes to that job (they have no
>>>>>> tasklog/job<ID> dir and are idle). Theoretically a distribution of 1
>>>>>> Million nodes across 12 buckets should hit every node at least once.
>>>>>> Therefore I think its a configuration problem. So far I messed around
>>>> with
>>>>>> these settings:
>>>>>> 
>>>>>> <name>bsp.max.tasks.per.job</name>
>>>>>> <name>bsp.local.tasks.maximum</name>
>>>>>> <name>bsp.tasks.maximum</name>
>>>>>> <name>bsp.child.java.opts</name>
>>>>>> 
>>>>>> Setting bsp.local.tasks.maximum to 1 and bsp.tasks.maximum.per.job to
>> 12
>>>>>> hat not the desired effect. I also split the input into 12 files
>>>> (because
>>>>>> of something in 0.5, that was fixed in 0.6).
>>>>>> 
>>>>>> Could you recommend me some settings or guide me through the system's
>>>>>> partition decision? I thought it would be:
>>>>>> 
>>>>>> Input -> Input Split based on input, max* conf values -> number of
>> tasks
>>>>>> HashPartition.class distributes Ids across that number of tasks.
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Benedikt
>>>> 
>>>> 
>> 
>>

Re: Partitioning

Posted by Thomas Jungblut <th...@gmail.com>.

Because the property is called "local". This doesn't affect the distributed
mode.
Note that it is really bad if you compute multiple tasks on different host
machines, because this leverages your communication costs.

2012/12/5 Benedikt Elser <el...@disi.unitn.it>

> Thank you, I will try that. However if I set bsp.local.tasks.maximum to 1,
> why doesn't it distribute one task to each machine?
>
> On Dec 5, 2012, at 11:58 AM, Thomas Jungblut wrote:
>
> > So it will spawn 12 tasks. If this doesn't satisfy the load on your
> > machines, try to use smaller blocksizes.
> >
> > 2012/12/5 Benedikt Elser <el...@disi.unitn.it>
> >
> >> Hi,
> >>
> >> thanks for your reply!
> >>
> >> Total size:    49078776 B
> >> Total dirs:    1
> >> Total files:   12
> >> Total blocks (validated):      12 (avg. block size 4089898 B)
> >>
> >> Benedikt
> >>
> >> On Dec 5, 2012, at 11:47 AM, Thomas Jungblut wrote:
> >>
> >>> So how many blocks has your data in HDFS?
> >>>
> >>> 2012/12/5 Benedikt Elser <el...@disi.unitn.it>
> >>>
> >>>> Hi List,
> >>>>
> >>>> I am using the hama-0.6.0 release to run graph jobs on various input
> >>>> graphs in a ec2 based cluster of size 12. However as I see in the logs
> >> not
> >>>> every node on the cluster contributes to that job (they have no
> >>>> tasklog/job<ID> dir and are idle). Theoretically a distribution of 1
> >>>> Million nodes across 12 buckets should hit every node at least once.
> >>>> Therefore I think its a configuration problem. So far I messed around
> >> with
> >>>> these settings:
> >>>>
> >>>>  <name>bsp.max.tasks.per.job</name>
> >>>>  <name>bsp.local.tasks.maximum</name>
> >>>>  <name>bsp.tasks.maximum</name>
> >>>>  <name>bsp.child.java.opts</name>
> >>>>
> >>>> Setting bsp.local.tasks.maximum to 1 and bsp.tasks.maximum.per.job to
> 12
> >>>> hat not the desired effect. I also split the input into 12 files
> >> (because
> >>>> of something in 0.5, that was fixed in 0.6).
> >>>>
> >>>> Could you recommend me some settings or guide me through the system's
> >>>> partition decision? I thought it would be:
> >>>>
> >>>> Input -> Input Split based on input, max* conf values -> number of
> tasks
> >>>> HashPartition.class distributes Ids across that number of tasks.
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Benedikt
> >>
> >>
>
>

Re: Partitioning

Posted by Benedikt Elser <el...@disi.unitn.it>.

Thank you, I will try that. However if I set bsp.local.tasks.maximum to 1, why doesn't it distribute one task to each machine? 

On Dec 5, 2012, at 11:58 AM, Thomas Jungblut wrote:

> So it will spawn 12 tasks. If this doesn't satisfy the load on your
> machines, try to use smaller blocksizes.
> 
> 2012/12/5 Benedikt Elser <el...@disi.unitn.it>
> 
>> Hi,
>> 
>> thanks for your reply!
>> 
>> Total size:    49078776 B
>> Total dirs:    1
>> Total files:   12
>> Total blocks (validated):      12 (avg. block size 4089898 B)
>> 
>> Benedikt
>> 
>> On Dec 5, 2012, at 11:47 AM, Thomas Jungblut wrote:
>> 
>>> So how many blocks has your data in HDFS?
>>> 
>>> 2012/12/5 Benedikt Elser <el...@disi.unitn.it>
>>> 
>>>> Hi List,
>>>> 
>>>> I am using the hama-0.6.0 release to run graph jobs on various input
>>>> graphs in a ec2 based cluster of size 12. However as I see in the logs
>> not
>>>> every node on the cluster contributes to that job (they have no
>>>> tasklog/job<ID> dir and are idle). Theoretically a distribution of 1
>>>> Million nodes across 12 buckets should hit every node at least once.
>>>> Therefore I think its a configuration problem. So far I messed around
>> with
>>>> these settings:
>>>> 
>>>>  <name>bsp.max.tasks.per.job</name>
>>>>  <name>bsp.local.tasks.maximum</name>
>>>>  <name>bsp.tasks.maximum</name>
>>>>  <name>bsp.child.java.opts</name>
>>>> 
>>>> Setting bsp.local.tasks.maximum to 1 and bsp.tasks.maximum.per.job to 12
>>>> hat not the desired effect. I also split the input into 12 files
>> (because
>>>> of something in 0.5, that was fixed in 0.6).
>>>> 
>>>> Could you recommend me some settings or guide me through the system's
>>>> partition decision? I thought it would be:
>>>> 
>>>> Input -> Input Split based on input, max* conf values -> number of tasks
>>>> HashPartition.class distributes Ids across that number of tasks.
>>>> 
>>>> Thanks,
>>>> 
>>>> Benedikt
>> 
>>

Re: Partitioning

Posted by Thomas Jungblut <th...@gmail.com>.

So it will spawn 12 tasks. If this doesn't satisfy the load on your
machines, try to use smaller blocksizes.

2012/12/5 Benedikt Elser <el...@disi.unitn.it>

> Hi,
>
> thanks for your reply!
>
>  Total size:    49078776 B
>  Total dirs:    1
>  Total files:   12
>  Total blocks (validated):      12 (avg. block size 4089898 B)
>
> Benedikt
>
> On Dec 5, 2012, at 11:47 AM, Thomas Jungblut wrote:
>
> > So how many blocks has your data in HDFS?
> >
> > 2012/12/5 Benedikt Elser <el...@disi.unitn.it>
> >
> >> Hi List,
> >>
> >> I am using the hama-0.6.0 release to run graph jobs on various input
> >> graphs in a ec2 based cluster of size 12. However as I see in the logs
> not
> >> every node on the cluster contributes to that job (they have no
> >> tasklog/job<ID> dir and are idle). Theoretically a distribution of 1
> >> Million nodes across 12 buckets should hit every node at least once.
> >> Therefore I think its a configuration problem. So far I messed around
> with
> >> these settings:
> >>
> >>   <name>bsp.max.tasks.per.job</name>
> >>   <name>bsp.local.tasks.maximum</name>
> >>   <name>bsp.tasks.maximum</name>
> >>   <name>bsp.child.java.opts</name>
> >>
> >> Setting bsp.local.tasks.maximum to 1 and bsp.tasks.maximum.per.job to 12
> >> hat not the desired effect. I also split the input into 12 files
> (because
> >> of something in 0.5, that was fixed in 0.6).
> >>
> >> Could you recommend me some settings or guide me through the system's
> >> partition decision? I thought it would be:
> >>
> >> Input -> Input Split based on input, max* conf values -> number of tasks
> >> HashPartition.class distributes Ids across that number of tasks.
> >>
> >> Thanks,
> >>
> >> Benedikt
>
>

Re: Partitioning

Posted by Benedikt Elser <el...@disi.unitn.it>.

Hi,

thanks for your reply!

 Total size:    49078776 B
 Total dirs:    1
 Total files:   12
 Total blocks (validated):      12 (avg. block size 4089898 B)

Benedikt

On Dec 5, 2012, at 11:47 AM, Thomas Jungblut wrote:

> So how many blocks has your data in HDFS?
> 
> 2012/12/5 Benedikt Elser <el...@disi.unitn.it>
> 
>> Hi List,
>> 
>> I am using the hama-0.6.0 release to run graph jobs on various input
>> graphs in a ec2 based cluster of size 12. However as I see in the logs not
>> every node on the cluster contributes to that job (they have no
>> tasklog/job<ID> dir and are idle). Theoretically a distribution of 1
>> Million nodes across 12 buckets should hit every node at least once.
>> Therefore I think its a configuration problem. So far I messed around with
>> these settings:
>> 
>>   <name>bsp.max.tasks.per.job</name>
>>   <name>bsp.local.tasks.maximum</name>
>>   <name>bsp.tasks.maximum</name>
>>   <name>bsp.child.java.opts</name>
>> 
>> Setting bsp.local.tasks.maximum to 1 and bsp.tasks.maximum.per.job to 12
>> hat not the desired effect. I also split the input into 12 files (because
>> of something in 0.5, that was fixed in 0.6).
>> 
>> Could you recommend me some settings or guide me through the system's
>> partition decision? I thought it would be:
>> 
>> Input -> Input Split based on input, max* conf values -> number of tasks
>> HashPartition.class distributes Ids across that number of tasks.
>> 
>> Thanks,
>> 
>> Benedikt

Re: Partitioning

Posted by Thomas Jungblut <th...@gmail.com>.

So how many blocks has your data in HDFS?

2012/12/5 Benedikt Elser <el...@disi.unitn.it>

> Hi List,
>
> I am using the hama-0.6.0 release to run graph jobs on various input
> graphs in a ec2 based cluster of size 12. However as I see in the logs not
> every node on the cluster contributes to that job (they have no
> tasklog/job<ID> dir and are idle). Theoretically a distribution of 1
> Million nodes across 12 buckets should hit every node at least once.
> Therefore I think its a configuration problem. So far I messed around with
> these settings:
>
>    <name>bsp.max.tasks.per.job</name>
>    <name>bsp.local.tasks.maximum</name>
>    <name>bsp.tasks.maximum</name>
>    <name>bsp.child.java.opts</name>
>
> Setting bsp.local.tasks.maximum to 1 and bsp.tasks.maximum.per.job to 12
> hat not the desired effect. I also split the input into 12 files (because
> of something in 0.5, that was fixed in 0.6).
>
> Could you recommend me some settings or guide me through the system's
> partition decision? I thought it would be:
>
> Input -> Input Split based on input, max* conf values -> number of tasks
> HashPartition.class distributes Ids across that number of tasks.
>
> Thanks,
>
> Benedikt