You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Ninad Raut <hb...@gmail.com> on 2009/07/24 12:30:34 UTC

Re: Why only few map tasks are running at a time inspite of plenty of scope for remaining?

If your data is stored just on one regionserver you will have only one map
inspite of setting
 conf.set("mapred.tasktracker.map.tasks.maximum", "2");
there are two approaches:

On Fri, Jul 24, 2009 at 4:54 AM, akhil1988 <ak...@gmail.com> wrote:

>
> Hi all,
>
> I am using a HTable as input to my map jobs and my reducer outputs to
> another Htable. There are 10 regions of my input HTable. And I have set
>        conf.set("mapred.tasktracker.map.tasks.maximum", "2");
>        conf.set("mapred.tasktracker.map.tasks.maximum", "2");
>       c.setNumReduceTasks(26);
> My cluster contains 15 nodes(out of which 2 are maters). When I run my job,
> only 2 map tasks run at a time and the remaining 8 are shown as pending. 24
> reduce tasks(out of 26) also get started initially and remaing 2 are shown
> as pending. I am confused why only 2 tasks are running at a time, though
> there are a total of 26 slots for map tasks.
>
> However, this does not happen when I run jobs in which I take files as
> inputs(i.e. only simple MapReduce jobs and not involving HBase at all).
> Only
> when a Htable is taken as input very few map tasks run concurrently than
> expected.
>
> Can anyone suggest why this is happening?
>
> What I have observed in simple mapreduce jobs that first all map tasks are
> instantiated and then reduce tasks. But  this does not seem to be happening
> in HTable case??
>
> --
> View this message in context:
> http://www.nabble.com/Why-only-few-map-tasks-are-running-at-a-time-inspite-of-plenty-of-scope-for-remaining--tp24636315p24636315.html
> Sent from the HBase User mailing list archive at Nabble.com.
>
>

Re: Why only few map tasks are running at a time inspite of plenty of scope for remaining?

Posted by Ninad Raut <hb...@gmail.com>.
Inputsplit wont harm your localization. It is taken care off in the
partioning phase. InputSplit helps you form more maps and increase your
parallelism.. here is the code
public InputSplit[] getSplits(JobConf conf, int numSplits) throws
IOException {
  System.out.println("In Input Split");
  // TODO Auto-generated method stub
  int i=0;
  int numMaps = 0;

  List<InputSplit> splitList = new ArrayList<InputSplit>();
  Scanner scanner = table.getScanner(new String[] {columnName});
  Iterator<RowResult> ie = scanner.iterator();
  byte[] firstrow = ie.next().getRow();

  while(ie.hasNext()){
  RowResult row = ie.next();
  i++;
  if(i%1000==0 && numMaps < numSplits){
  byte[] endRow = row.getRow();
  log.info("Starr::"+ new String(firstrow) + " End:: " + new
String(endRow));
  InputSplit split = new TableSplit(cTableName.getBytes(), firstrow, endRow,
"slave");
  splitList.add(split);
  numMaps++;
  firstrow = endRow;
  }
  }

  System.out.println(splitList.size());
  InputSplit[] splits = new InputSplit[splitList.size()];
  for(int j=0;j<splitList.size();++j){
  splits[j] = splitList.get(j);
  }
  System.out.println("Input Split method ends returning splits:: " +
splits.length);
  return splits;
 }


On Sat, Jul 25, 2009 at 12:54 AM, akhil1988 <ak...@gmail.com> wrote:

>
> Hi,
>
> The 10 regions of the table are lying on just 6 region servers(out of a
> total of 13 region servers). And even after setting
> mapred.tasktracker.map.tasks.maximum = 4, only total of 2 map tasks run at
> a
> time.
>
> Also, another observation is that the map tasks are not running on local
> data i.e. the input region for the map task lies on some other node.
>
> Name                                                     Region Server
> Encoded Name       Start Key    End Key
> WikiPages,,1248388318240
> cn84.cloud.cs.illinois.edu:60020
> 1479806923       117236
> WikiPages,117236,1248388318240         cn84.cloud.cs.illinois.edu:60020
> 1753302296       117236       13813
> WikiPages,13813,1248388323072           cn77.cloud.cs.illinois.edu:60020
> 200507463        13813        184272
> WikiPages,184272,1248388323072         cn77.cloud.cs.illinois.edu:60020
> 1543767328       184272       22998
> WikiPages,22998,1248388310452           cn71.cloud.cs.illinois.edu:60020
> 1972228055       22998         29193
> WikiPages,29193,1248388310452           cn71.cloud.cs.illinois.edu:60020
> 1630029649       29193         37870
> WikiPages,37870,1248388306711           cn73.cloud.cs.illinois.edu:60020
> 1028558084       37870             56
> WikiPages,56,1248388313083                cn82.cloud.cs.illinois.edu:60020
> 332484191       56              73976
> WikiPages,73976,1248388316165           cn83.cloud.cs.illinois.edu:60020
> 231296585       73976         85491
> WikiPages,85491,1248388316165           cn83.cloud.cs.illinois.edu:60020
> 1935329066      85491
>
>
> Each region is approximately 90 MB and I have set region max size to be 128
> MB. The region size is already less than the maximum size, how should I
> split it?
>
> Hadoop does shows 10 map tasks to be run. How would writing custom input
> split help, and moreover if I write custom Input Split to divide the rows I
> will end up giving rows lying in different regions to a map task(as the
> rows
> are not in sorted order).
>
> Thanks,
> --Akhil
>
> Ninad Raut-2 wrote:
> >
> > If your data is stored just on one regionserver you will have only one
> map
> >> inspite of setting
> >>  conf.set("mapred.tasktracker.map.tasks.maximum", "2");
> >> there are two approaches:
> >>
> >> 1) Do a manual table split
> >>
> >    2) Write a custom input split which will divide the tables rows
> amongst
> > the maps.
> >
> >
> >> On Fri, Jul 24, 2009 at 4:54 AM, akhil1988 <ak...@gmail.com>
> wrote:
> >>
> >>>
> >>> Hi all,
> >>>
> >>> I am using a HTable as input to my map jobs and my reducer outputs to
> >>> another Htable. There are 10 regions of my input HTable. And I have set
> >>>        conf.set("mapred.tasktracker.map.tasks.maximum", "2");
> >>>        conf.set("mapred.tasktracker.map.tasks.maximum", "2");
> >>>       c.setNumReduceTasks(26);
> >>> My cluster contains 15 nodes(out of which 2 are maters). When I run my
> >>> job,
> >>> only 2 map tasks run at a time and the remaining 8 are shown as
> pending.
> >>> 24
> >>> reduce tasks(out of 26) also get started initially and remaing 2 are
> >>> shown
> >>> as pending. I am confused why only 2 tasks are running at a time,
> though
> >>> there are a total of 26 slots for map tasks.
> >>>
> >>> However, this does not happen when I run jobs in which I take files as
> >>> inputs(i.e. only simple MapReduce jobs and not involving HBase at all).
> >>> Only
> >>> when a Htable is taken as input very few map tasks run concurrently
> than
> >>> expected.
> >>>
> >>> Can anyone suggest why this is happening?
> >>>
> >>> What I have observed in simple mapreduce jobs that first all map tasks
> >>> are
> >>> instantiated and then reduce tasks. But  this does not seem to be
> >>> happening
> >>> in HTable case??
> >>>
> >>> --
> >>> View this message in context:
> >>>
> http://www.nabble.com/Why-only-few-map-tasks-are-running-at-a-time-inspite-of-plenty-of-scope-for-remaining--tp24636315p24636315.html
> >>> Sent from the HBase User mailing list archive at Nabble.com.
> >>>
> >>>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Why-only-few-map-tasks-are-running-at-a-time-inspite-of-plenty-of-scope-for-remaining--tp24636315p24650457.html
> Sent from the HBase User mailing list archive at Nabble.com.
>
>

Re: Why only few map tasks are running at a time inspite of plenty of scope for remaining?

Posted by akhil1988 <ak...@gmail.com>.
Hi,

The 10 regions of the table are lying on just 6 region servers(out of a
total of 13 region servers). And even after setting
mapred.tasktracker.map.tasks.maximum = 4, only total of 2 map tasks run at a
time.  

Also, another observation is that the map tasks are not running on local
data i.e. the input region for the map task lies on some other node.

Name                                                     Region Server                               
Encoded Name       Start Key    End Key
WikiPages,,1248388318240                    cn84.cloud.cs.illinois.edu:60020          
1479806923       117236
WikiPages,117236,1248388318240         cn84.cloud.cs.illinois.edu:60020           
1753302296       117236       13813
WikiPages,13813,1248388323072           cn77.cloud.cs.illinois.edu:60020            
200507463        13813        184272
WikiPages,184272,1248388323072         cn77.cloud.cs.illinois.edu:60020           
1543767328       184272       22998
WikiPages,22998,1248388310452           cn71.cloud.cs.illinois.edu:60020           
1972228055       22998         29193
WikiPages,29193,1248388310452           cn71.cloud.cs.illinois.edu:60020           
1630029649       29193         37870
WikiPages,37870,1248388306711           cn73.cloud.cs.illinois.edu:60020           
1028558084       37870             56
WikiPages,56,1248388313083                cn82.cloud.cs.illinois.edu:60020             
332484191       56              73976
WikiPages,73976,1248388316165           cn83.cloud.cs.illinois.edu:60020             
231296585       73976         85491
WikiPages,85491,1248388316165           cn83.cloud.cs.illinois.edu:60020            
1935329066      85491 	


Each region is approximately 90 MB and I have set region max size to be 128
MB. The region size is already less than the maximum size, how should I
split it?

Hadoop does shows 10 map tasks to be run. How would writing custom input
split help, and moreover if I write custom Input Split to divide the rows I
will end up giving rows lying in different regions to a map task(as the rows
are not in sorted order).

Thanks,
--Akhil

Ninad Raut-2 wrote:
> 
> If your data is stored just on one regionserver you will have only one map
>> inspite of setting
>>  conf.set("mapred.tasktracker.map.tasks.maximum", "2");
>> there are two approaches:
>>
>> 1) Do a manual table split
>>
>    2) Write a custom input split which will divide the tables rows amongst
> the maps.
> 
> 
>> On Fri, Jul 24, 2009 at 4:54 AM, akhil1988 <ak...@gmail.com> wrote:
>>
>>>
>>> Hi all,
>>>
>>> I am using a HTable as input to my map jobs and my reducer outputs to
>>> another Htable. There are 10 regions of my input HTable. And I have set
>>>        conf.set("mapred.tasktracker.map.tasks.maximum", "2");
>>>        conf.set("mapred.tasktracker.map.tasks.maximum", "2");
>>>       c.setNumReduceTasks(26);
>>> My cluster contains 15 nodes(out of which 2 are maters). When I run my
>>> job,
>>> only 2 map tasks run at a time and the remaining 8 are shown as pending.
>>> 24
>>> reduce tasks(out of 26) also get started initially and remaing 2 are
>>> shown
>>> as pending. I am confused why only 2 tasks are running at a time, though
>>> there are a total of 26 slots for map tasks.
>>>
>>> However, this does not happen when I run jobs in which I take files as
>>> inputs(i.e. only simple MapReduce jobs and not involving HBase at all).
>>> Only
>>> when a Htable is taken as input very few map tasks run concurrently than
>>> expected.
>>>
>>> Can anyone suggest why this is happening?
>>>
>>> What I have observed in simple mapreduce jobs that first all map tasks
>>> are
>>> instantiated and then reduce tasks. But  this does not seem to be
>>> happening
>>> in HTable case??
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Why-only-few-map-tasks-are-running-at-a-time-inspite-of-plenty-of-scope-for-remaining--tp24636315p24636315.html
>>> Sent from the HBase User mailing list archive at Nabble.com.
>>>
>>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Why-only-few-map-tasks-are-running-at-a-time-inspite-of-plenty-of-scope-for-remaining--tp24636315p24650457.html
Sent from the HBase User mailing list archive at Nabble.com.


Re: Why only few map tasks are running at a time inspite of plenty of scope for remaining?

Posted by Ninad Raut <hb...@gmail.com>.
If your data is stored just on one regionserver you will have only one map
> inspite of setting
>  conf.set("mapred.tasktracker.map.tasks.maximum", "2");
> there are two approaches:
>
> 1) Do a manual table split
>
   2) Write a custom input split which will divide the tables rows amongst
the maps.


> On Fri, Jul 24, 2009 at 4:54 AM, akhil1988 <ak...@gmail.com> wrote:
>
>>
>> Hi all,
>>
>> I am using a HTable as input to my map jobs and my reducer outputs to
>> another Htable. There are 10 regions of my input HTable. And I have set
>>        conf.set("mapred.tasktracker.map.tasks.maximum", "2");
>>        conf.set("mapred.tasktracker.map.tasks.maximum", "2");
>>       c.setNumReduceTasks(26);
>> My cluster contains 15 nodes(out of which 2 are maters). When I run my
>> job,
>> only 2 map tasks run at a time and the remaining 8 are shown as pending.
>> 24
>> reduce tasks(out of 26) also get started initially and remaing 2 are shown
>> as pending. I am confused why only 2 tasks are running at a time, though
>> there are a total of 26 slots for map tasks.
>>
>> However, this does not happen when I run jobs in which I take files as
>> inputs(i.e. only simple MapReduce jobs and not involving HBase at all).
>> Only
>> when a Htable is taken as input very few map tasks run concurrently than
>> expected.
>>
>> Can anyone suggest why this is happening?
>>
>> What I have observed in simple mapreduce jobs that first all map tasks are
>> instantiated and then reduce tasks. But  this does not seem to be
>> happening
>> in HTable case??
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Why-only-few-map-tasks-are-running-at-a-time-inspite-of-plenty-of-scope-for-remaining--tp24636315p24636315.html
>> Sent from the HBase User mailing list archive at Nabble.com.
>>
>>
>