You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Ninad Raut <hb...@gmail.com> on 2009/07/24 12:30:34 UTC
Re: Why only few map tasks are running at a time inspite of plenty of
scope for remaining?
If your data is stored just on one regionserver you will have only one map
inspite of setting
conf.set("mapred.tasktracker.map.tasks.maximum", "2");
there are two approaches:
On Fri, Jul 24, 2009 at 4:54 AM, akhil1988 <ak...@gmail.com> wrote:
>
> Hi all,
>
> I am using a HTable as input to my map jobs and my reducer outputs to
> another Htable. There are 10 regions of my input HTable. And I have set
> conf.set("mapred.tasktracker.map.tasks.maximum", "2");
> conf.set("mapred.tasktracker.map.tasks.maximum", "2");
> c.setNumReduceTasks(26);
> My cluster contains 15 nodes(out of which 2 are maters). When I run my job,
> only 2 map tasks run at a time and the remaining 8 are shown as pending. 24
> reduce tasks(out of 26) also get started initially and remaing 2 are shown
> as pending. I am confused why only 2 tasks are running at a time, though
> there are a total of 26 slots for map tasks.
>
> However, this does not happen when I run jobs in which I take files as
> inputs(i.e. only simple MapReduce jobs and not involving HBase at all).
> Only
> when a Htable is taken as input very few map tasks run concurrently than
> expected.
>
> Can anyone suggest why this is happening?
>
> What I have observed in simple mapreduce jobs that first all map tasks are
> instantiated and then reduce tasks. But this does not seem to be happening
> in HTable case??
>
> --
> View this message in context:
> http://www.nabble.com/Why-only-few-map-tasks-are-running-at-a-time-inspite-of-plenty-of-scope-for-remaining--tp24636315p24636315.html
> Sent from the HBase User mailing list archive at Nabble.com.
>
>
Re: Why only few map tasks are running at a time inspite of plenty of
scope for remaining?
Posted by Ninad Raut <hb...@gmail.com>.
Inputsplit wont harm your localization. It is taken care off in the
partioning phase. InputSplit helps you form more maps and increase your
parallelism.. here is the code
public InputSplit[] getSplits(JobConf conf, int numSplits) throws
IOException {
System.out.println("In Input Split");
// TODO Auto-generated method stub
int i=0;
int numMaps = 0;
List<InputSplit> splitList = new ArrayList<InputSplit>();
Scanner scanner = table.getScanner(new String[] {columnName});
Iterator<RowResult> ie = scanner.iterator();
byte[] firstrow = ie.next().getRow();
while(ie.hasNext()){
RowResult row = ie.next();
i++;
if(i%1000==0 && numMaps < numSplits){
byte[] endRow = row.getRow();
log.info("Starr::"+ new String(firstrow) + " End:: " + new
String(endRow));
InputSplit split = new TableSplit(cTableName.getBytes(), firstrow, endRow,
"slave");
splitList.add(split);
numMaps++;
firstrow = endRow;
}
}
System.out.println(splitList.size());
InputSplit[] splits = new InputSplit[splitList.size()];
for(int j=0;j<splitList.size();++j){
splits[j] = splitList.get(j);
}
System.out.println("Input Split method ends returning splits:: " +
splits.length);
return splits;
}
On Sat, Jul 25, 2009 at 12:54 AM, akhil1988 <ak...@gmail.com> wrote:
>
> Hi,
>
> The 10 regions of the table are lying on just 6 region servers(out of a
> total of 13 region servers). And even after setting
> mapred.tasktracker.map.tasks.maximum = 4, only total of 2 map tasks run at
> a
> time.
>
> Also, another observation is that the map tasks are not running on local
> data i.e. the input region for the map task lies on some other node.
>
> Name Region Server
> Encoded Name Start Key End Key
> WikiPages,,1248388318240
> cn84.cloud.cs.illinois.edu:60020
> 1479806923 117236
> WikiPages,117236,1248388318240 cn84.cloud.cs.illinois.edu:60020
> 1753302296 117236 13813
> WikiPages,13813,1248388323072 cn77.cloud.cs.illinois.edu:60020
> 200507463 13813 184272
> WikiPages,184272,1248388323072 cn77.cloud.cs.illinois.edu:60020
> 1543767328 184272 22998
> WikiPages,22998,1248388310452 cn71.cloud.cs.illinois.edu:60020
> 1972228055 22998 29193
> WikiPages,29193,1248388310452 cn71.cloud.cs.illinois.edu:60020
> 1630029649 29193 37870
> WikiPages,37870,1248388306711 cn73.cloud.cs.illinois.edu:60020
> 1028558084 37870 56
> WikiPages,56,1248388313083 cn82.cloud.cs.illinois.edu:60020
> 332484191 56 73976
> WikiPages,73976,1248388316165 cn83.cloud.cs.illinois.edu:60020
> 231296585 73976 85491
> WikiPages,85491,1248388316165 cn83.cloud.cs.illinois.edu:60020
> 1935329066 85491
>
>
> Each region is approximately 90 MB and I have set region max size to be 128
> MB. The region size is already less than the maximum size, how should I
> split it?
>
> Hadoop does shows 10 map tasks to be run. How would writing custom input
> split help, and moreover if I write custom Input Split to divide the rows I
> will end up giving rows lying in different regions to a map task(as the
> rows
> are not in sorted order).
>
> Thanks,
> --Akhil
>
> Ninad Raut-2 wrote:
> >
> > If your data is stored just on one regionserver you will have only one
> map
> >> inspite of setting
> >> conf.set("mapred.tasktracker.map.tasks.maximum", "2");
> >> there are two approaches:
> >>
> >> 1) Do a manual table split
> >>
> > 2) Write a custom input split which will divide the tables rows
> amongst
> > the maps.
> >
> >
> >> On Fri, Jul 24, 2009 at 4:54 AM, akhil1988 <ak...@gmail.com>
> wrote:
> >>
> >>>
> >>> Hi all,
> >>>
> >>> I am using a HTable as input to my map jobs and my reducer outputs to
> >>> another Htable. There are 10 regions of my input HTable. And I have set
> >>> conf.set("mapred.tasktracker.map.tasks.maximum", "2");
> >>> conf.set("mapred.tasktracker.map.tasks.maximum", "2");
> >>> c.setNumReduceTasks(26);
> >>> My cluster contains 15 nodes(out of which 2 are maters). When I run my
> >>> job,
> >>> only 2 map tasks run at a time and the remaining 8 are shown as
> pending.
> >>> 24
> >>> reduce tasks(out of 26) also get started initially and remaing 2 are
> >>> shown
> >>> as pending. I am confused why only 2 tasks are running at a time,
> though
> >>> there are a total of 26 slots for map tasks.
> >>>
> >>> However, this does not happen when I run jobs in which I take files as
> >>> inputs(i.e. only simple MapReduce jobs and not involving HBase at all).
> >>> Only
> >>> when a Htable is taken as input very few map tasks run concurrently
> than
> >>> expected.
> >>>
> >>> Can anyone suggest why this is happening?
> >>>
> >>> What I have observed in simple mapreduce jobs that first all map tasks
> >>> are
> >>> instantiated and then reduce tasks. But this does not seem to be
> >>> happening
> >>> in HTable case??
> >>>
> >>> --
> >>> View this message in context:
> >>>
> http://www.nabble.com/Why-only-few-map-tasks-are-running-at-a-time-inspite-of-plenty-of-scope-for-remaining--tp24636315p24636315.html
> >>> Sent from the HBase User mailing list archive at Nabble.com.
> >>>
> >>>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Why-only-few-map-tasks-are-running-at-a-time-inspite-of-plenty-of-scope-for-remaining--tp24636315p24650457.html
> Sent from the HBase User mailing list archive at Nabble.com.
>
>
Re: Why only few map tasks are running at a time inspite of plenty
of scope for remaining?
Posted by akhil1988 <ak...@gmail.com>.
Hi,
The 10 regions of the table are lying on just 6 region servers(out of a
total of 13 region servers). And even after setting
mapred.tasktracker.map.tasks.maximum = 4, only total of 2 map tasks run at a
time.
Also, another observation is that the map tasks are not running on local
data i.e. the input region for the map task lies on some other node.
Name Region Server
Encoded Name Start Key End Key
WikiPages,,1248388318240 cn84.cloud.cs.illinois.edu:60020
1479806923 117236
WikiPages,117236,1248388318240 cn84.cloud.cs.illinois.edu:60020
1753302296 117236 13813
WikiPages,13813,1248388323072 cn77.cloud.cs.illinois.edu:60020
200507463 13813 184272
WikiPages,184272,1248388323072 cn77.cloud.cs.illinois.edu:60020
1543767328 184272 22998
WikiPages,22998,1248388310452 cn71.cloud.cs.illinois.edu:60020
1972228055 22998 29193
WikiPages,29193,1248388310452 cn71.cloud.cs.illinois.edu:60020
1630029649 29193 37870
WikiPages,37870,1248388306711 cn73.cloud.cs.illinois.edu:60020
1028558084 37870 56
WikiPages,56,1248388313083 cn82.cloud.cs.illinois.edu:60020
332484191 56 73976
WikiPages,73976,1248388316165 cn83.cloud.cs.illinois.edu:60020
231296585 73976 85491
WikiPages,85491,1248388316165 cn83.cloud.cs.illinois.edu:60020
1935329066 85491
Each region is approximately 90 MB and I have set region max size to be 128
MB. The region size is already less than the maximum size, how should I
split it?
Hadoop does shows 10 map tasks to be run. How would writing custom input
split help, and moreover if I write custom Input Split to divide the rows I
will end up giving rows lying in different regions to a map task(as the rows
are not in sorted order).
Thanks,
--Akhil
Ninad Raut-2 wrote:
>
> If your data is stored just on one regionserver you will have only one map
>> inspite of setting
>> conf.set("mapred.tasktracker.map.tasks.maximum", "2");
>> there are two approaches:
>>
>> 1) Do a manual table split
>>
> 2) Write a custom input split which will divide the tables rows amongst
> the maps.
>
>
>> On Fri, Jul 24, 2009 at 4:54 AM, akhil1988 <ak...@gmail.com> wrote:
>>
>>>
>>> Hi all,
>>>
>>> I am using a HTable as input to my map jobs and my reducer outputs to
>>> another Htable. There are 10 regions of my input HTable. And I have set
>>> conf.set("mapred.tasktracker.map.tasks.maximum", "2");
>>> conf.set("mapred.tasktracker.map.tasks.maximum", "2");
>>> c.setNumReduceTasks(26);
>>> My cluster contains 15 nodes(out of which 2 are maters). When I run my
>>> job,
>>> only 2 map tasks run at a time and the remaining 8 are shown as pending.
>>> 24
>>> reduce tasks(out of 26) also get started initially and remaing 2 are
>>> shown
>>> as pending. I am confused why only 2 tasks are running at a time, though
>>> there are a total of 26 slots for map tasks.
>>>
>>> However, this does not happen when I run jobs in which I take files as
>>> inputs(i.e. only simple MapReduce jobs and not involving HBase at all).
>>> Only
>>> when a Htable is taken as input very few map tasks run concurrently than
>>> expected.
>>>
>>> Can anyone suggest why this is happening?
>>>
>>> What I have observed in simple mapreduce jobs that first all map tasks
>>> are
>>> instantiated and then reduce tasks. But this does not seem to be
>>> happening
>>> in HTable case??
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Why-only-few-map-tasks-are-running-at-a-time-inspite-of-plenty-of-scope-for-remaining--tp24636315p24636315.html
>>> Sent from the HBase User mailing list archive at Nabble.com.
>>>
>>>
>>
>
>
--
View this message in context: http://www.nabble.com/Why-only-few-map-tasks-are-running-at-a-time-inspite-of-plenty-of-scope-for-remaining--tp24636315p24650457.html
Sent from the HBase User mailing list archive at Nabble.com.
Re: Why only few map tasks are running at a time inspite of plenty of
scope for remaining?
Posted by Ninad Raut <hb...@gmail.com>.
If your data is stored just on one regionserver you will have only one map
> inspite of setting
> conf.set("mapred.tasktracker.map.tasks.maximum", "2");
> there are two approaches:
>
> 1) Do a manual table split
>
2) Write a custom input split which will divide the tables rows amongst
the maps.
> On Fri, Jul 24, 2009 at 4:54 AM, akhil1988 <ak...@gmail.com> wrote:
>
>>
>> Hi all,
>>
>> I am using a HTable as input to my map jobs and my reducer outputs to
>> another Htable. There are 10 regions of my input HTable. And I have set
>> conf.set("mapred.tasktracker.map.tasks.maximum", "2");
>> conf.set("mapred.tasktracker.map.tasks.maximum", "2");
>> c.setNumReduceTasks(26);
>> My cluster contains 15 nodes(out of which 2 are maters). When I run my
>> job,
>> only 2 map tasks run at a time and the remaining 8 are shown as pending.
>> 24
>> reduce tasks(out of 26) also get started initially and remaing 2 are shown
>> as pending. I am confused why only 2 tasks are running at a time, though
>> there are a total of 26 slots for map tasks.
>>
>> However, this does not happen when I run jobs in which I take files as
>> inputs(i.e. only simple MapReduce jobs and not involving HBase at all).
>> Only
>> when a Htable is taken as input very few map tasks run concurrently than
>> expected.
>>
>> Can anyone suggest why this is happening?
>>
>> What I have observed in simple mapreduce jobs that first all map tasks are
>> instantiated and then reduce tasks. But this does not seem to be
>> happening
>> in HTable case??
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Why-only-few-map-tasks-are-running-at-a-time-inspite-of-plenty-of-scope-for-remaining--tp24636315p24636315.html
>> Sent from the HBase User mailing list archive at Nabble.com.
>>
>>
>