You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Avery Ching <ac...@yahoo-inc.com> on 2011/04/09 18:15:56 UTC

TableInputFormat and number of mappers == number of regions

Hi,

First off, I'd like to say thanks to the developers for HBase, it's been fun to work with.

I've been using TableInputFormat to run a Map-Reduce job and ran into an issue.

Exception in thread "main" org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.io.IOException: The number of tasks for this job 149624 exceeds the configured limit 100000

The table i'm accessing has 149624 regions, however my Hadoop instance won't allow me to start a job with that many map tasks.  After briefly looking at the TableInputFormatBase code, it appears that since TableSplit only knows about a single region, my job will be forced into having mappers == # of regions.  Since the Hadoop instance I'm using is shared, I'm concerned that even if configured limit was raised, having Jobs with so many mappers would eventually cause havoc to the job tracker.

Given that I have no control over the number of regions in the table (maintained by someone else), is the only solution to implement another input format (i.e. MultiRegionTableFormat) that allows InputSplits to have more than one region?  I don't mind doing it, but didn't want to write it if another solution already exists.

Apologies if this issue has been raised before, but a quick search didn't turn anything up for me.

Thanks,

Avery

Re: TableInputFormat and number of mappers == number of regions

Posted by Avery Ching <ac...@yahoo-inc.com>.

I found the code still exists in this code base for the old mapred interfaces

src/main/java/org/apache/hadoop/hbase/mapred/TableInputFormatBase.java

I'll adapt it for my needs.  Thanks!

Avery

On Apr 9, 2011, at 9:55 AM, Jean-Daniel Cryans wrote:

> It's weird, I thought we already did something like that and it seems
> that the old TableInputFormatBase does it but not the new one. From
> it's javadoc:
> 
>   * Splits are created in number equal to the smallest between numSplits and
>   * the number of {@link HRegion}s in the table. If the number of splits is
>   * smaller than the number of {@link HRegion}s then splits are spanned across
>   * multiple {@link HRegion}s and are grouped the most evenly possible. In the
>   * case splits are uneven the bigger splits are placed first in the
>   * {@link InputSplit} array.
> 
> J-D
> 
> On Sat, Apr 9, 2011 at 9:48 AM, Stack <st...@duboce.net> wrote:
>> Yes, you could make a different Splitter.  Would be nice in the
>> splitter if you could keep the locality where we have the Map task
>> running on the TaskTracker that is adjacent to the hosting
>> RegionServer.  That shouldn't be hard.  Study the current splitter and
>> see how it juggles locations.
>> 
>> Can you put us in contact w/ the person running the cluster (offline
>> if you prefer)?  150k sounds like regions need to be bigger.
>> 
>> Thanks,
>> St.Ack
>> 
>> On Sat, Apr 9, 2011 at 9:33 AM, Avery Ching <ac...@yahoo-inc.com> wrote:
>>> The number of regions is pretty insane, but not under my control unfortunately.  The workaround I suggested is to write another InputFormat and InputSplit such that each InputSplit is responsible for a configurable number of regions.  For example, if i have 100k regions and I configure each InputSplit to handle 1k regions, then I'll only have 100 map tasks.  Just was wondering if anyone else faced these issues.
>>> 
>>> Thanks for your quick response on a Saturday morning =),
>>> 
>>> Avery
>>> 
>>> On Apr 9, 2011, at 9:26 AM, Jean-Daniel Cryans wrote:
>>> 
>>>> You cannot have more mappers than you have regions, but you can have
>>>> less. Try going that way.
>>>> 
>>>> Also 149,624 regions is insane, is that really the case? I don't think
>>>> i've ever seen such a large deploy and it's probably bound to hit some
>>>> issues...
>>>> 
>>>> J-D
>>>> 
>>>> On Sat, Apr 9, 2011 at 9:15 AM, Avery Ching <ac...@yahoo-inc.com> wrote:
>>>>> Hi,
>>>>> 
>>>>> First off, I'd like to say thanks to the developers for HBase, it's been fun to work with.
>>>>> 
>>>>> I've been using TableInputFormat to run a Map-Reduce job and ran into an issue.
>>>>> 
>>>>> Exception in thread "main" org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.io.IOException: The number of tasks for this job 149624 exceeds the configured limit 100000
>>>>> 
>>>>> The table i'm accessing has 149624 regions, however my Hadoop instance won't allow me to start a job with that many map tasks.  After briefly looking at the TableInputFormatBase code, it appears that since TableSplit only knows about a single region, my job will be forced into having mappers == # of regions.  Since the Hadoop instance I'm using is shared, I'm concerned that even if configured limit was raised, having Jobs with so many mappers would eventually cause havoc to the job tracker.
>>>>> 
>>>>> Given that I have no control over the number of regions in the table (maintained by someone else), is the only solution to implement another input format (i.e. MultiRegionTableFormat) that allows InputSplits to have more than one region?  I don't mind doing it, but didn't want to write it if another solution already exists.
>>>>> 
>>>>> Apologies if this issue has been raised before, but a quick search didn't turn anything up for me.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Avery
>>>>> 
>>> 
>>> 
>>

Re: TableInputFormat and number of mappers == number of regions

Posted by Jean-Daniel Cryans <jd...@apache.org>.

It's weird, I thought we already did something like that and it seems
that the old TableInputFormatBase does it but not the new one. From
it's javadoc:

   * Splits are created in number equal to the smallest between numSplits and
   * the number of {@link HRegion}s in the table. If the number of splits is
   * smaller than the number of {@link HRegion}s then splits are spanned across
   * multiple {@link HRegion}s and are grouped the most evenly possible. In the
   * case splits are uneven the bigger splits are placed first in the
   * {@link InputSplit} array.

J-D

On Sat, Apr 9, 2011 at 9:48 AM, Stack <st...@duboce.net> wrote:
> Yes, you could make a different Splitter.  Would be nice in the
> splitter if you could keep the locality where we have the Map task
> running on the TaskTracker that is adjacent to the hosting
> RegionServer.  That shouldn't be hard.  Study the current splitter and
> see how it juggles locations.
>
> Can you put us in contact w/ the person running the cluster (offline
> if you prefer)?  150k sounds like regions need to be bigger.
>
> Thanks,
> St.Ack
>
> On Sat, Apr 9, 2011 at 9:33 AM, Avery Ching <ac...@yahoo-inc.com> wrote:
>> The number of regions is pretty insane, but not under my control unfortunately.  The workaround I suggested is to write another InputFormat and InputSplit such that each InputSplit is responsible for a configurable number of regions.  For example, if i have 100k regions and I configure each InputSplit to handle 1k regions, then I'll only have 100 map tasks.  Just was wondering if anyone else faced these issues.
>>
>> Thanks for your quick response on a Saturday morning =),
>>
>> Avery
>>
>> On Apr 9, 2011, at 9:26 AM, Jean-Daniel Cryans wrote:
>>
>>> You cannot have more mappers than you have regions, but you can have
>>> less. Try going that way.
>>>
>>> Also 149,624 regions is insane, is that really the case? I don't think
>>> i've ever seen such a large deploy and it's probably bound to hit some
>>> issues...
>>>
>>> J-D
>>>
>>> On Sat, Apr 9, 2011 at 9:15 AM, Avery Ching <ac...@yahoo-inc.com> wrote:
>>>> Hi,
>>>>
>>>> First off, I'd like to say thanks to the developers for HBase, it's been fun to work with.
>>>>
>>>> I've been using TableInputFormat to run a Map-Reduce job and ran into an issue.
>>>>
>>>> Exception in thread "main" org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.io.IOException: The number of tasks for this job 149624 exceeds the configured limit 100000
>>>>
>>>> The table i'm accessing has 149624 regions, however my Hadoop instance won't allow me to start a job with that many map tasks.  After briefly looking at the TableInputFormatBase code, it appears that since TableSplit only knows about a single region, my job will be forced into having mappers == # of regions.  Since the Hadoop instance I'm using is shared, I'm concerned that even if configured limit was raised, having Jobs with so many mappers would eventually cause havoc to the job tracker.
>>>>
>>>> Given that I have no control over the number of regions in the table (maintained by someone else), is the only solution to implement another input format (i.e. MultiRegionTableFormat) that allows InputSplits to have more than one region?  I don't mind doing it, but didn't want to write it if another solution already exists.
>>>>
>>>> Apologies if this issue has been raised before, but a quick search didn't turn anything up for me.
>>>>
>>>> Thanks,
>>>>
>>>> Avery
>>>>
>>
>>
>

Re: TableInputFormat and number of mappers == number of regions

Posted by Vidhyashankar Venkataraman <vi...@yahoo-inc.com>.

Just so you guys know, the 150K regions was in a test cluster that we had let run amok.

Our prod cluster has less than 50 regions per region server. Considering 700 nodes, that comes to around 22K regions! The job tracker could still potentially be overloaded with this number. The solution is indeed a new TableInputFormat rewritten with 1 task per regionserver than per region.

V

On 4/9/11 10:03 AM, "Avery Ching" <ac...@yahoo-inc.com> wrote:

Thanks for the suggestions, I'll get to work on my own InputFormat then.  Just wanted to see if someone else had run into these issues.  Stack, Vidhyashankar and I are on the same team here at Yahoo! and this is the same cluster that he uses.

Avery

On Apr 9, 2011, at 9:48 AM, Stack wrote:

> Yes, you could make a different Splitter.  Would be nice in the
> splitter if you could keep the locality where we have the Map task
> running on the TaskTracker that is adjacent to the hosting
> RegionServer.  That shouldn't be hard.  Study the current splitter and
> see how it juggles locations.
>
> Can you put us in contact w/ the person running the cluster (offline
> if you prefer)?  150k sounds like regions need to be bigger.
>
> Thanks,
> St.Ack
>
> On Sat, Apr 9, 2011 at 9:33 AM, Avery Ching <ac...@yahoo-inc.com> wrote:
>> The number of regions is pretty insane, but not under my control unfortunately.  The workaround I suggested is to write another InputFormat and InputSplit such that each InputSplit is responsible for a configurable number of regions.  For example, if i have 100k regions and I configure each InputSplit to handle 1k regions, then I'll only have 100 map tasks.  Just was wondering if anyone else faced these issues.
>>
>> Thanks for your quick response on a Saturday morning =),
>>
>> Avery
>>
>> On Apr 9, 2011, at 9:26 AM, Jean-Daniel Cryans wrote:
>>
>>> You cannot have more mappers than you have regions, but you can have
>>> less. Try going that way.
>>>
>>> Also 149,624 regions is insane, is that really the case? I don't think
>>> i've ever seen such a large deploy and it's probably bound to hit some
>>> issues...
>>>
>>> J-D
>>>
>>> On Sat, Apr 9, 2011 at 9:15 AM, Avery Ching <ac...@yahoo-inc.com> wrote:
>>>> Hi,
>>>>
>>>> First off, I'd like to say thanks to the developers for HBase, it's been fun to work with.
>>>>
>>>> I've been using TableInputFormat to run a Map-Reduce job and ran into an issue.
>>>>
>>>> Exception in thread "main" org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.io.IOException: The number of tasks for this job 149624 exceeds the configured limit 100000
>>>>
>>>> The table i'm accessing has 149624 regions, however my Hadoop instance won't allow me to start a job with that many map tasks.  After briefly looking at the TableInputFormatBase code, it appears that since TableSplit only knows about a single region, my job will be forced into having mappers == # of regions.  Since the Hadoop instance I'm using is shared, I'm concerned that even if configured limit was raised, having Jobs with so many mappers would eventually cause havoc to the job tracker.
>>>>
>>>> Given that I have no control over the number of regions in the table (maintained by someone else), is the only solution to implement another input format (i.e. MultiRegionTableFormat) that allows InputSplits to have more than one region?  I don't mind doing it, but didn't want to write it if another solution already exists.
>>>>
>>>> Apologies if this issue has been raised before, but a quick search didn't turn anything up for me.
>>>>
>>>> Thanks,
>>>>
>>>> Avery
>>>>
>>
>>

Re: TableInputFormat and number of mappers == number of regions

Posted by Avery Ching <ac...@yahoo-inc.com>.

Thanks for the suggestions, I'll get to work on my own InputFormat then.  Just wanted to see if someone else had run into these issues.  Stack, Vidhyashankar and I are on the same team here at Yahoo! and this is the same cluster that he uses.

Avery

On Apr 9, 2011, at 9:48 AM, Stack wrote:

> Yes, you could make a different Splitter.  Would be nice in the
> splitter if you could keep the locality where we have the Map task
> running on the TaskTracker that is adjacent to the hosting
> RegionServer.  That shouldn't be hard.  Study the current splitter and
> see how it juggles locations.
> 
> Can you put us in contact w/ the person running the cluster (offline
> if you prefer)?  150k sounds like regions need to be bigger.
> 
> Thanks,
> St.Ack
> 
> On Sat, Apr 9, 2011 at 9:33 AM, Avery Ching <ac...@yahoo-inc.com> wrote:
>> The number of regions is pretty insane, but not under my control unfortunately.  The workaround I suggested is to write another InputFormat and InputSplit such that each InputSplit is responsible for a configurable number of regions.  For example, if i have 100k regions and I configure each InputSplit to handle 1k regions, then I'll only have 100 map tasks.  Just was wondering if anyone else faced these issues.
>> 
>> Thanks for your quick response on a Saturday morning =),
>> 
>> Avery
>> 
>> On Apr 9, 2011, at 9:26 AM, Jean-Daniel Cryans wrote:
>> 
>>> You cannot have more mappers than you have regions, but you can have
>>> less. Try going that way.
>>> 
>>> Also 149,624 regions is insane, is that really the case? I don't think
>>> i've ever seen such a large deploy and it's probably bound to hit some
>>> issues...
>>> 
>>> J-D
>>> 
>>> On Sat, Apr 9, 2011 at 9:15 AM, Avery Ching <ac...@yahoo-inc.com> wrote:
>>>> Hi,
>>>> 
>>>> First off, I'd like to say thanks to the developers for HBase, it's been fun to work with.
>>>> 
>>>> I've been using TableInputFormat to run a Map-Reduce job and ran into an issue.
>>>> 
>>>> Exception in thread "main" org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.io.IOException: The number of tasks for this job 149624 exceeds the configured limit 100000
>>>> 
>>>> The table i'm accessing has 149624 regions, however my Hadoop instance won't allow me to start a job with that many map tasks.  After briefly looking at the TableInputFormatBase code, it appears that since TableSplit only knows about a single region, my job will be forced into having mappers == # of regions.  Since the Hadoop instance I'm using is shared, I'm concerned that even if configured limit was raised, having Jobs with so many mappers would eventually cause havoc to the job tracker.
>>>> 
>>>> Given that I have no control over the number of regions in the table (maintained by someone else), is the only solution to implement another input format (i.e. MultiRegionTableFormat) that allows InputSplits to have more than one region?  I don't mind doing it, but didn't want to write it if another solution already exists.
>>>> 
>>>> Apologies if this issue has been raised before, but a quick search didn't turn anything up for me.
>>>> 
>>>> Thanks,
>>>> 
>>>> Avery
>>>> 
>> 
>>

Re: TableInputFormat and number of mappers == number of regions

Posted by Stack <st...@duboce.net>.

Yes, you could make a different Splitter.  Would be nice in the
splitter if you could keep the locality where we have the Map task
running on the TaskTracker that is adjacent to the hosting
RegionServer.  That shouldn't be hard.  Study the current splitter and
see how it juggles locations.

Can you put us in contact w/ the person running the cluster (offline
if you prefer)?  150k sounds like regions need to be bigger.

Thanks,
St.Ack

On Sat, Apr 9, 2011 at 9:33 AM, Avery Ching <ac...@yahoo-inc.com> wrote:
> The number of regions is pretty insane, but not under my control unfortunately.  The workaround I suggested is to write another InputFormat and InputSplit such that each InputSplit is responsible for a configurable number of regions.  For example, if i have 100k regions and I configure each InputSplit to handle 1k regions, then I'll only have 100 map tasks.  Just was wondering if anyone else faced these issues.
>
> Thanks for your quick response on a Saturday morning =),
>
> Avery
>
> On Apr 9, 2011, at 9:26 AM, Jean-Daniel Cryans wrote:
>
>> You cannot have more mappers than you have regions, but you can have
>> less. Try going that way.
>>
>> Also 149,624 regions is insane, is that really the case? I don't think
>> i've ever seen such a large deploy and it's probably bound to hit some
>> issues...
>>
>> J-D
>>
>> On Sat, Apr 9, 2011 at 9:15 AM, Avery Ching <ac...@yahoo-inc.com> wrote:
>>> Hi,
>>>
>>> First off, I'd like to say thanks to the developers for HBase, it's been fun to work with.
>>>
>>> I've been using TableInputFormat to run a Map-Reduce job and ran into an issue.
>>>
>>> Exception in thread "main" org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.io.IOException: The number of tasks for this job 149624 exceeds the configured limit 100000
>>>
>>> The table i'm accessing has 149624 regions, however my Hadoop instance won't allow me to start a job with that many map tasks.  After briefly looking at the TableInputFormatBase code, it appears that since TableSplit only knows about a single region, my job will be forced into having mappers == # of regions.  Since the Hadoop instance I'm using is shared, I'm concerned that even if configured limit was raised, having Jobs with so many mappers would eventually cause havoc to the job tracker.
>>>
>>> Given that I have no control over the number of regions in the table (maintained by someone else), is the only solution to implement another input format (i.e. MultiRegionTableFormat) that allows InputSplits to have more than one region?  I don't mind doing it, but didn't want to write it if another solution already exists.
>>>
>>> Apologies if this issue has been raised before, but a quick search didn't turn anything up for me.
>>>
>>> Thanks,
>>>
>>> Avery
>>>
>
>

Re: TableInputFormat and number of mappers == number of regions

Posted by Avery Ching <ac...@yahoo-inc.com>.

The number of regions is pretty insane, but not under my control unfortunately.  The workaround I suggested is to write another InputFormat and InputSplit such that each InputSplit is responsible for a configurable number of regions.  For example, if i have 100k regions and I configure each InputSplit to handle 1k regions, then I'll only have 100 map tasks.  Just was wondering if anyone else faced these issues.

Thanks for your quick response on a Saturday morning =),

Avery

On Apr 9, 2011, at 9:26 AM, Jean-Daniel Cryans wrote:

> You cannot have more mappers than you have regions, but you can have
> less. Try going that way.
> 
> Also 149,624 regions is insane, is that really the case? I don't think
> i've ever seen such a large deploy and it's probably bound to hit some
> issues...
> 
> J-D
> 
> On Sat, Apr 9, 2011 at 9:15 AM, Avery Ching <ac...@yahoo-inc.com> wrote:
>> Hi,
>> 
>> First off, I'd like to say thanks to the developers for HBase, it's been fun to work with.
>> 
>> I've been using TableInputFormat to run a Map-Reduce job and ran into an issue.
>> 
>> Exception in thread "main" org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.io.IOException: The number of tasks for this job 149624 exceeds the configured limit 100000
>> 
>> The table i'm accessing has 149624 regions, however my Hadoop instance won't allow me to start a job with that many map tasks.  After briefly looking at the TableInputFormatBase code, it appears that since TableSplit only knows about a single region, my job will be forced into having mappers == # of regions.  Since the Hadoop instance I'm using is shared, I'm concerned that even if configured limit was raised, having Jobs with so many mappers would eventually cause havoc to the job tracker.
>> 
>> Given that I have no control over the number of regions in the table (maintained by someone else), is the only solution to implement another input format (i.e. MultiRegionTableFormat) that allows InputSplits to have more than one region?  I don't mind doing it, but didn't want to write it if another solution already exists.
>> 
>> Apologies if this issue has been raised before, but a quick search didn't turn anything up for me.
>> 
>> Thanks,
>> 
>> Avery
>>

Re: TableInputFormat and number of mappers == number of regions

Posted by Jean-Daniel Cryans <jd...@apache.org>.

You cannot have more mappers than you have regions, but you can have
less. Try going that way.

Also 149,624 regions is insane, is that really the case? I don't think
i've ever seen such a large deploy and it's probably bound to hit some
issues...

J-D

On Sat, Apr 9, 2011 at 9:15 AM, Avery Ching <ac...@yahoo-inc.com> wrote:
> Hi,
>
> First off, I'd like to say thanks to the developers for HBase, it's been fun to work with.
>
> I've been using TableInputFormat to run a Map-Reduce job and ran into an issue.
>
> Exception in thread "main" org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.io.IOException: The number of tasks for this job 149624 exceeds the configured limit 100000
>
> The table i'm accessing has 149624 regions, however my Hadoop instance won't allow me to start a job with that many map tasks.  After briefly looking at the TableInputFormatBase code, it appears that since TableSplit only knows about a single region, my job will be forced into having mappers == # of regions.  Since the Hadoop instance I'm using is shared, I'm concerned that even if configured limit was raised, having Jobs with so many mappers would eventually cause havoc to the job tracker.
>
> Given that I have no control over the number of regions in the table (maintained by someone else), is the only solution to implement another input format (i.e. MultiRegionTableFormat) that allows InputSplits to have more than one region?  I don't mind doing it, but didn't want to write it if another solution already exists.
>
> Apologies if this issue has been raised before, but a quick search didn't turn anything up for me.
>
> Thanks,
>
> Avery
>