You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by "Jonathan M. Kupferman" <jk...@umail.ucsb.edu> on 2008/05/08 22:32:12 UTC

Splitting within regions

Hi Everyone,
I am currently attempting to run a Map Reduce job where the input
comed from HBase. The input table has 22 regions, and thus creates 22
map tasks. This however creates an issue since so few map tasks
results in a poor distribution of labor on a cluster of 10+ machines,
specifically since the amount of work required is highly variable
depending on the region.

I would like to increase the number of map tasks at least 2 fold,
the relevant code seems to be in TableInputFormat.

//Original code
     Text[] startKeys = m_table.getStartKeys();
      if(startKeys == null || startKeys.length == 0) {
        throw new IOException("Expecting at least one region");
      }
      InputSplit[] splits = new InputSplit[startKeys.length];
      for(int i = 0; i < startKeys.length; i++) {
        splits[i] = new TableSplit(m_tableName, startKeys[i],
            ((i + 1) < startKeys.length) ? startKeys[i + 1] : new Text());
      }
//end-original

//Modified code
     Text[] startKeys = m_table.getStartKeys();
      if(startKeys == null || startKeys.length == 0) {
        throw new IOException("Expecting at least one region");
      }
      InputSplit[] splits = new InputSplit[startKeys.length*2];
      for(int i = 0; i < startKeys.length; i++) {
       Text halfsplit = new Text(""+Integer.parseInt(startKeys[i +  
1].toString())/2);
        splits[i] = new TableSplit(m_tableName, startKeys[i], halfsplit);
        splits[i+1] = new TableSplit(m_tableName, halfsplit ,((i + 1)  
< startKeys.length) ? startKeys[i + 2] : new Text());
      }
//end-modified

Is seems like the required modifications would be something along the  
lines the code written above. Is this the correct/best way to go about  
this?


Thanks,
Jonathan


Re: Splitting within regions

Posted by Bryan Duxbury <br...@rapleaf.com>.
I agree - you want a 1:1 mapping of tasks to regions. If you only  
have 22 regions, then I would wonder whether you even need HBase to  
host that data.

If the rows in a region are REALLY small, and the amount of data  
you'll have is bounded pretty low, then maybe you should consider  
changing the configuration for region size to be something less than  
256MB. Right now this is something that's configured at the server  
level, but ideally it'll eventually be configurable on a table-by- 
table basis.

-Bryan

On May 8, 2008, at 1:43 PM, Andrew Purtell wrote:

> I have also been considering this issue but from the opposite  
> direction
> -- forcing splits of the table. From the perspective of I/O and  
> loading
> optimization, doesn't it make the most sense to have a 1:1 mapping of
> regions to tasks?
>
> This issue I think will come up now and again if a user has tables  
> that
> hold a large number of items yet those items have small keys and  
> column
> data.
>
> Of course there are some problems with this, first and foremost the
> problem that too many regions for the carrying capacity of the cluster
> will take down all of the region servers via OOME in a cascading  
> spiral
> of death. Then there is the issue of making sure the key space of a
> forced split is not too small as to underutilize the mapfile storage
> available. Then there is the issue of key distributions being  
> dependent
> on the particular dataset and schema, so tweaking the global setting
> hbase.hregion.max.filesize might not be a good idea.
>
> Just some random thoughts on the topic,
>
>    - Andrew Purtell
>
> --- "Jonathan M. Kupferman" <jk...@umail.ucsb.edu> wrote:
>
>> Hi Everyone,
>> I am currently attempting to run a Map Reduce job where the input
>> comed from HBase. The input table has 22 regions, and thus creates 22
>> map tasks. This however creates an issue since so few map tasks
>> results in a poor distribution of labor on a cluster of 10+ machines,
>> specifically since the amount of work required is highly variable
>> depending on the region.
>>
>> I would like to increase the number of map tasks at least 2 fold,
>> the relevant code seems to be in TableInputFormat.
>>
>> //Original code
>>      Text[] startKeys = m_table.getStartKeys();
>>       if(startKeys == null || startKeys.length == 0) {
>>         throw new IOException("Expecting at least one region");
>>       }
>>       InputSplit[] splits = new InputSplit[startKeys.length];
>>       for(int i = 0; i < startKeys.length; i++) {
>>         splits[i] = new TableSplit(m_tableName, startKeys[i],
>>             ((i + 1) < startKeys.length) ? startKeys[i + 1] : new
>> Text());
>>       }
>> //end-original
>>
>> //Modified code
>>      Text[] startKeys = m_table.getStartKeys();
>>       if(startKeys == null || startKeys.length == 0) {
>>         throw new IOException("Expecting at least one region");
>>       }
>>       InputSplit[] splits = new InputSplit[startKeys.length*2];
>>       for(int i = 0; i < startKeys.length; i++) {
>>        Text halfsplit = new Text(""+Integer.parseInt(startKeys[i +
>> 1].toString())/2);
>>         splits[i] = new TableSplit(m_tableName, startKeys[i],
>> halfsplit);
>>         splits[i+1] = new TableSplit(m_tableName, halfsplit ,((i + 1)
>> < startKeys.length) ? startKeys[i + 2] : new Text());
>>       }
>> //end-modified
>>
>> Is seems like the required modifications would be something along the
>> lines the code written above. Is this the correct/best way to go  
>> about
>>
>> this?
>>
>>
>> Thanks,
>> Jonathan
>>
>>
>
>
>
>        
> ______________________________________________________________________ 
> ______________
> Be a better friend, newshound, and
> know-it-all with Yahoo! Mobile.  Try it now.  http:// 
> mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ


Re: Splitting within regions

Posted by Andrew Purtell <ap...@yahoo.com>.
I have also been considering this issue but from the opposite direction
-- forcing splits of the table. From the perspective of I/O and loading
optimization, doesn't it make the most sense to have a 1:1 mapping of
regions to tasks? 

This issue I think will come up now and again if a user has tables that
hold a large number of items yet those items have small keys and column
data. 

Of course there are some problems with this, first and foremost the
problem that too many regions for the carrying capacity of the cluster
will take down all of the region servers via OOME in a cascading spiral
of death. Then there is the issue of making sure the key space of a
forced split is not too small as to underutilize the mapfile storage
available. Then there is the issue of key distributions being dependent
on the particular dataset and schema, so tweaking the global setting
hbase.hregion.max.filesize might not be a good idea. 

Just some random thoughts on the topic,

   - Andrew Purtell

--- "Jonathan M. Kupferman" <jk...@umail.ucsb.edu> wrote:

> Hi Everyone,
> I am currently attempting to run a Map Reduce job where the input
> comed from HBase. The input table has 22 regions, and thus creates 22
> map tasks. This however creates an issue since so few map tasks
> results in a poor distribution of labor on a cluster of 10+ machines,
> specifically since the amount of work required is highly variable
> depending on the region.
> 
> I would like to increase the number of map tasks at least 2 fold,
> the relevant code seems to be in TableInputFormat.
> 
> //Original code
>      Text[] startKeys = m_table.getStartKeys();
>       if(startKeys == null || startKeys.length == 0) {
>         throw new IOException("Expecting at least one region");
>       }
>       InputSplit[] splits = new InputSplit[startKeys.length];
>       for(int i = 0; i < startKeys.length; i++) {
>         splits[i] = new TableSplit(m_tableName, startKeys[i],
>             ((i + 1) < startKeys.length) ? startKeys[i + 1] : new
> Text());
>       }
> //end-original
> 
> //Modified code
>      Text[] startKeys = m_table.getStartKeys();
>       if(startKeys == null || startKeys.length == 0) {
>         throw new IOException("Expecting at least one region");
>       }
>       InputSplit[] splits = new InputSplit[startKeys.length*2];
>       for(int i = 0; i < startKeys.length; i++) {
>        Text halfsplit = new Text(""+Integer.parseInt(startKeys[i +  
> 1].toString())/2);
>         splits[i] = new TableSplit(m_tableName, startKeys[i],
> halfsplit);
>         splits[i+1] = new TableSplit(m_tableName, halfsplit ,((i + 1)  
> < startKeys.length) ? startKeys[i + 2] : new Text());
>       }
> //end-modified
> 
> Is seems like the required modifications would be something along the  
> lines the code written above. Is this the correct/best way to go about 
> 
> this?
> 
> 
> Thanks,
> Jonathan
> 
> 



      ____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ