You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Ming Ma (JIRA)" <ji...@apache.org> on 2011/07/06 03:28:16 UTC

[jira] [Created] (HBASE-4063) Improve TableInputFormat to allow application to configure the number of mappers

Improve TableInputFormat to allow application to configure the number of mappers
--------------------------------------------------------------------------------

                 Key: HBASE-4063
                 URL: https://issues.apache.org/jira/browse/HBASE-4063
             Project: HBase
          Issue Type: Improvement
          Components: mapreduce
            Reporter: Ming Ma
            Assignee: Ming Ma


TableInputFormat creates one split/mapper task per region. In the case of lots of small regions, the overhead of map reduce framework becomes overhead. There are some related work items that could address this issue.

1.	Reduce the number of small regions. https://issues.apache.org/jira/browse/HBASE-420 
2.	Improvement in map reduce framework to handle small jobs. https://issues.apache.org/jira/browse/MAPREDUCE-1220 

Another quick way to solve this is to just improve TableInputFormat so that it can pack a configurable number of regions from a given region server into one mapper task. I tested this approach and was able to achieve 40% improvement on map job latency.


In addition, Ophir Cohen suggested support for multiple mappers per region as below.

On Thu, Jun 30, 2011 at 8:38 AM, Ophir Cohen <op...@gmail.com> wrote:
> Actually I thought of opposite version:
> If I have a spare map slots why not configure it to run more than one mapper
> on region?
> The question then is how to 'skip' the mappers to the needed places inside
> the regions.

Well, the current splitter passed mappers Scans where the start/end
rows are the region boundaries (at the time at which the splitter
ran).

To do your case,  in the splitter, you'd just give out multiple splits
per region.  To cut up the region key-space, you might use the
Bytes.split code.  It does coarse BigNumber math dividing the key
space.  See here:
http://hbase.apache.org/xref/org/apache/hadoop/hbase/util/Bytes.html#1034

St.Ack


To support the scenarios of:
a) One mapper for multiple regions.
b) Multiple mappers for one region.


We can modify TableInputFormat to allow application to config the number of mappers. TableInputFormat will do the internal calculation to find out how to config mappers' key range properly.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-4063) Improve TableInputFormat to allow application to configure the number of mappers

Posted by "Ted Yu (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-4063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13138848#comment-13138848 ] 

Ted Yu commented on HBASE-4063:
-------------------------------

RegionLoad carries statistics about the region, such as the total size of the store files for the region, uncompressed, in MB.
We should utilize such information to form balanced region groups.
                
> Improve TableInputFormat to allow application to configure the number of mappers
> --------------------------------------------------------------------------------
>
>                 Key: HBASE-4063
>                 URL: https://issues.apache.org/jira/browse/HBASE-4063
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>
> TableInputFormat creates one split/mapper task per region. In the case of lots of small regions, the overhead of map reduce framework becomes overhead. There are some related work items that could address this issue.
> 1.	Reduce the number of small regions. https://issues.apache.org/jira/browse/HBASE-420 
> 2.	Improvement in map reduce framework to handle small jobs. https://issues.apache.org/jira/browse/MAPREDUCE-1220 
> Another quick way to solve this is to just improve TableInputFormat so that it can pack a configurable number of regions from a given region server into one mapper task. I tested this approach and was able to achieve 40% improvement on map job latency.
> In addition, Ophir Cohen suggested support for multiple mappers per region as below.
> On Thu, Jun 30, 2011 at 8:38 AM, Ophir Cohen <op...@gmail.com> wrote:
> > Actually I thought of opposite version:
> > If I have a spare map slots why not configure it to run more than one mapper
> > on region?
> > The question then is how to 'skip' the mappers to the needed places inside
> > the regions.
> Well, the current splitter passed mappers Scans where the start/end
> rows are the region boundaries (at the time at which the splitter
> ran).
> To do your case,  in the splitter, you'd just give out multiple splits
> per region.  To cut up the region key-space, you might use the
> Bytes.split code.  It does coarse BigNumber math dividing the key
> space.  See here:
> http://hbase.apache.org/xref/org/apache/hadoop/hbase/util/Bytes.html#1034
> St.Ack
> To support the scenarios of:
> a) One mapper for multiple regions.
> b) Multiple mappers for one region.
> We can modify TableInputFormat to allow application to config the number of mappers. TableInputFormat will do the internal calculation to find out how to config mappers' key range properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira