You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by a <ka...@gmail.com> on 2012/04/04 16:52:55 UTC

Custom HBase table split that sents collocated rows to the same region

Hello,

Suppose that I have "tall-narrow" HBase table with composite key e.g. 
{class_id}#{student_id}.

The exemplary data will look like as follow:

ROW_KEY  |   ONE COLLUMN FAMILY
----------------------------------------------------------------
1        |   name = "Object Oriented Programming"
         |   location = "Building A"
         |   semester = "Winter"
         |   // many other information about class
----------------------------------------------------------------
1_1      |   name = "Alice White"
1_2      |   name = "Betty Lipcon"
// many other records related to class with ID = 1
----------------------------------------------------------------
// many other records related to class with ID = 2, 3, 4, .. N


I would like to use this HBase table as input source for my MapReduce job, where 
the mapper will emit <key, value> pairs where:
key = ${class_id}#${student_id},
value = some information about corresponding class.

Thanks to lexicographically sorting of row keys, it would be easily to implement 
if I could split HBase table into regions where all colocated rows (with the 
same row prefix i.e. {class_id}) will reside in the same region. Then for each 
group of such collocated records, I could use its first row to get information 
about class and emit this information with rowkey from each remaining row.

So I would like to ask, if such a custom split is easy to implement?

I know that:
1) I could model it with "flat-wide" table and I will have everything what I 
need in separate rows,
2) use two MR jobs for that.

but I am interested in best solution for "tall-narrow" table with one MR job.

Many thanks in advance for any hints!




          








Re: Custom HBase table split that sents collocated rows to the same region

Posted by lars hofhansl <lh...@yahoo.com>.
Yep. Need 0.94+.

If you don't mind the plug you can read a bit about how to setup RegionSplitPolicy 
here: 
http://hadoop-hbase.blogspot.com/2012/02/limited-cross-row-transactions-in-hbase.html
Note that KeyPrefixRegionSplitPolicy assumes a fixed length of the prefix. If you need variable length prefixes you need to implement your own RegionSplitPolicy, maybe using KeyPrefixRegionSplitPolicy as an example.

-- Lars



________________________________
 From: Suraj Varma <sv...@gmail.com>
To: user@hbase.apache.org 
Sent: Wednesday, April 4, 2012 10:15 AM
Subject: Re: Custom HBase table split that sents collocated rows to the same region
 
You did not mention what version of HBase you are on.

In 0.94/trunk, there is a RegionSplitPolicy feature that may work in
your case ...
https://issues.apache.org/jira/browse/HBASE-5304
http://search-hadoop.com/jd/hbase/org/apache/hadoop/hbase/regionserver/RegionSplitPolicy.html

I came across this implementation which may be what you want
http://search-hadoop.com/jd/hbase/org/apache/hadoop/hbase/regionserver/KeyPrefixRegionSplitPolicy.html
--Suraj

On Wed, Apr 4, 2012 at 7:52 AM, a <ka...@gmail.com> wrote:
> Hello,
>
> Suppose that I have "tall-narrow" HBase table with composite key e.g.
> {class_id}#{student_id}.
>
> The exemplary data will look like as follow:
>
> ROW_KEY  |   ONE COLLUMN FAMILY
> ----------------------------------------------------------------
> 1        |   name = "Object Oriented Programming"
>         |   location = "Building A"
>         |   semester = "Winter"
>         |   // many other information about class
> ----------------------------------------------------------------
> 1_1      |   name = "Alice White"
> 1_2      |   name = "Betty Lipcon"
> // many other records related to class with ID = 1
> ----------------------------------------------------------------
> // many other records related to class with ID = 2, 3, 4, .. N
>
>
> I would like to use this HBase table as input source for my MapReduce job, where
> the mapper will emit <key, value> pairs where:
> key = ${class_id}#${student_id},
> value = some information about corresponding class.
>
> Thanks to lexicographically sorting of row keys, it would be easily to implement
> if I could split HBase table into regions where all colocated rows (with the
> same row prefix i.e. {class_id}) will reside in the same region. Then for each
> group of such collocated records, I could use its first row to get information
> about class and emit this information with rowkey from each remaining row.
>
> So I would like to ask, if such a custom split is easy to implement?
>
> I know that:
> 1) I could model it with "flat-wide" table and I will have everything what I
> need in separate rows,
> 2) use two MR jobs for that.
>
> but I am interested in best solution for "tall-narrow" table with one MR job.
>
> Many thanks in advance for any hints!
>
>
>
>
>
>
>
>
>
>
>
>

Re: Custom HBase table split that sents collocated rows to the same region

Posted by Suraj Varma <sv...@gmail.com>.
You did not mention what version of HBase you are on.

In 0.94/trunk, there is a RegionSplitPolicy feature that may work in
your case ...
https://issues.apache.org/jira/browse/HBASE-5304
http://search-hadoop.com/jd/hbase/org/apache/hadoop/hbase/regionserver/RegionSplitPolicy.html

I came across this implementation which may be what you want
http://search-hadoop.com/jd/hbase/org/apache/hadoop/hbase/regionserver/KeyPrefixRegionSplitPolicy.html
--Suraj

On Wed, Apr 4, 2012 at 7:52 AM, a <ka...@gmail.com> wrote:
> Hello,
>
> Suppose that I have "tall-narrow" HBase table with composite key e.g.
> {class_id}#{student_id}.
>
> The exemplary data will look like as follow:
>
> ROW_KEY  |   ONE COLLUMN FAMILY
> ----------------------------------------------------------------
> 1        |   name = "Object Oriented Programming"
>         |   location = "Building A"
>         |   semester = "Winter"
>         |   // many other information about class
> ----------------------------------------------------------------
> 1_1      |   name = "Alice White"
> 1_2      |   name = "Betty Lipcon"
> // many other records related to class with ID = 1
> ----------------------------------------------------------------
> // many other records related to class with ID = 2, 3, 4, .. N
>
>
> I would like to use this HBase table as input source for my MapReduce job, where
> the mapper will emit <key, value> pairs where:
> key = ${class_id}#${student_id},
> value = some information about corresponding class.
>
> Thanks to lexicographically sorting of row keys, it would be easily to implement
> if I could split HBase table into regions where all colocated rows (with the
> same row prefix i.e. {class_id}) will reside in the same region. Then for each
> group of such collocated records, I could use its first row to get information
> about class and emit this information with rowkey from each remaining row.
>
> So I would like to ask, if such a custom split is easy to implement?
>
> I know that:
> 1) I could model it with "flat-wide" table and I will have everything what I
> need in separate rows,
> 2) use two MR jobs for that.
>
> but I am interested in best solution for "tall-narrow" table with one MR job.
>
> Many thanks in advance for any hints!
>
>
>
>
>
>
>
>
>
>
>
>