You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Gokul Balakrishnan <ro...@gmail.com> on 2015/03/17 13:30:52 UTC

Splitting up an HBase Table into partitions

Hi,

My requirement is to partition an HBase Table and return a group of records
(i.e. rows having a specific format) without having to iterate over all of
its rows. These partitions (which should ideally be along regions) will
eventually be sent to Spark but rather than use the HBase or Hadoop RDDs
directly, I'll be using a custom RDD which recognizes partitions as the
aforementioned group of records.

I was looking at achieving this through creating InputSplits through
TableInputFormat.getSplits(), as being done in the HBase RDD [1] but I
can't figure out a way to do this without having access to the mapred
context etc.

Would greatly appreciate if someone could point me in the right direction.

[1]
https://github.com/tmalaska/SparkOnHBase/blob/master/src/main/scala/com/cloudera/spark/hbase/HBaseScanRDD.scala

Thanks,
Gokul

Re: Splitting up an HBase Table into partitions

Posted by Michael Segel <mi...@hotmail.com>.

Ok… 

Lets take a step back. 

If you’re writing your own code and you’re writing a m/r program, you will get one split per region. 
If your scan doesn’t contain a start or stop row, you will scan every row in the table. 

The splits provide parallelism. 
So when you launch your job and you have 10 regions, you’ll have 10 splits. 

Going from memory, if your scan has a start/stop row, then those regions where there is no data  (e.g. the region’s start row isn’t inside the scope of your scan) the mapper created will complete quickly  and no rows are scanned and returned in the result set. 

I think what you’re looking for is already done for you. 

HTH

-Mike

> On Mar 17, 2015, at 2:09 PM, Gokul Balakrishnan <ro...@gmail.com> wrote:
> 
> Hi Michael,
> 
> Thanks for the reply. Yes, I do realise that HBase has regions, perhaps my
> usage of the term partitions was misleading. What I'm looking for is
> exactly what you've mentioned - a means of creating splits based on
> regions, without having to iterate over all rows in the table through the
> client API. Do you have any idea how I might achieve this?
> 
> Thanks,
> 
> On Tuesday, March 17, 2015, Michael Segel <mi...@hotmail.com> wrote:
> 
>> Hbase doesn't have partitions.  It has regions.
>> 
>> The split occurs against the regions so that if you have n regions, you
>> have n splits.
>> 
>> Please don't confuse partitions and regions because they are not the same
>> or synonymous.
>> 
>>> On Mar 17, 2015, at 7:30 AM, Gokul Balakrishnan <royalgok@gmail.com
>> <javascript:;>> wrote:
>>> 
>>> Hi,
>>> 
>>> My requirement is to partition an HBase Table and return a group of
>> records
>>> (i.e. rows having a specific format) without having to iterate over all
>> of
>>> its rows. These partitions (which should ideally be along regions) will
>>> eventually be sent to Spark but rather than use the HBase or Hadoop RDDs
>>> directly, I'll be using a custom RDD which recognizes partitions as the
>>> aforementioned group of records.
>>> 
>>> I was looking at achieving this through creating InputSplits through
>>> TableInputFormat.getSplits(), as being done in the HBase RDD [1] but I
>>> can't figure out a way to do this without having access to the mapred
>>> context etc.
>>> 
>>> Would greatly appreciate if someone could point me in the right
>> direction.
>>> 
>>> [1]
>>> 
>> https://github.com/tmalaska/SparkOnHBase/blob/master/src/main/scala/com/cloudera/spark/hbase/HBaseScanRDD.scala
>>> 
>>> Thanks,
>>> Gokul
>> 
>> The opinions expressed here are mine, while they may reflect a cognitive
>> thought, that is purely accidental.
>> Use at your own risk.
>> Michael Segel
>> michael_segel (AT) hotmail.com
>> 
>> 
>> 
>> 
>> 
>> 

The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com

Re: Splitting up an HBase Table into partitions

Posted by Michael Segel <mi...@hotmail.com>.

> On Mar 18, 2015, at 1:52 AM, Gokul Balakrishnan <ro...@gmail.com> wrote:
> 
> 
> 
> @Sean this was exactly what I was looking for. Based on the region
> boundaries, I should be able to create virtual groups of rows which can
> then be retrieved from the table (e.g. through a scan) on demand.
> 

Huh? 

You don’t need to do this. 

Its already done for you by the existing APIs. 

A scan will allow you to do either a full table scan (no range limits provided) or a range scan where you provide the boundaries. 

So if you’re using a client connection to HBase, its done for you. 

If you’re writing a M/R job, you are already getting one mapper task assigned per region.  So your parallelism is already done for you. 

Its possible that the Input Format is smart enough to pre-check the regions to see if they are within the boundaries or not and if not, no mapper task is generated.

HTH

-Mike

> Thanks everyone for your help.
> 
> On 18 March 2015 at 00:57, Sean Busbey <bu...@cloudera.com> wrote:
> 
>> You should ask for a RegionLocator if you want to know the boundaries of
>> all the regions in a table
>> 
>> 
>> final Connection connection = ConnectionFactory.createConnection(config);
>> 
>> try {
>> 
>>  final RegionLocator locator =
>> connection.getRegionLocator(TableName.valueOf("myTable"));
>> 
>>  final Pair<byte[][], byte[][]> startEndKeys = locator.getStartEndKeys();
>> 
>>  final byte[][] startKeys = startEndKeys.getFirst();
>> 
>>  final byte[][] endKeys = startEndKeys.getSecond();
>> 
>>  for (int i=0; i < startKeys.length && i < endKeys.length; i++) {
>> 
>>     System.out.println("Region " + i + " starts at '" +
>> Bytes.toStringBinary(startKeys[i]) +
>> 
>>         "' and ends at '" + Bytes.toStringBinary(endKeys[i]));
>> 
>>  }
>> 
>> } finally {
>> 
>>  connection.close();
>> 
>> }
>> 
>> 
>> There are other methods in RegionLocator if you need other details.
>> 
>> On Tue, Mar 17, 2015 at 2:09 PM, Gokul Balakrishnan <ro...@gmail.com>
>> wrote:
>> 
>>> Hi Michael,
>>> 
>>> Thanks for the reply. Yes, I do realise that HBase has regions, perhaps
>> my
>>> usage of the term partitions was misleading. What I'm looking for is
>>> exactly what you've mentioned - a means of creating splits based on
>>> regions, without having to iterate over all rows in the table through the
>>> client API. Do you have any idea how I might achieve this?
>>> 
>>> Thanks,
>>> 
>>> On Tuesday, March 17, 2015, Michael Segel <mi...@hotmail.com>
>>> wrote:
>>> 
>>>> Hbase doesn't have partitions.  It has regions.
>>>> 
>>>> The split occurs against the regions so that if you have n regions, you
>>>> have n splits.
>>>> 
>>>> Please don't confuse partitions and regions because they are not the
>> same
>>>> or synonymous.
>>>> 
>>>>> On Mar 17, 2015, at 7:30 AM, Gokul Balakrishnan <royalgok@gmail.com
>>>> <javascript:;>> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> My requirement is to partition an HBase Table and return a group of
>>>> records
>>>>> (i.e. rows having a specific format) without having to iterate over
>> all
>>>> of
>>>>> its rows. These partitions (which should ideally be along regions)
>> will
>>>>> eventually be sent to Spark but rather than use the HBase or Hadoop
>>> RDDs
>>>>> directly, I'll be using a custom RDD which recognizes partitions as
>> the
>>>>> aforementioned group of records.
>>>>> 
>>>>> I was looking at achieving this through creating InputSplits through
>>>>> TableInputFormat.getSplits(), as being done in the HBase RDD [1] but
>> I
>>>>> can't figure out a way to do this without having access to the mapred
>>>>> context etc.
>>>>> 
>>>>> Would greatly appreciate if someone could point me in the right
>>>> direction.
>>>>> 
>>>>> [1]
>>>>> 
>>>> 
>>> 
>> https://github.com/tmalaska/SparkOnHBase/blob/master/src/main/scala/com/cloudera/spark/hbase/HBaseScanRDD.scala
>>>>> 
>>>>> Thanks,
>>>>> Gokul
>>>> 
>>>> The opinions expressed here are mine, while they may reflect a
>> cognitive
>>>> thought, that is purely accidental.
>>>> Use at your own risk.
>>>> Michael Segel
>>>> michael_segel (AT) hotmail.com
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> Sean
>> 

The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com

Re: Splitting up an HBase Table into partitions

Posted by Gokul Balakrishnan <ro...@gmail.com>.

@Mikhail I wanted to split the table into groups of rows, but did not want
to initialize a scan and go over all rows and group them into batches in
the client code. In other words, I'm looking for a way to divide the rows
in the table and merely maintain the boundary information of each division
rather than actually populate them at the time of creation.

@Shahab yes, the row key ranges for the splits are not known in advance,
which was why I was looking at retrieving the region information of the
table and create the groupings that way.

@Sean this was exactly what I was looking for. Based on the region
boundaries, I should be able to create virtual groups of rows which can
then be retrieved from the table (e.g. through a scan) on demand.

Thanks everyone for your help.

On 18 March 2015 at 00:57, Sean Busbey <bu...@cloudera.com> wrote:

> You should ask for a RegionLocator if you want to know the boundaries of
> all the regions in a table
>
>
> final Connection connection = ConnectionFactory.createConnection(config);
>
> try {
>
>   final RegionLocator locator =
> connection.getRegionLocator(TableName.valueOf("myTable"));
>
>   final Pair<byte[][], byte[][]> startEndKeys = locator.getStartEndKeys();
>
>   final byte[][] startKeys = startEndKeys.getFirst();
>
>   final byte[][] endKeys = startEndKeys.getSecond();
>
>   for (int i=0; i < startKeys.length && i < endKeys.length; i++) {
>
>      System.out.println("Region " + i + " starts at '" +
> Bytes.toStringBinary(startKeys[i]) +
>
>          "' and ends at '" + Bytes.toStringBinary(endKeys[i]));
>
>   }
>
> } finally {
>
>   connection.close();
>
> }
>
>
> There are other methods in RegionLocator if you need other details.
>
> On Tue, Mar 17, 2015 at 2:09 PM, Gokul Balakrishnan <ro...@gmail.com>
> wrote:
>
> > Hi Michael,
> >
> > Thanks for the reply. Yes, I do realise that HBase has regions, perhaps
> my
> > usage of the term partitions was misleading. What I'm looking for is
> > exactly what you've mentioned - a means of creating splits based on
> > regions, without having to iterate over all rows in the table through the
> > client API. Do you have any idea how I might achieve this?
> >
> > Thanks,
> >
> > On Tuesday, March 17, 2015, Michael Segel <mi...@hotmail.com>
> > wrote:
> >
> > > Hbase doesn't have partitions.  It has regions.
> > >
> > > The split occurs against the regions so that if you have n regions, you
> > > have n splits.
> > >
> > > Please don't confuse partitions and regions because they are not the
> same
> > > or synonymous.
> > >
> > > > On Mar 17, 2015, at 7:30 AM, Gokul Balakrishnan <royalgok@gmail.com
> > > <javascript:;>> wrote:
> > > >
> > > > Hi,
> > > >
> > > > My requirement is to partition an HBase Table and return a group of
> > > records
> > > > (i.e. rows having a specific format) without having to iterate over
> all
> > > of
> > > > its rows. These partitions (which should ideally be along regions)
> will
> > > > eventually be sent to Spark but rather than use the HBase or Hadoop
> > RDDs
> > > > directly, I'll be using a custom RDD which recognizes partitions as
> the
> > > > aforementioned group of records.
> > > >
> > > > I was looking at achieving this through creating InputSplits through
> > > > TableInputFormat.getSplits(), as being done in the HBase RDD [1] but
> I
> > > > can't figure out a way to do this without having access to the mapred
> > > > context etc.
> > > >
> > > > Would greatly appreciate if someone could point me in the right
> > > direction.
> > > >
> > > > [1]
> > > >
> > >
> >
> https://github.com/tmalaska/SparkOnHBase/blob/master/src/main/scala/com/cloudera/spark/hbase/HBaseScanRDD.scala
> > > >
> > > > Thanks,
> > > > Gokul
> > >
> > > The opinions expressed here are mine, while they may reflect a
> cognitive
> > > thought, that is purely accidental.
> > > Use at your own risk.
> > > Michael Segel
> > > michael_segel (AT) hotmail.com
> > >
> > >
> > >
> > >
> > >
> > >
> >
>
>
>
> --
> Sean
>

Re: Splitting up an HBase Table into partitions

Posted by Sean Busbey <bu...@cloudera.com>.

You should ask for a RegionLocator if you want to know the boundaries of
all the regions in a table


final Connection connection = ConnectionFactory.createConnection(config);

try {

  final RegionLocator locator =
connection.getRegionLocator(TableName.valueOf("myTable"));

  final Pair<byte[][], byte[][]> startEndKeys = locator.getStartEndKeys();

  final byte[][] startKeys = startEndKeys.getFirst();

  final byte[][] endKeys = startEndKeys.getSecond();

  for (int i=0; i < startKeys.length && i < endKeys.length; i++) {

     System.out.println("Region " + i + " starts at '" +
Bytes.toStringBinary(startKeys[i]) +

         "' and ends at '" + Bytes.toStringBinary(endKeys[i]));

  }

} finally {

  connection.close();

}


There are other methods in RegionLocator if you need other details.

On Tue, Mar 17, 2015 at 2:09 PM, Gokul Balakrishnan <ro...@gmail.com>
wrote:

> Hi Michael,
>
> Thanks for the reply. Yes, I do realise that HBase has regions, perhaps my
> usage of the term partitions was misleading. What I'm looking for is
> exactly what you've mentioned - a means of creating splits based on
> regions, without having to iterate over all rows in the table through the
> client API. Do you have any idea how I might achieve this?
>
> Thanks,
>
> On Tuesday, March 17, 2015, Michael Segel <mi...@hotmail.com>
> wrote:
>
> > Hbase doesn't have partitions.  It has regions.
> >
> > The split occurs against the regions so that if you have n regions, you
> > have n splits.
> >
> > Please don't confuse partitions and regions because they are not the same
> > or synonymous.
> >
> > > On Mar 17, 2015, at 7:30 AM, Gokul Balakrishnan <royalgok@gmail.com
> > <javascript:;>> wrote:
> > >
> > > Hi,
> > >
> > > My requirement is to partition an HBase Table and return a group of
> > records
> > > (i.e. rows having a specific format) without having to iterate over all
> > of
> > > its rows. These partitions (which should ideally be along regions) will
> > > eventually be sent to Spark but rather than use the HBase or Hadoop
> RDDs
> > > directly, I'll be using a custom RDD which recognizes partitions as the
> > > aforementioned group of records.
> > >
> > > I was looking at achieving this through creating InputSplits through
> > > TableInputFormat.getSplits(), as being done in the HBase RDD [1] but I
> > > can't figure out a way to do this without having access to the mapred
> > > context etc.
> > >
> > > Would greatly appreciate if someone could point me in the right
> > direction.
> > >
> > > [1]
> > >
> >
> https://github.com/tmalaska/SparkOnHBase/blob/master/src/main/scala/com/cloudera/spark/hbase/HBaseScanRDD.scala
> > >
> > > Thanks,
> > > Gokul
> >
> > The opinions expressed here are mine, while they may reflect a cognitive
> > thought, that is purely accidental.
> > Use at your own risk.
> > Michael Segel
> > michael_segel (AT) hotmail.com
> >
> >
> >
> >
> >
> >
>



-- 
Sean

Re: Splitting up an HBase Table into partitions

Posted by Nick Dimiduk <nd...@gmail.com>.

If you don't want to use the getSplits method, you're welcome to pull the
relevant code out into your own RDD. The RegionLocator object is public
API, and the code is trivial if you're not interested in normalizing the
split points as the MR job does.

On Tue, Mar 17, 2015 at 12:12 PM, Mikhail Antonov <ol...@gmail.com>
wrote:

> Not sure what do you mean by "a means of creating splits based on
> regions, without having to iterate over all rows in the table through the
> client API.". Could you elaborate?
>
> -Mikhail
>
> On Tue, Mar 17, 2015 at 12:09 PM, Gokul Balakrishnan <ro...@gmail.com>
> wrote:
> > Hi Michael,
> >
> > Thanks for the reply. Yes, I do realise that HBase has regions, perhaps
> my
> > usage of the term partitions was misleading. What I'm looking for is
> > exactly what you've mentioned - a means of creating splits based on
> > regions, without having to iterate over all rows in the table through the
> > client API. Do you have any idea how I might achieve this?
> >
> > Thanks,
> >
> > On Tuesday, March 17, 2015, Michael Segel <mi...@hotmail.com>
> wrote:
> >
> >> Hbase doesn't have partitions.  It has regions.
> >>
> >> The split occurs against the regions so that if you have n regions, you
> >> have n splits.
> >>
> >> Please don't confuse partitions and regions because they are not the
> same
> >> or synonymous.
> >>
> >> > On Mar 17, 2015, at 7:30 AM, Gokul Balakrishnan <royalgok@gmail.com
> >> <javascript:;>> wrote:
> >> >
> >> > Hi,
> >> >
> >> > My requirement is to partition an HBase Table and return a group of
> >> records
> >> > (i.e. rows having a specific format) without having to iterate over
> all
> >> of
> >> > its rows. These partitions (which should ideally be along regions)
> will
> >> > eventually be sent to Spark but rather than use the HBase or Hadoop
> RDDs
> >> > directly, I'll be using a custom RDD which recognizes partitions as
> the
> >> > aforementioned group of records.
> >> >
> >> > I was looking at achieving this through creating InputSplits through
> >> > TableInputFormat.getSplits(), as being done in the HBase RDD [1] but I
> >> > can't figure out a way to do this without having access to the mapred
> >> > context etc.
> >> >
> >> > Would greatly appreciate if someone could point me in the right
> >> direction.
> >> >
> >> > [1]
> >> >
> >>
> https://github.com/tmalaska/SparkOnHBase/blob/master/src/main/scala/com/cloudera/spark/hbase/HBaseScanRDD.scala
> >> >
> >> > Thanks,
> >> > Gokul
> >>
> >> The opinions expressed here are mine, while they may reflect a cognitive
> >> thought, that is purely accidental.
> >> Use at your own risk.
> >> Michael Segel
> >> michael_segel (AT) hotmail.com
> >>
> >>
> >>
> >>
> >>
> >>
>
>
>
> --
> Thanks,
> Michael Antonov
>

Re: Splitting up an HBase Table into partitions

Posted by Shahab Yunus <sh...@gmail.com>.

If you know the row key range of your data, then you can create splits
points yourself and then use HBase api to actually make the splits.

E.g. If you know that your row key (and it is a very contrived example) has
a range of A - Z then you can decide on split points as every 5 th letter
as your split points and then use HBaseAdmin.split method to do the split
for you. This way you don't have to iterate of your data.

Or are you saying that you don't have the row key range?

Regards,
Shahab

On Tue, Mar 17, 2015 at 3:12 PM, Mikhail Antonov <ol...@gmail.com>
wrote:

> Not sure what do you mean by "a means of creating splits based on
> regions, without having to iterate over all rows in the table through the
> client API.". Could you elaborate?
>
> -Mikhail
>
> On Tue, Mar 17, 2015 at 12:09 PM, Gokul Balakrishnan <ro...@gmail.com>
> wrote:
> > Hi Michael,
> >
> > Thanks for the reply. Yes, I do realise that HBase has regions, perhaps
> my
> > usage of the term partitions was misleading. What I'm looking for is
> > exactly what you've mentioned - a means of creating splits based on
> > regions, without having to iterate over all rows in the table through the
> > client API. Do you have any idea how I might achieve this?
> >
> > Thanks,
> >
> > On Tuesday, March 17, 2015, Michael Segel <mi...@hotmail.com>
> wrote:
> >
> >> Hbase doesn't have partitions.  It has regions.
> >>
> >> The split occurs against the regions so that if you have n regions, you
> >> have n splits.
> >>
> >> Please don't confuse partitions and regions because they are not the
> same
> >> or synonymous.
> >>
> >> > On Mar 17, 2015, at 7:30 AM, Gokul Balakrishnan <royalgok@gmail.com
> >> <javascript:;>> wrote:
> >> >
> >> > Hi,
> >> >
> >> > My requirement is to partition an HBase Table and return a group of
> >> records
> >> > (i.e. rows having a specific format) without having to iterate over
> all
> >> of
> >> > its rows. These partitions (which should ideally be along regions)
> will
> >> > eventually be sent to Spark but rather than use the HBase or Hadoop
> RDDs
> >> > directly, I'll be using a custom RDD which recognizes partitions as
> the
> >> > aforementioned group of records.
> >> >
> >> > I was looking at achieving this through creating InputSplits through
> >> > TableInputFormat.getSplits(), as being done in the HBase RDD [1] but I
> >> > can't figure out a way to do this without having access to the mapred
> >> > context etc.
> >> >
> >> > Would greatly appreciate if someone could point me in the right
> >> direction.
> >> >
> >> > [1]
> >> >
> >>
> https://github.com/tmalaska/SparkOnHBase/blob/master/src/main/scala/com/cloudera/spark/hbase/HBaseScanRDD.scala
> >> >
> >> > Thanks,
> >> > Gokul
> >>
> >> The opinions expressed here are mine, while they may reflect a cognitive
> >> thought, that is purely accidental.
> >> Use at your own risk.
> >> Michael Segel
> >> michael_segel (AT) hotmail.com
> >>
> >>
> >>
> >>
> >>
> >>
>
>
>
> --
> Thanks,
> Michael Antonov
>

Re: Splitting up an HBase Table into partitions

Posted by Mikhail Antonov <ol...@gmail.com>.

Not sure what do you mean by "a means of creating splits based on
regions, without having to iterate over all rows in the table through the
client API.". Could you elaborate?

-Mikhail

On Tue, Mar 17, 2015 at 12:09 PM, Gokul Balakrishnan <ro...@gmail.com> wrote:
> Hi Michael,
>
> Thanks for the reply. Yes, I do realise that HBase has regions, perhaps my
> usage of the term partitions was misleading. What I'm looking for is
> exactly what you've mentioned - a means of creating splits based on
> regions, without having to iterate over all rows in the table through the
> client API. Do you have any idea how I might achieve this?
>
> Thanks,
>
> On Tuesday, March 17, 2015, Michael Segel <mi...@hotmail.com> wrote:
>
>> Hbase doesn't have partitions.  It has regions.
>>
>> The split occurs against the regions so that if you have n regions, you
>> have n splits.
>>
>> Please don't confuse partitions and regions because they are not the same
>> or synonymous.
>>
>> > On Mar 17, 2015, at 7:30 AM, Gokul Balakrishnan <royalgok@gmail.com
>> <javascript:;>> wrote:
>> >
>> > Hi,
>> >
>> > My requirement is to partition an HBase Table and return a group of
>> records
>> > (i.e. rows having a specific format) without having to iterate over all
>> of
>> > its rows. These partitions (which should ideally be along regions) will
>> > eventually be sent to Spark but rather than use the HBase or Hadoop RDDs
>> > directly, I'll be using a custom RDD which recognizes partitions as the
>> > aforementioned group of records.
>> >
>> > I was looking at achieving this through creating InputSplits through
>> > TableInputFormat.getSplits(), as being done in the HBase RDD [1] but I
>> > can't figure out a way to do this without having access to the mapred
>> > context etc.
>> >
>> > Would greatly appreciate if someone could point me in the right
>> direction.
>> >
>> > [1]
>> >
>> https://github.com/tmalaska/SparkOnHBase/blob/master/src/main/scala/com/cloudera/spark/hbase/HBaseScanRDD.scala
>> >
>> > Thanks,
>> > Gokul
>>
>> The opinions expressed here are mine, while they may reflect a cognitive
>> thought, that is purely accidental.
>> Use at your own risk.
>> Michael Segel
>> michael_segel (AT) hotmail.com
>>
>>
>>
>>
>>
>>



-- 
Thanks,
Michael Antonov

Re: Splitting up an HBase Table into partitions

Posted by Gokul Balakrishnan <ro...@gmail.com>.

Hi Michael,

Thanks for the reply. Yes, I do realise that HBase has regions, perhaps my
usage of the term partitions was misleading. What I'm looking for is
exactly what you've mentioned - a means of creating splits based on
regions, without having to iterate over all rows in the table through the
client API. Do you have any idea how I might achieve this?

Thanks,

On Tuesday, March 17, 2015, Michael Segel <mi...@hotmail.com> wrote:

> Hbase doesn't have partitions.  It has regions.
>
> The split occurs against the regions so that if you have n regions, you
> have n splits.
>
> Please don't confuse partitions and regions because they are not the same
> or synonymous.
>
> > On Mar 17, 2015, at 7:30 AM, Gokul Balakrishnan <royalgok@gmail.com
> <javascript:;>> wrote:
> >
> > Hi,
> >
> > My requirement is to partition an HBase Table and return a group of
> records
> > (i.e. rows having a specific format) without having to iterate over all
> of
> > its rows. These partitions (which should ideally be along regions) will
> > eventually be sent to Spark but rather than use the HBase or Hadoop RDDs
> > directly, I'll be using a custom RDD which recognizes partitions as the
> > aforementioned group of records.
> >
> > I was looking at achieving this through creating InputSplits through
> > TableInputFormat.getSplits(), as being done in the HBase RDD [1] but I
> > can't figure out a way to do this without having access to the mapred
> > context etc.
> >
> > Would greatly appreciate if someone could point me in the right
> direction.
> >
> > [1]
> >
> https://github.com/tmalaska/SparkOnHBase/blob/master/src/main/scala/com/cloudera/spark/hbase/HBaseScanRDD.scala
> >
> > Thanks,
> > Gokul
>
> The opinions expressed here are mine, while they may reflect a cognitive
> thought, that is purely accidental.
> Use at your own risk.
> Michael Segel
> michael_segel (AT) hotmail.com
>
>
>
>
>
>

Re: Splitting up an HBase Table into partitions

Posted by Michael Segel <mi...@hotmail.com>.

Hbase doesn’t have partitions.  It has regions.

The split occurs against the regions so that if you have n regions, you have n splits. 

Please don’t confuse partitions and regions because they are not the same or synonymous. 

> On Mar 17, 2015, at 7:30 AM, Gokul Balakrishnan <ro...@gmail.com> wrote:
> 
> Hi,
> 
> My requirement is to partition an HBase Table and return a group of records
> (i.e. rows having a specific format) without having to iterate over all of
> its rows. These partitions (which should ideally be along regions) will
> eventually be sent to Spark but rather than use the HBase or Hadoop RDDs
> directly, I'll be using a custom RDD which recognizes partitions as the
> aforementioned group of records.
> 
> I was looking at achieving this through creating InputSplits through
> TableInputFormat.getSplits(), as being done in the HBase RDD [1] but I
> can't figure out a way to do this without having access to the mapred
> context etc.
> 
> Would greatly appreciate if someone could point me in the right direction.
> 
> [1]
> https://github.com/tmalaska/SparkOnHBase/blob/master/src/main/scala/com/cloudera/spark/hbase/HBaseScanRDD.scala
> 
> Thanks,
> Gokul

The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com