You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Cedric Ho <ce...@gmail.com> on 2008/10/08 09:33:32 UTC

map reduce range of records from hbase table

Hi all,

I am using 0.18.0 and have successfully used data from hbase table as
input to my map/reduce job.

I wonder how to specify a subset of records from a table instead of
taking all records as input.
Such as a range of the row keys or maybe by specific values of certain columns.

Any help is appreciated.

Cedric

Re: map reduce range of records from hbase table

Posted by tigertail <ty...@yahoo.com>.

Hi Cedric,

Can you share your version of getSplits to feed only a subset of records to
me? I expect your method can select the subset based on row keys as well as
some column values. Thank you.


Cedric Ho wrote:
> 
> Thanks for the solutions, I've tried overriding getSplits and it does
> what I need.
> 
> But for the RowFilter, I guess it would also need to scan through all
> records and do filtering. So wouldn't it be the same if I do the
> filtering myself during the map phrase?
> 
> Cedric
> 
> 
> On Thu, Oct 9, 2008 at 5:13 AM, stack <st...@duboce.net> wrote:
>> Cedric Ho wrote:
>>>
>>> Hi all,
>>>
>>> I am using 0.18.0 and have successfully used data from hbase table as
>>> input to my map/reduce job.
>>>
>>> I wonder how to specify a subset of records from a table instead of
>>> taking all records as input.
>>> Such as a range of the row keys or maybe by specific values of certain
>>> columns.
>>>
>>
>> You'll have to subclass the TableInputFormat.
>>
>> There is an example in the javadoc on subclassing TIF:
>> http://hadoop.apache.org/hbase/docs/r0.18.0/api/org/apache/hadoop/hbase/mapred/TableInputFormatBase.html
>> (Sorry, the example is mangled.  Do a get of the html source to see
>> non-garbled code).
>>
>> The example shows you how to set a filter.  Filters can filter on rows
>> and
>> values.
>>
>> To work against a subset, you'd probably need to play with getSplits  in
>> your subclass.   Default, it  basically eretrns as many splits as there
>> are
>> regions in your table, so its the whole table always.  Filters could stop
>> unwanted rows being returned but maybe its better if the rows weren't
>> considered in the first place; hence the need of getSplits subclassing.
>>
>> St.Ack
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/map-reduce-range-of-records-from-hbase-table-tp19873787p20948685.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: map reduce range of records from hbase table

Posted by stack <st...@duboce.net>.

Cedric Ho wrote:
> But for the RowFilter, I guess it would also need to scan through all
> records and do filtering. So wouldn't it be the same if I do the
> filtering myself during the map phrase?
>   

No.  Filters work serverside so you save the cost of cells discarded by 
your map.
St.Ack

Re: map reduce range of records from hbase table

Posted by Cedric Ho <ce...@gmail.com>.

Thanks for the solutions, I've tried overriding getSplits and it does
what I need.

But for the RowFilter, I guess it would also need to scan through all
records and do filtering. So wouldn't it be the same if I do the
filtering myself during the map phrase?

Cedric


On Thu, Oct 9, 2008 at 5:13 AM, stack <st...@duboce.net> wrote:
> Cedric Ho wrote:
>>
>> Hi all,
>>
>> I am using 0.18.0 and have successfully used data from hbase table as
>> input to my map/reduce job.
>>
>> I wonder how to specify a subset of records from a table instead of
>> taking all records as input.
>> Such as a range of the row keys or maybe by specific values of certain
>> columns.
>>
>
> You'll have to subclass the TableInputFormat.
>
> There is an example in the javadoc on subclassing TIF:
> http://hadoop.apache.org/hbase/docs/r0.18.0/api/org/apache/hadoop/hbase/mapred/TableInputFormatBase.html
> (Sorry, the example is mangled.  Do a get of the html source to see
> non-garbled code).
>
> The example shows you how to set a filter.  Filters can filter on rows and
> values.
>
> To work against a subset, you'd probably need to play with getSplits  in
> your subclass.   Default, it  basically eretrns as many splits as there are
> regions in your table, so its the whole table always.  Filters could stop
> unwanted rows being returned but maybe its better if the rows weren't
> considered in the first place; hence the need of getSplits subclassing.
>
> St.Ack
>
>

Re: map reduce range of records from hbase table

Posted by stack <st...@duboce.net>.

Jaeyun Noh wrote:
> Another question:
>
> Does Hbase will support so-called "multi-get" function, which receives set
> of rowkeys as an input parameter and returns set of RowResults as an output?
>   

Not yet.  We need HBASE-880 to go in first (enables batching puts, gets 
and deletes).
St.Ack

Re: map reduce range of records from hbase table

Posted by Jaeyun Noh <me...@gmail.com>.

Another question:

Does Hbase will support so-called "multi-get" function, which receives set
of rowkeys as an input parameter and returns set of RowResults as an output?
It might be useful in case of having distinct set of rowkeys. similiar with
IN (.., ) condition of SQL.

2008/10/9 Jaeyun Noh <me...@gmail.com>

> I wonder if the network RPC involves whenever we call next() if scanner
> class.
> Also if the scanner works as a manner of parallel-request to Hregions and
> fetch to temporary cache of Hbase clients.
> If so, we're happy to live with that.
>
> Is the following hbase parameter related to my question?
>
> <property>
>
>     <name>hbase.client.scanner.caching</name>
>
>     <value>30</value>
>
>     <description>Number of rows that will be fetched when calling next
>
>     on a scanner if it is not served from memory. Higher caching values
>
>     will enable faster scanners but will eat up more memory and some
>
>     calls of next may take longer and longer times when the cache is empty.
>
>     </description>
>
>   </property>
>
> Regards, Jaeyun Noh.
>
>
> On Wed, Oct 8, 2008 at 10:10 PM, stack <st...@duboce.net> wrote:
>
>> On Wed, Oct 8, 2008 at 9:01 PM, Jaeyun Noh <me...@gmail.com> wrote:
>>
>> > Thx.
>> >
>> > BTW, it seems that the output format (subclass of
>> > org.apache.hadoop.mapred.OutputFormat) of MR job can only be a file. Can
>> we
>> > define our own file format which hbase clients can access?
>>
>>
>> No.  You can output to anything as long as you make it implement
>> OutputFormat.  To output to hbase subclass TableReduce or see
>> TableOutputFormat.
>>
>>
>> >
>> >
>> > My goal is to implement filter-enabled table scanner which runs by
>> > multi-process clients using MR. I'm trying to leverage MR since the
>> > ClientScanner class of HTable sequencially access Hregion and thus
>> involves
>> > multiple round trips btw servers and clients.
>>
>>
>> I'm not sure I follow.  Perhaps start simple then see where the
>> bottlenecks
>> are and optimize here.  Regards roundtrips between client and server, what
>> you want? A scanner that returns batches rather than row at at time?
>>
>> St.Ack
>>
>>
>>
>>
>>
>> >
>> >
>> > On Wed, Oct 8, 2008 at 4:30 PM, stack <st...@duboce.net> wrote:
>> >
>> > > Jaeyun Noh wrote:
>> > >
>> > >> Hi,
>> > >>
>> > >> May I ask another question?
>> > >>
>> > >> I'm running HBase/Hadoop on linux server, and implementing business
>> > >> application with java, which runs on a different windows machine.
>> > >>  It looks like MapReduce job runs on a server node. Can I run the
>> > >> MapReduce
>> > >> job built on windows client with an existing linux server? How can we
>> > get
>> > >> result done by MapReduce job at the server?
>> > >>
>> > >>
>> > >
>> > > You should be able to, yes.  Make sure you use same java on both
>> > machines.
>> > >  This page might help some:
>> > http://wiki.apache.org/hadoop/Hbase/MapReduce.
>> > > St.Ack
>> > >
>> >
>>
>
>

Re: map reduce range of records from hbase table

Posted by stack <st...@duboce.net>.

Jaeyun Noh wrote:
> I wonder if the network RPC involves whenever we call next() if scanner
> class.
>   

Its not a pretty story.  A next in client makes for a trip over to the 
server carrying the region that hosts the row the scanner is currently 
stalled on.   Serverside, the region has a Scanner context that has 
within it a scanner on the memcache and then a scanner for each of the 
storefiles present in the filesystem.  The storefile scanners in turn 
reduce to Hadoop MapFile#next calls so another network hop is involved 
out to the particular datanode hosting the MapFile block the scanner is 
currently within.  The next on the serverside is a careful nexting 
through the memcache first and through each of the store files 
respecting order trying to turn up appropriate next result.

> Also if the scanner works as a manner of parallel-request to Hregions and
> fetch to temporary cache of Hbase clients.
>   

Well, scanner will be homed on a single row at a time only so will be 
against a single region only at any one time.  That said, at the moment, 
if a row comprises many column families, we currently proceed through 
each in series.  I believe there is an issue to parallelize the requests 
across all the column families in a row.

> If so, we're happy to live with that.
>
> Is the following hbase parameter related to my question?
>
> <property>
>
>     <name>hbase.client.scanner.caching</name>
>
>     <value>30</value>
>
>     <description>Number of rows that will be fetched when calling next
>
>     on a scanner if it is not served from memory. Higher caching values
>
>     will enable faster scanners but will eat up more memory and some
>
>     calls of next may take longer and longer times when the cache is empty.
>
>     </description>
>
>   </property>
>   
Yes.  Just added.  Fetches a bunch at a time rather than one at a time 
as it used to.   Was just added.  In my testing, makes scanners 4X faster.

St.Ack

Re: map reduce range of records from hbase table

Posted by Jaeyun Noh <me...@gmail.com>.

I wonder if the network RPC involves whenever we call next() if scanner
class.
Also if the scanner works as a manner of parallel-request to Hregions and
fetch to temporary cache of Hbase clients.
If so, we're happy to live with that.

Is the following hbase parameter related to my question?

<property>

    <name>hbase.client.scanner.caching</name>

    <value>30</value>

    <description>Number of rows that will be fetched when calling next

    on a scanner if it is not served from memory. Higher caching values

    will enable faster scanners but will eat up more memory and some

    calls of next may take longer and longer times when the cache is empty.

    </description>

  </property>

Regards, Jaeyun Noh.

On Wed, Oct 8, 2008 at 10:10 PM, stack <st...@duboce.net> wrote:

> On Wed, Oct 8, 2008 at 9:01 PM, Jaeyun Noh <me...@gmail.com> wrote:
>
> > Thx.
> >
> > BTW, it seems that the output format (subclass of
> > org.apache.hadoop.mapred.OutputFormat) of MR job can only be a file. Can
> we
> > define our own file format which hbase clients can access?
>
>
> No.  You can output to anything as long as you make it implement
> OutputFormat.  To output to hbase subclass TableReduce or see
> TableOutputFormat.
>
>
> >
> >
> > My goal is to implement filter-enabled table scanner which runs by
> > multi-process clients using MR. I'm trying to leverage MR since the
> > ClientScanner class of HTable sequencially access Hregion and thus
> involves
> > multiple round trips btw servers and clients.
>
>
> I'm not sure I follow.  Perhaps start simple then see where the bottlenecks
> are and optimize here.  Regards roundtrips between client and server, what
> you want? A scanner that returns batches rather than row at at time?
>
> St.Ack
>
>
>
>
>
> >
> >
> > On Wed, Oct 8, 2008 at 4:30 PM, stack <st...@duboce.net> wrote:
> >
> > > Jaeyun Noh wrote:
> > >
> > >> Hi,
> > >>
> > >> May I ask another question?
> > >>
> > >> I'm running HBase/Hadoop on linux server, and implementing business
> > >> application with java, which runs on a different windows machine.
> > >>  It looks like MapReduce job runs on a server node. Can I run the
> > >> MapReduce
> > >> job built on windows client with an existing linux server? How can we
> > get
> > >> result done by MapReduce job at the server?
> > >>
> > >>
> > >
> > > You should be able to, yes.  Make sure you use same java on both
> > machines.
> > >  This page might help some:
> > http://wiki.apache.org/hadoop/Hbase/MapReduce.
> > > St.Ack
> > >
> >
>

Re: map reduce range of records from hbase table

Posted by stack <st...@duboce.net>.

On Wed, Oct 8, 2008 at 9:01 PM, Jaeyun Noh <me...@gmail.com> wrote:

> Thx.
>
> BTW, it seems that the output format (subclass of
> org.apache.hadoop.mapred.OutputFormat) of MR job can only be a file. Can we
> define our own file format which hbase clients can access?


No.  You can output to anything as long as you make it implement
OutputFormat.  To output to hbase subclass TableReduce or see
TableOutputFormat.


>
>
> My goal is to implement filter-enabled table scanner which runs by
> multi-process clients using MR. I'm trying to leverage MR since the
> ClientScanner class of HTable sequencially access Hregion and thus involves
> multiple round trips btw servers and clients.


I'm not sure I follow.  Perhaps start simple then see where the bottlenecks
are and optimize here.  Regards roundtrips between client and server, what
you want? A scanner that returns batches rather than row at at time?

St.Ack





>
>
> On Wed, Oct 8, 2008 at 4:30 PM, stack <st...@duboce.net> wrote:
>
> > Jaeyun Noh wrote:
> >
> >> Hi,
> >>
> >> May I ask another question?
> >>
> >> I'm running HBase/Hadoop on linux server, and implementing business
> >> application with java, which runs on a different windows machine.
> >>  It looks like MapReduce job runs on a server node. Can I run the
> >> MapReduce
> >> job built on windows client with an existing linux server? How can we
> get
> >> result done by MapReduce job at the server?
> >>
> >>
> >
> > You should be able to, yes.  Make sure you use same java on both
> machines.
> >  This page might help some:
> http://wiki.apache.org/hadoop/Hbase/MapReduce.
> > St.Ack
> >
>

Re: map reduce range of records from hbase table

Posted by Jaeyun Noh <me...@gmail.com>.

Thx.

BTW, it seems that the output format (subclass of
org.apache.hadoop.mapred.OutputFormat) of MR job can only be a file. Can we
define our own file format which hbase clients can access?

My goal is to implement filter-enabled table scanner which runs by
multi-process clients using MR. I'm trying to leverage MR since the
ClientScanner class of HTable sequencially access Hregion and thus involves
multiple round trips btw servers and clients.

On Wed, Oct 8, 2008 at 4:30 PM, stack <st...@duboce.net> wrote:

> Jaeyun Noh wrote:
>
>> Hi,
>>
>> May I ask another question?
>>
>> I'm running HBase/Hadoop on linux server, and implementing business
>> application with java, which runs on a different windows machine.
>>  It looks like MapReduce job runs on a server node. Can I run the
>> MapReduce
>> job built on windows client with an existing linux server? How can we get
>> result done by MapReduce job at the server?
>>
>>
>
> You should be able to, yes.  Make sure you use same java on both machines.
>  This page might help some: http://wiki.apache.org/hadoop/Hbase/MapReduce.
> St.Ack
>

Re: map reduce range of records from hbase table

Posted by stack <st...@duboce.net>.

Jaeyun Noh wrote:
> Hi,
>
> May I ask another question?
>
> I'm running HBase/Hadoop on linux server, and implementing business
> application with java, which runs on a different windows machine.
>   
> It looks like MapReduce job runs on a server node. Can I run the MapReduce
> job built on windows client with an existing linux server? How can we get
> result done by MapReduce job at the server?
>   

You should be able to, yes.  Make sure you use same java on both 
machines.  This page might help some: 
http://wiki.apache.org/hadoop/Hbase/MapReduce.
St.Ack

Re: map reduce range of records from hbase table

Posted by Jaeyun Noh <me...@gmail.com>.

Hi,

May I ask another question?

I'm running HBase/Hadoop on linux server, and implementing business
application with java, which runs on a different windows machine.

It looks like MapReduce job runs on a server node. Can I run the MapReduce
job built on windows client with an existing linux server? How can we get
result done by MapReduce job at the server?

e.g. scanning specific table with some filter conditions and return sum of
specific columns...

Regards,
Jaeyun Noh.

On Wed, Oct 8, 2008 at 2:13 PM, stack <st...@duboce.net> wrote:

> Cedric Ho wrote:
>
>> Hi all,
>>
>> I am using 0.18.0 and have successfully used data from hbase table as
>> input to my map/reduce job.
>>
>> I wonder how to specify a subset of records from a table instead of
>> taking all records as input.
>> Such as a range of the row keys or maybe by specific values of certain
>> columns.
>>
>>
> You'll have to subclass the TableInputFormat.
>
> There is an example in the javadoc on subclassing TIF:
> http://hadoop.apache.org/hbase/docs/r0.18.0/api/org/apache/hadoop/hbase/mapred/TableInputFormatBase.html(Sorry, the example is mangled.  Do a get of the html source to see
> non-garbled code).
>
> The example shows you how to set a filter.  Filters can filter on rows and
> values.
>
> To work against a subset, you'd probably need to play with getSplits  in
> your subclass.   Default, it  basically eretrns as many splits as there are
> regions in your table, so its the whole table always.  Filters could stop
> unwanted rows being returned but maybe its better if the rows weren't
> considered in the first place; hence the need of getSplits subclassing.
>
> St.Ack
>
>

Re: map reduce range of records from hbase table

Posted by stack <st...@duboce.net>.

Cedric Ho wrote:
> Hi all,
>
> I am using 0.18.0 and have successfully used data from hbase table as
> input to my map/reduce job.
>
> I wonder how to specify a subset of records from a table instead of
> taking all records as input.
> Such as a range of the row keys or maybe by specific values of certain columns.
>   
You'll have to subclass the TableInputFormat.

There is an example in the javadoc on subclassing TIF: 
http://hadoop.apache.org/hbase/docs/r0.18.0/api/org/apache/hadoop/hbase/mapred/TableInputFormatBase.html 
(Sorry, the example is mangled.  Do a get of the html source to see 
non-garbled code).

The example shows you how to set a filter.  Filters can filter on rows 
and values.

To work against a subset, you'd probably need to play with getSplits  in 
your subclass.   Default, it  basically eretrns as many splits as there 
are regions in your table, so its the whole table always.  Filters could 
stop unwanted rows being returned but maybe its better if the rows 
weren't considered in the first place; hence the need of getSplits 
subclassing.

St.Ack

Re: map reduce range of records from hbase table

Posted by Erik Holstad <er...@gmail.com>.

Hi Cedric1
Not sure if it is possible to set a range of row keys, but setting columns
is fairly easy.

I would a ssume that you are using something like
*TableMapReduceUtil.initTableMapJob(inputTable, inputCols, cla, Text.class,
Text.class, c);
*
where inputCols is a String[] for example {"content:a", "content:b",
"content:c"};

Erik

On Wed, Oct 8, 2008 at 12:33 AM, Cedric Ho <ce...@gmail.com> wrote:

> Hi all,
>
> I am using 0.18.0 and have successfully used data from hbase table as
> input to my map/reduce job.
>
> I wonder how to specify a subset of records from a table instead of
> taking all records as input.
> Such as a range of the row keys or maybe by specific values of certain
> columns.
>
> Any help is appreciated.
>
> Cedric
>