You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@hadoop.apache.org by "Kilbride, James P." <Ja...@gd-ais.com> on 2010/07/06 14:38:07 UTC

MapReduce HBASE examples

All,

The examples in the hbase examples, and on the hadoop wiki all reference the deprecated interfaces of the mapred package. Are there any examples of how to use hbase as the input for a mapreduce job, that uses the mapreduce package instead? I'm looking to set up a job which will read from a hbase table based on a row value passed into the job, and which starts the map the row values(as the value key) and the column names(or value) as the map values.

James Kilbride

Re: MapReduce HBASE examples

Posted by Jean-Daniel Cryans <jd...@apache.org>.
(moving the thread to the HBase user mailing list, on reply please remove
the general@ since this is not a general question)

It is indeed a parallelizable problem that could use a job management
system, but in your case I don't think MR is the right solution. You will
have to do all sorts weird tweaks and in the end you won't get much out of
it since you basically want to process a tiny portion of the whole dataset.
You also talk about possible localisation, but I don't see that being
a particularly strong argument in what you describe. Yes, you could start
one mapper per region that contains some of the rows you are looking for,
but the cost of starting and managing those JVMs is high compared to just
starting one that does the work (since it can be done easily in a single
process that can be multi-threaded).

To sum up, using MR on a small dataset is basically having all the
disadvantages for almost none of the advantages.

Instead you could look into running Gearman (or similar) on those machines
and that would give you exactly what you need IMHO.

J-D

On Tue, Jul 6, 2010 at 10:50 AM, Kilbride, James P. <
James.Kilbride@gd-ais.com> wrote:

> I'm assuming the rows being pulled back are smaller than the full row set
> of the entire database. So say the 10 out of 2B case. But, each row has a
> column family who's 'columns' are actually rowIds in the database.
> (basically my one to many relationship mapping). I'm not trying to use MR
> for the initial get of 10 columns, but rather the fact that each of those 10
> initial rows generates potentially hundreds or thousands of other calls.
>
> I am trying to do this for a real time user request, but I expect the total
> processing to take some time so it's more of a user initiated call. There
> also may be dozens of users making the request at any given time so I want
> to farm this out into the MR world so that multiple instances of the job can
> be running(with completely different starting rows) at any given time.
>
> I could do this using a serialized local process but I explicitly want some
> of my processing, which could take some time, happening out in the map
> reduce world to take advantage of spare cycles elsewhere, as well as
> potential data locality and the fact that it is a parallelizable problem
> seems to imply that M/R would be a logical way to do it.
>
> James Kilbride
>
> -----Original Message-----
> From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of
> Jean-Daniel Cryans
> Sent: Tuesday, July 06, 2010 1:12 PM
> To: general@hadoop.apache.org
> Subject: Re: MapReduce HBASE examples
>
> That won't be very efficient either... are you trying to do this for a real
> time user request. If so, it really isn't the way you want to go.
>
> If you are in a batch processing situation, I'd say it depends on how many
> rows you have VS how many you need to retrieve eg scanning 2B rows only to
> find 10 rows really doesn't make sense. How do you determine which users
> you
> need to process? How big is your dataset? I understand that you wish to use
> the MR-provided functionalities of grouping and such, but simply issuing a
> bunch of Gets in parallel may just be easier to write and maintain.
>
> J-D
>
> On Tue, Jul 6, 2010 at 10:02 AM, Kilbride, James P. <
> James.Kilbride@gd-ais.com> wrote:
>
> > So, if that's the case, and you argument makes sense understanding how
> scan
> > versus get works, I'd have to write a custom InputFormat class that looks
> > like the TableInputFormat class, but uses a get(or series of gets) rather
> > than a scan object as the current table mapper does?
> >
> > James Kilbride
> >
> > -----Original Message-----
> > From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of
> > Jean-Daniel Cryans
> > Sent: Tuesday, July 06, 2010 12:53 PM
> > To: general@hadoop.apache.org
> > Subject: Re: MapReduce HBASE examples
> >
> > >
> > >
> > > Does this make any sense?
> > >
> > >
> > Not in a MapReduce context, what you want to do is a LIKE with a bunch of
> > values right? Since a mapper will always read all the input that it's
> given
> > (minus some filters like you can do with HBase), whatever you do will
> > always
> > end up being a full table scan. You "could" solve your problem by
> > configuring your Scan object with a RowFilter that knows about the names
> > you
> > are looking for, but that still ends up being a full scan on the region
> > server side so it will be slow and will generate a lot of IO.
> >
> > WRT examples, HBase ships with a couple of utility classes that can also
> be
> > used as examples. The Export class has the Scan configuration stuff:
> >
> >
> http://github.com/apache/hbase/blob/0.20/src/java/org/apache/hadoop/hbase/mapreduce/Export.java
> >
> > J-D
> >
>

Re: MapReduce HBASE examples

Posted by Jean-Daniel Cryans <jd...@apache.org>.
(moving the thread to the HBase user mailing list, on reply please remove
the general@ since this is not a general question)

It is indeed a parallelizable problem that could use a job management
system, but in your case I don't think MR is the right solution. You will
have to do all sorts weird tweaks and in the end you won't get much out of
it since you basically want to process a tiny portion of the whole dataset.
You also talk about possible localisation, but I don't see that being
a particularly strong argument in what you describe. Yes, you could start
one mapper per region that contains some of the rows you are looking for,
but the cost of starting and managing those JVMs is high compared to just
starting one that does the work (since it can be done easily in a single
process that can be multi-threaded).

To sum up, using MR on a small dataset is basically having all the
disadvantages for almost none of the advantages.

Instead you could look into running Gearman (or similar) on those machines
and that would give you exactly what you need IMHO.

J-D

On Tue, Jul 6, 2010 at 10:50 AM, Kilbride, James P. <
James.Kilbride@gd-ais.com> wrote:

> I'm assuming the rows being pulled back are smaller than the full row set
> of the entire database. So say the 10 out of 2B case. But, each row has a
> column family who's 'columns' are actually rowIds in the database.
> (basically my one to many relationship mapping). I'm not trying to use MR
> for the initial get of 10 columns, but rather the fact that each of those 10
> initial rows generates potentially hundreds or thousands of other calls.
>
> I am trying to do this for a real time user request, but I expect the total
> processing to take some time so it's more of a user initiated call. There
> also may be dozens of users making the request at any given time so I want
> to farm this out into the MR world so that multiple instances of the job can
> be running(with completely different starting rows) at any given time.
>
> I could do this using a serialized local process but I explicitly want some
> of my processing, which could take some time, happening out in the map
> reduce world to take advantage of spare cycles elsewhere, as well as
> potential data locality and the fact that it is a parallelizable problem
> seems to imply that M/R would be a logical way to do it.
>
> James Kilbride
>
> -----Original Message-----
> From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of
> Jean-Daniel Cryans
> Sent: Tuesday, July 06, 2010 1:12 PM
> To: general@hadoop.apache.org
> Subject: Re: MapReduce HBASE examples
>
> That won't be very efficient either... are you trying to do this for a real
> time user request. If so, it really isn't the way you want to go.
>
> If you are in a batch processing situation, I'd say it depends on how many
> rows you have VS how many you need to retrieve eg scanning 2B rows only to
> find 10 rows really doesn't make sense. How do you determine which users
> you
> need to process? How big is your dataset? I understand that you wish to use
> the MR-provided functionalities of grouping and such, but simply issuing a
> bunch of Gets in parallel may just be easier to write and maintain.
>
> J-D
>
> On Tue, Jul 6, 2010 at 10:02 AM, Kilbride, James P. <
> James.Kilbride@gd-ais.com> wrote:
>
> > So, if that's the case, and you argument makes sense understanding how
> scan
> > versus get works, I'd have to write a custom InputFormat class that looks
> > like the TableInputFormat class, but uses a get(or series of gets) rather
> > than a scan object as the current table mapper does?
> >
> > James Kilbride
> >
> > -----Original Message-----
> > From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of
> > Jean-Daniel Cryans
> > Sent: Tuesday, July 06, 2010 12:53 PM
> > To: general@hadoop.apache.org
> > Subject: Re: MapReduce HBASE examples
> >
> > >
> > >
> > > Does this make any sense?
> > >
> > >
> > Not in a MapReduce context, what you want to do is a LIKE with a bunch of
> > values right? Since a mapper will always read all the input that it's
> given
> > (minus some filters like you can do with HBase), whatever you do will
> > always
> > end up being a full table scan. You "could" solve your problem by
> > configuring your Scan object with a RowFilter that knows about the names
> > you
> > are looking for, but that still ends up being a full scan on the region
> > server side so it will be slow and will generate a lot of IO.
> >
> > WRT examples, HBase ships with a couple of utility classes that can also
> be
> > used as examples. The Export class has the Scan configuration stuff:
> >
> >
> http://github.com/apache/hbase/blob/0.20/src/java/org/apache/hadoop/hbase/mapreduce/Export.java
> >
> > J-D
> >
>

RE: MapReduce HBASE examples

Posted by "Kilbride, James P." <Ja...@gd-ais.com>.
I'm assuming the rows being pulled back are smaller than the full row set of the entire database. So say the 10 out of 2B case. But, each row has a column family who's 'columns' are actually rowIds in the database. (basically my one to many relationship mapping). I'm not trying to use MR for the initial get of 10 columns, but rather the fact that each of those 10 initial rows generates potentially hundreds or thousands of other calls. 

I am trying to do this for a real time user request, but I expect the total processing to take some time so it's more of a user initiated call. There also may be dozens of users making the request at any given time so I want to farm this out into the MR world so that multiple instances of the job can be running(with completely different starting rows) at any given time. 

I could do this using a serialized local process but I explicitly want some of my processing, which could take some time, happening out in the map reduce world to take advantage of spare cycles elsewhere, as well as potential data locality and the fact that it is a parallelizable problem seems to imply that M/R would be a logical way to do it.

James Kilbride

-----Original Message-----
From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of Jean-Daniel Cryans
Sent: Tuesday, July 06, 2010 1:12 PM
To: general@hadoop.apache.org
Subject: Re: MapReduce HBASE examples

That won't be very efficient either... are you trying to do this for a real
time user request. If so, it really isn't the way you want to go.

If you are in a batch processing situation, I'd say it depends on how many
rows you have VS how many you need to retrieve eg scanning 2B rows only to
find 10 rows really doesn't make sense. How do you determine which users you
need to process? How big is your dataset? I understand that you wish to use
the MR-provided functionalities of grouping and such, but simply issuing a
bunch of Gets in parallel may just be easier to write and maintain.

J-D

On Tue, Jul 6, 2010 at 10:02 AM, Kilbride, James P. <
James.Kilbride@gd-ais.com> wrote:

> So, if that's the case, and you argument makes sense understanding how scan
> versus get works, I'd have to write a custom InputFormat class that looks
> like the TableInputFormat class, but uses a get(or series of gets) rather
> than a scan object as the current table mapper does?
>
> James Kilbride
>
> -----Original Message-----
> From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of
> Jean-Daniel Cryans
> Sent: Tuesday, July 06, 2010 12:53 PM
> To: general@hadoop.apache.org
> Subject: Re: MapReduce HBASE examples
>
> >
> >
> > Does this make any sense?
> >
> >
> Not in a MapReduce context, what you want to do is a LIKE with a bunch of
> values right? Since a mapper will always read all the input that it's given
> (minus some filters like you can do with HBase), whatever you do will
> always
> end up being a full table scan. You "could" solve your problem by
> configuring your Scan object with a RowFilter that knows about the names
> you
> are looking for, but that still ends up being a full scan on the region
> server side so it will be slow and will generate a lot of IO.
>
> WRT examples, HBase ships with a couple of utility classes that can also be
> used as examples. The Export class has the Scan configuration stuff:
>
> http://github.com/apache/hbase/blob/0.20/src/java/org/apache/hadoop/hbase/mapreduce/Export.java
>
> J-D
>

Re: MapReduce HBASE examples

Posted by Jean-Daniel Cryans <jd...@apache.org>.
That won't be very efficient either... are you trying to do this for a real
time user request. If so, it really isn't the way you want to go.

If you are in a batch processing situation, I'd say it depends on how many
rows you have VS how many you need to retrieve eg scanning 2B rows only to
find 10 rows really doesn't make sense. How do you determine which users you
need to process? How big is your dataset? I understand that you wish to use
the MR-provided functionalities of grouping and such, but simply issuing a
bunch of Gets in parallel may just be easier to write and maintain.

J-D

On Tue, Jul 6, 2010 at 10:02 AM, Kilbride, James P. <
James.Kilbride@gd-ais.com> wrote:

> So, if that's the case, and you argument makes sense understanding how scan
> versus get works, I'd have to write a custom InputFormat class that looks
> like the TableInputFormat class, but uses a get(or series of gets) rather
> than a scan object as the current table mapper does?
>
> James Kilbride
>
> -----Original Message-----
> From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of
> Jean-Daniel Cryans
> Sent: Tuesday, July 06, 2010 12:53 PM
> To: general@hadoop.apache.org
> Subject: Re: MapReduce HBASE examples
>
> >
> >
> > Does this make any sense?
> >
> >
> Not in a MapReduce context, what you want to do is a LIKE with a bunch of
> values right? Since a mapper will always read all the input that it's given
> (minus some filters like you can do with HBase), whatever you do will
> always
> end up being a full table scan. You "could" solve your problem by
> configuring your Scan object with a RowFilter that knows about the names
> you
> are looking for, but that still ends up being a full scan on the region
> server side so it will be slow and will generate a lot of IO.
>
> WRT examples, HBase ships with a couple of utility classes that can also be
> used as examples. The Export class has the Scan configuration stuff:
>
> http://github.com/apache/hbase/blob/0.20/src/java/org/apache/hadoop/hbase/mapreduce/Export.java
>
> J-D
>

RE: MapReduce HBASE examples

Posted by "Kilbride, James P." <Ja...@gd-ais.com>.
So, if that's the case, and you argument makes sense understanding how scan versus get works, I'd have to write a custom InputFormat class that looks like the TableInputFormat class, but uses a get(or series of gets) rather than a scan object as the current table mapper does? 

James Kilbride

-----Original Message-----
From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of Jean-Daniel Cryans
Sent: Tuesday, July 06, 2010 12:53 PM
To: general@hadoop.apache.org
Subject: Re: MapReduce HBASE examples

>
>
> Does this make any sense?
>
>
Not in a MapReduce context, what you want to do is a LIKE with a bunch of
values right? Since a mapper will always read all the input that it's given
(minus some filters like you can do with HBase), whatever you do will always
end up being a full table scan. You "could" solve your problem by
configuring your Scan object with a RowFilter that knows about the names you
are looking for, but that still ends up being a full scan on the region
server side so it will be slow and will generate a lot of IO.

WRT examples, HBase ships with a couple of utility classes that can also be
used as examples. The Export class has the Scan configuration stuff:
http://github.com/apache/hbase/blob/0.20/src/java/org/apache/hadoop/hbase/mapreduce/Export.java

J-D

Re: MapReduce HBASE examples

Posted by Jean-Daniel Cryans <jd...@apache.org>.
>
>
> Does this make any sense?
>
>
Not in a MapReduce context, what you want to do is a LIKE with a bunch of
values right? Since a mapper will always read all the input that it's given
(minus some filters like you can do with HBase), whatever you do will always
end up being a full table scan. You "could" solve your problem by
configuring your Scan object with a RowFilter that knows about the names you
are looking for, but that still ends up being a full scan on the region
server side so it will be slow and will generate a lot of IO.

WRT examples, HBase ships with a couple of utility classes that can also be
used as examples. The Export class has the Scan configuration stuff:
http://github.com/apache/hbase/blob/0.20/src/java/org/apache/hadoop/hbase/mapreduce/Export.java

J-D

RE: MapReduce HBASE examples

Posted by "Kilbride, James P." <Ja...@gd-ais.com>.
This is an interesting start but I'm really interested in the opposite direction, where hbase is the input to my map reduce job, and then I'm going to push some data into reducers which ultimately I'm okay with them just writing it to a file.

I get the impression that I need to set up a TableInputFormat type of object. But I since job only allows you to do setInputFormatClass, I'm not sure how to dynamically configure the inputFormatClass to accept some parameters to limit the input format scan on the table to only specific rows. Here's the general thrust of what I'm trying to do with MapReduce and HBase.

I have a table called People which has rows of people(names, ids, whatever is used for identifying a person in the system). That table also has a column family called relatives where the column ids are the names of relatives for the person. I want to pass into the inputFormat object the names of the people I want it to look up. And the mapper should get the persons name as the key and the columnFamily relatives as the value(that's the result of the scan limitations I'm putting into place). 

I then will retrieve the relatives(in the map function), look at relationships between them and push onto the context the relatives name(keyOut) and a floating point value(valueOut). The reducer will combine all these floating point values for each relative and output(in a file is fine) the relatives name and cumulative score.

But I can't seem to figure out how to set up a job that uses the TableInputFormat I want, and which also allows me to set the parameter for it so that it will only give me the people I ask for when I run the program not the entire table. 

Does this make any sense?

James Kilbride

-----Original Message-----
From: Harsh J [mailto:qwertymaniac@gmail.com] 
Sent: Tuesday, July 06, 2010 12:10 PM
To: general@hadoop.apache.org
Subject: Re: MapReduce HBASE examples

I believe this article will help you understand the new (not anymore)
API+HBase MR: http://kdpeterson.net/blog/2009/09/minimal-hbase-mapreduce-example.html
[Look at the second example, which uses the Put object]

On Tue, Jul 6, 2010 at 6:08 PM, Kilbride, James P.
<Ja...@gd-ais.com> wrote:
> All,
>
> The examples in the hbase examples, and on the hadoop wiki all reference the deprecated interfaces of the mapred package. Are there any examples of how to use hbase as the input for a mapreduce job, that uses the mapreduce package instead? I'm looking to set up a job which will read from a hbase table based on a row value passed into the job, and which starts the map the row values(as the value key) and the column names(or value) as the map values.
>
> James Kilbride
>



-- 
Harsh J
www.harshj.com

Re: MapReduce HBASE examples

Posted by Harsh J <qw...@gmail.com>.
I believe this article will help you understand the new (not anymore)
API+HBase MR: http://kdpeterson.net/blog/2009/09/minimal-hbase-mapreduce-example.html
[Look at the second example, which uses the Put object]

On Tue, Jul 6, 2010 at 6:08 PM, Kilbride, James P.
<Ja...@gd-ais.com> wrote:
> All,
>
> The examples in the hbase examples, and on the hadoop wiki all reference the deprecated interfaces of the mapred package. Are there any examples of how to use hbase as the input for a mapreduce job, that uses the mapreduce package instead? I'm looking to set up a job which will read from a hbase table based on a row value passed into the job, and which starts the map the row values(as the value key) and the column names(or value) as the map values.
>
> James Kilbride
>



-- 
Harsh J
www.harshj.com