You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Shawn Quinn <sq...@moxiegroup.com> on 2012/04/03 14:52:56 UTC

HBase MapReduce Job with Multiple Scans

Hello,

I have a table whose key is structured as "eventType + time", and I need to
periodically run a map reduce job on the table which will process each
event type within a specific time range.  So, the map reduce job needs to
process multiple segments of the table as input, and therefore can't be
setup with a single scan.  (Using a filter on the scan would theoretically
work, but doesn't scale well as the data size increases.)

Given that the HBase provided "TableMapReduceUtil.initTableMapperJob" only
supports a single scan there doesn't appear to be a "built in" way to run a
mapreduce job that has multiple scans as input.  I found the following
related post which points me to creating my own map reduce "InputFormat"
type by extending HBase's "TableInputFormatBase" and overriding the
"getSplits()" method:

http://stackoverflow.com/questions/4821455/hbase-mapreduce-on-multiple-scan-objects

So, that's currently the direction I'm heading.  However, before I got too
far in the weeds I thought I'd ask:

1. Is this still the best/right way to handle this situation?

2. Does anyone have an example of a custom InputFormat that sets up
multiple scans against an HBase input table (something like the
"MultiSegmentTableInputFormat" referred to in the post) that they'd be
willing to share?

Thanks,

       -Shawn

Re: HBase MapReduce Job with Multiple Scans

Posted by Shawn Quinn <sq...@moxiegroup.com>.
Sounds good, thanks Ted.  I'll give it a whirl and add any
comments/findings to the Jira issue.

     -Shawn

On Tue, Apr 3, 2012 at 10:45 AM, Ted Yu <yu...@gmail.com> wrote:

> Stack said he might help implement his suggestions if Eran is busy.
>
> The patch doesn't depend on recent changes to the Hadoop/MapReduce.
>
> Give it a try. Feedback would help us refine the patch.
>
> Thanks
>
> On Tue, Apr 3, 2012 at 7:43 AM, Shawn Quinn <sq...@moxiegroup.com> wrote:
>
> > Thanks for the quick reply Ted!  That's exactly what I'm looking for.
> > Reading through the Jira comments I'm a bit confused on what the
> > status/plan is with that patch.  Do you expect that will be included in
> the
> > next HBase release, or has it been postponed?  Also, does that change
> > depend on any recent changes to the Hadoop/MapReduce, or will it work
> > as-is?
> >
> > In the meantime, I'll give that patch a closer look and setup some custom
> > classes in my own project to try and pull off something similar.
> >
> >     -Shawn
> >
> > On Tue, Apr 3, 2012 at 9:42 AM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > Take a look at HBASE-3996 where Stack has some comments outstanding.
> > >
> > > Cheers
> > >
> > > On Tue, Apr 3, 2012 at 5:52 AM, Shawn Quinn <sq...@moxiegroup.com>
> > wrote:
> > >
> > > > Hello,
> > > >
> > > > I have a table whose key is structured as "eventType + time", and I
> > need
> > > to
> > > > periodically run a map reduce job on the table which will process
> each
> > > > event type within a specific time range.  So, the map reduce job
> needs
> > to
> > > > process multiple segments of the table as input, and therefore can't
> be
> > > > setup with a single scan.  (Using a filter on the scan would
> > > theoretically
> > > > work, but doesn't scale well as the data size increases.)
> > > >
> > > > Given that the HBase provided "TableMapReduceUtil.initTableMapperJob"
> > > only
> > > > supports a single scan there doesn't appear to be a "built in" way to
> > > run a
> > > > mapreduce job that has multiple scans as input.  I found the
> following
> > > > related post which points me to creating my own map reduce
> > "InputFormat"
> > > > type by extending HBase's "TableInputFormatBase" and overriding the
> > > > "getSplits()" method:
> > > >
> > > >
> > > >
> > >
> >
> http://stackoverflow.com/questions/4821455/hbase-mapreduce-on-multiple-scan-objects
> > > >
> > > > So, that's currently the direction I'm heading.  However, before I
> got
> > > too
> > > > far in the weeds I thought I'd ask:
> > > >
> > > > 1. Is this still the best/right way to handle this situation?
> > > >
> > > > 2. Does anyone have an example of a custom InputFormat that sets up
> > > > multiple scans against an HBase input table (something like the
> > > > "MultiSegmentTableInputFormat" referred to in the post) that they'd
> be
> > > > willing to share?
> > > >
> > > > Thanks,
> > > >
> > > >       -Shawn
> > > >
> > >
> >
>

Re: HBase MapReduce Job with Multiple Scans

Posted by Ted Yu <yu...@gmail.com>.
Stack said he might help implement his suggestions if Eran is busy.

The patch doesn't depend on recent changes to the Hadoop/MapReduce.

Give it a try. Feedback would help us refine the patch.

Thanks

On Tue, Apr 3, 2012 at 7:43 AM, Shawn Quinn <sq...@moxiegroup.com> wrote:

> Thanks for the quick reply Ted!  That's exactly what I'm looking for.
> Reading through the Jira comments I'm a bit confused on what the
> status/plan is with that patch.  Do you expect that will be included in the
> next HBase release, or has it been postponed?  Also, does that change
> depend on any recent changes to the Hadoop/MapReduce, or will it work
> as-is?
>
> In the meantime, I'll give that patch a closer look and setup some custom
> classes in my own project to try and pull off something similar.
>
>     -Shawn
>
> On Tue, Apr 3, 2012 at 9:42 AM, Ted Yu <yu...@gmail.com> wrote:
>
> > Take a look at HBASE-3996 where Stack has some comments outstanding.
> >
> > Cheers
> >
> > On Tue, Apr 3, 2012 at 5:52 AM, Shawn Quinn <sq...@moxiegroup.com>
> wrote:
> >
> > > Hello,
> > >
> > > I have a table whose key is structured as "eventType + time", and I
> need
> > to
> > > periodically run a map reduce job on the table which will process each
> > > event type within a specific time range.  So, the map reduce job needs
> to
> > > process multiple segments of the table as input, and therefore can't be
> > > setup with a single scan.  (Using a filter on the scan would
> > theoretically
> > > work, but doesn't scale well as the data size increases.)
> > >
> > > Given that the HBase provided "TableMapReduceUtil.initTableMapperJob"
> > only
> > > supports a single scan there doesn't appear to be a "built in" way to
> > run a
> > > mapreduce job that has multiple scans as input.  I found the following
> > > related post which points me to creating my own map reduce
> "InputFormat"
> > > type by extending HBase's "TableInputFormatBase" and overriding the
> > > "getSplits()" method:
> > >
> > >
> > >
> >
> http://stackoverflow.com/questions/4821455/hbase-mapreduce-on-multiple-scan-objects
> > >
> > > So, that's currently the direction I'm heading.  However, before I got
> > too
> > > far in the weeds I thought I'd ask:
> > >
> > > 1. Is this still the best/right way to handle this situation?
> > >
> > > 2. Does anyone have an example of a custom InputFormat that sets up
> > > multiple scans against an HBase input table (something like the
> > > "MultiSegmentTableInputFormat" referred to in the post) that they'd be
> > > willing to share?
> > >
> > > Thanks,
> > >
> > >       -Shawn
> > >
> >
>

Re: HBase MapReduce Job with Multiple Scans

Posted by Shawn Quinn <sq...@moxiegroup.com>.
Thanks for the quick reply Ted!  That's exactly what I'm looking for.
Reading through the Jira comments I'm a bit confused on what the
status/plan is with that patch.  Do you expect that will be included in the
next HBase release, or has it been postponed?  Also, does that change
depend on any recent changes to the Hadoop/MapReduce, or will it work as-is?

In the meantime, I'll give that patch a closer look and setup some custom
classes in my own project to try and pull off something similar.

     -Shawn

On Tue, Apr 3, 2012 at 9:42 AM, Ted Yu <yu...@gmail.com> wrote:

> Take a look at HBASE-3996 where Stack has some comments outstanding.
>
> Cheers
>
> On Tue, Apr 3, 2012 at 5:52 AM, Shawn Quinn <sq...@moxiegroup.com> wrote:
>
> > Hello,
> >
> > I have a table whose key is structured as "eventType + time", and I need
> to
> > periodically run a map reduce job on the table which will process each
> > event type within a specific time range.  So, the map reduce job needs to
> > process multiple segments of the table as input, and therefore can't be
> > setup with a single scan.  (Using a filter on the scan would
> theoretically
> > work, but doesn't scale well as the data size increases.)
> >
> > Given that the HBase provided "TableMapReduceUtil.initTableMapperJob"
> only
> > supports a single scan there doesn't appear to be a "built in" way to
> run a
> > mapreduce job that has multiple scans as input.  I found the following
> > related post which points me to creating my own map reduce "InputFormat"
> > type by extending HBase's "TableInputFormatBase" and overriding the
> > "getSplits()" method:
> >
> >
> >
> http://stackoverflow.com/questions/4821455/hbase-mapreduce-on-multiple-scan-objects
> >
> > So, that's currently the direction I'm heading.  However, before I got
> too
> > far in the weeds I thought I'd ask:
> >
> > 1. Is this still the best/right way to handle this situation?
> >
> > 2. Does anyone have an example of a custom InputFormat that sets up
> > multiple scans against an HBase input table (something like the
> > "MultiSegmentTableInputFormat" referred to in the post) that they'd be
> > willing to share?
> >
> > Thanks,
> >
> >       -Shawn
> >
>

Re: HBase MapReduce Job with Multiple Scans

Posted by Ted Yu <yu...@gmail.com>.
Take a look at HBASE-3996 where Stack has some comments outstanding.

Cheers

On Tue, Apr 3, 2012 at 5:52 AM, Shawn Quinn <sq...@moxiegroup.com> wrote:

> Hello,
>
> I have a table whose key is structured as "eventType + time", and I need to
> periodically run a map reduce job on the table which will process each
> event type within a specific time range.  So, the map reduce job needs to
> process multiple segments of the table as input, and therefore can't be
> setup with a single scan.  (Using a filter on the scan would theoretically
> work, but doesn't scale well as the data size increases.)
>
> Given that the HBase provided "TableMapReduceUtil.initTableMapperJob" only
> supports a single scan there doesn't appear to be a "built in" way to run a
> mapreduce job that has multiple scans as input.  I found the following
> related post which points me to creating my own map reduce "InputFormat"
> type by extending HBase's "TableInputFormatBase" and overriding the
> "getSplits()" method:
>
>
> http://stackoverflow.com/questions/4821455/hbase-mapreduce-on-multiple-scan-objects
>
> So, that's currently the direction I'm heading.  However, before I got too
> far in the weeds I thought I'd ask:
>
> 1. Is this still the best/right way to handle this situation?
>
> 2. Does anyone have an example of a custom InputFormat that sets up
> multiple scans against an HBase input table (something like the
> "MultiSegmentTableInputFormat" referred to in the post) that they'd be
> willing to share?
>
> Thanks,
>
>       -Shawn
>