You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by "Dhodapkar, Chinmay" <ch...@qualcomm.com> on 2011/09/09 23:54:45 UTC

Hbase + mapreduce -- operational design question

Hello,
I have a setup where a bunch of clients store 'events' in an Hbase table . Also, periodically(once a day), I run a mapreduce job that goes over the table and computes some reports.

Now my issue is that the next time I don't want mapreduce job to process the 'events' that it has already processed previously. I know that I can mark processed event in the hbase table and the mapper can filter them them out during the next run. But what I would really like/want is that previously processed events don't even hit the mapper.

One solution I can think of is to backup the hbase table after running the job and then clear the table. But this has lot of problems..
1) Clients may have inserted events while the job was running.
2) I could disable and drop the table and then create it again...but then the clients would complain about this short window of unavailability.


What do people using Hbase (live) + mapreduce typically do. ?

Thanks!
Chinmay

Re: Hbase + mapreduce -- operational design question

Posted by Sonal Goyal <so...@gmail.com>.

Chinmay, how are you configuring your job? Have you checked using setScan
and selecting the keys you care to run MR over? See

http://ofps.oreilly.com/titles/9781449396107/mapreduce.html

As a shameless plug - For your reports, see if you want to leverage Crux:
https://github.com/sonalgoyal/crux

Best Regards,
Sonal
Crux: Reporting for HBase <https://github.com/sonalgoyal/crux>
Nube Technologies <http://www.nubetech.co>

<http://in.linkedin.com/in/sonalgoyal>





On Sat, Sep 10, 2011 at 2:53 PM, Eugene Kirpichov <ek...@gmail.com>wrote:

> I believe HBase has some kind of TTL (timeout-based expiry) for
> records and it can clean them up on its own.
>
> On Sat, Sep 10, 2011 at 1:54 AM, Dhodapkar, Chinmay
> <ch...@qualcomm.com> wrote:
> > Hello,
> > I have a setup where a bunch of clients store 'events' in an Hbase table
> . Also, periodically(once a day), I run a mapreduce job that goes over the
> table and computes some reports.
> >
> > Now my issue is that the next time I don't want mapreduce job to process
> the 'events' that it has already processed previously. I know that I can
> mark processed event in the hbase table and the mapper can filter them them
> out during the next run. But what I would really like/want is that
> previously processed events don't even hit the mapper.
> >
> > One solution I can think of is to backup the hbase table after running
> the job and then clear the table. But this has lot of problems..
> > 1) Clients may have inserted events while the job was running.
> > 2) I could disable and drop the table and then create it again...but then
> the clients would complain about this short window of unavailability.
> >
> >
> > What do people using Hbase (live) + mapreduce typically do. ?
> >
> > Thanks!
> > Chinmay
> >
> >
>
>
>
> --
> Eugene Kirpichov
> Principal Engineer, Mirantis Inc. http://www.mirantis.com/
> Editor, http://fprog.ru/
>

Re: Hbase + mapreduce -- operational design question

Posted by Eugene Kirpichov <ek...@gmail.com>.

I believe HBase has some kind of TTL (timeout-based expiry) for
records and it can clean them up on its own.

On Sat, Sep 10, 2011 at 1:54 AM, Dhodapkar, Chinmay
<ch...@qualcomm.com> wrote:
> Hello,
> I have a setup where a bunch of clients store 'events' in an Hbase table . Also, periodically(once a day), I run a mapreduce job that goes over the table and computes some reports.
>
> Now my issue is that the next time I don't want mapreduce job to process the 'events' that it has already processed previously. I know that I can mark processed event in the hbase table and the mapper can filter them them out during the next run. But what I would really like/want is that previously processed events don't even hit the mapper.
>
> One solution I can think of is to backup the hbase table after running the job and then clear the table. But this has lot of problems..
> 1) Clients may have inserted events while the job was running.
> 2) I could disable and drop the table and then create it again...but then the clients would complain about this short window of unavailability.
>
>
> What do people using Hbase (live) + mapreduce typically do. ?
>
> Thanks!
> Chinmay
>
>



-- 
Eugene Kirpichov
Principal Engineer, Mirantis Inc. http://www.mirantis.com/
Editor, http://fprog.ru/