You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Rahul Ravindran <ra...@yahoo.com> on 2013/02/14 20:40:28 UTC

Using HBase for Deduping

Hi,
   We have events which are delivered into our HDFS cluster which may be duplicated. Each event has a UUID and we were hoping to leverage HBase to dedupe them. We run a MapReduce job which would perform a lookup for each UUID on HBase and then emit the event only if the UUID was absent and would also insert into the HBase table(This is simplistic, I am missing out details to make this more resilient to failures). My concern is that doing a Read+Write for every event in MR would be slow (We expect around 1 Billion events every hour). Does anyone use Hbase for a similar use case or is there a different approach to achieving the same end result. Any information, comments would be great.

Thanks,
~Rahul.

Re: Using HBase for Deduping

Posted by Michael Segel <mi...@hotmail.com>.
But then he can't trigger an event if its a net new row. 

Methinks that he needs to better define the problem he is trying to solve. 
Also the number of events.  A billion an hour or 300K events a second? (Ok its 277.78K events a second.) 


On Feb 14, 2013, at 10:19 PM, Anoop Sam John <an...@huawei.com> wrote:

> When max versions set as 1 and duplicate key is added, the last added will win removing the old.  This is what you want Rahul?  I think from his explanation he needs the reverse way
> 
> -Anoop-
> ________________________________________
> From: Asaf Mesika [asaf.mesika@gmail.com]
> Sent: Friday, February 15, 2013 3:56 AM
> To: user@hbase.apache.org; Rahul Ravindran
> Subject: Re: Using HBase for Deduping
> 
> You can load the events into an Hbase table, which has the event id as the
> unique row key. You can define max versions of 1 to the column family thus
> letting Hbase get rid of the duplicates for you during major compaction.
> 
> 
> 
> On Thursday, February 14, 2013, Rahul Ravindran wrote:
> 
>> Hi,
>>   We have events which are delivered into our HDFS cluster which may be
>> duplicated. Each event has a UUID and we were hoping to leverage HBase to
>> dedupe them. We run a MapReduce job which would perform a lookup for each
>> UUID on HBase and then emit the event only if the UUID was absent and would
>> also insert into the HBase table(This is simplistic, I am missing out
>> details to make this more resilient to failures). My concern is that doing
>> a Read+Write for every event in MR would be slow (We expect around 1
>> Billion events every hour). Does anyone use Hbase for a similar use case or
>> is there a different approach to achieving the same end result. Any
>> information, comments would be great.
>> 
>> Thanks,
>> ~Rahul.


RE: Using HBase for Deduping

Posted by Anoop Sam John <an...@huawei.com>.
Or may be go with large value for max version and put the duplicate entry. Now in the compact, need to have a wrapper for InternalScanner and next() method return only the 1st KV out, removing the others...  Even while scan also same kind of logic will be needed..  This will be good enough IMO especially when there wont be so many duplicate events for same rowkey..  That is why I asked some questions before....

I think this solution can be checked.

-Anoop-
________________________________________
From: Asaf Mesika [asaf.mesika@gmail.com]
Sent: Friday, February 15, 2013 3:06 PM
To: user@hbase.apache.org
Cc: Rahul Ravindran
Subject: Re: Using HBase for Deduping

Then maybe he can place an event in the same rowkey but with a column
qualifier which the time stamp of the event saved as long. Upon preCompact
in a region observer he can filter out for any row all column but the first?

On Friday, February 15, 2013, Anoop Sam John wrote:

> When max versions set as 1 and duplicate key is added, the last added will
> win removing the old.  This is what you want Rahul?  I think from his
> explanation he needs the reverse way
>
> -Anoop-
> ________________________________________
> From: Asaf Mesika [asaf.mesika@gmail.com <javascript:;>]
> Sent: Friday, February 15, 2013 3:56 AM
> To: user@hbase.apache.org <javascript:;>; Rahul Ravindran
> Subject: Re: Using HBase for Deduping
>
> You can load the events into an Hbase table, which has the event id as the
> unique row key. You can define max versions of 1 to the column family thus
> letting Hbase get rid of the duplicates for you during major compaction.
>
>
>
> On Thursday, February 14, 2013, Rahul Ravindran wrote:
>
> > Hi,
> >    We have events which are delivered into our HDFS cluster which may be
> > duplicated. Each event has a UUID and we were hoping to leverage HBase to
> > dedupe them. We run a MapReduce job which would perform a lookup for each
> > UUID on HBase and then emit the event only if the UUID was absent and
> would
> > also insert into the HBase table(This is simplistic, I am missing out
> > details to make this more resilient to failures). My concern is that
> doing
> > a Read+Write for every event in MR would be slow (We expect around 1
> > Billion events every hour). Does anyone use Hbase for a similar use case
> or
> > is there a different approach to achieving the same end result. Any
> > information, comments would be great.
> >
> > Thanks,
> > ~Rahul.

Re: Using HBase for Deduping

Posted by Asaf Mesika <as...@gmail.com>.
Then maybe he can place an event in the same rowkey but with a column
qualifier which the time stamp of the event saved as long. Upon preCompact
in a region observer he can filter out for any row all column but the first?

On Friday, February 15, 2013, Anoop Sam John wrote:

> When max versions set as 1 and duplicate key is added, the last added will
> win removing the old.  This is what you want Rahul?  I think from his
> explanation he needs the reverse way
>
> -Anoop-
> ________________________________________
> From: Asaf Mesika [asaf.mesika@gmail.com <javascript:;>]
> Sent: Friday, February 15, 2013 3:56 AM
> To: user@hbase.apache.org <javascript:;>; Rahul Ravindran
> Subject: Re: Using HBase for Deduping
>
> You can load the events into an Hbase table, which has the event id as the
> unique row key. You can define max versions of 1 to the column family thus
> letting Hbase get rid of the duplicates for you during major compaction.
>
>
>
> On Thursday, February 14, 2013, Rahul Ravindran wrote:
>
> > Hi,
> >    We have events which are delivered into our HDFS cluster which may be
> > duplicated. Each event has a UUID and we were hoping to leverage HBase to
> > dedupe them. We run a MapReduce job which would perform a lookup for each
> > UUID on HBase and then emit the event only if the UUID was absent and
> would
> > also insert into the HBase table(This is simplistic, I am missing out
> > details to make this more resilient to failures). My concern is that
> doing
> > a Read+Write for every event in MR would be slow (We expect around 1
> > Billion events every hour). Does anyone use Hbase for a similar use case
> or
> > is there a different approach to achieving the same end result. Any
> > information, comments would be great.
> >
> > Thanks,
> > ~Rahul.

RE: Using HBase for Deduping

Posted by Anoop Sam John <an...@huawei.com>.
When max versions set as 1 and duplicate key is added, the last added will win removing the old.  This is what you want Rahul?  I think from his explanation he needs the reverse way

-Anoop-
________________________________________
From: Asaf Mesika [asaf.mesika@gmail.com]
Sent: Friday, February 15, 2013 3:56 AM
To: user@hbase.apache.org; Rahul Ravindran
Subject: Re: Using HBase for Deduping

You can load the events into an Hbase table, which has the event id as the
unique row key. You can define max versions of 1 to the column family thus
letting Hbase get rid of the duplicates for you during major compaction.



On Thursday, February 14, 2013, Rahul Ravindran wrote:

> Hi,
>    We have events which are delivered into our HDFS cluster which may be
> duplicated. Each event has a UUID and we were hoping to leverage HBase to
> dedupe them. We run a MapReduce job which would perform a lookup for each
> UUID on HBase and then emit the event only if the UUID was absent and would
> also insert into the HBase table(This is simplistic, I am missing out
> details to make this more resilient to failures). My concern is that doing
> a Read+Write for every event in MR would be slow (We expect around 1
> Billion events every hour). Does anyone use Hbase for a similar use case or
> is there a different approach to achieving the same end result. Any
> information, comments would be great.
>
> Thanks,
> ~Rahul.

Re: Using HBase for Deduping

Posted by Asaf Mesika <as...@gmail.com>.
You can load the events into an Hbase table, which has the event id as the
unique row key. You can define max versions of 1 to the column family thus
letting Hbase get rid of the duplicates for you during major compaction.



On Thursday, February 14, 2013, Rahul Ravindran wrote:

> Hi,
>    We have events which are delivered into our HDFS cluster which may be
> duplicated. Each event has a UUID and we were hoping to leverage HBase to
> dedupe them. We run a MapReduce job which would perform a lookup for each
> UUID on HBase and then emit the event only if the UUID was absent and would
> also insert into the HBase table(This is simplistic, I am missing out
> details to make this more resilient to failures). My concern is that doing
> a Read+Write for every event in MR would be slow (We expect around 1
> Billion events every hour). Does anyone use Hbase for a similar use case or
> is there a different approach to achieving the same end result. Any
> information, comments would be great.
>
> Thanks,
> ~Rahul.