You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Rahul Ravindran <ra...@yahoo.com> on 2013/02/14 20:23:41 UTC

Using Hbase for Dedupping

Hi,
   We have events which are delivered into our HDFS cluster which may be duplicated. Each event has a UUID and we were hoping to leverage HBase to dedupe them. We run a MapReduce job which would perform a lookup for each UUID on HBase and then emit the event only if the UUID was absent and would also insert into the HBase table(This is simplistic, I am missing out details to make this more resilient to failures). My concern is that doing a Read+Write for every event in MR would be slow (We expect around 1 Billion events every hour). Does anyone use Hbase for a similar use case or is there a different approach to achieving the same end result. Any information, comments would be great.

Thanks,
~Rahul.

RE: Using Hbase for Dedupping

Posted by Anoop Sam John <an...@huawei.com>.
Hi Rahul
             When you say that some events can come with duplicate UUID, what is the probability of such duplicate events?  Is it like most of the events wont be unique and only few are duplicate?  Also whether this same duplicated events come again and again (I mean same UUID for so many times)?

-Anoop-
________________________________________
From: Rahul Ravindran [rahulrv@yahoo.com]
Sent: Friday, February 15, 2013 12:53 AM
To: user@hbase.apache.org
Subject: Using Hbase for Dedupping

Hi,
   We have events which are delivered into our HDFS cluster which may be duplicated. Each event has a UUID and we were hoping to leverage HBase to dedupe them. We run a MapReduce job which would perform a lookup for each UUID on HBase and then emit the event only if the UUID was absent and would also insert into the HBase table(This is simplistic, I am missing out details to make this more resilient to failures). My concern is that doing a Read+Write for every event in MR would be slow (We expect around 1 Billion events every hour). Does anyone use Hbase for a similar use case or is there a different approach to achieving the same end result. Any information, comments would be great.

Thanks,
~Rahul.