You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Paul Mackles <pm...@adobe.com> on 2012/10/05 20:17:36 UTC

bulk deletes

We need to do deletes pretty regularly and sometimes we could have hundreds of millions of cells to delete. TTLs won't work for us because we have a fair amount of bizlogic around the deletes.

Given their current implemention  (we are on 0.90.4), this delete process can take a really long time (half a day or more with 100 or so concurrent threads). From everything I can tell, the performance issues come down to each delete being an individual RPC call (even when using the batch API). In other words, I don't see any thrashing on hbase while this process is running – just lots of waiting for the RPC calls to return.

The alternative we came up with is to use the standard bulk load facilities to handle the deletes. The code turned out to be surpisingly simple and appears to work in the small-scale tests we have tried so far. Is anyone else doing deletes in  this fashion? Are there drawbacks that I might be missing? Here is a link to the code:

https://gist.github.com/3841437

Pretty simple, eh? I haven't seen much mention of this technique which is why I am a tad paranoid about it.

Thanks,
Paul


Re: bulk deletes

Posted by Jacques <wh...@gmail.com>.
While I didn't spend a lot of time with your code, I believe your approach
is sound.

Depending on your consistency requirements, I would suggest you consider
utilizing a coprocessor to handle the deletes.  Coprocessors can intercept
compaction scans.  Then just shift your delete logic to be an additional
filter to be utilized at compaction time.  This should be less load and
complexity than the bulk load.  Depending on the complexity and frequency
of the criteria, you could potentially add an endpoint to set these batch
deletes.

I was considering a generic version of this but haven't spent much time on
it...

Jacques

On Fri, Oct 5, 2012 at 11:17 AM, Paul Mackles <pm...@adobe.com> wrote:

> We need to do deletes pretty regularly and sometimes we could have
> hundreds of millions of cells to delete. TTLs won't work for us because we
> have a fair amount of bizlogic around the deletes.
>
> Given their current implemention  (we are on 0.90.4), this delete process
> can take a really long time (half a day or more with 100 or so concurrent
> threads). From everything I can tell, the performance issues come down to
> each delete being an individual RPC call (even when using the batch API).
> In other words, I don't see any thrashing on hbase while this process is
> running – just lots of waiting for the RPC calls to return.
>
> The alternative we came up with is to use the standard bulk load
> facilities to handle the deletes. The code turned out to be surpisingly
> simple and appears to work in the small-scale tests we have tried so far.
> Is anyone else doing deletes in  this fashion? Are there drawbacks that I
> might be missing? Here is a link to the code:
>
> https://gist.github.com/3841437
>
> Pretty simple, eh? I haven't seen much mention of this technique which is
> why I am a tad paranoid about it.
>
> Thanks,
> Paul
>
>

Re: bulk deletes

Posted by Jerry Lam <ch...@gmail.com>.
Hi Anoop:

In my use case, I use extensively the version delete marker because I need
to delete a specific version of a cell (row key, CF, qualifier, timestamp).
I have a mapreduce job that will run across some regions and based on some
business rules, some of the cells will be deleted in the table using the
version delete marker. The business rules for deletion are scoped to each
column family at a time. Therefore, there are no logically dependency of
deletions between column families.

I also posted the above use case in the HBASE-6942.

Best Regards,

Jerry

On Thu, Oct 11, 2012 at 12:04 AM, Anoop Sam John <an...@huawei.com> wrote:

> You are right Jerry..
> In your use case you want to delete full rows or some cfs/columns only?
>  Pls feel free to see the issue HBASE-6942 and give your valuable comments..
> Here I am trying to delete the rows [This is our use case]
>
> -Anoop-
> ________________________________________
> From: Jerry Lam [chilinglam@gmail.com]
> Sent: Wednesday, October 10, 2012 8:37 PM
> To: user@hbase.apache.org
> Subject: Re: bulk deletes
>
> Hi guys:
>
> The bulk delete approaches described in this thread are helpful in my case
> as well. If I understood correctly, Paul's approach is useful for offline
> bulk deletes (a.k.a. mapreduce) whereas Anoop's approach is useful for
> online/real-time bulk deletes (a.k.a. co-processor)?
>
> Best Regards,
>
> Jerry
>
> On Mon, Oct 8, 2012 at 7:45 AM, Paul Mackles <pm...@adobe.com> wrote:
>
> > Very cool Anoop. I can definitely see how that would be useful.
> >
> > Lars - the bulk deletes do appear to work. I just wasn't sure if there
> was
> > something I might be missing since I haven't seen this documented
> > elsewhere.
> >
> > Coprocessors do seem a better fit for this in the long term.
> >
> > Thanks everyone.
> >
> > On 10/7/12 11:55 PM, "Anoop Sam John" <an...@huawei.com> wrote:
> >
> > >We also done an implementation using compaction time deletes(avoid KVs).
> > >This works very well for us....
> > >As this would delay the deletes to happen till the next major
> compaction,
> > >we are having an implementation to do the real time bulk delete. [We
> have
> > >such use case]
> > >Here I am using an endpoint implementation to do the scan and delete at
> > >the server side only. Just raised an IA for this [HBASE-6942].  I will
> > >post a patch based on 0.94 model there...Pls have a look....  I have
> > >noticed big performance improvement over the normal way of  scan() +
> > >delete(List<Delete>) as this avoids several network calls and traffic...
> > >
> > >-Anoop-
> > >________________________________________
> > >From: lars hofhansl [lhofhansl@yahoo.com]
> > >Sent: Saturday, October 06, 2012 1:09 AM
> > >To: user@hbase.apache.org
> > >Subject: Re: bulk deletes
> > >
> > >Does it work? :)
> > >
> > >How did you do the deletes before?I assume you used the
> > >HTable.delete(List<Delete>) API?
> > >
> > >(Doesn't really help you, but) In 0.92+ you could hook up a coprocessor
> > >into the compactions and simply filter out any KVs you want to have
> > >removed.
> > >
> > >
> > >-- Lars
> > >
> > >
> > >
> > >________________________________
> > > From: Paul Mackles <pm...@adobe.com>
> > >To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > >Sent: Friday, October 5, 2012 11:17 AM
> > >Subject: bulk deletes
> > >
> > >We need to do deletes pretty regularly and sometimes we could have
> > >hundreds of millions of cells to delete. TTLs won't work for us because
> > >we have a fair amount of bizlogic around the deletes.
> > >
> > >Given their current implemention  (we are on 0.90.4), this delete
> process
> > >can take a really long time (half a day or more with 100 or so
> concurrent
> > >threads). From everything I can tell, the performance issues come down
> to
> > >each delete being an individual RPC call (even when using the batch
> API).
> > >In other words, I don't see any thrashing on hbase while this process is
> > >running ­ just lots of waiting for the RPC calls to return.
> > >
> > >The alternative we came up with is to use the standard bulk load
> > >facilities to handle the deletes. The code turned out to be surpisingly
> > >simple and appears to work in the small-scale tests we have tried so
> far.
> > >Is anyone else doing deletes in  this fashion? Are there drawbacks that
> I
> > >might be missing? Here is a link to the code:
> > >
> > >https://gist.github.com/3841437
> > >
> > >Pretty simple, eh? I haven't seen much mention of this technique which
> is
> > >why I am a tad paranoid about it.
> > >
> > >Thanks,
> > >Paul
> >
> >
>

RE: bulk deletes

Posted by Anoop Sam John <an...@huawei.com>.
You are right Jerry..
In your use case you want to delete full rows or some cfs/columns only?  Pls feel free to see the issue HBASE-6942 and give your valuable comments..
Here I am trying to delete the rows [This is our use case]

-Anoop-
________________________________________
From: Jerry Lam [chilinglam@gmail.com]
Sent: Wednesday, October 10, 2012 8:37 PM
To: user@hbase.apache.org
Subject: Re: bulk deletes

Hi guys:

The bulk delete approaches described in this thread are helpful in my case
as well. If I understood correctly, Paul's approach is useful for offline
bulk deletes (a.k.a. mapreduce) whereas Anoop's approach is useful for
online/real-time bulk deletes (a.k.a. co-processor)?

Best Regards,

Jerry

On Mon, Oct 8, 2012 at 7:45 AM, Paul Mackles <pm...@adobe.com> wrote:

> Very cool Anoop. I can definitely see how that would be useful.
>
> Lars - the bulk deletes do appear to work. I just wasn't sure if there was
> something I might be missing since I haven't seen this documented
> elsewhere.
>
> Coprocessors do seem a better fit for this in the long term.
>
> Thanks everyone.
>
> On 10/7/12 11:55 PM, "Anoop Sam John" <an...@huawei.com> wrote:
>
> >We also done an implementation using compaction time deletes(avoid KVs).
> >This works very well for us....
> >As this would delay the deletes to happen till the next major compaction,
> >we are having an implementation to do the real time bulk delete. [We have
> >such use case]
> >Here I am using an endpoint implementation to do the scan and delete at
> >the server side only. Just raised an IA for this [HBASE-6942].  I will
> >post a patch based on 0.94 model there...Pls have a look....  I have
> >noticed big performance improvement over the normal way of  scan() +
> >delete(List<Delete>) as this avoids several network calls and traffic...
> >
> >-Anoop-
> >________________________________________
> >From: lars hofhansl [lhofhansl@yahoo.com]
> >Sent: Saturday, October 06, 2012 1:09 AM
> >To: user@hbase.apache.org
> >Subject: Re: bulk deletes
> >
> >Does it work? :)
> >
> >How did you do the deletes before?I assume you used the
> >HTable.delete(List<Delete>) API?
> >
> >(Doesn't really help you, but) In 0.92+ you could hook up a coprocessor
> >into the compactions and simply filter out any KVs you want to have
> >removed.
> >
> >
> >-- Lars
> >
> >
> >
> >________________________________
> > From: Paul Mackles <pm...@adobe.com>
> >To: "user@hbase.apache.org" <us...@hbase.apache.org>
> >Sent: Friday, October 5, 2012 11:17 AM
> >Subject: bulk deletes
> >
> >We need to do deletes pretty regularly and sometimes we could have
> >hundreds of millions of cells to delete. TTLs won't work for us because
> >we have a fair amount of bizlogic around the deletes.
> >
> >Given their current implemention  (we are on 0.90.4), this delete process
> >can take a really long time (half a day or more with 100 or so concurrent
> >threads). From everything I can tell, the performance issues come down to
> >each delete being an individual RPC call (even when using the batch API).
> >In other words, I don't see any thrashing on hbase while this process is
> >running ­ just lots of waiting for the RPC calls to return.
> >
> >The alternative we came up with is to use the standard bulk load
> >facilities to handle the deletes. The code turned out to be surpisingly
> >simple and appears to work in the small-scale tests we have tried so far.
> >Is anyone else doing deletes in  this fashion? Are there drawbacks that I
> >might be missing? Here is a link to the code:
> >
> >https://gist.github.com/3841437
> >
> >Pretty simple, eh? I haven't seen much mention of this technique which is
> >why I am a tad paranoid about it.
> >
> >Thanks,
> >Paul
>
>

Re: bulk deletes

Posted by Jerry Lam <ch...@gmail.com>.
Hi guys:

The bulk delete approaches described in this thread are helpful in my case
as well. If I understood correctly, Paul's approach is useful for offline
bulk deletes (a.k.a. mapreduce) whereas Anoop's approach is useful for
online/real-time bulk deletes (a.k.a. co-processor)?

Best Regards,

Jerry

On Mon, Oct 8, 2012 at 7:45 AM, Paul Mackles <pm...@adobe.com> wrote:

> Very cool Anoop. I can definitely see how that would be useful.
>
> Lars - the bulk deletes do appear to work. I just wasn't sure if there was
> something I might be missing since I haven't seen this documented
> elsewhere.
>
> Coprocessors do seem a better fit for this in the long term.
>
> Thanks everyone.
>
> On 10/7/12 11:55 PM, "Anoop Sam John" <an...@huawei.com> wrote:
>
> >We also done an implementation using compaction time deletes(avoid KVs).
> >This works very well for us....
> >As this would delay the deletes to happen till the next major compaction,
> >we are having an implementation to do the real time bulk delete. [We have
> >such use case]
> >Here I am using an endpoint implementation to do the scan and delete at
> >the server side only. Just raised an IA for this [HBASE-6942].  I will
> >post a patch based on 0.94 model there...Pls have a look....  I have
> >noticed big performance improvement over the normal way of  scan() +
> >delete(List<Delete>) as this avoids several network calls and traffic...
> >
> >-Anoop-
> >________________________________________
> >From: lars hofhansl [lhofhansl@yahoo.com]
> >Sent: Saturday, October 06, 2012 1:09 AM
> >To: user@hbase.apache.org
> >Subject: Re: bulk deletes
> >
> >Does it work? :)
> >
> >How did you do the deletes before?I assume you used the
> >HTable.delete(List<Delete>) API?
> >
> >(Doesn't really help you, but) In 0.92+ you could hook up a coprocessor
> >into the compactions and simply filter out any KVs you want to have
> >removed.
> >
> >
> >-- Lars
> >
> >
> >
> >________________________________
> > From: Paul Mackles <pm...@adobe.com>
> >To: "user@hbase.apache.org" <us...@hbase.apache.org>
> >Sent: Friday, October 5, 2012 11:17 AM
> >Subject: bulk deletes
> >
> >We need to do deletes pretty regularly and sometimes we could have
> >hundreds of millions of cells to delete. TTLs won't work for us because
> >we have a fair amount of bizlogic around the deletes.
> >
> >Given their current implemention  (we are on 0.90.4), this delete process
> >can take a really long time (half a day or more with 100 or so concurrent
> >threads). From everything I can tell, the performance issues come down to
> >each delete being an individual RPC call (even when using the batch API).
> >In other words, I don't see any thrashing on hbase while this process is
> >running ­ just lots of waiting for the RPC calls to return.
> >
> >The alternative we came up with is to use the standard bulk load
> >facilities to handle the deletes. The code turned out to be surpisingly
> >simple and appears to work in the small-scale tests we have tried so far.
> >Is anyone else doing deletes in  this fashion? Are there drawbacks that I
> >might be missing? Here is a link to the code:
> >
> >https://gist.github.com/3841437
> >
> >Pretty simple, eh? I haven't seen much mention of this technique which is
> >why I am a tad paranoid about it.
> >
> >Thanks,
> >Paul
>
>

Re: bulk deletes

Posted by Paul Mackles <pm...@adobe.com>.
Very cool Anoop. I can definitely see how that would be useful.

Lars - the bulk deletes do appear to work. I just wasn't sure if there was
something I might be missing since I haven't seen this documented
elsewhere.

Coprocessors do seem a better fit for this in the long term.

Thanks everyone.

On 10/7/12 11:55 PM, "Anoop Sam John" <an...@huawei.com> wrote:

>We also done an implementation using compaction time deletes(avoid KVs).
>This works very well for us....
>As this would delay the deletes to happen till the next major compaction,
>we are having an implementation to do the real time bulk delete. [We have
>such use case]
>Here I am using an endpoint implementation to do the scan and delete at
>the server side only. Just raised an IA for this [HBASE-6942].  I will
>post a patch based on 0.94 model there...Pls have a look....  I have
>noticed big performance improvement over the normal way of  scan() +
>delete(List<Delete>) as this avoids several network calls and traffic...
>
>-Anoop-
>________________________________________
>From: lars hofhansl [lhofhansl@yahoo.com]
>Sent: Saturday, October 06, 2012 1:09 AM
>To: user@hbase.apache.org
>Subject: Re: bulk deletes
>
>Does it work? :)
>
>How did you do the deletes before?I assume you used the
>HTable.delete(List<Delete>) API?
>
>(Doesn't really help you, but) In 0.92+ you could hook up a coprocessor
>into the compactions and simply filter out any KVs you want to have
>removed.
>
>
>-- Lars
>
>
>
>________________________________
> From: Paul Mackles <pm...@adobe.com>
>To: "user@hbase.apache.org" <us...@hbase.apache.org>
>Sent: Friday, October 5, 2012 11:17 AM
>Subject: bulk deletes
>
>We need to do deletes pretty regularly and sometimes we could have
>hundreds of millions of cells to delete. TTLs won't work for us because
>we have a fair amount of bizlogic around the deletes.
>
>Given their current implemention  (we are on 0.90.4), this delete process
>can take a really long time (half a day or more with 100 or so concurrent
>threads). From everything I can tell, the performance issues come down to
>each delete being an individual RPC call (even when using the batch API).
>In other words, I don't see any thrashing on hbase while this process is
>running ­ just lots of waiting for the RPC calls to return.
>
>The alternative we came up with is to use the standard bulk load
>facilities to handle the deletes. The code turned out to be surpisingly
>simple and appears to work in the small-scale tests we have tried so far.
>Is anyone else doing deletes in  this fashion? Are there drawbacks that I
>might be missing? Here is a link to the code:
>
>https://gist.github.com/3841437
>
>Pretty simple, eh? I haven't seen much mention of this technique which is
>why I am a tad paranoid about it.
>
>Thanks,
>Paul


RE: bulk deletes

Posted by Anoop Sam John <an...@huawei.com>.
We also done an implementation using compaction time deletes(avoid KVs). This works very well for us....
As this would delay the deletes to happen till the next major compaction, we are having an implementation to do the real time bulk delete. [We have such use case]
Here I am using an endpoint implementation to do the scan and delete at the server side only. Just raised an IA for this [HBASE-6942].  I will post a patch based on 0.94 model there...Pls have a look....  I have noticed big performance improvement over the normal way of  scan() + delete(List<Delete>) as this avoids several network calls and traffic...

-Anoop-
________________________________________
From: lars hofhansl [lhofhansl@yahoo.com]
Sent: Saturday, October 06, 2012 1:09 AM
To: user@hbase.apache.org
Subject: Re: bulk deletes

Does it work? :)

How did you do the deletes before?I assume you used the HTable.delete(List<Delete>) API?

(Doesn't really help you, but) In 0.92+ you could hook up a coprocessor into the compactions and simply filter out any KVs you want to have removed.


-- Lars



________________________________
 From: Paul Mackles <pm...@adobe.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org>
Sent: Friday, October 5, 2012 11:17 AM
Subject: bulk deletes

We need to do deletes pretty regularly and sometimes we could have hundreds of millions of cells to delete. TTLs won't work for us because we have a fair amount of bizlogic around the deletes.

Given their current implemention  (we are on 0.90.4), this delete process can take a really long time (half a day or more with 100 or so concurrent threads). From everything I can tell, the performance issues come down to each delete being an individual RPC call (even when using the batch API). In other words, I don't see any thrashing on hbase while this process is running – just lots of waiting for the RPC calls to return.

The alternative we came up with is to use the standard bulk load facilities to handle the deletes. The code turned out to be surpisingly simple and appears to work in the small-scale tests we have tried so far. Is anyone else doing deletes in  this fashion? Are there drawbacks that I might be missing? Here is a link to the code:

https://gist.github.com/3841437

Pretty simple, eh? I haven't seen much mention of this technique which is why I am a tad paranoid about it.

Thanks,
Paul

Re: bulk deletes

Posted by lars hofhansl <lh...@yahoo.com>.
Does it work? :)

How did you do the deletes before?I assume you used the HTable.delete(List<Delete>) API?

(Doesn't really help you, but) In 0.92+ you could hook up a coprocessor into the compactions and simply filter out any KVs you want to have removed.


-- Lars



________________________________
 From: Paul Mackles <pm...@adobe.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org> 
Sent: Friday, October 5, 2012 11:17 AM
Subject: bulk deletes
 
We need to do deletes pretty regularly and sometimes we could have hundreds of millions of cells to delete. TTLs won't work for us because we have a fair amount of bizlogic around the deletes.

Given their current implemention  (we are on 0.90.4), this delete process can take a really long time (half a day or more with 100 or so concurrent threads). From everything I can tell, the performance issues come down to each delete being an individual RPC call (even when using the batch API). In other words, I don't see any thrashing on hbase while this process is running – just lots of waiting for the RPC calls to return.

The alternative we came up with is to use the standard bulk load facilities to handle the deletes. The code turned out to be surpisingly simple and appears to work in the small-scale tests we have tried so far. Is anyone else doing deletes in  this fashion? Are there drawbacks that I might be missing? Here is a link to the code:

https://gist.github.com/3841437

Pretty simple, eh? I haven't seen much mention of this technique which is why I am a tad paranoid about it.

Thanks,
Paul