You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Haijia Zhou <le...@gmail.com> on 2012/02/21 21:52:48 UTC

hbase delete operation is very slow

Hi, All
I'm new to this email list and hope I can get help from here.
My task is to come up with a M/R job in hbase to scan the whole table, find
out some data and delete them (delete the whole row), this job will be
executed on a daily basis.
Basically I have mapper class whose map() looks like follows:
public void map(ImmutableBytesWritable row, Result columns,
                Context context)
{
  ... do some check
  byte[] row = ...
  if(needs to delete user){
       Delete delete = new Delete(row);
       table.delete(delete)
   }

There's no reducer needed for this task.

Now, we are observing that this job takes a long time to finish (around 3-4
hours) for 49,565,000 delete operations and 191,838,114 total records
across 7 region servers
We know that a full table scan on the corresponding column/column family
takes around 40 minutes, so all the rest time were for the delete operation.

I wonder if there's anyway or tool to profile the hadoop M/R job ?

Thanks

Haijia

Re: hbase delete operation is very slow

Posted by Daniel Iancu <da...@1and1.ro>.
Hi
The deletes hits back the region you scan from so I wonder if this can't 
create hotspots if many rows need to be deleted from a single region. 
Can you check that?
Daniel

On 02/21/2012 10:52 PM, Haijia Zhou wrote:
> Hi, All
> I'm new to this email list and hope I can get help from here.
> My task is to come up with a M/R job in hbase to scan the whole table, find
> out some data and delete them (delete the whole row), this job will be
> executed on a daily basis.
> Basically I have mapper class whose map() looks like follows:
> public void map(ImmutableBytesWritable row, Result columns,
>                  Context context)
> {
>    ... do some check
>    byte[] row = ...
>    if(needs to delete user){
>         Delete delete = new Delete(row);
>         table.delete(delete)
>     }
>
> There's no reducer needed for this task.
>
> Now, we are observing that this job takes a long time to finish (around 3-4
> hours) for 49,565,000 delete operations and 191,838,114 total records
> across 7 region servers
> We know that a full table scan on the corresponding column/column family
> takes around 40 minutes, so all the rest time were for the delete operation.
>
> I wonder if there's anyway or tool to profile the hadoop M/R job ?
>
> Thanks
>
> Haijia

-- 
Daniel Iancu
Java Developer,Big Data Solutions Romania
1&1 Internet Development srl.
18 Mircea Eliade St
Sect 1, Bucharest
RO Bucharest, 012015
www.1and1.ro
Phone:+40-031-223-9081




Re: hbase delete operation is very slow

Posted by Stack <st...@duboce.net>.
On Tue, Feb 21, 2012 at 5:54 PM, Doug Meil
<do...@explorysmedical.com> wrote:
>
> I don't think write-buffering is an option because that's Put-only the
> last time I looked, but the advice I put in the book is to use the
> delete(List<Delete>).  He'll have to keep track of the List<Delete>
> himself and determine when the batch should be sent, but it's a lot better
> than one at a time.
>

Or do a backing Mapper that keeps an HTable around across Map
invocations and in this do up a write buffer?
St.Ack

Re: hbase delete operation is very slow

Posted by Ioan Eugen Stan <st...@gmail.com>.
Pe 22.02.2012 17:02, Haijia Zhou a scris:
> Thanks for the suggestion. I did use List<Delete>  with size 1000, actually the performance was not that different from deleting one row at a time.
> I investigated HRegion.delete() method, my understanding is that when you call delete() to delete a row, it's actually going to delete all the column families for that row first, meaning it'll put tombstone to each family column.
> In my case each row has 5 family columns, that means each delete will result in putting 5 tombstones to the row, I am thinking that could be the reason why delete is so slow.
>
> I  am just wondering if there's anyway or tools we can profile a hbase application to measure the time taken on each individual methods.
>
> Haijia
>


Hello Haijia,

Try jetm http://jetm.void.fm/ for that kind of work. If you configure it 
using Spring Proxy AOP you can enable/disable performance monitoring 
from a config file.

Cheers,

-- 
Ioan Eugen Stan
http://ieugen.blogspot.com

RE: hbase delete operation is very slow

Posted by Haijia Zhou <ha...@adobe.com>.
Thanks for the suggestion. I did use List<Delete> with size 1000, actually the performance was not that different from deleting one row at a time.
I investigated HRegion.delete() method, my understanding is that when you call delete() to delete a row, it's actually going to delete all the column families for that row first, meaning it'll put tombstone to each family column.
In my case each row has 5 family columns, that means each delete will result in putting 5 tombstones to the row, I am thinking that could be the reason why delete is so slow.

I  am just wondering if there's anyway or tools we can profile a hbase application to measure the time taken on each individual methods.

Haijia

-----Original Message-----
From: Doug Meil [mailto:doug.meil@explorysmedical.com] 
Sent: Tuesday, February 21, 2012 8:54 PM
To: user@hbase.apache.org
Subject: Re: hbase delete operation is very slow


I don't think write-buffering is an option because that's Put-only the last time I looked, but the advice I put in the book is to use the delete(List<Delete>).  He'll have to keep track of the List<Delete> himself and determine when the batch should be sent, but it's a lot better than one at a time.




On 2/21/12 7:39 PM, "Stack" <st...@duboce.net> wrote:

>On Tue, Feb 21, 2012 at 2:45 PM, Doug Meil 
><do...@explorysmedical.com> wrote:
>>
>> Hi there-
>>
>> You probably want to see this...
>>
>> http://hbase.apache.org/book.html#perf.deleting
>>
>> .. that particular method doesn't use the write-buffer and is 
>> submitting deletes one-by-one to the RS's.
>>
>>
>
>Do what Doug suggests.  Sounds like you are setting up a Map per row 
>and then per row, figuring whether to Delete.  If a Delete, you do an 
>invocation per.  Where are you getting your table instance from?  Is it 
>created each time?  And as per Doug, are you write buffering your 
>deletes?
>
>St.Ack
>



Re: hbase delete operation is very slow

Posted by Doug Meil <do...@explorysmedical.com>.
I don't think write-buffering is an option because that's Put-only the
last time I looked, but the advice I put in the book is to use the
delete(List<Delete>).  He'll have to keep track of the List<Delete>
himself and determine when the batch should be sent, but it's a lot better
than one at a time.




On 2/21/12 7:39 PM, "Stack" <st...@duboce.net> wrote:

>On Tue, Feb 21, 2012 at 2:45 PM, Doug Meil
><do...@explorysmedical.com> wrote:
>>
>> Hi there-
>>
>> You probably want to see this...
>>
>> http://hbase.apache.org/book.html#perf.deleting
>>
>> .. that particular method doesn't use the write-buffer and is submitting
>> deletes one-by-one to the RS's.
>>
>>
>
>Do what Doug suggests.  Sounds like you are setting up a Map per row
>and then per row, figuring whether to Delete.  If a Delete, you do an
>invocation per.  Where are you getting your table instance from?  Is
>it created each time?  And as per Doug, are you write buffering your
>deletes?
>
>St.Ack
>



Re: hbase delete operation is very slow

Posted by Stack <st...@duboce.net>.
On Tue, Feb 21, 2012 at 2:45 PM, Doug Meil
<do...@explorysmedical.com> wrote:
>
> Hi there-
>
> You probably want to see this...
>
> http://hbase.apache.org/book.html#perf.deleting
>
> .. that particular method doesn't use the write-buffer and is submitting
> deletes one-by-one to the RS's.
>
>

Do what Doug suggests.  Sounds like you are setting up a Map per row
and then per row, figuring whether to Delete.  If a Delete, you do an
invocation per.  Where are you getting your table instance from?  Is
it created each time?  And as per Doug, are you write buffering your
deletes?

St.Ack

Re: hbase delete operation is very slow

Posted by Doug Meil <do...@explorysmedical.com>.
Hi there-

You probably want to see this...

http://hbase.apache.org/book.html#perf.deleting

.. that particular method doesn't use the write-buffer and is submitting
deletes one-by-one to the RS's.




On 2/21/12 3:52 PM, "Haijia Zhou" <le...@gmail.com> wrote:

>Hi, All
>I'm new to this email list and hope I can get help from here.
>My task is to come up with a M/R job in hbase to scan the whole table,
>find
>out some data and delete them (delete the whole row), this job will be
>executed on a daily basis.
>Basically I have mapper class whose map() looks like follows:
>public void map(ImmutableBytesWritable row, Result columns,
>                Context context)
>{
>  ... do some check
>  byte[] row = ...
>  if(needs to delete user){
>       Delete delete = new Delete(row);
>       table.delete(delete)
>   }
>
>There's no reducer needed for this task.
>
>Now, we are observing that this job takes a long time to finish (around
>3-4
>hours) for 49,565,000 delete operations and 191,838,114 total records
>across 7 region servers
>We know that a full table scan on the corresponding column/column family
>takes around 40 minutes, so all the rest time were for the delete
>operation.
>
>I wonder if there's anyway or tool to profile the hadoop M/R job ?
>
>Thanks
>
>Haijia