You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@accumulo.apache.org by z11373 <z1...@outlook.com> on 2015/11/16 16:35:11 UTC

delete rows test result

Last week on separate thread I was suggested to use
tableOperations.deleteRows for deleting rows that matched with specific
ranges. So I was curious to try it out to see if it's better than my current
implementation which is iterating all rows, and call putDelete for each.
While researching, I also found Accumulo already provides BatchDeleter,
which also does the same thing.
I tried all of three, and below is my test results against three different
tables (numbers are in milliseconds):

Test 1 (using iterator and call putDelete for each):
Table 1: 5,702
Table 2: 6,912
Table 3: 4,694

Test 2 (using BatchDeleter class):
Table 1: 8,089
Table 2: 10,405
Table 3: 7,818

Test 3 (using tableOperations.deleteRows, note that I first iterate all
rows, just to get the last row id, which then being passed as argument to
the function):
Table 1: 196,597
Table 2: 226,496
Table 3: 8,442


I ran the tests few times, and pretty much got the consistent results above.
I didn't look at the code what deleteRows really doing, but looking at my
test results, I can say it sucks!
Note that for that test, I did scan and iterate just to get the last row id,
but even I subtract the time for doing that, it's still way too slow.
Therefore, I'd recommend anyone to avoid using deleteRows for this scenario.
YMMV, but I'd stick with my original approach, which is doing the same like
Test 1 above.


Thanks,
Z




--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/delete-rows-test-result-tp15569.html
Sent from the Developers mailing list archive at Nabble.com.

Re: delete rows test result

Posted by William Slacum <ws...@gmail.com>.
"Reading" all of the rows first implies you're bringing back the entire
result to a client, which provides you serial access to the data.

I think you should re-run test #3 that measures the time it takes to call
deleteRows only. I'm emphasizing this because I've worked on projects that
could quickly define a range to be deleted without reading any data, and
using deleteRows decreased our latency significantly

On Mon, Nov 16, 2015 at 11:19 AM, z11373 <z1...@outlook.com> wrote:

> I didn't do that, but I am sure can extrapolate that from Test 1.
>
> Test 1 is doing:
> foreach k/v in scanner's iterator
>     create a new mutation with that row
>     call putDelete
>
> Test 3 is doing
> foreach k/v in scanner's iterator
>     assign the row of first entry to 'first' var
>     assign the row to a 'last' var
> After the loop is done, pass 'first' and 'last' vars to deleteRows.
>
> So, if I'd extrapolate the time without reading all rows, then we can
> subtract result from Test 1 from result from Test 3, i.e. for Table 1 is
> 196,597 - 5,702 = 190,895 (this is still way too long)
>
>
>
>
> --
> View this message in context:
> http://apache-accumulo.1065345.n5.nabble.com/delete-rows-test-result-tp15569p15571.html
> Sent from the Developers mailing list archive at Nabble.com.
>

Re: delete rows test result

Posted by z11373 <z1...@outlook.com>.
I didn't do that, but I am sure can extrapolate that from Test 1.

Test 1 is doing:
foreach k/v in scanner's iterator
    create a new mutation with that row
    call putDelete

Test 3 is doing
foreach k/v in scanner's iterator
    assign the row of first entry to 'first' var
    assign the row to a 'last' var
After the loop is done, pass 'first' and 'last' vars to deleteRows.

So, if I'd extrapolate the time without reading all rows, then we can
subtract result from Test 1 from result from Test 3, i.e. for Table 1 is
196,597 - 5,702 = 190,895 (this is still way too long)




--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/delete-rows-test-result-tp15569p15571.html
Sent from the Developers mailing list archive at Nabble.com.

Re: delete rows test result

Posted by William Slacum <ws...@gmail.com>.
What happens when you subtract the time to read all of your rows?
deleteRows is designed so you don't have to read any data-- you can compute
a range to delete. For instance, in time series table, it's trivial to give
a start and end date as your rows and call deleteRows.

On Mon, Nov 16, 2015 at 10:35 AM, z11373 <z1...@outlook.com> wrote:

> Last week on separate thread I was suggested to use
> tableOperations.deleteRows for deleting rows that matched with specific
> ranges. So I was curious to try it out to see if it's better than my
> current
> implementation which is iterating all rows, and call putDelete for each.
> While researching, I also found Accumulo already provides BatchDeleter,
> which also does the same thing.
> I tried all of three, and below is my test results against three different
> tables (numbers are in milliseconds):
>
> Test 1 (using iterator and call putDelete for each):
> Table 1: 5,702
> Table 2: 6,912
> Table 3: 4,694
>
> Test 2 (using BatchDeleter class):
> Table 1: 8,089
> Table 2: 10,405
> Table 3: 7,818
>
> Test 3 (using tableOperations.deleteRows, note that I first iterate all
> rows, just to get the last row id, which then being passed as argument to
> the function):
> Table 1: 196,597
> Table 2: 226,496
> Table 3: 8,442
>
>
> I ran the tests few times, and pretty much got the consistent results
> above.
> I didn't look at the code what deleteRows really doing, but looking at my
> test results, I can say it sucks!
> Note that for that test, I did scan and iterate just to get the last row
> id,
> but even I subtract the time for doing that, it's still way too slow.
> Therefore, I'd recommend anyone to avoid using deleteRows for this
> scenario.
> YMMV, but I'd stick with my original approach, which is doing the same like
> Test 1 above.
>
>
> Thanks,
> Z
>
>
>
>
> --
> View this message in context:
> http://apache-accumulo.1065345.n5.nabble.com/delete-rows-test-result-tp15569.html
> Sent from the Developers mailing list archive at Nabble.com.
>

Re: delete rows test result

Posted by Christopher <ct...@apache.org>.
Without ACCUMULO-3235, one way you can make deleteRows faster is to only
use it to delete rows on existing tablet boundaries. Even then, there may
be cases where it's going to do a chop compaction before it completes the
delete, and some tablets may be offline while it does this.

Aside from possibly only using existing tablet boundaries, I'm not sure
there is anything you can do which would be faster.

If the deleteMany (scan/putDelete) strategy is faster for you, and memory
is less important than speed, then stick with that. That's almost certainly
going to be better if the data you wish to delete is interspersed with data
you wish to keep.

deleteRows is going to work best in cases where you have large quantities
of sequential rows to delete, spanning more than one tablet. If your
application can tolerate it, you could wait for a significantly large run
before doing a delete. For instance, if you wish to age-off old data, and
your data is ordered by time, you could age off once a week instead of
daily, to allow the ranges of things to delete to build up.

On Mon, Nov 30, 2015 at 3:18 PM z11373 <z1...@outlook.com> wrote:

> Hi Christopher,
> Do you have any idea what should I do to improve the perf in my case, or
> wait until ACCUMULO-3235?
>
> If you look at my test results, calling deleteRows took >15x slower than
> calling putDelete for the same table and data. Is it because the actual
> number of rows (i.e. being combined) is a way bigger than the number of
> combined rows? I'd imagine if deleteRows has to delete 100M of rows, while
> putDelete may only need to deal with 3-4M of rows (results from combined),
> then it may explain why it'd take that long.
>
>
> Thanks,
> Z
>
>
>
>
> --
> View this message in context:
> http://apache-accumulo.1065345.n5.nabble.com/delete-rows-test-result-tp15569p15637.html
> Sent from the Developers mailing list archive at Nabble.com.
>

Re: delete rows test result

Posted by z11373 <z1...@outlook.com>.
Hi Christopher,
Do you have any idea what should I do to improve the perf in my case, or
wait until ACCUMULO-3235?

If you look at my test results, calling deleteRows took >15x slower than
calling putDelete for the same table and data. Is it because the actual
number of rows (i.e. being combined) is a way bigger than the number of
combined rows? I'd imagine if deleteRows has to delete 100M of rows, while
putDelete may only need to deal with 3-4M of rows (results from combined),
then it may explain why it'd take that long.


Thanks,
Z




--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/delete-rows-test-result-tp15569p15637.html
Sent from the Developers mailing list archive at Nabble.com.

Re: delete rows test result

Posted by Christopher <ct...@apache.org>.
deleteRows should be fine with a combiner, but it's probably not going to
be efficient for small ranges. ACCUMULO-3235 should make it more efficient,
but it'll still probably add a split point and (very) briefly take tablets
offline.

On Mon, Nov 30, 2015 at 12:46 PM z11373 <z1...@outlook.com> wrote:

> Revisit this thread...
> I just want to know if deleteRows is not appropriate for a table with
> summing combiners?
> The problem with scan and for each putDelete is it's consuming more memory,
> though from my test it is way faster than calling deleteRows for this
> particular case.
>
> Thanks,
> Z
>
>
>
> --
> View this message in context:
> http://apache-accumulo.1065345.n5.nabble.com/delete-rows-test-result-tp15569p15634.html
> Sent from the Developers mailing list archive at Nabble.com.
>

Re: delete rows test result

Posted by z11373 <z1...@outlook.com>.
Revisit this thread...
I just want to know if deleteRows is not appropriate for a table with
summing combiners?
The problem with scan and for each putDelete is it's consuming more memory,
though from my test it is way faster than calling deleteRows for this
particular case.

Thanks,
Z



--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/delete-rows-test-result-tp15569p15634.html
Sent from the Developers mailing list archive at Nabble.com.

Re: delete rows test result

Posted by z11373 <z1...@outlook.com>.
Hi William,
I re-ran the same test calling deleteRows without scanning the table (so
it's only timing the deleteRows operation here), and you're right, it's
faster as shown in the result below.

Table 1: 3,301 
Table 2: 3,184 
Table 3: 2,635

It's definitely faster, as comparison to the fastest result I got by
scanning the table and calling putDelete for each, in the result below.

Table 1: 5,702 
Table 2: 6,912 
Table 3: 4,694

However, there is one case I didn't mention last time, which the table has
summing combiner installed. So even it may have 1M rows, but actually it can
have rows as many as 10M or beyond, which may explain why deleteRows can
take longer. Still, it seems something wrong looking at my test result.

Test 1 (using iterator and call putDelete for each):
Table 4 (with summing combiner): 11,081

Test 2 (calling deleteRows):
Table 4 (with summing combiner): 197,050

Last time I heard someone mentioned about compaction, so I was curious, and
do following test to compact first before calling deleteRows (to see if it'd
be faster), and here is the result:
Compact on Table 4 (with summing combiner): 376,619
Call deleteRows on Table 4 (with summing combiner): 188,862

So given the result above, I'd say the table compaction doesn't help.
Perhaps I did something wrong here. Therefore, it seems to me, for certain
case (like this one) scanning table and calling putDelete for each, will
perform better than calling deleteRows, does this make sense?


Thanks,
Z



--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/delete-rows-test-result-tp15569p15609.html
Sent from the Developers mailing list archive at Nabble.com.

Re: delete rows test result

Posted by z11373 <z1...@outlook.com>.
Thanks Christ!
I had no idea why it didn't work yesterday, that's why I thought it may look
for the authz.
I just tried running the command again from shell, and this time it works
fine.
Yes, the authz string actually already used as prefix of the row in our
case, so it works nicely :-)

Thanks,
Z



--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/delete-rows-test-result-tp15569p15597.html
Sent from the Developers mailing list archive at Nabble.com.

Re: delete rows test result

Posted by Christopher <ct...@apache.org>.
Deletemany does a scan and selective putDeletes based on the matches to the
scan.

Deleterows doesn't use authorizations, because it just drops whole ranges
without discrimination of their contents. We don't partition ranges of keys
based on authorizations, so deleterows wouldn't be able to make use of this
parameter. Your application could do that, if efficient deletes for
particular authorizations were essential, by making making an authorization
string a prefix of your row, or by partitioning your data into separate
tables based on authorizations. But this probably wouldn't be that useful
unless all your deletes of this nature were associated with a single
authorization which you knew in advance.

On Wed, Nov 18, 2015 at 6:10 PM z11373 <z1...@outlook.com> wrote:

> Never mind, no longer able to reproduce the issue.
>
> Why deleterows doesn't support the caller passing authz (unlike deletemany
> that has -s)?
> Without able to pass the authorization string, I pretty much cannot use
> this
> command :-(
>
> Thanks,
> Z
>
>
>
> --
> View this message in context:
> http://apache-accumulo.1065345.n5.nabble.com/delete-rows-test-result-tp15569p15593.html
> Sent from the Developers mailing list archive at Nabble.com.
>

Re: delete rows test result

Posted by z11373 <z1...@outlook.com>.
Never mind, no longer able to reproduce the issue.

Why deleterows doesn't support the caller passing authz (unlike deletemany
that has -s)?
Without able to pass the authorization string, I pretty much cannot use this
command :-(

Thanks,
Z



--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/delete-rows-test-result-tp15569p15593.html
Sent from the Developers mailing list archive at Nabble.com.

Re: delete rows test result

Posted by z11373 <z1...@outlook.com>.
Anyone seen this exception before when calling deleterows from shell?

user@dev> deleterows -t T1 -b 9000 -e 9001
Thread "shell" died no net in java.library.path
java.lang.UnsatisfiedLinkError: no net in java.library.path
        at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1865)
        at java.lang.Runtime.loadLibrary0(Runtime.java:870)
        at java.lang.System.loadLibrary(System.java:1122)
        at
java.net.AbstractPlainSocketImpl$1.run(AbstractPlainSocketImpl.java:84)
        at
java.net.AbstractPlainSocketImpl$1.run(AbstractPlainSocketImpl.java:82)
        at java.security.AccessController.doPrivileged(Native Method)
        at
java.net.AbstractPlainSocketImpl.<clinit>(AbstractPlainSocketImpl.java:81)
        at java.net.Socket.setImpl(Socket.java:503)
        at java.net.Socket.<init>(Socket.java:84)
        at org.apache.thrift.transport.TSocket.initSocket(TSocket.java:116)
        at org.apache.thrift.transport.TSocket.<init>(TSocket.java:109)
        at org.apache.thrift.transport.TSocket.<init>(TSocket.java:94)
        at
org.apache.accumulo.core.util.ThriftUtil.createClientTransport(ThriftUtil.java:277)
        at
org.apache.accumulo.core.client.impl.ThriftTransportPool.createNewTransport(ThriftTransportPool.java:487)
        at
org.apache.accumulo.core.client.impl.ThriftTransportPool.getTransport(ThriftTransportPool.java:420)
        at
org.apache.accumulo.core.client.impl.ThriftTransportPool.getTransport(ThriftTransportPool.java:397)
        at
org.apache.accumulo.core.util.ThriftUtil.getClient(ThriftUtil.java:128)
        at
org.apache.accumulo.core.util.ThriftUtil.getClientNoTimeout(ThriftUtil.java:116)
        at
org.apache.accumulo.core.client.impl.MasterClient.getConnection(MasterClient.java:67)
        at
org.apache.accumulo.core.client.impl.MasterClient.getConnectionWithRetry(MasterClient.java:45)
        at
org.apache.accumulo.core.client.impl.TableOperationsImpl.beginFateOperation(TableOperationsImpl.java:233)
        at
org.apache.accumulo.core.client.impl.TableOperationsImpl.doFateOperation(TableOperationsImpl.java:303)
        at
org.apache.accumulo.core.client.impl.TableOperationsImpl.doFateOperation(TableOperationsImpl.java:295)
        at
org.apache.accumulo.core.client.impl.TableOperationsImpl.doTableFateOperation(TableOperationsImpl.java:1594)
        at
org.apache.accumulo.core.client.impl.TableOperationsImpl.deleteRows(TableOperationsImpl.java:557)
        at
org.apache.accumulo.core.util.shell.commands.DeleteRowsCommand.execute(DeleteRowsCommand.java:39)
        at
org.apache.accumulo.core.util.shell.Shell.execCommand(Shell.java:747)
        at org.apache.accumulo.core.util.shell.Shell.start(Shell.java:607)
        at org.apache.accumulo.core.util.shell.Shell.main(Shell.java:528)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.accumulo.start.Main$1.run(Main.java:141)
        at java.lang.Thread.run(Thread.java:745)



--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/delete-rows-test-result-tp15569p15592.html
Sent from the Developers mailing list archive at Nabble.com.

Re: delete rows test result

Posted by Keith Turner <ke...@deenlo.com>.
On Mon, Nov 16, 2015 at 10:35 AM, z11373 <z1...@outlook.com> wrote:

> Last week on separate thread I was suggested to use
> tableOperations.deleteRows for deleting rows that matched with specific
> ranges. So I was curious to try it out to see if it's better than my
> current
> implementation which is iterating all rows, and call putDelete for each.
> While researching, I also found Accumulo already provides BatchDeleter,
> which also does the same thing.
> I tried all of three, and below is my test results against three different
> tables (numbers are in milliseconds):
>
> Test 1 (using iterator and call putDelete for each):
> Table 1: 5,702
> Table 2: 6,912
> Table 3: 4,694
>
> Test 2 (using BatchDeleter class):
> Table 1: 8,089
> Table 2: 10,405
> Table 3: 7,818
>
> Test 3 (using tableOperations.deleteRows, note that I first iterate all
> rows, just to get the last row id, which then being passed as argument to
> the function):
> Table 1: 196,597
> Table 2: 226,496
> Table 3: 8,442
>
>
> I ran the tests few times, and pretty much got the consistent results
> above.
> I didn't look at the code what deleteRows really doing, but looking at my
> test results, I can say it sucks!
>

An advantage of deleteRows is that it can drop entire tablets that fall
completely within a range.   However the tablet at the end of the range may
need to be compacted in order to extend its range.  Using deleteRows for a
"small" range that falls completely within a table may be suboptimal.  Is
that your case?  How many key values are you deleting?  If its not the
compaction that causing the delay, then there may be a bug.

Not sure if it will help, but there is a utility function for finding a max
row.   It does a binary search within the key space.

http://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/core/client/admin/TableOperations.html#getMaxRow%28java.lang.String,%20org.apache.accumulo.core.security.Authorizations,%20org.apache.hadoop.io.Text,%20boolean,%20org.apache.hadoop.io.Text,%20boolean%29


> Note that for that test, I did scan and iterate just to get the last row
> id,
> but even I subtract the time for doing that, it's still way too slow.
> Therefore, I'd recommend anyone to avoid using deleteRows for this
> scenario.
> YMMV, but I'd stick with my original approach, which is doing the same like
> Test 1 above.
>
>
> Thanks,
> Z
>
>
>
>
> --
> View this message in context:
> http://apache-accumulo.1065345.n5.nabble.com/delete-rows-test-result-tp15569.html
> Sent from the Developers mailing list archive at Nabble.com.
>