You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by Mike Drob <md...@mdrob.com> on 2013/11/01 00:08:03 UTC

Re: Deleting many rows that match a given criterion

Terry,

Yea, a RowFilter + full compaction takes care of the issue. Note that
simply setting a RowFilter for scan time and expecting the data to delete
naturally might not work if your clients set varying fetch columns on their
scanners.

Mike


On Thu, Oct 31, 2013 at 5:11 PM, Terry P. <te...@gmail.com> wrote:

> Hi Mike,
> Did you wind up writing java code to do this?  Did you go with a RowFilter?
>
> I have a similar circumstance where I need to delete millions of rows
> daily and the criteria for deletion is not in the rowkey.
>
> Thanks in advance,
> Terry
>
>
>
> On Wed, Oct 23, 2013 at 4:21 PM, Mike Drob <md...@mdrob.com> wrote:
>
>> Thanks for the feedback, Aru and Keith.
>>
>> I've had some more time to play around with this, and here's some
>> additional observations.
>>
>> My existing process is very slow. I think this is due to each deletemany
>> command starting up a new scanner and batchwriter, and creating a lot of
>> rpc overhead. I didn't initially think that it would be a significant
>> amount of data, but maybe I just had the wrong idea of what "significant"
>> is in this case.
>>
>> I'm not sure the RowDeletingIterator would work in this case because I do
>> use empty rows for other purposes. The RowFilter at compaction is a great
>> option, except I had hoped to avoid writing actual java code. Looking back
>> at this, I might have to bite that bullet.
>>
>> Again, thanks both for the suggestions!
>>
>> Mike
>>
>>
>> On Tue, Oct 22, 2013 at 12:04 PM, Keith Turner <ke...@deenlo.com> wrote:
>>
>>> If its a significant amount of data, you could create a class that
>>> extends row filter and set it as a compaction iterator.
>>>
>>>
>>> On Tue, Oct 22, 2013 at 11:45 AM, Mike Drob <md...@mdrob.com> wrote:
>>>
>>>> I'm attempting to delete all rows from a table that contain a specific
>>>> word in the value of a specified column. My current process looks like:
>>>>
>>>> accumulo shell -e 'egrep .*EXPRESSION.* -np -t tab -c col' | awk 'BEGIN
>>>> {print "table tab"}; {print "deletemany -f -np -r" $1}; END {print "exit"}'
>>>> > rows.out
>>>> accumulo shell -f rows.out
>>>>
>>>> I tried playing around with scan iterators and various options on
>>>> deletemany and deleterows but wasn't able to find a more straightforward
>>>> way to do this. Does anybody have any suggestions?
>>>>
>>>> Mike
>>>>
>>>
>>>
>>
>

Re: Deleting many rows that match a given criterion

Posted by "Terry P." <te...@gmail.com>.
Thanks Mike. It looks like the AgeOffFilter class would be a good starting
point as a template for my filter: override the logic in the *init *method
as appropriate and put the criteria in the *accept* method.

What I can't figure out is where is the magic to remove entries?  I don't
see anything in the AgeOffFilter class nor the base Filter class.  For my
case, I need to remove all entries for the RowKey that meets the expiration
timestamp column I'll be testing for. So really I'm removing the whole row
(all entries for a given rowkey), not just some entries.

Any chance you could share your code?  Thanks in advance for any help you
can provide.

Kind regards,
Terry



On Thu, Oct 31, 2013 at 6:08 PM, Mike Drob <md...@mdrob.com> wrote:

> Terry,
>
> Yea, a RowFilter + full compaction takes care of the issue. Note that
> simply setting a RowFilter for scan time and expecting the data to delete
> naturally might not work if your clients set varying fetch columns on their
> scanners.
>
> Mike
>
>
> On Thu, Oct 31, 2013 at 5:11 PM, Terry P. <te...@gmail.com> wrote:
>
>> Hi Mike,
>> Did you wind up writing java code to do this?  Did you go with a
>> RowFilter?
>>
>> I have a similar circumstance where I need to delete millions of rows
>> daily and the criteria for deletion is not in the rowkey.
>>
>> Thanks in advance,
>> Terry
>>
>>
>>
>> On Wed, Oct 23, 2013 at 4:21 PM, Mike Drob <md...@mdrob.com> wrote:
>>
>>> Thanks for the feedback, Aru and Keith.
>>>
>>> I've had some more time to play around with this, and here's some
>>> additional observations.
>>>
>>> My existing process is very slow. I think this is due to each deletemany
>>> command starting up a new scanner and batchwriter, and creating a lot of
>>> rpc overhead. I didn't initially think that it would be a significant
>>> amount of data, but maybe I just had the wrong idea of what "significant"
>>> is in this case.
>>>
>>> I'm not sure the RowDeletingIterator would work in this case because I
>>> do use empty rows for other purposes. The RowFilter at compaction is a
>>> great option, except I had hoped to avoid writing actual java code. Looking
>>> back at this, I might have to bite that bullet.
>>>
>>> Again, thanks both for the suggestions!
>>>
>>> Mike
>>>
>>>
>>> On Tue, Oct 22, 2013 at 12:04 PM, Keith Turner <ke...@deenlo.com> wrote:
>>>
>>>> If its a significant amount of data, you could create a class that
>>>> extends row filter and set it as a compaction iterator.
>>>>
>>>>
>>>> On Tue, Oct 22, 2013 at 11:45 AM, Mike Drob <md...@mdrob.com> wrote:
>>>>
>>>>> I'm attempting to delete all rows from a table that contain a specific
>>>>> word in the value of a specified column. My current process looks like:
>>>>>
>>>>> accumulo shell -e 'egrep .*EXPRESSION.* -np -t tab -c col' | awk
>>>>> 'BEGIN {print "table tab"}; {print "deletemany -f -np -r" $1}; END {print
>>>>> "exit"}' > rows.out
>>>>> accumulo shell -f rows.out
>>>>>
>>>>> I tried playing around with scan iterators and various options on
>>>>> deletemany and deleterows but wasn't able to find a more straightforward
>>>>> way to do this. Does anybody have any suggestions?
>>>>>
>>>>> Mike
>>>>>
>>>>
>>>>
>>>
>>
>