You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by Marc Reichman <mr...@pixelforensics.com> on 2013/05/10 18:39:49 UTC

deletion technique question

I have a table with rows which have 3 column values in one column family,
and a column visibility.

There are situations where I will want to replace the row content with a
new column visibility; I understand that the visibility attributes are
immutable, so I will have to delete and re-put.

Am I better off doing:
1. BatchDeleter with authorizations to allow access, set range to the key
in question, call delete, and then put in mutations with the new visibility
2. Create mutations with a putDelete followed by a put with the new
visibility for each value
3. Something else entirely?

For option #2, can I simply do a putDelete on the column family/qualifier?
Or do I need to "know" the old authorizations to put in a visibility
expression with the putDelete?

For all of these, can a client get up-to-the-minute results immediately
after? Or does some kind of compaction need to occur first?

Re: deletion technique question

Posted by Marc Reichman <mr...@pixelforensics.com>.
The only limitation with the approach that I can see is that I may not know
every permutation of visibility on a given key, and with the scan-driven
approach I can use the user's entire authorization set as a way to get all
of the rows for deletion.

Thanks,
Marc


On Fri, May 10, 2013 at 2:19 PM, Christopher <ct...@apache.org> wrote:

> The BatchDeleter is essentially a BatchScanner with the
> SortedKeyIterator (which drops values from the returned entries...
> they aren't needed to delete), and a BatchWriter that inserts a delete
> entry in a mutation for every entry the scanner sees.
>
> You can, and should, select option 2, because you're better off
> sending two column updates in each mutation rather than send twice as
> many mutations, as you'd be doing for option 1.
>
> --
> Christopher L Tubbs II
> http://gravatar.com/ctubbsii
>
>
> On Fri, May 10, 2013 at 12:39 PM, Marc Reichman
> <mr...@pixelforensics.com> wrote:
> > I have a table with rows which have 3 column values in one column family,
> > and a column visibility.
> >
> > There are situations where I will want to replace the row content with a
> new
> > column visibility; I understand that the visibility attributes are
> > immutable, so I will have to delete and re-put.
> >
> > Am I better off doing:
> > 1. BatchDeleter with authorizations to allow access, set range to the
> key in
> > question, call delete, and then put in mutations with the new visibility
> > 2. Create mutations with a putDelete followed by a put with the new
> > visibility for each value
> > 3. Something else entirely?
> >
> > For option #2, can I simply do a putDelete on the column
> family/qualifier?
> > Or do I need to "know" the old authorizations to put in a visibility
> > expression with the putDelete?
> >
> > For all of these, can a client get up-to-the-minute results immediately
> > after? Or does some kind of compaction need to occur first?
>

Re: deletion technique question

Posted by Christopher <ct...@apache.org>.
The BatchDeleter is essentially a BatchScanner with the
SortedKeyIterator (which drops values from the returned entries...
they aren't needed to delete), and a BatchWriter that inserts a delete
entry in a mutation for every entry the scanner sees.

You can, and should, select option 2, because you're better off
sending two column updates in each mutation rather than send twice as
many mutations, as you'd be doing for option 1.

--
Christopher L Tubbs II
http://gravatar.com/ctubbsii


On Fri, May 10, 2013 at 12:39 PM, Marc Reichman
<mr...@pixelforensics.com> wrote:
> I have a table with rows which have 3 column values in one column family,
> and a column visibility.
>
> There are situations where I will want to replace the row content with a new
> column visibility; I understand that the visibility attributes are
> immutable, so I will have to delete and re-put.
>
> Am I better off doing:
> 1. BatchDeleter with authorizations to allow access, set range to the key in
> question, call delete, and then put in mutations with the new visibility
> 2. Create mutations with a putDelete followed by a put with the new
> visibility for each value
> 3. Something else entirely?
>
> For option #2, can I simply do a putDelete on the column family/qualifier?
> Or do I need to "know" the old authorizations to put in a visibility
> expression with the putDelete?
>
> For all of these, can a client get up-to-the-minute results immediately
> after? Or does some kind of compaction need to occur first?

Re: deletion technique question

Posted by Keith Turner <ke...@deenlo.com>.
On Mon, May 13, 2013 at 11:24 AM, Marc Reichman <
mreichman@pixelforensics.com> wrote:

> The 1.5 solution looks nice.
>
> Aware of the potential data loss angle and the sort ordering is also an
> interesting angle, thank you.
>
> In my particular case where I may not necessarily be aware of all
> permutations of column visibility of a given key but want to replace them
> all with a particular new visibility with the same data, how would I go
> about that? Is there a way to use a batchscanner (step 1 of the
> batchdeleter approach) to pull down all the permutations, then putdeletes
> for them and put what I want?
>

No.  Its like you said.  You will only see entries based on the auths you
give the scanner.   There is no way to turn off colvis checking in a scan.
 Using the transforming iterator, from ACCUMULO-956, at compaction time is
a nice option because all data passes through iterators at compaction time.


>
> In my case I'm pulling one copy of the data down first to verify I have it
> at the user's current scan auth, then using the #1 approach to clear it out
> and then put it in again as the vis I need.
>

This is a good way to do it.  Could possibly clone the table instead of
pulling a copy down.


>
>
> On Mon, May 13, 2013 at 10:05 AM, Keith Turner <ke...@deenlo.com> wrote:
>
>>
>>
>>
>> On Fri, May 10, 2013 at 12:39 PM, Marc Reichman <
>> mreichman@pixelforensics.com> wrote:
>>
>>> I have a table with rows which have 3 column values in one column
>>> family, and a column visibility.
>>>
>>> There are situations where I will want to replace the row content with a
>>> new column visibility; I understand that the visibility attributes are
>>> immutable, so I will have to delete and re-put.
>>>
>>> Am I better off doing:
>>> 1. BatchDeleter with authorizations to allow access, set range to the
>>> key in question, call delete, and then put in mutations with the new
>>> visibility
>>> 2. Create mutations with a putDelete followed by a put with the new
>>> visibility for each value
>>> 3. Something else entirely?
>>>
>>
>> In 1.5, you can use ACCUMULO-956
>>
>>
>>>
>>> For option #2, can I simply do a putDelete on the column
>>> family/qualifier? Or do I need to "know" the old authorizations to put in a
>>> visibility expression with the putDelete?
>>>
>>> For all of these, can a client get up-to-the-minute results immediately
>>> after? Or does some kind of compaction need to occur first?
>>>
>>
>> If you send a mutation with a delete and put, the client will be able to
>> see it after the batchwriter flushes or closes.  No compaction needed.
>>
>> I am little fuzzy on #1.  Will you delete everything in one pass (using
>> batchdeleter), and then do another pass writing data w/ updated colvis?  If
>> so this would seems to imply that you are pulling the data from another
>> source (other than the table stuff was deleted from)?
>>
>> Make sure the method you chose is not susceptible to data loss in the
>> event that the client dies.  For example if a client was, reading a table
>> and then writing a delete and updates mutation for each key/val read.  If
>> the client died and some deletes were written, but not the corresponding
>> updates, then that data would not be seen to be transformed on the second
>> run.
>>
>> When you change the colvis, you change the sort order.  If you read a key
>> and K and change it to K', where K' sorts after K. If you insert K', its
>> possible that you may read it.  Its being inserted in front of the scanners
>> pointer.  Because of buffering in the batch writer and scanner, this would
>> not occur always, but it would occur occasionally.  Something to be aware
>> of.
>>
>>
>>
>>
>

Re: deletion technique question

Posted by Marc Reichman <mr...@pixelforensics.com>.
The 1.5 solution looks nice.

Aware of the potential data loss angle and the sort ordering is also an
interesting angle, thank you.

In my particular case where I may not necessarily be aware of all
permutations of column visibility of a given key but want to replace them
all with a particular new visibility with the same data, how would I go
about that? Is there a way to use a batchscanner (step 1 of the
batchdeleter approach) to pull down all the permutations, then putdeletes
for them and put what I want?

In my case I'm pulling one copy of the data down first to verify I have it
at the user's current scan auth, then using the #1 approach to clear it out
and then put it in again as the vis I need.


On Mon, May 13, 2013 at 10:05 AM, Keith Turner <ke...@deenlo.com> wrote:

>
>
>
> On Fri, May 10, 2013 at 12:39 PM, Marc Reichman <
> mreichman@pixelforensics.com> wrote:
>
>> I have a table with rows which have 3 column values in one column family,
>> and a column visibility.
>>
>> There are situations where I will want to replace the row content with a
>> new column visibility; I understand that the visibility attributes are
>> immutable, so I will have to delete and re-put.
>>
>> Am I better off doing:
>> 1. BatchDeleter with authorizations to allow access, set range to the key
>> in question, call delete, and then put in mutations with the new visibility
>> 2. Create mutations with a putDelete followed by a put with the new
>> visibility for each value
>> 3. Something else entirely?
>>
>
> In 1.5, you can use ACCUMULO-956
>
>
>>
>> For option #2, can I simply do a putDelete on the column
>> family/qualifier? Or do I need to "know" the old authorizations to put in a
>> visibility expression with the putDelete?
>>
>> For all of these, can a client get up-to-the-minute results immediately
>> after? Or does some kind of compaction need to occur first?
>>
>
> If you send a mutation with a delete and put, the client will be able to
> see it after the batchwriter flushes or closes.  No compaction needed.
>
> I am little fuzzy on #1.  Will you delete everything in one pass (using
> batchdeleter), and then do another pass writing data w/ updated colvis?  If
> so this would seems to imply that you are pulling the data from another
> source (other than the table stuff was deleted from)?
>
> Make sure the method you chose is not susceptible to data loss in the
> event that the client dies.  For example if a client was, reading a table
> and then writing a delete and updates mutation for each key/val read.  If
> the client died and some deletes were written, but not the corresponding
> updates, then that data would not be seen to be transformed on the second
> run.
>
> When you change the colvis, you change the sort order.  If you read a key
> and K and change it to K', where K' sorts after K. If you insert K', its
> possible that you may read it.  Its being inserted in front of the scanners
> pointer.  Because of buffering in the batch writer and scanner, this would
> not occur always, but it would occur occasionally.  Something to be aware
> of.
>
>
>
>

Re: deletion technique question

Posted by Keith Turner <ke...@deenlo.com>.
On Fri, May 10, 2013 at 12:39 PM, Marc Reichman <
mreichman@pixelforensics.com> wrote:

> I have a table with rows which have 3 column values in one column family,
> and a column visibility.
>
> There are situations where I will want to replace the row content with a
> new column visibility; I understand that the visibility attributes are
> immutable, so I will have to delete and re-put.
>
> Am I better off doing:
> 1. BatchDeleter with authorizations to allow access, set range to the key
> in question, call delete, and then put in mutations with the new visibility
> 2. Create mutations with a putDelete followed by a put with the new
> visibility for each value
> 3. Something else entirely?
>

In 1.5, you can use ACCUMULO-956


>
> For option #2, can I simply do a putDelete on the column family/qualifier?
> Or do I need to "know" the old authorizations to put in a visibility
> expression with the putDelete?
>
> For all of these, can a client get up-to-the-minute results immediately
> after? Or does some kind of compaction need to occur first?
>

If you send a mutation with a delete and put, the client will be able to
see it after the batchwriter flushes or closes.  No compaction needed.

I am little fuzzy on #1.  Will you delete everything in one pass (using
batchdeleter), and then do another pass writing data w/ updated colvis?  If
so this would seems to imply that you are pulling the data from another
source (other than the table stuff was deleted from)?

Make sure the method you chose is not susceptible to data loss in the event
that the client dies.  For example if a client was, reading a table and
then writing a delete and updates mutation for each key/val read.  If the
client died and some deletes were written, but not the corresponding
updates, then that data would not be seen to be transformed on the second
run.

When you change the colvis, you change the sort order.  If you read a key
and K and change it to K', where K' sorts after K. If you insert K', its
possible that you may read it.  Its being inserted in front of the scanners
pointer.  Because of buffering in the batch writer and scanner, this would
not occur always, but it would occur occasionally.  Something to be aware
of.