You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by "Terry P." <te...@gmail.com> on 2013/12/09 18:02:32 UTC

What priority for purge filter

Greetings all,
With Accumulo v1.4.2, we have a purge filter/iterator that extents
RowFilter and I have a question about what priority it should be
implemented with. I see the default VersioningIterator runs at priority 20.

Our purge iterator is designed to suppress (scan time) or remove (majc or
minc compactions) rows based on the value in a column. Is it more efficient
to run our purge iterator at a higher priority than the VersioningIterator,
or does it really matter? Our VersioningIterator maxVersions is set to the
default of 1 which is what we want/need.

Thanks in advance,
Terry

Re: What priority for purge filter

Posted by Christopher <ct...@apache.org>.
If you want your filter to make decisions based on older versions, do
it before (lower than 20). If you want it to make decisions based on
the current version(s), do it after (higher than 20). I think that's
what Billie was saying.

--
Christopher L Tubbs II
http://gravatar.com/ctubbsii


On Mon, Dec 9, 2013 at 12:02 PM, Terry P. <te...@gmail.com> wrote:
> Greetings all,
> With Accumulo v1.4.2, we have a purge filter/iterator that extents RowFilter
> and I have a question about what priority it should be implemented with. I
> see the default VersioningIterator runs at priority 20.
>
> Our purge iterator is designed to suppress (scan time) or remove (majc or
> minc compactions) rows based on the value in a column. Is it more efficient
> to run our purge iterator at a higher priority than the VersioningIterator,
> or does it really matter? Our VersioningIterator maxVersions is set to the
> default of 1 which is what we want/need.
>
> Thanks in advance,
> Terry

Re: What priority for purge filter

Posted by Keith Turner <ke...@deenlo.com>.
On Wed, Dec 11, 2013 at 3:22 PM, Terry P. <te...@gmail.com> wrote:

> Thanks Keith, wonderful explanation as always, and you are helping ensure
> everything goes as expected. Thank you sir!
> For minor compactions and partial major compactions, my approach to
> "letting everything pass" is:
> 1. In the init() method (the boolean variable inCorrectScope is declared
> at the head of the class and set to false to be safe):
>
> IteratorScope is = env.getIteratorScope();
> *if* (is.equals(IteratorScope.*scan*) || env.isFullMajorCompaction())
>   inCorrectScope = *true*;
>
> *else *  inCorrectScope = *false*;
>

The isFullMajorCompaction() method will complain if called when the scope
is not major compaction, so will want code like the following.

if(is.equals(IteratorScope.scan) || (is.equals(IteratorScope.majc) &&
env.isFullMajorCompaction()))




> 2. In the acceptRow() method:
>
> *while* ( rowIterator.hasTop() ) {
>   // If not in scan or full major compaction scope, short circuit and
> return true
> *  if* (!inCorrectScope)
> *    return* *true*;
>   <otherwise perform the steps to see if the row has the expTs column
> family and if the
>     purge criteria is met or not from the value in that column>
> My main question is just to confirm that I've put the return in the
> correct place.
>
> Also, I saw something that surprised me with a scan too. I did a scan with
> explicit columns listed, and NOT the expTimestamp column the purge iterator
> operates on, and I still see entries. If I include the expTs column the
> purge is done on in the explicit list of columns for the scan, entries are
> filtered out as they should be.  In our environment and use case
> for Accumulo, that shouldn't be an issue, but I can see how that might
> confuse someone in other circumstances.  Just curious if there is some way
> to "force" it to always run even if the "purge criterion column" is not
> included in the scan columns.
>

You can seek the iterator in the accept method with the columns you want.
The iterator passed to the accept method is confined to the current row, so
you do not need to specify a particular range.  Should be able to do
something like the following in the accept method.

     if(!inCorrectScope)
        return true;

     //myColumns is the set of columns you need to make a decision
     rowIterator.seek(new Range(), myColumns, true);

     while(rowIterator.hasTop()){
          //make decision
     }




>
> Thanks again as always for all the help.
>
> Best regards,
> Terry
>
>
>
>
> On Mon, Dec 9, 2013 at 5:45 PM, Keith Turner <ke...@deenlo.com> wrote:
>
>>
>>
>>
>>  On Mon, Dec 9, 2013 at 4:18 PM, Terry P. <te...@gmail.com> wrote:
>>
>>> Thanks Billie and Christopher, sounds like I should have the purge
>>> iterator run after the VersioningIterator.
>>>
>>> Keith, uh oh, I was not aware that not all compactions will see the
>>> entire row.  That sounds like it could be bad for my case!  Here is the
>>> original thread that you helped me with as background:
>>>
>>
>> Sometimes Accumulo will compact a subset of the data in a tablet.  This
>> can happen during a minor compaction and when a major compaction is
>> operating on a subset of files.  The rows columns and updates are spread
>> across multiple files.   In these cases you may only see a subset of the
>> columns in a row.  Also you may not see the latest version.   Scans and
>> full major compactions see all data.   You can tell the difference when an
>> iterators is initialized.  An IteratorEnvironment is passed into the init
>> method.   If the scope is majc and isFullMajorCompaction() is true then you
>> know you will see all data (also if the scope is scan).  For minor
>> compactions and partial major compactions you may want to just let
>> everything pass.
>>
>>
>>>
>>>
>>> http://mail-archives.apache.org/mod_mbox/accumulo-user/201311.mbox/%3CCAGUtCHryW3RR9PF5BAD+psxE-dswL9FyOGVv5Mn_Wj00o2mxig@mail.gmail.com%3E
>>>
>>> We only have 10-12 k/v pairs per row -- is that a factor? Can you
>>> explain the nuances with respect to when a compaction won't see the entire
>>> row?
>>>
>>> Thanks,
>>> Terry
>>>
>>>
>>>
>>> On Mon, Dec 9, 2013 at 1:34 PM, Keith Turner <ke...@deenlo.com> wrote:
>>>
>>>>
>>>>
>>>>
>>>>  On Mon, Dec 9, 2013 at 12:02 PM, Terry P. <te...@gmail.com> wrote:
>>>>
>>>>> Greetings all,
>>>>> With Accumulo v1.4.2, we have a purge filter/iterator that extents
>>>>> RowFilter and I have a question about what priority it should be
>>>>> implemented with. I see the default VersioningIterator runs at priority 20.
>>>>>
>>>>> Our purge iterator is designed to suppress (scan time) or remove (majc
>>>>> or minc compactions) rows based on the value in a column. Is it more
>>>>> efficient to run our purge iterator at a higher priority than the
>>>>> VersioningIterator, or does it
>>>>>
>>>>
>>>> Are you aware that not all compactions will see the entire row?
>>>>
>>>>
>>>>>  really matter? Our VersioningIterator maxVersions is set to the
>>>>> default of 1 which is what we want/need.
>>>>>
>>>>> Thanks in advance,
>>>>> Terry
>>>>>
>>>>
>>>>
>>>
>>
>

Re: What priority for purge filter

Posted by "Terry P." <te...@gmail.com>.
Thanks Keith, wonderful explanation as always, and you are helping ensure
everything goes as expected. Thank you sir!
For minor compactions and partial major compactions, my approach to
"letting everything pass" is:
1. In the init() method (the boolean variable inCorrectScope is declared at
the head of the class and set to false to be safe):

IteratorScope is = env.getIteratorScope();
*if* (is.equals(IteratorScope.*scan*) || env.isFullMajorCompaction())
  inCorrectScope = *true*;

*else*  inCorrectScope = *false*;
2. In the acceptRow() method:

*while* ( rowIterator.hasTop() ) {
  // If not in scan or full major compaction scope, short circuit and
return true
*  if* (!inCorrectScope)
*    return* *true*;
  <otherwise perform the steps to see if the row has the expTs column
family and if the
    purge criteria is met or not from the value in that column>
My main question is just to confirm that I've put the return in the correct
place.

Also, I saw something that surprised me with a scan too. I did a scan with
explicit columns listed, and NOT the expTimestamp column the purge iterator
operates on, and I still see entries. If I include the expTs column the
purge is done on in the explicit list of columns for the scan, entries are
filtered out as they should be.  In our environment and use case
for Accumulo, that shouldn't be an issue, but I can see how that might
confuse someone in other circumstances.  Just curious if there is some way
to "force" it to always run even if the "purge criterion column" is not
included in the scan columns.

Thanks again as always for all the help.

Best regards,
Terry




On Mon, Dec 9, 2013 at 5:45 PM, Keith Turner <ke...@deenlo.com> wrote:

>
>
>
>  On Mon, Dec 9, 2013 at 4:18 PM, Terry P. <te...@gmail.com> wrote:
>
>> Thanks Billie and Christopher, sounds like I should have the purge
>> iterator run after the VersioningIterator.
>>
>> Keith, uh oh, I was not aware that not all compactions will see the
>> entire row.  That sounds like it could be bad for my case!  Here is the
>> original thread that you helped me with as background:
>>
>
> Sometimes Accumulo will compact a subset of the data in a tablet.  This
> can happen during a minor compaction and when a major compaction is
> operating on a subset of files.  The rows columns and updates are spread
> across multiple files.   In these cases you may only see a subset of the
> columns in a row.  Also you may not see the latest version.   Scans and
> full major compactions see all data.   You can tell the difference when an
> iterators is initialized.  An IteratorEnvironment is passed into the init
> method.   If the scope is majc and isFullMajorCompaction() is true then you
> know you will see all data (also if the scope is scan).  For minor
> compactions and partial major compactions you may want to just let
> everything pass.
>
>
>>
>>
>> http://mail-archives.apache.org/mod_mbox/accumulo-user/201311.mbox/%3CCAGUtCHryW3RR9PF5BAD+psxE-dswL9FyOGVv5Mn_Wj00o2mxig@mail.gmail.com%3E
>>
>> We only have 10-12 k/v pairs per row -- is that a factor? Can you explain
>> the nuances with respect to when a compaction won't see the entire row?
>>
>> Thanks,
>> Terry
>>
>>
>>
>> On Mon, Dec 9, 2013 at 1:34 PM, Keith Turner <ke...@deenlo.com> wrote:
>>
>>>
>>>
>>>
>>>  On Mon, Dec 9, 2013 at 12:02 PM, Terry P. <te...@gmail.com> wrote:
>>>
>>>> Greetings all,
>>>> With Accumulo v1.4.2, we have a purge filter/iterator that extents
>>>> RowFilter and I have a question about what priority it should be
>>>> implemented with. I see the default VersioningIterator runs at priority 20.
>>>>
>>>> Our purge iterator is designed to suppress (scan time) or remove (majc
>>>> or minc compactions) rows based on the value in a column. Is it more
>>>> efficient to run our purge iterator at a higher priority than the
>>>> VersioningIterator, or does it
>>>>
>>>
>>> Are you aware that not all compactions will see the entire row?
>>>
>>>
>>>>  really matter? Our VersioningIterator maxVersions is set to the
>>>> default of 1 which is what we want/need.
>>>>
>>>> Thanks in advance,
>>>> Terry
>>>>
>>>
>>>
>>
>

Re: What priority for purge filter

Posted by Keith Turner <ke...@deenlo.com>.
On Mon, Dec 9, 2013 at 4:18 PM, Terry P. <te...@gmail.com> wrote:

> Thanks Billie and Christopher, sounds like I should have the purge
> iterator run after the VersioningIterator.
>
> Keith, uh oh, I was not aware that not all compactions will see the entire
> row.  That sounds like it could be bad for my case!  Here is the original
> thread that you helped me with as background:
>

Sometimes Accumulo will compact a subset of the data in a tablet.  This can
happen during a minor compaction and when a major compaction is operating
on a subset of files.  The rows columns and updates are spread across
multiple files.   In these cases you may only see a subset of the columns
in a row.  Also you may not see the latest version.   Scans and full
major compactions see all data.   You can tell the difference when an
iterators is initialized.  An IteratorEnvironment is passed into the init
method.   If the scope is majc and isFullMajorCompaction() is true then you
know you will see all data (also if the scope is scan).  For minor
compactions and partial major compactions you may want to just let
everything pass.


>
>
> http://mail-archives.apache.org/mod_mbox/accumulo-user/201311.mbox/%3CCAGUtCHryW3RR9PF5BAD+psxE-dswL9FyOGVv5Mn_Wj00o2mxig@mail.gmail.com%3E
>
> We only have 10-12 k/v pairs per row -- is that a factor? Can you explain
> the nuances with respect to when a compaction won't see the entire row?
>
> Thanks,
> Terry
>
>
>
> On Mon, Dec 9, 2013 at 1:34 PM, Keith Turner <ke...@deenlo.com> wrote:
>
>>
>>
>>
>> On Mon, Dec 9, 2013 at 12:02 PM, Terry P. <te...@gmail.com> wrote:
>>
>>> Greetings all,
>>> With Accumulo v1.4.2, we have a purge filter/iterator that extents
>>> RowFilter and I have a question about what priority it should be
>>> implemented with. I see the default VersioningIterator runs at priority 20.
>>>
>>> Our purge iterator is designed to suppress (scan time) or remove (majc
>>> or minc compactions) rows based on the value in a column. Is it more
>>> efficient to run our purge iterator at a higher priority than the
>>> VersioningIterator, or does it
>>>
>>
>> Are you aware that not all compactions will see the entire row?
>>
>>
>>> really matter? Our VersioningIterator maxVersions is set to the default
>>> of 1 which is what we want/need.
>>>
>>> Thanks in advance,
>>> Terry
>>>
>>
>>
>

Re: What priority for purge filter

Posted by "Terry P." <te...@gmail.com>.
Thanks Billie and Christopher, sounds like I should have the purge iterator
run after the VersioningIterator.

Keith, uh oh, I was not aware that not all compactions will see the entire
row.  That sounds like it could be bad for my case!  Here is the original
thread that you helped me with as background:

http://mail-archives.apache.org/mod_mbox/accumulo-user/201311.mbox/%3CCAGUtCHryW3RR9PF5BAD+psxE-dswL9FyOGVv5Mn_Wj00o2mxig@mail.gmail.com%3E

We only have 10-12 k/v pairs per row -- is that a factor? Can you explain
the nuances with respect to when a compaction won't see the entire row?

Thanks,
Terry



On Mon, Dec 9, 2013 at 1:34 PM, Keith Turner <ke...@deenlo.com> wrote:

>
>
>
> On Mon, Dec 9, 2013 at 12:02 PM, Terry P. <te...@gmail.com> wrote:
>
>> Greetings all,
>> With Accumulo v1.4.2, we have a purge filter/iterator that extents
>> RowFilter and I have a question about what priority it should be
>> implemented with. I see the default VersioningIterator runs at priority 20.
>>
>> Our purge iterator is designed to suppress (scan time) or remove (majc or
>> minc compactions) rows based on the value in a column. Is it more efficient
>> to run our purge iterator at a higher priority than the VersioningIterator,
>> or does it
>>
>
> Are you aware that not all compactions will see the entire row?
>
>
>> really matter? Our VersioningIterator maxVersions is set to the default
>> of 1 which is what we want/need.
>>
>> Thanks in advance,
>> Terry
>>
>
>

Re: What priority for purge filter

Posted by Keith Turner <ke...@deenlo.com>.
On Mon, Dec 9, 2013 at 12:02 PM, Terry P. <te...@gmail.com> wrote:

> Greetings all,
> With Accumulo v1.4.2, we have a purge filter/iterator that extents
> RowFilter and I have a question about what priority it should be
> implemented with. I see the default VersioningIterator runs at priority 20.
>
> Our purge iterator is designed to suppress (scan time) or remove (majc or
> minc compactions) rows based on the value in a column. Is it more efficient
> to run our purge iterator at a higher priority than the VersioningIterator,
> or does it
>

Are you aware that not all compactions will see the entire row?


> really matter? Our VersioningIterator maxVersions is set to the default of
> 1 which is what we want/need.
>
> Thanks in advance,
> Terry
>

Re: What priority for purge filter

Posted by Billie Rinaldi <bi...@gmail.com>.
In the event that the value could change (by having a new version of the
value written) in the column used to determine whether or not to keep the
row, the purge iterator should be applied after the VersioningIterator.
This means it should be given a larger priority.


On Mon, Dec 9, 2013 at 9:02 AM, Terry P. <te...@gmail.com> wrote:

> Greetings all,
> With Accumulo v1.4.2, we have a purge filter/iterator that extents
> RowFilter and I have a question about what priority it should be
> implemented with. I see the default VersioningIterator runs at priority 20.
>
> Our purge iterator is designed to suppress (scan time) or remove (majc or
> minc compactions) rows based on the value in a column. Is it more efficient
> to run our purge iterator at a higher priority than the VersioningIterator,
> or does it really matter? Our VersioningIterator maxVersions is set to the
> default of 1 which is what we want/need.
>
> Thanks in advance,
> Terry
>