You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by Russ Weeks <rw...@newbrightidea.com> on 2014/10/01 00:27:41 UTC

Distinguishing between processed and unprocessed data in an Iterator

Hi, folks,

The StatsCombiner[1] shows one way for an Iterator to distinguish between
processed and unprocessed data. In this case, the StatsCombiner treats
string representations of integers as unprocessed data and comma-separated
string representations of integers as processed data.

Two questions: First, is it possible to do this in an arbitrary fashion?
For example, let's say my Iterator adds Values to a bloom filter which it
maintains internally - like a combiner, but potentially across multiple
CF's. If the iterator encounters unprocessed data, it should offer it to
the bloom filter. If it encounters processed data (ie. a bloom filter), it
should merge it with its own bloom filter.

The only way that I can think of to do this is to have a higher-priority
iterator that "escapes" Values, and have my Iterator emit unescaped Values.
Then my iterator can make decisions based on whether a current Value is or
isn't escaped. I find this approach pretty kludgy though, and any advice is
welcome.

Second question: the need to distinguish between processed and unprocessed
data, is this due to the Iterator running in all three scopes? Would a
per-scanner Iterator or an Iterator running in scan scope be guaranteed to
only see unprocessed data?

Thanks,
-Russ

1:
https://github.com/apache/accumulo/blob/master/examples/simple/src/main/java/org/apache/accumulo/examples/simple/combiner/StatsCombiner.java

Re: Distinguishing between processed and unprocessed data in an Iterator

Posted by Christopher <ct...@apache.org>.
Without looking at the code, I can't recall the specific circumstances
where that might occur (maybe continueScan?), but no API guarantees are
made regarding that, so even if Accumulo itself didn't do that, it could
change in a different version.


--
Christopher L Tubbs II
http://gravatar.com/ctubbsii

On Tue, Sep 30, 2014 at 10:13 PM, Russ Weeks <rw...@newbrightidea.com>
wrote:

> I see, thanks Christopher.
>
> For a lot of the iterators that my colleagues and I are thinking about,
> we'd be OK with constraints like, "only apply this iterator at scan time"
> and "don't stick other iterators on top of this one". But my understanding
> is that Accumulo itself, either on the tserver-side and/or the
> scanner-side, might arbitrarily re-seek any type of iterator at any time it
> chooses.
>
> -Russ
>
> On Tue, Sep 30, 2014 at 6:41 PM, Christopher <ct...@apache.org> wrote:
>
>> On Tue, Sep 30, 2014 at 9:34 PM, Russ Weeks <rw...@newbrightidea.com>
>> wrote:
>>
>>> > an iterator in the scan scope would be guaranteed to only see
>>> unprocessed data if the iterator has not been configured for minor
>>> compaction or major compaction scopes at all
>>>
>>> Excellent, thanks Christopher. That simplifies things. One more
>>> question: I understand that an iterator may be re-seeked at any point in
>>> its lifetime, which could cause it to see unprocessed data a second time. I
>>> assume this is true for scan-scope iterators as well?
>>>
>>> -Russ
>>>
>>>
>> Yes, it is true for scan scope also. A simple example would be another
>> user iterator that sits on top of yours that does a Cartesian product of
>> its data source.
>>
>> --
>> Christopher L Tubbs II
>> http://gravatar.com/ctubbsii
>>
>>
>

Re: Distinguishing between processed and unprocessed data in an Iterator

Posted by Russ Weeks <rw...@newbrightidea.com>.
I see, thanks Christopher.

For a lot of the iterators that my colleagues and I are thinking about,
we'd be OK with constraints like, "only apply this iterator at scan time"
and "don't stick other iterators on top of this one". But my understanding
is that Accumulo itself, either on the tserver-side and/or the
scanner-side, might arbitrarily re-seek any type of iterator at any time it
chooses.

-Russ

On Tue, Sep 30, 2014 at 6:41 PM, Christopher <ct...@apache.org> wrote:

> On Tue, Sep 30, 2014 at 9:34 PM, Russ Weeks <rw...@newbrightidea.com>
> wrote:
>
>> > an iterator in the scan scope would be guaranteed to only see
>> unprocessed data if the iterator has not been configured for minor
>> compaction or major compaction scopes at all
>>
>> Excellent, thanks Christopher. That simplifies things. One more question:
>> I understand that an iterator may be re-seeked at any point in its
>> lifetime, which could cause it to see unprocessed data a second time. I
>> assume this is true for scan-scope iterators as well?
>>
>> -Russ
>>
>>
> Yes, it is true for scan scope also. A simple example would be another
> user iterator that sits on top of yours that does a Cartesian product of
> its data source.
>
> --
> Christopher L Tubbs II
> http://gravatar.com/ctubbsii
>
>

Re: Distinguishing between processed and unprocessed data in an Iterator

Posted by Christopher <ct...@apache.org>.
On Tue, Sep 30, 2014 at 9:34 PM, Russ Weeks <rw...@newbrightidea.com>
wrote:

> > an iterator in the scan scope would be guaranteed to only see
> unprocessed data if the iterator has not been configured for minor
> compaction or major compaction scopes at all
>
> Excellent, thanks Christopher. That simplifies things. One more question:
> I understand that an iterator may be re-seeked at any point in its
> lifetime, which could cause it to see unprocessed data a second time. I
> assume this is true for scan-scope iterators as well?
>
> -Russ
>
>
Yes, it is true for scan scope also. A simple example would be another user
iterator that sits on top of yours that does a Cartesian product of its
data source.

--
Christopher L Tubbs II
http://gravatar.com/ctubbsii

Re: Distinguishing between processed and unprocessed data in an Iterator

Posted by Russ Weeks <rw...@newbrightidea.com>.
> an iterator in the scan scope would be guaranteed to only see unprocessed
data if the iterator has not been configured for minor compaction or major
compaction scopes at all

Excellent, thanks Christopher. That simplifies things. One more question: I
understand that an iterator may be re-seeked at any point in its lifetime,
which could cause it to see unprocessed data a second time. I assume this
is true for scan-scope iterators as well?

-Russ

On Tue, Sep 30, 2014 at 3:46 PM, Christopher <ct...@apache.org> wrote:

> On Tue, Sep 30, 2014 at 6:27 PM, Russ Weeks <rw...@newbrightidea.com>
> wrote:
>
>> Hi, folks,
>>
>> The StatsCombiner[1] shows one way for an Iterator to distinguish between
>> processed and unprocessed data. In this case, the StatsCombiner treats
>> string representations of integers as unprocessed data and comma-separated
>> string representations of integers as processed data.
>>
>> Two questions: First, is it possible to do this in an arbitrary fashion?
>> For example, let's say my Iterator adds Values to a bloom filter which it
>> maintains internally - like a combiner, but potentially across multiple
>> CF's. If the iterator encounters unprocessed data, it should offer it to
>> the bloom filter. If it encounters processed data (ie. a bloom filter), it
>> should merge it with its own bloom filter.
>>
>> The only way that I can think of to do this is to have a higher-priority
>> iterator that "escapes" Values, and have my Iterator emit unescaped Values.
>> Then my iterator can make decisions based on whether a current Value is or
>> isn't escaped. I find this approach pretty kludgy though, and any advice is
>> welcome.
>>
>>
> Sure, you could generalize this, like standardize the way you flag data as
> evaluated. However, I think most people would interpret "evaluated" to mean
> "evaluated by this specific iterator", which would imply that the flagging
> is iterator-specific.
>
>
>> Second question: the need to distinguish between processed and
>> unprocessed data, is this due to the Iterator running in all three scopes?
>> Would a per-scanner Iterator or an Iterator running in scan scope be
>> guaranteed to only see unprocessed data?
>>
>>
> It's more that the iterator may run over the same data multiple times, not
> that it runs in different scopes (although, different scopes increases the
> number of times the data is iterated over). This could happen, for
> instance, if a tablet is compacted multiple times and the only scope the
> iterator is configured for is major compaction.
>
> So, in response to the second part of this question, an iterator in the
> scan scope would be guaranteed to only see unprocessed data if the iterator
> has not been configured for minor compaction or major compaction scopes at
> all.
>
>
>> Thanks,
>> -Russ
>>
>> 1:
>> https://github.com/apache/accumulo/blob/master/examples/simple/src/main/java/org/apache/accumulo/examples/simple/combiner/StatsCombiner.java
>>
>
>
>
> --
> Christopher L Tubbs II
> http://gravatar.com/ctubbsii
>
>
> --
> Christopher L Tubbs II
> http://gravatar.com/ctubbsii
>
> On Tue, Sep 30, 2014 at 6:27 PM, Russ Weeks <rw...@newbrightidea.com>
> wrote:
>
>> Hi, folks,
>>
>> The StatsCombiner[1] shows one way for an Iterator to distinguish between
>> processed and unprocessed data. In this case, the StatsCombiner treats
>> string representations of integers as unprocessed data and comma-separated
>> string representations of integers as processed data.
>>
>> Two questions: First, is it possible to do this in an arbitrary fashion?
>> For example, let's say my Iterator adds Values to a bloom filter which it
>> maintains internally - like a combiner, but potentially across multiple
>> CF's. If the iterator encounters unprocessed data, it should offer it to
>> the bloom filter. If it encounters processed data (ie. a bloom filter), it
>> should merge it with its own bloom filter.
>>
>> The only way that I can think of to do this is to have a higher-priority
>> iterator that "escapes" Values, and have my Iterator emit unescaped Values.
>> Then my iterator can make decisions based on whether a current Value is or
>> isn't escaped. I find this approach pretty kludgy though, and any advice is
>> welcome.
>>
>> Second question: the need to distinguish between processed and
>> unprocessed data, is this due to the Iterator running in all three scopes?
>> Would a per-scanner Iterator or an Iterator running in scan scope be
>> guaranteed to only see unprocessed data?
>>
>> Thanks,
>> -Russ
>>
>> 1:
>> https://github.com/apache/accumulo/blob/master/examples/simple/src/main/java/org/apache/accumulo/examples/simple/combiner/StatsCombiner.java
>>
>
>

Re: Distinguishing between processed and unprocessed data in an Iterator

Posted by Christopher <ct...@apache.org>.
On Tue, Sep 30, 2014 at 6:27 PM, Russ Weeks <rw...@newbrightidea.com>
wrote:

> Hi, folks,
>
> The StatsCombiner[1] shows one way for an Iterator to distinguish between
> processed and unprocessed data. In this case, the StatsCombiner treats
> string representations of integers as unprocessed data and comma-separated
> string representations of integers as processed data.
>
> Two questions: First, is it possible to do this in an arbitrary fashion?
> For example, let's say my Iterator adds Values to a bloom filter which it
> maintains internally - like a combiner, but potentially across multiple
> CF's. If the iterator encounters unprocessed data, it should offer it to
> the bloom filter. If it encounters processed data (ie. a bloom filter), it
> should merge it with its own bloom filter.
>
> The only way that I can think of to do this is to have a higher-priority
> iterator that "escapes" Values, and have my Iterator emit unescaped Values.
> Then my iterator can make decisions based on whether a current Value is or
> isn't escaped. I find this approach pretty kludgy though, and any advice is
> welcome.
>
>
Sure, you could generalize this, like standardize the way you flag data as
evaluated. However, I think most people would interpret "evaluated" to mean
"evaluated by this specific iterator", which would imply that the flagging
is iterator-specific.


> Second question: the need to distinguish between processed and unprocessed
> data, is this due to the Iterator running in all three scopes? Would a
> per-scanner Iterator or an Iterator running in scan scope be guaranteed to
> only see unprocessed data?
>
>
It's more that the iterator may run over the same data multiple times, not
that it runs in different scopes (although, different scopes increases the
number of times the data is iterated over). This could happen, for
instance, if a tablet is compacted multiple times and the only scope the
iterator is configured for is major compaction.

So, in response to the second part of this question, an iterator in the
scan scope would be guaranteed to only see unprocessed data if the iterator
has not been configured for minor compaction or major compaction scopes at
all.


> Thanks,
> -Russ
>
> 1:
> https://github.com/apache/accumulo/blob/master/examples/simple/src/main/java/org/apache/accumulo/examples/simple/combiner/StatsCombiner.java
>



--
Christopher L Tubbs II
http://gravatar.com/ctubbsii


--
Christopher L Tubbs II
http://gravatar.com/ctubbsii

On Tue, Sep 30, 2014 at 6:27 PM, Russ Weeks <rw...@newbrightidea.com>
wrote:

> Hi, folks,
>
> The StatsCombiner[1] shows one way for an Iterator to distinguish between
> processed and unprocessed data. In this case, the StatsCombiner treats
> string representations of integers as unprocessed data and comma-separated
> string representations of integers as processed data.
>
> Two questions: First, is it possible to do this in an arbitrary fashion?
> For example, let's say my Iterator adds Values to a bloom filter which it
> maintains internally - like a combiner, but potentially across multiple
> CF's. If the iterator encounters unprocessed data, it should offer it to
> the bloom filter. If it encounters processed data (ie. a bloom filter), it
> should merge it with its own bloom filter.
>
> The only way that I can think of to do this is to have a higher-priority
> iterator that "escapes" Values, and have my Iterator emit unescaped Values.
> Then my iterator can make decisions based on whether a current Value is or
> isn't escaped. I find this approach pretty kludgy though, and any advice is
> welcome.
>
> Second question: the need to distinguish between processed and unprocessed
> data, is this due to the Iterator running in all three scopes? Would a
> per-scanner Iterator or an Iterator running in scan scope be guaranteed to
> only see unprocessed data?
>
> Thanks,
> -Russ
>
> 1:
> https://github.com/apache/accumulo/blob/master/examples/simple/src/main/java/org/apache/accumulo/examples/simple/combiner/StatsCombiner.java
>