You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by "Cardon, Tejay E" <te...@lmco.com> on 2012/09/13 21:50:52 UTC

Iterators and seeking the middle of a row

The javadoc for SortedKeyValueIterator.seek states:
"Iterators that examine groups of adjacent key/value pairs (e.g. rows) to determine their top key and value should be sure that they properly handle a seek to a key in the middle of such a group (e.g. the middle of a row). Even if the client always seeks to a range containing an entire group (a,c), the tablet server could send back a batch of entries corresponding to (a,b], then reseek the iterator to range (b,c) when the scan is continued."

However, it gives no indication of what proper handling is.  What should an iterator that considers and entire row do in this case?  Does it simply ignore the row?  Attempt to seek its source iterator to the full row of the first range?  I'm struggling to understand the best approach here.

In my specific case, if it matters, I'm largely looking for ColumnQualifiers which exist in all Column Families in a given set (intersecting iterator, sortof).

Thanks,
Tejay

RE: EXTERNAL: Re: Iterators and seeking the middle of a row

Posted by "Cardon, Tejay E" <te...@lmco.com>.

So, if I understand what you're saying correctly, I would need to pass the raw source columns as my column constraint, and then use some other mechanism to control the columns in my seek calls.  That way, I ensure that bottom level iterator has the raw columns it needs, but the intermediate Iterators can then be adjusted with crafted calls to seek() with appropriate adjustments to the columns argument?  Is that correct?

Thanks,
Tejay

From: Billie Rinaldi [mailto:billie@apache.org]
Sent: Friday, September 14, 2012 8:09 AM
To: user@accumulo.apache.org
Subject: Re: EXTERNAL: Re: Iterators and seeking the middle of a row

On Thu, Sep 13, 2012 at 4:44 PM, Cardon, Tejay E <te...@lmco.com>> wrote:
Excellent, thank you William.  That raises an interesting point for me.  In my case, as with the IntersectingIterator, the schema of my iterator's topKey and topValue is not the same as the schema for the underlying source.

In IntersectingIterator, for example, the underlying source has data in the format;

row: shardID, colfam: term, colqual: docID

But the data being returned by the iterator is in the form

row: shardID, colfam: (empty), colqual: docID

Would I expect a seek on that iterator to have a range based on the ColF and ColQ being returned, or the ones being used on the sources?

It appears from the code of IntersectingIterator that seek is called based on the out-going schema, and the code then translates the keys in the range into the source schema before seeking the sources.

Yes, I believe that is correct.  The seek is passed down through the iterator stack, so an iterator that changes the schema has an opportunity to adjust the range when seeking its sources.  We recently discovered that the behavior with column filtering is not as intuitive.  When you fetch columns with a scanner, column filters are created at the system level (so they'll be sources for user level iterators) and they are passed the set of fetched columns directly, without giving the user iterators a chance to transform the columns.  This doesn't come up with the IntersectingIterator because it manages the columns itself (there's no reason to fetch columns with it), but in general this would be something to watch out for when writing schema-transforming iterators.

Billie

Thanks,
Tejay

From: William Slacum [mailto:wilhelm.von.cloud@accumulo.net<ma...@accumulo.net>]
Sent: Thursday, September 13, 2012 4:39 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: EXTERNAL: Re: Iterators and seeking the middle of a row

Another thing to keep in mind is that the documentation is actually meant to enforce the notion that, between returning keys, your iterator could be destroyed and reconstituted. If an iterator is originally given a range, ("a", "c"), and it returns a key "b", the system *may* deconstruct the iterator stack and at a later time, reinitialize it with the range ("b", "c"), since "b" was the last place your iterator stack was known to be at.
On Thu, Sep 13, 2012 at 3:34 PM, William Slacum <wi...@accumulo.net>> wrote:
Remember that the range given to an iterator is, at some point in time, user set. If a client only wants to scan between keys K1 and K2, and each occur in the same row, then the iterator should not be considering data that is outside of the range supplied to it. Someone can correct me if I'm wrong, but I also believe that if a client received a key outside of the original scan range, then that was considered a termination condition and the scan would stop.

Let's say I have a flat record structure for people, where the row is the name of the person, the column family is some attribute about them, and the column qualifier is the value for that attribute. Here's a record for Bob:

Bob eyes: blue
Bob hair: brown
Bob height: tall
Bob pants: brown
Bob shirt: white
Bob tie: blue

If you were searching for all attributes that were 'brown', you could do a look up using the range `new Range("Bob", "Bob")`. Your iterator would be able to see all of Bob and return to the user his hair and pants color. However, you could just as easily perform your look up with `new Range(new Key("Bob", "height"), new Key("Bob", "z"))`*. Your iterator would then be allowed to look at a subset of Bob, starting at his height and continuing until the end of his record.

* I used "z" because it sorts lexicographically after the other attributes.
On Thu, Sep 13, 2012 at 1:01 PM, Keith Turner <ke...@deenlo.com>> wrote:
On Thu, Sep 13, 2012 at 3:50 PM, Cardon, Tejay E
<te...@lmco.com>> wrote:
> The javadoc for SortedKeyValueIterator.seek states:
>
> "Iterators that examine groups of adjacent key/value pairs (e.g. rows) to
> determine their top key and value should be sure that they properly handle a
> seek to a key in the middle of such a group (e.g. the middle of a row). Even
> if the client always seeks to a range containing an entire group (a,c), the
> tablet server could send back a batch of entries corresponding to (a,b],
> then reseek the iterator to range (b,c) when the scan is continued."
>
>
>
> However, it gives no indication of what proper handling is.  What should an
> iterator that considers and entire row do in this case?  Does it simply
> ignore the row?  Attempt to seek its source iterator to the full row of the
> first range?  I'm struggling to understand the best approach here
org.apache.accumulo.core.iterators.user.RowFilter does what you
suggested.  It seeks to the beggining of a row if the range starts in
the middle of the row.  Look at the javadoc for the row filter, it
discusses the seeking behavior.

>
>
>
> In my specific case, if it matters, I'm largely looking for ColumnQualifiers
> which exist in all Column Families in a given set (intersecting iterator,
> sortof).
>
>
>
> Thanks,
> Tejay

Re: EXTERNAL: Re: Iterators and seeking the middle of a row

Posted by Billie Rinaldi <bi...@apache.org>.

On Thu, Sep 13, 2012 at 4:44 PM, Cardon, Tejay E <te...@lmco.com>wrote:

>  Excellent, thank you William.  That raises an interesting point for me.
> In my case, as with the IntersectingIterator, the schema of my iterator’s
> topKey and topValue is not the same as the schema for the underlying source.
> ****
>
> ** **
>
> In IntersectingIterator, for example, the underlying source has data in
> the format;****
>
> ** **
>
> row: shardID, colfam: term, colqual: docID****
>
> ** **
>
> But the data being returned by the iterator is in the form****
>
> ** **
>
> row: shardID, colfam: (empty), colqual: docID****
>
> ** **
>
> Would I expect a seek on that iterator to have a range based on the ColF
> and ColQ being returned, or the ones being used on the sources?****
>
> ** **
>
> It appears from the code of IntersectingIterator that seek is called based
> on the out-going schema, and the code then translates the keys in the range
> into the source schema before seeking the sources.
>

Yes, I believe that is correct.  The seek is passed down through the
iterator stack, so an iterator that changes the schema has an opportunity
to adjust the range when seeking its sources.  We recently discovered that
the behavior with column filtering is not as intuitive.  When you fetch
columns with a scanner, column filters are created at the system level (so
they'll be sources for user level iterators) and they are passed the set of
fetched columns directly, without giving the user iterators a chance to
transform the columns.  This doesn't come up with the IntersectingIterator
because it manages the columns itself (there's no reason to fetch columns
with it), but in general this would be something to watch out for when
writing schema-transforming iterators.

Billie


****
>
> ** **
>
> Thanks,****
>
> Tejay****
>
> ** **
>
> ** **
>
> *From:* William Slacum [mailto:wilhelm.von.cloud@accumulo.net]
> *Sent:* Thursday, September 13, 2012 4:39 PM
> *To:* user@accumulo.apache.org
> *Subject:* EXTERNAL: Re: Iterators and seeking the middle of a row****
>
> ** **
>
> Another thing to keep in mind is that the documentation is actually meant
> to enforce the notion that, between returning keys, your iterator could be
> destroyed and reconstituted. If an iterator is originally given a range,
> ("a", "c"), and it returns a key "b", the system *may* deconstruct the
> iterator stack and at a later time, reinitialize it with the range ("b",
> "c"), since "b" was the last place your iterator stack was known to be at.
> ****
>
> On Thu, Sep 13, 2012 at 3:34 PM, William Slacum <
> wilhelm.von.cloud@accumulo.net> wrote:****
>
> Remember that the range given to an iterator is, at some point in time,
> user set. If a client only wants to scan between keys K1 and K2, and each
> occur in the same row, then the iterator should not be considering data
> that is outside of the range supplied to it. Someone can correct me if I'm
> wrong, but I also believe that if a client received a key outside of the
> original scan range, then that was considered a termination condition and
> the scan would stop.
>
> Let's say I have a flat record structure for people, where the row is the
> name of the person, the column family is some attribute about them, and the
> column qualifier is the value for that attribute. Here's a record for Bob:
>
> Bob eyes: blue
> Bob hair: brown
> Bob height: tall
> Bob pants: brown
> Bob shirt: white
> Bob tie: blue
>
> If you were searching for all attributes that were 'brown', you could do a
> look up using the range `new Range("Bob", "Bob")`. Your iterator would be
> able to see all of Bob and return to the user his hair and pants color.
> However, you could just as easily perform your look up with `new Range(new
> Key("Bob", "height"), new Key("Bob", "z"))`*. Your iterator would then be
> allowed to look at a subset of Bob, starting at his height and continuing
> until the end of his record.
>
> * I used "z" because it sorts lexicographically after the other attributes.
> ****
>
> On Thu, Sep 13, 2012 at 1:01 PM, Keith Turner <ke...@deenlo.com> wrote:***
> *
>
>  On Thu, Sep 13, 2012 at 3:50 PM, Cardon, Tejay E
> <te...@lmco.com> wrote:
> > The javadoc for SortedKeyValueIterator.seek states:
> >
> > “Iterators that examine groups of adjacent key/value pairs (e.g. rows) to
> > determine their top key and value should be sure that they properly
> handle a
> > seek to a key in the middle of such a group (e.g. the middle of a row).
> Even
> > if the client always seeks to a range containing an entire group (a,c),
> the
> > tablet server could send back a batch of entries corresponding to (a,b],
> > then reseek the iterator to range (b,c) when the scan is continued.”
> >
> >
> >
> > However, it gives no indication of what proper handling is.  What should
> an
> > iterator that considers and entire row do in this case?  Does it simply
> > ignore the row?  Attempt to seek its source iterator to the full row of
> the
> > first range?  I’m struggling to understand the best approach here****
>
> org.apache.accumulo.core.iterators.user.RowFilter does what you
> suggested.  It seeks to the beggining of a row if the range starts in
> the middle of the row.  Look at the javadoc for the row filter, it
> discusses the seeking behavior.****
>
>
> >
> >
> >
> > In my specific case, if it matters, I’m largely looking for
> ColumnQualifiers
> > which exist in all Column Families in a given set (intersecting iterator,
> > sortof).
> >
> >
> >
> > Thanks,
> > Tejay****
>
>  ** **
>
> ** **
>

RE: EXTERNAL: Re: Iterators and seeking the middle of a row

Posted by "Cardon, Tejay E" <te...@lmco.com>.

Excellent, thank you William.  That raises an interesting point for me.  In my case, as with the IntersectingIterator, the schema of my iterator's topKey and topValue is not the same as the schema for the underlying source.

In IntersectingIterator, for example, the underlying source has data in the format;

row: shardID, colfam: term, colqual: docID

But the data being returned by the iterator is in the form

row: shardID, colfam: (empty), colqual: docID

Would I expect a seek on that iterator to have a range based on the ColF and ColQ being returned, or the ones being used on the sources?

It appears from the code of IntersectingIterator that seek is called based on the out-going schema, and the code then translates the keys in the range into the source schema before seeking the sources.

Thanks,
Tejay

From: William Slacum [mailto:wilhelm.von.cloud@accumulo.net]
Sent: Thursday, September 13, 2012 4:39 PM
To: user@accumulo.apache.org
Subject: EXTERNAL: Re: Iterators and seeking the middle of a row

Another thing to keep in mind is that the documentation is actually meant to enforce the notion that, between returning keys, your iterator could be destroyed and reconstituted. If an iterator is originally given a range, ("a", "c"), and it returns a key "b", the system *may* deconstruct the iterator stack and at a later time, reinitialize it with the range ("b", "c"), since "b" was the last place your iterator stack was known to be at.
On Thu, Sep 13, 2012 at 3:34 PM, William Slacum <wi...@accumulo.net>> wrote:
Remember that the range given to an iterator is, at some point in time, user set. If a client only wants to scan between keys K1 and K2, and each occur in the same row, then the iterator should not be considering data that is outside of the range supplied to it. Someone can correct me if I'm wrong, but I also believe that if a client received a key outside of the original scan range, then that was considered a termination condition and the scan would stop.

Let's say I have a flat record structure for people, where the row is the name of the person, the column family is some attribute about them, and the column qualifier is the value for that attribute. Here's a record for Bob:

Bob eyes: blue
Bob hair: brown
Bob height: tall
Bob pants: brown
Bob shirt: white
Bob tie: blue

If you were searching for all attributes that were 'brown', you could do a look up using the range `new Range("Bob", "Bob")`. Your iterator would be able to see all of Bob and return to the user his hair and pants color. However, you could just as easily perform your look up with `new Range(new Key("Bob", "height"), new Key("Bob", "z"))`*. Your iterator would then be allowed to look at a subset of Bob, starting at his height and continuing until the end of his record.

* I used "z" because it sorts lexicographically after the other attributes.
On Thu, Sep 13, 2012 at 1:01 PM, Keith Turner <ke...@deenlo.com>> wrote:
On Thu, Sep 13, 2012 at 3:50 PM, Cardon, Tejay E
<te...@lmco.com>> wrote:
> The javadoc for SortedKeyValueIterator.seek states:
>
> "Iterators that examine groups of adjacent key/value pairs (e.g. rows) to
> determine their top key and value should be sure that they properly handle a
> seek to a key in the middle of such a group (e.g. the middle of a row). Even
> if the client always seeks to a range containing an entire group (a,c), the
> tablet server could send back a batch of entries corresponding to (a,b],
> then reseek the iterator to range (b,c) when the scan is continued."
>
>
>
> However, it gives no indication of what proper handling is.  What should an
> iterator that considers and entire row do in this case?  Does it simply
> ignore the row?  Attempt to seek its source iterator to the full row of the
> first range?  I'm struggling to understand the best approach here
org.apache.accumulo.core.iterators.user.RowFilter does what you
suggested.  It seeks to the beggining of a row if the range starts in
the middle of the row.  Look at the javadoc for the row filter, it
discusses the seeking behavior.

>
>
>
> In my specific case, if it matters, I'm largely looking for ColumnQualifiers
> which exist in all Column Families in a given set (intersecting iterator,
> sortof).
>
>
>
> Thanks,
> Tejay

Re: Iterators and seeking the middle of a row

Posted by William Slacum <wi...@accumulo.net>.

Another thing to keep in mind is that the documentation is actually meant
to enforce the notion that, between returning keys, your iterator could be
destroyed and reconstituted. If an iterator is originally given a range,
("a", "c"), and it returns a key "b", the system *may* deconstruct the
iterator stack and at a later time, reinitialize it with the range ("b",
"c"), since "b" was the last place your iterator stack was known to be at.

On Thu, Sep 13, 2012 at 3:34 PM, William Slacum <
wilhelm.von.cloud@accumulo.net> wrote:

> Remember that the range given to an iterator is, at some point in time,
> user set. If a client only wants to scan between keys K1 and K2, and each
> occur in the same row, then the iterator should not be considering data
> that is outside of the range supplied to it. Someone can correct me if I'm
> wrong, but I also believe that if a client received a key outside of the
> original scan range, then that was considered a termination condition and
> the scan would stop.
>
> Let's say I have a flat record structure for people, where the row is the
> name of the person, the column family is some attribute about them, and the
> column qualifier is the value for that attribute. Here's a record for Bob:
>
> Bob eyes: blue
> Bob hair: brown
> Bob height: tall
> Bob pants: brown
> Bob shirt: white
> Bob tie: blue
>
> If you were searching for all attributes that were 'brown', you could do a
> look up using the range `new Range("Bob", "Bob")`. Your iterator would be
> able to see all of Bob and return to the user his hair and pants color.
> However, you could just as easily perform your look up with `new Range(new
> Key("Bob", "height"), new Key("Bob", "z"))`*. Your iterator would then be
> allowed to look at a subset of Bob, starting at his height and continuing
> until the end of his record.
>
> * I used "z" because it sorts lexicographically after the other attributes.
> On Thu, Sep 13, 2012 at 1:01 PM, Keith Turner <ke...@deenlo.com> wrote:
>
>> On Thu, Sep 13, 2012 at 3:50 PM, Cardon, Tejay E
>> <te...@lmco.com> wrote:
>> > The javadoc for SortedKeyValueIterator.seek states:
>> >
>> > “Iterators that examine groups of adjacent key/value pairs (e.g. rows)
>> to
>> > determine their top key and value should be sure that they properly
>> handle a
>> > seek to a key in the middle of such a group (e.g. the middle of a row).
>> Even
>> > if the client always seeks to a range containing an entire group (a,c),
>> the
>> > tablet server could send back a batch of entries corresponding to (a,b],
>> > then reseek the iterator to range (b,c) when the scan is continued.”
>> >
>> >
>> >
>> > However, it gives no indication of what proper handling is.  What
>> should an
>> > iterator that considers and entire row do in this case?  Does it simply
>> > ignore the row?  Attempt to seek its source iterator to the full row of
>> the
>> > first range?  I’m struggling to understand the best approach here
>> org.apache.accumulo.core.iterators.user.RowFilter does what you
>> suggested.  It seeks to the beggining of a row if the range starts in
>> the middle of the row.  Look at the javadoc for the row filter, it
>> discusses the seeking behavior.
>>
>> >
>> >
>> >
>> > In my specific case, if it matters, I’m largely looking for
>> ColumnQualifiers
>> > which exist in all Column Families in a given set (intersecting
>> iterator,
>> > sortof).
>> >
>> >
>> >
>> > Thanks,
>> > Tejay
>>
>
>

Re: Iterators and seeking the middle of a row

Posted by William Slacum <wi...@accumulo.net>.

Remember that the range given to an iterator is, at some point in time,
user set. If a client only wants to scan between keys K1 and K2, and each
occur in the same row, then the iterator should not be considering data
that is outside of the range supplied to it. Someone can correct me if I'm
wrong, but I also believe that if a client received a key outside of the
original scan range, then that was considered a termination condition and
the scan would stop.

Let's say I have a flat record structure for people, where the row is the
name of the person, the column family is some attribute about them, and the
column qualifier is the value for that attribute. Here's a record for Bob:

Bob eyes: blue
Bob hair: brown
Bob height: tall
Bob pants: brown
Bob shirt: white
Bob tie: blue

If you were searching for all attributes that were 'brown', you could do a
look up using the range `new Range("Bob", "Bob")`. Your iterator would be
able to see all of Bob and return to the user his hair and pants color.
However, you could just as easily perform your look up with `new Range(new
Key("Bob", "height"), new Key("Bob", "z"))`*. Your iterator would then be
allowed to look at a subset of Bob, starting at his height and continuing
until the end of his record.

* I used "z" because it sorts lexicographically after the other attributes.
On Thu, Sep 13, 2012 at 1:01 PM, Keith Turner <ke...@deenlo.com> wrote:

> On Thu, Sep 13, 2012 at 3:50 PM, Cardon, Tejay E
> <te...@lmco.com> wrote:
> > The javadoc for SortedKeyValueIterator.seek states:
> >
> > “Iterators that examine groups of adjacent key/value pairs (e.g. rows) to
> > determine their top key and value should be sure that they properly
> handle a
> > seek to a key in the middle of such a group (e.g. the middle of a row).
> Even
> > if the client always seeks to a range containing an entire group (a,c),
> the
> > tablet server could send back a batch of entries corresponding to (a,b],
> > then reseek the iterator to range (b,c) when the scan is continued.”
> >
> >
> >
> > However, it gives no indication of what proper handling is.  What should
> an
> > iterator that considers and entire row do in this case?  Does it simply
> > ignore the row?  Attempt to seek its source iterator to the full row of
> the
> > first range?  I’m struggling to understand the best approach here
> org.apache.accumulo.core.iterators.user.RowFilter does what you
> suggested.  It seeks to the beggining of a row if the range starts in
> the middle of the row.  Look at the javadoc for the row filter, it
> discusses the seeking behavior.
>
> >
> >
> >
> > In my specific case, if it matters, I’m largely looking for
> ColumnQualifiers
> > which exist in all Column Families in a given set (intersecting iterator,
> > sortof).
> >
> >
> >
> > Thanks,
> > Tejay
>

Re: Iterators and seeking the middle of a row

Posted by Keith Turner <ke...@deenlo.com>.

On Thu, Sep 13, 2012 at 3:50 PM, Cardon, Tejay E
<te...@lmco.com> wrote:
> The javadoc for SortedKeyValueIterator.seek states:
>
> “Iterators that examine groups of adjacent key/value pairs (e.g. rows) to
> determine their top key and value should be sure that they properly handle a
> seek to a key in the middle of such a group (e.g. the middle of a row). Even
> if the client always seeks to a range containing an entire group (a,c), the
> tablet server could send back a batch of entries corresponding to (a,b],
> then reseek the iterator to range (b,c) when the scan is continued.”
>
>
>
> However, it gives no indication of what proper handling is.  What should an
> iterator that considers and entire row do in this case?  Does it simply
> ignore the row?  Attempt to seek its source iterator to the full row of the
> first range?  I’m struggling to understand the best approach here.

org.apache.accumulo.core.iterators.user.RowFilter does what you
suggested.  It seeks to the beggining of a row if the range starts in
the middle of the row.  Look at the javadoc for the row filter, it
discusses the seeking behavior.

>
>
>
> In my specific case, if it matters, I’m largely looking for ColumnQualifiers
> which exist in all Column Families in a given set (intersecting iterator,
> sortof).
>
>
>
> Thanks,
> Tejay