You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by Mike Hugo <mi...@piragua.com> on 2013/02/27 06:12:04 UTC

Reset column iterator while using AccumuloRowInputFormat

Is there a way to "reset" the column iterator back to the "beginning" when
using the AccumuloRowInputFormat?  We have a case in which we need to
iterate over the columns for a row at least twice and it could be a large
row that may not fit in memory.

I think we can work around this by having a separate scanner used within
the map method for this purpose.  Other than that, is there a way to clone
or copy or reset the column iterator such that we can iterate over it more
than once?

Thanks,

Mike

public void map(Text key, PeekingIterator<Map.Entry<Key, Value>>
columnIterator, Context context) {
    while (columnIterator.hasNext()) {
        Map.Entry<Key, Value> kv = columnIterator.next();
    }

*    // reset column iterator back to the beginning*

    while (columnIterator.hasNext()) {
        Map.Entry<Key, Value> kv = columnIterator.next();
    }

}

Re: Reset column iterator while using AccumuloRowInputFormat

Posted by Christopher <ct...@apache.org>.
You could use the leverage new TransformingIterator to seek and
iterate over the keys n times:

r1 cf:cq v
r2 cf:cq v
r3 cf:cq v

becomes:

pass1-r1 cf:cq v
pass1-r2 cf:cq v
pass1-r3 cf:cq v
pass2-r1 cf:cq v
pass2-r2 cf:cq v
pass2-r3 cf:cq v

However, are you sure you need to iterate over the whole row twice?
There are strategies to internally intersect a row with itself (see
ItersectingIterator) that avoids this (at least, avoids it from the
user's perspective).

If you don't need the range in the same mapper, you could specify the
range twice in the AccumuloInputFormat's configuration, (disable
auto-adjust ranges feature so they won't be collapsed to one), and
you'll get 1 mapper per range (though I'm pretty sure this gets you
nothing more than simply doing two actions in the same mapper before
moving on to the next key/value pair).

--
Christopher L Tubbs II
http://gravatar.com/ctubbsii


On Tue, Feb 26, 2013 at 9:12 PM, Mike Hugo <mi...@piragua.com> wrote:
> Is there a way to "reset" the column iterator back to the "beginning" when
> using the AccumuloRowInputFormat?  We have a case in which we need to
> iterate over the columns for a row at least twice and it could be a large
> row that may not fit in memory.
>
> I think we can work around this by having a separate scanner used within the
> map method for this purpose.  Other than that, is there a way to clone or
> copy or reset the column iterator such that we can iterate over it more than
> once?
>
> Thanks,
>
> Mike
>
> public void map(Text key, PeekingIterator<Map.Entry<Key, Value>>
> columnIterator, Context context) {
>     while (columnIterator.hasNext()) {
>         Map.Entry<Key, Value> kv = columnIterator.next();
>     }
>
>     // reset column iterator back to the beginning
>
>     while (columnIterator.hasNext()) {
>         Map.Entry<Key, Value> kv = columnIterator.next();
>     }
>
> }

Re: Reset column iterator while using AccumuloRowInputFormat

Posted by Billie Rinaldi <bi...@apache.org>.
On Tue, Feb 26, 2013 at 9:12 PM, Mike Hugo <mi...@piragua.com> wrote:

> Is there a way to "reset" the column iterator back to the "beginning" when
> using the AccumuloRowInputFormat?  We have a case in which we need to
> iterate over the columns for a row at least twice and it could be a large
> row that may not fit in memory.
>
> I think we can work around this by having a separate scanner used within
> the map method for this purpose.  Other than that, is there a way to clone
> or copy or reset the column iterator such that we can iterate over it more
> than once?
>

Currently, no.  It's not immediately obvious how we could change the
InputFormat to accomplish this.  The RecordReader creates a scanner, does
the seeking/fetching for the InputSplit once in its initialize method, then
iterates over the scanner, grouping together rows as appropriate.  Going
back to the beginning of a row would require us to seek the scanner again,
and replace the old iterator with a new one.  We could make a special
RecordReader with a reset method, but I don't know how we could call the
method.  Interactions with the RecordReader are handled by the MapContext,
and I don't know if you can use a custom MapContext.  Maybe we could have
an InputFormat that gives you a Scanner directly that you could reseek in
the Mapper, but we'd have to spend some time thinking about it to make sure
it would work.

Billie



> Thanks,
>
> Mike
>
> public void map(Text key, PeekingIterator<Map.Entry<Key, Value>>
> columnIterator, Context context) {
>     while (columnIterator.hasNext()) {
>         Map.Entry<Key, Value> kv = columnIterator.next();
>     }
>
> *    // reset column iterator back to the beginning*
>
>     while (columnIterator.hasNext()) {
>         Map.Entry<Key, Value> kv = columnIterator.next();
>     }
>
> }
>