You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by Michael Moss <mi...@gmail.com> on 2014/09/07 22:38:51 UTC

Re: Iterating/Aggregating/Combining Complex (Java POJO/Avro) Values

All, thanks again for your feedback. I just consolidated some of these
learnings with some code samples here.

http://www.mammothdatallc.com/blog/accumulo-in-depth-look-at-filters-combiners-iterators-against-complex-values/

Best,

-Mike

On Fri, Jul 18, 2014 at 11:54 AM, William Slacum <
wilhelm.von.cloud@accumulo.net> wrote:

> Oh wow, I have totally read your problem incorrectly then. I thought you
> wanted a total count across rows for some reasoning (when you mentioned you
> had versioning turned off, things clicked).
>
> You can use a combiner, but I'd write an iterator that strips out the
> count field for each value (like we did the other iterator), and then place
> that lower in the iterator stack. This way you can get around your original
> issue with the combiner only taking a single input/output type.
>
>
> On Tue, Jul 15, 2014 at 2:25 PM, Adam Fuchs <af...@apache.org> wrote:
>
>> Mike,
>>
>> The way we usually aggregate by row is to check the source's top key
>> within the next function to see if it breaks the row boundary. If your
>> source starts giving you data in the next row then break out of the loop in
>> the next function. You'll also need to construct a row key to return from
>> your iterator and then handle the reseeking case (automatic seeking to
>> second key in row). See the RowEncodingIterator for hints on
>> implementation. You might actually want to subclass RowEncodingIterator to
>> implement your counter.
>>
>> Cheers,
>> Adam
>>  Cool. I'll write something up and share.
>>
>> I'm curious how to get my Counter (WrappingIterator) implementation to
>> aggregate by row (which, for some reason, I assumed was default?)
>>
>> Let's say I have rows (and CF="", CQ="" and versioningiterator off):
>> 1 (Value1, Value 2...Value N)
>> 2
>> 3
>>
>> How can my iterator return?
>> 1 (Count of values 1..N)
>> 2 (Count of values 1..N)
>> 3 ...
>>
>> I tried scan -b "1" -e "1" and it counts an individual row. But if I
>> don't specify anything, it returns,
>> 3 (Count of all values across all rows)
>>
>> Code:
>> http://pastebin.com/8xFNLHFS
>>
>> Example:
>> root@dev pe> listiter -scan -t pojo
>> -
>> -    Iterator counter, scan scope options:
>> -        iteratorPriority = 10
>> -        iteratorClassName = iterators.Counter
>> -
>> root@dev pe> scan -b "1_1_20140101" -e "1_1_20140101"
>> 1_1_20140101 : [public]    65
>>
>> root@dev pe> scan -b "1_1_20140101" -e "3_9_20140727"
>> 3_9_20140727 : [public]    100000
>>
>> root@dev pe> scan
>> 3_9_20140727 : [public]    100000
>>
>>
>> Thanks.
>>
>> -Mike
>>
>>
>>
>>  On Tue, Jul 15, 2014 at 12:29 PM, Josh Elser <jo...@gmail.com>
>> wrote:
>>
>>> There's been some mention about a desire to rethink the Iterator
>>> interface as it has some deficiencies (notably the lack of a "cleanup"
>>> before the iterators are torn down), but no one has stated that they're
>>> actively working on this.
>>>
>>> Getting better documentation wrt to convetions: let us know where the
>>> Accumulo documentation falls short (and give us patches to fix the
>>> documentation :D). Additionally, write up your own findings from problems
>>> that you've run into. It's the entire community (users specifically) that
>>> we need to help encourage to grow.
>>>
>>> Even things as simple as "how do I count entries in an iterator" are big
>>> as you are now an "expert" on the subject :)
>>>
>>>
>>> On 7/15/14, 12:17 PM, Michael Moss wrote:
>>>
>>>> That worked ;) - Thanks!
>>>>
>>>> What a journey...
>>>>
>>>> I like Accumulo's architecture and promise, but the difficulty in
>>>> querying it (lack of documentation, conventions) is a major concern and
>>>> I'd imagine has to have an impact on adoption. I'm curious if there have
>>>> been any conversations around changing the interface around iterators
>>>> which are still confusing to me. Let me know how I can help!
>>>>
>>>>
>>>> On Tue, Jul 15, 2014 at 12:03 PM, William Slacum
>>>> <wilhelm.von.cloud@accumulo.net <mailto:wilhelm.von.cloud@accumulo.net
>>>> >>
>>>>
>>>> wrote:
>>>>
>>>>     Herp... serves me right for not setting up a proper test case.
>>>>
>>>>     I think you need to override seek as well:
>>>>
>>>>     @Override
>>>>     public void seek(...) throws IOException {
>>>>        super.seek(...);
>>>>        next();
>>>>     }
>>>>
>>>>     I think I just realized the wrapping iterator could use some clean
>>>>     up, because this isn't obvious. Basically after the wrapping
>>>>     iterator's seek is called, it never calls the implementor's next()
>>>>     to actually set up the first top key and value.
>>>>
>>>>
>>>>
>>>>     On Tue, Jul 15, 2014 at 9:50 AM, Michael Moss
>>>>     <michael.moss@gmail.com <ma...@gmail.com>> wrote:
>>>>
>>>>         I set up debugging and am rethrowing the exception. What's
>>>>         strange is it appears that despite the iterator instance being
>>>>         properly set to iterator.Counter (my implementation), my
>>>>         breakpoints aren't being hit, only in the parent classes
>>>>         (Wrapping Iterator) and (SortedKeyValueIterator).
>>>>
>>>>         I have two rows in the table, when I scan with no iterator:
>>>>         2014-07-15 06:46:26,577 [Audit   ] INFO : operation: permitted;
>>>>         user: root; action: scan; targetTable: pojo; authorizations:
>>>>         public,; range: (-inf,+inf); columns: []; iterators: [];
>>>>         iteratorOptions: {};
>>>>         2014-07-15 06:46:26,589 [tserver.TabletServer] DEBUG: ScanSess
>>>>         tid 10.0.2.15:45073 <http://10.0.2.15:45073> 8*2 entries* in
>>>>
>>>>         0.01 secs, nbTimes = [7 7 7.00 1]
>>>>
>>>>         When I scan with the iterator (0 entries?):
>>>>         2014-07-15 06:45:58,036 [Audit   ] INFO : operation: permitted;
>>>>         user: root; action: scan; targetTable: pojo; authorizations:
>>>>         public,; range: (-inf,+inf); columns: []; iterators: [];
>>>>         iteratorOptions: {};
>>>>         2014-07-15 06:45:58,047 [tserver.TabletServer] DEBUG: ScanSess
>>>>         tid 10.0.2.15:44992 <http://10.0.2.15:44992> 8 *0 entries* in
>>>>
>>>>         0.01 secs, nbTimes = [6 6 6.00 1]
>>>>
>>>>         No exceptions otherwise. Really appreciate all the ongoing help.
>>>>
>>>>         Best,
>>>>
>>>>         -Mike
>>>>
>>>>
>>>>         On Mon, Jul 14, 2014 at 6:40 PM, William Slacum
>>>>         <wilhelm.von.cloud@accumulo.net
>>>>         <ma...@accumulo.net>> wrote:
>>>>
>>>>             Anything in your Tserver log? I think you should just
>>>>             rethrow that IOExcepton on your source's next() method,
>>>>             since they're usually not recoverable (ie, just make
>>>>             Counter#next throw IOException)
>>>>
>>>>
>>>>             On Mon, Jul 14, 2014 at 5:48 PM, Josh Elser
>>>>             <josh.elser@gmail.com <ma...@gmail.com>> wrote:
>>>>
>>>>                 A quick sanity check is to make sure you have data in
>>>>                 the table and that you can read the data without your
>>>>                 iterator (I've thought I had a bug because I didn't have
>>>>                 proper visibilities more times than I'd like to admit).
>>>>
>>>>                 Alternatively, you can also enable remote-debugging via
>>>>                 Eclipse into the TabletServer which might help you
>>>>                 understand more of what's going on.
>>>>
>>>>                 Lots of articles on how to set this up [1]. In short,
>>>>                 add -Xdebug
>>>>                 -Xrunjdwp:transport=dt_socket,__server=y,address=8000
>>>> to
>>>>
>>>>                 ACCUMULO_TSERVER_OPTS in accumulo-env.sh, restart the
>>>>                 tserver, connect eclipse to 8000 via the Debug
>>>>                 configuration menu, set a breakpoint in your init, seek
>>>>                 and next methods, and `scan` in the shell.
>>>>
>>>>
>>>>                 [1]
>>>>                 http://javarevisited.blogspot.
>>>> __com/2011/02/how-to-setup-__remote-debugging-in.html
>>>>
>>>>                 <http://javarevisited.blogspot.com/2011/02/how-to-
>>>> setup-remote-debugging-in.html>
>>>>
>>>>
>>>>                 On 7/14/14, 5:33 PM, Michael Moss wrote:
>>>>
>>>>                     Hmm...Still doesn't return anything from the shell.
>>>>
>>>>                     http://pastebin.com/ndRhspf8
>>>>
>>>>                     Any thoughts? What's the best way to debug these?
>>>>
>>>>
>>>>                     On Mon, Jul 14, 2014 at 5:14 PM, William Slacum
>>>>                     <wilhelm.von.cloud@accumulo.__net
>>>>                     <ma...@accumulo.net>
>>>>                     <mailto:wilhelm.von.cloud@__accumulo.net
>>>>
>>>>                     <ma...@accumulo.net>>>
>>>>
>>>>                     wrote:
>>>>
>>>>                          Ah, an artifact of me just willy nilly writing
>>>>                     an iterator :) Any
>>>>                          reference to `this.source` should be replaced
>>>> with
>>>>                          `this.getSource()`. In `next()`, your
>>>>                     workaround ends up calling
>>>>                          `this.hasTop()` as the while loop condition. It
>>>>                     will always return
>>>>                          false because two lines up we set `top_key` to
>>>>                     null. We need to make
>>>>                          sure that the source iterator has a top,
>>>>                     because we want to read
>>>>                          data from it. We'll have to change the loop
>>>>                     condition to
>>>>                          `while(this.getSource().__hasTop())`. On line
>>>>
>>>>                     38 of your code we'll
>>>>                          need to call `this.getSource().next()` instead
>>>>                     of `this.next()`.
>>>>
>>>>                          The iterator interface is documented, but there
>>>>                     hasn't been a
>>>>                          definitive go-to for making one. I've been
>>>>                     drafting a blog post, but
>>>>                          since it doesn't exist yet, hopefully the
>>>>                     following will suffice.
>>>>
>>>>                          The lifetime of an iterator is (usually) as
>>>>                     follows:
>>>>
>>>>                          (1) A new instance is called via
>>>>                     Class.newInstance (so a no-args
>>>>                          constructor is needed)
>>>>                          (2) Init is called. This allows users to
>>>>                     configure the iterator, set
>>>>                          its source, and possible check the environment.
>>>>                     We can also call
>>>>                          `deepCopy` on the source if we want to have
>>>>                     multiple sources (we'd
>>>>                          do this if we wanted to do a merge read out of
>>>>                     multiple column
>>>>                          families within a row).
>>>>                          (3) seek() is called. This gets our readers to
>>>>                     the correct positions
>>>>                          in the data that are within the scan range the
>>>>                     user requested, as
>>>>                          well as turning column families on or off. The
>>>>                     name should
>>>>                          reminiscent of seeking to some key on disk.
>>>>                          (4) hasTop() is called. If true, that means we
>>>>                     have data, and the
>>>>                          iterator has a key/value pair that can be
>>>>                     retrieved by calling
>>>>                          getTopKey() and getTopValue(). If fasle, we're
>>>>                     done because there's
>>>>                          no data to return.
>>>>                          (5) next() is called. This will attempt find a
>>>>                     new top key and
>>>>                          value. We go back to (4) to see if next was
>>>>                     successful in finding a
>>>>                          new top key/value and will repeat until the
>>>>                     client is satisfied or
>>>>                          hasTop() returns false.
>>>>
>>>>                          You can kind of make a state machine out of
>>>>                     those steps where we
>>>>                          loop between (4) and (5) until there's no data.
>>>>                     There are more
>>>>                          advanced workflows where next() can be reading
>>>>                     from multiple
>>>>                          sources, as well as seeking them to different
>>>>                     positions in the tablet.
>>>>
>>>>
>>>>                          On Mon, Jul 14, 2014 at 4:51 PM, Michael Moss
>>>>                          <michael.moss@gmail.com
>>>>                     <ma...@gmail.com>
>>>>                     <mailto:michael.moss@gmail.com
>>>>
>>>>                     <ma...@gmail.com>__>> wrote:
>>>>
>>>>                              Thanks, William. I was just hitting you up
>>>>                     for an example :)
>>>>
>>>>                              I adapted your pseudocode
>>>>                     (http://pastebin.com/ufPJq0g3)__, but
>>>>
>>>>                              noticed that "this.source" in your example
>>>>                     didn't have
>>>>                              visibility. Did I worked around it
>>>> correctly?
>>>>
>>>>                              When I add my iterator to my table and run
>>>>                     scan from the shell,
>>>>                              it returns nothing - what should I expect
>>>>                     here? In general I've
>>>>                              found the iterator interface pretty
>>>>                     confusing and haven't spent
>>>>                              the time wrapping my head around it yet.
>>>>                     Any documentation or
>>>>                              examples (beyond what I could find on the
>>>>                     site or in the code)
>>>>                              appreciated!
>>>>
>>>>                              /root@dev> table pojo/
>>>>                              /root@dev pojo> listiter -scan -t pojo/
>>>>                              /-/
>>>>                              /-    Iterator counter, scan scope
>>>> options:/
>>>>                              /-        iteratorPriority = 10/
>>>>                              /-        iteratorClassName =
>>>>                     iterators.Counter/
>>>>                              /-/
>>>>                              /root@dev pojo> scan/
>>>>                              /root@dev pojo>/
>>>>
>>>>
>>>>                              Best,
>>>>
>>>>                              -Mike
>>>>
>>>>
>>>>
>>>>
>>>>                              On Mon, Jul 14, 2014 at 4:07 PM, William
>>>> Slacum
>>>>                              <wilhelm.von.cloud@accumulo.__net
>>>>                     <ma...@accumulo.net>
>>>>                              <mailto:wilhelm.von.cloud@__accumulo.net
>>>>
>>>>                     <ma...@accumulo.net>>> wrote:
>>>>
>>>>                                  For a bit of psuedocode, I'd probably
>>>>                     make a class that did
>>>>                                  something akin to:
>>>>                     http://pastebin.com/pKqAeeCR
>>>>
>>>>                                  I wrote that up real quick in a text
>>>>                     editor-- it won't
>>>>                                  compile or anything, but should point
>>>>                     you in the right
>>>>                                  direction.
>>>>
>>>>
>>>>                                  On Mon, Jul 14, 2014 at 3:44 PM,
>>>>                     William Slacum
>>>>                                  <wilhelm.von.cloud@accumulo.__net
>>>>                     <ma...@accumulo.net>
>>>>
>>>>                     <mailto:wilhelm.von.cloud@__accumulo.net
>>>>
>>>>                     <ma...@accumulo.net>>> wrote:
>>>>
>>>>                                      Hi Mike!
>>>>
>>>>                                      The Combiner interface is only for
>>>>                     aggregating keys
>>>>                                      within a single row. You can
>>>>                     probably get away with
>>>>                                      implementing your combining logic
>>>>                     in a WrappingIterator
>>>>                                      that reads across all the rows in a
>>>>                     given tablet.
>>>>
>>>>                                      To do some combine/fold/reduce
>>>>                     operation, Accumulo needs
>>>>                                      the input type to be the same as
>>>>                     the output type. The
>>>>                                      combiner doesn't have a notion of a
>>>>                     "present" type (as
>>>>                                      you'd see in something like
>>>>                     Algebird's Groups), but you
>>>>                                      can use another iterator to perform
>>>>                     your transformation.
>>>>
>>>>                                      If you wanted to extract the
>>>>                     "count" field from your
>>>>                                      Avro object, you could write a new
>>>>                     Iterator that took
>>>>                                      your Avro object, extracted the
>>>>                     desired field, and
>>>>                                      returned it as its top value. You
>>>>                     can then set this
>>>>                                      iterator as the source of the
>>>>                     aggregator, either
>>>>                                      programmatically or via by wrapping
>>>>                     the source object
>>>>                                      passed to the aggregator in its
>>>>                                      SortedKeyValueIterator#init call.
>>>>
>>>>                                      This is a bit inefficient as you'd
>>>>                     have to serialize to
>>>>                                      a Value and then immediately
>>>>                     deserialize it in the
>>>>                                      iterator above it. You could
>>>>                     mitigate this by exposing a
>>>>                                      method that would get the extracted
>>>>                     value before
>>>>                                      serializing it.
>>>>
>>>>                                      This kind of counting also requires
>>>>                     client side logic to
>>>>                                      do a final combine operation, since
>>>>                     the aggregations
>>>>                                      from all the tservers are partial
>>>>                     results.
>>>>
>>>>                                      I believe that CountingIterator is
>>>>                     not meant for user
>>>>                                      consumption, but I do not know if
>>>>                     it's related to your
>>>>                                      issue in trying to use it from the
>>>>                     shell. Iterators set
>>>>                                      through the shell, in previous
>>>>                     versions of Accumulo,
>>>>                                      have a requirement to implement
>>>>                     OptionDescriber. Many
>>>>                                      default iterators do not implement
>>>>                     this, and thus can't
>>>>                                      set in the shell.
>>>>
>>>>
>>>>
>>>>                                      On Mon, Jul 14, 2014 at 2:44 PM,
>>>>                     Michael Moss
>>>>                                      <michael.moss@gmail.com
>>>>                     <ma...@gmail.com>
>>>>                     <mailto:michael.moss@gmail.com
>>>>                     <ma...@gmail.com>__>>
>>>>
>>>>
>>>>                                      wrote:
>>>>
>>>>                                          Hi, All.
>>>>
>>>>                                          I'm curious what the best
>>>>                     practices are around
>>>>                                          persisting complex types/data
>>>>                     in Accumulo (and
>>>>                                          aggregating on fields within
>>>> them).
>>>>
>>>>                                          Let's say I have (row, column
>>>>                     family, column
>>>>                                          qualifier, value):
>>>>                                          "A" "foo" ""
>>>>                     MyHugeAvroObject(count=2)
>>>>                                          "A" "foo" ""
>>>>                     MyHugeAvroObject(count=3)
>>>>
>>>>                                          Let's say MyHugeAvroObject has
>>>>                     a field "Integer
>>>>                                          count" with the values above.
>>>>
>>>>                                          What is the best way to
>>>>                     aggregate on row, column
>>>>                                          family, column qualifier by
>>>>                     count? In my above example:
>>>>                                          "A" "foo" "" 5
>>>>
>>>>                                          The
>>>>                     TypedValueCombiner.typedReduce method can
>>>>                                          deserialize any "V", in my case
>>>>                     MyHugeAvroObject,
>>>>                                          but it needs to return a value
>>>>                     of type "V". What are
>>>>                                          the best practices for deeply
>>>>                     nested/complex
>>>>                                          objects? It's not always
>>>>                     straightforward to map a
>>>>                                          complex Avro type into Row ->
>>>>                     Column Family ->
>>>>                                          Column Qualifier.
>>>>
>>>>                                          Rather than using a
>>>>                     TypedCombiner, I looked into
>>>>                                          using an Aggregator (which
>>>>                     appears deprecated as of
>>>>                                          1.4), which appears to let me
>>>>                     return arbitrary
>>>>                                          values, but despite running
>>>>                     setiter, my aggregator
>>>>                                          doesn't seem to do anything.
>>>>
>>>>                                          I also tried looking at
>>>>                     implementing a
>>>>                                          WrappingIterator, which also
>>>>                     appears to allow me to
>>>>                                          return arbitary values (such as
>>>>                     Accumulo's
>>>>                                          CountingIterator), but I get
>>>>                     cryptic errors when
>>>>                                          trying to setiter, I'm on
>>>>                     Accumulo 1.6:
>>>>
>>>>                                          root@dev kyt> setiter -t kyt
>>>>                     -scan -p 10 -n
>>>>                                          countingIter -class
>>>>
>>>>                     org.apache.accumulo.core.__iterators.system.__
>>>> CountingIterator
>>>>
>>>>                                          2014-07-14 11:12:55,623
>>>>                     [shell.Shell] ERROR:
>>>>
>>>>                     java.lang.__IllegalArgumentException:
>>>>
>>>>                     org.apache.accumulo.core.__iterators.system.__
>>>> CountingIterator
>>>>
>>>>
>>>>                                          This is odd because other
>>>>                     included implementations
>>>>                                          of WrappingIterator seem to
>>>>                     work (perhaps the
>>>>                                          implementation of
>>>>                     CountingIterator is dated):
>>>>                                          root@dev kyt> setiter -t kyt
>>>>                     -scan -p 10 -n
>>>>                                          deletingIterator -class
>>>>
>>>>                     org.apache.accumulo.core.__iterators.system.__
>>>> DeletingIterator
>>>>
>>>>                                          The iterator class does not
>>>>                     implement
>>>>                                          OptionDescriber. Consider this
>>>>                     for better iterator
>>>>                                          configuration using this
>>>>                     setiter command.
>>>>                                          Name for iterator (enter to
>>>> skip):
>>>>
>>>>                                          All in all, how can I aggregate
>>>>                     simple values, like
>>>>                                          counters from rows with complex
>>>>                     Avro objects as
>>>>                                          Values without having to add
>>>>                     aggregations fields to
>>>>                                          these Value objects?
>>>>
>>>>                                          Thanks!
>>>>
>>>>                                          -Mike
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>
>