You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by Michael Moss <mi...@gmail.com> on 2014/09/07 22:38:51 UTC
Re: Iterating/Aggregating/Combining Complex (Java POJO/Avro) Values
All, thanks again for your feedback. I just consolidated some of these
learnings with some code samples here.
http://www.mammothdatallc.com/blog/accumulo-in-depth-look-at-filters-combiners-iterators-against-complex-values/
Best,
-Mike
On Fri, Jul 18, 2014 at 11:54 AM, William Slacum <
wilhelm.von.cloud@accumulo.net> wrote:
> Oh wow, I have totally read your problem incorrectly then. I thought you
> wanted a total count across rows for some reasoning (when you mentioned you
> had versioning turned off, things clicked).
>
> You can use a combiner, but I'd write an iterator that strips out the
> count field for each value (like we did the other iterator), and then place
> that lower in the iterator stack. This way you can get around your original
> issue with the combiner only taking a single input/output type.
>
>
> On Tue, Jul 15, 2014 at 2:25 PM, Adam Fuchs <af...@apache.org> wrote:
>
>> Mike,
>>
>> The way we usually aggregate by row is to check the source's top key
>> within the next function to see if it breaks the row boundary. If your
>> source starts giving you data in the next row then break out of the loop in
>> the next function. You'll also need to construct a row key to return from
>> your iterator and then handle the reseeking case (automatic seeking to
>> second key in row). See the RowEncodingIterator for hints on
>> implementation. You might actually want to subclass RowEncodingIterator to
>> implement your counter.
>>
>> Cheers,
>> Adam
>> Cool. I'll write something up and share.
>>
>> I'm curious how to get my Counter (WrappingIterator) implementation to
>> aggregate by row (which, for some reason, I assumed was default?)
>>
>> Let's say I have rows (and CF="", CQ="" and versioningiterator off):
>> 1 (Value1, Value 2...Value N)
>> 2
>> 3
>>
>> How can my iterator return?
>> 1 (Count of values 1..N)
>> 2 (Count of values 1..N)
>> 3 ...
>>
>> I tried scan -b "1" -e "1" and it counts an individual row. But if I
>> don't specify anything, it returns,
>> 3 (Count of all values across all rows)
>>
>> Code:
>> http://pastebin.com/8xFNLHFS
>>
>> Example:
>> root@dev pe> listiter -scan -t pojo
>> -
>> - Iterator counter, scan scope options:
>> - iteratorPriority = 10
>> - iteratorClassName = iterators.Counter
>> -
>> root@dev pe> scan -b "1_1_20140101" -e "1_1_20140101"
>> 1_1_20140101 : [public] 65
>>
>> root@dev pe> scan -b "1_1_20140101" -e "3_9_20140727"
>> 3_9_20140727 : [public] 100000
>>
>> root@dev pe> scan
>> 3_9_20140727 : [public] 100000
>>
>>
>> Thanks.
>>
>> -Mike
>>
>>
>>
>> On Tue, Jul 15, 2014 at 12:29 PM, Josh Elser <jo...@gmail.com>
>> wrote:
>>
>>> There's been some mention about a desire to rethink the Iterator
>>> interface as it has some deficiencies (notably the lack of a "cleanup"
>>> before the iterators are torn down), but no one has stated that they're
>>> actively working on this.
>>>
>>> Getting better documentation wrt to convetions: let us know where the
>>> Accumulo documentation falls short (and give us patches to fix the
>>> documentation :D). Additionally, write up your own findings from problems
>>> that you've run into. It's the entire community (users specifically) that
>>> we need to help encourage to grow.
>>>
>>> Even things as simple as "how do I count entries in an iterator" are big
>>> as you are now an "expert" on the subject :)
>>>
>>>
>>> On 7/15/14, 12:17 PM, Michael Moss wrote:
>>>
>>>> That worked ;) - Thanks!
>>>>
>>>> What a journey...
>>>>
>>>> I like Accumulo's architecture and promise, but the difficulty in
>>>> querying it (lack of documentation, conventions) is a major concern and
>>>> I'd imagine has to have an impact on adoption. I'm curious if there have
>>>> been any conversations around changing the interface around iterators
>>>> which are still confusing to me. Let me know how I can help!
>>>>
>>>>
>>>> On Tue, Jul 15, 2014 at 12:03 PM, William Slacum
>>>> <wilhelm.von.cloud@accumulo.net <mailto:wilhelm.von.cloud@accumulo.net
>>>> >>
>>>>
>>>> wrote:
>>>>
>>>> Herp... serves me right for not setting up a proper test case.
>>>>
>>>> I think you need to override seek as well:
>>>>
>>>> @Override
>>>> public void seek(...) throws IOException {
>>>> super.seek(...);
>>>> next();
>>>> }
>>>>
>>>> I think I just realized the wrapping iterator could use some clean
>>>> up, because this isn't obvious. Basically after the wrapping
>>>> iterator's seek is called, it never calls the implementor's next()
>>>> to actually set up the first top key and value.
>>>>
>>>>
>>>>
>>>> On Tue, Jul 15, 2014 at 9:50 AM, Michael Moss
>>>> <michael.moss@gmail.com <ma...@gmail.com>> wrote:
>>>>
>>>> I set up debugging and am rethrowing the exception. What's
>>>> strange is it appears that despite the iterator instance being
>>>> properly set to iterator.Counter (my implementation), my
>>>> breakpoints aren't being hit, only in the parent classes
>>>> (Wrapping Iterator) and (SortedKeyValueIterator).
>>>>
>>>> I have two rows in the table, when I scan with no iterator:
>>>> 2014-07-15 06:46:26,577 [Audit ] INFO : operation: permitted;
>>>> user: root; action: scan; targetTable: pojo; authorizations:
>>>> public,; range: (-inf,+inf); columns: []; iterators: [];
>>>> iteratorOptions: {};
>>>> 2014-07-15 06:46:26,589 [tserver.TabletServer] DEBUG: ScanSess
>>>> tid 10.0.2.15:45073 <http://10.0.2.15:45073> 8*2 entries* in
>>>>
>>>> 0.01 secs, nbTimes = [7 7 7.00 1]
>>>>
>>>> When I scan with the iterator (0 entries?):
>>>> 2014-07-15 06:45:58,036 [Audit ] INFO : operation: permitted;
>>>> user: root; action: scan; targetTable: pojo; authorizations:
>>>> public,; range: (-inf,+inf); columns: []; iterators: [];
>>>> iteratorOptions: {};
>>>> 2014-07-15 06:45:58,047 [tserver.TabletServer] DEBUG: ScanSess
>>>> tid 10.0.2.15:44992 <http://10.0.2.15:44992> 8 *0 entries* in
>>>>
>>>> 0.01 secs, nbTimes = [6 6 6.00 1]
>>>>
>>>> No exceptions otherwise. Really appreciate all the ongoing help.
>>>>
>>>> Best,
>>>>
>>>> -Mike
>>>>
>>>>
>>>> On Mon, Jul 14, 2014 at 6:40 PM, William Slacum
>>>> <wilhelm.von.cloud@accumulo.net
>>>> <ma...@accumulo.net>> wrote:
>>>>
>>>> Anything in your Tserver log? I think you should just
>>>> rethrow that IOExcepton on your source's next() method,
>>>> since they're usually not recoverable (ie, just make
>>>> Counter#next throw IOException)
>>>>
>>>>
>>>> On Mon, Jul 14, 2014 at 5:48 PM, Josh Elser
>>>> <josh.elser@gmail.com <ma...@gmail.com>> wrote:
>>>>
>>>> A quick sanity check is to make sure you have data in
>>>> the table and that you can read the data without your
>>>> iterator (I've thought I had a bug because I didn't have
>>>> proper visibilities more times than I'd like to admit).
>>>>
>>>> Alternatively, you can also enable remote-debugging via
>>>> Eclipse into the TabletServer which might help you
>>>> understand more of what's going on.
>>>>
>>>> Lots of articles on how to set this up [1]. In short,
>>>> add -Xdebug
>>>> -Xrunjdwp:transport=dt_socket,__server=y,address=8000
>>>> to
>>>>
>>>> ACCUMULO_TSERVER_OPTS in accumulo-env.sh, restart the
>>>> tserver, connect eclipse to 8000 via the Debug
>>>> configuration menu, set a breakpoint in your init, seek
>>>> and next methods, and `scan` in the shell.
>>>>
>>>>
>>>> [1]
>>>> http://javarevisited.blogspot.
>>>> __com/2011/02/how-to-setup-__remote-debugging-in.html
>>>>
>>>> <http://javarevisited.blogspot.com/2011/02/how-to-
>>>> setup-remote-debugging-in.html>
>>>>
>>>>
>>>> On 7/14/14, 5:33 PM, Michael Moss wrote:
>>>>
>>>> Hmm...Still doesn't return anything from the shell.
>>>>
>>>> http://pastebin.com/ndRhspf8
>>>>
>>>> Any thoughts? What's the best way to debug these?
>>>>
>>>>
>>>> On Mon, Jul 14, 2014 at 5:14 PM, William Slacum
>>>> <wilhelm.von.cloud@accumulo.__net
>>>> <ma...@accumulo.net>
>>>> <mailto:wilhelm.von.cloud@__accumulo.net
>>>>
>>>> <ma...@accumulo.net>>>
>>>>
>>>> wrote:
>>>>
>>>> Ah, an artifact of me just willy nilly writing
>>>> an iterator :) Any
>>>> reference to `this.source` should be replaced
>>>> with
>>>> `this.getSource()`. In `next()`, your
>>>> workaround ends up calling
>>>> `this.hasTop()` as the while loop condition. It
>>>> will always return
>>>> false because two lines up we set `top_key` to
>>>> null. We need to make
>>>> sure that the source iterator has a top,
>>>> because we want to read
>>>> data from it. We'll have to change the loop
>>>> condition to
>>>> `while(this.getSource().__hasTop())`. On line
>>>>
>>>> 38 of your code we'll
>>>> need to call `this.getSource().next()` instead
>>>> of `this.next()`.
>>>>
>>>> The iterator interface is documented, but there
>>>> hasn't been a
>>>> definitive go-to for making one. I've been
>>>> drafting a blog post, but
>>>> since it doesn't exist yet, hopefully the
>>>> following will suffice.
>>>>
>>>> The lifetime of an iterator is (usually) as
>>>> follows:
>>>>
>>>> (1) A new instance is called via
>>>> Class.newInstance (so a no-args
>>>> constructor is needed)
>>>> (2) Init is called. This allows users to
>>>> configure the iterator, set
>>>> its source, and possible check the environment.
>>>> We can also call
>>>> `deepCopy` on the source if we want to have
>>>> multiple sources (we'd
>>>> do this if we wanted to do a merge read out of
>>>> multiple column
>>>> families within a row).
>>>> (3) seek() is called. This gets our readers to
>>>> the correct positions
>>>> in the data that are within the scan range the
>>>> user requested, as
>>>> well as turning column families on or off. The
>>>> name should
>>>> reminiscent of seeking to some key on disk.
>>>> (4) hasTop() is called. If true, that means we
>>>> have data, and the
>>>> iterator has a key/value pair that can be
>>>> retrieved by calling
>>>> getTopKey() and getTopValue(). If fasle, we're
>>>> done because there's
>>>> no data to return.
>>>> (5) next() is called. This will attempt find a
>>>> new top key and
>>>> value. We go back to (4) to see if next was
>>>> successful in finding a
>>>> new top key/value and will repeat until the
>>>> client is satisfied or
>>>> hasTop() returns false.
>>>>
>>>> You can kind of make a state machine out of
>>>> those steps where we
>>>> loop between (4) and (5) until there's no data.
>>>> There are more
>>>> advanced workflows where next() can be reading
>>>> from multiple
>>>> sources, as well as seeking them to different
>>>> positions in the tablet.
>>>>
>>>>
>>>> On Mon, Jul 14, 2014 at 4:51 PM, Michael Moss
>>>> <michael.moss@gmail.com
>>>> <ma...@gmail.com>
>>>> <mailto:michael.moss@gmail.com
>>>>
>>>> <ma...@gmail.com>__>> wrote:
>>>>
>>>> Thanks, William. I was just hitting you up
>>>> for an example :)
>>>>
>>>> I adapted your pseudocode
>>>> (http://pastebin.com/ufPJq0g3)__, but
>>>>
>>>> noticed that "this.source" in your example
>>>> didn't have
>>>> visibility. Did I worked around it
>>>> correctly?
>>>>
>>>> When I add my iterator to my table and run
>>>> scan from the shell,
>>>> it returns nothing - what should I expect
>>>> here? In general I've
>>>> found the iterator interface pretty
>>>> confusing and haven't spent
>>>> the time wrapping my head around it yet.
>>>> Any documentation or
>>>> examples (beyond what I could find on the
>>>> site or in the code)
>>>> appreciated!
>>>>
>>>> /root@dev> table pojo/
>>>> /root@dev pojo> listiter -scan -t pojo/
>>>> /-/
>>>> /- Iterator counter, scan scope
>>>> options:/
>>>> /- iteratorPriority = 10/
>>>> /- iteratorClassName =
>>>> iterators.Counter/
>>>> /-/
>>>> /root@dev pojo> scan/
>>>> /root@dev pojo>/
>>>>
>>>>
>>>> Best,
>>>>
>>>> -Mike
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Jul 14, 2014 at 4:07 PM, William
>>>> Slacum
>>>> <wilhelm.von.cloud@accumulo.__net
>>>> <ma...@accumulo.net>
>>>> <mailto:wilhelm.von.cloud@__accumulo.net
>>>>
>>>> <ma...@accumulo.net>>> wrote:
>>>>
>>>> For a bit of psuedocode, I'd probably
>>>> make a class that did
>>>> something akin to:
>>>> http://pastebin.com/pKqAeeCR
>>>>
>>>> I wrote that up real quick in a text
>>>> editor-- it won't
>>>> compile or anything, but should point
>>>> you in the right
>>>> direction.
>>>>
>>>>
>>>> On Mon, Jul 14, 2014 at 3:44 PM,
>>>> William Slacum
>>>> <wilhelm.von.cloud@accumulo.__net
>>>> <ma...@accumulo.net>
>>>>
>>>> <mailto:wilhelm.von.cloud@__accumulo.net
>>>>
>>>> <ma...@accumulo.net>>> wrote:
>>>>
>>>> Hi Mike!
>>>>
>>>> The Combiner interface is only for
>>>> aggregating keys
>>>> within a single row. You can
>>>> probably get away with
>>>> implementing your combining logic
>>>> in a WrappingIterator
>>>> that reads across all the rows in a
>>>> given tablet.
>>>>
>>>> To do some combine/fold/reduce
>>>> operation, Accumulo needs
>>>> the input type to be the same as
>>>> the output type. The
>>>> combiner doesn't have a notion of a
>>>> "present" type (as
>>>> you'd see in something like
>>>> Algebird's Groups), but you
>>>> can use another iterator to perform
>>>> your transformation.
>>>>
>>>> If you wanted to extract the
>>>> "count" field from your
>>>> Avro object, you could write a new
>>>> Iterator that took
>>>> your Avro object, extracted the
>>>> desired field, and
>>>> returned it as its top value. You
>>>> can then set this
>>>> iterator as the source of the
>>>> aggregator, either
>>>> programmatically or via by wrapping
>>>> the source object
>>>> passed to the aggregator in its
>>>> SortedKeyValueIterator#init call.
>>>>
>>>> This is a bit inefficient as you'd
>>>> have to serialize to
>>>> a Value and then immediately
>>>> deserialize it in the
>>>> iterator above it. You could
>>>> mitigate this by exposing a
>>>> method that would get the extracted
>>>> value before
>>>> serializing it.
>>>>
>>>> This kind of counting also requires
>>>> client side logic to
>>>> do a final combine operation, since
>>>> the aggregations
>>>> from all the tservers are partial
>>>> results.
>>>>
>>>> I believe that CountingIterator is
>>>> not meant for user
>>>> consumption, but I do not know if
>>>> it's related to your
>>>> issue in trying to use it from the
>>>> shell. Iterators set
>>>> through the shell, in previous
>>>> versions of Accumulo,
>>>> have a requirement to implement
>>>> OptionDescriber. Many
>>>> default iterators do not implement
>>>> this, and thus can't
>>>> set in the shell.
>>>>
>>>>
>>>>
>>>> On Mon, Jul 14, 2014 at 2:44 PM,
>>>> Michael Moss
>>>> <michael.moss@gmail.com
>>>> <ma...@gmail.com>
>>>> <mailto:michael.moss@gmail.com
>>>> <ma...@gmail.com>__>>
>>>>
>>>>
>>>> wrote:
>>>>
>>>> Hi, All.
>>>>
>>>> I'm curious what the best
>>>> practices are around
>>>> persisting complex types/data
>>>> in Accumulo (and
>>>> aggregating on fields within
>>>> them).
>>>>
>>>> Let's say I have (row, column
>>>> family, column
>>>> qualifier, value):
>>>> "A" "foo" ""
>>>> MyHugeAvroObject(count=2)
>>>> "A" "foo" ""
>>>> MyHugeAvroObject(count=3)
>>>>
>>>> Let's say MyHugeAvroObject has
>>>> a field "Integer
>>>> count" with the values above.
>>>>
>>>> What is the best way to
>>>> aggregate on row, column
>>>> family, column qualifier by
>>>> count? In my above example:
>>>> "A" "foo" "" 5
>>>>
>>>> The
>>>> TypedValueCombiner.typedReduce method can
>>>> deserialize any "V", in my case
>>>> MyHugeAvroObject,
>>>> but it needs to return a value
>>>> of type "V". What are
>>>> the best practices for deeply
>>>> nested/complex
>>>> objects? It's not always
>>>> straightforward to map a
>>>> complex Avro type into Row ->
>>>> Column Family ->
>>>> Column Qualifier.
>>>>
>>>> Rather than using a
>>>> TypedCombiner, I looked into
>>>> using an Aggregator (which
>>>> appears deprecated as of
>>>> 1.4), which appears to let me
>>>> return arbitrary
>>>> values, but despite running
>>>> setiter, my aggregator
>>>> doesn't seem to do anything.
>>>>
>>>> I also tried looking at
>>>> implementing a
>>>> WrappingIterator, which also
>>>> appears to allow me to
>>>> return arbitary values (such as
>>>> Accumulo's
>>>> CountingIterator), but I get
>>>> cryptic errors when
>>>> trying to setiter, I'm on
>>>> Accumulo 1.6:
>>>>
>>>> root@dev kyt> setiter -t kyt
>>>> -scan -p 10 -n
>>>> countingIter -class
>>>>
>>>> org.apache.accumulo.core.__iterators.system.__
>>>> CountingIterator
>>>>
>>>> 2014-07-14 11:12:55,623
>>>> [shell.Shell] ERROR:
>>>>
>>>> java.lang.__IllegalArgumentException:
>>>>
>>>> org.apache.accumulo.core.__iterators.system.__
>>>> CountingIterator
>>>>
>>>>
>>>> This is odd because other
>>>> included implementations
>>>> of WrappingIterator seem to
>>>> work (perhaps the
>>>> implementation of
>>>> CountingIterator is dated):
>>>> root@dev kyt> setiter -t kyt
>>>> -scan -p 10 -n
>>>> deletingIterator -class
>>>>
>>>> org.apache.accumulo.core.__iterators.system.__
>>>> DeletingIterator
>>>>
>>>> The iterator class does not
>>>> implement
>>>> OptionDescriber. Consider this
>>>> for better iterator
>>>> configuration using this
>>>> setiter command.
>>>> Name for iterator (enter to
>>>> skip):
>>>>
>>>> All in all, how can I aggregate
>>>> simple values, like
>>>> counters from rows with complex
>>>> Avro objects as
>>>> Values without having to add
>>>> aggregations fields to
>>>> these Value objects?
>>>>
>>>> Thanks!
>>>>
>>>> -Mike
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>
>