You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by Russ Weeks <rw...@newbrightidea.com> on 2014/09/02 07:01:45 UTC

Re: Using iterators to generate data

Hi, William,

Thanks very much for your response. I get that it's not supported or
desirable for an Iterator to instantiate a scanner or writer. It's sort of
analogous to opening a JDBC connection from inside a stored procedure -
lots of reasons why that would be a bad idea. I'm more interested in the
case where an iterator that processes input A, B, C, D might emit values A,
A1=f(A), B, B1=f(B) etc. Under what conditions is it safe to use iterators
this way? It seems there are at least two constraints: A1 must sort
lexicographically between A and B (otherwise the iterator could emit data
out of order), and A1 must be in the same row as A (otherwise A1 might
properly be handled by a different tablet server).

Seems like the consensus is to use MR for this sort of thing. I'm
definitely keeping an eye on fluo though, looks like a very cool project!

-Russ


On Sat, Aug 30, 2014 at 12:20 AM, William Slacum <
wilhelm.von.cloud@accumulo.net> wrote:

> This comes up a bit, so maybe we should add it to the FAQ (or just have
> better information about iterators in general). The short answer is that
> it's usually not recommended, because there aren't strong guarantees about
> the lifetime of an iterator (so we wouldn't know when to close any
> resources held by an iterator instance, such as batch writer thread pools)
> and there's 0 resource management related to tablet server-to-tablet server
> communications.
>
> Check out Fluo, made by our own "Chief" Keith Turner & Mike "The Trike"
> Walch: https://github.com/fluo-io/fluo
>
> It's an implementation of Google's percolator, which provides the
> capability to handle "new" data server side as well as transactional
> guarantees.
>
>
> On Fri, Aug 29, 2014 at 5:09 PM, Russ Weeks <rw...@newbrightidea.com>
> wrote:
>
>> There are plenty of examples of using custom iterators to filter or
>> combine data at either the cell level or the row level. In these cases, the
>> amount of data coming out of the iterator is less than the amount going in.
>> What about going the other direction, using a custom iterator to generate
>> new data based on the contents of a cell or a row? I guess this is also
>> what a combiner does but bear with me...
>>
>> The immediately obvious use case is parsing. Suppose one cell in my row
>> holds an XML document. I'd like to configure an iterator with an XPath
>> expression to pull a field out of the document, so that I can leverage the
>> distributed processing of the cluster instead of parsing the doc on the
>> scanner-side.
>>
>> I'm sure there are constraints or things to watch out for, does anybody
>> have any recommendations here? For instance, the generated cells would
>> probably have to be in the same row as the input cells?
>>
>> I'm using MapReduce to satisfy all these use cases right now but I'm
>> interested to know how much of my code could be ported to Iterators.
>>
>> Thanks!
>> -Russ
>>
>
>

Re: Using iterators to generate data

Posted by William Slacum <wi...@accumulo.net>.
Ah I see. You're correct about the ordering. How different would your key
be? Another thing to consider is that if you are returning a generated key
that's not actually in the data, your iterator needs to handle the case
where it is reseek'd with a range that has an exclusive start on a
generated key. You'd have to potentially recompute results if you return
multiple generated keys.


On Tue, Sep 2, 2014 at 1:01 AM, Russ Weeks <rw...@newbrightidea.com> wrote:

> Hi, William,
>
> Thanks very much for your response. I get that it's not supported or
> desirable for an Iterator to instantiate a scanner or writer. It's sort of
> analogous to opening a JDBC connection from inside a stored procedure -
> lots of reasons why that would be a bad idea. I'm more interested in the
> case where an iterator that processes input A, B, C, D might emit values A,
> A1=f(A), B, B1=f(B) etc. Under what conditions is it safe to use iterators
> this way? It seems there are at least two constraints: A1 must sort
> lexicographically between A and B (otherwise the iterator could emit data
> out of order), and A1 must be in the same row as A (otherwise A1 might
> properly be handled by a different tablet server).
>
> Seems like the consensus is to use MR for this sort of thing. I'm
> definitely keeping an eye on fluo though, looks like a very cool project!
>
> -Russ
>
>
> On Sat, Aug 30, 2014 at 12:20 AM, William Slacum <
> wilhelm.von.cloud@accumulo.net> wrote:
>
>> This comes up a bit, so maybe we should add it to the FAQ (or just have
>> better information about iterators in general). The short answer is that
>> it's usually not recommended, because there aren't strong guarantees about
>> the lifetime of an iterator (so we wouldn't know when to close any
>> resources held by an iterator instance, such as batch writer thread pools)
>> and there's 0 resource management related to tablet server-to-tablet server
>> communications.
>>
>> Check out Fluo, made by our own "Chief" Keith Turner & Mike "The Trike"
>> Walch: https://github.com/fluo-io/fluo
>>
>> It's an implementation of Google's percolator, which provides the
>> capability to handle "new" data server side as well as transactional
>> guarantees.
>>
>>
>> On Fri, Aug 29, 2014 at 5:09 PM, Russ Weeks <rw...@newbrightidea.com>
>> wrote:
>>
>>> There are plenty of examples of using custom iterators to filter or
>>> combine data at either the cell level or the row level. In these cases, the
>>> amount of data coming out of the iterator is less than the amount going in.
>>> What about going the other direction, using a custom iterator to generate
>>> new data based on the contents of a cell or a row? I guess this is also
>>> what a combiner does but bear with me...
>>>
>>> The immediately obvious use case is parsing. Suppose one cell in my row
>>> holds an XML document. I'd like to configure an iterator with an XPath
>>> expression to pull a field out of the document, so that I can leverage the
>>> distributed processing of the cluster instead of parsing the doc on the
>>> scanner-side.
>>>
>>> I'm sure there are constraints or things to watch out for, does anybody
>>> have any recommendations here? For instance, the generated cells would
>>> probably have to be in the same row as the input cells?
>>>
>>> I'm using MapReduce to satisfy all these use cases right now but I'm
>>> interested to know how much of my code could be ported to Iterators.
>>>
>>> Thanks!
>>> -Russ
>>>
>>
>>
>

Re: Using iterators to generate data

Posted by Andrew Wells <aw...@clearedgeit.com>.
This expands on what Keith is saying. I have done something like this
before.

For this example, I am going to assume a few things. This is a mater of my
opinion and experience and my not fall directly with what you want to do. I
have experience with doing this where you insert into the same table, and
insert into a different table. I am going to assume you want it in the same
table, without needing to form a new connection.

*Assumptions*

*You want to keep your original data*

*You are going to insert into the same table that you are reading from*

*Requirements*
   I am assuming some requirements, If there are others, please reply with
them.

*Data being read, must be output*

This is a pretty straight forward requirement. Simple put that data you
read, must also be a part of the output.


*Order must be maintained* (in most cases)

   Ordering is very important in accumulo, in that
Keys(rowId,cf,cq,access,time) must be in order. A small exception can be
noted with scanners as currently its not enforced, however it is expected
with ordered scanners. MINC and MAJC enforce the ordering policy.


*Generated Data must come after the original record* (in most cases)

   Besides the exception with scanners mentioned before, ORDER MATTERS. So
when you read a record, whatever you generate needs to come after it. This
can be achieved by changing the column-family, column-qualifier, or
accessors. *Do not modify the rowId as it cannot be guaranteed to be in the
same tablet. *Also, changing the timestamp is used for accumulo's
versioning so it is also best not to use that.

*Implementation*

Let us propose that the data we are generating is going into an in order
queue. Additionally, the data being read is also in order. This allows us
to create a merge sort, which means the resulting stream or dataset will
also be in order.

Treating the data like this, resolves all of our requirements, granted that
you create data that always comes after the generating data.

The complexity comes from all of the customization needed on the
SortedIterator implementation.

It would also be recommended to run a versioning iterator BEFORE a
generating iterator to limit the number records generated.

*Pitfall*

*Memory*: (Worst Case) memory can be an issue if the records are being
output at the end of the tablet or scanner.



On Tue, Sep 2, 2014 at 10:09 AM, Keith Turner <ke...@deenlo.com> wrote:

> Does extending
> org.apache.accumulo.core.iterators.user.TransformingIterator meet your
> needs?
>
> Transforming data is tricky.  You have covered some of the issues, such as
> making sure you generate sorted data within the tablets range.  Also need
> to handle the reseek case.  Accumulo reads batches of data.  At any point
> it could take the last key the iterator returned and reseek non-inclusive.
> For example if you initially seek [A,R] and your iterator returns keys
> B,F,L,N,Q.  You iterator should work correctly w/ the following seek ranges
> (B,R], (F,R], (L,R], (N,R], and (Q,R].
>
>
>
>
> On Tue, Sep 2, 2014 at 1:01 AM, Russ Weeks <rw...@newbrightidea.com>
> wrote:
>
>> Hi, William,
>>
>> Thanks very much for your response. I get that it's not supported or
>> desirable for an Iterator to instantiate a scanner or writer. It's sort of
>> analogous to opening a JDBC connection from inside a stored procedure -
>> lots of reasons why that would be a bad idea. I'm more interested in the
>> case where an iterator that processes input A, B, C, D might emit values A,
>> A1=f(A), B, B1=f(B) etc. Under what conditions is it safe to use iterators
>> this way? It seems there are at least two constraints: A1 must sort
>> lexicographically between A and B (otherwise the iterator could emit data
>> out of order), and A1 must be in the same row as A (otherwise A1 might
>> properly be handled by a different tablet server).
>>
>> Seems like the consensus is to use MR for this sort of thing. I'm
>> definitely keeping an eye on fluo though, looks like a very cool project!
>>
>> -Russ
>>
>>
>> On Sat, Aug 30, 2014 at 12:20 AM, William Slacum <
>> wilhelm.von.cloud@accumulo.net> wrote:
>>
>>> This comes up a bit, so maybe we should add it to the FAQ (or just have
>>> better information about iterators in general). The short answer is that
>>> it's usually not recommended, because there aren't strong guarantees about
>>> the lifetime of an iterator (so we wouldn't know when to close any
>>> resources held by an iterator instance, such as batch writer thread pools)
>>> and there's 0 resource management related to tablet server-to-tablet server
>>> communications.
>>>
>>> Check out Fluo, made by our own "Chief" Keith Turner & Mike "The Trike"
>>> Walch: https://github.com/fluo-io/fluo
>>>
>>> It's an implementation of Google's percolator, which provides the
>>> capability to handle "new" data server side as well as transactional
>>> guarantees.
>>>
>>>
>>> On Fri, Aug 29, 2014 at 5:09 PM, Russ Weeks <rw...@newbrightidea.com>
>>> wrote:
>>>
>>>> There are plenty of examples of using custom iterators to filter or
>>>> combine data at either the cell level or the row level. In these cases, the
>>>> amount of data coming out of the iterator is less than the amount going in.
>>>> What about going the other direction, using a custom iterator to generate
>>>> new data based on the contents of a cell or a row? I guess this is also
>>>> what a combiner does but bear with me...
>>>>
>>>> The immediately obvious use case is parsing. Suppose one cell in my row
>>>> holds an XML document. I'd like to configure an iterator with an XPath
>>>> expression to pull a field out of the document, so that I can leverage the
>>>> distributed processing of the cluster instead of parsing the doc on the
>>>> scanner-side.
>>>>
>>>> I'm sure there are constraints or things to watch out for, does anybody
>>>> have any recommendations here? For instance, the generated cells would
>>>> probably have to be in the same row as the input cells?
>>>>
>>>> I'm using MapReduce to satisfy all these use cases right now but I'm
>>>> interested to know how much of my code could be ported to Iterators.
>>>>
>>>> Thanks!
>>>> -Russ
>>>>
>>>
>>>
>>
>


-- 
*Andrew George Wells*
*Software Engineer*
*awells@clearedgeit.com <aw...@clearedgeit.com>*

Re: Using iterators to generate data

Posted by Keith Turner <ke...@deenlo.com>.
Does extending org.apache.accumulo.core.iterators.user.TransformingIterator
meet your needs?

Transforming data is tricky.  You have covered some of the issues, such as
making sure you generate sorted data within the tablets range.  Also need
to handle the reseek case.  Accumulo reads batches of data.  At any point
it could take the last key the iterator returned and reseek non-inclusive.
For example if you initially seek [A,R] and your iterator returns keys
B,F,L,N,Q.  You iterator should work correctly w/ the following seek ranges
(B,R], (F,R], (L,R], (N,R], and (Q,R].




On Tue, Sep 2, 2014 at 1:01 AM, Russ Weeks <rw...@newbrightidea.com> wrote:

> Hi, William,
>
> Thanks very much for your response. I get that it's not supported or
> desirable for an Iterator to instantiate a scanner or writer. It's sort of
> analogous to opening a JDBC connection from inside a stored procedure -
> lots of reasons why that would be a bad idea. I'm more interested in the
> case where an iterator that processes input A, B, C, D might emit values A,
> A1=f(A), B, B1=f(B) etc. Under what conditions is it safe to use iterators
> this way? It seems there are at least two constraints: A1 must sort
> lexicographically between A and B (otherwise the iterator could emit data
> out of order), and A1 must be in the same row as A (otherwise A1 might
> properly be handled by a different tablet server).
>
> Seems like the consensus is to use MR for this sort of thing. I'm
> definitely keeping an eye on fluo though, looks like a very cool project!
>
> -Russ
>
>
> On Sat, Aug 30, 2014 at 12:20 AM, William Slacum <
> wilhelm.von.cloud@accumulo.net> wrote:
>
>> This comes up a bit, so maybe we should add it to the FAQ (or just have
>> better information about iterators in general). The short answer is that
>> it's usually not recommended, because there aren't strong guarantees about
>> the lifetime of an iterator (so we wouldn't know when to close any
>> resources held by an iterator instance, such as batch writer thread pools)
>> and there's 0 resource management related to tablet server-to-tablet server
>> communications.
>>
>> Check out Fluo, made by our own "Chief" Keith Turner & Mike "The Trike"
>> Walch: https://github.com/fluo-io/fluo
>>
>> It's an implementation of Google's percolator, which provides the
>> capability to handle "new" data server side as well as transactional
>> guarantees.
>>
>>
>> On Fri, Aug 29, 2014 at 5:09 PM, Russ Weeks <rw...@newbrightidea.com>
>> wrote:
>>
>>> There are plenty of examples of using custom iterators to filter or
>>> combine data at either the cell level or the row level. In these cases, the
>>> amount of data coming out of the iterator is less than the amount going in.
>>> What about going the other direction, using a custom iterator to generate
>>> new data based on the contents of a cell or a row? I guess this is also
>>> what a combiner does but bear with me...
>>>
>>> The immediately obvious use case is parsing. Suppose one cell in my row
>>> holds an XML document. I'd like to configure an iterator with an XPath
>>> expression to pull a field out of the document, so that I can leverage the
>>> distributed processing of the cluster instead of parsing the doc on the
>>> scanner-side.
>>>
>>> I'm sure there are constraints or things to watch out for, does anybody
>>> have any recommendations here? For instance, the generated cells would
>>> probably have to be in the same row as the input cells?
>>>
>>> I'm using MapReduce to satisfy all these use cases right now but I'm
>>> interested to know how much of my code could be ported to Iterators.
>>>
>>> Thanks!
>>> -Russ
>>>
>>
>>
>