You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by "Cardon, Tejay E" <te...@lmco.com> on 2012/08/22 22:41:20 UTC

Custom Iterators

All,
I'm interested in writing a custom iterator, and I've been looking for documentation on how to do so.  Thus far, I've not been able to find anything beyond the java docs in SortedKeyValueIterator and a few other sub-classes.  A few of the examples use Iterators, but provide no real info on how to properly implement one.  Is there anywhere to find general guidance on the iterator stack?

(If you're interested)
Specifically, for those that are curious, I'm trying to implement something similar to the wikisearch example, but with some key differences.  In my case, I've got a file with various attributes that being indexed.  So for each file there are 5 attributes, and each attribute has a fixed number of possible values.  For example (totally made up):
personID, gender, hair color, country, race, personRecord

Row:binID; ColFam:Attribute_AttributeValue; ColQ:PersonID; Val:blank
AND
Row:binID; ColFam:"D"; ColQ:personID; value:personRecord

A typical query would be:
Give me the personRecord for all people with:
Gender: male &
Hair color: blond or brown &
Country: USA or England or china or korea &
Race: white or oriental

The existing Iterators used in the wikisearch example are unable to handle the "or" clauses in each attribute.
The OrIterator doesn't appear to handle the possibility more than one row per tablet

Thanks,
Tejay Cardon

RE: EXTERNAL: Re: Custom Iterators

Posted by "Cardon, Tejay E" <te...@lmco.com>.

Ah, thank you Marc.  I should have pieced that together, but hadn't.  So the OrIterator should never be added directly to the stack using the scanner.  It is utilized by something like the BooleanLogicIterator to build a composite iterator, but it is the BooleanLogicIterator that actually gets added to the stack.

-----Original Message-----
From: Marc P. [mailto:marc.parisi@gmail.com] 
Sent: Thursday, August 23, 2012 9:15 AM
To: user@accumulo.apache.org
Subject: Re: EXTERNAL: Re: Custom Iterators

Thanks for catching that! I did indeed write that down incorrectly. I apologize. I'll fix that tonight.

Iterators are stacked based on their priority ( when you set them via the scanner, for example ) or the input format's IteratorSetting.

The init method comment is a general suggestion, for example if you are using it within a scan session.

The OrIterator ( as in the wikisearch example ) is created by the BooleanLogicIterator, and the sources are added ( through the addTerm method). This is, apparently, it's expected use. You will also note that the BooleanLogicIterator ( or any iter that uses the OrIterator ) has an implemented initializer method.

On Thu, Aug 23, 2012 at 10:59 AM, Cardon, Tejay E <te...@lmco.com> wrote:
> Marc,
>
> Thanks for the writeup.  It is by far the most comprehensive info I've 
> seen on iterators, and was very helpful to me.  A couple notes/questions:
>
>
>
> You mention that SortedKeyValueIterator implements FileSKVIterator.  
> I've only looked at the 1.4.1 source, but it appears that the opposite is true.
>
>
>
> You also mention that iterators get their source from the init method, 
> but some (like OrIterator) seem to throw exceptions on that method.  
> Where do they get their source data, and what are the API implications 
> of having iterators that reject init (or deep copy for that matter).
>
>
>
> Final thought.  If I want to stack several iterators, what's the best 
> way to go about that?  In other words, I'd like an iterator that I 
> write to be the source to another iterator that I've written, which in 
> turn may feed yet another that I've written.  Preferably, I'd like 
> each to be independently re-useable, so I don't want to build that 
> stacking into the source of any of the iterators themselves.  Is that 
> possible, or would I need some sort of iterator factory that builds 
> the stacks and then acts as an interface to the fully formed stack?
>
>
>
> Thanks,
>
> Tejay
>
> From: Marc Parisi [mailto:marc@accumulo.net]
> Sent: Wednesday, August 22, 2012 5:33 PM
>
>
> To: user@accumulo.apache.org
> Subject: EXTERNAL: Re: Custom Iterators
>
>
>
> Here's a quick write up
>
>
>
>     http://www.accumulo.net/node/1
>
> On Wed, Aug 22, 2012 at 8:03 PM, Josh Elser <jo...@gmail.com> wrote:
>
> Err, double (triple) reply:
>
> No, you are incorrect. The wikisearch example can handle any arbitrary 
> boolean expression containing NOT, AND, and OR. As always, I'll 
> preface it the same as Bill did: it *should* be able to handle them :).
>
> I know that cleaning-up/reworking the Wikisearch code is in the works. 
> I'm just not positive about the timeframe.
>
> As far as examples, I'd push you to the write-up Eric did after 
> benchmarking the wikisearch example: 
> http://accumulo.apache.org/example/wikisearch.html
>
> He has some example queries that give the basic idea behind what's 
> supported (minus the NOTs)
>
> On 08/22/2012 05:27 PM, Cardon, Tejay E wrote:
>
>
> Josh,
>
> Thanks for getting back to me so quickly. I explained in my lengthy 
> reply to William that the comment on OrIterator.TermSource.compareTo 
> indicates that implementations with more than one row per tablet need 
> to compare row key first (and that is not being done in this code). It 
> may be that it's not an issue and I'm simply misunderstanding 
> something. As for the wikisearch example, as I understood it, it could only handle searches for "anded"
> terms. If that's not the case, then an example of an or search would 
> be helpful. In any case, I'd love a deeper dive on the wikisearch 
> somewhere. I get the source code and a high level explanation of 
> what's happening, but I'd love a tutorial or something that walks 
> through the classes and explains how each one contributes to the 
> functionality. Don't consider that a request (that would be a lot more 
> to ask then I'm willing to ask), but I would certainly find it useful if it does exist.
>
> Thanks,
>
> Tejay
>
> *From:*Josh Elser [mailto:josh.elser@gmail.com]
> *Sent:* Wednesday, August 22, 2012 2:53 PM
> *To:* user@accumulo.apache.org
> *Subject:* EXTERNAL: Re: Custom Iterators
>
>
>
> What makes you say that the OrIterator cannot handle more than one row 
> per tablet? Can you provide details?
>
> AFAIK, the OrIterator should work correctly in all cases (e.g. 
> regardless of row distribution in a tablet). Any issues in the code 
> that prevent it from doing so would be a bug that should be fixed.
>
> Also, the wikisearch example supports indexing over multiple 
> attributes (and I believe indexes document metadata in addition to the tokenized document).
> Is there something unclear that could be better documented?
>
> On 8/22/12 4:41 PM, Cardon, Tejay E wrote:
>
>     All,
>
>     I'm interested in writing a custom iterator, and I've been looking
>     for documentation on how to do so. Thus far, I've not been able to
>     find anything beyond the java docs in SortedKeyValueIterator and a
>     few other sub-classes. A few of the examples use Iterators, but
>     provide no real info on how to properly implement one. Is there
>     anywhere to find general guidance on the iterator stack?
>
>     (If you're interested)
>
>     Specifically, for those that are curious, I'm trying to implement
>     something similar to the wikisearch example, but with some key
>     differences. In my case, I've got a file with various attributes
>     that being indexed. So for each file there are 5 attributes, and
>     each attribute has a fixed number of possible values. For example
>     (totally made up):
>
>     personID, gender, hair color, country, race, personRecord
>
>     Row:binID; ColFam:Attribute_AttributeValue; ColQ:PersonID; 
> Val:blank
>
>     AND
>     Row:binID; ColFam:"D"; ColQ:personID; value:personRecord
>
>     A typical query would be:
>
>     Give me the personRecord for all people with:
>
>     Gender: male &
>
>     Hair color: blond or brown &
>
>     Country: USA or England or china or korea &
>
>     Race: white or oriental
>
>     The existing Iterators used in the wikisearch example are unable
>     to handle the "or" clauses in each attribute.
>
>     The OrIterator doesn't appear to handle the possibility more than
>     one row per tablet
>
>     Thanks,
>
>     Tejay Cardon
>
>

Re: EXTERNAL: Re: Custom Iterators

Posted by "Marc P." <ma...@gmail.com>.

Thanks for catching that! I did indeed write that down incorrectly. I
apologize. I'll fix that tonight.

Iterators are stacked based on their priority ( when you set them via
the scanner, for example ) or the input format's IteratorSetting.

The init method comment is a general suggestion, for example if you
are using it within a scan session.

The OrIterator ( as in the wikisearch example ) is created by the
BooleanLogicIterator, and the sources are added ( through the addTerm
method). This is, apparently, it's expected use. You will also note
that the BooleanLogicIterator ( or any iter that uses the OrIterator )
has an implemented initializer method.

On Thu, Aug 23, 2012 at 10:59 AM, Cardon, Tejay E
<te...@lmco.com> wrote:
> Marc,
>
> Thanks for the writeup.  It is by far the most comprehensive info I’ve seen
> on iterators, and was very helpful to me.  A couple notes/questions:
>
>
>
> You mention that SortedKeyValueIterator implements FileSKVIterator.  I’ve
> only looked at the 1.4.1 source, but it appears that the opposite is true.
>
>
>
> You also mention that iterators get their source from the init method, but
> some (like OrIterator) seem to throw exceptions on that method.  Where do
> they get their source data, and what are the API implications of having
> iterators that reject init (or deep copy for that matter).
>
>
>
> Final thought.  If I want to stack several iterators, what’s the best way to
> go about that?  In other words, I’d like an iterator that I write to be the
> source to another iterator that I’ve written, which in turn may feed yet
> another that I’ve written.  Preferably, I’d like each to be independently
> re-useable, so I don’t want to build that stacking into the source of any of
> the iterators themselves.  Is that possible, or would I need some sort of
> iterator factory that builds the stacks and then acts as an interface to the
> fully formed stack?
>
>
>
> Thanks,
>
> Tejay
>
> From: Marc Parisi [mailto:marc@accumulo.net]
> Sent: Wednesday, August 22, 2012 5:33 PM
>
>
> To: user@accumulo.apache.org
> Subject: EXTERNAL: Re: Custom Iterators
>
>
>
> Here's a quick write up
>
>
>
>     http://www.accumulo.net/node/1
>
> On Wed, Aug 22, 2012 at 8:03 PM, Josh Elser <jo...@gmail.com> wrote:
>
> Err, double (triple) reply:
>
> No, you are incorrect. The wikisearch example can handle any arbitrary
> boolean expression containing NOT, AND, and OR. As always, I'll preface it
> the same as Bill did: it *should* be able to handle them :).
>
> I know that cleaning-up/reworking the Wikisearch code is in the works. I'm
> just not positive about the timeframe.
>
> As far as examples, I'd push you to the write-up Eric did after benchmarking
> the wikisearch example: http://accumulo.apache.org/example/wikisearch.html
>
> He has some example queries that give the basic idea behind what's supported
> (minus the NOTs)
>
> On 08/22/2012 05:27 PM, Cardon, Tejay E wrote:
>
>
> Josh,
>
> Thanks for getting back to me so quickly. I explained in my lengthy reply to
> William that the comment on OrIterator.TermSource.compareTo indicates that
> implementations with more than one row per tablet need to compare row key
> first (and that is not being done in this code). It may be that it’s not an
> issue and I’m simply misunderstanding something. As for the wikisearch
> example, as I understood it, it could only handle searches for “anded”
> terms. If that’s not the case, then an example of an or search would be
> helpful. In any case, I’d love a deeper dive on the wikisearch somewhere. I
> get the source code and a high level explanation of what’s happening, but
> I’d love a tutorial or something that walks through the classes and explains
> how each one contributes to the functionality. Don’t consider that a request
> (that would be a lot more to ask then I’m willing to ask), but I would
> certainly find it useful if it does exist.
>
> Thanks,
>
> Tejay
>
> *From:*Josh Elser [mailto:josh.elser@gmail.com]
> *Sent:* Wednesday, August 22, 2012 2:53 PM
> *To:* user@accumulo.apache.org
> *Subject:* EXTERNAL: Re: Custom Iterators
>
>
>
> What makes you say that the OrIterator cannot handle more than one row per
> tablet? Can you provide details?
>
> AFAIK, the OrIterator should work correctly in all cases (e.g. regardless of
> row distribution in a tablet). Any issues in the code that prevent it from
> doing so would be a bug that should be fixed.
>
> Also, the wikisearch example supports indexing over multiple attributes (and
> I believe indexes document metadata in addition to the tokenized document).
> Is there something unclear that could be better documented?
>
> On 8/22/12 4:41 PM, Cardon, Tejay E wrote:
>
>     All,
>
>     I’m interested in writing a custom iterator, and I’ve been looking
>     for documentation on how to do so. Thus far, I’ve not been able to
>     find anything beyond the java docs in SortedKeyValueIterator and a
>     few other sub-classes. A few of the examples use Iterators, but
>     provide no real info on how to properly implement one. Is there
>     anywhere to find general guidance on the iterator stack?
>
>     (If you’re interested)
>
>     Specifically, for those that are curious, I’m trying to implement
>     something similar to the wikisearch example, but with some key
>     differences. In my case, I’ve got a file with various attributes
>     that being indexed. So for each file there are 5 attributes, and
>     each attribute has a fixed number of possible values. For example
>     (totally made up):
>
>     personID, gender, hair color, country, race, personRecord
>
>     Row:binID; ColFam:Attribute_AttributeValue; ColQ:PersonID; Val:blank
>
>     AND
>     Row:binID; ColFam:”D”; ColQ:personID; value:personRecord
>
>     A typical query would be:
>
>     Give me the personRecord for all people with:
>
>     Gender: male &
>
>     Hair color: blond or brown &
>
>     Country: USA or England or china or korea &
>
>     Race: white or oriental
>
>     The existing Iterators used in the wikisearch example are unable
>     to handle the “or” clauses in each attribute.
>
>     The OrIterator doesn’t appear to handle the possibility more than
>     one row per tablet
>
>     Thanks,
>
>     Tejay Cardon
>
>

RE: EXTERNAL: Re: Custom Iterators

Posted by "Cardon, Tejay E" <te...@lmco.com>.

Marc,
Thanks for the writeup.  It is by far the most comprehensive info I've seen on iterators, and was very helpful to me.  A couple notes/questions:

You mention that SortedKeyValueIterator implements FileSKVIterator.  I've only looked at the 1.4.1 source, but it appears that the opposite is true.

You also mention that iterators get their source from the init method, but some (like OrIterator) seem to throw exceptions on that method.  Where do they get their source data, and what are the API implications of having iterators that reject init (or deep copy for that matter).

Final thought.  If I want to stack several iterators, what's the best way to go about that?  In other words, I'd like an iterator that I write to be the source to another iterator that I've written, which in turn may feed yet another that I've written.  Preferably, I'd like each to be independently re-useable, so I don't want to build that stacking into the source of any of the iterators themselves.  Is that possible, or would I need some sort of iterator factory that builds the stacks and then acts as an interface to the fully formed stack?

Thanks,
Tejay
From: Marc Parisi [mailto:marc@accumulo.net]
Sent: Wednesday, August 22, 2012 5:33 PM
To: user@accumulo.apache.org
Subject: EXTERNAL: Re: Custom Iterators

Here's a quick write up

    http://www.accumulo.net/node/1<http://accumulo.net/node/1>
On Wed, Aug 22, 2012 at 8:03 PM, Josh Elser <jo...@gmail.com>> wrote:
Err, double (triple) reply:

No, you are incorrect. The wikisearch example can handle any arbitrary boolean expression containing NOT, AND, and OR. As always, I'll preface it the same as Bill did: it *should* be able to handle them :).

I know that cleaning-up/reworking the Wikisearch code is in the works. I'm just not positive about the timeframe.

As far as examples, I'd push you to the write-up Eric did after benchmarking the wikisearch example: http://accumulo.apache.org/example/wikisearch.html

He has some example queries that give the basic idea behind what's supported (minus the NOTs)

On 08/22/2012 05:27 PM, Cardon, Tejay E wrote:

Josh,

Thanks for getting back to me so quickly. I explained in my lengthy reply to William that the comment on OrIterator.TermSource.compareTo indicates that implementations with more than one row per tablet need to compare row key first (and that is not being done in this code). It may be that it's not an issue and I'm simply misunderstanding something. As for the wikisearch example, as I understood it, it could only handle searches for "anded" terms. If that's not the case, then an example of an or search would be helpful. In any case, I'd love a deeper dive on the wikisearch somewhere. I get the source code and a high level explanation of what's happening, but I'd love a tutorial or something that walks through the classes and explains how each one contributes to the functionality. Don't consider that a request (that would be a lot more to ask then I'm willing to ask), but I would certainly find it useful if it does exist.

Thanks,

Tejay

*From:*Josh Elser [mailto:josh.elser@gmail.com<ma...@gmail.com>]
*Sent:* Wednesday, August 22, 2012 2:53 PM
*To:* user@accumulo.apache.org<ma...@accumulo.apache.org>
*Subject:* EXTERNAL: Re: Custom Iterators


What makes you say that the OrIterator cannot handle more than one row per tablet? Can you provide details?

AFAIK, the OrIterator should work correctly in all cases (e.g. regardless of row distribution in a tablet). Any issues in the code that prevent it from doing so would be a bug that should be fixed.

Also, the wikisearch example supports indexing over multiple attributes (and I believe indexes document metadata in addition to the tokenized document). Is there something unclear that could be better documented?

On 8/22/12 4:41 PM, Cardon, Tejay E wrote:

    All,

    I'm interested in writing a custom iterator, and I've been looking
    for documentation on how to do so. Thus far, I've not been able to
    find anything beyond the java docs in SortedKeyValueIterator and a
    few other sub-classes. A few of the examples use Iterators, but
    provide no real info on how to properly implement one. Is there
    anywhere to find general guidance on the iterator stack?

    (If you're interested)

    Specifically, for those that are curious, I'm trying to implement
    something similar to the wikisearch example, but with some key
    differences. In my case, I've got a file with various attributes
    that being indexed. So for each file there are 5 attributes, and
    each attribute has a fixed number of possible values. For example
    (totally made up):

    personID, gender, hair color, country, race, personRecord

    Row:binID; ColFam:Attribute_AttributeValue; ColQ:PersonID; Val:blank

    AND
    Row:binID; ColFam:"D"; ColQ:personID; value:personRecord

    A typical query would be:

    Give me the personRecord for all people with:

    Gender: male &

    Hair color: blond or brown &

    Country: USA or England or china or korea &

    Race: white or oriental

    The existing Iterators used in the wikisearch example are unable
    to handle the "or" clauses in each attribute.

    The OrIterator doesn't appear to handle the possibility more than
    one row per tablet

    Thanks,

    Tejay Cardon

Re: Custom Iterators

Posted by Marc Parisi <ma...@accumulo.net>.

Here's a quick write up

    http://www.accumulo.net/node/1 <http://accumulo.net/node/1>

On Wed, Aug 22, 2012 at 8:03 PM, Josh Elser <jo...@gmail.com> wrote:

> Err, double (triple) reply:
>
> No, you are incorrect. The wikisearch example can handle any arbitrary
> boolean expression containing NOT, AND, and OR. As always, I'll preface it
> the same as Bill did: it *should* be able to handle them :).
>
> I know that cleaning-up/reworking the Wikisearch code is in the works. I'm
> just not positive about the timeframe.
>
> As far as examples, I'd push you to the write-up Eric did after
> benchmarking the wikisearch example: http://accumulo.apache.org/**
> example/wikisearch.html<http://accumulo.apache.org/example/wikisearch.html>
>
> He has some example queries that give the basic idea behind what's
> supported (minus the NOTs)
>
> On 08/22/2012 05:27 PM, Cardon, Tejay E wrote:
>
>>
>> Josh,
>>
>> Thanks for getting back to me so quickly. I explained in my lengthy reply
>> to William that the comment on OrIterator.TermSource.**compareTo
>> indicates that implementations with more than one row per tablet need to
>> compare row key first (and that is not being done in this code). It may be
>> that it’s not an issue and I’m simply misunderstanding something. As for
>> the wikisearch example, as I understood it, it could only handle searches
>> for “anded” terms. If that’s not the case, then an example of an or search
>> would be helpful. In any case, I’d love a deeper dive on the wikisearch
>> somewhere. I get the source code and a high level explanation of what’s
>> happening, but I’d love a tutorial or something that walks through the
>> classes and explains how each one contributes to the functionality. Don’t
>> consider that a request (that would be a lot more to ask then I’m willing
>> to ask), but I would certainly find it useful if it does exist.
>>
>> Thanks,
>>
>> Tejay
>>
>> *From:*Josh Elser [mailto:josh.elser@gmail.com]
>> *Sent:* Wednesday, August 22, 2012 2:53 PM
>> *To:* user@accumulo.apache.org
>> *Subject:* EXTERNAL: Re: Custom Iterators
>>
>>
>> What makes you say that the OrIterator cannot handle more than one row
>> per tablet? Can you provide details?
>>
>> AFAIK, the OrIterator should work correctly in all cases (e.g. regardless
>> of row distribution in a tablet). Any issues in the code that prevent it
>> from doing so would be a bug that should be fixed.
>>
>> Also, the wikisearch example supports indexing over multiple attributes
>> (and I believe indexes document metadata in addition to the tokenized
>> document). Is there something unclear that could be better documented?
>>
>> On 8/22/12 4:41 PM, Cardon, Tejay E wrote:
>>
>>     All,
>>
>>     I’m interested in writing a custom iterator, and I’ve been looking
>>     for documentation on how to do so. Thus far, I’ve not been able to
>>     find anything beyond the java docs in SortedKeyValueIterator and a
>>     few other sub-classes. A few of the examples use Iterators, but
>>     provide no real info on how to properly implement one. Is there
>>     anywhere to find general guidance on the iterator stack?
>>
>>     (If you’re interested)
>>
>>     Specifically, for those that are curious, I’m trying to implement
>>     something similar to the wikisearch example, but with some key
>>     differences. In my case, I’ve got a file with various attributes
>>     that being indexed. So for each file there are 5 attributes, and
>>     each attribute has a fixed number of possible values. For example
>>     (totally made up):
>>
>>     personID, gender, hair color, country, race, personRecord
>>
>>     Row:binID; ColFam:Attribute_**AttributeValue; ColQ:PersonID;
>> Val:blank
>>
>>     AND
>>     Row:binID; ColFam:”D”; ColQ:personID; value:personRecord
>>
>>     A typical query would be:
>>
>>     Give me the personRecord for all people with:
>>
>>     Gender: male &
>>
>>     Hair color: blond or brown &
>>
>>     Country: USA or England or china or korea &
>>
>>     Race: white or oriental
>>
>>     The existing Iterators used in the wikisearch example are unable
>>     to handle the “or” clauses in each attribute.
>>
>>     The OrIterator doesn’t appear to handle the possibility more than
>>     one row per tablet
>>
>>     Thanks,
>>
>>     Tejay Cardon
>>
>>

RE: EXTERNAL: Re: Custom Iterators

Posted by "Cardon, Tejay E" <te...@lmco.com>.

Excellent.  I'll have to look more closely at the wikisearch code then.  That should get me most of the way to my solution.  Let me layout the next piece of this, and please tell me if doing this in an Iterator would make sense.

My actual "query" is more than just an Or-ing of index terms/values.  It's actually looking for a probability of match.  So to expand on the earlier example:

personID, gender, hair color, country, race, personRecord

Row:binID; ColFam:Attribute_AttributeValue; ColQ:PersonID; Val:blank
AND
Row:binID; ColFam:"D"; ColQ:personID; value:personRecord

My query would be a lookup table where the lookupKey would be attribute/attribute_value combinations and the lookupValue would be score (probability) for that attribute/attribute_value pair.  So something like:

Attribute       | Score
Gender:male     | 10
Gender:female   | 30
Hair:brown      | 30
Hair:blond      | 80

I intend to write and iterator that will use that lookup table as input (along with a threshold).  My iterator would then return only those records where the sum of the scores is greater than the threshold.  Because the lookup matrix is sparsely populated (no scores under 5 are included), I would start with an ORing iterator that only returns records that contain at least one attribute that has a score.  Then, for only those records that have at least some score, I would filter out any that didn't reach the threshold.

One final iterator would sit at the top of the stack.  It would take the records which passed the threshold, extract the actual document, run it through a more detailed filter, and return as a final result only the records which pass this final filter.  

The goal here is to keep all of the processing on the server side, and if possible, do it all in one stack of iterators so as to avoid passing intermediate results across the network.

Is this a reasonable use of iterators?  Or am I taking an entirely inappropriate approach to the problem?

Thanks,
Tejay Cardon  

-----Original Message-----
From: Josh Elser [mailto:josh.elser@gmail.com] 
Sent: Wednesday, August 22, 2012 6:04 PM
To: user@accumulo.apache.org
Subject: EXTERNAL: Re: Custom Iterators

Err, double (triple) reply:

No, you are incorrect. The wikisearch example can handle any arbitrary boolean expression containing NOT, AND, and OR. As always, I'll preface it the same as Bill did: it *should* be able to handle them :).

I know that cleaning-up/reworking the Wikisearch code is in the works. 
I'm just not positive about the timeframe.

As far as examples, I'd push you to the write-up Eric did after benchmarking the wikisearch example: 
http://accumulo.apache.org/example/wikisearch.html

He has some example queries that give the basic idea behind what's supported (minus the NOTs)

On 08/22/2012 05:27 PM, Cardon, Tejay E wrote:
>
> Josh,
>
> Thanks for getting back to me so quickly. I explained in my lengthy 
> reply to William that the comment on OrIterator.TermSource.compareTo 
> indicates that implementations with more than one row per tablet need 
> to compare row key first (and that is not being done in this code). It 
> may be that it's not an issue and I'm simply misunderstanding 
> something. As for the wikisearch example, as I understood it, it could 
> only handle searches for "anded" terms. If that's not the case, then 
> an example of an or search would be helpful. In any case, I'd love a 
> deeper dive on the wikisearch somewhere. I get the source code and a 
> high level explanation of what's happening, but I'd love a tutorial or 
> something that walks through the classes and explains how each one 
> contributes to the functionality. Don't consider that a request (that 
> would be a lot more to ask then I'm willing to ask), but I would 
> certainly find it useful if it does exist.
>
> Thanks,
>
> Tejay
>
> *From:*Josh Elser [mailto:josh.elser@gmail.com]
> *Sent:* Wednesday, August 22, 2012 2:53 PM
> *To:* user@accumulo.apache.org
> *Subject:* EXTERNAL: Re: Custom Iterators
>
> What makes you say that the OrIterator cannot handle more than one row 
> per tablet? Can you provide details?
>
> AFAIK, the OrIterator should work correctly in all cases (e.g. 
> regardless of row distribution in a tablet). Any issues in the code 
> that prevent it from doing so would be a bug that should be fixed.
>
> Also, the wikisearch example supports indexing over multiple 
> attributes (and I believe indexes document metadata in addition to the 
> tokenized document). Is there something unclear that could be better 
> documented?
>
> On 8/22/12 4:41 PM, Cardon, Tejay E wrote:
>
>     All,
>
>     I'm interested in writing a custom iterator, and I've been looking
>     for documentation on how to do so. Thus far, I've not been able to
>     find anything beyond the java docs in SortedKeyValueIterator and a
>     few other sub-classes. A few of the examples use Iterators, but
>     provide no real info on how to properly implement one. Is there
>     anywhere to find general guidance on the iterator stack?
>
>     (If you're interested)
>
>     Specifically, for those that are curious, I'm trying to implement
>     something similar to the wikisearch example, but with some key
>     differences. In my case, I've got a file with various attributes
>     that being indexed. So for each file there are 5 attributes, and
>     each attribute has a fixed number of possible values. For example
>     (totally made up):
>
>     personID, gender, hair color, country, race, personRecord
>
>     Row:binID; ColFam:Attribute_AttributeValue; ColQ:PersonID; 
> Val:blank
>
>     AND
>     Row:binID; ColFam:"D"; ColQ:personID; value:personRecord
>
>     A typical query would be:
>
>     Give me the personRecord for all people with:
>
>     Gender: male &
>
>     Hair color: blond or brown &
>
>     Country: USA or England or china or korea &
>
>     Race: white or oriental
>
>     The existing Iterators used in the wikisearch example are unable
>     to handle the "or" clauses in each attribute.
>
>     The OrIterator doesn't appear to handle the possibility more than
>     one row per tablet
>
>     Thanks,
>
>     Tejay Cardon
>

Re: Custom Iterators

Posted by Josh Elser <jo...@gmail.com>.

Err, double (triple) reply:

No, you are incorrect. The wikisearch example can handle any arbitrary 
boolean expression containing NOT, AND, and OR. As always, I'll preface 
it the same as Bill did: it *should* be able to handle them :).

I know that cleaning-up/reworking the Wikisearch code is in the works. 
I'm just not positive about the timeframe.

As far as examples, I'd push you to the write-up Eric did after 
benchmarking the wikisearch example: 
http://accumulo.apache.org/example/wikisearch.html

He has some example queries that give the basic idea behind what's 
supported (minus the NOTs)

On 08/22/2012 05:27 PM, Cardon, Tejay E wrote:
>
> Josh,
>
> Thanks for getting back to me so quickly. I explained in my lengthy 
> reply to William that the comment on OrIterator.TermSource.compareTo 
> indicates that implementations with more than one row per tablet need 
> to compare row key first (and that is not being done in this code). It 
> may be that it’s not an issue and I’m simply misunderstanding 
> something. As for the wikisearch example, as I understood it, it could 
> only handle searches for “anded” terms. If that’s not the case, then 
> an example of an or search would be helpful. In any case, I’d love a 
> deeper dive on the wikisearch somewhere. I get the source code and a 
> high level explanation of what’s happening, but I’d love a tutorial or 
> something that walks through the classes and explains how each one 
> contributes to the functionality. Don’t consider that a request (that 
> would be a lot more to ask then I’m willing to ask), but I would 
> certainly find it useful if it does exist.
>
> Thanks,
>
> Tejay
>
> *From:*Josh Elser [mailto:josh.elser@gmail.com]
> *Sent:* Wednesday, August 22, 2012 2:53 PM
> *To:* user@accumulo.apache.org
> *Subject:* EXTERNAL: Re: Custom Iterators
>
> What makes you say that the OrIterator cannot handle more than one row 
> per tablet? Can you provide details?
>
> AFAIK, the OrIterator should work correctly in all cases (e.g. 
> regardless of row distribution in a tablet). Any issues in the code 
> that prevent it from doing so would be a bug that should be fixed.
>
> Also, the wikisearch example supports indexing over multiple 
> attributes (and I believe indexes document metadata in addition to the 
> tokenized document). Is there something unclear that could be better 
> documented?
>
> On 8/22/12 4:41 PM, Cardon, Tejay E wrote:
>
>     All,
>
>     I’m interested in writing a custom iterator, and I’ve been looking
>     for documentation on how to do so. Thus far, I’ve not been able to
>     find anything beyond the java docs in SortedKeyValueIterator and a
>     few other sub-classes. A few of the examples use Iterators, but
>     provide no real info on how to properly implement one. Is there
>     anywhere to find general guidance on the iterator stack?
>
>     (If you’re interested)
>
>     Specifically, for those that are curious, I’m trying to implement
>     something similar to the wikisearch example, but with some key
>     differences. In my case, I’ve got a file with various attributes
>     that being indexed. So for each file there are 5 attributes, and
>     each attribute has a fixed number of possible values. For example
>     (totally made up):
>
>     personID, gender, hair color, country, race, personRecord
>
>     Row:binID; ColFam:Attribute_AttributeValue; ColQ:PersonID; Val:blank
>
>     AND
>     Row:binID; ColFam:”D”; ColQ:personID; value:personRecord
>
>     A typical query would be:
>
>     Give me the personRecord for all people with:
>
>     Gender: male &
>
>     Hair color: blond or brown &
>
>     Country: USA or England or china or korea &
>
>     Race: white or oriental
>
>     The existing Iterators used in the wikisearch example are unable
>     to handle the “or” clauses in each attribute.
>
>     The OrIterator doesn’t appear to handle the possibility more than
>     one row per tablet
>
>     Thanks,
>
>     Tejay Cardon
>

RE: EXTERNAL: Re: Custom Iterators

Posted by "Cardon, Tejay E" <te...@lmco.com>.

Josh,
Thanks for getting back to me so quickly.  I explained in my lengthy reply to William that the comment on OrIterator.TermSource.compareTo indicates that implementations with more than one row per tablet need to compare row key first (and that is not being done in this code).  It may be that it's not an issue and I'm simply misunderstanding something.  As for the wikisearch example, as I understood it, it could only handle searches for "anded" terms.  If that's not the case, then an example of an or search would be helpful.  In any case, I'd love a deeper dive on the wikisearch somewhere.  I get the source code and a high level explanation of what's happening, but I'd love a tutorial or something that walks through the classes and explains how each one contributes to the functionality.  Don't consider that a request (that would be a lot more to ask then I'm willing to ask), but I would certainly find it useful if it does exist.

Thanks,
Tejay

From: Josh Elser [mailto:josh.elser@gmail.com]
Sent: Wednesday, August 22, 2012 2:53 PM
To: user@accumulo.apache.org
Subject: EXTERNAL: Re: Custom Iterators

What makes you say that the OrIterator cannot handle more than one row per tablet? Can you provide details?

AFAIK, the OrIterator should work correctly in all cases (e.g. regardless of row distribution in a tablet). Any issues in the code that prevent it from doing so would be a bug that should be fixed.

Also, the wikisearch example supports indexing over multiple attributes (and I believe indexes document metadata in addition to the tokenized document). Is there something unclear that could be better documented?
On 8/22/12 4:41 PM, Cardon, Tejay E wrote:
All,
I'm interested in writing a custom iterator, and I've been looking for documentation on how to do so.  Thus far, I've not been able to find anything beyond the java docs in SortedKeyValueIterator and a few other sub-classes.  A few of the examples use Iterators, but provide no real info on how to properly implement one.  Is there anywhere to find general guidance on the iterator stack?

(If you're interested)
Specifically, for those that are curious, I'm trying to implement something similar to the wikisearch example, but with some key differences.  In my case, I've got a file with various attributes that being indexed.  So for each file there are 5 attributes, and each attribute has a fixed number of possible values.  For example (totally made up):
personID, gender, hair color, country, race, personRecord

Row:binID; ColFam:Attribute_AttributeValue; ColQ:PersonID; Val:blank
AND
Row:binID; ColFam:"D"; ColQ:personID; value:personRecord

A typical query would be:
Give me the personRecord for all people with:
Gender: male &
Hair color: blond or brown &
Country: USA or England or china or korea &
Race: white or oriental

The existing Iterators used in the wikisearch example are unable to handle the "or" clauses in each attribute.
The OrIterator doesn't appear to handle the possibility more than one row per tablet

Thanks,
Tejay Cardon

Re: Custom Iterators

Posted by Josh Elser <jo...@gmail.com>.

What makes you say that the OrIterator cannot handle more than one row 
per tablet? Can you provide details?

AFAIK, the OrIterator should work correctly in all cases (e.g. 
regardless of row distribution in a tablet). Any issues in the code that 
prevent it from doing so would be a bug that should be fixed.

Also, the wikisearch example supports indexing over multiple attributes 
(and I believe indexes document metadata in addition to the tokenized 
document). Is there something unclear that could be better documented?

On 8/22/12 4:41 PM, Cardon, Tejay E wrote:
>
> All,
>
> I'm interested in writing a custom iterator, and I've been looking for 
> documentation on how to do so.  Thus far, I've not been able to find 
> anything beyond the java docs in SortedKeyValueIterator and a few 
> other sub-classes.  A few of the examples use Iterators, but provide 
> no real info on how to properly implement one.  Is there anywhere to 
> find general guidance on the iterator stack?
>
> (If you're interested)
>
> Specifically, for those that are curious, I'm trying to implement 
> something similar to the wikisearch example, but with some key 
> differences.  In my case, I've got a file with various attributes that 
> being indexed.  So for each file there are 5 attributes, and each 
> attribute has a fixed number of possible values.  For example (totally 
> made up):
>
> personID, gender, hair color, country, race, personRecord
>
> Row:binID; ColFam:Attribute_AttributeValue; ColQ:PersonID; Val:blank
>
> AND
> Row:binID; ColFam:"D"; ColQ:personID; value:personRecord
>
> A typical query would be:
>
> Give me the personRecord for all people with:
>
> Gender: male &
>
> Hair color: blond or brown &
>
> Country: USA or England or china or korea &
>
> Race: white or oriental
>
> The existing Iterators used in the wikisearch example are unable to 
> handle the "or" clauses in each attribute.
>
> The OrIterator doesn't appear to handle the possibility more than one 
> row per tablet
>
> Thanks,
>
> Tejay Cardon
>

Re: EXTERNAL: Re: Custom Iterators

Posted by Billie Rinaldi <bi...@apache.org>.

On Wed, Aug 22, 2012 at 3:22 PM, Cardon, Tejay E <te...@lmco.com>wrote:

>  Why do some iterators have so many constructors if the system will
> simply construct them from the default constructor?
>
> Some iterators (such as OrIterator) throw an exception if init is called.
> How do these iterators get constructed and initialized?
>

Some iterators are "system" iterators and the tserver uses their special
constructors directly. These iterators have been moved to the
iterators.system package as of 1.4. Prior to 1.4, some user iterators had
constructors for testing purposes, but we have since tried to move towards
testing them in the way they will be used, i.e. through the default
constructor and passing configuration in the init method. This has been
accompanied by the use of static methods to make configuring an
IteratorSetting easier.

Billie



> ****
>
> ** **
>
> If OrIterator can do what I’m asking for, how do I get it the “terms” and
> what format do they come in?  You mentioned JEXL expressions, but I haven’t
> seen anything about them in the documentation.****
>
> ** **
>
> ** **
>
> As for my statement about the OrIterator and multiple rows, the comments
> on the compareTo for OrIterator.TermSource state “If your implementation
> can have more than one row in a tablet, you must compare row key here
> first, then column qualifier.”  But the code does not do so.  It may be
> that I’m just not fully understanding the code, however.****
>
> ** **
>
> Finally, I’m actually trying to do something a little more complex than
> just what I described below.  This reply is already too long and had too
> many questions in it, but I’ll get more detail out after I have a better
> handle on how the iterator framework works.****
>
>
> Thanks,****
>
> Tejay****
>
> ** **
>
> *From:* William Slacum [mailto:wilhelm.von.cloud@accumulo.net]
> *Sent:* Wednesday, August 22, 2012 3:00 PM
> *To:* user@accumulo.apache.org
> *Subject:* EXTERNAL: Re: Custom Iterators****
>
> ** **
>
> An or clause should be able to handle an enumeration of values, as that's
> supported in a JEXL expression. It would not, however, surprise me if those
> iterators could not handle multiple rows in a tablet. If you can reproduce
> that, please file a ticket. There will be a large update occurring to the
> Wiki example in the near future.
>
> Do you have any specific questions about how you should structure your
> iterator or the contract? Making a tutorial has been on my to do list, but
> we all know how to do lists end up...
>
> The big things to remember are:
>
> 1) The call order: Your iterator will be created via the default
> constructor, init() will be called, then seek(). After seek() is called,
> your iterator should have a top if there is data available. A client then
> can call hasTop(), getTopKey() and getTopValue() to check and retrieve data
> (similar to hasNext() and next()) and then next to advance the pointer.
>
> 2) Your iterator can be destroyed during a scan and then reconstructed,
> being passed in the last key returned to the client as the start of the
> range.
>
> 3) You can have multiple sources feed into a single iterator in a tree
> like fashion by clone()'ing the source passed in to init.****
>
> On Wed, Aug 22, 2012 at 1:41 PM, Cardon, Tejay E <te...@lmco.com>
> wrote:****
>
> All,****
>
> I’m interested in writing a custom iterator, and I’ve been looking for
> documentation on how to do so.  Thus far, I’ve not been able to find
> anything beyond the java docs in SortedKeyValueIterator and a few other
> sub-classes.  A few of the examples use Iterators, but provide no real info
> on how to properly implement one.  Is there anywhere to find general
> guidance on the iterator stack?****
>
>  ****
>
> (If you’re interested)****
>
> Specifically, for those that are curious, I’m trying to implement
> something similar to the wikisearch example, but with some key
> differences.  In my case, I’ve got a file with various attributes that
> being indexed.  So for each file there are 5 attributes, and each attribute
> has a fixed number of possible values.  For example (totally made up):****
>
> personID, gender, hair color, country, race, personRecord****
>
>  ****
>
> Row:binID; ColFam:Attribute_AttributeValue; ColQ:PersonID; Val:blank****
>
> AND
> Row:binID; ColFam:”D”; ColQ:personID; value:personRecord****
>
>  ****
>
> A typical query would be:****
>
> Give me the personRecord for all people with:****
>
> Gender: male &****
>
> Hair color: blond or brown &****
>
> Country: USA or England or china or korea &****
>
> Race: white or oriental****
>
>  ****
>
> The existing Iterators used in the wikisearch example are unable to handle
> the “or” clauses in each attribute.****
>
> The OrIterator doesn’t appear to handle the possibility more than one row
> per tablet****
>
>  ****
>
> Thanks,****
>
> Tejay Cardon****
>
> ** **
>

RE: EXTERNAL: Re: Custom Iterators

Posted by "Cardon, Tejay E" <te...@lmco.com>.

And I'm actually looking at the OrIterator in 1.4.1.  I really need to pull trunk just for the additional insights it may give me, but ultimately I'll be running on the 1.4.1 release.

Tejay

-----Original Message-----
From: Josh Elser [mailto:josh.elser@gmail.com] 
Sent: Wednesday, August 22, 2012 5:55 PM
To: user@accumulo.apache.org
Subject: Re: EXTERNAL: Re: Custom Iterators

... and I just realized I was looking at the OrIterator in trunk, not contrib/wikisearch x.x

Still, I think most of my comments still apply. Should verify with test cases...

On 08/22/2012 06:44 PM, Josh Elser wrote:
> You could compare clone()'ing multiple sources inside of an iterator 
> to maintaining multiple pointers at different offsets to a file on 
> disk. The clone()'ed iterators are all operating over the same row; 
> however, they are all pointing at different offsets (keys).
>
> Concretely, the OrIterator is sent a list of terms to union, and 
> clone()'s the source it was given for each term (note the addTerm() 
> method on the class). The OrIterator attempts to find the index 
> entries for each term, and return the minimum docid to satisfy the 
> SortedKeyValueIterator contract.
>
> Given your comment on the TermSource.compareTo() method's comment 
> (....), yes, it does appear that you have found a bug. That comment 
> about "multiple rows in a tablet" should really be removed, IMO. It's 
> rather confusing, and shouldn't matter when you're writing an 
> iterator. In other words, you, as a developer, don't need to know what 
> rows are contained in a tablet. The only issue you need to worry about 
> is if you're trying to do some operation *across* rows. Given that all 
> of the index entries for a single document are contained in one row 
> (which happens to just be a bucket in the Wiki application), this 
> point is meaningless.
>
> You might also note that the next() method on the OrIterator doesn't 
> check if the new topKey for the term it just advanced is contained in 
> the current Range before adding it back to the PriorityQueue. This 
> could cause a term who has passed outside of the initial Range 
> provided to seek() to be added unnecessarily to said PriorityQueue.
>
> +2 bugs
>
> On 08/22/2012 05:22 PM, Cardon, Tejay E wrote:
>>
>> William,
>>
>> Thanks for the quick response. Let me start by stating what I 
>> understand about Iterators (to be sure I'm not completely off my 
>> rocker).
>>
>> 1. An iterator receives, as its source, another iterator (by way of 
>> the init method), which becomes it's source of data.
>>
>> 2. When seek is called on an iterator, the iterator should respond by 
>> moving the pointer to the first key/value that applied to that 
>> iterator and is within the range
>>
>> a. Depending on the iterator, that may not be the first key in the 
>> range
>>
>> b. Only keys (and their corresponding values) which include one of 
>> the column families listed in the family list should be available as 
>> topKey and topValue. (this restriction should continue until seek is 
>> called again, meaning that subsequent calls to next will only proceed 
>> to key/values that also match the list provided.
>>
>> c. Generally speaking, a seek will result in the iterator calling 
>> seek on its source iterator (although the parameters passed in may be
>> different)
>>
>> 3. If an iterator needs configuration beyond just the source obtained 
>> in the init call, it can get that through the options and/or env.
>>
>> 4. Iterators do not necessarily return the same types of key/values 
>> as they consume. ie, a Combiner may call next() and getTopValue 
>> multiple times each time those methods are called on it. And the 
>> value it returns as topKey may be a key that doesn't actually exist 
>> in the datastore itself.
>>
>> So my questions:
>>
>> Is it correct that once seek is called, only topKeys that conform to 
>> the columnFamilies collection should be returned. And that this 
>> behavior persists until seek is called again, even when next has been 
>> called?
>>
>> How do iterators like the OrIterator obtain multiple sources? (I 
>> assume you were trying to address that with #3 in your response, but 
>> I don't understand what you mean by clone()ing the source. That would 
>> give me copies of the one source, but not multiple sources)
>>
>> Why do some iterators have so many constructors if the system will 
>> simply construct them from the default constructor?
>>
>> Some iterators (such as OrIterator) throw an exception if init is 
>> called. How do these iterators get constructed and initialized?
>>
>> If OrIterator can do what I'm asking for, how do I get it the "terms" 
>> and what format do they come in? You mentioned JEXL expressions, but 
>> I haven't seen anything about them in the documentation.
>>
>> As for my statement about the OrIterator and multiple rows, the 
>> comments on the compareTo for OrIterator.TermSource state "If your 
>> implementation can have more than one row in a tablet, you must 
>> compare row key here first, then column qualifier." But the code does 
>> not do so. It may be that I'm just not fully understanding the code, 
>> however.
>>
>> Finally, I'm actually trying to do something a little more complex 
>> than just what I described below. This reply is already too long and 
>> had too many questions in it, but I'll get more detail out after I 
>> have a better handle on how the iterator framework works.
>>
>>
>> Thanks,
>>
>> Tejay
>>
>> *From:*William Slacum [mailto:wilhelm.von.cloud@accumulo.net]
>> *Sent:* Wednesday, August 22, 2012 3:00 PM
>> *To:* user@accumulo.apache.org
>> *Subject:* EXTERNAL: Re: Custom Iterators
>>
>> An or clause should be able to handle an enumeration of values, as 
>> that's supported in a JEXL expression. It would not, however, 
>> surprise me if those iterators could not handle multiple rows in a 
>> tablet. If you can reproduce that, please file a ticket. There will 
>> be a large update occurring to the Wiki example in the near future.
>>
>> Do you have any specific questions about how you should structure 
>> your iterator or the contract? Making a tutorial has been on my to do 
>> list, but we all know how to do lists end up...
>>
>> The big things to remember are:
>>
>> 1) The call order: Your iterator will be created via the default 
>> constructor, init() will be called, then seek(). After seek() is 
>> called, your iterator should have a top if there is data available. A 
>> client then can call hasTop(), getTopKey() and getTopValue() to check 
>> and retrieve data (similar to hasNext() and next()) and then next to 
>> advance the pointer.
>>
>> 2) Your iterator can be destroyed during a scan and then 
>> reconstructed, being passed in the last key returned to the client as 
>> the start of the range.
>>
>> 3) You can have multiple sources feed into a single iterator in a 
>> tree like fashion by clone()'ing the source passed in to init.
>>
>> On Wed, Aug 22, 2012 at 1:41 PM, Cardon, Tejay E 
>> <tejay.e.cardon@lmco.com <ma...@lmco.com>> wrote:
>>
>> All,
>>
>> I'm interested in writing a custom iterator, and I've been looking 
>> for documentation on how to do so. Thus far, I've not been able to 
>> find anything beyond the java docs in SortedKeyValueIterator and a 
>> few other sub-classes. A few of the examples use Iterators, but 
>> provide no real info on how to properly implement one. Is there 
>> anywhere to find general guidance on the iterator stack?
>>
>> (If you're interested)
>>
>> Specifically, for those that are curious, I'm trying to implement 
>> something similar to the wikisearch example, but with some key 
>> differences. In my case, I've got a file with various attributes that 
>> being indexed. So for each file there are 5 attributes, and each 
>> attribute has a fixed number of possible values. For example (totally 
>> made up):
>>
>> personID, gender, hair color, country, race, personRecord
>>
>> Row:binID; ColFam:Attribute_AttributeValue; ColQ:PersonID; Val:blank
>>
>> AND
>> Row:binID; ColFam:"D"; ColQ:personID; value:personRecord
>>
>> A typical query would be:
>>
>> Give me the personRecord for all people with:
>>
>> Gender: male &
>>
>> Hair color: blond or brown &
>>
>> Country: USA or England or china or korea &
>>
>> Race: white or oriental
>>
>> The existing Iterators used in the wikisearch example are unable to 
>> handle the "or" clauses in each attribute.
>>
>> The OrIterator doesn't appear to handle the possibility more than one 
>> row per tablet
>>
>> Thanks,
>>
>> Tejay Cardon
>>

Re: EXTERNAL: Re: Custom Iterators

Posted by Josh Elser <jo...@gmail.com>.

... and I just realized I was looking at the OrIterator in trunk, not 
contrib/wikisearch x.x

Still, I think most of my comments still apply. Should verify with test 
cases...

On 08/22/2012 06:44 PM, Josh Elser wrote:
> You could compare clone()'ing multiple sources inside of an iterator 
> to maintaining multiple pointers at different offsets to a file on 
> disk. The clone()'ed iterators are all operating over the same row; 
> however, they are all pointing at different offsets (keys).
>
> Concretely, the OrIterator is sent a list of terms to union, and 
> clone()'s the source it was given for each term (note the addTerm() 
> method on the class). The OrIterator attempts to find the index 
> entries for each term, and return the minimum docid to satisfy the 
> SortedKeyValueIterator contract.
>
> Given your comment on the TermSource.compareTo() method's comment 
> (....), yes, it does appear that you have found a bug. That comment 
> about "multiple rows in a tablet" should really be removed, IMO. It's 
> rather confusing, and shouldn't matter when you're writing an 
> iterator. In other words, you, as a developer, don't need to know what 
> rows are contained in a tablet. The only issue you need to worry about 
> is if you're trying to do some operation *across* rows. Given that all 
> of the index entries for a single document are contained in one row 
> (which happens to just be a bucket in the Wiki application), this 
> point is meaningless.
>
> You might also note that the next() method on the OrIterator doesn't 
> check if the new topKey for the term it just advanced is contained in 
> the current Range before adding it back to the PriorityQueue. This 
> could cause a term who has passed outside of the initial Range 
> provided to seek() to be added unnecessarily to said PriorityQueue.
>
> +2 bugs
>
> On 08/22/2012 05:22 PM, Cardon, Tejay E wrote:
>>
>> William,
>>
>> Thanks for the quick response. Let me start by stating what I 
>> understand about Iterators (to be sure I’m not completely off my 
>> rocker).
>>
>> 1. An iterator receives, as its source, another iterator (by way of 
>> the init method), which becomes it’s source of data.
>>
>> 2. When seek is called on an iterator, the iterator should respond by 
>> moving the pointer to the first key/value that applied to that 
>> iterator and is within the range
>>
>> a. Depending on the iterator, that may not be the first key in the range
>>
>> b. Only keys (and their corresponding values) which include one of 
>> the column families listed in the family list should be available as 
>> topKey and topValue. (this restriction should continue until seek is 
>> called again, meaning that subsequent calls to next will only proceed 
>> to key/values that also match the list provided.
>>
>> c. Generally speaking, a seek will result in the iterator calling 
>> seek on its source iterator (although the parameters passed in may be 
>> different)
>>
>> 3. If an iterator needs configuration beyond just the source obtained 
>> in the init call, it can get that through the options and/or env.
>>
>> 4. Iterators do not necessarily return the same types of key/values 
>> as they consume. ie, a Combiner may call next() and getTopValue 
>> multiple times each time those methods are called on it. And the 
>> value it returns as topKey may be a key that doesn’t actually exist 
>> in the datastore itself.
>>
>> So my questions:
>>
>> Is it correct that once seek is called, only topKeys that conform to 
>> the columnFamilies collection should be returned. And that this 
>> behavior persists until seek is called again, even when next has been 
>> called?
>>
>> How do iterators like the OrIterator obtain multiple sources? (I 
>> assume you were trying to address that with #3 in your response, but 
>> I don’t understand what you mean by clone()ing the source. That would 
>> give me copies of the one source, but not multiple sources)
>>
>> Why do some iterators have so many constructors if the system will 
>> simply construct them from the default constructor?
>>
>> Some iterators (such as OrIterator) throw an exception if init is 
>> called. How do these iterators get constructed and initialized?
>>
>> If OrIterator can do what I’m asking for, how do I get it the “terms” 
>> and what format do they come in? You mentioned JEXL expressions, but 
>> I haven’t seen anything about them in the documentation.
>>
>> As for my statement about the OrIterator and multiple rows, the 
>> comments on the compareTo for OrIterator.TermSource state “If your 
>> implementation can have more than one row in a tablet, you must 
>> compare row key here first, then column qualifier.” But the code does 
>> not do so. It may be that I’m just not fully understanding the code, 
>> however.
>>
>> Finally, I’m actually trying to do something a little more complex 
>> than just what I described below. This reply is already too long and 
>> had too many questions in it, but I’ll get more detail out after I 
>> have a better handle on how the iterator framework works.
>>
>>
>> Thanks,
>>
>> Tejay
>>
>> *From:*William Slacum [mailto:wilhelm.von.cloud@accumulo.net]
>> *Sent:* Wednesday, August 22, 2012 3:00 PM
>> *To:* user@accumulo.apache.org
>> *Subject:* EXTERNAL: Re: Custom Iterators
>>
>> An or clause should be able to handle an enumeration of values, as 
>> that's supported in a JEXL expression. It would not, however, 
>> surprise me if those iterators could not handle multiple rows in a 
>> tablet. If you can reproduce that, please file a ticket. There will 
>> be a large update occurring to the Wiki example in the near future.
>>
>> Do you have any specific questions about how you should structure 
>> your iterator or the contract? Making a tutorial has been on my to do 
>> list, but we all know how to do lists end up...
>>
>> The big things to remember are:
>>
>> 1) The call order: Your iterator will be created via the default 
>> constructor, init() will be called, then seek(). After seek() is 
>> called, your iterator should have a top if there is data available. A 
>> client then can call hasTop(), getTopKey() and getTopValue() to check 
>> and retrieve data (similar to hasNext() and next()) and then next to 
>> advance the pointer.
>>
>> 2) Your iterator can be destroyed during a scan and then 
>> reconstructed, being passed in the last key returned to the client as 
>> the start of the range.
>>
>> 3) You can have multiple sources feed into a single iterator in a 
>> tree like fashion by clone()'ing the source passed in to init.
>>
>> On Wed, Aug 22, 2012 at 1:41 PM, Cardon, Tejay E 
>> <tejay.e.cardon@lmco.com <ma...@lmco.com>> wrote:
>>
>> All,
>>
>> I’m interested in writing a custom iterator, and I’ve been looking 
>> for documentation on how to do so. Thus far, I’ve not been able to 
>> find anything beyond the java docs in SortedKeyValueIterator and a 
>> few other sub-classes. A few of the examples use Iterators, but 
>> provide no real info on how to properly implement one. Is there 
>> anywhere to find general guidance on the iterator stack?
>>
>> (If you’re interested)
>>
>> Specifically, for those that are curious, I’m trying to implement 
>> something similar to the wikisearch example, but with some key 
>> differences. In my case, I’ve got a file with various attributes that 
>> being indexed. So for each file there are 5 attributes, and each 
>> attribute has a fixed number of possible values. For example (totally 
>> made up):
>>
>> personID, gender, hair color, country, race, personRecord
>>
>> Row:binID; ColFam:Attribute_AttributeValue; ColQ:PersonID; Val:blank
>>
>> AND
>> Row:binID; ColFam:”D”; ColQ:personID; value:personRecord
>>
>> A typical query would be:
>>
>> Give me the personRecord for all people with:
>>
>> Gender: male &
>>
>> Hair color: blond or brown &
>>
>> Country: USA or England or china or korea &
>>
>> Race: white or oriental
>>
>> The existing Iterators used in the wikisearch example are unable to 
>> handle the “or” clauses in each attribute.
>>
>> The OrIterator doesn’t appear to handle the possibility more than one 
>> row per tablet
>>
>> Thanks,
>>
>> Tejay Cardon
>>

Re: EXTERNAL: Re: Custom Iterators

Posted by Josh Elser <jo...@gmail.com>.

You could compare clone()'ing multiple sources inside of an iterator to 
maintaining multiple pointers at different offsets to a file on disk. 
The clone()'ed iterators are all operating over the same row; however, 
they are all pointing at different offsets (keys).

Concretely, the OrIterator is sent a list of terms to union, and 
clone()'s the source it was given for each term (note the addTerm() 
method on the class). The OrIterator attempts to find the index entries 
for each term, and return the minimum docid to satisfy the 
SortedKeyValueIterator contract.

Given your comment on the TermSource.compareTo() method's comment 
(....), yes, it does appear that you have found a bug. That comment 
about "multiple rows in a tablet" should really be removed, IMO. It's 
rather confusing, and shouldn't matter when you're writing an iterator. 
In other words, you, as a developer, don't need to know what rows are 
contained in a tablet. The only issue you need to worry about is if 
you're trying to do some operation *across* rows. Given that all of the 
index entries for a single document are contained in one row (which 
happens to just be a bucket in the Wiki application), this point is 
meaningless.

You might also note that the next() method on the OrIterator doesn't 
check if the new topKey for the term it just advanced is contained in 
the current Range before adding it back to the PriorityQueue. This could 
cause a term who has passed outside of the initial Range provided to 
seek() to be added unnecessarily to said PriorityQueue.

+2 bugs

On 08/22/2012 05:22 PM, Cardon, Tejay E wrote:
>
> William,
>
> Thanks for the quick response. Let me start by stating what I 
> understand about Iterators (to be sure I’m not completely off my rocker).
>
> 1. An iterator receives, as its source, another iterator (by way of 
> the init method), which becomes it’s source of data.
>
> 2. When seek is called on an iterator, the iterator should respond by 
> moving the pointer to the first key/value that applied to that 
> iterator and is within the range
>
> a. Depending on the iterator, that may not be the first key in the range
>
> b. Only keys (and their corresponding values) which include one of the 
> column families listed in the family list should be available as 
> topKey and topValue. (this restriction should continue until seek is 
> called again, meaning that subsequent calls to next will only proceed 
> to key/values that also match the list provided.
>
> c. Generally speaking, a seek will result in the iterator calling seek 
> on its source iterator (although the parameters passed in may be 
> different)
>
> 3. If an iterator needs configuration beyond just the source obtained 
> in the init call, it can get that through the options and/or env.
>
> 4. Iterators do not necessarily return the same types of key/values as 
> they consume. ie, a Combiner may call next() and getTopValue multiple 
> times each time those methods are called on it. And the value it 
> returns as topKey may be a key that doesn’t actually exist in the 
> datastore itself.
>
> So my questions:
>
> Is it correct that once seek is called, only topKeys that conform to 
> the columnFamilies collection should be returned. And that this 
> behavior persists until seek is called again, even when next has been 
> called?
>
> How do iterators like the OrIterator obtain multiple sources? (I 
> assume you were trying to address that with #3 in your response, but I 
> don’t understand what you mean by clone()ing the source. That would 
> give me copies of the one source, but not multiple sources)
>
> Why do some iterators have so many constructors if the system will 
> simply construct them from the default constructor?
>
> Some iterators (such as OrIterator) throw an exception if init is 
> called. How do these iterators get constructed and initialized?
>
> If OrIterator can do what I’m asking for, how do I get it the “terms” 
> and what format do they come in? You mentioned JEXL expressions, but I 
> haven’t seen anything about them in the documentation.
>
> As for my statement about the OrIterator and multiple rows, the 
> comments on the compareTo for OrIterator.TermSource state “If your 
> implementation can have more than one row in a tablet, you must 
> compare row key here first, then column qualifier.” But the code does 
> not do so. It may be that I’m just not fully understanding the code, 
> however.
>
> Finally, I’m actually trying to do something a little more complex 
> than just what I described below. This reply is already too long and 
> had too many questions in it, but I’ll get more detail out after I 
> have a better handle on how the iterator framework works.
>
>
> Thanks,
>
> Tejay
>
> *From:*William Slacum [mailto:wilhelm.von.cloud@accumulo.net]
> *Sent:* Wednesday, August 22, 2012 3:00 PM
> *To:* user@accumulo.apache.org
> *Subject:* EXTERNAL: Re: Custom Iterators
>
> An or clause should be able to handle an enumeration of values, as 
> that's supported in a JEXL expression. It would not, however, surprise 
> me if those iterators could not handle multiple rows in a tablet. If 
> you can reproduce that, please file a ticket. There will be a large 
> update occurring to the Wiki example in the near future.
>
> Do you have any specific questions about how you should structure your 
> iterator or the contract? Making a tutorial has been on my to do list, 
> but we all know how to do lists end up...
>
> The big things to remember are:
>
> 1) The call order: Your iterator will be created via the default 
> constructor, init() will be called, then seek(). After seek() is 
> called, your iterator should have a top if there is data available. A 
> client then can call hasTop(), getTopKey() and getTopValue() to check 
> and retrieve data (similar to hasNext() and next()) and then next to 
> advance the pointer.
>
> 2) Your iterator can be destroyed during a scan and then 
> reconstructed, being passed in the last key returned to the client as 
> the start of the range.
>
> 3) You can have multiple sources feed into a single iterator in a tree 
> like fashion by clone()'ing the source passed in to init.
>
> On Wed, Aug 22, 2012 at 1:41 PM, Cardon, Tejay E 
> <tejay.e.cardon@lmco.com <ma...@lmco.com>> wrote:
>
> All,
>
> I’m interested in writing a custom iterator, and I’ve been looking for 
> documentation on how to do so. Thus far, I’ve not been able to find 
> anything beyond the java docs in SortedKeyValueIterator and a few 
> other sub-classes. A few of the examples use Iterators, but provide no 
> real info on how to properly implement one. Is there anywhere to find 
> general guidance on the iterator stack?
>
> (If you’re interested)
>
> Specifically, for those that are curious, I’m trying to implement 
> something similar to the wikisearch example, but with some key 
> differences. In my case, I’ve got a file with various attributes that 
> being indexed. So for each file there are 5 attributes, and each 
> attribute has a fixed number of possible values. For example (totally 
> made up):
>
> personID, gender, hair color, country, race, personRecord
>
> Row:binID; ColFam:Attribute_AttributeValue; ColQ:PersonID; Val:blank
>
> AND
> Row:binID; ColFam:”D”; ColQ:personID; value:personRecord
>
> A typical query would be:
>
> Give me the personRecord for all people with:
>
> Gender: male &
>
> Hair color: blond or brown &
>
> Country: USA or England or china or korea &
>
> Race: white or oriental
>
> The existing Iterators used in the wikisearch example are unable to 
> handle the “or” clauses in each attribute.
>
> The OrIterator doesn’t appear to handle the possibility more than one 
> row per tablet
>
> Thanks,
>
> Tejay Cardon
>

RE: EXTERNAL: Re: Custom Iterators

Posted by "Cardon, Tejay E" <te...@lmco.com>.

William,
Thanks for the quick response.  Let me start by stating what I understand about Iterators (to be sure I'm not completely off my rocker).

1. An iterator receives, as its source, another iterator (by way of the init method), which becomes it's source of data.
2. When seek is called on an iterator, the iterator should respond by moving the pointer to the first key/value that applied to that iterator and is within the range
    a. Depending on the iterator, that may not be the first key in the range
    b. Only keys (and their corresponding values) which include one of the column families listed in the family list should be available as topKey and topValue. (this restriction should continue until seek is called again, meaning that subsequent calls to next will only proceed to key/values that also match the list provided.
    c. Generally speaking, a seek will result in the iterator calling seek on its source iterator (although the parameters passed in may be different)
3. If an iterator needs configuration beyond just the source obtained in the init call, it can get that through the options and/or env.
4. Iterators do not necessarily return the same types of key/values as they consume.  ie, a Combiner may call next() and getTopValue multiple times each time those methods are called on it.  And the value it returns as topKey may be a key that doesn't actually exist in the datastore itself.


So my questions:
Is it correct that once seek is called, only topKeys that conform to the columnFamilies collection should be returned.  And that this behavior persists until seek is called again, even when next has been called?
How do iterators like the OrIterator obtain multiple sources?  (I assume you were trying to address that with #3 in your response, but I don't understand what you mean by clone()ing the source.  That would give me copies of the one source, but not multiple sources)
Why do some iterators have so many constructors if the system will simply construct them from the default constructor?
Some iterators (such as OrIterator) throw an exception if init is called.  How do these iterators get constructed and initialized?

If OrIterator can do what I'm asking for, how do I get it the "terms" and what format do they come in?  You mentioned JEXL expressions, but I haven't seen anything about them in the documentation.


As for my statement about the OrIterator and multiple rows, the comments on the compareTo for OrIterator.TermSource state "If your implementation can have more than one row in a tablet, you must compare row key here first, then column qualifier."  But the code does not do so.  It may be that I'm just not fully understanding the code, however.

Finally, I'm actually trying to do something a little more complex than just what I described below.  This reply is already too long and had too many questions in it, but I'll get more detail out after I have a better handle on how the iterator framework works.

Thanks,
Tejay

From: William Slacum [mailto:wilhelm.von.cloud@accumulo.net]
Sent: Wednesday, August 22, 2012 3:00 PM
To: user@accumulo.apache.org
Subject: EXTERNAL: Re: Custom Iterators

An or clause should be able to handle an enumeration of values, as that's supported in a JEXL expression. It would not, however, surprise me if those iterators could not handle multiple rows in a tablet. If you can reproduce that, please file a ticket. There will be a large update occurring to the Wiki example in the near future.

Do you have any specific questions about how you should structure your iterator or the contract? Making a tutorial has been on my to do list, but we all know how to do lists end up...

The big things to remember are:

1) The call order: Your iterator will be created via the default constructor, init() will be called, then seek(). After seek() is called, your iterator should have a top if there is data available. A client then can call hasTop(), getTopKey() and getTopValue() to check and retrieve data (similar to hasNext() and next()) and then next to advance the pointer.

2) Your iterator can be destroyed during a scan and then reconstructed, being passed in the last key returned to the client as the start of the range.

3) You can have multiple sources feed into a single iterator in a tree like fashion by clone()'ing the source passed in to init.
On Wed, Aug 22, 2012 at 1:41 PM, Cardon, Tejay E <te...@lmco.com>> wrote:
All,
I'm interested in writing a custom iterator, and I've been looking for documentation on how to do so.  Thus far, I've not been able to find anything beyond the java docs in SortedKeyValueIterator and a few other sub-classes.  A few of the examples use Iterators, but provide no real info on how to properly implement one.  Is there anywhere to find general guidance on the iterator stack?

(If you're interested)
Specifically, for those that are curious, I'm trying to implement something similar to the wikisearch example, but with some key differences.  In my case, I've got a file with various attributes that being indexed.  So for each file there are 5 attributes, and each attribute has a fixed number of possible values.  For example (totally made up):
personID, gender, hair color, country, race, personRecord

Row:binID; ColFam:Attribute_AttributeValue; ColQ:PersonID; Val:blank
AND
Row:binID; ColFam:"D"; ColQ:personID; value:personRecord

A typical query would be:
Give me the personRecord for all people with:
Gender: male &
Hair color: blond or brown &
Country: USA or England or china or korea &
Race: white or oriental

The existing Iterators used in the wikisearch example are unable to handle the "or" clauses in each attribute.
The OrIterator doesn't appear to handle the possibility more than one row per tablet

Thanks,
Tejay Cardon

Re: Custom Iterators

Posted by William Slacum <wi...@accumulo.net>.

An or clause should be able to handle an enumeration of values, as that's
supported in a JEXL expression. It would not, however, surprise me if those
iterators could not handle multiple rows in a tablet. If you can reproduce
that, please file a ticket. There will be a large update occurring to the
Wiki example in the near future.

Do you have any specific questions about how you should structure your
iterator or the contract? Making a tutorial has been on my to do list, but
we all know how to do lists end up...

The big things to remember are:

1) The call order: Your iterator will be created via the default
constructor, init() will be called, then seek(). After seek() is called,
your iterator should have a top if there is data available. A client then
can call hasTop(), getTopKey() and getTopValue() to check and retrieve data
(similar to hasNext() and next()) and then next to advance the pointer.

2) Your iterator can be destroyed during a scan and then reconstructed,
being passed in the last key returned to the client as the start of the
range.

3) You can have multiple sources feed into a single iterator in a tree like
fashion by clone()'ing the source passed in to init.

On Wed, Aug 22, 2012 at 1:41 PM, Cardon, Tejay E <te...@lmco.com>wrote:

>  All,****
>
> I’m interested in writing a custom iterator, and I’ve been looking for
> documentation on how to do so.  Thus far, I’ve not been able to find
> anything beyond the java docs in SortedKeyValueIterator and a few other
> sub-classes.  A few of the examples use Iterators, but provide no real info
> on how to properly implement one.  Is there anywhere to find general
> guidance on the iterator stack?****
>
> ** **
>
> (If you’re interested)****
>
> Specifically, for those that are curious, I’m trying to implement
> something similar to the wikisearch example, but with some key
> differences.  In my case, I’ve got a file with various attributes that
> being indexed.  So for each file there are 5 attributes, and each attribute
> has a fixed number of possible values.  For example (totally made up):****
>
> personID, gender, hair color, country, race, personRecord****
>
> ** **
>
> Row:binID; ColFam:Attribute_AttributeValue; ColQ:PersonID; Val:blank****
>
> AND
> Row:binID; ColFam:”D”; ColQ:personID; value:personRecord****
>
> ** **
>
> A typical query would be:****
>
> Give me the personRecord for all people with:****
>
> Gender: male &****
>
> Hair color: blond or brown &****
>
> Country: USA or England or china or korea &****
>
> Race: white or oriental****
>
> ** **
>
> The existing Iterators used in the wikisearch example are unable to handle
> the “or” clauses in each attribute.****
>
> The OrIterator doesn’t appear to handle the possibility more than one row
> per tablet****
>
> ** **
>
> Thanks,****
>
> Tejay Cardon****
>