You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@uima.apache.org by Marshall Schor <ms...@schor.com> on 2015/04/09 23:42:48 UTC

inconsistency in implementation of SubIterators

In UIMA, Subiterators are defined relative to an existing Annotation Index over
some subtype of Annotation.

When you create a subiterator, you pass in boundaries (begin / end) used to
restrict the iterator to those instances within that span.

The boundaries are passed in using a FeatureStructure, which may be a new one,
or an existing one (perhaps also in the Annotation Index, but it need not be).

When these were defined, the concept of having multiple "equal" (in the sense
that the defined keys - begin, end, and type prioirity order) matched between
two FeatureStructures), was not though of, I think.  The implementation
currently includes code that, when creating the iterator, does a
"moveTo(the_bounding_fs)" operation, and then, if it finds that the FS at that
spot is "equal" to the bounding FS, it moves-to-next to "skip" it.

Extending this to the possibility of "multiple" equal FSs, the effect is
currently to skip just the first (of possibly many "equal" instances).

The documentation (which is in the Javadocs, mostly, for AnnotationIndex, here
http://uima.apache.org/d/uimaj-2.7.0/apidocs/index.html ), doesn't cover this
case.  It also seems to believe that the annotation supplying the bounding
information needs to be in the index, whereas, the implementation doesn't
require that.  For instance, one could decide to get all annotations between 10
and 100, and just make an instance of a subtype of Annotation, setting the
begin/end values to 10/100, and ** never add this to the indexes **, and pass it
to the subiterator method as the bounding annotation.

I realize this is an edge case, that might not be too interesting, but I'd like
to do some kind of better implementation to cover this.  The choices seem to be
to a) continue skipping the 1st one, and leave the others in the iteration, or
b) continue skipping the 1st one, and skip all of the other "equal" ones as well.

Another edge case happens if the bounding annotation *is* in the index.  In that
case the definition in the Javadocs specifies the iterator will return
annotations *following* the particular bounding annotation that is in the index.
To implement this correctly, the code would need to search all "equal" items in
the index to find the one that is "EQ" / == / has the same exact
FeatureStructure "id", and return items "following" that in the index.

This code is not present in the current implementation; should it be added?  Or
should we update the Javadocs?

Does anyone have a preference, one way or another (or perhaps even a better
analysis and an alternate suggestion)?

Thanks. -Marshall

Re: inconsistency in implementation of SubIterators

Posted by Marshall Schor <ms...@schor.com>.

Re: confusion regarding subiterators and "type priorities". I agree that many
users have wanted a simpler version of a subiterator that just bounds the
iterator to a specific begin and end, without any reference to type priorities.

To do this, I'm thinking of an API which expresses this quite directly.  It
would be nice to be able to continue to offer the "strict" and "unambiguous"
styles as well, for "architectural generality" (although I don't really know if
there's a perceived need for this - architectural generality has a side benefit
of having users "learn" less unique special things; instead they learn some base
things which they can then combine).

I know some languages (Python?) use the term "slice" to express a subsequence
from an ordered collection.

So perhaps we can have for annotation indexes an additional method:
  slice(int begin, int end)  or
  slice(FeatureStructure fs) [ where the fs just supplies a begin / end ]

which would return a lightweight wrapper of the specific index to operate as a
subiterator.

We could also make the strict and unambiguous work this way too
  strict()
  unambiguous()

This would then permit the use of this now-specialized index in Java for (xxx :
yyy) style.

Note that this version would not "skip" anything; if users wanted to skip some
particular items, they'd need to do that in the loop.

The implementation of this could make use of additional index structures
(lists), like the current base UIMA implementations for strict and unambiguous
do.  I'd like to keep this "detail" out of the API if possible, hoping that the
decision to do this (or not) could be somehow automatic :-) .

-Marshall

On 4/10/2015 2:23 AM, Richard Eckart de Castilho wrote:
> On 09.04.2015, at 23:42, Marshall Schor <ms...@schor.com> wrote:
>
>> In UIMA, Subiterators are defined relative to an existing Annotation Index over
>> some subtype of Annotation.
>>
>> When you create a subiterator, you pass in boundaries (begin / end) used to
>> restrict the iterator to those instances within that span.
>>
>> The boundaries are passed in using a FeatureStructure, which may be a new one,
>> or an existing one (perhaps also in the Annotation Index, but it need not be).
>>
>> When these were defined, the concept of having multiple "equal" (in the sense
>> that the defined keys - begin, end, and type prioirity order) matched between
>> two FeatureStructures), was not though of, I think.  The implementation
>> currently includes code that, when creating the iterator, does a
>> "moveTo(the_bounding_fs)" operation, and then, if it finds that the FS at that
>> spot is "equal" to the bounding FS, it moves-to-next to "skip" it.
>>
>> Extending this to the possibility of "multiple" equal FSs, the effect is
>> currently to skip just the first (of possibly many "equal" instances).
> In my experience, users have often been confused by that. They thought that
> begin/end was sufficient and that type priorities were not even needed.
>
> This confusion gave rise to the uimaFIT selectCovered(jcas, being, end) method
> which only takes offsets into account, ignores type priorities, and rewinds 
> the the first of possibly multiple with equal begin after the initial moveto
> operation.
>
>> The documentation (which is in the Javadocs, mostly, for AnnotationIndex, here
>> http://uima.apache.org/d/uimaj-2.7.0/apidocs/index.html ), doesn't cover this
>> case.  It also seems to believe that the annotation supplying the bounding
>> information needs to be in the index, whereas, the implementation doesn't
>> require that.  For instance, one could decide to get all annotations between 10
>> and 100, and just make an instance of a subtype of Annotation, setting the
>> begin/end values to 10/100, and ** never add this to the indexes **, and pass it
>> to the subiterator method as the bounding annotation.
> That is also something I often saw. The problem is, that - to my understanding -
> creating such a temporary annotation consumes space in the CAS heaps even if the
> annotation is never indexed.
>
>> I realize this is an edge case, that might not be too interesting, but I'd like
>> to do some kind of better implementation to cover this.  The choices seem to be
>> to a) continue skipping the 1st one, and leave the others in the iteration, or
>> b) continue skipping the 1st one, and skip all of the other "equal" ones as well.
>>
>> Another edge case happens if the bounding annotation *is* in the index.  In that
>> case the definition in the Javadocs specifies the iterator will return
>> annotations *following* the particular bounding annotation that is in the index.
>> To implement this correctly, the code would need to search all "equal" items in
>> the index to find the one that is "EQ" / == / has the same exact
>> FeatureStructure "id", and return items "following" that in the index.
>>
>> This code is not present in the current implementation; should it be added?  Or
>> should we update the Javadocs?
> If compatibility is an issue, I'd be for updating the JavaDoc to reflect the current
> behavior more clearly, then think about adding a new API that supports other kinds
> of behaviors, e.g. in the way that uimaFIT selectCovered is handling this.
>
> I think it would be great to investigate the possibilities that the Java 8 stream
> API might open up. Years back, we had been contemplating in uimaFIT on an alternative
> CAS "selection" API thinking into directions akin to that steam API or to the Hibernate
> Critera API. It might be interesting to refer to our notes from back then. Steven even
> did some initial coding:
>
> https://code.google.com/p/uimafit/issues/detail?id=65&colspec=ID%20Type%20Status%20Priority%20Milestone%20Compatible%20ASFJira%20Owner%20Summary
>
> Cheers,
>
> -- Richard

Re: inconsistency in implementation of SubIterators

Posted by Richard Eckart de Castilho <re...@apache.org>.

On 09.04.2015, at 23:42, Marshall Schor <ms...@schor.com> wrote:

> In UIMA, Subiterators are defined relative to an existing Annotation Index over
> some subtype of Annotation.
> 
> When you create a subiterator, you pass in boundaries (begin / end) used to
> restrict the iterator to those instances within that span.
> 
> The boundaries are passed in using a FeatureStructure, which may be a new one,
> or an existing one (perhaps also in the Annotation Index, but it need not be).
> 
> When these were defined, the concept of having multiple "equal" (in the sense
> that the defined keys - begin, end, and type prioirity order) matched between
> two FeatureStructures), was not though of, I think.  The implementation
> currently includes code that, when creating the iterator, does a
> "moveTo(the_bounding_fs)" operation, and then, if it finds that the FS at that
> spot is "equal" to the bounding FS, it moves-to-next to "skip" it.
> 
> Extending this to the possibility of "multiple" equal FSs, the effect is
> currently to skip just the first (of possibly many "equal" instances).

In my experience, users have often been confused by that. They thought that
begin/end was sufficient and that type priorities were not even needed.

This confusion gave rise to the uimaFIT selectCovered(jcas, being, end) method
which only takes offsets into account, ignores type priorities, and rewinds 
the the first of possibly multiple with equal begin after the initial moveto
operation.

> The documentation (which is in the Javadocs, mostly, for AnnotationIndex, here
> http://uima.apache.org/d/uimaj-2.7.0/apidocs/index.html ), doesn't cover this
> case.  It also seems to believe that the annotation supplying the bounding
> information needs to be in the index, whereas, the implementation doesn't
> require that.  For instance, one could decide to get all annotations between 10
> and 100, and just make an instance of a subtype of Annotation, setting the
> begin/end values to 10/100, and ** never add this to the indexes **, and pass it
> to the subiterator method as the bounding annotation.

That is also something I often saw. The problem is, that - to my understanding -
creating such a temporary annotation consumes space in the CAS heaps even if the
annotation is never indexed.

> I realize this is an edge case, that might not be too interesting, but I'd like
> to do some kind of better implementation to cover this.  The choices seem to be
> to a) continue skipping the 1st one, and leave the others in the iteration, or
> b) continue skipping the 1st one, and skip all of the other "equal" ones as well.
> 
> Another edge case happens if the bounding annotation *is* in the index.  In that
> case the definition in the Javadocs specifies the iterator will return
> annotations *following* the particular bounding annotation that is in the index.
> To implement this correctly, the code would need to search all "equal" items in
> the index to find the one that is "EQ" / == / has the same exact
> FeatureStructure "id", and return items "following" that in the index.
> 
> This code is not present in the current implementation; should it be added?  Or
> should we update the Javadocs?

If compatibility is an issue, I'd be for updating the JavaDoc to reflect the current
behavior more clearly, then think about adding a new API that supports other kinds
of behaviors, e.g. in the way that uimaFIT selectCovered is handling this.

I think it would be great to investigate the possibilities that the Java 8 stream
API might open up. Years back, we had been contemplating in uimaFIT on an alternative
CAS "selection" API thinking into directions akin to that steam API or to the Hibernate
Critera API. It might be interesting to refer to our notes from back then. Steven even
did some initial coding:

https://code.google.com/p/uimafit/issues/detail?id=65&colspec=ID%20Type%20Status%20Priority%20Milestone%20Compatible%20ASFJira%20Owner%20Summary

Cheers,

-- Richard

Re: inconsistency in implementation of SubIterators

Posted by Jens Grivolla <j+...@grivolla.net>.

On Fri, Apr 10, 2015 at 9:00 AM, Richard Eckart de Castilho <re...@apache.org>
wrote:

> I personally like the approach in uimaFIT better where different modes of
> selection/iteration are expressed using different verbs, e.g.
> selectCovered, selectCovering, etc. I think we have no selectOverlapping,
> selectRightOverlapping, selectLeftOverlapping yet, but they are sometimes
> requested by users.
>

Yes, the "overlapping" selection would be very useful...

-- Jens

Re: inconsistency in implementation of SubIterators

Posted by Richard Eckart de Castilho <re...@apache.org>.

On 10.04.2015, at 03:17, Nick Hill <ap...@nickhill.org> wrote:

> I was proposing a simple subiterator(int start, int end) method to be added to the AnnoatationIndex interface. This should satisfy all requirements .. i.e. it doesn't matter if the user has an existing annotation in mind or not to define the span, or whether or not that annotation exists in the index. If the user doesn't want that annotation in the result they can just exclude it while iterating.

+1 ( a new method qualifies nicely as "new API" ;) )

> A boolean param could also be included to indicate whether to return annotations wholly covered by the span, or all whose start index are in the span.

I personally like the approach in uimaFIT better where different modes of selection/iteration are expressed using different verbs, e.g. selectCovered, selectCovering, etc. I think we have no selectOverlapping, selectRightOverlapping, selectLeftOverlapping yet, but they are sometimes requested by users.

Cheers,

-- Richard

Re: inconsistency in implementation of SubIterators

Posted by Nick Hill <ap...@nickhill.org>.

Having said that, I just noticed that in my obj-based impl I exclude  
all "equal" ones (effectively skipping all of them), I think based on  
initial interpretation/reading of that javadoc.
But it could be changed to match the existing behaviour if necessary.

Nick

Quoting Nick Hill <ap...@nickhill.org>:

> Hi Marshall, I made a related suggestion in this comment:  
> https://issues.apache.org/jira/browse/UIMA-4329?focusedCommentId=14486825&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14486825
>
> I think a common usecase is to narrow a given annotation index (i.e.  
> maybe of subtypes) to those annotations falling within a particular  
> span, regardless of types. This isn't provided by the current  
> AnnotationIndex.subiterator() methods because of the type priority  
> behaviour, hence the custom acrobatics required.
>
> I was proposing a simple subiterator(int start, int end) method to  
> be added to the AnnoatationIndex interface. This should satisfy all  
> requirements .. i.e. it doesn't matter if the user has an existing  
> annotation in mind or not to define the span, or whether or not that  
> annotation exists in the index. If the user doesn't want that  
> annotation in the result they can just exclude it while iterating.
>
> A boolean param could also be included to indicate whether to return  
> annotations wholly covered by the span, or all whose start index are  
> in the span.
>
> Implementing this would be equivalent to calling the existing  
> subiterator impl with a dummy annotation whose start/end indices  
> match the provided span and whose type is guaranteed to be of lower  
> priority than all other annotation types (some virtual/internal type  
> maybe). Also exposing the existing "strict" boolean parameter would  
> give the optional behaviour described above.
>
> Regarding the existing edge case you describe, I don't have a strong  
> feeling but maybe it makes sense just to update the javadoc given  
> that it's presumably rare and has been working that way all along.
>
> Regards,
> Nick
>
> Quoting Marshall Schor <ms...@schor.com>:
>
>> In UIMA, Subiterators are defined relative to an existing  
>> Annotation Index over
>> some subtype of Annotation.
>>
>> When you create a subiterator, you pass in boundaries (begin / end) used to
>> restrict the iterator to those instances within that span.
>>
>> The boundaries are passed in using a FeatureStructure, which may be  
>> a new one,
>> or an existing one (perhaps also in the Annotation Index, but it  
>> need not be).
>>
>> When these were defined, the concept of having multiple "equal" (in  
>> the sense
>> that the defined keys - begin, end, and type prioirity order)  
>> matched between
>> two FeatureStructures), was not though of, I think.  The implementation
>> currently includes code that, when creating the iterator, does a
>> "moveTo(the_bounding_fs)" operation, and then, if it finds that the  
>> FS at that
>> spot is "equal" to the bounding FS, it moves-to-next to "skip" it.
>>
>> Extending this to the possibility of "multiple" equal FSs, the effect is
>> currently to skip just the first (of possibly many "equal" instances).
>>
>> The documentation (which is in the Javadocs, mostly, for  
>> AnnotationIndex, here
>> http://uima.apache.org/d/uimaj-2.7.0/apidocs/index.html ), doesn't  
>> cover this
>> case.  It also seems to believe that the annotation supplying the bounding
>> information needs to be in the index, whereas, the implementation doesn't
>> require that.  For instance, one could decide to get all  
>> annotations between 10
>> and 100, and just make an instance of a subtype of Annotation, setting the
>> begin/end values to 10/100, and ** never add this to the indexes  
>> **, and pass it
>> to the subiterator method as the bounding annotation.
>>
>> I realize this is an edge case, that might not be too interesting,  
>> but I'd like
>> to do some kind of better implementation to cover this.  The  
>> choices seem to be
>> to a) continue skipping the 1st one, and leave the others in the  
>> iteration, or
>> b) continue skipping the 1st one, and skip all of the other "equal"  
>> ones as well.
>>
>> Another edge case happens if the bounding annotation *is* in the  
>> index.  In that
>> case the definition in the Javadocs specifies the iterator will return
>> annotations *following* the particular bounding annotation that is  
>> in the index.
>> To implement this correctly, the code would need to search all  
>> "equal" items in
>> the index to find the one that is "EQ" / == / has the same exact
>> FeatureStructure "id", and return items "following" that in the index.
>>
>> This code is not present in the current implementation; should it  
>> be added?  Or
>> should we update the Javadocs?
>>
>> Does anyone have a preference, one way or another (or perhaps even a better
>> analysis and an alternate suggestion)?
>>
>> Thanks. -Marshall

Re: inconsistency in implementation of SubIterators

Posted by Nick Hill <ap...@nickhill.org>.

Hi Marshall, I made a related suggestion in this comment:  
https://issues.apache.org/jira/browse/UIMA-4329?focusedCommentId=14486825&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14486825

I think a common usecase is to narrow a given annotation index (i.e.  
maybe of subtypes) to those annotations falling within a particular  
span, regardless of types. This isn't provided by the current  
AnnotationIndex.subiterator() methods because of the type priority  
behaviour, hence the custom acrobatics required.

I was proposing a simple subiterator(int start, int end) method to be  
added to the AnnoatationIndex interface. This should satisfy all  
requirements .. i.e. it doesn't matter if the user has an existing  
annotation in mind or not to define the span, or whether or not that  
annotation exists in the index. If the user doesn't want that  
annotation in the result they can just exclude it while iterating.

A boolean param could also be included to indicate whether to return  
annotations wholly covered by the span, or all whose start index are  
in the span.

Implementing this would be equivalent to calling the existing  
subiterator impl with a dummy annotation whose start/end indices match  
the provided span and whose type is guaranteed to be of lower priority  
than all other annotation types (some virtual/internal type maybe).  
Also exposing the existing "strict" boolean parameter would give the  
optional behaviour described above.

Regarding the existing edge case you describe, I don't have a strong  
feeling but maybe it makes sense just to update the javadoc given that  
it's presumably rare and has been working that way all along.

Regards,
Nick

Quoting Marshall Schor <ms...@schor.com>:

> In UIMA, Subiterators are defined relative to an existing Annotation  
> Index over
> some subtype of Annotation.
>
> When you create a subiterator, you pass in boundaries (begin / end) used to
> restrict the iterator to those instances within that span.
>
> The boundaries are passed in using a FeatureStructure, which may be  
> a new one,
> or an existing one (perhaps also in the Annotation Index, but it  
> need not be).
>
> When these were defined, the concept of having multiple "equal" (in the sense
> that the defined keys - begin, end, and type prioirity order) matched between
> two FeatureStructures), was not though of, I think.  The implementation
> currently includes code that, when creating the iterator, does a
> "moveTo(the_bounding_fs)" operation, and then, if it finds that the  
> FS at that
> spot is "equal" to the bounding FS, it moves-to-next to "skip" it.
>
> Extending this to the possibility of "multiple" equal FSs, the effect is
> currently to skip just the first (of possibly many "equal" instances).
>
> The documentation (which is in the Javadocs, mostly, for  
> AnnotationIndex, here
> http://uima.apache.org/d/uimaj-2.7.0/apidocs/index.html ), doesn't cover this
> case.  It also seems to believe that the annotation supplying the bounding
> information needs to be in the index, whereas, the implementation doesn't
> require that.  For instance, one could decide to get all annotations  
> between 10
> and 100, and just make an instance of a subtype of Annotation, setting the
> begin/end values to 10/100, and ** never add this to the indexes **,  
> and pass it
> to the subiterator method as the bounding annotation.
>
> I realize this is an edge case, that might not be too interesting,  
> but I'd like
> to do some kind of better implementation to cover this.  The choices  
> seem to be
> to a) continue skipping the 1st one, and leave the others in the  
> iteration, or
> b) continue skipping the 1st one, and skip all of the other "equal"  
> ones as well.
>
> Another edge case happens if the bounding annotation *is* in the  
> index.  In that
> case the definition in the Javadocs specifies the iterator will return
> annotations *following* the particular bounding annotation that is  
> in the index.
> To implement this correctly, the code would need to search all  
> "equal" items in
> the index to find the one that is "EQ" / == / has the same exact
> FeatureStructure "id", and return items "following" that in the index.
>
> This code is not present in the current implementation; should it be  
> added?  Or
> should we update the Javadocs?
>
> Does anyone have a preference, one way or another (or perhaps even a better
> analysis and an alternate suggestion)?
>
> Thanks. -Marshall