You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@uima.apache.org by Richard Eckart de Castilho <re...@apache.org> on 2020/11/12 20:40:00 UTC

Adjusting the behavior of SelectFS limit and shifted operations

SelectFS provides different way of accessing the selection:

- as list
- as array
- as FS iterator
- as a single result
- ...

I believe it is a desirable property that these behave consistently.

In particular, when obtaining the iterator and moving over it in any
order, one should encounter the same results as when accessing the
results as a list or array (although hopefully more efficiently, 
in particular if not all of the results need to be looked at).

I fixed up the code such that going through the iterator front-to-back
and back-to-front yields results consistent with the list/array.

But working on this, I found two issues:

== limit(x)

It is possible to limit the results of the selection operation. This
is implemented as a limit on the number of get() operations one can 
make on the iterator. That means, one can move the iterator around
after obtaining it (without calling get()) without triggering the limit:

it = select().limit(2).fsIterator();
it.moveToNext();
it.moveToNext();
it.moveToNext();
it.get() <= can still get here!

I would rather expect that the limit would restrict size of the
result space and that the iterator could be used to freely move
inside that limited space while allowing to call get() as often
as one likes. So the limit should in my opinion rather be imposed
on the move operations than on the get operations.

WDYT?


== shifted(x < 0)

The shift operation is defined as moving the iterator to the left/right
from the starting position. Consider the following situation:

t1 = new Token(0,1)
t2 = new Token(2,3)
t3 = new Token(4,5)
t4 = new Token(6,7)
t5 = new Token(8,9)

select().following(t3) => {t4, t5}
select().preceding(t3) => {t1, t2} 

NOTE: results are returned in index order, not in iteration order!
select.preceding(t3) iterates backwards starting from t3, but then 
reverses the result list.

So with positive shift, we can skip some of the results before returning.

select().shifted(1).following(t3) => {t5}

When I started looking into this, it was also possible to use a negative
shift with following, e.g.:

select().shifted(-1).following(t3) => {t3, t4, t5}

So the "following" operation would select "t4" as the starting point and
the shift would then move the iterator one to the left so that the result
list would include t3 as well.

However, in order to align the following and preceding operations with the
I have installed filtered iterators to ensure that edge cases with
zero-width annotations are properly handled. These filters put a hard limit
on the iterator when the boundaries of the startFS (here t3) are hit and
will not allow moving beyond this limit. That means that using shift with
following/preceding returns an empty list because moving the iterator
backwards from its starting position invariably hits the filtered iterator
limit and invalidates it.

select().shifted(-1).following(t3) => {}
select().shifted(-1).preceding(t3) => {}

IMHO that makes sense. Operations like preceding/following/coveredBy/etc.
are supposed to work offset-oriented and to gloss over the index order
(and type priorities). Mixing this with an operation that strongly
depends on index order like shift feels like a bad idea. It works out
nicely for positive shift though - it boils down to skipping the first x
elements of the result. However, allowing a negative shift to move to some
element just before the start position for this kind of operation seems
like a bad idea conceptually and also causes quite a bit of headache
at the implementation level.

Instead of using a negative shift with preceding/following, it should
be used in conjunction with startAt. startAt itself is strongly dependent
on index order and the shift operation is well defined in conjunction with
startAt.

So when SelectFS following/preceding is used with a negative shift now,
a warning is logged suggesting the use of startAt. 

I believe this change is defendable in a feature-level release because 
- the behavior of shift is quite odd to start with (and IMHO should also
 be adjusted in other cases)
- I would not expect much if any code to actually rely on the shift 
 operator with a negative index

WDYT?

Cheers,

-- Richard

Re: Adjusting the behavior of SelectFS limit and shifted operations

Posted by "Zesch, Torsten" <to...@uni-due.de>.


    I would rather expect that the limit would restrict size of the
    result space and that the iterator could be used to freely move
    inside that limited space while allowing to call get() as often
    as one likes. So the limit should in my opinion rather be imposed
    on the move operations than on the get operations.

    WDYT?

TZ: that makes sense


    I believe this change is defendable in a feature-level release because 
    - the behavior of shift is quite odd to start with (and IMHO should also
     be adjusted in other cases)
    - I would not expect much if any code to actually rely on the shift 
     operator with a negative index

    WDYT?

TZ: AFAIK we are not using any shift operations in our code, so no objections from this side. 

-Torsten

Re: Adjusting the behavior of SelectFS limit and shifted operations

Posted by Mario Juric <ma...@cactusglobal.com>.

Hi Richard,

Sorry for the late reply, I’ve been busy with other things lately, but thanks for the extensive explanation, and it gives me a much better understanding of the select API.  Generally I don’t oppose to any of the thoughts you have, but I don’t necessarily always find it intuitive (yet), and there hasn’t been any issues for our limited use cases, except for the bug you fixed very quickly, thanks for that :)

Cheers
Mario


> On 20 Nov 2020, at 22.04, Richard Eckart de Castilho <re...@apache.org> wrote:
>
> External email – Do not click links or open attachments unless you recognize the sender and know that the content is safe.
>
>
> Hi Mario,
>
> if I understand you correctly, you imagine select() to be a streaming operation. Actually, it is not - at least not immediately.
>
> When select() is invoked, it creates an object that is a hybrid between a builder, an Iterable and a Stream. If at any point you invoke an Iterable or Stream method on it, it loses the other personalities.
>
> The methods such as the following are part of the "builder" personality:
>
> - following(x)
> - coveredBy(x)
> - covering(x)
> - ...
>
> - shifted(y)
> - backwards()
> - noneOverlapping()
> - typePriorities()
> - ...
>
> While operating on the "builder" personality, the order of methods has no effect. E.g. the following calls are all equivalent:
>
> cas.select(Token.class).shifted(-1).following(t3).backwards()
> cas.select(Token.class).following(t3).backwards().shifted(-1)
> cas.select(Token.class).backwards().shifted(-1).following(t3)
>
> If you try to give conflicting instructions to the builder personality, the last instruction should be used, e.g.
>
> cas.select(Token.class).following(t3).shifted(-1).preceding(t4)
>
> should be equivalent to
>
> cas.select(Token.class).preceding(t4).shifted(-1)
>
> (... or if there are bugs it might do something unexpected ...)
>
> Methods like coveredBy(x) or covering(x) set up bounds for the iterator internally created by SelectFS.
> I think the initial idea for following(x)/preceding(x) was that they would not define bounds - but IMHO that doesn't make too much sense. From my perspective they also define bounds either from the beginning of the document to x (preceding) or from x to the end of the document (following). There is also the startAt(x) method - this does not define a boundary - it just moves the iterator to a given start position.
>
> So while the following operations are bounded:
>
> cas.select(Token.class).following(x).asList()
> cas.select(Token.class).preceding(x).asList()
>
> these operations are their respective not-bounded versions
>
> cas.select(Token.class).startAt(x).asList()
> cas.select(Token.class).startAt(x).backwards().asList()
>
> The not-bounded versions behave a bit differently from the bounded ones. E.g. preceding(x) returns annotations in document order while startAt(x).backwards() returns them in iteration order. Also,
> following(x) and preceding(x) would never include x in their results, while startAt(x) should return
> x as the first entry in the result list. I do hope that I explained this correctly and that it makes sense and that it mostly matches the implementation. I am still working on setting up a tighter test suite to ensure it does ;)
>
> select() only really becomes a stream if you invoke stream() or a method from the Stream interface (e.g. filter() or map()). It can also become a list, an array, or an iterator. So the following is actually *not* possible:
>
> select(Token.class).filter(t -> t.getCoveredText().equals("blah")).shifted(1)
>
> because "shifted()" is a method from the builder personality of SelectFS while "filter()" is a method of the Stream personality. However, this would work:
>
> select(Token.class).filter(t -> t.getCoveredText().equals("blah")).skip(1)
>
> because "skip()" is a method on Stream.
>
> Ok, but independent of the different personalities of select(), I understand that you'd find it not logical or intuitive that limit and shifted interact with each other. But you do support the idea of
> capping shift at 0 and simply ignoring any smaller values for bounded selections.
>
> Cheers,
>
> -- Richard


________________________________
Disclaimer:
This email and any files transmitted with it are confidential and directed solely for the use of the intended addressee or addressees and may contain information that is legally privileged, confidential, and exempt from disclosure. If you have received this email in error, please notify the sender by telephone, fax, or return email and immediately delete this email and any files transmitted along with it. Unintended recipients are not authorized to disclose, disseminate, distribute, copy or take any action in reliance on information contained in this email and/or any files attached thereto, in any manner other than to notify the sender; any unauthorized use is subject to legal prosecution.

Re: Adjusting the behavior of SelectFS limit and shifted operations

Posted by Richard Eckart de Castilho <re...@apache.org>.

Hi Mario,

if I understand you correctly, you imagine select() to be a streaming operation. Actually, it is not - at least not immediately.

When select() is invoked, it creates an object that is a hybrid between a builder, an Iterable and a Stream. If at any point you invoke an Iterable or Stream method on it, it loses the other personalities.

The methods such as the following are part of the "builder" personality:

- following(x)
- coveredBy(x)
- covering(x)
- ...

- shifted(y)
- backwards()
- noneOverlapping()
- typePriorities()
- ...

While operating on the "builder" personality, the order of methods has no effect. E.g. the following calls are all equivalent:

cas.select(Token.class).shifted(-1).following(t3).backwards()
cas.select(Token.class).following(t3).backwards().shifted(-1)
cas.select(Token.class).backwards().shifted(-1).following(t3)

If you try to give conflicting instructions to the builder personality, the last instruction should be used, e.g.

cas.select(Token.class).following(t3).shifted(-1).preceding(t4)

should be equivalent to

cas.select(Token.class).preceding(t4).shifted(-1)

(... or if there are bugs it might do something unexpected ...)

Methods like coveredBy(x) or covering(x) set up bounds for the iterator internally created by SelectFS.
I think the initial idea for following(x)/preceding(x) was that they would not define bounds - but IMHO that doesn't make too much sense. From my perspective they also define bounds either from the beginning of the document to x (preceding) or from x to the end of the document (following). There is also the startAt(x) method - this does not define a boundary - it just moves the iterator to a given start position. 

So while the following operations are bounded:

cas.select(Token.class).following(x).asList()
cas.select(Token.class).preceding(x).asList()

these operations are their respective not-bounded versions

cas.select(Token.class).startAt(x).asList()
cas.select(Token.class).startAt(x).backwards().asList()

The not-bounded versions behave a bit differently from the bounded ones. E.g. preceding(x) returns annotations in document order while startAt(x).backwards() returns them in iteration order. Also,
following(x) and preceding(x) would never include x in their results, while startAt(x) should return
x as the first entry in the result list. I do hope that I explained this correctly and that it makes sense and that it mostly matches the implementation. I am still working on setting up a tighter test suite to ensure it does ;)

select() only really becomes a stream if you invoke stream() or a method from the Stream interface (e.g. filter() or map()). It can also become a list, an array, or an iterator. So the following is actually *not* possible:

select(Token.class).filter(t -> t.getCoveredText().equals("blah")).shifted(1)

because "shifted()" is a method from the builder personality of SelectFS while "filter()" is a method of the Stream personality. However, this would work:

select(Token.class).filter(t -> t.getCoveredText().equals("blah")).skip(1)

because "skip()" is a method on Stream.

Ok, but independent of the different personalities of select(), I understand that you'd find it not logical or intuitive that limit and shifted interact with each other. But you do support the idea of
capping shift at 0 and simply ignoring any smaller values for bounded selections.

Cheers,

-- Richard

Re: Adjusting the behavior of SelectFS limit and shifted operations

Posted by Mario Juric <ma...@cactusglobal.com>.

Hej Richard,

I have to admit that I needed to read this a number of times before I believe I understood this. I think the semantics can be hard to understand in any of these edge cases, but I’ll try to explain how my intuition sees it.

> So I am thinking now that it might make more sense to simply internally set the shift from the negative number to 0 when used
> in such a situation:
>
> ----
> select().shifted(-1).following(t3) => {t4, t5} // internally shifted is capped to 0
> select().following(t3) => {t4, t5}
> ——
Makes sense if we use a list model with a final begin and end, and I would with this choice prefer to think of shift as an operation that is limited to the bounds of the list, meaning it stops once these have been reached, which effectively means it is capped to 0 in the above example. The alternative is to take the view of the list as a circular entity with respect to operations like shift, which is similar to negative list indexes in Python. However, applying the “following”-operation makes the shift redundant in the above example. Both operations just move the iterator, so to me it seems irrelevant what shift does before, since “following” should move the iterator to the specified location independently of where it was placed by shift.
>
> However, that would/should then introduce interaction with the `limit()` operator, e.g.
>
> ----
> select().shifted(-1).following(t3)          => {t4, t5} // internally shifted is capped to 0
> select().shifted(-1).following(t3).limit(2) => {t5}
> ——
>
> We shift by -1, do not return the result because it is outside the bounds, but still limit the
> forward move operations. So effectively, the negative shift would need to be subtracted from
> the limit in this case.
>
> Any thoughts?
>
This is not obvious to me. Given my argumentation above, it doesn’t matter what shift does, since “following” is the last operation that determines the iterator position, and limit should apply to the iteration from that position. I would expect it to be independent from shift, except that shift, like following, changes the iterator position, but limit applies to the iteration from that point on whatever that is, so I would still expect {t4, t5}. I don’t see why there should be any direct interaction between shift and limit. I would think of this like stream operations, except that shift and following don’t make much sense in the middle of a stream chain, i.e. they can only be applied during initialisation of the stream source.


Cheers
Mario


________________________________
Disclaimer:
This email and any files transmitted with it are confidential and directed solely for the use of the intended addressee or addressees and may contain information that is legally privileged, confidential, and exempt from disclosure. If you have received this email in error, please notify the sender by telephone, fax, or return email and immediately delete this email and any files transmitted along with it. Unintended recipients are not authorized to disclose, disseminate, distribute, copy or take any action in reliance on information contained in this email and/or any files attached thereto, in any manner other than to notify the sender; any unauthorized use is subject to legal prosecution.

Re: Adjusting the behavior of SelectFS limit and shifted operations

Posted by Richard Eckart de Castilho <re...@apache.org>.

On 13. Nov 2020, at 12:08, Mario Juric <ma...@cactusglobal.com> wrote:
> 
> We have no use cases where something like shift is used, especially negative shifts, so it’s hard for me to have a qualified opinion about this. The closest thing it reminds me of are negative list indexes in Python, and I am not sure that is comparable. I guess, invalidating or sending out warnings in some of the mentioned edge cases seems reasonable for now.

While writing up some migration notes related to behavioral changes in the next UIMA version, I came across this again.

So, right now (after the changes we discussed), using a negative shift with a bounding operator like "following" or "coveredBy"
would return an empty list:

----
select().shifted(-1).following(t3) => {}
----

Using it without the negative shift, it would return this result e.g.:

----
select().following(t3) => {t4, t5}
----

So I am thinking now that it might make more sense to simply internally set the shift from the negative number to 0 when used
in such a situation:

----
select().shifted(-1).following(t3) => {t4, t5} // internally shifted is capped to 0
select().following(t3) => {t4, t5}
----

However, that would/should then introduce interaction with the `limit()` operator, e.g.

----
select().shifted(-1).following(t3)          => {t4, t5} // internally shifted is capped to 0
select().shifted(-1).following(t3).limit(2) => {t5}     
----

We shift by -1, do not return the result because it is outside the bounds, but still limit the
forward move operations. So effectively, the negative shift would need to be subtracted from
the limit in this case.

Any thoughts?

Cheers,

-- Richard

Re: Adjusting the behavior of SelectFS limit and shifted operations

Posted by Richard Eckart de Castilho <re...@apache.org>.

On 13. Nov 2020, at 12:08, Mario Juric <ma...@cactusglobal.com> wrote:
> 
> We have no use cases where something like shift is used, especially negative shifts, so it’s hard for me to have a qualified opinion about this. The closest thing it reminds me of are negative list indexes in Python, and I am not sure that is comparable. I guess, invalidating or sending out warnings in some of the mentioned edge cases seems reasonable for now.

While writing up some migration notes related to behavioral changes in the next UIMA version, I came across this again.

So, right now (after the changes we discussed), using a negative shift with a bounding operator like "following" or "coveredBy"
would return an empty list:

----
select().shifted(-1).following(t3) => {}
----

Using it without the negative shift, it would return this result e.g.:

----
select().following(t3) => {t4, t5}
----

So I am thinking now that it might make more sense to simply internally set the shift from the negative number to 0 when used
in such a situation:

----
select().shifted(-1).following(t3) => {t4, t5} // internally shifted is capped to 0
select().following(t3) => {t4, t5}
----

However, that would/should then introduce interaction with the `limit()` operator, e.g.

----
select().shifted(-1).following(t3)          => {t4, t5} // internally shifted is capped to 0
select().shifted(-1).following(t3).limit(2) => {t5}     
----

We shift by -1, do not return the result because it is outside the bounds, but still limit the
forward move operations. So effectively, the negative shift would need to be subtracted from
the limit in this case.

Any thoughts?

Cheers,

-- Richard

Re: Adjusting the behavior of SelectFS limit and shifted operations

Posted by Mario Juric <ma...@cactusglobal.com>.

Hi Richard,

I haven’t used any of the mentioned features yet, so I have to think of some use cases that we have, where they could be applied.

> it = select().limit(2).fsIterator();
> it.moveToNext();
> it.moveToNext();
> it.moveToNext();
> it.get() <= can still get here!
>
> I would rather expect that the limit would restrict size of the
> result space and that the iterator could be used to freely move
> inside that limited space while allowing to call get() as often
> as one likes. So the limit should in my opinion rather be imposed
> on the move operations than on the get operations.

A top N filter would be the closest use case that we have for something like this, and it picks the first N that satisfy some constraints, so in this case it doesn’t make sense to limit the iteration range, because we don’t know whether the next item satisfies the constraints, but then I would probably implement it something like this:

It = select().filter(t -> satisfiesConstraints(t)).limit(n);

In this case it makes sense to limit the iterator range as you propose, and this is basically also how our implementation works with streams.

> == shifted(x < 0)
>
> The shift operation is defined as moving the iterator to the left/right
> from the starting position. Consider the following situation:
>
> t1 = new Token(0,1)
> t2 = new Token(2,3)
> t3 = new Token(4,5)
> t4 = new Token(6,7)
> t5 = new Token(8,9)
>
> select().following(t3) => {t4, t5}
> select().preceding(t3) => {t1, t2}
>
> NOTE: results are returned in index order, not in iteration order!
> select.preceding(t3) iterates backwards starting from t3, but then
> reverses the result list.
>
> So with positive shift, we can skip some of the results before returning.
>
> select().shifted(1).following(t3) => {t5}
>
> When I started looking into this, it was also possible to use a negative
> shift with following, e.g.:
>
> select().shifted(-1).following(t3) => {t3, t4, t5}
>
> So the "following" operation would select "t4" as the starting point and
> the shift would then move the iterator one to the left so that the result
> list would include t3 as well.
>
> However, in order to align the following and preceding operations with the
> I have installed filtered iterators to ensure that edge cases with
> zero-width annotations are properly handled. These filters put a hard limit
> on the iterator when the boundaries of the startFS (here t3) are hit and
> will not allow moving beyond this limit. That means that using shift with
> following/preceding returns an empty list because moving the iterator
> backwards from its starting position invariably hits the filtered iterator
> limit and invalidates it.
>
> select().shifted(-1).following(t3) => {}
> select().shifted(-1).preceding(t3) => {}
>
> IMHO that makes sense. Operations like preceding/following/coveredBy/etc.
> are supposed to work offset-oriented and to gloss over the index order
> (and type priorities). Mixing this with an operation that strongly
> depends on index order like shift feels like a bad idea. It works out
> nicely for positive shift though - it boils down to skipping the first x
> elements of the result. However, allowing a negative shift to move to some
> element just before the start position for this kind of operation seems
> like a bad idea conceptually and also causes quite a bit of headache
> at the implementation level.
>
> Instead of using a negative shift with preceding/following, it should
> be used in conjunction with startAt. startAt itself is strongly dependent
> on index order and the shift operation is well defined in conjunction with
> startAt.
>
> So when SelectFS following/preceding is used with a negative shift now,
> a warning is logged suggesting the use of startAt.
>
> I believe this change is defendable in a feature-level release because
> - the behavior of shift is quite odd to start with (and IMHO should also
> be adjusted in other cases)
> - I would not expect much if any code to actually rely on the shift
> operator with a negative index

We have no use cases where something like shift is used, especially negative shifts, so it’s hard for me to have a qualified opinion about this. The closest thing it reminds me of are negative list indexes in Python, and I am not sure that is comparable. I guess, invalidating or sending out warnings in some of the mentioned edge cases seems reasonable for now.

Cheers
Mario



________________________________
Disclaimer:
This email and any files transmitted with it are confidential and directed solely for the use of the intended addressee or addressees and may contain information that is legally privileged, confidential, and exempt from disclosure. If you have received this email in error, please notify the sender by telephone, fax, or return email and immediately delete this email and any files transmitted along with it. Unintended recipients are not authorized to disclose, disseminate, distribute, copy or take any action in reliance on information contained in this email and/or any files attached thereto, in any manner other than to notify the sender; any unauthorized use is subject to legal prosecution.