You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uima.apache.org by Marshall Schor <ms...@schor.com> on 2016/03/31 21:22:06 UTC

changing edge case impl details in casCopiers

I'm thinking of changing how cas copier works with respect to managing Sofas and
sofa ref updating.  I've written something up here:
https://cwiki.apache.org/confluence/display/UIMA/CasCopier+and+Views

Comments / feedback / what did I overlook?  appreciated :-) -Marshall

Re: changing edge case impl details in casCopiers

Posted by Richard Eckart de Castilho <re...@apache.org>.
I'll answer inline below.

> On 04.04.2016, at 22:41, Marshall Schor <ms...@schor.com> wrote:
> 
> The existing CasCopier code, when copying a FS which is a subtype of
> AnnotationBase, copies the sofa ref by getting the "corresponding" sofa in the
> target CAS.  It does this by getting the sofa whose sofa number is the same. 
> 
> In the use case:
> 
> * view A contains a text
> * view B is created through a transformation of the text from A
> * annotations are created in view B
> * annotations are copied back to view A
> * offsets in the copied annotations are updated based on a reverse of the transformation operation in the second step
> 
> it would seem that step 3 (annotations created in view B) would create them in
> view "B".   So the sofa references of annotations (which are subtypes of
> AnnotationBase) would refer to the sofa associated with view "B".
> 
> In the code example above, a cas copier is created to go from view "B" as the
> source to view "A".  (I'm assuming the cas copier creation call is passing in
> two CAS "views", the source being some CAS's view "B", and the target being some
> CAS's view "A".   It's ambiguous whether or not these are two separate CASes, or
> two views of the same CAS (can you clarify?).

Well, actually we aim to support both. We have an older implementation following the
example above which uses different views in the same CAS. A newer one presently 
is based as a CAS multiplier and copies between CASes. We plan to extend the new
version in the future to also support view-to-view transformations.

> In an earlier note, I said maybe we could add an API to allow updating the sofa
> reference.  The DKPro code above found a way using existing APIs to do this; we
> could just keep this.

Definitely +1

-- Richard


Re: changing edge case impl details in casCopiers

Posted by Marshall Schor <ms...@schor.com>.
This continues to be an interesting set of use cases :-)

On 4/1/2016 5:28 PM, Richard Eckart de Castilho wrote:
> Hi,
>
> I would say as long as the CasCopier doesn't simply fail if it thinks that a copy wound be invalid/unsafe and as long as one can fix potentially broken copies afterwards, it would be in general ok. Ok, existing code might break...
breaking existing code - probably a bad thing, and to be avoided ...
>
> The use-case below was half hypothetical. Very real is a reverse use-case which we have implemented in DKPro Core.
>
> * view A contains a text
> * view B is created through a transformation of the text from A
> * annotations are created in view B
> * annotations are copied back to view A
> * offsets in the copied annotations are updated based on a reverse of the transformation operation in the second step
>
> The code we currently use to handle the copying back looks like this:
>
> CasCopier copier = new CasCopier(inputCas, outputCas);
>
> for (FeatureStructure fs : selectFS(inputCas, getType(inputCas, typeName))) {
>   if (!copier.alreadyCopied(fs)) {
>     FeatureStructure fsCopy = copier.copyFs(fs);
>     // Make sure that the sofa annotation in the copy is set
>     if (fs instanceof AnnotationBaseFS) {
>       FeatureStructure sofa = fsCopy.getFeatureValue(mDestSofaFeature);
>       if (sofa == null) {
>         fsCopy.setFeatureValue(mDestSofaFeature, outputCas.getSofa());
>       }
>     }
>     aOutput.addFsToIndexes(fsCopy);
>   }
> }
The existing CasCopier code, when copying a FS which is a subtype of
AnnotationBase, copies the sofa ref by getting the "corresponding" sofa in the
target CAS.  It does this by getting the sofa whose sofa number is the same. 

In the use case:

* view A contains a text
* view B is created through a transformation of the text from A
* annotations are created in view B
* annotations are copied back to view A
* offsets in the copied annotations are updated based on a reverse of the transformation operation in the second step

it would seem that step 3 (annotations created in view B) would create them in
view "B".   So the sofa references of annotations (which are subtypes of
AnnotationBase) would refer to the sofa associated with view "B".

In the code example above, a cas copier is created to go from view "B" as the
source to view "A".  (I'm assuming the cas copier creation call is passing in
two CAS "views", the source being some CAS's view "B", and the target being some
CAS's view "A".   It's ambiguous whether or not these are two separate CASes, or
two views of the same CAS (can you clarify?).

The copyFs call sets the sofa ref in the copy to point to the sofa in the target
CAS which has the same sofa number as the sofa had in view "B" (the source CAS
view), unless the source had null for the sofa reference, in which case, the
target is left as null. 

It's possible that this might accidentally "work" for some view populations of
the two CASes.


> Source: https://github.com/dkpro/dkpro-core/blob/7c8785647ca8c5905aa108251935069e601cbb8d/dkpro-core-api-transform-asl/src/main/java/de/tudarmstadt/ukp/dkpro/core/api/transform/JCasTransformer_ImplBase.java#L99
>
> I guess this code would still work and wouldn't throw exceptions or such.
It might or might not, depending on whether or not there's a hard constraint
that the source and target CASes be the same CAS (with multiple views).  It
doesn't work, I think, in all cases when there are multiple CASes/ multiple views.
>
> If I understand the diagrams in the wiki correctly, there is one case where the sofa of the copied FS points to the source view but the FS in indexed in the target view. This seems to be the only difference between the case copying between CASes and within a CAS. I think it may be better/simpler/more consistent to set the sofa of the copy to null in both cases and if the user really wants the FS to point to a sofa in a different view, then he should set the sofa in this was manually after the copy is complete.

I'm hoping to have an approach which won't break backwards compatibility...
>  
>
> Btw... at least when copying individual FSes, the copy isn't indexed anyway by the CasCopier. We are talking only about the bulk-copy method then?

You are correct, the copy isn't indexed when you use the copyFs API.  However,
it's sofa reference is set, and if set "wrong", an attempt to add the fs to the
indexes will throw an error.  This check was added in version 2.7.0, and
intended to prevent accidents. 

In an earlier note, I said maybe we could add an API to allow updating the sofa
reference.  The DKPro code above found a way using existing APIs to do this; we
could just keep this.

-Marshall
>
> Cheers,
>
> -- Richard
>
>> On 01.04.2016, at 15:57, Marshall Schor <ms...@schor.com> wrote:
>>
>> Hi Richard,
>>
>> Thanks for this use-case.  I think there may be 2 subcases.
>>
>> 1) The views, A and B, are in the same CAS, and
>> 2) The views, A and B, are in different CASes
>>
>> In case 1), with this new proposal the annotations copied from view A to B would
>> have their "sofa" reference continue to point to the text in view A.  This means:
>>
>> a) The references into the text are still "valid", but of course point to the
>> text in view A.
>> b) To do the updating process to have them point to the de-xml'ed version of the
>> text, not only do the begin/end references need to be updated, but the sofa
>> reference needs to be changed.  We could add an API to update that to the
>> current view's.
>>
>> In case 2), the annotations in B would no longer have a valid sofa reference at
>> all (it would be set to null).
>> This would clearly be a problem; but once again, we could add an API to update
>> that to the current view's.
>>
>> --------------------------------
>>
>> So, it looks like this proposed design change would break the use-case you
>> suggested. 
>>
>> The current design would seems to support this use case but only if the two
>> views are in different CASes.
>> If they were in the same CAS, I think the current implementation (not tested,
>> just reading the code) would have the copied Annotations have their sofa
>> references be to the sofa in CAS A.
>>
>> Does this match what you're currently seeing?
>>
>> -Marshall
>>
>>
>> On 3/31/2016 4:36 PM, Richard Eckart de Castilho wrote:
>>> On 31.03.2016, at 21:22, Marshall Schor <ms...@schor.com> wrote:
>>>> I'm thinking of changing how cas copier works with respect to managing Sofas and
>>>> sofa ref updating.  I've written something up here:
>>>> https://cwiki.apache.org/confluence/display/UIMA/CasCopier+and+Views
>>>>
>>>> Comments / feedback / what did I overlook?  appreciated :-) -Marshall
>>> Consider the following case:
>>>
>>> - there are two views, A and B
>>> - the text in B has been derived from A through some transformation, e.g. the removal of XML tags
>>> - A contains UIMA annotations that represent the XML tags and the point into the text in A
>>> - as part of a second transformation process, all annotations in A are to be copied into B
>>> - after the copy has been performed, the offsets of the copied annotations are updated
>>>
>>> Would such a scenario still be supported after the changes you suggest?
>>>
>>> Best,
>>>
>>> -- Richard
>


Re: changing edge case impl details in casCopiers

Posted by Richard Eckart de Castilho <re...@apache.org>.
Hi,

I would say as long as the CasCopier doesn't simply fail if it thinks that a copy wound be invalid/unsafe and as long as one can fix potentially broken copies afterwards, it would be in general ok. Ok, existing code might break...

The use-case below was half hypothetical. Very real is a reverse use-case which we have implemented in DKPro Core.

* view A contains a text
* view B is created through a transformation of the text from A
* annotations are created in view B
* annotations are copied back to view A
* offsets in the copied annotations are updated based on a reverse of the transformation operation in the second step

The code we currently use to handle the copying back looks like this:

CasCopier copier = new CasCopier(inputCas, outputCas);

for (FeatureStructure fs : selectFS(inputCas, getType(inputCas, typeName))) {
  if (!copier.alreadyCopied(fs)) {
    FeatureStructure fsCopy = copier.copyFs(fs);
    // Make sure that the sofa annotation in the copy is set
    if (fs instanceof AnnotationBaseFS) {
      FeatureStructure sofa = fsCopy.getFeatureValue(mDestSofaFeature);
      if (sofa == null) {
        fsCopy.setFeatureValue(mDestSofaFeature, outputCas.getSofa());
      }
    }
    aOutput.addFsToIndexes(fsCopy);
  }
}

Source: https://github.com/dkpro/dkpro-core/blob/7c8785647ca8c5905aa108251935069e601cbb8d/dkpro-core-api-transform-asl/src/main/java/de/tudarmstadt/ukp/dkpro/core/api/transform/JCasTransformer_ImplBase.java#L99

I guess this code would still work and wouldn't throw exceptions or such.

If I understand the diagrams in the wiki correctly, there is one case where the sofa of the copied FS points to the source view but the FS in indexed in the target view. This seems to be the only difference between the case copying between CASes and within a CAS. I think it may be better/simpler/more consistent to set the sofa of the copy to null in both cases and if the user really wants the FS to point to a sofa in a different view, then he should set the sofa in this was manually after the copy is complete. 

Btw... at least when copying individual FSes, the copy isn't indexed anyway by the CasCopier. We are talking only about the bulk-copy method then?

Cheers,

-- Richard

> On 01.04.2016, at 15:57, Marshall Schor <ms...@schor.com> wrote:
> 
> Hi Richard,
> 
> Thanks for this use-case.  I think there may be 2 subcases.
> 
> 1) The views, A and B, are in the same CAS, and
> 2) The views, A and B, are in different CASes
> 
> In case 1), with this new proposal the annotations copied from view A to B would
> have their "sofa" reference continue to point to the text in view A.  This means:
> 
> a) The references into the text are still "valid", but of course point to the
> text in view A.
> b) To do the updating process to have them point to the de-xml'ed version of the
> text, not only do the begin/end references need to be updated, but the sofa
> reference needs to be changed.  We could add an API to update that to the
> current view's.
> 
> In case 2), the annotations in B would no longer have a valid sofa reference at
> all (it would be set to null).
> This would clearly be a problem; but once again, we could add an API to update
> that to the current view's.
> 
> --------------------------------
> 
> So, it looks like this proposed design change would break the use-case you
> suggested. 
> 
> The current design would seems to support this use case but only if the two
> views are in different CASes.
> If they were in the same CAS, I think the current implementation (not tested,
> just reading the code) would have the copied Annotations have their sofa
> references be to the sofa in CAS A.
> 
> Does this match what you're currently seeing?
> 
> -Marshall
> 
> 
> On 3/31/2016 4:36 PM, Richard Eckart de Castilho wrote:
>> On 31.03.2016, at 21:22, Marshall Schor <ms...@schor.com> wrote:
>>> I'm thinking of changing how cas copier works with respect to managing Sofas and
>>> sofa ref updating.  I've written something up here:
>>> https://cwiki.apache.org/confluence/display/UIMA/CasCopier+and+Views
>>> 
>>> Comments / feedback / what did I overlook?  appreciated :-) -Marshall
>> Consider the following case:
>> 
>> - there are two views, A and B
>> - the text in B has been derived from A through some transformation, e.g. the removal of XML tags
>> - A contains UIMA annotations that represent the XML tags and the point into the text in A
>> - as part of a second transformation process, all annotations in A are to be copied into B
>> - after the copy has been performed, the offsets of the copied annotations are updated
>> 
>> Would such a scenario still be supported after the changes you suggest?
>> 
>> Best,
>> 
>> -- Richard


Re: changing edge case impl details in casCopiers

Posted by Marshall Schor <ms...@schor.com>.
Hi Richard,

Thanks for this use-case.  I think there may be 2 subcases.

1) The views, A and B, are in the same CAS, and
2) The views, A and B, are in different CASes

In case 1), with this new proposal the annotations copied from view A to B would
have their "sofa" reference continue to point to the text in view A.  This means:

a) The references into the text are still "valid", but of course point to the
text in view A.
b) To do the updating process to have them point to the de-xml'ed version of the
text, not only do the begin/end references need to be updated, but the sofa
reference needs to be changed.  We could add an API to update that to the
current view's.

In case 2), the annotations in B would no longer have a valid sofa reference at
all (it would be set to null).
This would clearly be a problem; but once again, we could add an API to update
that to the current view's.

--------------------------------

So, it looks like this proposed design change would break the use-case you
suggested. 

The current design would seems to support this use case but only if the two
views are in different CASes.
If they were in the same CAS, I think the current implementation (not tested,
just reading the code) would have the copied Annotations have their sofa
references be to the sofa in CAS A.

Does this match what you're currently seeing?

-Marshall


On 3/31/2016 4:36 PM, Richard Eckart de Castilho wrote:
> On 31.03.2016, at 21:22, Marshall Schor <ms...@schor.com> wrote:
>> I'm thinking of changing how cas copier works with respect to managing Sofas and
>> sofa ref updating.  I've written something up here:
>> https://cwiki.apache.org/confluence/display/UIMA/CasCopier+and+Views
>>
>> Comments / feedback / what did I overlook?  appreciated :-) -Marshall
> Consider the following case:
>
> - there are two views, A and B
> - the text in B has been derived from A through some transformation, e.g. the removal of XML tags
> - A contains UIMA annotations that represent the XML tags and the point into the text in A
> - as part of a second transformation process, all annotations in A are to be copied into B
> - after the copy has been performed, the offsets of the copied annotations are updated
>
> Would such a scenario still be supported after the changes you suggest?
>
> Best,
>
> -- Richard
>
>


Re: changing edge case impl details in casCopiers

Posted by Richard Eckart de Castilho <re...@apache.org>.
On 31.03.2016, at 21:22, Marshall Schor <ms...@schor.com> wrote:
> 
> I'm thinking of changing how cas copier works with respect to managing Sofas and
> sofa ref updating.  I've written something up here:
> https://cwiki.apache.org/confluence/display/UIMA/CasCopier+and+Views
> 
> Comments / feedback / what did I overlook?  appreciated :-) -Marshall

Consider the following case:

- there are two views, A and B
- the text in B has been derived from A through some transformation, e.g. the removal of XML tags
- A contains UIMA annotations that represent the XML tags and the point into the text in A
- as part of a second transformation process, all annotations in A are to be copied into B
- after the copy has been performed, the offsets of the copied annotations are updated

Would such a scenario still be supported after the changes you suggest?

Best,

-- Richard