You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uima.apache.org by Richard Eckart de Castilho <re...@apache.org> on 2017/11/30 22:04:17 UTC

Deserializing CAS v2 -> v3 + type system change

Hi,

after DKPro Core can be made to on UIMA v3, I'm looking next into WebAnno.
It uses serialized CASes and CAS addresses a lot, so I am very curious
if/how well it works with v3.

However, I have a small problem that may or may not be related to the v3
upgrade. As part of upgrading WebAnno from v2 to v3, I am also upgrading
it from DKPro Core 1.7.0 to 1.9.0-SNAPSHOT (v3 branch). We have had some
changes in the type system from 1.7.0 to 1.9.0 on of which I am now hitting:

The supertype of the annotation "de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Document"
has changed from "uima.tcas.Annotation" in 1.7.0 to
"de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div" in 1.9.0 and it
appears that the v3 CAS deserialization code is validating JCas classes against
the data being read from a serialized CAS:

> 2017-11-30 22:38:58 ERROR [admin] AnnotationPage - Error: The JCas class: "de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Document" has supertype: "de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div" which doesn't match the UIMA type "de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Document"'s supertype "uima.tcas.Annotation".
> org.apache.uima.cas.CASRuntimeException: The JCas class: "de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Document" has supertype: "de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div" which doesn't match the UIMA type "de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Document"'s supertype "uima.tcas.Annotation".
> 	at org.apache.uima.cas.impl.FSClassRegistry.validateSuperClass(FSClassRegistry.java:460) ~[uimaj-core-3.0.0-beta.jar:3.0.0-beta]
> 	at org.apache.uima.cas.impl.FSClassRegistry.maybeLoadJCasAndSubtypes(FSClassRegistry.java:391) ~[uimaj-core-3.0.0-beta.jar:3.0.0-beta]
> 	at org.apache.uima.cas.impl.FSClassRegistry.maybeLoadJCasAndSubtypes(FSClassRegistry.java:435) ~[uimaj-core-3.0.0-beta.jar:3.0.0-beta]
> 	at org.apache.uima.cas.impl.FSClassRegistry.maybeLoadJCasAndSubtypes(FSClassRegistry.java:435) ~[uimaj-core-3.0.0-beta.jar:3.0.0-beta]
> 	at org.apache.uima.cas.impl.FSClassRegistry.maybeLoadJCasAndSubtypes(FSClassRegistry.java:435) ~[uimaj-core-3.0.0-beta.jar:3.0.0-beta]
> 	at org.apache.uima.cas.impl.FSClassRegistry.loadJCasForTSandClassLoader(FSClassRegistry.java:334) ~[uimaj-core-3.0.0-beta.jar:3.0.0-beta]
> 	at org.apache.uima.cas.impl.FSClassRegistry.getGeneratorsForClassLoader(FSClassRegistry.java:871) ~[uimaj-core-3.0.0-beta.jar:3.0.0-beta]
> 	at org.apache.uima.cas.impl.TypeSystemImpl.getGeneratorsForClassLoader(TypeSystemImpl.java:2651) ~[uimaj-core-3.0.0-beta.jar:3.0.0-beta]
> 	at org.apache.uima.cas.impl.TypeSystemImpl.commit(TypeSystemImpl.java:1393) ~[uimaj-core-3.0.0-beta.jar:3.0.0-beta]
> 	at org.apache.uima.cas.impl.CASImpl.commitTypeSystem(CASImpl.java:1532) ~[uimaj-core-3.0.0-beta.jar:3.0.0-beta]
> 	at org.apache.uima.cas.impl.BinaryCasSerDes.reinit(BinaryCasSerDes.java:314) ~[uimaj-core-3.0.0-beta.jar:3.0.0-beta]
> 	at org.apache.uima.cas.impl.Serialization.deserializeCASComplete(Serialization.java:129) ~[uimaj-core-3.0.0-beta.jar:3.0.0-beta]


I wonder if that is really necessary and whether it can be turned off or worked around.
Is this just a consistency check or does the deserialization really produce invalid results
in such a case? 

Normally (i.e. v2), I would expect that I should be able to deserialize any
CAS data into a CAS instance no matter if there are JCas classes available and no matter
what their inheritance hierarchy is. I may not be able to use the JCas classes to access
this particular deserialized CAS, but via the CAS interface, it should at least be possible
to access all the data.

If there is no way to do a "lenient loading" directly from the serialized CAS, at least it
would be good if there was a way to load the serialized data into a CAS, write that out
again in another format (XMI or an other binary format supporting lenient loading) and
to load it back in to the desired target type system, i.e. the one which matches the JCas
classes that are on the classpath.

Cheers,

-- Richard

Re: Deserializing CAS v2 -> v3 + type system change

Posted by Marshall Schor <ms...@schor.com>.
The CasIOUtils API uses an enum CasLoadMode, currently having 3 values:

    DEFAULT, LENIENT, and REINIT

Propose adding the local config for this by adding 3 new enums:

    DEFAULT_PRESERVE_V2_IDS, LENIENT_PRESERVE_V2_IDS and REINIT_PRESERVE_V2_IDS

The existing APIs already are designed to take one instance of the CasLoadMode,
so these have to be multiplied out, it seems.

-Marshall


On 12/4/2017 9:36 AM, Marshall Schor wrote:
> I've thought a bit about generalizing this.  Since V3 supports arbitrary Java
> objects in the CAS, users could construct their own custom mapping between ids
> (ints or strings or ...) and FSs and carry these around in the CAS, with weak or
> strong references, using plain or linked hashmaps (if iteration order was
> important), etc.
>
> Using something like that would still require some kind of bridge or migration
> code, so the deserializers which use the V2 addresses (for backwards
> compatibility) could "dump" these into whatever kind of map users wanted.
>
> -----------------
>
> There is a backwards compatibility concern, though, where it would be nice to
> run applications which depended on this in V3.  An implementation "mode" could
> do this as follows:
>
> Use the mode to modify some operations:
>
> 1) modify deserialization operations (for those deserializers where they are
> maintaining a v2 id mapping to FSs):  at the end, dump the v2 id map to fss into
> either:
>   a linked hash map with weak references
>     and special code to remove entries
>     when the weak reference is gc'd, or
>   just the currently existing map used for low-level cas fs-creation
>     which does not use weak refs (required)
>
> 2) a way to access this new map: either overload the current low level accessors
> (if alternative 1) based on the "mode" to use this new map, or provide a
> different API (would imply less backwards compatibility).
>
> 3) new serialization to have the serialized id's match the saved ones.  This
> could be implemented by having the deserialization install the v2 id's as the v3
> ids, and then set the internals so new id's would be created after the highest
> deserialized id.  This approach would facilitate reusing the existing low level
> APIs for this.
>
> The mode could be set programmatically (which would require modifying the
> application of course) or via a -Duima.xxxx style of start-up configuration
> option (-Duima.v2_deserialize_preserve_ids).
>
> ====================
>
> I propose to implement this focusing more on making it backward-compatible,
> reusing the existing low-level APIs and the internal maps they use (which are
> not weak refs), and an external flag (which should permit running a V2
> application unmodified), plus some local config option whose default value is
> set from the external flag.
>
> WDYT?
>
> -Marshall
>
>
>


Re: Deserializing CAS v2 -> v3 + type system change

Posted by Marshall Schor <ms...@schor.com>.
I've thought a bit about generalizing this.  Since V3 supports arbitrary Java
objects in the CAS, users could construct their own custom mapping between ids
(ints or strings or ...) and FSs and carry these around in the CAS, with weak or
strong references, using plain or linked hashmaps (if iteration order was
important), etc.

Using something like that would still require some kind of bridge or migration
code, so the deserializers which use the V2 addresses (for backwards
compatibility) could "dump" these into whatever kind of map users wanted.

-----------------

There is a backwards compatibility concern, though, where it would be nice to
run applications which depended on this in V3.  An implementation "mode" could
do this as follows:

Use the mode to modify some operations:

1) modify deserialization operations (for those deserializers where they are
maintaining a v2 id mapping to FSs):  at the end, dump the v2 id map to fss into
either:
  a linked hash map with weak references
    and special code to remove entries
    when the weak reference is gc'd, or
  just the currently existing map used for low-level cas fs-creation
    which does not use weak refs (required)

2) a way to access this new map: either overload the current low level accessors
(if alternative 1) based on the "mode" to use this new map, or provide a
different API (would imply less backwards compatibility).

3) new serialization to have the serialized id's match the saved ones.  This
could be implemented by having the deserialization install the v2 id's as the v3
ids, and then set the internals so new id's would be created after the highest
deserialized id.  This approach would facilitate reusing the existing low level
APIs for this.

The mode could be set programmatically (which would require modifying the
application of course) or via a -Duima.xxxx style of start-up configuration
option (-Duima.v2_deserialize_preserve_ids).

====================

I propose to implement this focusing more on making it backward-compatible,
reusing the existing low-level APIs and the internal maps they use (which are
not weak refs), and an external flag (which should permit running a V2
application unmodified), plus some local config option whose default value is
set from the external flag.

WDYT?

-Marshall



Re: Deserializing CAS v2 -> v3 + type system change

Posted by Richard Eckart de Castilho <re...@apache.org>.
On 02.12.2017, at 20:03, Marshall Schor <ms...@schor.com> wrote:
> 
>> That is completely acceptable. Although, wouldn't the use of weak references
>> be a way to allow the FSes to be looked up by ID and still be GCed?
> Yes, the weak ref would allow the FS to be GC'd. 
> 
> But here's a use case which fails with this design:
> 1) the original CAS,  is created.  It has a bunch of FSs, whose IDs are written
> out into a database somewhere.
> 1a) some of the FSs become unreachable. 
> 2) The CAS is serialized by V2, and includes the unreachables.
> 
> 3) the FSs are deserialized
> 4) the database of ids is used to "look up" these FSs - it may fail to find the
> unreachable ones, because they may be GC'd.
> 
> I assume the intent is to have the FSs be found?

Well, as long as the FSes are indexed in the CAS, they are reachable, right? So they
wouldn't be GCed. When FSes are removed from indexes and also no longer reachable via
other FSes, then it should be ok to GC them. I guess in v2 even FSes that were removed
from indexes and were no longer reachable could be retrieved via their addresses - but
at least in my context, I consider non-indexed and non-reachable FSes to be actually
non-existent. I don't think it would cause a problem in WebAnno if such FSes were GCed.

>>> I don't think I'd want this as the "normal" mode of deserialization.  Perhaps we
>>> could have some kind of a -Duima.xxxx flag that turned on this mode for those
>>> cases which needed it. 
>> That would be an option. Personally, I'd favor a local configuration option for
>> this in the deserializer, but the default value of that could be controlled
>> via a global property.
> 
> Would just a local config option be OK?

That should be sufficient for my case, yes.

-- Richard

Re: Deserializing CAS v2 -> v3 + type system change

Posted by Marshall Schor <ms...@schor.com>.

On 12/2/2017 8:23 AM, Richard Eckart de Castilho wrote:
> On 02.12.2017, at 04:06, Marshall Schor <ms...@schor.com> wrote:
>>> WebAnno doesn't make use of the unreferenced FSes - there is no problem
>>> if they are GCed. However, WebAnno relies on the fact that for v2 CASes
>>> stored/loaded using CasCompleteSerializer have stable addresses/IDs.
>>>
>>> The following should be possible:
>>>
>>> * create a FS in the CAS
>>> * let X be the ID of the FS
>>> * save CAS to file
>>> * load CAS to memory
>>> * get the FS from the CAS via X
>> This could be made to work, but as I've said before, it would prevent future
>> GC's from happening on the loaded FSs, should some updates occur which made them
>> unreachable.
>>
>> Is this acceptable?
> That is completely acceptable. Although, wouldn't the use of weak references
> be a way to allow the FSes to be looked up by ID and still be GCed?
Yes, the weak ref would allow the FS to be GC'd. 

But here's a use case which fails with this design:
1) the original CAS,  is created.  It has a bunch of FSs, whose IDs are written
out into a database somewhere.
1a) some of the FSs become unreachable. 
2) The CAS is serialized by V2, and includes the unreachables.

3) the FSs are deserialized
4) the database of ids is used to "look up" these FSs - it may fail to find the
unreachable ones, because they may be GC'd.

I assume the intent is to have the FSs be found? 

-Marshall

>
>> I don't think I'd want this as the "normal" mode of deserialization.  Perhaps we
>> could have some kind of a -Duima.xxxx flag that turned on this mode for those
>> cases which needed it. 
> That would be an option. Personally, I'd favor a local configuration option for
> this in the deserializer, but the default value of that could be controlled
> via a global property.

Would just a local config option be OK?

-M
>
> Cheers,
>
> -- Richard


Re: Deserializing CAS v2 -> v3 + type system change

Posted by Richard Eckart de Castilho <re...@apache.org>.
On 02.12.2017, at 04:06, Marshall Schor <ms...@schor.com> wrote:
> 
>> WebAnno doesn't make use of the unreferenced FSes - there is no problem
>> if they are GCed. However, WebAnno relies on the fact that for v2 CASes
>> stored/loaded using CasCompleteSerializer have stable addresses/IDs.
>> 
>> The following should be possible:
>> 
>> * create a FS in the CAS
>> * let X be the ID of the FS
>> * save CAS to file
>> * load CAS to memory
>> * get the FS from the CAS via X
> This could be made to work, but as I've said before, it would prevent future
> GC's from happening on the loaded FSs, should some updates occur which made them
> unreachable.
> 
> Is this acceptable?

That is completely acceptable. Although, wouldn't the use of weak references
be a way to allow the FSes to be looked up by ID and still be GCed?

> I don't think I'd want this as the "normal" mode of deserialization.  Perhaps we
> could have some kind of a -Duima.xxxx flag that turned on this mode for those
> cases which needed it. 

That would be an option. Personally, I'd favor a local configuration option for
this in the deserializer, but the default value of that could be controlled
via a global property.

Cheers,

-- Richard

Re: Deserializing CAS v2 -> v3 + type system change

Posted by Marshall Schor <ms...@schor.com>.
On 12/1/2017 5:39 PM, Richard Eckart de Castilho wrote:
> On 01.12.2017, at 22:47, Marshall Schor <ms...@schor.com> wrote:
>> I'm wondering if some other approach could be done, that would treat these kinds
>> of use cases specially, and not have to give up the v3 benefits like GC in the
>> general case.
>>
>> A more general question: The v3 approach to serialization / deserialization is
>> to serialize just those FSs which are indexed, or reachable from other
>> serializable things. Does this work for webAnnot?  A consequence would be, that
>> deserializing a CAS produced by v2, which had a bunch of FSs which were not
>> indexed, and not referenced by anything, would end up being GC'd.
>>
>>    Whereas, in v2, they would be in the "cas" and gettable via their "address"
>> (integer).
>>
>> Is this (maybe made-up) use case something that goes on in WebAnnot?
> WebAnno doesn't make use of the unreferenced FSes - there is no problem
> if they are GCed. However, WebAnno relies on the fact that for v2 CASes
> stored/loaded using CasCompleteSerializer have stable addresses/IDs.
>
> The following should be possible:
>
> * create a FS in the CAS
> * let X be the ID of the FS
> * save CAS to file
> * load CAS to memory
> * get the FS from the CAS via X
This could be made to work, but as I've said before, it would prevent future
GC's from happening on the loaded FSs, should some updates occur which made them
unreachable.

Is this acceptable?

I don't think I'd want this as the "normal" mode of deserialization.  Perhaps we
could have some kind of a -Duima.xxxx flag that turned on this mode for those
cases which needed it. 

WDYT?

-Marshall
> So X should be stable across the save/load cycle. In v2, this works
> because there is no GC when using CasCompleteSerializer. WebAnno as
> an alternative way of doing GC at times when the stability of the
> IDs is not relevant: it temporarily saves the CAS into form 6 and
> then loads it back - this does not only do GC, but also permits 
> factoring in changes to the type system.
>
> Cheers,
>
> -- Richard


Re: Deserializing CAS v2 -> v3 + type system change

Posted by Richard Eckart de Castilho <re...@apache.org>.
On 01.12.2017, at 22:47, Marshall Schor <ms...@schor.com> wrote:
> 
> I'm wondering if some other approach could be done, that would treat these kinds
> of use cases specially, and not have to give up the v3 benefits like GC in the
> general case.
> 
> A more general question: The v3 approach to serialization / deserialization is
> to serialize just those FSs which are indexed, or reachable from other
> serializable things. Does this work for webAnnot?  A consequence would be, that
> deserializing a CAS produced by v2, which had a bunch of FSs which were not
> indexed, and not referenced by anything, would end up being GC'd.
> 
>    Whereas, in v2, they would be in the "cas" and gettable via their "address"
> (integer).
> 
> Is this (maybe made-up) use case something that goes on in WebAnnot?

WebAnno doesn't make use of the unreferenced FSes - there is no problem
if they are GCed. However, WebAnno relies on the fact that for v2 CASes
stored/loaded using CasCompleteSerializer have stable addresses/IDs.

The following should be possible:

* create a FS in the CAS
* let X be the ID of the FS
* save CAS to file
* load CAS to memory
* get the FS from the CAS via X

So X should be stable across the save/load cycle. In v2, this works
because there is no GC when using CasCompleteSerializer. WebAnno as
an alternative way of doing GC at times when the stability of the
IDs is not relevant: it temporarily saves the CAS into form 6 and
then loads it back - this does not only do GC, but also permits 
factoring in changes to the type system.

Cheers,

-- Richard

Re: Deserializing CAS v2 -> v3 + type system change

Posted by Marshall Schor <ms...@schor.com>.
On 12/1/2017 4:16 PM, Richard Eckart de Castilho wrote:
> On 01.12.2017, at 16:12, Marshall Schor <ms...@schor.com> wrote:
>> I'm in the middle of things with changes for 3.0.0sdk;  could you check out the
>> 3.0.0-beta tag and change
>> in the uimaj-core project,
>>   Class: org.apache.uima.cas.impl.FSClassRegistry
>>   to comment out the throw clause: lines 456-460
>>
>> and then rebuild uimaj-core, try it, and let me know if it works?
>>
>> If so, let's put in a Jira to switch the "throw" to a "report" (not completely
>> straight-forward, but not too hard- I can do the change...)
> Commenting out these lines, I can load the CAS and I can also see that WebAnno
> can access and render the annotations. 
>
> However, it doesn't seem to be possible to retrieve the annotations by their addresses:
> org.apache.uima.cas.impl.LowLevelException: Error in low-level CAS APIs: accessing FS with id 15, but no such FS exists in this CAS.
> 	at org.apache.uima.cas.impl.CASImpl.getFsFromId_checked(CASImpl.java:2444) ~[classes/:?]
> 	at org.apache.uima.cas.impl.CASImpl.ll_getFSForRef(CASImpl.java:2641) ~[classes/:?]
>
> WebAnno uses the CasCompleteSerializer since FS addresses in v2 remained stable with this
> particular serialization format.
>
> I checked the CASImpl.svd.id2fs (JCasHashMap): all four of its JCasHashMapSubMaps report a size of 0.
> It seems like during the deserialization, the id2fs map is not updated with the addresses obtained
> from the serialized file. 

The JCasHashMap is normally not used and not maintained - hence you see it is
0.  It is there mainly to support Pear trampolines.

>
> Digging further, I found that BinaryCasSerDes.createFSsFromHeaps actually sets up a
> map of the v2 IDs obtained from the serialized CAS to the v3 FSes, but it does not actually
> set the IDs of the v3 FSes to the values obtained from the CAS. Unfortunately, this is an
> essential assumption made in the WebAnno code. 
>
> It looks to me that v3 is rather flexible with respect to assigning IDs to FSes (unlike
> v2 where this was bound to the heap organization). It would be great if this flexibility
> could be used in order to assign the IDs in the way that they are read from the serialized
> CAS (cf. CommonSerDesSequential.addr2fs).
This could be done, but it would not be enough.  On top of this, you would need
to have a map from these numbers to the feature structures.

The map you saw throwing the exception, is not normally populated, because it
prevents "garbage collection" of unreferenced Feature Structures.  The map is
used when low level APIs are used to create Feature Structures.  This is
required because of a race condition that can happen where a GC happens after
the Feature Structure is created (returning an "int"), and before that Feature
Structure instance can be "held onto" by something to prevent GC.

I'm wondering if some other approach could be done, that would treat these kinds
of use cases specially, and not have to give up the v3 benefits like GC in the
general case.

A more general question: The v3 approach to serialization / deserialization is
to serialize just those FSs which are indexed, or reachable from other
serializable things. Does this work for webAnnot?  A consequence would be, that
deserializing a CAS produced by v2, which had a bunch of FSs which were not
indexed, and not referenced by anything, would end up being GC'd.

   Whereas, in v2, they would be in the "cas" and gettable via their "address"
(integer).

Is this (maybe made-up) use case something that goes on in WebAnnot?

  If so, we may need some creative thinking - how to support something like
this, while keeping the v3 benefits.

Thanks for all your testing.
I'll put in a Jira to change the throw to a message, for the JCas super class test.

-Marshall

> Cheers,
>
> -- Richard


Re: Deserializing CAS v2 -> v3 + type system change

Posted by Marshall Schor <ms...@schor.com>.

On 12/1/2017 5:31 PM, Richard Eckart de Castilho wrote:
> On 01.12.2017, at 22:16, Richard Eckart de Castilho <re...@apache.org> wrote:
>> It looks to me that v3 is rather flexible with respect to assigning IDs to FSes (unlike
>> v2 where this was bound to the heap organization). It would be great if this flexibility
>> could be used in order to assign the IDs in the way that they are read from the serialized
>> CAS (cf. CommonSerDesSequential.addr2fs).
> I believe one way to address this without refactoring major parts of the JCas code would
> be to allow CommonSerDesSequential to set CASImpl.SharedViewData.reuseId to the heapIndex
> before instantiating the FS.
More would be needed to insure the id's for subsequently created FSs were
assigned starting above the highest one deserialized.

There might be other things to watch out for...

-Marshall
>
> -- Richard


Re: Deserializing CAS v2 -> v3 + type system change

Posted by Richard Eckart de Castilho <re...@apache.org>.
On 01.12.2017, at 22:16, Richard Eckart de Castilho <re...@apache.org> wrote:
> 
> It looks to me that v3 is rather flexible with respect to assigning IDs to FSes (unlike
> v2 where this was bound to the heap organization). It would be great if this flexibility
> could be used in order to assign the IDs in the way that they are read from the serialized
> CAS (cf. CommonSerDesSequential.addr2fs).

I believe one way to address this without refactoring major parts of the JCas code would
be to allow CommonSerDesSequential to set CASImpl.SharedViewData.reuseId to the heapIndex
before instantiating the FS.

-- Richard

Re: Deserializing CAS v2 -> v3 + type system change

Posted by Richard Eckart de Castilho <re...@apache.org>.
On 01.12.2017, at 16:12, Marshall Schor <ms...@schor.com> wrote:
> 
> I'm in the middle of things with changes for 3.0.0sdk;  could you check out the
> 3.0.0-beta tag and change
> in the uimaj-core project,
>   Class: org.apache.uima.cas.impl.FSClassRegistry
>   to comment out the throw clause: lines 456-460
> 
> and then rebuild uimaj-core, try it, and let me know if it works?
> 
> If so, let's put in a Jira to switch the "throw" to a "report" (not completely
> straight-forward, but not too hard- I can do the change...)

Commenting out these lines, I can load the CAS and I can also see that WebAnno
can access and render the annotations. 

However, it doesn't seem to be possible to retrieve the annotations by their addresses:

org.apache.uima.cas.impl.LowLevelException: Error in low-level CAS APIs: accessing FS with id 15, but no such FS exists in this CAS.
	at org.apache.uima.cas.impl.CASImpl.getFsFromId_checked(CASImpl.java:2444) ~[classes/:?]
	at org.apache.uima.cas.impl.CASImpl.ll_getFSForRef(CASImpl.java:2641) ~[classes/:?]

WebAnno uses the CasCompleteSerializer since FS addresses in v2 remained stable with this
particular serialization format.

I checked the CASImpl.svd.id2fs (JCasHashMap): all four of its JCasHashMapSubMaps report a size of 0.
It seems like during the deserialization, the id2fs map is not updated with the addresses obtained
from the serialized file. 

Digging further, I found that BinaryCasSerDes.createFSsFromHeaps actually sets up a
map of the v2 IDs obtained from the serialized CAS to the v3 FSes, but it does not actually
set the IDs of the v3 FSes to the values obtained from the CAS. Unfortunately, this is an
essential assumption made in the WebAnno code. 

It looks to me that v3 is rather flexible with respect to assigning IDs to FSes (unlike
v2 where this was bound to the heap organization). It would be great if this flexibility
could be used in order to assign the IDs in the way that they are read from the serialized
CAS (cf. CommonSerDesSequential.addr2fs).

Cheers,

-- Richard

Re: Deserializing CAS v2 -> v3 + type system change

Posted by Marshall Schor <ms...@schor.com>.
To summarize:

JCas classes loaded: 
  X  (extends SX)
  SX (extends Annotation)

Deserialize a CAS with a type system having:
  X (extends Annotation)

Alternative use case:
  start with empty system,
  create a new pipeline from a spec which includes type X (extends Annotation)
  have a classpath which has JCas class definitions as the top specifies.

Thinking about this:  I think the implementation could be relaxed to report this
case, but continue. 

I'm in the middle of things with changes for 3.0.0sdk;  could you check out the
3.0.0-beta tag and change
in the uimaj-core project,
  Class: org.apache.uima.cas.impl.FSClassRegistry
  to comment out the throw clause: lines 456-460

and then rebuild uimaj-core, try it, and let me know if it works?

If so, let's put in a Jira to switch the "throw" to a "report" (not completely
straight-forward, but not too hard- I can do the change...)

Thanks. -Marshall

On 11/30/2017 5:04 PM, Richard Eckart de Castilho wrote:
> Hi,
>
> after DKPro Core can be made to on UIMA v3, I'm looking next into WebAnno.
> It uses serialized CASes and CAS addresses a lot, so I am very curious
> if/how well it works with v3.
>
> However, I have a small problem that may or may not be related to the v3
> upgrade. As part of upgrading WebAnno from v2 to v3, I am also upgrading
> it from DKPro Core 1.7.0 to 1.9.0-SNAPSHOT (v3 branch). We have had some
> changes in the type system from 1.7.0 to 1.9.0 on of which I am now hitting:
>
> The supertype of the annotation "de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Document"
> has changed from "uima.tcas.Annotation" in 1.7.0 to
> "de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div" in 1.9.0 and it
> appears that the v3 CAS deserialization code is validating JCas classes against
> the data being read from a serialized CAS:
>
>> 2017-11-30 22:38:58 ERROR [admin] AnnotationPage - Error: The JCas class: "de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Document" has supertype: "de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div" which doesn't match the UIMA type "de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Document"'s supertype "uima.tcas.Annotation".
>> org.apache.uima.cas.CASRuntimeException: The JCas class: "de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Document" has supertype: "de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div" which doesn't match the UIMA type "de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Document"'s supertype "uima.tcas.Annotation".
>> 	at org.apache.uima.cas.impl.FSClassRegistry.validateSuperClass(FSClassRegistry.java:460) ~[uimaj-core-3.0.0-beta.jar:3.0.0-beta]
>> 	at org.apache.uima.cas.impl.FSClassRegistry.maybeLoadJCasAndSubtypes(FSClassRegistry.java:391) ~[uimaj-core-3.0.0-beta.jar:3.0.0-beta]
>> 	at org.apache.uima.cas.impl.FSClassRegistry.maybeLoadJCasAndSubtypes(FSClassRegistry.java:435) ~[uimaj-core-3.0.0-beta.jar:3.0.0-beta]
>> 	at org.apache.uima.cas.impl.FSClassRegistry.maybeLoadJCasAndSubtypes(FSClassRegistry.java:435) ~[uimaj-core-3.0.0-beta.jar:3.0.0-beta]
>> 	at org.apache.uima.cas.impl.FSClassRegistry.maybeLoadJCasAndSubtypes(FSClassRegistry.java:435) ~[uimaj-core-3.0.0-beta.jar:3.0.0-beta]
>> 	at org.apache.uima.cas.impl.FSClassRegistry.loadJCasForTSandClassLoader(FSClassRegistry.java:334) ~[uimaj-core-3.0.0-beta.jar:3.0.0-beta]
>> 	at org.apache.uima.cas.impl.FSClassRegistry.getGeneratorsForClassLoader(FSClassRegistry.java:871) ~[uimaj-core-3.0.0-beta.jar:3.0.0-beta]
>> 	at org.apache.uima.cas.impl.TypeSystemImpl.getGeneratorsForClassLoader(TypeSystemImpl.java:2651) ~[uimaj-core-3.0.0-beta.jar:3.0.0-beta]
>> 	at org.apache.uima.cas.impl.TypeSystemImpl.commit(TypeSystemImpl.java:1393) ~[uimaj-core-3.0.0-beta.jar:3.0.0-beta]
>> 	at org.apache.uima.cas.impl.CASImpl.commitTypeSystem(CASImpl.java:1532) ~[uimaj-core-3.0.0-beta.jar:3.0.0-beta]
>> 	at org.apache.uima.cas.impl.BinaryCasSerDes.reinit(BinaryCasSerDes.java:314) ~[uimaj-core-3.0.0-beta.jar:3.0.0-beta]
>> 	at org.apache.uima.cas.impl.Serialization.deserializeCASComplete(Serialization.java:129) ~[uimaj-core-3.0.0-beta.jar:3.0.0-beta]
>
> I wonder if that is really necessary and whether it can be turned off or worked around.
> Is this just a consistency check or does the deserialization really produce invalid results
> in such a case? 
>
> Normally (i.e. v2), I would expect that I should be able to deserialize any
> CAS data into a CAS instance no matter if there are JCas classes available and no matter
> what their inheritance hierarchy is. I may not be able to use the JCas classes to access
> this particular deserialized CAS, but via the CAS interface, it should at least be possible
> to access all the data.
>
> If there is no way to do a "lenient loading" directly from the serialized CAS, at least it
> would be good if there was a way to load the serialized data into a CAS, write that out
> again in another format (XMI or an other binary format supporting lenient loading) and
> to load it back in to the desired target type system, i.e. the one which matches the JCas
> classes that are on the classpath.
>
> Cheers,
>
> -- Richard


Re: Deserializing CAS v2 -> v3 + type system change

Posted by Marshall Schor <ms...@schor.com>.
I'll have to think about this, probably in the morning, when there are more
neurons available :-).

-Marshall


On 11/30/2017 5:04 PM, Richard Eckart de Castilho wrote:
> Hi,
>
> after DKPro Core can be made to on UIMA v3, I'm looking next into WebAnno.
> It uses serialized CASes and CAS addresses a lot, so I am very curious
> if/how well it works with v3.
>
> However, I have a small problem that may or may not be related to the v3
> upgrade. As part of upgrading WebAnno from v2 to v3, I am also upgrading
> it from DKPro Core 1.7.0 to 1.9.0-SNAPSHOT (v3 branch). We have had some
> changes in the type system from 1.7.0 to 1.9.0 on of which I am now hitting:
>
> The supertype of the annotation "de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Document"
> has changed from "uima.tcas.Annotation" in 1.7.0 to
> "de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div" in 1.9.0 and it
> appears that the v3 CAS deserialization code is validating JCas classes against
> the data being read from a serialized CAS:
>
>> 2017-11-30 22:38:58 ERROR [admin] AnnotationPage - Error: The JCas class: "de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Document" has supertype: "de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div" which doesn't match the UIMA type "de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Document"'s supertype "uima.tcas.Annotation".
>> org.apache.uima.cas.CASRuntimeException: The JCas class: "de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Document" has supertype: "de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div" which doesn't match the UIMA type "de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Document"'s supertype "uima.tcas.Annotation".
>> 	at org.apache.uima.cas.impl.FSClassRegistry.validateSuperClass(FSClassRegistry.java:460) ~[uimaj-core-3.0.0-beta.jar:3.0.0-beta]
>> 	at org.apache.uima.cas.impl.FSClassRegistry.maybeLoadJCasAndSubtypes(FSClassRegistry.java:391) ~[uimaj-core-3.0.0-beta.jar:3.0.0-beta]
>> 	at org.apache.uima.cas.impl.FSClassRegistry.maybeLoadJCasAndSubtypes(FSClassRegistry.java:435) ~[uimaj-core-3.0.0-beta.jar:3.0.0-beta]
>> 	at org.apache.uima.cas.impl.FSClassRegistry.maybeLoadJCasAndSubtypes(FSClassRegistry.java:435) ~[uimaj-core-3.0.0-beta.jar:3.0.0-beta]
>> 	at org.apache.uima.cas.impl.FSClassRegistry.maybeLoadJCasAndSubtypes(FSClassRegistry.java:435) ~[uimaj-core-3.0.0-beta.jar:3.0.0-beta]
>> 	at org.apache.uima.cas.impl.FSClassRegistry.loadJCasForTSandClassLoader(FSClassRegistry.java:334) ~[uimaj-core-3.0.0-beta.jar:3.0.0-beta]
>> 	at org.apache.uima.cas.impl.FSClassRegistry.getGeneratorsForClassLoader(FSClassRegistry.java:871) ~[uimaj-core-3.0.0-beta.jar:3.0.0-beta]
>> 	at org.apache.uima.cas.impl.TypeSystemImpl.getGeneratorsForClassLoader(TypeSystemImpl.java:2651) ~[uimaj-core-3.0.0-beta.jar:3.0.0-beta]
>> 	at org.apache.uima.cas.impl.TypeSystemImpl.commit(TypeSystemImpl.java:1393) ~[uimaj-core-3.0.0-beta.jar:3.0.0-beta]
>> 	at org.apache.uima.cas.impl.CASImpl.commitTypeSystem(CASImpl.java:1532) ~[uimaj-core-3.0.0-beta.jar:3.0.0-beta]
>> 	at org.apache.uima.cas.impl.BinaryCasSerDes.reinit(BinaryCasSerDes.java:314) ~[uimaj-core-3.0.0-beta.jar:3.0.0-beta]
>> 	at org.apache.uima.cas.impl.Serialization.deserializeCASComplete(Serialization.java:129) ~[uimaj-core-3.0.0-beta.jar:3.0.0-beta]
>
> I wonder if that is really necessary and whether it can be turned off or worked around.
> Is this just a consistency check or does the deserialization really produce invalid results
> in such a case? 
>
> Normally (i.e. v2), I would expect that I should be able to deserialize any
> CAS data into a CAS instance no matter if there are JCas classes available and no matter
> what their inheritance hierarchy is. I may not be able to use the JCas classes to access
> this particular deserialized CAS, but via the CAS interface, it should at least be possible
> to access all the data.
>
> If there is no way to do a "lenient loading" directly from the serialized CAS, at least it
> would be good if there was a way to load the serialized data into a CAS, write that out
> again in another format (XMI or an other binary format supporting lenient loading) and
> to load it back in to the desired target type system, i.e. the one which matches the JCas
> classes that are on the classpath.
>
> Cheers,
>
> -- Richard