You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Mario Juric <mj...@unsilo.ai> on 2019/09/12 08:41:31 UTC

Migrating type system of form 6 compressed CAS binaries

Hi,

We use form 6 compressed binaries to persist the CAS. We now want to make a change to the type system that is not directly compatible, although in principle the new type system is really a subset from a data perspective, so we want to migrate existing binaries to the new type system, but we don’t know how. The change is as follows:

In the existing type system we have a type A with a FSArray feature of element type X, and we want to change X to Y where Y contains a genuine feature subset of X. This means we basically want to replace X with Y for the FSArray and ditch a few attributes of X when loading the CAS into the new type system.

Had the CAS been stored in JSON this would be trivial by just mapping the attributes that they have in common, but when I try to load the CAS binary into the new target type system it chokes with an EOF, so I don’t know if that is at all possible with a form 6 compressed CAS binary?

I pocked a bit around in the reference, API and mailing list archive but I was not able to find anything useful. I can of course keep parallel attributes for both X and Y and then have a separate step that makes an explicit conversion/copy, but I prefer to avoid this. I would appreciate any input to the problem, thanks :)

Cheers,
Mario














Re: Migrating type system of form 6 compressed CAS binaries

Posted by Mario Juric <mj...@unsilo.ai>.
Thanks. I will take a look at it and then I get back to you.

Cheers,
Mario













> On 25 Sep 2019, at 20:46 , Marshall Schor <ms...@schor.com> wrote:
> 
> Here's code that works that serializes in 1.1 format.
> 
> The key idea is to set the OutputProperty OutputKeys.VERSION to "1.1".
> 
> XmiCasSerializer xmiCasSerializer = new XmiCasSerializer(jCas.getTypeSystem());
> OutputStream out = new FileOutputStream(new File ("odd-doc-txt-v11.xmi"));
> try {
>  XMLSerializer xml11Serializer = new XMLSerializer(out);
>   xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
>   xmiCasSerializer.serialize(jCas.getCas(), xml11Serializer.getContentHandler());
>     }
> finally {
>  out.close();
> }
> 
> This is from a test case. -Marshall
> 
> On 9/25/2019 2:16 PM, Mario Juric wrote:
>> Thanks Marshall,
>> 
>> If you prefer then I can also have a look at it, although I probably need to finish something first within the next 3-4 weeks. It would probably get me faster started if you could share some of your experimental sample code.
>> 
>> Cheers,
>> Mario
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>> On 24 Sep 2019, at 21:32 , Marshall Schor <ms...@schor.com> wrote:
>>> 
>>> yes, makes sense, thanks for posting the Jira.
>>> 
>>> If no one else steps up to work on this, I'll probably take a look in a few
>>> days. -Marshall
>>> 
>>> On 9/24/2019 6:47 AM, Mario Juric wrote:
>>>> Hi Marshall,
>>>> 
>>>> I added the following feature request to Apache Jira:
>>>> 
>>>> https://issues.apache.org/jira/browse/UIMA-6128
>>>> 
>>>> Hope it makes sense :)
>>>> 
>>>> Thanks a lot for the help, it’s appreciated.
>>>> 
>>>> Cheers,
>>>> Mario
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On 23 Sep 2019, at 16:33 , Marshall Schor <ms...@schor.com> wrote:
>>>>> 
>>>>> Re: serializing using XML 1.1
>>>>> 
>>>>> This was not thought of, when setting up the CasIOUtils.
>>>>> 
>>>>> The way it was done (above) was using some more "primitive/lower level" APIs,
>>>>> rather than the CasIOUtils.
>>>>> 
>>>>> Please open a Jira ticket for this, with perhaps some suggestions on how it
>>>>> might be specified in the CasIOUtils APIs.
>>>>> 
>>>>> Thanks! -Marshall
>>>>> 
>>>>> On 9/23/2019 3:45 AM, Mario Juric wrote:
>>>>>> Hi Marshall,
>>>>>> 
>>>>>> Thanks for the thorough and excellent investigation.
>>>>>> 
>>>>>> We are looking into possible normalisation/cleanup of whitespace/invisible characters, but I don’t think we can necessarily do the same for some of the other characters. It sounds to me though that serialising to XML 1.1 could also be a simple fix right now, but can this be configured? CasIOUtils doesn’t seem to have an option for this, so I assume it’s something you have working in your branch.
>>>>>> 
>>>>>> Regarding the other problem. It seems that the JDK bug is fixed from Java 9 and after. Do you think switching to a more recent Java version would make a difference? I think we can also try this out ourselves when we look into migrating to UIMA 3 once our current deliveries are complete. We also like to switch to Java 11, and like UIMA 3 migration it will require some thorough testing.
>>>>>> 
>>>>>> Cheers,
>>>>>> Mario
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 20 Sep 2019, at 20:52 , Marshall Schor <ms...@schor.com> wrote:
>>>>>>> 
>>>>>>> In the test "OddDocumentText", this produces a "throw" due to an invalid xml
>>>>>>> char, which is the \u0002.
>>>>>>> 
>>>>>>> This is in part because the xml version being used is xml 1.0.
>>>>>>> 
>>>>>>> XML 1.1 expanded the set of valid characters to include \u0002.
>>>>>>> 
>>>>>>> Here's a snip from the XmiCasSerializerTest class which serializes with xml 1.1:
>>>>>>> 
>>>>>>>      XmiCasSerializer xmiCasSerializer = new
>>>>>>> XmiCasSerializer(jCas.getTypeSystem());
>>>>>>>      OutputStream out = new FileOutputStream(new File ("odd-doc-txt-v11.xmi"));
>>>>>>>      try {
>>>>>>>        XMLSerializer xml11Serializer = new XMLSerializer(out);
>>>>>>>        xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
>>>>>>>        xmiCasSerializer.serialize(jCas.getCas(),
>>>>>>> xml11Serializer.getContentHandler());
>>>>>>>      }
>>>>>>>      finally {
>>>>>>>        out.close();
>>>>>>>      }
>>>>>>> 
>>>>>>> This succeeds and serializes this using xml 1.1.
>>>>>>> 
>>>>>>> I also tried serializing some doc text which includes \u77987.  That did not
>>>>>>> serialize correctly.
>>>>>>> I could see it in the code while tracing up to some point down in the innards of
>>>>>>> some internal
>>>>>>> sax java code
>>>>>>> com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize  where it was
>>>>>>> "Correct" in the Java string.
>>>>>>> 
>>>>>>> When serialized (as UTF-8) it came out as a 4 byte string E79E 9837.
>>>>>>> 
>>>>>>> This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 byte encoding:
>>>>>>>      1110 xxxx 10xx xxxx 10xx xxxx
>>>>>>> 
>>>>>>> of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to me.
>>>>>>> 
>>>>>>> But I think it's out of our hands - it's somewhere deep in the sax transform
>>>>>>> java code.
>>>>>>> 
>>>>>>> I looked for a bug report and found some
>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8058175
>>>>>>> 
>>>>>>> Bottom line, is, I think to clean out these characters early :-) .
>>>>>>> 
>>>>>>> -Marshall
>>>>>>> 
>>>>>>> 
>>>>>>> On 9/20/2019 1:28 PM, Marshall Schor wrote:
>>>>>>>> here's an idea.
>>>>>>>> 
>>>>>>>> If you have a string, with the surrogate pair &#77987 at position 10, and you
>>>>>>>> have some Java code, which is iterating through the string and getting the
>>>>>>>> code-point at each character offset, then that code will produce:
>>>>>>>> 
>>>>>>>> at position 10:  the code-point 77987
>>>>>>>> at position 11:  the code-point 56483
>>>>>>>> 
>>>>>>>> Of course, it's a "bug" to iterate through a string of characters, assuming you
>>>>>>>> have characters at each point, if you don't handle surrogate pairs.
>>>>>>>> 
>>>>>>>> The 56483 is just the lower bits of the surrogate pair, added to xDC00 (see
>>>>>>>> https://tools.ietf.org/html/rfc2781 )
>>>>>>>> 
>>>>>>>> I worry that even tools like the CVD or similar may not work properly, since
>>>>>>>> they're not designed to handle surrogate pairs, I think, so I have no idea if
>>>>>>>> they would work well enough for you.
>>>>>>>> 
>>>>>>>> I'll poke around some more to see if I can enable the conversion for document
>>>>>>>> strings.
>>>>>>>> 
>>>>>>>> -Marshall
>>>>>>>> 
>>>>>>>> On 9/20/2019 11:09 AM, Mario Juric wrote:
>>>>>>>>> Thanks Marshall,
>>>>>>>>> 
>>>>>>>>> Encoding the characters like you suggest should work just fine for us as long as we can serialize and deserialise the XMI, so that we can open the content in a tool like the CVD or similar. These characters are just noise from the original content that happen to remain in the CAS, but they are not visible in our final output because they are basically filtered out one way or the other by downstream components. They become a problem though when they make it more difficult for us to inspect the content.
>>>>>>>>> 
>>>>>>>>> Regarding the feature name issue: Might you have an idea why we are getting a different XMI output for the same character in our actual pipeline, where it results in "&#77987;&#56483;”? I investigated the value in the debugger again, and like you are illustrating it is also just a single codepoint with the value 77987. We are simply not able to load this XMI because of this, but unfortunately I couldn’t reproduce it in my small example.
>>>>>>>>> 
>>>>>>>>> Cheers,
>>>>>>>>> Mario
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On 19 Sep 2019, at 22:41 , Marshall Schor <ms...@schor.com> wrote:
>>>>>>>>>> 
>>>>>>>>>> The odd-feature-text seems to work OK, but has some unusual properties, due to
>>>>>>>>>> that unicode character.
>>>>>>>>>> 
>>>>>>>>>> Here's what I see:  The FeatureRecord "name" field is set to a
>>>>>>>>>> 1-unicode-character, that must be encoded as 2 java characters.
>>>>>>>>>> 
>>>>>>>>>> When output, it shows up in the xmi as <noNamespace:FeatureRecord xmi:id="18"
>>>>>>>>>> name="&#77987;" value="1.0"/>
>>>>>>>>>> which seems correct.  The name field only has 1 (extended)unicode character
>>>>>>>>>> (taking 2 Java characters to represent),
>>>>>>>>>> due to setting it with this code:   String oddName = "\uD80C\uDCA3";
>>>>>>>>>> 
>>>>>>>>>> When read in, the name field is assigned to a String, that string says it has a
>>>>>>>>>> length of 2 (but that's because it takes 2 java chars to represent this char).
>>>>>>>>>> If you have the name string in a variable "n", and do
>>>>>>>>>> System.out.println(n.codePointAt(0)), it shows (correctly) 77987.
>>>>>>>>>> n.codePointCount(0, n.length()) is, as expected, 1.
>>>>>>>>>> 
>>>>>>>>>> So, the string value serialization and deserialization seems to be "working".
>>>>>>>>>> 
>>>>>>>>>> The other code - for the sofa (document) serialization, is throwing that error,
>>>>>>>>>> because as currently designed, the
>>>>>>>>>> serialization code checks for these kinds of characters, and if found throws
>>>>>>>>>> that exception.  The code checking is
>>>>>>>>>> in XMLUtils.checkForNonXmlCharacters
>>>>>>>>>> 
>>>>>>>>>> This is because it's highly likely that "fixing this" in the same way as the
>>>>>>>>>> other, would result in hard-to-diagnose
>>>>>>>>>> future errors, because the subject of analysis string is processed with begin /
>>>>>>>>>> end offset all over the place, and makes
>>>>>>>>>> the assumption that the characters are all not coded as surrogate pairs.
>>>>>>>>>> 
>>>>>>>>>> We could change the code to output these like the name, as, e.g.,  &#77987; 
>>>>>>>>>> 
>>>>>>>>>> Would that help in your case, or do you imagine other kinds of things might
>>>>>>>>>> break (due to begin/end offsets no longer
>>>>>>>>>> being on character boundaries, for example).
>>>>>>>>>> 
>>>>>>>>>> -Marshall
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 9/18/2019 11:41 AM, Mario Juric wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>> 
>>>>>>>>>>> I investigated the XMI issue as promised and these are my findings.
>>>>>>>>>>> 
>>>>>>>>>>> It is related to special unicode characters that are not handled by XMI
>>>>>>>>>>> serialisation, and there seems to be two distinct categories of issues we have
>>>>>>>>>>> identified so far.
>>>>>>>>>>> 
>>>>>>>>>>> 1) The document text of the CAS contains special unicode characters
>>>>>>>>>>> 2) Annotations with String features have values containing special unicode
>>>>>>>>>>> characters
>>>>>>>>>>> 
>>>>>>>>>>> In both cases we could for sure solve the problem if we did a better clean up
>>>>>>>>>>> job upstream, but with the amount and variety of data we receive there is
>>>>>>>>>>> always a chance something passes through, and some of it may in the general
>>>>>>>>>>> case even be valid content.
>>>>>>>>>>> 
>>>>>>>>>>> The first case is easy to reproduce with the OddDocumentText example I
>>>>>>>>>>> attached. In this example the text is a snippet taken from the content of a
>>>>>>>>>>> parsed XML document.
>>>>>>>>>>> 
>>>>>>>>>>> The other case was not possible to reproduce with the OddFeatureText example,
>>>>>>>>>>> because I am getting slightly different output to what I have in our real
>>>>>>>>>>> setup. The OddFeatureText example is based on the simple type system I shared
>>>>>>>>>>> previously. The name value of a FeatureRecord contains special unicode
>>>>>>>>>>> characters that I found in a similar data structure in our actual CAS. The
>>>>>>>>>>> value comes from an external knowledge base holding some noisy strings, which
>>>>>>>>>>> in this case is a hieroglyph entity. However, when I write the CAS to XMI
>>>>>>>>>>> using the small example it only outputs the first of the two characters in
>>>>>>>>>>> "\uD80C\uDCA3”, which yields the value "&#77987;” in the XMI, but in our
>>>>>>>>>>> actual setup both character values are written as "&#77987;&#56483;”. This
>>>>>>>>>>> means that the attached example will for some reason parse the XMI again, but
>>>>>>>>>>> it will not work in the case where both characters are written the way we
>>>>>>>>>>> experience it. The XMI can be manually changed, so that both character values
>>>>>>>>>>> are included the way it happens in our output, and in this case a
>>>>>>>>>>> SAXParserException happens.
>>>>>>>>>>> 
>>>>>>>>>>> I don’t know whether it is outside the scope of the XMI serialiser to handle
>>>>>>>>>>> any of this, but it will be good to know in any case :)
>>>>>>>>>>> 
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Mario
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On 17 Sep 2019, at 09:36 , Mario Juric <mj@unsilo.ai <ma...@unsilo.ai> <mailto:mj@unsilo.ai <ma...@unsilo.ai>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Thank you very much for looking into this. It is really appreciated and I
>>>>>>>>>>>> think it touches upon something important, which is about data migration in
>>>>>>>>>>>> general.
>>>>>>>>>>>> 
>>>>>>>>>>>> I agree that some of these solutions can appear specific, awkward or complex
>>>>>>>>>>>> and the way forward is not to address our use case alone. I think there is a
>>>>>>>>>>>> need for a compact and efficient binary serialization format for the CAS when
>>>>>>>>>>>> dealing with large amounts of data because this is directly visible in costs
>>>>>>>>>>>> of processing and storing, and I found the compressed binary format to be
>>>>>>>>>>>> much better than XMI in this regard, although I have to admit it’s been a
>>>>>>>>>>>> while since I benchmarked this. Given that UIMA already has a well described
>>>>>>>>>>>> type system then maybe it just lacks a way to describe schema evolution
>>>>>>>>>>>> similar to Apache Avro or similar serialisation frameworks. I think a more
>>>>>>>>>>>> formal approach to data migration would be critical to any larger operational
>>>>>>>>>>>> setup.
>>>>>>>>>>>> 
>>>>>>>>>>>> Regarding XMI I like to provide some input to the problem we are observing,
>>>>>>>>>>>> so that it can be solved. We are primarily using XMI for inspection/debugging
>>>>>>>>>>>> purposes, and we are sometimes not able to do this because of this error. I
>>>>>>>>>>>> will try to extract a minimum example to avoid involving parts that has to do
>>>>>>>>>>>> with our pipeline and type system, and I think this would also be the best
>>>>>>>>>>>> way to illustrate that the problem exists outside of this context. However,
>>>>>>>>>>>> converting all our data to XMI first in order to do the conversion in our
>>>>>>>>>>>> example would not be very practical for us, because it involves a large
>>>>>>>>>>>> amount of data.
>>>>>>>>>>>> 
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> Mario
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On 16 Sep 2019, at 23:02 , Marshall Schor <msa@schor.com <ma...@schor.com>
>>>>>>>>>>>>> <mailto:msa@schor.com <ma...@schor.com>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> In this case, the original looks kind-of like this:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Container
>>>>>>>>>>>>> features -> FSArray of FeatureAnnotation each of which
>>>>>>>>>>>>>                          has 5 slots: sofaRef, begin, end, name, value
>>>>>>>>>>>>> 
>>>>>>>>>>>>> the new TypeSystem has
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Container
>>>>>>>>>>>>> features -> FSArray of FeatureRecord each of which
>>>>>>>>>>>>>                           has 2 slots: name, value
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The deserializer code would need some way to decide how to
>>>>>>>>>>>>> 1) create an FSArray of FeatureRecord,
>>>>>>>>>>>>> 2) for each element,
>>>>>>>>>>>>>   map the FeatureAnnotation to a new instance of FeatureRecord
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I guess I could imagine a default mapping (for item 2 above) of
>>>>>>>>>>>>> 1) change the type from A to B
>>>>>>>>>>>>> 2) set equal-named features from A to B, drop other features
>>>>>>>>>>>>> 
>>>>>>>>>>>>> This mapping would need to apply to a subset of the A's and B's, namely, only
>>>>>>>>>>>>> those referenced by the FSArray where the element type changed.  Seems complex
>>>>>>>>>>>>> and specific to this use case though.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -Marshall
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On 9/16/2019 2:42 PM, Richard Eckart de Castilho wrote:
>>>>>>>>>>>>>> On 16. Sep 2019, at 19:05, Marshall Schor <msa@schor.com <ma...@schor.com>
>>>>>>>>>>>>>> <mailto:msa@schor.com <ma...@schor.com>>> wrote:
>>>>>>>>>>>>>>> I can reproduce the problem, and see what is happening.  The deserialization
>>>>>>>>>>>>>>> code compares the two type systems, and allows for some mismatches (things
>>>>>>>>>>>>>>> present in one and not in the other), but it doesn't allow for having a
>>>>>>>>>>>>>>> feature
>>>>>>>>>>>>>>> whose range (value) is type XXXX in one type system and type YYYY in the
>>>>>>>>>>>>>>> other.
>>>>>>>>>>>>>>> See CasTypeSystemMapper lines 299 - 315.
>>>>>>>>>>>>>> Without reading the code in detail - could we not relax this check such
>>>>>>>>>>>>>> that the element type of FSArrays is not checked and the code simply
>>>>>>>>>>>>>> assumes that the source element type has the same features as the target
>>>>>>>>>>>>>> element type (with the usual lenient handling of missing features in the
>>>>>>>>>>>>>> target type)? - Kind of a "duck typing" approach?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> -- Richard
>> 


Re: Migrating type system of form 6 compressed CAS binaries

Posted by Marshall Schor <ms...@schor.com>.
Here's code that works that serializes in 1.1 format.

The key idea is to set the OutputProperty OutputKeys.VERSION to "1.1".

XmiCasSerializer xmiCasSerializer = new XmiCasSerializer(jCas.getTypeSystem());
OutputStream out = new FileOutputStream(new File ("odd-doc-txt-v11.xmi"));
try {
  XMLSerializer xml11Serializer = new XMLSerializer(out);
  xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
  xmiCasSerializer.serialize(jCas.getCas(), xml11Serializer.getContentHandler());
    }
finally {
  out.close();
}

This is from a test case. -Marshall

On 9/25/2019 2:16 PM, Mario Juric wrote:
> Thanks Marshall,
>
> If you prefer then I can also have a look at it, although I probably need to finish something first within the next 3-4 weeks. It would probably get me faster started if you could share some of your experimental sample code.
>
> Cheers,
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>
>> On 24 Sep 2019, at 21:32 , Marshall Schor <ms...@schor.com> wrote:
>>
>> yes, makes sense, thanks for posting the Jira.
>>
>> If no one else steps up to work on this, I'll probably take a look in a few
>> days. -Marshall
>>
>> On 9/24/2019 6:47 AM, Mario Juric wrote:
>>> Hi Marshall,
>>>
>>> I added the following feature request to Apache Jira:
>>>
>>> https://issues.apache.org/jira/browse/UIMA-6128
>>>
>>> Hope it makes sense :)
>>>
>>> Thanks a lot for the help, it’s appreciated.
>>>
>>> Cheers,
>>> Mario
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>> On 23 Sep 2019, at 16:33 , Marshall Schor <ms...@schor.com> wrote:
>>>>
>>>> Re: serializing using XML 1.1
>>>>
>>>> This was not thought of, when setting up the CasIOUtils.
>>>>
>>>> The way it was done (above) was using some more "primitive/lower level" APIs,
>>>> rather than the CasIOUtils.
>>>>
>>>> Please open a Jira ticket for this, with perhaps some suggestions on how it
>>>> might be specified in the CasIOUtils APIs.
>>>>
>>>> Thanks! -Marshall
>>>>
>>>> On 9/23/2019 3:45 AM, Mario Juric wrote:
>>>>> Hi Marshall,
>>>>>
>>>>> Thanks for the thorough and excellent investigation.
>>>>>
>>>>> We are looking into possible normalisation/cleanup of whitespace/invisible characters, but I don’t think we can necessarily do the same for some of the other characters. It sounds to me though that serialising to XML 1.1 could also be a simple fix right now, but can this be configured? CasIOUtils doesn’t seem to have an option for this, so I assume it’s something you have working in your branch.
>>>>>
>>>>> Regarding the other problem. It seems that the JDK bug is fixed from Java 9 and after. Do you think switching to a more recent Java version would make a difference? I think we can also try this out ourselves when we look into migrating to UIMA 3 once our current deliveries are complete. We also like to switch to Java 11, and like UIMA 3 migration it will require some thorough testing.
>>>>>
>>>>> Cheers,
>>>>> Mario
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> On 20 Sep 2019, at 20:52 , Marshall Schor <ms...@schor.com> wrote:
>>>>>>
>>>>>> In the test "OddDocumentText", this produces a "throw" due to an invalid xml
>>>>>> char, which is the \u0002.
>>>>>>
>>>>>> This is in part because the xml version being used is xml 1.0.
>>>>>>
>>>>>> XML 1.1 expanded the set of valid characters to include \u0002.
>>>>>>
>>>>>> Here's a snip from the XmiCasSerializerTest class which serializes with xml 1.1:
>>>>>>
>>>>>>       XmiCasSerializer xmiCasSerializer = new
>>>>>> XmiCasSerializer(jCas.getTypeSystem());
>>>>>>       OutputStream out = new FileOutputStream(new File ("odd-doc-txt-v11.xmi"));
>>>>>>       try {
>>>>>>         XMLSerializer xml11Serializer = new XMLSerializer(out);
>>>>>>         xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
>>>>>>         xmiCasSerializer.serialize(jCas.getCas(),
>>>>>> xml11Serializer.getContentHandler());
>>>>>>       }
>>>>>>       finally {
>>>>>>         out.close();
>>>>>>       }
>>>>>>
>>>>>> This succeeds and serializes this using xml 1.1.
>>>>>>
>>>>>> I also tried serializing some doc text which includes \u77987.  That did not
>>>>>> serialize correctly.
>>>>>> I could see it in the code while tracing up to some point down in the innards of
>>>>>> some internal
>>>>>> sax java code
>>>>>> com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize  where it was
>>>>>> "Correct" in the Java string.
>>>>>>
>>>>>> When serialized (as UTF-8) it came out as a 4 byte string E79E 9837.
>>>>>>
>>>>>> This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 byte encoding:
>>>>>>       1110 xxxx 10xx xxxx 10xx xxxx
>>>>>>
>>>>>> of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to me.
>>>>>>
>>>>>> But I think it's out of our hands - it's somewhere deep in the sax transform
>>>>>> java code.
>>>>>>
>>>>>> I looked for a bug report and found some
>>>>>> https://bugs.openjdk.java.net/browse/JDK-8058175
>>>>>>
>>>>>> Bottom line, is, I think to clean out these characters early :-) .
>>>>>>
>>>>>> -Marshall
>>>>>>
>>>>>>
>>>>>> On 9/20/2019 1:28 PM, Marshall Schor wrote:
>>>>>>> here's an idea.
>>>>>>>
>>>>>>> If you have a string, with the surrogate pair &#77987 at position 10, and you
>>>>>>> have some Java code, which is iterating through the string and getting the
>>>>>>> code-point at each character offset, then that code will produce:
>>>>>>>
>>>>>>> at position 10:  the code-point 77987
>>>>>>> at position 11:  the code-point 56483
>>>>>>>
>>>>>>> Of course, it's a "bug" to iterate through a string of characters, assuming you
>>>>>>> have characters at each point, if you don't handle surrogate pairs.
>>>>>>>
>>>>>>> The 56483 is just the lower bits of the surrogate pair, added to xDC00 (see
>>>>>>> https://tools.ietf.org/html/rfc2781 )
>>>>>>>
>>>>>>> I worry that even tools like the CVD or similar may not work properly, since
>>>>>>> they're not designed to handle surrogate pairs, I think, so I have no idea if
>>>>>>> they would work well enough for you.
>>>>>>>
>>>>>>> I'll poke around some more to see if I can enable the conversion for document
>>>>>>> strings.
>>>>>>>
>>>>>>> -Marshall
>>>>>>>
>>>>>>> On 9/20/2019 11:09 AM, Mario Juric wrote:
>>>>>>>> Thanks Marshall,
>>>>>>>>
>>>>>>>> Encoding the characters like you suggest should work just fine for us as long as we can serialize and deserialise the XMI, so that we can open the content in a tool like the CVD or similar. These characters are just noise from the original content that happen to remain in the CAS, but they are not visible in our final output because they are basically filtered out one way or the other by downstream components. They become a problem though when they make it more difficult for us to inspect the content.
>>>>>>>>
>>>>>>>> Regarding the feature name issue: Might you have an idea why we are getting a different XMI output for the same character in our actual pipeline, where it results in "&#77987;&#56483;”? I investigated the value in the debugger again, and like you are illustrating it is also just a single codepoint with the value 77987. We are simply not able to load this XMI because of this, but unfortunately I couldn’t reproduce it in my small example.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Mario
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> On 19 Sep 2019, at 22:41 , Marshall Schor <ms...@schor.com> wrote:
>>>>>>>>>
>>>>>>>>> The odd-feature-text seems to work OK, but has some unusual properties, due to
>>>>>>>>> that unicode character.
>>>>>>>>>
>>>>>>>>> Here's what I see:  The FeatureRecord "name" field is set to a
>>>>>>>>> 1-unicode-character, that must be encoded as 2 java characters.
>>>>>>>>>
>>>>>>>>> When output, it shows up in the xmi as <noNamespace:FeatureRecord xmi:id="18"
>>>>>>>>> name="&#77987;" value="1.0"/>
>>>>>>>>> which seems correct.  The name field only has 1 (extended)unicode character
>>>>>>>>> (taking 2 Java characters to represent),
>>>>>>>>> due to setting it with this code:   String oddName = "\uD80C\uDCA3";
>>>>>>>>>
>>>>>>>>> When read in, the name field is assigned to a String, that string says it has a
>>>>>>>>> length of 2 (but that's because it takes 2 java chars to represent this char).
>>>>>>>>> If you have the name string in a variable "n", and do
>>>>>>>>> System.out.println(n.codePointAt(0)), it shows (correctly) 77987.
>>>>>>>>> n.codePointCount(0, n.length()) is, as expected, 1.
>>>>>>>>>
>>>>>>>>> So, the string value serialization and deserialization seems to be "working".
>>>>>>>>>
>>>>>>>>> The other code - for the sofa (document) serialization, is throwing that error,
>>>>>>>>> because as currently designed, the
>>>>>>>>> serialization code checks for these kinds of characters, and if found throws
>>>>>>>>> that exception.  The code checking is
>>>>>>>>> in XMLUtils.checkForNonXmlCharacters
>>>>>>>>>
>>>>>>>>> This is because it's highly likely that "fixing this" in the same way as the
>>>>>>>>> other, would result in hard-to-diagnose
>>>>>>>>> future errors, because the subject of analysis string is processed with begin /
>>>>>>>>> end offset all over the place, and makes
>>>>>>>>> the assumption that the characters are all not coded as surrogate pairs.
>>>>>>>>>
>>>>>>>>> We could change the code to output these like the name, as, e.g.,  &#77987; 
>>>>>>>>>
>>>>>>>>> Would that help in your case, or do you imagine other kinds of things might
>>>>>>>>> break (due to begin/end offsets no longer
>>>>>>>>> being on character boundaries, for example).
>>>>>>>>>
>>>>>>>>> -Marshall
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 9/18/2019 11:41 AM, Mario Juric wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I investigated the XMI issue as promised and these are my findings.
>>>>>>>>>>
>>>>>>>>>> It is related to special unicode characters that are not handled by XMI
>>>>>>>>>> serialisation, and there seems to be two distinct categories of issues we have
>>>>>>>>>> identified so far.
>>>>>>>>>>
>>>>>>>>>> 1) The document text of the CAS contains special unicode characters
>>>>>>>>>> 2) Annotations with String features have values containing special unicode
>>>>>>>>>> characters
>>>>>>>>>>
>>>>>>>>>> In both cases we could for sure solve the problem if we did a better clean up
>>>>>>>>>> job upstream, but with the amount and variety of data we receive there is
>>>>>>>>>> always a chance something passes through, and some of it may in the general
>>>>>>>>>> case even be valid content.
>>>>>>>>>>
>>>>>>>>>> The first case is easy to reproduce with the OddDocumentText example I
>>>>>>>>>> attached. In this example the text is a snippet taken from the content of a
>>>>>>>>>> parsed XML document.
>>>>>>>>>>
>>>>>>>>>> The other case was not possible to reproduce with the OddFeatureText example,
>>>>>>>>>> because I am getting slightly different output to what I have in our real
>>>>>>>>>> setup. The OddFeatureText example is based on the simple type system I shared
>>>>>>>>>> previously. The name value of a FeatureRecord contains special unicode
>>>>>>>>>> characters that I found in a similar data structure in our actual CAS. The
>>>>>>>>>> value comes from an external knowledge base holding some noisy strings, which
>>>>>>>>>> in this case is a hieroglyph entity. However, when I write the CAS to XMI
>>>>>>>>>> using the small example it only outputs the first of the two characters in
>>>>>>>>>> "\uD80C\uDCA3”, which yields the value "&#77987;” in the XMI, but in our
>>>>>>>>>> actual setup both character values are written as "&#77987;&#56483;”. This
>>>>>>>>>> means that the attached example will for some reason parse the XMI again, but
>>>>>>>>>> it will not work in the case where both characters are written the way we
>>>>>>>>>> experience it. The XMI can be manually changed, so that both character values
>>>>>>>>>> are included the way it happens in our output, and in this case a
>>>>>>>>>> SAXParserException happens.
>>>>>>>>>>
>>>>>>>>>> I don’t know whether it is outside the scope of the XMI serialiser to handle
>>>>>>>>>> any of this, but it will be good to know in any case :)
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Mario
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On 17 Sep 2019, at 09:36 , Mario Juric <mj@unsilo.ai <ma...@unsilo.ai> <mailto:mj@unsilo.ai <ma...@unsilo.ai>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Thank you very much for looking into this. It is really appreciated and I
>>>>>>>>>>> think it touches upon something important, which is about data migration in
>>>>>>>>>>> general.
>>>>>>>>>>>
>>>>>>>>>>> I agree that some of these solutions can appear specific, awkward or complex
>>>>>>>>>>> and the way forward is not to address our use case alone. I think there is a
>>>>>>>>>>> need for a compact and efficient binary serialization format for the CAS when
>>>>>>>>>>> dealing with large amounts of data because this is directly visible in costs
>>>>>>>>>>> of processing and storing, and I found the compressed binary format to be
>>>>>>>>>>> much better than XMI in this regard, although I have to admit it’s been a
>>>>>>>>>>> while since I benchmarked this. Given that UIMA already has a well described
>>>>>>>>>>> type system then maybe it just lacks a way to describe schema evolution
>>>>>>>>>>> similar to Apache Avro or similar serialisation frameworks. I think a more
>>>>>>>>>>> formal approach to data migration would be critical to any larger operational
>>>>>>>>>>> setup.
>>>>>>>>>>>
>>>>>>>>>>> Regarding XMI I like to provide some input to the problem we are observing,
>>>>>>>>>>> so that it can be solved. We are primarily using XMI for inspection/debugging
>>>>>>>>>>> purposes, and we are sometimes not able to do this because of this error. I
>>>>>>>>>>> will try to extract a minimum example to avoid involving parts that has to do
>>>>>>>>>>> with our pipeline and type system, and I think this would also be the best
>>>>>>>>>>> way to illustrate that the problem exists outside of this context. However,
>>>>>>>>>>> converting all our data to XMI first in order to do the conversion in our
>>>>>>>>>>> example would not be very practical for us, because it involves a large
>>>>>>>>>>> amount of data.
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Mario
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> On 16 Sep 2019, at 23:02 , Marshall Schor <msa@schor.com <ma...@schor.com>
>>>>>>>>>>>> <mailto:msa@schor.com <ma...@schor.com>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> In this case, the original looks kind-of like this:
>>>>>>>>>>>>
>>>>>>>>>>>> Container
>>>>>>>>>>>> features -> FSArray of FeatureAnnotation each of which
>>>>>>>>>>>>                           has 5 slots: sofaRef, begin, end, name, value
>>>>>>>>>>>>
>>>>>>>>>>>> the new TypeSystem has
>>>>>>>>>>>>
>>>>>>>>>>>> Container
>>>>>>>>>>>> features -> FSArray of FeatureRecord each of which
>>>>>>>>>>>>                            has 2 slots: name, value
>>>>>>>>>>>>
>>>>>>>>>>>> The deserializer code would need some way to decide how to
>>>>>>>>>>>> 1) create an FSArray of FeatureRecord,
>>>>>>>>>>>> 2) for each element,
>>>>>>>>>>>>    map the FeatureAnnotation to a new instance of FeatureRecord
>>>>>>>>>>>>
>>>>>>>>>>>> I guess I could imagine a default mapping (for item 2 above) of
>>>>>>>>>>>> 1) change the type from A to B
>>>>>>>>>>>> 2) set equal-named features from A to B, drop other features
>>>>>>>>>>>>
>>>>>>>>>>>> This mapping would need to apply to a subset of the A's and B's, namely, only
>>>>>>>>>>>> those referenced by the FSArray where the element type changed.  Seems complex
>>>>>>>>>>>> and specific to this use case though.
>>>>>>>>>>>>
>>>>>>>>>>>> -Marshall
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 9/16/2019 2:42 PM, Richard Eckart de Castilho wrote:
>>>>>>>>>>>>> On 16. Sep 2019, at 19:05, Marshall Schor <msa@schor.com <ma...@schor.com>
>>>>>>>>>>>>> <mailto:msa@schor.com <ma...@schor.com>>> wrote:
>>>>>>>>>>>>>> I can reproduce the problem, and see what is happening.  The deserialization
>>>>>>>>>>>>>> code compares the two type systems, and allows for some mismatches (things
>>>>>>>>>>>>>> present in one and not in the other), but it doesn't allow for having a
>>>>>>>>>>>>>> feature
>>>>>>>>>>>>>> whose range (value) is type XXXX in one type system and type YYYY in the
>>>>>>>>>>>>>> other.
>>>>>>>>>>>>>> See CasTypeSystemMapper lines 299 - 315.
>>>>>>>>>>>>> Without reading the code in detail - could we not relax this check such
>>>>>>>>>>>>> that the element type of FSArrays is not checked and the code simply
>>>>>>>>>>>>> assumes that the source element type has the same features as the target
>>>>>>>>>>>>> element type (with the usual lenient handling of missing features in the
>>>>>>>>>>>>> target type)? - Kind of a "duck typing" approach?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>
>>>>>>>>>>>>> -- Richard
>

Re: Migrating type system of form 6 compressed CAS binaries

Posted by Mario Juric <mj...@unsilo.ai>.
Thanks Marshall,

If you prefer then I can also have a look at it, although I probably need to finish something first within the next 3-4 weeks. It would probably get me faster started if you could share some of your experimental sample code.

Cheers,
Mario













> On 24 Sep 2019, at 21:32 , Marshall Schor <ms...@schor.com> wrote:
> 
> yes, makes sense, thanks for posting the Jira.
> 
> If no one else steps up to work on this, I'll probably take a look in a few
> days. -Marshall
> 
> On 9/24/2019 6:47 AM, Mario Juric wrote:
>> Hi Marshall,
>> 
>> I added the following feature request to Apache Jira:
>> 
>> https://issues.apache.org/jira/browse/UIMA-6128
>> 
>> Hope it makes sense :)
>> 
>> Thanks a lot for the help, it’s appreciated.
>> 
>> Cheers,
>> Mario
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>> On 23 Sep 2019, at 16:33 , Marshall Schor <ms...@schor.com> wrote:
>>> 
>>> Re: serializing using XML 1.1
>>> 
>>> This was not thought of, when setting up the CasIOUtils.
>>> 
>>> The way it was done (above) was using some more "primitive/lower level" APIs,
>>> rather than the CasIOUtils.
>>> 
>>> Please open a Jira ticket for this, with perhaps some suggestions on how it
>>> might be specified in the CasIOUtils APIs.
>>> 
>>> Thanks! -Marshall
>>> 
>>> On 9/23/2019 3:45 AM, Mario Juric wrote:
>>>> Hi Marshall,
>>>> 
>>>> Thanks for the thorough and excellent investigation.
>>>> 
>>>> We are looking into possible normalisation/cleanup of whitespace/invisible characters, but I don’t think we can necessarily do the same for some of the other characters. It sounds to me though that serialising to XML 1.1 could also be a simple fix right now, but can this be configured? CasIOUtils doesn’t seem to have an option for this, so I assume it’s something you have working in your branch.
>>>> 
>>>> Regarding the other problem. It seems that the JDK bug is fixed from Java 9 and after. Do you think switching to a more recent Java version would make a difference? I think we can also try this out ourselves when we look into migrating to UIMA 3 once our current deliveries are complete. We also like to switch to Java 11, and like UIMA 3 migration it will require some thorough testing.
>>>> 
>>>> Cheers,
>>>> Mario
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On 20 Sep 2019, at 20:52 , Marshall Schor <ms...@schor.com> wrote:
>>>>> 
>>>>> In the test "OddDocumentText", this produces a "throw" due to an invalid xml
>>>>> char, which is the \u0002.
>>>>> 
>>>>> This is in part because the xml version being used is xml 1.0.
>>>>> 
>>>>> XML 1.1 expanded the set of valid characters to include \u0002.
>>>>> 
>>>>> Here's a snip from the XmiCasSerializerTest class which serializes with xml 1.1:
>>>>> 
>>>>>       XmiCasSerializer xmiCasSerializer = new
>>>>> XmiCasSerializer(jCas.getTypeSystem());
>>>>>       OutputStream out = new FileOutputStream(new File ("odd-doc-txt-v11.xmi"));
>>>>>       try {
>>>>>         XMLSerializer xml11Serializer = new XMLSerializer(out);
>>>>>         xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
>>>>>         xmiCasSerializer.serialize(jCas.getCas(),
>>>>> xml11Serializer.getContentHandler());
>>>>>       }
>>>>>       finally {
>>>>>         out.close();
>>>>>       }
>>>>> 
>>>>> This succeeds and serializes this using xml 1.1.
>>>>> 
>>>>> I also tried serializing some doc text which includes \u77987.  That did not
>>>>> serialize correctly.
>>>>> I could see it in the code while tracing up to some point down in the innards of
>>>>> some internal
>>>>> sax java code
>>>>> com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize  where it was
>>>>> "Correct" in the Java string.
>>>>> 
>>>>> When serialized (as UTF-8) it came out as a 4 byte string E79E 9837.
>>>>> 
>>>>> This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 byte encoding:
>>>>>       1110 xxxx 10xx xxxx 10xx xxxx
>>>>> 
>>>>> of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to me.
>>>>> 
>>>>> But I think it's out of our hands - it's somewhere deep in the sax transform
>>>>> java code.
>>>>> 
>>>>> I looked for a bug report and found some
>>>>> https://bugs.openjdk.java.net/browse/JDK-8058175
>>>>> 
>>>>> Bottom line, is, I think to clean out these characters early :-) .
>>>>> 
>>>>> -Marshall
>>>>> 
>>>>> 
>>>>> On 9/20/2019 1:28 PM, Marshall Schor wrote:
>>>>>> here's an idea.
>>>>>> 
>>>>>> If you have a string, with the surrogate pair &#77987 at position 10, and you
>>>>>> have some Java code, which is iterating through the string and getting the
>>>>>> code-point at each character offset, then that code will produce:
>>>>>> 
>>>>>> at position 10:  the code-point 77987
>>>>>> at position 11:  the code-point 56483
>>>>>> 
>>>>>> Of course, it's a "bug" to iterate through a string of characters, assuming you
>>>>>> have characters at each point, if you don't handle surrogate pairs.
>>>>>> 
>>>>>> The 56483 is just the lower bits of the surrogate pair, added to xDC00 (see
>>>>>> https://tools.ietf.org/html/rfc2781 )
>>>>>> 
>>>>>> I worry that even tools like the CVD or similar may not work properly, since
>>>>>> they're not designed to handle surrogate pairs, I think, so I have no idea if
>>>>>> they would work well enough for you.
>>>>>> 
>>>>>> I'll poke around some more to see if I can enable the conversion for document
>>>>>> strings.
>>>>>> 
>>>>>> -Marshall
>>>>>> 
>>>>>> On 9/20/2019 11:09 AM, Mario Juric wrote:
>>>>>>> Thanks Marshall,
>>>>>>> 
>>>>>>> Encoding the characters like you suggest should work just fine for us as long as we can serialize and deserialise the XMI, so that we can open the content in a tool like the CVD or similar. These characters are just noise from the original content that happen to remain in the CAS, but they are not visible in our final output because they are basically filtered out one way or the other by downstream components. They become a problem though when they make it more difficult for us to inspect the content.
>>>>>>> 
>>>>>>> Regarding the feature name issue: Might you have an idea why we are getting a different XMI output for the same character in our actual pipeline, where it results in "&#77987;&#56483;”? I investigated the value in the debugger again, and like you are illustrating it is also just a single codepoint with the value 77987. We are simply not able to load this XMI because of this, but unfortunately I couldn’t reproduce it in my small example.
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Mario
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On 19 Sep 2019, at 22:41 , Marshall Schor <ms...@schor.com> wrote:
>>>>>>>> 
>>>>>>>> The odd-feature-text seems to work OK, but has some unusual properties, due to
>>>>>>>> that unicode character.
>>>>>>>> 
>>>>>>>> Here's what I see:  The FeatureRecord "name" field is set to a
>>>>>>>> 1-unicode-character, that must be encoded as 2 java characters.
>>>>>>>> 
>>>>>>>> When output, it shows up in the xmi as <noNamespace:FeatureRecord xmi:id="18"
>>>>>>>> name="&#77987;" value="1.0"/>
>>>>>>>> which seems correct.  The name field only has 1 (extended)unicode character
>>>>>>>> (taking 2 Java characters to represent),
>>>>>>>> due to setting it with this code:   String oddName = "\uD80C\uDCA3";
>>>>>>>> 
>>>>>>>> When read in, the name field is assigned to a String, that string says it has a
>>>>>>>> length of 2 (but that's because it takes 2 java chars to represent this char).
>>>>>>>> If you have the name string in a variable "n", and do
>>>>>>>> System.out.println(n.codePointAt(0)), it shows (correctly) 77987.
>>>>>>>> n.codePointCount(0, n.length()) is, as expected, 1.
>>>>>>>> 
>>>>>>>> So, the string value serialization and deserialization seems to be "working".
>>>>>>>> 
>>>>>>>> The other code - for the sofa (document) serialization, is throwing that error,
>>>>>>>> because as currently designed, the
>>>>>>>> serialization code checks for these kinds of characters, and if found throws
>>>>>>>> that exception.  The code checking is
>>>>>>>> in XMLUtils.checkForNonXmlCharacters
>>>>>>>> 
>>>>>>>> This is because it's highly likely that "fixing this" in the same way as the
>>>>>>>> other, would result in hard-to-diagnose
>>>>>>>> future errors, because the subject of analysis string is processed with begin /
>>>>>>>> end offset all over the place, and makes
>>>>>>>> the assumption that the characters are all not coded as surrogate pairs.
>>>>>>>> 
>>>>>>>> We could change the code to output these like the name, as, e.g.,  &#77987; 
>>>>>>>> 
>>>>>>>> Would that help in your case, or do you imagine other kinds of things might
>>>>>>>> break (due to begin/end offsets no longer
>>>>>>>> being on character boundaries, for example).
>>>>>>>> 
>>>>>>>> -Marshall
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 9/18/2019 11:41 AM, Mario Juric wrote:
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> I investigated the XMI issue as promised and these are my findings.
>>>>>>>>> 
>>>>>>>>> It is related to special unicode characters that are not handled by XMI
>>>>>>>>> serialisation, and there seems to be two distinct categories of issues we have
>>>>>>>>> identified so far.
>>>>>>>>> 
>>>>>>>>> 1) The document text of the CAS contains special unicode characters
>>>>>>>>> 2) Annotations with String features have values containing special unicode
>>>>>>>>> characters
>>>>>>>>> 
>>>>>>>>> In both cases we could for sure solve the problem if we did a better clean up
>>>>>>>>> job upstream, but with the amount and variety of data we receive there is
>>>>>>>>> always a chance something passes through, and some of it may in the general
>>>>>>>>> case even be valid content.
>>>>>>>>> 
>>>>>>>>> The first case is easy to reproduce with the OddDocumentText example I
>>>>>>>>> attached. In this example the text is a snippet taken from the content of a
>>>>>>>>> parsed XML document.
>>>>>>>>> 
>>>>>>>>> The other case was not possible to reproduce with the OddFeatureText example,
>>>>>>>>> because I am getting slightly different output to what I have in our real
>>>>>>>>> setup. The OddFeatureText example is based on the simple type system I shared
>>>>>>>>> previously. The name value of a FeatureRecord contains special unicode
>>>>>>>>> characters that I found in a similar data structure in our actual CAS. The
>>>>>>>>> value comes from an external knowledge base holding some noisy strings, which
>>>>>>>>> in this case is a hieroglyph entity. However, when I write the CAS to XMI
>>>>>>>>> using the small example it only outputs the first of the two characters in
>>>>>>>>> "\uD80C\uDCA3”, which yields the value "&#77987;” in the XMI, but in our
>>>>>>>>> actual setup both character values are written as "&#77987;&#56483;”. This
>>>>>>>>> means that the attached example will for some reason parse the XMI again, but
>>>>>>>>> it will not work in the case where both characters are written the way we
>>>>>>>>> experience it. The XMI can be manually changed, so that both character values
>>>>>>>>> are included the way it happens in our output, and in this case a
>>>>>>>>> SAXParserException happens.
>>>>>>>>> 
>>>>>>>>> I don’t know whether it is outside the scope of the XMI serialiser to handle
>>>>>>>>> any of this, but it will be good to know in any case :)
>>>>>>>>> 
>>>>>>>>> Cheers,
>>>>>>>>> Mario
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On 17 Sep 2019, at 09:36 , Mario Juric <mj@unsilo.ai <ma...@unsilo.ai> <mailto:mj@unsilo.ai <ma...@unsilo.ai>>>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Thank you very much for looking into this. It is really appreciated and I
>>>>>>>>>> think it touches upon something important, which is about data migration in
>>>>>>>>>> general.
>>>>>>>>>> 
>>>>>>>>>> I agree that some of these solutions can appear specific, awkward or complex
>>>>>>>>>> and the way forward is not to address our use case alone. I think there is a
>>>>>>>>>> need for a compact and efficient binary serialization format for the CAS when
>>>>>>>>>> dealing with large amounts of data because this is directly visible in costs
>>>>>>>>>> of processing and storing, and I found the compressed binary format to be
>>>>>>>>>> much better than XMI in this regard, although I have to admit it’s been a
>>>>>>>>>> while since I benchmarked this. Given that UIMA already has a well described
>>>>>>>>>> type system then maybe it just lacks a way to describe schema evolution
>>>>>>>>>> similar to Apache Avro or similar serialisation frameworks. I think a more
>>>>>>>>>> formal approach to data migration would be critical to any larger operational
>>>>>>>>>> setup.
>>>>>>>>>> 
>>>>>>>>>> Regarding XMI I like to provide some input to the problem we are observing,
>>>>>>>>>> so that it can be solved. We are primarily using XMI for inspection/debugging
>>>>>>>>>> purposes, and we are sometimes not able to do this because of this error. I
>>>>>>>>>> will try to extract a minimum example to avoid involving parts that has to do
>>>>>>>>>> with our pipeline and type system, and I think this would also be the best
>>>>>>>>>> way to illustrate that the problem exists outside of this context. However,
>>>>>>>>>> converting all our data to XMI first in order to do the conversion in our
>>>>>>>>>> example would not be very practical for us, because it involves a large
>>>>>>>>>> amount of data.
>>>>>>>>>> 
>>>>>>>>>> Cheers,
>>>>>>>>>> Mario
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On 16 Sep 2019, at 23:02 , Marshall Schor <msa@schor.com <ma...@schor.com>
>>>>>>>>>>> <mailto:msa@schor.com <ma...@schor.com>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> In this case, the original looks kind-of like this:
>>>>>>>>>>> 
>>>>>>>>>>> Container
>>>>>>>>>>> features -> FSArray of FeatureAnnotation each of which
>>>>>>>>>>>                           has 5 slots: sofaRef, begin, end, name, value
>>>>>>>>>>> 
>>>>>>>>>>> the new TypeSystem has
>>>>>>>>>>> 
>>>>>>>>>>> Container
>>>>>>>>>>> features -> FSArray of FeatureRecord each of which
>>>>>>>>>>>                            has 2 slots: name, value
>>>>>>>>>>> 
>>>>>>>>>>> The deserializer code would need some way to decide how to
>>>>>>>>>>> 1) create an FSArray of FeatureRecord,
>>>>>>>>>>> 2) for each element,
>>>>>>>>>>>    map the FeatureAnnotation to a new instance of FeatureRecord
>>>>>>>>>>> 
>>>>>>>>>>> I guess I could imagine a default mapping (for item 2 above) of
>>>>>>>>>>> 1) change the type from A to B
>>>>>>>>>>> 2) set equal-named features from A to B, drop other features
>>>>>>>>>>> 
>>>>>>>>>>> This mapping would need to apply to a subset of the A's and B's, namely, only
>>>>>>>>>>> those referenced by the FSArray where the element type changed.  Seems complex
>>>>>>>>>>> and specific to this use case though.
>>>>>>>>>>> 
>>>>>>>>>>> -Marshall
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On 9/16/2019 2:42 PM, Richard Eckart de Castilho wrote:
>>>>>>>>>>>> On 16. Sep 2019, at 19:05, Marshall Schor <msa@schor.com <ma...@schor.com>
>>>>>>>>>>>> <mailto:msa@schor.com <ma...@schor.com>>> wrote:
>>>>>>>>>>>>> I can reproduce the problem, and see what is happening.  The deserialization
>>>>>>>>>>>>> code compares the two type systems, and allows for some mismatches (things
>>>>>>>>>>>>> present in one and not in the other), but it doesn't allow for having a
>>>>>>>>>>>>> feature
>>>>>>>>>>>>> whose range (value) is type XXXX in one type system and type YYYY in the
>>>>>>>>>>>>> other.
>>>>>>>>>>>>> See CasTypeSystemMapper lines 299 - 315.
>>>>>>>>>>>> Without reading the code in detail - could we not relax this check such
>>>>>>>>>>>> that the element type of FSArrays is not checked and the code simply
>>>>>>>>>>>> assumes that the source element type has the same features as the target
>>>>>>>>>>>> element type (with the usual lenient handling of missing features in the
>>>>>>>>>>>> target type)? - Kind of a "duck typing" approach?
>>>>>>>>>>>> 
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> 
>>>>>>>>>>>> -- Richard
>> 


Re: Migrating type system of form 6 compressed CAS binaries

Posted by Marshall Schor <ms...@schor.com>.
yes, makes sense, thanks for posting the Jira.

If no one else steps up to work on this, I'll probably take a look in a few
days. -Marshall

On 9/24/2019 6:47 AM, Mario Juric wrote:
> Hi Marshall,
>
> I added the following feature request to Apache Jira:
>
> https://issues.apache.org/jira/browse/UIMA-6128
>
> Hope it makes sense :)
>
> Thanks a lot for the help, it’s appreciated.
>
> Cheers,
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>
>> On 23 Sep 2019, at 16:33 , Marshall Schor <ms...@schor.com> wrote:
>>
>> Re: serializing using XML 1.1
>>
>> This was not thought of, when setting up the CasIOUtils.
>>
>> The way it was done (above) was using some more "primitive/lower level" APIs,
>> rather than the CasIOUtils.
>>
>> Please open a Jira ticket for this, with perhaps some suggestions on how it
>> might be specified in the CasIOUtils APIs.
>>
>> Thanks! -Marshall
>>
>> On 9/23/2019 3:45 AM, Mario Juric wrote:
>>> Hi Marshall,
>>>
>>> Thanks for the thorough and excellent investigation.
>>>
>>> We are looking into possible normalisation/cleanup of whitespace/invisible characters, but I don’t think we can necessarily do the same for some of the other characters. It sounds to me though that serialising to XML 1.1 could also be a simple fix right now, but can this be configured? CasIOUtils doesn’t seem to have an option for this, so I assume it’s something you have working in your branch.
>>>
>>> Regarding the other problem. It seems that the JDK bug is fixed from Java 9 and after. Do you think switching to a more recent Java version would make a difference? I think we can also try this out ourselves when we look into migrating to UIMA 3 once our current deliveries are complete. We also like to switch to Java 11, and like UIMA 3 migration it will require some thorough testing.
>>>
>>> Cheers,
>>> Mario
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>> On 20 Sep 2019, at 20:52 , Marshall Schor <ms...@schor.com> wrote:
>>>>
>>>> In the test "OddDocumentText", this produces a "throw" due to an invalid xml
>>>> char, which is the \u0002.
>>>>
>>>> This is in part because the xml version being used is xml 1.0.
>>>>
>>>> XML 1.1 expanded the set of valid characters to include \u0002.
>>>>
>>>> Here's a snip from the XmiCasSerializerTest class which serializes with xml 1.1:
>>>>
>>>>        XmiCasSerializer xmiCasSerializer = new
>>>> XmiCasSerializer(jCas.getTypeSystem());
>>>>        OutputStream out = new FileOutputStream(new File ("odd-doc-txt-v11.xmi"));
>>>>        try {
>>>>          XMLSerializer xml11Serializer = new XMLSerializer(out);
>>>>          xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
>>>>          xmiCasSerializer.serialize(jCas.getCas(),
>>>> xml11Serializer.getContentHandler());
>>>>        }
>>>>        finally {
>>>>          out.close();
>>>>        }
>>>>
>>>> This succeeds and serializes this using xml 1.1.
>>>>
>>>> I also tried serializing some doc text which includes \u77987.  That did not
>>>> serialize correctly.
>>>> I could see it in the code while tracing up to some point down in the innards of
>>>> some internal
>>>> sax java code
>>>> com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize  where it was
>>>> "Correct" in the Java string.
>>>>
>>>> When serialized (as UTF-8) it came out as a 4 byte string E79E 9837.
>>>>
>>>> This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 byte encoding:
>>>>        1110 xxxx 10xx xxxx 10xx xxxx
>>>>
>>>> of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to me.
>>>>
>>>> But I think it's out of our hands - it's somewhere deep in the sax transform
>>>> java code.
>>>>
>>>> I looked for a bug report and found some
>>>> https://bugs.openjdk.java.net/browse/JDK-8058175
>>>>
>>>> Bottom line, is, I think to clean out these characters early :-) .
>>>>
>>>> -Marshall
>>>>
>>>>
>>>> On 9/20/2019 1:28 PM, Marshall Schor wrote:
>>>>> here's an idea.
>>>>>
>>>>> If you have a string, with the surrogate pair &#77987 at position 10, and you
>>>>> have some Java code, which is iterating through the string and getting the
>>>>> code-point at each character offset, then that code will produce:
>>>>>
>>>>> at position 10:  the code-point 77987
>>>>> at position 11:  the code-point 56483
>>>>>
>>>>> Of course, it's a "bug" to iterate through a string of characters, assuming you
>>>>> have characters at each point, if you don't handle surrogate pairs.
>>>>>
>>>>> The 56483 is just the lower bits of the surrogate pair, added to xDC00 (see
>>>>> https://tools.ietf.org/html/rfc2781 )
>>>>>
>>>>> I worry that even tools like the CVD or similar may not work properly, since
>>>>> they're not designed to handle surrogate pairs, I think, so I have no idea if
>>>>> they would work well enough for you.
>>>>>
>>>>> I'll poke around some more to see if I can enable the conversion for document
>>>>> strings.
>>>>>
>>>>> -Marshall
>>>>>
>>>>> On 9/20/2019 11:09 AM, Mario Juric wrote:
>>>>>> Thanks Marshall,
>>>>>>
>>>>>> Encoding the characters like you suggest should work just fine for us as long as we can serialize and deserialise the XMI, so that we can open the content in a tool like the CVD or similar. These characters are just noise from the original content that happen to remain in the CAS, but they are not visible in our final output because they are basically filtered out one way or the other by downstream components. They become a problem though when they make it more difficult for us to inspect the content.
>>>>>>
>>>>>> Regarding the feature name issue: Might you have an idea why we are getting a different XMI output for the same character in our actual pipeline, where it results in "&#77987;&#56483;”? I investigated the value in the debugger again, and like you are illustrating it is also just a single codepoint with the value 77987. We are simply not able to load this XMI because of this, but unfortunately I couldn’t reproduce it in my small example.
>>>>>>
>>>>>> Cheers,
>>>>>> Mario
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On 19 Sep 2019, at 22:41 , Marshall Schor <ms...@schor.com> wrote:
>>>>>>>
>>>>>>> The odd-feature-text seems to work OK, but has some unusual properties, due to
>>>>>>> that unicode character.
>>>>>>>
>>>>>>> Here's what I see:  The FeatureRecord "name" field is set to a
>>>>>>> 1-unicode-character, that must be encoded as 2 java characters.
>>>>>>>
>>>>>>> When output, it shows up in the xmi as <noNamespace:FeatureRecord xmi:id="18"
>>>>>>> name="&#77987;" value="1.0"/>
>>>>>>> which seems correct.  The name field only has 1 (extended)unicode character
>>>>>>> (taking 2 Java characters to represent),
>>>>>>> due to setting it with this code:   String oddName = "\uD80C\uDCA3";
>>>>>>>
>>>>>>> When read in, the name field is assigned to a String, that string says it has a
>>>>>>> length of 2 (but that's because it takes 2 java chars to represent this char).
>>>>>>> If you have the name string in a variable "n", and do
>>>>>>> System.out.println(n.codePointAt(0)), it shows (correctly) 77987.
>>>>>>> n.codePointCount(0, n.length()) is, as expected, 1.
>>>>>>>
>>>>>>> So, the string value serialization and deserialization seems to be "working".
>>>>>>>
>>>>>>> The other code - for the sofa (document) serialization, is throwing that error,
>>>>>>> because as currently designed, the
>>>>>>> serialization code checks for these kinds of characters, and if found throws
>>>>>>> that exception.  The code checking is
>>>>>>> in XMLUtils.checkForNonXmlCharacters
>>>>>>>
>>>>>>> This is because it's highly likely that "fixing this" in the same way as the
>>>>>>> other, would result in hard-to-diagnose
>>>>>>> future errors, because the subject of analysis string is processed with begin /
>>>>>>> end offset all over the place, and makes
>>>>>>> the assumption that the characters are all not coded as surrogate pairs.
>>>>>>>
>>>>>>> We could change the code to output these like the name, as, e.g.,  &#77987; 
>>>>>>>
>>>>>>> Would that help in your case, or do you imagine other kinds of things might
>>>>>>> break (due to begin/end offsets no longer
>>>>>>> being on character boundaries, for example).
>>>>>>>
>>>>>>> -Marshall
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 9/18/2019 11:41 AM, Mario Juric wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I investigated the XMI issue as promised and these are my findings.
>>>>>>>>
>>>>>>>> It is related to special unicode characters that are not handled by XMI
>>>>>>>> serialisation, and there seems to be two distinct categories of issues we have
>>>>>>>> identified so far.
>>>>>>>>
>>>>>>>> 1) The document text of the CAS contains special unicode characters
>>>>>>>> 2) Annotations with String features have values containing special unicode
>>>>>>>> characters
>>>>>>>>
>>>>>>>> In both cases we could for sure solve the problem if we did a better clean up
>>>>>>>> job upstream, but with the amount and variety of data we receive there is
>>>>>>>> always a chance something passes through, and some of it may in the general
>>>>>>>> case even be valid content.
>>>>>>>>
>>>>>>>> The first case is easy to reproduce with the OddDocumentText example I
>>>>>>>> attached. In this example the text is a snippet taken from the content of a
>>>>>>>> parsed XML document.
>>>>>>>>
>>>>>>>> The other case was not possible to reproduce with the OddFeatureText example,
>>>>>>>> because I am getting slightly different output to what I have in our real
>>>>>>>> setup. The OddFeatureText example is based on the simple type system I shared
>>>>>>>> previously. The name value of a FeatureRecord contains special unicode
>>>>>>>> characters that I found in a similar data structure in our actual CAS. The
>>>>>>>> value comes from an external knowledge base holding some noisy strings, which
>>>>>>>> in this case is a hieroglyph entity. However, when I write the CAS to XMI
>>>>>>>> using the small example it only outputs the first of the two characters in
>>>>>>>> "\uD80C\uDCA3”, which yields the value "&#77987;” in the XMI, but in our
>>>>>>>> actual setup both character values are written as "&#77987;&#56483;”. This
>>>>>>>> means that the attached example will for some reason parse the XMI again, but
>>>>>>>> it will not work in the case where both characters are written the way we
>>>>>>>> experience it. The XMI can be manually changed, so that both character values
>>>>>>>> are included the way it happens in our output, and in this case a
>>>>>>>> SAXParserException happens.
>>>>>>>>
>>>>>>>> I don’t know whether it is outside the scope of the XMI serialiser to handle
>>>>>>>> any of this, but it will be good to know in any case :)
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Mario
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> On 17 Sep 2019, at 09:36 , Mario Juric <mj@unsilo.ai <ma...@unsilo.ai> <mailto:mj@unsilo.ai <ma...@unsilo.ai>>>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Thank you very much for looking into this. It is really appreciated and I
>>>>>>>>> think it touches upon something important, which is about data migration in
>>>>>>>>> general.
>>>>>>>>>
>>>>>>>>> I agree that some of these solutions can appear specific, awkward or complex
>>>>>>>>> and the way forward is not to address our use case alone. I think there is a
>>>>>>>>> need for a compact and efficient binary serialization format for the CAS when
>>>>>>>>> dealing with large amounts of data because this is directly visible in costs
>>>>>>>>> of processing and storing, and I found the compressed binary format to be
>>>>>>>>> much better than XMI in this regard, although I have to admit it’s been a
>>>>>>>>> while since I benchmarked this. Given that UIMA already has a well described
>>>>>>>>> type system then maybe it just lacks a way to describe schema evolution
>>>>>>>>> similar to Apache Avro or similar serialisation frameworks. I think a more
>>>>>>>>> formal approach to data migration would be critical to any larger operational
>>>>>>>>> setup.
>>>>>>>>>
>>>>>>>>> Regarding XMI I like to provide some input to the problem we are observing,
>>>>>>>>> so that it can be solved. We are primarily using XMI for inspection/debugging
>>>>>>>>> purposes, and we are sometimes not able to do this because of this error. I
>>>>>>>>> will try to extract a minimum example to avoid involving parts that has to do
>>>>>>>>> with our pipeline and type system, and I think this would also be the best
>>>>>>>>> way to illustrate that the problem exists outside of this context. However,
>>>>>>>>> converting all our data to XMI first in order to do the conversion in our
>>>>>>>>> example would not be very practical for us, because it involves a large
>>>>>>>>> amount of data.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Mario
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On 16 Sep 2019, at 23:02 , Marshall Schor <msa@schor.com <ma...@schor.com>
>>>>>>>>>> <mailto:msa@schor.com <ma...@schor.com>>> wrote:
>>>>>>>>>>
>>>>>>>>>> In this case, the original looks kind-of like this:
>>>>>>>>>>
>>>>>>>>>> Container
>>>>>>>>>>  features -> FSArray of FeatureAnnotation each of which
>>>>>>>>>>                            has 5 slots: sofaRef, begin, end, name, value
>>>>>>>>>>
>>>>>>>>>> the new TypeSystem has
>>>>>>>>>>
>>>>>>>>>> Container
>>>>>>>>>>  features -> FSArray of FeatureRecord each of which
>>>>>>>>>>                             has 2 slots: name, value
>>>>>>>>>>
>>>>>>>>>> The deserializer code would need some way to decide how to
>>>>>>>>>>  1) create an FSArray of FeatureRecord,
>>>>>>>>>>  2) for each element,
>>>>>>>>>>     map the FeatureAnnotation to a new instance of FeatureRecord
>>>>>>>>>>
>>>>>>>>>> I guess I could imagine a default mapping (for item 2 above) of
>>>>>>>>>> 1) change the type from A to B
>>>>>>>>>> 2) set equal-named features from A to B, drop other features
>>>>>>>>>>
>>>>>>>>>> This mapping would need to apply to a subset of the A's and B's, namely, only
>>>>>>>>>> those referenced by the FSArray where the element type changed.  Seems complex
>>>>>>>>>> and specific to this use case though.
>>>>>>>>>>
>>>>>>>>>> -Marshall
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 9/16/2019 2:42 PM, Richard Eckart de Castilho wrote:
>>>>>>>>>>> On 16. Sep 2019, at 19:05, Marshall Schor <msa@schor.com <ma...@schor.com>
>>>>>>>>>>> <mailto:msa@schor.com <ma...@schor.com>>> wrote:
>>>>>>>>>>>> I can reproduce the problem, and see what is happening.  The deserialization
>>>>>>>>>>>> code compares the two type systems, and allows for some mismatches (things
>>>>>>>>>>>> present in one and not in the other), but it doesn't allow for having a
>>>>>>>>>>>> feature
>>>>>>>>>>>> whose range (value) is type XXXX in one type system and type YYYY in the
>>>>>>>>>>>> other.
>>>>>>>>>>>> See CasTypeSystemMapper lines 299 - 315.
>>>>>>>>>>> Without reading the code in detail - could we not relax this check such
>>>>>>>>>>> that the element type of FSArrays is not checked and the code simply
>>>>>>>>>>> assumes that the source element type has the same features as the target
>>>>>>>>>>> element type (with the usual lenient handling of missing features in the
>>>>>>>>>>> target type)? - Kind of a "duck typing" approach?
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>>
>>>>>>>>>>> -- Richard
>

Re: Migrating type system of form 6 compressed CAS binaries

Posted by Mario Juric <mj...@unsilo.ai>.
Hi Marshall,

I added the following feature request to Apache Jira:

https://issues.apache.org/jira/browse/UIMA-6128

Hope it makes sense :)

Thanks a lot for the help, it’s appreciated.

Cheers,
Mario













> On 23 Sep 2019, at 16:33 , Marshall Schor <ms...@schor.com> wrote:
> 
> Re: serializing using XML 1.1
> 
> This was not thought of, when setting up the CasIOUtils.
> 
> The way it was done (above) was using some more "primitive/lower level" APIs,
> rather than the CasIOUtils.
> 
> Please open a Jira ticket for this, with perhaps some suggestions on how it
> might be specified in the CasIOUtils APIs.
> 
> Thanks! -Marshall
> 
> On 9/23/2019 3:45 AM, Mario Juric wrote:
>> Hi Marshall,
>> 
>> Thanks for the thorough and excellent investigation.
>> 
>> We are looking into possible normalisation/cleanup of whitespace/invisible characters, but I don’t think we can necessarily do the same for some of the other characters. It sounds to me though that serialising to XML 1.1 could also be a simple fix right now, but can this be configured? CasIOUtils doesn’t seem to have an option for this, so I assume it’s something you have working in your branch.
>> 
>> Regarding the other problem. It seems that the JDK bug is fixed from Java 9 and after. Do you think switching to a more recent Java version would make a difference? I think we can also try this out ourselves when we look into migrating to UIMA 3 once our current deliveries are complete. We also like to switch to Java 11, and like UIMA 3 migration it will require some thorough testing.
>> 
>> Cheers,
>> Mario
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>> On 20 Sep 2019, at 20:52 , Marshall Schor <ms...@schor.com> wrote:
>>> 
>>> In the test "OddDocumentText", this produces a "throw" due to an invalid xml
>>> char, which is the \u0002.
>>> 
>>> This is in part because the xml version being used is xml 1.0.
>>> 
>>> XML 1.1 expanded the set of valid characters to include \u0002.
>>> 
>>> Here's a snip from the XmiCasSerializerTest class which serializes with xml 1.1:
>>> 
>>>        XmiCasSerializer xmiCasSerializer = new
>>> XmiCasSerializer(jCas.getTypeSystem());
>>>        OutputStream out = new FileOutputStream(new File ("odd-doc-txt-v11.xmi"));
>>>        try {
>>>          XMLSerializer xml11Serializer = new XMLSerializer(out);
>>>          xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
>>>          xmiCasSerializer.serialize(jCas.getCas(),
>>> xml11Serializer.getContentHandler());
>>>        }
>>>        finally {
>>>          out.close();
>>>        }
>>> 
>>> This succeeds and serializes this using xml 1.1.
>>> 
>>> I also tried serializing some doc text which includes \u77987.  That did not
>>> serialize correctly.
>>> I could see it in the code while tracing up to some point down in the innards of
>>> some internal
>>> sax java code
>>> com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize  where it was
>>> "Correct" in the Java string.
>>> 
>>> When serialized (as UTF-8) it came out as a 4 byte string E79E 9837.
>>> 
>>> This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 byte encoding:
>>>        1110 xxxx 10xx xxxx 10xx xxxx
>>> 
>>> of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to me.
>>> 
>>> But I think it's out of our hands - it's somewhere deep in the sax transform
>>> java code.
>>> 
>>> I looked for a bug report and found some
>>> https://bugs.openjdk.java.net/browse/JDK-8058175
>>> 
>>> Bottom line, is, I think to clean out these characters early :-) .
>>> 
>>> -Marshall
>>> 
>>> 
>>> On 9/20/2019 1:28 PM, Marshall Schor wrote:
>>>> here's an idea.
>>>> 
>>>> If you have a string, with the surrogate pair &#77987 at position 10, and you
>>>> have some Java code, which is iterating through the string and getting the
>>>> code-point at each character offset, then that code will produce:
>>>> 
>>>> at position 10:  the code-point 77987
>>>> at position 11:  the code-point 56483
>>>> 
>>>> Of course, it's a "bug" to iterate through a string of characters, assuming you
>>>> have characters at each point, if you don't handle surrogate pairs.
>>>> 
>>>> The 56483 is just the lower bits of the surrogate pair, added to xDC00 (see
>>>> https://tools.ietf.org/html/rfc2781 )
>>>> 
>>>> I worry that even tools like the CVD or similar may not work properly, since
>>>> they're not designed to handle surrogate pairs, I think, so I have no idea if
>>>> they would work well enough for you.
>>>> 
>>>> I'll poke around some more to see if I can enable the conversion for document
>>>> strings.
>>>> 
>>>> -Marshall
>>>> 
>>>> On 9/20/2019 11:09 AM, Mario Juric wrote:
>>>>> Thanks Marshall,
>>>>> 
>>>>> Encoding the characters like you suggest should work just fine for us as long as we can serialize and deserialise the XMI, so that we can open the content in a tool like the CVD or similar. These characters are just noise from the original content that happen to remain in the CAS, but they are not visible in our final output because they are basically filtered out one way or the other by downstream components. They become a problem though when they make it more difficult for us to inspect the content.
>>>>> 
>>>>> Regarding the feature name issue: Might you have an idea why we are getting a different XMI output for the same character in our actual pipeline, where it results in "&#77987;&#56483;”? I investigated the value in the debugger again, and like you are illustrating it is also just a single codepoint with the value 77987. We are simply not able to load this XMI because of this, but unfortunately I couldn’t reproduce it in my small example.
>>>>> 
>>>>> Cheers,
>>>>> Mario
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> On 19 Sep 2019, at 22:41 , Marshall Schor <ms...@schor.com> wrote:
>>>>>> 
>>>>>> The odd-feature-text seems to work OK, but has some unusual properties, due to
>>>>>> that unicode character.
>>>>>> 
>>>>>> Here's what I see:  The FeatureRecord "name" field is set to a
>>>>>> 1-unicode-character, that must be encoded as 2 java characters.
>>>>>> 
>>>>>> When output, it shows up in the xmi as <noNamespace:FeatureRecord xmi:id="18"
>>>>>> name="&#77987;" value="1.0"/>
>>>>>> which seems correct.  The name field only has 1 (extended)unicode character
>>>>>> (taking 2 Java characters to represent),
>>>>>> due to setting it with this code:   String oddName = "\uD80C\uDCA3";
>>>>>> 
>>>>>> When read in, the name field is assigned to a String, that string says it has a
>>>>>> length of 2 (but that's because it takes 2 java chars to represent this char).
>>>>>> If you have the name string in a variable "n", and do
>>>>>> System.out.println(n.codePointAt(0)), it shows (correctly) 77987.
>>>>>> n.codePointCount(0, n.length()) is, as expected, 1.
>>>>>> 
>>>>>> So, the string value serialization and deserialization seems to be "working".
>>>>>> 
>>>>>> The other code - for the sofa (document) serialization, is throwing that error,
>>>>>> because as currently designed, the
>>>>>> serialization code checks for these kinds of characters, and if found throws
>>>>>> that exception.  The code checking is
>>>>>> in XMLUtils.checkForNonXmlCharacters
>>>>>> 
>>>>>> This is because it's highly likely that "fixing this" in the same way as the
>>>>>> other, would result in hard-to-diagnose
>>>>>> future errors, because the subject of analysis string is processed with begin /
>>>>>> end offset all over the place, and makes
>>>>>> the assumption that the characters are all not coded as surrogate pairs.
>>>>>> 
>>>>>> We could change the code to output these like the name, as, e.g.,  &#77987; 
>>>>>> 
>>>>>> Would that help in your case, or do you imagine other kinds of things might
>>>>>> break (due to begin/end offsets no longer
>>>>>> being on character boundaries, for example).
>>>>>> 
>>>>>> -Marshall
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 9/18/2019 11:41 AM, Mario Juric wrote:
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I investigated the XMI issue as promised and these are my findings.
>>>>>>> 
>>>>>>> It is related to special unicode characters that are not handled by XMI
>>>>>>> serialisation, and there seems to be two distinct categories of issues we have
>>>>>>> identified so far.
>>>>>>> 
>>>>>>> 1) The document text of the CAS contains special unicode characters
>>>>>>> 2) Annotations with String features have values containing special unicode
>>>>>>> characters
>>>>>>> 
>>>>>>> In both cases we could for sure solve the problem if we did a better clean up
>>>>>>> job upstream, but with the amount and variety of data we receive there is
>>>>>>> always a chance something passes through, and some of it may in the general
>>>>>>> case even be valid content.
>>>>>>> 
>>>>>>> The first case is easy to reproduce with the OddDocumentText example I
>>>>>>> attached. In this example the text is a snippet taken from the content of a
>>>>>>> parsed XML document.
>>>>>>> 
>>>>>>> The other case was not possible to reproduce with the OddFeatureText example,
>>>>>>> because I am getting slightly different output to what I have in our real
>>>>>>> setup. The OddFeatureText example is based on the simple type system I shared
>>>>>>> previously. The name value of a FeatureRecord contains special unicode
>>>>>>> characters that I found in a similar data structure in our actual CAS. The
>>>>>>> value comes from an external knowledge base holding some noisy strings, which
>>>>>>> in this case is a hieroglyph entity. However, when I write the CAS to XMI
>>>>>>> using the small example it only outputs the first of the two characters in
>>>>>>> "\uD80C\uDCA3”, which yields the value "&#77987;” in the XMI, but in our
>>>>>>> actual setup both character values are written as "&#77987;&#56483;”. This
>>>>>>> means that the attached example will for some reason parse the XMI again, but
>>>>>>> it will not work in the case where both characters are written the way we
>>>>>>> experience it. The XMI can be manually changed, so that both character values
>>>>>>> are included the way it happens in our output, and in this case a
>>>>>>> SAXParserException happens.
>>>>>>> 
>>>>>>> I don’t know whether it is outside the scope of the XMI serialiser to handle
>>>>>>> any of this, but it will be good to know in any case :)
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Mario
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On 17 Sep 2019, at 09:36 , Mario Juric <mj@unsilo.ai <ma...@unsilo.ai> <mailto:mj@unsilo.ai <ma...@unsilo.ai>>>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Thank you very much for looking into this. It is really appreciated and I
>>>>>>>> think it touches upon something important, which is about data migration in
>>>>>>>> general.
>>>>>>>> 
>>>>>>>> I agree that some of these solutions can appear specific, awkward or complex
>>>>>>>> and the way forward is not to address our use case alone. I think there is a
>>>>>>>> need for a compact and efficient binary serialization format for the CAS when
>>>>>>>> dealing with large amounts of data because this is directly visible in costs
>>>>>>>> of processing and storing, and I found the compressed binary format to be
>>>>>>>> much better than XMI in this regard, although I have to admit it’s been a
>>>>>>>> while since I benchmarked this. Given that UIMA already has a well described
>>>>>>>> type system then maybe it just lacks a way to describe schema evolution
>>>>>>>> similar to Apache Avro or similar serialisation frameworks. I think a more
>>>>>>>> formal approach to data migration would be critical to any larger operational
>>>>>>>> setup.
>>>>>>>> 
>>>>>>>> Regarding XMI I like to provide some input to the problem we are observing,
>>>>>>>> so that it can be solved. We are primarily using XMI for inspection/debugging
>>>>>>>> purposes, and we are sometimes not able to do this because of this error. I
>>>>>>>> will try to extract a minimum example to avoid involving parts that has to do
>>>>>>>> with our pipeline and type system, and I think this would also be the best
>>>>>>>> way to illustrate that the problem exists outside of this context. However,
>>>>>>>> converting all our data to XMI first in order to do the conversion in our
>>>>>>>> example would not be very practical for us, because it involves a large
>>>>>>>> amount of data.
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Mario
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On 16 Sep 2019, at 23:02 , Marshall Schor <msa@schor.com <ma...@schor.com>
>>>>>>>>> <mailto:msa@schor.com <ma...@schor.com>>> wrote:
>>>>>>>>> 
>>>>>>>>> In this case, the original looks kind-of like this:
>>>>>>>>> 
>>>>>>>>> Container
>>>>>>>>>  features -> FSArray of FeatureAnnotation each of which
>>>>>>>>>                            has 5 slots: sofaRef, begin, end, name, value
>>>>>>>>> 
>>>>>>>>> the new TypeSystem has
>>>>>>>>> 
>>>>>>>>> Container
>>>>>>>>>  features -> FSArray of FeatureRecord each of which
>>>>>>>>>                             has 2 slots: name, value
>>>>>>>>> 
>>>>>>>>> The deserializer code would need some way to decide how to
>>>>>>>>>  1) create an FSArray of FeatureRecord,
>>>>>>>>>  2) for each element,
>>>>>>>>>     map the FeatureAnnotation to a new instance of FeatureRecord
>>>>>>>>> 
>>>>>>>>> I guess I could imagine a default mapping (for item 2 above) of
>>>>>>>>> 1) change the type from A to B
>>>>>>>>> 2) set equal-named features from A to B, drop other features
>>>>>>>>> 
>>>>>>>>> This mapping would need to apply to a subset of the A's and B's, namely, only
>>>>>>>>> those referenced by the FSArray where the element type changed.  Seems complex
>>>>>>>>> and specific to this use case though.
>>>>>>>>> 
>>>>>>>>> -Marshall
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 9/16/2019 2:42 PM, Richard Eckart de Castilho wrote:
>>>>>>>>>> On 16. Sep 2019, at 19:05, Marshall Schor <msa@schor.com <ma...@schor.com>
>>>>>>>>>> <mailto:msa@schor.com <ma...@schor.com>>> wrote:
>>>>>>>>>>> I can reproduce the problem, and see what is happening.  The deserialization
>>>>>>>>>>> code compares the two type systems, and allows for some mismatches (things
>>>>>>>>>>> present in one and not in the other), but it doesn't allow for having a
>>>>>>>>>>> feature
>>>>>>>>>>> whose range (value) is type XXXX in one type system and type YYYY in the
>>>>>>>>>>> other.
>>>>>>>>>>> See CasTypeSystemMapper lines 299 - 315.
>>>>>>>>>> Without reading the code in detail - could we not relax this check such
>>>>>>>>>> that the element type of FSArrays is not checked and the code simply
>>>>>>>>>> assumes that the source element type has the same features as the target
>>>>>>>>>> element type (with the usual lenient handling of missing features in the
>>>>>>>>>> target type)? - Kind of a "duck typing" approach?
>>>>>>>>>> 
>>>>>>>>>> Cheers,
>>>>>>>>>> 
>>>>>>>>>> -- Richard
>> 


Re: Migrating type system of form 6 compressed CAS binaries

Posted by Marshall Schor <ms...@schor.com>.
Re: serializing using XML 1.1

This was not thought of, when setting up the CasIOUtils.

The way it was done (above) was using some more "primitive/lower level" APIs,
rather than the CasIOUtils.

Please open a Jira ticket for this, with perhaps some suggestions on how it
might be specified in the CasIOUtils APIs.

Thanks! -Marshall

On 9/23/2019 3:45 AM, Mario Juric wrote:
> Hi Marshall,
>
> Thanks for the thorough and excellent investigation.
>
> We are looking into possible normalisation/cleanup of whitespace/invisible characters, but I don’t think we can necessarily do the same for some of the other characters. It sounds to me though that serialising to XML 1.1 could also be a simple fix right now, but can this be configured? CasIOUtils doesn’t seem to have an option for this, so I assume it’s something you have working in your branch.
>
> Regarding the other problem. It seems that the JDK bug is fixed from Java 9 and after. Do you think switching to a more recent Java version would make a difference? I think we can also try this out ourselves when we look into migrating to UIMA 3 once our current deliveries are complete. We also like to switch to Java 11, and like UIMA 3 migration it will require some thorough testing.
>
> Cheers,
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>
>> On 20 Sep 2019, at 20:52 , Marshall Schor <ms...@schor.com> wrote:
>>
>> In the test "OddDocumentText", this produces a "throw" due to an invalid xml
>> char, which is the \u0002.
>>
>> This is in part because the xml version being used is xml 1.0.
>>
>> XML 1.1 expanded the set of valid characters to include \u0002.
>>
>> Here's a snip from the XmiCasSerializerTest class which serializes with xml 1.1:
>>
>>         XmiCasSerializer xmiCasSerializer = new
>> XmiCasSerializer(jCas.getTypeSystem());
>>         OutputStream out = new FileOutputStream(new File ("odd-doc-txt-v11.xmi"));
>>         try {
>>           XMLSerializer xml11Serializer = new XMLSerializer(out);
>>           xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
>>           xmiCasSerializer.serialize(jCas.getCas(),
>> xml11Serializer.getContentHandler());
>>         }
>>         finally {
>>           out.close();
>>         }
>>
>> This succeeds and serializes this using xml 1.1.
>>
>> I also tried serializing some doc text which includes \u77987.  That did not
>> serialize correctly.
>> I could see it in the code while tracing up to some point down in the innards of
>> some internal
>> sax java code
>> com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize  where it was
>> "Correct" in the Java string.
>>
>> When serialized (as UTF-8) it came out as a 4 byte string E79E 9837.
>>
>> This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 byte encoding:
>>         1110 xxxx 10xx xxxx 10xx xxxx
>>
>> of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to me.
>>
>> But I think it's out of our hands - it's somewhere deep in the sax transform
>> java code.
>>
>> I looked for a bug report and found some
>> https://bugs.openjdk.java.net/browse/JDK-8058175
>>
>> Bottom line, is, I think to clean out these characters early :-) .
>>
>> -Marshall
>>
>>
>> On 9/20/2019 1:28 PM, Marshall Schor wrote:
>>> here's an idea.
>>>
>>> If you have a string, with the surrogate pair &#77987 at position 10, and you
>>> have some Java code, which is iterating through the string and getting the
>>> code-point at each character offset, then that code will produce:
>>>
>>> at position 10:  the code-point 77987
>>> at position 11:  the code-point 56483
>>>
>>> Of course, it's a "bug" to iterate through a string of characters, assuming you
>>> have characters at each point, if you don't handle surrogate pairs.
>>>
>>> The 56483 is just the lower bits of the surrogate pair, added to xDC00 (see
>>> https://tools.ietf.org/html/rfc2781 )
>>>
>>> I worry that even tools like the CVD or similar may not work properly, since
>>> they're not designed to handle surrogate pairs, I think, so I have no idea if
>>> they would work well enough for you.
>>>
>>> I'll poke around some more to see if I can enable the conversion for document
>>> strings.
>>>
>>> -Marshall
>>>
>>> On 9/20/2019 11:09 AM, Mario Juric wrote:
>>>> Thanks Marshall,
>>>>
>>>> Encoding the characters like you suggest should work just fine for us as long as we can serialize and deserialise the XMI, so that we can open the content in a tool like the CVD or similar. These characters are just noise from the original content that happen to remain in the CAS, but they are not visible in our final output because they are basically filtered out one way or the other by downstream components. They become a problem though when they make it more difficult for us to inspect the content.
>>>>
>>>> Regarding the feature name issue: Might you have an idea why we are getting a different XMI output for the same character in our actual pipeline, where it results in "&#77987;&#56483;”? I investigated the value in the debugger again, and like you are illustrating it is also just a single codepoint with the value 77987. We are simply not able to load this XMI because of this, but unfortunately I couldn’t reproduce it in my small example.
>>>>
>>>> Cheers,
>>>> Mario
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> On 19 Sep 2019, at 22:41 , Marshall Schor <ms...@schor.com> wrote:
>>>>>
>>>>> The odd-feature-text seems to work OK, but has some unusual properties, due to
>>>>> that unicode character.
>>>>>
>>>>> Here's what I see:  The FeatureRecord "name" field is set to a
>>>>> 1-unicode-character, that must be encoded as 2 java characters.
>>>>>
>>>>> When output, it shows up in the xmi as <noNamespace:FeatureRecord xmi:id="18"
>>>>> name="&#77987;" value="1.0"/>
>>>>> which seems correct.  The name field only has 1 (extended)unicode character
>>>>> (taking 2 Java characters to represent),
>>>>> due to setting it with this code:   String oddName = "\uD80C\uDCA3";
>>>>>
>>>>> When read in, the name field is assigned to a String, that string says it has a
>>>>> length of 2 (but that's because it takes 2 java chars to represent this char).
>>>>> If you have the name string in a variable "n", and do
>>>>> System.out.println(n.codePointAt(0)), it shows (correctly) 77987.
>>>>> n.codePointCount(0, n.length()) is, as expected, 1.
>>>>>
>>>>> So, the string value serialization and deserialization seems to be "working".
>>>>>
>>>>> The other code - for the sofa (document) serialization, is throwing that error,
>>>>> because as currently designed, the
>>>>> serialization code checks for these kinds of characters, and if found throws
>>>>> that exception.  The code checking is
>>>>> in XMLUtils.checkForNonXmlCharacters
>>>>>
>>>>> This is because it's highly likely that "fixing this" in the same way as the
>>>>> other, would result in hard-to-diagnose
>>>>> future errors, because the subject of analysis string is processed with begin /
>>>>> end offset all over the place, and makes
>>>>> the assumption that the characters are all not coded as surrogate pairs.
>>>>>
>>>>> We could change the code to output these like the name, as, e.g.,  &#77987; 
>>>>>
>>>>> Would that help in your case, or do you imagine other kinds of things might
>>>>> break (due to begin/end offsets no longer
>>>>> being on character boundaries, for example).
>>>>>
>>>>> -Marshall
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 9/18/2019 11:41 AM, Mario Juric wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I investigated the XMI issue as promised and these are my findings.
>>>>>>
>>>>>> It is related to special unicode characters that are not handled by XMI
>>>>>> serialisation, and there seems to be two distinct categories of issues we have
>>>>>> identified so far.
>>>>>>
>>>>>> 1) The document text of the CAS contains special unicode characters
>>>>>> 2) Annotations with String features have values containing special unicode
>>>>>> characters
>>>>>>
>>>>>> In both cases we could for sure solve the problem if we did a better clean up
>>>>>> job upstream, but with the amount and variety of data we receive there is
>>>>>> always a chance something passes through, and some of it may in the general
>>>>>> case even be valid content.
>>>>>>
>>>>>> The first case is easy to reproduce with the OddDocumentText example I
>>>>>> attached. In this example the text is a snippet taken from the content of a
>>>>>> parsed XML document.
>>>>>>
>>>>>> The other case was not possible to reproduce with the OddFeatureText example,
>>>>>> because I am getting slightly different output to what I have in our real
>>>>>> setup. The OddFeatureText example is based on the simple type system I shared
>>>>>> previously. The name value of a FeatureRecord contains special unicode
>>>>>> characters that I found in a similar data structure in our actual CAS. The
>>>>>> value comes from an external knowledge base holding some noisy strings, which
>>>>>> in this case is a hieroglyph entity. However, when I write the CAS to XMI
>>>>>> using the small example it only outputs the first of the two characters in
>>>>>> "\uD80C\uDCA3”, which yields the value "&#77987;” in the XMI, but in our
>>>>>> actual setup both character values are written as "&#77987;&#56483;”. This
>>>>>> means that the attached example will for some reason parse the XMI again, but
>>>>>> it will not work in the case where both characters are written the way we
>>>>>> experience it. The XMI can be manually changed, so that both character values
>>>>>> are included the way it happens in our output, and in this case a
>>>>>> SAXParserException happens.
>>>>>>
>>>>>> I don’t know whether it is outside the scope of the XMI serialiser to handle
>>>>>> any of this, but it will be good to know in any case :)
>>>>>>
>>>>>> Cheers,
>>>>>> Mario
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On 17 Sep 2019, at 09:36 , Mario Juric <mj@unsilo.ai <ma...@unsilo.ai> <mailto:mj@unsilo.ai <ma...@unsilo.ai>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Thank you very much for looking into this. It is really appreciated and I
>>>>>>> think it touches upon something important, which is about data migration in
>>>>>>> general.
>>>>>>>
>>>>>>> I agree that some of these solutions can appear specific, awkward or complex
>>>>>>> and the way forward is not to address our use case alone. I think there is a
>>>>>>> need for a compact and efficient binary serialization format for the CAS when
>>>>>>> dealing with large amounts of data because this is directly visible in costs
>>>>>>> of processing and storing, and I found the compressed binary format to be
>>>>>>> much better than XMI in this regard, although I have to admit it’s been a
>>>>>>> while since I benchmarked this. Given that UIMA already has a well described
>>>>>>> type system then maybe it just lacks a way to describe schema evolution
>>>>>>> similar to Apache Avro or similar serialisation frameworks. I think a more
>>>>>>> formal approach to data migration would be critical to any larger operational
>>>>>>> setup.
>>>>>>>
>>>>>>> Regarding XMI I like to provide some input to the problem we are observing,
>>>>>>> so that it can be solved. We are primarily using XMI for inspection/debugging
>>>>>>> purposes, and we are sometimes not able to do this because of this error. I
>>>>>>> will try to extract a minimum example to avoid involving parts that has to do
>>>>>>> with our pipeline and type system, and I think this would also be the best
>>>>>>> way to illustrate that the problem exists outside of this context. However,
>>>>>>> converting all our data to XMI first in order to do the conversion in our
>>>>>>> example would not be very practical for us, because it involves a large
>>>>>>> amount of data.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Mario
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> On 16 Sep 2019, at 23:02 , Marshall Schor <msa@schor.com <ma...@schor.com>
>>>>>>>> <mailto:msa@schor.com <ma...@schor.com>>> wrote:
>>>>>>>>
>>>>>>>> In this case, the original looks kind-of like this:
>>>>>>>>
>>>>>>>> Container
>>>>>>>>   features -> FSArray of FeatureAnnotation each of which
>>>>>>>>                             has 5 slots: sofaRef, begin, end, name, value
>>>>>>>>
>>>>>>>> the new TypeSystem has
>>>>>>>>
>>>>>>>> Container
>>>>>>>>   features -> FSArray of FeatureRecord each of which
>>>>>>>>                              has 2 slots: name, value
>>>>>>>>
>>>>>>>> The deserializer code would need some way to decide how to
>>>>>>>>   1) create an FSArray of FeatureRecord,
>>>>>>>>   2) for each element,
>>>>>>>>      map the FeatureAnnotation to a new instance of FeatureRecord
>>>>>>>>
>>>>>>>> I guess I could imagine a default mapping (for item 2 above) of
>>>>>>>>  1) change the type from A to B
>>>>>>>>  2) set equal-named features from A to B, drop other features
>>>>>>>>
>>>>>>>> This mapping would need to apply to a subset of the A's and B's, namely, only
>>>>>>>> those referenced by the FSArray where the element type changed.  Seems complex
>>>>>>>> and specific to this use case though.
>>>>>>>>
>>>>>>>> -Marshall
>>>>>>>>
>>>>>>>>
>>>>>>>> On 9/16/2019 2:42 PM, Richard Eckart de Castilho wrote:
>>>>>>>>> On 16. Sep 2019, at 19:05, Marshall Schor <msa@schor.com <ma...@schor.com>
>>>>>>>>> <mailto:msa@schor.com <ma...@schor.com>>> wrote:
>>>>>>>>>> I can reproduce the problem, and see what is happening.  The deserialization
>>>>>>>>>> code compares the two type systems, and allows for some mismatches (things
>>>>>>>>>> present in one and not in the other), but it doesn't allow for having a
>>>>>>>>>> feature
>>>>>>>>>> whose range (value) is type XXXX in one type system and type YYYY in the
>>>>>>>>>> other.
>>>>>>>>>> See CasTypeSystemMapper lines 299 - 315.
>>>>>>>>> Without reading the code in detail - could we not relax this check such
>>>>>>>>> that the element type of FSArrays is not checked and the code simply
>>>>>>>>> assumes that the source element type has the same features as the target
>>>>>>>>> element type (with the usual lenient handling of missing features in the
>>>>>>>>> target type)? - Kind of a "duck typing" approach?
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>>
>>>>>>>>> -- Richard
>

Re: Migrating type system of form 6 compressed CAS binaries

Posted by Marshall Schor <ms...@schor.com>.
re: using a later Java - that might make a difference, since fixes keep getting
added.

For some fixes, however, as you've noted, the fixes are backported to previous
versions.

-Marshall

On 9/23/2019 3:45 AM, Mario Juric wrote:
> Hi Marshall,
>
> Thanks for the thorough and excellent investigation.
>
> We are looking into possible normalisation/cleanup of whitespace/invisible characters, but I don’t think we can necessarily do the same for some of the other characters. It sounds to me though that serialising to XML 1.1 could also be a simple fix right now, but can this be configured? CasIOUtils doesn’t seem to have an option for this, so I assume it’s something you have working in your branch.
>
> Regarding the other problem. It seems that the JDK bug is fixed from Java 9 and after. Do you think switching to a more recent Java version would make a difference? I think we can also try this out ourselves when we look into migrating to UIMA 3 once our current deliveries are complete. We also like to switch to Java 11, and like UIMA 3 migration it will require some thorough testing.
>
> Cheers,
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>
>> On 20 Sep 2019, at 20:52 , Marshall Schor <ms...@schor.com> wrote:
>>
>> In the test "OddDocumentText", this produces a "throw" due to an invalid xml
>> char, which is the \u0002.
>>
>> This is in part because the xml version being used is xml 1.0.
>>
>> XML 1.1 expanded the set of valid characters to include \u0002.
>>
>> Here's a snip from the XmiCasSerializerTest class which serializes with xml 1.1:
>>
>>         XmiCasSerializer xmiCasSerializer = new
>> XmiCasSerializer(jCas.getTypeSystem());
>>         OutputStream out = new FileOutputStream(new File ("odd-doc-txt-v11.xmi"));
>>         try {
>>           XMLSerializer xml11Serializer = new XMLSerializer(out);
>>           xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
>>           xmiCasSerializer.serialize(jCas.getCas(),
>> xml11Serializer.getContentHandler());
>>         }
>>         finally {
>>           out.close();
>>         }
>>
>> This succeeds and serializes this using xml 1.1.
>>
>> I also tried serializing some doc text which includes \u77987.  That did not
>> serialize correctly.
>> I could see it in the code while tracing up to some point down in the innards of
>> some internal
>> sax java code
>> com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize  where it was
>> "Correct" in the Java string.
>>
>> When serialized (as UTF-8) it came out as a 4 byte string E79E 9837.
>>
>> This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 byte encoding:
>>         1110 xxxx 10xx xxxx 10xx xxxx
>>
>> of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to me.
>>
>> But I think it's out of our hands - it's somewhere deep in the sax transform
>> java code.
>>
>> I looked for a bug report and found some
>> https://bugs.openjdk.java.net/browse/JDK-8058175
>>
>> Bottom line, is, I think to clean out these characters early :-) .
>>
>> -Marshall
>>
>>
>> On 9/20/2019 1:28 PM, Marshall Schor wrote:
>>> here's an idea.
>>>
>>> If you have a string, with the surrogate pair &#77987 at position 10, and you
>>> have some Java code, which is iterating through the string and getting the
>>> code-point at each character offset, then that code will produce:
>>>
>>> at position 10:  the code-point 77987
>>> at position 11:  the code-point 56483
>>>
>>> Of course, it's a "bug" to iterate through a string of characters, assuming you
>>> have characters at each point, if you don't handle surrogate pairs.
>>>
>>> The 56483 is just the lower bits of the surrogate pair, added to xDC00 (see
>>> https://tools.ietf.org/html/rfc2781 )
>>>
>>> I worry that even tools like the CVD or similar may not work properly, since
>>> they're not designed to handle surrogate pairs, I think, so I have no idea if
>>> they would work well enough for you.
>>>
>>> I'll poke around some more to see if I can enable the conversion for document
>>> strings.
>>>
>>> -Marshall
>>>
>>> On 9/20/2019 11:09 AM, Mario Juric wrote:
>>>> Thanks Marshall,
>>>>
>>>> Encoding the characters like you suggest should work just fine for us as long as we can serialize and deserialise the XMI, so that we can open the content in a tool like the CVD or similar. These characters are just noise from the original content that happen to remain in the CAS, but they are not visible in our final output because they are basically filtered out one way or the other by downstream components. They become a problem though when they make it more difficult for us to inspect the content.
>>>>
>>>> Regarding the feature name issue: Might you have an idea why we are getting a different XMI output for the same character in our actual pipeline, where it results in "&#77987;&#56483;”? I investigated the value in the debugger again, and like you are illustrating it is also just a single codepoint with the value 77987. We are simply not able to load this XMI because of this, but unfortunately I couldn’t reproduce it in my small example.
>>>>
>>>> Cheers,
>>>> Mario
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> On 19 Sep 2019, at 22:41 , Marshall Schor <ms...@schor.com> wrote:
>>>>>
>>>>> The odd-feature-text seems to work OK, but has some unusual properties, due to
>>>>> that unicode character.
>>>>>
>>>>> Here's what I see:  The FeatureRecord "name" field is set to a
>>>>> 1-unicode-character, that must be encoded as 2 java characters.
>>>>>
>>>>> When output, it shows up in the xmi as <noNamespace:FeatureRecord xmi:id="18"
>>>>> name="&#77987;" value="1.0"/>
>>>>> which seems correct.  The name field only has 1 (extended)unicode character
>>>>> (taking 2 Java characters to represent),
>>>>> due to setting it with this code:   String oddName = "\uD80C\uDCA3";
>>>>>
>>>>> When read in, the name field is assigned to a String, that string says it has a
>>>>> length of 2 (but that's because it takes 2 java chars to represent this char).
>>>>> If you have the name string in a variable "n", and do
>>>>> System.out.println(n.codePointAt(0)), it shows (correctly) 77987.
>>>>> n.codePointCount(0, n.length()) is, as expected, 1.
>>>>>
>>>>> So, the string value serialization and deserialization seems to be "working".
>>>>>
>>>>> The other code - for the sofa (document) serialization, is throwing that error,
>>>>> because as currently designed, the
>>>>> serialization code checks for these kinds of characters, and if found throws
>>>>> that exception.  The code checking is
>>>>> in XMLUtils.checkForNonXmlCharacters
>>>>>
>>>>> This is because it's highly likely that "fixing this" in the same way as the
>>>>> other, would result in hard-to-diagnose
>>>>> future errors, because the subject of analysis string is processed with begin /
>>>>> end offset all over the place, and makes
>>>>> the assumption that the characters are all not coded as surrogate pairs.
>>>>>
>>>>> We could change the code to output these like the name, as, e.g.,  &#77987; 
>>>>>
>>>>> Would that help in your case, or do you imagine other kinds of things might
>>>>> break (due to begin/end offsets no longer
>>>>> being on character boundaries, for example).
>>>>>
>>>>> -Marshall
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 9/18/2019 11:41 AM, Mario Juric wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I investigated the XMI issue as promised and these are my findings.
>>>>>>
>>>>>> It is related to special unicode characters that are not handled by XMI
>>>>>> serialisation, and there seems to be two distinct categories of issues we have
>>>>>> identified so far.
>>>>>>
>>>>>> 1) The document text of the CAS contains special unicode characters
>>>>>> 2) Annotations with String features have values containing special unicode
>>>>>> characters
>>>>>>
>>>>>> In both cases we could for sure solve the problem if we did a better clean up
>>>>>> job upstream, but with the amount and variety of data we receive there is
>>>>>> always a chance something passes through, and some of it may in the general
>>>>>> case even be valid content.
>>>>>>
>>>>>> The first case is easy to reproduce with the OddDocumentText example I
>>>>>> attached. In this example the text is a snippet taken from the content of a
>>>>>> parsed XML document.
>>>>>>
>>>>>> The other case was not possible to reproduce with the OddFeatureText example,
>>>>>> because I am getting slightly different output to what I have in our real
>>>>>> setup. The OddFeatureText example is based on the simple type system I shared
>>>>>> previously. The name value of a FeatureRecord contains special unicode
>>>>>> characters that I found in a similar data structure in our actual CAS. The
>>>>>> value comes from an external knowledge base holding some noisy strings, which
>>>>>> in this case is a hieroglyph entity. However, when I write the CAS to XMI
>>>>>> using the small example it only outputs the first of the two characters in
>>>>>> "\uD80C\uDCA3”, which yields the value "&#77987;” in the XMI, but in our
>>>>>> actual setup both character values are written as "&#77987;&#56483;”. This
>>>>>> means that the attached example will for some reason parse the XMI again, but
>>>>>> it will not work in the case where both characters are written the way we
>>>>>> experience it. The XMI can be manually changed, so that both character values
>>>>>> are included the way it happens in our output, and in this case a
>>>>>> SAXParserException happens.
>>>>>>
>>>>>> I don’t know whether it is outside the scope of the XMI serialiser to handle
>>>>>> any of this, but it will be good to know in any case :)
>>>>>>
>>>>>> Cheers,
>>>>>> Mario
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On 17 Sep 2019, at 09:36 , Mario Juric <mj@unsilo.ai <ma...@unsilo.ai> <mailto:mj@unsilo.ai <ma...@unsilo.ai>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Thank you very much for looking into this. It is really appreciated and I
>>>>>>> think it touches upon something important, which is about data migration in
>>>>>>> general.
>>>>>>>
>>>>>>> I agree that some of these solutions can appear specific, awkward or complex
>>>>>>> and the way forward is not to address our use case alone. I think there is a
>>>>>>> need for a compact and efficient binary serialization format for the CAS when
>>>>>>> dealing with large amounts of data because this is directly visible in costs
>>>>>>> of processing and storing, and I found the compressed binary format to be
>>>>>>> much better than XMI in this regard, although I have to admit it’s been a
>>>>>>> while since I benchmarked this. Given that UIMA already has a well described
>>>>>>> type system then maybe it just lacks a way to describe schema evolution
>>>>>>> similar to Apache Avro or similar serialisation frameworks. I think a more
>>>>>>> formal approach to data migration would be critical to any larger operational
>>>>>>> setup.
>>>>>>>
>>>>>>> Regarding XMI I like to provide some input to the problem we are observing,
>>>>>>> so that it can be solved. We are primarily using XMI for inspection/debugging
>>>>>>> purposes, and we are sometimes not able to do this because of this error. I
>>>>>>> will try to extract a minimum example to avoid involving parts that has to do
>>>>>>> with our pipeline and type system, and I think this would also be the best
>>>>>>> way to illustrate that the problem exists outside of this context. However,
>>>>>>> converting all our data to XMI first in order to do the conversion in our
>>>>>>> example would not be very practical for us, because it involves a large
>>>>>>> amount of data.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Mario
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> On 16 Sep 2019, at 23:02 , Marshall Schor <msa@schor.com <ma...@schor.com>
>>>>>>>> <mailto:msa@schor.com <ma...@schor.com>>> wrote:
>>>>>>>>
>>>>>>>> In this case, the original looks kind-of like this:
>>>>>>>>
>>>>>>>> Container
>>>>>>>>   features -> FSArray of FeatureAnnotation each of which
>>>>>>>>                             has 5 slots: sofaRef, begin, end, name, value
>>>>>>>>
>>>>>>>> the new TypeSystem has
>>>>>>>>
>>>>>>>> Container
>>>>>>>>   features -> FSArray of FeatureRecord each of which
>>>>>>>>                              has 2 slots: name, value
>>>>>>>>
>>>>>>>> The deserializer code would need some way to decide how to
>>>>>>>>   1) create an FSArray of FeatureRecord,
>>>>>>>>   2) for each element,
>>>>>>>>      map the FeatureAnnotation to a new instance of FeatureRecord
>>>>>>>>
>>>>>>>> I guess I could imagine a default mapping (for item 2 above) of
>>>>>>>>  1) change the type from A to B
>>>>>>>>  2) set equal-named features from A to B, drop other features
>>>>>>>>
>>>>>>>> This mapping would need to apply to a subset of the A's and B's, namely, only
>>>>>>>> those referenced by the FSArray where the element type changed.  Seems complex
>>>>>>>> and specific to this use case though.
>>>>>>>>
>>>>>>>> -Marshall
>>>>>>>>
>>>>>>>>
>>>>>>>> On 9/16/2019 2:42 PM, Richard Eckart de Castilho wrote:
>>>>>>>>> On 16. Sep 2019, at 19:05, Marshall Schor <msa@schor.com <ma...@schor.com>
>>>>>>>>> <mailto:msa@schor.com <ma...@schor.com>>> wrote:
>>>>>>>>>> I can reproduce the problem, and see what is happening.  The deserialization
>>>>>>>>>> code compares the two type systems, and allows for some mismatches (things
>>>>>>>>>> present in one and not in the other), but it doesn't allow for having a
>>>>>>>>>> feature
>>>>>>>>>> whose range (value) is type XXXX in one type system and type YYYY in the
>>>>>>>>>> other.
>>>>>>>>>> See CasTypeSystemMapper lines 299 - 315.
>>>>>>>>> Without reading the code in detail - could we not relax this check such
>>>>>>>>> that the element type of FSArrays is not checked and the code simply
>>>>>>>>> assumes that the source element type has the same features as the target
>>>>>>>>> element type (with the usual lenient handling of missing features in the
>>>>>>>>> target type)? - Kind of a "duck typing" approach?
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>>
>>>>>>>>> -- Richard
>

Re: Migrating type system of form 6 compressed CAS binaries

Posted by Mario Juric <mj...@unsilo.ai>.
Hi Marshall,

Seems the bug was already resolved for 8u92 in one of the backports:

https://bugs.openjdk.java.net/browse/JDK-8141098 <https://bugs.openjdk.java.net/browse/JDK-8141098>

Cheers,
Mario













> On 23 Sep 2019, at 09:45 , Mario Juric <mj...@unsilo.ai> wrote:
> 
> Hi Marshall,
> 
> Thanks for the thorough and excellent investigation.
> 
> We are looking into possible normalisation/cleanup of whitespace/invisible characters, but I don’t think we can necessarily do the same for some of the other characters. It sounds to me though that serialising to XML 1.1 could also be a simple fix right now, but can this be configured? CasIOUtils doesn’t seem to have an option for this, so I assume it’s something you have working in your branch.
> 
> Regarding the other problem. It seems that the JDK bug is fixed from Java 9 and after. Do you think switching to a more recent Java version would make a difference? I think we can also try this out ourselves when we look into migrating to UIMA 3 once our current deliveries are complete. We also like to switch to Java 11, and like UIMA 3 migration it will require some thorough testing.
> 
> Cheers,
> Mario
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>> On 20 Sep 2019, at 20:52 , Marshall Schor <msa@schor.com <ma...@schor.com>> wrote:
>> 
>> In the test "OddDocumentText", this produces a "throw" due to an invalid xml
>> char, which is the \u0002.
>> 
>> This is in part because the xml version being used is xml 1.0.
>> 
>> XML 1.1 expanded the set of valid characters to include \u0002.
>> 
>> Here's a snip from the XmiCasSerializerTest class which serializes with xml 1.1:
>> 
>>         XmiCasSerializer xmiCasSerializer = new
>> XmiCasSerializer(jCas.getTypeSystem());
>>         OutputStream out = new FileOutputStream(new File ("odd-doc-txt-v11.xmi"));
>>         try {
>>           XMLSerializer xml11Serializer = new XMLSerializer(out);
>>           xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
>>           xmiCasSerializer.serialize(jCas.getCas(),
>> xml11Serializer.getContentHandler());
>>         }
>>         finally {
>>           out.close();
>>         }
>> 
>> This succeeds and serializes this using xml 1.1.
>> 
>> I also tried serializing some doc text which includes \u77987.  That did not
>> serialize correctly.
>> I could see it in the code while tracing up to some point down in the innards of
>> some internal
>> sax java code
>> com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize  where it was
>> "Correct" in the Java string.
>> 
>> When serialized (as UTF-8) it came out as a 4 byte string E79E 9837.
>> 
>> This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 byte encoding:
>>         1110 xxxx 10xx xxxx 10xx xxxx
>> 
>> of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to me.
>> 
>> But I think it's out of our hands - it's somewhere deep in the sax transform
>> java code.
>> 
>> I looked for a bug report and found some
>> https://bugs.openjdk.java.net/browse/JDK-8058175 <https://bugs.openjdk.java.net/browse/JDK-8058175>
>> 
>> Bottom line, is, I think to clean out these characters early :-) .
>> 
>> -Marshall
>> 
>> 
>> On 9/20/2019 1:28 PM, Marshall Schor wrote:
>>> here's an idea.
>>> 
>>> If you have a string, with the surrogate pair &#77987 at position 10, and you
>>> have some Java code, which is iterating through the string and getting the
>>> code-point at each character offset, then that code will produce:
>>> 
>>> at position 10:  the code-point 77987
>>> at position 11:  the code-point 56483
>>> 
>>> Of course, it's a "bug" to iterate through a string of characters, assuming you
>>> have characters at each point, if you don't handle surrogate pairs.
>>> 
>>> The 56483 is just the lower bits of the surrogate pair, added to xDC00 (see
>>> https://tools.ietf.org/html/rfc2781 )
>>> 
>>> I worry that even tools like the CVD or similar may not work properly, since
>>> they're not designed to handle surrogate pairs, I think, so I have no idea if
>>> they would work well enough for you.
>>> 
>>> I'll poke around some more to see if I can enable the conversion for document
>>> strings.
>>> 
>>> -Marshall
>>> 
>>> On 9/20/2019 11:09 AM, Mario Juric wrote:
>>>> Thanks Marshall,
>>>> 
>>>> Encoding the characters like you suggest should work just fine for us as long as we can serialize and deserialise the XMI, so that we can open the content in a tool like the CVD or similar. These characters are just noise from the original content that happen to remain in the CAS, but they are not visible in our final output because they are basically filtered out one way or the other by downstream components. They become a problem though when they make it more difficult for us to inspect the content.
>>>> 
>>>> Regarding the feature name issue: Might you have an idea why we are getting a different XMI output for the same character in our actual pipeline, where it results in "&#77987;&#56483;”? I investigated the value in the debugger again, and like you are illustrating it is also just a single codepoint with the value 77987. We are simply not able to load this XMI because of this, but unfortunately I couldn’t reproduce it in my small example.
>>>> 
>>>> Cheers,
>>>> Mario
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On 19 Sep 2019, at 22:41 , Marshall Schor <ms...@schor.com> wrote:
>>>>> 
>>>>> The odd-feature-text seems to work OK, but has some unusual properties, due to
>>>>> that unicode character.
>>>>> 
>>>>> Here's what I see:  The FeatureRecord "name" field is set to a
>>>>> 1-unicode-character, that must be encoded as 2 java characters.
>>>>> 
>>>>> When output, it shows up in the xmi as <noNamespace:FeatureRecord xmi:id="18"
>>>>> name="&#77987;" value="1.0"/>
>>>>> which seems correct.  The name field only has 1 (extended)unicode character
>>>>> (taking 2 Java characters to represent),
>>>>> due to setting it with this code:   String oddName = "\uD80C\uDCA3";
>>>>> 
>>>>> When read in, the name field is assigned to a String, that string says it has a
>>>>> length of 2 (but that's because it takes 2 java chars to represent this char).
>>>>> If you have the name string in a variable "n", and do
>>>>> System.out.println(n.codePointAt(0)), it shows (correctly) 77987.
>>>>> n.codePointCount(0, n.length()) is, as expected, 1.
>>>>> 
>>>>> So, the string value serialization and deserialization seems to be "working".
>>>>> 
>>>>> The other code - for the sofa (document) serialization, is throwing that error,
>>>>> because as currently designed, the
>>>>> serialization code checks for these kinds of characters, and if found throws
>>>>> that exception.  The code checking is
>>>>> in XMLUtils.checkForNonXmlCharacters
>>>>> 
>>>>> This is because it's highly likely that "fixing this" in the same way as the
>>>>> other, would result in hard-to-diagnose
>>>>> future errors, because the subject of analysis string is processed with begin /
>>>>> end offset all over the place, and makes
>>>>> the assumption that the characters are all not coded as surrogate pairs.
>>>>> 
>>>>> We could change the code to output these like the name, as, e.g.,  &#77987; 
>>>>> 
>>>>> Would that help in your case, or do you imagine other kinds of things might
>>>>> break (due to begin/end offsets no longer
>>>>> being on character boundaries, for example).
>>>>> 
>>>>> -Marshall
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On 9/18/2019 11:41 AM, Mario Juric wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> I investigated the XMI issue as promised and these are my findings.
>>>>>> 
>>>>>> It is related to special unicode characters that are not handled by XMI
>>>>>> serialisation, and there seems to be two distinct categories of issues we have
>>>>>> identified so far.
>>>>>> 
>>>>>> 1) The document text of the CAS contains special unicode characters
>>>>>> 2) Annotations with String features have values containing special unicode
>>>>>> characters
>>>>>> 
>>>>>> In both cases we could for sure solve the problem if we did a better clean up
>>>>>> job upstream, but with the amount and variety of data we receive there is
>>>>>> always a chance something passes through, and some of it may in the general
>>>>>> case even be valid content.
>>>>>> 
>>>>>> The first case is easy to reproduce with the OddDocumentText example I
>>>>>> attached. In this example the text is a snippet taken from the content of a
>>>>>> parsed XML document.
>>>>>> 
>>>>>> The other case was not possible to reproduce with the OddFeatureText example,
>>>>>> because I am getting slightly different output to what I have in our real
>>>>>> setup. The OddFeatureText example is based on the simple type system I shared
>>>>>> previously. The name value of a FeatureRecord contains special unicode
>>>>>> characters that I found in a similar data structure in our actual CAS. The
>>>>>> value comes from an external knowledge base holding some noisy strings, which
>>>>>> in this case is a hieroglyph entity. However, when I write the CAS to XMI
>>>>>> using the small example it only outputs the first of the two characters in
>>>>>> "\uD80C\uDCA3”, which yields the value "&#77987;” in the XMI, but in our
>>>>>> actual setup both character values are written as "&#77987;&#56483;”. This
>>>>>> means that the attached example will for some reason parse the XMI again, but
>>>>>> it will not work in the case where both characters are written the way we
>>>>>> experience it. The XMI can be manually changed, so that both character values
>>>>>> are included the way it happens in our output, and in this case a
>>>>>> SAXParserException happens.
>>>>>> 
>>>>>> I don’t know whether it is outside the scope of the XMI serialiser to handle
>>>>>> any of this, but it will be good to know in any case :)
>>>>>> 
>>>>>> Cheers,
>>>>>> Mario
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 17 Sep 2019, at 09:36 , Mario Juric <mj@unsilo.ai <ma...@unsilo.ai> <mailto:mj@unsilo.ai <ma...@unsilo.ai>>>
>>>>>>> wrote:
>>>>>>> 
>>>>>>> Thank you very much for looking into this. It is really appreciated and I
>>>>>>> think it touches upon something important, which is about data migration in
>>>>>>> general.
>>>>>>> 
>>>>>>> I agree that some of these solutions can appear specific, awkward or complex
>>>>>>> and the way forward is not to address our use case alone. I think there is a
>>>>>>> need for a compact and efficient binary serialization format for the CAS when
>>>>>>> dealing with large amounts of data because this is directly visible in costs
>>>>>>> of processing and storing, and I found the compressed binary format to be
>>>>>>> much better than XMI in this regard, although I have to admit it’s been a
>>>>>>> while since I benchmarked this. Given that UIMA already has a well described
>>>>>>> type system then maybe it just lacks a way to describe schema evolution
>>>>>>> similar to Apache Avro or similar serialisation frameworks. I think a more
>>>>>>> formal approach to data migration would be critical to any larger operational
>>>>>>> setup.
>>>>>>> 
>>>>>>> Regarding XMI I like to provide some input to the problem we are observing,
>>>>>>> so that it can be solved. We are primarily using XMI for inspection/debugging
>>>>>>> purposes, and we are sometimes not able to do this because of this error. I
>>>>>>> will try to extract a minimum example to avoid involving parts that has to do
>>>>>>> with our pipeline and type system, and I think this would also be the best
>>>>>>> way to illustrate that the problem exists outside of this context. However,
>>>>>>> converting all our data to XMI first in order to do the conversion in our
>>>>>>> example would not be very practical for us, because it involves a large
>>>>>>> amount of data.
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Mario
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On 16 Sep 2019, at 23:02 , Marshall Schor <msa@schor.com <ma...@schor.com>
>>>>>>>> <mailto:msa@schor.com <ma...@schor.com>>> wrote:
>>>>>>>> 
>>>>>>>> In this case, the original looks kind-of like this:
>>>>>>>> 
>>>>>>>> Container
>>>>>>>>   features -> FSArray of FeatureAnnotation each of which
>>>>>>>>                             has 5 slots: sofaRef, begin, end, name, value
>>>>>>>> 
>>>>>>>> the new TypeSystem has
>>>>>>>> 
>>>>>>>> Container
>>>>>>>>   features -> FSArray of FeatureRecord each of which
>>>>>>>>                              has 2 slots: name, value
>>>>>>>> 
>>>>>>>> The deserializer code would need some way to decide how to
>>>>>>>>   1) create an FSArray of FeatureRecord,
>>>>>>>>   2) for each element,
>>>>>>>>      map the FeatureAnnotation to a new instance of FeatureRecord
>>>>>>>> 
>>>>>>>> I guess I could imagine a default mapping (for item 2 above) of
>>>>>>>>  1) change the type from A to B
>>>>>>>>  2) set equal-named features from A to B, drop other features
>>>>>>>> 
>>>>>>>> This mapping would need to apply to a subset of the A's and B's, namely, only
>>>>>>>> those referenced by the FSArray where the element type changed.  Seems complex
>>>>>>>> and specific to this use case though.
>>>>>>>> 
>>>>>>>> -Marshall
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 9/16/2019 2:42 PM, Richard Eckart de Castilho wrote:
>>>>>>>>> On 16. Sep 2019, at 19:05, Marshall Schor <msa@schor.com <ma...@schor.com>
>>>>>>>>> <mailto:msa@schor.com <ma...@schor.com>>> wrote:
>>>>>>>>>> I can reproduce the problem, and see what is happening.  The deserialization
>>>>>>>>>> code compares the two type systems, and allows for some mismatches (things
>>>>>>>>>> present in one and not in the other), but it doesn't allow for having a
>>>>>>>>>> feature
>>>>>>>>>> whose range (value) is type XXXX in one type system and type YYYY in the
>>>>>>>>>> other.
>>>>>>>>>> See CasTypeSystemMapper lines 299 - 315.
>>>>>>>>> Without reading the code in detail - could we not relax this check such
>>>>>>>>> that the element type of FSArrays is not checked and the code simply
>>>>>>>>> assumes that the source element type has the same features as the target
>>>>>>>>> element type (with the usual lenient handling of missing features in the
>>>>>>>>> target type)? - Kind of a "duck typing" approach?
>>>>>>>>> 
>>>>>>>>> Cheers,
>>>>>>>>> 
>>>>>>>>> -- Richard
> 


Re: Migrating type system of form 6 compressed CAS binaries

Posted by Mario Juric <mj...@unsilo.ai>.
Hi Marshall,

Thanks for the thorough and excellent investigation.

We are looking into possible normalisation/cleanup of whitespace/invisible characters, but I don’t think we can necessarily do the same for some of the other characters. It sounds to me though that serialising to XML 1.1 could also be a simple fix right now, but can this be configured? CasIOUtils doesn’t seem to have an option for this, so I assume it’s something you have working in your branch.

Regarding the other problem. It seems that the JDK bug is fixed from Java 9 and after. Do you think switching to a more recent Java version would make a difference? I think we can also try this out ourselves when we look into migrating to UIMA 3 once our current deliveries are complete. We also like to switch to Java 11, and like UIMA 3 migration it will require some thorough testing.

Cheers,
Mario













> On 20 Sep 2019, at 20:52 , Marshall Schor <ms...@schor.com> wrote:
> 
> In the test "OddDocumentText", this produces a "throw" due to an invalid xml
> char, which is the \u0002.
> 
> This is in part because the xml version being used is xml 1.0.
> 
> XML 1.1 expanded the set of valid characters to include \u0002.
> 
> Here's a snip from the XmiCasSerializerTest class which serializes with xml 1.1:
> 
>         XmiCasSerializer xmiCasSerializer = new
> XmiCasSerializer(jCas.getTypeSystem());
>         OutputStream out = new FileOutputStream(new File ("odd-doc-txt-v11.xmi"));
>         try {
>           XMLSerializer xml11Serializer = new XMLSerializer(out);
>           xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
>           xmiCasSerializer.serialize(jCas.getCas(),
> xml11Serializer.getContentHandler());
>         }
>         finally {
>           out.close();
>         }
> 
> This succeeds and serializes this using xml 1.1.
> 
> I also tried serializing some doc text which includes \u77987.  That did not
> serialize correctly.
> I could see it in the code while tracing up to some point down in the innards of
> some internal
> sax java code
> com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize  where it was
> "Correct" in the Java string.
> 
> When serialized (as UTF-8) it came out as a 4 byte string E79E 9837.
> 
> This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 byte encoding:
>         1110 xxxx 10xx xxxx 10xx xxxx
> 
> of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to me.
> 
> But I think it's out of our hands - it's somewhere deep in the sax transform
> java code.
> 
> I looked for a bug report and found some
> https://bugs.openjdk.java.net/browse/JDK-8058175
> 
> Bottom line, is, I think to clean out these characters early :-) .
> 
> -Marshall
> 
> 
> On 9/20/2019 1:28 PM, Marshall Schor wrote:
>> here's an idea.
>> 
>> If you have a string, with the surrogate pair &#77987 at position 10, and you
>> have some Java code, which is iterating through the string and getting the
>> code-point at each character offset, then that code will produce:
>> 
>> at position 10:  the code-point 77987
>> at position 11:  the code-point 56483
>> 
>> Of course, it's a "bug" to iterate through a string of characters, assuming you
>> have characters at each point, if you don't handle surrogate pairs.
>> 
>> The 56483 is just the lower bits of the surrogate pair, added to xDC00 (see
>> https://tools.ietf.org/html/rfc2781 )
>> 
>> I worry that even tools like the CVD or similar may not work properly, since
>> they're not designed to handle surrogate pairs, I think, so I have no idea if
>> they would work well enough for you.
>> 
>> I'll poke around some more to see if I can enable the conversion for document
>> strings.
>> 
>> -Marshall
>> 
>> On 9/20/2019 11:09 AM, Mario Juric wrote:
>>> Thanks Marshall,
>>> 
>>> Encoding the characters like you suggest should work just fine for us as long as we can serialize and deserialise the XMI, so that we can open the content in a tool like the CVD or similar. These characters are just noise from the original content that happen to remain in the CAS, but they are not visible in our final output because they are basically filtered out one way or the other by downstream components. They become a problem though when they make it more difficult for us to inspect the content.
>>> 
>>> Regarding the feature name issue: Might you have an idea why we are getting a different XMI output for the same character in our actual pipeline, where it results in "&#77987;&#56483;”? I investigated the value in the debugger again, and like you are illustrating it is also just a single codepoint with the value 77987. We are simply not able to load this XMI because of this, but unfortunately I couldn’t reproduce it in my small example.
>>> 
>>> Cheers,
>>> Mario
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> On 19 Sep 2019, at 22:41 , Marshall Schor <ms...@schor.com> wrote:
>>>> 
>>>> The odd-feature-text seems to work OK, but has some unusual properties, due to
>>>> that unicode character.
>>>> 
>>>> Here's what I see:  The FeatureRecord "name" field is set to a
>>>> 1-unicode-character, that must be encoded as 2 java characters.
>>>> 
>>>> When output, it shows up in the xmi as <noNamespace:FeatureRecord xmi:id="18"
>>>> name="&#77987;" value="1.0"/>
>>>> which seems correct.  The name field only has 1 (extended)unicode character
>>>> (taking 2 Java characters to represent),
>>>> due to setting it with this code:   String oddName = "\uD80C\uDCA3";
>>>> 
>>>> When read in, the name field is assigned to a String, that string says it has a
>>>> length of 2 (but that's because it takes 2 java chars to represent this char).
>>>> If you have the name string in a variable "n", and do
>>>> System.out.println(n.codePointAt(0)), it shows (correctly) 77987.
>>>> n.codePointCount(0, n.length()) is, as expected, 1.
>>>> 
>>>> So, the string value serialization and deserialization seems to be "working".
>>>> 
>>>> The other code - for the sofa (document) serialization, is throwing that error,
>>>> because as currently designed, the
>>>> serialization code checks for these kinds of characters, and if found throws
>>>> that exception.  The code checking is
>>>> in XMLUtils.checkForNonXmlCharacters
>>>> 
>>>> This is because it's highly likely that "fixing this" in the same way as the
>>>> other, would result in hard-to-diagnose
>>>> future errors, because the subject of analysis string is processed with begin /
>>>> end offset all over the place, and makes
>>>> the assumption that the characters are all not coded as surrogate pairs.
>>>> 
>>>> We could change the code to output these like the name, as, e.g.,  &#77987; 
>>>> 
>>>> Would that help in your case, or do you imagine other kinds of things might
>>>> break (due to begin/end offsets no longer
>>>> being on character boundaries, for example).
>>>> 
>>>> -Marshall
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On 9/18/2019 11:41 AM, Mario Juric wrote:
>>>>> Hi,
>>>>> 
>>>>> I investigated the XMI issue as promised and these are my findings.
>>>>> 
>>>>> It is related to special unicode characters that are not handled by XMI
>>>>> serialisation, and there seems to be two distinct categories of issues we have
>>>>> identified so far.
>>>>> 
>>>>> 1) The document text of the CAS contains special unicode characters
>>>>> 2) Annotations with String features have values containing special unicode
>>>>> characters
>>>>> 
>>>>> In both cases we could for sure solve the problem if we did a better clean up
>>>>> job upstream, but with the amount and variety of data we receive there is
>>>>> always a chance something passes through, and some of it may in the general
>>>>> case even be valid content.
>>>>> 
>>>>> The first case is easy to reproduce with the OddDocumentText example I
>>>>> attached. In this example the text is a snippet taken from the content of a
>>>>> parsed XML document.
>>>>> 
>>>>> The other case was not possible to reproduce with the OddFeatureText example,
>>>>> because I am getting slightly different output to what I have in our real
>>>>> setup. The OddFeatureText example is based on the simple type system I shared
>>>>> previously. The name value of a FeatureRecord contains special unicode
>>>>> characters that I found in a similar data structure in our actual CAS. The
>>>>> value comes from an external knowledge base holding some noisy strings, which
>>>>> in this case is a hieroglyph entity. However, when I write the CAS to XMI
>>>>> using the small example it only outputs the first of the two characters in
>>>>> "\uD80C\uDCA3”, which yields the value "&#77987;” in the XMI, but in our
>>>>> actual setup both character values are written as "&#77987;&#56483;”. This
>>>>> means that the attached example will for some reason parse the XMI again, but
>>>>> it will not work in the case where both characters are written the way we
>>>>> experience it. The XMI can be manually changed, so that both character values
>>>>> are included the way it happens in our output, and in this case a
>>>>> SAXParserException happens.
>>>>> 
>>>>> I don’t know whether it is outside the scope of the XMI serialiser to handle
>>>>> any of this, but it will be good to know in any case :)
>>>>> 
>>>>> Cheers,
>>>>> Mario
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> On 17 Sep 2019, at 09:36 , Mario Juric <mj@unsilo.ai <ma...@unsilo.ai> <mailto:mj@unsilo.ai <ma...@unsilo.ai>>>
>>>>>> wrote:
>>>>>> 
>>>>>> Thank you very much for looking into this. It is really appreciated and I
>>>>>> think it touches upon something important, which is about data migration in
>>>>>> general.
>>>>>> 
>>>>>> I agree that some of these solutions can appear specific, awkward or complex
>>>>>> and the way forward is not to address our use case alone. I think there is a
>>>>>> need for a compact and efficient binary serialization format for the CAS when
>>>>>> dealing with large amounts of data because this is directly visible in costs
>>>>>> of processing and storing, and I found the compressed binary format to be
>>>>>> much better than XMI in this regard, although I have to admit it’s been a
>>>>>> while since I benchmarked this. Given that UIMA already has a well described
>>>>>> type system then maybe it just lacks a way to describe schema evolution
>>>>>> similar to Apache Avro or similar serialisation frameworks. I think a more
>>>>>> formal approach to data migration would be critical to any larger operational
>>>>>> setup.
>>>>>> 
>>>>>> Regarding XMI I like to provide some input to the problem we are observing,
>>>>>> so that it can be solved. We are primarily using XMI for inspection/debugging
>>>>>> purposes, and we are sometimes not able to do this because of this error. I
>>>>>> will try to extract a minimum example to avoid involving parts that has to do
>>>>>> with our pipeline and type system, and I think this would also be the best
>>>>>> way to illustrate that the problem exists outside of this context. However,
>>>>>> converting all our data to XMI first in order to do the conversion in our
>>>>>> example would not be very practical for us, because it involves a large
>>>>>> amount of data.
>>>>>> 
>>>>>> Cheers,
>>>>>> Mario
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 16 Sep 2019, at 23:02 , Marshall Schor <msa@schor.com <ma...@schor.com>
>>>>>>> <mailto:msa@schor.com <ma...@schor.com>>> wrote:
>>>>>>> 
>>>>>>> In this case, the original looks kind-of like this:
>>>>>>> 
>>>>>>> Container
>>>>>>>   features -> FSArray of FeatureAnnotation each of which
>>>>>>>                             has 5 slots: sofaRef, begin, end, name, value
>>>>>>> 
>>>>>>> the new TypeSystem has
>>>>>>> 
>>>>>>> Container
>>>>>>>   features -> FSArray of FeatureRecord each of which
>>>>>>>                              has 2 slots: name, value
>>>>>>> 
>>>>>>> The deserializer code would need some way to decide how to
>>>>>>>   1) create an FSArray of FeatureRecord,
>>>>>>>   2) for each element,
>>>>>>>      map the FeatureAnnotation to a new instance of FeatureRecord
>>>>>>> 
>>>>>>> I guess I could imagine a default mapping (for item 2 above) of
>>>>>>>  1) change the type from A to B
>>>>>>>  2) set equal-named features from A to B, drop other features
>>>>>>> 
>>>>>>> This mapping would need to apply to a subset of the A's and B's, namely, only
>>>>>>> those referenced by the FSArray where the element type changed.  Seems complex
>>>>>>> and specific to this use case though.
>>>>>>> 
>>>>>>> -Marshall
>>>>>>> 
>>>>>>> 
>>>>>>> On 9/16/2019 2:42 PM, Richard Eckart de Castilho wrote:
>>>>>>>> On 16. Sep 2019, at 19:05, Marshall Schor <msa@schor.com <ma...@schor.com>
>>>>>>>> <mailto:msa@schor.com <ma...@schor.com>>> wrote:
>>>>>>>>> I can reproduce the problem, and see what is happening.  The deserialization
>>>>>>>>> code compares the two type systems, and allows for some mismatches (things
>>>>>>>>> present in one and not in the other), but it doesn't allow for having a
>>>>>>>>> feature
>>>>>>>>> whose range (value) is type XXXX in one type system and type YYYY in the
>>>>>>>>> other.
>>>>>>>>> See CasTypeSystemMapper lines 299 - 315.
>>>>>>>> Without reading the code in detail - could we not relax this check such
>>>>>>>> that the element type of FSArrays is not checked and the code simply
>>>>>>>> assumes that the source element type has the same features as the target
>>>>>>>> element type (with the usual lenient handling of missing features in the
>>>>>>>> target type)? - Kind of a "duck typing" approach?
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> 
>>>>>>>> -- Richard


Re: Migrating type system of form 6 compressed CAS binaries

Posted by Marshall Schor <ms...@schor.com>.
In the test "OddDocumentText", this produces a "throw" due to an invalid xml
char, which is the \u0002.

This is in part because the xml version being used is xml 1.0.

XML 1.1 expanded the set of valid characters to include \u0002.

Here's a snip from the XmiCasSerializerTest class which serializes with xml 1.1:

        XmiCasSerializer xmiCasSerializer = new
XmiCasSerializer(jCas.getTypeSystem());
        OutputStream out = new FileOutputStream(new File ("odd-doc-txt-v11.xmi"));
        try {
          XMLSerializer xml11Serializer = new XMLSerializer(out);
          xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
          xmiCasSerializer.serialize(jCas.getCas(),
xml11Serializer.getContentHandler());
        }
        finally {
          out.close();
        }

This succeeds and serializes this using xml 1.1.

I also tried serializing some doc text which includes \u77987.  That did not
serialize correctly.
I could see it in the code while tracing up to some point down in the innards of
some internal
sax java code
com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize  where it was
"Correct" in the Java string.

When serialized (as UTF-8) it came out as a 4 byte string E79E 9837.

This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 byte encoding:
        1110 xxxx 10xx xxxx 10xx xxxx

of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to me.

But I think it's out of our hands - it's somewhere deep in the sax transform
java code.

I looked for a bug report and found some
https://bugs.openjdk.java.net/browse/JDK-8058175

Bottom line, is, I think to clean out these characters early :-) .

-Marshall


On 9/20/2019 1:28 PM, Marshall Schor wrote:
> here's an idea.
>
> If you have a string, with the surrogate pair &#77987 at position 10, and you
> have some Java code, which is iterating through the string and getting the
> code-point at each character offset, then that code will produce:
>
> at position 10:  the code-point 77987
> at position 11:  the code-point 56483
>
> Of course, it's a "bug" to iterate through a string of characters, assuming you
> have characters at each point, if you don't handle surrogate pairs.
>
> The 56483 is just the lower bits of the surrogate pair, added to xDC00 (see
> https://tools.ietf.org/html/rfc2781 )
>
> I worry that even tools like the CVD or similar may not work properly, since
> they're not designed to handle surrogate pairs, I think, so I have no idea if
> they would work well enough for you.
>
> I'll poke around some more to see if I can enable the conversion for document
> strings.
>
> -Marshall
>
> On 9/20/2019 11:09 AM, Mario Juric wrote:
>> Thanks Marshall,
>>
>> Encoding the characters like you suggest should work just fine for us as long as we can serialize and deserialise the XMI, so that we can open the content in a tool like the CVD or similar. These characters are just noise from the original content that happen to remain in the CAS, but they are not visible in our final output because they are basically filtered out one way or the other by downstream components. They become a problem though when they make it more difficult for us to inspect the content.
>>
>> Regarding the feature name issue: Might you have an idea why we are getting a different XMI output for the same character in our actual pipeline, where it results in "&#77987;&#56483;”? I investigated the value in the debugger again, and like you are illustrating it is also just a single codepoint with the value 77987. We are simply not able to load this XMI because of this, but unfortunately I couldn’t reproduce it in my small example.
>>
>> Cheers,
>> Mario
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>> On 19 Sep 2019, at 22:41 , Marshall Schor <ms...@schor.com> wrote:
>>>
>>> The odd-feature-text seems to work OK, but has some unusual properties, due to
>>> that unicode character.
>>>
>>> Here's what I see:  The FeatureRecord "name" field is set to a
>>> 1-unicode-character, that must be encoded as 2 java characters.
>>>
>>> When output, it shows up in the xmi as <noNamespace:FeatureRecord xmi:id="18"
>>> name="&#77987;" value="1.0"/>
>>> which seems correct.  The name field only has 1 (extended)unicode character
>>> (taking 2 Java characters to represent),
>>> due to setting it with this code:   String oddName = "\uD80C\uDCA3";
>>>
>>> When read in, the name field is assigned to a String, that string says it has a
>>> length of 2 (but that's because it takes 2 java chars to represent this char).
>>> If you have the name string in a variable "n", and do
>>> System.out.println(n.codePointAt(0)), it shows (correctly) 77987.
>>> n.codePointCount(0, n.length()) is, as expected, 1.
>>>
>>> So, the string value serialization and deserialization seems to be "working".
>>>
>>> The other code - for the sofa (document) serialization, is throwing that error,
>>> because as currently designed, the
>>> serialization code checks for these kinds of characters, and if found throws
>>> that exception.  The code checking is
>>> in XMLUtils.checkForNonXmlCharacters
>>>
>>> This is because it's highly likely that "fixing this" in the same way as the
>>> other, would result in hard-to-diagnose
>>> future errors, because the subject of analysis string is processed with begin /
>>> end offset all over the place, and makes
>>> the assumption that the characters are all not coded as surrogate pairs.
>>>
>>> We could change the code to output these like the name, as, e.g.,  &#77987; 
>>>
>>> Would that help in your case, or do you imagine other kinds of things might
>>> break (due to begin/end offsets no longer
>>> being on character boundaries, for example).
>>>
>>> -Marshall
>>>
>>>
>>>
>>>
>>>
>>> On 9/18/2019 11:41 AM, Mario Juric wrote:
>>>> Hi,
>>>>
>>>> I investigated the XMI issue as promised and these are my findings.
>>>>
>>>> It is related to special unicode characters that are not handled by XMI
>>>> serialisation, and there seems to be two distinct categories of issues we have
>>>> identified so far.
>>>>
>>>> 1) The document text of the CAS contains special unicode characters
>>>> 2) Annotations with String features have values containing special unicode
>>>> characters
>>>>
>>>> In both cases we could for sure solve the problem if we did a better clean up
>>>> job upstream, but with the amount and variety of data we receive there is
>>>> always a chance something passes through, and some of it may in the general
>>>> case even be valid content.
>>>>
>>>> The first case is easy to reproduce with the OddDocumentText example I
>>>> attached. In this example the text is a snippet taken from the content of a
>>>> parsed XML document.
>>>>
>>>> The other case was not possible to reproduce with the OddFeatureText example,
>>>> because I am getting slightly different output to what I have in our real
>>>> setup. The OddFeatureText example is based on the simple type system I shared
>>>> previously. The name value of a FeatureRecord contains special unicode
>>>> characters that I found in a similar data structure in our actual CAS. The
>>>> value comes from an external knowledge base holding some noisy strings, which
>>>> in this case is a hieroglyph entity. However, when I write the CAS to XMI
>>>> using the small example it only outputs the first of the two characters in
>>>> "\uD80C\uDCA3”, which yields the value "&#77987;” in the XMI, but in our
>>>> actual setup both character values are written as "&#77987;&#56483;”. This
>>>> means that the attached example will for some reason parse the XMI again, but
>>>> it will not work in the case where both characters are written the way we
>>>> experience it. The XMI can be manually changed, so that both character values
>>>> are included the way it happens in our output, and in this case a
>>>> SAXParserException happens.
>>>>
>>>> I don’t know whether it is outside the scope of the XMI serialiser to handle
>>>> any of this, but it will be good to know in any case :)
>>>>
>>>> Cheers,
>>>> Mario
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> On 17 Sep 2019, at 09:36 , Mario Juric <mj@unsilo.ai <ma...@unsilo.ai> <mailto:mj@unsilo.ai <ma...@unsilo.ai>>>
>>>>> wrote:
>>>>>
>>>>> Thank you very much for looking into this. It is really appreciated and I
>>>>> think it touches upon something important, which is about data migration in
>>>>> general.
>>>>>
>>>>> I agree that some of these solutions can appear specific, awkward or complex
>>>>> and the way forward is not to address our use case alone. I think there is a
>>>>> need for a compact and efficient binary serialization format for the CAS when
>>>>> dealing with large amounts of data because this is directly visible in costs
>>>>> of processing and storing, and I found the compressed binary format to be
>>>>> much better than XMI in this regard, although I have to admit it’s been a
>>>>> while since I benchmarked this. Given that UIMA already has a well described
>>>>> type system then maybe it just lacks a way to describe schema evolution
>>>>> similar to Apache Avro or similar serialisation frameworks. I think a more
>>>>> formal approach to data migration would be critical to any larger operational
>>>>> setup.
>>>>>
>>>>> Regarding XMI I like to provide some input to the problem we are observing,
>>>>> so that it can be solved. We are primarily using XMI for inspection/debugging
>>>>> purposes, and we are sometimes not able to do this because of this error. I
>>>>> will try to extract a minimum example to avoid involving parts that has to do
>>>>> with our pipeline and type system, and I think this would also be the best
>>>>> way to illustrate that the problem exists outside of this context. However,
>>>>> converting all our data to XMI first in order to do the conversion in our
>>>>> example would not be very practical for us, because it involves a large
>>>>> amount of data.
>>>>>
>>>>> Cheers,
>>>>> Mario
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> On 16 Sep 2019, at 23:02 , Marshall Schor <msa@schor.com <ma...@schor.com>
>>>>>> <mailto:msa@schor.com <ma...@schor.com>>> wrote:
>>>>>>
>>>>>> In this case, the original looks kind-of like this:
>>>>>>
>>>>>> Container
>>>>>>    features -> FSArray of FeatureAnnotation each of which
>>>>>>                              has 5 slots: sofaRef, begin, end, name, value
>>>>>>
>>>>>> the new TypeSystem has
>>>>>>
>>>>>> Container
>>>>>>    features -> FSArray of FeatureRecord each of which
>>>>>>                               has 2 slots: name, value
>>>>>>
>>>>>> The deserializer code would need some way to decide how to
>>>>>>    1) create an FSArray of FeatureRecord,
>>>>>>    2) for each element,
>>>>>>       map the FeatureAnnotation to a new instance of FeatureRecord
>>>>>>
>>>>>> I guess I could imagine a default mapping (for item 2 above) of
>>>>>>   1) change the type from A to B
>>>>>>   2) set equal-named features from A to B, drop other features
>>>>>>
>>>>>> This mapping would need to apply to a subset of the A's and B's, namely, only
>>>>>> those referenced by the FSArray where the element type changed.  Seems complex
>>>>>> and specific to this use case though.
>>>>>>
>>>>>> -Marshall
>>>>>>
>>>>>>
>>>>>> On 9/16/2019 2:42 PM, Richard Eckart de Castilho wrote:
>>>>>>> On 16. Sep 2019, at 19:05, Marshall Schor <msa@schor.com <ma...@schor.com>
>>>>>>> <mailto:msa@schor.com <ma...@schor.com>>> wrote:
>>>>>>>> I can reproduce the problem, and see what is happening.  The deserialization
>>>>>>>> code compares the two type systems, and allows for some mismatches (things
>>>>>>>> present in one and not in the other), but it doesn't allow for having a
>>>>>>>> feature
>>>>>>>> whose range (value) is type XXXX in one type system and type YYYY in the
>>>>>>>> other.
>>>>>>>> See CasTypeSystemMapper lines 299 - 315.
>>>>>>> Without reading the code in detail - could we not relax this check such
>>>>>>> that the element type of FSArrays is not checked and the code simply
>>>>>>> assumes that the source element type has the same features as the target
>>>>>>> element type (with the usual lenient handling of missing features in the
>>>>>>> target type)? - Kind of a "duck typing" approach?
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> -- Richard

Re: Migrating type system of form 6 compressed CAS binaries

Posted by Marshall Schor <ms...@schor.com>.
here's an idea.

If you have a string, with the surrogate pair &#77987 at position 10, and you
have some Java code, which is iterating through the string and getting the
code-point at each character offset, then that code will produce:

at position 10:  the code-point 77987
at position 11:  the code-point 56483

Of course, it's a "bug" to iterate through a string of characters, assuming you
have characters at each point, if you don't handle surrogate pairs.

The 56483 is just the lower bits of the surrogate pair, added to xDC00 (see
https://tools.ietf.org/html/rfc2781 )

I worry that even tools like the CVD or similar may not work properly, since
they're not designed to handle surrogate pairs, I think, so I have no idea if
they would work well enough for you.

I'll poke around some more to see if I can enable the conversion for document
strings.

-Marshall

On 9/20/2019 11:09 AM, Mario Juric wrote:
> Thanks Marshall,
>
> Encoding the characters like you suggest should work just fine for us as long as we can serialize and deserialise the XMI, so that we can open the content in a tool like the CVD or similar. These characters are just noise from the original content that happen to remain in the CAS, but they are not visible in our final output because they are basically filtered out one way or the other by downstream components. They become a problem though when they make it more difficult for us to inspect the content.
>
> Regarding the feature name issue: Might you have an idea why we are getting a different XMI output for the same character in our actual pipeline, where it results in "&#77987;&#56483;”? I investigated the value in the debugger again, and like you are illustrating it is also just a single codepoint with the value 77987. We are simply not able to load this XMI because of this, but unfortunately I couldn’t reproduce it in my small example.
>
> Cheers,
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>> On 19 Sep 2019, at 22:41 , Marshall Schor <ms...@schor.com> wrote:
>>
>> The odd-feature-text seems to work OK, but has some unusual properties, due to
>> that unicode character.
>>
>> Here's what I see:  The FeatureRecord "name" field is set to a
>> 1-unicode-character, that must be encoded as 2 java characters.
>>
>> When output, it shows up in the xmi as <noNamespace:FeatureRecord xmi:id="18"
>> name="&#77987;" value="1.0"/>
>> which seems correct.  The name field only has 1 (extended)unicode character
>> (taking 2 Java characters to represent),
>> due to setting it with this code:   String oddName = "\uD80C\uDCA3";
>>
>> When read in, the name field is assigned to a String, that string says it has a
>> length of 2 (but that's because it takes 2 java chars to represent this char).
>> If you have the name string in a variable "n", and do
>> System.out.println(n.codePointAt(0)), it shows (correctly) 77987.
>> n.codePointCount(0, n.length()) is, as expected, 1.
>>
>> So, the string value serialization and deserialization seems to be "working".
>>
>> The other code - for the sofa (document) serialization, is throwing that error,
>> because as currently designed, the
>> serialization code checks for these kinds of characters, and if found throws
>> that exception.  The code checking is
>> in XMLUtils.checkForNonXmlCharacters
>>
>> This is because it's highly likely that "fixing this" in the same way as the
>> other, would result in hard-to-diagnose
>> future errors, because the subject of analysis string is processed with begin /
>> end offset all over the place, and makes
>> the assumption that the characters are all not coded as surrogate pairs.
>>
>> We could change the code to output these like the name, as, e.g.,  &#77987; 
>>
>> Would that help in your case, or do you imagine other kinds of things might
>> break (due to begin/end offsets no longer
>> being on character boundaries, for example).
>>
>> -Marshall
>>
>>
>>
>>
>>
>> On 9/18/2019 11:41 AM, Mario Juric wrote:
>>> Hi,
>>>
>>> I investigated the XMI issue as promised and these are my findings.
>>>
>>> It is related to special unicode characters that are not handled by XMI
>>> serialisation, and there seems to be two distinct categories of issues we have
>>> identified so far.
>>>
>>> 1) The document text of the CAS contains special unicode characters
>>> 2) Annotations with String features have values containing special unicode
>>> characters
>>>
>>> In both cases we could for sure solve the problem if we did a better clean up
>>> job upstream, but with the amount and variety of data we receive there is
>>> always a chance something passes through, and some of it may in the general
>>> case even be valid content.
>>>
>>> The first case is easy to reproduce with the OddDocumentText example I
>>> attached. In this example the text is a snippet taken from the content of a
>>> parsed XML document.
>>>
>>> The other case was not possible to reproduce with the OddFeatureText example,
>>> because I am getting slightly different output to what I have in our real
>>> setup. The OddFeatureText example is based on the simple type system I shared
>>> previously. The name value of a FeatureRecord contains special unicode
>>> characters that I found in a similar data structure in our actual CAS. The
>>> value comes from an external knowledge base holding some noisy strings, which
>>> in this case is a hieroglyph entity. However, when I write the CAS to XMI
>>> using the small example it only outputs the first of the two characters in
>>> "\uD80C\uDCA3”, which yields the value "&#77987;” in the XMI, but in our
>>> actual setup both character values are written as "&#77987;&#56483;”. This
>>> means that the attached example will for some reason parse the XMI again, but
>>> it will not work in the case where both characters are written the way we
>>> experience it. The XMI can be manually changed, so that both character values
>>> are included the way it happens in our output, and in this case a
>>> SAXParserException happens.
>>>
>>> I don’t know whether it is outside the scope of the XMI serialiser to handle
>>> any of this, but it will be good to know in any case :)
>>>
>>> Cheers,
>>> Mario
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>> On 17 Sep 2019, at 09:36 , Mario Juric <mj@unsilo.ai <ma...@unsilo.ai> <mailto:mj@unsilo.ai <ma...@unsilo.ai>>>
>>>> wrote:
>>>>
>>>> Thank you very much for looking into this. It is really appreciated and I
>>>> think it touches upon something important, which is about data migration in
>>>> general.
>>>>
>>>> I agree that some of these solutions can appear specific, awkward or complex
>>>> and the way forward is not to address our use case alone. I think there is a
>>>> need for a compact and efficient binary serialization format for the CAS when
>>>> dealing with large amounts of data because this is directly visible in costs
>>>> of processing and storing, and I found the compressed binary format to be
>>>> much better than XMI in this regard, although I have to admit it’s been a
>>>> while since I benchmarked this. Given that UIMA already has a well described
>>>> type system then maybe it just lacks a way to describe schema evolution
>>>> similar to Apache Avro or similar serialisation frameworks. I think a more
>>>> formal approach to data migration would be critical to any larger operational
>>>> setup.
>>>>
>>>> Regarding XMI I like to provide some input to the problem we are observing,
>>>> so that it can be solved. We are primarily using XMI for inspection/debugging
>>>> purposes, and we are sometimes not able to do this because of this error. I
>>>> will try to extract a minimum example to avoid involving parts that has to do
>>>> with our pipeline and type system, and I think this would also be the best
>>>> way to illustrate that the problem exists outside of this context. However,
>>>> converting all our data to XMI first in order to do the conversion in our
>>>> example would not be very practical for us, because it involves a large
>>>> amount of data.
>>>>
>>>> Cheers,
>>>> Mario
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> On 16 Sep 2019, at 23:02 , Marshall Schor <msa@schor.com <ma...@schor.com>
>>>>> <mailto:msa@schor.com <ma...@schor.com>>> wrote:
>>>>>
>>>>> In this case, the original looks kind-of like this:
>>>>>
>>>>> Container
>>>>>    features -> FSArray of FeatureAnnotation each of which
>>>>>                              has 5 slots: sofaRef, begin, end, name, value
>>>>>
>>>>> the new TypeSystem has
>>>>>
>>>>> Container
>>>>>    features -> FSArray of FeatureRecord each of which
>>>>>                               has 2 slots: name, value
>>>>>
>>>>> The deserializer code would need some way to decide how to
>>>>>    1) create an FSArray of FeatureRecord,
>>>>>    2) for each element,
>>>>>       map the FeatureAnnotation to a new instance of FeatureRecord
>>>>>
>>>>> I guess I could imagine a default mapping (for item 2 above) of
>>>>>   1) change the type from A to B
>>>>>   2) set equal-named features from A to B, drop other features
>>>>>
>>>>> This mapping would need to apply to a subset of the A's and B's, namely, only
>>>>> those referenced by the FSArray where the element type changed.  Seems complex
>>>>> and specific to this use case though.
>>>>>
>>>>> -Marshall
>>>>>
>>>>>
>>>>> On 9/16/2019 2:42 PM, Richard Eckart de Castilho wrote:
>>>>>> On 16. Sep 2019, at 19:05, Marshall Schor <msa@schor.com <ma...@schor.com>
>>>>>> <mailto:msa@schor.com <ma...@schor.com>>> wrote:
>>>>>>> I can reproduce the problem, and see what is happening.  The deserialization
>>>>>>> code compares the two type systems, and allows for some mismatches (things
>>>>>>> present in one and not in the other), but it doesn't allow for having a
>>>>>>> feature
>>>>>>> whose range (value) is type XXXX in one type system and type YYYY in the
>>>>>>> other.
>>>>>>> See CasTypeSystemMapper lines 299 - 315.
>>>>>> Without reading the code in detail - could we not relax this check such
>>>>>> that the element type of FSArrays is not checked and the code simply
>>>>>> assumes that the source element type has the same features as the target
>>>>>> element type (with the usual lenient handling of missing features in the
>>>>>> target type)? - Kind of a "duck typing" approach?
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> -- Richard
>

Re: Migrating type system of form 6 compressed CAS binaries

Posted by Mario Juric <mj...@unsilo.ai>.
Thanks Marshall,

Encoding the characters like you suggest should work just fine for us as long as we can serialize and deserialise the XMI, so that we can open the content in a tool like the CVD or similar. These characters are just noise from the original content that happen to remain in the CAS, but they are not visible in our final output because they are basically filtered out one way or the other by downstream components. They become a problem though when they make it more difficult for us to inspect the content.

Regarding the feature name issue: Might you have an idea why we are getting a different XMI output for the same character in our actual pipeline, where it results in "&#77987;&#56483;”? I investigated the value in the debugger again, and like you are illustrating it is also just a single codepoint with the value 77987. We are simply not able to load this XMI because of this, but unfortunately I couldn’t reproduce it in my small example.

Cheers,
Mario












> On 19 Sep 2019, at 22:41 , Marshall Schor <ms...@schor.com> wrote:
> 
> The odd-feature-text seems to work OK, but has some unusual properties, due to
> that unicode character.
> 
> Here's what I see:  The FeatureRecord "name" field is set to a
> 1-unicode-character, that must be encoded as 2 java characters.
> 
> When output, it shows up in the xmi as <noNamespace:FeatureRecord xmi:id="18"
> name="&#77987;" value="1.0"/>
> which seems correct.  The name field only has 1 (extended)unicode character
> (taking 2 Java characters to represent),
> due to setting it with this code:   String oddName = "\uD80C\uDCA3";
> 
> When read in, the name field is assigned to a String, that string says it has a
> length of 2 (but that's because it takes 2 java chars to represent this char).
> If you have the name string in a variable "n", and do
> System.out.println(n.codePointAt(0)), it shows (correctly) 77987.
> n.codePointCount(0, n.length()) is, as expected, 1.
> 
> So, the string value serialization and deserialization seems to be "working".
> 
> The other code - for the sofa (document) serialization, is throwing that error,
> because as currently designed, the
> serialization code checks for these kinds of characters, and if found throws
> that exception.  The code checking is
> in XMLUtils.checkForNonXmlCharacters
> 
> This is because it's highly likely that "fixing this" in the same way as the
> other, would result in hard-to-diagnose
> future errors, because the subject of analysis string is processed with begin /
> end offset all over the place, and makes
> the assumption that the characters are all not coded as surrogate pairs.
> 
> We could change the code to output these like the name, as, e.g.,  &#77987; 
> 
> Would that help in your case, or do you imagine other kinds of things might
> break (due to begin/end offsets no longer
> being on character boundaries, for example).
> 
> -Marshall
> 
> 
> 
> 
> 
> On 9/18/2019 11:41 AM, Mario Juric wrote:
>> Hi,
>> 
>> I investigated the XMI issue as promised and these are my findings.
>> 
>> It is related to special unicode characters that are not handled by XMI
>> serialisation, and there seems to be two distinct categories of issues we have
>> identified so far.
>> 
>> 1) The document text of the CAS contains special unicode characters
>> 2) Annotations with String features have values containing special unicode
>> characters
>> 
>> In both cases we could for sure solve the problem if we did a better clean up
>> job upstream, but with the amount and variety of data we receive there is
>> always a chance something passes through, and some of it may in the general
>> case even be valid content.
>> 
>> The first case is easy to reproduce with the OddDocumentText example I
>> attached. In this example the text is a snippet taken from the content of a
>> parsed XML document.
>> 
>> The other case was not possible to reproduce with the OddFeatureText example,
>> because I am getting slightly different output to what I have in our real
>> setup. The OddFeatureText example is based on the simple type system I shared
>> previously. The name value of a FeatureRecord contains special unicode
>> characters that I found in a similar data structure in our actual CAS. The
>> value comes from an external knowledge base holding some noisy strings, which
>> in this case is a hieroglyph entity. However, when I write the CAS to XMI
>> using the small example it only outputs the first of the two characters in
>> "\uD80C\uDCA3”, which yields the value "&#77987;” in the XMI, but in our
>> actual setup both character values are written as "&#77987;&#56483;”. This
>> means that the attached example will for some reason parse the XMI again, but
>> it will not work in the case where both characters are written the way we
>> experience it. The XMI can be manually changed, so that both character values
>> are included the way it happens in our output, and in this case a
>> SAXParserException happens.
>> 
>> I don’t know whether it is outside the scope of the XMI serialiser to handle
>> any of this, but it will be good to know in any case :)
>> 
>> Cheers,
>> Mario
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>> On 17 Sep 2019, at 09:36 , Mario Juric <mj@unsilo.ai <ma...@unsilo.ai> <mailto:mj@unsilo.ai <ma...@unsilo.ai>>>
>>> wrote:
>>> 
>>> Thank you very much for looking into this. It is really appreciated and I
>>> think it touches upon something important, which is about data migration in
>>> general.
>>> 
>>> I agree that some of these solutions can appear specific, awkward or complex
>>> and the way forward is not to address our use case alone. I think there is a
>>> need for a compact and efficient binary serialization format for the CAS when
>>> dealing with large amounts of data because this is directly visible in costs
>>> of processing and storing, and I found the compressed binary format to be
>>> much better than XMI in this regard, although I have to admit it’s been a
>>> while since I benchmarked this. Given that UIMA already has a well described
>>> type system then maybe it just lacks a way to describe schema evolution
>>> similar to Apache Avro or similar serialisation frameworks. I think a more
>>> formal approach to data migration would be critical to any larger operational
>>> setup.
>>> 
>>> Regarding XMI I like to provide some input to the problem we are observing,
>>> so that it can be solved. We are primarily using XMI for inspection/debugging
>>> purposes, and we are sometimes not able to do this because of this error. I
>>> will try to extract a minimum example to avoid involving parts that has to do
>>> with our pipeline and type system, and I think this would also be the best
>>> way to illustrate that the problem exists outside of this context. However,
>>> converting all our data to XMI first in order to do the conversion in our
>>> example would not be very practical for us, because it involves a large
>>> amount of data.
>>> 
>>> Cheers,
>>> Mario
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> On 16 Sep 2019, at 23:02 , Marshall Schor <msa@schor.com <ma...@schor.com>
>>>> <mailto:msa@schor.com <ma...@schor.com>>> wrote:
>>>> 
>>>> In this case, the original looks kind-of like this:
>>>> 
>>>> Container
>>>>    features -> FSArray of FeatureAnnotation each of which
>>>>                              has 5 slots: sofaRef, begin, end, name, value
>>>> 
>>>> the new TypeSystem has
>>>> 
>>>> Container
>>>>    features -> FSArray of FeatureRecord each of which
>>>>                               has 2 slots: name, value
>>>> 
>>>> The deserializer code would need some way to decide how to
>>>>    1) create an FSArray of FeatureRecord,
>>>>    2) for each element,
>>>>       map the FeatureAnnotation to a new instance of FeatureRecord
>>>> 
>>>> I guess I could imagine a default mapping (for item 2 above) of
>>>>   1) change the type from A to B
>>>>   2) set equal-named features from A to B, drop other features
>>>> 
>>>> This mapping would need to apply to a subset of the A's and B's, namely, only
>>>> those referenced by the FSArray where the element type changed.  Seems complex
>>>> and specific to this use case though.
>>>> 
>>>> -Marshall
>>>> 
>>>> 
>>>> On 9/16/2019 2:42 PM, Richard Eckart de Castilho wrote:
>>>>> On 16. Sep 2019, at 19:05, Marshall Schor <msa@schor.com <ma...@schor.com>
>>>>> <mailto:msa@schor.com <ma...@schor.com>>> wrote:
>>>>>> I can reproduce the problem, and see what is happening.  The deserialization
>>>>>> code compares the two type systems, and allows for some mismatches (things
>>>>>> present in one and not in the other), but it doesn't allow for having a
>>>>>> feature
>>>>>> whose range (value) is type XXXX in one type system and type YYYY in the
>>>>>> other.
>>>>>> See CasTypeSystemMapper lines 299 - 315.
>>>>> Without reading the code in detail - could we not relax this check such
>>>>> that the element type of FSArrays is not checked and the code simply
>>>>> assumes that the source element type has the same features as the target
>>>>> element type (with the usual lenient handling of missing features in the
>>>>> target type)? - Kind of a "duck typing" approach?
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> -- Richard


Re: Migrating type system of form 6 compressed CAS binaries

Posted by Marshall Schor <ms...@schor.com>.
The odd-feature-text seems to work OK, but has some unusual properties, due to
that unicode character.

Here's what I see:  The FeatureRecord "name" field is set to a
1-unicode-character, that must be encoded as 2 java characters.

When output, it shows up in the xmi as <noNamespace:FeatureRecord xmi:id="18"
name="&#77987;" value="1.0"/>
which seems correct.  The name field only has 1 (extended)unicode character
(taking 2 Java characters to represent),
due to setting it with this code:   String oddName = "\uD80C\uDCA3";

When read in, the name field is assigned to a String, that string says it has a
length of 2 (but that's because it takes 2 java chars to represent this char).
If you have the name string in a variable "n", and do
System.out.println(n.codePointAt(0)), it shows (correctly) 77987.
n.codePointCount(0, n.length()) is, as expected, 1.

So, the string value serialization and deserialization seems to be "working".

The other code - for the sofa (document) serialization, is throwing that error,
because as currently designed, the
serialization code checks for these kinds of characters, and if found throws
that exception.  The code checking is
in XMLUtils.checkForNonXmlCharacters

This is because it's highly likely that "fixing this" in the same way as the
other, would result in hard-to-diagnose
future errors, because the subject of analysis string is processed with begin /
end offset all over the place, and makes
the assumption that the characters are all not coded as surrogate pairs.

We could change the code to output these like the name, as, e.g.,  &#77987; 

Would that help in your case, or do you imagine other kinds of things might
break (due to begin/end offsets no longer
being on character boundaries, for example).

-Marshall





On 9/18/2019 11:41 AM, Mario Juric wrote:
> Hi,
>
> I investigated the XMI issue as promised and these are my findings.
>
> It is related to special unicode characters that are not handled by XMI
> serialisation, and there seems to be two distinct categories of issues we have
> identified so far.
>
> 1) The document text of the CAS contains special unicode characters
> 2) Annotations with String features have values containing special unicode
> characters
>
> In both cases we could for sure solve the problem if we did a better clean up
> job upstream, but with the amount and variety of data we receive there is
> always a chance something passes through, and some of it may in the general
> case even be valid content.
>
> The first case is easy to reproduce with the OddDocumentText example I
> attached. In this example the text is a snippet taken from the content of a
> parsed XML document.
>
> The other case was not possible to reproduce with the OddFeatureText example,
> because I am getting slightly different output to what I have in our real
> setup. The OddFeatureText example is based on the simple type system I shared
> previously. The name value of a FeatureRecord contains special unicode
> characters that I found in a similar data structure in our actual CAS. The
> value comes from an external knowledge base holding some noisy strings, which
> in this case is a hieroglyph entity. However, when I write the CAS to XMI
> using the small example it only outputs the first of the two characters in
> "\uD80C\uDCA3”, which yields the value "&#77987;” in the XMI, but in our
> actual setup both character values are written as "&#77987;&#56483;”. This
> means that the attached example will for some reason parse the XMI again, but
> it will not work in the case where both characters are written the way we
> experience it. The XMI can be manually changed, so that both character values
> are included the way it happens in our output, and in this case a
> SAXParserException happens.
>
> I don’t know whether it is outside the scope of the XMI serialiser to handle
> any of this, but it will be good to know in any case :)
>
> Cheers,
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>> On 17 Sep 2019, at 09:36 , Mario Juric <mj@unsilo.ai <ma...@unsilo.ai>>
>> wrote:
>>
>> Thank you very much for looking into this. It is really appreciated and I
>> think it touches upon something important, which is about data migration in
>> general.
>>
>> I agree that some of these solutions can appear specific, awkward or complex
>> and the way forward is not to address our use case alone. I think there is a
>> need for a compact and efficient binary serialization format for the CAS when
>> dealing with large amounts of data because this is directly visible in costs
>> of processing and storing, and I found the compressed binary format to be
>> much better than XMI in this regard, although I have to admit it’s been a
>> while since I benchmarked this. Given that UIMA already has a well described
>> type system then maybe it just lacks a way to describe schema evolution
>> similar to Apache Avro or similar serialisation frameworks. I think a more
>> formal approach to data migration would be critical to any larger operational
>> setup.
>>
>> Regarding XMI I like to provide some input to the problem we are observing,
>> so that it can be solved. We are primarily using XMI for inspection/debugging
>> purposes, and we are sometimes not able to do this because of this error. I
>> will try to extract a minimum example to avoid involving parts that has to do
>> with our pipeline and type system, and I think this would also be the best
>> way to illustrate that the problem exists outside of this context. However,
>> converting all our data to XMI first in order to do the conversion in our
>> example would not be very practical for us, because it involves a large
>> amount of data.
>>
>> Cheers,
>> Mario
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>> On 16 Sep 2019, at 23:02 , Marshall Schor <msa@schor.com
>>> <ma...@schor.com>> wrote:
>>>
>>> In this case, the original looks kind-of like this:
>>>
>>> Container
>>>    features -> FSArray of FeatureAnnotation each of which
>>>                              has 5 slots: sofaRef, begin, end, name, value
>>>
>>> the new TypeSystem has
>>>
>>> Container
>>>    features -> FSArray of FeatureRecord each of which
>>>                               has 2 slots: name, value
>>>
>>> The deserializer code would need some way to decide how to
>>>    1) create an FSArray of FeatureRecord,
>>>    2) for each element,
>>>       map the FeatureAnnotation to a new instance of FeatureRecord
>>>
>>> I guess I could imagine a default mapping (for item 2 above) of
>>>   1) change the type from A to B
>>>   2) set equal-named features from A to B, drop other features
>>>
>>> This mapping would need to apply to a subset of the A's and B's, namely, only
>>> those referenced by the FSArray where the element type changed.  Seems complex
>>> and specific to this use case though.
>>>
>>> -Marshall
>>>
>>>
>>> On 9/16/2019 2:42 PM, Richard Eckart de Castilho wrote:
>>>> On 16. Sep 2019, at 19:05, Marshall Schor <msa@schor.com
>>>> <ma...@schor.com>> wrote:
>>>>> I can reproduce the problem, and see what is happening.  The deserialization
>>>>> code compares the two type systems, and allows for some mismatches (things
>>>>> present in one and not in the other), but it doesn't allow for having a
>>>>> feature
>>>>> whose range (value) is type XXXX in one type system and type YYYY in the
>>>>> other.
>>>>> See CasTypeSystemMapper lines 299 - 315.
>>>> Without reading the code in detail - could we not relax this check such
>>>> that the element type of FSArrays is not checked and the code simply
>>>> assumes that the source element type has the same features as the target
>>>> element type (with the usual lenient handling of missing features in the
>>>> target type)? - Kind of a "duck typing" approach?
>>>>
>>>> Cheers,
>>>>
>>>> -- Richard
>>
>

Re: Migrating type system of form 6 compressed CAS binaries

Posted by Mario Juric <mj...@unsilo.ai>.
Hi,

I investigated the XMI issue as promised and these are my findings.

It is related to special unicode characters that are not handled by XMI serialisation, and there seems to be two distinct categories of issues we have identified so far.

1) The document text of the CAS contains special unicode characters
2) Annotations with String features have values containing special unicode characters

In both cases we could for sure solve the problem if we did a better clean up job upstream, but with the amount and variety of data we receive there is always a chance something passes through, and some of it may in the general case even be valid content.

The first case is easy to reproduce with the OddDocumentText example I attached. In this example the text is a snippet taken from the content of a parsed XML document.

The other case was not possible to reproduce with the OddFeatureText example, because I am getting slightly different output to what I have in our real setup. The OddFeatureText example is based on the simple type system I shared previously. The name value of a FeatureRecord contains special unicode characters that I found in a similar data structure in our actual CAS. The value comes from an external knowledge base holding some noisy strings, which in this case is a hieroglyph entity. However, when I write the CAS to XMI using the small example it only outputs the first of the two characters in "\uD80C\uDCA3”, which yields the value "&#77987;” in the XMI, but in our actual setup both character values are written as "&#77987;&#56483;”. This means that the attached example will for some reason parse the XMI again, but it will not work in the case where both characters are written the way we experience it. The XMI can be manually changed, so that both character values are included the way it happens in our output, and in this case a SAXParserException happens.

I don’t know whether it is outside the scope of the XMI serialiser to handle any of this, but it will be good to know in any case :)

Cheers,
Mario













> On 17 Sep 2019, at 09:36 , Mario Juric <mj...@unsilo.ai> wrote:
> 
> Thank you very much for looking into this. It is really appreciated and I think it touches upon something important, which is about data migration in general.
> 
> I agree that some of these solutions can appear specific, awkward or complex and the way forward is not to address our use case alone. I think there is a need for a compact and efficient binary serialization format for the CAS when dealing with large amounts of data because this is directly visible in costs of processing and storing, and I found the compressed binary format to be much better than XMI in this regard, although I have to admit it’s been a while since I benchmarked this. Given that UIMA already has a well described type system then maybe it just lacks a way to describe schema evolution similar to Apache Avro or similar serialisation frameworks. I think a more formal approach to data migration would be critical to any larger operational setup.
> 
> Regarding XMI I like to provide some input to the problem we are observing, so that it can be solved. We are primarily using XMI for inspection/debugging purposes, and we are sometimes not able to do this because of this error. I will try to extract a minimum example to avoid involving parts that has to do with our pipeline and type system, and I think this would also be the best way to illustrate that the problem exists outside of this context. However, converting all our data to XMI first in order to do the conversion in our example would not be very practical for us, because it involves a large amount of data.
> 
> Cheers,
> Mario
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>> On 16 Sep 2019, at 23:02 , Marshall Schor <msa@schor.com <ma...@schor.com>> wrote:
>> 
>> In this case, the original looks kind-of like this:
>> 
>> Container
>>    features -> FSArray of FeatureAnnotation each of which
>>                              has 5 slots: sofaRef, begin, end, name, value
>> 
>> the new TypeSystem has
>> 
>> Container
>>    features -> FSArray of FeatureRecord each of which
>>                               has 2 slots: name, value
>> 
>> The deserializer code would need some way to decide how to
>>    1) create an FSArray of FeatureRecord,
>>    2) for each element,
>>       map the FeatureAnnotation to a new instance of FeatureRecord
>> 
>> I guess I could imagine a default mapping (for item 2 above) of
>>   1) change the type from A to B
>>   2) set equal-named features from A to B, drop other features
>> 
>> This mapping would need to apply to a subset of the A's and B's, namely, only
>> those referenced by the FSArray where the element type changed.  Seems complex
>> and specific to this use case though.
>> 
>> -Marshall
>> 
>> 
>> On 9/16/2019 2:42 PM, Richard Eckart de Castilho wrote:
>>> On 16. Sep 2019, at 19:05, Marshall Schor <msa@schor.com <ma...@schor.com>> wrote:
>>>> I can reproduce the problem, and see what is happening.  The deserialization
>>>> code compares the two type systems, and allows for some mismatches (things
>>>> present in one and not in the other), but it doesn't allow for having a feature
>>>> whose range (value) is type XXXX in one type system and type YYYY in the other. 
>>>> See CasTypeSystemMapper lines 299 - 315.
>>> Without reading the code in detail - could we not relax this check such that the element type of FSArrays is not checked and the code simply assumes that the source element type has the same features as the target element type (with the usual lenient handling of missing features in the target type)? - Kind of a "duck typing" approach?
>>> 
>>> Cheers,
>>> 
>>> -- Richard
> 


Re: Migrating type system of form 6 compressed CAS binaries

Posted by Mario Juric <mj...@unsilo.ai>.
Thank you very much for looking into this. It is really appreciated and I think it touches upon something important, which is about data migration in general.

I agree that some of these solutions can appear specific, awkward or complex and the way forward is not to address our use case alone. I think there is a need for a compact and efficient binary serialization format for the CAS when dealing with large amounts of data because this is directly visible in costs of processing and storing, and I found the compressed binary format to be much better than XMI in this regard, although I have to admit it’s been a while since I benchmarked this. Given that UIMA already has a well described type system then maybe it just lacks a way to describe schema evolution similar to Apache Avro or similar serialisation frameworks. I think a more formal approach to data migration would be critical to any larger operational setup.

Regarding XMI I like to provide some input to the problem we are observing, so that it can be solved. We are primarily using XMI for inspection/debugging purposes, and we are sometimes not able to do this because of this error. I will try to extract a minimum example to avoid involving parts that has to do with our pipeline and type system, and I think this would also be the best way to illustrate that the problem exists outside of this context. However, converting all our data to XMI first in order to do the conversion in our example would not be very practical for us, because it involves a large amount of data.

Cheers,
Mario













> On 16 Sep 2019, at 23:02 , Marshall Schor <ms...@schor.com> wrote:
> 
> In this case, the original looks kind-of like this:
> 
> Container
>    features -> FSArray of FeatureAnnotation each of which
>                              has 5 slots: sofaRef, begin, end, name, value
> 
> the new TypeSystem has
> 
> Container
>    features -> FSArray of FeatureRecord each of which
>                               has 2 slots: name, value
> 
> The deserializer code would need some way to decide how to
>    1) create an FSArray of FeatureRecord,
>    2) for each element,
>       map the FeatureAnnotation to a new instance of FeatureRecord
> 
> I guess I could imagine a default mapping (for item 2 above) of
>   1) change the type from A to B
>   2) set equal-named features from A to B, drop other features
> 
> This mapping would need to apply to a subset of the A's and B's, namely, only
> those referenced by the FSArray where the element type changed.  Seems complex
> and specific to this use case though.
> 
> -Marshall
> 
> 
> On 9/16/2019 2:42 PM, Richard Eckart de Castilho wrote:
>> On 16. Sep 2019, at 19:05, Marshall Schor <ms...@schor.com> wrote:
>>> I can reproduce the problem, and see what is happening.  The deserialization
>>> code compares the two type systems, and allows for some mismatches (things
>>> present in one and not in the other), but it doesn't allow for having a feature
>>> whose range (value) is type XXXX in one type system and type YYYY in the other. 
>>> See CasTypeSystemMapper lines 299 - 315.
>> Without reading the code in detail - could we not relax this check such that the element type of FSArrays is not checked and the code simply assumes that the source element type has the same features as the target element type (with the usual lenient handling of missing features in the target type)? - Kind of a "duck typing" approach?
>> 
>> Cheers,
>> 
>> -- Richard


Re: Migrating type system of form 6 compressed CAS binaries

Posted by Marshall Schor <ms...@schor.com>.
In this case, the original looks kind-of like this:

Container
   features -> FSArray of FeatureAnnotation each of which
                             has 5 slots: sofaRef, begin, end, name, value

the new TypeSystem has

Container
   features -> FSArray of FeatureRecord each of which
                              has 2 slots: name, value

The deserializer code would need some way to decide how to
   1) create an FSArray of FeatureRecord,
   2) for each element,
      map the FeatureAnnotation to a new instance of FeatureRecord

I guess I could imagine a default mapping (for item 2 above) of
  1) change the type from A to B
  2) set equal-named features from A to B, drop other features

This mapping would need to apply to a subset of the A's and B's, namely, only
those referenced by the FSArray where the element type changed.  Seems complex
and specific to this use case though.

-Marshall


On 9/16/2019 2:42 PM, Richard Eckart de Castilho wrote:
> On 16. Sep 2019, at 19:05, Marshall Schor <ms...@schor.com> wrote:
>> I can reproduce the problem, and see what is happening.  The deserialization
>> code compares the two type systems, and allows for some mismatches (things
>> present in one and not in the other), but it doesn't allow for having a feature
>> whose range (value) is type XXXX in one type system and type YYYY in the other. 
>> See CasTypeSystemMapper lines 299 - 315.
> Without reading the code in detail - could we not relax this check such that the element type of FSArrays is not checked and the code simply assumes that the source element type has the same features as the target element type (with the usual lenient handling of missing features in the target type)? - Kind of a "duck typing" approach?
>
> Cheers,
>
> -- Richard

Re: Migrating type system of form 6 compressed CAS binaries

Posted by Richard Eckart de Castilho <re...@apache.org>.
On 16. Sep 2019, at 19:05, Marshall Schor <ms...@schor.com> wrote:
> 
> I can reproduce the problem, and see what is happening.  The deserialization
> code compares the two type systems, and allows for some mismatches (things
> present in one and not in the other), but it doesn't allow for having a feature
> whose range (value) is type XXXX in one type system and type YYYY in the other. 
> See CasTypeSystemMapper lines 299 - 315.

Without reading the code in detail - could we not relax this check such that the element type of FSArrays is not checked and the code simply assumes that the source element type has the same features as the target element type (with the usual lenient handling of missing features in the target type)? - Kind of a "duck typing" approach?

Cheers,

-- Richard

Re: Migrating type system of form 6 compressed CAS binaries

Posted by Marshall Schor <ms...@schor.com>.
I can reproduce the problem, and see what is happening.  The deserialization
code compares the two type systems, and allows for some mismatches (things
present in one and not in the other), but it doesn't allow for having a feature
whose range (value) is type XXXX in one type system and type YYYY in the other. 
See CasTypeSystemMapper lines 299 - 315.

It may not be easy to fix.  Basically, the deserialization routines are set up
with a lenient kind of accommodation for different type systems, where they can
"skip" over types and features that are missing. 

This particular transformation needs to run a value conversion - from
FeatureAnnotation to FeatureRecord. 

I'm thinking of various approaches, and putting these out for others to expand
upon, etc.

1) Along the lines of Richard's remark, fix the xmi serialization to work with
all binary data, perhaps by base-64 encoding problematic (or specified by
feature name, or all) values, or - if it turns out to just be some "bug" -
fixing the bug.

2) Allow the user to specify some kind of call-back function, in the
deserializer, when the range of the feature doesn't match.  This would take some
kind of representation of the feature value in typesystem1, and the type of the
feature value in type system 2, and would need to produce the value in type
system 2.  This may be quite problematic/awkward to carry out in all the
generalized edge cases, for instance if there are "forward" references to things
not yet deserialized, etc.

At this point, I think #1 could be quite feasible.  To investigate further, it
would help to have a small test case where the xmi serialization currently is
not readable (due to - as you think - character coding issues).

-Marshall

On 9/16/2019 8:11 AM, Mario Juric wrote:
>
> Best Regards,
>
> Mario Juric
> Principal Engineer
> *UNSILO.ai* <http://unsilo.ai/>
> mobile:  +45 3082 4100
>
>     skype: mario.juric.dk <http://mario.juric.dk>
>
>
>
>
> Hi Marshall,
>
> I have a small test case  with 3 files excluding any JCasGen generated types
> and UIMAfit types file.
>
> First you will have to generate the types and run the SaveCompressedBinary to
> produce the 3 binaries forms I have been experimenting with. Yo should then be
> able to run LoadCompressedBinaries as expected.
>
> Next you need to change the element type of Container.features from
> FeatureAnnotation to FeatureRecord in the type system and generate the type
> system again. Also change the FeatureAnnotation reference In
> LoadCompressedBinaries l. 25 to FeatureRecord and then try to reload the
> previously stored binaries again without saving them first using the new type
> system.
>
> You can see I have played with different ways of loading just to see if
> anything worked, but much of it seems to result in exactly the same calls in
> the lower layers. I didn’t get entirely the same results with the CAS we
> actually store as in this example. E.g. I experienced some EOF with the
> compressed filtered whereas I only get a class cast exception during
> verification in this example. Note also that we keep both types in the new
> type system, but we want to change the element type of the FSArray in the
> Container.
>
> Hope this will yield some useful insights and thanks a lot :)
>
> Cheers
> Mario
>
>
>
>
>
>
>
>
>
>
>
>> On 13 Sep 2019, at 21:55 , Mario Juric <mj@unsilo.ai <ma...@unsilo.ai>>
>> wrote:
>>
>> Thanks Marshall,
>>
>> I’ll get back to you with a small sample as soon I get the time to do it.
>> This will also get me a better understanding of the the format.
>>
>>
>> Cheers,
>> Mario
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>> On 13 Sep 2019, at 19:32 , Marshall Schor <msa@schor.com
>>> <ma...@schor.com>> wrote:
>>>
>>> I'm wondering if you could post a very small test case showing this problem with
>>> a small type system. 
>>>
>>> With that, I could run in the debugger and see exactly what was happening, and
>>> see whether or not some small fix would make this work.
>>>
>>> The Deserializer for this already supports a certain type of mismatch between
>>> type systems, but mainly one where one is a subset of the other - see the
>>> javadoc for the method
>>>
>>> org.apache.uima.cas.impl.BinaryCasSerDes6.java.
>>>
>>> But it must not currently cover this particular case.
>>>
>>> -Marshall
>>>
>>> On 9/13/2019 10:48 AM, Mario Juric wrote:
>>>> Just a quick follow up.
>>>>
>>>> I played a bit around with the CasIOUtils, and it seems that it is possible
>>>> to load and use the embedded type system, i.e. the old type system with X,
>>>> but I found no way to replace it with the new type system and make the
>>>> necessary mappings to Y. I tried to see if I could use the CasCopier in a
>>>> separate step but it expectedly fails when it reaches to the FSArray of X
>>>> in the source CAS because the destination type system requires elements of
>>>> type Y. I could make my own modified version of the CasCopier that could
>>>> take some mapping functions for each pair of source and destination types
>>>> that need to be mapped, but this is where it starts to get too complicated,
>>>> so I found it not to be worth it at this point, since we might then want to
>>>> reprocess everything from scratch anyway.
>>>>
>>>> Cheers,
>>>> Mario
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> On 12 Sep 2019, at 10:41 , Mario Juric <mj@unsilo.ai
>>>>> <ma...@unsilo.ai>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> We use form 6 compressed binaries to persist the CAS. We now want to make
>>>>> a change to the type system that is not directly compatible, although in
>>>>> principle the new type system is really a subset from a data perspective,
>>>>> so we want to migrate existing binaries to the new type system, but we
>>>>> don’t know how. The change is as follows:
>>>>>
>>>>> In the existing type system we have a type A with a FSArray feature of
>>>>> element type X, and we want to change X to Y where Y contains a genuine
>>>>> feature subset of X. This means we basically want to replace X with Y for
>>>>> the FSArray and ditch a few attributes of X when loading the CAS into the
>>>>> new type system.
>>>>>
>>>>> Had the CAS been stored in JSON this would be trivial by just mapping the
>>>>> attributes that they have in common, but when I try to load the CAS binary
>>>>> into the new target type system it chokes with an EOF, so I don’t know if
>>>>> that is at all possible with a form 6 compressed CAS binary?
>>>>>
>>>>> I pocked a bit around in the reference, API and mailing list archive but I
>>>>> was not able to find anything useful. I can of course keep parallel
>>>>> attributes for both X and Y and then have a separate step that makes an
>>>>> explicit conversion/copy, but I prefer to avoid this. I would appreciate
>>>>> any input to the problem, thanks :)
>>>>>
>>>>> Cheers,
>>>>> Mario
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>
>

Re: Migrating type system of form 6 compressed CAS binaries

Posted by Mario Juric <mj...@unsilo.ai>.
Yes, these where just generated from the type system file using JCasGen.



> On 16 Sep 2019, at 15:32 , Marshall Schor <ms...@schor.com> wrote:
> 
> oops, ignore that - I see Container is a JCas class ...  -M
> 
> On 9/16/2019 9:30 AM, Marshall Schor wrote:
>> I may have some version pblms.  The LoadCompressedBinary has refs to a class
>> "Container", but I don't seem to have that class - where is it coming from?
>> 
>> -Marshall
>> 
>> On 9/16/2019 8:11 AM, Mario Juric wrote:
>>> Best Regards,
>>> 
>>> Mario Juric
>>> Principal Engineer
>>> *UNSILO.ai* <http://unsilo.ai/>
>>> mobile:  +45 3082 4100
>>> 
>>>    skype: mario.juric.dk <http://mario.juric.dk>
>>> 
>>> 
>>> 
>>> 
>>> Hi Marshall,
>>> 
>>> I have a small test case  with 3 files excluding any JCasGen generated types
>>> and UIMAfit types file.
>>> 
>>> First you will have to generate the types and run the SaveCompressedBinary to
>>> produce the 3 binaries forms I have been experimenting with. Yo should then be
>>> able to run LoadCompressedBinaries as expected.
>>> 
>>> Next you need to change the element type of Container.features from
>>> FeatureAnnotation to FeatureRecord in the type system and generate the type
>>> system again. Also change the FeatureAnnotation reference In
>>> LoadCompressedBinaries l. 25 to FeatureRecord and then try to reload the
>>> previously stored binaries again without saving them first using the new type
>>> system.
>>> 
>>> You can see I have played with different ways of loading just to see if
>>> anything worked, but much of it seems to result in exactly the same calls in
>>> the lower layers. I didn’t get entirely the same results with the CAS we
>>> actually store as in this example. E.g. I experienced some EOF with the
>>> compressed filtered whereas I only get a class cast exception during
>>> verification in this example. Note also that we keep both types in the new
>>> type system, but we want to change the element type of the FSArray in the
>>> Container.
>>> 
>>> Hope this will yield some useful insights and thanks a lot :)
>>> 
>>> Cheers
>>> Mario
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> On 13 Sep 2019, at 21:55 , Mario Juric <mj@unsilo.ai <ma...@unsilo.ai>>
>>>> wrote:
>>>> 
>>>> Thanks Marshall,
>>>> 
>>>> I’ll get back to you with a small sample as soon I get the time to do it.
>>>> This will also get me a better understanding of the the format.
>>>> 
>>>> 
>>>> Cheers,
>>>> Mario
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On 13 Sep 2019, at 19:32 , Marshall Schor <msa@schor.com
>>>>> <ma...@schor.com>> wrote:
>>>>> 
>>>>> I'm wondering if you could post a very small test case showing this problem with
>>>>> a small type system. 
>>>>> 
>>>>> With that, I could run in the debugger and see exactly what was happening, and
>>>>> see whether or not some small fix would make this work.
>>>>> 
>>>>> The Deserializer for this already supports a certain type of mismatch between
>>>>> type systems, but mainly one where one is a subset of the other - see the
>>>>> javadoc for the method
>>>>> 
>>>>> org.apache.uima.cas.impl.BinaryCasSerDes6.java.
>>>>> 
>>>>> But it must not currently cover this particular case.
>>>>> 
>>>>> -Marshall
>>>>> 
>>>>> On 9/13/2019 10:48 AM, Mario Juric wrote:
>>>>>> Just a quick follow up.
>>>>>> 
>>>>>> I played a bit around with the CasIOUtils, and it seems that it is possible
>>>>>> to load and use the embedded type system, i.e. the old type system with X,
>>>>>> but I found no way to replace it with the new type system and make the
>>>>>> necessary mappings to Y. I tried to see if I could use the CasCopier in a
>>>>>> separate step but it expectedly fails when it reaches to the FSArray of X
>>>>>> in the source CAS because the destination type system requires elements of
>>>>>> type Y. I could make my own modified version of the CasCopier that could
>>>>>> take some mapping functions for each pair of source and destination types
>>>>>> that need to be mapped, but this is where it starts to get too complicated,
>>>>>> so I found it not to be worth it at this point, since we might then want to
>>>>>> reprocess everything from scratch anyway.
>>>>>> 
>>>>>> Cheers,
>>>>>> Mario
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 12 Sep 2019, at 10:41 , Mario Juric <mj@unsilo.ai
>>>>>>> <ma...@unsilo.ai>> wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> We use form 6 compressed binaries to persist the CAS. We now want to make
>>>>>>> a change to the type system that is not directly compatible, although in
>>>>>>> principle the new type system is really a subset from a data perspective,
>>>>>>> so we want to migrate existing binaries to the new type system, but we
>>>>>>> don’t know how. The change is as follows:
>>>>>>> 
>>>>>>> In the existing type system we have a type A with a FSArray feature of
>>>>>>> element type X, and we want to change X to Y where Y contains a genuine
>>>>>>> feature subset of X. This means we basically want to replace X with Y for
>>>>>>> the FSArray and ditch a few attributes of X when loading the CAS into the
>>>>>>> new type system.
>>>>>>> 
>>>>>>> Had the CAS been stored in JSON this would be trivial by just mapping the
>>>>>>> attributes that they have in common, but when I try to load the CAS binary
>>>>>>> into the new target type system it chokes with an EOF, so I don’t know if
>>>>>>> that is at all possible with a form 6 compressed CAS binary?
>>>>>>> 
>>>>>>> I pocked a bit around in the reference, API and mailing list archive but I
>>>>>>> was not able to find anything useful. I can of course keep parallel
>>>>>>> attributes for both X and Y and then have a separate step that makes an
>>>>>>> explicit conversion/copy, but I prefer to avoid this. I would appreciate
>>>>>>> any input to the problem, thanks :)
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Mario
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 


Re: Migrating type system of form 6 compressed CAS binaries

Posted by Marshall Schor <ms...@schor.com>.
oops, ignore that - I see Container is a JCas class ...  -M

On 9/16/2019 9:30 AM, Marshall Schor wrote:
> I may have some version pblms.  The LoadCompressedBinary has refs to a class
> "Container", but I don't seem to have that class - where is it coming from?
>
> -Marshall
>
> On 9/16/2019 8:11 AM, Mario Juric wrote:
>> Best Regards,
>>
>> Mario Juric
>> Principal Engineer
>> *UNSILO.ai* <http://unsilo.ai/>
>> mobile:  +45 3082 4100
>>
>>     skype: mario.juric.dk <http://mario.juric.dk>
>>
>>
>>
>>
>> Hi Marshall,
>>
>> I have a small test case  with 3 files excluding any JCasGen generated types
>> and UIMAfit types file.
>>
>> First you will have to generate the types and run the SaveCompressedBinary to
>> produce the 3 binaries forms I have been experimenting with. Yo should then be
>> able to run LoadCompressedBinaries as expected.
>>
>> Next you need to change the element type of Container.features from
>> FeatureAnnotation to FeatureRecord in the type system and generate the type
>> system again. Also change the FeatureAnnotation reference In
>> LoadCompressedBinaries l. 25 to FeatureRecord and then try to reload the
>> previously stored binaries again without saving them first using the new type
>> system.
>>
>> You can see I have played with different ways of loading just to see if
>> anything worked, but much of it seems to result in exactly the same calls in
>> the lower layers. I didn’t get entirely the same results with the CAS we
>> actually store as in this example. E.g. I experienced some EOF with the
>> compressed filtered whereas I only get a class cast exception during
>> verification in this example. Note also that we keep both types in the new
>> type system, but we want to change the element type of the FSArray in the
>> Container.
>>
>> Hope this will yield some useful insights and thanks a lot :)
>>
>> Cheers
>> Mario
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>> On 13 Sep 2019, at 21:55 , Mario Juric <mj@unsilo.ai <ma...@unsilo.ai>>
>>> wrote:
>>>
>>> Thanks Marshall,
>>>
>>> I’ll get back to you with a small sample as soon I get the time to do it.
>>> This will also get me a better understanding of the the format.
>>>
>>>
>>> Cheers,
>>> Mario
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>> On 13 Sep 2019, at 19:32 , Marshall Schor <msa@schor.com
>>>> <ma...@schor.com>> wrote:
>>>>
>>>> I'm wondering if you could post a very small test case showing this problem with
>>>> a small type system. 
>>>>
>>>> With that, I could run in the debugger and see exactly what was happening, and
>>>> see whether or not some small fix would make this work.
>>>>
>>>> The Deserializer for this already supports a certain type of mismatch between
>>>> type systems, but mainly one where one is a subset of the other - see the
>>>> javadoc for the method
>>>>
>>>> org.apache.uima.cas.impl.BinaryCasSerDes6.java.
>>>>
>>>> But it must not currently cover this particular case.
>>>>
>>>> -Marshall
>>>>
>>>> On 9/13/2019 10:48 AM, Mario Juric wrote:
>>>>> Just a quick follow up.
>>>>>
>>>>> I played a bit around with the CasIOUtils, and it seems that it is possible
>>>>> to load and use the embedded type system, i.e. the old type system with X,
>>>>> but I found no way to replace it with the new type system and make the
>>>>> necessary mappings to Y. I tried to see if I could use the CasCopier in a
>>>>> separate step but it expectedly fails when it reaches to the FSArray of X
>>>>> in the source CAS because the destination type system requires elements of
>>>>> type Y. I could make my own modified version of the CasCopier that could
>>>>> take some mapping functions for each pair of source and destination types
>>>>> that need to be mapped, but this is where it starts to get too complicated,
>>>>> so I found it not to be worth it at this point, since we might then want to
>>>>> reprocess everything from scratch anyway.
>>>>>
>>>>> Cheers,
>>>>> Mario
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> On 12 Sep 2019, at 10:41 , Mario Juric <mj@unsilo.ai
>>>>>> <ma...@unsilo.ai>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> We use form 6 compressed binaries to persist the CAS. We now want to make
>>>>>> a change to the type system that is not directly compatible, although in
>>>>>> principle the new type system is really a subset from a data perspective,
>>>>>> so we want to migrate existing binaries to the new type system, but we
>>>>>> don’t know how. The change is as follows:
>>>>>>
>>>>>> In the existing type system we have a type A with a FSArray feature of
>>>>>> element type X, and we want to change X to Y where Y contains a genuine
>>>>>> feature subset of X. This means we basically want to replace X with Y for
>>>>>> the FSArray and ditch a few attributes of X when loading the CAS into the
>>>>>> new type system.
>>>>>>
>>>>>> Had the CAS been stored in JSON this would be trivial by just mapping the
>>>>>> attributes that they have in common, but when I try to load the CAS binary
>>>>>> into the new target type system it chokes with an EOF, so I don’t know if
>>>>>> that is at all possible with a form 6 compressed CAS binary?
>>>>>>
>>>>>> I pocked a bit around in the reference, API and mailing list archive but I
>>>>>> was not able to find anything useful. I can of course keep parallel
>>>>>> attributes for both X and Y and then have a separate step that makes an
>>>>>> explicit conversion/copy, but I prefer to avoid this. I would appreciate
>>>>>> any input to the problem, thanks :)
>>>>>>
>>>>>> Cheers,
>>>>>> Mario
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>

Re: Migrating type system of form 6 compressed CAS binaries

Posted by Marshall Schor <ms...@schor.com>.
I may have some version pblms.  The LoadCompressedBinary has refs to a class
"Container", but I don't seem to have that class - where is it coming from?

-Marshall

On 9/16/2019 8:11 AM, Mario Juric wrote:
>
> Best Regards,
>
> Mario Juric
> Principal Engineer
> *UNSILO.ai* <http://unsilo.ai/>
> mobile:  +45 3082 4100
>
>     skype: mario.juric.dk <http://mario.juric.dk>
>
>
>
>
> Hi Marshall,
>
> I have a small test case  with 3 files excluding any JCasGen generated types
> and UIMAfit types file.
>
> First you will have to generate the types and run the SaveCompressedBinary to
> produce the 3 binaries forms I have been experimenting with. Yo should then be
> able to run LoadCompressedBinaries as expected.
>
> Next you need to change the element type of Container.features from
> FeatureAnnotation to FeatureRecord in the type system and generate the type
> system again. Also change the FeatureAnnotation reference In
> LoadCompressedBinaries l. 25 to FeatureRecord and then try to reload the
> previously stored binaries again without saving them first using the new type
> system.
>
> You can see I have played with different ways of loading just to see if
> anything worked, but much of it seems to result in exactly the same calls in
> the lower layers. I didn’t get entirely the same results with the CAS we
> actually store as in this example. E.g. I experienced some EOF with the
> compressed filtered whereas I only get a class cast exception during
> verification in this example. Note also that we keep both types in the new
> type system, but we want to change the element type of the FSArray in the
> Container.
>
> Hope this will yield some useful insights and thanks a lot :)
>
> Cheers
> Mario
>
>
>
>
>
>
>
>
>
>
>
>> On 13 Sep 2019, at 21:55 , Mario Juric <mj@unsilo.ai <ma...@unsilo.ai>>
>> wrote:
>>
>> Thanks Marshall,
>>
>> I’ll get back to you with a small sample as soon I get the time to do it.
>> This will also get me a better understanding of the the format.
>>
>>
>> Cheers,
>> Mario
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>> On 13 Sep 2019, at 19:32 , Marshall Schor <msa@schor.com
>>> <ma...@schor.com>> wrote:
>>>
>>> I'm wondering if you could post a very small test case showing this problem with
>>> a small type system. 
>>>
>>> With that, I could run in the debugger and see exactly what was happening, and
>>> see whether or not some small fix would make this work.
>>>
>>> The Deserializer for this already supports a certain type of mismatch between
>>> type systems, but mainly one where one is a subset of the other - see the
>>> javadoc for the method
>>>
>>> org.apache.uima.cas.impl.BinaryCasSerDes6.java.
>>>
>>> But it must not currently cover this particular case.
>>>
>>> -Marshall
>>>
>>> On 9/13/2019 10:48 AM, Mario Juric wrote:
>>>> Just a quick follow up.
>>>>
>>>> I played a bit around with the CasIOUtils, and it seems that it is possible
>>>> to load and use the embedded type system, i.e. the old type system with X,
>>>> but I found no way to replace it with the new type system and make the
>>>> necessary mappings to Y. I tried to see if I could use the CasCopier in a
>>>> separate step but it expectedly fails when it reaches to the FSArray of X
>>>> in the source CAS because the destination type system requires elements of
>>>> type Y. I could make my own modified version of the CasCopier that could
>>>> take some mapping functions for each pair of source and destination types
>>>> that need to be mapped, but this is where it starts to get too complicated,
>>>> so I found it not to be worth it at this point, since we might then want to
>>>> reprocess everything from scratch anyway.
>>>>
>>>> Cheers,
>>>> Mario
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> On 12 Sep 2019, at 10:41 , Mario Juric <mj@unsilo.ai
>>>>> <ma...@unsilo.ai>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> We use form 6 compressed binaries to persist the CAS. We now want to make
>>>>> a change to the type system that is not directly compatible, although in
>>>>> principle the new type system is really a subset from a data perspective,
>>>>> so we want to migrate existing binaries to the new type system, but we
>>>>> don’t know how. The change is as follows:
>>>>>
>>>>> In the existing type system we have a type A with a FSArray feature of
>>>>> element type X, and we want to change X to Y where Y contains a genuine
>>>>> feature subset of X. This means we basically want to replace X with Y for
>>>>> the FSArray and ditch a few attributes of X when loading the CAS into the
>>>>> new type system.
>>>>>
>>>>> Had the CAS been stored in JSON this would be trivial by just mapping the
>>>>> attributes that they have in common, but when I try to load the CAS binary
>>>>> into the new target type system it chokes with an EOF, so I don’t know if
>>>>> that is at all possible with a form 6 compressed CAS binary?
>>>>>
>>>>> I pocked a bit around in the reference, API and mailing list archive but I
>>>>> was not able to find anything useful. I can of course keep parallel
>>>>> attributes for both X and Y and then have a separate step that makes an
>>>>> explicit conversion/copy, but I prefer to avoid this. I would appreciate
>>>>> any input to the problem, thanks :)
>>>>>
>>>>> Cheers,
>>>>> Mario
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>
>

Re: Migrating type system of form 6 compressed CAS binaries

Posted by Mario Juric <mj...@unsilo.ai>.
Best Regards,

Mario Juric
Principal Engineer
UNSILO.ai <http://unsilo.ai/>
mobile:  +45 3082 4100
skype: mario.juric.dk



Hi Marshall,

I have a small test case  with 3 files excluding any JCasGen generated types and UIMAfit types file.

First you will have to generate the types and run the SaveCompressedBinary to produce the 3 binaries forms I have been experimenting with. Yo should then be able to run LoadCompressedBinaries as expected.

Next you need to change the element type of Container.features from FeatureAnnotation to FeatureRecord in the type system and generate the type system again. Also change the FeatureAnnotation reference In LoadCompressedBinaries l. 25 to FeatureRecord and then try to reload the previously stored binaries again without saving them first using the new type system.

You can see I have played with different ways of loading just to see if anything worked, but much of it seems to result in exactly the same calls in the lower layers. I didn’t get entirely the same results with the CAS we actually store as in this example. E.g. I experienced some EOF with the compressed filtered whereas I only get a class cast exception during verification in this example. Note also that we keep both types in the new type system, but we want to change the element type of the FSArray in the Container.

Hope this will yield some useful insights and thanks a lot :)

Cheers
Mario








> On 13 Sep 2019, at 21:55 , Mario Juric <mj...@unsilo.ai> wrote:
> 
> Thanks Marshall,
> 
> I’ll get back to you with a small sample as soon I get the time to do it. This will also get me a better understanding of the the format.
> 
> 
> Cheers,
> Mario
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>> On 13 Sep 2019, at 19:32 , Marshall Schor <msa@schor.com <ma...@schor.com>> wrote:
>> 
>> I'm wondering if you could post a very small test case showing this problem with
>> a small type system. 
>> 
>> With that, I could run in the debugger and see exactly what was happening, and
>> see whether or not some small fix would make this work.
>> 
>> The Deserializer for this already supports a certain type of mismatch between
>> type systems, but mainly one where one is a subset of the other - see the
>> javadoc for the method
>> 
>> org.apache.uima.cas.impl.BinaryCasSerDes6.java.
>> 
>> But it must not currently cover this particular case.
>> 
>> -Marshall
>> 
>> On 9/13/2019 10:48 AM, Mario Juric wrote:
>>> Just a quick follow up.
>>> 
>>> I played a bit around with the CasIOUtils, and it seems that it is possible to load and use the embedded type system, i.e. the old type system with X, but I found no way to replace it with the new type system and make the necessary mappings to Y. I tried to see if I could use the CasCopier in a separate step but it expectedly fails when it reaches to the FSArray of X in the source CAS because the destination type system requires elements of type Y. I could make my own modified version of the CasCopier that could take some mapping functions for each pair of source and destination types that need to be mapped, but this is where it starts to get too complicated, so I found it not to be worth it at this point, since we might then want to reprocess everything from scratch anyway.
>>> 
>>> Cheers,
>>> Mario
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> On 12 Sep 2019, at 10:41 , Mario Juric <mj@unsilo.ai <ma...@unsilo.ai>> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> We use form 6 compressed binaries to persist the CAS. We now want to make a change to the type system that is not directly compatible, although in principle the new type system is really a subset from a data perspective, so we want to migrate existing binaries to the new type system, but we don’t know how. The change is as follows:
>>>> 
>>>> In the existing type system we have a type A with a FSArray feature of element type X, and we want to change X to Y where Y contains a genuine feature subset of X. This means we basically want to replace X with Y for the FSArray and ditch a few attributes of X when loading the CAS into the new type system.
>>>> 
>>>> Had the CAS been stored in JSON this would be trivial by just mapping the attributes that they have in common, but when I try to load the CAS binary into the new target type system it chokes with an EOF, so I don’t know if that is at all possible with a form 6 compressed CAS binary?
>>>> 
>>>> I pocked a bit around in the reference, API and mailing list archive but I was not able to find anything useful. I can of course keep parallel attributes for both X and Y and then have a separate step that makes an explicit conversion/copy, but I prefer to avoid this. I would appreciate any input to the problem, thanks :)
>>>> 
>>>> Cheers,
>>>> Mario
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
> 


Re: Migrating type system of form 6 compressed CAS binaries

Posted by Mario Juric <mj...@unsilo.ai>.
Thanks Marshall,

I’ll get back to you with a small sample as soon I get the time to do it. This will also get me a better understanding of the the format.


Cheers,
Mario












> On 13 Sep 2019, at 19:32 , Marshall Schor <ms...@schor.com> wrote:
> 
> I'm wondering if you could post a very small test case showing this problem with
> a small type system. 
> 
> With that, I could run in the debugger and see exactly what was happening, and
> see whether or not some small fix would make this work.
> 
> The Deserializer for this already supports a certain type of mismatch between
> type systems, but mainly one where one is a subset of the other - see the
> javadoc for the method
> 
> org.apache.uima.cas.impl.BinaryCasSerDes6.java.
> 
> But it must not currently cover this particular case.
> 
> -Marshall
> 
> On 9/13/2019 10:48 AM, Mario Juric wrote:
>> Just a quick follow up.
>> 
>> I played a bit around with the CasIOUtils, and it seems that it is possible to load and use the embedded type system, i.e. the old type system with X, but I found no way to replace it with the new type system and make the necessary mappings to Y. I tried to see if I could use the CasCopier in a separate step but it expectedly fails when it reaches to the FSArray of X in the source CAS because the destination type system requires elements of type Y. I could make my own modified version of the CasCopier that could take some mapping functions for each pair of source and destination types that need to be mapped, but this is where it starts to get too complicated, so I found it not to be worth it at this point, since we might then want to reprocess everything from scratch anyway.
>> 
>> Cheers,
>> Mario
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>> On 12 Sep 2019, at 10:41 , Mario Juric <mj...@unsilo.ai> wrote:
>>> 
>>> Hi,
>>> 
>>> We use form 6 compressed binaries to persist the CAS. We now want to make a change to the type system that is not directly compatible, although in principle the new type system is really a subset from a data perspective, so we want to migrate existing binaries to the new type system, but we don’t know how. The change is as follows:
>>> 
>>> In the existing type system we have a type A with a FSArray feature of element type X, and we want to change X to Y where Y contains a genuine feature subset of X. This means we basically want to replace X with Y for the FSArray and ditch a few attributes of X when loading the CAS into the new type system.
>>> 
>>> Had the CAS been stored in JSON this would be trivial by just mapping the attributes that they have in common, but when I try to load the CAS binary into the new target type system it chokes with an EOF, so I don’t know if that is at all possible with a form 6 compressed CAS binary?
>>> 
>>> I pocked a bit around in the reference, API and mailing list archive but I was not able to find anything useful. I can of course keep parallel attributes for both X and Y and then have a separate step that makes an explicit conversion/copy, but I prefer to avoid this. I would appreciate any input to the problem, thanks :)
>>> 
>>> Cheers,
>>> Mario
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 


Re: Migrating type system of form 6 compressed CAS binaries

Posted by Marshall Schor <ms...@schor.com>.
I'm wondering if you could post a very small test case showing this problem with
a small type system. 

With that, I could run in the debugger and see exactly what was happening, and
see whether or not some small fix would make this work.

The Deserializer for this already supports a certain type of mismatch between
type systems, but mainly one where one is a subset of the other - see the
javadoc for the method

org.apache.uima.cas.impl.BinaryCasSerDes6.java.

But it must not currently cover this particular case.

-Marshall

On 9/13/2019 10:48 AM, Mario Juric wrote:
> Just a quick follow up.
>
> I played a bit around with the CasIOUtils, and it seems that it is possible to load and use the embedded type system, i.e. the old type system with X, but I found no way to replace it with the new type system and make the necessary mappings to Y. I tried to see if I could use the CasCopier in a separate step but it expectedly fails when it reaches to the FSArray of X in the source CAS because the destination type system requires elements of type Y. I could make my own modified version of the CasCopier that could take some mapping functions for each pair of source and destination types that need to be mapped, but this is where it starts to get too complicated, so I found it not to be worth it at this point, since we might then want to reprocess everything from scratch anyway.
>
> Cheers,
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>
>> On 12 Sep 2019, at 10:41 , Mario Juric <mj...@unsilo.ai> wrote:
>>
>> Hi,
>>
>> We use form 6 compressed binaries to persist the CAS. We now want to make a change to the type system that is not directly compatible, although in principle the new type system is really a subset from a data perspective, so we want to migrate existing binaries to the new type system, but we don’t know how. The change is as follows:
>>
>> In the existing type system we have a type A with a FSArray feature of element type X, and we want to change X to Y where Y contains a genuine feature subset of X. This means we basically want to replace X with Y for the FSArray and ditch a few attributes of X when loading the CAS into the new type system.
>>
>> Had the CAS been stored in JSON this would be trivial by just mapping the attributes that they have in common, but when I try to load the CAS binary into the new target type system it chokes with an EOF, so I don’t know if that is at all possible with a form 6 compressed CAS binary?
>>
>> I pocked a bit around in the reference, API and mailing list archive but I was not able to find anything useful. I can of course keep parallel attributes for both X and Y and then have a separate step that makes an explicit conversion/copy, but I prefer to avoid this. I would appreciate any input to the problem, thanks :)
>>
>> Cheers,
>> Mario
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>

Re: Migrating type system of form 6 compressed CAS binaries

Posted by Mario Juric <mj...@unsilo.ai>.
Hi Richard,

Unfortunately no. We have experienced some instability with the XMI format where it wasn’t possible to read the data after writing it, and we would probably not be able to convert a percentage of documents this way. Superficially it appears to be related to encoding issues, but I will try to see if I can recreate a small example at some point.

Cheers,
Mario












> On 14 Sep 2019, at 01:06 , Richard Eckart de Castilho <re...@apache.org> wrote:
> 
> Hi Mario,
> 
>> On 13. Sep 2019, at 16:48, Mario Juric <mj...@unsilo.ai> wrote:
>> 
>> I tried to see if I could use the CasCopier in a separate step but it expectedly fails when it reaches to the FSArray of X in the source CAS because the destination type system requires elements of type Y.
> 
> How about converting your data to XMI, patching the name of the array element type from the old to the new name, and loading that data back in leniently?
> 
> -- Richard


Re: Migrating type system of form 6 compressed CAS binaries

Posted by Richard Eckart de Castilho <re...@apache.org>.
Hi Mario,

> On 13. Sep 2019, at 16:48, Mario Juric <mj...@unsilo.ai> wrote:
> 
> I tried to see if I could use the CasCopier in a separate step but it expectedly fails when it reaches to the FSArray of X in the source CAS because the destination type system requires elements of type Y.

How about converting your data to XMI, patching the name of the array element type from the old to the new name, and loading that data back in leniently?

-- Richard

Re: Migrating type system of form 6 compressed CAS binaries

Posted by Mario Juric <mj...@unsilo.ai>.
Just a quick follow up.

I played a bit around with the CasIOUtils, and it seems that it is possible to load and use the embedded type system, i.e. the old type system with X, but I found no way to replace it with the new type system and make the necessary mappings to Y. I tried to see if I could use the CasCopier in a separate step but it expectedly fails when it reaches to the FSArray of X in the source CAS because the destination type system requires elements of type Y. I could make my own modified version of the CasCopier that could take some mapping functions for each pair of source and destination types that need to be mapped, but this is where it starts to get too complicated, so I found it not to be worth it at this point, since we might then want to reprocess everything from scratch anyway.

Cheers,
Mario













> On 12 Sep 2019, at 10:41 , Mario Juric <mj...@unsilo.ai> wrote:
> 
> Hi,
> 
> We use form 6 compressed binaries to persist the CAS. We now want to make a change to the type system that is not directly compatible, although in principle the new type system is really a subset from a data perspective, so we want to migrate existing binaries to the new type system, but we don’t know how. The change is as follows:
> 
> In the existing type system we have a type A with a FSArray feature of element type X, and we want to change X to Y where Y contains a genuine feature subset of X. This means we basically want to replace X with Y for the FSArray and ditch a few attributes of X when loading the CAS into the new type system.
> 
> Had the CAS been stored in JSON this would be trivial by just mapping the attributes that they have in common, but when I try to load the CAS binary into the new target type system it chokes with an EOF, so I don’t know if that is at all possible with a form 6 compressed CAS binary?
> 
> I pocked a bit around in the reference, API and mailing list archive but I was not able to find anything useful. I can of course keep parallel attributes for both X and Y and then have a separate step that makes an explicit conversion/copy, but I prefer to avoid this. I would appreciate any input to the problem, thanks :)
> 
> Cheers,
> Mario
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>