You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uima.apache.org by "Rune Stilling (Jira)" <de...@uima.apache.org> on 2020/01/08 09:23:00 UTC

[jira] [Commented] (UIMA-6128) Allow XMI to be optionally serialized with XML 1.1 instead of only 1.0

    [ https://issues.apache.org/jira/browse/UIMA-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17010504#comment-17010504 ] 

Rune Stilling commented on UIMA-6128:
-------------------------------------

We have found the root of the problem.

When serializing a CAS containing some characters, that in UTF-16 are encoded via surrogates, the result is invalid XML-characters in the serialized UTF-8 encoded document making it unparsable.
 
The problem is coming from the Xalan serialization libraries that may be used in UIMA via the edu.stanford.nlp:stanford-corenlp:3.9.1 and 3.9.2 dependencies (dependent on xalan:xalan:2.7.0).
 
The bug is described here (and has never been fixed in an official release):
 
https://issues.apache.org/jira/browse/XALANJ-2617
 
We found the solution to be quite straight forward. We simply excluded the Xalan (and Xerces dependencies) so that the code uses the default Java implementation instead (org.xml.sax.ContentHandler::startElement())
 
We have attached two files, that may be used to reproduce the issue. If Xalan is included, the test code will throw an exception when loading the generated CASA XMI.

> Allow XMI to be optionally serialized with XML 1.1 instead of only 1.0
> ----------------------------------------------------------------------
>
>                 Key: UIMA-6128
>                 URL: https://issues.apache.org/jira/browse/UIMA-6128
>             Project: UIMA
>          Issue Type: New Feature
>          Components: UIMA
>            Reporter: Mario Juric
>            Priority: Major
>         Attachments: OddFeatureText.java, SimpleTypeSystem_TS.xml
>
>
> Some unicode characters are not handled by XML 1.0 and it can require some normalization or cleanup to be able to serialize the CAS to XMI, but requirements may not necessarily allow all such characters to be fully removed from the CAS. It can also be impossible to do such normalization/cleanup without full reprocess when converting data already stored as compressed binaries to XMI. Being able to optionally select XML 1.1 instead of the default XML 1.0 would be an easier way for some to bypass many of those unicode issues.
> See also discussion on the UIMA mailing list:
> https://lists.apache.org/thread.html/7f8124b7be9ea20ab21dc616243e5661a0b7668a856532031fda71e3@%3Cuser.uima.apache.org%3E
> This feature request suggests that an additional SerialFormat is introduced, e.g. XMI_1_1, which can be selected as format parameter in the CasIOUtils.save methods.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)