You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uima.apache.org by "Marshall Schor (JIRA)" <ui...@incubator.apache.org> on 2007/05/01 21:51:15 UTC

[jira] Commented: (UIMA-387) XMI Serializer can write invalid control characters

    [ https://issues.apache.org/jira/browse/UIMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492934 ] 

Marshall Schor commented on UIMA-387:
-------------------------------------

I don't think we should (silently) change user data (i.e., replacing funny characters with spaces).  I would prefer the XML 1.1 approach, unless someone has a reason 1.0 is needed.  

That still leaves the 0x00 character not being valid - Could we output something that was valid XML but when read in by our deserializer would be able to be converted back to 00?  I suppose if we came up with such a mechanism, it could be used in XML 1.0 for all the "bad" characters.  Maybe something like outputing a special XML element we define which has a hex representation of the bad character(s)?  

How does EMF handle this?

-Marshall

> XMI Serializer can write invalid control characters
> ---------------------------------------------------
>
>                 Key: UIMA-387
>                 URL: https://issues.apache.org/jira/browse/UIMA-387
>             Project: UIMA
>          Issue Type: Bug
>          Components: Core Java Framework
>    Affects Versions: 2.1
>            Reporter: Adam Lally
>             Fix For: 2.2
>
>
> On 5/1/07, Leo Ferres <lf...@ccs.carleton.ca> wrote:
> > Hello,
> >
> > While trying to open an xmi file after processing in xml view, an
> > error pops up telling me that there is an invalid &#26 xml character.
> > the error comes from the sax parser. Below is the stack trace. Thanks
> > very much for your help,
> >
> Most control characters are not allowed in XML 1.0, even if they are
> escaped with &#xxx.  If your input document contains such characters,
> the XMI CAS serializer is writing them to the output XMI document,
> making it unreadable.
> I checked that if you edit the XMI document and change the first line to:
> <?xml version="1.1" encoding="UTF-8"?>
> The problem goes away, because XML version 1.1 does allow escaped
> control characters.
> So one possibility for us to fix this in UIMA is to have the XMI CAS
> Serializer generate XML version 1.1 tag by default.  (I think we
> considered that before and decided not to for some reason, maybe we
> were worried that other applications might not be able to consume XML
> 1.1?  I can't remember. :)
> Another possibility would be to have the XMI serializer automatically
> replace these characters with spaces.  The XCAS (not XMI) serializer
> does that, but only for the document text, not for feature values.  We
> could also serialize the XMI using XML version 1.1, which allows
> escaped control characters (but still not the 0x00 character).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.