You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uima.apache.org by Jörn Kottmann <ko...@gmail.com> on 2011/07/13 16:21:10 UTC

Cas Editor problem with special characters

Hello,

a user tried to import a text file into the Cas Editor which
contains a backspace character.

The Cas Editor treated it as any other unicode char and tried
to write it into a CAS which is then serialized as XMI.
But apparently the Xmi Serializer doesn't like that and
throws an exception.

Shouldn't we support any valid unicode character?

Jörn

Stack trace:
Caused by: org.xml.sax.SAXParseException: Trying to serialize non-XML 
1.0 character: , 0x8
     at 
org.apache.uima.util.XMLSerializer$CharacterValidatingContentHandler.checkForInvalidXmlChars(XMLSerializer.java:254)
     at 
org.apache.uima.util.XMLSerializer$CharacterValidatingContentHandler.startElement(XMLSerializer.java:174)
     at 
org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.startElement(XmiCasSerializer.java:1003)
     at 
org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.encodeFS(XmiCasSerializer.java:755)
     at 
org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.encodeIndexed(XmiCasSerializer.java:700)
     at 
org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.serialize(XmiCasSerializer.java:268)
     at 
org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.access$700(XmiCasSerializer.java:108)
     at 
org.apache.uima.cas.impl.XmiCasSerializer.serialize(XmiCasSerializer.java:1567)
     at 
org.apache.uima.cas.impl.XmiCasSerializer.serialize(XmiCasSerializer.java:1638)
     at 
org.apache.uima.cas.impl.XmiCasSerializer.serialize(XmiCasSerializer.java:1585)
     at 
org.apache.uima.caseditor.ide.wizards.DocumentImportStructureProvider.getDocument(DocumentImportStructureProvider.java:118)
     ... 9 more


Re: Cas Editor problem with special characters

Posted by Jörn Kottmann <ko...@gmail.com>.
On 7/13/11 5:10 PM, Thilo Götz wrote:
> No, because it's not valid XML.  If you just serialize it
> out regardless, you will have problems at the other end
> because your XML parser will barf.
>
> If I recall correctly, you can tell the serializer to
> replace illegal characters with the Unicode replacement
> char without complaining.  Would that be an option?
>

Yes, the user just needs to import plain text, if such automatic conversions
are problematic for him, he can always use a different tool to solve
the problem and then provide the Cas Editor with XMI files.
Depending on the text he might want to just delete the character, instead
of replacing it.

I will have a look at the API and then enable this option.
Thanks for pointing out.

I guess most people do not have this issue anyway, but the user here
just copied over some text from a pdf, placed it in a text file and 
imported it.

Jörn




Re: Cas Editor problem with special characters

Posted by Thilo Götz <tw...@gmx.de>.
On 14/07/11 12:35, Jörn Kottmann wrote:
> On 7/13/11 5:10 PM, Thilo Götz wrote:
>> If I recall correctly, you can tell the serializer to
>> replace illegal characters with the Unicode replacement
>> char without complaining.  Would that be an option?
> 
> Do you know how to do that?
> 
> I found some documentation from you about this issue:
> http://uima.apache.org/d/uimaj-2.3.1/tutorials_and_users_guides.html#ugr.tug.xmi_emf.xml_character_issues
> 
> 
> There you suggest to use XMLUtils.checkForNonXmlCharacters which can locate
> non-xml characters in a String.
> Based on that information I could make a method to replace or remove them.
> 
> But it would be easier if I can just set a flag on some serializer to do this
> for me.
> 
> Jörn

Hm, I have no more information than this.  I guess we decided
not to implement a switch at that time :-(  The serialization
methods have so many overloads already, maybe we felt this was
getting too much.

--Thilo

Re: Cas Editor problem with special characters

Posted by Jörn Kottmann <ko...@gmail.com>.
On 7/13/11 5:10 PM, Thilo Götz wrote:
> If I recall correctly, you can tell the serializer to
> replace illegal characters with the Unicode replacement
> char without complaining.  Would that be an option?

Do you know how to do that?

I found some documentation from you about this issue:
http://uima.apache.org/d/uimaj-2.3.1/tutorials_and_users_guides.html#ugr.tug.xmi_emf.xml_character_issues

There you suggest to use XMLUtils.checkForNonXmlCharacters which can 
locate non-xml characters in a String.
Based on that information I could make a method to replace or remove them.

But it would be easier if I can just set a flag on some serializer to do 
this for me.

Jörn

Re: Cas Editor problem with special characters

Posted by Thilo Götz <tw...@gmx.de>.
No, because it's not valid XML.  If you just serialize it
out regardless, you will have problems at the other end
because your XML parser will barf.

If I recall correctly, you can tell the serializer to
replace illegal characters with the Unicode replacement
char without complaining.  Would that be an option?

If you work with XML for a while, you start realizing that
it's not made for serializing arbitrary text.  If I were
to do this again, I would base64 encode the text and get
around these issues.  We could include a plain text copy
as an FYI for humans, but really use the base64 version
as the authoritative one.

--Thilo

On 13/07/11 16:21, Jörn Kottmann wrote:
> Hello,
> 
> a user tried to import a text file into the Cas Editor which
> contains a backspace character.
> 
> The Cas Editor treated it as any other unicode char and tried
> to write it into a CAS which is then serialized as XMI.
> But apparently the Xmi Serializer doesn't like that and
> throws an exception.
> 
> Shouldn't we support any valid unicode character?
> 
> Jörn
> 
> Stack trace:
> Caused by: org.xml.sax.SAXParseException: Trying to serialize non-XML 1.0
> character: , 0x8
>     at
> org.apache.uima.util.XMLSerializer$CharacterValidatingContentHandler.checkForInvalidXmlChars(XMLSerializer.java:254)
> 
>     at
> org.apache.uima.util.XMLSerializer$CharacterValidatingContentHandler.startElement(XMLSerializer.java:174)
> 
>     at
> org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.startElement(XmiCasSerializer.java:1003)
> 
>     at
> org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.encodeFS(XmiCasSerializer.java:755)
> 
>     at
> org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.encodeIndexed(XmiCasSerializer.java:700)
> 
>     at
> org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.serialize(XmiCasSerializer.java:268)
> 
>     at
> org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.access$700(XmiCasSerializer.java:108)
> 
>     at
> org.apache.uima.cas.impl.XmiCasSerializer.serialize(XmiCasSerializer.java:1567)
>     at
> org.apache.uima.cas.impl.XmiCasSerializer.serialize(XmiCasSerializer.java:1638)
>     at
> org.apache.uima.cas.impl.XmiCasSerializer.serialize(XmiCasSerializer.java:1585)
>     at
> org.apache.uima.caseditor.ide.wizards.DocumentImportStructureProvider.getDocument(DocumentImportStructureProvider.java:118)
> 
>     ... 9 more
>