You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Leo Ferres <lf...@ccs.carleton.ca> on 2007/05/01 10:31:30 UTC

Invalid xml character

Hello,

While trying to open an xmi file after processing in xml view, an
error pops up telling me that there is an invalid &#26 xml character.
the error comes from the sax parser. Below is the stack trace. Thanks
very much for your help,

Leo

[Fatal Error] :1:2830153: Character reference "&#26" is an invalid XML
character.
org.xml.sax.SAXParseException: Character reference "&#26" is an
invalid XML character.
	at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown
Source)
	at org.apache.uima.util.XmlCasDeserializer.deserialize(XmlCasDeserializer.java:83)
	at org.apache.uima.tools.docanalyzer.AnnotationViewerDialog.launchThatViewer(AnnotationViewerDialog.java:399)
	at org.apache.uima.tools.docanalyzer.AnnotationViewerDialog$ListMouseAdapter.mouseClicked(AnnotationViewerDialog.java:730)
	at java.awt.AWTEventMulticaster.mouseClicked(Unknown Source)
	at java.awt.Component.processMouseEvent(Unknown Source)
	at javax.swing.JComponent.processMouseEvent(Unknown Source)
	at java.awt.Component.processEvent(Unknown Source)
	at java.awt.Container.processEvent(Unknown Source)
	at java.awt.Component.dispatchEventImpl(Unknown Source)
	at java.awt.Container.dispatchEventImpl(Unknown Source)
	at java.awt.Component.dispatchEvent(Unknown Source)
	at java.awt.LightweightDispatcher.retargetMouseEvent(Unknown Source)
	at java.awt.LightweightDispatcher.processMouseEvent(Unknown Source)
	at java.awt.LightweightDispatcher.dispatchEvent(Unknown Source)
	at java.awt.Container.dispatchEventImpl(Unknown Source)
	at java.awt.Window.dispatchEventImpl(Unknown Source)
	at java.awt.Component.dispatchEvent(Unknown Source)
	at java.awt.EventQueue.dispatchEvent(Unknown Source)
	at java.awt.EventDispatchThread.pumpOneEventForHierarchy(Unknown Source)
	at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown Source)
	at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown Source)
	at java.awt.Dialog$1.run(Unknown Source)
	at java.awt.Dialog$2.run(Unknown Source)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.awt.Dialog.show(Unknown Source)
	at java.awt.Component.show(Unknown Source)
	at java.awt.Component.setVisible(Unknown Source)
	at org.apache.uima.tools.docanalyzer.DocumentAnalyzer.show_analysis(DocumentAnalyzer.java:832)
	at org.apache.uima.tools.docanalyzer.DocumentAnalyzer.showAnalysisResults(DocumentAnalyzer.java:767)
	at org.apache.uima.tools.docanalyzer.DocumentAnalyzer.actionPerformed(DocumentAnalyzer.java:499)
	at javax.swing.AbstractButton.fireActionPerformed(Unknown Source)
	at javax.swing.AbstractButton$Handler.actionPerformed(Unknown Source)
	at javax.swing.DefaultButtonModel.fireActionPerformed(Unknown Source)
	at javax.swing.DefaultButtonModel.setPressed(Unknown Source)
	at javax.swing.plaf.basic.BasicButtonListener.mouseReleased(Unknown Source)
	at java.awt.Component.processMouseEvent(Unknown Source)
	at javax.swing.JComponent.processMouseEvent(Unknown Source)
	at java.awt.Component.processEvent(Unknown Source)
	at java.awt.Container.processEvent(Unknown Source)
	at java.awt.Component.dispatchEventImpl(Unknown Source)
	at java.awt.Container.dispatchEventImpl(Unknown Source)
	at java.awt.Component.dispatchEvent(Unknown Source)
	at java.awt.LightweightDispatcher.retargetMouseEvent(Unknown Source)
	at java.awt.LightweightDispatcher.processMouseEvent(Unknown Source)
	at java.awt.LightweightDispatcher.dispatchEvent(Unknown Source)
	at java.awt.Container.dispatchEventImpl(Unknown Source)
	at java.awt.Window.dispatchEventImpl(Unknown Source)
	at java.awt.Component.dispatchEvent(Unknown Source)
	at java.awt.EventQueue.dispatchEvent(Unknown Source)
	at java.awt.EventDispatchThread.pumpOneEventForHierarchy(Unknown Source)
	at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown Source)
	at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
	at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
	at java.awt.EventDispatchThread.run(Unknown Source)


-- 
Leo Ferres, Ph.D.
Human-Oriented Technology Lab
Carleton University,
Ottawa, ON, Canada

Re: Invalid xml character

Posted by Thilo Goetz <tw...@gmx.de>.
Leo Ferres wrote:
> Dear, Adam;
> 
> Thanks very much for your replies. Let me summarize. We have two
> general options so far, (1) preprocessing documents from outside UIMA
> (a. "upgrading" XMIs manually to XML version 1.1 and/or b. manually
> stripping offending character sequences) or (2) processing the input
> docs from within UIMA (a. XMI CAS serializer work with XML 1.1, b.
> replace offending sequences with spaces or c. store docs as byte
> arrays).
> 
> I would assume that, in general, (2) will be preferred over (1), and
> then again, I'd prefer 2b, over 2a over 2c. I agree with Adam that,
> although a nice simple solution, XML 1.1 might prove "inconsumable" :)
> for certain apps, and converting docs to byte array will add more
> processing. Since it's maybe safe to assume that & # 26 carries very
> little information at the time of searching for regexps, and because
> it is really simple, I'd go for 2b.
> 
> I hope this is of some use, let me know what you have decided please.
> 
> Thanks again for replying so fast.
> 
> My best regards,
> 
> Leo
> 

Please note that discussion of this issue has shifted to the dev list. 
Adam has opened a Jira issue that you can track: 
https://issues.apache.org/jira/browse/UIMA-387

--Thilo


Re: Invalid xml character

Posted by Leo Ferres <lf...@ccs.carleton.ca>.
Dear, Adam;

Thanks very much for your replies. Let me summarize. We have two
general options so far, (1) preprocessing documents from outside UIMA
(a. "upgrading" XMIs manually to XML version 1.1 and/or b. manually
stripping offending character sequences) or (2) processing the input
docs from within UIMA (a. XMI CAS serializer work with XML 1.1, b.
replace offending sequences with spaces or c. store docs as byte
arrays).

I would assume that, in general, (2) will be preferred over (1), and
then again, I'd prefer 2b, over 2a over 2c. I agree with Adam that,
although a nice simple solution, XML 1.1 might prove "inconsumable" :)
for certain apps, and converting docs to byte array will add more
processing. Since it's maybe safe to assume that & # 26 carries very
little information at the time of searching for regexps, and because
it is really simple, I'd go for 2b.

I hope this is of some use, let me know what you have decided please.

Thanks again for replying so fast.

My best regards,

Leo

-- 
Leo Ferres, Ph.D.
Human-Oriented Technology Lab
Carleton University,
Ottawa, ON, Canada

Re: Invalid xml character

Posted by Eddie Epstein <ea...@gmail.com>.
Another possibility might be to store such raw documents with control
characters as binary data, for example as a byte array, and only store
cleaned up text as a String.

Eddie

Re: Invalid xml character

Posted by Adam Lally <al...@alum.rpi.edu>.
On 5/1/07, Leo Ferres <lf...@ccs.carleton.ca> wrote:
> Hello,
>
> While trying to open an xmi file after processing in xml view, an
> error pops up telling me that there is an invalid &#26 xml character.
> the error comes from the sax parser. Below is the stack trace. Thanks
> very much for your help,
>

Leo,

Hmm, looks like we have a bug here...

Most control characters are not allowed in XML 1.0, even if they are
escaped with &#xxx.  If your input document contains such characters,
the XMI CAS serializer is writing them to the output XMI document,
making it unreadable.

One workaround might be for you to strip control characters from your
input documents.  This test should return true for valid XML
characters, false for invalid ones;
(c >= 0x20 && c < 0xFFFE) || c == 0x09 || c == 0x0A || c == 0x0D

Also I checked that if you edit the XMI document and change the first line to:
<?xml version="1.1" encoding="UTF-8"?>

The problem goes away, because XML version 1.1 does allow escaped
control characters.


So one possibility for us to fix this in UIMA is to have the XMI CAS
Serializer generate XML version 1.1 tag by default.  (I think we
considered that before and decided not to for some reason, maybe we
were worried that other applications might not be able to consume XML
1.1?  I can't remember. :)

Another possibility would be to have the XMI serializer automatically
replace these characters with spaces.  The XCAS (not XMI) serializer
does that, but only for the document text, not for feature values.  We
could also serialize the XMI using XML version 1.1, which allows
escaped control characters (but still not the 0x00 character).

-Adam