You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by Harald Wehr <hw...@hs-harz.de> on 2004/07/26 09:16:09 UTC

Ignore invalid bytes

I have to process utf-8 documents. Sometimes a document contains an 
illegal character that causes an UTFDataFormatException due to the 
invalid byte.

Is it possible to tell xerces just to ignore these bytes and to go on 
parsing the document?

There is no need to display these documents 100 % correctly. A missing 
character is acceptable for us in this project rather than chrashing the 
whole document with this exception.

Thanks for your help

Harald



---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: Ignore invalid bytes

Posted by Michael Glavassevich <mr...@apache.org>.
Hello Harald,

The answer to your question is no. This is fatal error. A document is not 
well-formed if it contains malformed byte sequences [1].

[1] http://www.w3.org/TR/2004/REC-xml-20040204/#charencoding

On Mon, 26 Jul 2004, Harald Wehr wrote:

> I have to process utf-8 documents. Sometimes a document contains an illegal 
> character that causes an UTFDataFormatException due to the invalid byte.
>
> Is it possible to tell xerces just to ignore these bytes and to go on parsing 
> the document?
>
> There is no need to display these documents 100 % correctly. A missing 
> character is acceptable for us in this project rather than chrashing the 
> whole document with this exception.
>
> Thanks for your help
>
> Harald
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-user-help@xml.apache.org

---------------------------
Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: Ignore invalid bytes

Posted by Andy Clark <an...@apache.org>.
Simon Kitching wrote:
> Well, if you're using java1.4 or later, then presumably you could 
> implement your own character encoding scheme, register it, then tell
> the parser to use that scheme. Your scheme would delegate to the
> UTF-8 decoder, except that on invalid char it returns "?" or similar.

We implement a custom UTF-8 reader in Xerces so that approach
probably won't work.

-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: Ignore invalid bytes

Posted by Simon Kitching <si...@ecnetwork.co.nz>.
On Tue, 2004-07-27 at 10:19, Andy Clark wrote:
> Harald Wehr wrote:
> > Is it possible to tell xerces just to ignore these bytes and to go on 
> > parsing the document?
> 
> You really shouldn't ignore this type of error. And even though
> Xerces has a continue-after-fatal-error setting, you are likely
> to get caught in an infinite loop if you use it in this situation.
> 
> > There is no need to display these documents 100 % correctly. A missing 
> > character is acceptable for us in this project rather than chrashing the 
> > whole document with this exception.
> 
> Depending on the primary data in your document, a cheap trick is
> to use a Reader object with the input encoding set to ISO Latin 1
> because it uses the full eight bits in each byte and nothing is
> invalid. Of course, you should realize that every UTF-8 character
> after 127 will be corrupted using this trick.

Well, if you're using java1.4 or later, then presumably you could
implement your own character encoding scheme, register it, then tell the
parser to use that scheme. Your scheme would delegate to the UTF-8
decoder, except that on invalid char it returns "?" or similar.

See java.nio.Charset or java.nio.CharsetDecoder,
java.nio.charset.spi.CharsetProvider.

But this does seem to be a lot of work. Pre-processing the document to
remove the invalid characters is probably easier. Why not write your own
filter around wherever the xml is coming from, and replace the problem
chars before they get fed into the xml parser? I presume you're aware
that the parse methods take input streams as well as filenames; you just
need to make sure the stream is "sanitized".

Regards,

Simon




---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: Ignore invalid bytes

Posted by Andy Clark <an...@cyberneko.net>.
Harald Wehr wrote:
> Is it possible to tell xerces just to ignore these bytes and to go on 
> parsing the document?

You really shouldn't ignore this type of error. And even though
Xerces has a continue-after-fatal-error setting, you are likely
to get caught in an infinite loop if you use it in this situation.

> There is no need to display these documents 100 % correctly. A missing 
> character is acceptable for us in this project rather than chrashing the 
> whole document with this exception.

Depending on the primary data in your document, a cheap trick is
to use a Reader object with the input encoding set to ISO Latin 1
because it uses the full eight bits in each byte and nothing is
invalid. Of course, you should realize that every UTF-8 character
after 127 will be corrupted using this trick.

-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


getting schema information

Posted by Gaston Escobar <ga...@yahoo.com.ar>.
Hello, 

I'm having a problem using the xml schema API.
In the line:
XSModel schema = rootPSVI.getSchemaInformation();
The method getSchemaInformation() is always returning
null. I send the code, I hope somebody can help me.
thanks a lot!


I'm using the following code to retrieve the xml
schema information:

DOMParser parser = new DOMParser();
parser.setFeature("http://xml.org/sax/features/namespaces",
true);
parser.setFeature("http://xml.org/sax/features/validation",
true);
parser.setFeature("http://apache.org/xml/features/validation/schema",
true);
parser.setFeature("http://apache.org/xml/features/validation/schema-full-checking",
true);
String id =
"http://apache.org/xml/properties/dom/document-class-name";
Object value =
"org.apache.xerces.dom.PSVIDocumentImpl";
try {
parser.setProperty(id, value);
} 
catch (SAXException e) {
System.err.println("could not set parser property");
}					
parser.parse("NewFile.xml");
Document document = parser.getDocument();
Element root = document.getDocumentElement();
//retrieve PSVI for the root element
ElementPSVI rootPSVI = (ElementPSVI)root;
//retrieve the schema used in validation of this
document
XSModel schema = rootPSVI.getSchemaInformation();
XSNamedMap elementDeclarations =
schema.getComponents(XSConstants.ELEMENT_DECLARATION);
// get schema normalized value
String normalizedValue =
rootPSVI.getSchemaNormalizedValue();
} catch (SAXException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}


This is my xml file:

<?xml version="1.0" encoding="UTF-8"?>
<personas
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="NewFile.xsd" >
	<persona>
		<nombre>hgh</nombre>
		<apellido>eee</apellido>
	</persona>
</personas>


and NewFile.xsd is like this:

<?xml version="1.0" encoding="UTF-8"?>
<xsd:schema
xmlns:xsd="http://www.w3.org/2001/XMLSchema">
    <xsd:element name="apellido" type="xsd:string"/>
    <xsd:element name="nombre" type="xsd:string"/>
    <xsd:element name="persona">
        <xsd:complexType>
            <xsd:sequence>
                <xsd:element ref="nombre"/>
                <xsd:element ref="apellido"/>
            </xsd:sequence>
        </xsd:complexType>
    </xsd:element>
    <xsd:element name="personas">
        <xsd:complexType>
            <xsd:sequence>
                <xsd:element ref="persona"/>
            </xsd:sequence>
        </xsd:complexType>
    </xsd:element>
</xsd:schema>




	
	
		
___________________________________________________________
100mb gratis, Antivirus y Antispam
Correo Yahoo!, el mejor correo web del mundo
http://correo.yahoo.com.ar

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org