You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@xml.apache.org by Jonathan Cates <ca...@home.com> on 2001/05/01 02:37:22 UTC

Unicode problem

I am working on a project that is using the German language.  All our xml is
supposed to be headed with iso-8859-1.  Some data was recently loaded to the
database, and I am suddenly getting the following exception:

SystemId Unknown; Line 292; Column 24; ; Line#: 292; Column#: 24
javax.xml.transform.TransformerException: An invalid XML character (Unicode:
0x1e) was found in the element content of the document.
        at
org.apache.xalan.transformer.TransformerImpl.transform(TransformerImpl.java:
660)
        at
org.apache.xalan.transformer.TransformerImpl.transform(TransformerImpl.java:
1118)


Where the code looks like:
public void process(Source xml, Source xsl, Writer out){
        try{

             TransformerFactory tFactory;
             Transformer serializer;

                         tFactory = TransformerFactory.newInstance();

            serializer = tFactory.newTransformer(xsl);
            serializer.setOutputProperty(OutputKeys.ENCODING,"iso-8859-1");
            serializer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION,
"yes");
            serializer.transform(xml ,new StreamResult(out));
        }catch(Exception ex){
            ex.printStackTrace();

....

Is there something I have missed here.  If the doc doesn't have the
encoding="iso-8859-1" should this matter if I explictly set it?  I am using
v2 of xalan/xerces.  Any help is appreciated.

Thanks
Jon

Re: Unicode problem

Posted by James Scott <js...@hnt.com>.

I've found that it's useful to String.trim() before sending to the XSLT engine/XML parser. We had a problem awhile back with Oracle adding a 0x0 "bonus character" to the end of XML snippets extracted from the database. Trimming the snippets before inserting them into the document cured the problem.

JLS
  ----- Original Message ----- 
  From: Andy Heninger 
  To: general@xml.apache.org 
  Sent: Wednesday, May 02, 2001 11:58 AM
  Subject: Re: Unicode problem


  From the XML spec,  http://www.w3.org/TR/REC-xml#charsets
   
        [2]    Char    ::=    #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */ 

   
  0x1e is not in the list, so if one of your new data files happens to contain one, an invalid XML character error would be expected result.
   

  Andy Heninger
  IBM, Cupertino, CA
  heninger@us.ibm.com

    ----- Original Message ----- 
    From: Jonathan Cates 
    To: general@xml.apache.org 
    Sent: Monday, April 30, 2001 5:37 PM
    Subject: Unicode problem


    I am working on a project that is using the German language.  All our xml is
    supposed to be headed with iso-8859-1.  Some data was recently loaded to the
    database, and I am suddenly getting the following exception:

    SystemId Unknown; Line 292; Column 24; ; Line#: 292; Column#: 24
    javax.xml.transform.TransformerException: An invalid XML character (Unicode:
    0x1e) was found in the element content of the document.
            at
    org.apache.xalan.transformer.TransformerImpl.transform(TransformerImpl.java:
    660)
            at
    org.apache.xalan.transformer.TransformerImpl.transform(TransformerImpl.java:
    1118)


    Where the code looks like:
    public void process(Source xml, Source xsl, Writer out){
            try{

                 TransformerFactory tFactory;
                 Transformer serializer;

                             tFactory = TransformerFactory.newInstance();

                serializer = tFactory.newTransformer(xsl);
                serializer.setOutputProperty(OutputKeys.ENCODING,"iso-8859-1");
                serializer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION,
    "yes");
                serializer.transform(xml ,new StreamResult(out));
            }catch(Exception ex){
                ex.printStackTrace();

    ....

    Is there something I have missed here.  If the doc doesn't have the
    encoding="iso-8859-1" should this matter if I explictly set it?  I am using
    v2 of xalan/xerces.  Any help is appreciated.

    Thanks
    Jon

Re: Unicode problem

Posted by Andy Heninger <an...@jtcsv.com>.

>From the XML spec,  http://www.w3.org/TR/REC-xml#charsets

      [2]    Char    ::=    #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */ 


0x1e is not in the list, so if one of your new data files happens to contain one, an invalid XML character error would be expected result.


Andy Heninger
IBM, Cupertino, CA
heninger@us.ibm.com

  ----- Original Message ----- 
  From: Jonathan Cates 
  To: general@xml.apache.org 
  Sent: Monday, April 30, 2001 5:37 PM
  Subject: Unicode problem


  I am working on a project that is using the German language.  All our xml is
  supposed to be headed with iso-8859-1.  Some data was recently loaded to the
  database, and I am suddenly getting the following exception:

  SystemId Unknown; Line 292; Column 24; ; Line#: 292; Column#: 24
  javax.xml.transform.TransformerException: An invalid XML character (Unicode:
  0x1e) was found in the element content of the document.
          at
  org.apache.xalan.transformer.TransformerImpl.transform(TransformerImpl.java:
  660)
          at
  org.apache.xalan.transformer.TransformerImpl.transform(TransformerImpl.java:
  1118)


  Where the code looks like:
  public void process(Source xml, Source xsl, Writer out){
          try{

               TransformerFactory tFactory;
               Transformer serializer;

                           tFactory = TransformerFactory.newInstance();

              serializer = tFactory.newTransformer(xsl);
              serializer.setOutputProperty(OutputKeys.ENCODING,"iso-8859-1");
              serializer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION,
  "yes");
              serializer.transform(xml ,new StreamResult(out));
          }catch(Exception ex){
              ex.printStackTrace();

  ....

  Is there something I have missed here.  If the doc doesn't have the
  encoding="iso-8859-1" should this matter if I explictly set it?  I am using
  v2 of xalan/xerces.  Any help is appreciated.

  Thanks
  Jon