You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-dev@xerces.apache.org by Yoga Balaji <y....@1internet.com> on 2000/07/20 18:13:17 UTC

XERCES Windows-1252 encoding problem!!! Pls Help

I'm trying to parse a XML file from MSNBC (daily news) and store the data in
the DB. I'm using Xerces parser to parse the XML file I rcv. I'm getting the
following error when I run my program. Without "encoding=Windows-1252" it
works fine. I donno how to resolve this problem.

[Fatal Error] :0:0: The encoding "Windows-1252" is not supported.
Exception in thread  "main" java.lang.NullPointerException
            at msnbcmain.main(msnbcmain.java, Compiled Code)

Initially I used JAXP (from www.java.sun.com.xml), this problem wasn't
there - but I had entity reference problem with that. It wasn't resolving
entity references in the XML file. Xerces does that automatically but
encoding problem is coming.

Pls help me! YoGA

RE: XERCES Windows-1252 encoding problem!!! Pls Help

Posted by Yoga Balaji <y....@1internet.com>.

Andy. Actually I tried with small p also, but it didn't work initially. Now
my problem is solved since I added the following line.
parser.setFeature("http://apache.org/xml/features/allow-java-encodings",true
);
              and
done mapping Cp1252 with Windows-1252 in MIME2Java.java source.  I'm using
DOMParserWrapper parser XERCES 1.1.2. It's working fine now with
Windows-1252.
Thank u very much Andy for yr Support.

Jerzy. I tried yr DOMParser (in all the versions of XERCES) and it works
fine for Cp1252 but, it doesn't resolve Entity References. The main reason I
replaced java's JAXP with XERCES is to resolve Entity References
automatically by the parser
Thanks a lot Jerzy for yr  response.

Yoga Balaji wrote:
>         s_enchash.put("WINDOWS-1252",   "CP1252");
>         s_revhash.put("CP1252", "WINDOWS-1252");

The mapping to "Cp1252" is CASE SENSITIVE because it is used
to dynamically locate an appropriate decoder class. So use
a lower-case 'p' and let me know if that works for you. If
not, let me know what version of Java you are using and on
what platform because Java JVMs are not required to include
decoders besides the ones for Unicode.

--
Andy Clark * IBM, JTC - Silicon Valley * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: XERCES Windows-1252 encoding problem!!! Pls Help

Posted by Andy Clark <an...@apache.org>.

Yoga Balaji wrote:
>         s_enchash.put("WINDOWS-1252",   "CP1252");
>         s_revhash.put("CP1252", "WINDOWS-1252");

The mapping to "Cp1252" is CASE SENSITIVE because it is used
to dynamically locate an appropriate decoder class. So use
a lower-case 'p' and let me know if that works for you. If
not, let me know what version of Java you are using and on
what platform because Java JVMs are not required to include
decoders besides the ones for Unicode.

-- 
Andy Clark * IBM, JTC - Silicon Valley * andyc@apache.org

RE: XERCES Windows-1252 encoding problem!!! Pls Help

Posted by Yoga Balaji <y....@1internet.com>.

Andy. Thanks for yr response.

I prefer Solution #3.
I followed yr instructions and added the following code in MIMEJava.java

        s_enchash.put("WINDOWS-1252",   "CP1252");
        s_revhash.put("CP1252", "WINDOWS-1252");

But it still gives the same error. I tried both with upper & lower cases
also.
The URL in yr solution #2 is not there.
Thanks. Pls help!

Yoga Balaji wrote:
> I'm trying to parse a XML file from MSNBC (daily news) and store
> the data in the DB. I'm using Xerces parser to parse the XML file
> I rcv. I'm getting the following error when I run my program.
> Without "encoding=Windows-1252" it works fine. I donno how to
> resolve this problem.

The problem is that "Windows-1252" is *not* a valid encoding name.
The XML specification states that all encoding names must be IANA
names. However, "Windows-1252" is not. Unfortunately, when the
Microsoft XML parser writes an XML document, it automatically
includes this encoding.

There are several solutions:

1) Convert all of the incoming documents from "Windows-1252"
   encoding to a proper encoding. Make sure to modify the
   encoding line at the top of the file to reflect the change.

2) Modify the encoding line in all of your files to be
   "Cp1252" (case is important this time). Then turn on the
   following feature in the parser:

     http://apache.org/xml/features/allow-java-encodings

   Please note that your documents won't be portable in much
   the same way that using the "Windows-1252" encoding name
   doesn't work everywhere.

3) Modify the MIME2Java.java source file to include a mapping
   for "Windows-1252" to "Cp1252". Recompile and rebuild the
   Jar file. Use the new Jar file and you're done.

--
Andy Clark * IBM, JTC - Silicon Valley * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: XERCES Windows-1252 encoding problem!!! Pls Help

Posted by Mike Pogue <mp...@apache.org>.

What makes you think that MS doesn't want that?

"Use MS Windows, and all of your problems are solved!  Oh, the XML documents you create
can't be used on other platforms?  Ooops!  Sorry!"  :-)

Right now, Xerces counts on the underlying JDK for all of its conversion support
(allow-java-encodings turns that on, so that anything the underlying JDK supports is
allowed, even though not necessarily portable).  

If the JDK supports the Windows encoding, then "allow-java-encodings" really implies
"allow-windows-encodings".  If the JDK does NOT support the Windows encoding, then to
implement "allow-windows-encodings" would require adding a  converter, I believe.

Right now, all the converters are in sun.io, if I remember correctly, so JDK implementors
are not required to implement the ones that Sun has.  (I think this was a mistake in the
long run, but I can certainly understand why it was done!)

So, for TRULY PORTABLE XML, you must use UTF-8.  I encourage everybody to use UTF-8,
because everybody is required to implement it in their parser, and also it's supported
nicely by all JDK's. 

Using a Windows-specific encoding is only asking for your XML to be non-portable!  "Just
say no!"

Mike

Ed Staub wrote:
> 
> We have the feature "http://apache.org/xml/features/allow-java-encodings".
> Might we want to also have
> "http://apache.org/xml/features/allow-Windows-encodings"?
> 
> Has anyone approached the Windows XML team about this?
> It seems like a really stupid thing for them to do, strategically; I suspect
> it's just a decision that was made by default - "we just pass on what the
> operating system (Windows) tells us" or similar.
> 
> I don't think they'd want to make Windows an "XML Roach Motel": "Documents
> check in... they _don't check out!".
> 
> -Ed
> 
> -----Original Message-----
> From: Mike Pogue [mailto:mpogue@apache.org]
> Sent: Friday, July 21, 2000 8:48 PM
> To: xerces-j-dev@xml.apache.org
> Subject: Re: XERCES Windows-1252 encoding problem!!! Pls Help
> 
> I agree with Andy on this one.  In the XML world, strict is best, because it
> maximizes
> portability.
> 
> You wouldn't believe how many times people ask for a "near-XML parser",
> though, just so
> their particular platform works nicer (e.g. I've had people ask to allow
> mal-formed XML,
> because it's easier for them to create in their text editor).
> 
> "Just say no", is my opinion...
> 
> Mike
> 
> Andy Clark wrote:
> >
> > James Duncan Davidson wrote:
> > > Any thought that the parser should be lazy in what it accepts? Kind of
> like
> > > HTTP clients are supposed to be strict in what they send and servers are
> > > supposed to be loose in what they accept? Or is that just opening the
> door
> > > for more slop?
> >
> > Opens the door. ;) I like stricter grammars. Schema is too sloppy
> > for my tastes.
> >
> > --
> > Andy Clark * IBM, JTC - Silicon Valley * andyc@apache.org
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> > For additional commands, e-mail: xerces-j-dev-help@xml.apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-dev-help@xml.apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

RE: XERCES Windows-1252 encoding problem!!! Pls Help

Posted by Ed Staub <es...@mediaone.net>.

We have the feature "http://apache.org/xml/features/allow-java-encodings".
Might we want to also have
"http://apache.org/xml/features/allow-Windows-encodings"?

Has anyone approached the Windows XML team about this?
It seems like a really stupid thing for them to do, strategically; I suspect
it's just a decision that was made by default - "we just pass on what the
operating system (Windows) tells us" or similar.

I don't think they'd want to make Windows an "XML Roach Motel": "Documents
check in... they _don't check out!".

-Ed

-----Original Message-----
From: Mike Pogue [mailto:mpogue@apache.org]
Sent: Friday, July 21, 2000 8:48 PM
To: xerces-j-dev@xml.apache.org
Subject: Re: XERCES Windows-1252 encoding problem!!! Pls Help

I agree with Andy on this one.  In the XML world, strict is best, because it
maximizes
portability.

You wouldn't believe how many times people ask for a "near-XML parser",
though, just so
their particular platform works nicer (e.g. I've had people ask to allow
mal-formed XML,
because it's easier for them to create in their text editor).

"Just say no", is my opinion...

Mike

Andy Clark wrote:
>
> James Duncan Davidson wrote:
> > Any thought that the parser should be lazy in what it accepts? Kind of
like
> > HTTP clients are supposed to be strict in what they send and servers are
> > supposed to be loose in what they accept? Or is that just opening the
door
> > for more slop?
>
> Opens the door. ;) I like stricter grammars. Schema is too sloppy
> for my tastes.
>
> --
> Andy Clark * IBM, JTC - Silicon Valley * andyc@apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: XERCES Windows-1252 encoding problem!!! Pls Help

Posted by Mike Pogue <mp...@apache.org>.

I agree with Andy on this one.  In the XML world, strict is best, because it maximizes
portability.

You wouldn't believe how many times people ask for a "near-XML parser", though, just so
their particular platform works nicer (e.g. I've had people ask to allow mal-formed XML,
because it's easier for them to create in their text editor).

"Just say no", is my opinion...

Mike

Andy Clark wrote:
> 
> James Duncan Davidson wrote:
> > Any thought that the parser should be lazy in what it accepts? Kind of like
> > HTTP clients are supposed to be strict in what they send and servers are
> > supposed to be loose in what they accept? Or is that just opening the door
> > for more slop?
> 
> Opens the door. ;) I like stricter grammars. Schema is too sloppy
> for my tastes.
> 
> --
> Andy Clark * IBM, JTC - Silicon Valley * andyc@apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: XERCES Windows-1252 encoding problem!!! Pls Help

Posted by Edwin Goei <Ed...@eng.sun.com>.

"Andy Clark" <an...@apache.org> wrote:
> James Duncan Davidson wrote:
> > Any thought that the parser should be lazy in what it accepts? Kind of
like
> > HTTP clients are supposed to be strict in what they send and servers are
> > supposed to be loose in what they accept? Or is that just opening the
door
> > for more slop?
>
> Opens the door. ;) I like stricter grammars. Schema is too sloppy
> for my tastes.

Ideally it would be a user configurable option, because sometimes you want
one or the other.  Sort of like the "strict" and the "transitional" XHTML
DTDs.  This makes the implementation more difficult though.

Re: XERCES Windows-1252 encoding problem!!! Pls Help

Posted by Andy Clark <an...@apache.org>.

James Duncan Davidson wrote:
> Any thought that the parser should be lazy in what it accepts? Kind of like
> HTTP clients are supposed to be strict in what they send and servers are
> supposed to be loose in what they accept? Or is that just opening the door
> for more slop?

Opens the door. ;) I like stricter grammars. Schema is too sloppy
for my tastes.

-- 
Andy Clark * IBM, JTC - Silicon Valley * andyc@apache.org

Re: XERCES Windows-1252 encoding problem!!! Pls Help

Posted by James Duncan Davidson <ja...@eng.sun.com>.

on 7/20/00 10:25 AM, Andy Clark at andyc@apache.org wrote:

> However, "Windows-1252" is not. Unfortunately, when the
> Microsoft XML parser writes an XML document, it automatically
> includes this encoding.

You just gotta hate that kind of thing. :(

Any thought that the parser should be lazy in what it accepts? Kind of like
HTTP clients are supposed to be strict in what they send and servers are
supposed to be loose in what they accept? Or is that just opening the door
for more slop?

.duncan

RE: Errata to my last (today) post

Posted by Yoga Balaji <y....@1internet.com>.

Hi Jerzy. I tried with Xerces111, it still gives me the same problem. I
tried with both encoding = Cp1252 and encoding = windows-1252. Pls help me
by giving more details on that. Thanks.

I just check my program with Xerces103 and Xerces111 and it works
fine. This means for me, that something is wrong in 1.1.2 version or I
am missing some important stuff.

Jerzy
--
+--------------------------------+
|         Jerzy Puchala          |
+--------------------------------+
|       jerzypuc@scdi.com        |
+--------------------------------+


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Who I did validation (was: Errata...) [long]

Posted by Jerzy Puchala <je...@scdi.com>.

First I just started play with Xerces parser.
Like I wrote in previous letter this version is NOT working with 1.1.2
version of Xercers.
There I will include part of code which is working on my machine. 

<MAJOR PROGRAM CLASS>
import com.sun.javadoc.*;

import java.util.Collection;
import java.io.IOException;

import org.w3c.dom.Attr;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;

import org.apache.xerces.parsers.DOMParser;

import org.xml.sax.SAXException;

public class MyClass2 {
	
public static void main(String arg[]) throws Exception {
  MyClass2 myClass = new MyClass2(arg[0]);
}
	
public MyClass2(String a){
  getWork(a);
}
	
private void getWork(String DDuri) {
  try {
    DOMParser parser = new DOMParser();
    try {
      parser.setErrorHandler(new MyErrorHandler());
 
parser.setFeature("http://apache.org/xml/features/allow-java-encodings",
true);

     parser.setFeature("http://xml.org/sax/features/validation",
true);
				
     parser.parse(DDuri);
     Document doc = parser.getDocument();

//here do something with your document, this is not subject of this
//post 
    }
    catch (SAXException e) {
      e.printStackTrace();
    }
    catch (IOException e) {
      e.printStackTrace();
    }

  }
	
} 
</MAJOR PROGRAM CLASS> 

<MyErrorHandler clas>
import org.xml.sax.ErrorHandler;
import org.xml.sax.SAXException;
import org.xml.sax.SAXParseException;
import org.xml.sax.SAXNotRecognizedException;
import org.xml.sax.SAXNotSupportedException;


public class MyErrorHandler implements ErrorHandler {
	
/** Warning. */
public void warning(SAXParseException ex) {
  System.err.println("[Warning] "+
       getLocationString(ex)+": "+
           ex.getMessage());
}
	
/** Error. */
public void error(SAXParseException ex) {
  System.err.println("[Error] "+
       getLocationString(ex)+": "+
           ex.getMessage());
}
	
/** Fatal error. */
public void fatalError(SAXParseException ex) throws SAXException {
System.err.println("[Fatal Error] "+
    getLocationString(ex)+": "+
        ex.getMessage());
throw ex;
}
	
/** Returns a string of the location. */
private String getLocationString(SAXParseException ex) {
StringBuffer str = new StringBuffer();
		
String systemId = ex.getSystemId();
if (systemId != null) {
  int index = systemId.lastIndexOf('/');
  if (index != -1)
    systemId = systemId.substring(index + 1);
    str.append(systemId);
}
str.append(':');
str.append(ex.getLineNumber());
str.append(':');
str.append(ex.getColumnNumber());
		
return str.toString();
} // getLocationString(SAXParseException):String
}
</MyErrorHandler class>

In additon exact look of line with encoding in my xml is:

<?xml version="1.0" encoding="Cp1252"?>

This line is generated by Deployment Descriptor Editor from Inprise
Application Server, adn I can not send whole one. But this was line
which was responsible for error.

I do not made any mapings and changing in any files. I work on Windows
NT 40 Workstation. I have jdk1.2.2_6.

I have to cut big part of my program from this letter but I belive
that I do not cat out to much. 

Once again I am not work with Xerces long time, and maby this problem
can be solved better way. In addition if there are any lines in the
code which are no nessesery - I am sorry (a specialy in import can be
be too much).

I hope that whis will help somebody. Sorry for my poor english.

Jerzy Puchala

-- 
+--------------------------------+
|         Jerzy Puchala          |
+--------------------------------+
|       jerzypuc@scdi.com        |
+--------------------------------+

RE: Errata to my last (today) post

Posted by Yoga Balaji <y....@1internet.com>.

Jerzy. As I mentioned in my earlier mail I wasn't successful with Xerces111
and Xerces103. Actually I'm using the following code.

            DOMParserWrapper parser =
(DOMParserWrapper)Class.forName(parserName).newInstance();
            DOMCount counter = new DOMCount();
            long before = System.currentTimeMillis();
            doc = parser.parse(argv[0]);

Then I tried with yr code, as u mentioned in yr email. But I don't know how
to catch the document after this line:
parser.parse("myxml.xml");

Can u explain?? Thanks. YoGA

I just check my program with Xerces103 and Xerces111 and it works
fine. This means for me, that something is wrong in 1.1.2 version or I
am missing some important stuff.

Jerzy
--
+--------------------------------+
|         Jerzy Puchala          |
+--------------------------------+
|       jerzypuc@scdi.com        |
+--------------------------------+


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Errata to my last (today) post

Posted by Jerzy Puchala <je...@scdi.com>.

I just check my program with Xerces103 and Xerces111 and it works
fine. This means for me, that something is wrong in 1.1.2 version or I
am missing some important stuff.

Jerzy
-- 
+--------------------------------+
|         Jerzy Puchala          |
+--------------------------------+
|       jerzypuc@scdi.com        |
+--------------------------------+

Re: XERCES Windows-1252 encoding problem!!! Pls Help

Posted by Jerzy Puchala <je...@scdi.com>.

On Thu, 20 Jul 2000, Andy Clark wrote:

> 2) Modify the encoding line in all of your files to be
>    "Cp1252" (case is important this time). Then turn on the
>    following feature in the parser:
> 
>      http://apache.org/xml/features/allow-java-encodings
> 
>    Please note that your documents won't be portable in much
>    the same way that using the "Windows-1252" encoding name
>    doesn't work everywhere.
> 
I have simmilar problem: 
In XML file first line is:
<?xml version="1.0" encoding="Cp1252"?>

and in my program I have:	
DOMParser parser = new DOMParser();
try {
parser.setErrorHandler(new MyErrorHandler());

parser.setFeature("http://apache.org/xml/features/allow-java-encodings",
true);

parser.setFeature("http://xml.org/sax/features/validation", true);
parser.parse("myxml.xml");
}
....//here I am catching exceptions.

Results are not nice ;)

[Fatal Error] :0:0: The encoding "Cp1252" is not supported.
//-------------------------------

I use Xerces 112, jdk122_6.
MyErrorHandler is my error handler which was made based on
dom.wrappers.DOMParser
from samples.

Thank you for any help.

Jerzy Puchala

-- 
+--------------------------------+
|         Jerzy Puchala          |
+--------------------------------+
|       jerzypuc@scdi.com        |
+--------------------------------+

Re: XERCES Windows-1252 encoding problem!!! Pls Help

Posted by Andy Clark <an...@apache.org>.

Yoga Balaji wrote:
> I'm trying to parse a XML file from MSNBC (daily news) and store 
> the data in the DB. I'm using Xerces parser to parse the XML file 
> I rcv. I'm getting the following error when I run my program. 
> Without "encoding=Windows-1252" it works fine. I donno how to 
> resolve this problem.

The problem is that "Windows-1252" is *not* a valid encoding name.
The XML specification states that all encoding names must be IANA
names. However, "Windows-1252" is not. Unfortunately, when the
Microsoft XML parser writes an XML document, it automatically
includes this encoding.

There are several solutions:

1) Convert all of the incoming documents from "Windows-1252"
   encoding to a proper encoding. Make sure to modify the
   encoding line at the top of the file to reflect the change.

2) Modify the encoding line in all of your files to be
   "Cp1252" (case is important this time). Then turn on the
   following feature in the parser:

     http://apache.org/xml/features/allow-java-encodings

   Please note that your documents won't be portable in much
   the same way that using the "Windows-1252" encoding name
   doesn't work everywhere.

3) Modify the MIME2Java.java source file to include a mapping
   for "Windows-1252" to "Cp1252". Recompile and rebuild the
   Jar file. Use the new Jar file and you're done.

-- 
Andy Clark * IBM, JTC - Silicon Valley * andyc@apache.org