You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@tomcat.apache.org by Andre John Mas <aj...@newtradetech.com> on 2003/04/10 16:40:43 UTC

UTF-8 end to end - what am I doing wrong?

Hi,

I am trying to create a solution which requires a SOAP message to be
sent from one party to another, in UTF-8. The set up is as follows:

    - Tomcat 4.1.18 at the server end, on MS-Win2k
    - Apache Commons HttpClient at the client end, on MS-Win2k
    - JDK 1.3.1

At the server end I have a servlet running, that extends the
JAXMServlet. Since we are required to be UTF-8 compliant any stress
tests will involve Latin, Greek, Cyrillic and Japanese characters
being sent through. On windows I have installed all the possible
language sets to have access to as many 'alphabets' as possible.

On the client end the HTTPClient sends the document through as a POST
with the following content-type:

     text/xml; charset=UTF-8

Now when I receive the document, where there should have been accents
and other non-Roman characters (this includes characters with accents)
I just get question marks. My first analysis suggested that maybe
JAXMServlet is at fault. Over-riding the doPost method I still get
mangled characters. If I send both the  orginal text to file, before
sending and the received text, I find that in the first case I get
UTF-8 characters than appear nicely when viewed with Mozilla and in
the second case either question marks or mangled characters, depending
on whether I specifiy "UTF-8" in the OutputStreamWriter.

BTW When I use tcpmon, from Axis ( see xml.apache.org ) I see accented
characters appear to be on the stream. I don't see the japanese
characters, but that may be because the font used does not include
them (Courrier new).

Does anyone have any solutions that they could suggest.


---------------------------------------------------------------------
To unsubscribe, e-mail: tomcat-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: tomcat-user-help@jakarta.apache.org


Re: UTF-8 end to end - what am I doing wrong?

Posted by Andre John Mas <aj...@newtradetech.com>.
Further investigation shows that my problem is probably with the
Apache Commons HttpClient. Using an equivalent approach with
Java's URLConnection the data arrives uncorrupted. I will
continue the investigation on the HttpClient mailing list.

The following code, using Java's URLConnection, works at the
client end:

  public String send(URL destinationUrl, int timeout, String message)
     throws Exception
   {
     //try
     URLConnection connection = destinationUrl.openConnection();
     connection.setRequestProperty("Content-type","text/xml; 
charset=UTF-8");
     connection.setRequestProperty("user-agent", "myAgent");

     connection.setDoInput(true);
     connection.setDoOutput(true);

     OutputStream out = connection.getOutputStream();
     OutputStreamWriter outw = new OutputStreamWriter(out,"utf-8");
     outw.write(message);
     outw.flush();

     InputStream in = connection.getInputStream();
     InputStreamReader inr = new InputStreamReader(in,"utf-8");
     BufferedReader br = new BufferedReader(inr);

     StringBuffer strBuf = new StringBuffer();
     String line = null;
     while ( (line = br.readLine()) != null ) {
       strBuf.append(line);
       strBuf.append('\n');
     }
     return strBuf.toString();
   }



---------------------------------------------------------------------
To unsubscribe, e-mail: tomcat-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: tomcat-user-help@jakarta.apache.org