You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Eric Chow <ec...@macaucabletv.com> on 2003/09/18 09:03:11 UTC

HttpClient UTF-8 problem !!!

Hello,

When I tried to use HttpClient to query a web site with a "UTF-8" paramter,
it just can't match.

When I tried to "Copy & Paste" the character into that web site, it works,
but failed to use HttpClient.

Is there any problem in HttpClient to handle UTF-8 parameters ???


Web Side: http://www.mandarintools.com/chardict_u8.html
In "Paste in Character", select "UTF-8", and paste 与, (the UTF-8 of the
previous character is \u4E0E),
 and then "Search by Character".

It should return something ....


The following is my source, please teach me how can I pass UTF-8 query
parameters in HttpClient.

============================================================================
import org.apache.commons.httpclient.*;
import org.apache.commons.httpclient.methods.*;
import org.apache.commons.httpclient.cookie.*;
import org.apache.commons.httpclient.util.*;
import java.io.*;
import java.util.*;


public class TradSimUnicode {
   private static String checkURL = "cgibin.erols.com";


   public static void main(String argv[]) {

      HttpClient client = new HttpClient();

      try {

         HostConfiguration hc = new HostConfiguration();
         hc.setHost(checkURL, 80, "http");
         client.setHostConfiguration(hc);
         client.getState().setCookiePolicy(CookiePolicy.COMPATIBILITY);

          check(client, '\u4e0e');

      } catch(Exception e) {
         e.printStackTrace();
      }
   }

   public static void check(HttpClient client, char c) throws Exception {

      String code = toFormat(c);
      String s = unicodeToString(code);


      PostMethod post = new
PostMethod("/mandarintools/cgi-bin/charlook.pl");

      // Prepare login parameters
      NameValuePair v1     = new NameValuePair("searchmode", "standard");
      NameValuePair v2     = new NameValuePair("printtype", "utf8");
      NameValuePair v3     = new NameValuePair("chartype", "trad");
      NameValuePair v4     = new NameValuePair("ordering", "frequency");
      NameValuePair v5     = new NameValuePair("display", "char");
      NameValuePair v6     = new NameValuePair("display", "variants");
      NameValuePair v7     = new NameValuePair("display", "unicode");
      NameValuePair v8     = new NameValuePair("enctype", "utf8");
      NameValuePair v9     = new NameValuePair("whatchar", s);
      NameValuePair v10    = new NameValuePair("searchchar", "Search by
Character");


      NameValuePair[] valPairs = { v1, v2, v3, v4, v5, v6, v7, v8, v9,
v10 };

      post.setRequestBody(valPairs);


      client.executeMethod(post);

      String resp = post.getResponseBodyAsString();

      System.out.write(resp.getBytes("UTF-8"));

      post.releaseConnection();
   }

   public static String unicodeToString(String unicodeString) {
  if (unicodeString == null) {
   return null;
  }

  StringBuffer buf = new StringBuffer();

  StringTokenizer tokens = new StringTokenizer(unicodeString, "\\u");

  while(tokens.hasMoreTokens()) {
   String token = (String)tokens.nextToken();

   char oneUnicodeChar = (char)Integer.parseInt(token, 16);

   buf.append(oneUnicodeChar);
  }

  return buf.toString();
 }

 public static String toFormat(int n){
  String zeros = "000";
  String body = Integer.toHexString(n);
  return "\\u" + zeros.substring(0, 4-body.length()) + body;
 }
}

============================================================





==========================
If you know what you are doing,
it is not called RESEARCH!
==========================


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: HttpClient UTF-8 problem !!!

Posted by Eric Chow <ec...@macaucabletv.com>.
Hi Odi,

It works now, thanks a lot.

Best regards,
Eric

----- Original Message -----
From: "Ortwin Glück" <or...@nose.ch>
To: "Jakarta Commons Developers List" <co...@jakarta.apache.org>
Sent: Thursday, September 18, 2003 4:14 PM
Subject: Re: HttpClient UTF-8 problem !!!


>
>
> Eric Chow wrote:
> >
============================================================================
> > import org.apache.commons.httpclient.*;
> > import org.apache.commons.httpclient.methods.*;
> > import org.apache.commons.httpclient.cookie.*;
> > import org.apache.commons.httpclient.util.*;
> > import java.io.*;
> > import java.util.*;
> >
> >
> > public class TradSimUnicode {
> >    private static String checkURL = "cgibin.erols.com";
> >
> >
> >    public static void main(String argv[]) {
> >
> >       HttpClient client = new HttpClient();
> >
> >       try {
> >
> >          HostConfiguration hc = new HostConfiguration();
> >          hc.setHost(checkURL, 80, "http");
> >          client.setHostConfiguration(hc);
> >          client.getState().setCookiePolicy(CookiePolicy.COMPATIBILITY);
> >
> >           check(client, '\u4e0e');
> >
> >       } catch(Exception e) {
> >          e.printStackTrace();
> >       }
> >    }
> >
> >    public static void check(HttpClient client, char c) throws Exception
{
> >
> >       String code = toFormat(c);
> >       String s = unicodeToString(code);
> >
> >
> >       PostMethod post = new
> > PostMethod("/mandarintools/cgi-bin/charlook.pl");
> >
> >       // Prepare login parameters
> >       NameValuePair v1     = new NameValuePair("searchmode",
"standard");
> >       NameValuePair v2     = new NameValuePair("printtype", "utf8");
> >       NameValuePair v3     = new NameValuePair("chartype", "trad");
> >       NameValuePair v4     = new NameValuePair("ordering", "frequency");
> >       NameValuePair v5     = new NameValuePair("display", "char");
> >       NameValuePair v6     = new NameValuePair("display", "variants");
> >       NameValuePair v7     = new NameValuePair("display", "unicode");
> >       NameValuePair v8     = new NameValuePair("enctype", "utf8");
> >       NameValuePair v9     = new NameValuePair("whatchar", s);
> >       NameValuePair v10    = new NameValuePair("searchchar", "Search by
> > Character");
> >
> >
> >       NameValuePair[] valPairs = { v1, v2, v3, v4, v5, v6, v7, v8, v9,
> > v10 };
> >
> >       post.setRequestBody(valPairs);
> >
> >
> >       client.executeMethod(post);
> >
> >       String resp = post.getResponseBodyAsString();
> >
> >       System.out.write(resp.getBytes("UTF-8"));
> >
> >       post.releaseConnection();
> >    }
> >
> >    public static String unicodeToString(String unicodeString) {
> >   if (unicodeString == null) {
> >    return null;
> >   }
> >
> >   StringBuffer buf = new StringBuffer();
> >
> >   StringTokenizer tokens = new StringTokenizer(unicodeString, "\\u");
> >
> >   while(tokens.hasMoreTokens()) {
> >    String token = (String)tokens.nextToken();
> >
> >    char oneUnicodeChar = (char)Integer.parseInt(token, 16);
> >
> >    buf.append(oneUnicodeChar);
> >   }
> >
> >   return buf.toString();
> >  }
> >
> >  public static String toFormat(int n){
> >   String zeros = "000";
> >   String body = Integer.toHexString(n);
> >   return "\\u" + zeros.substring(0, 4-body.length()) + body;
> >  }
> > }
> >
> > ============================================================
>
> Eric,
>
> I think you have not fully understood how Java handles Unicode.
> Basically you can get completely rid of your two methods unicodeToString
> and toFormat and just pass in the character as you are doing now without
> touching it:
>
> NameValuePair v9     = new NameValuePair("whatchar", String.valueOf(c));
>
> By default HttpClient uses ISO-8859-1 as the encoding for parameters
> passed into the POST method (which is not consistent with the GET method
> where UTF-8 is used by default by the way). If you whish to use UTF-8
> you must set the Content-Type header manually:
>
> post.addRequestHeader("Content-Type", FORM_URL_ENCODED_CONTENT_TYPE+";
> charset=UTF-8");
>
> hope that helps
>
> Odi
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
>


Re: HttpClient UTF-8 problem !!!

Posted by Eric Chow <ec...@macaucabletv.com>.
Hi Odi,

It works now, thanks a lot.

Best regards,
Eric

----- Original Message -----
From: "Ortwin Glück" <or...@nose.ch>
To: "Jakarta Commons Developers List" <co...@jakarta.apache.org>
Sent: Thursday, September 18, 2003 4:14 PM
Subject: Re: HttpClient UTF-8 problem !!!


>
>
> Eric Chow wrote:
> >
============================================================================
> > import org.apache.commons.httpclient.*;
> > import org.apache.commons.httpclient.methods.*;
> > import org.apache.commons.httpclient.cookie.*;
> > import org.apache.commons.httpclient.util.*;
> > import java.io.*;
> > import java.util.*;
> >
> >
> > public class TradSimUnicode {
> >    private static String checkURL = "cgibin.erols.com";
> >
> >
> >    public static void main(String argv[]) {
> >
> >       HttpClient client = new HttpClient();
> >
> >       try {
> >
> >          HostConfiguration hc = new HostConfiguration();
> >          hc.setHost(checkURL, 80, "http");
> >          client.setHostConfiguration(hc);
> >          client.getState().setCookiePolicy(CookiePolicy.COMPATIBILITY);
> >
> >           check(client, '\u4e0e');
> >
> >       } catch(Exception e) {
> >          e.printStackTrace();
> >       }
> >    }
> >
> >    public static void check(HttpClient client, char c) throws Exception
{
> >
> >       String code = toFormat(c);
> >       String s = unicodeToString(code);
> >
> >
> >       PostMethod post = new
> > PostMethod("/mandarintools/cgi-bin/charlook.pl");
> >
> >       // Prepare login parameters
> >       NameValuePair v1     = new NameValuePair("searchmode",
"standard");
> >       NameValuePair v2     = new NameValuePair("printtype", "utf8");
> >       NameValuePair v3     = new NameValuePair("chartype", "trad");
> >       NameValuePair v4     = new NameValuePair("ordering", "frequency");
> >       NameValuePair v5     = new NameValuePair("display", "char");
> >       NameValuePair v6     = new NameValuePair("display", "variants");
> >       NameValuePair v7     = new NameValuePair("display", "unicode");
> >       NameValuePair v8     = new NameValuePair("enctype", "utf8");
> >       NameValuePair v9     = new NameValuePair("whatchar", s);
> >       NameValuePair v10    = new NameValuePair("searchchar", "Search by
> > Character");
> >
> >
> >       NameValuePair[] valPairs = { v1, v2, v3, v4, v5, v6, v7, v8, v9,
> > v10 };
> >
> >       post.setRequestBody(valPairs);
> >
> >
> >       client.executeMethod(post);
> >
> >       String resp = post.getResponseBodyAsString();
> >
> >       System.out.write(resp.getBytes("UTF-8"));
> >
> >       post.releaseConnection();
> >    }
> >
> >    public static String unicodeToString(String unicodeString) {
> >   if (unicodeString == null) {
> >    return null;
> >   }
> >
> >   StringBuffer buf = new StringBuffer();
> >
> >   StringTokenizer tokens = new StringTokenizer(unicodeString, "\\u");
> >
> >   while(tokens.hasMoreTokens()) {
> >    String token = (String)tokens.nextToken();
> >
> >    char oneUnicodeChar = (char)Integer.parseInt(token, 16);
> >
> >    buf.append(oneUnicodeChar);
> >   }
> >
> >   return buf.toString();
> >  }
> >
> >  public static String toFormat(int n){
> >   String zeros = "000";
> >   String body = Integer.toHexString(n);
> >   return "\\u" + zeros.substring(0, 4-body.length()) + body;
> >  }
> > }
> >
> > ============================================================
>
> Eric,
>
> I think you have not fully understood how Java handles Unicode.
> Basically you can get completely rid of your two methods unicodeToString
> and toFormat and just pass in the character as you are doing now without
> touching it:
>
> NameValuePair v9     = new NameValuePair("whatchar", String.valueOf(c));
>
> By default HttpClient uses ISO-8859-1 as the encoding for parameters
> passed into the POST method (which is not consistent with the GET method
> where UTF-8 is used by default by the way). If you whish to use UTF-8
> you must set the Content-Type header manually:
>
> post.addRequestHeader("Content-Type", FORM_URL_ENCODED_CONTENT_TYPE+";
> charset=UTF-8");
>
> hope that helps
>
> Odi
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: HttpClient UTF-8 problem !!!

Posted by Ortwin Glück <or...@nose.ch>.

Eric Chow wrote:
> ============================================================================
> import org.apache.commons.httpclient.*;
> import org.apache.commons.httpclient.methods.*;
> import org.apache.commons.httpclient.cookie.*;
> import org.apache.commons.httpclient.util.*;
> import java.io.*;
> import java.util.*;
> 
> 
> public class TradSimUnicode {
>    private static String checkURL = "cgibin.erols.com";
> 
> 
>    public static void main(String argv[]) {
> 
>       HttpClient client = new HttpClient();
> 
>       try {
> 
>          HostConfiguration hc = new HostConfiguration();
>          hc.setHost(checkURL, 80, "http");
>          client.setHostConfiguration(hc);
>          client.getState().setCookiePolicy(CookiePolicy.COMPATIBILITY);
> 
>           check(client, '\u4e0e');
> 
>       } catch(Exception e) {
>          e.printStackTrace();
>       }
>    }
> 
>    public static void check(HttpClient client, char c) throws Exception {
> 
>       String code = toFormat(c);
>       String s = unicodeToString(code);
> 
> 
>       PostMethod post = new
> PostMethod("/mandarintools/cgi-bin/charlook.pl");
> 
>       // Prepare login parameters
>       NameValuePair v1     = new NameValuePair("searchmode", "standard");
>       NameValuePair v2     = new NameValuePair("printtype", "utf8");
>       NameValuePair v3     = new NameValuePair("chartype", "trad");
>       NameValuePair v4     = new NameValuePair("ordering", "frequency");
>       NameValuePair v5     = new NameValuePair("display", "char");
>       NameValuePair v6     = new NameValuePair("display", "variants");
>       NameValuePair v7     = new NameValuePair("display", "unicode");
>       NameValuePair v8     = new NameValuePair("enctype", "utf8");
>       NameValuePair v9     = new NameValuePair("whatchar", s);
>       NameValuePair v10    = new NameValuePair("searchchar", "Search by
> Character");
> 
> 
>       NameValuePair[] valPairs = { v1, v2, v3, v4, v5, v6, v7, v8, v9,
> v10 };
> 
>       post.setRequestBody(valPairs);
> 
> 
>       client.executeMethod(post);
> 
>       String resp = post.getResponseBodyAsString();
> 
>       System.out.write(resp.getBytes("UTF-8"));
> 
>       post.releaseConnection();
>    }
> 
>    public static String unicodeToString(String unicodeString) {
>   if (unicodeString == null) {
>    return null;
>   }
> 
>   StringBuffer buf = new StringBuffer();
> 
>   StringTokenizer tokens = new StringTokenizer(unicodeString, "\\u");
> 
>   while(tokens.hasMoreTokens()) {
>    String token = (String)tokens.nextToken();
> 
>    char oneUnicodeChar = (char)Integer.parseInt(token, 16);
> 
>    buf.append(oneUnicodeChar);
>   }
> 
>   return buf.toString();
>  }
> 
>  public static String toFormat(int n){
>   String zeros = "000";
>   String body = Integer.toHexString(n);
>   return "\\u" + zeros.substring(0, 4-body.length()) + body;
>  }
> }
> 
> ============================================================

Eric,

I think you have not fully understood how Java handles Unicode. 
Basically you can get completely rid of your two methods unicodeToString 
and toFormat and just pass in the character as you are doing now without 
touching it:

NameValuePair v9     = new NameValuePair("whatchar", String.valueOf(c));

By default HttpClient uses ISO-8859-1 as the encoding for parameters 
passed into the POST method (which is not consistent with the GET method 
where UTF-8 is used by default by the way). If you whish to use UTF-8 
you must set the Content-Type header manually:

post.addRequestHeader("Content-Type", FORM_URL_ENCODED_CONTENT_TYPE+"; 
charset=UTF-8");

hope that helps

Odi


Re: HttpClient UTF-8 problem !!!

Posted by Ortwin Glück <or...@nose.ch>.

Eric Chow wrote:
> ============================================================================
> import org.apache.commons.httpclient.*;
> import org.apache.commons.httpclient.methods.*;
> import org.apache.commons.httpclient.cookie.*;
> import org.apache.commons.httpclient.util.*;
> import java.io.*;
> import java.util.*;
> 
> 
> public class TradSimUnicode {
>    private static String checkURL = "cgibin.erols.com";
> 
> 
>    public static void main(String argv[]) {
> 
>       HttpClient client = new HttpClient();
> 
>       try {
> 
>          HostConfiguration hc = new HostConfiguration();
>          hc.setHost(checkURL, 80, "http");
>          client.setHostConfiguration(hc);
>          client.getState().setCookiePolicy(CookiePolicy.COMPATIBILITY);
> 
>           check(client, '\u4e0e');
> 
>       } catch(Exception e) {
>          e.printStackTrace();
>       }
>    }
> 
>    public static void check(HttpClient client, char c) throws Exception {
> 
>       String code = toFormat(c);
>       String s = unicodeToString(code);
> 
> 
>       PostMethod post = new
> PostMethod("/mandarintools/cgi-bin/charlook.pl");
> 
>       // Prepare login parameters
>       NameValuePair v1     = new NameValuePair("searchmode", "standard");
>       NameValuePair v2     = new NameValuePair("printtype", "utf8");
>       NameValuePair v3     = new NameValuePair("chartype", "trad");
>       NameValuePair v4     = new NameValuePair("ordering", "frequency");
>       NameValuePair v5     = new NameValuePair("display", "char");
>       NameValuePair v6     = new NameValuePair("display", "variants");
>       NameValuePair v7     = new NameValuePair("display", "unicode");
>       NameValuePair v8     = new NameValuePair("enctype", "utf8");
>       NameValuePair v9     = new NameValuePair("whatchar", s);
>       NameValuePair v10    = new NameValuePair("searchchar", "Search by
> Character");
> 
> 
>       NameValuePair[] valPairs = { v1, v2, v3, v4, v5, v6, v7, v8, v9,
> v10 };
> 
>       post.setRequestBody(valPairs);
> 
> 
>       client.executeMethod(post);
> 
>       String resp = post.getResponseBodyAsString();
> 
>       System.out.write(resp.getBytes("UTF-8"));
> 
>       post.releaseConnection();
>    }
> 
>    public static String unicodeToString(String unicodeString) {
>   if (unicodeString == null) {
>    return null;
>   }
> 
>   StringBuffer buf = new StringBuffer();
> 
>   StringTokenizer tokens = new StringTokenizer(unicodeString, "\\u");
> 
>   while(tokens.hasMoreTokens()) {
>    String token = (String)tokens.nextToken();
> 
>    char oneUnicodeChar = (char)Integer.parseInt(token, 16);
> 
>    buf.append(oneUnicodeChar);
>   }
> 
>   return buf.toString();
>  }
> 
>  public static String toFormat(int n){
>   String zeros = "000";
>   String body = Integer.toHexString(n);
>   return "\\u" + zeros.substring(0, 4-body.length()) + body;
>  }
> }
> 
> ============================================================

Eric,

I think you have not fully understood how Java handles Unicode. 
Basically you can get completely rid of your two methods unicodeToString 
and toFormat and just pass in the character as you are doing now without 
touching it:

NameValuePair v9     = new NameValuePair("whatchar", String.valueOf(c));

By default HttpClient uses ISO-8859-1 as the encoding for parameters 
passed into the POST method (which is not consistent with the GET method 
where UTF-8 is used by default by the way). If you whish to use UTF-8 
you must set the Content-Type header manually:

post.addRequestHeader("Content-Type", FORM_URL_ENCODED_CONTENT_TYPE+"; 
charset=UTF-8");

hope that helps

Odi


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org