You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tomcat.apache.org by Jun Inamori <j-...@osa.att.ne.jp> on 2000/05/04 22:31:30 UTC

Proposal(NEW) FixCharset

Hi Costin, Dmitry and Eugen
Thank you for your reply.

As I reported in my previous e-mail, I wrote some sort of
RequestInterceptor. This class is based on the replies to my resume
about my proposal. Most part are derived from the code by Dmitry.(Thank
you Dmitry!) So, before I'll explain about my new proposal, I'd like to
resume the discussion about my proposal, again.

We've reached the following conclusion:

1. To get the correct parameter string, we need to know the character
encoding by which the original string was encoded.

2. According to the HTTP spec, such an encoding should be specified by
the 'charset' attribute of the 'content-type' header. But the WWW
browsers in the real world does not supply this attribute. Because there
are NO effective way to guess the correct 'charset' for all the case, it
will be preferable that the webmaster supply such an information, rather
than guess it. (S/he knows in what 'charset' the parameter strings are
encoded.)

3. To convert the encoded character stream to the string, I should not
use 'ByteArrayOutputStream', because allocating 'byte[]' again and again
results in the frequent GC and performance issue.

4. It is preferable that decoding is processed by some sort of
'RequestInterceptor', rather than inside from 'RequestImpl'. To do this,
'RequestInterceptor' should implement the new 'hook' method,
'afterRead()'.

5. When Servlet2.3 specification available, we may have more
sophisticated way to these problems.

The following is my comment to the recent posting:

Dmitry:
>> 3. The practical way to guess the encoding is to determine
>> the first language in the 'accept-language' values.

> How is that possible? Do you know any good solutions
> how to tell the encoding from the locale?
> There's often several charsets suitable for each language
> -- which one of them would you choose?

> Then, at my experience, LOTS of users never set the language
> properly. They just not even aware of the existance of these
> settings, or don't want to bother themselves with this kind
> of stuff.
> They already have the browser installer, then what do you
> expect them to do? Tune it? No bloody way!

I agree with you. Most users does not take care of this configuration.
Depending on the language listed first in 'accept-language' will not
help Tomcat to know the 'charset'.
BTW, 'Locale' retrieved by 'HttpServletRequest.getLocale()' fully
depends on this information. Can we take any benefit from this 'Locale'?

Dmitry:
>> 4. To convert the encoded character stream to the string,
>> I should not use 'ByteArrayOutputStream', because
>> allocating 'byte[]' again and again results in the frequent
>> GC and performance issue.

> Yes, this approach seems to be wasteful a bit.
> I can present a piece of code we're using for the same
> purpose. Out goal was to get rid of the char, String and
> StringBuffer operations also, the ones which bring dynamic
> memory allocation, such as String concatenation.

Thank you again, Dmitry. My new class uses most of your code. I slightly
modified it suitable for 'hook'. But I have some question about it:

Code from Dmitry:
>   /** file storage */
>  Hashtable files = null;

What do we use 'files'. Does it have something to do with
'multipart/form-data' ?

Code from Dmitry:
>   /** the multipart form parameter encoding */
>  final static String ENCODING_MULTIPART = 
>  "multipart/form-data";

I don't know in which case this is sent and, of course, how to handle
it. According to
 
http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/Docs/fill-out-forms/overview.html
only 'application/x-www-form-urlencoded' is available.

Code from Dmitry:
>            case '%':
>              b2[n++] = ( byte )
>                ( ( parseHex( buffer[++i] ) << 4 ) |
>                parseHex( buffer[++i] ) );
>              break;

I don't know how to implement 'parseHex()'. So I put alternative
fragment of code.

Dmitry:
> I don't know is this code readable enough or not :),
> but it is working. The processParameters() method is called
> to process the parameters at the first getParameter() call
> (similar to RequestImpl.handleParameters() in Tomcat).

Now your code is moved into 'afterRead()', which is invoked by
'ServletWrapper.handleRequest(Request,Response)'. The
'handleParameters()' method can be removed from 'RequestImple'.

Dmitry:
> As for resources, we are happy with just two byte arrays,
> and no any extra object creation at all.

Cool!

Costin:
> I don't think tomcat should do the guessing
> - but we should provide the way for web sites to plug
> their own modules.

Webmaster can supply the 'charset' to 'FixCharset' at the configuration
time. It looks like this:
        <RequestInterceptor
	className="org.apache.tomcat.request.FixCharset"
	Charset="Shift_JIS" />
The 'charset' in 'Context-Type' from WWW browser can override this.

Costin:
> IMHO Byte-to-Char is a perfect candidate
> for a hook, it's just a matter of finding the time )

Don't worry. We can help with each other.


Now I'll explain about my new proposal:

The 2 new class are available. They are:
    org.apache.tomcat.request.FixCharset
    org.apache.tomcat.util.CharsetToJavaEnc

The first class implements RequestInterceptor and is subclass of
BaseInterceptor. The setCharset(String c_set) method is invoked by
XmlMapper at the boot time of Tomcat. The afterRead(Request,Response)
method is invoked by ServletWrapper.handleRequest(Request,Response).
So the following related classes and interface must be modified:
   org.apache.tomcat.core.RequestInterceptor
   org.apache.tomcat.core.BaseInterceptor
   org.apache.tomcat.core.ServletWrapper
As for RequestImpl, handleParameters() can be removed and 
   getParameterValues(String name)
   getParameterNames()
must not call handleParameters().

The second class is responsible for just converting 'charset' to the
corresponding Java encoding. (in case-insensitive).

-- 

Happy Java programming!

Jun Inamori
E-mail: j-office@osa.att.ne.jp
URL:    http://www.oop-reserch.com

PS: 
I'll atache the source code of the 2 new classes and diffs for related
classes and interface:


*** Source ***
*** For org.apache.tomcat.request.FixCharset ***

package org.apache.tomcat.request;

import org.apache.tomcat.core.*;
import org.apache.tomcat.util.*;
import java.io.*;
import java.util.*;
import javax.servlet.*;
import javax.servlet.http.*;

// Should we use this?
//import sun.io.ByteToCharConverter;

/**
 * @author Dmitry I. Platonoff <dp...@descartes.com>
 * @author Jun Inamori <j-...@osa.att.ne.jp>
 */
public class FixCharset extends BaseInterceptor
    implements RequestInterceptor {

    static private int result=0;

    /**
     *  Web masters can override this by 'charSet' property.
     *  Should be overriden by 'charset' attribute
     *  of 'Content-Type' from WWW browser.
     *  But at this time, we don't know such a browser.
     */
    static private String cs="US-ASCII";

    /**
     *  Web masters can override this by 'charSet' property.
     *  Should be overriden by 'charset' attribute
     *  of 'Content-Type' from WWW browser.
     *  But at this time, we don't know such a browser.
     */
    static private String enc=System.getProperty("file.encoding");
    // Or should we use this?
    // static private String
enc=ByteToCharConverter.getDefault().getCharacterEncoding();


    /** an original request */
    static private Request req = null;
    
    /** parameter storage */
    static private Hashtable params = new Hashtable();
    /** file storage */
    
    /** the HTTP GET method signature */
    final static private String METHOD_GET = "GET";
    /** the HTTP POST method signature */
    final static private String METHOD_POST = "POST";
    /** the url-style form parameter encoding */
    final static private String ENCODING_URL =
"application/x-www-form-urlencoded";

    public FixCharset() {
    }

    /**
     *  Invoked by XmlMapper at the boot time of Tomcat.
     *  We can supply the argument value by server.xml
     *  <RequestInterceptor
     *	className="org.apache.tomcat.request.FixCharset"
     *	Charset="Shift_JIS" />
     *  Value supplied by server.xml is overriden by
     *  'charset' attribute of 'Content-Type' sent from
     *  WWW browser.
     */
    public void setCharset(String c_set){
	cs=c_set;
	//Corresponding Java encoding is also set.
	enc=CharsetToJavaEnc.getJavaEncoding(cs);
	if(enc==null){
	    enc=System.getProperty("file.encoding");
	    // Or should we use this?
	    //enc=ByteToCharConverter.getDefault().getCharacterEncoding();
	}
    }

    public String getCharset(){
	return cs;
    }

    public String getJavaEnc(){
	return enc;
    }

    /**
     *  Called at first in ServletWrapper.handleRequest().
     *  In case of POST or GET, parameters must be parsed
     *  and decoded.
     *  For POST, parameters should be read from InputStream.
     */
    public int afterRead( Request request, Response response ) {
	req=request;
	result=0;
	params.clear();
	/**
	 *  Override the initial value
	 *  (which is specified by webmaster)
	 *  by the one from WWW browser.
	 */
        String c_set=req.getCharacterEncoding();
	if(c_set!=null){
	    cs=c_set;
	    String w_enc=CharsetToJavaEnc.getJavaEncoding(c_set);
	    if(w_enc!=null){
		enc=w_enc;
	    }
	}
	// Set the initial value to request
	else{
	    req.setCharEncoding(cs);
	}


if(!req.getMethod().toUpperCase().equals(METHOD_GET)&&!req.getMethod().toUpperCase().equals(METHOD_POST)){
	    return 0;
	}
	else{
	    try{
		processParameters(getByteArray());
		req.setParameters(params);
	    }
	    catch(Exception ex){}
	}
	return result;
    }

    /**
     *  In case of POST, we have to read from InputStream.
     *  Otherwise we can use req.getQueryString.
     */
    private byte[] getByteArray()
	throws Exception{
	byte[] buffer=null;
	if(req.getMethod().toUpperCase().equals(METHOD_POST)){
	    try{
		ServletInputStream sis = req.getInputStream();
		if(sis!=null){
		    int length=req.getContentLength();
		    buffer=new byte[length];
		    int offset=0;
		    for(int readBytes=0; offset<length; offset+=readBytes){
			readBytes=sis.read(buffer,offset,length-offset );
			if(readBytes==-1){
			    break;
			}
		    }
		    sis.close();
		}
	    }
	    catch(Exception ex){
		result=HttpServletResponse.SC_INTERNAL_SERVER_ERROR;
		throw ex;
	    }
        }
	else{
	    String s=req.getQueryString();
	    if(s!=null){
		if(s.length() > 0 ){
		    buffer=s.getBytes();
		}
		else{
		    result=HttpServletResponse.SC_BAD_REQUEST;
		    throw (new TomcatException("No query string!"));
		}
	    }
	}
	return buffer;
    }

    /**
     *  Parameter processor core. Parses text values,
     *  extracts and saves files.
     */
    private void processParameters(byte[] buffer)
	throws Exception{

	//extract parameters
        byte[] b2 = new byte[buffer.length + 1];
        int n = 0;
        for ( int i = 0; i < buffer.length; i++ ){
	    char some=(char)buffer[i];
	    switch ( some ){
            case '+':
		b2[n++] =(byte) ' ';
		break;
            case '%':
		char[] digit=new char[2];
		digit[0]=(char)buffer[++i];
		digit[1]=(char)buffer[++i];
		int i_v=Integer.parseInt((new String(digit)),16);
		b2[n++]=(byte)i_v;
		break;
            case '&':
		convertParameter( b2, n );
		n = 0;
		break;
            default:
		b2[n++] = buffer[i];
		break;
	    }
        }
        convertParameter( b2, n );
    }

    /**
     * The parameter parser
     * @param bytes a byte array representing the parameter entry
     * @param n entry length (it's expected that the buffer is longer)
     */
    private void convertParameter( byte[] bytes, int n )
	throws Exception{
	int n2 = 0;
	// At first, we have to know the index of '='.
	// Without String.indexOf()!
	while ( bytes[n2++] != '=' && n2 < n );
	
	// Then devide given byte array into key and value.
	int del=(bytes[n2 - 1] == '=') ? n2 - 1 : n2;
	try{
	    String key = new String(bytes, 0, del, enc);
	    String value = "";
	    if( n2 < n ){
		value=new String(bytes, n2, n - n2, enc);
	    }
	    putParameter( key, value );
	}
	catch(Exception ex){
	    result=HttpServletResponse.SC_INTERNAL_SERVER_ERROR;
	    throw ex;
	}
    }

    private void putParameter(String key, String value){
	String values[];
	if (params.containsKey(key)) {
	    String oldValues[] = (String[])params.get(key);
	    values = new String[oldValues.length + 1];
	    for (int i = 0; i < oldValues.length; i++) {
		values[i] = oldValues[i];
	    }
	    values[oldValues.length] = value;
	} else {
	    values = new String[1];
	    values[0] = value;
	}
	params.put(key, values);
    }
    
}

*** Source ***
*** For org.apache.tomcat.util.CharsetToJavaEnc ***

package org.apache.tomcat.util;

import java.util.*;

public class CharsetToJavaEnc {
    
    static private Map to_java_enc_dic;
    static {
	/*
	 * Key   :Character set name(case insensitive)
	 * Value :Java encoding name
	 *
	 * Note that character set name is case insensitive!
	 * See 'getJavaEncoding()' method.
	 */
        to_java_enc_dic = new HashMap();
        to_java_enc_dic.put("UTF-8", "UTF8");
        to_java_enc_dic.put("US-ASCII",        "8859_1");
        to_java_enc_dic.put("ISO-8859-1",      "8859_1");
        to_java_enc_dic.put("ISO-8859-2",      "8859_2");
        to_java_enc_dic.put("ISO-8859-3",      "8859_3");
        to_java_enc_dic.put("ISO-8859-4",      "8859_4");
        to_java_enc_dic.put("ISO-8859-5",      "8859_5");
        to_java_enc_dic.put("ISO-8859-6",      "8859_6");
        to_java_enc_dic.put("ISO-8859-7",      "8859_7");
        to_java_enc_dic.put("ISO-8859-8",      "8859_8");
        to_java_enc_dic.put("ISO-8859-9",      "8859_9");
        to_java_enc_dic.put("ISO-2022-JP",     "ISO2022JP");
        to_java_enc_dic.put("SHIFT_JIS",       "SJIS");
        to_java_enc_dic.put("EUC-JP",          "EUCJIS");
        to_java_enc_dic.put("GB2312",          "GB2312");
        to_java_enc_dic.put("BIG5",            "Big5");
        to_java_enc_dic.put("EUC-KR",          "KSC5601");
        to_java_enc_dic.put("ISO-2022-KR",     "ISO2022KR");
        to_java_enc_dic.put("KOI8-R",          "KOI8_R");
        to_java_enc_dic.put("EBCDIC-CP-US",    "CP037");
        to_java_enc_dic.put("EBCDIC-CP-CA",    "CP037");
        to_java_enc_dic.put("EBCDIC-CP-NL",    "CP037");
        to_java_enc_dic.put("EBCDIC-CP-DK",    "CP277");
        to_java_enc_dic.put("EBCDIC-CP-NO",    "CP277");
        to_java_enc_dic.put("EBCDIC-CP-FI",    "CP278");
        to_java_enc_dic.put("EBCDIC-CP-SE",    "CP278");
        to_java_enc_dic.put("EBCDIC-CP-IT",    "CP280");
        to_java_enc_dic.put("EBCDIC-CP-ES",    "CP284");
        to_java_enc_dic.put("EBCDIC-CP-GB",    "CP285");
        to_java_enc_dic.put("EBCDIC-CP-FR",    "CP297");
        to_java_enc_dic.put("EBCDIC-CP-AR1",   "CP420");
        to_java_enc_dic.put("EBCDIC-CP-HE",    "CP424");
        to_java_enc_dic.put("EBCDIC-CP-CH",    "CP500");
        to_java_enc_dic.put("EBCDIC-CP-ROECE", "CP870");
        to_java_enc_dic.put("EBCDIC-CP-YU",    "CP870");
        to_java_enc_dic.put("EBCDIC-CP-IS",    "CP871");
        to_java_enc_dic.put("EBCDIC-CP-AR2",   "CP918");
    }

    public CharsetToJavaEnc() {
	        
    }

    public static String getJavaEncoding(String c_set) {
        return (String)to_java_enc_dic.get(c_set.toUpperCase());
    }

} // CharsetToJavaEnc


*** Diffs ***
*** For org.apache.tomcat.core.BaseInterceptor ***
diff -bwr modified_src//org/apache/tomcat/core/BaseInterceptor.java
modified_org//org/apache/tomcat/core/BaseInterceptor.java
100,104d99
<     //Start:Jun Inamori
<     public int afterRead(Request request, Response response) {
< 	return 0;
<     }
<     //End:Jun Inamori

*** Diffs ***
*** For org.apache.tomcat.core.RequestInterceptor ***
diff -bwr modified_src//org/apache/tomcat/core/RequestInterceptor.java
modified_org//org/apache/tomcat/core/RequestInterceptor.java
112,116d111
<     //Start:Jun Inamori modified
<     /** Called at first in ServletWrapper.handleRequest()
<      */
<     public int afterRead(Request request, Response response);
<     //End:Jun Inamori modified

*** Diffs ***
*** For org.apache.tomcat.core.ServletWrapper ***
diff -bwr modified_src//org/apache/tomcat/core/ServletWrapper.java
modified_org//org/apache/tomcat/core/ServletWrapper.java
389,401d388
< 	//Start:Jun Inamori modified
< 	RequestInterceptor cI[]=context.getRequestInterceptors();
< 	for( int i=0; i<cI.length; i++ ) {
< 	    //Parse and decode parameters.
< 	    //FixParameter is responsible for this.
< 	    int result=0;
< 	    result=cI[i].afterRead( req, res );
< 	    if(result!=0 && result!=200){
< 		contextM.handleError(req, res,null,result);
< 	    }
< 	}
< 	//End:Jun Inamori moodified
< 
505,509c492
< 	    //Start:Jun Inamori modified
< 	    // Array of RequestInterceptor already is retrieved
< 	    // at the top of this method
< 	    //RequestInterceptor cI[]=context.getRequestInterceptors();
< 	    //End:Jun Inamori modified
---
> 	    RequestInterceptor cI[]=context.getRequestInterceptors();

*** Diffs ***
*** For org.apache.tomcat.core.RequestImpl ***
diff -bwr modified_src//org/apache/tomcat/core/RequestImpl.java
modified_org//org/apache/tomcat/core/RequestImpl.java
219a220
> 	handleParameters();
223a225
> 	handleParameters();
564a567,582
>     private void handleParameters() {
>    	if(!didParameters) {
> 	    String qString=getQueryString();
> 	    if(qString!=null) {
> 		didParameters=true;
> 		RequestUtil.processFormData( qString, parameters );
> 	    }
> 	}
> 	if (!didReadFormData) {
> 	    didReadFormData = true;
> 	    Hashtable postParameters=RequestUtil.readFormData( this );
> 	    if(postParameters!=null)
> 		parameters = RequestUtil.mergeParameters(parameters, postParameters);
> 	}
>     }

Re: Proposal(NEW) FixCharset

Posted by "Dmitry I. Platonoff" <dp...@descartes.com>.
On Fri, 05 May 2000 05:31:30 +0900, Jun Inamori wrote:


 > Most part are derived from the code by Dmitry.
 > (Thank you Dmitry!)

My pleasure :)

 > BTW, 'Locale' retrieved by 'HttpServletRequest.getLocale()'
 > fully depends on this information. Can we take any benefit
 > from this 'Locale'?

Well... We're supposed to. :)  This information can be used, for example,
for choosing the language we use to present the information (in the case of
a international site). This is a very nice idea, when it's implemented
properly. For example, there's a content negotiation mechanism in Apache,
which already takes advantage of that. Then, as far as I remember, Eugen
also had some proposals about the same thing with JSPs and servlets. I'd
like him to share his ideas in this mailing list.

But everything could be spoiled, of course. The Microsoft website drives me
mad  -- I have the Russian language set as the preferred in my browser, and
they always have an outdated version of their pages in Russian, so I have
to change the preferred language to English every time I visit this site :)

 > >   /** file storage */
 > >  Hashtable files = null;
 > What do we use 'files'. Does it have something to do with
 > 'multipart/form-data' ?

Yes. The request wrapper, we developed and used since last year, was meant
to be able to deal not only with regular GET and POST requests, but with
multipart data and file attachments also. I just stripped the code off to
save on email size. Are you interested in this code too? It's pretty
simple, actually, and there's enough implementations already available on
the web, so thought it would be one too many...

 > I don't know in which case this is sent and, of course
 > how to handle it.

It is sent when you have your form headers like this:
<FORM ACTION="/blahblahblah" METHOD=POST
  ENCTYPE="multipart/form-data">
And you can allow the user to submit the entire files with such a form. The
result, which would be a regular MIME multipart stream, could be read as a
regular POST data stream, but you'll have to parse it differently.


 > I don't know how to implement 'parseHex()'. So I put alternative
 > fragment of code.

This was a custom piece of code, a kind of hack to parform the hex
translation. It's supposed to be faster than the standard
Integer.parseInt(), which is designed to be universal, and therefore it's
huge (the method is about 50 lines of code) and resource-consuming.

This method is rather simple:

  /**
    Hex value parser

    @param c a character of value "{0-9|A-F}" to parse
  */
  public static int parseHex( int c )
  {
    if ( c >= '0' && c <= '9' ) return c - '0';
    if ( ( c &= 0x5F ) >= 'A' && c <= 'F' ) return c - 'A' + 10;
    return 0;
  }

And I believe, that this piece, which uses it,

  b2[n++] = ( byte ) ( ( Util.parseHex( buffer[++i] ) << 4 ) |
    Util.parseHex( buffer[++i] ) );

might still be better than the following code

  char[] digit=new char[2];
  digit[0]=(char)buffer[++i];
  digit[1]=(char)buffer[++i];
  int i_v=Integer.parseInt((new String(digit)),16);

which allocates an array, then a String, and then calls a universal number
parser. Am I wrong? It might be not very easy to read :), but I wanted it
to be efficient.

 > Now your code is moved into 'afterRead()', which is invoked by
 > 'ServletWrapper.handleRequest(Request,Response)'. The
 > 'handleParameters()' method can be removed from 'RequestImple'.

Sounds cool.

You did a good job.


Sincerely,
Dmitry.