You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tomcat.apache.org by "Clere, Jean-Frederic" <jf...@fujitsu.siemens.es> on 2000/03/24 17:16:38 UTC

[PATCH] Add EBCDIC support for text files in Tomcat.

I am porting tomcat to a BS2000 (Siemens EBCDIC mainframe).
And I have arranged:
./jakarta-tomcat/src/share/org/apache/tomcat/servlets/DefaultServlet.java
The problem is that text documents like html files have to be editable on
EBCDIC machine -That is  EBCDIC-.
So some conversions are need native(EBCDIC to java) and then java to ASCII.

Find enclosed the result of a diff -u -w between my patch and the CVS file.

--- jakarta-tomcat/src/share/org/apache/tomcat/servlets/DefaultServlet.java
Wed Mar 22 22:52:15 2000
+++ jakarta-tomcat/src/share/org/apache/tomcat/servlets/DefaultServlet.java
Fri Mar 24 16:48:01 2000
@@ -289,6 +289,9 @@
 	FileInputStream in=null;
 	try {
 	    in = new FileInputStream(file);
+	    if (mimeType.startsWith("text"))
+	        serveStreamNative(in, request, response);
+	    else
 	    serveStream(in, request, response);
 	} catch (FileNotFoundException e) {
 	    // Figure out what we're serving
@@ -311,6 +314,22 @@
 	}
     }
 
+    // Sends the text file, wich are encoded in machine encoding.
+    private void serveStreamNative(InputStream in, HttpServletRequest
request,
+        HttpServletResponse response)
+    throws IOException {
+        // like the serveStream, we try the stream and the writer. 
+
+	try {
+	    ServletOutputStream out = response.getOutputStream();
+	    serveStreamAsStreamNative(in, out);
+	} catch (IllegalStateException ise) {
+	    PrintWriter out = response.getWriter();
+            // as it uses an InputStreamReader it does the conversion.
+	    serveStreamAsWriter(in, out);
+	}
+    }
+
     private void serveStream(InputStream in, HttpServletRequest request,
         HttpServletResponse response)
     throws IOException {
@@ -341,6 +360,27 @@
 	    out.write(buf, 0, read);
 	}
     }
+
+    private void serveStreamAsStreamNative(InputStream in, OutputStream
out)
+    throws IOException {
+	char[] buf = new char[1024];
+	int read = 0;
+
+	// Here the input and the output have to be converted.
+        // input  default encoding output ASCII (ISO8859_1).
+        OutputStreamWriter output = null;
+	try {
+            output = new OutputStreamWriter(out,"ISO8859_1");
+	} catch(UnsupportedEncodingException e) {
+	    output = new OutputStreamWriter(out); // try without conversion.

+	}
+	InputStreamReader input = new InputStreamReader(in);
+
+	while ((read = input.read(buf)) != -1) {
+	    output.write(buf, 0, read);
+	}
+	output.flush();
+    }
 
     private void serveStreamAsWriter(InputStream in, PrintWriter out)
     throws IOException {



 <<DefaultServlet.patch.txt>>  

Jean-Frédéric Clère
EP LP DC22 (BCN)
Fujitsu Siemens Computers
Phone + 34 93 480 4209
Fax     + 34 93 480 4201
Mail mailto:jfrederic.clere@fujitsu.siemens.es


RE: [PATCH] Add EBCDIC support for text files in Tomcat.

Posted by "Preston L. Bannister" <pr...@home.com>.
From: Costin Manolache [mailto:costin@eng.sun.com]
> My vote:
> - for this release - do nothing, the problem is too deep and we may brake
> something else ( it's not only about EBCDIC - internationaliaztion is probably
> broken in few places too )

It would be nice if we could make Tomcat work out of the box on EBCDIC systems.
As it is now - Tomcat doesn't work correctly on EBCDIC machines.


> - for next release - review the code and make sure it does the right thing. We
> need to use the local encoding ( we can't ask the user to convert everything to
> UTF8 ) 

Note that for us (spoiled) US-ASCII people no conversion is needed.

Adding "java -Dfile.encoding=8859_1" in fact changes nothing.

Adding "java -Dfile.encoding=UTF8" in practice requires no change to our existing
ASCII files, but removes a potential set of errors.

Note also that I think you are falling into the same trap we fell into (a dozen
years back) in assuming that one local encoding can apply to the entire system.
We were told by some of our European customers that they really wanted the 
language-specific settings to be associated with each *user*.

Given this bit of insight it makes sense that the *default* file encoding for 
a web application should not be specific to a locale.


> and keep track of this, add comments, and make sure we follow the rules (
> including the accept-encoding and all other relevant headers).
>
> I think we can clean up what's broken - if someone can spend some time to review,
> comment and clean everything. DefaultServlet is in a very bad shape, so it has to
> be rewriten anyway, and I think it is the most important variable.

Practically I don't believe we keep Tomcat constantly EBCDIC compatible.  It's just
too much work and too easy to break.  If there was someone who wanted to do this 
nearly full-time it might be possible, but lacking such a volunteer it's just not
going to happen.

 
> ( Keep in mind that tomcat can be used "integrated" in Apache, IIS and NES, we
> need to follow the same rules as the static server - I don't think we have any
> choice or option in this area - we can't change the way people work because it's
> simpler for us )

I agree with this, but don't really see a conflict.  

I floated this idea in an IBM OS/390 mailing list, and was told that at least one 
version of the OS/390 web server released by IBM is an fact an ASCII application.  
If IBM also finds it pragmatic to make their web server an ASCII application on
their EBCDIC boxes...


From: Chris Janicki
> > Another point that bothered me was that if web pages are stored on
> > disk using the "native" character encoding then they would have to
> > be translated to Unicode (on read) and then to ASCII (on write).

> I may have only understood 75% of your argument, but my understanding is
> that your quote above is *exactly* what the Java designers had in mind. 
> This is necessary for *code* portability, right?  We shouldn't be
> worried about the format of content source... source files should always
> be stored natively!?  Otherwise the content source becomes closed to
> other applications.

This is *exactly* my point!!  :-)

What is the correct native encoding for a web application?

If a user types in a Kanji character, is it OK for the servlet engine to
take an exception when writing a file to disk just because it happens to 
be running on a machine in North America?

Web applications are by their very nature international.

Note that I'm suggesting that we choose the *default* encoding for Tomcat
to be the one most compatible choice.  A site or a vendor can override 
this choice if they so choose.  We are not talking about adding code to
Tomcat to *require* a particular encoding.  


> Consider a shared text-based (tab-delimited) database used by both some
> native app and available for raw display via web.  If we convert the
> source to some non-native standard for our web server, then the native
> app can't share.

In the case of EBCDIC applications at least on OS/390 machines, the ability 
to read and write ASCII data is a universal need, and has long been throughly 
integrated into the system.


> It is a bit of overhead to do the conversion but there are more
> important areas for performance tuning.  Isn't that the whole tradeoff
> premise behind portable byte code?

I think building inefficiencies into the default implementation is not 
a good idea.  

I also think that a portable *web* application carries with it a slightly
different set of semantics - that leads to the original proposal!


Re: [PATCH] Add EBCDIC support for text files in Tomcat.

Posted by Costin Manolache <co...@eng.sun.com>.
My vote:
- for this release - do nothing, the problem is too deep and we may brake
something else ( it's not only about EBCDIC - internationaliaztion is probably
broken in few
places too )

- for next release - review the code and make sure it does the right thing. We
need to use the local encoding ( we can't ask the user to convert everything to
UTF8 ) and keep track of this, add comments, and make sure we follow the rules (
including the accept-encoding and all other relevant headers).

I think we can clean up what's broken - if someone can spend some time to review,
comment and clean everything. DefaultServlet is in a very bad shape, so it has to
be rewriten anyway, and I think it is the most important variable.

( Keep in mind that tomcat can be used "integrated" in Apache, IIS and NES, we
need to follow the same rules as the static server - I don't think we have any
choice or option in this area - we can't change the way people work because it's
simpler for us )

Costin

"Preston L. Bannister" wrote:

> (I know this is long, but I'm soliciting opinions at the end :).
>
> I am going to suggest that this patch not be incorporated.  The code is
> nice and clean, but the strategy is probably wrong.
>
> A couple months back I went through the exercise of making Tomcat work
> on an EBCDIC machine (an IBM 390 box specifically).
>
> The changes did get incorporated, and *most* things in Tomcat worked
> correctly on an EBCDIC machine.  Then the very next major release broke
> the EBCDIC support, or to be more exact the only-send-ASCII-to-the-web
> support :).  The reason for this is that a number of commonly used
> constructor's have two forms, a constructor that takes an explicit
> character encoding, and a more commonly used constructor that assumes
> the character encoding is the value of the "file.encoding" property.
> A programmer on an ASCII machine (just about everyone) will use the
> default encoding form of the constructor without knowing that this
> will pose a problem on non-ASCII machines.
>
> In other words it is impractical to keep Tomcat compatible with a
> default file.encoding of EBCDIC.
>
> Another point that bothered me was that if web pages are stored on
> disk using the "native" character encoding then they would have to
> be translated to Unicode (on read) and then to ASCII (on write).
>
> For the vast majority of (ASCII) machines this is a complete waste
> of time as the encoding on disk and on the web are the same.
>
> For the much smaller number of EBCDIC machines this is exactly what
> you need - but only if you are going to store your web pages as EBCDIC.
>
> After thinking about the above I changed my strategy.
>
> I would suggest that EBCDIC -> ASCII translation be done when the web
> pages are first created, and that Tomcat always assume that text on
> disk is in ASCII (or maybe UTF8).
>
> For each time a web page is created or updated by it's author,
> it will be viewed hundreds, thousands or even millions of times.
> It makes a lot more sense to do the EBCDIC -> ASCII translation at
> the time of publication, rather than on each and every request.
>
> I would also suggest that Tomcat always be run with a default character
> encoding of ASCII (or maybe UTF8).  This means that the file.encoding
> should be overridden when starting the JVM for Tomcat, like:
>
>   java -Dfile.encoding=8859_1 ...(remaining options)
>
> With this change Tomcat works very nicely on an EBCDIC machine, and
> will continue to work just as well as on ASCII machines.
>
> If you still want to read EBCDIC web pages then I would suggest that
> this be both an optional item and be made more general.  One possible
> approach might be to *optionally* use a subclass of DefaultServlet that
> would always do a disk-encoding to ASCII translation.  To be more general
> a look-aside could be used to determine the character encoding for files
> in a particular directory.
>
> I don't believe that this should be done by the default implementation.
>
> ----------------
> Opinions please!
> ----------------
>
> 1.  We should recommend that Tomcat be run with a web-compatible default
>     character encoding.
>
> This means I'll alter tomcat.sh to always specify the value to the Java
> interpreter, and checkin some form of the above text with Tomcat for
> future reference as (say) EBCDIC.txt.
>
> 2.  The default character encoding should be either ASCII (8859_1) or
>     perhaps UTF8.  The web standard for HTTP is ASCII.  I would like to
>     suggest that UTF8 might be a good default.
>
> So far as I can tell UTF8 is a strict superset of ASCII.  So for the case
> where the original data is ASCII the use of UTF8 wouldn't change anything.
>
> In the case where the data is more than just ASCII, the 8859_1 encoding
> will (I believe) cause exceptions to be thrown.  It is quite likely that
> applications previously only exercised with ASCII with deal poorly with
> the unexpected encoding exceptions.
>
> If the default encoding is UTF8 then non-ASCII characters will be encoded
> and decoded correctly.  Code used outside the ASCII-only world is much more
> likely to "just work".
>
> Personally I feel that the only practical alternative for (1) is to use the
> web encoding as the default encoding.  I suspect that the best choice for
> the default web encoding (2) is UTF8, but I might have missed some downside
> to deviating (even to a superset) from ASCII.
>
> Opinions??
>
> > -----Original Message-----
> > From: Clere, Jean-Frederic [mailto:jfrederic.clere@fujitsu.siemens.es]
> > Sent: Friday, March 24, 2000 8:17 AM
> > To: Tomcat-Dev (Correo electrónico)
> > Subject: [PATCH] Add EBCDIC support for text files in Tomcat.
> >
> >
> > I am porting tomcat to a BS2000 (Siemens EBCDIC mainframe).
> > And I have arranged:
> > ./jakarta-tomcat/src/share/org/apache/tomcat/servlets/DefaultServlet.java
> > The problem is that text documents like html files have to be editable on
> > EBCDIC machine -That is  EBCDIC-.
> > So some conversions are need native(EBCDIC to java) and then java to ASCII.
> >
> > Find enclosed the result of a diff -u -w between my patch and the CVS file.
> >
> > --- jakarta-tomcat/src/share/org/apache/tomcat/servlets/DefaultServlet.java
> > Wed Mar 22 22:52:15 2000
> > +++ jakarta-tomcat/src/share/org/apache/tomcat/servlets/DefaultServlet.java
> > Fri Mar 24 16:48:01 2000
> > @@ -289,6 +289,9 @@
> >       FileInputStream in=null;
> >       try {
> >           in = new FileInputStream(file);
> > +         if (mimeType.startsWith("text"))
> > +             serveStreamNative(in, request, response);
> > +         else
> >           serveStream(in, request, response);
> >       } catch (FileNotFoundException e) {
> >           // Figure out what we're serving
> > @@ -311,6 +314,22 @@
> >       }
> >      }
> >
> > +    // Sends the text file, wich are encoded in machine encoding.
> > +    private void serveStreamNative(InputStream in, HttpServletRequest
> > request,
> > +        HttpServletResponse response)
> > +    throws IOException {
> > +        // like the serveStream, we try the stream and the writer.
> > +
> > +     try {
> > +         ServletOutputStream out = response.getOutputStream();
> > +         serveStreamAsStreamNative(in, out);
> > +     } catch (IllegalStateException ise) {
> > +         PrintWriter out = response.getWriter();
> > +            // as it uses an InputStreamReader it does the conversion.
> > +         serveStreamAsWriter(in, out);
> > +     }
> > +    }
> > +
> >      private void serveStream(InputStream in, HttpServletRequest request,
> >          HttpServletResponse response)
> >      throws IOException {
> > @@ -341,6 +360,27 @@
> >           out.write(buf, 0, read);
> >       }
> >      }
> > +
> > +    private void serveStreamAsStreamNative(InputStream in, OutputStream
> > out)
> > +    throws IOException {
> > +     char[] buf = new char[1024];
> > +     int read = 0;
> > +
> > +     // Here the input and the output have to be converted.
> > +        // input  default encoding output ASCII (ISO8859_1).
> > +        OutputStreamWriter output = null;
> > +     try {
> > +            output = new OutputStreamWriter(out,"ISO8859_1");
> > +     } catch(UnsupportedEncodingException e) {
> > +         output = new OutputStreamWriter(out); // try without conversion.
> >
> > +     }
> > +     InputStreamReader input = new InputStreamReader(in);
> > +
> > +     while ((read = input.read(buf)) != -1) {
> > +         output.write(buf, 0, read);
> > +     }
> > +     output.flush();
> > +    }
> >
> >      private void serveStreamAsWriter(InputStream in, PrintWriter out)
> >      throws IOException {
> >
> >
> >
> >  <<DefaultServlet.patch.txt>>
> >
> > Jean-Frédéric Clère
> > EP LP DC22 (BCN)
> > Fujitsu Siemens Computers
> > Phone + 34 93 480 4209
> > Fax     + 34 93 480 4201
> > Mail mailto:jfrederic.clere@fujitsu.siemens.es
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: tomcat-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: tomcat-dev-help@jakarta.apache.org


Re: [PATCH] Add EBCDIC support for text files in Tomcat.

Posted by Chris Janicki <Ja...@ia-inc.com>.
> Another point that bothered me was that if web pages are stored
on
> disk using the "native" character encoding then they would have
to
> be translated to Unicode (on read) and then to ASCII (on write).
>

I may have only understood 75% of your argument, but my understanding is
that your quote above is *exactly* what the Java designers had in mind. 
This is necessary for *code* portability, right?  We shouldn't be
worried about the format of content source... source files should always
be stored natively!?  Otherwise the content source becomes closed to
other applications.

Consider a shared text-based (tab-delimited) database used by both some
native app and available for raw display via web.  If we convert the
source to some non-native standard for our web server, then the native
app can't share.

It is a bit of overhead to do the conversion but there are more
important areas for performance tuning.  Isn't that the whole tradeoff
premise behind portable byte code?

I'm a little green on these matters, so comments/corrections welcome.

-- 
Chris Janicki
Industrious Activities, Inc.
http://www.ia-inc.com
(1 617) 787-9479

RE: [PATCH] Add EBCDIC support for text files in Tomcat.

Posted by "Preston L. Bannister" <pr...@home.com>.
(I know this is long, but I'm soliciting opinions at the end :).

I am going to suggest that this patch not be incorporated.  The code is
nice and clean, but the strategy is probably wrong.

A couple months back I went through the exercise of making Tomcat work
on an EBCDIC machine (an IBM 390 box specifically).

The changes did get incorporated, and *most* things in Tomcat worked
correctly on an EBCDIC machine.  Then the very next major release broke
the EBCDIC support, or to be more exact the only-send-ASCII-to-the-web
support :).  The reason for this is that a number of commonly used
constructor's have two forms, a constructor that takes an explicit
character encoding, and a more commonly used constructor that assumes
the character encoding is the value of the "file.encoding" property.
A programmer on an ASCII machine (just about everyone) will use the
default encoding form of the constructor without knowing that this
will pose a problem on non-ASCII machines.

In other words it is impractical to keep Tomcat compatible with a
default file.encoding of EBCDIC.

Another point that bothered me was that if web pages are stored on
disk using the "native" character encoding then they would have to
be translated to Unicode (on read) and then to ASCII (on write).

For the vast majority of (ASCII) machines this is a complete waste
of time as the encoding on disk and on the web are the same.

For the much smaller number of EBCDIC machines this is exactly what
you need - but only if you are going to store your web pages as EBCDIC.

After thinking about the above I changed my strategy.

I would suggest that EBCDIC -> ASCII translation be done when the web
pages are first created, and that Tomcat always assume that text on
disk is in ASCII (or maybe UTF8).

For each time a web page is created or updated by it's author,
it will be viewed hundreds, thousands or even millions of times.
It makes a lot more sense to do the EBCDIC -> ASCII translation at
the time of publication, rather than on each and every request.

I would also suggest that Tomcat always be run with a default character
encoding of ASCII (or maybe UTF8).  This means that the file.encoding
should be overridden when starting the JVM for Tomcat, like:

  java -Dfile.encoding=8859_1 ...(remaining options)

With this change Tomcat works very nicely on an EBCDIC machine, and
will continue to work just as well as on ASCII machines.

If you still want to read EBCDIC web pages then I would suggest that
this be both an optional item and be made more general.  One possible
approach might be to *optionally* use a subclass of DefaultServlet that
would always do a disk-encoding to ASCII translation.  To be more general
a look-aside could be used to determine the character encoding for files
in a particular directory.

I don't believe that this should be done by the default implementation.

----------------
Opinions please!
----------------

1.  We should recommend that Tomcat be run with a web-compatible default
    character encoding.

This means I'll alter tomcat.sh to always specify the value to the Java
interpreter, and checkin some form of the above text with Tomcat for
future reference as (say) EBCDIC.txt.

2.  The default character encoding should be either ASCII (8859_1) or
    perhaps UTF8.  The web standard for HTTP is ASCII.  I would like to
    suggest that UTF8 might be a good default.

So far as I can tell UTF8 is a strict superset of ASCII.  So for the case
where the original data is ASCII the use of UTF8 wouldn't change anything.

In the case where the data is more than just ASCII, the 8859_1 encoding
will (I believe) cause exceptions to be thrown.  It is quite likely that
applications previously only exercised with ASCII with deal poorly with
the unexpected encoding exceptions.

If the default encoding is UTF8 then non-ASCII characters will be encoded
and decoded correctly.  Code used outside the ASCII-only world is much more
likely to "just work".

Personally I feel that the only practical alternative for (1) is to use the
web encoding as the default encoding.  I suspect that the best choice for
the default web encoding (2) is UTF8, but I might have missed some downside
to deviating (even to a superset) from ASCII.

Opinions??


> -----Original Message-----
> From: Clere, Jean-Frederic [mailto:jfrederic.clere@fujitsu.siemens.es]
> Sent: Friday, March 24, 2000 8:17 AM
> To: Tomcat-Dev (Correo electrónico)
> Subject: [PATCH] Add EBCDIC support for text files in Tomcat.
>
>
> I am porting tomcat to a BS2000 (Siemens EBCDIC mainframe).
> And I have arranged:
> ./jakarta-tomcat/src/share/org/apache/tomcat/servlets/DefaultServlet.java
> The problem is that text documents like html files have to be editable on
> EBCDIC machine -That is  EBCDIC-.
> So some conversions are need native(EBCDIC to java) and then java to ASCII.
>
> Find enclosed the result of a diff -u -w between my patch and the CVS file.
>
> --- jakarta-tomcat/src/share/org/apache/tomcat/servlets/DefaultServlet.java
> Wed Mar 22 22:52:15 2000
> +++ jakarta-tomcat/src/share/org/apache/tomcat/servlets/DefaultServlet.java
> Fri Mar 24 16:48:01 2000
> @@ -289,6 +289,9 @@
>  	FileInputStream in=null;
>  	try {
>  	    in = new FileInputStream(file);
> +	    if (mimeType.startsWith("text"))
> +	        serveStreamNative(in, request, response);
> +	    else
>  	    serveStream(in, request, response);
>  	} catch (FileNotFoundException e) {
>  	    // Figure out what we're serving
> @@ -311,6 +314,22 @@
>  	}
>      }
>
> +    // Sends the text file, wich are encoded in machine encoding.
> +    private void serveStreamNative(InputStream in, HttpServletRequest
> request,
> +        HttpServletResponse response)
> +    throws IOException {
> +        // like the serveStream, we try the stream and the writer.
> +
> +	try {
> +	    ServletOutputStream out = response.getOutputStream();
> +	    serveStreamAsStreamNative(in, out);
> +	} catch (IllegalStateException ise) {
> +	    PrintWriter out = response.getWriter();
> +            // as it uses an InputStreamReader it does the conversion.
> +	    serveStreamAsWriter(in, out);
> +	}
> +    }
> +
>      private void serveStream(InputStream in, HttpServletRequest request,
>          HttpServletResponse response)
>      throws IOException {
> @@ -341,6 +360,27 @@
>  	    out.write(buf, 0, read);
>  	}
>      }
> +
> +    private void serveStreamAsStreamNative(InputStream in, OutputStream
> out)
> +    throws IOException {
> +	char[] buf = new char[1024];
> +	int read = 0;
> +
> +	// Here the input and the output have to be converted.
> +        // input  default encoding output ASCII (ISO8859_1).
> +        OutputStreamWriter output = null;
> +	try {
> +            output = new OutputStreamWriter(out,"ISO8859_1");
> +	} catch(UnsupportedEncodingException e) {
> +	    output = new OutputStreamWriter(out); // try without conversion.
>
> +	}
> +	InputStreamReader input = new InputStreamReader(in);
> +
> +	while ((read = input.read(buf)) != -1) {
> +	    output.write(buf, 0, read);
> +	}
> +	output.flush();
> +    }
>
>      private void serveStreamAsWriter(InputStream in, PrintWriter out)
>      throws IOException {
>
>
>
>  <<DefaultServlet.patch.txt>>
>
> Jean-Frédéric Clère
> EP LP DC22 (BCN)
> Fujitsu Siemens Computers
> Phone + 34 93 480 4209
> Fax     + 34 93 480 4201
> Mail mailto:jfrederic.clere@fujitsu.siemens.es