You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Konstantin Chuguev <jo...@urc.ac.ru> on 1998/10/20 19:14:56 UTC

Multilingual Apache [Was: Re: mod_mime/3238: New directive suggestion: AddCharset (fwd)]

Hi, Apache users and developers.

Marc Slemko wrote:
> 
> This needs to be looked at by someone who understands the support for and
> use of charsets.
> 
Well, I seem to be one of those guys :-)

> The implementation may well need cleaning up, but the idea sounds like it
> may possibly have value if it isn't too expensive.
> 
Although my implementation is kind of expensive, I think it can
be useful for somebody...

MultiWeb is a multilingual extension of Apache, supporting
multiple character sets (including Unicode and several CJK ones),
their conversion for GET, POST, PUT methods and several other
features. AddCharset and many others.

Just today the latest version is released: Apache-1.3.3-MultiWeb-3.2.

Some details are on http://multiweb.urc.ac.ru/

Unfortunately, not much documentation now, but I am working on it.

And please feel free to ask any questions.
mailto:joy@urc.ac.ru.

> ---------- Forwarded message ----------
> Date: 19 Oct 1998 04:59:39 -0000
> From: Youichirou Koga <y-...@jp.FreeBSD.org>
> To: apbugs@hyperreal.org
> Subject: mod_mime/3238: New directive suggestion: AddCharset
> 
> >Number:         3238
> >Category:       mod_mime
> >Synopsis:       New directive suggestion: AddCharset
> >Confidential:   no
> >Severity:       non-critical
> >Priority:       medium
> >Responsible:    apache
> >State:          open
> >Class:          change-request
> >Submitter-Id:   apache
> >Arrival-Date:   Sun Oct 18 23:10:00 PDT 1998
> >Last-Modified:
> >Originator:     y-koga@jp.FreeBSD.org
> >Organization:
> apache
> >Release:        1.3.3
> >Environment:
> FreeBSD 2.2.7-STABLE
> >Description:
> New directive suggestion: AddCharset
> 
> HTTP's default charset for Content-Type: is ISO-8859-1.
> This default value is very convenient for languages which
> are written with latin character. However it is not
> convenient for some other languages (e.g. Japanese),
> because we must specify a charset parameter explicitly and
> we are using not only one charset but many charsets.
> (e.g. iso-8859-1, iso-2022-jp, Shift_JIS, EUC-JP, and so on.)
> 
> We can use ForceType and AddType directives to specify
> a charset parameter. Currently, if we set correct charset,
> we must set it for all text/* media types using such directives,
> but we don't use only one charset, as I've already written.
> 
> Now I propose adding a new directive, AddCharset to the mod_mime.
> AddCharset directive allows us to set charset easily.
> 
> I have already implemented it. The patch and its manual is
> available at following URL:
> <http://www.isoternet.org/~y-koga/Apache/>
> 
> I hope you'll understand its necessity and merge it in next release.

--
	Konstantin V. Chuguev.		System administrator of Southern
	http://www.urc.ac.ru/~joy/	Ural Regional Center of FREEnet,
	mailto:joy@urc.ac.ru		Chelyabinsk, Russia.

Re: Multilingual Apache [Was: Re: mod_mime/3238: New directive suggestion: AddCharset (fwd)]

Posted by Dirk-Willem van Gulik <di...@jrc.it>.

On Sun, 25 Oct 1998, Konstantin Chuguev wrote:

> Hello Dirk-Willem. I recalled I already met your name somewhere in WWW
> i18n resources. Searching through various sources, I found
> http://www.ceo.org/winter/sevilla/Overview.html

> Of course, I read this earlier. Just have not remembered your name :-)
> Very useful paper. I would like to refer to it in the MultiWeb
> documentation.

Be carefull there, it is rather old. We've got some revised documentation
here; see the next message.
 
> And I hope this paper helped me to understand better your point of view
> about Web multilinguism.
 
> Excuse me if I misunderstood some of your notices below, but it seems
> like you did not read parts of MultiWeb documentation included into
> Apache docs (coming with distribution; online at
> http://www.rnoc.urc.ac.ru/apache/manual/)  or, more correctly, like I
> did not write enough documentation yet :-)  Now I am doing exactly that
> (oh, writing papers is much more difficult than programming :-), but
> will try to explain some things here in the message. 

> > On Tue, 20 Oct 1998, Konstantin Chuguev wrote:
> > 
> > Actually, cause of the braindead way MIME handles charsets (i.e. as part
> > of the content-type, rather than as an independendt dimension or variant),
> > the way to do this in apache, since version 0.98 is to either add in your
> > mime.types file or with AddType something along the lines of
> > 
> > html_latin1     text/html;charset=iso-...
> > 
> > In fact, using AddCharset would be counter productive (beleive me I
> > tried!) unles you fix the entire content struct; i.e.break the implicit

> Could you give us an example of counter productivity of AddCharset?

Apache takes the MIME approach to charsets; i.e. they are effectively tied
to the mime-type. Even though the code has a separate field for it.

So if you have a document in two languages; each with a different charset;
your negotiation becomes horribly complex; and you either have to use
a variant on just one, the language, and hope that the receipient has
the font, or alternatively only have one-2-one relation between your
charset and language (i.e. a russion document is always stored in koi8, a
french one in latin1 and a Czech one in latin2).
 
> Anyway, this method is not the only one. I have it in MultiWeb, but
> never use it myself. There is another method, which suits my need
> better.  It works if file doesn't have a charset suffix. Looks like
> that: 
 
> <Language ru>
> 	ServerCharset koi8-r
> </Language>
 
> It's true that practically all resources in the same language share the
> same charset on the server (or at least in some server's subdirectory:
> <Language> directives can have any Apache context - up to <Files ...>).
> There's no need to label the document with a charset suffix in that
> case.

Exactly; you've put it way better than I can put it; and thus... if you
just tie it to the mime type you are there.. without having to do the
above. Except.. and this is a nice idea which I like, when you want to
go as far down as per individual files. I had not realized that; and I
agree that it can be very useful.

> I don't. But someone might want to do that.
> 
> home.en.latin1.html > > Which today just work fine.  

> I have avoided changing request_rec content struct by storing the
> charset information in the r->notes table. http_protocol.c is patched a
> bit to insert that information into the Content-Type response header

Ah, so you avoid using the content_* containers. This makes sense. But it
would not make the other modules ware.

> line.  Another change in the http_protocol.c file is turning on the
> charset converter in case of textual content (I cannot be sure that
> content type is textual in a fixup_handler, where the converter is set
> up, because CGI scripts can set it later).  This is the dirty hack, but
> it seem to be unavoidable if I need the functionality MultiWeb has.  I

Yes I agree.

> would like to have the standard mechanism of this in Apache.  Until it
> happened (I hope :-) I try to make the minimal changes of the original
> sources. 

Well, the real solution might be in apache 2.0; where we might just have
streamed layers to take care of just that.

> > > > The implementation may well need cleaning up, but the idea sounds like it
> > > > may possibly have value if it isn't too expensive.
> > 
> > > Just today the latest version is released: Apache-1.3.3-MultiWeb-3.2.
> > >
> > > Some details are on http://multiweb.urc.ac.ru/
> > >
> > > Unfortunately, not much documentation now, but I am working on it.
> > >
> > > Although my implementation is kind of expensive, I think it can
> > > be useful for somebody...
> > 
> > It is actually a nice piece of work; though I worry about the i18n side,
> > as it seems to have broken a server which does not have strictly
> > paralellel text in it. And yes it is very expensive :-).

> If I understood it right, you are afraid about unilingual servers or
> ones having resources with different content in different languages? 

Well, about servers which _label_ what they send out, even when that label
does not quite apply; i.e. compare to the Accept header of netscape
whcih says */* as the first entry.

> I am ready to discuss the expensiveness and minimize it.
> I really wonder how there is still no public available charset
> conversion  API.

Actually, there is; see mod_i18n which uses the CCC-API which, to the best
of my knowledge is public. It uses (non normalized :-() unicode as the
basis. You might want to look at it. It was a terena project. I think your
inet96 paper even pointed to it. But yes, those API's do tend to mix
the concept of glyphs with charsets and languages, and the C3 one seems
to have never made it beyond ap45.

	http://www.nada.kth.se/i18n/c3/

Dw.





Re: Multilingual Apache [Was: Re: mod_mime/3238: New directive suggestion: AddCharset (fwd)]

Posted by Konstantin Chuguev <jo...@urc.ac.ru>.
Dirk-Willem van Gulik wrote:
> 
Hello Dirk-Willem. I recalled I already met your name somewhere in WWW
i18n
resources. Searching through various sources, I found
http://www.ceo.org/winter/sevilla/Overview.html

Of course, I read this earlier. Just have not remembered your name :-)
Very useful paper. I would like to refer to it in the MultiWeb
documentation.

And I hope this paper helped me to understand better your point of view
about Web multilinguism.

Excuse me if I misunderstood some of your notices below, but it seems
like
you did not read parts of MultiWeb documentation included into Apache
docs
(coming with distribution; online at
http://www.rnoc.urc.ac.ru/apache/manual/)
or, more correctly, like I did not write enough documentation yet :-)
Now I am doing exactly that (oh, writing papers is much more difficult
than programming :-), but will try to explain some things here in the
message.

> On Tue, 20 Oct 1998, Konstantin Chuguev wrote:
> 
> Actually, cause of the braindead way MIME handles charsets (i.e. as part
> of the content-type, rather than as an independendt dimension or variant),
> the way to do this in apache, since version 0.98 is to either add in your
> mime.types file or with AddType something along the lines of
> 
> html_latin1     text/html;charset=iso-...
> 
> In fact, using AddCharset would be counter productive (beleive me I
> tried!) unles you fix the entire content struct; i.e.break the implicit
Could you give us an example of counter productivity of AddCharset?
Anyway, this method is not the only one. I have it in MultiWeb, but
never
use it myself. There is another method, which suits my need better.
It works if file doesn't have a charset suffix. Looks like that:

<Language ru>
	ServerCharset koi8-r
</Language>

It's true that practically all resources in the same language share the
same charset on the server (or at least in some server's subdirectory:
<Language> directives can have any Apache context - up to <Files ...>).
There's no need to label the document with a charset suffix in that
case.
I don't. But someone might want to do that.

> link between charset, content-type (and q factor, etc) breaking just about
> every odule and causing subtle ssues with files like
> 
>         home.en.latin1.html
> 
> Which today just work fine.
> 
I have avoided changing request_rec content struct by storing the
charset
information in the r->notes table. http_protocol.c is patched a bit to
insert that information into the Content-Type response header line.
Another change in the http_protocol.c file is turning on the charset
converter
in case of textual content (I cannot be sure that content type is
textual
in a fixup_handler, where the converter is set up, because CGI scripts
can
set it later).
This is the dirty hack, but it seem to be unavoidable if I need the
functionality MultiWeb has.
I would like to have the standard mechanism of this in Apache.
Until it happened (I hope :-) I try to make the minimal changes of the
original sources.

> > > The implementation may well need cleaning up, but the idea sounds like it
> > > may possibly have value if it isn't too expensive.
> 
> > Just today the latest version is released: Apache-1.3.3-MultiWeb-3.2.
> >
> > Some details are on http://multiweb.urc.ac.ru/
> >
> > Unfortunately, not much documentation now, but I am working on it.
> >
> > Although my implementation is kind of expensive, I think it can
> > be useful for somebody...
> 
> It is actually a nice piece of work; though I worry about the i18n side,
> as it seems to have broken a server which does not have strictly
> paralellel text in it. And yes it is very expensive :-).
If I understood it right, you are afraid about unilingual servers or
ones
having resources with different content in different languages?
What do you mean "broken"?

I am ready to discuss the expensiveness and minimize it.
I really wonder how there is still no public available charset
conversion
API.
Apache has the great portability among almost all software now.
There are thoughts about making a separate library of Apache API
functions. Charset conversion functions would be very useful there :-)

> 
> I wonder if it could not be combined with mod_i18n which uses the CCC-API
> and be slot in _before_ mod_negotiation; and then use fake q= factors for
> all acept line (thus circumventing netscapes acept */*).
> DW
I am sorry, what do you mean here?

--
	Konstantin V. Chuguev.		System administrator of Southern
	http://www.urc.ac.ru/~joy/	Ural Regional Center of FREEnet,
	mailto:joy@urc.ac.ru		Chelyabinsk, Russia.

Re: Multilingual Apache [Was: Re: mod_mime/3238: New directive suggestion: AddCharset (fwd)]

Posted by Dirk-Willem van Gulik <di...@jrc.it>.
On Tue, 20 Oct 1998, Konstantin Chuguev wrote:

Actually, cause of the braindead way MIME handles charsets (i.e. as part
of the content-type, rather than as an independendt dimension or variant),
the way to do this in apache, since version 0.98 is to either add in your
mime.types file or with AddType something along the lines of

html_latin1	text/html;charset=iso-...

In fact, using AddCharset would be counter productive (beleive me I
tried!) unles you fix the entire content struct; i.e.break the implicit
link between charset, content-type (and q factor, etc) breaking just about
every odule and causing subtle ssues with files like

	home.en.latin1.html

Which today just work fine.

> > The implementation may well need cleaning up, but the idea sounds like it
> > may possibly have value if it isn't too expensive.

> Just today the latest version is released: Apache-1.3.3-MultiWeb-3.2.
> 
> Some details are on http://multiweb.urc.ac.ru/
> 
> Unfortunately, not much documentation now, but I am working on it.
>
> Although my implementation is kind of expensive, I think it can
> be useful for somebody...

It is actually a nice piece of work; though I worry about the i18n side,
as it seems to have broken a server which does not have strictly
paralellel text in it. And yes it is very expensive :-).

I wonder if it could not be combined with mod_i18n which uses the CCC-API
and be slot in _before_ mod_negotiation; and then use fake q= factors for
all acept line (thus circumventing netscapes acept */*).
DW