You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@tapestry.apache.org by Denis Ponomarev <oz...@romsat.ua> on 2003/08/26 16:05:20 UTC

Re[2]: Non-latin charset in JavaScript messages

MB> Please add a bug with this, and I will make sure it is fixed. Also, will it
MB> be possible to attach a quick example to the bug entry if possible? (files
MB> in KOI8-R are okay -- they will make a good test).

I am ready to add, but don't know where I can do it :)

MB> I have to admit this is a bit strange, though. toString() should return the
MB> string in Unicode, and printRaw() should output it in the encoding of the
MB> page. Is JavaScript not accepting characters in the page encoding? (it
MB> should, I think). What exactly do you see?

Of course toString() returns strings in Unicode but resulting
HTML contains one-byte characters, isn't it?

When Tapestry generates pages all non-latin characters replaced
with well-known &#1234; stuff. Browser accept it in any charset.

But when javascript generated printRaw() used which does not do any
magic - it simply prints string. In this case I suppose that default
platform charset used to translate unicode to one-byte characters
in the resulting HTML. (Server charset, not client!)

So when client browser accepts header with information about charset
(default - utf-8) it does not know how to interpret my characters with
codes more than 127. In my case on alert("non-latin") it outputs something
like hieroglyphs :) or simply fails after page loading with "not closed literal"
message.

Of course I can substitute content-type by overriding
getResponseWriter() method.

public IMarkupWriter getResponseWriter(OutputStream out) {
        return new HTMLWriter("text/html; charset=windows-1251", out);
}

After that browser succefully accepts cyrilic.

But doing so brokes application internationality!
I can't provide support for both cyrilic and japanese anymore.

So to be charset independent I propose translate non-latin characters into
[\u + hexCode] sequences for javascript.

I hope my English was understandable :)

Re[8]: Non-latin charset in JavaScript messages

Posted by Denis Ponomarev <oz...@romsat.ua>.

AG> What are you using as a server, Tomcat, Websphere, etc?

AG> In Tomcat, you could simply create the directory structure
AG> org/apache/tapestry/valid under common/classes in the tomcat directory and
AG> put the language specific files there.

At now I'm using JBoss with Jetty, but going to download newer
version of JBoss with Tomcat.

Thank you.

RE: Re[6]: Non-latin charset in JavaScript messages

Posted by Adam Greene <ag...@romulin.com>.

What are you using as a server, Tomcat, Websphere, etc?

In Tomcat, you could simply create the directory structure
org/apache/tapestry/valid under common/classes in the tomcat directory and
put the language specific files there.

-----Original Message-----
From: Denis Ponomarev [mailto:oz@romsat.ua]
Sent: Wednesday, August 27, 2003 11:19 AM
To: Tapestry users
Subject: Re[6]: Non-latin charset in JavaScript messages

>> I am using 3.0.

MB> I should have specified, I guess -- please use 3.0 beta-2. The earlier
MB> releases of 3.0 do not have the encoding facilities.

In 3.0 beta-2 everything works fine! Thanks a lot!

And last question on this topic - how can I add my localized version
of the ValidationStrings.properties file?

I understand that I can bind messasges from my Page.properties file to
the message-properties of validators, but this is inconvenience to do
it over all my application.

I found this in the BaseValidator class:

ResourceBundle strings =

ResourceBundle.getBundle("org.apache.tapestry.valid.ValidationStrings",
locale);

So default class loader used to load strings. As I understand it can't
see my application's classpath, isn't it?

Any suggestions?

---------------------------------------------------------------------
To unsubscribe, e-mail: tapestry-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: tapestry-user-help@jakarta.apache.org

Re[6]: Non-latin charset in JavaScript messages

Posted by Denis Ponomarev <oz...@romsat.ua>.

>> I am using 3.0.

MB> I should have specified, I guess -- please use 3.0 beta-2. The earlier
MB> releases of 3.0 do not have the encoding facilities.

In 3.0 beta-2 everything works fine! Thanks a lot!

And last question on this topic - how can I add my localized version
of the ValidationStrings.properties file?

I understand that I can bind messasges from my Page.properties file to
the message-properties of validators, but this is inconvenience to do
it over all my application.


I found this in the BaseValidator class:

ResourceBundle strings =
         ResourceBundle.getBundle("org.apache.tapestry.valid.ValidationStrings", locale);

So default class loader used to load strings. As I understand it can't
see my application's classpath, isn't it?

Any suggestions?

RE: Re[4]: Non-latin charset in JavaScript messages

Posted by Mind Bridge <mi...@yahoo.com>.

Hi Denis,

> I am using 3.0.

I should have specified, I guess -- please use 3.0 beta-2. The earlier
releases of 3.0 do not have the encoding facilities.


With that version you could try the following:

Run the Workbench.
Open a new IE window, go to Tools/Internet Options/Languages, add 'Chinese
(Taiwan)' (zh-tw) as a language and place it at the top of the language list
(with highest priority). I am choosing this language as a demo, since
Tapestry contains ValidationStrings translated into Chinese (Taiwanese) by
default.
Open the Workbench in that browser, go to the Fields page, and without
typing anything (leaving client-side validation enabled), click on Continue.
What do you see?

I personally see an error message (field is required) in Chinese on a number
of different machines. This message is in the JavaScript of the page, and
thus it has to have been encoded properly in utf-8.

This is similar to what you were suggesting, I believe, and it does seem to
work consistently.


> So to force browser
> understand them you should specify charset of the page - by http
> content-type header or by http-equiv in the metatag.

3.0 beta 2 does the following:
- always sets the content-type in the header with the encoding used
- the Shell component includes http-equiv by default as well again including
the encoding (this is necessary for forms in IE)
- decodes the POST requests using that same encoding


What you are seeing is definitely very weird. I cannot think of a specific
reason why it should occur. The 'usual suspects' are the following (but they
don't quite fit what you are saying:

  - the property file with the messages has not been run through
native2ascii -- this is a requirement of the standard Java Properties
implementation. As a result, the

  - getResponseWriter() was overriden incorrectly in the page



I am not sure if this helps. If it does not, please contact me directly --
if this is a problem, we should get to the bottom of this.

Best regards,
-mb


-----Original Message-----
From: Denis Ponomarev [mailto:oz@romsat.ua]
Sent: Wednesday, August 27, 2003 11:22 AM
To: Tapestry users
Subject: Re[4]: Non-latin charset in JavaScript messages


>> I am ready to add, but don't know where I can do it :)

MB> :)

MB> There is a link to the bug database on jakarta.apache.org

Probably I'll try it later...

>> Of course toString() returns strings in Unicode but resulting
>> HTML contains one-byte characters, isn't it?

MB> Based on this and the rest of your message I can conclude that you are
using
MB> Tapestry 2.x.

No!
I am using 3.0. It seems like my previous explanations was not clear
enough :)
When I say "one-byte characters in HTML", I mean that when page
rendered you obtain HTML where no unicode, there are ASCII characters
only! Any non-ASCII characters are represented with special form named
"HTML entity". It consists of _several_ASCII_characters_ and looks like
&#1043;. It is not unicode! There is several ASCII chars instead of
one non-ASCII. Of course this ASCII chars describe non-ASCII chracter
by it's unicode value, but this is not unicode exactly - this is unicode
presentation, in the case of &#1043; it uses 7 bytes (one per char),
not 2 bytes as unicode do.
But there is no similar technic applied during javascript rendering.

MB> Version 3.0 has a number of features that, I believe, resolve the
MB> encoding/internationalization problem completely.

I don't think so!

Consider simple example:

Try to create a page with a simple form and put one ValidField on it.
Assign StringValidator to it. Set clientScriptingEnabled property to
true. (This will generate validation javascript.)
Now assume that application should have multilanguage support.
So you should specify localized versions of validation messages
for langugages you want to support. For example you can bind
requiredMessage property of the validator to the message from the
Page.properties file. To support another languages you simply add
localized properties files, for example Page_ru.properties.
Try to do it for any langugage with non-latin characters.

Now if you'll run application and open resulting HTML you'll see in the
source that validation script contains string literals with messages.
They are non-ASCII! So to force browser
understand them you should specify charset of the page - by http
content-type header or by http-equiv in the metatag.

Even if you are supporting ONE langugage and it is non-latin you
must add servlet filter to specify request encoding if you want
receive valid user input from the forms.

But we want to support _several_ languages, we can't just substitute
content-type of the output because it must support _all_ our langugages and
must have universal charset (utf-8).

So what can we do with Tapestry from this point? My answer is nothing.
And my proposed solution was quite obvious: change rendering of the
javascript. Such as it was done with HTML rendering (substituting of
non-latin with html-entities) it must be done with javascript
rendering (substituting of non-latin with [\u + code] sequences).
After doing so we don't need to specify charset other than utf-8. Our pages
will
contain ASCII characters only!


---------------------------------------------------------------------
To unsubscribe, e-mail: tapestry-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: tapestry-user-help@jakarta.apache.org

Re[4]: Non-latin charset in JavaScript messages

Posted by Denis Ponomarev <oz...@romsat.ua>.

>> I am ready to add, but don't know where I can do it :)

MB> :)

MB> There is a link to the bug database on jakarta.apache.org

Probably I'll try it later...

>> Of course toString() returns strings in Unicode but resulting
>> HTML contains one-byte characters, isn't it?

MB> Based on this and the rest of your message I can conclude that you are using
MB> Tapestry 2.x.

No!
I am using 3.0. It seems like my previous explanations was not clear
enough :)
When I say "one-byte characters in HTML", I mean that when page
rendered you obtain HTML where no unicode, there are ASCII characters
only! Any non-ASCII characters are represented with special form named
"HTML entity". It consists of _several_ASCII_characters_ and looks like
&#1043;. It is not unicode! There is several ASCII chars instead of
one non-ASCII. Of course this ASCII chars describe non-ASCII chracter
by it's unicode value, but this is not unicode exactly - this is unicode
presentation, in the case of &#1043; it uses 7 bytes (one per char),
not 2 bytes as unicode do.
But there is no similar technic applied during javascript rendering.

MB> Version 3.0 has a number of features that, I believe, resolve the
MB> encoding/internationalization problem completely.

I don't think so!

Consider simple example:

Try to create a page with a simple form and put one ValidField on it.
Assign StringValidator to it. Set clientScriptingEnabled property to
true. (This will generate validation javascript.)
Now assume that application should have multilanguage support.
So you should specify localized versions of validation messages
for langugages you want to support. For example you can bind
requiredMessage property of the validator to the message from the
Page.properties file. To support another languages you simply add
localized properties files, for example Page_ru.properties.
Try to do it for any langugage with non-latin characters.

Now if you'll run application and open resulting HTML you'll see in the
source that validation script contains string literals with messages.
They are non-ASCII! So to force browser
understand them you should specify charset of the page - by http
content-type header or by http-equiv in the metatag.

Even if you are supporting ONE langugage and it is non-latin you
must add servlet filter to specify request encoding if you want
receive valid user input from the forms.

But we want to support _several_ languages, we can't just substitute
content-type of the output because it must support _all_ our langugages and
must have universal charset (utf-8).

So what can we do with Tapestry from this point? My answer is nothing.
And my proposed solution was quite obvious: change rendering of the
javascript. Such as it was done with HTML rendering (substituting of
non-latin with html-entities) it must be done with javascript
rendering (substituting of non-latin with [\u + code] sequences).
After doing so we don't need to specify charset other than utf-8. Our pages will
contain ASCII characters only!

RE: Re[2]: Non-latin charset in JavaScript messages

Posted by Mind Bridge <mi...@yahoo.com>.

Hi,

> I am ready to add, but don't know where I can do it :)

:)

There is a link to the bug database on jakarta.apache.org

> Of course toString() returns strings in Unicode but resulting
> HTML contains one-byte characters, isn't it?

Based on this and the rest of your message I can conclude that you are using
Tapestry 2.x.

Version 3.0 has a number of features that, I believe, resolve the
encoding/internationalization problem completely. From the 'What's new' page

- The character encoding used for a component template can now be defined
using the property 'org.apache.tapestry.template-encoding'. The property is
localizable, so you can define 'org.apache.tapestry.template-encoding_ru' to
specify the encoding for all Russian templates, for example.

- The character encoding used to generate the response can now be specified
using the property 'org.apache.tapestry.output-encoding'. It is UTF-8 by
default. (that should not normally need to change)

- The Shell component now defines an http-equiv tag with the content type of
the response.


In other words, even without changing the default settings, the generated
HTML is encoded in UTF-8, form parameters are decoded correctly, etc.

You could separately define what encoding your templates use. That encoding
can be different for different templates (one can be ISO..., another Cp...,
a third Big5, etc), and that has no bearing at all on the charset used to
generate the final page.

Here are some ways to define the template-encoding:

	<property name="org.apache.tapestry.template-encoding" value="ISO-8859-1"/>
	<property name="org.apache.tapestry.template-encoding_ru"
value="windows-1251"/>

You can place those in a .jwc, .page, .library, .application, depending on
the scope in which you want them to be active. You can also define them like
other properties in the web.xml or using -D.

Please try 3.0 if you can -- you should not get one-byte chars in the output
for non-English symbols and the JavaScript should work fine.

> I hope my English was understandable :)

It was, certainly.

Best regards,
-mb

-----Original Message-----
From: Denis Ponomarev [mailto:oz@romsat.ua]
Sent: Tuesday, August 26, 2003 5:05 PM
To: Tapestry users
Subject: Re[2]: Non-latin charset in JavaScript messages


MB> Please add a bug with this, and I will make sure it is fixed. Also, will
it
MB> be possible to attach a quick example to the bug entry if possible?
(files
MB> in KOI8-R are okay -- they will make a good test).

I am ready to add, but don't know where I can do it :)

MB> I have to admit this is a bit strange, though. toString() should return
the
MB> string in Unicode, and printRaw() should output it in the encoding of
the
MB> page. Is JavaScript not accepting characters in the page encoding? (it
MB> should, I think). What exactly do you see?

Of course toString() returns strings in Unicode but resulting
HTML contains one-byte characters, isn't it?

When Tapestry generates pages all non-latin characters replaced
with well-known &#1234; stuff. Browser accept it in any charset.

But when javascript generated printRaw() used which does not do any
magic - it simply prints string. In this case I suppose that default
platform charset used to translate unicode to one-byte characters
in the resulting HTML. (Server charset, not client!)

So when client browser accepts header with information about charset
(default - utf-8) it does not know how to interpret my characters with
codes more than 127. In my case on alert("non-latin") it outputs something
like hieroglyphs :) or simply fails after page loading with "not closed
literal"
message.

Of course I can substitute content-type by overriding
getResponseWriter() method.

public IMarkupWriter getResponseWriter(OutputStream out) {
        return new HTMLWriter("text/html; charset=windows-1251", out);
}

After that browser succefully accepts cyrilic.

But doing so brokes application internationality!
I can't provide support for both cyrilic and japanese anymore.

So to be charset independent I propose translate non-latin characters into
[\u + hexCode] sequences for javascript.

I hope my English was understandable :)


---------------------------------------------------------------------
To unsubscribe, e-mail: tapestry-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: tapestry-user-help@jakarta.apache.org