You are viewing a plain text version of this content. The canonical link for it is here.
Posted to websh-dev@tcl.apache.org by Ronnie Brunner <ro...@netcetera.ch> on 2005/08/26 17:00:22 UTC

i18n problems in Websh (multibyte charsets) (was Re: switching to apache)

Hi Taguchi

I tried to apply your patches, but it was not as straight forward as I
thought: lots of your changes (e.g. Tcl8.4 compliancy) have already
been solved in the HEAD version of Websh (your patch was against the
original websh3.5.0). 

Now I have the following problems:
- I needed to modify the Makefile.in, because it would not compile
  anymore on Solaris (I can find out what to do there)
- The test suite does not work anymore: the web_out_eval_tag function
  does not work as expected. Did the tests work in your environment?
  (I'd be surprised). The way it looks: web::putx is broken when used
  recurrsively and escaping does not work the same anymore. -> the
  following tests fail:
  ==== putx-2.3 web::putx nested FAILED
  ==== putx-2.4 special syntax FAILED
  ==== putx-2.5 escaping FAILED
  ==== putx-2.6 escaped paren FAILED
  ==== putx-3.0d <? ?> syntax escaping FAILED
  ==== putx-3.1 nested <? ?> syntax FAILED
  ==== putx-3.2 gloabally switch to tag syntax FAILED
  ==== putx-3.3 gloabally switch back to normal eval syntax FAILED
  ==== putx-3.6 putx of empty string before first brace does not send headers FAILED
  ==== putx-3.7 putx of empty string before first brace does not send headers FAILED

-> Obviously there is some work needed to make that properly work. And
I'm not sure if I'll get to do it, but if you manage to make our tests
running, I wouldn't mind ;-)

I'll keep the issue open until either someone sends a solution or I
have time to look at it.

Best regards
Ronnie

> > Please sent your bug reports to websh-dev@tcl.apache.org (with me in
> > CC). I don't remember any mails regarding these issue, except for a
> > supposedly unsolved "special characters" issue that I could
> > not reproduce.
> 
> See following URL:
> http://mail-archives.apache.org/mod_mbox/tcl-websh-dev/200505.mbox/%3c429C6300.1040408@ff.iij4u.or.jp%3e
> 
> This patch solved my following probrems.
>   a. websh does not deal pages which written using multi-byte letters
>     such as Japanese, Chinese, Korean, and so on.
>   b. broken page probrem.
>     For example, translate strings which contained in websh live examples
>     from English to Japanese, then push submit button. 
>     So you will get broken pages.
> 
> Original web_eval_tag() assumes all string is single byte one.
> But unicode string is multi-byte.
> So multi-byte page will be broken.
> 
> But notice. web_eval_tag() in this patch is modified version of
> tcl-rivet one.
> 
> Additionally, parseUrlEncodeFormData() in formdata.c set channel option
> -translation to "binary". This code also will set option -encoding to binary
> implicitly. see following example.


-----------------------------------------------------------------------
Ronnie Brunner                              ronnie.brunner@netcetera.ch
Netcetera AG, 8040 Zuerich, phone +41 44 247 79 79 fax +41 44 247 70 75

---------------------------------------------------------------------
To unsubscribe, e-mail: websh-dev-unsubscribe@tcl.apache.org
For additional commands, e-mail: websh-dev-help@tcl.apache.org


Re: i18n problems in Websh (multibyte charsets)

Posted by Ronnie Brunner <ro...@netcetera.ch>.
Hi Taguchi

> I've finished cleanup my patch.
> I believe web::putx and web::htmlify probrem are solved.
> Now, They can deal not only single byte string, but also
> multi byte string.

I applied your patch and tests run fine. Would it be possible for you
to add some tests that confirm the new compliancy with other
encodings? I would like to add some, so that we won't break things
again, when we add new or fix stuff.

> Sorry, I still have confuse about parseUrlEncodedFormData().
> Is this 'Tcl_Channel channel' used as output channel?
> 'output' means web::putx or web::put write to this channel.

Well, the problem is the following: in parseUrlEncodedFormData, we get
URI encoded form fields. They are ASCII (only 8-Bit), but this is
because they are encoded that way. The actual content might be a
different charset altogether. Right now, we set channel to binary and
read the ASCII stuff, then we set the channel back to what it was
and we call web::uri2list, which decodes the actual form fields. At
this time, they can have different encodings and unfortunately, I'm
not really sure whether it works under all combinations.

> If yes, its encoding option should be backuped. Because,
> Tcl_SetChannelOption(interp, channel, "-translation", "binary");
> also sets its encoding option as its side-effects.

OK, I finally found out what you mean: setting translation to binary
does really drop the encoding information (which I didn't know and is,
as far as I know not documented anywhere...)

> All data from apache is ascii encoding. But output from mod_websh
> to apache might be other encoding includes mutibyte one.
> I'd forgot this, Sorry.

Encoding of data from Apache is actually varying but not ASCII (look at
the mutlipart form: the encoding might be part of the form data, where
also binary files can be uploaded)

-> So far, we always treated all data as binary and so it is in the
responsibility of the application to convert if necessary. I'm not
very sure if this works with all encodings, but obviously you now manage
to handle your mutli-byte character set properly, eventhough Websh
does not really treat mutlipart form data in the correct encoding, but
handles it binary. If you have some example of what a browser submits
and what Websh has to do with it and we can create some tests, I would
very much like to add these tests to our test suite. (Something
similar to the tests we have in src/tests/dispatch.test or
src/tests/formdata.test

I will look at the code more closely soon and if everything looks fine
and we have some more tests for multibyte character sets, I'd like to
commit your proposed changes.

Thank you very much so far for your efforts. I appreciate it.

Regards
Ronnie
-----------------------------------------------------------------------
Ronnie Brunner                              ronnie.brunner@netcetera.ch
Netcetera AG, 8040 Zuerich, phone +41 44 247 79 79 fax +41 44 247 70 75

---------------------------------------------------------------------
To unsubscribe, e-mail: websh-dev-unsubscribe@tcl.apache.org
For additional commands, e-mail: websh-dev-help@tcl.apache.org


Re: i18n problems in Websh (multibyte charsets)

Posted by Ronnie Brunner <ro...@netcetera.ch>.
Hi Taguchi

> I believe web::putx and web::htmlify probrem are solved.
> Now, They can deal not only single byte string, but also
> multi byte string.

I have a question regarding htmlify: would it not make sense to encode
all multibyte characters as &#<numeric>; as well? So the result is
always ASCII compatible? On the other hand web::dehtmlify already
handles this correctly.

I committed a new config option that allows to set the
permissions of all created files:

	    web::config filepermissions 0600

Default is still 0644, but now you can set it explicitly, so you don't
need the hack in formdata.c anymore.

Please update from CVS and check if everything works for you as
intended: except for the Makfile.in it should include all your
suggestions, some tests and fixes to the quickref.xml.

Regards
Ronnie
-----------------------------------------------------------------------
Ronnie Brunner                              ronnie.brunner@netcetera.ch
Netcetera AG, 8040 Zuerich, phone +41 44 247 79 79 fax +41 44 247 70 75

---------------------------------------------------------------------
To unsubscribe, e-mail: websh-dev-unsubscribe@tcl.apache.org
For additional commands, e-mail: websh-dev-help@tcl.apache.org


Re: i18n problems in Websh (multibyte charsets)

Posted by Taguchi Takeshi <ta...@ff.iij4u.or.jp>.
Hi,

I've finished cleanup my patch.
I believe web::putx and web::htmlify probrem are solved.
Now, They can deal not only single byte string, but also
multi byte string.

Sorry, I still have confuse about parseUrlEncodedFormData().
Is this 'Tcl_Channel channel' used as output channel?
'output' means web::putx or web::put write to this channel.

If yes, its encoding option should be backuped.
Because,
Tcl_SetChannelOption(interp, channel, "-translation", "binary");
also sets its encoding option as its side-effects.

If no, please forget this parts.

All data from apache is ascii encoding. But output from mod_websh
to apache might be other encoding includes mutibyte one.
I'd forgot this, Sorry.

Additionaly, This patch is still darty. Actuary, I'm not good at C
language. I hate pointer. So I love Tcl ;-)

Thanks,
Taguchi,T.
---

Re: i18n problems in Websh (multibyte charsets)

Posted by ta...@iij.ad.jp.
Hi,

Rivet parser can not deal "{}" as script start/end tag.
So first patch got many errors. I think so.

I've made 3rd patch for web_eval_tag().
It is full scratch.

All putx test scripts in test/webout.test are OK.

I'm try to make a patch for htmlify.

Thanks.
Taguchi,T.

---------------------------------------------------------------------
To unsubscribe, e-mail: websh-dev-unsubscribe@tcl.apache.org
For additional commands, e-mail: websh-dev-help@tcl.apache.org


Re: i18n problems in Websh (multibyte charsets)

Posted by "David N. Welton" <da...@dedasys.com>.
>From a very quick glance at this thread, I think what you might want to
consider doing is using next = (char *)Tcl_UtfNext(cur); in the parser.
 weboutint.c uses Dstrings.  In Rivet, the rivetParser.c file uses the
Utf stuff in order to parse up the <? ?> tags.  I think it's done
correctly but you'd probably have to try it with some non-European
encodings...

One thing that would be handy would be a small test case in Tcl that
demonstrates the problem (if I haven't missed it... apologies if I
didn't see it).

-- 
David N. Welton
- http://www.dedasys.com/davidw/

Apache, Linux, Tcl Consulting
- http://www.dedasys.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: websh-dev-unsubscribe@tcl.apache.org
For additional commands, e-mail: websh-dev-help@tcl.apache.org


Re: i18n problems in Websh (multibyte charsets)

Posted by ta...@iij.ad.jp.
Hi, 

I found another probrem.

web::htmlify can not deal multibyte string.
for example:

> websh3.5.1a
% web::htmlify abcd
abcd                     ; # it seem work fine.
% web::htmlify "\u3042\u3044\u3046\u3048\u304a"
                         ; # return empty string!!

"\u3042\u3044\u3046\u3048\u304a" is a correct Japanese string.
So I think web::htmlify should return string which substituted
from it.

I'm reading webHtmlify() in htmlify.c now .......

Thanks,
Taguchi,T.

---------------------------------------------------------------------
To unsubscribe, e-mail: websh-dev-unsubscribe@tcl.apache.org
For additional commands, e-mail: websh-dev-help@tcl.apache.org


Re: i18n problems in Websh (multibyte charsets)

Posted by ta...@iij.ad.jp.
Hi,

Tcl has encoding mechanism.
Tcl assumes all input strings from input channel are written using
its encoding. default encoding can be refferd by "encoding system"
command.
And Tcl convert this string from its encoding to internal UTF-8
string.
For output, Tcl try to convert from internal UTF-8 string to its
encoding.

With the exception of such converting, if channel's encoding is
"binary", then no conversion occur.

Many "multibyte probrems" are occur at this point.

Imagen some system which has multibyte encoding such as euc-jp.
And websh test/ scripts are contain raw utf-8? multibyte strings.

If websh which has multibyte encoding try to read such test scripts,
it will try to convert the scripts from its system encoding to internal
UTF-8 encoding. But input is already UTF-8 string. So it will be broken.
So all raw 8bit string must be written "\uXXX" notation.
And It must be correct string for system encoding.
For example, Tcl can read any Chinese string which written using "\uXXX"
notatin. But If its Tcl has "euc-jp" system encoding, Tcl can not output
it. Encoding for output channel must be Chinese encoding for Chinese string.
So I think test scripts must be evaluated under correct encoding.

And the otherhand, Some one will think encoding binary is good solution.
But It is not good idea.

Tcl can input a string from binary encoding channel. and can output such
string. But Tcl can not operate such string.
For example,
% fconfigure stdin -encoding binary
% set rawStr [gets stdin] 
% set splitStr [split $rawStr {}]; # splitStr will be broken.

I think websh try to deal multibyte string as single byte string.
And additionaly, I think websh also has above encoding related probrems.

> If you talk about scripts that are sourced from mod_websh, you have to
> look at src/generic/webinterp.c: in readWebInterpCode() we basically
src/generic/interpool.c ?
> do the following:
> 
> Tcl_Obj *objPtr = Tcl_NewObj();
> chan = Tcl_OpenFileChannel(interp, filename, "r", 0644);
> Tcl_ReadChars(chan, objPtr, -1, 0);
> Tcl_Close(interp, chan);
> -> objPtr is the code object that is later eval'ed using Tcl_EvalObjEx
> 
> Hope that helps

Thanks! Ronnie. I want to find this one, But I could not...

Notice. Encoding for this channel "chan" is default system encoding.
Websh can read ws3 script which written using its system encoding.

But I think channel for formdata has "binary" encoding.
So websh can not deal multibyte form data.
Ofcause,
  web::put [encoding convertfrom [encoding system] [web::formvar varName]]
work fine.

Thanks.
Taguchi,T.

---------------------------------------------------------------------
To unsubscribe, e-mail: websh-dev-unsubscribe@tcl.apache.org
For additional commands, e-mail: websh-dev-help@tcl.apache.org


Re: i18n problems in Websh (multibyte charsets)

Posted by Ronnie Brunner <ro...@netcetera.ch>.
Hi

I looked at your patch. As you already know, it does not rellay work
yet. But can you tell me, what this modification s for?

diff -ur tcl-websh.orig/src/generic/formdata.c
tcl-websh/src/generic/formdata.c
--- tcl-websh.orig/src/generic/formdata.c       Mon Aug 29 13:24:13
2005
+++ tcl-websh/src/generic/formdata.c    Mon Aug 29 13:30:19 2005
@@ -41,6 +41,7 @@
     int readToEnd = 0;
     int content_length = 0;
     Tcl_DString translation;
+    Tcl_DString encoding;

     channel = Web_GetChannelOrVarChannel(interp, channelName, &mode);
     if (channel == NULL) {
@@ -63,7 +64,9 @@
     }

     Tcl_DStringInit(&translation);
+    Tcl_DStringInit(&encoding);
     Tcl_GetChannelOption(interp, channel, "-translation",
     &translation);
+    Tcl_GetChannelOption(interp, channel, "-encoding", &encoding);
     Tcl_SetChannelOption(interp, channel, "-translation", "binary");

     /*
     ------------------------------------------------------------------------
@@ -88,7 +91,9 @@
            if (Tcl_GetIntFromObj(interp, len, &content_length) !=
	    TCL_OK) {

                Tcl_SetChannelOption(interp, channel, "-translation",
		Tcl_DStringValue(&translation));
+               Tcl_SetChannelOption(interp, channel, "-encoding",
Tcl_DStringValue(&encoding));
                Tcl_DStringFree(&translation);
+               Tcl_DStringFree(&encoding);
                /* unregister if was a varchannel */
                Web_UnregisterVarChannel(interp, channelName,
		channel);
                return TCL_ERROR;
@@ -122,7 +127,9 @@
            Tcl_DecrRefCount(formData);

            Tcl_SetChannelOption(interp, channel, "-translation",
	    Tcl_DStringValue(&translation));
+           Tcl_SetChannelOption(interp, channel, "-encoding",
Tcl_DStringValue(&encoding));
            Tcl_DStringFree(&translation);
+           Tcl_DStringFree(&encoding);
            /* unregister if was a varchannel */
            Web_UnregisterVarChannel(interp, channelName, channel);

@@ -131,7 +138,9 @@
     }

     Tcl_SetChannelOption(interp, channel, "-translation",
     Tcl_DStringValue(&translation));
+    Tcl_SetChannelOption(interp, channel, "-encoding",
Tcl_DStringValue(&encoding));
     Tcl_DStringFree(&translation);
+    Tcl_DStringFree(&encoding);
     /* unregister if was a varchannel */
     Web_UnregisterVarChannel(interp, channelName, channel);

As far as I can see, this doesn't do anything: it just saves the
encoding, then doesn't do anything with it and before returning sets
it to the value it's already set to. Did I miss something?

> Does websh get contents from apache server as unicode string,
> or binary string?

I'm not sure if I understand your question correctly: The strings are
always the same (binary?). The encoding is just how you interpret
it. If you talk about multipart formdata: you get the encoding within
the data, otherwise (www-form-urlencoded) is just a defined 8-bit
encoding (I would not know how multibyte character sets are posted in
that encoding.)

If you talk about scripts that are sourced from mod_websh, you have to
look at src/generic/webinterp.c: in readWebInterpCode() we basically
do the following:

Tcl_Obj *objPtr = Tcl_NewObj();
chan = Tcl_OpenFileChannel(interp, filename, "r", 0644);
Tcl_ReadChars(chan, objPtr, -1, 0);
Tcl_Close(interp, chan);
-> objPtr is the code object that is later eval'ed using Tcl_EvalObjEx

Hope that helps

Best regards
Ronnie
-----------------------------------------------------------------------
Ronnie Brunner                              ronnie.brunner@netcetera.ch
Netcetera AG, 8040 Zuerich, phone +41 44 247 79 79 fax +41 44 247 70 75

---------------------------------------------------------------------
To unsubscribe, e-mail: websh-dev-unsubscribe@tcl.apache.org
For additional commands, e-mail: websh-dev-help@tcl.apache.org


Re: i18n problems in Websh (multibyte charsets) (was Re: switching to apache)

Posted by ta...@iij.ad.jp.
> I've read the code more carfully. And I think I found the
> solution.

Sorry, Not yet.

Multibyte strings are broken in webout_eval_tag().
It seem fine top of webout_eval_tag().
variable cur contains correct string.

But variable dstr does not contain correct string.
It's broken.

I think quote_append() can not deal multi-byte string.
So I make a patch.

But not good enough....

I have a question.

Does websh get contents from apache server as unicode string,
or binary string?

---------------------------------------------------------------------
To unsubscribe, e-mail: websh-dev-unsubscribe@tcl.apache.org
For additional commands, e-mail: websh-dev-help@tcl.apache.org


Re: i18n problems in Websh (multibyte charsets) (was Re: switching to apache)

Posted by ta...@iij.ad.jp.
Hi, Ronnie,

About webout_eval_tag() patch, I'd got confution.
It was bad idea that importing related code from tcl-rivet.
I should make more modifies for websh.

I've read the code more carfully. And I think I found the
solution.

Current webout_eval_tag() contains following code:
	....
	prev = cur;
	cur++
	continue;
	....
char * cur contains the contents. It may be unicode string.
This means it may be multi-byte string not only single-byte.
So cur++ may not point next char.

I think above lines should be as following:
	....
	prev = cur;
	cur = (char *)Tcl_UtfNext(cur);
	continue;
	....

An attatchement is new patch against CVS-HEAD.
I think it work fine.

But I found another probrems.

Scripts under test/ are contain raw 8bit strings.
And I think there are raw 8bit unicode strings ....

It requires "encoding system utf-8".
But many system has other encoding such as iso8859-1, euc-jp,
and so on.
Such systems can not read raw 8bit unicode strings.
I think they should use \uXXXX notation.

Thanks.
Taguchi,T.
---