You are viewing a plain text version of this content. The canonical link for it is here.
Posted to rivet-dev@tcl.apache.org by Massimo Manghi <ma...@alice.it> on 2019/09/23 00:17:20 UTC

UTF8 handling by Apache and Tcl

Does anyone have experience with the Apache handling of UTF8 strings?

I have two Apache instances running the same code and having largely the 
same configuration but, while the first I built from source works as 
expected the latter tampers with URLs where arguments are strings having 
UTF8 encoded accented characters, like those widely used in French and 
to a minor extent in Italian. Those characters go through some sort of 
double UTF8 conversion because they can be forced to display correctly 
if transformed calling 'encoding' beforehand

set my_string [encoding convertfrom utf-8 $my_string]

There is also some interaction with ::rivet::escape_string that fails to 
recognize those characters as alphanumeric and escapes them, but what 
bothers me is that I couldn't find an obvious explanation for the 2 
Apache instances behaving in a different way (AddDefaultCharset is left 
unset in both configurations, Apache2 versions are different but very 
close each other)


  -- Massimo



---------------------------------------------------------------------
To unsubscribe, e-mail: rivet-dev-unsubscribe@tcl.apache.org
For additional commands, e-mail: rivet-dev-help@tcl.apache.org


Re: UTF8 handling by Apache and Tcl

Posted by Massimo Manghi <ma...@alice.it>.

On 9/25/19 7:58 PM, Georgios Petasis wrote:
> Dear Massimo,
> 
> My advice is to use "encoding system" in your code, and act accordingly 
> in the code (use or not use encoding convertfrom).
> This way, the code will work even in cases you cannot control the 
> settings apache runs with.
> 
> Best,
> George
>

Hi George

As I hinted in my first message in this thread strings with accented 
characters were handled consistently until they went through 
::rivet::escape_string, before making into a URL.

The problem seems to be related to the byte string returned by this call 
in ::rivet::escape_string

origString = Tcl_GetStringFromObj( objv[1], &origLength );

with both utf-8 and iso8859-1 system encodings the returned string is 
invariably the utf-8 byte representation, which at first made sense to 
me because I know that Tcl handles string as utf-8 internally. I'm not 
questioning what Tcl_GetStringFromObj does but shouldn't at this point 
be replaced by some function that returns a byte string consistent with 
the locale?

For example the accented character 'è', which has code 0xe9 as byte 
representation in latin1 (and the same code point in utf-8), is 
represented as 0xc3 0xa9 (utf-8 byte string) and it becomes %c3%a9. 
After this sequence of bytes has been unescaped it's returned by calling

Tcl_SetObjResult( interp, Tcl_NewStringObj( newString, -1 ) );

and the iso8859-1 machine represents it as é

I'm trying replacing Tcl_GetString... with Tcl_GetByteArrayFromObj (and 
Tcl_NewByteArray). The sequence of the characters is correct but there 
is some extra stuff in it that breaks things.

Still working (and wasting time) on it


  -- Massimo





---------------------------------------------------------------------
To unsubscribe, e-mail: rivet-dev-unsubscribe@tcl.apache.org
For additional commands, e-mail: rivet-dev-help@tcl.apache.org


Re: UTF8 handling by Apache and Tcl

Posted by Georgios Petasis <pe...@apache.org>.
Dear Massimo,

My advice is to use "encoding system" in your code, and act accordingly 
in the code (use or not use encoding convertfrom).
This way, the code will work even in cases you cannot control the 
settings apache runs with.

Best,
George

Στις 25/9/2019 11:11, ο Massimo Manghi έγραψε:
>
>
> On 9/24/19 9:02 PM, Georgios Petasis wrote:
>> Dear Massimo,
>>
>> I have somewhat used utf-8. But never on the arguments passed to the 
>> url.
>> I think you should start debugging with a simple check on "encoding 
>> system" (just write a rivet script that prints the result).
>
> Actually I'm not sending them on the URL, ::rivet::escape_string is 
> escaping those characters, and I'm happy with it because browsers are 
> smart enough to understand the escaping and display the right symbol.
>>
>> Ideally, you must get the same result on both systems (although I 
>> doubt).
>> If encoding system within rivet returns utf-8, probably you can use 
>> the parameters unconverted. If it returns any other encoding,
>> encoding convertfrom utf-8 may be needed.
>>
>
> well, this explains everything, the command on the system stock apache 
> instance (the failing one) returns iso8859-1 while the other returns 
> utf-8. I had been mislead by the locale of the user running Apache
>
> # sudo --user www-data locale
> LANG=en_US.UTF-8
> LANGUAGE=en_US:en
> LC_CTYPE="en_US.UTF-8"
> LC_NUMERIC="en_US.UTF-8"
> LC_TIME="en_US.UTF-8"
> LC_COLLATE="en_US.UTF-8"
> LC_MONETARY="en_US.UTF-8"
> LC_MESSAGES="en_US.UTF-8"
> LC_PAPER="en_US.UTF-8"
> LC_NAME="en_US.UTF-8"
> LC_ADDRESS="en_US.UTF-8"
> LC_TELEPHONE="en_US.UTF-8"
> LC_MEASUREMENT="en_US.UTF-8"
> LC_IDENTIFICATION="en_US.UTF-8"
> LC_ALL=
>
> but looking harder in the configuration I found
>
> ## The locale used by some modules like mod_dav
> export LANG=C
> ## Uncomment the following line to use the system default locale instead:
> #. /etc/default/locale
> export LANG
>
> which changed what I assumed was the default
>
>  -- Massimo
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: rivet-dev-unsubscribe@tcl.apache.org
> For additional commands, e-mail: rivet-dev-help@tcl.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: rivet-dev-unsubscribe@tcl.apache.org
For additional commands, e-mail: rivet-dev-help@tcl.apache.org


Re: UTF8 handling by Apache and Tcl

Posted by Massimo Manghi <ma...@alice.it>.

On 9/24/19 9:02 PM, Georgios Petasis wrote:
> Dear Massimo,
> 
> I have somewhat used utf-8. But never on the arguments passed to the url.
> I think you should start debugging with a simple check on "encoding 
> system" (just write a rivet script that prints the result).

Actually I'm not sending them on the URL, ::rivet::escape_string is 
escaping those characters, and I'm happy with it because browsers are 
smart enough to understand the escaping and display the right symbol.
> 
> Ideally, you must get the same result on both systems (although I doubt).
> If encoding system within rivet returns utf-8, probably you can use the 
> parameters unconverted. If it returns any other encoding,
> encoding convertfrom utf-8 may be needed.
> 

well, this explains everything, the command on the system stock apache 
instance (the failing one) returns iso8859-1 while the other returns 
utf-8. I had been mislead by the locale of the user running Apache

# sudo --user www-data locale
LANG=en_US.UTF-8
LANGUAGE=en_US:en
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

but looking harder in the configuration I found

## The locale used by some modules like mod_dav
export LANG=C
## Uncomment the following line to use the system default locale instead:
#. /etc/default/locale
export LANG

which changed what I assumed was the default

  -- Massimo

---------------------------------------------------------------------
To unsubscribe, e-mail: rivet-dev-unsubscribe@tcl.apache.org
For additional commands, e-mail: rivet-dev-help@tcl.apache.org


Re: UTF8 handling by Apache and Tcl

Posted by Georgios Petasis <pe...@apache.org>.
Dear Massimo,

I have somewhat used utf-8. But never on the arguments passed to the url.
I think you should start debugging with a simple check on "encoding 
system" (just write a rivet script that prints the result).

Ideally, you must get the same result on both systems (although I doubt).
If encoding system within rivet returns utf-8, probably you can use the 
parameters unconverted. If it returns any other encoding,
encoding convertfrom utf-8 may be needed.

Best,

George

Στις 23/9/2019 03:17, ο Massimo Manghi έγραψε:
> Does anyone have experience with the Apache handling of UTF8 strings?
>
> I have two Apache instances running the same code and having largely 
> the same configuration but, while the first I built from source works 
> as expected the latter tampers with URLs where arguments are strings 
> having UTF8 encoded accented characters, like those widely used in 
> French and to a minor extent in Italian. Those characters go through 
> some sort of double UTF8 conversion because they can be forced to 
> display correctly if transformed calling 'encoding' beforehand
>
> set my_string [encoding convertfrom utf-8 $my_string]
>
> There is also some interaction with ::rivet::escape_string that fails 
> to recognize those characters as alphanumeric and escapes them, but 
> what bothers me is that I couldn't find an obvious explanation for the 
> 2 Apache instances behaving in a different way (AddDefaultCharset is 
> left unset in both configurations, Apache2 versions are different but 
> very close each other)
>
>
>  -- Massimo
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: rivet-dev-unsubscribe@tcl.apache.org
> For additional commands, e-mail: rivet-dev-help@tcl.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: rivet-dev-unsubscribe@tcl.apache.org
For additional commands, e-mail: rivet-dev-help@tcl.apache.org