You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@httpd.apache.org by ej...@seznam.cz on 2009/07/01 00:02:22 UTC

[users@httpd] Wrong charset convert

I have installed Apache 2.2.11 with PHP 5.2.8 on Windows XP SP3. 
Windows are using Windows-1250 charset (Czech localization). I want 
to install MediaWiki software which uses utf-8 charset.

When I upload a file with non-english characters in its name, then 
its name is saved in utf-8 format. When I try to open such file in 
web browser it sends 404 not found status.

Example:

Upload a file by using simple html upload form, which is encoded in 
utf-8:

<!-- this is only part of whole code --!>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
</head>
<body>

<form enctype="multipart/form-data" action="uploader.php" method=
"POST">
<input type="hidden" name="MAX_FILE_SIZE" value="100000" />
Choose a file to upload: <input name="uploadedfile" type="file" /><
br />
<input type="submit" value="Upload File" />

</form>
</body>
</html>

File named for example "složka.png" is saved to hard drive with name
"sloĹľka.png" in Windows-1250 encoding. If that upload form was 
encoded with charset=Windows-1250 then it'll be right named "složka.
png", but charset must be utf-8.

So suppose that we have server with uploaded file: http://something.
com/složka.png. On linux it is working fine. But on Windows server 
you must use address like that: http://something.com/sloĹľka.png and
that's not good for MediaWiki.

I don't know if it's understandably enough, I need set up Apache to 
ignore windows-1250 charset and use original utf-8 for decoding URL.
httpd.conf is original (with php installation).

Thanks for help
Jiri Eichler



Re: [users@httpd] Wrong charset convert SOLVED

Posted by André Warnier <aw...@ice-sa.com>.
Jiří Eichler wrote:
> I didn't program MediaWiki, but on Wikipedia it seems to be working 
> well. I just realize that we haven't solved that problem with charset, I 
> have just changed charset sent by php ... you're right with "double 
> encoding" to utf-8, Apache/php think that it is something else and 
> encode it once more. But how can we tell php that it is in utf-8? I 
> don't know. :-D    Can it be called bug when there is no way to detect 
> charset of uploaded filename?
> 
Well...
One basic problem is that there are "holes" in the HTTP 1.x 
specification, at least when considering the multi-lingual, increasingly 
Unicode-centric world in which we are living.
The next problem is that browsers do not always respect even the HTTP 
1.x specification.
The third problem is that not all browsers fail to respect it in the 
same way (but they are getting better at this).
The next issue is that, the WWW being like it is, with clients that the 
server does not control, you can never be sure of anything.
The next issue is that programming languages like PHP, do not 
necessarily offer very good tools to "mark" a string as being in any 
particular encoding.
Another issue is that it is relatively easy to check if a received text 
is valid UTF-8; but it is very hard to check if a received text is valid 
iso-8859-1 or iso-8859-2 or cp-1250, or any of the 8-bit character sets; 
and it is even harder to find out which one of them it is.

And one overall issue, is that it is not always easy to change any of 
the above, without suddenly breaking many WWW applications.

Taking all the above into account however, there are some things which 
you can do in your applications, to minimise the consequences.

One first thing is to be correct, consistent, and precise in what you 
send to the browser.
("Be strict in what you send, and tolerant in what you receive")

So if you have chosen Unicode/UTF-8 for your basic charset and encoding 
(the best choice nowadays), make sure that :
- each time your server sends some text page to the client, there is a 
proper "Content-type: xxx/yyyy; charset=utf-8" HTTP header with the 
response (see *1 below)
- each time your server sends some HTML or XML page to the client, make 
sure that it has an explicit charset declaration inside
- always verify that your pages *are* encoded in UTF-8.  Not that 
someone has been editing your pages using an old editor, which knows 
only iso-latin-2 or cp-1250.
- when you send a <form> to the client, specify the
accept-charset="utf-8" attribute in the <form> tag
- when you send a <form> to the client, which will be later submitted 
back, include some
<input name="test-encoding" type="hidden" value="xxxxxxxxxx">
where "xxxxxxxxxx" is a valid UTF-8 string containing non US-ASCII 
characters.
Then, in the script that receives the data from this form, test this 
parameter, to see if what you received is indeed UTF-8 or not.
The way to do that varies depending on the programming language.
(Maybe you can compare the length in bytes and/or the length in 
characters, or compare it with an internal identical string known to be 
UTF-8.)
- be "defensive" in your cgi-bin scripts. Everything you receive from 
the client is suspect.
- never forget that on the WWW, "the client is king". The user /can/ 
change the charset of his browser, no matter what the server tells it.
(Firefox 3.1 : View..Character encoding; IE 7 : same)



(*1) :
when I use your PHP upload page, the response page that I get from your 
server has these HTTP headers :
HTTP/1.1 200 OK
Date: Wed, 01 Jul 2009 19:44:31 GMT
Server: Apache/2.2.11 (Win32) PHP/5.2.8
X-Powered-By: PHP/5.2.8
Content-Length: 716
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Content-Type: text/html; charset=windows-1250


However, the html page itself contains :
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />

That is /not/ consistent.

On the other hand, the index page received after you click on the /data 
link, has the following HTTP headers :

HTTP/1.1 200 OK
Date: Wed, 01 Jul 2009 19:54:01 GMT
Server: Apache/2.2.11 (Win32) PHP/5.2.8
Content-Length: 264
Keep-Alive: timeout=5, max=99
Connection: Keep-Alive
Content-Type: text/html;charset=UTF-8



---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Re: [users@httpd] Wrong charset convert SOLVED

Posted by Jiří Eichler <ej...@seznam.cz>.
I didn't program MediaWiki, but on Wikipedia it seems to be working 
well. I just realize that we haven't solved that problem with charset, I 
have just changed charset sent by php ... you're right with "double 
encoding" to utf-8, Apache/php think that it is something else and 
encode it once more. But how can we tell php that it is in utf-8? I 
don't know. :-D    Can it be called bug when there is no way to detect 
charset of uploaded filename?

André Warnier wrote:
> Jiří Eichler wrote:
>> Man you are incredible. 
> I had to leave that part..
> ...
> (I removed the part about the idiot however..)
> ...
> I didn't think that it is problem with php, it ran on Linux well.
>
> It runs under Linux well, probably /only/ because the locale of the 
> process under which Apache + PHP is started, is a UTF-8 locale.
> So by default, PHP is considering the filename string as UTF-8, and 
> you do not see the problem.
> But if you want to make this really portable, you should also make 
> sure it always does it right under whatever OS and whatever locale.
>
> Unfortunately, it is not easy, because the browser does not actually 
> tell you in which character set it sends the filename.  So you have to 
> "believe" that this is /your/ <form>, and that the browser does it 
> correctly.
>
> I still think that it is a bad idea to save the file under the 
> original name given by the browser, for a number of reasons.
> Let me give you a couple more reasons :
>
> 1) It is easy for a hacker, to create his own "HTTP agent" (browser).
> He does not even have to create one, there are many programs available 
> that do that.
> This client could send you a file named
> "myfile.txt > /etc/passwd"
> or "file.txt ; rm -r /*"
> Then you, on the server, use that filename in another command, like
> system("cat" . $filename . " > myotherfile");
> Got the idea ?
>
> 2) I once designed an application like that, for normal users, not 
> hackers.  And they used it for a long time, without problems.
> Then one day, years later, I had to move all these thousands of 
> uploaded files to another system.  So, I used "tar" to create an 
> archive of these files, to move them to the other system.
> Unfortunately, tar was crashing about every 50 files, because it got a 
> filename that it could not handle, like
> "My grand-mother At the <Pizza Hut>. Near the place of John & Maria".png
> (that being a simple case)
> So it took me a lot of hours to move these files.
>
>
>
>
>
> ---------------------------------------------------------------------
> The official User-To-User support forum of the Apache HTTP Server 
> Project.
> See <URL:http://httpd.apache.org/userslist.html> for more info.
> To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
>   "   from the digest: users-digest-unsubscribe@httpd.apache.org
> For additional commands, e-mail: users-help@httpd.apache.org
>
>


---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Re: [users@httpd] Wrong charset convert SOLVED

Posted by "William A. Rowe, Jr." <wr...@rowe-clan.net>.
Jiří Eichler wrote:
> Ok, "AddDefaultCharset off" added to httpd.conf, charset spec in header
> disappeared in both cases, It must be enough for this time. As regards
> file uploads, I really want to use utf-8 for multilingual support and I
> believe that it is technically possible even on Windows. It just needs
> to configure server to know that received filename is in utf-8. Windows
> 'reputedly' use something like Unicode. Sometimes I wonder why is so
> much problems with charsets. Maybe because it is hard to recognize in
> which charset is text written.

All filesystem URI's for Apache HTTPD on Windows are in utf-8.

But all URI's should be given spelled out in %XX form, so the encoding
itself on the page doesn't matter.

It is so much problem because user/developers think in their region, alone.

---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Re: [users@httpd] Wrong charset convert SOLVED

Posted by Jiří Eichler <ej...@seznam.cz>.
Ok, "AddDefaultCharset off" added to httpd.conf, charset spec in header 
disappeared in both cases, It must be enough for this time. As regards 
file uploads, I really want to use utf-8 for multilingual support and I 
believe that it is technically possible even on Windows. It just needs 
to configure server to know that received filename is in utf-8. Windows 
'reputedly' use something like Unicode. Sometimes I wonder why is so 
much problems with charsets. Maybe because it is hard to recognize in 
which charset is text written.

---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Re: [users@httpd] Wrong charset convert SOLVED

Posted by André Warnier <aw...@ice-sa.com>.
Jiří Eichler wrote:
> Man you are incredible. 
I had to leave that part..
...
(I removed the part about the idiot however..)
...
I didn't think that it is problem with php, it ran on Linux well.

It runs under Linux well, probably /only/ because the locale of the 
process under which Apache + PHP is started, is a UTF-8 locale.
So by default, PHP is considering the filename string as UTF-8, and you 
do not see the problem.
But if you want to make this really portable, you should also make sure 
it always does it right under whatever OS and whatever locale.

Unfortunately, it is not easy, because the browser does not actually 
tell you in which character set it sends the filename.  So you have to 
"believe" that this is /your/ <form>, and that the browser does it 
correctly.

I still think that it is a bad idea to save the file under the original 
name given by the browser, for a number of reasons.
Let me give you a couple more reasons :

1) It is easy for a hacker, to create his own "HTTP agent" (browser).
He does not even have to create one, there are many programs available 
that do that.
This client could send you a file named
"myfile.txt > /etc/passwd"
or "file.txt ; rm -r /*"
Then you, on the server, use that filename in another command, like
system("cat" . $filename . " > myotherfile");
Got the idea ?

2) I once designed an application like that, for normal users, not 
hackers.  And they used it for a long time, without problems.
Then one day, years later, I had to move all these thousands of uploaded 
files to another system.  So, I used "tar" to create an archive of these 
files, to move them to the other system.
Unfortunately, tar was crashing about every 50 files, because it got a 
filename that it could not handle, like
"My grand-mother At the <Pizza Hut>. Near the place of John & Maria".png
(that being a simple case)
So it took me a lot of hours to move these files.





---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Re: [users@httpd] Wrong charset convert SOLVED

Posted by Jiří Eichler <ej...@seznam.cz>.
Man you are incredible. Thank you. And I'm idiot because I searched for 
mistake where it wasn't :) I had just to use:

$target_path = utf8_decode($target_path);

Maybe it will be also possible change somehow php.ini, I'm going to look 
at it. I didn't think that it is problem with php, it ran on Linux well. 
And add one line to MediaWiki won't be problem.

Thank you so much for your time.
Jiri Eichler

André Warnier wrote:
> Jiří Eichler wrote:
> ..
> I just checked your on-line example.
> I used Firefox 3.1, with the "HttpFox" add-on (recommended).
> This shows exactly what the browser is sending to the server.
> In this case, the form does a POST, in the "multipart/form-data" 
> encoding.
> I sent a small test file, which I created on my disk under Windows XP 
> (German, so basically latin-1, not latin-2 like yours).
> I used cut-and-paste from your email, to copy the filename.
> The file name is thus the same as your example, but with a .txt 
> extension.
> In the Windows Explorer (not IE), the file name on my disk looks like :
> složka.txt
> (I used cut-and-paste again in the Explorer to copy this into this 
> email).
>
> I used your (nice) example form to send this file to your server, and 
> traced it with HttpFox.
> This is actually what the browser is sending, as part of that 
> multipart POST :
>
> -----------------------------20037128598723
> Content-Disposition: form-data; name="uploadedfile"; 
> filename="složka.txt"
> Content-Type: text/plain
>
> test složka
>
> -----------------------------20037128598723--
>
> Important : the browser is NOT sending this filename as a part of the 
> URL.  It is sending it in the BODY of the POST request.
> It is also not sending it encoded as "slo%C5%BEka". It /is/ sending 
> the filename encoded as UTF-8.
> That means, that if there is a translation going on here, it is NOT at 
> the level of the upload URL.
>
> Now the question is to know how your PHP script really interprets this 
> filename.  As UTF-8 ?  How do you know for sure ?
> (I am not a PHP specialist, at all)
> I mean, precisely :
> the PHP script "uploader.php", somehow "gets" the value of the 
> parameter "uploadedfile" as a string, representing the filename that 
> the browser uploaded.
> In which encoding (in PHP) /is/ that string ? does PHP know that this 
> is Unicode/UTF-8 ?
>
> Or does /PHP/ (which runs under Apache, which runs under Windows, on a 
> Windows system where the default charset is cp-1250) think that this 
> string is encoded in cp-1250 ?
>
> And then, when PHP writes this file to the disk, it encodes the 
> filename /again/ into Unicode, and thus this time the "ž" (which is 
> originally 2 bytes representing 1 Unicode character), now becomes 4 
> bytes representing the UTF-8 encoding of "Å" and "¾" ....
>
> ... and then, PHP generates the index listing.  And, in this index 
> page, it generates the "href" as
> <a href="slo%c4%b9%c4%beka.txt">
> which looks very much like it could be the Unicode/UTF-8 encoding of 
> "složka.txt", but not like the Unicode/UTF-8 encoding of "složka.txt".
>
>
>
> ---------------------------------------------------------------------
> The official User-To-User support forum of the Apache HTTP Server 
> Project.
> See <URL:http://httpd.apache.org/userslist.html> for more info.
> To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
>   "   from the digest: users-digest-unsubscribe@httpd.apache.org
> For additional commands, e-mail: users-help@httpd.apache.org
>
>


---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Re: [users@httpd] Wrong charset convert

Posted by André Warnier <aw...@ice-sa.com>.
Jiří Eichler wrote:
..
I just checked your on-line example.
I used Firefox 3.1, with the "HttpFox" add-on (recommended).
This shows exactly what the browser is sending to the server.
In this case, the form does a POST, in the "multipart/form-data" encoding.
I sent a small test file, which I created on my disk under Windows XP 
(German, so basically latin-1, not latin-2 like yours).
I used cut-and-paste from your email, to copy the filename.
The file name is thus the same as your example, but with a .txt extension.
In the Windows Explorer (not IE), the file name on my disk looks like :
složka.txt
(I used cut-and-paste again in the Explorer to copy this into this email).

I used your (nice) example form to send this file to your server, and 
traced it with HttpFox.
This is actually what the browser is sending, as part of that multipart 
POST :

-----------------------------20037128598723
Content-Disposition: form-data; name="uploadedfile"; filename="složka.txt"
Content-Type: text/plain

test složka

-----------------------------20037128598723--

Important : the browser is NOT sending this filename as a part of the 
URL.  It is sending it in the BODY of the POST request.
It is also not sending it encoded as "slo%C5%BEka". It /is/ sending the 
filename encoded as UTF-8.
That means, that if there is a translation going on here, it is NOT at 
the level of the upload URL.

Now the question is to know how your PHP script really interprets this 
filename.  As UTF-8 ?  How do you know for sure ?
(I am not a PHP specialist, at all)
I mean, precisely :
the PHP script "uploader.php", somehow "gets" the value of the parameter 
"uploadedfile" as a string, representing the filename that the browser 
uploaded.
In which encoding (in PHP) /is/ that string ? does PHP know that this is 
Unicode/UTF-8 ?

Or does /PHP/ (which runs under Apache, which runs under Windows, on a 
Windows system where the default charset is cp-1250) think that this 
string is encoded in cp-1250 ?

And then, when PHP writes this file to the disk, it encodes the filename 
/again/ into Unicode, and thus this time the "ž" (which is originally 2 
bytes representing 1 Unicode character), now becomes 4 bytes 
representing the UTF-8 encoding of "Å" and "¾" ....

... and then, PHP generates the index listing.  And, in this index page, 
it generates the "href" as
<a href="slo%c4%b9%c4%beka.txt">
which looks very much like it could be the Unicode/UTF-8 encoding of 
"složka.txt", but not like the Unicode/UTF-8 encoding of "složka.txt".



---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Re: [users@httpd] Wrong charset convert

Posted by Jiří Eichler <ej...@seznam.cz>.
another correction, IE setting HAS effect, but it changes nothing, still 
not working. Sorry for confusing.
> sorry, validator sends of course: GET /slo%C5%BEka.png HTTP/1.1
> I select wrong TCP session, sorry :)
>> All of them, Opera, IE, and even http://validator.w3.org, it sends: 
>> GET 
>> /check?uri=http%3A%2F%2Fsgo.happyforever.com%2Fslo%C5%BEka.png&charset=%28detect+automatically%29&doctype=Inline&group=0 
>> HTTP/1.1
>> IE setting "send URLs as UTF-8" has no effect
>>
>> André Warnier wrote:
>>> Jiří Eichler wrote:
>>>> Thank you André for perfect explanation. Web browser converts 'ž' to
>>>> %C5%BE, which are two bytes,
>>>
>>> Which web browser ?
>>> And if it is IE, then is the "send URLs as UTF-8" box checked in 
>>> "Internet options" ?
>>> And what happens if you uncheck it ?
>>>
>>> (Will read the rest of your post later)
>>>
>>>
>>> ---------------------------------------------------------------------
>>> The official User-To-User support forum of the Apache HTTP Server 
>>> Project.
>>> See <URL:http://httpd.apache.org/userslist.html> for more info.
>>> To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
>>>   "   from the digest: users-digest-unsubscribe@httpd.apache.org
>>> For additional commands, e-mail: users-help@httpd.apache.org
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> The official User-To-User support forum of the Apache HTTP Server 
>> Project.
>> See <URL:http://httpd.apache.org/userslist.html> for more info.
>> To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
>>   "   from the digest: users-digest-unsubscribe@httpd.apache.org
>> For additional commands, e-mail: users-help@httpd.apache.org
>>
>>
>
>
> ---------------------------------------------------------------------
> The official User-To-User support forum of the Apache HTTP Server 
> Project.
> See <URL:http://httpd.apache.org/userslist.html> for more info.
> To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
>   "   from the digest: users-digest-unsubscribe@httpd.apache.org
> For additional commands, e-mail: users-help@httpd.apache.org
>
>


---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Re: [users@httpd] Wrong charset convert

Posted by Jiří Eichler <ej...@seznam.cz>.
sorry, validator sends of course: GET /slo%C5%BEka.png HTTP/1.1
I select wrong TCP session, sorry :)
> All of them, Opera, IE, and even http://validator.w3.org, it sends: 
> GET 
> /check?uri=http%3A%2F%2Fsgo.happyforever.com%2Fslo%C5%BEka.png&charset=%28detect+automatically%29&doctype=Inline&group=0 
> HTTP/1.1
> IE setting "send URLs as UTF-8" has no effect
>
> André Warnier wrote:
>> Jiří Eichler wrote:
>>> Thank you André for perfect explanation. Web browser converts 'ž' to
>>> %C5%BE, which are two bytes,
>>
>> Which web browser ?
>> And if it is IE, then is the "send URLs as UTF-8" box checked in 
>> "Internet options" ?
>> And what happens if you uncheck it ?
>>
>> (Will read the rest of your post later)
>>
>>
>> ---------------------------------------------------------------------
>> The official User-To-User support forum of the Apache HTTP Server 
>> Project.
>> See <URL:http://httpd.apache.org/userslist.html> for more info.
>> To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
>>   "   from the digest: users-digest-unsubscribe@httpd.apache.org
>> For additional commands, e-mail: users-help@httpd.apache.org
>>
>>
>
>
> ---------------------------------------------------------------------
> The official User-To-User support forum of the Apache HTTP Server 
> Project.
> See <URL:http://httpd.apache.org/userslist.html> for more info.
> To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
>   "   from the digest: users-digest-unsubscribe@httpd.apache.org
> For additional commands, e-mail: users-help@httpd.apache.org
>
>


---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Re: [users@httpd] Wrong charset convert

Posted by Jiří Eichler <ej...@seznam.cz>.
All of them, Opera, IE, and even http://validator.w3.org, it sends: GET 
/check?uri=http%3A%2F%2Fsgo.happyforever.com%2Fslo%C5%BEka.png&charset=%28detect+automatically%29&doctype=Inline&group=0 
HTTP/1.1
IE setting "send URLs as UTF-8" has no effect

André Warnier wrote:
> Jiří Eichler wrote:
>> Thank you André for perfect explanation. Web browser converts 'ž' to
>> %C5%BE, which are two bytes,
>
> Which web browser ?
> And if it is IE, then is the "send URLs as UTF-8" box checked in 
> "Internet options" ?
> And what happens if you uncheck it ?
>
> (Will read the rest of your post later)
>
>
> ---------------------------------------------------------------------
> The official User-To-User support forum of the Apache HTTP Server 
> Project.
> See <URL:http://httpd.apache.org/userslist.html> for more info.
> To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
>   "   from the digest: users-digest-unsubscribe@httpd.apache.org
> For additional commands, e-mail: users-help@httpd.apache.org
>
>


---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Re: [users@httpd] Wrong charset convert

Posted by André Warnier <aw...@ice-sa.com>.
Jiří Eichler wrote:
> Thank you André for perfect explanation. Web browser converts 'ž' to
> %C5%BE, which are two bytes,

Which web browser ?
And if it is IE, then is the "send URLs as UTF-8" box checked in 
"Internet options" ?
And what happens if you uncheck it ?

(Will read the rest of your post later)


---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Re: [users@httpd] Wrong charset convert

Posted by Jiří Eichler <ej...@seznam.cz>.
I understand you very well. I can try to look at Apache source codes and 
recompile it with some changes.
Otherwise, thank you for your time. I'm trying to resolve this for three 
days :-D

Have a nice day,
Jiri Eichler

André Warnier wrote:
> Jiří Eichler wrote:
> ...
> Hi.
> I do not know the answer precisely either.
> But I know enough to tell you that in such matters, you must be 
> /extremely/ careful in interpreting what is really going on, at each 
> level.
> Just as a stupid example : when you look at a log file, you must know 
> : - has the process that writes that logfile transformed the data into 
> some encoding already, when writing it to the logfile ?
> - is the editor which I am using aware of the logfile encoding ?
> etc...
> Because otherwise what you see, and what is really there, may be 
> different things.
>
> For example, I think I remember that, internally, in the Windows NTFS 
> filesystem, file names are stored as Unicode (not necessarily UTF-8, 
> it could also be UTF-16 or another Unicode encoding).
> (See for example here : http://www.ntfsrecovery.com/a-ntfs.php)
> But when you look at a directory through the Explorer, these internal 
> filenames /may/ get transformed according to your PC's codepage, just 
> to display it to you.
> So what you think you see, is not necessarily what is really there.
> Understand what I'm saying ?
>
> Just some elements :
> - Apache should not "translate" or "encode" the received URL, because 
> basically it does not know if this URL is in UTF-8, ASCII, or any 
> other encoding. There is no "flag" or "header" in a HTTP request, that 
> says in which encoding the "GET" line comes in. (e.g. it may also be 
> some Japanese or Chinese encoding).
> So it /must/ take it as bytes.
> - then Apache calls the OS to find the file.  There may, or may not, 
> be some translation there, I really don't know.  It may depend on what 
> API call the program uses to read the directory, and I don't know what 
> Apache uses.
> - it's the same for your C program.  I don't know if the OpenFile() 
> call interprets "name" as a pure byte sequence, or if it converts it 
> internally, or whatever.
> - and we don't know if Apache and your program use the same API calls.
>
> For example, in Java or Perl, there are different ways to open a file 
> and to read/write from it, some with encoding/decoding going on, some 
> not.  Unfortunately, I am incompetent in C and Windows API, so I don't 
> know in that case.
>
> Obviously something is happening somewhere, and obviously it happens 
> differently under Unix and under Windows.
>
> Under Unix/Linux, most of these things are influenced by the "locale" 
> under which the process is running.  Under Windows, it is usually the 
> whole system-wide "International settings" which count.
>
> I think we need an Apache/Windows developer here, to really tell us 
> what is going on.
>
>
> ---------------------------------------------------------------------
> The official User-To-User support forum of the Apache HTTP Server 
> Project.
> See <URL:http://httpd.apache.org/userslist.html> for more info.
> To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
>   "   from the digest: users-digest-unsubscribe@httpd.apache.org
> For additional commands, e-mail: users-help@httpd.apache.org
>
>


---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Re: [users@httpd] Wrong charset convert

Posted by André Warnier <aw...@ice-sa.com>.
Jiří Eichler wrote:
...
Hi.
I do not know the answer precisely either.
But I know enough to tell you that in such matters, you must be 
/extremely/ careful in interpreting what is really going on, at each level.
Just as a stupid example : when you look at a log file, you must know : 
- has the process that writes that logfile transformed the data into 
some encoding already, when writing it to the logfile ?
- is the editor which I am using aware of the logfile encoding ?
etc...
Because otherwise what you see, and what is really there, may be 
different things.

For example, I think I remember that, internally, in the Windows NTFS 
filesystem, file names are stored as Unicode (not necessarily UTF-8, it 
could also be UTF-16 or another Unicode encoding).
(See for example here : http://www.ntfsrecovery.com/a-ntfs.php)
But when you look at a directory through the Explorer, these internal 
filenames /may/ get transformed according to your PC's codepage, just to 
display it to you.
So what you think you see, is not necessarily what is really there.
Understand what I'm saying ?

Just some elements :
- Apache should not "translate" or "encode" the received URL, because 
basically it does not know if this URL is in UTF-8, ASCII, or any other 
encoding. There is no "flag" or "header" in a HTTP request, that says in 
which encoding the "GET" line comes in. (e.g. it may also be some 
Japanese or Chinese encoding).
So it /must/ take it as bytes.
- then Apache calls the OS to find the file.  There may, or may not, be 
some translation there, I really don't know.  It may depend on what API 
call the program uses to read the directory, and I don't know what 
Apache uses.
- it's the same for your C program.  I don't know if the OpenFile() call 
interprets "name" as a pure byte sequence, or if it converts it 
internally, or whatever.
- and we don't know if Apache and your program use the same API calls.

For example, in Java or Perl, there are different ways to open a file 
and to read/write from it, some with encoding/decoding going on, some 
not.  Unfortunately, I am incompetent in C and Windows API, so I don't 
know in that case.

Obviously something is happening somewhere, and obviously it happens 
differently under Unix and under Windows.

Under Unix/Linux, most of these things are influenced by the "locale" 
under which the process is running.  Under Windows, it is usually the 
whole system-wide "International settings" which count.

I think we need an Apache/Windows developer here, to really tell us what 
is going on.


---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Re: [users@httpd] Wrong charset convert

Posted by Jiří Eichler <ej...@seznam.cz>.
Thank you André for perfect explanation. Web browser converts 'ž' to
%C5%BE, which are two bytes, this is 'sent' to Apache: GET
/slo%C5%BEka.png HTTP/1.1. Apache in "Not found" message translate it to
/složka.png, what is probably right - ASCII. But it seems really
strange :) I don't think that OS is changing the name when uploading, it
has to save it in UTF-8 format and it's saved right with bytes C5BE
('ž'), even if in Windows it is of course wrong charset. I tried to open
such file from C program:

char name[]= {'s', 'l', 'o', 0xC5, 0xBE, 'k', 'a', '.', 'p', 'n', 'g', 0};
OFSTRUCT o;
HFILE f = OpenFile(name, &o, OF_READ);

And it has opened that file. Windows didn't change anything. Apache
receives GET with exactly same bytes as are in file system on hard
drive. I would suppose that Apache won't convert anything and only will
call OpenFile or something similar.
When I try to load file "/sloĹľka.png", then Apache find it. Apache
receives from browser: GET /slo%C4%B9%C4%BEka.png HTTP/1.1, and that
works. This filename works with Windows API OpenFile function too, but
it is because Windows try to convert it, it is not saved on hard drive
this way.

Apache must convert received request somehow....
You wrote: "The webserver should take this path exactly as received, and
look for a file on disk whose name matches exactly that path, byte by byte."
If it was so, then it MUST work with C5BE bytes, if it work with Windows
API and in Hexplorer view it is only C5BE on hard drive, not C4B9C4BE.

I hope that 'ž' character is well displayed and sorry for my english,
I'm sure that I made a lot of mistakes :-)

André Warnier wrote:
> ejirkae@seznam.cz wrote:
>> This is that problem: http://sgo.happyforever.com/test.php
>> (http://sgo.happyforever.com/test.php)
>> Try it please, thanks.
>>
>> ------------ Původní zpráva ------------
>> Od: <ej...@seznam.cz>
>> Předmět: [users@httpd] Wrong charset convert
>> Datum: 01.7.2009 00:03:06
>> ---------------------------------------------
>> I have installed Apache 2.2.11 with PHP 5.2.8 on Windows XP SP3. 
>> Windows are using Windows-1250 charset (Czech localization). I want 
>> to install MediaWiki software which uses utf-8 charset.
>>
>> When I upload a file with non-english characters in its name, then 
>> its name is saved in utf-8 format. When I try to open such file in 
>> web browser it sends 404 not found status.
>>
>> Example:
>>
>> Upload a file by using simple html upload form, which is encoded in 
>> utf-8:
>>
>> <!-- this is only part of whole code --!>
>> <html>
>> <head>
>> <meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
>> </head>
>> <body>
>>
>> <form enctype="multipart/form-data" action="uploader.php" method=
>> "POST">
>> <input type="hidden" name="MAX_FILE_SIZE" value="100000" />
>> Choose a file to upload: <input name="uploadedfile" type="file" /><
>> br />
>> <input type="submit" value="Upload File" />
>>
>> </form>
>> </body>
>> </html>
>>
>> File named for example "složka.png" is saved to hard drive with name
>> "sloĹľka.png" in Windows-1250 encoding.
> (This is not true, see below)
>
> If that upload form was
>> encoded with charset=Windows-1250 then it'll be right named "složka.
>> png", but charset must be utf-8.
>>
>> So suppose that we have server with uploaded file: http://something.
>> com/složka.png. On linux it is working fine. But on Windows server 
>> you must use address like that: http://something.com/sloĹľka.png and
>> that's not good for MediaWiki.
>>
>> I don't know if it's understandably enough, I need set up Apache to 
>> ignore windows-1250 charset and use original utf-8 for decoding URL.
>> httpd.conf is original (with php installation).
>>
>> Thanks for help
>> Jiri Eichler
>>
> Jiri,
> the issue you are explaining above is not an easy one.
> It will really be solved only, whenever the powers-that-be on the 
> Internet, finally decide to move to an HTTP version 2.0, where 
> everything by default would be Unicode, UTF-8 encoded.
> Until then, there will be confusion and difficulties for whoever does 
> not use English as his main language.
>
> --- Part I -------
>
> First, about your last paragraph :
> Apache will not use UTF-8 to decode a URL, because that would be wrong 
> according to the current RFCs that specifiy how the WWW is working.
> The "law" in that respect is defined here :
> http://www.ietf.org/rfc/rfc2396.txt
> See section : 1.5. URI Transcribability
>
> It is all a bit obscure, but basically what it boils down to is :
> when a server receives a URL :
> - it first decodes the URL, to convert the "percent-escaped" 
> characters back into single characters. That means, for instance, that 
> a "%20" is decoded into a space.
> - then it does *no further decoding*, it takes the bytes *as they are*.
> They are *not supposed* to be decoded any further, using iso-8859-1, 
> cp-1250, UTF-8 or whatever.
> (If Apache did that, then Apache would not respect the RFC).
>
> Now, let's say that in this URL, is a path pointing to some resource, 
> which in this case is a file on disk.
> Well then, the webserver should take this path exactly as received, 
> and look for a file on disk whose name matches exactly that path, byte 
> by byte.
>
> But, between the webserver and the disk, there is an operating system.
> The webserver does not read the disk directly. It does that through 
> the OS I/O interface calls. So, it is possible that when the webserver 
> looks for a file called "xyz123.html", the OS interface translates 
> that to "XYZ123.HTML" for example, and returns /that/ file.
> That is for example the case for Windows. For "xyz123.html", Windows 
> will return any file that is named "Xyz123.html", or "yYz123.html", or 
> "XYZ123.html" etc.. because when looking for files, Windows is 
> case-insensitive. If the webserver does not double-check this (some 
> do), then it may thus return the wrong file.
> The same kind of thing can happen with "diacritic" characters, such as 
> your "složka.png".
>
> --------- Part II -----------
>
> Uploading files and writing them to disk.
> This is a separate issue.
>
> The script that handles the <form> which is used to upload the file, 
> knows that the filename is Unicode, encoded as UTF-8.
> (It knows that, because you wrote the <form> and the script, and in 
> your <form>, you have told the browser to send information in UTF-8).
>
> In the UTF-8 encoding, the filename "složka.png", consists of *10 
> characters*, but of *11 bytes*. That is because the "ž" in the middle, 
> is encoded using 2 bytes in UTF-8.
> If you look at this filename with an editor which understands UTF-8, 
> you will see this as "složka.png".
> If you look at this same filename with an editor which does not 
> understand UTF-8 (or is set to iso-8859-2), then you will see this 
> same string as something like "sloĹľka.png" (or something else like 
> that, I have not really checked).
>
> But back to your upload script.
>
> It has this uploaded file name, in Unicode UTF-8, as "složka.png".
> Now it wants to create this file on disk.
> For that, it tells the OS : create file "složka.png".
> The OS takes this file name, and depending on several conditions (**), 
> understands this name literally as either a series of *bytes* (11 of 
> them), or as a series of *characters* (10 of them) in UTF-8 encoding.
> And the OS, according to its understanding, creates a directory entry 
> on disk for this filename.
> In your case, it creates an entry in the disk directory, containing 
> the /bytes/ (or /characters/) "sloĹľka.png".
>
> It does that, because your script does it wrong :
> The script "knows" that this filename is encoded in UTF-8.
> But the OS does not know that.
> The script /should know/ how the OS is going to understand that, and 
> should, if needed, re-encode this filename in the proper encoding, so 
> that the OS understands it correctly, and creates a file named 
> "složka.png".
>
> It is not that a file named "sloĹľka.png" is wrong. It is, in itself, 
> a perfectly valid filename.
> But the problem is that, considering Part I above :
> - your users are going to type a URL in the location bar of their browser
> - for that, they are going to use the keyboard that they have, on 
> their workstation, with their OS and their browser etc...
> (for example, I could never type it, because I don't have a key for 
> "ž" on my keyboard; so I have to cut and paste from your email ;-))
> - So they are going to type, for example :
> http://yourhost.yourcompany.com/uploadedfiles/složka.png
>
> - The browser is going to URL-encode that, probably replacing the "ž" 
> by a 3-character "percent-sequence" like %B3 (or even 2 3-character 
> sequences, if the browser thinks it must encode the URL as UTF-8).
> - the browser is then going to "send this URL" to Apache.
> - Apache will receive this URL, decode the %-sequences into *bytes*, 
> and ask the OS for this file.
>
> ------ Part III ----
>
> Now, IF the two translations match (the one which happened when you 
> uploaded the file, and the one which happens between the user and the 
> server disk), then the file will be found.
> And otherwise, it will not be.
>
> Your case is that the two translations do not match.
>
> ----- Part IV : how to resolve this --------
>
> My suggestion :
> do /not/ allow the users to decide under which name the file is really 
> stored on the disk.
> Create an "alias" for the filename, containing only US-ASCII 
> characters, and store the file under that name.
> And then, arrange that when the users ask for the file "složka.png" 
> (this name appears for example on an index page that you create), in 
> reality your webserver is looking for this alias name. (*)
>
> This is the only way to make your application really portable, because 
> in the end, on the WWW, you never know who or where the user is, what 
> his workstation is, what his OS is, etc..
> So the user could upload a file under a name that gives you a lot of 
> trouble on your server (as you have discovered already, but not 
> entirely).
> For example, one user could upload a file named "složka.png", and 
> another user could upload a file called "Složka.png". If your server 
> is Windows, and if you are not careful, the second file will overwrite 
> the first.
> There are many other such problematic cases.
>
> And if MediaWiki does not do that, then MediaWiki is not a portable 
> application, sorry. The problem is not the webserver, the problem is 
> the application.
>
> (and, in part, HTTP 1.x)
>
> (*) you show for example an index page like :
> <a href="/files/20090630-180667-123456.png">složka.png</a>
>
> (**) which can be, for example, the "locale" under which the Apache 
> process is running.
>
>
> ---------------------------------------------------------------------
> The official User-To-User support forum of the Apache HTTP Server 
> Project.
> See <URL:http://httpd.apache.org/userslist.html> for more info.
> To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
> " from the digest: users-digest-unsubscribe@httpd.apache.org
> For additional commands, e-mail: users-help@httpd.apache.org


---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Re: [users@httpd] Wrong charset convert

Posted by André Warnier <aw...@ice-sa.com>.
ejirkae@seznam.cz wrote:
> This is that problem: http://sgo.happyforever.com/test.php
> (http://sgo.happyforever.com/test.php)
> Try it please, thanks.
> 
> ------------ Původní zpráva ------------
> Od: <ej...@seznam.cz>
> Předmět: [users@httpd] Wrong charset convert
> Datum: 01.7.2009 00:03:06
> ---------------------------------------------
> I have installed Apache 2.2.11 with PHP 5.2.8 on Windows XP SP3. 
> Windows are using Windows-1250 charset (Czech localization). I want 
> to install MediaWiki software which uses utf-8 charset.
> 
> When I upload a file with non-english characters in its name, then 
> its name is saved in utf-8 format. When I try to open such file in 
> web browser it sends 404 not found status.
> 
> Example:
> 
> Upload a file by using simple html upload form, which is encoded in 
> utf-8:
> 
> <!-- this is only part of whole code --!>
> <html>
> <head>
> <meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
> </head>
> <body>
> 
> <form enctype="multipart/form-data" action="uploader.php" method=
> "POST">
> <input type="hidden" name="MAX_FILE_SIZE" value="100000" />
> Choose a file to upload: <input name="uploadedfile" type="file" /><
> br />
> <input type="submit" value="Upload File" />
> 
> </form>
> </body>
> </html>
> 
> File named for example "složka.png" is saved to hard drive with name
> "sloĹľka.png" in Windows-1250 encoding.
(This is not true, see below)

  If that upload form was
> encoded with charset=Windows-1250 then it'll be right named "složka.
> png", but charset must be utf-8.
> 
> So suppose that we have server with uploaded file: http://something.
> com/složka.png. On linux it is working fine. But on Windows server 
> you must use address like that: http://something.com/sloĹľka.png and
> that's not good for MediaWiki.
> 
> I don't know if it's understandably enough, I need set up Apache to 
> ignore windows-1250 charset and use original utf-8 for decoding URL.
> httpd.conf is original (with php installation).
> 
> Thanks for help
> Jiri Eichler
> 
Jiri,
the issue you are explaining above is not an easy one.
It will really be solved only, whenever the powers-that-be on the 
Internet, finally decide to move to an HTTP version 2.0, where 
everything by default would be Unicode, UTF-8 encoded.
Until then, there will be confusion and difficulties for whoever does 
not use English as his main language.

--- Part I -------

First, about your last paragraph :
Apache will not use UTF-8 to decode a URL, because that would be wrong 
according to the current RFCs that specifiy how the WWW is working.
The "law" in that respect is defined here :
http://www.ietf.org/rfc/rfc2396.txt
See section : 1.5. URI Transcribability

It is all a bit obscure, but basically what it boils down to is :
when a server receives a URL :
- it first decodes the URL, to convert the "percent-escaped" characters 
back into single characters.  That means, for instance, that a "%20" is 
decoded into a space.
- then it does *no further decoding*, it takes the bytes *as they are*.
They are *not supposed* to be decoded any further, using iso-8859-1, 
cp-1250, UTF-8 or whatever.
(If Apache did that, then Apache would not respect the RFC).

Now, let's say that in this URL, is a path pointing to some resource, 
which in this case is a file on disk.
Well then, the webserver should take this path exactly as received, and 
look for a file on disk whose name matches exactly that path, byte by byte.

But, between the webserver and the disk, there is an operating system.
The webserver does not read the disk directly. It does that through the 
OS I/O interface calls.  So, it is possible that when the webserver 
looks for a file called "xyz123.html", the OS interface translates that 
to "XYZ123.HTML" for example, and returns /that/ file.
That is for example the case for Windows. For "xyz123.html", Windows 
will return any file that is named "Xyz123.html", or "yYz123.html", or 
"XYZ123.html" etc.. because when looking for files, Windows is 
case-insensitive. If the webserver does not double-check this (some do), 
then it may thus return the wrong file.
The same kind of thing can happen with "diacritic" characters, such as 
your "složka.png".

--------- Part II -----------

Uploading files and writing them to disk.
This is a separate issue.

The script that handles the <form> which is used to upload the file, 
knows that the filename is Unicode, encoded as UTF-8.
(It knows that, because you wrote the <form> and the script, and in your 
<form>, you have told the browser to send information in UTF-8).

In the UTF-8 encoding, the filename "složka.png", consists of *10 
characters*, but of *11 bytes*.  That is because the "ž" in the middle, 
is encoded using 2 bytes in UTF-8.
If you look at this filename with an editor which understands UTF-8, you 
will see this as "složka.png".
If you look at this same filename with an editor which does not 
understand UTF-8 (or is set to iso-8859-2), then you will see this same 
string as something like "sloĹľka.png" (or something else like that, I 
have not really checked).

But back to your upload script.

It has this uploaded file name, in Unicode UTF-8, as "složka.png".
Now it wants to create this file on disk.
For that, it tells the OS : create file "složka.png".
The OS takes this file name, and depending on several conditions (**), 
understands this name literally as either a series of *bytes* (11 of 
them), or as a series of *characters* (10 of them) in UTF-8 encoding.
And the OS, according to its understanding, creates a directory entry on 
disk for this filename.
In your case, it creates an entry in the disk directory, containing the 
/bytes/ (or /characters/) "sloĹľka.png".

It does that, because your script does it wrong :
The script "knows" that this filename is encoded in UTF-8.
But the OS does not know that.
The script /should know/ how the OS is going to understand that, and 
should, if needed, re-encode this filename in the proper encoding, so 
that the OS understands it correctly, and creates a file named "složka.png".

It is not that a file named "sloĹľka.png" is wrong.  It is, in itself, a 
perfectly valid filename.
But the problem is that, considering Part I above :
- your users are going to type a URL in the location bar of their browser
- for that, they are going to use the keyboard that they have, on their 
workstation, with their OS and their browser etc...
(for example, I could never type it, because I don't have a key for "ž" 
on my keyboard; so I have to cut and paste from your email ;-))
- So they are going to type, for example :
http://yourhost.yourcompany.com/uploadedfiles/složka.png

- The browser is going to URL-encode that, probably replacing the "ž" by 
a 3-character "percent-sequence" like %B3 (or even 2 3-character 
sequences, if the browser thinks it must encode the URL as UTF-8).
- the browser is then going to "send this URL" to Apache.
- Apache will receive this URL, decode the %-sequences into *bytes*, and 
ask the OS for this file.

------ Part III ----

Now, IF the two translations match (the one which happened when you 
uploaded the file, and the one which happens between the user and the 
server disk), then the file will be found.
And otherwise, it will not be.

Your case is that the two translations do not match.

----- Part IV : how to resolve this --------

My suggestion :
do /not/ allow the users to decide under which name the file is really 
stored on the disk.
Create an "alias" for the filename, containing only US-ASCII characters, 
and store the file under that name.
And then, arrange that when the users ask for the file "složka.png" 
(this name appears for example on an index page that you create), in 
reality your webserver is looking for this alias name. (*)

This is the only way to make your application really portable, because 
in the end, on the WWW, you never know who or where the user is, what 
his workstation is, what his OS is, etc..
So the user could upload a file under a name that gives you a lot of 
trouble on your server (as you have discovered already, but not entirely).
For example, one user could upload a file named "složka.png", and 
another user could upload a file called "Složka.png". If your server is 
Windows, and if you are not careful, the second file will overwrite the 
first.
There are many other such problematic cases.

And if MediaWiki does not do that, then MediaWiki is not a portable 
application, sorry.  The problem is not the webserver, the problem is 
the application.

(and, in part, HTTP 1.x)

(*) you show for example an index page like :
<a href="/files/20090630-180667-123456.png">složka.png</a>

(**) which can be, for example, the "locale" under which the Apache 
process is running.


---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Re: [users@httpd] Wrong charset convert

Posted by ej...@seznam.cz.
This is that problem: http://sgo.happyforever.com/test.php
(http://sgo.happyforever.com/test.php)
Try it please, thanks.

------------ Původní zpráva ------------
Od: <ej...@seznam.cz>
Předmět: [users@httpd] Wrong charset convert
Datum: 01.7.2009 00:03:06
---------------------------------------------
I have installed Apache 2.2.11 with PHP 5.2.8 on Windows XP SP3. 
Windows are using Windows-1250 charset (Czech localization). I want 
to install MediaWiki software which uses utf-8 charset.

When I upload a file with non-english characters in its name, then 
its name is saved in utf-8 format. When I try to open such file in 
web browser it sends 404 not found status.

Example:

Upload a file by using simple html upload form, which is encoded in 
utf-8:

<!-- this is only part of whole code --!>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
</head>
<body>

<form enctype="multipart/form-data" action="uploader.php" method=
"POST">
<input type="hidden" name="MAX_FILE_SIZE" value="100000" />
Choose a file to upload: <input name="uploadedfile" type="file" /><
br />
<input type="submit" value="Upload File" />

</form>
</body>
</html>

File named for example "složka.png" is saved to hard drive with name
"sloĹľka.png" in Windows-1250 encoding. If that upload form was 
encoded with charset=Windows-1250 then it'll be right named "složka.
png", but charset must be utf-8.

So suppose that we have server with uploaded file: http://something.
com/složka.png. On linux it is working fine. But on Windows server 
you must use address like that: http://something.com/sloĹľka.png and
that's not good for MediaWiki.

I don't know if it's understandably enough, I need set up Apache to 
ignore windows-1250 charset and use original utf-8 for decoding URL.
httpd.conf is original (with php installation).

Thanks for help
Jiri Eichler