You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@httpd.apache.org by Chris Biggs <ch...@aptus.co.uk> on 2009/10/07 10:20:01 UTC

[users@httpd] Using SSI to include a UTF-8 encoded file causes a strange character to be sent to the browser

Hi,

Environment:

    Apache 2.2.13
    Windows XP SP3

Background:

    My intention is to use SSI to include one html file in another. The text in both can be 
    either English or Russian - or a combination. So I used UTF-8 encoding for the files as 
    it seemed to be the right thing to do.

Problem:

    The two files HTML files that can be used to demonstrate this problem are as follows:

       outer.shtml:

          <html>
          <body>
          <table border="1"><tr><td nowrap><!--#include virtual="/en_US/Pages/utf8.utf8" --></td></tr></table>
          </body>
          </html>

       utf8.utf8:
 
          <table><tr><td>xxxxxs</td></tr></table>

    As can be seen outer.shtml includes the file utf8.utf8.

    When these files are saved as "ANSI" (using Notepad) and call the included file "something.htm",
    they display correctly in a browser as you would expect. However, when I save them as UTF-8 
    (again using Notepad) I seem to get a strange character sent to the browser, just before the 
    start of the included file.  I have attached the file received by the browser. (see outer[1].txt
    in the attached zip file). As you may be able to see there is an "odd" character between 
    "<td nowrap>" and "<table>. This has the effect of pushing the inner table down a line which 
    is not wanted.

I have modified the httpd.conf only to allow me to have 4 virtual hosts, enable SSI and, when I 
noticed the problem, I also uncommented the line:

    Include conf/extra/httpd-languages.conf

As part of by attempt to fix the problem, I have also added .shtml to a line in the above file 
as follows:

    AddCharset UTF-8   .utf8 .shtml

However, nothing seems to stop this odd character being presented to the browser.

If you could provide some help or guidance to solve this problem - or perhaps suggest whether 
this is a bug and should be logged, I would be grateful.


Many Thanks for your attention thus far.
Regards,
Christopher Biggs

Re: [users@httpd] Using SSI to include a UTF-8 encoded file causes a strange character to be sent to the browser

Posted by Jan Ingvoldstad <fr...@gmail.com>.
On Wed, Oct 7, 2009 at 10:55 AM, André Warnier <aw...@ice-sa.com> wrote:

> 1) *don't use Notepad to edit HTML pages*.  Use a real editor, properly
> aware of character sets and encodings, and which will highlight incorrect
> UTF-8 characters.
> Notepad has a big problem when saving UTF-8 encoded files : it writes a
> "BOM" at the beginning of the file, which is not only totally unnecessary
> for UTF-8, but also confuses other programs.
> A BOM is a sequence of 2 or 3 bytes, meant in some cases to indicate the
> "byte order" of the file that follows.
>

Just for the sake of information, DreamWeaver MX also pulls this nice stunt,
_also when editing PHP files_.

This can lead to annoying problems when a PHP script tries to modify the
HTTP headers, since the headers will already have been written, and
(depending on PHP solution; mod_php, fastcgi, suphp etc.) will produce nasty
errors.

When you open a file with a BOM in a UTF-8 aware editor, the BOM is hidden.

Software producing BOM makes things go BOOM.
-- 
Jan

Re: [users@httpd] Using SSI to include a UTF-8 encoded file causes a strange character to be sent to the browser

Posted by André Warnier <aw...@ice-sa.com>.
Hi.

Chris Biggs wrote:
...
>     When these files are saved as "ANSI" (using Notepad) 
(or rather in this case, as UTF-8)

Tips :
1) *don't use Notepad to edit HTML pages*.  Use a real editor, properly 
aware of character sets and encodings, and which will highlight 
incorrect UTF-8 characters.
Notepad has a big problem when saving UTF-8 encoded files : it writes a 
"BOM" at the beginning of the file, which is not only totally 
unnecessary for UTF-8, but also confuses other programs.
A BOM is a sequence of 2 or 3 bytes, meant in some cases to indicate the 
"byte order" of the file that follows.
For UTF-8, there is only one valid byte order, so the BOM is not 
necessary and could/should be ignored.
However, when such a file with a BOM prefix is being included by some 
software in the middle of another file (as you do with SSI), it usually 
causes the kind of problem you are seeing : "bizarre" characters in the 
middle.
2) use a proper <meta http-equiv="Content-Type" content="text/html; 
charset=UTF-8" /> in the <head> section of your html files.  That should 
tell the browser what the encoding of the page is.
3) But this is really only a substitute for the real standard-conformant 
way of indicating the encoding to the browser : the webserver should 
send, with each html page, a HTTP header like :
Content-type: text/html; charset=UTF-8
Unfortunately, MS's IE (all versions and sub-versions) have a long 
history of ignoring or misinterpreting this part of the HTTP RFC, and 
deciding themselves what content the document has.
This is *wrong*, but unfortunately also, in the real world IE is much 
used, so one has to learn to work around this.


---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org