You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@httpd.apache.org by greg wm <ap...@nvpf.org> on 2005/08/21 08:11:53 UTC
[users@httpd] Re: meta http-equiv useless??
greg wm wrote:
> i used wget to copy the entire http://nonviolentpeaceforce.org site to
> http://nvpf.org/np. the former is in m$ asp, the latter captured as html.
>
> for example, http://nonviolentpeaceforce.org/spanish/welcome.asp was
> captured to http://nvpf.org/np/spanish/welcome.asp.html
>
> as you can see, the capture is mostly fine, including spanish characters
> in the text (eg año), however the spanish characters in the menus didn't
> do quite so well (eg Misi?n)
fixed! see below.
> in the file año appears as año which is apparently "good", but
> Misi?n appears as Misión, which is apparently "bad".
>
> first question: why is that bad?
>
> if i tell galeon, instead of automatic encoding, use western iso-8859-1,
> then, presto, the page appears nicely. but i don't have to do that to
> see the original, nor do i have to do that for anybody else's pages, and
> of course i can't expect our audience to go and fiddle with that in
> their browsers.
>
> second question: why doesn't the meta http-equiv header do anything?
>
> right after the title the file says <meta http-equiv="Content-Type"
> content="text/html; charset=iso-8859-1">. why isn't that good enough?
> why does it make no difference at all what i change it to? i tried
> utf-8, Utf-8, UTF-8, Windows-1252, none have any effect tho i can see
> them if i tell my browser to view source.
overridden by apache's http headers, apparently. see below.
> fourth question: can wget be tweaked to do better?
>
> i think those menus were rendered out of some .asp database or
> whatever, differently than the rest of the text of the page. but so
> what? why didn't wget capture something identical to what my
> browser shows?
>
> the command i ran was
> wget -ENKkrl19 -nH -w2 -owget.log http://nonviolentpeaceforce.org
my locale is en_IE.UTF-8, so why did wget save in latin-1 format?
the wget manual page mentions nothing at all about character sets.
> well whatever, thunk i, no problem, i'll just find and replace. well
> ha. i haven't yet managed to craft sed to capture the buggers! it's
> all making me feel dang defeated..
Brian Foster wrote:
> there are other alternatives. e.g., convert the page
> (file) to UTF-8 (e.g., using iconv(1)), being sure to
> change the meta charset setting to utf-8.
>
> finally, vim(1): vim confuses things here (I am _not_
> trying to start an editor war!). vim guesses what the
> file's charset is, and adjusts accordingly so that you
> can view/edit it in a locale using a different charset.
> hence, a lot of things that cat displays as rubbish
> display Ok in vim. if, in vim, you do a “:set” command,
> you'll probably see an entry like “fileencoding=utf-8”.
> that means vim thinks the file is UTF-8. (probably,
> for the ISO-8859-1 files, it says “fileencoding=latin1”;
> Latin1 is a common informal(?) name for ISO-8859-1.)
> “:help fileencoding” for more information.
thank you brian! perhaps iconv might have done the trick, anyway i used
vim. vim :se fileencoding revealed that wget saved the files in
latin-1, and :se fileencoding=utf-8 for each file cleaned up the mess.
wasn't even a big job after using :map such that each file was fixed
with a single keystroke.
William A. Rowe, Jr. wrote:
> What happens if you remove the defaultcharset entirely; have Apache
> provide no hinting at the encoding; does the browser respect the meta
> tag?
>
> The http headers are authoritative, and override any metadata. If you
> rather control your encoding with meta tags, turn off charsets entirely.
that is probably the winning answer. i already applied the above
solution so i dunno for sure, but look..
wget --save-headers from the original m$ .asp server:
HTTP/1.1 200 OK
Server: Microsoft-IIS/5.0
Date: Sat, 20 Aug 2005 21:18:55 GMT
Connection: keep-alive
Connection: Keep-Alive
Content-Length: 11003
Content-Type: text/html
Set-Cookie: ASPSESSIONIDAQQBRDRA=KNIPOCMDJPKMMANLNHFMKMGH; path=/
Cache-control: private
wget --save-headers from my apache server:
HTTP/1.1 200 OK
Date: Sun, 21 Aug 2005 04:10:34 GMT
Server: Apache/2.0.52 (CentOS)
Last-Modified: Sun, 21 Aug 2005 01:34:43 GMT
ETag: "260261-2b33-9134b2c0"
Accept-Ranges: bytes
Content-Length: 11059
Connection: close
Content-Type: text/html; charset=UTF-8
now i wouldn't have thought that the following httpd.conf directive
would result in overriding the meta http-equiv headers, but, there does
seem to be a strong odor..
# Specify a default charset for all pages sent out. This is
# always a good idea and opens the door for future internationalisation
# of your web site, should you ever want it. Specifying it as
# a default does little harm; as the standard dictates that a page
# is in iso-8859-1 (latin1) unless specified otherwise i.e. you
# are merely stating the obvious. There are also some security
# reasons in browsers, related to javascript and URL parsing
# which encourage you to always set a default char set.
#
AddDefaultCharset UTF-8
greg
> Greg Whitley Mott
> IT Coordinator
> NonviolentPeaceforce.org
---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
" from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org