You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Tony Sanders <sa...@bsdi.com> on 1995/10/02 19:13:42 UTC

The Encoding Problem (was # in file names...)

> > > > [# in a directory index]

For reference I've included my encoding functions below.  They are
in perl but are obvious enough that I think anyone can read them.


Now, some information, history and why ';' is also a reserved character...

WARNING: encode_attribute is tricky.  As the spec says, you must
use SGML entities for escaping markup inside markup attributes
(this is how SGML works -- unless you specify your own escaping
scheme of course, like % does for URLs) but as you might guess,
many browsers get this wrong (including W3's own browsers: www
linemode and arena).  Netscape and Mosaic however do work correctly
in this respect.

There is a serious problem with the current state of things.

Example -- a simple form fragment:
	<INPUT NAME="type" VALUE="foo">
	<INPUT NAME="amp" VALUE="bar">
Now -- let's look at the URL:
	http://whatever/myform?type=foo&amp=bar
And then the user cuts and pastes this URL into a hypertext link:
        <A HREF="http://whatever/myform?type=foo&amp=bar">
And guess what!!!

Netscape and Mosaic will correctly (according to the spec)
send to server something you didn't quite expect:
	GET myform?type=foo&=bar HTTP/1.0

Yikes!

And in this case you cannot escape the & with %26 either because that
hides the & from the form processing software on the server.

So you simply *must* use ``&amp;'' here to escape the ``&'' in the
form if you store it as a URL (because HTML doesn't define any
alternate encode besides % and as I've shown, this doesn't work in
this case).

This is where `;' enters the picture.  The current HTML spec recommends
that the server accept `;' in place of `&' -- this would at least
allow users to portibly store form queries in their HREF's (and since
this is server specific there is no harm if my server does and yours
doesn't -- users just have to know which works and which doesn't).


Of course, as you already know, if the entity ref isn't valid then
Netscape and Mosaic will send the correct thing.  This is another
fine example of how ``forgiving'' software gets us into a horrible
situation -- if everyone would just stick to the spec and reject
all the crap out there then these kinds of problems would be
quickly fixed as soon as they popped up and we wouldn't be stuck
with all the crap we currently have.


----- cut here -----

# encode unknown data for use in a URL <A HREF="...">
sub encode_url {
    local($_) = @_;
    # rfc1738 says that ";"|"/"|"?"|":"|"@"|"&"|"=" may be reserved.
    # And % is the escape character so we escape it along with
    # single-quote('), double-quote("), grave accent(`), less than(<),
    # greater than(>), and non-US-ASCII characters (binary data),
    # and white space.  Whew.
    s/([\000-\032\;\/\?\:\@\&\=\%\'\"\`\<\>\177-\377])/sprintf('%%%02x',ord($1))/eg;
    $_;
}
# encode unknown data for use in <TITLE>...</TITILE>
sub encode_title {
    # like encode_url but less strict (I couldn't find docs on this)
    local($_) = @_;
    s/([\000-\031\%\&\<\>\177-\377])/sprintf('%%%02x',ord($1))/eg;
    $_;
}
# encode unknown data for use inside markup attributes <MARKUP ATTR="...">
sub encode_attribute {
    # rfc1738 says to use entity references here
    local($_) = @_;
    s/([\000-\031\"\'\`\%\&\<\>\177-\377])/sprintf('\&#%03d;',ord($1))/eg;
    $_;
}
# encode unknown text data for using as HTML,
# treats ^H as overstrike ala nroff.
sub encode_data {
    local($_) = @_;
    local($str);
    # Escape binary data except for ^H which we process below
    # \375 gets turned into the & for the entity reference
    s/([^\010\012\015\032-\176])/sprintf('\375#%03d;',ord($1))/eg;

    # Process ^H sequences, we use \376 and \377 (already escaped
    # above) to stand in for < and > until those characters can
    # be properly escaped below.
    s,((_\010.)+),($str = $1) =~ s/.\010//g; "\376I\377$str\376/I\377";,ge;
    s,((.\010.)+),($str = $1) =~ s/.\010//g; "\376B\377$str\376/B\377";,ge;
    s,\376[IB]\377_\376/[IB]\377,,g;
    s/.[\b]//g;                 # just do an erase for anything else
    
    # Escape &, < and >
    s/\&/\&amp\;/g; s/\</\&lt\;/g; s/\>/\&gt\;/g;
    
    # Now convert our magic chars into our tag markers
    s/\375/\&/g; s/\376/</g; s/\377/>/g;

    $_;
}