You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@xalan.apache.org by Holger Flörke <fl...@doctronic.de> on 2001/03/15 08:12:28 UTC
URI encoding in formattertohtml.cpp
Hi,
after a bug in Xalan-C1.1 caused a double encoding of a percent sign
under some circumstances in an URI when producing HTML, I looked a little
bit deeper in the RFC and the code. I found Xalan is *not* encoding all the
characters it should encode with a percent sign (and even encodes a legal
character) . It is not a heavy bug, because most of the webservers
'understand' the illigal (for URIs) characters. I have applied the things
to the formattertohtml.cpp and attached a diff of the patch to this
message. There is one open thing for me to know: The rfc defines some
"excluded" characters and says they are disallowed within URI (like control
codes, spaces, delimiters,...) What about theese characters? I found I
don't have to encode the "#" because the HTML link will be broken beacuse
the fragment is not treated as a fragment.
Does anybody know the right way to treat theese 'excluded' characters?
HolgeR
diff formattertohtml_old.cpp formattertohtml.cpp
855c855,908
< if (ch < 33 || ch > 126)
---
> //
> // Unreserved characters are allowed in the URI.
> // Reserved characters have a special meaning in the URI
> // (eg "=" seperates a parameter name from its value). Because
> // we do *not* know anything about the meaning of the characters
> // in the URI (is this "=" a parameter name-value-separator or
> // is it contained in a parameter value?) we have to
> // leave this characters as they are.
> // The encoding symbol "%" is *not* encoded in any way.
> // WATCH OUT: What about the excluded characters? I know it
> // is a problem to encode the "#" sign, so I decided
> // to exclude the "#" from encoding. What about the other
> // characters?
> //
> // http://www.ietf.org/rfc/rfc2396.txt says:
> // 2.4.3. Excluded US-ASCII Characters
> // Although they are disallowed within the URI syntax, we include
> // here a description of those US-ASCII characters that have been
> // excluded and the reasons for their exclusion.
> // [...]
> // The angle-bracket "<" and ">" and double-quote (") characters are
> // excluded because they are often used as the delimiters around URI
> // in text documents and protocol fields. The character "#" is
> // excluded because it is used to delimit a URI from a fragment
> // identifier in URI references (Section 4). The percent character
> // "%" is excluded because it is used for the encoding of escaped
> // characters.
> //
> // delims = "<" | ">" | "#" | "%" | <">
> //
> if((ch < 'a' || ch > 'z') && // unreserved
> (ch < 'A' || ch > 'Z') && // unreserved
> (ch < '0' || ch > '9') && // unreserved
> (ch != '%') && // encoding symbol
> (ch != '=') && // reserved
> (ch != '&') && // reserved
> (ch != '+') && // reserved
> (ch != '@') && // reserved
> (ch != '?') && // reserved
> (ch != '/') && // reserved
> (ch != ':') && // reserved
> (ch != ';') && // reserved
> (ch != '$') && // reserved
> (ch != ',') && // reserved
> (ch != '#') && // fragment delimiter
> (ch != '-') && // unreserved
> (ch != '_') && // unreserved
> (ch != '.') && // unreserved
> (ch != '!') && // unreserved
> (ch != '~') && // unreserved
> (ch != '*') && // unreserved
> (ch != '\'') && // unreserved
> (ch != '(') && // unreserved
> (ch != ')')) // unreserved
958,980c1011
< }
< else if(ch == XalanUnicode::charPercentSign)
< {
< // If the character is a '%' number number, try to avoid double-escaping.
< // There is a question if this is legal behavior.
< if (i + 2 < len &&
< XalanXMLChar::isDigit(string[i + 1]) == true &&
< XalanXMLChar::isDigit(string[i + 2]) == true)
< {
< accumContent(ch);
< }
< else
< {
< if (m_escapeURLs == true)
< {
< accumHexNumber(ch);
< }
< else
< {
< accumContent(ch);
< }
< }
< }
---
> }
--
holger floerke d o c t r o n i c
email floerke@doctronic.de information publishing + retrieval
phone +49 (0) 2222 9292 90 http://www.doctronic.de
fax +49 (0) 2222 9292 99