You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@xalan.apache.org by Holger Flörke <fl...@doctronic.de> on 2001/03/15 08:12:28 UTC

URI encoding in formattertohtml.cpp

Hi,

  after a bug in Xalan-C1.1 caused a double encoding of a percent sign 
under some circumstances in an URI when producing HTML, I looked a little 
bit deeper in the RFC and the code. I found Xalan is *not* encoding all the 
characters it should encode with a percent sign (and even encodes a legal 
character) . It is not a heavy bug, because most of the webservers 
'understand' the illigal (for URIs) characters. I have applied the things 
to the formattertohtml.cpp and attached a diff of the patch to this 
message. There is one open thing for me to know: The rfc defines some 
"excluded" characters and says they are disallowed within URI (like control 
codes, spaces, delimiters,...) What about theese characters? I found I 
don't have to encode the "#" because the HTML link will be broken beacuse 
the fragment is not treated as a fragment.

Does anybody know the right way to treat theese 'excluded' characters?

HolgeR

diff formattertohtml_old.cpp formattertohtml.cpp
855c855,908
< 		if (ch < 33 || ch > 126)
---
 >     //
 >     // Unreserved characters are allowed in the URI.
 >     // Reserved characters have a special meaning in the URI
 >     //  (eg "=" seperates a parameter name from its value). Because
 >     //  we do *not* know anything about the meaning of the characters
 >     //  in the URI (is this "=" a parameter name-value-separator or
 >     //  is it contained in a parameter value?) we have to
 >     //  leave this characters as they are.
 >     // The encoding symbol "%" is *not* encoded in any way.
 >     // WATCH OUT: What about the excluded characters? I know it
 >     //  is a problem to encode the "#" sign, so I decided
 >     //  to exclude the "#" from encoding. What about the other
 >     //  characters?
 >     //
 >     //  http://www.ietf.org/rfc/rfc2396.txt says:
 >     //  2.4.3. Excluded US-ASCII Characters
 >     //   Although they are disallowed within the URI syntax, we include
 >     //   here a description of those US-ASCII characters that have been
 >     //   excluded and the reasons for their exclusion.
 >     //   [...]
 >     //   The angle-bracket "<" and ">" and double-quote (") characters are
 >     //   excluded because they are often used as the delimiters around URI
 >     //   in text documents and protocol fields.  The character "#" is
 >     //   excluded because it is used to delimit a URI from a fragment
 >     //   identifier in URI references (Section 4). The percent character
 >     //   "%" is excluded because it is used for the encoding of escaped
 >     //   characters.
 >     //
 >     //   delims      = "<" | ">" | "#" | "%" | <">
 >     //
 >     if((ch < 'a' || ch > 'z') && // unreserved
 >        (ch < 'A' || ch > 'Z') && // unreserved
 >        (ch < '0' || ch > '9') && // unreserved
 >        (ch != '%') && // encoding symbol
 >        (ch != '=') && // reserved
 >        (ch != '&') && // reserved
 >        (ch != '+') && // reserved
 >        (ch != '@') && // reserved
 >        (ch != '?') && // reserved
 >        (ch != '/') && // reserved
 >        (ch != ':') && // reserved
 >        (ch != ';') && // reserved
 >        (ch != '$') && // reserved
 >        (ch != ',') && // reserved
 >        (ch != '#') && // fragment delimiter
 >        (ch != '-') && // unreserved
 >        (ch != '_') && // unreserved
 >        (ch != '.') && // unreserved
 >        (ch != '!') && // unreserved
 >        (ch != '~') && // unreserved
 >        (ch != '*') && // unreserved
 >        (ch != '\'') && // unreserved
 >        (ch != '(') &&  // unreserved
 >        (ch != ')'))    // unreserved
958,980c1011
< 		}
< 		else if(ch == XalanUnicode::charPercentSign)
< 		{
< 			// If the character is a '%' number number, try to avoid double-escaping.
< 			// There is a question if this is legal behavior.
< 			if (i + 2 < len &&
< 				XalanXMLChar::isDigit(string[i + 1]) == true &&
< 				XalanXMLChar::isDigit(string[i + 2]) == true)
< 			{
< 				accumContent(ch);
< 			}
< 			else
< 			{
< 				if (m_escapeURLs == true)
< 				{
< 					accumHexNumber(ch);
< 				}
< 				else
< 				{
< 					accumContent(ch);
< 				}
< 			}
< 		}
---
 >     }


-- 
holger floerke                      d  o  c  t  r  o  n  i  c
email floerke@doctronic.de          information publishing + retrieval
phone +49 (0) 2222 9292 90          http://www.doctronic.de
fax   +49 (0) 2222 9292 99