You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@struts.apache.org by Matthias Kerkhoff <ma...@BESToffers.de> on 2000/11/13 15:22:53 UTC

Beanutils.filter() and URLs

Hi all,

this time I would like to bring your attention to (yet undetected?)
problems with BeanUtils.filter(), if used to encode URLs.


The situation:
--------------
This method is used from various tags of the Struts codebase. From
the way how it's used, it seems that most developers think of this
method as a way to savely encode characters that have a special
meaning in/for HTML _and_ HTTP. Examples of the typical usage of
filter() include the encoding of query parameters and the encoding
of HTML content.


The problem:
------------
The set of characters with a special meaning largely depend on the
context, in which the string is used.
Some examples:
The '#' character is used to as delimiter for anchors in URLs, but
has no special meaning in HTML content. filter() does not encode
the anchor.
The '%' character is used to mark encoded characters in URLs, but
has no special meaning in HTML content. filter() does not encode
the percentage sign.
The '&' character is used to mark the beginning of an character
entity. In URLs, it has varying meanings, it's fe. used as a
separator in query-strings but is otherwise allowed (mostly).
filter() always encodes the ampersand sign.
The '[' and ']' characters which are used internally as index marker
in Struts are mentioned as 'unwise' in the RFC spec, that is, they
should better be encoded. filter() does not encode these characters.


How it manifests:
----------------
A good example to illustrate the problem is the link tag. The link
tag contains some code that builds an URL and optionally appends
some bean properties after URLencoding the values..

After the URL is build, the whole URL is filter'ed. This results in
% sign's being encoded twice (they're already URLencoded);
& being encoded once as &amp; (may cause problems depending on the
  server software, that may (or may not) recognize &amp; as query
  parameter separator;
[]not being encoded at all (this may become important, when Struts
  supports nested properties in the form tags)


(Possible) solution:
--------------------
Adding a static method String filterFor(int context, String value){..}
that accepts an additional context argument. This argument should
be used to indicate the intended use of the filtered string. The
method should properly encode the given value with regards to the
specified context. Some candidates for context types are ...
- HTML
  (should resolve to URLEncoder.encode)
- URIPATH
  (encodes the path component of an URL)
- QUERYPARAM
  (encodes the name or value of an query argument)


-- 
Matthias                          mailto:make@BESToffers.de



Re: Beanutils.filter() and URLs

Posted by Pierre Métras <ge...@sympatico.ca>.
Hi,

I don't know if the problem analysed by Matthias (described at the end of
the post) has been solved yet (at least not in last week version 1.0), but I
needed to correctly handle the encoding of characters, in HTML files or URL.

Following is the code I use in my private HtmlUtil.java file.
    - HtmlUtil.escape(s) correspond to BeanUtils.filter(s), but take into
account more characters. It support also an escaping character '$', that I
use when I don't want to translate the following character (for instance,
when I retrieve a message string with embeded HTML formatting).
    - HtmlUtil.escapeURI(s) is used to encode characters used in a URI, as
defined in RFC 2396.
    - HtmlUtil.escapeParameter(s) is used to encode characters to be used as
parameter values.
These functions can be used as a base to write the filter function as
defined by Matthias.

Pierre Métras


/**
 * HTML utils and URL encoding
 *
 * @see RFC 2396
 *
 * @author  Pierre Métras
 * @version 1.0
 */

public class HtmlUtil
{
    // HTML characters
    protected static final String namedCharacters[] =
    {
        "",         // 0
        "",         // 1
        "",         // 2
        "",         // 3
        "",         // 4
        "",         // 5
        "",         // 6
        "",         // 7
        "",         // 8 \b
        "&nbsp;",   // 9 \t
        "<P>",      // 10 \n
        "",         // 11
        "",         // 12 \f
        "<BR>",     // 13 \r
        "",         // 14
        "",         // 15
        "",         // 16
        "",         // 17
        "",         // 18
        "",         // 19
        "",         // 20
        "",         // 21
        "",         // 22
        "",         // 23
        "",         // 24
        "",         // 25
        "",         // 26
        "",         // 27
        "",         // 28
        "",         // 29
        "",         // 30
        "",         // 31
        " ",        // 32
        "!",        // 33 !
        "&quot;",   // 34 "
        "#",        // 35 #
        "$",        // 36 $
        "%",        // 37 %
        "&amp;",    // 38 &
        "'",        // 39 '
        "(",        // 40 (
        ")",        // 41 )
        "*",        // 42 *
        "+",        // 43 +
        ",",        // 44 ,
        "-",        // 45 -
        ".",        // 46 .
        "/",        // 47 /
        "0",        // 48 0
        "1",        // 49 1
        "2",        // 50 2
        "3",        // 51 3
        "4",        // 52 4
        "5",        // 53 5
        "6",        // 54 6
        "7",        // 55 7
        "8",        // 56 8
        "9",        // 57 9
        ":",        // 58 :
        ";",        // 59 ;
        "&lt;",     // 60 <
        "=",        // 61 =
        "&gt;",     // 62 >
        "?",        // 63 ?
        "@",        // 64 @
        "A",        // 65 A
        "B",        // 66 B
        "C",        // 67 C
        "D",        // 68 D
        "E",        // 69 E
        "F",        // 70 F
        "G",        // 71 G
        "H",        // 72 H
        "I",        // 73 I
        "J",        // 74 J
        "K",        // 75 K
        "L",        // 76 L
        "M",        // 77 M
        "N",        // 78 N
        "O",        // 79 O
        "P",        // 80 P
        "Q",        // 81 Q
        "R",        // 82 R
        "S",        // 83 S
        "T",        // 84 T
        "U",        // 85 U
        "V",        // 86 V
        "W",        // 87 W
        "X",        // 88 X
        "Y",        // 89 Y
        "Z",        // 90 Z
        "[",        // 91 [
        "\\",       // 92 \
        "]",        // 93 ]
        "^",        // 94 ^
        "_",        // 95 _
        "`",        // 96 `
        "a",        // 97 a
        "b",        // 98 b
        "c",        // 99 c
        "d",        // 100 d
        "e",        // 101 e
        "f",        // 102 f
        "g",        // 103 g
        "h",        // 104 h
        "i",        // 105 i
        "j",        // 106 j
        "k",        // 107 k
        "l",        // 108 l
        "m",        // 109 m
        "n",        // 110 n
        "o",        // 111 o
        "p",        // 112 p
        "q",        // 113 q
        "r",        // 114 r
        "s",        // 115 s
        "t",        // 116 t
        "u",        // 117 u
        "v",        // 118 v
        "w",        // 119 w
        "x",        // 120 x
        "y",        // 121 y
        "z",        // 122 z
        "{",        // 123 {
        "|",        // 124 |
        "}",        // 125 }
        "~",        // 126 ~
        "&#127;",   // 127
        "&#128;",   // 128
        "&#129;",   // 129
        "&#130;",   // 130 ??? Right apostroph \u2019
        "&#131;",   // 131 ??? Florin
        "&#132;",   // 132 ??? Right double quote \u201d
        "&#133;",   // 133 ??? Ellipsis \u2026
        "&#134;",   // 134 ??? Dagger \u2020
        "&#135;",   // 135 ??? Double dagger \u2021
        "&#136;",   // 136 ??? Circumflex
        "&#137;",   // 137 ??? Permil \u2030
        "&#138;",   // 138 ???
        "&#139;",   // 139 ??? Less than sign
        "&#140;",   // 140 ??? Capital OE ligature
        "&#141;",   // 141
        "&#142;",   // 142
        "&#143;",   // 143
        "&#144;",   // 144
        "&#145;",   // 145 ??? Left single quote
        "&#146;",   // 146 ??? Right single quote
        "&#147;",   // 147 ??? Left double quote
        "&#148;",   // 148 ??? Right double quote
        "&#149;",   // 149 ??? Bullet
        "&#150;",   // 150 ??? En dash
        "&#151;",   // 151 ??? Em dash
        "&#152;",   // 152 ??? Tilde
        "&#153;",   // 153 ??? Trademark
        "&#154;",   // 154 ???
        "&#155;",   // 155 ??? Greater than sign
        "&#156;",   // 156 ??? Small oe ligature
        "&#157;",   // 157
        "&#158;",   // 158
        "&#159;",   // 159 ??? Capital Y, umlaut
        "&nbsp;",   // 160 Non breaking space
        "&iexcl;",  // 161 Inverted exclamation point
        "&cent",    // 162 Cent sign
        "&pound;",  // 163 Pound sign
        "&curren;", // 164 General currency sign
        "&yen;",    // 165 Yen sign
        "&brvbar;", // 166 Broken vertical bar
        "&sect;",   // 167 Section sign
        "&uml;",    // 168 Umlaut
        "&copy;",   // 169 Copyright
        "&ordf;",   // 170 Feminine ordinal
        "&laquo;",  // 171 Left angle quote
        "&not;",    // 172 Not sign
        "&shy;",    // 173 Soft hyphen
        "&reg;",    // 174 Registred trademark
        "&macr;",   // 175 Macron accent
        "&deg;",    // 176 Degree sign
        "&plusmn;", // 177 Plus or minus
        "&sup2;",   // 178 Superscript 2
        "&sup3;",   // 179 Superscript 3
        "&acute;",  // 180 Acute accent
        "&micro;",  // 181 Greek Mu
        "&para;",   // 182 Paragraph sign
        "&middot;", // 183 Middle dot
        "&cedil;",  // 184 Cedilla
        "&sup1;",   // 185 Superscript 1
        "&ordm;",   // 186 Masculine ordinal
        "&raquo;",  // 187 Right angle quote
        "&frac14;", // 188 Fraction one-fourth
        "&frac12;", // 189 Fraction one-half
        "&frac34;", // 190 Fraction three-fourths
        "&iquest;", // 191 Inverted question mark
        "&Agrave;", // 192 A grace accent
        "&Aacute;", // 193 A acute accent
        "&Acirc;",  // 194 A circumflex accent
        "&Atilde;", // 195 A tilde
        "&Auml;",   // 196 A umlaut
        "&Aring;",  // 197 A ring
        "&AElig;",  // 198 AE ligature
        "&Ccedil;", // 199 C cedilla
        "&Egrave;", // 200 E grave accent
        "&Eacute;", // 201 E acute accent
        "&Ecirc;",  // 202 E circumflex
        "&Euml;",   // 203 E umlaut
        "&Igrave;", // 204 I grave accent
        "&Iacute;", // 205 I acute accent
        "&Icirc;",  // 206 I circumflex
        "&Iuml;",   // 207 I umlaut
        "&ETH;",    // 208 Eth Icelandic
        "&Ntilde;", // 209 N tilde
        "&Ograve;", // 210 O grave accent
        "&Oacute;", // 211 O acute accent
        "&Ocirc;",  // 212 O circumflex
        "&Otilde;", // 213 O tilde
        "&Ouml;",   // 214 O umlaut
        "&times;",  // 215 Multiply sign
        "&Oslash;", // 216 O slash
        "&Ugrave;", // 217 U grave accent
        "&Uacute;", // 218 U acute accent
        "&Ucirc;",  // 219 U circumflex
        "&Uuml;",   // 220 U umlaut
        "&Yacute;", // 221 Y acute accent
        "&THORN;",  // 222 Thorn Icelandic
        "&szlig;",  // 223 sz ligature
        "&agrave;", // 224 a grave accent
        "&aacute;", // 225 a acute accent
        "&acirc;",  // 226 a circumflex
        "&atilde;", // 227 a tilde
        "&auml;",   // 228 a umlaut
        "&aring;",  // 229 a ring
        "&aelig;",  // 230 ae ligature
        "&ccedil;", // 231 c cedilla
        "&egrave;", // 232 e grave accent
        "&eacute;", // 233 e acute accent
        "&ecirc;",  // 234 e circumflex
        "&euml;",   // 235 e umlaut
        "&igrave;", // 236 i grave accent
        "&iacute;", // 237 i accute accent
        "&icirc;",  // 238 i circumflex
        "&iuml;",   // 239 i umlaut
        "&eth;",    // 240 eth Icelandic
        "&ntilde;", // 241 n tilde
        "&ograve;", // 242 o grave accent
        "&oacute;", // 243 o acute accent
        "&ocirc;",  // 244 o circumflex
        "&otilde;", // 245 o tilde
        "&ouml;",   // 246 o umlaut
        "&divide;", // 247 Division sign
        "&oslash;", // 248 o slash
        "&ugrave;", // 249 u grave accent
        "&uacute;", // 250 u acute accent
        "&ucirc;",  // 251 u circumflex
        "&uuml;",   // 252 u umlaut
        "&yacute;", // 253 y acute accent
        "&thorn;",  // 254 thorn Icelandic
        "&yuml;"    // 255 y umlaut
    };


    /**
     * Escape a Java string to valid HTML characters.
     *
     * Special characters are converted to named HTML characters.
     * For instance, '&eacute;' is replaced by '&amp;eacute;'.
     * Basic formating allows to enter multiple lines sentences into
     * property files:
     *     \n   ==> &lt;P&gt;
     *     \r   ==> &lt;BR&gt;
     *     \t   ==> &amp;nbsp;
     * Other chars can be escaped with a leading '$' not to be
     * translated. For example, '$&lt;' gives '&lt;' in the resulting
     * string instead of '&amp;lt;'.
     *
     * Control characters are not translated.
     *
     * @param s         The string to translate to HTML characters.
     *
     * @return          The escaped string.
     */
    public static String escape(final String s)
    {
        // Special cases
        if (s == null || s.length() == 0)
        {
            return null;
        }

        char characters[] = s.toCharArray();
        StringBuffer buf = new StringBuffer((int) (characters.length *
1.5));

        for (int i = 0; i < characters.length; i++)
        {
            char c = characters[i];
            if (c < namedCharacters.length)
            {
                if (c == '$')
                {
                    i++;
                    buf.append(characters[i]);
                }
                else
                {
                    buf.append(namedCharacters[c]);
                }
            }
            else
            {
                buf.append(c);
            }
        }

        return buf.toString();
    }





    /**
     * Escape the characters to be used in the URI part of the URL.
     *
     * Reserved characters are not translated:
     * Reserved = ";" | "/" | "?" | ":" | "@" | "&" | "="
     *                | "+" | "$" | ","
     *
     * Unreserved characters are not translated, too.
     * Unreserved = Alphanum | "-" | "_" | "." | "!"
     *                       | "~" | "*" | "'" | "(" | ")"
     *
     * Example of URL
     *      http://www.myweb.org/~joe/joe's CV.html#companies
     *
     * Escaped
     *      http://www.myweb.org/~joe/joe's%20CV.html%23companies
     *
     * Caution: the shark '#' character is encoded.
     *
     * @param s             The string to escape
     *
     * @return              The escaped string
     *
     * @see HtmlUtil#escapeURI
     * @see RFC 2396
     */
    public static String escapeURI(final String s)
    {
        // Special cases
        if (s == null || s.length() == 0)
        {
            return s;
        }

        StringBuffer buf = new StringBuffer((int) (s.length() * 1.5));

        for (int i = 0; i < s.length(); i++)
        {
            char c = s.charAt(i);

            // ASCII Alphanum characters
            if ((48 <= c && c <= 57) || (65 <= c && c <= 90) || (97 <= c &&
c <= 122))
            {
                buf.append(c);
            }

            // Special characters
            else
            {
                switch (c)
                {
                // Reserved characters
                case ';':
                case '/':
                case '?':
                case ':':
                case '@':
                case '&':
                case '=':
                case '+':
                case '$':
                case ',':

                // Mark characters
                case '-':
                case '_':
                case '.':
                case '!':
                case '~':
                case '*':
                case '\'':
                case '(':
                case ')':
                    buf.append(c);
                    break;

                default:
                    // Other US-ASCII
                    if (32 <= c && c <= 127)
                    {
                        buf.append('%');
                        buf.append(Integer.toHexString(c));
                    }
                    // Other Unicode or control chars
                    else
                    {
                        buf.append(c);
                    }
                    break;

                }
            }
        }

        return buf.toString();
    }



    /**
     * Escape a string to be used as a parameter value in a URL.
     *
     * This should not be used on the full URL, because some
     * parts would be escaped and would change its meaning!
     *
     * Alphanum values are not translated.
     * Other US-ASCII characters are transformed to their
     * escaped value:
     *
     * <code>
     *      ' ' ==> '%20'
     *      '%' ==> '%25'
     * </code>
     *
     * Other characters (non US-ASCII) are taken as is.
     *
     * @param s             The string to encode as a value
     *
     * @return              The encoded parameter value
     *
     * @see RFC 2396
     */
    public static String escapeParameter(final String s)
    {
        // Special cases
        if (s == null || s.length() == 0)
        {
            return s;
        }

        StringBuffer buf = new StringBuffer((int) (s.length() * 1.5));

        for (int i = 0; i < s.length(); i++)
        {
            char c = s.charAt(i);

            // ASCII Alphanum characters
            if ((48 <= c && c <= 57) || (65 <= c && c <= 90) || (97 <= c &&
c <= 122))
            {
                buf.append(c);
            }

            // Other ASCII characters
            else if (c <= 127)
            {
                buf.append('%');
                if (c < 16)
                {
                    buf.append(Integer.toHexString(c));
                }
                buf.append(Integer.toHexString(c));
            }

            // Other characters. Don't know what to do for the
            // moment because it depends on the character set.
            // I decide not to escape them, letting the application
            // decide what to do...
            else
            {
                buf.append(c);
            }
        }

        return buf.toString();
    }
}






----- Original Message -----
From: "Matthias Kerkhoff" <ma...@BESToffers.de>
To: <st...@jakarta.apache.org>
Sent: Monday, November 13, 2000 9:22 AM
Subject: Beanutils.filter() and URLs


> Hi all,
>
> this time I would like to bring your attention to (yet undetected?)
> problems with BeanUtils.filter(), if used to encode URLs.
>
>
> The situation:
> --------------
> This method is used from various tags of the Struts codebase. From
> the way how it's used, it seems that most developers think of this
> method as a way to savely encode characters that have a special
> meaning in/for HTML _and_ HTTP. Examples of the typical usage of
> filter() include the encoding of query parameters and the encoding
> of HTML content.
>
>
> The problem:
> ------------
> The set of characters with a special meaning largely depend on the
> context, in which the string is used.
> Some examples:
> The '#' character is used to as delimiter for anchors in URLs, but
> has no special meaning in HTML content. filter() does not encode
> the anchor.
> The '%' character is used to mark encoded characters in URLs, but
> has no special meaning in HTML content. filter() does not encode
> the percentage sign.
> The '&' character is used to mark the beginning of an character
> entity. In URLs, it has varying meanings, it's fe. used as a
> separator in query-strings but is otherwise allowed (mostly).
> filter() always encodes the ampersand sign.
> The '[' and ']' characters which are used internally as index marker
> in Struts are mentioned as 'unwise' in the RFC spec, that is, they
> should better be encoded. filter() does not encode these characters.
>
>
> How it manifests:
> ----------------
> A good example to illustrate the problem is the link tag. The link
> tag contains some code that builds an URL and optionally appends
> some bean properties after URLencoding the values..
>
> After the URL is build, the whole URL is filter'ed. This results in
> % sign's being encoded twice (they're already URLencoded);
> & being encoded once as &amp; (may cause problems depending on the
>   server software, that may (or may not) recognize &amp; as query
>   parameter separator;
> []not being encoded at all (this may become important, when Struts
>   supports nested properties in the form tags)
>
>
> (Possible) solution:
> --------------------
> Adding a static method String filterFor(int context, String value){..}
> that accepts an additional context argument. This argument should
> be used to indicate the intended use of the filtered string. The
> method should properly encode the given value with regards to the
> specified context. Some candidates for context types are ...
> - HTML
>   (should resolve to URLEncoder.encode)
> - URIPATH
>   (encodes the path component of an URL)
> - QUERYPARAM
>   (encodes the name or value of an query argument)
>
>
> --
> Matthias                          mailto:make@BESToffers.de
>
>
>