You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@commons.apache.org by Gary Gregory <GG...@seagullsoftware.com> on 2011/07/19 16:28:15 UTC

LANG-728 to work with Lang 3.0 way of using escapeXml with > 0x7f characters [WAS RE: svn commit: r1148162 - /commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java]

Hi All:

I am glad to know there is a 3.0 way of doing that, which is:

    @Test
    public void testEscapeXmlSupplementaryCharacters() {
        CharSequenceTranslator escapeXml = 
            StringEscapeUtils.ESCAPE_XML.with( NumericEntityEscaper.between(0x7f, Integer.MAX_VALUE) );

        assertEquals("Supplementary character must be represented using a single escape", "&#144308;",
                escapeXml.translate("\uD84C\uDFB4"));

 but what about the test the way it was originally written?

        // Example from https://issues.apache.org/jira/browse/LANG-728
        assertEquals("Supplementary character must be represented using a single escape", "&#144308;",
                StringEscapeUtils.escapeXml("\uD84C\uDFB4"));
        // Example from See http://www.w3.org/International/questions/qa-escapes
        assertEquals("Supplementary character must be represented using a single escape", "&#x233B4;",
                StringEscapeUtils.escapeXml("\uD84C;\uDFB4;"));

It still fails. 

Shouldn't the API be changed to work for this case too? The W3C seems to say so: "you must use the single, code point value for that character" in:

     * From http://www.w3.org/International/questions/qa-escapes
     * </p>
     * <blockquote>
     * Supplementary characters are those Unicode characters that have code points higher than the characters in
     * the Basic Multilingual Plane (BMP). In UTF-16 a supplementary character is encoded using two 16-bit surrogate code points from the
     * BMP. Because of this, some people think that supplementary characters need to be represented using two escapes, but this is incorrect
     * – you must use the single, code point value for that character. For example, use &#x233B4; rather than &#xD84C;&#xDFB4;.
     * </blockquote>

Gary

-----Original Message-----
From: bayard@apache.org [mailto:bayard@apache.org] 
Sent: Tuesday, July 19, 2011 0:58 AM
To: commits@commons.apache.org
Subject: svn commit: r1148162 - /commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java

Author: bayard
Date: Tue Jul 19 04:58:03 2011
New Revision: 1148162

URL: http://svn.apache.org/viewvc?rev=1148162&view=rev
Log:
Updating unit test for LANG-728 to work with Lang 3.0 way of using escapeXml with > 0x7f characters

Modified:
    commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java

Modified: commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java
URL: http://svn.apache.org/viewvc/commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java?rev=1148162&r1=1148161&r2=1148162&view=diff
==============================================================================
--- commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java (original)
+++ commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/Str
+++ ingEscapeUtilsTest.java Tue Jul 19 04:58:03 2011
@@ -31,6 +31,9 @@ import org.apache.commons.io.IOUtils;  import org.junit.Ignore;  import org.junit.Test;
 
+import org.apache.commons.lang3.text.translate.CharSequenceTranslator;
+import org.apache.commons.lang3.text.translate.UnicodeEscaper;
+
 /**
  * Unit tests for {@link StringEscapeUtils}.
  *
@@ -333,15 +336,13 @@ public class StringEscapeUtilsTest {
      * @see <a href="http://www.w3.org/International/questions/qa-escapes">Using character escapes in markup and CSS</a>
      * @see <a href="https://issues.apache.org/jira/browse/LANG-728">LANG-728</a>
      */
-    @Ignore
     @Test
     public void testEscapeXmlSupplementaryCharacters() {
-        // Example from https://issues.apache.org/jira/browse/LANG-728
-        assertEquals("Supplementary character must be represented using a single escape", "&#144308;",
-                StringEscapeUtils.escapeXml("\uD84C\uDFB4"));
-        // Example from See http://www.w3.org/International/questions/qa-escapes
-        assertEquals("Supplementary character must be represented using a single escape", "&#x233B4;",
-                StringEscapeUtils.escapeXml("\uD84C;\uDFB4;"));
+        CharSequenceTranslator escapeXml = 
+            StringEscapeUtils.ESCAPE_XML.with( 
+ UnicodeEscaper.between(0x7f, Integer.MAX_VALUE) );
+
+        assertEquals("Supplementary character must be represented using a single escape", "\u233B4",
+                escapeXml.translate("\uD84C\uDFB4"));
     }
     
     // Tests issue #38569

Re: LANG-728 to work with Lang 3.0 way of using escapeXml with > 0x7f characters [WAS RE: svn commit: r1148162 - /commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java]

Posted by Gary Gregory <ga...@gmail.com>.

Hi Hen,

I have more questions than answers...

On Tue, Jul 19, 2011 at 12:35 PM, Henri Yandell <fl...@gmail.com> wrote:
>
> So you're not saying that we have to escape > 0x7f (old behaviour),

Yeah, the way I read the W3C site, I thought we'd need to escape code
points > 65,536 (above the BMP)

>
> but that we have to escape any supplementary characters?

Yes, in particular, an esacped code point > 65,536 must be escaped
with one escape (&#x233B4; rather than &#xD84C;&#xDFB4;)

The way I read the site is that IF you are going to escape > 65,536,
then you MUST use a single code point value.

What is not clear to me yet is if/when you must escape > 65,536.

The XML 1.0 spec reads:

[2]   Char   ::=   #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]/* any Unicode character, excluding the surrogate
blocks, FFFE, and FFFF. */

So does that mean that we should make sure we do NOT escape an XML
Char (aside from & > < and so on?)

Then what about XML 1.1?

The XML 1.1 spec reads:

[2]   Char   ::=   [#x1-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]/* any Unicode character, excluding the surrogate
blocks, FFFE, and FFFF. */
[2a]   RestrictedChar   ::=   [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] |
[#x7F-#x84] | [#x86-#x9F]

The more I look at this the more it is confusing!

Gary
>
> Hen
>
> On Tue, Jul 19, 2011 at 7:28 AM, Gary Gregory
> <GG...@seagullsoftware.com> wrote:
> > Hi All:
> >
> > I am glad to know there is a 3.0 way of doing that, which is:
> >
> >    @Test
> >    public void testEscapeXmlSupplementaryCharacters() {
> >        CharSequenceTranslator escapeXml =
> >            StringEscapeUtils.ESCAPE_XML.with( NumericEntityEscaper.between(0x7f, Integer.MAX_VALUE) );
> >
> >        assertEquals("Supplementary character must be represented using a single escape", "&#144308;",
> >                escapeXml.translate("\uD84C\uDFB4"));
> >
> >  but what about the test the way it was originally written?
> >
> >        // Example from https://issues.apache.org/jira/browse/LANG-728
> >        assertEquals("Supplementary character must be represented using a single escape", "&#144308;",
> >                StringEscapeUtils.escapeXml("\uD84C\uDFB4"));
> >        // Example from See http://www.w3.org/International/questions/qa-escapes
> >        assertEquals("Supplementary character must be represented using a single escape", "&#x233B4;",
> >                StringEscapeUtils.escapeXml("\uD84C;\uDFB4;"));
> >
> > It still fails.
> >
> > Shouldn't the API be changed to work for this case too? The W3C seems to say so: "you must use the single, code point value for that character" in:
> >
> >     * From http://www.w3.org/International/questions/qa-escapes
> >     * </p>
> >     * <blockquote>
> >     * Supplementary characters are those Unicode characters that have code points higher than the characters in
> >     * the Basic Multilingual Plane (BMP). In UTF-16 a supplementary character is encoded using two 16-bit surrogate code points from the
> >     * BMP. Because of this, some people think that supplementary characters need to be represented using two escapes, but this is incorrect
> >     * – you must use the single, code point value for that character. For example, use &#x233B4; rather than &#xD84C;&#xDFB4;.
> >     * </blockquote>
> >
> > Gary
> >
> > -----Original Message-----
> > From: bayard@apache.org [mailto:bayard@apache.org]
> > Sent: Tuesday, July 19, 2011 0:58 AM
> > To: commits@commons.apache.org
> > Subject: svn commit: r1148162 - /commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java
> >
> > Author: bayard
> > Date: Tue Jul 19 04:58:03 2011
> > New Revision: 1148162
> >
> > URL: http://svn.apache.org/viewvc?rev=1148162&view=rev
> > Log:
> > Updating unit test for LANG-728 to work with Lang 3.0 way of using escapeXml with > 0x7f characters
> >
> > Modified:
> >    commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java
> >
> > Modified: commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java
> > URL: http://svn.apache.org/viewvc/commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java?rev=1148162&r1=1148161&r2=1148162&view=diff
> > ==============================================================================
> > --- commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java (original)
> > +++ commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/Str
> > +++ ingEscapeUtilsTest.java Tue Jul 19 04:58:03 2011
> > @@ -31,6 +31,9 @@ import org.apache.commons.io.IOUtils;  import org.junit.Ignore;  import org.junit.Test;
> >
> > +import org.apache.commons.lang3.text.translate.CharSequenceTranslator;
> > +import org.apache.commons.lang3.text.translate.UnicodeEscaper;
> > +
> >  /**
> >  * Unit tests for {@link StringEscapeUtils}.
> >  *
> > @@ -333,15 +336,13 @@ public class StringEscapeUtilsTest {
> >      * @see <a href="http://www.w3.org/International/questions/qa-escapes">Using character escapes in markup and CSS</a>
> >      * @see <a href="https://issues.apache.org/jira/browse/LANG-728">LANG-728</a>
> >      */
> > -    @Ignore
> >     @Test
> >     public void testEscapeXmlSupplementaryCharacters() {
> > -        // Example from https://issues.apache.org/jira/browse/LANG-728
> > -        assertEquals("Supplementary character must be represented using a single escape", "&#144308;",
> > -                StringEscapeUtils.escapeXml("\uD84C\uDFB4"));
> > -        // Example from See http://www.w3.org/International/questions/qa-escapes
> > -        assertEquals("Supplementary character must be represented using a single escape", "&#x233B4;",
> > -                StringEscapeUtils.escapeXml("\uD84C;\uDFB4;"));
> > +        CharSequenceTranslator escapeXml =
> > +            StringEscapeUtils.ESCAPE_XML.with(
> > + UnicodeEscaper.between(0x7f, Integer.MAX_VALUE) );
> > +
> > +        assertEquals("Supplementary character must be represented using a single escape", "\u233B4",
> > +                escapeXml.translate("\uD84C\uDFB4"));
> >     }
> >
> >     // Tests issue #38569
> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>



--
Thank you,
Gary

http://garygregory.wordpress.com/
http://garygregory.com/
http://people.apache.org/~ggregory/
http://twitter.com/GaryGregory

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: LANG-728 to work with Lang 3.0 way of using escapeXml with > 0x7f characters [WAS RE: svn commit: r1148162 - /commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java]

Posted by Henri Yandell <fl...@gmail.com>.

So you're not saying that we have to escape > 0x7f (old behaviour),
but that we have to escape any supplementary characters?

Hen

On Tue, Jul 19, 2011 at 7:28 AM, Gary Gregory
<GG...@seagullsoftware.com> wrote:
> Hi All:
>
> I am glad to know there is a 3.0 way of doing that, which is:
>
>    @Test
>    public void testEscapeXmlSupplementaryCharacters() {
>        CharSequenceTranslator escapeXml =
>            StringEscapeUtils.ESCAPE_XML.with( NumericEntityEscaper.between(0x7f, Integer.MAX_VALUE) );
>
>        assertEquals("Supplementary character must be represented using a single escape", "&#144308;",
>                escapeXml.translate("\uD84C\uDFB4"));
>
>  but what about the test the way it was originally written?
>
>        // Example from https://issues.apache.org/jira/browse/LANG-728
>        assertEquals("Supplementary character must be represented using a single escape", "&#144308;",
>                StringEscapeUtils.escapeXml("\uD84C\uDFB4"));
>        // Example from See http://www.w3.org/International/questions/qa-escapes
>        assertEquals("Supplementary character must be represented using a single escape", "&#x233B4;",
>                StringEscapeUtils.escapeXml("\uD84C;\uDFB4;"));
>
> It still fails.
>
> Shouldn't the API be changed to work for this case too? The W3C seems to say so: "you must use the single, code point value for that character" in:
>
>     * From http://www.w3.org/International/questions/qa-escapes
>     * </p>
>     * <blockquote>
>     * Supplementary characters are those Unicode characters that have code points higher than the characters in
>     * the Basic Multilingual Plane (BMP). In UTF-16 a supplementary character is encoded using two 16-bit surrogate code points from the
>     * BMP. Because of this, some people think that supplementary characters need to be represented using two escapes, but this is incorrect
>     * – you must use the single, code point value for that character. For example, use &#x233B4; rather than &#xD84C;&#xDFB4;.
>     * </blockquote>
>
> Gary
>
> -----Original Message-----
> From: bayard@apache.org [mailto:bayard@apache.org]
> Sent: Tuesday, July 19, 2011 0:58 AM
> To: commits@commons.apache.org
> Subject: svn commit: r1148162 - /commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java
>
> Author: bayard
> Date: Tue Jul 19 04:58:03 2011
> New Revision: 1148162
>
> URL: http://svn.apache.org/viewvc?rev=1148162&view=rev
> Log:
> Updating unit test for LANG-728 to work with Lang 3.0 way of using escapeXml with > 0x7f characters
>
> Modified:
>    commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java
>
> Modified: commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java
> URL: http://svn.apache.org/viewvc/commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java?rev=1148162&r1=1148161&r2=1148162&view=diff
> ==============================================================================
> --- commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java (original)
> +++ commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/Str
> +++ ingEscapeUtilsTest.java Tue Jul 19 04:58:03 2011
> @@ -31,6 +31,9 @@ import org.apache.commons.io.IOUtils;  import org.junit.Ignore;  import org.junit.Test;
>
> +import org.apache.commons.lang3.text.translate.CharSequenceTranslator;
> +import org.apache.commons.lang3.text.translate.UnicodeEscaper;
> +
>  /**
>  * Unit tests for {@link StringEscapeUtils}.
>  *
> @@ -333,15 +336,13 @@ public class StringEscapeUtilsTest {
>      * @see <a href="http://www.w3.org/International/questions/qa-escapes">Using character escapes in markup and CSS</a>
>      * @see <a href="https://issues.apache.org/jira/browse/LANG-728">LANG-728</a>
>      */
> -    @Ignore
>     @Test
>     public void testEscapeXmlSupplementaryCharacters() {
> -        // Example from https://issues.apache.org/jira/browse/LANG-728
> -        assertEquals("Supplementary character must be represented using a single escape", "&#144308;",
> -                StringEscapeUtils.escapeXml("\uD84C\uDFB4"));
> -        // Example from See http://www.w3.org/International/questions/qa-escapes
> -        assertEquals("Supplementary character must be represented using a single escape", "&#x233B4;",
> -                StringEscapeUtils.escapeXml("\uD84C;\uDFB4;"));
> +        CharSequenceTranslator escapeXml =
> +            StringEscapeUtils.ESCAPE_XML.with(
> + UnicodeEscaper.between(0x7f, Integer.MAX_VALUE) );
> +
> +        assertEquals("Supplementary character must be represented using a single escape", "\u233B4",
> +                escapeXml.translate("\uD84C\uDFB4"));
>     }
>
>     // Tests issue #38569
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org