You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Gary Gregory <GG...@seagullsoftware.com> on 2011/07/19 16:28:15 UTC
LANG-728 to work with Lang 3.0 way of using escapeXml with > 0x7f
characters [WAS RE: svn commit: r1148162 -
/commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java]
Hi All:
I am glad to know there is a 3.0 way of doing that, which is:
@Test
public void testEscapeXmlSupplementaryCharacters() {
CharSequenceTranslator escapeXml =
StringEscapeUtils.ESCAPE_XML.with( NumericEntityEscaper.between(0x7f, Integer.MAX_VALUE) );
assertEquals("Supplementary character must be represented using a single escape", "𣎴",
escapeXml.translate("\uD84C\uDFB4"));
but what about the test the way it was originally written?
// Example from https://issues.apache.org/jira/browse/LANG-728
assertEquals("Supplementary character must be represented using a single escape", "𣎴",
StringEscapeUtils.escapeXml("\uD84C\uDFB4"));
// Example from See http://www.w3.org/International/questions/qa-escapes
assertEquals("Supplementary character must be represented using a single escape", "𣎴",
StringEscapeUtils.escapeXml("\uD84C;\uDFB4;"));
It still fails.
Shouldn't the API be changed to work for this case too? The W3C seems to say so: "you must use the single, code point value for that character" in:
* From http://www.w3.org/International/questions/qa-escapes
* </p>
* <blockquote>
* Supplementary characters are those Unicode characters that have code points higher than the characters in
* the Basic Multilingual Plane (BMP). In UTF-16 a supplementary character is encoded using two 16-bit surrogate code points from the
* BMP. Because of this, some people think that supplementary characters need to be represented using two escapes, but this is incorrect
* – you must use the single, code point value for that character. For example, use 𣎴 rather than ��.
* </blockquote>
Gary
-----Original Message-----
From: bayard@apache.org [mailto:bayard@apache.org]
Sent: Tuesday, July 19, 2011 0:58 AM
To: commits@commons.apache.org
Subject: svn commit: r1148162 - /commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java
Author: bayard
Date: Tue Jul 19 04:58:03 2011
New Revision: 1148162
URL: http://svn.apache.org/viewvc?rev=1148162&view=rev
Log:
Updating unit test for LANG-728 to work with Lang 3.0 way of using escapeXml with > 0x7f characters
Modified:
commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java
Modified: commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java
URL: http://svn.apache.org/viewvc/commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java?rev=1148162&r1=1148161&r2=1148162&view=diff
==============================================================================
--- commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java (original)
+++ commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/Str
+++ ingEscapeUtilsTest.java Tue Jul 19 04:58:03 2011
@@ -31,6 +31,9 @@ import org.apache.commons.io.IOUtils; import org.junit.Ignore; import org.junit.Test;
+import org.apache.commons.lang3.text.translate.CharSequenceTranslator;
+import org.apache.commons.lang3.text.translate.UnicodeEscaper;
+
/**
* Unit tests for {@link StringEscapeUtils}.
*
@@ -333,15 +336,13 @@ public class StringEscapeUtilsTest {
* @see <a href="http://www.w3.org/International/questions/qa-escapes">Using character escapes in markup and CSS</a>
* @see <a href="https://issues.apache.org/jira/browse/LANG-728">LANG-728</a>
*/
- @Ignore
@Test
public void testEscapeXmlSupplementaryCharacters() {
- // Example from https://issues.apache.org/jira/browse/LANG-728
- assertEquals("Supplementary character must be represented using a single escape", "𣎴",
- StringEscapeUtils.escapeXml("\uD84C\uDFB4"));
- // Example from See http://www.w3.org/International/questions/qa-escapes
- assertEquals("Supplementary character must be represented using a single escape", "𣎴",
- StringEscapeUtils.escapeXml("\uD84C;\uDFB4;"));
+ CharSequenceTranslator escapeXml =
+ StringEscapeUtils.ESCAPE_XML.with(
+ UnicodeEscaper.between(0x7f, Integer.MAX_VALUE) );
+
+ assertEquals("Supplementary character must be represented using a single escape", "\u233B4",
+ escapeXml.translate("\uD84C\uDFB4"));
}
// Tests issue #38569
Re: LANG-728 to work with Lang 3.0 way of using escapeXml with > 0x7f
characters [WAS RE: svn commit: r1148162 - /commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java]
Posted by Gary Gregory <ga...@gmail.com>.
Hi Hen,
I have more questions than answers...
On Tue, Jul 19, 2011 at 12:35 PM, Henri Yandell <fl...@gmail.com> wrote:
>
> So you're not saying that we have to escape > 0x7f (old behaviour),
Yeah, the way I read the W3C site, I thought we'd need to escape code
points > 65,536 (above the BMP)
>
> but that we have to escape any supplementary characters?
Yes, in particular, an esacped code point > 65,536 must be escaped
with one escape (𣎴 rather than ��)
The way I read the site is that IF you are going to escape > 65,536,
then you MUST use a single code point value.
What is not clear to me yet is if/when you must escape > 65,536.
The XML 1.0 spec reads:
[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]/* any Unicode character, excluding the surrogate
blocks, FFFE, and FFFF. */
So does that mean that we should make sure we do NOT escape an XML
Char (aside from & > < and so on?)
Then what about XML 1.1?
The XML 1.1 spec reads:
[2] Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]/* any Unicode character, excluding the surrogate
blocks, FFFE, and FFFF. */
[2a] RestrictedChar ::= [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] |
[#x7F-#x84] | [#x86-#x9F]
The more I look at this the more it is confusing!
Gary
>
> Hen
>
> On Tue, Jul 19, 2011 at 7:28 AM, Gary Gregory
> <GG...@seagullsoftware.com> wrote:
> > Hi All:
> >
> > I am glad to know there is a 3.0 way of doing that, which is:
> >
> > @Test
> > public void testEscapeXmlSupplementaryCharacters() {
> > CharSequenceTranslator escapeXml =
> > StringEscapeUtils.ESCAPE_XML.with( NumericEntityEscaper.between(0x7f, Integer.MAX_VALUE) );
> >
> > assertEquals("Supplementary character must be represented using a single escape", "𣎴",
> > escapeXml.translate("\uD84C\uDFB4"));
> >
> > but what about the test the way it was originally written?
> >
> > // Example from https://issues.apache.org/jira/browse/LANG-728
> > assertEquals("Supplementary character must be represented using a single escape", "𣎴",
> > StringEscapeUtils.escapeXml("\uD84C\uDFB4"));
> > // Example from See http://www.w3.org/International/questions/qa-escapes
> > assertEquals("Supplementary character must be represented using a single escape", "𣎴",
> > StringEscapeUtils.escapeXml("\uD84C;\uDFB4;"));
> >
> > It still fails.
> >
> > Shouldn't the API be changed to work for this case too? The W3C seems to say so: "you must use the single, code point value for that character" in:
> >
> > * From http://www.w3.org/International/questions/qa-escapes
> > * </p>
> > * <blockquote>
> > * Supplementary characters are those Unicode characters that have code points higher than the characters in
> > * the Basic Multilingual Plane (BMP). In UTF-16 a supplementary character is encoded using two 16-bit surrogate code points from the
> > * BMP. Because of this, some people think that supplementary characters need to be represented using two escapes, but this is incorrect
> > * – you must use the single, code point value for that character. For example, use 𣎴 rather than ��.
> > * </blockquote>
> >
> > Gary
> >
> > -----Original Message-----
> > From: bayard@apache.org [mailto:bayard@apache.org]
> > Sent: Tuesday, July 19, 2011 0:58 AM
> > To: commits@commons.apache.org
> > Subject: svn commit: r1148162 - /commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java
> >
> > Author: bayard
> > Date: Tue Jul 19 04:58:03 2011
> > New Revision: 1148162
> >
> > URL: http://svn.apache.org/viewvc?rev=1148162&view=rev
> > Log:
> > Updating unit test for LANG-728 to work with Lang 3.0 way of using escapeXml with > 0x7f characters
> >
> > Modified:
> > commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java
> >
> > Modified: commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java
> > URL: http://svn.apache.org/viewvc/commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java?rev=1148162&r1=1148161&r2=1148162&view=diff
> > ==============================================================================
> > --- commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java (original)
> > +++ commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/Str
> > +++ ingEscapeUtilsTest.java Tue Jul 19 04:58:03 2011
> > @@ -31,6 +31,9 @@ import org.apache.commons.io.IOUtils; import org.junit.Ignore; import org.junit.Test;
> >
> > +import org.apache.commons.lang3.text.translate.CharSequenceTranslator;
> > +import org.apache.commons.lang3.text.translate.UnicodeEscaper;
> > +
> > /**
> > * Unit tests for {@link StringEscapeUtils}.
> > *
> > @@ -333,15 +336,13 @@ public class StringEscapeUtilsTest {
> > * @see <a href="http://www.w3.org/International/questions/qa-escapes">Using character escapes in markup and CSS</a>
> > * @see <a href="https://issues.apache.org/jira/browse/LANG-728">LANG-728</a>
> > */
> > - @Ignore
> > @Test
> > public void testEscapeXmlSupplementaryCharacters() {
> > - // Example from https://issues.apache.org/jira/browse/LANG-728
> > - assertEquals("Supplementary character must be represented using a single escape", "𣎴",
> > - StringEscapeUtils.escapeXml("\uD84C\uDFB4"));
> > - // Example from See http://www.w3.org/International/questions/qa-escapes
> > - assertEquals("Supplementary character must be represented using a single escape", "𣎴",
> > - StringEscapeUtils.escapeXml("\uD84C;\uDFB4;"));
> > + CharSequenceTranslator escapeXml =
> > + StringEscapeUtils.ESCAPE_XML.with(
> > + UnicodeEscaper.between(0x7f, Integer.MAX_VALUE) );
> > +
> > + assertEquals("Supplementary character must be represented using a single escape", "\u233B4",
> > + escapeXml.translate("\uD84C\uDFB4"));
> > }
> >
> > // Tests issue #38569
> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
--
Thank you,
Gary
http://garygregory.wordpress.com/
http://garygregory.com/
http://people.apache.org/~ggregory/
http://twitter.com/GaryGregory
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org
Re: LANG-728 to work with Lang 3.0 way of using escapeXml with > 0x7f
characters [WAS RE: svn commit: r1148162 - /commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java]
Posted by Henri Yandell <fl...@gmail.com>.
So you're not saying that we have to escape > 0x7f (old behaviour),
but that we have to escape any supplementary characters?
Hen
On Tue, Jul 19, 2011 at 7:28 AM, Gary Gregory
<GG...@seagullsoftware.com> wrote:
> Hi All:
>
> I am glad to know there is a 3.0 way of doing that, which is:
>
> @Test
> public void testEscapeXmlSupplementaryCharacters() {
> CharSequenceTranslator escapeXml =
> StringEscapeUtils.ESCAPE_XML.with( NumericEntityEscaper.between(0x7f, Integer.MAX_VALUE) );
>
> assertEquals("Supplementary character must be represented using a single escape", "𣎴",
> escapeXml.translate("\uD84C\uDFB4"));
>
> but what about the test the way it was originally written?
>
> // Example from https://issues.apache.org/jira/browse/LANG-728
> assertEquals("Supplementary character must be represented using a single escape", "𣎴",
> StringEscapeUtils.escapeXml("\uD84C\uDFB4"));
> // Example from See http://www.w3.org/International/questions/qa-escapes
> assertEquals("Supplementary character must be represented using a single escape", "𣎴",
> StringEscapeUtils.escapeXml("\uD84C;\uDFB4;"));
>
> It still fails.
>
> Shouldn't the API be changed to work for this case too? The W3C seems to say so: "you must use the single, code point value for that character" in:
>
> * From http://www.w3.org/International/questions/qa-escapes
> * </p>
> * <blockquote>
> * Supplementary characters are those Unicode characters that have code points higher than the characters in
> * the Basic Multilingual Plane (BMP). In UTF-16 a supplementary character is encoded using two 16-bit surrogate code points from the
> * BMP. Because of this, some people think that supplementary characters need to be represented using two escapes, but this is incorrect
> * – you must use the single, code point value for that character. For example, use 𣎴 rather than ��.
> * </blockquote>
>
> Gary
>
> -----Original Message-----
> From: bayard@apache.org [mailto:bayard@apache.org]
> Sent: Tuesday, July 19, 2011 0:58 AM
> To: commits@commons.apache.org
> Subject: svn commit: r1148162 - /commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java
>
> Author: bayard
> Date: Tue Jul 19 04:58:03 2011
> New Revision: 1148162
>
> URL: http://svn.apache.org/viewvc?rev=1148162&view=rev
> Log:
> Updating unit test for LANG-728 to work with Lang 3.0 way of using escapeXml with > 0x7f characters
>
> Modified:
> commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java
>
> Modified: commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java
> URL: http://svn.apache.org/viewvc/commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java?rev=1148162&r1=1148161&r2=1148162&view=diff
> ==============================================================================
> --- commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java (original)
> +++ commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/Str
> +++ ingEscapeUtilsTest.java Tue Jul 19 04:58:03 2011
> @@ -31,6 +31,9 @@ import org.apache.commons.io.IOUtils; import org.junit.Ignore; import org.junit.Test;
>
> +import org.apache.commons.lang3.text.translate.CharSequenceTranslator;
> +import org.apache.commons.lang3.text.translate.UnicodeEscaper;
> +
> /**
> * Unit tests for {@link StringEscapeUtils}.
> *
> @@ -333,15 +336,13 @@ public class StringEscapeUtilsTest {
> * @see <a href="http://www.w3.org/International/questions/qa-escapes">Using character escapes in markup and CSS</a>
> * @see <a href="https://issues.apache.org/jira/browse/LANG-728">LANG-728</a>
> */
> - @Ignore
> @Test
> public void testEscapeXmlSupplementaryCharacters() {
> - // Example from https://issues.apache.org/jira/browse/LANG-728
> - assertEquals("Supplementary character must be represented using a single escape", "𣎴",
> - StringEscapeUtils.escapeXml("\uD84C\uDFB4"));
> - // Example from See http://www.w3.org/International/questions/qa-escapes
> - assertEquals("Supplementary character must be represented using a single escape", "𣎴",
> - StringEscapeUtils.escapeXml("\uD84C;\uDFB4;"));
> + CharSequenceTranslator escapeXml =
> + StringEscapeUtils.ESCAPE_XML.with(
> + UnicodeEscaper.between(0x7f, Integer.MAX_VALUE) );
> +
> + assertEquals("Supplementary character must be represented using a single escape", "\u233B4",
> + escapeXml.translate("\uD84C\uDFB4"));
> }
>
> // Tests issue #38569
>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org