You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@commons.apache.org by Lawrence Angrave <an...@illinois.edu> on 2013/03/05 23:28:19 UTC

[lang] Fix for Apache2,Apache3 StringEscapeUtils/UnicodeEscaper (3 objects created per non-ascii char) & simpler unicode hex logic

Hi,

Some comments that are relevant to  Apache3 UnicodeEscaper and Apache2's 
StringEscapeUtils.java
Summary-

  * I noticed the current Apache code creates three String objects each
    time it writes a unicode hexadecimal  value.
  * Apache3 can also create a char[] array per character translation
    (but I do include a fix for that)
  * This is a easy-to-fix performance bottleneck when writing many
    non-ascii characters.
  * The logic to test for unicode values of different magnitudes can
    also be simplified.
  * Benchmark and code fixes for Apache2 and Apache 3 are included. I do
    not have time to become an Apache maintainer. use or ignore at your
    choice.
  * I'm not interested in being a developer for Commons Lang  Use it or
    not  - that's a choice for Commons Lang developers.

A simple fix more than doubles the string escape speed (40 ms v 100ms to 
translate all unicode characters) for Apache3.
The older Apache2-style implementation can now translate all unicode 
characters in 8ms.

The existing Apache3/Apache2 write unicode hex values like this-
         if (codepoint > 0xfff) {
                 out.write("\\u" + hex(codepoint));
             } else if (codepoint > 0xff) {
                 out.write("\\u0" + hex(codepoint));
             } else if (codepoint > 0xf) {
                 out.write("\\u00" + hex(codepoint));
             } else {
                 out.write("\\u000" + hex(codepoint));
             }
The hex() function,
//hex(): return Integer.toHexString(codepoint).toUpperCase(Locale.ENGLISH);
also creates two string objects, so we have 3 objects per unicode hex value.

FIX:

The padding logic can be simplified and per-character object creation 
can be eliminated by writing hex digits directly
             out.write("\\u");
             out.write(HEX_DIGIT[(codepoint >> 12) & 15]);
             out.write(HEX_DIGIT[(codepoint >> 8) & 15]);
             out.write(HEX_DIGIT[(codepoint >> 4) & 15]);
             out.write(HEX_DIGIT[(codepoint) & 15]);


where  HEX_DIGIT  is
public static final char[] HEX_DIGIT = "0123456789ABCDEF".toCharArray();
I believe this is safe for all Locales.


When benchmarked it was disconcerting that Apache3 is still five times 
slower (40ms instead of 8ms) than my rewritten Apache2 version (included 
below).
My guess is that there are other unnecessary per-character object 
creation issues still lurking Here's one example -
CharSequenceTranslator.translate(CharSequence input, Writer out) :
        char[] c = *Character.toChars*(Character.codePointAt(input, pos))

For better performance this should use toChars(int codePoint,  char[] 
dst, int dstIndex) , which can re-use the dst char array



The benchmark, my version of a  Apache2-style escapeJavaStyleString 
implementation and the code fix for UnicodeEscaper.java  are included below.
If you do find this useful, some source-code attribution would be 
appreciated.
Legal:  I, Lawrence Angrave (email address, angrave@illinois.edu 
<ma...@illinois.edu>), am the author and copyright holder of 
the new code provided in this email, and share and re-license this code 
under the current Apache2 license.

I hope this email does not go into a blackhole... Feel free to forward 
it to the relevant maintainers.

Regards,
Lawrence.

     public static final char[] HEX_DIGIT = 
"0123456789ABCDEF".toCharArray();
     public static final char[] CONTROL_CHARS; // non-zero entries for 
special case control characters
     static {
         CONTROL_CHARS = new char[32];
         CONTROL_CHARS['\b'] = 'b';
         CONTROL_CHARS['\n'] = 'n';
         CONTROL_CHARS['\t'] = 't';
         CONTROL_CHARS['\f'] = 'f';
         CONTROL_CHARS['\r'] = 'r';
     }
     public static void  escapeJavaStyleString(Writer out, String s, 
boolean escapeSingleQuote) throws IOException {
// Apache2 makes the following checks, so we will too-
     if(out==null) throw new IllegalArgumentException("The Writer must 
not be null");
         if(s == null) return;

         final int len = s.length();
         for(int i =0; i < len;i++)
             escapeChar(out,s.charAt(i), escapeSingleQuote);
     }
     public static void escapeChar(Writer out, char c, boolean 
escapeSingleQuote)
             throws IOException {
         // Most common case
         if (c >= 32 && c < 127) {
             if (c == '\\' || c == '"' || (c == '\'' && escapeSingleQuote))
                 out.write('\\');
             out.write(c);
             return;
         }
         out.write('\\');
         if (c < 32 && CONTROL_CHARS[c] != 0) {
             out.write(CONTROL_CHARS[c]);
             return;
         }
         // Fast 4 digit uppercase hexadecimal without object creation
         out.write('u');
         out.write(HEX_DIGIT[(c >> 12) & 15]);
         out.write(HEX_DIGIT[(c >> 8) & 15]);
         out.write(HEX_DIGIT[(c >> 4) & 15]);
         out.write(HEX_DIGIT[(c) & 15]);
     }




FYI The benchmark test just writes all possible unicode characters into 
a null writer:
             Writer nullWriter = new Writer() {
             public void write(String s) {
             };

             public void write(int c) {
             }

             public void close() throws IOException {
             }

             public void flush() throws IOException {
             }

             public void write(char[] cbuf, int off, int len) throws 
IOException {
             }
         };
         StringBuilder sb = new StringBuilder(0x10000);
         for (int i = 0; i <= 0xffff; i++)
             sb.append((char) i);
         String allChars = sb.toString();

         long t1 = System.currentTimeMillis();
         StringEscaper.escapeJavaStyleString(nullWriter, allChars, true);
         long t2 = System.currentTimeMillis();
         System.out.println(t2 - t1);

         long t3 = System.currentTimeMillis();
         CharSequenceTranslator translator = StringEscapeUtils.ESCAPE_JAVA;
         translator.translate(allChars, nullWriter);
         long t4 = System.currentTimeMillis();
         System.out.println(t4 - t3);



The modification to Apache3 UnicodeEscaper :

         if (codepoint > 0xffff) {
             // TODO: Figure out what to do. Output as two Unicodes?
             // Does this make this a Java-specific output class?
             out.write("\\u" + hex(codepoint));
         } else if (1 == 0) { //*OLD SLOW CODE* (can be removed)
*if (codepoint > 0xfff) {
                 out.write("\\u" + hex(codepoint));
             } else if (codepoint > 0xff) {
                 out.write("\\u0" + hex(codepoint));
             } else if (codepoint > 0xf) {
                 out.write("\\u00" + hex(codepoint));
             } else {
                 out.write("\\u000" + hex(codepoint));
             }*
         } else { // *NEW FAST CODE*
*            out.write("\\u");
             out.write(HEX_DIGIT[(codepoint >> 12) & 15]);
             out.write(HEX_DIGIT[(codepoint >> 8) & 15]);
             out.write(HEX_DIGIT[(codepoint >> 4) & 15]);
             out.write(HEX_DIGIT[(codepoint) & 15]);*
         }

*and add    public static final char[] HEX_DIGIT = 
"0123456789ABCDEF".toCharArray();**
*

ps.
The home page for Commons lang
http://commons.apache.org/proper/commons-lang/)
  has the following 404 link on the left
     Contributing Patches
http://commons.apache.org/proper/patches.html
(I believe the 'proper' is unnecessary)

Re: [lang] Fix for Apache2,Apache3 StringEscapeUtils/UnicodeEscaper (3 objects created per non-ascii char) & simpler unicode hex logic

Posted by sebb <se...@gmail.com>.

On 5 March 2013 22:28, Lawrence Angrave <an...@illinois.edu> wrote:
> Hi,

Thanks for the feedback.

> Some comments that are relevant to  Apache3 UnicodeEscaper and Apache2's
> StringEscapeUtils.java
> Summary-
>
>  * I noticed the current Apache code creates three String objects each
>    time it writes a unicode hexadecimal  value.
>  * Apache3 can also create a char[] array per character translation
>    (but I do include a fix for that)
>  * This is a easy-to-fix performance bottleneck when writing many
>    non-ascii characters.
>  * The logic to test for unicode values of different magnitudes can
>    also be simplified.
>  * Benchmark and code fixes for Apache2 and Apache 3 are included. I do
>    not have time to become an Apache maintainer. use or ignore at your
>    choice.
>  * I'm not interested in being a developer for Commons Lang  Use it or
>    not  - that's a choice for Commons Lang developers.
>
> A simple fix more than doubles the string escape speed (40 ms v 100ms to
> translate all unicode characters) for Apache3.
> The older Apache2-style implementation can now translate all unicode
> characters in 8ms.
>
> The existing Apache3/Apache2 write unicode hex values like this-
>         if (codepoint > 0xfff) {
>                 out.write("\\u" + hex(codepoint));
>             } else if (codepoint > 0xff) {
>                 out.write("\\u0" + hex(codepoint));
>             } else if (codepoint > 0xf) {
>                 out.write("\\u00" + hex(codepoint));
>             } else {
>                 out.write("\\u000" + hex(codepoint));
>             }
> The hex() function,
> //hex(): return Integer.toHexString(codepoint).toUpperCase(Locale.ENGLISH);
> also creates two string objects, so we have 3 objects per unicode hex value.
>
> FIX:
>
> The padding logic can be simplified and per-character object creation can be
> eliminated by writing hex digits directly
>             out.write("\\u");
>             out.write(HEX_DIGIT[(codepoint >> 12) & 15]);
>             out.write(HEX_DIGIT[(codepoint >> 8) & 15]);
>             out.write(HEX_DIGIT[(codepoint >> 4) & 15]);
>             out.write(HEX_DIGIT[(codepoint) & 15]);
>
>
> where  HEX_DIGIT  is
> public static final char[] HEX_DIGIT = "0123456789ABCDEF".toCharArray();
> I believe this is safe for all Locales.
>
>
> When benchmarked it was disconcerting that Apache3 is still five times
> slower (40ms instead of 8ms) than my rewritten Apache2 version (included
> below).
> My guess is that there are other unnecessary per-character object creation
> issues still lurking Here's one example -
> CharSequenceTranslator.translate(CharSequence input, Writer out) :
>        char[] c = *Character.toChars*(Character.codePointAt(input, pos))
>
> For better performance this should use toChars(int codePoint,  char[] dst,
> int dstIndex) , which can re-use the dst char array
>
>
>
> The benchmark, my version of a  Apache2-style escapeJavaStyleString
> implementation and the code fix for UnicodeEscaper.java  are included below.
> If you do find this useful, some source-code attribution would be
> appreciated.
> Legal:  I, Lawrence Angrave (email address, angrave@illinois.edu
> <ma...@illinois.edu>), am the author and copyright holder of the
> new code provided in this email, and share and re-license this code under
> the current Apache2 license.
>
> I hope this email does not go into a blackhole... Feel free to forward it to
> the relevant maintainers.

This is the correct list for the maintainers, however bug reports and
patches are normally handled via JIRA, in this case

https://issues.apache.org/jira/browse/LANG

It's quite difficult to keep track of patches on mailing lists, and
easy for them to get forgotten or overlooked.

> Regards,
> Lawrence.
>
>     public static final char[] HEX_DIGIT = "0123456789ABCDEF".toCharArray();
>     public static final char[] CONTROL_CHARS; // non-zero entries for
> special case control characters
>     static {
>         CONTROL_CHARS = new char[32];
>         CONTROL_CHARS['\b'] = 'b';
>         CONTROL_CHARS['\n'] = 'n';
>         CONTROL_CHARS['\t'] = 't';
>         CONTROL_CHARS['\f'] = 'f';
>         CONTROL_CHARS['\r'] = 'r';
>     }
>     public static void  escapeJavaStyleString(Writer out, String s, boolean
> escapeSingleQuote) throws IOException {
> // Apache2 makes the following checks, so we will too-
>     if(out==null) throw new IllegalArgumentException("The Writer must not be
> null");
>         if(s == null) return;
>
>         final int len = s.length();
>         for(int i =0; i < len;i++)
>             escapeChar(out,s.charAt(i), escapeSingleQuote);
>     }
>     public static void escapeChar(Writer out, char c, boolean
> escapeSingleQuote)
>             throws IOException {
>         // Most common case
>         if (c >= 32 && c < 127) {
>             if (c == '\\' || c == '"' || (c == '\'' && escapeSingleQuote))
>                 out.write('\\');
>             out.write(c);
>             return;
>         }
>         out.write('\\');
>         if (c < 32 && CONTROL_CHARS[c] != 0) {
>             out.write(CONTROL_CHARS[c]);
>             return;
>         }
>         // Fast 4 digit uppercase hexadecimal without object creation
>         out.write('u');
>         out.write(HEX_DIGIT[(c >> 12) & 15]);
>         out.write(HEX_DIGIT[(c >> 8) & 15]);
>         out.write(HEX_DIGIT[(c >> 4) & 15]);
>         out.write(HEX_DIGIT[(c) & 15]);
>     }
>
>
>
>
> FYI The benchmark test just writes all possible unicode characters into a
> null writer:
>             Writer nullWriter = new Writer() {
>             public void write(String s) {
>             };
>
>             public void write(int c) {
>             }
>
>             public void close() throws IOException {
>             }
>
>             public void flush() throws IOException {
>             }
>
>             public void write(char[] cbuf, int off, int len) throws
> IOException {
>             }
>         };
>         StringBuilder sb = new StringBuilder(0x10000);
>         for (int i = 0; i <= 0xffff; i++)
>             sb.append((char) i);
>         String allChars = sb.toString();
>
>         long t1 = System.currentTimeMillis();
>         StringEscaper.escapeJavaStyleString(nullWriter, allChars, true);
>         long t2 = System.currentTimeMillis();
>         System.out.println(t2 - t1);
>
>         long t3 = System.currentTimeMillis();
>         CharSequenceTranslator translator = StringEscapeUtils.ESCAPE_JAVA;
>         translator.translate(allChars, nullWriter);
>         long t4 = System.currentTimeMillis();
>         System.out.println(t4 - t3);
>
>
>
> The modification to Apache3 UnicodeEscaper :
>
>         if (codepoint > 0xffff) {
>             // TODO: Figure out what to do. Output as two Unicodes?
>             // Does this make this a Java-specific output class?
>             out.write("\\u" + hex(codepoint));
>         } else if (1 == 0) { //*OLD SLOW CODE* (can be removed)
> *if (codepoint > 0xfff) {
>                 out.write("\\u" + hex(codepoint));
>             } else if (codepoint > 0xff) {
>                 out.write("\\u0" + hex(codepoint));
>             } else if (codepoint > 0xf) {
>                 out.write("\\u00" + hex(codepoint));
>             } else {
>                 out.write("\\u000" + hex(codepoint));
>             }*
>         } else { // *NEW FAST CODE*
> *            out.write("\\u");
>             out.write(HEX_DIGIT[(codepoint >> 12) & 15]);
>             out.write(HEX_DIGIT[(codepoint >> 8) & 15]);
>             out.write(HEX_DIGIT[(codepoint >> 4) & 15]);
>             out.write(HEX_DIGIT[(codepoint) & 15]);*
>         }
>
> *and add    public static final char[] HEX_DIGIT =
> "0123456789ABCDEF".toCharArray();**
> *
>
> ps.
> The home page for Commons lang
> http://commons.apache.org/proper/commons-lang/)
>  has the following 404 link on the left
>     Contributing Patches
> http://commons.apache.org/proper/patches.html
> (I believe the 'proper' is unnecessary)
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [lang] Fix for Apache2,Apache3 StringEscapeUtils/UnicodeEscaper (3 objects created per non-ascii char) & simpler unicode hex logic

Posted by Henri Yandell <fl...@gmail.com>.

https://issues.apache.org/jira/browse/LANG-877

On Fri, Mar 8, 2013 at 3:21 PM, Henri Yandell <fl...@gmail.com> wrote:
> Noting Lawrence's response. I'll get his email's contents added to JIRA :)
>
> ---------- Forwarded message ----------
> From: Lawrence Angrave <an...@illinois.edu>
> Date: Fri, Mar 8, 2013 at 2:36 PM
> Subject: Re: [lang] Fix for Apache2,Apache3
> StringEscapeUtils/UnicodeEscaper (3 objects created per non-ascii
> char) & simpler unicode hex logic
> To: Henri Yandell <fl...@gmail.com>
>
>
> On 3/8/13 12:51 PM, Henri Yandell wrote:
>>
>> Thanks for the report Lawrence.
>>
>> With regards to Sebb's comment, I'm happy to put in a JIRA, but wanted
>> to resolve the attribution comment first:
>>
>> "> Legal:  I, Lawrence Angrave (email address, angrave@illinois.edu
>>>
>>> <ma...@illinois.edu>), am the author and copyright holder of the
>>> new code provided in this email, and share and re-license this code under
>>> the current Apache2 license."
>>
>> Our solution on attribution of contributions is to add them to the
>> contributor list (inside the pom.xml in the source and published on
>> this page http://commons.apache.org/proper/commons-lang/team-list.html).
>>
>> Can you confirm that's acceptable?
>
> Yes! Thanks!
>
>
>
>
>> Typically we don't put the mail address in for contributors to reduce
>> spam hitting them.
>
> Very sensible. Glad you found it helpful.
> Best,
> Lawrence.
>
>> Thanks again,
>>
>> Hen
>>
>> On Tue, Mar 5, 2013 at 2:28 PM, Lawrence Angrave <an...@illinois.edu> wrote:
>>>
>>> Hi,
>>>
>>> Some comments that are relevant to  Apache3 UnicodeEscaper and Apache2's
>>> StringEscapeUtils.java
>>> Summary-
>>>
>>>   * I noticed the current Apache code creates three String objects each
>>>     time it writes a unicode hexadecimal  value.
>>>   * Apache3 can also create a char[] array per character translation
>>>     (but I do include a fix for that)
>>>   * This is a easy-to-fix performance bottleneck when writing many
>>>     non-ascii characters.
>>>   * The logic to test for unicode values of different magnitudes can
>>>     also be simplified.
>>>   * Benchmark and code fixes for Apache2 and Apache 3 are included. I do
>>>     not have time to become an Apache maintainer. use or ignore at your
>>>     choice.
>>>   * I'm not interested in being a developer for Commons Lang  Use it or
>>>     not  - that's a choice for Commons Lang developers.
>>>
>>> A simple fix more than doubles the string escape speed (40 ms v 100ms to
>>> translate all unicode characters) for Apache3.
>>> The older Apache2-style implementation can now translate all unicode
>>> characters in 8ms.
>>>
>>> The existing Apache3/Apache2 write unicode hex values like this-
>>>          if (codepoint > 0xfff) {
>>>                  out.write("\\u" + hex(codepoint));
>>>              } else if (codepoint > 0xff) {
>>>                  out.write("\\u0" + hex(codepoint));
>>>              } else if (codepoint > 0xf) {
>>>                  out.write("\\u00" + hex(codepoint));
>>>              } else {
>>>                  out.write("\\u000" + hex(codepoint));
>>>              }
>>> The hex() function,
>>> //hex(): return Integer.toHexString(codepoint).toUpperCase(Locale.ENGLISH);
>>> also creates two string objects, so we have 3 objects per unicode hex value.
>>>
>>> FIX:
>>>
>>> The padding logic can be simplified and per-character object creation can be
>>> eliminated by writing hex digits directly
>>>              out.write("\\u");
>>>              out.write(HEX_DIGIT[(codepoint >> 12) & 15]);
>>>              out.write(HEX_DIGIT[(codepoint >> 8) & 15]);
>>>              out.write(HEX_DIGIT[(codepoint >> 4) & 15]);
>>>              out.write(HEX_DIGIT[(codepoint) & 15]);
>>>
>>>
>>> where  HEX_DIGIT  is
>>> public static final char[] HEX_DIGIT = "0123456789ABCDEF".toCharArray();
>>> I believe this is safe for all Locales.
>>>
>>>
>>> When benchmarked it was disconcerting that Apache3 is still five times
>>> slower (40ms instead of 8ms) than my rewritten Apache2 version (included
>>> below).
>>> My guess is that there are other unnecessary per-character object creation
>>> issues still lurking Here's one example -
>>> CharSequenceTranslator.translate(CharSequence input, Writer out) :
>>>         char[] c = *Character.toChars*(Character.codePointAt(input, pos))
>>>
>>> For better performance this should use toChars(int codePoint,  char[] dst,
>>> int dstIndex) , which can re-use the dst char array
>>>
>>>
>>>
>>> The benchmark, my version of a  Apache2-style escapeJavaStyleString
>>> implementation and the code fix for UnicodeEscaper.java  are included below.
>>> If you do find this useful, some source-code attribution would be
>>> appreciated.
>>> Legal:  I, Lawrence Angrave (email address, angrave@illinois.edu
>>> <ma...@illinois.edu>), am the author and copyright holder of the
>>> new code provided in this email, and share and re-license this code under
>>> the current Apache2 license.
>>>
>>> I hope this email does not go into a blackhole... Feel free to forward it to
>>> the relevant maintainers.
>>>
>>> Regards,
>>> Lawrence.
>>>
>>>      public static final char[] HEX_DIGIT = "0123456789ABCDEF".toCharArray();
>>>      public static final char[] CONTROL_CHARS; // non-zero entries for
>>> special case control characters
>>>      static {
>>>          CONTROL_CHARS = new char[32];
>>>          CONTROL_CHARS['\b'] = 'b';
>>>          CONTROL_CHARS['\n'] = 'n';
>>>          CONTROL_CHARS['\t'] = 't';
>>>          CONTROL_CHARS['\f'] = 'f';
>>>          CONTROL_CHARS['\r'] = 'r';
>>>      }
>>>      public static void  escapeJavaStyleString(Writer out, String s, boolean
>>> escapeSingleQuote) throws IOException {
>>> // Apache2 makes the following checks, so we will too-
>>>      if(out==null) throw new IllegalArgumentException("The Writer must not be
>>> null");
>>>          if(s == null) return;
>>>
>>>          final int len = s.length();
>>>          for(int i =0; i < len;i++)
>>>              escapeChar(out,s.charAt(i), escapeSingleQuote);
>>>      }
>>>      public static void escapeChar(Writer out, char c, boolean
>>> escapeSingleQuote)
>>>              throws IOException {
>>>          // Most common case
>>>          if (c >= 32 && c < 127) {
>>>              if (c == '\\' || c == '"' || (c == '\'' && escapeSingleQuote))
>>>                  out.write('\\');
>>>              out.write(c);
>>>              return;
>>>          }
>>>          out.write('\\');
>>>          if (c < 32 && CONTROL_CHARS[c] != 0) {
>>>              out.write(CONTROL_CHARS[c]);
>>>              return;
>>>          }
>>>          // Fast 4 digit uppercase hexadecimal without object creation
>>>          out.write('u');
>>>          out.write(HEX_DIGIT[(c >> 12) & 15]);
>>>          out.write(HEX_DIGIT[(c >> 8) & 15]);
>>>          out.write(HEX_DIGIT[(c >> 4) & 15]);
>>>          out.write(HEX_DIGIT[(c) & 15]);
>>>      }
>>>
>>>
>>>
>>>
>>> FYI The benchmark test just writes all possible unicode characters into a
>>> null writer:
>>>              Writer nullWriter = new Writer() {
>>>              public void write(String s) {
>>>              };
>>>
>>>              public void write(int c) {
>>>              }
>>>
>>>              public void close() throws IOException {
>>>              }
>>>
>>>              public void flush() throws IOException {
>>>              }
>>>
>>>              public void write(char[] cbuf, int off, int len) throws
>>> IOException {
>>>              }
>>>          };
>>>          StringBuilder sb = new StringBuilder(0x10000);
>>>          for (int i = 0; i <= 0xffff; i++)
>>>              sb.append((char) i);
>>>          String allChars = sb.toString();
>>>
>>>          long t1 = System.currentTimeMillis();
>>>          StringEscaper.escapeJavaStyleString(nullWriter, allChars, true);
>>>          long t2 = System.currentTimeMillis();
>>>          System.out.println(t2 - t1);
>>>
>>>          long t3 = System.currentTimeMillis();
>>>          CharSequenceTranslator translator = StringEscapeUtils.ESCAPE_JAVA;
>>>          translator.translate(allChars, nullWriter);
>>>          long t4 = System.currentTimeMillis();
>>>          System.out.println(t4 - t3);
>>>
>>>
>>>
>>> The modification to Apache3 UnicodeEscaper :
>>>
>>>          if (codepoint > 0xffff) {
>>>              // TODO: Figure out what to do. Output as two Unicodes?
>>>              // Does this make this a Java-specific output class?
>>>              out.write("\\u" + hex(codepoint));
>>>          } else if (1 == 0) { //*OLD SLOW CODE* (can be removed)
>>> *if (codepoint > 0xfff) {
>>>                  out.write("\\u" + hex(codepoint));
>>>              } else if (codepoint > 0xff) {
>>>                  out.write("\\u0" + hex(codepoint));
>>>              } else if (codepoint > 0xf) {
>>>                  out.write("\\u00" + hex(codepoint));
>>>              } else {
>>>                  out.write("\\u000" + hex(codepoint));
>>>              }*
>>>          } else { // *NEW FAST CODE*
>>> *            out.write("\\u");
>>>              out.write(HEX_DIGIT[(codepoint >> 12) & 15]);
>>>              out.write(HEX_DIGIT[(codepoint >> 8) & 15]);
>>>              out.write(HEX_DIGIT[(codepoint >> 4) & 15]);
>>>              out.write(HEX_DIGIT[(codepoint) & 15]);*
>>>          }
>>>
>>> *and add    public static final char[] HEX_DIGIT =
>>> "0123456789ABCDEF".toCharArray();**
>>> *
>>>
>>> ps.
>>> The home page for Commons lang
>>> http://commons.apache.org/proper/commons-lang/)
>>>   has the following 404 link on the left
>>>      Contributing Patches
>>> http://commons.apache.org/proper/patches.html
>>> (I believe the 'proper' is unnecessary)
>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Fwd: [lang] Fix for Apache2,Apache3 StringEscapeUtils/UnicodeEscaper (3 objects created per non-ascii char) & simpler unicode hex logic

Posted by Henri Yandell <fl...@gmail.com>.

Noting Lawrence's response. I'll get his email's contents added to JIRA :)

---------- Forwarded message ----------
From: Lawrence Angrave <an...@illinois.edu>
Date: Fri, Mar 8, 2013 at 2:36 PM
Subject: Re: [lang] Fix for Apache2,Apache3
StringEscapeUtils/UnicodeEscaper (3 objects created per non-ascii
char) & simpler unicode hex logic
To: Henri Yandell <fl...@gmail.com>


On 3/8/13 12:51 PM, Henri Yandell wrote:
>
> Thanks for the report Lawrence.
>
> With regards to Sebb's comment, I'm happy to put in a JIRA, but wanted
> to resolve the attribution comment first:
>
> "> Legal:  I, Lawrence Angrave (email address, angrave@illinois.edu
>>
>> <ma...@illinois.edu>), am the author and copyright holder of the
>> new code provided in this email, and share and re-license this code under
>> the current Apache2 license."
>
> Our solution on attribution of contributions is to add them to the
> contributor list (inside the pom.xml in the source and published on
> this page http://commons.apache.org/proper/commons-lang/team-list.html).
>
> Can you confirm that's acceptable?

Yes! Thanks!




> Typically we don't put the mail address in for contributors to reduce
> spam hitting them.

Very sensible. Glad you found it helpful.
Best,
Lawrence.

> Thanks again,
>
> Hen
>
> On Tue, Mar 5, 2013 at 2:28 PM, Lawrence Angrave <an...@illinois.edu> wrote:
>>
>> Hi,
>>
>> Some comments that are relevant to  Apache3 UnicodeEscaper and Apache2's
>> StringEscapeUtils.java
>> Summary-
>>
>>   * I noticed the current Apache code creates three String objects each
>>     time it writes a unicode hexadecimal  value.
>>   * Apache3 can also create a char[] array per character translation
>>     (but I do include a fix for that)
>>   * This is a easy-to-fix performance bottleneck when writing many
>>     non-ascii characters.
>>   * The logic to test for unicode values of different magnitudes can
>>     also be simplified.
>>   * Benchmark and code fixes for Apache2 and Apache 3 are included. I do
>>     not have time to become an Apache maintainer. use or ignore at your
>>     choice.
>>   * I'm not interested in being a developer for Commons Lang  Use it or
>>     not  - that's a choice for Commons Lang developers.
>>
>> A simple fix more than doubles the string escape speed (40 ms v 100ms to
>> translate all unicode characters) for Apache3.
>> The older Apache2-style implementation can now translate all unicode
>> characters in 8ms.
>>
>> The existing Apache3/Apache2 write unicode hex values like this-
>>          if (codepoint > 0xfff) {
>>                  out.write("\\u" + hex(codepoint));
>>              } else if (codepoint > 0xff) {
>>                  out.write("\\u0" + hex(codepoint));
>>              } else if (codepoint > 0xf) {
>>                  out.write("\\u00" + hex(codepoint));
>>              } else {
>>                  out.write("\\u000" + hex(codepoint));
>>              }
>> The hex() function,
>> //hex(): return Integer.toHexString(codepoint).toUpperCase(Locale.ENGLISH);
>> also creates two string objects, so we have 3 objects per unicode hex value.
>>
>> FIX:
>>
>> The padding logic can be simplified and per-character object creation can be
>> eliminated by writing hex digits directly
>>              out.write("\\u");
>>              out.write(HEX_DIGIT[(codepoint >> 12) & 15]);
>>              out.write(HEX_DIGIT[(codepoint >> 8) & 15]);
>>              out.write(HEX_DIGIT[(codepoint >> 4) & 15]);
>>              out.write(HEX_DIGIT[(codepoint) & 15]);
>>
>>
>> where  HEX_DIGIT  is
>> public static final char[] HEX_DIGIT = "0123456789ABCDEF".toCharArray();
>> I believe this is safe for all Locales.
>>
>>
>> When benchmarked it was disconcerting that Apache3 is still five times
>> slower (40ms instead of 8ms) than my rewritten Apache2 version (included
>> below).
>> My guess is that there are other unnecessary per-character object creation
>> issues still lurking Here's one example -
>> CharSequenceTranslator.translate(CharSequence input, Writer out) :
>>         char[] c = *Character.toChars*(Character.codePointAt(input, pos))
>>
>> For better performance this should use toChars(int codePoint,  char[] dst,
>> int dstIndex) , which can re-use the dst char array
>>
>>
>>
>> The benchmark, my version of a  Apache2-style escapeJavaStyleString
>> implementation and the code fix for UnicodeEscaper.java  are included below.
>> If you do find this useful, some source-code attribution would be
>> appreciated.
>> Legal:  I, Lawrence Angrave (email address, angrave@illinois.edu
>> <ma...@illinois.edu>), am the author and copyright holder of the
>> new code provided in this email, and share and re-license this code under
>> the current Apache2 license.
>>
>> I hope this email does not go into a blackhole... Feel free to forward it to
>> the relevant maintainers.
>>
>> Regards,
>> Lawrence.
>>
>>      public static final char[] HEX_DIGIT = "0123456789ABCDEF".toCharArray();
>>      public static final char[] CONTROL_CHARS; // non-zero entries for
>> special case control characters
>>      static {
>>          CONTROL_CHARS = new char[32];
>>          CONTROL_CHARS['\b'] = 'b';
>>          CONTROL_CHARS['\n'] = 'n';
>>          CONTROL_CHARS['\t'] = 't';
>>          CONTROL_CHARS['\f'] = 'f';
>>          CONTROL_CHARS['\r'] = 'r';
>>      }
>>      public static void  escapeJavaStyleString(Writer out, String s, boolean
>> escapeSingleQuote) throws IOException {
>> // Apache2 makes the following checks, so we will too-
>>      if(out==null) throw new IllegalArgumentException("The Writer must not be
>> null");
>>          if(s == null) return;
>>
>>          final int len = s.length();
>>          for(int i =0; i < len;i++)
>>              escapeChar(out,s.charAt(i), escapeSingleQuote);
>>      }
>>      public static void escapeChar(Writer out, char c, boolean
>> escapeSingleQuote)
>>              throws IOException {
>>          // Most common case
>>          if (c >= 32 && c < 127) {
>>              if (c == '\\' || c == '"' || (c == '\'' && escapeSingleQuote))
>>                  out.write('\\');
>>              out.write(c);
>>              return;
>>          }
>>          out.write('\\');
>>          if (c < 32 && CONTROL_CHARS[c] != 0) {
>>              out.write(CONTROL_CHARS[c]);
>>              return;
>>          }
>>          // Fast 4 digit uppercase hexadecimal without object creation
>>          out.write('u');
>>          out.write(HEX_DIGIT[(c >> 12) & 15]);
>>          out.write(HEX_DIGIT[(c >> 8) & 15]);
>>          out.write(HEX_DIGIT[(c >> 4) & 15]);
>>          out.write(HEX_DIGIT[(c) & 15]);
>>      }
>>
>>
>>
>>
>> FYI The benchmark test just writes all possible unicode characters into a
>> null writer:
>>              Writer nullWriter = new Writer() {
>>              public void write(String s) {
>>              };
>>
>>              public void write(int c) {
>>              }
>>
>>              public void close() throws IOException {
>>              }
>>
>>              public void flush() throws IOException {
>>              }
>>
>>              public void write(char[] cbuf, int off, int len) throws
>> IOException {
>>              }
>>          };
>>          StringBuilder sb = new StringBuilder(0x10000);
>>          for (int i = 0; i <= 0xffff; i++)
>>              sb.append((char) i);
>>          String allChars = sb.toString();
>>
>>          long t1 = System.currentTimeMillis();
>>          StringEscaper.escapeJavaStyleString(nullWriter, allChars, true);
>>          long t2 = System.currentTimeMillis();
>>          System.out.println(t2 - t1);
>>
>>          long t3 = System.currentTimeMillis();
>>          CharSequenceTranslator translator = StringEscapeUtils.ESCAPE_JAVA;
>>          translator.translate(allChars, nullWriter);
>>          long t4 = System.currentTimeMillis();
>>          System.out.println(t4 - t3);
>>
>>
>>
>> The modification to Apache3 UnicodeEscaper :
>>
>>          if (codepoint > 0xffff) {
>>              // TODO: Figure out what to do. Output as two Unicodes?
>>              // Does this make this a Java-specific output class?
>>              out.write("\\u" + hex(codepoint));
>>          } else if (1 == 0) { //*OLD SLOW CODE* (can be removed)
>> *if (codepoint > 0xfff) {
>>                  out.write("\\u" + hex(codepoint));
>>              } else if (codepoint > 0xff) {
>>                  out.write("\\u0" + hex(codepoint));
>>              } else if (codepoint > 0xf) {
>>                  out.write("\\u00" + hex(codepoint));
>>              } else {
>>                  out.write("\\u000" + hex(codepoint));
>>              }*
>>          } else { // *NEW FAST CODE*
>> *            out.write("\\u");
>>              out.write(HEX_DIGIT[(codepoint >> 12) & 15]);
>>              out.write(HEX_DIGIT[(codepoint >> 8) & 15]);
>>              out.write(HEX_DIGIT[(codepoint >> 4) & 15]);
>>              out.write(HEX_DIGIT[(codepoint) & 15]);*
>>          }
>>
>> *and add    public static final char[] HEX_DIGIT =
>> "0123456789ABCDEF".toCharArray();**
>> *
>>
>> ps.
>> The home page for Commons lang
>> http://commons.apache.org/proper/commons-lang/)
>>   has the following 404 link on the left
>>      Contributing Patches
>> http://commons.apache.org/proper/patches.html
>> (I believe the 'proper' is unnecessary)
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [lang] Fix for Apache2,Apache3 StringEscapeUtils/UnicodeEscaper (3 objects created per non-ascii char) & simpler unicode hex logic

Posted by Henri Yandell <fl...@gmail.com>.

Thanks for the report Lawrence.

With regards to Sebb's comment, I'm happy to put in a JIRA, but wanted
to resolve the attribution comment first:

"> Legal:  I, Lawrence Angrave (email address, angrave@illinois.edu
> <ma...@illinois.edu>), am the author and copyright holder of the
> new code provided in this email, and share and re-license this code under
> the current Apache2 license."

Our solution on attribution of contributions is to add them to the
contributor list (inside the pom.xml in the source and published on
this page http://commons.apache.org/proper/commons-lang/team-list.html).

Can you confirm that's acceptable?

Typically we don't put the mail address in for contributors to reduce
spam hitting them.

Thanks again,

Hen

On Tue, Mar 5, 2013 at 2:28 PM, Lawrence Angrave <an...@illinois.edu> wrote:
> Hi,
>
> Some comments that are relevant to  Apache3 UnicodeEscaper and Apache2's
> StringEscapeUtils.java
> Summary-
>
>  * I noticed the current Apache code creates three String objects each
>    time it writes a unicode hexadecimal  value.
>  * Apache3 can also create a char[] array per character translation
>    (but I do include a fix for that)
>  * This is a easy-to-fix performance bottleneck when writing many
>    non-ascii characters.
>  * The logic to test for unicode values of different magnitudes can
>    also be simplified.
>  * Benchmark and code fixes for Apache2 and Apache 3 are included. I do
>    not have time to become an Apache maintainer. use or ignore at your
>    choice.
>  * I'm not interested in being a developer for Commons Lang  Use it or
>    not  - that's a choice for Commons Lang developers.
>
> A simple fix more than doubles the string escape speed (40 ms v 100ms to
> translate all unicode characters) for Apache3.
> The older Apache2-style implementation can now translate all unicode
> characters in 8ms.
>
> The existing Apache3/Apache2 write unicode hex values like this-
>         if (codepoint > 0xfff) {
>                 out.write("\\u" + hex(codepoint));
>             } else if (codepoint > 0xff) {
>                 out.write("\\u0" + hex(codepoint));
>             } else if (codepoint > 0xf) {
>                 out.write("\\u00" + hex(codepoint));
>             } else {
>                 out.write("\\u000" + hex(codepoint));
>             }
> The hex() function,
> //hex(): return Integer.toHexString(codepoint).toUpperCase(Locale.ENGLISH);
> also creates two string objects, so we have 3 objects per unicode hex value.
>
> FIX:
>
> The padding logic can be simplified and per-character object creation can be
> eliminated by writing hex digits directly
>             out.write("\\u");
>             out.write(HEX_DIGIT[(codepoint >> 12) & 15]);
>             out.write(HEX_DIGIT[(codepoint >> 8) & 15]);
>             out.write(HEX_DIGIT[(codepoint >> 4) & 15]);
>             out.write(HEX_DIGIT[(codepoint) & 15]);
>
>
> where  HEX_DIGIT  is
> public static final char[] HEX_DIGIT = "0123456789ABCDEF".toCharArray();
> I believe this is safe for all Locales.
>
>
> When benchmarked it was disconcerting that Apache3 is still five times
> slower (40ms instead of 8ms) than my rewritten Apache2 version (included
> below).
> My guess is that there are other unnecessary per-character object creation
> issues still lurking Here's one example -
> CharSequenceTranslator.translate(CharSequence input, Writer out) :
>        char[] c = *Character.toChars*(Character.codePointAt(input, pos))
>
> For better performance this should use toChars(int codePoint,  char[] dst,
> int dstIndex) , which can re-use the dst char array
>
>
>
> The benchmark, my version of a  Apache2-style escapeJavaStyleString
> implementation and the code fix for UnicodeEscaper.java  are included below.
> If you do find this useful, some source-code attribution would be
> appreciated.
> Legal:  I, Lawrence Angrave (email address, angrave@illinois.edu
> <ma...@illinois.edu>), am the author and copyright holder of the
> new code provided in this email, and share and re-license this code under
> the current Apache2 license.
>
> I hope this email does not go into a blackhole... Feel free to forward it to
> the relevant maintainers.
>
> Regards,
> Lawrence.
>
>     public static final char[] HEX_DIGIT = "0123456789ABCDEF".toCharArray();
>     public static final char[] CONTROL_CHARS; // non-zero entries for
> special case control characters
>     static {
>         CONTROL_CHARS = new char[32];
>         CONTROL_CHARS['\b'] = 'b';
>         CONTROL_CHARS['\n'] = 'n';
>         CONTROL_CHARS['\t'] = 't';
>         CONTROL_CHARS['\f'] = 'f';
>         CONTROL_CHARS['\r'] = 'r';
>     }
>     public static void  escapeJavaStyleString(Writer out, String s, boolean
> escapeSingleQuote) throws IOException {
> // Apache2 makes the following checks, so we will too-
>     if(out==null) throw new IllegalArgumentException("The Writer must not be
> null");
>         if(s == null) return;
>
>         final int len = s.length();
>         for(int i =0; i < len;i++)
>             escapeChar(out,s.charAt(i), escapeSingleQuote);
>     }
>     public static void escapeChar(Writer out, char c, boolean
> escapeSingleQuote)
>             throws IOException {
>         // Most common case
>         if (c >= 32 && c < 127) {
>             if (c == '\\' || c == '"' || (c == '\'' && escapeSingleQuote))
>                 out.write('\\');
>             out.write(c);
>             return;
>         }
>         out.write('\\');
>         if (c < 32 && CONTROL_CHARS[c] != 0) {
>             out.write(CONTROL_CHARS[c]);
>             return;
>         }
>         // Fast 4 digit uppercase hexadecimal without object creation
>         out.write('u');
>         out.write(HEX_DIGIT[(c >> 12) & 15]);
>         out.write(HEX_DIGIT[(c >> 8) & 15]);
>         out.write(HEX_DIGIT[(c >> 4) & 15]);
>         out.write(HEX_DIGIT[(c) & 15]);
>     }
>
>
>
>
> FYI The benchmark test just writes all possible unicode characters into a
> null writer:
>             Writer nullWriter = new Writer() {
>             public void write(String s) {
>             };
>
>             public void write(int c) {
>             }
>
>             public void close() throws IOException {
>             }
>
>             public void flush() throws IOException {
>             }
>
>             public void write(char[] cbuf, int off, int len) throws
> IOException {
>             }
>         };
>         StringBuilder sb = new StringBuilder(0x10000);
>         for (int i = 0; i <= 0xffff; i++)
>             sb.append((char) i);
>         String allChars = sb.toString();
>
>         long t1 = System.currentTimeMillis();
>         StringEscaper.escapeJavaStyleString(nullWriter, allChars, true);
>         long t2 = System.currentTimeMillis();
>         System.out.println(t2 - t1);
>
>         long t3 = System.currentTimeMillis();
>         CharSequenceTranslator translator = StringEscapeUtils.ESCAPE_JAVA;
>         translator.translate(allChars, nullWriter);
>         long t4 = System.currentTimeMillis();
>         System.out.println(t4 - t3);
>
>
>
> The modification to Apache3 UnicodeEscaper :
>
>         if (codepoint > 0xffff) {
>             // TODO: Figure out what to do. Output as two Unicodes?
>             // Does this make this a Java-specific output class?
>             out.write("\\u" + hex(codepoint));
>         } else if (1 == 0) { //*OLD SLOW CODE* (can be removed)
> *if (codepoint > 0xfff) {
>                 out.write("\\u" + hex(codepoint));
>             } else if (codepoint > 0xff) {
>                 out.write("\\u0" + hex(codepoint));
>             } else if (codepoint > 0xf) {
>                 out.write("\\u00" + hex(codepoint));
>             } else {
>                 out.write("\\u000" + hex(codepoint));
>             }*
>         } else { // *NEW FAST CODE*
> *            out.write("\\u");
>             out.write(HEX_DIGIT[(codepoint >> 12) & 15]);
>             out.write(HEX_DIGIT[(codepoint >> 8) & 15]);
>             out.write(HEX_DIGIT[(codepoint >> 4) & 15]);
>             out.write(HEX_DIGIT[(codepoint) & 15]);*
>         }
>
> *and add    public static final char[] HEX_DIGIT =
> "0123456789ABCDEF".toCharArray();**
> *
>
> ps.
> The home page for Commons lang
> http://commons.apache.org/proper/commons-lang/)
>  has the following 404 link on the left
>     Contributing Patches
> http://commons.apache.org/proper/patches.html
> (I believe the 'proper' is unnecessary)
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org