You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Benedikt Ritter (JIRA)" <ji...@apache.org> on 2015/03/13 19:20:38 UTC
[jira] [Resolved] (LANG-877) Performance improvements for
StringEscapeUtils
[ https://issues.apache.org/jira/browse/LANG-877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Benedikt Ritter resolved LANG-877.
----------------------------------
Resolution: Fixed
Fix Version/s: (was: Review Patch)
3.4
> Performance improvements for StringEscapeUtils
> ----------------------------------------------
>
> Key: LANG-877
> URL: https://issues.apache.org/jira/browse/LANG-877
> Project: Commons Lang
> Issue Type: Improvement
> Components: lang.text.translate.*
> Affects Versions: 3.1
> Reporter: Henri Yandell
> Assignee: Benedikt Ritter
> Labels: github
> Fix For: 3.4
>
>
> An email on the list from Lawrence Angrave:
> Hi,
> Some comments that are relevant to Apache3 UnicodeEscaper and Apache2's StringEscapeUtils.java
> Summary-
> * I noticed the current Apache code creates three String objects each
> time it writes a unicode hexadecimal value.
> * Apache3 can also create a char[] array per character translation
> (but I do not include a fix for that)
> * This is a easy-to-fix performance bottleneck when writing many
> non-ascii characters.
> * The logic to test for unicode values of different magnitudes can
> also be simplified.
> * Benchmark and code fixes for Apache2 and Apache 3 are included. I do
> not have time to become an Apache maintainer. use or ignore at your
> choice.
> * I'm not interested in being a developer for Commons Lang Use it or
> not - that's a choice for Commons Lang developers.
> A simple fix more than doubles the string escape speed (40 ms v 100ms to translate all unicode characters) for Apache3.
> The older Apache2-style implementation can now translate all unicode characters in 8ms.
> The existing Apache3/Apache2 write unicode hex values like this-
> {code}
> if (codepoint > 0xfff) {
> out.write("\\u" + hex(codepoint));
> } else if (codepoint > 0xff) {
> out.write("\\u0" + hex(codepoint));
> } else if (codepoint > 0xf) {
> out.write("\\u00" + hex(codepoint));
> } else {
> out.write("\\u000" + hex(codepoint));
> }
> {code}
> The hex() function,
> {code}
> //hex(): return Integer.toHexString(codepoint).toUpperCase(Locale.ENGLISH);
> {code}
> also creates two string objects, so we have 3 objects per unicode hex value.
> FIX:
> The padding logic can be simplified and per-character object creation can be eliminated by writing hex digits directly
> {code}
> out.write("\\u");
> out.write(HEX_DIGIT[(codepoint >> 12) & 15]);
> out.write(HEX_DIGIT[(codepoint >> 8) & 15]);
> out.write(HEX_DIGIT[(codepoint >> 4) & 15]);
> out.write(HEX_DIGIT[(codepoint) & 15]);
> {code}
> where HEX_DIGIT is
> {code}
> public static final char[] HEX_DIGIT = "0123456789ABCDEF".toCharArray();
> {code}
> I believe this is safe for all Locales.
> When benchmarked it was disconcerting that Apache3 is still five times slower (40ms instead of 8ms) than my rewritten Apache2 version (included below).
> My guess is that there are other unnecessary per-character object creation issues still lurking Here's one example -
> {code}
> CharSequenceTranslator.translate(CharSequence input, Writer out) :
> char[] c = *Character.toChars*(Character.codePointAt(input, pos))
> {code}
> For better performance this should use {{toChars(int codePoint, char[] dst, int dstIndex)}} , which can re-use the dst char array
> The benchmark, my version of a Apache2-style escapeJavaStyleString implementation and the code fix for UnicodeEscaper.java are included below.
> I hope this email does not go into a blackhole... Feel free to forward it to the relevant maintainers.
> Regards,
> Lawrence.
> {code}
> public static final char[] HEX_DIGIT = "0123456789ABCDEF".toCharArray();
> public static final char[] CONTROL_CHARS; // non-zero entries for special case control characters
> static {
> CONTROL_CHARS = new char[32];
> CONTROL_CHARS['\b'] = 'b';
> CONTROL_CHARS['\n'] = 'n';
> CONTROL_CHARS['\t'] = 't';
> CONTROL_CHARS['\f'] = 'f';
> CONTROL_CHARS['\r'] = 'r';
> }
> public static void escapeJavaStyleString(Writer out, String s, boolean escapeSingleQuote) throws IOException {
> // Apache2 makes the following checks, so we will too-
> if(out==null) throw new IllegalArgumentException("The Writer must not be null");
> if(s == null) return;
> final int len = s.length();
> for(int i =0; i < len;i++)
> escapeChar(out,s.charAt(i), escapeSingleQuote);
> }
> public static void escapeChar(Writer out, char c, boolean escapeSingleQuote)
> throws IOException {
> // Most common case
> if (c >= 32 && c < 127) {
> if (c == '\\' || c == '"' || (c == '\'' && escapeSingleQuote))
> out.write('\\');
> out.write(c);
> return;
> }
> out.write('\\');
> if (c < 32 && CONTROL_CHARS[c] != 0) {
> out.write(CONTROL_CHARS[c]);
> return;
> }
> // Fast 4 digit uppercase hexadecimal without object creation
> out.write('u');
> out.write(HEX_DIGIT[(c >> 12) & 15]);
> out.write(HEX_DIGIT[(c >> 8) & 15]);
> out.write(HEX_DIGIT[(c >> 4) & 15]);
> out.write(HEX_DIGIT[(c) & 15]);
> }
> {code}
> FYI The benchmark test just writes all possible unicode characters into a null writer:
> {code}
> Writer nullWriter = new Writer() {
> public void write(String s) {
> };
> public void write(int c) {
> }
> public void close() throws IOException {
> }
> public void flush() throws IOException {
> }
> public void write(char[] cbuf, int off, int len) throws IOException {
> }
> };
> StringBuilder sb = new StringBuilder(0x10000);
> for (int i = 0; i <= 0xffff; i++)
> sb.append((char) i);
> String allChars = sb.toString();
> long t1 = System.currentTimeMillis();
> StringEscaper.escapeJavaStyleString(nullWriter, allChars, true);
> long t2 = System.currentTimeMillis();
> System.out.println(t2 - t1);
> long t3 = System.currentTimeMillis();
> CharSequenceTranslator translator = StringEscapeUtils.ESCAPE_JAVA;
> translator.translate(allChars, nullWriter);
> long t4 = System.currentTimeMillis();
> System.out.println(t4 - t3);
> {code}
> The modification to Apache3 UnicodeEscaper :
> {code}
> if (codepoint > 0xffff) {
> // TODO: Figure out what to do. Output as two Unicodes?
> // Does this make this a Java-specific output class?
> out.write("\\u" + hex(codepoint));
> } else if (1 == 0) { //*OLD SLOW CODE* (can be removed)
> *if (codepoint > 0xfff) {
> out.write("\\u" + hex(codepoint));
> } else if (codepoint > 0xff) {
> out.write("\\u0" + hex(codepoint));
> } else if (codepoint > 0xf) {
> out.write("\\u00" + hex(codepoint));
> } else {
> out.write("\\u000" + hex(codepoint));
> }*
> } else { // *NEW FAST CODE*
> * out.write("\\u");
> out.write(HEX_DIGIT[(codepoint >> 12) & 15]);
> out.write(HEX_DIGIT[(codepoint >> 8) & 15]);
> out.write(HEX_DIGIT[(codepoint >> 4) & 15]);
> out.write(HEX_DIGIT[(codepoint) & 15]);*
> }
> *and add public static final char[] HEX_DIGIT = "0123456789ABCDEF".toCharArray();**
> *
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)