You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@commons.apache.org by Peter Wall <pw...@pwall.net> on 2013/12/10 05:14:43 UTC

[lang] Suggested alternatives for escape functions

Hi, I'm new here, so please forgive me if I'm duplicating a previous 
discussion (I looked back through several months of archives for 
something related, before suffering a near-fatal attack of tl;dr).

I have a toolbox of functions that I have accumulated over the years 
and among them are "escape" functions for converting, for example, XML 
"&" to "&amp;" etc.  When I showed these to a colleague he asked why I 
didn't use the Apache Commons utilities, so I benchmarked my functions 
against the Commons versions and found that mine were approximately 10 
times faster.  At which point the same colleague suggested submitting my 
versions to Apache, so here goes.

The code in org.apache.commons.lang3.text.translate is very elegant in 
the way it uses the same code and the same initialisation character 
arrays for both the escape and the unescape functions, but this elegance 
comes at a cost.  The unescape will need to look up multi-character 
sequences, but the escape code will ALWAYS be looking up single 
characters, and this can be made much simpler than a string match.  And 
in my view the function should never allocate a new object until it 
finds that it needs to do so - in many cases the string will not need to 
be modified at all so the original string should be returned.

The escape function is:

     public static final String escape(String s, CharMapper mapper) {
         for (int i = 0, n = s.length(); i < n; ) {
             char ch = s.charAt(i++);
             String mapped = mapper.map(ch);
             if (mapped != null) {
                 StringBuilder sb = new StringBuilder();
                 for (int j = 0, k = i - 1; j < k; ++j)
                     sb.append(s.charAt(j));
                 sb.append(mapped);
                 while (i < n) {
                     ch = s.charAt(i++);
                     mapped = mapper.map(ch);
                     if (mapped != null)
                         sb.append(mapped);
                     else
                         sb.append(ch);
                 }
                 return sb.toString();
             }
         }
         return s;
     }

Where CharMapper is:

     public interface CharMapper {
         String map(int codePoint);
     }

and the implementation for XML is:

     private static final CharMapper allCharMapper = new CharMapper() {
         @Override
         public String map(int codePoint) {
             if (codePoint == '<')
                 return "&lt;";
             if (codePoint == '>')
                 return "&gt;";
             if (codePoint == '&')
                 return "&amp;";
             if (codePoint == '"')
                 return "&quot;";
             if (codePoint == '\'')
                 return "&apos;";
             if (codePoint < ' ' && !isWhiteSpace(codePoint) || 
codePoint >= 0x7F) {
                 // isWhitespace checks for XML whitespace characters, 
\n \r etc.
                 StringBuilder sb = new StringBuilder(10);
                 sb.append("&#");
                 sb.append(codePoint);
                 sb.append(';');
                 return sb.toString();
             }
             return null;
         }
     };

The whole thing can be wrapped in a simple function like:

     public static String escapeAll(String s) {
         return escape(s, allCharMapper);
     }

I have versions for Java string escapes, XML, HTML (including the full 
range of entity names) and URI percent encoding, and I have versions 
that handle UTF-16 surrogate codes.  They all perform approxiamtely an 
order of magnitude better than the existing Apache Commons functons.  
They are currently under LGPL and I have JUnit tests for all of them.

One thing to note is that my versions convert all characters over 0x7F 
to numeric character references, thus sidestepping any concerns over 
UTF-8 or ISO-8859-1 character set encoding.

Is anyone interested?

Regards,
Peter Wall


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [lang] Suggested alternatives for escape functions

Posted by Peter Wall <pw...@pwall.net>.

Hi Bernd,

Thank you for taking the time to look at my submission.  Let me see if 
I can answer your comments:

1.  I have a separate version (which I did not include in my original 
email; I thought it was already long enough) which handles UTF-16 
strings, that is, strings which could include Unicode surrogate 
sequences:

     public static final String escapeUTF16(String s, CharMapper mapper) 
{
         char ch1 = '\0', ch2 = '\0'; // avoid "possibly uninitialised" 
errors
         for (int i = 0, n = s.length(); i < n; ) {
             int k = i;
             ch1 = s.charAt(i++);
             String mapped;
             if (Character.isHighSurrogate(ch1)) {
                 if (i >= n || !Character.isLowSurrogate(ch2 = 
s.charAt(i++)))
                     throw new IllegalArgumentException("Illegal 
surrogate sequence");
                 mapped = mapper.map(Character.toCodePoint(ch1, ch2));
             }
             else
                 mapped = mapper.map(ch1);
             if (mapped != null) {
                 StringBuilder sb = new StringBuilder();
                 for (int j = 0; j < k; ++j)
                     sb.append(s.charAt(j));
                 sb.append(mapped);
                 while (i < n) {
                     ch1 = s.charAt(i++);
                     if (Character.isHighSurrogate(ch1)) {
                         if (i >= n || !Character.isLowSurrogate(ch2 = 
s.charAt(i++)))
                             throw new IllegalArgumentException("Illegal 
surrogate sequence");
                         mapped = mapper.map(Character.toCodePoint(ch1, 
ch2));
                     }
                     else
                         mapped = mapper.map(ch1);
                     if (mapped != null)
                         sb.append(mapped);
                     else if (Character.isHighSurrogate(ch1))
                         sb.append(ch1).append(ch2);
                     else
                         sb.append(ch1);
                 }
                 return sb.toString();
             }
         }
         return s;
     }

As you can see, this uses the same CharMapper, and in this case it is 
called with a full Unicode code point.  Whether to throw an exception or 
simply to process the characters anyway in the case of an erroneous 
surrogate sequence is a matter of debate; I have chosen the former in 
this case but I could be persuaded otherwise.

2.  In different iterations of this code I have attempted to estimate 
the output length and pre-allocate the StringBuilder, but estimates are 
difficult.  My most recent attempt used double the input string length, 
but for a 2-character string, where both characters convert to 
8-character sequences, this would be worse than the StringBuilder 
default (of 16).  Perhaps double the input string length plus 20 would 
be a good estimate.  I'm happy to take suggestions on this point.

3.  I have a separate version of escape (and escapeUTF16) which takes a 
CharSequence and returns a CharSequence as output (in line with my 
principle of returning the input object unmodified if it needs no 
conversion).  The code is identical except that 'return sb.toString();' 
becomes 'return sb;'.  I realise that calling toString() on a String 
would return 'this' so there would be no unnecessary object allocation 
if I were to take a CharSequence as input and return a String.  Again, I 
am happy to take suggestions.

Regards,
Peter


On 2013-12-11 03:39, Bernd Eckenfels wrote:
> Hello,
>
> it depends on what you want to escape, a single Unicode character
> could be  2 codepoints (UTF-16 codepoints can only cover the BMP). So
> having a  String typed needle can be helpfull. But of course all the
> usual things  are single-codepoint characters (<>&"...). Having said
> that, any reason  why CharMappter takes an integer not a char? Thats
> missleading in this  context if someone expects it to be a real
> codepoint - which it is not  (using charAt()).
>
> Besides that, the implementation copies single characters to the new
> StringBuffer and produces multiple String buffers in a look without
> guessing the initial lengt. That does not look like a efficient
> implementation to the problem to me. Not sure where I have seen the
> functions which handle that, maybe in one of the xml parsers.
>
> BTW: maybe also the input should be a CharSequence not a String?
>
> Greetings
> Bernd
>
> Am 10.12.2013, 05:14 Uhr, schrieb Peter Wall <pw...@pwall.net>:
>
>> Hi, I'm new here, so please forgive me if I'm duplicating a previous 
>> discussion (I looked back through several months of archives for  
>> something related, before suffering a near-fatal attack of tl;dr).
>>
>> I have a toolbox of functions that I have accumulated over the years 
>> and  among them are "escape" functions for converting, for example, 
>> XML "&"  to "&amp;" etc.  When I showed these to a colleague he asked 
>> why I  didn't use the Apache Commons utilities, so I benchmarked my 
>> functions  against the Commons versions and found that mine were 
>> approximately 10  times faster.  At which point the same colleague 
>> suggested submitting my  versions to Apache, so here goes.
>>
>> The code in org.apache.commons.lang3.text.translate is very elegant 
>> in  the way it uses the same code and the same initialisation 
>> character  arrays for both the escape and the unescape functions, but 
>> this elegance  comes at a cost.  The unescape will need to look up 
>> multi-character  sequences, but the escape code will ALWAYS be looking 
>> up single  characters, and this can be made much simpler than a string 
>> match.  And  in my view the function should never allocate a new 
>> object until it  finds that it needs to do so - in many cases the 
>> string will not need to  be modified at all so the original string 
>> should be returned.
>>
>> The escape function is:
>>
>>      public static final String escape(String s, CharMapper mapper) 
>> {
>>          for (int i = 0, n = s.length(); i < n; ) {
>>              char ch = s.charAt(i++);
>>              String mapped = mapper.map(ch);
>>              if (mapped != null) {
>>                  StringBuilder sb = new StringBuilder();
>>                  for (int j = 0, k = i - 1; j < k; ++j)
>>                      sb.append(s.charAt(j));
>>                  sb.append(mapped);
>>                  while (i < n) {
>>                      ch = s.charAt(i++);
>>                      mapped = mapper.map(ch);
>>                      if (mapped != null)
>>                          sb.append(mapped);
>>                      else
>>                          sb.append(ch);
>>                  }
>>                  return sb.toString();
>>              }
>>          }
>>          return s;
>>      }
>>
>> Where CharMapper is:
>>
>>      public interface CharMapper {
>>          String map(int codePoint);
>>      }
>>
>> and the implementation for XML is:
>>
>>      private static final CharMapper allCharMapper = new 
>> CharMapper() {
>>          @Override
>>          public String map(int codePoint) {
>>              if (codePoint == '<')
>>                  return "&lt;";
>>              if (codePoint == '>')
>>                  return "&gt;";
>>              if (codePoint == '&')
>>                  return "&amp;";
>>              if (codePoint == '"')
>>                  return "&quot;";
>>              if (codePoint == '\'')
>>                  return "&apos;";
>>              if (codePoint < ' ' && !isWhiteSpace(codePoint) ||  
>> codePoint >= 0x7F) {
>>                  // isWhitespace checks for XML whitespace 
>> characters,  \n \r etc.
>>                  StringBuilder sb = new StringBuilder(10);
>>                  sb.append("&#");
>>                  sb.append(codePoint);
>>                  sb.append(';');
>>                  return sb.toString();
>>              }
>>              return null;
>>          }
>>      };
>>
>> The whole thing can be wrapped in a simple function like:
>>
>>      public static String escapeAll(String s) {
>>          return escape(s, allCharMapper);
>>      }
>>
>> I have versions for Java string escapes, XML, HTML (including the 
>> full  range of entity names) and URI percent encoding, and I have 
>> versions  that handle UTF-16 surrogate codes.  They all perform 
>> approxiamtely an  order of magnitude better than the existing Apache 
>> Commons functons.   They are currently under LGPL and I have JUnit 
>> tests for all of them.
>>
>> One thing to note is that my versions convert all characters over 
>> 0x7F  to numeric character references, thus sidestepping any concerns 
>> over  UTF-8 or ISO-8859-1 character set encoding.
>>
>> Is anyone interested?
>>
>> Regards,
>> Peter Wall
>>
>>
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [lang] Suggested alternatives for escape functions

Posted by Bernd Eckenfels <ec...@zusammenkunft.net>.

Hello,

it depends on what you want to escape, a single Unicode character could be  
2 codepoints (UTF-16 codepoints can only cover the BMP). So having a  
String typed needle can be helpfull. But of course all the usual things  
are single-codepoint characters (<>&"...). Having said that, any reason  
why CharMappter takes an integer not a char? Thats missleading in this  
context if someone expects it to be a real codepoint - which it is not  
(using charAt()).

Besides that, the implementation copies single characters to the new  
StringBuffer and produces multiple String buffers in a look without  
guessing the initial lengt. That does not look like a efficient  
implementation to the problem to me. Not sure where I have seen the  
functions which handle that, maybe in one of the xml parsers.

BTW: maybe also the input should be a CharSequence not a String?

Greetings
Bernd

Am 10.12.2013, 05:14 Uhr, schrieb Peter Wall <pw...@pwall.net>:

> Hi, I'm new here, so please forgive me if I'm duplicating a previous  
> discussion (I looked back through several months of archives for  
> something related, before suffering a near-fatal attack of tl;dr).
>
> I have a toolbox of functions that I have accumulated over the years and  
> among them are "escape" functions for converting, for example, XML "&"  
> to "&amp;" etc.  When I showed these to a colleague he asked why I  
> didn't use the Apache Commons utilities, so I benchmarked my functions  
> against the Commons versions and found that mine were approximately 10  
> times faster.  At which point the same colleague suggested submitting my  
> versions to Apache, so here goes.
>
> The code in org.apache.commons.lang3.text.translate is very elegant in  
> the way it uses the same code and the same initialisation character  
> arrays for both the escape and the unescape functions, but this elegance  
> comes at a cost.  The unescape will need to look up multi-character  
> sequences, but the escape code will ALWAYS be looking up single  
> characters, and this can be made much simpler than a string match.  And  
> in my view the function should never allocate a new object until it  
> finds that it needs to do so - in many cases the string will not need to  
> be modified at all so the original string should be returned.
>
> The escape function is:
>
>      public static final String escape(String s, CharMapper mapper) {
>          for (int i = 0, n = s.length(); i < n; ) {
>              char ch = s.charAt(i++);
>              String mapped = mapper.map(ch);
>              if (mapped != null) {
>                  StringBuilder sb = new StringBuilder();
>                  for (int j = 0, k = i - 1; j < k; ++j)
>                      sb.append(s.charAt(j));
>                  sb.append(mapped);
>                  while (i < n) {
>                      ch = s.charAt(i++);
>                      mapped = mapper.map(ch);
>                      if (mapped != null)
>                          sb.append(mapped);
>                      else
>                          sb.append(ch);
>                  }
>                  return sb.toString();
>              }
>          }
>          return s;
>      }
>
> Where CharMapper is:
>
>      public interface CharMapper {
>          String map(int codePoint);
>      }
>
> and the implementation for XML is:
>
>      private static final CharMapper allCharMapper = new CharMapper() {
>          @Override
>          public String map(int codePoint) {
>              if (codePoint == '<')
>                  return "&lt;";
>              if (codePoint == '>')
>                  return "&gt;";
>              if (codePoint == '&')
>                  return "&amp;";
>              if (codePoint == '"')
>                  return "&quot;";
>              if (codePoint == '\'')
>                  return "&apos;";
>              if (codePoint < ' ' && !isWhiteSpace(codePoint) ||  
> codePoint >= 0x7F) {
>                  // isWhitespace checks for XML whitespace characters,  
> \n \r etc.
>                  StringBuilder sb = new StringBuilder(10);
>                  sb.append("&#");
>                  sb.append(codePoint);
>                  sb.append(';');
>                  return sb.toString();
>              }
>              return null;
>          }
>      };
>
> The whole thing can be wrapped in a simple function like:
>
>      public static String escapeAll(String s) {
>          return escape(s, allCharMapper);
>      }
>
> I have versions for Java string escapes, XML, HTML (including the full  
> range of entity names) and URI percent encoding, and I have versions  
> that handle UTF-16 surrogate codes.  They all perform approxiamtely an  
> order of magnitude better than the existing Apache Commons functons.   
> They are currently under LGPL and I have JUnit tests for all of them.
>
> One thing to note is that my versions convert all characters over 0x7F  
> to numeric character references, thus sidestepping any concerns over  
> UTF-8 or ISO-8859-1 character set encoding.
>
> Is anyone interested?
>
> Regards,
> Peter Wall
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>


-- 
http://www.zusammenkunft.net

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org