You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@shindig.apache.org by "Henning P. Schmiedehausen" <hp...@intermeta.de> on 2008/05/08 17:44:12 UTC

usage of String.getBytes()

Having been burned far too many times by far too many Java
applications that assume platform encoding == UTF-8 and running
applications between different platforms, the free usage of the
getBytes() method inside the Shindig code base concerns me a lot.

A simple example: Apply the following patch:

Index: java/gadgets/src/test/java/org/apache/shindig/util/CryptoTest.java
===================================================================
--- java/gadgets/src/test/java/org/apache/shindig/util/CryptoTest.java  (revision 654541)
+++ java/gadgets/src/test/java/org/apache/shindig/util/CryptoTest.java  (working copy)
@@ -37,14 +37,17 @@
     crypter = new BasicBlobCrypter("0123456789abcdef".getBytes());
     crypter.timeSource = new FakeTimeSource();
   }
+
+
+
   
   @Test
   public void testHmacSha1() throws Exception { 
     String key = "abcd1234";
-    String val = "your mother is a hedgehog";
+    String val = "your mother is a hedgehog (\u00e4\u00f6\u00fc\u00c4\u00d6\u00dc)";
     byte[] expected = new byte[] {
-        -21, 2, 47, -101, 9, -40, 18, 43, 76, 117,
-        -51, 115, -122, -91, 39, 26, -18, 122, 30, 90,     
+        -45, -20, 16, -21, -64, 8, 79, -41, -28, -101,
+        -108, 73, -113, 79, 57, 40, 107, -1, 107, -61,
     };
     byte[] hmac = Crypto.hmacSha1(key.getBytes(), val.getBytes());
     assertArrayEquals(expected, hmac);
@@ -53,10 +56,10 @@
   @Test
   public void testHmacSha1Verify() throws Exception { 
     String key = "abcd1234";
-    String val = "your mother is a hedgehog";
+    String val = "your mother is a hedgehog (\u00e4\u00f6\u00fc\u00c4\u00d6\u00dc)";
     byte[] expected = new byte[] {
-        -21, 2, 47, -101, 9, -40, 18, 43, 76, 117,
-        -51, 115, -122, -91, 39, 26, -18, 122, 30, 90,     
+        -45, -20, 16, -21, -64, 8, 79, -41, -28, -101,
+        -108, 73, -113, 79, 57, 40, 107, -1, 107, -61,
     };
     Crypto.hmacSha1Verify(key.getBytes(), val.getBytes(), expected);
   }
@@ -65,10 +68,10 @@
   @Test
   public void testHmacSha1VerifyTampered() throws Exception { 
     String key = "abcd1234";
-    String val = "your mother is a hedgehog";
+    String val = "your mother is a hedgehog (\u00e4\u00f6\u00fc\u00c4\u00d6\u00dc)";
     byte[] expected = new byte[] {
-        -21, 2, 47, -101, 9, -40, 18, 43, 76, 117,
-        -51, 115, -122, -91, 39, 0, -18, 122, 30, 90,     
+        -45, -20, 16, -21, -64, 15, 79, -41, -28, -101,
+        -108, 73, -113, 79, 57, 40, 107, -1, 107, -61,
     };
     try {
       Crypto.hmacSha1Verify(key.getBytes(), val.getBytes(), expected);


now run 

export LANG=en_US.UTF-8 ; mvn clean ; mvn

and

export LANG=en_US.ISO-8859-1 ; mvn clean ; mvn

The second one then gives me errors in CryptoTest, which means that
the actual bytes depend on the platform encoding. Which is bad if you
happen to live outside US-ASCII. :-)

I have a largeish patch to make sure that everywhere where getBytes()
is used actually getBytes("UTF-8") is used (the only place where this
is ok is the BasicBlobCrypter, there is only a single bug in
there... :-) ), however this needs to deal with useless
UnsupportedEncodingException (but be honest: If getBytes("UTF-8")
fails, then this is the smallest of your problems. ;-) ).

there is a good article from Joel on Software that deals in depth with
the whole encoding shebang. Very readable:

http://www.joelonsoftware.com/articles/Unicode.html

You can also apply this patch:

Index: java/gadgets/src/test/java/org/apache/shindig/util/CryptoTest.java
===================================================================
--- java/gadgets/src/test/java/org/apache/shindig/util/CryptoTest.java  (revision 654541)
+++ java/gadgets/src/test/java/org/apache/shindig/util/CryptoTest.java  (working copy)
@@ -39,6 +39,17 @@
   }
   
   @Test
+  public void testCharsetEncoding() throws Exception {
+      String str = "\u00e4\u00f6\u00fc\u00c4\u00d6\u00dc";
+
+      assertEquals(12, str.getBytes("UTF-8").length);
+      assertEquals(6, str.getBytes("ISO-8859-1").length);
+
+      assertEquals(12, str.getBytes().length);
+  }
+
+  
+  @Test
   public void testHmacSha1() throws Exception { 
     String key = "abcd1234";
     String val = "your mother is a hedgehog";

and run with ISO-8859-1 and UTF-8 platform encodings to illustrate the
problem.

	Best regards
	     Henning (living in äöü country. ;-) )

-- 
Henning P. Schmiedehausen  -- hps@intermeta.de | J2EE, Linux,
91054 Buckenhof, Germany   -- +49 9131 506540  | Apache person
Open Source Consulting, Development, Design    | Velocity - Turbine guy

INTERMETA - Gesellschaft fuer Mehrwertdienste mbH - RG Fuerth, HRB 7350
Sitz der Gesellschaft: Buckenhof. Geschaeftsfuehrer: Henning Schmiedehausen

           "...it's good to be a lunatic..." -- 10th doctor

Re: usage of String.getBytes()

Posted by Santiago Gala <sa...@gmail.com>.
El vie, 16-05-2008 a las 21:04 +0000, Henning P. Schmiedehausen escribió:
> Santiago Gala <sa...@gmail.com> writes:
> 
> >I've found a few other places where it is missing, as I outlined in the
> >gitweb link. I will initially patch them to use getBytes("utf-8") or
> >getBytes("ASCII") to clearly state what is the intention of the code.
> 
> >Base64 decode expects "ASCII", for instance. Most other cases expect "utf-8".
> 
> According to
> http://java.sun.com/j2se/1.5.0/docs/api/java/nio/charset/Charset.html,
> the official names in Java are "UTF-8" (upper case) and
> "US-ASCII". Please make sure that you don't use variants as they are
> not guaranteed to work.

ok, thanks. Done.

Regards
Santiago

> 
>     Best regards
>     	 Henning
> 
> 
-- 
Santiago Gala
http://memojo.com/~sgala/blog/


Re: usage of String.getBytes()

Posted by "Henning P. Schmiedehausen" <hp...@intermeta.de>.
Santiago Gala <sa...@gmail.com> writes:

>I've found a few other places where it is missing, as I outlined in the
>gitweb link. I will initially patch them to use getBytes("utf-8") or
>getBytes("ASCII") to clearly state what is the intention of the code.

>Base64 decode expects "ASCII", for instance. Most other cases expect "utf-8".

According to
http://java.sun.com/j2se/1.5.0/docs/api/java/nio/charset/Charset.html,
the official names in Java are "UTF-8" (upper case) and
"US-ASCII". Please make sure that you don't use variants as they are
not guaranteed to work.

    Best regards
    	 Henning


-- 
Henning P. Schmiedehausen  -- hps@intermeta.de | J2EE, Linux,
91054 Buckenhof, Germany   -- +49 9131 506540  | Apache person
Open Source Consulting, Development, Design    | Velocity - Turbine guy

INTERMETA - Gesellschaft fuer Mehrwertdienste mbH - RG Fuerth, HRB 7350
Sitz der Gesellschaft: Buckenhof. Geschaeftsfuehrer: Henning Schmiedehausen

           "...it's good to be a lunatic..." -- 10th doctor

Re: usage of String.getBytes()

Posted by Santiago Gala <sa...@gmail.com>.
El jue, 08-05-2008 a las 12:06 -0700, Kevin Brown escribió:
> I'm pretty sure the majority of cases where getBytes is used without
> specifying an encoding, it's in test code. Most everywhere else has gone out
> of the way to litter the useless UnsupportedEncodingExceptions all over the
> place.

I've found a few other places where it is missing, as I outlined in the
gitweb link. I will initially patch them to use getBytes("utf-8") or
getBytes("ASCII") to clearly state what is the intention of the code.

Base64 decode expects "ASCII", for instance. Most other cases expect "utf-8".

Later I'll apply Paul's ideas to get static encoders/decoders and get
better performance. Paul, are those encoders thread safe?

Y'all What do you think?

Regards
Santiago

> 
> On Thu, May 8, 2008 at 11:58 AM, Paul Lindner <pl...@hi5.com> wrote:
> 
> > I agree with this String.getBytes() is evil, especially for performance
> > reasons.  See my post about it here:
> >
> >
> > http://paul.vox.com/library/post/the-mysteries-of-java-character-set-perform
> > ance.html<http://paul.vox.com/library/post/the-mysteries-of-java-character-set-performance.html>
> >
> > Here's some code chunks that use Java NIO to convert between character sets
> > that don't exhibit the performance problems with looking up character sets
> > all the time...
> >
> >
> >    private static final Charset UTF8 = Charset.forName("UTF-8");
> >
> >    try {
> >        CharsetEncoder toUTF8Bytes = UTF8.newEncoder()
> >                     .onMalformedInput(CodingErrorAction.REPORT)
> >                     .onUnmappableCharacter(CodingErrorAction.REPORT);
> >
> >       return toUTF8Bytes.encode(CharBuffer.wrap(str))
> >    } catch (Exception ex) {
> >        // do something else
> >    }
> >
> >    String s = UTF8.decode(ByteBuffer.wrap(output)).toString();
> >
> >
> >
> > On 5/8/08 8:44 AM, "Henning P. Schmiedehausen" <hp...@intermeta.de> wrote:
> >
> > > Having been burned far too many times by far too many Java
> > > applications that assume platform encoding == UTF-8 and running
> > > applications between different platforms, the free usage of the
> > > getBytes() method inside the Shindig code base concerns me a lot.
> > >
> > > A simple example: Apply the following patch:
> > >
> > > Index: java/gadgets/src/test/java/org/apache/shindig/util/CryptoTest.java
> > > ===================================================================
> > > --- java/gadgets/src/test/java/org/apache/shindig/util/CryptoTest.java
> > > (revision 654541)
> > > +++ java/gadgets/src/test/java/org/apache/shindig/util/CryptoTest.java
> > > (working copy)
> > > @@ -37,14 +37,17 @@
> > >      crypter = new BasicBlobCrypter("0123456789abcdef".getBytes());
> > >      crypter.timeSource = new FakeTimeSource();
> > >    }
> > > +
> > > +
> > > +
> > >
> > >    @Test
> > >    public void testHmacSha1() throws Exception {
> > >      String key = "abcd1234";
> > > -    String val = "your mother is a hedgehog";
> > > +    String val = "your mother is a hedgehog
> > > (\u00e4\u00f6\u00fc\u00c4\u00d6\u00dc)";
> > >      byte[] expected = new byte[] {
> > > -        -21, 2, 47, -101, 9, -40, 18, 43, 76, 117,
> > > -        -51, 115, -122, -91, 39, 26, -18, 122, 30, 90,
> > > +        -45, -20, 16, -21, -64, 8, 79, -41, -28, -101,
> > > +        -108, 73, -113, 79, 57, 40, 107, -1, 107, -61,
> > >      };
> > >      byte[] hmac = Crypto.hmacSha1(key.getBytes(), val.getBytes());
> > >      assertArrayEquals(expected, hmac);
> > > @@ -53,10 +56,10 @@
> > >    @Test
> > >    public void testHmacSha1Verify() throws Exception {
> > >      String key = "abcd1234";
> > > -    String val = "your mother is a hedgehog";
> > > +    String val = "your mother is a hedgehog
> > > (\u00e4\u00f6\u00fc\u00c4\u00d6\u00dc)";
> > >      byte[] expected = new byte[] {
> > > -        -21, 2, 47, -101, 9, -40, 18, 43, 76, 117,
> > > -        -51, 115, -122, -91, 39, 26, -18, 122, 30, 90,
> > > +        -45, -20, 16, -21, -64, 8, 79, -41, -28, -101,
> > > +        -108, 73, -113, 79, 57, 40, 107, -1, 107, -61,
> > >      };
> > >      Crypto.hmacSha1Verify(key.getBytes(), val.getBytes(), expected);
> > >    }
> > > @@ -65,10 +68,10 @@
> > >    @Test
> > >    public void testHmacSha1VerifyTampered() throws Exception {
> > >      String key = "abcd1234";
> > > -    String val = "your mother is a hedgehog";
> > > +    String val = "your mother is a hedgehog
> > > (\u00e4\u00f6\u00fc\u00c4\u00d6\u00dc)";
> > >      byte[] expected = new byte[] {
> > > -        -21, 2, 47, -101, 9, -40, 18, 43, 76, 117,
> > > -        -51, 115, -122, -91, 39, 0, -18, 122, 30, 90,
> > > +        -45, -20, 16, -21, -64, 15, 79, -41, -28, -101,
> > > +        -108, 73, -113, 79, 57, 40, 107, -1, 107, -61,
> > >      };
> > >      try {
> > >        Crypto.hmacSha1Verify(key.getBytes(), val.getBytes(), expected);
> > >
> > >
> > > now run
> > >
> > > export LANG=en_US.UTF-8 ; mvn clean ; mvn
> > >
> > > and
> > >
> > > export LANG=en_US.ISO-8859-1 ; mvn clean ; mvn
> > >
> > > The second one then gives me errors in CryptoTest, which means that
> > > the actual bytes depend on the platform encoding. Which is bad if you
> > > happen to live outside US-ASCII. :-)
> > >
> > > I have a largeish patch to make sure that everywhere where getBytes()
> > > is used actually getBytes("UTF-8") is used (the only place where this
> > > is ok is the BasicBlobCrypter, there is only a single bug in
> > > there... :-) ), however this needs to deal with useless
> > > UnsupportedEncodingException (but be honest: If getBytes("UTF-8")
> > > fails, then this is the smallest of your problems. ;-) ).
> > >
> > > there is a good article from Joel on Software that deals in depth with
> > > the whole encoding shebang. Very readable:
> > >
> > > http://www.joelonsoftware.com/articles/Unicode.html
> > >
> > > You can also apply this patch:
> > >
> > > Index: java/gadgets/src/test/java/org/apache/shindig/util/CryptoTest.java
> > > ===================================================================
> > > --- java/gadgets/src/test/java/org/apache/shindig/util/CryptoTest.java
> > > (revision 654541)
> > > +++ java/gadgets/src/test/java/org/apache/shindig/util/CryptoTest.java
> > > (working copy)
> > > @@ -39,6 +39,17 @@
> > >    }
> > >
> > >    @Test
> > > +  public void testCharsetEncoding() throws Exception {
> > > +      String str = "\u00e4\u00f6\u00fc\u00c4\u00d6\u00dc";
> > > +
> > > +      assertEquals(12, str.getBytes("UTF-8").length);
> > > +      assertEquals(6, str.getBytes("ISO-8859-1").length);
> > > +
> > > +      assertEquals(12, str.getBytes().length);
> > > +  }
> > > +
> > > +
> > > +  @Test
> > >    public void testHmacSha1() throws Exception {
> > >      String key = "abcd1234";
> > >      String val = "your mother is a hedgehog";
> > >
> > > and run with ISO-8859-1 and UTF-8 platform encodings to illustrate the
> > > problem.
> > >
> > > Best regards
> > >     Henning (living in ÀöÃπ country. ;-) )
> >
> >
-- 
Santiago Gala
http://memojo.com/~sgala/blog/


Re: usage of String.getBytes()

Posted by Santiago Gala <sa...@gmail.com>.
El jue, 08-05-2008 a las 12:06 -0700, Kevin Brown escribió:
> I'm pretty sure the majority of cases where getBytes is used without
> specifying an encoding, it's in test code. Most everywhere else has gone out
> of the way to litter the useless UnsupportedEncodingExceptions all over the
> place.
> 

FYI: I had an "encoding" branch in my git repo that seems to be alive,
as rebasing it didn't actually break it:

I think similar issues to what Henning is reporting got fixed there:

http://people.apache.org/~sgala/git/?p=shindig.git;a=shortlog;h=refs/heads/encoding

(first three commits)


Regards
Santiago

> On Thu, May 8, 2008 at 11:58 AM, Paul Lindner <pl...@hi5.com> wrote:
> 
> > I agree with this String.getBytes() is evil, especially for performance
> > reasons.  See my post about it here:
> >
> >
> > http://paul.vox.com/library/post/the-mysteries-of-java-character-set-perform
> > ance.html<http://paul.vox.com/library/post/the-mysteries-of-java-character-set-performance.html>
> >
> > Here's some code chunks that use Java NIO to convert between character sets
> > that don't exhibit the performance problems with looking up character sets
> > all the time...
> >
> >
> >    private static final Charset UTF8 = Charset.forName("UTF-8");
> >
> >    try {
> >        CharsetEncoder toUTF8Bytes = UTF8.newEncoder()
> >                     .onMalformedInput(CodingErrorAction.REPORT)
> >                     .onUnmappableCharacter(CodingErrorAction.REPORT);
> >
> >       return toUTF8Bytes.encode(CharBuffer.wrap(str))
> >    } catch (Exception ex) {
> >        // do something else
> >    }
> >
> >    String s = UTF8.decode(ByteBuffer.wrap(output)).toString();
> >
> >
> >
> > On 5/8/08 8:44 AM, "Henning P. Schmiedehausen" <hp...@intermeta.de> wrote:
> >
> > > Having been burned far too many times by far too many Java
> > > applications that assume platform encoding == UTF-8 and running
> > > applications between different platforms, the free usage of the
> > > getBytes() method inside the Shindig code base concerns me a lot.
> > >
> > > A simple example: Apply the following patch:
> > >
> > > Index: java/gadgets/src/test/java/org/apache/shindig/util/CryptoTest.java
> > > ===================================================================
> > > --- java/gadgets/src/test/java/org/apache/shindig/util/CryptoTest.java
> > > (revision 654541)
> > > +++ java/gadgets/src/test/java/org/apache/shindig/util/CryptoTest.java
> > > (working copy)
> > > @@ -37,14 +37,17 @@
> > >      crypter = new BasicBlobCrypter("0123456789abcdef".getBytes());
> > >      crypter.timeSource = new FakeTimeSource();
> > >    }
> > > +
> > > +
> > > +
> > >
> > >    @Test
> > >    public void testHmacSha1() throws Exception {
> > >      String key = "abcd1234";
> > > -    String val = "your mother is a hedgehog";
> > > +    String val = "your mother is a hedgehog
> > > (\u00e4\u00f6\u00fc\u00c4\u00d6\u00dc)";
> > >      byte[] expected = new byte[] {
> > > -        -21, 2, 47, -101, 9, -40, 18, 43, 76, 117,
> > > -        -51, 115, -122, -91, 39, 26, -18, 122, 30, 90,
> > > +        -45, -20, 16, -21, -64, 8, 79, -41, -28, -101,
> > > +        -108, 73, -113, 79, 57, 40, 107, -1, 107, -61,
> > >      };
> > >      byte[] hmac = Crypto.hmacSha1(key.getBytes(), val.getBytes());
> > >      assertArrayEquals(expected, hmac);
> > > @@ -53,10 +56,10 @@
> > >    @Test
> > >    public void testHmacSha1Verify() throws Exception {
> > >      String key = "abcd1234";
> > > -    String val = "your mother is a hedgehog";
> > > +    String val = "your mother is a hedgehog
> > > (\u00e4\u00f6\u00fc\u00c4\u00d6\u00dc)";
> > >      byte[] expected = new byte[] {
> > > -        -21, 2, 47, -101, 9, -40, 18, 43, 76, 117,
> > > -        -51, 115, -122, -91, 39, 26, -18, 122, 30, 90,
> > > +        -45, -20, 16, -21, -64, 8, 79, -41, -28, -101,
> > > +        -108, 73, -113, 79, 57, 40, 107, -1, 107, -61,
> > >      };
> > >      Crypto.hmacSha1Verify(key.getBytes(), val.getBytes(), expected);
> > >    }
> > > @@ -65,10 +68,10 @@
> > >    @Test
> > >    public void testHmacSha1VerifyTampered() throws Exception {
> > >      String key = "abcd1234";
> > > -    String val = "your mother is a hedgehog";
> > > +    String val = "your mother is a hedgehog
> > > (\u00e4\u00f6\u00fc\u00c4\u00d6\u00dc)";
> > >      byte[] expected = new byte[] {
> > > -        -21, 2, 47, -101, 9, -40, 18, 43, 76, 117,
> > > -        -51, 115, -122, -91, 39, 0, -18, 122, 30, 90,
> > > +        -45, -20, 16, -21, -64, 15, 79, -41, -28, -101,
> > > +        -108, 73, -113, 79, 57, 40, 107, -1, 107, -61,
> > >      };
> > >      try {
> > >        Crypto.hmacSha1Verify(key.getBytes(), val.getBytes(), expected);
> > >
> > >
> > > now run
> > >
> > > export LANG=en_US.UTF-8 ; mvn clean ; mvn
> > >
> > > and
> > >
> > > export LANG=en_US.ISO-8859-1 ; mvn clean ; mvn
> > >
> > > The second one then gives me errors in CryptoTest, which means that
> > > the actual bytes depend on the platform encoding. Which is bad if you
> > > happen to live outside US-ASCII. :-)
> > >
> > > I have a largeish patch to make sure that everywhere where getBytes()
> > > is used actually getBytes("UTF-8") is used (the only place where this
> > > is ok is the BasicBlobCrypter, there is only a single bug in
> > > there... :-) ), however this needs to deal with useless
> > > UnsupportedEncodingException (but be honest: If getBytes("UTF-8")
> > > fails, then this is the smallest of your problems. ;-) ).
> > >
> > > there is a good article from Joel on Software that deals in depth with
> > > the whole encoding shebang. Very readable:
> > >
> > > http://www.joelonsoftware.com/articles/Unicode.html
> > >
> > > You can also apply this patch:
> > >
> > > Index: java/gadgets/src/test/java/org/apache/shindig/util/CryptoTest.java
> > > ===================================================================
> > > --- java/gadgets/src/test/java/org/apache/shindig/util/CryptoTest.java
> > > (revision 654541)
> > > +++ java/gadgets/src/test/java/org/apache/shindig/util/CryptoTest.java
> > > (working copy)
> > > @@ -39,6 +39,17 @@
> > >    }
> > >
> > >    @Test
> > > +  public void testCharsetEncoding() throws Exception {
> > > +      String str = "\u00e4\u00f6\u00fc\u00c4\u00d6\u00dc";
> > > +
> > > +      assertEquals(12, str.getBytes("UTF-8").length);
> > > +      assertEquals(6, str.getBytes("ISO-8859-1").length);
> > > +
> > > +      assertEquals(12, str.getBytes().length);
> > > +  }
> > > +
> > > +
> > > +  @Test
> > >    public void testHmacSha1() throws Exception {
> > >      String key = "abcd1234";
> > >      String val = "your mother is a hedgehog";
> > >
> > > and run with ISO-8859-1 and UTF-8 platform encodings to illustrate the
> > > problem.
> > >
> > > Best regards
> > >     Henning (living in ÀöÃπ country. ;-) )
> >
> >
-- 
Santiago Gala
http://memojo.com/~sgala/blog/


Re: usage of String.getBytes()

Posted by Kevin Brown <et...@google.com>.
I'm pretty sure the majority of cases where getBytes is used without
specifying an encoding, it's in test code. Most everywhere else has gone out
of the way to litter the useless UnsupportedEncodingExceptions all over the
place.

On Thu, May 8, 2008 at 11:58 AM, Paul Lindner <pl...@hi5.com> wrote:

> I agree with this String.getBytes() is evil, especially for performance
> reasons.  See my post about it here:
>
>
> http://paul.vox.com/library/post/the-mysteries-of-java-character-set-perform
> ance.html<http://paul.vox.com/library/post/the-mysteries-of-java-character-set-performance.html>
>
> Here's some code chunks that use Java NIO to convert between character sets
> that don't exhibit the performance problems with looking up character sets
> all the time...
>
>
>    private static final Charset UTF8 = Charset.forName("UTF-8");
>
>    try {
>        CharsetEncoder toUTF8Bytes = UTF8.newEncoder()
>                     .onMalformedInput(CodingErrorAction.REPORT)
>                     .onUnmappableCharacter(CodingErrorAction.REPORT);
>
>       return toUTF8Bytes.encode(CharBuffer.wrap(str))
>    } catch (Exception ex) {
>        // do something else
>    }
>
>    String s = UTF8.decode(ByteBuffer.wrap(output)).toString();
>
>
>
> On 5/8/08 8:44 AM, "Henning P. Schmiedehausen" <hp...@intermeta.de> wrote:
>
> > Having been burned far too many times by far too many Java
> > applications that assume platform encoding == UTF-8 and running
> > applications between different platforms, the free usage of the
> > getBytes() method inside the Shindig code base concerns me a lot.
> >
> > A simple example: Apply the following patch:
> >
> > Index: java/gadgets/src/test/java/org/apache/shindig/util/CryptoTest.java
> > ===================================================================
> > --- java/gadgets/src/test/java/org/apache/shindig/util/CryptoTest.java
> > (revision 654541)
> > +++ java/gadgets/src/test/java/org/apache/shindig/util/CryptoTest.java
> > (working copy)
> > @@ -37,14 +37,17 @@
> >      crypter = new BasicBlobCrypter("0123456789abcdef".getBytes());
> >      crypter.timeSource = new FakeTimeSource();
> >    }
> > +
> > +
> > +
> >
> >    @Test
> >    public void testHmacSha1() throws Exception {
> >      String key = "abcd1234";
> > -    String val = "your mother is a hedgehog";
> > +    String val = "your mother is a hedgehog
> > (\u00e4\u00f6\u00fc\u00c4\u00d6\u00dc)";
> >      byte[] expected = new byte[] {
> > -        -21, 2, 47, -101, 9, -40, 18, 43, 76, 117,
> > -        -51, 115, -122, -91, 39, 26, -18, 122, 30, 90,
> > +        -45, -20, 16, -21, -64, 8, 79, -41, -28, -101,
> > +        -108, 73, -113, 79, 57, 40, 107, -1, 107, -61,
> >      };
> >      byte[] hmac = Crypto.hmacSha1(key.getBytes(), val.getBytes());
> >      assertArrayEquals(expected, hmac);
> > @@ -53,10 +56,10 @@
> >    @Test
> >    public void testHmacSha1Verify() throws Exception {
> >      String key = "abcd1234";
> > -    String val = "your mother is a hedgehog";
> > +    String val = "your mother is a hedgehog
> > (\u00e4\u00f6\u00fc\u00c4\u00d6\u00dc)";
> >      byte[] expected = new byte[] {
> > -        -21, 2, 47, -101, 9, -40, 18, 43, 76, 117,
> > -        -51, 115, -122, -91, 39, 26, -18, 122, 30, 90,
> > +        -45, -20, 16, -21, -64, 8, 79, -41, -28, -101,
> > +        -108, 73, -113, 79, 57, 40, 107, -1, 107, -61,
> >      };
> >      Crypto.hmacSha1Verify(key.getBytes(), val.getBytes(), expected);
> >    }
> > @@ -65,10 +68,10 @@
> >    @Test
> >    public void testHmacSha1VerifyTampered() throws Exception {
> >      String key = "abcd1234";
> > -    String val = "your mother is a hedgehog";
> > +    String val = "your mother is a hedgehog
> > (\u00e4\u00f6\u00fc\u00c4\u00d6\u00dc)";
> >      byte[] expected = new byte[] {
> > -        -21, 2, 47, -101, 9, -40, 18, 43, 76, 117,
> > -        -51, 115, -122, -91, 39, 0, -18, 122, 30, 90,
> > +        -45, -20, 16, -21, -64, 15, 79, -41, -28, -101,
> > +        -108, 73, -113, 79, 57, 40, 107, -1, 107, -61,
> >      };
> >      try {
> >        Crypto.hmacSha1Verify(key.getBytes(), val.getBytes(), expected);
> >
> >
> > now run
> >
> > export LANG=en_US.UTF-8 ; mvn clean ; mvn
> >
> > and
> >
> > export LANG=en_US.ISO-8859-1 ; mvn clean ; mvn
> >
> > The second one then gives me errors in CryptoTest, which means that
> > the actual bytes depend on the platform encoding. Which is bad if you
> > happen to live outside US-ASCII. :-)
> >
> > I have a largeish patch to make sure that everywhere where getBytes()
> > is used actually getBytes("UTF-8") is used (the only place where this
> > is ok is the BasicBlobCrypter, there is only a single bug in
> > there... :-) ), however this needs to deal with useless
> > UnsupportedEncodingException (but be honest: If getBytes("UTF-8")
> > fails, then this is the smallest of your problems. ;-) ).
> >
> > there is a good article from Joel on Software that deals in depth with
> > the whole encoding shebang. Very readable:
> >
> > http://www.joelonsoftware.com/articles/Unicode.html
> >
> > You can also apply this patch:
> >
> > Index: java/gadgets/src/test/java/org/apache/shindig/util/CryptoTest.java
> > ===================================================================
> > --- java/gadgets/src/test/java/org/apache/shindig/util/CryptoTest.java
> > (revision 654541)
> > +++ java/gadgets/src/test/java/org/apache/shindig/util/CryptoTest.java
> > (working copy)
> > @@ -39,6 +39,17 @@
> >    }
> >
> >    @Test
> > +  public void testCharsetEncoding() throws Exception {
> > +      String str = "\u00e4\u00f6\u00fc\u00c4\u00d6\u00dc";
> > +
> > +      assertEquals(12, str.getBytes("UTF-8").length);
> > +      assertEquals(6, str.getBytes("ISO-8859-1").length);
> > +
> > +      assertEquals(12, str.getBytes().length);
> > +  }
> > +
> > +
> > +  @Test
> >    public void testHmacSha1() throws Exception {
> >      String key = "abcd1234";
> >      String val = "your mother is a hedgehog";
> >
> > and run with ISO-8859-1 and UTF-8 platform encodings to illustrate the
> > problem.
> >
> > Best regards
> >     Henning (living in ÀöÃπ country. ;-) )
>
>

Re: usage of String.getBytes()

Posted by Paul Lindner <pl...@hi5.com>.
I agree with this String.getBytes() is evil, especially for performance
reasons.  See my post about it here:

http://paul.vox.com/library/post/the-mysteries-of-java-character-set-perform
ance.html

Here's some code chunks that use Java NIO to convert between character sets
that don't exhibit the performance problems with looking up character sets
all the time...


    private static final Charset UTF8 = Charset.forName("UTF-8");

    try {
        CharsetEncoder toUTF8Bytes = UTF8.newEncoder()
                     .onMalformedInput(CodingErrorAction.REPORT)
                     .onUnmappableCharacter(CodingErrorAction.REPORT);

       return toUTF8Bytes.encode(CharBuffer.wrap(str))
    } catch (Exception ex) {
        // do something else
    }

    String s = UTF8.decode(ByteBuffer.wrap(output)).toString();



On 5/8/08 8:44 AM, "Henning P. Schmiedehausen" <hp...@intermeta.de> wrote:

> Having been burned far too many times by far too many Java
> applications that assume platform encoding == UTF-8 and running
> applications between different platforms, the free usage of the
> getBytes() method inside the Shindig code base concerns me a lot.
> 
> A simple example: Apply the following patch:
> 
> Index: java/gadgets/src/test/java/org/apache/shindig/util/CryptoTest.java
> ===================================================================
> --- java/gadgets/src/test/java/org/apache/shindig/util/CryptoTest.java
> (revision 654541)
> +++ java/gadgets/src/test/java/org/apache/shindig/util/CryptoTest.java
> (working copy)
> @@ -37,14 +37,17 @@
>      crypter = new BasicBlobCrypter("0123456789abcdef".getBytes());
>      crypter.timeSource = new FakeTimeSource();
>    }
> +
> +
> +
>    
>    @Test
>    public void testHmacSha1() throws Exception {
>      String key = "abcd1234";
> -    String val = "your mother is a hedgehog";
> +    String val = "your mother is a hedgehog
> (\u00e4\u00f6\u00fc\u00c4\u00d6\u00dc)";
>      byte[] expected = new byte[] {
> -        -21, 2, 47, -101, 9, -40, 18, 43, 76, 117,
> -        -51, 115, -122, -91, 39, 26, -18, 122, 30, 90,
> +        -45, -20, 16, -21, -64, 8, 79, -41, -28, -101,
> +        -108, 73, -113, 79, 57, 40, 107, -1, 107, -61,
>      };
>      byte[] hmac = Crypto.hmacSha1(key.getBytes(), val.getBytes());
>      assertArrayEquals(expected, hmac);
> @@ -53,10 +56,10 @@
>    @Test
>    public void testHmacSha1Verify() throws Exception {
>      String key = "abcd1234";
> -    String val = "your mother is a hedgehog";
> +    String val = "your mother is a hedgehog
> (\u00e4\u00f6\u00fc\u00c4\u00d6\u00dc)";
>      byte[] expected = new byte[] {
> -        -21, 2, 47, -101, 9, -40, 18, 43, 76, 117,
> -        -51, 115, -122, -91, 39, 26, -18, 122, 30, 90,
> +        -45, -20, 16, -21, -64, 8, 79, -41, -28, -101,
> +        -108, 73, -113, 79, 57, 40, 107, -1, 107, -61,
>      };
>      Crypto.hmacSha1Verify(key.getBytes(), val.getBytes(), expected);
>    }
> @@ -65,10 +68,10 @@
>    @Test
>    public void testHmacSha1VerifyTampered() throws Exception {
>      String key = "abcd1234";
> -    String val = "your mother is a hedgehog";
> +    String val = "your mother is a hedgehog
> (\u00e4\u00f6\u00fc\u00c4\u00d6\u00dc)";
>      byte[] expected = new byte[] {
> -        -21, 2, 47, -101, 9, -40, 18, 43, 76, 117,
> -        -51, 115, -122, -91, 39, 0, -18, 122, 30, 90,
> +        -45, -20, 16, -21, -64, 15, 79, -41, -28, -101,
> +        -108, 73, -113, 79, 57, 40, 107, -1, 107, -61,
>      };
>      try {
>        Crypto.hmacSha1Verify(key.getBytes(), val.getBytes(), expected);
> 
> 
> now run 
> 
> export LANG=en_US.UTF-8 ; mvn clean ; mvn
> 
> and
> 
> export LANG=en_US.ISO-8859-1 ; mvn clean ; mvn
> 
> The second one then gives me errors in CryptoTest, which means that
> the actual bytes depend on the platform encoding. Which is bad if you
> happen to live outside US-ASCII. :-)
> 
> I have a largeish patch to make sure that everywhere where getBytes()
> is used actually getBytes("UTF-8") is used (the only place where this
> is ok is the BasicBlobCrypter, there is only a single bug in
> there... :-) ), however this needs to deal with useless
> UnsupportedEncodingException (but be honest: If getBytes("UTF-8")
> fails, then this is the smallest of your problems. ;-) ).
> 
> there is a good article from Joel on Software that deals in depth with
> the whole encoding shebang. Very readable:
> 
> http://www.joelonsoftware.com/articles/Unicode.html
> 
> You can also apply this patch:
> 
> Index: java/gadgets/src/test/java/org/apache/shindig/util/CryptoTest.java
> ===================================================================
> --- java/gadgets/src/test/java/org/apache/shindig/util/CryptoTest.java
> (revision 654541)
> +++ java/gadgets/src/test/java/org/apache/shindig/util/CryptoTest.java
> (working copy)
> @@ -39,6 +39,17 @@
>    }
>    
>    @Test
> +  public void testCharsetEncoding() throws Exception {
> +      String str = "\u00e4\u00f6\u00fc\u00c4\u00d6\u00dc";
> +
> +      assertEquals(12, str.getBytes("UTF-8").length);
> +      assertEquals(6, str.getBytes("ISO-8859-1").length);
> +
> +      assertEquals(12, str.getBytes().length);
> +  }
> +
> +  
> +  @Test
>    public void testHmacSha1() throws Exception {
>      String key = "abcd1234";
>      String val = "your mother is a hedgehog";
> 
> and run with ISO-8859-1 and UTF-8 platform encodings to illustrate the
> problem.
> 
> Best regards
>     Henning (living in ÀöÃπ country. ;-) )