You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@thrift.apache.org by "Bryan Duxbury (JIRA)" <ji...@apache.org> on 2010/04/24 22:09:49 UTC

[jira] Created: (THRIFT-765) Improved string encoding and decoding performance

Improved string encoding and decoding performance
-------------------------------------------------

                 Key: THRIFT-765
                 URL: https://issues.apache.org/jira/browse/THRIFT-765
             Project: Thrift
          Issue Type: Improvement
          Components: Library (Java)
    Affects Versions: 0.2
            Reporter: Bryan Duxbury
            Assignee: Bryan Duxbury
             Fix For: 0.3


One of the most consistent time-consuming spots of Thrift serialization and deserialization is string encoding. For some inscrutable reason, String.getBytes("UTF-8") is slow. 

However, it's recently been brought to my attention that DataOutputStream's writeUTF method has a faster implementation of UTF-8 encoding and decoding. We should use this style of encoding.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (THRIFT-765) Improved string encoding and decoding performance

Posted by "Bryan Duxbury (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/THRIFT-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12863458#action_12863458 ] 

Bryan Duxbury commented on THRIFT-765:
--------------------------------------

Firstly, as far as I can tell, the built-in Java version is *not* implemented in C. I don't know all of the reasons why it's faster, but a few of them are: avoid byte array allocation/copy, more specific implementation, and less modular code with fewer intermediate layers.

It seems like many other projects that do UTF-8 encoding do so via a custom method, some better than others. I'll admit it's tricky, but the performance benefit is so huge that I'm willing to pursue it more deeply.

> Improved string encoding and decoding performance
> -------------------------------------------------
>
>                 Key: THRIFT-765
>                 URL: https://issues.apache.org/jira/browse/THRIFT-765
>             Project: Thrift
>          Issue Type: Improvement
>          Components: Library (Java)
>    Affects Versions: 0.2
>            Reporter: Bryan Duxbury
>            Assignee: Bryan Duxbury
>             Fix For: 0.4
>
>         Attachments: thrift-765-redux-v2.patch, thrift-765-redux.patch, thrift-765.patch
>
>
> One of the most consistent time-consuming spots of Thrift serialization and deserialization is string encoding. For some inscrutable reason, String.getBytes("UTF-8") is slow. 
> However, it's recently been brought to my attention that DataOutputStream's writeUTF method has a faster implementation of UTF-8 encoding and decoding. We should use this style of encoding.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Reopened: (THRIFT-765) Improved string encoding and decoding performance

Posted by "Bryan Duxbury (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/THRIFT-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bryan Duxbury reopened THRIFT-765:
----------------------------------


There still seems to be a bug in the mix somewhere. I am working to chase it down, but I don't think this patch will make it in for 0.3.

> Improved string encoding and decoding performance
> -------------------------------------------------
>
>                 Key: THRIFT-765
>                 URL: https://issues.apache.org/jira/browse/THRIFT-765
>             Project: Thrift
>          Issue Type: Improvement
>          Components: Library (Java)
>    Affects Versions: 0.2
>            Reporter: Bryan Duxbury
>            Assignee: Bryan Duxbury
>             Fix For: 0.4
>
>         Attachments: thrift-765-redux-v2.patch, thrift-765-redux.patch, thrift-765.patch
>
>
> One of the most consistent time-consuming spots of Thrift serialization and deserialization is string encoding. For some inscrutable reason, String.getBytes("UTF-8") is slow. 
> However, it's recently been brought to my attention that DataOutputStream's writeUTF method has a faster implementation of UTF-8 encoding and decoding. We should use this style of encoding.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (THRIFT-765) Improved string encoding and decoding performance

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/THRIFT-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906212#action_12906212 ] 

Robert Muir commented on THRIFT-765:
------------------------------------

bq. Do you think we should try some invalid strings as well? It should just lead to encoding exceptions, right?

The (commented-out test) generates purely random bytes for decode, and purely random chars for encode.

you can safely enable the random chars encode test, if you think there are no encode bugs in your jdk (sun is ok)
you cannot safely enable the random bytes decode test, because of the sun decode bug i mentioned (see testJDKCorrectness to test your jdk).

I reported the bug to sun here: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6982052

by the way: you can speed up decode further for the multibyte case:
{code}
      if (state != UTF8_ACCEPT) {
        // non-starter, or following illegal byte sequence
        if (state == UTF8_REJECT)               // remove this line
          throw new CharacterCodingException(); // remove this line also
{code}

this is because the DFA is designed such that, if the decoder enters a reject state it will never leave.
so this code is redundant, it will throw the exception at the end anyway (sorry for leaving the check in the code)


> Improved string encoding and decoding performance
> -------------------------------------------------
>
>                 Key: THRIFT-765
>                 URL: https://issues.apache.org/jira/browse/THRIFT-765
>             Project: Thrift
>          Issue Type: Improvement
>          Components: Java - Library
>    Affects Versions: 0.2
>            Reporter: Bryan Duxbury
>            Assignee: Bryan Duxbury
>             Fix For: 0.4
>
>         Attachments: thrift-765-redux-v2.patch, thrift-765-redux.patch, THRIFT-765.patch, thrift-765.patch
>
>
> One of the most consistent time-consuming spots of Thrift serialization and deserialization is string encoding. For some inscrutable reason, String.getBytes("UTF-8") is slow. 
> However, it's recently been brought to my attention that DataOutputStream's writeUTF method has a faster implementation of UTF-8 encoding and decoding. We should use this style of encoding.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (THRIFT-765) Improved string encoding and decoding performance

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/THRIFT-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861998#action_12861998 ] 

Doug Cutting commented on THRIFT-765:
-------------------------------------

It doesn't look like this handles surrogate pairs correctly.  Java strings are not arrays of unicode codepoints, but are rather UTF-16 encoded.  UTF-8 can have 4-byte values that must be converted to two java characters.

> Improved string encoding and decoding performance
> -------------------------------------------------
>
>                 Key: THRIFT-765
>                 URL: https://issues.apache.org/jira/browse/THRIFT-765
>             Project: Thrift
>          Issue Type: Improvement
>          Components: Library (Java)
>    Affects Versions: 0.2
>            Reporter: Bryan Duxbury
>            Assignee: Bryan Duxbury
>             Fix For: 0.3
>
>         Attachments: thrift-765.patch
>
>
> One of the most consistent time-consuming spots of Thrift serialization and deserialization is string encoding. For some inscrutable reason, String.getBytes("UTF-8") is slow. 
> However, it's recently been brought to my attention that DataOutputStream's writeUTF method has a faster implementation of UTF-8 encoding and decoding. We should use this style of encoding.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (THRIFT-765) Improved string encoding and decoding performance

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/THRIFT-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12862360#action_12862360 ] 

Yonik Seeley commented on THRIFT-765:
-------------------------------------

Doug, I think you were right the first time.
I took a quick glance at the code, and it doesn't handle surrogate pairs.  The resulting output will be CESU-8, not UTF-8.

Taking Ken's example, that code point encoded in UTF-8 is 4 bytes, while the encoding this patch implements will yield 6 bytes of CESU-8.  A round-trip test is not sufficient to test for correct UTF-8 (they will both decode to the same Java String).

> Improved string encoding and decoding performance
> -------------------------------------------------
>
>                 Key: THRIFT-765
>                 URL: https://issues.apache.org/jira/browse/THRIFT-765
>             Project: Thrift
>          Issue Type: Improvement
>          Components: Library (Java)
>    Affects Versions: 0.2
>            Reporter: Bryan Duxbury
>            Assignee: Bryan Duxbury
>             Fix For: 0.3
>
>         Attachments: thrift-765.patch
>
>
> One of the most consistent time-consuming spots of Thrift serialization and deserialization is string encoding. For some inscrutable reason, String.getBytes("UTF-8") is slow. 
> However, it's recently been brought to my attention that DataOutputStream's writeUTF method has a faster implementation of UTF-8 encoding and decoding. We should use this style of encoding.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (THRIFT-765) Improved string encoding and decoding performance

Posted by "Bryan Duxbury (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/THRIFT-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905933#action_12905933 ] 

Bryan Duxbury commented on THRIFT-765:
--------------------------------------

I just had a look at this patch and it seems really promising. The performance improvement is great (cuts about 30% from encoding and 60% from decoding) so I'm encouraged. 

I'm thinking that maybe we should give the Random instance in the test a consistent seed so that the test is deterministic. Do you think we should try some invalid strings as well? It should just lead to encoding exceptions, right?



> Improved string encoding and decoding performance
> -------------------------------------------------
>
>                 Key: THRIFT-765
>                 URL: https://issues.apache.org/jira/browse/THRIFT-765
>             Project: Thrift
>          Issue Type: Improvement
>          Components: Java - Library
>    Affects Versions: 0.2
>            Reporter: Bryan Duxbury
>            Assignee: Bryan Duxbury
>             Fix For: 0.4
>
>         Attachments: thrift-765-redux-v2.patch, thrift-765-redux.patch, THRIFT-765.patch, thrift-765.patch
>
>
> One of the most consistent time-consuming spots of Thrift serialization and deserialization is string encoding. For some inscrutable reason, String.getBytes("UTF-8") is slow. 
> However, it's recently been brought to my attention that DataOutputStream's writeUTF method has a faster implementation of UTF-8 encoding and decoding. We should use this style of encoding.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (THRIFT-765) Improved string encoding and decoding performance

Posted by "David Reiss (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/THRIFT-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861079#action_12861079 ] 

David Reiss commented on THRIFT-765:
------------------------------------

Why is there a custom UTF-8 codec in the patch while your description referenced a Java library class.  Did you copy the implementation?  Are there license restrictions on it?

> Improved string encoding and decoding performance
> -------------------------------------------------
>
>                 Key: THRIFT-765
>                 URL: https://issues.apache.org/jira/browse/THRIFT-765
>             Project: Thrift
>          Issue Type: Improvement
>          Components: Library (Java)
>    Affects Versions: 0.2
>            Reporter: Bryan Duxbury
>            Assignee: Bryan Duxbury
>             Fix For: 0.3
>
>         Attachments: thrift-765.patch
>
>
> One of the most consistent time-consuming spots of Thrift serialization and deserialization is string encoding. For some inscrutable reason, String.getBytes("UTF-8") is slow. 
> However, it's recently been brought to my attention that DataOutputStream's writeUTF method has a faster implementation of UTF-8 encoding and decoding. We should use this style of encoding.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (THRIFT-765) Improved string encoding and decoding performance

Posted by "Bryan Duxbury (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/THRIFT-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12862028#action_12862028 ] 

Bryan Duxbury commented on THRIFT-765:
--------------------------------------

I added a testcase that included the example character you just provided, and successfully encoded and decoded it. Seems like it works ok.

Anyone have others they'd like to see in the tests?

> Improved string encoding and decoding performance
> -------------------------------------------------
>
>                 Key: THRIFT-765
>                 URL: https://issues.apache.org/jira/browse/THRIFT-765
>             Project: Thrift
>          Issue Type: Improvement
>          Components: Library (Java)
>    Affects Versions: 0.2
>            Reporter: Bryan Duxbury
>            Assignee: Bryan Duxbury
>             Fix For: 0.3
>
>         Attachments: thrift-765.patch
>
>
> One of the most consistent time-consuming spots of Thrift serialization and deserialization is string encoding. For some inscrutable reason, String.getBytes("UTF-8") is slow. 
> However, it's recently been brought to my attention that DataOutputStream's writeUTF method has a faster implementation of UTF-8 encoding and decoding. We should use this style of encoding.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (THRIFT-765) Improved string encoding and decoding performance

Posted by "David Dabbs (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/THRIFT-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12863918#action_12863918 ] 

David Dabbs commented on THRIFT-765:
------------------------------------

Lucene's codebase has moved on from the revision link referenced above. If I can get the time I will compare to ICU4J, etc.  
This is a common interest -.Thrift, Avro and protobuf all have similar issues filed.

> Improved string encoding and decoding performance
> -------------------------------------------------
>
>                 Key: THRIFT-765
>                 URL: https://issues.apache.org/jira/browse/THRIFT-765
>             Project: Thrift
>          Issue Type: Improvement
>          Components: Library (Java)
>    Affects Versions: 0.2
>            Reporter: Bryan Duxbury
>            Assignee: Bryan Duxbury
>             Fix For: 0.4
>
>         Attachments: thrift-765-redux-v2.patch, thrift-765-redux.patch, thrift-765.patch
>
>
> One of the most consistent time-consuming spots of Thrift serialization and deserialization is string encoding. For some inscrutable reason, String.getBytes("UTF-8") is slow. 
> However, it's recently been brought to my attention that DataOutputStream's writeUTF method has a faster implementation of UTF-8 encoding and decoding. We should use this style of encoding.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (THRIFT-765) Improved string encoding and decoding performance

Posted by "Bryan Duxbury (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/THRIFT-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bryan Duxbury closed THRIFT-765.
--------------------------------

    Resolution: Won't Fix

> Improved string encoding and decoding performance
> -------------------------------------------------
>
>                 Key: THRIFT-765
>                 URL: https://issues.apache.org/jira/browse/THRIFT-765
>             Project: Thrift
>          Issue Type: Improvement
>          Components: Library (Java)
>    Affects Versions: 0.2
>            Reporter: Bryan Duxbury
>            Assignee: Bryan Duxbury
>             Fix For: 0.4
>
>         Attachments: thrift-765-redux-v2.patch, thrift-765-redux.patch, thrift-765.patch
>
>
> One of the most consistent time-consuming spots of Thrift serialization and deserialization is string encoding. For some inscrutable reason, String.getBytes("UTF-8") is slow. 
> However, it's recently been brought to my attention that DataOutputStream's writeUTF method has a faster implementation of UTF-8 encoding and decoding. We should use this style of encoding.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (THRIFT-765) Improved string encoding and decoding performance

Posted by "Bryan Duxbury (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/THRIFT-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bryan Duxbury updated THRIFT-765:
---------------------------------

    Attachment: thrift-765-redux-v2.patch

OK, after looking over Lucene's implementation, I've managed to make an approach that reclaims the performance benefits of the original broken version.

Protocols that use this could probably still benefit a little bit by reusing the byte and char buffers used during encoding and decoding, but I haven't thought all the way through the memory implications, so I'm leaving that out for now.

If no one has any objections, I will commit this patch shortly.

> Improved string encoding and decoding performance
> -------------------------------------------------
>
>                 Key: THRIFT-765
>                 URL: https://issues.apache.org/jira/browse/THRIFT-765
>             Project: Thrift
>          Issue Type: Improvement
>          Components: Library (Java)
>    Affects Versions: 0.2
>            Reporter: Bryan Duxbury
>            Assignee: Bryan Duxbury
>             Fix For: 0.3
>
>         Attachments: thrift-765-redux-v2.patch, thrift-765-redux.patch, thrift-765.patch
>
>
> One of the most consistent time-consuming spots of Thrift serialization and deserialization is string encoding. For some inscrutable reason, String.getBytes("UTF-8") is slow. 
> However, it's recently been brought to my attention that DataOutputStream's writeUTF method has a faster implementation of UTF-8 encoding and decoding. We should use this style of encoding.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (THRIFT-765) Improved string encoding and decoding performance

Posted by "Bryan Duxbury (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/THRIFT-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861105#action_12861105 ] 

Bryan Duxbury commented on THRIFT-765:
--------------------------------------

We can't actually use the method that already exists on DataOutputStream, because it wants to wrap a stream object, and implies a Java-specific frame size that doesn't meet our needs. 

I did not copy the implementation. I adapted the technique found in DataOutputStream#writeUTF (and a similar version found in Kryo's StringSerializer) to work with Thrift. I could include the license notice from Kryo if you think that we could be in license trouble here.

> Improved string encoding and decoding performance
> -------------------------------------------------
>
>                 Key: THRIFT-765
>                 URL: https://issues.apache.org/jira/browse/THRIFT-765
>             Project: Thrift
>          Issue Type: Improvement
>          Components: Library (Java)
>    Affects Versions: 0.2
>            Reporter: Bryan Duxbury
>            Assignee: Bryan Duxbury
>             Fix For: 0.3
>
>         Attachments: thrift-765.patch
>
>
> One of the most consistent time-consuming spots of Thrift serialization and deserialization is string encoding. For some inscrutable reason, String.getBytes("UTF-8") is slow. 
> However, it's recently been brought to my attention that DataOutputStream's writeUTF method has a faster implementation of UTF-8 encoding and decoding. We should use this style of encoding.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (THRIFT-765) Improved string encoding and decoding performance

Posted by "Bryan Duxbury (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/THRIFT-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bryan Duxbury updated THRIFT-765:
---------------------------------

    Fix Version/s: 0.4
                       (was: 0.3)

> Improved string encoding and decoding performance
> -------------------------------------------------
>
>                 Key: THRIFT-765
>                 URL: https://issues.apache.org/jira/browse/THRIFT-765
>             Project: Thrift
>          Issue Type: Improvement
>          Components: Library (Java)
>    Affects Versions: 0.2
>            Reporter: Bryan Duxbury
>            Assignee: Bryan Duxbury
>             Fix For: 0.4
>
>         Attachments: thrift-765-redux-v2.patch, thrift-765-redux.patch, thrift-765.patch
>
>
> One of the most consistent time-consuming spots of Thrift serialization and deserialization is string encoding. For some inscrutable reason, String.getBytes("UTF-8") is slow. 
> However, it's recently been brought to my attention that DataOutputStream's writeUTF method has a faster implementation of UTF-8 encoding and decoding. We should use this style of encoding.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (THRIFT-765) Improved string encoding and decoding performance

Posted by "Bryan Duxbury (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/THRIFT-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bryan Duxbury updated THRIFT-765:
---------------------------------

    Attachment: thrift-765-redux.patch

I had a look at Hadoop's RecordIO implementation, and this did the trick. After adapting it to Thrift, I found that it was pretty slow, so I monkeyed around a bit until I'd basically regained all of the encoding performance, though not very much of the decoding performance. It's still faster, but I just can't seem to shake the StringBuilder overhead even when doing nothing but ASCII strings. 

All the test pass. What do you guys think?

> Improved string encoding and decoding performance
> -------------------------------------------------
>
>                 Key: THRIFT-765
>                 URL: https://issues.apache.org/jira/browse/THRIFT-765
>             Project: Thrift
>          Issue Type: Improvement
>          Components: Library (Java)
>    Affects Versions: 0.2
>            Reporter: Bryan Duxbury
>            Assignee: Bryan Duxbury
>             Fix For: 0.3
>
>         Attachments: thrift-765-redux.patch, thrift-765.patch
>
>
> One of the most consistent time-consuming spots of Thrift serialization and deserialization is string encoding. For some inscrutable reason, String.getBytes("UTF-8") is slow. 
> However, it's recently been brought to my attention that DataOutputStream's writeUTF method has a faster implementation of UTF-8 encoding and decoding. We should use this style of encoding.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (THRIFT-765) Improved string encoding and decoding performance

Posted by "Bryan Duxbury (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/THRIFT-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bryan Duxbury closed THRIFT-765.
--------------------------------

    Resolution: Fixed

I just committed this (again).

> Improved string encoding and decoding performance
> -------------------------------------------------
>
>                 Key: THRIFT-765
>                 URL: https://issues.apache.org/jira/browse/THRIFT-765
>             Project: Thrift
>          Issue Type: Improvement
>          Components: Library (Java)
>    Affects Versions: 0.2
>            Reporter: Bryan Duxbury
>            Assignee: Bryan Duxbury
>             Fix For: 0.3
>
>         Attachments: thrift-765-redux-v2.patch, thrift-765-redux.patch, thrift-765.patch
>
>
> One of the most consistent time-consuming spots of Thrift serialization and deserialization is string encoding. For some inscrutable reason, String.getBytes("UTF-8") is slow. 
> However, it's recently been brought to my attention that DataOutputStream's writeUTF method has a faster implementation of UTF-8 encoding and decoding. We should use this style of encoding.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (THRIFT-765) Improved string encoding and decoding performance

Posted by "Bryan Duxbury (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/THRIFT-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bryan Duxbury closed THRIFT-765.
--------------------------------

    Resolution: Fixed

I just committed this.

> Improved string encoding and decoding performance
> -------------------------------------------------
>
>                 Key: THRIFT-765
>                 URL: https://issues.apache.org/jira/browse/THRIFT-765
>             Project: Thrift
>          Issue Type: Improvement
>          Components: Library (Java)
>    Affects Versions: 0.2
>            Reporter: Bryan Duxbury
>            Assignee: Bryan Duxbury
>             Fix For: 0.3
>
>         Attachments: thrift-765.patch
>
>
> One of the most consistent time-consuming spots of Thrift serialization and deserialization is string encoding. For some inscrutable reason, String.getBytes("UTF-8") is slow. 
> However, it's recently been brought to my attention that DataOutputStream's writeUTF method has a faster implementation of UTF-8 encoding and decoding. We should use this style of encoding.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Reopened: (THRIFT-765) Improved string encoding and decoding performance

Posted by "Bryan Duxbury (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/THRIFT-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bryan Duxbury reopened THRIFT-765:
----------------------------------


> Improved string encoding and decoding performance
> -------------------------------------------------
>
>                 Key: THRIFT-765
>                 URL: https://issues.apache.org/jira/browse/THRIFT-765
>             Project: Thrift
>          Issue Type: Improvement
>          Components: Library (Java)
>    Affects Versions: 0.2
>            Reporter: Bryan Duxbury
>            Assignee: Bryan Duxbury
>             Fix For: 0.3
>
>         Attachments: thrift-765.patch
>
>
> One of the most consistent time-consuming spots of Thrift serialization and deserialization is string encoding. For some inscrutable reason, String.getBytes("UTF-8") is slow. 
> However, it's recently been brought to my attention that DataOutputStream's writeUTF method has a faster implementation of UTF-8 encoding and decoding. We should use this style of encoding.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (THRIFT-765) Improved string encoding and decoding performance

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/THRIFT-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12862375#action_12862375 ] 

Doug Cutting commented on THRIFT-765:
-------------------------------------

Milind claims in AVRO-532 that Hadoop's record io correctlly handled surrogate pairs.

https://svn.apache.org/repos/asf/hadoop/common/branches/pre-HADOOP-4687/src/core/org/apache/hadoop/record/Utils.java

That code is quite different from the code in this patch.  Notably, it calls String#getCodePoint() and StringBuffer#appendCodePoint() to let Java implement surrogate pairs.


> Improved string encoding and decoding performance
> -------------------------------------------------
>
>                 Key: THRIFT-765
>                 URL: https://issues.apache.org/jira/browse/THRIFT-765
>             Project: Thrift
>          Issue Type: Improvement
>          Components: Library (Java)
>    Affects Versions: 0.2
>            Reporter: Bryan Duxbury
>            Assignee: Bryan Duxbury
>             Fix For: 0.3
>
>         Attachments: thrift-765.patch
>
>
> One of the most consistent time-consuming spots of Thrift serialization and deserialization is string encoding. For some inscrutable reason, String.getBytes("UTF-8") is slow. 
> However, it's recently been brought to my attention that DataOutputStream's writeUTF method has a faster implementation of UTF-8 encoding and decoding. We should use this style of encoding.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (THRIFT-765) Improved string encoding and decoding performance

Posted by "Bryan Duxbury (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/THRIFT-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12862386#action_12862386 ] 

Bryan Duxbury commented on THRIFT-765:
--------------------------------------

Hm, it looks like I made mistake in transcribing that earlier test case, and it does indeed fail to encode correctly.

> Improved string encoding and decoding performance
> -------------------------------------------------
>
>                 Key: THRIFT-765
>                 URL: https://issues.apache.org/jira/browse/THRIFT-765
>             Project: Thrift
>          Issue Type: Improvement
>          Components: Library (Java)
>    Affects Versions: 0.2
>            Reporter: Bryan Duxbury
>            Assignee: Bryan Duxbury
>             Fix For: 0.3
>
>         Attachments: thrift-765.patch
>
>
> One of the most consistent time-consuming spots of Thrift serialization and deserialization is string encoding. For some inscrutable reason, String.getBytes("UTF-8") is slow. 
> However, it's recently been brought to my attention that DataOutputStream's writeUTF method has a faster implementation of UTF-8 encoding and decoding. We should use this style of encoding.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (THRIFT-765) Improved string encoding and decoding performance

Posted by "Bryan Duxbury (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/THRIFT-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12862011#action_12862011 ] 

Bryan Duxbury commented on THRIFT-765:
--------------------------------------

Could you provide some example strings?

> Improved string encoding and decoding performance
> -------------------------------------------------
>
>                 Key: THRIFT-765
>                 URL: https://issues.apache.org/jira/browse/THRIFT-765
>             Project: Thrift
>          Issue Type: Improvement
>          Components: Library (Java)
>    Affects Versions: 0.2
>            Reporter: Bryan Duxbury
>            Assignee: Bryan Duxbury
>             Fix For: 0.3
>
>         Attachments: thrift-765.patch
>
>
> One of the most consistent time-consuming spots of Thrift serialization and deserialization is string encoding. For some inscrutable reason, String.getBytes("UTF-8") is slow. 
> However, it's recently been brought to my attention that DataOutputStream's writeUTF method has a faster implementation of UTF-8 encoding and decoding. We should use this style of encoding.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (THRIFT-765) Improved string encoding and decoding performance

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/THRIFT-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12862456#action_12862456 ] 

Doug Cutting commented on THRIFT-765:
-------------------------------------

Does using a switch instead of an 'if else' in decode help any?

Another thing to try might be to use a char[] buffer instead of StringBuffer and directly implement UTF-16 coding.  Codepoints less than 0x10000 are encoded as one Java char.  Codepoints over that are encoded in two chars, with the first containing 0xD8 plus the high-order ten bits of codepoint minus 0x10000, and the second 0xDC plus the low-order ten bits of codepoint minus 0x10000.

The one-, two- and three- byte cases of UTF-8 all produce characters less than 0x10000 and so will always generate a single char.  And the four-byte case of UTF-8 will only and always generate a character greater than or equal to 0x10000.  So you don't need to test that if you combine the UTF-16 encoding with the UTF-8 decoding.

> Improved string encoding and decoding performance
> -------------------------------------------------
>
>                 Key: THRIFT-765
>                 URL: https://issues.apache.org/jira/browse/THRIFT-765
>             Project: Thrift
>          Issue Type: Improvement
>          Components: Library (Java)
>    Affects Versions: 0.2
>            Reporter: Bryan Duxbury
>            Assignee: Bryan Duxbury
>             Fix For: 0.3
>
>         Attachments: thrift-765-redux.patch, thrift-765.patch
>
>
> One of the most consistent time-consuming spots of Thrift serialization and deserialization is string encoding. For some inscrutable reason, String.getBytes("UTF-8") is slow. 
> However, it's recently been brought to my attention that DataOutputStream's writeUTF method has a faster implementation of UTF-8 encoding and decoding. We should use this style of encoding.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (THRIFT-765) Improved string encoding and decoding performance

Posted by "Bryan Duxbury (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/THRIFT-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12862373#action_12862373 ] 

Bryan Duxbury commented on THRIFT-765:
--------------------------------------

It's not just a round-trip test. I tested that the strings serialized to the *same bytes* with either encoder.

> Improved string encoding and decoding performance
> -------------------------------------------------
>
>                 Key: THRIFT-765
>                 URL: https://issues.apache.org/jira/browse/THRIFT-765
>             Project: Thrift
>          Issue Type: Improvement
>          Components: Library (Java)
>    Affects Versions: 0.2
>            Reporter: Bryan Duxbury
>            Assignee: Bryan Duxbury
>             Fix For: 0.3
>
>         Attachments: thrift-765.patch
>
>
> One of the most consistent time-consuming spots of Thrift serialization and deserialization is string encoding. For some inscrutable reason, String.getBytes("UTF-8") is slow. 
> However, it's recently been brought to my attention that DataOutputStream's writeUTF method has a faster implementation of UTF-8 encoding and decoding. We should use this style of encoding.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (THRIFT-765) Improved string encoding and decoding performance

Posted by "David Reiss (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/THRIFT-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12863450#action_12863450 ] 

David Reiss commented on THRIFT-765:
------------------------------------

Any idea why this is faster than the built-in Java codec (which is probably implemented in C)?  Reinventing the wheel here seems like it's just asking for problems.  The surrogate pair bug is just the first example.  This implementation also appears to allow overlong forms.

> Improved string encoding and decoding performance
> -------------------------------------------------
>
>                 Key: THRIFT-765
>                 URL: https://issues.apache.org/jira/browse/THRIFT-765
>             Project: Thrift
>          Issue Type: Improvement
>          Components: Library (Java)
>    Affects Versions: 0.2
>            Reporter: Bryan Duxbury
>            Assignee: Bryan Duxbury
>             Fix For: 0.3
>
>         Attachments: thrift-765-redux-v2.patch, thrift-765-redux.patch, thrift-765.patch
>
>
> One of the most consistent time-consuming spots of Thrift serialization and deserialization is string encoding. For some inscrutable reason, String.getBytes("UTF-8") is slow. 
> However, it's recently been brought to my attention that DataOutputStream's writeUTF method has a faster implementation of UTF-8 encoding and decoding. We should use this style of encoding.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (THRIFT-765) Improved string encoding and decoding performance

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/THRIFT-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated THRIFT-765:
-------------------------------

    Attachment: THRIFT-765.patch

just found this issue randomly, late to the game.

I don't think you should give up here, besides the performance problems,
there are also bugs in Sun (and other JDK) UTF-8 implementations.

Included in the patch is a commented-out test showing a bug in my 
Sun JDK's UTF-8 decode (1.6.0_19), just as an example.

This patch includes bulk UTF-8 encode/decode with full error-checking.
Ensuring correctness doesn't need to be slow:
* For encode its trivial: just surrogate validation
* For decode its tricky, but the code can be simple with a DFA.

I used the DFA approach here: http://bjoern.hoehrmann.de/utf-8/decoder/dfa/
In my opinion, its the only way to go without a lot of complexity.

I implemented as char[] -> byte[] and back, and throw a checked 
CharacterCodingException on invalid inputs. 

I didn't modify any thrift code to actually use this, except the benchmark
from the previous patch.
Hopefully its easy adjust to your needs.


> Improved string encoding and decoding performance
> -------------------------------------------------
>
>                 Key: THRIFT-765
>                 URL: https://issues.apache.org/jira/browse/THRIFT-765
>             Project: Thrift
>          Issue Type: Improvement
>          Components: Java - Library
>    Affects Versions: 0.2
>            Reporter: Bryan Duxbury
>            Assignee: Bryan Duxbury
>             Fix For: 0.4
>
>         Attachments: thrift-765-redux-v2.patch, thrift-765-redux.patch, THRIFT-765.patch, thrift-765.patch
>
>
> One of the most consistent time-consuming spots of Thrift serialization and deserialization is string encoding. For some inscrutable reason, String.getBytes("UTF-8") is slow. 
> However, it's recently been brought to my attention that DataOutputStream's writeUTF method has a faster implementation of UTF-8 encoding and decoding. We should use this style of encoding.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (THRIFT-765) Improved string encoding and decoding performance

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/THRIFT-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12862016#action_12862016 ] 

Ken Krugler commented on THRIFT-765:
------------------------------------

For example, code point U+20491 is two UTF-16 characters (\uD841 \uDC91) and thus two Java characters. The equivalent in UTF-8 is four bytes (0xF0 0xA0 0x92 0x91). Just to see how well Jira & your browser handle this, the actual character is 𠒑


> Improved string encoding and decoding performance
> -------------------------------------------------
>
>                 Key: THRIFT-765
>                 URL: https://issues.apache.org/jira/browse/THRIFT-765
>             Project: Thrift
>          Issue Type: Improvement
>          Components: Library (Java)
>    Affects Versions: 0.2
>            Reporter: Bryan Duxbury
>            Assignee: Bryan Duxbury
>             Fix For: 0.3
>
>         Attachments: thrift-765.patch
>
>
> One of the most consistent time-consuming spots of Thrift serialization and deserialization is string encoding. For some inscrutable reason, String.getBytes("UTF-8") is slow. 
> However, it's recently been brought to my attention that DataOutputStream's writeUTF method has a faster implementation of UTF-8 encoding and decoding. We should use this style of encoding.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (THRIFT-765) Improved string encoding and decoding performance

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/THRIFT-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12862385#action_12862385 ] 

Yonik Seeley commented on THRIFT-765:
-------------------------------------

bq. It's not just a round-trip test. I tested that the strings serialized to the same bytes with either encoder.

I don't see any code points outside the BMP in your test.
Try something like "\uD801\uDC00", which is one character.

> Improved string encoding and decoding performance
> -------------------------------------------------
>
>                 Key: THRIFT-765
>                 URL: https://issues.apache.org/jira/browse/THRIFT-765
>             Project: Thrift
>          Issue Type: Improvement
>          Components: Library (Java)
>    Affects Versions: 0.2
>            Reporter: Bryan Duxbury
>            Assignee: Bryan Duxbury
>             Fix For: 0.3
>
>         Attachments: thrift-765.patch
>
>
> One of the most consistent time-consuming spots of Thrift serialization and deserialization is string encoding. For some inscrutable reason, String.getBytes("UTF-8") is slow. 
> However, it's recently been brought to my attention that DataOutputStream's writeUTF method has a faster implementation of UTF-8 encoding and decoding. We should use this style of encoding.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (THRIFT-765) Improved string encoding and decoding performance

Posted by "Bryan Duxbury (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/THRIFT-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12863870#action_12863870 ] 

Bryan Duxbury commented on THRIFT-765:
--------------------------------------

I've spent a bunch of time playing around with this. I discovered that the "bugs" with the new encoder/decoder only occur when decoding invalid UTF-8 strings. In particular, the Java standard decoder seems to take a lot more care with validating every byte it examines before combining it into the codepoint and converting it to UTF-16. 

While the decoding function I have right now is still about 2x the speed of Java's, adding in all the checks required to get us to parity of functionality will almost certainly evaporate the performance benefits. Even worse, it seems like the performance benefits of encoding have somehow disappeared during the process of implemented surrogate pair support. (This one confuses me - my benchmark doesn't even cover a string that contains surrogate pairs. Maybe my original tests were flawed?)

The bottom line is that it looks like this is a dead end, unless we are willing to sacrifice "correctness" when decoding invalid UTF-8 encoded strings. You could argue that if it's a bad encoding already, it might be best to detect that and throw rather than silently convert like Java does, but that's a debate for another time.

The only way that reviving this issue would make sense is if we are willing to support a separate encoding mechanism for the purpose of avoiding buffer allocation and copies during write. At the moment, we're not equipped to benefit from that, so maybe I'll reevaluate later.


> Improved string encoding and decoding performance
> -------------------------------------------------
>
>                 Key: THRIFT-765
>                 URL: https://issues.apache.org/jira/browse/THRIFT-765
>             Project: Thrift
>          Issue Type: Improvement
>          Components: Library (Java)
>    Affects Versions: 0.2
>            Reporter: Bryan Duxbury
>            Assignee: Bryan Duxbury
>             Fix For: 0.4
>
>         Attachments: thrift-765-redux-v2.patch, thrift-765-redux.patch, thrift-765.patch
>
>
> One of the most consistent time-consuming spots of Thrift serialization and deserialization is string encoding. For some inscrutable reason, String.getBytes("UTF-8") is slow. 
> However, it's recently been brought to my attention that DataOutputStream's writeUTF method has a faster implementation of UTF-8 encoding and decoding. We should use this style of encoding.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (THRIFT-765) Improved string encoding and decoding performance

Posted by "Bryan Duxbury (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/THRIFT-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bryan Duxbury updated THRIFT-765:
---------------------------------

    Attachment: thrift-765.patch

This patch adds a new class, Utf8Helper, that implements the fast encoding and decoding methods. It also changes Binary and Compact protocols to use this method of encoding.

I haven't benchmarked the overall performance improvements to the respective protocols, but I did perform a micro-benchmark of string encoding and decoding, which showed the new methods take about half as long. The benefit you see in your actual serialization/deserialization will depend on how many strings you use.

> Improved string encoding and decoding performance
> -------------------------------------------------
>
>                 Key: THRIFT-765
>                 URL: https://issues.apache.org/jira/browse/THRIFT-765
>             Project: Thrift
>          Issue Type: Improvement
>          Components: Library (Java)
>    Affects Versions: 0.2
>            Reporter: Bryan Duxbury
>            Assignee: Bryan Duxbury
>             Fix For: 0.3
>
>         Attachments: thrift-765.patch
>
>
> One of the most consistent time-consuming spots of Thrift serialization and deserialization is string encoding. For some inscrutable reason, String.getBytes("UTF-8") is slow. 
> However, it's recently been brought to my attention that DataOutputStream's writeUTF method has a faster implementation of UTF-8 encoding and decoding. We should use this style of encoding.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (THRIFT-765) Improved string encoding and decoding performance

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/THRIFT-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12862317#action_12862317 ] 

Doug Cutting commented on THRIFT-765:
-------------------------------------

Mea culpa.  It looks like it actually does handle surrogate pairs.  (I didn't realize that they don't require special case code for this conversion.)  It doesn't detect invalid inputs but that's probably okay.  Nevermind.

> Improved string encoding and decoding performance
> -------------------------------------------------
>
>                 Key: THRIFT-765
>                 URL: https://issues.apache.org/jira/browse/THRIFT-765
>             Project: Thrift
>          Issue Type: Improvement
>          Components: Library (Java)
>    Affects Versions: 0.2
>            Reporter: Bryan Duxbury
>            Assignee: Bryan Duxbury
>             Fix For: 0.3
>
>         Attachments: thrift-765.patch
>
>
> One of the most consistent time-consuming spots of Thrift serialization and deserialization is string encoding. For some inscrutable reason, String.getBytes("UTF-8") is slow. 
> However, it's recently been brought to my attention that DataOutputStream's writeUTF method has a faster implementation of UTF-8 encoding and decoding. We should use this style of encoding.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (THRIFT-765) Improved string encoding and decoding performance

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/THRIFT-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12862496#action_12862496 ] 

Yonik Seeley commented on THRIFT-765:
-------------------------------------

You can also check out Lucene's code for UTF8 encoding/decoding:
http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/UnicodeUtil.java?revision=931668&view=markup

> Improved string encoding and decoding performance
> -------------------------------------------------
>
>                 Key: THRIFT-765
>                 URL: https://issues.apache.org/jira/browse/THRIFT-765
>             Project: Thrift
>          Issue Type: Improvement
>          Components: Library (Java)
>    Affects Versions: 0.2
>            Reporter: Bryan Duxbury
>            Assignee: Bryan Duxbury
>             Fix For: 0.3
>
>         Attachments: thrift-765-redux.patch, thrift-765.patch
>
>
> One of the most consistent time-consuming spots of Thrift serialization and deserialization is string encoding. For some inscrutable reason, String.getBytes("UTF-8") is slow. 
> However, it's recently been brought to my attention that DataOutputStream's writeUTF method has a faster implementation of UTF-8 encoding and decoding. We should use this style of encoding.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.