You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@thrift.apache.org by "Will Pierce (JIRA)" <ji...@apache.org> on 2011/03/21 02:49:06 UTC

[jira] [Created] (THRIFT-1103) TZlibTransport for python, a zlib compressed transport

TZlibTransport for python, a zlib compressed transport
------------------------------------------------------

Key: THRIFT-1103
URL: https://issues.apache.org/jira/browse/THRIFT-1103
Project: Thrift
Issue Type: New Feature
Components: Python - Library
Reporter: Will Pierce
Assignee: Will Pierce

New implementation of zlib compressed transport for python.

The attached patch provides a zlib compressed transport wrapper for python. It is similar to the TFramedTransport, in that it wraps another transport, implementing the data compression as a transformation layer on top of the underlying transport that it wraps.

The compression level is configurable in the constructor, from 0 (none) to 9 (best) and defaults to 9 for best compression. The way this works is that every write() to the transport appends more data to the internal cStringIO write buffer. When the transport's flush() method is called, the buffered bytes are then passed to a zlib Compressor object and flush()ed with zlib.Z_SYNC_FLUSH.

Because the thrift API calls the transport's flush() after writeMessageEnd(), this means very small thrift RPC calls don't get compressed well. This transport works best on thrift protocols where the payload contains strings longer than 10 characters. As with all data compression, the more redundancy in the uncompressed input, the greater the resulting compression.

The TZlibTransport class also implements some basic statistics that track the number of raw bytes written and read, versus the decompressed equivalent. The getCompRatio() method returns a tuple of (readCompressionRatio,writeCompressionRatio) where ratio is computed using: compressed_bytes/uncompressed_bytes. (So 10 compression is 0.10, meaning smaller numbers are better.) The getCompSavings() method returns the actual number of (saved_read_bytes,saved_write_bytes) which might be negative when the compression of non-compressible data ends up expanding the data. So hopefully, anyone who uses this transport will be able to tell whether the compression is saving bandwidth or not.

I will add the patch in a few minutes.

I haven't tested this against the C++ TZlibTransport, only against itself.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (THRIFT-1103) TZlibTransport for python, a zlib compressed transport

Posted by "Will Pierce (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/THRIFT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009544#comment-13009544 ] 

Will Pierce commented on THRIFT-1103:
-------------------------------------

I updated the test suite to include running every valid combination of server, protocol and wrapping transports (both ssl and zlib).  For python2.4, this is 30 combinations and runs in about 24 seconds.  For python2.7, there is an extra server type (TProcessPool which uses the multiprocessing module) and the SSL transport (unavailable in py2.4), whichadds up to 66 combinations of tests, running in ~95 seconds.  The 4 nested for-loops significantly expands the code test coverage.

In addition to everything in the _v1 of this patch, the _v2 version also has:

Updated test code:
* added testing of TSSLServer, an alternate socket transport
* added testing of TZlibTransport, a wrapping transport
* added a self-signed cert in test/py/test_cert.pem with a cautionary .readme to allow testing of the TSSLServerSocket (it needs a certificate file)
* fixed -q (quiet) and -v (verbose) options to RunClientServer/TestServer/TestClient to lower and raise the verbosity

Fixed two problems in lib/py/src/transport/TSSLSocket.py and one enhancement:
* fixed confusing parameters to both client and server constructors, removing the overly ornate \*args and \*\*kwargs which made the constructor behave poorly when used with just (host,port) as arguments.  The constructors better match the TSocket and TServerSocket constructor parameters now.
* fixed logic in TSSLServerSocket parameter checking, if validate=True and ca_certs=None, now it raises an exception like the docstring claims it should.
* made TSSLServerSocket more robust on failed SSL handshake by closing socket connection and returning None from accept() call, which is better than terminating the entire server in some cases

I will attach the _v2 patch in a moment.


> TZlibTransport for python, a zlib compressed transport
> ------------------------------------------------------
>
>                 Key: THRIFT-1103
>                 URL: https://issues.apache.org/jira/browse/THRIFT-1103
>             Project: Thrift
>          Issue Type: New Feature
>          Components: Python - Library
>            Reporter: Will Pierce
>            Assignee: Will Pierce
>         Attachments: THRIFT-1103.tzlibtransport_for_python_v1.patch
>
>
> New implementation of zlib compressed transport for python.
> The attached patch provides a zlib compressed transport wrapper for python.  It is similar to the TFramedTransport, in that it wraps another transport, implementing the data compression as a transformation layer on top of the underlying transport that it wraps.
> The compression level is configurable in the constructor, from 0 (none) to 9 (best) and defaults to 9 for best compression.  The way this works is that every write() to the transport appends more data to the internal cStringIO write buffer.  When the transport's flush() method is called, the buffered bytes are then passed to a zlib Compressor object and flush()ed with zlib.Z_SYNC_FLUSH.
> Because the thrift API calls the transport's flush() after writeMessageEnd(), this means very small thrift RPC calls don't get compressed well.  This transport works best on thrift protocols where the payload contains strings longer than 10 characters.  As with all data compression, the more redundancy in the uncompressed input, the greater the resulting compression.
> The TZlibTransport class also implements some basic statistics that track the number of raw bytes written and read, versus the decompressed equivalent.  The getCompRatio() method returns a tuple of (readCompressionRatio,writeCompressionRatio) where ratio is computed using: compressed_bytes/uncompressed_bytes.  (So 10 compression is 0.10, meaning smaller numbers are better.)  The getCompSavings() method returns the actual number of (saved_read_bytes,saved_write_bytes) which might be negative when the compression of non-compressible data ends up expanding the data.  So hopefully, anyone who uses this transport will be able to tell whether the compression is saving bandwidth or not.
> I will add the patch in a few minutes.
> I haven't tested this against the C++ TZlibTransport, only against itself.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (THRIFT-1103) TZlibTransport for python, a zlib compressed transport

Posted by "Bryan Duxbury (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/THRIFT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bryan Duxbury closed THRIFT-1103.
---------------------------------

       Resolution: Fixed
    Fix Version/s: 0.7

I just committed this. Thanks Will!

> TZlibTransport for python, a zlib compressed transport
> ------------------------------------------------------
>
>                 Key: THRIFT-1103
>                 URL: https://issues.apache.org/jira/browse/THRIFT-1103
>             Project: Thrift
>          Issue Type: New Feature
>          Components: Python - Library
>            Reporter: Will Pierce
>            Assignee: Will Pierce
>             Fix For: 0.7
>
>         Attachments: THRIFT-1103.tzlibtransport_for_python_v1.patch, THRIFT-1103.tzlibtransport_for_python_v2.patch
>
>
> New implementation of zlib compressed transport for python.
> The attached patch provides a zlib compressed transport wrapper for python.  It is similar to the TFramedTransport, in that it wraps another transport, implementing the data compression as a transformation layer on top of the underlying transport that it wraps.
> The compression level is configurable in the constructor, from 0 (none) to 9 (best) and defaults to 9 for best compression.  The way this works is that every write() to the transport appends more data to the internal cStringIO write buffer.  When the transport's flush() method is called, the buffered bytes are then passed to a zlib Compressor object and flush()ed with zlib.Z_SYNC_FLUSH.
> Because the thrift API calls the transport's flush() after writeMessageEnd(), this means very small thrift RPC calls don't get compressed well.  This transport works best on thrift protocols where the payload contains strings longer than 10 characters.  As with all data compression, the more redundancy in the uncompressed input, the greater the resulting compression.
> The TZlibTransport class also implements some basic statistics that track the number of raw bytes written and read, versus the decompressed equivalent.  The getCompRatio() method returns a tuple of (readCompressionRatio,writeCompressionRatio) where ratio is computed using: compressed_bytes/uncompressed_bytes.  (So 10 compression is 0.10, meaning smaller numbers are better.)  The getCompSavings() method returns the actual number of (saved_read_bytes,saved_write_bytes) which might be negative when the compression of non-compressible data ends up expanding the data.  So hopefully, anyone who uses this transport will be able to tell whether the compression is saving bandwidth or not.
> I will add the patch in a few minutes.
> I haven't tested this against the C++ TZlibTransport, only against itself.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (THRIFT-1103) TZlibTransport for python, a zlib compressed transport

Posted by "Will Pierce (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/THRIFT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Will Pierce updated THRIFT-1103:
--------------------------------

    Attachment: THRIFT-1103.tzlibtransport_for_python_v2.patch

version 2 of patch attached  (obsoletes the v1 patch)

> TZlibTransport for python, a zlib compressed transport
> ------------------------------------------------------
>
>                 Key: THRIFT-1103
>                 URL: https://issues.apache.org/jira/browse/THRIFT-1103
>             Project: Thrift
>          Issue Type: New Feature
>          Components: Python - Library
>            Reporter: Will Pierce
>            Assignee: Will Pierce
>         Attachments: THRIFT-1103.tzlibtransport_for_python_v1.patch, THRIFT-1103.tzlibtransport_for_python_v2.patch
>
>
> New implementation of zlib compressed transport for python.
> The attached patch provides a zlib compressed transport wrapper for python.  It is similar to the TFramedTransport, in that it wraps another transport, implementing the data compression as a transformation layer on top of the underlying transport that it wraps.
> The compression level is configurable in the constructor, from 0 (none) to 9 (best) and defaults to 9 for best compression.  The way this works is that every write() to the transport appends more data to the internal cStringIO write buffer.  When the transport's flush() method is called, the buffered bytes are then passed to a zlib Compressor object and flush()ed with zlib.Z_SYNC_FLUSH.
> Because the thrift API calls the transport's flush() after writeMessageEnd(), this means very small thrift RPC calls don't get compressed well.  This transport works best on thrift protocols where the payload contains strings longer than 10 characters.  As with all data compression, the more redundancy in the uncompressed input, the greater the resulting compression.
> The TZlibTransport class also implements some basic statistics that track the number of raw bytes written and read, versus the decompressed equivalent.  The getCompRatio() method returns a tuple of (readCompressionRatio,writeCompressionRatio) where ratio is computed using: compressed_bytes/uncompressed_bytes.  (So 10 compression is 0.10, meaning smaller numbers are better.)  The getCompSavings() method returns the actual number of (saved_read_bytes,saved_write_bytes) which might be negative when the compression of non-compressible data ends up expanding the data.  So hopefully, anyone who uses this transport will be able to tell whether the compression is saving bandwidth or not.
> I will add the patch in a few minutes.
> I haven't tested this against the C++ TZlibTransport, only against itself.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (THRIFT-1103) TZlibTransport for python, a zlib compressed transport

Posted by "Bryan Duxbury (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/THRIFT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009243#comment-13009243 ] 

Bryan Duxbury commented on THRIFT-1103:
---------------------------------------

THRIFT-1094 is committed. Do you want to revise your patch before I evaluate this ticket for commit?

> TZlibTransport for python, a zlib compressed transport
> ------------------------------------------------------
>
>                 Key: THRIFT-1103
>                 URL: https://issues.apache.org/jira/browse/THRIFT-1103
>             Project: Thrift
>          Issue Type: New Feature
>          Components: Python - Library
>            Reporter: Will Pierce
>            Assignee: Will Pierce
>         Attachments: THRIFT-1103.tzlibtransport_for_python_v1.patch
>
>
> New implementation of zlib compressed transport for python.
> The attached patch provides a zlib compressed transport wrapper for python.  It is similar to the TFramedTransport, in that it wraps another transport, implementing the data compression as a transformation layer on top of the underlying transport that it wraps.
> The compression level is configurable in the constructor, from 0 (none) to 9 (best) and defaults to 9 for best compression.  The way this works is that every write() to the transport appends more data to the internal cStringIO write buffer.  When the transport's flush() method is called, the buffered bytes are then passed to a zlib Compressor object and flush()ed with zlib.Z_SYNC_FLUSH.
> Because the thrift API calls the transport's flush() after writeMessageEnd(), this means very small thrift RPC calls don't get compressed well.  This transport works best on thrift protocols where the payload contains strings longer than 10 characters.  As with all data compression, the more redundancy in the uncompressed input, the greater the resulting compression.
> The TZlibTransport class also implements some basic statistics that track the number of raw bytes written and read, versus the decompressed equivalent.  The getCompRatio() method returns a tuple of (readCompressionRatio,writeCompressionRatio) where ratio is computed using: compressed_bytes/uncompressed_bytes.  (So 10 compression is 0.10, meaning smaller numbers are better.)  The getCompSavings() method returns the actual number of (saved_read_bytes,saved_write_bytes) which might be negative when the compression of non-compressible data ends up expanding the data.  So hopefully, anyone who uses this transport will be able to tell whether the compression is saving bandwidth or not.
> I will add the patch in a few minutes.
> I haven't tested this against the C++ TZlibTransport, only against itself.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (THRIFT-1103) TZlibTransport for python, a zlib compressed transport

Posted by "Will Pierce (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/THRIFT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009283#comment-13009283 ] 

Will Pierce commented on THRIFT-1103:
-------------------------------------

Thanks for comitting THRIFT-1094, I'll update the patch this evening (and attach as _v2) to include testing TZlibTransport wrapping in the TestServer.py/TestClient.py code (as a cmdline --zlib argument to both scripts).

FYI, currently the hudson(jenkins) build seems to be failing on the javascript jslint tasks, stopped the build tests from progressing past the 'test/js' directory...

> TZlibTransport for python, a zlib compressed transport
> ------------------------------------------------------
>
>                 Key: THRIFT-1103
>                 URL: https://issues.apache.org/jira/browse/THRIFT-1103
>             Project: Thrift
>          Issue Type: New Feature
>          Components: Python - Library
>            Reporter: Will Pierce
>            Assignee: Will Pierce
>         Attachments: THRIFT-1103.tzlibtransport_for_python_v1.patch
>
>
> New implementation of zlib compressed transport for python.
> The attached patch provides a zlib compressed transport wrapper for python.  It is similar to the TFramedTransport, in that it wraps another transport, implementing the data compression as a transformation layer on top of the underlying transport that it wraps.
> The compression level is configurable in the constructor, from 0 (none) to 9 (best) and defaults to 9 for best compression.  The way this works is that every write() to the transport appends more data to the internal cStringIO write buffer.  When the transport's flush() method is called, the buffered bytes are then passed to a zlib Compressor object and flush()ed with zlib.Z_SYNC_FLUSH.
> Because the thrift API calls the transport's flush() after writeMessageEnd(), this means very small thrift RPC calls don't get compressed well.  This transport works best on thrift protocols where the payload contains strings longer than 10 characters.  As with all data compression, the more redundancy in the uncompressed input, the greater the resulting compression.
> The TZlibTransport class also implements some basic statistics that track the number of raw bytes written and read, versus the decompressed equivalent.  The getCompRatio() method returns a tuple of (readCompressionRatio,writeCompressionRatio) where ratio is computed using: compressed_bytes/uncompressed_bytes.  (So 10 compression is 0.10, meaning smaller numbers are better.)  The getCompSavings() method returns the actual number of (saved_read_bytes,saved_write_bytes) which might be negative when the compression of non-compressible data ends up expanding the data.  So hopefully, anyone who uses this transport will be able to tell whether the compression is saving bandwidth or not.
> I will add the patch in a few minutes.
> I haven't tested this against the C++ TZlibTransport, only against itself.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (THRIFT-1103) TZlibTransport for python, a zlib compressed transport

Posted by "Will Pierce (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/THRIFT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Will Pierce updated THRIFT-1103:
--------------------------------

    Attachment: THRIFT-1103.tzlibtransport_for_python_v1.patch

Patch attached.  Adds TZlibTransport.py into ./lib/py/src/transport/ and adds TZlibTransport into the transport/__init__.py module's __all__ list.

I tested this on python 2.4 and 2.7.  The zlib module is present and provides the same API in python 2.4 as 2.7 for our needs.

If the patch for THRIFT-1094 is good and can be commited, then it would make it easier for me to extend the RunClientServer.py/TestServer.py/TestClient.py code to include testing that exercises the TZlibTransport code.  (I did it locally in my copy of thrift-svn/trunk to test this code, but didn't want to submit a patch that requires another patch ( THRIFT-1094 ) which hasn't been approved yet.)



> TZlibTransport for python, a zlib compressed transport
> ------------------------------------------------------
>
>                 Key: THRIFT-1103
>                 URL: https://issues.apache.org/jira/browse/THRIFT-1103
>             Project: Thrift
>          Issue Type: New Feature
>          Components: Python - Library
>            Reporter: Will Pierce
>            Assignee: Will Pierce
>         Attachments: THRIFT-1103.tzlibtransport_for_python_v1.patch
>
>
> New implementation of zlib compressed transport for python.
> The attached patch provides a zlib compressed transport wrapper for python.  It is similar to the TFramedTransport, in that it wraps another transport, implementing the data compression as a transformation layer on top of the underlying transport that it wraps.
> The compression level is configurable in the constructor, from 0 (none) to 9 (best) and defaults to 9 for best compression.  The way this works is that every write() to the transport appends more data to the internal cStringIO write buffer.  When the transport's flush() method is called, the buffered bytes are then passed to a zlib Compressor object and flush()ed with zlib.Z_SYNC_FLUSH.
> Because the thrift API calls the transport's flush() after writeMessageEnd(), this means very small thrift RPC calls don't get compressed well.  This transport works best on thrift protocols where the payload contains strings longer than 10 characters.  As with all data compression, the more redundancy in the uncompressed input, the greater the resulting compression.
> The TZlibTransport class also implements some basic statistics that track the number of raw bytes written and read, versus the decompressed equivalent.  The getCompRatio() method returns a tuple of (readCompressionRatio,writeCompressionRatio) where ratio is computed using: compressed_bytes/uncompressed_bytes.  (So 10 compression is 0.10, meaning smaller numbers are better.)  The getCompSavings() method returns the actual number of (saved_read_bytes,saved_write_bytes) which might be negative when the compression of non-compressible data ends up expanding the data.  So hopefully, anyone who uses this transport will be able to tell whether the compression is saving bandwidth or not.
> I will add the patch in a few minutes.
> I haven't tested this against the C++ TZlibTransport, only against itself.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira