You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Muga Nishizawa (JIRA)" <ji...@apache.org> on 2010/11/12 10:35:13 UTC

[jira] Created: (CASSANDRA-1735) Using MessagePack for reducing data size

Using MessagePack for reducing data size
----------------------------------------

                 Key: CASSANDRA-1735
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1735
             Project: Cassandra
          Issue Type: New Feature
          Components: API
    Affects Versions: 0.7 beta 3
         Environment: Fedora11,  JDK1.6.0_20
            Reporter: Muga Nishizawa


For improving Cassandra performance, I implemented a Cassandra RPC part with MessagePack.
The implementation details are attached as a patch.  The patch works on Cassandra 0.7.0-beta3.  
Please check it.  

MessagePack is one of object serialization libraries for cross-languages like Thrift, 
Protocol Buffers but it is fast, small, and easy.  MessagePack allows reducing 
serialization cost and data size in network and disk.  

MessagePack websites are
    * website: http://msgpack.org/
        This website compares MessagePack, Thrift and JSON.  
    * desing details: http://redmine.msgpack.org/projects/msgpack/wiki/FormatDesign
    * source code: https://github.com/msgpack/msgpack/

Performance of the data serialization library is one of the most important 
issues for developing a distributed database in Java.  If the performance is 
bad, it significantly reduces the overall database performance.  Java's GC 
also runs many times.  Cassandra has this problem as well.  

For reducing data size in network between a client and Cassandra, I prototyped 
the implementation of a Cassandra RPC part with MessagePack and MessagePack-RPC.  
The implementation is very simple.  MessagePack-RPC enables reuse of the 
existing Thrift based CassandraServer (org.apache.cassandra.thrift.CassandraServer)
and adapt communication protocol and data serialization to MessagePack.  

Major features of MessagePack-RPC are 
    * Asynchronous RPC
    * Parallel Pipelining
    * Connection pooling
    * Delayed return
    * Event-driven I/O
    * more details: http://redmine.msgpack.org/projects/msgpack/wiki/RPCDesign
    * source code: https://github.com/msgpack/msgpack-rpc/

The attached patch includes a ring cache program for MessagePack and its test program.  
You can check the behavior of the Cassandra RPC with MessagePack. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-1735) Using MessagePack for reducing data size

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12931844#action_12931844 ] 

Jonathan Ellis commented on CASSANDRA-1735:
-------------------------------------------

Thanks, this is exciting!

What kind of performance improvement do you see with this patch?

> Using MessagePack for reducing data size
> ----------------------------------------
>
>                 Key: CASSANDRA-1735
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1735
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: API
>    Affects Versions: 0.7 beta 3
>         Environment: Fedora11,  JDK1.6.0_20
>            Reporter: Muga Nishizawa
>         Attachments: 0001-implement-a-Cassandra-RPC-part-with-MessagePack.patch, dependency_libs.zip
>
>
> For improving Cassandra performance, I implemented a Cassandra RPC part with MessagePack.  The implementation details are attached as a patch.  The patch works on Cassandra 0.7.0-beta3.  Please check it.  
> MessagePack is one of object serialization libraries for cross-languages like Thrift and Protocol Buffers but it is much faster, small, and easy to implement.  MessagePack allows reducing serialization cost and data size in network and disk.  
> MessagePack websites are
>     * website: http://msgpack.org/
>         This website compares MessagePack, Thrift and JSON.  
>     * desing details: http://redmine.msgpack.org/projects/msgpack/wiki/FormatDesign
>     * source code: https://github.com/msgpack/msgpack/
> Performance of the data serialization library is one of the most important issues for developing a distributed database in Java.  If the performance is bad, it significantly reduces the overall database performance.  Java's GC also runs many times.  Cassandra has this problem as well.  
> For reducing data size in network between a client and Cassandra, I prototyped the implementation of a Cassandra RPC part with MessagePack and MessagePack-RPC.  The implementation is very simple.  MessagePack-RPC can reuse the existing Thrift based CassandraServer (org.apache.cassandra.thrift.CassandraServer)
> while adapting MessagePack's communication protocol and data serialization.  
> Major features of MessagePack-RPC are 
>     * Asynchronous RPC
>     * Parallel Pipelining
>     * Connection pooling
>     * Delayed return
>     * Event-driven I/O
>     * more details: http://redmine.msgpack.org/projects/msgpack/wiki/RPCDesign
>     * source code: https://github.com/msgpack/msgpack-rpc/
> The attached patch includes a ring cache program for MessagePack and its test program.  
> You can check the behavior of the Cassandra RPC with MessagePack.  
> Thanks in advance, 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (CASSANDRA-1735) Using MessagePack for reducing data size

Posted by "Muga Nishizawa (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932822#action_12932822 ] 

Muga Nishizawa edited comment on CASSANDRA-1735 at 11/17/10 12:40 AM:
----------------------------------------------------------------------

Jonathan,

Thanks for your response.

>What kind of performance improvement do you see with this patch?

Performance improvement available with this patch will be the following:
* Reducing serialization cost and the data size
* Increase throughput between clients and a Cassandra node

I have also measured the performance of MessagePack, from the viewpoints of reducing serialization cost and throughput.  I will discuss details below.

== Reduction of serialization cost and the data size ==

(Summary)
MessagePack has proved to be better in reducing serialzation cost and the data size compared to other serialization libraries in the test below.  

(Test environment)
I used "jvm-serializers" which is a well-known benchmark and compared performances with Protocol Buffers, Thrift, and Avro.  Machine used for this benchmark has Core2 Duo 2GHz with 1GB RAM.

(Results)
      create  ser +same deser +shal +deep total size +dfl
 protobuf    683 6016 2973  3338  3454 3759 9775 239 149
 thrift      572 6287 5565  3479  3616 3770 10057 349 197 
 msgpack    291 4935 4750  3468  3545 3708 8748 236 150
 avro     2698 6409 3623  7480  9301 10481 16890 221 133

(Comments)
It may be better to compare serialization cost using objects with Cassandra like a Column object.  But such objects and sizes vary by users, and is not suitable for comparing serialization cost of various data.  According to the above result, the size of MessagePack's serialized data is slightly larger than Avro.  But MessagePack has significantly low serialization cost compared to Avro and Thrift.  

== Increasing throughput ==

(Summary)
I compared MessagePack based RPC of Cassandra to that of Thrift.  Random read throughput of MessagePack based RPC is 15% higher than that of Thrift and random write throughput is 21% higher.  

(Test environment)
In this evaluation, Cassandra node ran as a standalone on a machine with Core2 Duo 2GHz and 1GB RAM.  Client programs ran on two machines both with Core2 Duo 2GHz and 1GB RAM.  Client program was based on ring cache.  It created 100 threads per a JVM on each machine and accesses to a Cassandra node with ring cache.  

(Results)
* Thrift based RPC part of Cassandra(read: 5,200 query/sec., write: 11,200 query/sec.)
* MessagePack based RPC part of Cassandra (read: 6,000 query/sec., write: 13,600 query/sec.)

(Comments)
I measured the max throughput of random access (read/write) after 100 items (size of each item is small) were stored in the Cassandra node.  The reason is because I wanted to make the state of CPU bottle neck for the Cassandra node.  If the Cassandra node is the state of Disk IO bottle neck, I thought that I cannot properly evaluate max throughput of the RPC part.  

I did not measure the amount of data transferred in network during the evaluation directly.  But from the benchmark result of jvm-serializers, I believe that the amount of transferred data for MessagePack-based Cassandra would be reduced compared to that of Thrift.  


      was (Author: muga_nishizawa):
    Jonathan,

Thanks for your response.

>What kind of performance improvement do you see with this patch?

Performance improvement available with this patch will be the following:
* Reducing serialization cost and the data size
* Increase throughput between clients and a Cassandra node

I have also measured the performance of MessagePack, from the viewpoints of reducing serialization cost and throughput.  I will discuss details below.

== Reduction of serialization cost and the data size ==

(Summary)
MessagePack has proved to be better in reducing serialzation cost and the data size compared to other serialization libraries in the test below.  

(Test environment)
I used "jvm-serializers" which is a well-known benchmark and compared performances with Protocol Buffers, Thrift, and Avro.  Machine used for this benchmark has Core2 Duo 2GHz with 1GB RAM.

(Results)
                                 create     ser   +same   deser   +shal   +deep   total   size  +dfl
protobuf                         683    6016    2973    3338   3454    3759   9775    239   149
thrift                              572    6287    5565    3479   3616    3770 10057    349   197
msgpack                         291    4935    4750    3468   3545    3708   8748    236   150
avro                             2698    6409    3623    7480   9301   10481 16890    221   133

(Comments)
It may be better to compare serialization cost using objects with Cassandra like a Column object.  But such objects and sizes vary by users, and is not suitable for comparing serialization cost of various data.  According to the above result, the size of MessagePack's serialized data is slightly larger than Avro.  But MessagePack has significantly low serialization cost compared to Avro and Thrift.  

== Increasing throughput ==

(Summary)
I compared MessagePack based RPC of Cassandra to that of Thrift.  Random read throughput of MessagePack based RPC is 15% higher than that of Thrift and random write throughput is 21% higher.  

(Test environment)
In this evaluation, Cassandra node ran as a standalone on a machine with Core2 Duo 2GHz and 1GB RAM.  Client programs ran on two machines both with Core2 Duo 2GHz and 1GB RAM.  Client program was based on ring cache.  It created 100 threads per a JVM on each machine and accesses to a Cassandra node with ring cache.  

(Results)
* Thrift based RPC part of Cassandra
  * Random read: 5,200 query/sec.
  * Random write: 11,200 query/sec.
* MessagePack based RPC part of Cassandra
  * Random read: 6,000 query/sec.
  * Random write: 13,600 query/sec.

(Comments)
I measured the max throughput of random access (read/write) after 100 items (size of each item is small) were stored in the Cassandra node.  The reason is because I wanted to make the state of CPU bottle neck for the Cassandra node.  If the Cassandra node is the state of Disk IO bottle neck, I thought that I cannot properly evaluate max throughput of the RPC part.  

I did not measure the amount of data transferred in network during the evaluation directly.  But from the benchmark result of jvm-serializers, I believe that the amount of transferred data for MessagePack-based Cassandra would be reduced compared to that of Thrift.  

  
> Using MessagePack for reducing data size
> ----------------------------------------
>
>                 Key: CASSANDRA-1735
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1735
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: API
>    Affects Versions: 0.7 beta 3
>         Environment: Fedora11,  JDK1.6.0_20
>            Reporter: Muga Nishizawa
>         Attachments: 0001-implement-a-Cassandra-RPC-part-with-MessagePack.patch, dependency_libs.zip
>
>
> For improving Cassandra performance, I implemented a Cassandra RPC part with MessagePack.  The implementation details are attached as a patch.  The patch works on Cassandra 0.7.0-beta3.  Please check it.  
> MessagePack is one of object serialization libraries for cross-languages like Thrift and Protocol Buffers but it is much faster, small, and easy to implement.  MessagePack allows reducing serialization cost and data size in network and disk.  
> MessagePack websites are
>     * website: http://msgpack.org/
>         This website compares MessagePack, Thrift and JSON.  
>     * desing details: http://redmine.msgpack.org/projects/msgpack/wiki/FormatDesign
>     * source code: https://github.com/msgpack/msgpack/
> Performance of the data serialization library is one of the most important issues for developing a distributed database in Java.  If the performance is bad, it significantly reduces the overall database performance.  Java's GC also runs many times.  Cassandra has this problem as well.  
> For reducing data size in network between a client and Cassandra, I prototyped the implementation of a Cassandra RPC part with MessagePack and MessagePack-RPC.  The implementation is very simple.  MessagePack-RPC can reuse the existing Thrift based CassandraServer (org.apache.cassandra.thrift.CassandraServer)
> while adapting MessagePack's communication protocol and data serialization.  
> Major features of MessagePack-RPC are 
>     * Asynchronous RPC
>     * Parallel Pipelining
>     * Connection pooling
>     * Delayed return
>     * Event-driven I/O
>     * more details: http://redmine.msgpack.org/projects/msgpack/wiki/RPCDesign
>     * source code: https://github.com/msgpack/msgpack-rpc/
> The attached patch includes a ring cache program for MessagePack and its test program.  
> You can check the behavior of the Cassandra RPC with MessagePack.  
> Thanks in advance, 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-1735) Using MessagePack for reducing data size

Posted by "Terje Marthinussen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932756#action_12932756 ] 

Terje Marthinussen commented on CASSANDRA-1735:
-----------------------------------------------

I am very curious how the serialization in messagepack could compete with the serialization used on the data side for cassandra (SSTables) and how we could benefit from having the same serialization in both those places.

Anyone has any thoughts?



> Using MessagePack for reducing data size
> ----------------------------------------
>
>                 Key: CASSANDRA-1735
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1735
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: API
>    Affects Versions: 0.7 beta 3
>         Environment: Fedora11,  JDK1.6.0_20
>            Reporter: Muga Nishizawa
>         Attachments: 0001-implement-a-Cassandra-RPC-part-with-MessagePack.patch, dependency_libs.zip
>
>
> For improving Cassandra performance, I implemented a Cassandra RPC part with MessagePack.  The implementation details are attached as a patch.  The patch works on Cassandra 0.7.0-beta3.  Please check it.  
> MessagePack is one of object serialization libraries for cross-languages like Thrift and Protocol Buffers but it is much faster, small, and easy to implement.  MessagePack allows reducing serialization cost and data size in network and disk.  
> MessagePack websites are
>     * website: http://msgpack.org/
>         This website compares MessagePack, Thrift and JSON.  
>     * desing details: http://redmine.msgpack.org/projects/msgpack/wiki/FormatDesign
>     * source code: https://github.com/msgpack/msgpack/
> Performance of the data serialization library is one of the most important issues for developing a distributed database in Java.  If the performance is bad, it significantly reduces the overall database performance.  Java's GC also runs many times.  Cassandra has this problem as well.  
> For reducing data size in network between a client and Cassandra, I prototyped the implementation of a Cassandra RPC part with MessagePack and MessagePack-RPC.  The implementation is very simple.  MessagePack-RPC can reuse the existing Thrift based CassandraServer (org.apache.cassandra.thrift.CassandraServer)
> while adapting MessagePack's communication protocol and data serialization.  
> Major features of MessagePack-RPC are 
>     * Asynchronous RPC
>     * Parallel Pipelining
>     * Connection pooling
>     * Delayed return
>     * Event-driven I/O
>     * more details: http://redmine.msgpack.org/projects/msgpack/wiki/RPCDesign
>     * source code: https://github.com/msgpack/msgpack-rpc/
> The attached patch includes a ring cache program for MessagePack and its test program.  
> You can check the behavior of the Cassandra RPC with MessagePack.  
> Thanks in advance, 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] [Commented] (CASSANDRA-1735) Using MessagePack for reducing data size

Posted by "Parlo Mendez (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084871#comment-13084871 ] 

Parlo Mendez commented on CASSANDRA-1735:
-----------------------------------------

The last post is some time ago. What is the current status of messagepack implementation in cassandra? I think it would be very nice.

Parlo

> Using MessagePack for reducing data size
> ----------------------------------------
>
>                 Key: CASSANDRA-1735
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1735
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: API
>    Affects Versions: 0.7 beta 3
>         Environment: Fedora11,  JDK1.6.0_20
>            Reporter: Muga Nishizawa
>         Attachments: 0001-implement-a-Cassandra-RPC-part-with-MessagePack.patch, dependency_libs.zip
>
>
> For improving Cassandra performance, I implemented a Cassandra RPC part with MessagePack.  The implementation details are attached as a patch.  The patch works on Cassandra 0.7.0-beta3.  Please check it.  
> MessagePack is one of object serialization libraries for cross-languages like Thrift and Protocol Buffers but it is much faster, small, and easy to implement.  MessagePack allows reducing serialization cost and data size in network and disk.  
> MessagePack websites are
>     * website: http://msgpack.org/
>         This website compares MessagePack, Thrift and JSON.  
>     * desing details: http://redmine.msgpack.org/projects/msgpack/wiki/FormatDesign
>     * source code: https://github.com/msgpack/msgpack/
> Performance of the data serialization library is one of the most important issues for developing a distributed database in Java.  If the performance is bad, it significantly reduces the overall database performance.  Java's GC also runs many times.  Cassandra has this problem as well.  
> For reducing data size in network between a client and Cassandra, I prototyped the implementation of a Cassandra RPC part with MessagePack and MessagePack-RPC.  The implementation is very simple.  MessagePack-RPC can reuse the existing Thrift based CassandraServer (org.apache.cassandra.thrift.CassandraServer)
> while adapting MessagePack's communication protocol and data serialization.  
> Major features of MessagePack-RPC are 
>     * Asynchronous RPC
>     * Parallel Pipelining
>     * Connection pooling
>     * Delayed return
>     * Event-driven I/O
>     * more details: http://redmine.msgpack.org/projects/msgpack/wiki/RPCDesign
>     * source code: https://github.com/msgpack/msgpack-rpc/
> The attached patch includes a ring cache program for MessagePack and its test program.  
> You can check the behavior of the Cassandra RPC with MessagePack.  
> Thanks in advance, 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (CASSANDRA-1735) Using MessagePack for reducing data size

Posted by "Muga Nishizawa (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988655#comment-12988655 ] 

Muga Nishizawa commented on CASSANDRA-1735:
-------------------------------------------

Hi T Jake Luciani,

I would like to notify that we have cleared the license issues with MessagePack.

As you pointed out earlier, MessagePack used to require XNIO (LGPL) for network communication.  We replaced XNIO with Apache MINA (Apache License) in MessagePack. Javassist which was another issue is a dual license (LGPL and MPL) module, and is used by other apache products as MPL.  

So we believe that we have cleared license related issues at the moment.

Please check URL below for more details.  
https://github.com/msgpack/msgpack/
https://github.com/msgpack/msgpack-rpc/ 

> Using MessagePack for reducing data size
> ----------------------------------------
>
>                 Key: CASSANDRA-1735
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1735
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: API
>    Affects Versions: 0.7 beta 3
>         Environment: Fedora11,  JDK1.6.0_20
>            Reporter: Muga Nishizawa
>         Attachments: 0001-implement-a-Cassandra-RPC-part-with-MessagePack.patch, dependency_libs.zip
>
>
> For improving Cassandra performance, I implemented a Cassandra RPC part with MessagePack.  The implementation details are attached as a patch.  The patch works on Cassandra 0.7.0-beta3.  Please check it.  
> MessagePack is one of object serialization libraries for cross-languages like Thrift and Protocol Buffers but it is much faster, small, and easy to implement.  MessagePack allows reducing serialization cost and data size in network and disk.  
> MessagePack websites are
>     * website: http://msgpack.org/
>         This website compares MessagePack, Thrift and JSON.  
>     * desing details: http://redmine.msgpack.org/projects/msgpack/wiki/FormatDesign
>     * source code: https://github.com/msgpack/msgpack/
> Performance of the data serialization library is one of the most important issues for developing a distributed database in Java.  If the performance is bad, it significantly reduces the overall database performance.  Java's GC also runs many times.  Cassandra has this problem as well.  
> For reducing data size in network between a client and Cassandra, I prototyped the implementation of a Cassandra RPC part with MessagePack and MessagePack-RPC.  The implementation is very simple.  MessagePack-RPC can reuse the existing Thrift based CassandraServer (org.apache.cassandra.thrift.CassandraServer)
> while adapting MessagePack's communication protocol and data serialization.  
> Major features of MessagePack-RPC are 
>     * Asynchronous RPC
>     * Parallel Pipelining
>     * Connection pooling
>     * Delayed return
>     * Event-driven I/O
>     * more details: http://redmine.msgpack.org/projects/msgpack/wiki/RPCDesign
>     * source code: https://github.com/msgpack/msgpack-rpc/
> The attached patch includes a ring cache program for MessagePack and its test program.  
> You can check the behavior of the Cassandra RPC with MessagePack.  
> Thanks in advance, 

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (CASSANDRA-1735) Using MessagePack for reducing data size

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis resolved CASSANDRA-1735.
---------------------------------------

    Resolution: Won't Fix

Gary did some tests in CASSANDRA-1765 and found no significant advantage over Thrift. Given that, and our brief experience supporting a second rpc protocol (Avro in the 0.7 series), I don't think this is going anywhere.

> Using MessagePack for reducing data size
> ----------------------------------------
>
>                 Key: CASSANDRA-1735
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1735
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: API
>    Affects Versions: 0.7 beta 3
>         Environment: Fedora11,  JDK1.6.0_20
>            Reporter: Muga Nishizawa
>         Attachments: 0001-implement-a-Cassandra-RPC-part-with-MessagePack.patch, dependency_libs.zip
>
>
> For improving Cassandra performance, I implemented a Cassandra RPC part with MessagePack.  The implementation details are attached as a patch.  The patch works on Cassandra 0.7.0-beta3.  Please check it.  
> MessagePack is one of object serialization libraries for cross-languages like Thrift and Protocol Buffers but it is much faster, small, and easy to implement.  MessagePack allows reducing serialization cost and data size in network and disk.  
> MessagePack websites are
>     * website: http://msgpack.org/
>         This website compares MessagePack, Thrift and JSON.  
>     * desing details: http://redmine.msgpack.org/projects/msgpack/wiki/FormatDesign
>     * source code: https://github.com/msgpack/msgpack/
> Performance of the data serialization library is one of the most important issues for developing a distributed database in Java.  If the performance is bad, it significantly reduces the overall database performance.  Java's GC also runs many times.  Cassandra has this problem as well.  
> For reducing data size in network between a client and Cassandra, I prototyped the implementation of a Cassandra RPC part with MessagePack and MessagePack-RPC.  The implementation is very simple.  MessagePack-RPC can reuse the existing Thrift based CassandraServer (org.apache.cassandra.thrift.CassandraServer)
> while adapting MessagePack's communication protocol and data serialization.  
> Major features of MessagePack-RPC are 
>     * Asynchronous RPC
>     * Parallel Pipelining
>     * Connection pooling
>     * Delayed return
>     * Event-driven I/O
>     * more details: http://redmine.msgpack.org/projects/msgpack/wiki/RPCDesign
>     * source code: https://github.com/msgpack/msgpack-rpc/
> The attached patch includes a ring cache program for MessagePack and its test program.  
> You can check the behavior of the Cassandra RPC with MessagePack.  
> Thanks in advance, 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (CASSANDRA-1735) Using MessagePack for reducing data size

Posted by "Muga Nishizawa (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932822#action_12932822 ] 

Muga Nishizawa commented on CASSANDRA-1735:
-------------------------------------------

Jonathan,

Thanks for your response.

>What kind of performance improvement do you see with this patch?

Performance improvement available with this patch will be the following:
* Reducing serialization cost and the data size
* Increase throughput between clients and a Cassandra node

I have also measured the performance of MessagePack, from the viewpoints of reducing serialization cost and throughput.  I will discuss details below.

== Reduction of serialization cost and the data size ==

(Summary)
MessagePack has proved to be better in reducing serialzation cost and the data size compared to other serialization libraries in the test below.  

(Test environment)
I used "jvm-serializers" which is a well-known benchmark and compared performances with Protocol Buffers, Thrift, and Avro.  Machine used for this benchmark has Core2 Duo 2GHz with 1GB RAM.

(Results)
                                 create     ser   +same   deser   +shal   +deep   total   size  +dfl
protobuf                         683    6016    2973    3338   3454    3759   9775    239   149
thrift                              572    6287    5565    3479   3616    3770 10057    349   197
msgpack                         291    4935    4750    3468   3545    3708   8748    236   150
avro                             2698    6409    3623    7480   9301   10481 16890    221   133

(Comments)
It may be better to compare serialization cost using objects with Cassandra like a Column object.  But such objects and sizes vary by users, and is not suitable for comparing serialization cost of various data.  According to the above result, the size of MessagePack's serialized data is slightly larger than Avro.  But MessagePack has significantly low serialization cost compared to Avro and Thrift.  

== Increasing throughput ==

(Summary)
I compared MessagePack based RPC of Cassandra to that of Thrift.  Random read throughput of MessagePack based RPC is 15% higher than that of Thrift and random write throughput is 21% higher.  

(Test environment)
In this evaluation, Cassandra node ran as a standalone on a machine with Core2 Duo 2GHz and 1GB RAM.  Client programs ran on two machines both with Core2 Duo 2GHz and 1GB RAM.  Client program was based on ring cache.  It created 100 threads per a JVM on each machine and accesses to a Cassandra node with ring cache.  

(Results)
* Thrift based RPC part of Cassandra
  * Random read: 5,200 query/sec.
  * Random write: 11,200 query/sec.
* MessagePack based RPC part of Cassandra
  * Random read: 6,000 query/sec.
  * Random write: 13,600 query/sec.

(Comments)
I measured the max throughput of random access (read/write) after 100 items (size of each item is small) were stored in the Cassandra node.  The reason is because I wanted to make the state of CPU bottle neck for the Cassandra node.  If the Cassandra node is the state of Disk IO bottle neck, I thought that I cannot properly evaluate max throughput of the RPC part.  

I did not measure the amount of data transferred in network during the evaluation directly.  But from the benchmark result of jvm-serializers, I believe that the amount of transferred data for MessagePack-based Cassandra would be reduced compared to that of Thrift.  


> Using MessagePack for reducing data size
> ----------------------------------------
>
>                 Key: CASSANDRA-1735
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1735
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: API
>    Affects Versions: 0.7 beta 3
>         Environment: Fedora11,  JDK1.6.0_20
>            Reporter: Muga Nishizawa
>         Attachments: 0001-implement-a-Cassandra-RPC-part-with-MessagePack.patch, dependency_libs.zip
>
>
> For improving Cassandra performance, I implemented a Cassandra RPC part with MessagePack.  The implementation details are attached as a patch.  The patch works on Cassandra 0.7.0-beta3.  Please check it.  
> MessagePack is one of object serialization libraries for cross-languages like Thrift and Protocol Buffers but it is much faster, small, and easy to implement.  MessagePack allows reducing serialization cost and data size in network and disk.  
> MessagePack websites are
>     * website: http://msgpack.org/
>         This website compares MessagePack, Thrift and JSON.  
>     * desing details: http://redmine.msgpack.org/projects/msgpack/wiki/FormatDesign
>     * source code: https://github.com/msgpack/msgpack/
> Performance of the data serialization library is one of the most important issues for developing a distributed database in Java.  If the performance is bad, it significantly reduces the overall database performance.  Java's GC also runs many times.  Cassandra has this problem as well.  
> For reducing data size in network between a client and Cassandra, I prototyped the implementation of a Cassandra RPC part with MessagePack and MessagePack-RPC.  The implementation is very simple.  MessagePack-RPC can reuse the existing Thrift based CassandraServer (org.apache.cassandra.thrift.CassandraServer)
> while adapting MessagePack's communication protocol and data serialization.  
> Major features of MessagePack-RPC are 
>     * Asynchronous RPC
>     * Parallel Pipelining
>     * Connection pooling
>     * Delayed return
>     * Event-driven I/O
>     * more details: http://redmine.msgpack.org/projects/msgpack/wiki/RPCDesign
>     * source code: https://github.com/msgpack/msgpack-rpc/
> The attached patch includes a ring cache program for MessagePack and its test program.  
> You can check the behavior of the Cassandra RPC with MessagePack.  
> Thanks in advance, 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-1735) Using MessagePack for reducing data size

Posted by "Muga Nishizawa (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Muga Nishizawa updated CASSANDRA-1735:
--------------------------------------

    Attachment: dependency_libs.zip
                0001-implement-a-Cassandra-RPC-part-with-MessagePack.patch

I ) Cassandra RPC wich MessagePack
2) dependency libraries

> Using MessagePack for reducing data size
> ----------------------------------------
>
>                 Key: CASSANDRA-1735
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1735
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: API
>    Affects Versions: 0.7 beta 3
>         Environment: Fedora11,  JDK1.6.0_20
>            Reporter: Muga Nishizawa
>         Attachments: 0001-implement-a-Cassandra-RPC-part-with-MessagePack.patch, dependency_libs.zip
>
>
> For improving Cassandra performance, I implemented a Cassandra RPC part with MessagePack.  The implementation details are attached as a patch.  The patch works on Cassandra 0.7.0-beta3.  Please check it.  
> MessagePack is one of object serialization libraries for cross-languages like Thrift and Protocol Buffers but it is much faster, small, and easy to implement.  MessagePack allows reducing serialization cost and data size in network and disk.  
> MessagePack websites are
>     * website: http://msgpack.org/
>         This website compares MessagePack, Thrift and JSON.  
>     * desing details: http://redmine.msgpack.org/projects/msgpack/wiki/FormatDesign
>     * source code: https://github.com/msgpack/msgpack/
> Performance of the data serialization library is one of the most important issues for developing a distributed database in Java.  If the performance is bad, it significantly reduces the overall database performance.  Java's GC also runs many times.  Cassandra has this problem as well.  
> For reducing data size in network between a client and Cassandra, I prototyped the implementation of a Cassandra RPC part with MessagePack and MessagePack-RPC.  The implementation is very simple.  MessagePack-RPC can reuse the existing Thrift based CassandraServer (org.apache.cassandra.thrift.CassandraServer)
> while adapting MessagePack's communication protocol and data serialization.  
> Major features of MessagePack-RPC are 
>     * Asynchronous RPC
>     * Parallel Pipelining
>     * Connection pooling
>     * Delayed return
>     * Event-driven I/O
>     * more details: http://redmine.msgpack.org/projects/msgpack/wiki/RPCDesign
>     * source code: https://github.com/msgpack/msgpack-rpc/
> The attached patch includes a ring cache program for MessagePack and its test program.  
> You can check the behavior of the Cassandra RPC with MessagePack.  
> Thanks in advance, 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-1735) Using MessagePack for reducing data size

Posted by "Muga Nishizawa (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12933387#action_12933387 ] 

Muga Nishizawa commented on CASSANDRA-1735:
-------------------------------------------

Additionally I also compared the amount of data transferred in network between MessagePack protocol and that of Thrift.  

(Summary)

The amount of transferred data using MessagePack (or its protocol) is 20% less than that of Thrfit.

(Test environment)

I used ifconfig on a machine where Cassandra node runs.  While accessing to the Cassandra node with client program, I monitored RX (downloading) and TX (uploading) data displayed by ifconfig.  Client program was based on ring cache and executed random read and write requests 10,000 times.  

(Results)

* Random read with MessagePack (RX: 1722828 bytes, TX: 1369345 bytes)
* Random write with MessagePack (RX: 1831990 bytes, TX: 1228501 bytes)
* Random read with Thrift (RX: 2232822 bytes, TX: 1987473 bytes)
* Random write with Thrift (RX: 2522280 bytes, TX: 1607606 bytes)

Of course, objects with Cassandra and sizes vary by users.  In this evaluation, the size of data that I used was small so MessagePack significantly reduced the amount of transferred data compared to Thrift.  

> Using MessagePack for reducing data size
> ----------------------------------------
>
>                 Key: CASSANDRA-1735
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1735
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: API
>    Affects Versions: 0.7 beta 3
>         Environment: Fedora11,  JDK1.6.0_20
>            Reporter: Muga Nishizawa
>         Attachments: 0001-implement-a-Cassandra-RPC-part-with-MessagePack.patch, dependency_libs.zip
>
>
> For improving Cassandra performance, I implemented a Cassandra RPC part with MessagePack.  The implementation details are attached as a patch.  The patch works on Cassandra 0.7.0-beta3.  Please check it.  
> MessagePack is one of object serialization libraries for cross-languages like Thrift and Protocol Buffers but it is much faster, small, and easy to implement.  MessagePack allows reducing serialization cost and data size in network and disk.  
> MessagePack websites are
>     * website: http://msgpack.org/
>         This website compares MessagePack, Thrift and JSON.  
>     * desing details: http://redmine.msgpack.org/projects/msgpack/wiki/FormatDesign
>     * source code: https://github.com/msgpack/msgpack/
> Performance of the data serialization library is one of the most important issues for developing a distributed database in Java.  If the performance is bad, it significantly reduces the overall database performance.  Java's GC also runs many times.  Cassandra has this problem as well.  
> For reducing data size in network between a client and Cassandra, I prototyped the implementation of a Cassandra RPC part with MessagePack and MessagePack-RPC.  The implementation is very simple.  MessagePack-RPC can reuse the existing Thrift based CassandraServer (org.apache.cassandra.thrift.CassandraServer)
> while adapting MessagePack's communication protocol and data serialization.  
> Major features of MessagePack-RPC are 
>     * Asynchronous RPC
>     * Parallel Pipelining
>     * Connection pooling
>     * Delayed return
>     * Event-driven I/O
>     * more details: http://redmine.msgpack.org/projects/msgpack/wiki/RPCDesign
>     * source code: https://github.com/msgpack/msgpack-rpc/
> The attached patch includes a ring cache program for MessagePack and its test program.  
> You can check the behavior of the Cassandra RPC with MessagePack.  
> Thanks in advance, 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-1735) Using MessagePack for reducing data size

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934520#action_12934520 ] 

Jonathan Ellis commented on CASSANDRA-1735:
-------------------------------------------

Gary wrote some performance tests in CASSANDRA-1765 and saw MessagePack performance worse than Thrift.  Is something wrong with his code?

> Using MessagePack for reducing data size
> ----------------------------------------
>
>                 Key: CASSANDRA-1735
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1735
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: API
>    Affects Versions: 0.7 beta 3
>         Environment: Fedora11,  JDK1.6.0_20
>            Reporter: Muga Nishizawa
>         Attachments: 0001-implement-a-Cassandra-RPC-part-with-MessagePack.patch, dependency_libs.zip
>
>
> For improving Cassandra performance, I implemented a Cassandra RPC part with MessagePack.  The implementation details are attached as a patch.  The patch works on Cassandra 0.7.0-beta3.  Please check it.  
> MessagePack is one of object serialization libraries for cross-languages like Thrift and Protocol Buffers but it is much faster, small, and easy to implement.  MessagePack allows reducing serialization cost and data size in network and disk.  
> MessagePack websites are
>     * website: http://msgpack.org/
>         This website compares MessagePack, Thrift and JSON.  
>     * desing details: http://redmine.msgpack.org/projects/msgpack/wiki/FormatDesign
>     * source code: https://github.com/msgpack/msgpack/
> Performance of the data serialization library is one of the most important issues for developing a distributed database in Java.  If the performance is bad, it significantly reduces the overall database performance.  Java's GC also runs many times.  Cassandra has this problem as well.  
> For reducing data size in network between a client and Cassandra, I prototyped the implementation of a Cassandra RPC part with MessagePack and MessagePack-RPC.  The implementation is very simple.  MessagePack-RPC can reuse the existing Thrift based CassandraServer (org.apache.cassandra.thrift.CassandraServer)
> while adapting MessagePack's communication protocol and data serialization.  
> Major features of MessagePack-RPC are 
>     * Asynchronous RPC
>     * Parallel Pipelining
>     * Connection pooling
>     * Delayed return
>     * Event-driven I/O
>     * more details: http://redmine.msgpack.org/projects/msgpack/wiki/RPCDesign
>     * source code: https://github.com/msgpack/msgpack-rpc/
> The attached patch includes a ring cache program for MessagePack and its test program.  
> You can check the behavior of the Cassandra RPC with MessagePack.  
> Thanks in advance, 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-1735) Using MessagePack for reducing data size

Posted by "T Jake Luciani (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964798#action_12964798 ] 

T Jake Luciani commented on CASSANDRA-1735:
-------------------------------------------

It appears msgpack requires jassist and xnio both of which are LGPL.

This means we can't include msgpack support in our disrtibution see http://www.apache.org/legal/3party.html

> Using MessagePack for reducing data size
> ----------------------------------------
>
>                 Key: CASSANDRA-1735
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1735
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: API
>    Affects Versions: 0.7 beta 3
>         Environment: Fedora11,  JDK1.6.0_20
>            Reporter: Muga Nishizawa
>         Attachments: 0001-implement-a-Cassandra-RPC-part-with-MessagePack.patch, dependency_libs.zip
>
>
> For improving Cassandra performance, I implemented a Cassandra RPC part with MessagePack.  The implementation details are attached as a patch.  The patch works on Cassandra 0.7.0-beta3.  Please check it.  
> MessagePack is one of object serialization libraries for cross-languages like Thrift and Protocol Buffers but it is much faster, small, and easy to implement.  MessagePack allows reducing serialization cost and data size in network and disk.  
> MessagePack websites are
>     * website: http://msgpack.org/
>         This website compares MessagePack, Thrift and JSON.  
>     * desing details: http://redmine.msgpack.org/projects/msgpack/wiki/FormatDesign
>     * source code: https://github.com/msgpack/msgpack/
> Performance of the data serialization library is one of the most important issues for developing a distributed database in Java.  If the performance is bad, it significantly reduces the overall database performance.  Java's GC also runs many times.  Cassandra has this problem as well.  
> For reducing data size in network between a client and Cassandra, I prototyped the implementation of a Cassandra RPC part with MessagePack and MessagePack-RPC.  The implementation is very simple.  MessagePack-RPC can reuse the existing Thrift based CassandraServer (org.apache.cassandra.thrift.CassandraServer)
> while adapting MessagePack's communication protocol and data serialization.  
> Major features of MessagePack-RPC are 
>     * Asynchronous RPC
>     * Parallel Pipelining
>     * Connection pooling
>     * Delayed return
>     * Event-driven I/O
>     * more details: http://redmine.msgpack.org/projects/msgpack/wiki/RPCDesign
>     * source code: https://github.com/msgpack/msgpack-rpc/
> The attached patch includes a ring cache program for MessagePack and its test program.  
> You can check the behavior of the Cassandra RPC with MessagePack.  
> Thanks in advance, 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-1735) Using MessagePack for reducing data size

Posted by "Muga Nishizawa (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Muga Nishizawa updated CASSANDRA-1735:
--------------------------------------

    Description: 
For improving Cassandra performance, I implemented a Cassandra RPC part with MessagePack.  The implementation details are attached as a patch.  The patch works on Cassandra 0.7.0-beta3.  Please check it.  

MessagePack is one of object serialization libraries for cross-languages like Thrift and Protocol Buffers but it is much faster, small, and easy to implement.  MessagePack allows reducing serialization cost and data size in network and disk.  

MessagePack websites are
    * website: http://msgpack.org/
        This website compares MessagePack, Thrift and JSON.  
    * desing details: http://redmine.msgpack.org/projects/msgpack/wiki/FormatDesign
    * source code: https://github.com/msgpack/msgpack/

Performance of the data serialization library is one of the most important issues for developing a distributed database in Java.  If the performance is bad, it significantly reduces the overall database performance.  Java's GC also runs many times.  Cassandra has this problem as well.  

For reducing data size in network between a client and Cassandra, I prototyped the implementation of a Cassandra RPC part with MessagePack and MessagePack-RPC.  The implementation is very simple.  MessagePack-RPC can reuse the existing Thrift based CassandraServer (org.apache.cassandra.thrift.CassandraServer)
while adapting MessagePack's communication protocol and data serialization.  

Major features of MessagePack-RPC are 
    * Asynchronous RPC
    * Parallel Pipelining
    * Connection pooling
    * Delayed return
    * Event-driven I/O
    * more details: http://redmine.msgpack.org/projects/msgpack/wiki/RPCDesign
    * source code: https://github.com/msgpack/msgpack-rpc/

The attached patch includes a ring cache program for MessagePack and its test program.  
You can check the behavior of the Cassandra RPC with MessagePack.  

Thanks in advance, 

  was:
For improving Cassandra performance, I implemented a Cassandra RPC part with MessagePack.
The implementation details are attached as a patch.  The patch works on Cassandra 0.7.0-beta3.  
Please check it.  

MessagePack is one of object serialization libraries for cross-languages like Thrift, 
Protocol Buffers but it is fast, small, and easy.  MessagePack allows reducing 
serialization cost and data size in network and disk.  

MessagePack websites are
    * website: http://msgpack.org/
        This website compares MessagePack, Thrift and JSON.  
    * desing details: http://redmine.msgpack.org/projects/msgpack/wiki/FormatDesign
    * source code: https://github.com/msgpack/msgpack/

Performance of the data serialization library is one of the most important 
issues for developing a distributed database in Java.  If the performance is 
bad, it significantly reduces the overall database performance.  Java's GC 
also runs many times.  Cassandra has this problem as well.  

For reducing data size in network between a client and Cassandra, I prototyped 
the implementation of a Cassandra RPC part with MessagePack and MessagePack-RPC.  
The implementation is very simple.  MessagePack-RPC enables reuse of the 
existing Thrift based CassandraServer (org.apache.cassandra.thrift.CassandraServer)
and adapt communication protocol and data serialization to MessagePack.  

Major features of MessagePack-RPC are 
    * Asynchronous RPC
    * Parallel Pipelining
    * Connection pooling
    * Delayed return
    * Event-driven I/O
    * more details: http://redmine.msgpack.org/projects/msgpack/wiki/RPCDesign
    * source code: https://github.com/msgpack/msgpack-rpc/

The attached patch includes a ring cache program for MessagePack and its test program.  
You can check the behavior of the Cassandra RPC with MessagePack. 


> Using MessagePack for reducing data size
> ----------------------------------------
>
>                 Key: CASSANDRA-1735
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1735
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: API
>    Affects Versions: 0.7 beta 3
>         Environment: Fedora11,  JDK1.6.0_20
>            Reporter: Muga Nishizawa
>
> For improving Cassandra performance, I implemented a Cassandra RPC part with MessagePack.  The implementation details are attached as a patch.  The patch works on Cassandra 0.7.0-beta3.  Please check it.  
> MessagePack is one of object serialization libraries for cross-languages like Thrift and Protocol Buffers but it is much faster, small, and easy to implement.  MessagePack allows reducing serialization cost and data size in network and disk.  
> MessagePack websites are
>     * website: http://msgpack.org/
>         This website compares MessagePack, Thrift and JSON.  
>     * desing details: http://redmine.msgpack.org/projects/msgpack/wiki/FormatDesign
>     * source code: https://github.com/msgpack/msgpack/
> Performance of the data serialization library is one of the most important issues for developing a distributed database in Java.  If the performance is bad, it significantly reduces the overall database performance.  Java's GC also runs many times.  Cassandra has this problem as well.  
> For reducing data size in network between a client and Cassandra, I prototyped the implementation of a Cassandra RPC part with MessagePack and MessagePack-RPC.  The implementation is very simple.  MessagePack-RPC can reuse the existing Thrift based CassandraServer (org.apache.cassandra.thrift.CassandraServer)
> while adapting MessagePack's communication protocol and data serialization.  
> Major features of MessagePack-RPC are 
>     * Asynchronous RPC
>     * Parallel Pipelining
>     * Connection pooling
>     * Delayed return
>     * Event-driven I/O
>     * more details: http://redmine.msgpack.org/projects/msgpack/wiki/RPCDesign
>     * source code: https://github.com/msgpack/msgpack-rpc/
> The attached patch includes a ring cache program for MessagePack and its test program.  
> You can check the behavior of the Cassandra RPC with MessagePack.  
> Thanks in advance, 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.