You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by revans2 <gi...@git.apache.org> on 2014/06/27 20:49:55 UTC

[GitHub] incubator-storm pull request: STORM-376: Add compression to serial...

GitHub user revans2 opened a pull request:

    https://github.com/apache/incubator-storm/pull/168

    STORM-376: Add compression to serialization.

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/revans2/incubator-storm compression

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-storm/pull/168.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #168
    
----
commit 041b11c1966e3bb0b29ba31d366315b8aced19d7
Author: Robert (Bobby) Evans <ev...@yahoo-inc.com>
Date:   2014-06-27T18:48:30Z

    Add compression to serialization.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-storm pull request: STORM-376: Add compression to serial...

Posted by revans2 <gi...@git.apache.org>.
Github user revans2 commented on the pull request:

    https://github.com/apache/incubator-storm/pull/168#issuecomment-52508878
  
    Removed the tags.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-storm pull request: STORM-376: Add compression to serial...

Posted by revans2 <gi...@git.apache.org>.
Github user revans2 commented on the pull request:

    https://github.com/apache/incubator-storm/pull/168#issuecomment-51806966
  
    @danehammer we ran a number of stress tests with this, but we have not been running it in production.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-storm pull request: STORM-376: Add compression to serial...

Posted by ptgoetz <gi...@git.apache.org>.
Github user ptgoetz commented on the pull request:

    https://github.com/apache/incubator-storm/pull/168#issuecomment-48561812
  
    Looks fine to me. I'd  be interested in hearing what change in ZK-related  performance characteristics you've seen.
    
    Should we consider a configuration switch for this?
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-storm pull request: STORM-376: Add compression to serial...

Posted by danehammer <gi...@git.apache.org>.
Github user danehammer commented on the pull request:

    https://github.com/apache/incubator-storm/pull/168#issuecomment-52072250
  
    @revans2 false alarm, found a serialization bug in our code (was using `ObjectInputStream#read(byte[])` instead of `ObjectInputStream#readFully(byte[])`)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-storm pull request: STORM-376: Add compression to serial...

Posted by danehammer <gi...@git.apache.org>.
Github user danehammer commented on the pull request:

    https://github.com/apache/incubator-storm/pull/168#issuecomment-50664897
  
    @ptgoetz Are you saying this can be merged and we should log a JIRA to make upgrades easier? Or are you thinking we can't merge this without the facilitation?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-storm pull request: STORM-376: Add compression to serial...

Posted by revans2 <gi...@git.apache.org>.
Github user revans2 commented on the pull request:

    https://github.com/apache/incubator-storm/pull/168#issuecomment-49617431
  
    @xiaokang you are completely correct about gzip compression making small data bigger.  I need to find some time to address Taylor's concern about configuration and your concern about size for small data.
    
    And yes you are correct about a risk for data loss with ZK when forceSync=no.  This would generally only happen when there is a power outage to a majority of nodes in the ensemble, but it is still a possibility you need to take into account.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-storm pull request: STORM-376: Add compression to serial...

Posted by ptgoetz <gi...@git.apache.org>.
Github user ptgoetz commented on the pull request:

    https://github.com/apache/incubator-storm/pull/168#issuecomment-50667008
  
    @danehammer I should have been more clear... I was saying that I'm +1 for merging this pull request as it is. And that it would be good if we could address serialization to make upgrades less painful/disruptive, but that's a separate issue/JIRA.
    
    To merge this we need additional +1s from committers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-storm pull request: STORM-376: Add compression to serial...

Posted by danehammer <gi...@git.apache.org>.
Github user danehammer commented on the pull request:

    https://github.com/apache/incubator-storm/pull/168#issuecomment-50350332
  
    > It might be worth it to try and detect if the incoming data is compressed or not, and only decompress it if it is compressed.
    
    I don't think the recent commit is tenable. It should be all or nothing, as it was originally, or we'll have to add smarts like you mentioned. This makes me wonder - what happens when someone wants compression but not gzip? That thought train leads me to adding said smarts, and making it somewhat pluggable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-storm pull request: STORM-376: Add compression to serial...

Posted by ptgoetz <gi...@git.apache.org>.
Github user ptgoetz commented on the pull request:

    https://github.com/apache/incubator-storm/pull/168#issuecomment-49950186
  
    Given the points brought up in the discussion, I think making it configurable, with a default to "off" would be the best way to go.
    
    Upgrading is hard enough... But that's another discussion. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-storm pull request: STORM-376: Add compression to serial...

Posted by d2r <gi...@git.apache.org>.
Github user d2r commented on the pull request:

    https://github.com/apache/incubator-storm/pull/168#issuecomment-52248156
  
    +1, and if we could remove the `@author` javadoc items, that would be good.  The git commits should attribute the author already.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-storm pull request: STORM-376: Add compression to serial...

Posted by revans2 <gi...@git.apache.org>.
Github user revans2 commented on the pull request:

    https://github.com/apache/incubator-storm/pull/168#issuecomment-50199048
  
    I agree, now that I have a bit of free time, I'll update the PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-storm pull request: STORM-376: Add compression to serial...

Posted by danehammer <gi...@git.apache.org>.
Github user danehammer commented on the pull request:

    https://github.com/apache/incubator-storm/pull/168#issuecomment-51805865
  
    I have deployed this in a development cluster, and inconsistently see errors around the dreaded "IllegalStateException: unread block data", as well as some issues where my custom deserialization code in classes is appearing to receive incomplete blocks. I didn't see it for every topology, several went along fine, even after several redeploys. But occasionally I would hit deserialization problems. It's only with the gzip implementation. I have configured the DefaultSerializationDelegate and not seen any issues since.
    
    @revans2 I believe you said you've been running this at your place for a while now (before the pull request), did you have any similar experiences?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-storm pull request: STORM-376: Add compression to serial...

Posted by revans2 <gi...@git.apache.org>.
Github user revans2 commented on the pull request:

    https://github.com/apache/incubator-storm/pull/168#issuecomment-53283553
  
    The author tags were just javadoc comments, so I am going to take the +1 from @d2r and merge this in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-storm pull request: STORM-376: Add compression to serial...

Posted by danehammer <gi...@git.apache.org>.
Github user danehammer commented on the pull request:

    https://github.com/apache/incubator-storm/pull/168#issuecomment-50368957
  
    What do you think about this approach? https://github.com/revans2/incubator-storm/pull/2


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-storm pull request: STORM-376: Add compression to serial...

Posted by revans2 <gi...@git.apache.org>.
Github user revans2 commented on the pull request:

    https://github.com/apache/incubator-storm/pull/168#issuecomment-50205339
  
    I updated the code to make it configurable, but I am still a bit hesitant about this.  I looked at everywhere that Utils/serialize and Utils/deserialize are being used.  Most places seem fine because they are fairly transient, but it is also used for storing data to zookeeper for trident.  This makes me a bit nervous because the trident state is meant to live beyond a single topology, so it makes changing this a lot more difficult.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-storm pull request: STORM-376: Add compression to serial...

Posted by danehammer <gi...@git.apache.org>.
Github user danehammer commented on the pull request:

    https://github.com/apache/incubator-storm/pull/168#issuecomment-50688995
  
    Logged https://issues.apache.org/jira/browse/STORM-432


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-storm pull request: STORM-376: Add compression to serial...

Posted by Gvain <gi...@git.apache.org>.
Github user Gvain commented on the pull request:

    https://github.com/apache/incubator-storm/pull/168#issuecomment-49121945
  
    @revans2 Thanks a lot. That's really valuable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-storm pull request: STORM-376: Add compression to serial...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/incubator-storm/pull/168


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-storm pull request: STORM-376: Add compression to serial...

Posted by Gvain <gi...@git.apache.org>.
Github user Gvain commented on the pull request:

    https://github.com/apache/incubator-storm/pull/168#issuecomment-48815091
  
    @revans2  According to 
     
    heartbeatWritesPerSecond = supervisors/5 seconds + (supervisors * workersPerSupervisor)/3 seconds.
    
    with a fully loaded 1965 nodes, the ZK write operation per seconds should be nearly 6900. Could you share the limit of the ZK write ops if we turned the force sync off, or can we have some estimation.
    
    Thanks 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-storm pull request: STORM-376: Add compression to serial...

Posted by revans2 <gi...@git.apache.org>.
Github user revans2 commented on the pull request:

    https://github.com/apache/incubator-storm/pull/168#issuecomment-48752031
  
    With a basic ZK setup, spinning disk, ext4, and forcesync enabled by default, we saw about 800 write operations per second even with ZK batching writes.  Heartbeats were the cause of most of the writes.
    
    heartbeatWritesPerSecond = supervisors/5 seconds + (supervisors * workersPerSupervisor)/3 seconds.
    
    So it scales linearly with the number of supervisors you have. For 10 workers per supervisor that ends up at about 225 supervisors max, for a smooth running cluster it is actually closer to 200.
    
    If we turned the force sync off, and let the OS batch/reorder writes the new bottleneck was the outbound link of the gigabit ethernet.  We maxed out the traffic at about 800 simulated nodes, but could not fill the cluster completely. The formula for this bottleneck is:
    
    downloadTraffic = (supervisors * topologies * avgTopologyAssignmentSize)/10 seconds + (supervisors * workersPerSupervisor * avgTopologyAssignmentSize)/10 seconds
    
    For this one it scales with the number of supervisors and the number of topologies and the size of the assignments.  topologies * avgTopologyAssignmentSize feels like it would scale fairly linearly with the size of the cluster. A fully loaded larger cluster has more and/or bigger topologies running compared to a fully loaded smaller cluster. This actually scales quadratically with the size of the cluster.
    
    Turning on just gzip compression reduced the traffic to 1/5 th what we saw originally, and we were able to run 1179 simulated nodes fully loaded.  But to scale much further we needed to change fundamentally how we interacted with ZK.  That is why we wrote STORM-375.
    
    With a not complete version of STORM-375 that only updated the supervisors not the workers and gzip compression we were able to go to a fully loaded 1965 nodes, and the download link from ZK was only 20.44MB/sec.  We expect that we could scale even higher, but we ran out of time on the cluster and we were hitting ulimit restrictions that would require us to reconfigure all of the nodes.
    
    I am not really sure what the next bottleneck is now.  Looking at the rudimentary latency data collected by ZK for each of the three nodes in the ensemble I can see that the worst case latencies are about 1.5 seconds now, and the average latency is about 2ms.  This is a lot worse then the 10ms worst case and 0 ms average I would see on an idle cluster. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-storm pull request: STORM-376: Add compression to serial...

Posted by xiaokang <gi...@git.apache.org>.
Github user xiaokang commented on the pull request:

    https://github.com/apache/incubator-storm/pull/168#issuecomment-49539373
  
    @revans2 and @Gvain , we also have used forceSync=no to reduce disk io of zk servers. IO util jumped from nearly 100% to less than 10% for a 500 node cluster. However it's a risk to lost data according to zk's document.
    
    We'll try this gzip compression code to reduce the network flow of zk servers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-storm pull request: STORM-376: Add compression to serial...

Posted by Gvain <gi...@git.apache.org>.
Github user Gvain commented on the pull request:

    https://github.com/apache/incubator-storm/pull/168#issuecomment-49578380
  
    @xiaokang  Thanks for your sharing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-storm pull request: STORM-376: Add compression to serial...

Posted by revans2 <gi...@git.apache.org>.
Github user revans2 commented on the pull request:

    https://github.com/apache/incubator-storm/pull/168#issuecomment-49057769
  
    I don't know hard limits when force sync is off.  It depends on the operating system, file system, disk RPM/type, the amount of free memory, and what else is going on in the system.  
    
    When force sync is off ZK will not wait for the edit to hit the platter, it just waits for the data to hit the page cache in the OS before going on.  The OS decides when to write the dirty pages back to disk.  Linux typically is configured to make sure dirty pages are not around for more than a couple of mins.  But even a few seconds worth of edits can become a rather large batch that can usually be written in a single large extent to disk, so seeks are greatly reduced.  For us we regularly see disk utilization hovering around 5% and occasionally spiking to 20% for a 400 node cluster.  It was very similar on the 800 simulated nodes with disk utilization about 7%.  I did not save the simulated 2000 node metrics so I don't feel comfortable giving hard numbers.  But off the top of my head I seem to remember it being 10 to 15%, but I really don't know for sure.
    
    Even if utilization doubled to 10% going from 400 to 800 nodes and it grew linearly there after it would become a bottleneck again at 6500+ nodes (It is really hard to actually hit 100% disk utilization in the real world. I picked 80% for this number).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-storm pull request: STORM-376: Add compression to serial...

Posted by xiaokang <gi...@git.apache.org>.
Github user xiaokang commented on the pull request:

    https://github.com/apache/incubator-storm/pull/168#issuecomment-49539497
  
    @revans2 , Just a little concern about using gzip compression as the default behavior of Utils.serialize. If it's used for little object, the result size will be bigger than original since gzip can not compress on it and add a extra header.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-storm pull request: STORM-376: Add compression to serial...

Posted by revans2 <gi...@git.apache.org>.
Github user revans2 commented on the pull request:

    https://github.com/apache/incubator-storm/pull/168#issuecomment-50424297
  
    @danehammer fixed my simple code and made it a lot better.  I pulled in his changes and updated the headers with his OK.  My reservations about merging this in are gone now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-storm pull request: STORM-376: Add compression to serial...

Posted by revans2 <gi...@git.apache.org>.
Github user revans2 commented on the pull request:

    https://github.com/apache/incubator-storm/pull/168#issuecomment-48752485
  
    As for making this configurable I am fine with that.  It just makes means that if you have that config wrong you will not be able to read any of the data in ZK.  In my testing the CPU utilization did not go up noticeably so I didn't see much of a reason to make it configurable.  The only think I would be nervous about is if this code is being used for the transactional zookeeper too.  That would mean you would not just have to wipe clean the current state of the cluster, but also the state of the topologies.
    
    It might be worth it to try and detect if the incoming data is compressed or not, and only decompress it if it is compressed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-storm pull request: STORM-376: Add compression to serial...

Posted by revans2 <gi...@git.apache.org>.
Github user revans2 commented on the pull request:

    https://github.com/apache/incubator-storm/pull/168#issuecomment-51082311
  
    @danehammer 
    I addressed your comments.  If you have any more please let me know.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-storm pull request: STORM-376: Add compression to serial...

Posted by ptgoetz <gi...@git.apache.org>.
Github user ptgoetz commented on the pull request:

    https://github.com/apache/incubator-storm/pull/168#issuecomment-50433312
  
    Nice @revans2 and @danehammer 
    I think this is a sound approach. Next up is likely version-agnostic serialization to facilitate upgrades.
    
    +1
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---