You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@storm.apache.org by "Robert Joseph Evans (JIRA)" <ji...@apache.org> on 2015/11/09 15:19:11 UTC

[jira] [Commented] (STORM-1190) System load spikes in recent snapshot

    [ https://issues.apache.org/jira/browse/STORM-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14996578#comment-14996578 ] 

Robert Joseph Evans commented on STORM-1190:
--------------------------------------------

This is most likely due to the disruptor queue batching.

https://github.com/apache/storm/pull/765

The experiments showed that the CPU utilization under light load increased significantly, but the throughput at higher loads doubled.  

https://github.com/apache/storm/pull/765#issuecomment-149987537

You can try to mitigate this by setting topology.disruptor.batch.size to 1, and setting topology.disruptor.batch.timeout.millis to something large like 1000.  If this works for you I will put some special case code for a batch size of 1, that should drop the CPU utilization back to where it was before, but you will also lose the increased throughput.

> System load spikes in recent snapshot
> -------------------------------------
>
>                 Key: STORM-1190
>                 URL: https://issues.apache.org/jira/browse/STORM-1190
>             Project: Apache Storm
>          Issue Type: Bug
>          Components: storm-core
>    Affects Versions: 0.11.0
>         Environment: 10x (CoreOS stable (766.4.0) / k8s 1.0.1 / docker running on Azure VMs)
>            Reporter: Michael Schonfeld
>            Priority: Critical
>         Attachments: Screenshot 2015-11-08 22.17.57.png, Screenshot 2015-11-08 22.18.06.png
>
>
> We've been running Storm's snapshots on our production cluster for a little while now (that back pressure support really helped us), and we've noticed a sudden spike in system load when going from commit@ba1250993d10ffc523c9f5464371fbeb406d216f to the current latest commit@c12e28c829fcfabc0a3a775fb9714968b7e3e349. Both versions were running the exact same topologies, and there was no significant change in workload. Not exactly sure how to even begin to debug this, so we ended up just rolling back. Thoughts?
> Stats screenshots attached



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)