You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@storm.apache.org by "Adam (JIRA)" <ji...@apache.org> on 2018/02/02 12:15:00 UTC
[jira] [Comment Edited] (STORM-2853) Deactivated topologies cause high cpu utilization

    [ https://issues.apache.org/jira/browse/STORM-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350212#comment-16350212 ] 

Adam edited comment on STORM-2853 at 2/2/18 12:14 PM:
------------------------------------------------------

[Jungtaek Lim|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=kabhwan] I tested the patch. In our use case it resolves the issue. Thank you.

My only reservation is that we removed one of the symptoms but the original inconsistency/root cause remains. Meaning that the behavior of a deactivated topology is not consistent before and after a JVM's restart. 
 As an example. As you pointed out when a topology is deactivated we want to keep tick tuples coming as there might be remaining tuples to process/flush. Then let's assume that the worker process dies/is killed before the processing/flush can finish. In that case stateful bolts normally would resume processing using those tick tuples but currently they won't have a chance to do it.
 Well, it's just a consideration. I do understand the bigger picture perspective here and that the risks of making the bigger functional change dealing with the root cause could be too big.

Also thanks for the explanation. It cleared up a lot. Quite an interesting story as well with Clojure and JStorm :-)


was (Author: vorin):
[Jungtaek Lim|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=kabhwan] I tested the patch. In our use case it resolves the issue. Thank you.

My only reservation is that we removed one of the symptoms but the original inconsistency/root cause remains. Meaning that the behavior of a deactivated topology is not consistent before and after a JVM's restart. 
As an example. As you pointed out when a topology is deactivated we want to keep tick tuples coming as there might be remaining tuples to process/flush. Then let's assume that the worker process dies/is killed before the processing/flush can finish. In that case stateful bolts normally would resume processing using those tick tuples but currently they won't have a chance to do it.
Well, it's just a consideration. I do understand the bigger picture perspective here and that the risks of making the bigger functional change dealing with the root cause could be too big.

Also thanks for the explanation. It cleared up a lot. Quite an interesting story as well with Clojure and JStorm  !/jira/images/icons/emoticons/smile.png!

 

 

> Deactivated topologies cause high cpu utilization
> -------------------------------------------------
>
>                 Key: STORM-2853
>                 URL: https://issues.apache.org/jira/browse/STORM-2853
>             Project: Apache Storm
>          Issue Type: Bug
>          Components: storm-core
>    Affects Versions: 1.1.0
>            Reporter: Stuart
>            Assignee: Jungtaek Lim
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.0.0, 1.2.0, 1.1.2, 1.0.6
>
>         Attachments: exclamation.zip
>
>          Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> The issue is there is high cpu usage for deactivated apache storm topologies.  I can reliably re-create the issue using the steps below but I haven't identified the exact cause or a solution yet.
> The environment is a storm cluster on which 1 topology is running (The topology is extremely simple, I used the exclamation example).  It is INACTIVE.  Initially there is normal CPU usage.  However, when I kill all topology JVM processes on all supervisors and let Storm restart them again, I find that some time later (~9 hours) the CPU usage per JVM process rises to nearly 100%.  I have tested an ACTIVE topology and this does not happen with it.  I have also tested more than one topology and observe the same results when they're in the INACTIVE state.
> ***Steps to re-create:***
>  1. Run 1 topology on an Apache Storm cluster
>  2. Deactivate it
>  3. Kill **all** topology JVM processes on all supervisors (Storm will restart them)
>  4. Observe the CPU usage on Supervisors rise to nearly 100% for all **INACTIVE** topology JVM processes.
> ***Environment***
> Apache Storm 1.1.0 running on 3 VMs (1 nimbus and 2 supervisors).
> Cluster Summary:
>  - Supervisors: 2 
>  - Used Slots: 2 
>  - Available Slots: 38 
>  - Total Slots: 40
>  - Executors: 50 
>  - Tasks: 50
> the topology has 2 workers and 50 executors/tasks (threads).
> ***Investigation so far:***
> Apart from being able to reliably re-create the issue, I have identified, for the affected topology JVM process, the threads using the most CPU.  There are 102 threads total in the process, 97 blocked, 5 IN_NATIVE.  The threads using the most CPU are identical and there are 23 of them (all in BLOCKED state):
>     Thread 28558: (state = BLOCKED)
>      - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
>      - java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=11, line=338 (Compiled frame)
>      - com.lmax.disruptor.MultiProducerSequencer.next(int) @bci=82, line=136 (Compiled frame)
>      - com.lmax.disruptor.RingBuffer.next(int) @bci=5, line=260 (Interpreted frame)
>      - org.apache.storm.utils.DisruptorQueue.publishDirect(java.util.ArrayList, boolean) @bci=18, line=517 (Interpreted frame)
>      - org.apache.storm.utils.DisruptorQueue.access$1000(org.apache.storm.utils.DisruptorQueue, java.util.ArrayList, boolean) @bci=3, line=61 (Interpreted frame)
>      - org.apache.storm.utils.DisruptorQueue$ThreadLocalBatcher.flush(boolean) @bci=50, line=280 (Interpreted frame)
>      - org.apache.storm.utils.DisruptorQueue$Flusher.run() @bci=55, line=303 (Interpreted frame)
>      - java.util.concurrent.Executors$RunnableAdapter.call() @bci=4, line=511 (Compiled frame)
>      - java.util.concurrent.FutureTask.run() @bci=42, line=266 (Compiled frame)
>      - java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=95, line=1142 (Compiled frame)
>      - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=617 (Interpreted frame)
>      - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)
> I identified this thread by using `jstack` to get a thread dump for the process:
>  
>     jstack -F <pid> > jstack<pid>.txt
> and `top` to identify the threads within the process using the most CPU:
>     top -H -p <pid> 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)