You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@cloudstack.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/11/08 11:32:00 UTC

[jira] [Commented] (CLOUDSTACK-10136) Fix thread growth/leak issue

    [ https://issues.apache.org/jira/browse/CLOUDSTACK-10136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16243765#comment-16243765 ] 

ASF GitHub Bot commented on CLOUDSTACK-10136:
---------------------------------------------

rhtyd opened a new pull request #2314: CLOUDSTACK-10136: Fix RemoteHostEndPoint thread growth
URL: https://github.com/apache/cloudstack/pull/2314
 
 
   This fixes the following:
   - Unchecked thread growth in RemoteEndHostEndPoint
   - Potential NPE while finding EP for a storage/scope
   
   Unbounded thread growth can be reproduced with following findings:
   - Every unreachable template would produce 6 new threads (in a single
   ScheduledExecutorService instance) spaced by 10 seconds
   - Every reachable template url without the template would produce 1 new
   thread (and one ScheduledExecutorService instance), it errors out quickly without
   causing more thread growth.
   - Every valid url will produce upto 10 threads as the same ep (endpoint
   instance) will be reused to query upload/download (async callback)
   progresses.
   
   Every RemoteHostEndPoint instances creates its own
   ScheduledExecutorService instance which is why in the jstack dump, we
   see several threads that share the prefix RemoteHostEndPoint-{1..10}
   (given poolsize is defined as 10, it uses suffixes 1-10).
   
   This fixes the discovered thread leakage with following notes:
   - Instead of ScheduledExecutorService instance, a cached pool could be
   used instead and was implemented, and with `static` scope to be reused
   among other future RemoteHostEndPoint instances.
   - It was not clear why we would want to wait when we've Answers returned
   from the remote EP, and therefore a scheduled/delayed Runnable was
   not required at all for processing answers. ScheduledExecutorService
   was therefore not really required, moved to ExecutorService instead.
   - Another benefit of using a cached pool is that it will shutdown
   threads if they are not used in 60 seconds, and they get re-used for
   future runnable submissions.
   - Caveat: the executor service is still unbounded, however, the use-case
   that this method is used for short jobs to check upload/download
   progresses fits the case here.
   - Refactored CmdRunner to not use/reference objects from parent class.
   
   Screenshots showing deterministic thread growth for template with an invalid/unreachable URL:
   ![screenshot from 2017-11-08 13-40-59](https://user-images.githubusercontent.com/95203/32542409-893496a6-c498-11e7-8afb-1b2e1a46e710.png)
   
   Screenshot showing threads transitioning from waiting->stopped (and re-use) with this fix:
   ![screenshot from 2017-11-08 14-49-10](https://user-images.githubusercontent.com/95203/32542430-996e0638-c498-11e7-89d9-432b2d0afa89.png)
   
   To verify, the following can be tried:
   - Before applying this fix, in a test environment register two template such that (1) one has a reachable IP/domain but the resource does not exist (causing 404) and (2) the second template uses a domain/IP that is not reachable at all
   - Thread growths can be checked using: `jstack -l <mgmt server PID> | grep RemoteHostEndPoint`, or using a visual tool such as VisualVM etc.
   - With the fix + restart, the mgmt server will reattempt to download those template, and a humungous thread growth won't be seen and after say 2-4 minutes all the threads should shutdown, and  `jstack -l <mgmt server PID> | grep RemoteHostEndPoint` will show no threads.
   
   Pinging for review - @DaanHoogland @nvazquez @borisstoyanov @PaulAngus @wido @mlsorensen @marcaurele and others
   
   @blueorangutan package

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Fix thread growth/leak issue
> ----------------------------
>
>                 Key: CLOUDSTACK-10136
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10136
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the default.) 
>    Affects Versions: 4.5.2, 4.6.2, 4.7.1, 4.10.0.0, 4.9.2.0, 4.8.1.1, 4.9.3.0
>            Reporter: Rohit Yadav
>            Assignee: Rohit Yadav
>             Fix For: 4.11.0.0
>
>
> For long running mgmt server with large amounts of templates etc, large amounts of waiting threads are seen that start with the 'RemoteHostEndPoint-' prefix. These async threads are responsible mostly for checking template/volume upload/download progress/states. They kick everytime a template is being checked/downloaded setup etc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)