You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@metron.apache.org by nickwallen <gi...@git.apache.org> on 2017/09/11 18:41:33 UTC

[GitHub] metron pull request #748: METRON-1177 Stale running topologies seen post-ker...

GitHub user nickwallen opened a pull request:

    https://github.com/apache/metron/pull/748

    METRON-1177 Stale running topologies seen post-kerberization and cause exceptions

    [METRON-1177](https://issues.apache.org/jira/browse/METRON-1177)
    
    ### Problem
    
    After running the Ambari Kerberization process on a cluster where Metron was installed with the MPack, often times the Kerberization process would complete successfully, but the running Metron topologies were stale and had not been restarted properly after all Kerberization steps completed.  In other cases, the Metron service check would fail when Ambari began restarting all cluster services.
    
    One clue that this has occurred is that when querying Storm using the Thrift API to check on topology status after kerberization would result in the following error.
    ```
    AuthorizationException(msg:getTopologyInfo on topology snort is not authorized)
    ```
    
    ### Solution
    
    * All Metron services have to be started before performing a Metron service check.
    * All external dependencies like Storm, HBase, Kafka, etc must complete their service check before performing the service check of a Metron service having those dependencies.
    * Added Storm as a start dependency for the Metron Profiler.
    * Metron Profiler has to be stopped before Storm is stopped.
    
    ### Testing
    
    This was tested by launching the Full Dev environment, kerberizing the environment, and then monitoring the order in which each of the service start, stop and status check actions occurred.  I was not able to replicate the failure condition with this fix.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/nickwallen/metron METRON-1177

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/metron/pull/748.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #748
    
----
commit a115f726eb5ccd3d215a7282871d290317cc1052
Author: Nick Allen <ni...@nickallen.org>
Date:   2017-09-08T15:23:49Z

    METRON-1177 Stale running topologies seen post-kerberization and cause exceptions

----


---

[GitHub] metron issue #748: METRON-1177 Stale running topologies seen post-kerberizat...

Posted by anandsubbu <gi...@git.apache.org>.
Github user anandsubbu commented on the issue:

    https://github.com/apache/metron/pull/748
  
    Hi @nickwallen , the latest of you patch works just perfect. I can see that the STORM_UI_SERVER waits until all of the metron topologies are stopped (killed i.e.) and then shuts down. This causes all of the topologies to come up gracefully during the kerberization process. Thank you for the fix! 
    
    +1 (non-binding).


---

[GitHub] metron issue #748: METRON-1177 Stale running topologies seen post-kerberizat...

Posted by mmiklavc <gi...@git.apache.org>.
Github user mmiklavc commented on the issue:

    https://github.com/apache/metron/pull/748
  
    +1. Thanks for the thorough explanation and post-mortem on this!! That was extremely helpful.


---

[GitHub] metron issue #748: METRON-1177 Stale running topologies seen post-kerberizat...

Posted by nickwallen <gi...@git.apache.org>.
Github user nickwallen commented on the issue:

    https://github.com/apache/metron/pull/748
  
    I committed another change.  Everything seems to be working with this additional fix.  Would like @anandsubbu to add his experiences working with this patch.
    
    This problem started with #660 .  I added a conditional to make sure that a topology is actually running before trying to stop it and similar logic for starting topologies.  This introduced a subtle dependency in that the "is running" conditional depends on the Storm UI/API.  If the Storm UI/API is not running, then it assumes the topology is already stopped.  This is only a problem if the Storm UI/API is stopped before Metron components.
    
    We then attempted a fix in #680.  There we added dependencies to `STORM_REST_API-STOP` to ensure that all Metron topologies are stopped prior to the Storm UI.  This is necessary since the Metron MPack uses the Storm API to check the status of topologies before shutting them down.  
    
    Unfortunately, this did not fix the problem, at least not in all cases. We found instances where the Metron MPack was still unable to stop topologies before Kerberization because the Storm UI/API had already been stopped.  
    
    After some sleuthing, it seems that the status checks depend on a process running on port 8744. 
     I found that the storm component that listens on port 8744 is actually called `STORM_UI_SERVER` by Ambari.  So it is the `STORM_UI_SERVER` component that we need to add the dependencies to.  And this is what I have added in the latest commit. 
    
    I am assuming that this is needed in addition to `STORM_REST_API`, but I am not completely sure. 
     So I left the existing `STORM_REST_API` dependencies.


---

[GitHub] metron pull request #748: METRON-1177 Stale running topologies seen post-ker...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/metron/pull/748


---

[GitHub] metron issue #748: METRON-1177 Stale running topologies seen post-kerberizat...

Posted by mmiklavc <gi...@git.apache.org>.
Github user mmiklavc commented on the issue:

    https://github.com/apache/metron/pull/748
  
    Oh man, after looking at this it dawns on me that a dependency tree analyzer for the Ambari role command order would be extremely useful.
    
    +1 by inspection. Thanks for fixing this @nickwallen.


---

[GitHub] metron issue #748: METRON-1177 Stale running topologies seen post-kerberizat...

Posted by nickwallen <gi...@git.apache.org>.
Github user nickwallen commented on the issue:

    https://github.com/apache/metron/pull/748
  
    @mmiklavc Let me know if you can re-up on your previous +1.  I added another commit since you originally looked at it. It should be ready to go now.  Thanks for the review.


---

[GitHub] metron issue #748: METRON-1177 Stale running topologies seen post-kerberizat...

Posted by nickwallen <gi...@git.apache.org>.
Github user nickwallen commented on the issue:

    https://github.com/apache/metron/pull/748
  
    Thanks. We are still testing so I'll hold off on committing it.  I think we found some other items to address.


---