You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by zd-project <gi...@git.apache.org> on 2018/06/08 15:45:34 UTC

[GitHub] storm pull request #2710: STORM-3099: Extend metrics on supervisor and worke...

GitHub user zd-project opened a pull request:

    https://github.com/apache/storm/pull/2710

    STORM-3099: Extend metrics on supervisor and workers

    The first commit refactored and commented code in Slot.java

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zd-project/storm STORM-3099

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/storm/pull/2710.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2710
    
----
commit ce2723860323e8fca2afcb42688da6df516157d6
Author: Zhengdai Hu <hu...@...>
Date:   2018-06-08T15:24:09Z

    STORM-3099: Refactored and commented code in Slot.java

----


---

[GitHub] storm pull request #2710: STORM-3099: Extend metrics on supervisor, workers ...

Posted by zd-project <gi...@git.apache.org>.
Github user zd-project commented on a diff in the pull request:

    https://github.com/apache/storm/pull/2710#discussion_r208286166
  
    --- Diff: storm-server/src/main/java/org/apache/storm/daemon/supervisor/Container.java ---
    @@ -346,6 +358,10 @@ public void cleanUp() throws IOException {
                 _usedMemory.remove(_port);
                 _reservedMemory.remove(_port);
                 cleanUpForRestart();
    +        } catch (IOException e) {
    +            //This may or may not be reported depending on when process exits
    --- End diff --
    
    My bad. I thought Container runs on it's own process when I started working on this. This should not be a problem if it's just a daemon thread.


---

[GitHub] storm issue #2710: [WIP] STORM-3099: Extend metrics on supervisor and worker...

Posted by zd-project <gi...@git.apache.org>.
Github user zd-project commented on the issue:

    https://github.com/apache/storm/pull/2710
  
    New supervisor level metrics: 
    
    - [ ] Worker Kill/Restart Statistics
    	- [x] Kill Count by Category - assignment change/HB too old/Heap space (memory limit?)
    		- [x] blob change?
    	- [ ] Worker Suicide Cnt - category:  internal error or Assignment Change
    		- [x] - Implemented based on running status the container's main process. Does not actually reflect suicide count because it counts the normal exit as well.
    	- [x] Worker idle period
    		- The metrics records the duration machines spent in each state (in histogram) and how many times it transition into/out to a certain state.
    	- [x] Time to Actually Kill worker (from identifying need by supervisor and actual change in the state of the worker) - (This is only an estimation, accuracy affected by SleepTime)
    	- [x] Time to start worker for topology from reading assignment for the first time.
    	- [x] Worker cleanup time
    - [x] Supervisor Level Metrics:
    	- [x] Supervisor restart Count
    		- simply report everytime it restarts.
    	- [x] Blobstore (Request to download time)
    		- [x] download time individual blob (inside localizer) localizer gettting requst to actually download hdfs request to finish
    			- I assume this to be [the complete process] from initiating download to commit download to local blob cache and inform relative workers
    		- [x] download rate individual blob (inside localizer)
    			- This is tracks the actual download rate of a blob retrieval, in MB/s
    		- [x] supervisor localizer thread blob download - how long (outside localizer)
    			- I put this inside async localizer as it turns out to be better suited for purpose. This tracks the time for a topology blob download request to be completely processed.
    		- [x] Blob update is also considered.
    	- [x] Blobstore Update due to Version change Cnts


---

[GitHub] storm issue #2710: STORM-3099: Extend metrics on supervisor, workers and DRP...

Posted by zd-project <gi...@git.apache.org>.
Github user zd-project commented on the issue:

    https://github.com/apache/storm/pull/2710
  
    #2739 


---

[GitHub] storm issue #2710: STORM-3099: Extend metrics on supervisor, workers and DRP...

Posted by Ethanlm <gi...@git.apache.org>.
Github user Ethanlm commented on the issue:

    https://github.com/apache/storm/pull/2710
  
    @zd-project  Could you squash all the commits? Will merge this in.


---

[GitHub] storm issue #2710: STORM-3099: Extend metrics on supervisor, workers and DRP...

Posted by Ethanlm <gi...@git.apache.org>.
Github user Ethanlm commented on the issue:

    https://github.com/apache/storm/pull/2710
  
    @HeartSaVioR @danny0405 Do you want to take a look again? Otherwise I think it's good to merge it.


---

[GitHub] storm pull request #2710: STORM-3099: Extend metrics on supervisor, workers ...

Posted by Ethanlm <gi...@git.apache.org>.
Github user Ethanlm commented on a diff in the pull request:

    https://github.com/apache/storm/pull/2710#discussion_r208279317
  
    --- Diff: storm-client/src/jvm/org/apache/storm/daemon/supervisor/ClientSupervisorUtils.java ---
    @@ -28,11 +29,14 @@
     import org.apache.storm.shade.org.apache.commons.lang.StringUtils;
     import org.apache.storm.utils.ConfigUtils;
     import org.apache.storm.utils.ObjectReader;
    +import org.apache.storm.utils.ShellUtils;
     import org.apache.storm.utils.Utils;
     import org.slf4j.Logger;
     import org.slf4j.LoggerFactory;
     
     public class ClientSupervisorUtils {
    +    public static final Meter numWorkerLaunchExceptions = ShellUtils.numShellExceptions;
    --- End diff --
    
    I would like to see some comments so people can understand why `ClientSupervisorUtils` is using the same meter with `ShellUtils`


---

[GitHub] storm pull request #2710: STORM-3099: Extend metrics on supervisor and worke...

Posted by danny0405 <gi...@git.apache.org>.
Github user danny0405 commented on a diff in the pull request:

    https://github.com/apache/storm/pull/2710#discussion_r194212295
  
    --- Diff: storm-server/src/main/java/org/apache/storm/daemon/supervisor/Slot.java ---
    @@ -410,9 +420,11 @@ static DynamicState handleWaitingForBlobLocalization(DynamicState dynamicState,
             assert (dynamicState.pendingDownload != null);
             assert (dynamicState.container == null);
     
    -        //Ignore changes to scheduling while downloading the topology blobs
    +        // Ignore changes to scheduling while downloading the topology blobs
             // We don't support canceling the download through the future yet,
    --- End diff --
    
    nit: whitespace


---

[GitHub] storm pull request #2710: STORM-3099: Extend metrics on supervisor, workers ...

Posted by zd-project <gi...@git.apache.org>.
Github user zd-project commented on a diff in the pull request:

    https://github.com/apache/storm/pull/2710#discussion_r207645153
  
    --- Diff: storm-client/src/jvm/org/apache/storm/daemon/supervisor/ClientSupervisorUtils.java ---
    @@ -28,11 +29,14 @@
     import org.apache.storm.shade.org.apache.commons.lang.StringUtils;
     import org.apache.storm.utils.ConfigUtils;
     import org.apache.storm.utils.ObjectReader;
    +import org.apache.storm.utils.ShellUtils;
     import org.apache.storm.utils.Utils;
     import org.slf4j.Logger;
     import org.slf4j.LoggerFactory;
     
     public class ClientSupervisorUtils {
    +    public static final Meter numWorkerLaunchExceptions = ShellUtils.numShellExceptions;
    --- End diff --
    
    Logged in Jira


---

[GitHub] storm issue #2710: STORM-3099: Extend metrics on supervisor and workers

Posted by HeartSaVioR <gi...@git.apache.org>.
Github user HeartSaVioR commented on the issue:

    https://github.com/apache/storm/pull/2710
  
    @zd-project Please append "[WIP]" in the title and remove when you finished.


---

[GitHub] storm issue #2710: STORM-3099: Extend metrics on supervisor, workers and DRP...

Posted by HeartSaVioR <gi...@git.apache.org>.
Github user HeartSaVioR commented on the issue:

    https://github.com/apache/storm/pull/2710
  
    Could you please rebase? I'm expecting the code diff would be reduced after that.


---

[GitHub] storm issue #2710: STORM-3099: Extend metrics on supervisor, workers and DRP...

Posted by zd-project <gi...@git.apache.org>.
Github user zd-project commented on the issue:

    https://github.com/apache/storm/pull/2710
  
    I'm kind of wondering whether decorator pattern is better than subclassing or not, for TimedPortAndAssignment, because itself is already a wrapper.


---

[GitHub] storm pull request #2710: STORM-3099: Extend metrics on supervisor, workers ...

Posted by Ethanlm <gi...@git.apache.org>.
Github user Ethanlm commented on a diff in the pull request:

    https://github.com/apache/storm/pull/2710#discussion_r208279468
  
    --- Diff: storm-server/src/main/java/org/apache/storm/localizer/AsyncLocalizer.java ---
    @@ -251,11 +261,18 @@ private LocalizedResource getUserFile(String user, String key) {
                                     long localVersion = blob.getLocalVersion();
                                     long remoteVersion = blob.getRemoteVersion(blobStore);
                                     if (localVersion != remoteVersion || !blob.isFullyDownloaded()) {
    +                                    if (!blob.isFullyDownloaded()) {
    --- End diff --
    
    Should be `if (!blob.isFullyDownloaded()) {`


---

[GitHub] storm pull request #2710: STORM-3099: Extend metrics on supervisor, workers ...

Posted by Ethanlm <gi...@git.apache.org>.
Github user Ethanlm commented on a diff in the pull request:

    https://github.com/apache/storm/pull/2710#discussion_r208095554
  
    --- Diff: storm-server/src/main/java/org/apache/storm/daemon/supervisor/Slot.java ---
    @@ -1008,6 +1015,13 @@ public void close() throws Exception {
         }
     
         static class DynamicState {
    +        private static final Map<MachineState, Meter> workerStateTransition = EnumUtil.toEnumMap(MachineState.class,
    +            machineState -> StormMetricsRegistry.registerMeter("supervisor:num-transitions-into-" + machineState.toString()));
    +        private static final Map<MachineState, Timer> workerStateDuration = EnumUtil.toEnumMap(MachineState.class,
    +            machineState -> StormMetricsRegistry.registerTimer(
    +                    "supervisor:num-transitions-out-" + machineState.toString() + "-and-duration-ms")
    --- End diff --
    
    It would be good if we can find a better name


---

[GitHub] storm pull request #2710: STORM-3099: Extend metrics on supervisor, workers ...

Posted by revans2 <gi...@git.apache.org>.
Github user revans2 commented on a diff in the pull request:

    https://github.com/apache/storm/pull/2710#discussion_r207576736
  
    --- Diff: storm-client/src/jvm/org/apache/storm/daemon/supervisor/ClientSupervisorUtils.java ---
    @@ -28,11 +29,14 @@
     import org.apache.storm.shade.org.apache.commons.lang.StringUtils;
     import org.apache.storm.utils.ConfigUtils;
     import org.apache.storm.utils.ObjectReader;
    +import org.apache.storm.utils.ShellUtils;
     import org.apache.storm.utils.Utils;
     import org.slf4j.Logger;
     import org.slf4j.LoggerFactory;
     
     public class ClientSupervisorUtils {
    +    public static final Meter numWorkerLaunchExceptions = ShellUtils.numShellExceptions;
    --- End diff --
    
    This all feels a bit too confusing to me.
    
    `ShellUtils.numShellExceptions` is static and pulled in from multiple different locations, but only registered once in the supervisor. I personally would rather see it where `ClientSupervisorUtils` registers the Meter, as it is tied to the supervisor having it do it here is fine.
    
    For ShellUtils, I would like to see it not have a static meter, but instead optionally pass a meter in when calling run, or perhaps optionally include it in the constructor. 


---

[GitHub] storm pull request #2710: STORM-3099: Extend metrics on supervisor, workers ...

Posted by Ethanlm <gi...@git.apache.org>.
Github user Ethanlm commented on a diff in the pull request:

    https://github.com/apache/storm/pull/2710#discussion_r208340502
  
    --- Diff: storm-server/src/main/java/org/apache/storm/localizer/AsyncLocalizer.java ---
    @@ -365,6 +384,7 @@ void addReferencesToBlobs(PortAndAssignment pna, BlobChangingCallback cb)
             if (!localResourceList.isEmpty()) {
                 getBlobs(localResourceList, pna, cb);
             }
    +        pna.complete();
    --- End diff --
    
    just one more nit: put this outside of this method since the method `addReferencesToBlobs` doesn't indicate `complete`


---

[GitHub] storm issue #2710: STORM-3099: Extend metrics on supervisor, workers and DRP...

Posted by zd-project <gi...@git.apache.org>.
Github user zd-project commented on the issue:

    https://github.com/apache/storm/pull/2710
  
    Should I squash to one per daemon?


---

[GitHub] storm pull request #2710: STORM-3099: Extend metrics on supervisor, workers ...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/storm/pull/2710


---

[GitHub] storm issue #2710: STORM-3099: Extend metrics on supervisor, workers and DRP...

Posted by zd-project <gi...@git.apache.org>.
Github user zd-project commented on the issue:

    https://github.com/apache/storm/pull/2710
  
    DRPC metrics:
    
    - [x] Avg/Max Time to respond to Http Request



---

[GitHub] storm pull request #2710: STORM-3099: Extend metrics on supervisor, workers ...

Posted by Ethanlm <gi...@git.apache.org>.
Github user Ethanlm commented on a diff in the pull request:

    https://github.com/apache/storm/pull/2710#discussion_r208096352
  
    --- Diff: storm-server/src/main/java/org/apache/storm/daemon/supervisor/Slot.java ---
    @@ -1147,8 +1162,19 @@ public DynamicState withPendingLocalization(Future<Void> pendingDownload) {
              */
             public DynamicState withState(final MachineState state) {
                 long newStartTime = Time.currentTimeMillis();
    +            //We may (though unlikely) lose metering here if state transition is too frequent (less than a millisecond)
    +            workerStateDuration.get(this.state).update(newStartTime - startTime, TimeUnit.MILLISECONDS);
    +            workerStateTransition.get(state).mark();
    +
    +            LocalAssignment assignment = this.currentAssignment;
    +            if (this.state != MachineState.RUNNING && state == MachineState.RUNNING
    +                && this.currentAssignment instanceof TimerDecoratedAssignment) {
    +                ((TimerDecoratedAssignment) assignment).stopTiming();
    +                assignment = new LocalAssignment(this.currentAssignment);
    --- End diff --
    
    A few comments would be helpful.


---

[GitHub] storm pull request #2710: STORM-3099: Extend metrics on supervisor, workers ...

Posted by HeartSaVioR <gi...@git.apache.org>.
Github user HeartSaVioR commented on a diff in the pull request:

    https://github.com/apache/storm/pull/2710#discussion_r199710919
  
    --- Diff: storm-server/src/main/java/org/apache/storm/daemon/drpc/DRPC.java ---
    @@ -208,6 +208,7 @@ public void failRequest(String id, DRPCExecutionException e) throws Authorizatio
             }
         }
     
    +    //Should this be changed to private?
    --- End diff --
    
    Leave this to the comment on PR or file an issue. TODO-like comment is tend to not be addressed afterwards.


---

[GitHub] storm issue #2710: STORM-3099: Extend metrics on supervisor and workers

Posted by zd-project <gi...@git.apache.org>.
Github user zd-project commented on the issue:

    https://github.com/apache/storm/pull/2710
  
    I’ll make incremental commits to this pull request as it involves multiple changes, as described in the original apache issue page. Thanks. 


---

[GitHub] storm pull request #2710: STORM-3099: Extend metrics on supervisor, workers ...

Posted by zd-project <gi...@git.apache.org>.
Github user zd-project commented on a diff in the pull request:

    https://github.com/apache/storm/pull/2710#discussion_r208292571
  
    --- Diff: storm-server/src/main/java/org/apache/storm/daemon/supervisor/Container.java ---
    @@ -346,6 +358,10 @@ public void cleanUp() throws IOException {
                 _usedMemory.remove(_port);
                 _reservedMemory.remove(_port);
                 cleanUpForRestart();
    +        } catch (IOException e) {
    +            //This may or may not be reported depending on when process exits
    --- End diff --
    
    I'll remove this after review is completed


---

[GitHub] storm issue #2710: STORM-3099: Extend metrics on supervisor, workers and DRP...

Posted by zd-project <gi...@git.apache.org>.
Github user zd-project commented on the issue:

    https://github.com/apache/storm/pull/2710
  
    Rebase completed with one new commit added. The new TimerDecoratedAssignment is incorporated.


---

[GitHub] storm issue #2710: STORM-3099: Extend metrics on supervisor, workers and DRP...

Posted by HeartSaVioR <gi...@git.apache.org>.
Github user HeartSaVioR commented on the issue:

    https://github.com/apache/storm/pull/2710
  
    @zd-project 
    Just 2 cents: I can see couple of other changes than the issue itself: please try to avoid side-effect on the patch. Let's file issues and raise pull requests if they're real issues. It's OK to just raise a pull request if you think the patch is minor.


---

[GitHub] storm issue #2710: STORM-3099: Extend metrics on supervisor and workers

Posted by danny0405 <gi...@git.apache.org>.
Github user danny0405 commented on the issue:

    https://github.com/apache/storm/pull/2710
  
    Sorry, i didn't see any practical changes for this patch but only some comment java doc. Can you describe it?


---

[GitHub] storm pull request #2710: STORM-3099: Extend metrics on supervisor, workers ...

Posted by Ethanlm <gi...@git.apache.org>.
Github user Ethanlm commented on a diff in the pull request:

    https://github.com/apache/storm/pull/2710#discussion_r208092337
  
    --- Diff: storm-server/src/main/java/org/apache/storm/daemon/supervisor/Container.java ---
    @@ -346,6 +358,10 @@ public void cleanUp() throws IOException {
                 _usedMemory.remove(_port);
                 _reservedMemory.remove(_port);
                 cleanUpForRestart();
    +        } catch (IOException e) {
    +            //This may or may not be reported depending on when process exits
    --- End diff --
    
    I would like to better understand this. In which case the meter will not be reported?


---

[GitHub] storm issue #2710: STORM-3099: Extend metrics on supervisor, workers and DRP...

Posted by revans2 <gi...@git.apache.org>.
Github user revans2 commented on the issue:

    https://github.com/apache/storm/pull/2710
  
    Also could we squash the commits some?


---

[GitHub] storm pull request #2710: STORM-3099: Extend metrics on supervisor, workers ...

Posted by revans2 <gi...@git.apache.org>.
Github user revans2 commented on a diff in the pull request:

    https://github.com/apache/storm/pull/2710#discussion_r207578433
  
    --- Diff: storm-server/src/main/java/org/apache/storm/daemon/supervisor/Slot.java ---
    @@ -1147,8 +1162,19 @@ public DynamicState withPendingLocalization(Future<Void> pendingDownload) {
              */
             public DynamicState withState(final MachineState state) {
                 long newStartTime = Time.currentTimeMillis();
    +            //TODO: potential lost metrics due to timing accuracy (Timer only tracks one call per millisecond)
    --- End diff --
    
    Please don't leave TODOs int he code.  Either fix it, file a follow on JIRA to fix it, or accept it and just have it be a comment and not a TODO.


---

[GitHub] storm issue #2710: STORM-3099: Extend metrics on supervisor, workers and DRP...

Posted by zd-project <gi...@git.apache.org>.
Github user zd-project commented on the issue:

    https://github.com/apache/storm/pull/2710
  
    Two more metrics for Supervisor
    
    - [x] Worker exceptions count for clean-up, kill and force-kill.
    - [x] Internal Error/Exception Cnt. on running external process/linux commands


---