You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by zd-project <gi...@git.apache.org> on 2018/06/08 15:45:34 UTC
[GitHub] storm pull request #2710: STORM-3099: Extend metrics on supervisor and worke...
GitHub user zd-project opened a pull request:
https://github.com/apache/storm/pull/2710
STORM-3099: Extend metrics on supervisor and workers
The first commit refactored and commented code in Slot.java
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/zd-project/storm STORM-3099
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/storm/pull/2710.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2710
----
commit ce2723860323e8fca2afcb42688da6df516157d6
Author: Zhengdai Hu <hu...@...>
Date: 2018-06-08T15:24:09Z
STORM-3099: Refactored and commented code in Slot.java
----
---
[GitHub] storm pull request #2710: STORM-3099: Extend metrics on supervisor, workers ...
Posted by zd-project <gi...@git.apache.org>.
Github user zd-project commented on a diff in the pull request:
https://github.com/apache/storm/pull/2710#discussion_r208286166
--- Diff: storm-server/src/main/java/org/apache/storm/daemon/supervisor/Container.java ---
@@ -346,6 +358,10 @@ public void cleanUp() throws IOException {
_usedMemory.remove(_port);
_reservedMemory.remove(_port);
cleanUpForRestart();
+ } catch (IOException e) {
+ //This may or may not be reported depending on when process exits
--- End diff --
My bad. I thought Container runs on it's own process when I started working on this. This should not be a problem if it's just a daemon thread.
---
[GitHub] storm issue #2710: [WIP] STORM-3099: Extend metrics on supervisor and worker...
Posted by zd-project <gi...@git.apache.org>.
Github user zd-project commented on the issue:
https://github.com/apache/storm/pull/2710
New supervisor level metrics:
- [ ] Worker Kill/Restart Statistics
- [x] Kill Count by Category - assignment change/HB too old/Heap space (memory limit?)
- [x] blob change?
- [ ] Worker Suicide Cnt - category: internal error or Assignment Change
- [x] - Implemented based on running status the container's main process. Does not actually reflect suicide count because it counts the normal exit as well.
- [x] Worker idle period
- The metrics records the duration machines spent in each state (in histogram) and how many times it transition into/out to a certain state.
- [x] Time to Actually Kill worker (from identifying need by supervisor and actual change in the state of the worker) - (This is only an estimation, accuracy affected by SleepTime)
- [x] Time to start worker for topology from reading assignment for the first time.
- [x] Worker cleanup time
- [x] Supervisor Level Metrics:
- [x] Supervisor restart Count
- simply report everytime it restarts.
- [x] Blobstore (Request to download time)
- [x] download time individual blob (inside localizer) localizer gettting requst to actually download hdfs request to finish
- I assume this to be [the complete process] from initiating download to commit download to local blob cache and inform relative workers
- [x] download rate individual blob (inside localizer)
- This is tracks the actual download rate of a blob retrieval, in MB/s
- [x] supervisor localizer thread blob download - how long (outside localizer)
- I put this inside async localizer as it turns out to be better suited for purpose. This tracks the time for a topology blob download request to be completely processed.
- [x] Blob update is also considered.
- [x] Blobstore Update due to Version change Cnts
---
[GitHub] storm issue #2710: STORM-3099: Extend metrics on supervisor, workers and DRP...
Posted by zd-project <gi...@git.apache.org>.
Github user zd-project commented on the issue:
https://github.com/apache/storm/pull/2710
#2739
---
[GitHub] storm issue #2710: STORM-3099: Extend metrics on supervisor, workers and DRP...
Posted by Ethanlm <gi...@git.apache.org>.
Github user Ethanlm commented on the issue:
https://github.com/apache/storm/pull/2710
@zd-project Could you squash all the commits? Will merge this in.
---
[GitHub] storm issue #2710: STORM-3099: Extend metrics on supervisor, workers and DRP...
Posted by Ethanlm <gi...@git.apache.org>.
Github user Ethanlm commented on the issue:
https://github.com/apache/storm/pull/2710
@HeartSaVioR @danny0405 Do you want to take a look again? Otherwise I think it's good to merge it.
---
[GitHub] storm pull request #2710: STORM-3099: Extend metrics on supervisor, workers ...
Posted by Ethanlm <gi...@git.apache.org>.
Github user Ethanlm commented on a diff in the pull request:
https://github.com/apache/storm/pull/2710#discussion_r208279317
--- Diff: storm-client/src/jvm/org/apache/storm/daemon/supervisor/ClientSupervisorUtils.java ---
@@ -28,11 +29,14 @@
import org.apache.storm.shade.org.apache.commons.lang.StringUtils;
import org.apache.storm.utils.ConfigUtils;
import org.apache.storm.utils.ObjectReader;
+import org.apache.storm.utils.ShellUtils;
import org.apache.storm.utils.Utils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class ClientSupervisorUtils {
+ public static final Meter numWorkerLaunchExceptions = ShellUtils.numShellExceptions;
--- End diff --
I would like to see some comments so people can understand why `ClientSupervisorUtils` is using the same meter with `ShellUtils`
---
[GitHub] storm pull request #2710: STORM-3099: Extend metrics on supervisor and worke...
Posted by danny0405 <gi...@git.apache.org>.
Github user danny0405 commented on a diff in the pull request:
https://github.com/apache/storm/pull/2710#discussion_r194212295
--- Diff: storm-server/src/main/java/org/apache/storm/daemon/supervisor/Slot.java ---
@@ -410,9 +420,11 @@ static DynamicState handleWaitingForBlobLocalization(DynamicState dynamicState,
assert (dynamicState.pendingDownload != null);
assert (dynamicState.container == null);
- //Ignore changes to scheduling while downloading the topology blobs
+ // Ignore changes to scheduling while downloading the topology blobs
// We don't support canceling the download through the future yet,
--- End diff --
nit: whitespace
---
[GitHub] storm pull request #2710: STORM-3099: Extend metrics on supervisor, workers ...
Posted by zd-project <gi...@git.apache.org>.
Github user zd-project commented on a diff in the pull request:
https://github.com/apache/storm/pull/2710#discussion_r207645153
--- Diff: storm-client/src/jvm/org/apache/storm/daemon/supervisor/ClientSupervisorUtils.java ---
@@ -28,11 +29,14 @@
import org.apache.storm.shade.org.apache.commons.lang.StringUtils;
import org.apache.storm.utils.ConfigUtils;
import org.apache.storm.utils.ObjectReader;
+import org.apache.storm.utils.ShellUtils;
import org.apache.storm.utils.Utils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class ClientSupervisorUtils {
+ public static final Meter numWorkerLaunchExceptions = ShellUtils.numShellExceptions;
--- End diff --
Logged in Jira
---
[GitHub] storm issue #2710: STORM-3099: Extend metrics on supervisor and workers
Posted by HeartSaVioR <gi...@git.apache.org>.
Github user HeartSaVioR commented on the issue:
https://github.com/apache/storm/pull/2710
@zd-project Please append "[WIP]" in the title and remove when you finished.
---
[GitHub] storm issue #2710: STORM-3099: Extend metrics on supervisor, workers and DRP...
Posted by HeartSaVioR <gi...@git.apache.org>.
Github user HeartSaVioR commented on the issue:
https://github.com/apache/storm/pull/2710
Could you please rebase? I'm expecting the code diff would be reduced after that.
---
[GitHub] storm issue #2710: STORM-3099: Extend metrics on supervisor, workers and DRP...
Posted by zd-project <gi...@git.apache.org>.
Github user zd-project commented on the issue:
https://github.com/apache/storm/pull/2710
I'm kind of wondering whether decorator pattern is better than subclassing or not, for TimedPortAndAssignment, because itself is already a wrapper.
---
[GitHub] storm pull request #2710: STORM-3099: Extend metrics on supervisor, workers ...
Posted by Ethanlm <gi...@git.apache.org>.
Github user Ethanlm commented on a diff in the pull request:
https://github.com/apache/storm/pull/2710#discussion_r208279468
--- Diff: storm-server/src/main/java/org/apache/storm/localizer/AsyncLocalizer.java ---
@@ -251,11 +261,18 @@ private LocalizedResource getUserFile(String user, String key) {
long localVersion = blob.getLocalVersion();
long remoteVersion = blob.getRemoteVersion(blobStore);
if (localVersion != remoteVersion || !blob.isFullyDownloaded()) {
+ if (!blob.isFullyDownloaded()) {
--- End diff --
Should be `if (!blob.isFullyDownloaded()) {`
---
[GitHub] storm pull request #2710: STORM-3099: Extend metrics on supervisor, workers ...
Posted by Ethanlm <gi...@git.apache.org>.
Github user Ethanlm commented on a diff in the pull request:
https://github.com/apache/storm/pull/2710#discussion_r208095554
--- Diff: storm-server/src/main/java/org/apache/storm/daemon/supervisor/Slot.java ---
@@ -1008,6 +1015,13 @@ public void close() throws Exception {
}
static class DynamicState {
+ private static final Map<MachineState, Meter> workerStateTransition = EnumUtil.toEnumMap(MachineState.class,
+ machineState -> StormMetricsRegistry.registerMeter("supervisor:num-transitions-into-" + machineState.toString()));
+ private static final Map<MachineState, Timer> workerStateDuration = EnumUtil.toEnumMap(MachineState.class,
+ machineState -> StormMetricsRegistry.registerTimer(
+ "supervisor:num-transitions-out-" + machineState.toString() + "-and-duration-ms")
--- End diff --
It would be good if we can find a better name
---
[GitHub] storm pull request #2710: STORM-3099: Extend metrics on supervisor, workers ...
Posted by revans2 <gi...@git.apache.org>.
Github user revans2 commented on a diff in the pull request:
https://github.com/apache/storm/pull/2710#discussion_r207576736
--- Diff: storm-client/src/jvm/org/apache/storm/daemon/supervisor/ClientSupervisorUtils.java ---
@@ -28,11 +29,14 @@
import org.apache.storm.shade.org.apache.commons.lang.StringUtils;
import org.apache.storm.utils.ConfigUtils;
import org.apache.storm.utils.ObjectReader;
+import org.apache.storm.utils.ShellUtils;
import org.apache.storm.utils.Utils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class ClientSupervisorUtils {
+ public static final Meter numWorkerLaunchExceptions = ShellUtils.numShellExceptions;
--- End diff --
This all feels a bit too confusing to me.
`ShellUtils.numShellExceptions` is static and pulled in from multiple different locations, but only registered once in the supervisor. I personally would rather see it where `ClientSupervisorUtils` registers the Meter, as it is tied to the supervisor having it do it here is fine.
For ShellUtils, I would like to see it not have a static meter, but instead optionally pass a meter in when calling run, or perhaps optionally include it in the constructor.
---
[GitHub] storm pull request #2710: STORM-3099: Extend metrics on supervisor, workers ...
Posted by Ethanlm <gi...@git.apache.org>.
Github user Ethanlm commented on a diff in the pull request:
https://github.com/apache/storm/pull/2710#discussion_r208340502
--- Diff: storm-server/src/main/java/org/apache/storm/localizer/AsyncLocalizer.java ---
@@ -365,6 +384,7 @@ void addReferencesToBlobs(PortAndAssignment pna, BlobChangingCallback cb)
if (!localResourceList.isEmpty()) {
getBlobs(localResourceList, pna, cb);
}
+ pna.complete();
--- End diff --
just one more nit: put this outside of this method since the method `addReferencesToBlobs` doesn't indicate `complete`
---
[GitHub] storm issue #2710: STORM-3099: Extend metrics on supervisor, workers and DRP...
Posted by zd-project <gi...@git.apache.org>.
Github user zd-project commented on the issue:
https://github.com/apache/storm/pull/2710
Should I squash to one per daemon?
---
[GitHub] storm pull request #2710: STORM-3099: Extend metrics on supervisor, workers ...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/storm/pull/2710
---
[GitHub] storm issue #2710: STORM-3099: Extend metrics on supervisor, workers and DRP...
Posted by zd-project <gi...@git.apache.org>.
Github user zd-project commented on the issue:
https://github.com/apache/storm/pull/2710
DRPC metrics:
- [x] Avg/Max Time to respond to Http Request
---
[GitHub] storm pull request #2710: STORM-3099: Extend metrics on supervisor, workers ...
Posted by Ethanlm <gi...@git.apache.org>.
Github user Ethanlm commented on a diff in the pull request:
https://github.com/apache/storm/pull/2710#discussion_r208096352
--- Diff: storm-server/src/main/java/org/apache/storm/daemon/supervisor/Slot.java ---
@@ -1147,8 +1162,19 @@ public DynamicState withPendingLocalization(Future<Void> pendingDownload) {
*/
public DynamicState withState(final MachineState state) {
long newStartTime = Time.currentTimeMillis();
+ //We may (though unlikely) lose metering here if state transition is too frequent (less than a millisecond)
+ workerStateDuration.get(this.state).update(newStartTime - startTime, TimeUnit.MILLISECONDS);
+ workerStateTransition.get(state).mark();
+
+ LocalAssignment assignment = this.currentAssignment;
+ if (this.state != MachineState.RUNNING && state == MachineState.RUNNING
+ && this.currentAssignment instanceof TimerDecoratedAssignment) {
+ ((TimerDecoratedAssignment) assignment).stopTiming();
+ assignment = new LocalAssignment(this.currentAssignment);
--- End diff --
A few comments would be helpful.
---
[GitHub] storm pull request #2710: STORM-3099: Extend metrics on supervisor, workers ...
Posted by HeartSaVioR <gi...@git.apache.org>.
Github user HeartSaVioR commented on a diff in the pull request:
https://github.com/apache/storm/pull/2710#discussion_r199710919
--- Diff: storm-server/src/main/java/org/apache/storm/daemon/drpc/DRPC.java ---
@@ -208,6 +208,7 @@ public void failRequest(String id, DRPCExecutionException e) throws Authorizatio
}
}
+ //Should this be changed to private?
--- End diff --
Leave this to the comment on PR or file an issue. TODO-like comment is tend to not be addressed afterwards.
---
[GitHub] storm issue #2710: STORM-3099: Extend metrics on supervisor and workers
Posted by zd-project <gi...@git.apache.org>.
Github user zd-project commented on the issue:
https://github.com/apache/storm/pull/2710
I’ll make incremental commits to this pull request as it involves multiple changes, as described in the original apache issue page. Thanks.
---
[GitHub] storm pull request #2710: STORM-3099: Extend metrics on supervisor, workers ...
Posted by zd-project <gi...@git.apache.org>.
Github user zd-project commented on a diff in the pull request:
https://github.com/apache/storm/pull/2710#discussion_r208292571
--- Diff: storm-server/src/main/java/org/apache/storm/daemon/supervisor/Container.java ---
@@ -346,6 +358,10 @@ public void cleanUp() throws IOException {
_usedMemory.remove(_port);
_reservedMemory.remove(_port);
cleanUpForRestart();
+ } catch (IOException e) {
+ //This may or may not be reported depending on when process exits
--- End diff --
I'll remove this after review is completed
---
[GitHub] storm issue #2710: STORM-3099: Extend metrics on supervisor, workers and DRP...
Posted by zd-project <gi...@git.apache.org>.
Github user zd-project commented on the issue:
https://github.com/apache/storm/pull/2710
Rebase completed with one new commit added. The new TimerDecoratedAssignment is incorporated.
---
[GitHub] storm issue #2710: STORM-3099: Extend metrics on supervisor, workers and DRP...
Posted by HeartSaVioR <gi...@git.apache.org>.
Github user HeartSaVioR commented on the issue:
https://github.com/apache/storm/pull/2710
@zd-project
Just 2 cents: I can see couple of other changes than the issue itself: please try to avoid side-effect on the patch. Let's file issues and raise pull requests if they're real issues. It's OK to just raise a pull request if you think the patch is minor.
---
[GitHub] storm issue #2710: STORM-3099: Extend metrics on supervisor and workers
Posted by danny0405 <gi...@git.apache.org>.
Github user danny0405 commented on the issue:
https://github.com/apache/storm/pull/2710
Sorry, i didn't see any practical changes for this patch but only some comment java doc. Can you describe it?
---
[GitHub] storm pull request #2710: STORM-3099: Extend metrics on supervisor, workers ...
Posted by Ethanlm <gi...@git.apache.org>.
Github user Ethanlm commented on a diff in the pull request:
https://github.com/apache/storm/pull/2710#discussion_r208092337
--- Diff: storm-server/src/main/java/org/apache/storm/daemon/supervisor/Container.java ---
@@ -346,6 +358,10 @@ public void cleanUp() throws IOException {
_usedMemory.remove(_port);
_reservedMemory.remove(_port);
cleanUpForRestart();
+ } catch (IOException e) {
+ //This may or may not be reported depending on when process exits
--- End diff --
I would like to better understand this. In which case the meter will not be reported?
---
[GitHub] storm issue #2710: STORM-3099: Extend metrics on supervisor, workers and DRP...
Posted by revans2 <gi...@git.apache.org>.
Github user revans2 commented on the issue:
https://github.com/apache/storm/pull/2710
Also could we squash the commits some?
---
[GitHub] storm pull request #2710: STORM-3099: Extend metrics on supervisor, workers ...
Posted by revans2 <gi...@git.apache.org>.
Github user revans2 commented on a diff in the pull request:
https://github.com/apache/storm/pull/2710#discussion_r207578433
--- Diff: storm-server/src/main/java/org/apache/storm/daemon/supervisor/Slot.java ---
@@ -1147,8 +1162,19 @@ public DynamicState withPendingLocalization(Future<Void> pendingDownload) {
*/
public DynamicState withState(final MachineState state) {
long newStartTime = Time.currentTimeMillis();
+ //TODO: potential lost metrics due to timing accuracy (Timer only tracks one call per millisecond)
--- End diff --
Please don't leave TODOs int he code. Either fix it, file a follow on JIRA to fix it, or accept it and just have it be a comment and not a TODO.
---
[GitHub] storm issue #2710: STORM-3099: Extend metrics on supervisor, workers and DRP...
Posted by zd-project <gi...@git.apache.org>.
Github user zd-project commented on the issue:
https://github.com/apache/storm/pull/2710
Two more metrics for Supervisor
- [x] Worker exceptions count for clean-up, kill and force-kill.
- [x] Internal Error/Exception Cnt. on running external process/linux commands
---