You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@storm.apache.org by "James Xu (JIRA)" <ji...@apache.org> on 2013/12/15 07:27:06 UTC

[jira] [Created] (STORM-148) Track Nimbus actions in UI

James Xu created STORM-148:
------------------------------

Summary: Track Nimbus actions in UI
Key: STORM-148
URL: https://issues.apache.org/jira/browse/STORM-148
Project: Apache Storm (Incubating)
Issue Type: New Feature
Reporter: James Xu
Priority: Minor

https://github.com/nathanmarz/storm/issues/77

1. Worker reassignment history
2. Task timeout history

----------
danehammer: I feel like the logical next step would be to click on a supervisor from the main page, and get details about that supervisor node's going-ons. Workers running, their uptime, and the history you mention. Could even go one step further in, click on a worker, and see the executors/tasks running on that worker.

----------
cnardi: It would be really nice. Sometimes a worker is not behaving as expected (memory or cpu problems) and its important to know what it's being executed there. The only way so far is to go through all the bolts/spouts and see where it's being executed.

----------
danehammer: I've started familiarizing myself with what would be required to implement this. It feels like the part I'm thinking about, having the workers for every supervisor known, would require changes to the thrift API. I currently have no way of identifying an individual worker. I can get a supervisor, it can tell me the number of workers it has and how many are used, and executors know their host and port, but it feels like there should be a worker object between these two. A supervisor has a set of workers, and an executor lives on a worker. The worker has an uptime, port, host, id, as well as an understanding of its executors.

Sound right?

----------
nathanmarz: A worker is identified by its [supervisor id, port]. The uptime for a worker is the same as all its executors.

It would be useful to have a new Thrift method that gets the list of all workers in the cluster, including information such as:

Supervisor id and port
Host it's running on
Executors running in the worker
Once you have that method, you can easily implement supervisor pages. I think you should leave uptime out for now as that would require fetching the executor heartbeats, which is a very large amount of Zookeeper calls.

----------
danehammer: I would love to see from the UI if a worker's uptime is abnormal. Today if I hit the storm UI and a supervisor has recently gone down, it stands out immediately - its uptime is way lower than the other supervisors. I would imagine the same sort of "one of these does not belong" would be easily recognizable on a supervisor page.

The uptime for a worker is the same as all its executors
Would looking up one of these executor's heartbeats be a valid test of the worker's uptime? I take it this means the executor's heartbeats are what tell the supervisor the worker is up, and that the worker does not have its own heartbeat.

----------
nathanmarz: Well, all the executor heartbeats are kept in worker heartbeats. Fetching all the worker heartbeats for every topology is just going to be too expensive.

We can solve the heartbeat problem in the future by having the supervisor keep the uptime stats (from its perspective) in the supervisor heartbeat.

--
This message was sent by Atlassian JIRA
(v6.1.4#6159)