You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@oozie.apache.org by "Hadoop QA (JIRA)" <ji...@apache.org> on 2011/09/08 06:29:09 UTC

[jira] [Created] (OOZIE-103) GH-68: Better reporting/handling of problems in Hadoop

GH-68: Better reporting/handling of problems in Hadoop
------------------------------------------------------

                 Key: OOZIE-103
                 URL: https://issues.apache.org/jira/browse/OOZIE-103
             Project: Oozie
          Issue Type: Bug
            Reporter: Hadoop QA


Add instrumentation to track performance stats of NN and JT (how long to get directory listing on hdfs; how long to submit a job or query JT queue)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (OOZIE-103) GH-68: Better reporting/handling of problems in Hadoop

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/OOZIE-103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hadoop QA resolved OOZIE-103.
-----------------------------

    Resolution: Fixed

> GH-68: Better reporting/handling of problems in Hadoop
> ------------------------------------------------------
>
>                 Key: OOZIE-103
>                 URL: https://issues.apache.org/jira/browse/OOZIE-103
>             Project: Oozie
>          Issue Type: Bug
>            Reporter: Hadoop QA
>
> Add instrumentation to track performance stats of NN and JT (how long to get directory listing on hdfs; how long to submit a job or query JT queue)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (OOZIE-103) GH-68: Better reporting/handling of problems in Hadoop

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/OOZIE-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13099801#comment-13099801 ] 

Hadoop QA commented on OOZIE-103:
---------------------------------

mislam77 remarked:
Objective :
---------------
1. The long term objective is: when hadoop is slow oozie should be able to  throttle the JT/NN load through submitting fewer jobs(e.g.). 

2. In short term, we want to instrument oozie so that it could report the response time of JT/NN  at any time. How will the value be meat or presented is not the scope of this short term goal.  

3. It is expected that the design to achieve the short term objective should be extend-able and reusable for long term objective.

Solution:
------------
Following ideas were discussed internally at Y! .
Approach 1:
Use a separate monitoring thread that will periodically ping with a representative command to the Hadoop server. For example,  in namenode, the thread will invoke "ls /tmp" like  command.

Pros & Cons :
*  This thread will add extra overhead to hadoop as well as to oozie.
* Find a representative command that would represent the actual health of hadoop might not be trivial.

Approach 2:
 When oozie calls to NN, JT, oozie could instrument that turn-around time. The benefit is: there  is no extra command sent.

Pros  & Cons :
* There are different types of commands and there normal response time also varied. In this case, oozie could restrict the instrumentation to a subset of commonly used commands. Each command type will have a different instrumented value.
 
* When oozie is idle, oozie might miss the data for that period. 

Comments please.

> GH-68: Better reporting/handling of problems in Hadoop
> ------------------------------------------------------
>
>                 Key: OOZIE-103
>                 URL: https://issues.apache.org/jira/browse/OOZIE-103
>             Project: Oozie
>          Issue Type: Bug
>            Reporter: Hadoop QA
>
> Add instrumentation to track performance stats of NN and JT (how long to get directory listing on hdfs; how long to submit a job or query JT queue)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Reopened] (OOZIE-103) GH-68: Better reporting/handling of problems in Hadoop

Posted by "Roman Shaposhnik (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/OOZIE-103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Roman Shaposhnik reopened OOZIE-103:
------------------------------------


> GH-68: Better reporting/handling of problems in Hadoop
> ------------------------------------------------------
>
>                 Key: OOZIE-103
>                 URL: https://issues.apache.org/jira/browse/OOZIE-103
>             Project: Oozie
>          Issue Type: Bug
>            Reporter: Hadoop QA
>
> Add instrumentation to track performance stats of NN and JT (how long to get directory listing on hdfs; how long to submit a job or query JT queue)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (OOZIE-103) GH-68: Better reporting/handling of problems in Hadoop

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/OOZIE-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101844#comment-13101844 ] 

Hadoop QA commented on OOZIE-103:
---------------------------------

mislam77 remarked:
The another use-case we consider is ..
Admin wants to keep hadoop running while making sure no  job submission through oozie. In that case, oozie should not even ping hadoop.

To make it simple, we consider a protocol if any resource is being blacklisted by admin, oozie should not automatically try to take it off from the list. It is the admin's responsibility to move it out from black list.

In the above case, the initialization-at-startup might not work.

> GH-68: Better reporting/handling of problems in Hadoop
> ------------------------------------------------------
>
>                 Key: OOZIE-103
>                 URL: https://issues.apache.org/jira/browse/OOZIE-103
>             Project: Oozie
>          Issue Type: Bug
>            Reporter: Hadoop QA
>
> Add instrumentation to track performance stats of NN and JT (how long to get directory listing on hdfs; how long to submit a job or query JT queue)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (OOZIE-103) GH-68: Better reporting/handling of problems in Hadoop

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/OOZIE-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101842#comment-13101842 ] 

Hadoop QA commented on OOZIE-103:
---------------------------------

anew remarked:
My understanding of this issue is that we want to avoid failing workflows when Hadoop is down. Instead, we want to defer the submission of the workflow until Hadoop is back. What this requires is:

* Every time a workflow (action) is submitted, we need to know whether Hadoop is up. If the daemon thread pings Hadoop - say - every minute, then there is a window of time of 59 seconds where jobs will fail. How do we deal with that?
* When Oozie comes up, the daemon will need up to - say - a minute to detect that Hadoop is down. Same thing - how do we prevent job submission in that window? If we persist the blacklist in the DB, then Oozie will remember that a cluster was down before. If the cluster has come back in the mean time, it will take up to - say - 1 minute until jobs will be submitted again, but I think that is acceptable.
* I am not sure whether a configuration at start-up is a good idea. That would require an admin to create that config before he restarts Oozie, and hence this would not work, for instance, for automatic failover or restart of Oozie by a monitoring system. Therefore I would prefer to persist the last known state.

> GH-68: Better reporting/handling of problems in Hadoop
> ------------------------------------------------------
>
>                 Key: OOZIE-103
>                 URL: https://issues.apache.org/jira/browse/OOZIE-103
>             Project: Oozie
>          Issue Type: Bug
>            Reporter: Hadoop QA
>
> Add instrumentation to track performance stats of NN and JT (how long to get directory listing on hdfs; how long to submit a job or query JT queue)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (OOZIE-103) GH-68: Better reporting/handling of problems in Hadoop

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/OOZIE-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101843#comment-13101843 ] 

Hadoop QA commented on OOZIE-103:
---------------------------------

tucu00 remarked:
Agree on the configuration startup burden.

Building on your rationale that at startup Oozie will determine if a cluster is down, then why not avoid the DB completely as Oozie will blacklist the cluster at the first try

> GH-68: Better reporting/handling of problems in Hadoop
> ------------------------------------------------------
>
>                 Key: OOZIE-103
>                 URL: https://issues.apache.org/jira/browse/OOZIE-103
>             Project: Oozie
>          Issue Type: Bug
>            Reporter: Hadoop QA
>
> Add instrumentation to track performance stats of NN and JT (how long to get directory listing on hdfs; how long to submit a job or query JT queue)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (OOZIE-103) GH-68: Better reporting/handling of problems in Hadoop

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/OOZIE-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101838#comment-13101838 ] 

Hadoop QA commented on OOZIE-103:
---------------------------------

anew remarked:
In general, I believe that the Hadoop itself should protect itself from overload. At the same time, Oozie should minimize the load it puts on Hadoop, regardless of whether Hadoop is under stress. Mainly this means Oozie shoud minimize the amount of polling. Reliable notification from job tracker would help with that. Active notification for new data (from HDFS or from a meta data system such as Howl) would help, too. 

When Oozie is idle, then it does not matter whether it has no knowledge of Hadoop load. Because when Oozie is idle, then there is nothing to throttle. So for the short term, I would prefer approach 2.

> GH-68: Better reporting/handling of problems in Hadoop
> ------------------------------------------------------
>
>                 Key: OOZIE-103
>                 URL: https://issues.apache.org/jira/browse/OOZIE-103
>             Project: Oozie
>          Issue Type: Bug
>            Reporter: Hadoop QA
>
> Add instrumentation to track performance stats of NN and JT (how long to get directory listing on hdfs; how long to submit a job or query JT queue)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (OOZIE-103) GH-68: Better reporting/handling of problems in Hadoop

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/OOZIE-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101839#comment-13101839 ] 

Hadoop QA commented on OOZIE-103:
---------------------------------

tucu00 remarked:
Option #2 is seems a better approach.

Mohammad, we've discussed this issue in the past and the idea was:

* find API calls to JT/NN that require a fixed processing are lightweight: we identified a JT API call and NN API call with fixed processing on the JT and NN, fetching JT queues info and listing NN root directory contents.

* find the response time of those API calls under normal load and under over load. This has to be done for the JT and NN and it may differ on easy JT/NN installation depending on the machine size and cluster size.

* determine the response time threshold for JT and NN for Oozie to do back-off.

* In HadoopAccessorService, before trying to get a FileSystem or a JobClient handle, check the response time of the above API calls first, if the values are below the threshold then retrieve the FS or JC handle, otherwise backoff throwing an exception for a transient error.

* to optimize the above logic, the HadoopAccessorService should to the response check  only if the last check was done more than X secs (default 60) ago. And if at some point JT/NN is overloaded, HadoopAccessorService should backoff for the next Y secs (default 60) without even trying to hit the JT/NN.

> GH-68: Better reporting/handling of problems in Hadoop
> ------------------------------------------------------
>
>                 Key: OOZIE-103
>                 URL: https://issues.apache.org/jira/browse/OOZIE-103
>             Project: Oozie
>          Issue Type: Bug
>            Reporter: Hadoop QA
>
> Add instrumentation to track performance stats of NN and JT (how long to get directory listing on hdfs; how long to submit a job or query JT queue)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (OOZIE-103) GH-68: Better reporting/handling of problems in Hadoop

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/OOZIE-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101841#comment-13101841 ] 

Hadoop QA commented on OOZIE-103:
---------------------------------

tucu00 remarked:
Functionality-wise it makes sense as an initial approach.

Implementation-wise I would prefer not using a DB table, instead I suggest a simpler approach:

* Using a memory structure to keep tab on the blacklisted JTs/NNs, times and mode (auto/admin).
* A configuration property can provide a list of blacklisted JTs/NNs at startup.

> GH-68: Better reporting/handling of problems in Hadoop
> ------------------------------------------------------
>
>                 Key: OOZIE-103
>                 URL: https://issues.apache.org/jira/browse/OOZIE-103
>             Project: Oozie
>          Issue Type: Bug
>            Reporter: Hadoop QA
>
> Add instrumentation to track performance stats of NN and JT (how long to get directory listing on hdfs; how long to submit a job or query JT queue)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (OOZIE-103) GH-68: Better reporting/handling of problems in Hadoop

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/OOZIE-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101840#comment-13101840 ] 

Hadoop QA commented on OOZIE-103:
---------------------------------

mislam77 remarked:
Oozie team at Yahoo plans to implement the following idea:

* Oozie will only monitor whether hadoop is up or down. In this task, it will not consider if hadoop is slow.   
* Oozie will allow a way of blacklisting any hadoop JT or NN. There will be two ways to make a hadoop server  black-listed:
   * By admin: Admin can blacklist any hadoop server through command line interface. Most often for upgrade or maintenance.
   * By oozie daemon service: A dedicated thread will ping the hadoop server (that is not blacklisted by admin) and determine if the server needs to be blacklisted.

* How to remove items from the blacklist? There are also two ways:
   * Admin server can send WS request to take off the server from the blacklist. If a server is blacklisted by admin, that should be taken off by the admin only.
   * Oozie daemon thread will periodically ping  the hadoop server (that are not blacklisted by admin) and if it sees the server is up, it removes from the list.


Implementations:
===============
* A new WS endpoint for AdminServlet to allow admin to add/remove a blacklist item.
* A new table (name could be black_list_resources)  will be needed with the following columns:
   * Resource Name (e.g. http://localhost:9000/jobtracker)
   * Resource Type (hadoop-JT)
   * Creator (Admin/oozie)
   * Created time (UTC)
    * Last modified time (UTC)
    * More...
* A new monitor service needs to be implemented.  It will do the following tasks:
   * At the initialization, it will create an object to represent the contents of the current black-list-resources table.  Basically it will read the table and populate the memory object.
    * It will allow API's to add or remove any black listed item into/from table. At the same time, it will update the memory object accordingly.
    * It will allow an API to verify if an item is black-listed or not.
     * It will periodically ping all the whitelisted servers, to see if the server is up. If any server is down, it will put that server into blacklist. Question: what is the easy way of pinging hadoop? Action Item: Needs to talk to hadoop team to get a non-blocking API call , if any.
       

* HadoopAccessorService(HAS) will need to check if NN/JT is already in the black list. If it is in the blacklist return an error/exception.

* Each caller of HAS.createJobClient or HAS.createFileSystem needs to handle the above error case. For the time being, it will update the job/action record with last modified time without doing any hadoop action. It relies on the RecoveryService to pick up later for retry.

* Currently any job submission will fail immediately, if the corresponding NN is in blacklist.

Note: This implementation is not to solve the whole problem. Some future work will be needed to improve the functionality.

> GH-68: Better reporting/handling of problems in Hadoop
> ------------------------------------------------------
>
>                 Key: OOZIE-103
>                 URL: https://issues.apache.org/jira/browse/OOZIE-103
>             Project: Oozie
>          Issue Type: Bug
>            Reporter: Hadoop QA
>
> Add instrumentation to track performance stats of NN and JT (how long to get directory listing on hdfs; how long to submit a job or query JT queue)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira