You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Vinod Kumar Vavilapalli (JIRA)" <ji...@apache.org> on 2008/05/13 10:58:55 UTC
[jira] Updated: (HADOOP-3376) [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits

     [ https://issues.apache.org/jira/browse/HADOOP-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated HADOOP-3376:
--------------------------------------------

    Attachment: HADOOP-3376

Attaching a patch.

 - This implements changes required in HOD to deal better with clusters exceeding resource manager or scheduler limits.
 - After this, every time HOD detects that the cluster is still queued, HOD calls isJobFeasible method of resource manager interface (src/contrib/hod/hodlib/Hod/nodePool.py) to check if job can run if at all.
 - Torque implementation of isJobFeasible (src/contrib/hod/hodlib/NodePools/torque.py) uses the comment field in qstat output. When this comment field becomes equal to hodlib.Common.util.TORQUE_USER_LIMITS_COMMENT_FIELD, HOD deallocates the cluster with the error message "Request execeeded maximum user limits. Cluster will not be allocated." . As it is, this is still only part of the solution - torque comment field has to be set to the above string either by a scheduler or by an external tool.
 - Also introducing a hod config parameter which will enable the above checking : check-job-feasibility. This defaults to false and specifies whether or not to check job feasibility - resource manager and/or scheduler limits.
 - This patches also replaces a few 'job' strings by the string 'cluster'.

> [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3376
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3376
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: HADOOP-3376
>
>
> Currently If we set up resource manager/scheduler limits on the jobs submitted, any HOD cluster that exceeds/violates these limits may 1) get blocked/queued indefinitely or 2) blocked till resources occupied by old clusters get freed. HOD should detect these scenarios and deal intelligently, instead of just waiting for a long time/ for ever. This means more and proper information to the submitter.
> (Internal) Use Case:
>      If there are no resource limits, users can flood the resource manager queue preventing other users from using the queue. To avoid this, we could have various types of limits setup in either resource manager or a scheduler - max node limit in torque(per job limit), maxproc limit in maui (per user/class), maxjob limit in maui(per user/class) etc. But there is one problem with the current setup - for e.g if we set up maxproc limit in maui to limit the aggregate number of nodes by any user over all jobs, 1) jobs get queued indefinitely if jobs exceed max limit and 2) blocked if it asks for nodes < max limit, but some of the resources are already used by jobs from the same user. This issue addresses how to deal with scenarios like these.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.