You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Vinod Kumar Vavilapalli (JIRA)" <ji...@apache.org> on 2008/05/13 07:04:55 UTC

[jira] Created: (HADOOP-3376) [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits

[HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits
--------------------------------------------------------------------------------------------------------

Key: HADOOP-3376
URL: https://issues.apache.org/jira/browse/HADOOP-3376
Project: Hadoop Core
Issue Type: Bug
Components: contrib/hod
Reporter: Vinod Kumar Vavilapalli
Assignee: Hemanth Yamijala

Currently If we set up resource manager/scheduler limits on the jobs submitted, any HOD cluster that exceeds/violates these limits may 1) get blocked/queued indefinitely or 2) blocked till resources occupied by old clusters get freed. HOD should detect these scenarios and deal intelligently, instead of just waiting for a long time/ for ever. This means more and proper information to the submitter.

(Internal) Use Case:
If there are no resource limits, users can flood the resource manager queue preventing other users from using the queue. To avoid this, we could have various types of limits setup in either resource manager or a scheduler - max node limit in torque(per job limit), maxproc limit in maui (per user/class), maxjob limit in maui(per user/class) etc. But there is one problem with the current setup - for e.g if we set up maxproc limit in maui to limit the aggregate number of nodes by any user over all jobs, 1) jobs get queued indefinitely if jobs exceed max limit and 2) blocked if it asks for nodes < max limit, but some of the resources are already used by jobs from the same user. This issue addresses how to deal with scenarios like these.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3376) [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits

Posted by "Vinod Kumar Vavilapalli (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated HADOOP-3376:
--------------------------------------------

    Status: Open  (was: Patch Available)

> [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3376
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3376
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: checklimits.sh, HADOOP-3376, HADOOP-3376.1, HADOOP-3376.2
>
>
> Currently If we set up resource manager/scheduler limits on the jobs submitted, any HOD cluster that exceeds/violates these limits may 1) get blocked/queued indefinitely or 2) blocked till resources occupied by old clusters get freed. HOD should detect these scenarios and deal intelligently, instead of just waiting for a long time/ for ever. This means more and proper information to the submitter.
> (Internal) Use Case:
>      If there are no resource limits, users can flood the resource manager queue preventing other users from using the queue. To avoid this, we could have various types of limits setup in either resource manager or a scheduler - max node limit in torque(per job limit), maxproc limit in maui (per user/class), maxjob limit in maui(per user/class) etc. But there is one problem with the current setup - for e.g if we set up maxproc limit in maui to limit the aggregate number of nodes by any user over all jobs, 1) jobs get queued indefinitely if jobs exceed max limit and 2) blocked if it asks for nodes < max limit, but some of the resources are already used by jobs from the same user. This issue addresses how to deal with scenarios like these.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-3376) [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits

Posted by "Vinod Kumar Vavilapalli (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli reassigned HADOOP-3376:
-----------------------------------------------

    Assignee: Vinod Kumar Vavilapalli  (was: Hemanth Yamijala)

> [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3376
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3376
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>
> Currently If we set up resource manager/scheduler limits on the jobs submitted, any HOD cluster that exceeds/violates these limits may 1) get blocked/queued indefinitely or 2) blocked till resources occupied by old clusters get freed. HOD should detect these scenarios and deal intelligently, instead of just waiting for a long time/ for ever. This means more and proper information to the submitter.
> (Internal) Use Case:
>      If there are no resource limits, users can flood the resource manager queue preventing other users from using the queue. To avoid this, we could have various types of limits setup in either resource manager or a scheduler - max node limit in torque(per job limit), maxproc limit in maui (per user/class), maxjob limit in maui(per user/class) etc. But there is one problem with the current setup - for e.g if we set up maxproc limit in maui to limit the aggregate number of nodes by any user over all jobs, 1) jobs get queued indefinitely if jobs exceed max limit and 2) blocked if it asks for nodes < max limit, but some of the resources are already used by jobs from the same user. This issue addresses how to deal with scenarios like these.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3376) [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits

Posted by "Vinod Kumar Vavilapalli (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated HADOOP-3376:
--------------------------------------------

    Status: Patch Available  (was: Open)

> [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3376
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3376
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: checklimits.sh, HADOOP-3376, HADOOP-3376.1, HADOOP-3376.2
>
>
> Currently If we set up resource manager/scheduler limits on the jobs submitted, any HOD cluster that exceeds/violates these limits may 1) get blocked/queued indefinitely or 2) blocked till resources occupied by old clusters get freed. HOD should detect these scenarios and deal intelligently, instead of just waiting for a long time/ for ever. This means more and proper information to the submitter.
> (Internal) Use Case:
>      If there are no resource limits, users can flood the resource manager queue preventing other users from using the queue. To avoid this, we could have various types of limits setup in either resource manager or a scheduler - max node limit in torque(per job limit), maxproc limit in maui (per user/class), maxjob limit in maui(per user/class) etc. But there is one problem with the current setup - for e.g if we set up maxproc limit in maui to limit the aggregate number of nodes by any user over all jobs, 1) jobs get queued indefinitely if jobs exceed max limit and 2) blocked if it asks for nodes < max limit, but some of the resources are already used by jobs from the same user. This issue addresses how to deal with scenarios like these.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3376) [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hemanth Yamijala updated HADOOP-3376:
-------------------------------------

    Fix Version/s: 0.18.0

> [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3376
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3376
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>             Fix For: 0.18.0
>
>         Attachments: checklimits.sh, HADOOP-3376, HADOOP-3376.1, HADOOP-3376.2
>
>
> Currently If we set up resource manager/scheduler limits on the jobs submitted, any HOD cluster that exceeds/violates these limits may 1) get blocked/queued indefinitely or 2) blocked till resources occupied by old clusters get freed. HOD should detect these scenarios and deal intelligently, instead of just waiting for a long time/ for ever. This means more and proper information to the submitter.
> (Internal) Use Case:
>      If there are no resource limits, users can flood the resource manager queue preventing other users from using the queue. To avoid this, we could have various types of limits setup in either resource manager or a scheduler - max node limit in torque(per job limit), maxproc limit in maui (per user/class), maxjob limit in maui(per user/class) etc. But there is one problem with the current setup - for e.g if we set up maxproc limit in maui to limit the aggregate number of nodes by any user over all jobs, 1) jobs get queued indefinitely if jobs exceed max limit and 2) blocked if it asks for nodes < max limit, but some of the resources are already used by jobs from the same user. This issue addresses how to deal with scenarios like these.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3376) [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12596391#action_12596391 ] 

Hadoop QA commented on HADOOP-3376:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12381947/checklimits.sh
  against trunk revision 655674.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no tests are needed for this patch.

    -1 patch.  The patch command could not apply the patch.

Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2457/console

This message is automatically generated.

> [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3376
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3376
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: checklimits.sh, HADOOP-3376
>
>
> Currently If we set up resource manager/scheduler limits on the jobs submitted, any HOD cluster that exceeds/violates these limits may 1) get blocked/queued indefinitely or 2) blocked till resources occupied by old clusters get freed. HOD should detect these scenarios and deal intelligently, instead of just waiting for a long time/ for ever. This means more and proper information to the submitter.
> (Internal) Use Case:
>      If there are no resource limits, users can flood the resource manager queue preventing other users from using the queue. To avoid this, we could have various types of limits setup in either resource manager or a scheduler - max node limit in torque(per job limit), maxproc limit in maui (per user/class), maxjob limit in maui(per user/class) etc. But there is one problem with the current setup - for e.g if we set up maxproc limit in maui to limit the aggregate number of nodes by any user over all jobs, 1) jobs get queued indefinitely if jobs exceed max limit and 2) blocked if it asks for nodes < max limit, but some of the resources are already used by jobs from the same user. This issue addresses how to deal with scenarios like these.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3376) [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits

Posted by "Karam Singh (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12601616#action_12601616 ] 

Karam Singh commented on HADOOP-3376:
-------------------------------------

To check this issue after setting MAXPROC limit (say 10) in maui.cfg did the following -:

Added line -: job-feasibility-attr = User-limits exceeded. Requested:([0-9]*) Used:([0-9]*) MaxLimit:([0-9]*) under section hod in hodrc. Hod requires exact string in hodrc

1. When tried to use hod allocate with number of nodes greater then MAXPROC limit (say 11). Verified that hod exits with exit code 4 and proper error message saying -:  CRITICAL/50 hadoop:216 - Requested number of nodes  exceeded maximum user limits. Current Usage:0, Requested:11, Maximum Limit:10 This cluster cannot be allocated now.

2. Tried a combination like first used hod allocate 5 nodes then again using hod allocate with 6 nodes. Verified that job got queued with message -:
CRITICAL/50 hadoop:216 - Requested number of nodes  exceeded maximum user limits. Current Usage:5, Requested:6, Maximum  Limit:10 This cluster allocation will succeed only after other clusters are deallocated.
Also checked after first cluster got deallocated then second cluster got allocated

Repeated with more hod allocate combinations


> [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3376
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3376
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: checklimits.sh, HADOOP-3376, HADOOP-3376.1, HADOOP-3376.2
>
>
> Currently If we set up resource manager/scheduler limits on the jobs submitted, any HOD cluster that exceeds/violates these limits may 1) get blocked/queued indefinitely or 2) blocked till resources occupied by old clusters get freed. HOD should detect these scenarios and deal intelligently, instead of just waiting for a long time/ for ever. This means more and proper information to the submitter.
> (Internal) Use Case:
>      If there are no resource limits, users can flood the resource manager queue preventing other users from using the queue. To avoid this, we could have various types of limits setup in either resource manager or a scheduler - max node limit in torque(per job limit), maxproc limit in maui (per user/class), maxjob limit in maui(per user/class) etc. But there is one problem with the current setup - for e.g if we set up maxproc limit in maui to limit the aggregate number of nodes by any user over all jobs, 1) jobs get queued indefinitely if jobs exceed max limit and 2) blocked if it asks for nodes < max limit, but some of the resources are already used by jobs from the same user. This issue addresses how to deal with scenarios like these.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3376) [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits

Posted by "Vinod Kumar Vavilapalli (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated HADOOP-3376:
--------------------------------------------

    Status: Patch Available  (was: Open)

> [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3376
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3376
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: checklimits.sh, HADOOP-3376, HADOOP-3376.1
>
>
> Currently If we set up resource manager/scheduler limits on the jobs submitted, any HOD cluster that exceeds/violates these limits may 1) get blocked/queued indefinitely or 2) blocked till resources occupied by old clusters get freed. HOD should detect these scenarios and deal intelligently, instead of just waiting for a long time/ for ever. This means more and proper information to the submitter.
> (Internal) Use Case:
>      If there are no resource limits, users can flood the resource manager queue preventing other users from using the queue. To avoid this, we could have various types of limits setup in either resource manager or a scheduler - max node limit in torque(per job limit), maxproc limit in maui (per user/class), maxjob limit in maui(per user/class) etc. But there is one problem with the current setup - for e.g if we set up maxproc limit in maui to limit the aggregate number of nodes by any user over all jobs, 1) jobs get queued indefinitely if jobs exceed max limit and 2) blocked if it asks for nodes < max limit, but some of the resources are already used by jobs from the same user. This issue addresses how to deal with scenarios like these.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3376) [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits

Posted by "Vinod Kumar Vavilapalli (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated HADOOP-3376:
--------------------------------------------

    Attachment: checklimits.sh

Attaching checklimits.sh. This is the utility that would update torque comment field. Uploading for review.

> [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3376
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3376
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: checklimits.sh, HADOOP-3376
>
>
> Currently If we set up resource manager/scheduler limits on the jobs submitted, any HOD cluster that exceeds/violates these limits may 1) get blocked/queued indefinitely or 2) blocked till resources occupied by old clusters get freed. HOD should detect these scenarios and deal intelligently, instead of just waiting for a long time/ for ever. This means more and proper information to the submitter.
> (Internal) Use Case:
>      If there are no resource limits, users can flood the resource manager queue preventing other users from using the queue. To avoid this, we could have various types of limits setup in either resource manager or a scheduler - max node limit in torque(per job limit), maxproc limit in maui (per user/class), maxjob limit in maui(per user/class) etc. But there is one problem with the current setup - for e.g if we set up maxproc limit in maui to limit the aggregate number of nodes by any user over all jobs, 1) jobs get queued indefinitely if jobs exceed max limit and 2) blocked if it asks for nodes < max limit, but some of the resources are already used by jobs from the same user. This issue addresses how to deal with scenarios like these.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3376) [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12597978#action_12597978 ] 

Hadoop QA commented on HADOOP-3376:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12382301/HADOOP-3376.1
  against trunk revision 656939.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no tests are needed for this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2498/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2498/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2498/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2498/console

This message is automatically generated.

> [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3376
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3376
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: checklimits.sh, HADOOP-3376, HADOOP-3376.1
>
>
> Currently If we set up resource manager/scheduler limits on the jobs submitted, any HOD cluster that exceeds/violates these limits may 1) get blocked/queued indefinitely or 2) blocked till resources occupied by old clusters get freed. HOD should detect these scenarios and deal intelligently, instead of just waiting for a long time/ for ever. This means more and proper information to the submitter.
> (Internal) Use Case:
>      If there are no resource limits, users can flood the resource manager queue preventing other users from using the queue. To avoid this, we could have various types of limits setup in either resource manager or a scheduler - max node limit in torque(per job limit), maxproc limit in maui (per user/class), maxjob limit in maui(per user/class) etc. But there is one problem with the current setup - for e.g if we set up maxproc limit in maui to limit the aggregate number of nodes by any user over all jobs, 1) jobs get queued indefinitely if jobs exceed max limit and 2) blocked if it asks for nodes < max limit, but some of the resources are already used by jobs from the same user. This issue addresses how to deal with scenarios like these.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3376) [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12601476#action_12601476 ] 

Hadoop QA commented on HADOOP-3376:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12382640/HADOOP-3376.2
  against trunk revision 661918.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no tests are needed for this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2529/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2529/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2529/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2529/console

This message is automatically generated.

> [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3376
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3376
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: checklimits.sh, HADOOP-3376, HADOOP-3376.1, HADOOP-3376.2
>
>
> Currently If we set up resource manager/scheduler limits on the jobs submitted, any HOD cluster that exceeds/violates these limits may 1) get blocked/queued indefinitely or 2) blocked till resources occupied by old clusters get freed. HOD should detect these scenarios and deal intelligently, instead of just waiting for a long time/ for ever. This means more and proper information to the submitter.
> (Internal) Use Case:
>      If there are no resource limits, users can flood the resource manager queue preventing other users from using the queue. To avoid this, we could have various types of limits setup in either resource manager or a scheduler - max node limit in torque(per job limit), maxproc limit in maui (per user/class), maxjob limit in maui(per user/class) etc. But there is one problem with the current setup - for e.g if we set up maxproc limit in maui to limit the aggregate number of nodes by any user over all jobs, 1) jobs get queued indefinitely if jobs exceed max limit and 2) blocked if it asks for nodes < max limit, but some of the resources are already used by jobs from the same user. This issue addresses how to deal with scenarios like these.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3376) [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits

Posted by "Vinod Kumar Vavilapalli (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated HADOOP-3376:
--------------------------------------------

    Attachment: HADOOP-3376.2

Incorporated the above changes. Reattaching.

This cannot have really useful test cases - it is related to system testing and too integrated with Torque/maui.

> [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3376
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3376
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: checklimits.sh, HADOOP-3376, HADOOP-3376.1, HADOOP-3376.2
>
>
> Currently If we set up resource manager/scheduler limits on the jobs submitted, any HOD cluster that exceeds/violates these limits may 1) get blocked/queued indefinitely or 2) blocked till resources occupied by old clusters get freed. HOD should detect these scenarios and deal intelligently, instead of just waiting for a long time/ for ever. This means more and proper information to the submitter.
> (Internal) Use Case:
>      If there are no resource limits, users can flood the resource manager queue preventing other users from using the queue. To avoid this, we could have various types of limits setup in either resource manager or a scheduler - max node limit in torque(per job limit), maxproc limit in maui (per user/class), maxjob limit in maui(per user/class) etc. But there is one problem with the current setup - for e.g if we set up maxproc limit in maui to limit the aggregate number of nodes by any user over all jobs, 1) jobs get queued indefinitely if jobs exceed max limit and 2) blocked if it asks for nodes < max limit, but some of the resources are already used by jobs from the same user. This issue addresses how to deal with scenarios like these.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3376) [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-3376:
--------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

I just committed this. Thanks, Vinod!

> [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3376
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3376
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>             Fix For: 0.18.0
>
>         Attachments: checklimits.sh, HADOOP-3376, HADOOP-3376.1, HADOOP-3376.2
>
>
> Currently If we set up resource manager/scheduler limits on the jobs submitted, any HOD cluster that exceeds/violates these limits may 1) get blocked/queued indefinitely or 2) blocked till resources occupied by old clusters get freed. HOD should detect these scenarios and deal intelligently, instead of just waiting for a long time/ for ever. This means more and proper information to the submitter.
> (Internal) Use Case:
>      If there are no resource limits, users can flood the resource manager queue preventing other users from using the queue. To avoid this, we could have various types of limits setup in either resource manager or a scheduler - max node limit in torque(per job limit), maxproc limit in maui (per user/class), maxjob limit in maui(per user/class) etc. But there is one problem with the current setup - for e.g if we set up maxproc limit in maui to limit the aggregate number of nodes by any user over all jobs, 1) jobs get queued indefinitely if jobs exceed max limit and 2) blocked if it asks for nodes < max limit, but some of the resources are already used by jobs from the same user. This issue addresses how to deal with scenarios like these.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3376) [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits

Posted by "Vinod Kumar Vavilapalli (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated HADOOP-3376:
--------------------------------------------

    Status: Open  (was: Patch Available)

> [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3376
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3376
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: checklimits.sh, HADOOP-3376, HADOOP-3376.1
>
>
> Currently If we set up resource manager/scheduler limits on the jobs submitted, any HOD cluster that exceeds/violates these limits may 1) get blocked/queued indefinitely or 2) blocked till resources occupied by old clusters get freed. HOD should detect these scenarios and deal intelligently, instead of just waiting for a long time/ for ever. This means more and proper information to the submitter.
> (Internal) Use Case:
>      If there are no resource limits, users can flood the resource manager queue preventing other users from using the queue. To avoid this, we could have various types of limits setup in either resource manager or a scheduler - max node limit in torque(per job limit), maxproc limit in maui (per user/class), maxjob limit in maui(per user/class) etc. But there is one problem with the current setup - for e.g if we set up maxproc limit in maui to limit the aggregate number of nodes by any user over all jobs, 1) jobs get queued indefinitely if jobs exceed max limit and 2) blocked if it asks for nodes < max limit, but some of the resources are already used by jobs from the same user. This issue addresses how to deal with scenarios like these.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3376) [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits

Posted by "Vinod Kumar Vavilapalli (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated HADOOP-3376:
--------------------------------------------

    Status: Patch Available  (was: Open)

> [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3376
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3376
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: checklimits.sh, HADOOP-3376, HADOOP-3376.1, HADOOP-3376.2
>
>
> Currently If we set up resource manager/scheduler limits on the jobs submitted, any HOD cluster that exceeds/violates these limits may 1) get blocked/queued indefinitely or 2) blocked till resources occupied by old clusters get freed. HOD should detect these scenarios and deal intelligently, instead of just waiting for a long time/ for ever. This means more and proper information to the submitter.
> (Internal) Use Case:
>      If there are no resource limits, users can flood the resource manager queue preventing other users from using the queue. To avoid this, we could have various types of limits setup in either resource manager or a scheduler - max node limit in torque(per job limit), maxproc limit in maui (per user/class), maxjob limit in maui(per user/class) etc. But there is one problem with the current setup - for e.g if we set up maxproc limit in maui to limit the aggregate number of nodes by any user over all jobs, 1) jobs get queued indefinitely if jobs exceed max limit and 2) blocked if it asks for nodes < max limit, but some of the resources are already used by jobs from the same user. This issue addresses how to deal with scenarios like these.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3376) [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits

Posted by "Robert Chansler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Chansler updated HADOOP-3376:
------------------------------------

    Release Note: Modified HOD client to look for specific messages related to resource limit overruns and take appropriate actions - such as either failing to allocate the cluster, or issuing a warning to the user. A tool is provided, specific to Maui and Torque, that will set these specific messages.  (was: HOD client was modified to look for specific messages related to resource limit overruns and take appropriate actions - such as either failing to allocate the cluster, or issuing a warning to the user. A tool is provided, specific to Maui and Torque, that will set these specific messages.)

> [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3376
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3376
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>             Fix For: 0.18.0
>
>         Attachments: checklimits.sh, HADOOP-3376, HADOOP-3376.1, HADOOP-3376.2
>
>
> Currently If we set up resource manager/scheduler limits on the jobs submitted, any HOD cluster that exceeds/violates these limits may 1) get blocked/queued indefinitely or 2) blocked till resources occupied by old clusters get freed. HOD should detect these scenarios and deal intelligently, instead of just waiting for a long time/ for ever. This means more and proper information to the submitter.
> (Internal) Use Case:
>      If there are no resource limits, users can flood the resource manager queue preventing other users from using the queue. To avoid this, we could have various types of limits setup in either resource manager or a scheduler - max node limit in torque(per job limit), maxproc limit in maui (per user/class), maxjob limit in maui(per user/class) etc. But there is one problem with the current setup - for e.g if we set up maxproc limit in maui to limit the aggregate number of nodes by any user over all jobs, 1) jobs get queued indefinitely if jobs exceed max limit and 2) blocked if it asks for nodes < max limit, but some of the resources are already used by jobs from the same user. This issue addresses how to deal with scenarios like these.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3376) [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hemanth Yamijala updated HADOOP-3376:
-------------------------------------

    Release Note: HOD client was modified to look for specific messages related to resource limit overruns and take appropriate actions - such as either failing to allocate the cluster, or issuing a warning to the user. A tool is provided, specific to Maui and Torque, that will set these specific messages.
    Hadoop Flags: [Reviewed]

> [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3376
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3376
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>             Fix For: 0.18.0
>
>         Attachments: checklimits.sh, HADOOP-3376, HADOOP-3376.1, HADOOP-3376.2
>
>
> Currently If we set up resource manager/scheduler limits on the jobs submitted, any HOD cluster that exceeds/violates these limits may 1) get blocked/queued indefinitely or 2) blocked till resources occupied by old clusters get freed. HOD should detect these scenarios and deal intelligently, instead of just waiting for a long time/ for ever. This means more and proper information to the submitter.
> (Internal) Use Case:
>      If there are no resource limits, users can flood the resource manager queue preventing other users from using the queue. To avoid this, we could have various types of limits setup in either resource manager or a scheduler - max node limit in torque(per job limit), maxproc limit in maui (per user/class), maxjob limit in maui(per user/class) etc. But there is one problem with the current setup - for e.g if we set up maxproc limit in maui to limit the aggregate number of nodes by any user over all jobs, 1) jobs get queued indefinitely if jobs exceed max limit and 2) blocked if it asks for nodes < max limit, but some of the resources are already used by jobs from the same user. This issue addresses how to deal with scenarios like these.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3376) [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits

Posted by "Vinod Kumar Vavilapalli (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated HADOOP-3376:
--------------------------------------------

    Attachment: HADOOP-3376

Attaching a patch.

 - This implements changes required in HOD to deal better with clusters exceeding resource manager or scheduler limits.
 - After this, every time HOD detects that the cluster is still queued, HOD calls isJobFeasible method of resource manager interface (src/contrib/hod/hodlib/Hod/nodePool.py) to check if job can run if at all.
 - Torque implementation of isJobFeasible (src/contrib/hod/hodlib/NodePools/torque.py) uses the comment field in qstat output. When this comment field becomes equal to hodlib.Common.util.TORQUE_USER_LIMITS_COMMENT_FIELD, HOD deallocates the cluster with the error message "Request execeeded maximum user limits. Cluster will not be allocated." . As it is, this is still only part of the solution - torque comment field has to be set to the above string either by a scheduler or by an external tool.
 - Also introducing a hod config parameter which will enable the above checking : check-job-feasibility. This defaults to false and specifies whether or not to check job feasibility - resource manager and/or scheduler limits.
 - This patches also replaces a few 'job' strings by the string 'cluster'.

> [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3376
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3376
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: HADOOP-3376
>
>
> Currently If we set up resource manager/scheduler limits on the jobs submitted, any HOD cluster that exceeds/violates these limits may 1) get blocked/queued indefinitely or 2) blocked till resources occupied by old clusters get freed. HOD should detect these scenarios and deal intelligently, instead of just waiting for a long time/ for ever. This means more and proper information to the submitter.
> (Internal) Use Case:
>      If there are no resource limits, users can flood the resource manager queue preventing other users from using the queue. To avoid this, we could have various types of limits setup in either resource manager or a scheduler - max node limit in torque(per job limit), maxproc limit in maui (per user/class), maxjob limit in maui(per user/class) etc. But there is one problem with the current setup - for e.g if we set up maxproc limit in maui to limit the aggregate number of nodes by any user over all jobs, 1) jobs get queued indefinitely if jobs exceed max limit and 2) blocked if it asks for nodes < max limit, but some of the resources are already used by jobs from the same user. This issue addresses how to deal with scenarios like these.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3376) [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits

Posted by "Vinod Kumar Vavilapalli (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated HADOOP-3376:
--------------------------------------------

    Status: Open  (was: Patch Available)

Cancelling patch to incorporate Hemanth's comments. The following things need to be done:

 - Each time cluster is checked for feasibility, two qstats are run - reduce it to get required information within only one trip to resource manager.
 - There can be two ways in which user limits can be crossed - requesting for limits beyond the max limit, and cumulative usage crossing the max limits. These two scenarios should be dealt with separately - in the first case, cluster should be deallocated while in the second, cluster should not be deallocated but users should be appropriately informed.
 - Do away with the configuration variable check-job-feasibility. Instead have the variable job-feasibility-comment, which will 1)indicate whether user limits functionality has to used and 2) the comment field that will be set by checklimits.sh - currently checkjob(used by checklimits.sh) prints "job [0-9]* violates active HARD MAXPROC limit of [0-9]* for user [a-z]*  (R: [0-9]*, U: [0-9]*])"
 - This patch changes behavior of getJobState. It should only return True or False in all code paths.
 - Modify the error message TORQUE_USER_LIMITS_EXCEEDED_MSG so that it also prints the max limits so that user can modify his request.
 - checklimits.sh: 1) Submit this also with in  the patch as part of src/contrib/hod/support 2) checklimits.sh should only do only one iteration over all incomplete jobs and modify comment field according as the job crosses the user limits. It should be left to some outside mechanism (like cron) to run checklimits.sh repeatedly after every (some) interval of time.

> [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3376
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3376
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: checklimits.sh, HADOOP-3376
>
>
> Currently If we set up resource manager/scheduler limits on the jobs submitted, any HOD cluster that exceeds/violates these limits may 1) get blocked/queued indefinitely or 2) blocked till resources occupied by old clusters get freed. HOD should detect these scenarios and deal intelligently, instead of just waiting for a long time/ for ever. This means more and proper information to the submitter.
> (Internal) Use Case:
>      If there are no resource limits, users can flood the resource manager queue preventing other users from using the queue. To avoid this, we could have various types of limits setup in either resource manager or a scheduler - max node limit in torque(per job limit), maxproc limit in maui (per user/class), maxjob limit in maui(per user/class) etc. But there is one problem with the current setup - for e.g if we set up maxproc limit in maui to limit the aggregate number of nodes by any user over all jobs, 1) jobs get queued indefinitely if jobs exceed max limit and 2) blocked if it asks for nodes < max limit, but some of the resources are already used by jobs from the same user. This issue addresses how to deal with scenarios like these.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3376) [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598949#action_12598949 ] 

Hemanth Yamijala commented on HADOOP-3376:
------------------------------------------

Some comments:

- I think job-feasibility-attr should be optional. Some code which depends on this attribute may need to check for it or change to handle it if it's not defined:
In torque.py.isJobFeasible, if the job-feasibility-attr is not defined, we would get an exception, where the info message being printed is not going to be very descriptive. I think it would just print 'job-feasibility-attr' and not information about what the error is.
__check_job_state: doesn't handle case where job-feasibility-attr is not defined.

- The messages now read as follows:
(In case of req. resources > max resources):
Request exceeded maximum user limits. CurentUsage:%s, Requested:%s, MaxLimit:%s
(In other case):
Request exceeded maximum user limits. CurentUsage:3, Requested:3, MaxLimit:3 This cluster will remain queued till old clusters free resources.
The message still does not clarify the resources being exceeded.

I suggest the following:
Request number of nodes exceeded maximum user limits. Current Usage:%s, Requested:%s, Maximum Limit:%s. This cluster cannot be allocated now.

and

Request number of nodes exceeded maximum user limits. Current Usage:%s, Requested:%s, Maximum Limit:%s. This cluster allocation will succeed only after other clusters are deallocated.
(Note: I also corrected some typos in the message)

- The executable bit is not being turned on for support/checklimits.sh. This is mostly due to a bug in the ant script. For code under the contrib projects, only files under the bin/ folder are made executable when packaged. As this is not a bug in HOD, I think we should leave this as it is, but update the usage documentation to make it executable.

- In checklimits.sh - the sleep at the end is not required.

- In case when current usage + requested usage exceeds limits, the critical message is printed every 10 seconds. It should be printed only once.

Other than these, I tested checklimits and hod for both scenarios and it works fine.

> [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3376
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3376
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: checklimits.sh, HADOOP-3376, HADOOP-3376.1
>
>
> Currently If we set up resource manager/scheduler limits on the jobs submitted, any HOD cluster that exceeds/violates these limits may 1) get blocked/queued indefinitely or 2) blocked till resources occupied by old clusters get freed. HOD should detect these scenarios and deal intelligently, instead of just waiting for a long time/ for ever. This means more and proper information to the submitter.
> (Internal) Use Case:
>      If there are no resource limits, users can flood the resource manager queue preventing other users from using the queue. To avoid this, we could have various types of limits setup in either resource manager or a scheduler - max node limit in torque(per job limit), maxproc limit in maui (per user/class), maxjob limit in maui(per user/class) etc. But there is one problem with the current setup - for e.g if we set up maxproc limit in maui to limit the aggregate number of nodes by any user over all jobs, 1) jobs get queued indefinitely if jobs exceed max limit and 2) blocked if it asks for nodes < max limit, but some of the resources are already used by jobs from the same user. This issue addresses how to deal with scenarios like these.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3376) [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits

Posted by "Vinod Kumar Vavilapalli (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated HADOOP-3376:
--------------------------------------------

    Attachment: HADOOP-3376.1

Made the suggested changes. Also updated documentation.
  - When asking for resources>max limits, it prints "Request exceeded maximum user limits. CurentUsage:%s, Requested:%s, MaxLimit:%s" at critical log level and deletes the cluster.
  - When request is within limits but cumulative usage crosses limits, it prints "Request exceeded maximum user limits. CurentUsage:%s, Requested:%s, MaxLimit:%s.  This cluster will remain queued till old clusters free resources" at info level and stays in the queued state.
  - Replaced job-feasibity config parameter with job-feasibility-attr : specifies whether to check job feasibility - resource manager and/or scheduler limits, also gives the attribute value. It defaults to TORQUE_USER_LIMITS_COMMENT_FIELD which is "User-limits exceeded. Requested:([0-9]*) Used:([0-9]*) MaxLimit:([0-9]*). 
  - Made necessary changes in checklimits and putting it now in hod/support dir.

> [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3376
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3376
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: checklimits.sh, HADOOP-3376, HADOOP-3376.1
>
>
> Currently If we set up resource manager/scheduler limits on the jobs submitted, any HOD cluster that exceeds/violates these limits may 1) get blocked/queued indefinitely or 2) blocked till resources occupied by old clusters get freed. HOD should detect these scenarios and deal intelligently, instead of just waiting for a long time/ for ever. This means more and proper information to the submitter.
> (Internal) Use Case:
>      If there are no resource limits, users can flood the resource manager queue preventing other users from using the queue. To avoid this, we could have various types of limits setup in either resource manager or a scheduler - max node limit in torque(per job limit), maxproc limit in maui (per user/class), maxjob limit in maui(per user/class) etc. But there is one problem with the current setup - for e.g if we set up maxproc limit in maui to limit the aggregate number of nodes by any user over all jobs, 1) jobs get queued indefinitely if jobs exceed max limit and 2) blocked if it asks for nodes < max limit, but some of the resources are already used by jobs from the same user. This issue addresses how to deal with scenarios like these.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3376) [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits

Posted by "Vinod Kumar Vavilapalli (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated HADOOP-3376:
--------------------------------------------

    Status: Patch Available  (was: Open)

> [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3376
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3376
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: HADOOP-3376
>
>
> Currently If we set up resource manager/scheduler limits on the jobs submitted, any HOD cluster that exceeds/violates these limits may 1) get blocked/queued indefinitely or 2) blocked till resources occupied by old clusters get freed. HOD should detect these scenarios and deal intelligently, instead of just waiting for a long time/ for ever. This means more and proper information to the submitter.
> (Internal) Use Case:
>      If there are no resource limits, users can flood the resource manager queue preventing other users from using the queue. To avoid this, we could have various types of limits setup in either resource manager or a scheduler - max node limit in torque(per job limit), maxproc limit in maui (per user/class), maxjob limit in maui(per user/class) etc. But there is one problem with the current setup - for e.g if we set up maxproc limit in maui to limit the aggregate number of nodes by any user over all jobs, 1) jobs get queued indefinitely if jobs exceed max limit and 2) blocked if it asks for nodes < max limit, but some of the resources are already used by jobs from the same user. This issue addresses how to deal with scenarios like these.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.