You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Hemanth Yamijala (JIRA)" <ji...@apache.org> on 2008/04/09 06:05:24 UTC

[jira] Created: (HADOOP-3217) [HOD] Be less agressive when querying job status from resource manager.

[HOD] Be less agressive when querying job status from resource manager.
-----------------------------------------------------------------------

                 Key: HADOOP-3217
                 URL: https://issues.apache.org/jira/browse/HADOOP-3217
             Project: Hadoop Core
          Issue Type: Bug
          Components: contrib/hod
    Affects Versions: 0.16.2
            Reporter: Hemanth Yamijala


After a job is submitted, HOD queries torque periodically until it finds the job to be running / completed (due to error). The initial rate of query is once every 0.5 seconds for 20 times, and then once every 10 seconds. This is probably a tad too aggressive as we find that Torque sometimes returns some odd errors under heavy load in the cluster (HADOOP-3216). It may be better to query at a more relaxed rate. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3217) [HOD] Be less agressive when querying job status from resource manager.

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hemanth Yamijala updated HADOOP-3217:
-------------------------------------

    Attachment: HADOOP-3217.patch.0.17

> [HOD] Be less agressive when querying job status from resource manager.
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-3217
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3217
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>    Affects Versions: 0.16.2
>            Reporter: Hemanth Yamijala
>            Assignee: Hemanth Yamijala
>            Priority: Blocker
>             Fix For: 0.17.3, 0.18.2, 0.19.0, 0.20.0
>
>         Attachments: HADOOP-3217.patch.0.17
>
>
> After a job is submitted, HOD queries torque periodically until it finds the job to be running / completed (due to error). The initial rate of query is once every 0.5 seconds for 20 times, and then once every 10 seconds. This is probably a tad too aggressive as we find that Torque sometimes returns some odd errors under heavy load in the cluster (HADOOP-3216). It may be better to query at a more relaxed rate. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3217) [HOD] Be less agressive when querying job status from resource manager.

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12640326#action_12640326 ] 

Hadoop QA commented on HADOOP-3217:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12392183/HADOOP-3217
  against trunk revision 705215.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no tests are needed for this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3476/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3476/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3476/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3476/console

This message is automatically generated.

> [HOD] Be less agressive when querying job status from resource manager.
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-3217
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3217
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>    Affects Versions: 0.16.2
>            Reporter: Hemanth Yamijala
>            Assignee: Hemanth Yamijala
>            Priority: Blocker
>             Fix For: 0.17.3, 0.18.2, 0.19.0, 0.20.0
>
>         Attachments: HADOOP-3217, HADOOP-3217.patch.0.17, HADOOP-3217.patch.0.17, HADOOP-3217.patch.0.17
>
>
> After a job is submitted, HOD queries torque periodically until it finds the job to be running / completed (due to error). The initial rate of query is once every 0.5 seconds for 20 times, and then once every 10 seconds. This is probably a tad too aggressive as we find that Torque sometimes returns some odd errors under heavy load in the cluster (HADOOP-3216). It may be better to query at a more relaxed rate. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3217) [HOD] Be less agressive when querying job status from resource manager.

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12640139#action_12640139 ] 

Hemanth Yamijala commented on HADOOP-3217:
------------------------------------------

bq. There was a minor issue : when invalid tarball is specified with tarball option then ringmaster failure occurs and return code is 5 while for 0.17.3 without patch, return code is 6 for the same scenario.
I discussed this with Vinod and Suman offline. There has always been a timing issue in this part of the code. The patch makes it more likely that 5 is returned in this case. Note that both indicate a ringmaster failure. I think at this stage, it is OK to ignore this minor anomaly for now.

> [HOD] Be less agressive when querying job status from resource manager.
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-3217
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3217
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>    Affects Versions: 0.16.2
>            Reporter: Hemanth Yamijala
>            Assignee: Hemanth Yamijala
>            Priority: Blocker
>             Fix For: 0.17.3, 0.18.2, 0.19.0, 0.20.0
>
>         Attachments: HADOOP-3217, HADOOP-3217.patch.0.17, HADOOP-3217.patch.0.17, HADOOP-3217.patch.0.17
>
>
> After a job is submitted, HOD queries torque periodically until it finds the job to be running / completed (due to error). The initial rate of query is once every 0.5 seconds for 20 times, and then once every 10 seconds. This is probably a tad too aggressive as we find that Torque sometimes returns some odd errors under heavy load in the cluster (HADOOP-3216). It may be better to query at a more relaxed rate. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3217) [HOD] Be less agressive when querying job status from resource manager.

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639844#action_12639844 ] 

Hemanth Yamijala commented on HADOOP-3217:
------------------------------------------

I found a bug while testing the previous patch. In a corner case when we are retrying qsub operations, and the user hits a control-C, we don't break out of the loop immediately. This is fixed now in the last patch I uploaded. Rest of the patch is the same.

> [HOD] Be less agressive when querying job status from resource manager.
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-3217
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3217
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>    Affects Versions: 0.16.2
>            Reporter: Hemanth Yamijala
>            Assignee: Hemanth Yamijala
>            Priority: Blocker
>             Fix For: 0.17.3, 0.18.2, 0.19.0, 0.20.0
>
>         Attachments: HADOOP-3217.patch.0.17, HADOOP-3217.patch.0.17, HADOOP-3217.patch.0.17
>
>
> After a job is submitted, HOD queries torque periodically until it finds the job to be running / completed (due to error). The initial rate of query is once every 0.5 seconds for 20 times, and then once every 10 seconds. This is probably a tad too aggressive as we find that Torque sometimes returns some odd errors under heavy load in the cluster (HADOOP-3216). It may be better to query at a more relaxed rate. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3217) [HOD] Be less agressive when querying job status from resource manager.

Posted by "Suman Sehgal (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12640137#action_12640137 ] 

Suman Sehgal commented on HADOOP-3217:
--------------------------------------

Verified patch for 0.17.0
This is working fine for dynamic hdfs & mapred. There was a minor issue : when invalid tarball is specified with tarball option then ringmaster failure occurs and return code is 5 while for 0.17.3 without patch, return code is 6 for the same scenario.

> [HOD] Be less agressive when querying job status from resource manager.
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-3217
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3217
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>    Affects Versions: 0.16.2
>            Reporter: Hemanth Yamijala
>            Assignee: Hemanth Yamijala
>            Priority: Blocker
>             Fix For: 0.17.3, 0.18.2, 0.19.0, 0.20.0
>
>         Attachments: HADOOP-3217, HADOOP-3217.patch.0.17, HADOOP-3217.patch.0.17, HADOOP-3217.patch.0.17
>
>
> After a job is submitted, HOD queries torque periodically until it finds the job to be running / completed (due to error). The initial rate of query is once every 0.5 seconds for 20 times, and then once every 10 seconds. This is probably a tad too aggressive as we find that Torque sometimes returns some odd errors under heavy load in the cluster (HADOOP-3216). It may be better to query at a more relaxed rate. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3217) [HOD] Be less agressive when querying job status from resource manager.

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hemanth Yamijala updated HADOOP-3217:
-------------------------------------

    Attachment: HADOOP-3217

Patch for Hadoop 0.18. I've verified this also applies to 0.19 and trunk.

> [HOD] Be less agressive when querying job status from resource manager.
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-3217
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3217
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>    Affects Versions: 0.16.2
>            Reporter: Hemanth Yamijala
>            Assignee: Hemanth Yamijala
>            Priority: Blocker
>             Fix For: 0.17.3, 0.18.2, 0.19.0, 0.20.0
>
>         Attachments: HADOOP-3217, HADOOP-3217.patch.0.17, HADOOP-3217.patch.0.17, HADOOP-3217.patch.0.17
>
>
> After a job is submitted, HOD queries torque periodically until it finds the job to be running / completed (due to error). The initial rate of query is once every 0.5 seconds for 20 times, and then once every 10 seconds. This is probably a tad too aggressive as we find that Torque sometimes returns some odd errors under heavy load in the cluster (HADOOP-3216). It may be better to query at a more relaxed rate. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3217) [HOD] Be less agressive when querying job status from resource manager.

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hemanth Yamijala updated HADOOP-3217:
-------------------------------------

         Priority: Blocker  (was: Major)
    Fix Version/s: 0.20.0
                   0.19.0
                   0.18.2
                   0.17.3
         Assignee: Hemanth Yamijala

This is causing some issues in a production deployment. Will work on a fix asap.

> [HOD] Be less agressive when querying job status from resource manager.
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-3217
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3217
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>    Affects Versions: 0.16.2
>            Reporter: Hemanth Yamijala
>            Assignee: Hemanth Yamijala
>            Priority: Blocker
>             Fix For: 0.17.3, 0.18.2, 0.19.0, 0.20.0
>
>
> After a job is submitted, HOD queries torque periodically until it finds the job to be running / completed (due to error). The initial rate of query is once every 0.5 seconds for 20 times, and then once every 10 seconds. This is probably a tad too aggressive as we find that Torque sometimes returns some odd errors under heavy load in the cluster (HADOOP-3216). It may be better to query at a more relaxed rate. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3217) [HOD] Be less agressive when querying job status from resource manager.

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-3217:
----------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

I just committed this. Thanks, Hemanth!

> [HOD] Be less agressive when querying job status from resource manager.
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-3217
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3217
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>    Affects Versions: 0.16.2
>            Reporter: Hemanth Yamijala
>            Assignee: Hemanth Yamijala
>            Priority: Blocker
>             Fix For: 0.17.3, 0.18.2, 0.19.0, 0.20.0
>
>         Attachments: HADOOP-3217, HADOOP-3217.patch.0.17, HADOOP-3217.patch.0.17, HADOOP-3217.patch.0.17
>
>
> After a job is submitted, HOD queries torque periodically until it finds the job to be running / completed (due to error). The initial rate of query is once every 0.5 seconds for 20 times, and then once every 10 seconds. This is probably a tad too aggressive as we find that Torque sometimes returns some odd errors under heavy load in the cluster (HADOOP-3216). It may be better to query at a more relaxed rate. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3217) [HOD] Be less agressive when querying job status from resource manager.

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hemanth Yamijala updated HADOOP-3217:
-------------------------------------

    Attachment: HADOOP-3217.patch.0.17

> [HOD] Be less agressive when querying job status from resource manager.
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-3217
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3217
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>    Affects Versions: 0.16.2
>            Reporter: Hemanth Yamijala
>            Assignee: Hemanth Yamijala
>            Priority: Blocker
>             Fix For: 0.17.3, 0.18.2, 0.19.0, 0.20.0
>
>         Attachments: HADOOP-3217.patch.0.17, HADOOP-3217.patch.0.17
>
>
> After a job is submitted, HOD queries torque periodically until it finds the job to be running / completed (due to error). The initial rate of query is once every 0.5 seconds for 20 times, and then once every 10 seconds. This is probably a tad too aggressive as we find that Torque sometimes returns some odd errors under heavy load in the cluster (HADOOP-3216). It may be better to query at a more relaxed rate. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3217) [HOD] Be less agressive when querying job status from resource manager.

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hemanth Yamijala updated HADOOP-3217:
-------------------------------------

    Status: Patch Available  (was: Open)

> [HOD] Be less agressive when querying job status from resource manager.
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-3217
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3217
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>    Affects Versions: 0.16.2
>            Reporter: Hemanth Yamijala
>            Assignee: Hemanth Yamijala
>            Priority: Blocker
>             Fix For: 0.17.3, 0.18.2, 0.19.0, 0.20.0
>
>         Attachments: HADOOP-3217, HADOOP-3217.patch.0.17, HADOOP-3217.patch.0.17, HADOOP-3217.patch.0.17
>
>
> After a job is submitted, HOD queries torque periodically until it finds the job to be running / completed (due to error). The initial rate of query is once every 0.5 seconds for 20 times, and then once every 10 seconds. This is probably a tad too aggressive as we find that Torque sometimes returns some odd errors under heavy load in the cluster (HADOOP-3216). It may be better to query at a more relaxed rate. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3217) [HOD] Be less agressive when querying job status from resource manager.

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639805#action_12639805 ] 

Hemanth Yamijala commented on HADOOP-3217:
------------------------------------------

Good idea on the loop. I've updated a new patch incorporating the comments.

> [HOD] Be less agressive when querying job status from resource manager.
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-3217
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3217
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>    Affects Versions: 0.16.2
>            Reporter: Hemanth Yamijala
>            Assignee: Hemanth Yamijala
>            Priority: Blocker
>             Fix For: 0.17.3, 0.18.2, 0.19.0, 0.20.0
>
>         Attachments: HADOOP-3217.patch.0.17, HADOOP-3217.patch.0.17
>
>
> After a job is submitted, HOD queries torque periodically until it finds the job to be running / completed (due to error). The initial rate of query is once every 0.5 seconds for 20 times, and then once every 10 seconds. This is probably a tad too aggressive as we find that Torque sometimes returns some odd errors under heavy load in the cluster (HADOOP-3216). It may be better to query at a more relaxed rate. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3217) [HOD] Be less agressive when querying job status from resource manager.

Posted by "Vinod K V (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12640138#action_12640138 ] 

Vinod K V commented on HADOOP-3217:
-----------------------------------

+1 for both the patches.

> [HOD] Be less agressive when querying job status from resource manager.
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-3217
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3217
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>    Affects Versions: 0.16.2
>            Reporter: Hemanth Yamijala
>            Assignee: Hemanth Yamijala
>            Priority: Blocker
>             Fix For: 0.17.3, 0.18.2, 0.19.0, 0.20.0
>
>         Attachments: HADOOP-3217, HADOOP-3217.patch.0.17, HADOOP-3217.patch.0.17, HADOOP-3217.patch.0.17
>
>
> After a job is submitted, HOD queries torque periodically until it finds the job to be running / completed (due to error). The initial rate of query is once every 0.5 seconds for 20 times, and then once every 10 seconds. This is probably a tad too aggressive as we find that Torque sometimes returns some odd errors under heavy load in the cluster (HADOOP-3216). It may be better to query at a more relaxed rate. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3217) [HOD] Be less agressive when querying job status from resource manager.

Posted by "Vinod K V (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639793#action_12639793 ] 

Vinod K V commented on HADOOP-3217:
-----------------------------------

Some comments:
 - You've accidentally removed documentation comment line in hodlib/Hod/hadoop.py(+16)
 - Instead of the big while loop for retrying submitNodeSet, you could just try submitNodeSet repeatedly till max-failures and break appropriately. Greatly reduces the size of the patch.
 - Should we change the issue heading/description? We are really dealing with two issues here - 1) increasing retrying interval for qstat and 2) retrying for qsub/qstat failures.

> [HOD] Be less agressive when querying job status from resource manager.
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-3217
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3217
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>    Affects Versions: 0.16.2
>            Reporter: Hemanth Yamijala
>            Assignee: Hemanth Yamijala
>            Priority: Blocker
>             Fix For: 0.17.3, 0.18.2, 0.19.0, 0.20.0
>
>         Attachments: HADOOP-3217.patch.0.17
>
>
> After a job is submitted, HOD queries torque periodically until it finds the job to be running / completed (due to error). The initial rate of query is once every 0.5 seconds for 20 times, and then once every 10 seconds. This is probably a tad too aggressive as we find that Torque sometimes returns some odd errors under heavy load in the cluster (HADOOP-3216). It may be better to query at a more relaxed rate. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3217) [HOD] Be less agressive when querying job status from resource manager.

Posted by "Peeyush Bishnoi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Peeyush Bishnoi updated HADOOP-3217:
------------------------------------


I and Suman verified the patch with Hadoop-0.18 and it is working fine.

> [HOD] Be less agressive when querying job status from resource manager.
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-3217
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3217
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>    Affects Versions: 0.16.2
>            Reporter: Hemanth Yamijala
>            Assignee: Hemanth Yamijala
>            Priority: Blocker
>             Fix For: 0.17.3, 0.18.2, 0.19.0, 0.20.0
>
>         Attachments: HADOOP-3217, HADOOP-3217.patch.0.17, HADOOP-3217.patch.0.17, HADOOP-3217.patch.0.17
>
>
> After a job is submitted, HOD queries torque periodically until it finds the job to be running / completed (due to error). The initial rate of query is once every 0.5 seconds for 20 times, and then once every 10 seconds. This is probably a tad too aggressive as we find that Torque sometimes returns some odd errors under heavy load in the cluster (HADOOP-3216). It may be better to query at a more relaxed rate. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3217) [HOD] Be less agressive when querying job status from resource manager.

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12645226#action_12645226 ] 

Arun C Murthy commented on HADOOP-3217:
---------------------------------------

I messed up the commit msg, so the subversion commits are available here:
trunk: http://svn.apache.org/viewcvs?view=rev&rev=705420 
branch-19: http://svn.apache.org/viewcvs?view=rev&rev=705422
branch-18: http://svn.apache.org/viewcvs?view=rev&rev=705423
branch-17: http://svn.apache.org/viewcvs?view=rev&rev=705426

> [HOD] Be less agressive when querying job status from resource manager.
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-3217
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3217
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>    Affects Versions: 0.16.2
>            Reporter: Hemanth Yamijala
>            Assignee: Hemanth Yamijala
>            Priority: Blocker
>             Fix For: 0.17.3, 0.18.2
>
>         Attachments: HADOOP-3217, HADOOP-3217.patch.0.17, HADOOP-3217.patch.0.17, HADOOP-3217.patch.0.17
>
>
> After a job is submitted, HOD queries torque periodically until it finds the job to be running / completed (due to error). The initial rate of query is once every 0.5 seconds for 20 times, and then once every 10 seconds. This is probably a tad too aggressive as we find that Torque sometimes returns some odd errors under heavy load in the cluster (HADOOP-3216). It may be better to query at a more relaxed rate. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3217) [HOD] Be less agressive when querying job status from resource manager.

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12640569#action_12640569 ] 

Hudson commented on HADOOP-3217:
--------------------------------

Integrated in Hadoop-trunk #636 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/636/])
    . Decrease the rate at which the hod queries the resource manager for job status. Contributed by Hemanth Yamijala.


> [HOD] Be less agressive when querying job status from resource manager.
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-3217
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3217
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>    Affects Versions: 0.16.2
>            Reporter: Hemanth Yamijala
>            Assignee: Hemanth Yamijala
>            Priority: Blocker
>             Fix For: 0.17.3, 0.18.2, 0.19.0, 0.20.0
>
>         Attachments: HADOOP-3217, HADOOP-3217.patch.0.17, HADOOP-3217.patch.0.17, HADOOP-3217.patch.0.17
>
>
> After a job is submitted, HOD queries torque periodically until it finds the job to be running / completed (due to error). The initial rate of query is once every 0.5 seconds for 20 times, and then once every 10 seconds. This is probably a tad too aggressive as we find that Torque sometimes returns some odd errors under heavy load in the cluster (HADOOP-3216). It may be better to query at a more relaxed rate. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3217) [HOD] Be less agressive when querying job status from resource manager.

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639758#action_12639758 ] 

Hemanth Yamijala commented on HADOOP-3217:
------------------------------------------

Attached a patch for Hadoop 0.17. The following are the changes:

- For relevant qsub failures, that is other than qsub options error, or insufficient resources, we retry a configurable number of times (default 3), with a configurable wait interval between the retries (default 10 seconds)
- For all qstat errors, we retry a configurable number of times (default 3), with a configurable wait time interval between the retries (default 10 seconds)
- For qstat queries which are successful, and where we poll for the job state to become running or completed, the interval is made configurable (default 30 seconds).

Patch for other branches in progress.

> [HOD] Be less agressive when querying job status from resource manager.
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-3217
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3217
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>    Affects Versions: 0.16.2
>            Reporter: Hemanth Yamijala
>            Assignee: Hemanth Yamijala
>            Priority: Blocker
>             Fix For: 0.17.3, 0.18.2, 0.19.0, 0.20.0
>
>         Attachments: HADOOP-3217.patch.0.17
>
>
> After a job is submitted, HOD queries torque periodically until it finds the job to be running / completed (due to error). The initial rate of query is once every 0.5 seconds for 20 times, and then once every 10 seconds. This is probably a tad too aggressive as we find that Torque sometimes returns some odd errors under heavy load in the cluster (HADOOP-3216). It may be better to query at a more relaxed rate. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3217) [HOD] Be less agressive when querying job status from resource manager.

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hemanth Yamijala updated HADOOP-3217:
-------------------------------------

    Attachment: HADOOP-3217.patch.0.17

> [HOD] Be less agressive when querying job status from resource manager.
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-3217
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3217
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>    Affects Versions: 0.16.2
>            Reporter: Hemanth Yamijala
>            Assignee: Hemanth Yamijala
>            Priority: Blocker
>             Fix For: 0.17.3, 0.18.2, 0.19.0, 0.20.0
>
>         Attachments: HADOOP-3217.patch.0.17, HADOOP-3217.patch.0.17, HADOOP-3217.patch.0.17
>
>
> After a job is submitted, HOD queries torque periodically until it finds the job to be running / completed (due to error). The initial rate of query is once every 0.5 seconds for 20 times, and then once every 10 seconds. This is probably a tad too aggressive as we find that Torque sometimes returns some odd errors under heavy load in the cluster (HADOOP-3216). It may be better to query at a more relaxed rate. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.