You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Hemanth Yamijala (JIRA)" <ji...@apache.org> on 2008/10/15 10:21:44 UTC

[jira] Commented: (HADOOP-3217) [HOD] Be less agressive when querying job status from resource manager.

    [ https://issues.apache.org/jira/browse/HADOOP-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639758#action_12639758 ] 

Hemanth Yamijala commented on HADOOP-3217:
------------------------------------------

Attached a patch for Hadoop 0.17. The following are the changes:

- For relevant qsub failures, that is other than qsub options error, or insufficient resources, we retry a configurable number of times (default 3), with a configurable wait interval between the retries (default 10 seconds)
- For all qstat errors, we retry a configurable number of times (default 3), with a configurable wait time interval between the retries (default 10 seconds)
- For qstat queries which are successful, and where we poll for the job state to become running or completed, the interval is made configurable (default 30 seconds).

Patch for other branches in progress.

> [HOD] Be less agressive when querying job status from resource manager.
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-3217
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3217
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>    Affects Versions: 0.16.2
>            Reporter: Hemanth Yamijala
>            Assignee: Hemanth Yamijala
>            Priority: Blocker
>             Fix For: 0.17.3, 0.18.2, 0.19.0, 0.20.0
>
>         Attachments: HADOOP-3217.patch.0.17
>
>
> After a job is submitted, HOD queries torque periodically until it finds the job to be running / completed (due to error). The initial rate of query is once every 0.5 seconds for 20 times, and then once every 10 seconds. This is probably a tad too aggressive as we find that Torque sometimes returns some odd errors under heavy load in the cluster (HADOOP-3216). It may be better to query at a more relaxed rate. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.