You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Vinod K V (JIRA)" <ji...@apache.org> on 2008/09/26 11:05:44 UTC

[jira] Created: (HADOOP-4287) [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.

[mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.
-------------------------------------------------------------------------------------------------------------

                 Key: HADOOP-4287
                 URL: https://issues.apache.org/jira/browse/HADOOP-4287
             Project: Hadoop Core
          Issue Type: Bug
            Reporter: Vinod K V


This I observed while running a job that always fails because of reduce failures. Need to investigate this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4287) [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-4287:
--------------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]
          Status: Resolved  (was: Patch Available)

I just committed this. Thanks, Sreekanth!

> [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4287
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4287
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>            Reporter: Vinod K V
>            Assignee: Sreekanth Ramakrishnan
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: HADOOP-4287-1.patch, HADOOP-4287-2.patch, HADOOP-4287-3.patch
>
>
> This I observed while running a job that always fails because of reduce failures. Need to investigate this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4287) [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.

Posted by "Sreekanth Ramakrishnan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12637436#action_12637436 ] 

Sreekanth Ramakrishnan commented on HADOOP-4287:
------------------------------------------------

After an offline discussion with Amareshwari Hemanth and Vinod, it is ok to remove the assignment in the terminate as the failedTask() is called after the TIP have been issued a kill so the numbers would be reset back to zero.

So removing the assignment in the terminate and leaving failedTask method as is.

> [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4287
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4287
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>            Reporter: Vinod K V
>            Assignee: Sreekanth Ramakrishnan
>            Priority: Blocker
>         Attachments: HADOOP-4287-1.patch
>
>
> This I observed while running a job that always fails because of reduce failures. Need to investigate this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4287) [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12637418#action_12637418 ] 

Amareshwari Sriramadasu commented on HADOOP-4287:
-------------------------------------------------

I think this jira can remove the assignments for the counters from terminate method.
terminate() method, makes *runningMapTasks* and *runningReduceTasks* as zero, which is not necessary. They eventually become zero later.

> [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4287
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4287
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>            Reporter: Vinod K V
>            Assignee: Sreekanth Ramakrishnan
>            Priority: Blocker
>         Attachments: HADOOP-4287-1.patch
>
>
> This I observed while running a job that always fails because of reduce failures. Need to investigate this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4287) [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639079#action_12639079 ] 

Hadoop QA commented on HADOOP-4287:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12391878/HADOOP-4287-3.patch
  against trunk revision 703923.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    -1 core tests.  The patch failed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3450/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3450/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3450/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3450/console

This message is automatically generated.

> [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4287
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4287
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>            Reporter: Vinod K V
>            Assignee: Sreekanth Ramakrishnan
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: HADOOP-4287-1.patch, HADOOP-4287-2.patch, HADOOP-4287-3.patch
>
>
> This I observed while running a job that always fails because of reduce failures. Need to investigate this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4287) [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.

Posted by "Vinod K V (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12638484#action_12638484 ] 

Vinod K V commented on HADOOP-4287:
-----------------------------------

The fix is fine. I have a few test-case related comments.
 - You only have a test for failing maps, you should also have a test for failing reduces.
 - You should test for positive counts not just once, but through out the life time of the job; so you need to check it in a loop till job completion. Otherwise test-case success/failure would just be a matter of timing.
 - I think you can rename the test-case to TestJobInProgess, because part of which is what we are really testing here.
 - Minor : You shouldn't catch and ignore any exception thrown by RunningJob.runJob(). If something abnormal happens, let the test-case fail.

> [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4287
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4287
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>            Reporter: Vinod K V
>            Assignee: Sreekanth Ramakrishnan
>            Priority: Blocker
>         Attachments: HADOOP-4287-1.patch, HADOOP-4287-2.patch
>
>
> This I observed while running a job that always fails because of reduce failures. Need to investigate this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4287) [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.

Posted by "Vinod K V (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod K V updated HADOOP-4287:
------------------------------

    Component/s: contrib/capacity-sched

> [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4287
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4287
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>            Reporter: Vinod K V
>
> This I observed while running a job that always fails because of reduce failures. Need to investigate this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4287) [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.

Posted by "Sreekanth Ramakrishnan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12634803#action_12634803 ] 

Sreekanth Ramakrishnan commented on HADOOP-4287:
------------------------------------------------

_CapacityTaskScheduler_ uses _JobInProgress.getPendingMaps()_ and _JobInProgress.getPendingReduces()_. The way the pending maps and reduces are calculated in JIP can return negative values.

i.e. in case of an always failing task:

Number of task  - number of running task - number of failed task - finished task + speculative task.

When number of failures become large the method returns negative value. But not sure if this is expected.



> [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4287
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4287
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>            Reporter: Vinod K V
>
> This I observed while running a job that always fails because of reduce failures. Need to investigate this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4287) [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.

Posted by "Sreekanth Ramakrishnan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sreekanth Ramakrishnan updated HADOOP-4287:
-------------------------------------------

    Attachment: HADOOP-4287-2.patch

Attaching patch incorporating the comments, also attaching a test case which verifies that the pending and running task count does not go to negative in a case of always failing task.

> [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4287
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4287
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>            Reporter: Vinod K V
>            Assignee: Sreekanth Ramakrishnan
>            Priority: Blocker
>         Attachments: HADOOP-4287-1.patch, HADOOP-4287-2.patch
>
>
> This I observed while running a job that always fails because of reduce failures. Need to investigate this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4287) [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.

Posted by "Sreekanth Ramakrishnan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635712#action_12635712 ] 

Sreekanth Ramakrishnan commented on HADOOP-4287:
------------------------------------------------

The formulae for determining the pending tasks have to be changed to the following in order to have the correct number of pending tasks to be determined.

The formulae would be:

number of desired task - number of running tasks - number of failed TIP for task - finished task + speculative task

> [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4287
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4287
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>            Reporter: Vinod K V
>            Assignee: Sreekanth Ramakrishnan
>            Priority: Blocker
>
> This I observed while running a job that always fails because of reduce failures. Need to investigate this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4287) [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.

Posted by "Sreekanth Ramakrishnan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sreekanth Ramakrishnan updated HADOOP-4287:
-------------------------------------------

    Attachment: HADOOP-4287-1.patch

Attaching a patch for fixing the negative values in the running and pending cases. Running cases I have not been able to reproduce the issue. But I have made modifications according to Vinod's comments. waiting negative values have been fixed.

> [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4287
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4287
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>            Reporter: Vinod K V
>            Assignee: Sreekanth Ramakrishnan
>            Priority: Blocker
>         Attachments: HADOOP-4287-1.patch
>
>
> This I observed while running a job that always fails because of reduce failures. Need to investigate this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4287) [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.

Posted by "Vinod K V (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12634813#action_12634813 ] 

Vinod K V commented on HADOOP-4287:
-----------------------------------

This happens not just with the tasks that always fail, but with any TIPs that fail for a few times before finishing/failing. I guess we should also maintai mapRepeatAttempts and reduceRepeatAttempts to track repetitive task-execution because of failures. These are similar to speculative{map|Reduce}Tasks to track speculative task-execution.

> [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4287
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4287
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>            Reporter: Vinod K V
>
> This I observed while running a job that always fails because of reduce failures. Need to investigate this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4287) [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.

Posted by "Vinod K V (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12634794#action_12634794 ] 

Vinod K V commented on HADOOP-4287:
-----------------------------------

This is seen in all places where queue scheduling info is used - jobtracker.jsp, jobqueue_details.jsp and hadoop CLI. Seems like a problem in our calculations in CapacityTaskScheduler itself.

> [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4287
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4287
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>            Reporter: Vinod K V
>
> This I observed while running a job that always fails because of reduce failures. Need to investigate this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4287) [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.

Posted by "Sreekanth Ramakrishnan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639286#action_12639286 ] 

Sreekanth Ramakrishnan commented on HADOOP-4287:
------------------------------------------------

The failed test case org.apache.hadoop.hdfs.TestLeaseRecovery.testBlockSynchronization is not related to this particular patch which modifies only the mapred project in the core. Hemanth has already reported this issue on following JIRA [HADOOP-4403|https://issues.apache.org/jira/browse/HADOOP-4403]

> [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4287
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4287
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>            Reporter: Vinod K V
>            Assignee: Sreekanth Ramakrishnan
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: HADOOP-4287-1.patch, HADOOP-4287-2.patch, HADOOP-4287-3.patch
>
>
> This I observed while running a job that always fails because of reduce failures. Need to investigate this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4287) [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.

Posted by "Sreekanth Ramakrishnan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sreekanth Ramakrishnan updated HADOOP-4287:
-------------------------------------------

    Attachment: HADOOP-4287-3.patch

Attaching the latest patch. The patch contains modified test case with following:
- New Test case to check for Failing reduce.
- Test case now waits till all the tasks of the job have reported failures.

One thing to be noted is test case takes nearly _125 seconds_ to complete.

bq.Minor : You shouldn't catch and ignore any exception thrown by RunningJob.runJob(). If something abnormal happens, let the test-case fail.

The reason why the exception is being caught is because when a Job fails exception is thrown back from JobClient.runJob. We catch the exception to continue working on the task counts.


> [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4287
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4287
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>            Reporter: Vinod K V
>            Assignee: Sreekanth Ramakrishnan
>            Priority: Blocker
>         Attachments: HADOOP-4287-1.patch, HADOOP-4287-2.patch, HADOOP-4287-3.patch
>
>
> This I observed while running a job that always fails because of reduce failures. Need to investigate this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4287) [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639859#action_12639859 ] 

Hudson commented on HADOOP-4287:
--------------------------------

Integrated in Hadoop-trunk #634 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/634/])
    

> [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4287
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4287
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>            Reporter: Vinod K V
>            Assignee: Sreekanth Ramakrishnan
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: HADOOP-4287-1.patch, HADOOP-4287-2.patch, HADOOP-4287-3.patch
>
>
> This I observed while running a job that always fails because of reduce failures. Need to investigate this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4287) [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.

Posted by "Sreekanth Ramakrishnan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12637421#action_12637421 ] 

Sreekanth Ramakrishnan commented on HADOOP-4287:
------------------------------------------------

The counter assignments from terminate method should remain, because only thing which we are handling is prevention of updating the counters after the job has been terminated due to maximum failures reached, as mentioned by Vinod above.
 
If we remove the assignment then value of running reduces would not fall back to zero, in case a job which is terminated when it has 5 maps in total and has 2 running maps when it is terminated. Then if we remove the assignment from terminate then the JIP.running task would say 2 and waiting would always point to 3.

> [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4287
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4287
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>            Reporter: Vinod K V
>            Assignee: Sreekanth Ramakrishnan
>            Priority: Blocker
>         Attachments: HADOOP-4287-1.patch
>
>
> This I observed while running a job that always fails because of reduce failures. Need to investigate this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4287) [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.

Posted by "Sreekanth Ramakrishnan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635288#action_12635288 ] 

Sreekanth Ramakrishnan commented on HADOOP-4287:
------------------------------------------------

I think it would be simpler to deduce the number of pending task by subtracting values from _desiredMaps()_ and _finishedMaps()_. Similarly for reduces, I am making an assumption that value of the _finishedMapTasks_ and _finishedReduceTasks_ are bumped up by one whenever a speculative task has finished. This way, we don't need to keep track of the number of failed maps or reduces, since our scheduling algorithm does not assign weight to number of failures.

> [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4287
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4287
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>            Reporter: Vinod K V
>
> This I observed while running a job that always fails because of reduce failures. Need to investigate this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4287) [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hemanth Yamijala updated HADOOP-4287:
-------------------------------------

    Priority: Blocker  (was: Major)

Marking as a blocker, as it is giving incorrect user feedback for a usual scenario.

> [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4287
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4287
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>            Reporter: Vinod K V
>            Priority: Blocker
>
> This I observed while running a job that always fails because of reduce failures. Need to investigate this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4287) [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.

Posted by "Vinod K V (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635949#action_12635949 ] 

Vinod K V commented on HADOOP-4287:
-----------------------------------

After running through the logs, I've realized that the negative count on running tasks results because of TTs that report failed tasks *after* the job is killed due to maximum TIP failures being hit. A killed job resets its running tasks count to zero, any task failures reported, after the job is marked as killed, bring down the count to a negative value. Don't know for sure, but I guess we should not update task-status of jobs that are already marked succeeded/failed/killed.

> [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4287
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4287
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>            Reporter: Vinod K V
>            Assignee: Sreekanth Ramakrishnan
>            Priority: Blocker
>
> This I observed while running a job that always fails because of reduce failures. Need to investigate this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4287) [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.

Posted by "Vinod K V (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12638993#action_12638993 ] 

Vinod K V commented on HADOOP-4287:
-----------------------------------

bq. Test case now waits till all the tasks of the job have reported failures.
It actually waits till all the tasks that are already launched get finished. Thus, it makes sure that tasks that report back after job is failed won't decrement the count below zero.

Reviewed the patch. The test-case now looks fine. About the test-case time, it could have been drastically reduced had we instead used fake objects for JT, TT etc., but I think it's fine for now as it is.

+1.

> [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4287
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4287
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>            Reporter: Vinod K V
>            Assignee: Sreekanth Ramakrishnan
>            Priority: Blocker
>         Attachments: HADOOP-4287-1.patch, HADOOP-4287-2.patch, HADOOP-4287-3.patch
>
>
> This I observed while running a job that always fails because of reduce failures. Need to investigate this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4287) [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.

Posted by "Sreekanth Ramakrishnan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sreekanth Ramakrishnan updated HADOOP-4287:
-------------------------------------------

    Fix Version/s: 0.19.0
           Status: Patch Available  (was: Open)

Attaching local test patch output:

{noformat}
     [exec]
     [exec] +1 overall.
     [exec]
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec]
     [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.
     [exec]
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec]
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
     [exec]
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
     [exec]
     [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
     [exec]

{noformat}

> [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4287
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4287
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>            Reporter: Vinod K V
>            Assignee: Sreekanth Ramakrishnan
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: HADOOP-4287-1.patch, HADOOP-4287-2.patch, HADOOP-4287-3.patch
>
>
> This I observed while running a job that always fails because of reduce failures. Need to investigate this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HADOOP-4287) [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.

Posted by "Sreekanth Ramakrishnan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sreekanth Ramakrishnan reassigned HADOOP-4287:
----------------------------------------------

    Assignee: Sreekanth Ramakrishnan

> [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4287
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4287
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>            Reporter: Vinod K V
>            Assignee: Sreekanth Ramakrishnan
>            Priority: Blocker
>
> This I observed while running a job that always fails because of reduce failures. Need to investigate this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4287) [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.

Posted by "Vinod K V (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635704#action_12635704 ] 

Vinod K V commented on HADOOP-4287:
-----------------------------------

Unlinked HADOOP-4289 which is related only to HADOOP-4288. Sorry, for any confusion I might have created.

But, I still see negative running reduces, although I've repetitively seen only a -1. This should be investigated as part of this JIRA.

> [mapred] jobqueue_details.jsp shows negative count of running and waiting reduces with CapacityTaskScheduler.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4287
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4287
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>            Reporter: Vinod K V
>            Assignee: Sreekanth Ramakrishnan
>            Priority: Blocker
>
> This I observed while running a job that always fails because of reduce failures. Need to investigate this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.