You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Hemanth Yamijala (JIRA)" <ji...@apache.org> on 2009/06/17 09:46:07 UTC

[jira] Created: (HADOOP-6064) Rewrite TestQueueCapacities to make it simpler and avoid timeout errors

Rewrite TestQueueCapacities to make it simpler and avoid timeout errors
-----------------------------------------------------------------------

                 Key: HADOOP-6064
                 URL: https://issues.apache.org/jira/browse/HADOOP-6064
             Project: Hadoop Core
          Issue Type: Bug
          Components: contrib/capacity-sched, test
    Affects Versions: 0.20.0
            Reporter: Hemanth Yamijala


We have seen TestQueueCapacities fail periodically and there have been a couple of times fixes partially fixed the problem, the most recent instance being HADOOP-5869. I found another instance of failure, while running tests locally while testing a different patch. This was a different symptom from the ones we've seen before. The core problem is that the test is too complex and relies on too many things working correctly to be useful. It would make sense to revisit the purpose of the test and see if a simpler model can serve it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6064) Rewrite TestQueueCapacities to make it simpler and avoid timeout errors

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720536#action_12720536 ] 

Hemanth Yamijala commented on HADOOP-6064:
------------------------------------------

Just for information, the failure this time around happened as follows:

- The test timed out in multipleQsWithOneQBeyondCapacity, while waiting for 5 map tasks to complete.
- The check for completion of tasks assumes all map tasks run successfully in ControlledMapReduceJob. Note that the check is on jip.finishedMaps() which  does not count failed tasks.
- However, one of the map tasks failed this time, with the following stack trace:
{noformat}
    [junit] 09/06/17 12:49:20 INFO mapred.TaskInProgress: Error from attempt_200906171248_0001_m_000003_0: java.io.FileNotFoundException: File signalFileDir-7646601804912829477/MAPS_0 does not exist.
    [junit]   at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:383)
    [junit]   at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:301)
    [junit]   at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:746)
    [junit]   at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:771)
    [junit]   at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:465)
    [junit]   at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:746)
    [junit]   at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:806)
    [junit]   at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:936)
    [junit]   at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:891)
    [junit]   at org.apache.hadoop.mapred.ControlledMapReduceJob.listSignalFiles(ControlledMapReduceJob.java:278)
    [junit]   at org.apache.hadoop.mapred.ControlledMapReduceJob.map(ControlledMapReduceJob.java:318)
    [junit]   at org.apache.hadoop.mapred.ControlledMapReduceJob.map(ControlledMapReduceJob.java:60)
    [junit]   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    [junit]   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:363)
    [junit]   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:312)
    [junit]   at org.apache.hadoop.mapred.Child.main(Child.java:159)
{noformat}
- This, in turn, seems to relate to the problem described in HADOOP-4167. The mappers all list contents of a filesystem looking for 'signal' files. These signal files are renamed and therefore go missing asynchronously.
- The test waits forever and thus times out.

> Rewrite TestQueueCapacities to make it simpler and avoid timeout errors
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-6064
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6064
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched, test
>    Affects Versions: 0.20.0
>            Reporter: Hemanth Yamijala
>
> We have seen TestQueueCapacities fail periodically and there have been a couple of times fixes partially fixed the problem, the most recent instance being HADOOP-5869. I found another instance of failure, while running tests locally while testing a different patch. This was a different symptom from the ones we've seen before. The core problem is that the test is too complex and relies on too many things working correctly to be useful. It would make sense to revisit the purpose of the test and see if a simpler model can serve it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.