You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Ivan Mitic (JIRA)" <ji...@apache.org> on 2012/08/27 19:40:07 UTC

[jira] [Created] (HADOOP-8732) Address intermittent test failures on Windows

Ivan Mitic created HADOOP-8732:
----------------------------------

             Summary: Address intermittent test failures on Windows
                 Key: HADOOP-8732
                 URL: https://issues.apache.org/jira/browse/HADOOP-8732
             Project: Hadoop Common
          Issue Type: Bug
          Components: util
            Reporter: Ivan Mitic
            Assignee: Ivan Mitic


There are a few tests that fail intermittently on Windows with a timeout error. This means that the test was actually killed from the outside, and it would continue to run otherwise. 

The following are examples of such tests (there might be others):
 - TestJobInProgress (this issue reproes pretty consistently in Eclipse on this one)
 - TestControlledMapReduceJob
 - TestServiceLevelAuthorization


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8732) Address intermittent test failures on Windows

Posted by "Ivan Mitic (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-8732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13442548#comment-13442548 ] 

Ivan Mitic commented on HADOOP-8732:
------------------------------------

Root cause:
When you create a child process by using the CreateProcess function call in a multithreaded environment, the child may inherit handles that were not intended to be inherited (there is a race condition here). In our case, Hadoop is consistently calling CreateProcess on winutils.exe and as part of preparations for CreateProcess read/write handles are created on pipes used to redirect stdout/stderr. In scenario where we create for example one ShortLived and one LongLived child process, the LongLived process can end up inheriting handles of the ShortLived process. This will further cause ReadFile on the ShortLived process’ stdout/stderr not to return until the LongLived process terminates, what is the behavior we observed. 

Pre Windows-Vista, the only way to mitigate the problem was to serialize all calls to CreateProcess. On Vista and later, there is a way to specify the list of handles that should be inherited by a child via PROC_THREAD_ATTRIBUTE_HANDLE_LIST in STARTUPINFOEX.

This KB article nicely explains the issue:
http://support.microsoft.com/kb/315939 - PRB: Child Inherits Unintended Handles During CreateProcess Call

I looked over the OpenJDK implementation for Process#start(), and this is exactly what is going on. Since we can repro the problem in Oracle JDK, it should be safe to assume that they have the same issue. 

The suggested workaround is to serialize all calls to CreateProcess. In Java world, this boils down to synchronizing on Process#start() as this call just delegates to CreateProcess. I tested this out and it worked out fine.

                
> Address intermittent test failures on Windows
> ---------------------------------------------
>
>                 Key: HADOOP-8732
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8732
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: util
>            Reporter: Ivan Mitic
>            Assignee: Ivan Mitic
>
> There are a few tests that fail intermittently on Windows with a timeout error. This means that the test was actually killed from the outside, and it would continue to run otherwise. 
> The following are examples of such tests (there might be others):
>  - TestJobInProgress (this issue reproes pretty consistently in Eclipse on this one)
>  - TestControlledMapReduceJob
>  - TestServiceLevelAuthorization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8732) Address intermittent test failures on Windows

Posted by "Bikas Saha (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-8732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13444309#comment-13444309 ] 

Bikas Saha commented on HADOOP-8732:
------------------------------------

+1. After this fix I dont see the intermittent failures after multiple runs.
                
> Address intermittent test failures on Windows
> ---------------------------------------------
>
>                 Key: HADOOP-8732
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8732
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: util
>            Reporter: Ivan Mitic
>            Assignee: Ivan Mitic
>         Attachments: HADOOP-8732-IntermittentFailures.patch
>
>
> There are a few tests that fail intermittently on Windows with a timeout error. This means that the test was actually killed from the outside, and it would continue to run otherwise. 
> The following are examples of such tests (there might be others):
>  - TestJobInProgress (this issue reproes pretty consistently in Eclipse on this one)
>  - TestControlledMapReduceJob
>  - TestServiceLevelAuthorization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (HADOOP-8732) Address intermittent test failures on Windows

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-8732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy resolved HADOOP-8732.
-----------------------------------

       Resolution: Fixed
    Fix Version/s: 1-win

I just committed this. Thanks Ivan for the fix and Bikas for the review.
                
> Address intermittent test failures on Windows
> ---------------------------------------------
>
>                 Key: HADOOP-8732
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8732
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: util
>            Reporter: Ivan Mitic
>            Assignee: Ivan Mitic
>             Fix For: 1-win
>
>         Attachments: HADOOP-8732-IntermittentFailures.patch
>
>
> There are a few tests that fail intermittently on Windows with a timeout error. This means that the test was actually killed from the outside, and it would continue to run otherwise. 
> The following are examples of such tests (there might be others):
>  - TestJobInProgress (this issue reproes pretty consistently in Eclipse on this one)
>  - TestControlledMapReduceJob
>  - TestServiceLevelAuthorization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HADOOP-8732) Address intermittent test failures on Windows

Posted by "Ivan Mitic (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-8732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ivan Mitic updated HADOOP-8732:
-------------------------------

    Attachment: HADOOP-8732-IntermittentFailures.patch

Attaching the patch with the fix.
                
> Address intermittent test failures on Windows
> ---------------------------------------------
>
>                 Key: HADOOP-8732
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8732
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: util
>            Reporter: Ivan Mitic
>            Assignee: Ivan Mitic
>         Attachments: HADOOP-8732-IntermittentFailures.patch
>
>
> There are a few tests that fail intermittently on Windows with a timeout error. This means that the test was actually killed from the outside, and it would continue to run otherwise. 
> The following are examples of such tests (there might be others):
>  - TestJobInProgress (this issue reproes pretty consistently in Eclipse on this one)
>  - TestControlledMapReduceJob
>  - TestServiceLevelAuthorization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira