You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Hairong Kuang (JIRA)" <ji...@apache.org> on 2008/11/18 20:27:44 UTC

[jira] Created: (HADOOP-4679) Datanode prints tons of log messages: Waiting for threadgroup to exit, active theads is XX

Datanode prints tons of log messages: Waiting for threadgroup to exit, active theads is XX
------------------------------------------------------------------------------------------

                 Key: HADOOP-4679
                 URL: https://issues.apache.org/jira/browse/HADOOP-4679
             Project: Hadoop Core
          Issue Type: Bug
            Reporter: Hairong Kuang


When a data receiver thread sees a disk error, it immediately calls shutdown to shutdown DataNode. But the shutdown method does not return before all data receiver threads exit, which will never happen. Therefore the DataNode gets into a dead/live lock state, emitting tons of log messages: Waiting for threadgroup to exit, active threads is XX.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HADOOP-4679) Datanode prints tons of log messages: Waiting for threadgroup to exit, active theads is XX

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang resolved HADOOP-4679.
-----------------------------------

       Resolution: Fixed
    Fix Version/s: 0.18.3
     Release Note: 
1. Only datanode's offerService thread shutdown the datanode to avoid deadlock;
2. Datanode checks disk in case of failure on creating a block file. 
     Hadoop Flags: [Reviewed]

I just committed this.

> Datanode prints tons of log messages: Waiting for threadgroup to exit, active theads is XX
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4679
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4679
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.18.3
>
>         Attachments: diskError.patch, diskError1.patch, diskError2.patch, diskError3-br18.patch, diskError3.patch
>
>
> When a data receiver thread sees a disk error, it immediately calls shutdown to shutdown DataNode. But the shutdown method does not return before all data receiver threads exit, which will never happen. Therefore the DataNode gets into a dead/live lock state, emitting tons of log messages: Waiting for threadgroup to exit, active threads is XX.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4679) Datanode prints tons of log messages: Waiting for threadgroup to exit, active theads is XX

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650684#action_12650684 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-4679:
------------------------------------------------

Could you also include the course e in the new DiskOutOfSpaceException in checkDiskError(...)?

> Datanode prints tons of log messages: Waiting for threadgroup to exit, active theads is XX
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4679
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4679
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>         Attachments: diskError.patch, diskError1.patch
>
>
> When a data receiver thread sees a disk error, it immediately calls shutdown to shutdown DataNode. But the shutdown method does not return before all data receiver threads exit, which will never happen. Therefore the DataNode gets into a dead/live lock state, emitting tons of log messages: Waiting for threadgroup to exit, active threads is XX.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4679) Datanode prints tons of log messages: Waiting for threadgroup to exit, active theads is XX

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650704#action_12650704 ] 

Raghu Angadi commented on HADOOP-4679:
--------------------------------------

After talking to Hairong:

  # DataXceiverServer should handle SocketTimeoutException. Right now an idle DN prints exception every 10 seconds.
  # the timeout for serever socket could be lower.. that test will finish faster.
  # The unit test need not create files in a tight loop.
  # immedateShutdown is not really necessary. The way shutdown() works, it should only be called from offerService() thread. I think javadoc JavaDoc should state it explicitly. 
  # The reason log was printed in a tight infinite loop (with out sleep) is that thread inturrupts itself before calling sleep().. so sleep returns immediately!

I think this should go into 0.18. No one likes disks filling up with these log messages.
  

> Datanode prints tons of log messages: Waiting for threadgroup to exit, active theads is XX
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4679
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4679
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>         Attachments: diskError.patch, diskError1.patch
>
>
> When a data receiver thread sees a disk error, it immediately calls shutdown to shutdown DataNode. But the shutdown method does not return before all data receiver threads exit, which will never happen. Therefore the DataNode gets into a dead/live lock state, emitting tons of log messages: Waiting for threadgroup to exit, active threads is XX.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-4679) Datanode prints tons of log messages: Waiting for threadgroup to exit, active theads is XX

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang reassigned HADOOP-4679:
-------------------------------------

    Assignee: Hairong Kuang

> Datanode prints tons of log messages: Waiting for threadgroup to exit, active theads is XX
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4679
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4679
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>         Attachments: diskError.patch
>
>
> When a data receiver thread sees a disk error, it immediately calls shutdown to shutdown DataNode. But the shutdown method does not return before all data receiver threads exit, which will never happen. Therefore the DataNode gets into a dead/live lock state, emitting tons of log messages: Waiting for threadgroup to exit, active threads is XX.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4679) Datanode prints tons of log messages: Waiting for threadgroup to exit, active theads is XX

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-4679:
----------------------------------

    Component/s: dfs

> Datanode prints tons of log messages: Waiting for threadgroup to exit, active theads is XX
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4679
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4679
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Hairong Kuang
>
> When a data receiver thread sees a disk error, it immediately calls shutdown to shutdown DataNode. But the shutdown method does not return before all data receiver threads exit, which will never happen. Therefore the DataNode gets into a dead/live lock state, emitting tons of log messages: Waiting for threadgroup to exit, active threads is XX.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4679) Datanode prints tons of log messages: Waiting for threadgroup to exit, active theads is XX

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-4679:
----------------------------------

    Attachment: diskError2.patch

This patch incorporates Raghu's comments except for comment 3. The unit test does not create files in a tight loop. It waits for all replications are created before moving to the next iteration. I tried a few other ways of writing this test. It seems that the current one is most efficient.

In addition, I made a change to BlockReceiver. If BlockReceiver constructor fails, it checks if it caused by a read-only disk. Since checking read-only disks is an expensive operation, it is performed only when creating the temporary block file fails. 

> Datanode prints tons of log messages: Waiting for threadgroup to exit, active theads is XX
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4679
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4679
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>         Attachments: diskError.patch, diskError1.patch, diskError2.patch
>
>
> When a data receiver thread sees a disk error, it immediately calls shutdown to shutdown DataNode. But the shutdown method does not return before all data receiver threads exit, which will never happen. Therefore the DataNode gets into a dead/live lock state, emitting tons of log messages: Waiting for threadgroup to exit, active threads is XX.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4679) Datanode prints tons of log messages: Waiting for threadgroup to exit, active theads is XX

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-4679:
----------------------------------

    Attachment: diskError3-br18.patch

A patch for branch 0.18.

> Datanode prints tons of log messages: Waiting for threadgroup to exit, active theads is XX
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4679
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4679
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>         Attachments: diskError.patch, diskError1.patch, diskError2.patch, diskError3-br18.patch, diskError3.patch
>
>
> When a data receiver thread sees a disk error, it immediately calls shutdown to shutdown DataNode. But the shutdown method does not return before all data receiver threads exit, which will never happen. Therefore the DataNode gets into a dead/live lock state, emitting tons of log messages: Waiting for threadgroup to exit, active threads is XX.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4679) Datanode prints tons of log messages: Waiting for threadgroup to exit, active theads is XX

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12652891#action_12652891 ] 

Hairong Kuang commented on HADOOP-4679:
---------------------------------------

ant test-core passed:
BUILD SUCCESSFUL
Total time: 118 minutes 28 seconds

and so did ant patch:
     [exec] +1 overall.

     [exec]     +1 @author.  The patch does not contain any @author tags.

     [exec]     +1 tests included.  The patch appears to include 4 new or modified tests.

     [exec]     +1 javadoc.  The javadoc tool did not generate any warningmessages.

     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.

     [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.


> Datanode prints tons of log messages: Waiting for threadgroup to exit, active theads is XX
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4679
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4679
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>         Attachments: diskError.patch, diskError1.patch, diskError2.patch, diskError3.patch
>
>
> When a data receiver thread sees a disk error, it immediately calls shutdown to shutdown DataNode. But the shutdown method does not return before all data receiver threads exit, which will never happen. Therefore the DataNode gets into a dead/live lock state, emitting tons of log messages: Waiting for threadgroup to exit, active threads is XX.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4679) Datanode prints tons of log messages: Waiting for threadgroup to exit, active theads is XX

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12652529#action_12652529 ] 

Hairong Kuang commented on HADOOP-4679:
---------------------------------------

I do not think it is necessary to check read-only disk for both block flle & meta data file. Checking block file is good enough. I will update the javadoc for shutdown.

> Datanode prints tons of log messages: Waiting for threadgroup to exit, active theads is XX
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4679
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4679
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>         Attachments: diskError.patch, diskError1.patch, diskError2.patch
>
>
> When a data receiver thread sees a disk error, it immediately calls shutdown to shutdown DataNode. But the shutdown method does not return before all data receiver threads exit, which will never happen. Therefore the DataNode gets into a dead/live lock state, emitting tons of log messages: Waiting for threadgroup to exit, active threads is XX.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4679) Datanode prints tons of log messages: Waiting for threadgroup to exit, active theads is XX

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-4679:
----------------------------------

    Attachment: diskError1.patch

A new patch with minor change to handle a failed test.

> Datanode prints tons of log messages: Waiting for threadgroup to exit, active theads is XX
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4679
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4679
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>         Attachments: diskError.patch, diskError1.patch
>
>
> When a data receiver thread sees a disk error, it immediately calls shutdown to shutdown DataNode. But the shutdown method does not return before all data receiver threads exit, which will never happen. Therefore the DataNode gets into a dead/live lock state, emitting tons of log messages: Waiting for threadgroup to exit, active threads is XX.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4679) Datanode prints tons of log messages: Waiting for threadgroup to exit, active theads is XX

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654072#action_12654072 ] 

Hudson commented on HADOOP-4679:
--------------------------------

Integrated in Hadoop-trunk #680 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/680/])
    

> Datanode prints tons of log messages: Waiting for threadgroup to exit, active theads is XX
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4679
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4679
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.18.3
>
>         Attachments: diskError.patch, diskError1.patch, diskError2.patch, diskError3-br18.patch, diskError3.patch
>
>
> When a data receiver thread sees a disk error, it immediately calls shutdown to shutdown DataNode. But the shutdown method does not return before all data receiver threads exit, which will never happen. Therefore the DataNode gets into a dead/live lock state, emitting tons of log messages: Waiting for threadgroup to exit, active threads is XX.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4679) Datanode prints tons of log messages: Waiting for threadgroup to exit, active theads is XX

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-4679:
----------------------------------

    Attachment: diskError.patch

This patch changes DataNode.shouldRun to be false when a disk error is detected while receiving a block. It also sets a timeout of 10s on DataXceiverServer's server sokcet so the dataXceverServer is able to wake up periodically to check if it should continue to run or not.

> Datanode prints tons of log messages: Waiting for threadgroup to exit, active theads is XX
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4679
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4679
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Hairong Kuang
>         Attachments: diskError.patch
>
>
> When a data receiver thread sees a disk error, it immediately calls shutdown to shutdown DataNode. But the shutdown method does not return before all data receiver threads exit, which will never happen. Therefore the DataNode gets into a dead/live lock state, emitting tons of log messages: Waiting for threadgroup to exit, active threads is XX.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4679) Datanode prints tons of log messages: Waiting for threadgroup to exit, active theads is XX

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12652467#action_12652467 ] 

Raghu Angadi commented on HADOOP-4679:
--------------------------------------


# writeToBlock() creates files in two places. The patch catches only one of them.
# There is inherent requirement that shutdown() should only be called from offerService thread. It would be better if JavaDoc for shutdown() says this explicitly.  Otherwise, this deadlock and logging in tight infinite loop could occur again with future changes.

> Datanode prints tons of log messages: Waiting for threadgroup to exit, active theads is XX
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4679
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4679
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>         Attachments: diskError.patch, diskError1.patch, diskError2.patch
>
>
> When a data receiver thread sees a disk error, it immediately calls shutdown to shutdown DataNode. But the shutdown method does not return before all data receiver threads exit, which will never happen. Therefore the DataNode gets into a dead/live lock state, emitting tons of log messages: Waiting for threadgroup to exit, active threads is XX.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4679) Datanode prints tons of log messages: Waiting for threadgroup to exit, active theads is XX

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-4679:
----------------------------------

    Attachment: diskError3.patch

> Datanode prints tons of log messages: Waiting for threadgroup to exit, active theads is XX
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4679
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4679
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>         Attachments: diskError.patch, diskError1.patch, diskError2.patch, diskError3.patch
>
>
> When a data receiver thread sees a disk error, it immediately calls shutdown to shutdown DataNode. But the shutdown method does not return before all data receiver threads exit, which will never happen. Therefore the DataNode gets into a dead/live lock state, emitting tons of log messages: Waiting for threadgroup to exit, active threads is XX.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.