You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org> on 2008/09/25 23:05:44 UTC

[jira] Created: (HADOOP-4278) TestDatanodeDeath failed occasionally

TestDatanodeDeath failed occasionally
-------------------------------------

                 Key: HADOOP-4278
                 URL: https://issues.apache.org/jira/browse/HADOOP-4278
             Project: Hadoop Core
          Issue Type: Bug
            Reporter: Tsz Wo (Nicholas), SZE


TestDatanodeDeath keeps failing occasionally.  For example, see
http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3365/testReport/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4278) TestDatanodeDeath failed occasionally

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-4278:
-------------------------------------

    Attachment: primaryDeath.patch

If the recoverBlock RPC to the primary datanode fails, then remove mark it as bad and continue recovery.

> TestDatanodeDeath failed occasionally
> -------------------------------------
>
>                 Key: HADOOP-4278
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4278
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: deathlog.patch, primaryDeath.patch
>
>
> TestDatanodeDeath keeps failing occasionally.  For example, see
> http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3365/testReport/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4278) TestDatanodeDeath failed occasionally

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639683#action_12639683 ] 

Raghu Angadi commented on HADOOP-4278:
--------------------------------------

HADOOP-3416 does not explain this. 3416 is a bug on DFSClient which results in marking first datanode as the bad one. I think dependency of this jira on HADOOP-3416 should be removed.


> TestDatanodeDeath failed occasionally
> -------------------------------------
>
>                 Key: HADOOP-4278
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4278
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>
> TestDatanodeDeath keeps failing occasionally.  For example, see
> http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3365/testReport/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4278) TestDatanodeDeath failed occasionally

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12640198#action_12640198 ] 

Hudson commented on HADOOP-4278:
--------------------------------

Integrated in Hadoop-trunk #635 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/635/])
    . Increase debug logging for unit test TestDatanodeDeath.
(dhruba)


> TestDatanodeDeath failed occasionally
> -------------------------------------
>
>                 Key: HADOOP-4278
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4278
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: deathlog.patch, primaryDeath.patch, primaryDeath.patch
>
>
> TestDatanodeDeath keeps failing occasionally.  For example, see
> http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3365/testReport/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4278) TestDatanodeDeath failed occasionally

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-4278:
-------------------------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]
          Status: Resolved  (was: Patch Available)

I just committed this.  Thanks Dhruba!

> TestDatanodeDeath failed occasionally
> -------------------------------------
>
>                 Key: HADOOP-4278
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4278
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: deathlog.patch, primaryDeath.patch, primaryDeath.patch
>
>
> TestDatanodeDeath keeps failing occasionally.  For example, see
> http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3365/testReport/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4278) TestDatanodeDeath failed occasionally

Posted by "Sameer Paranjpye (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639867#action_12639867 ] 

Sameer Paranjpye commented on HADOOP-4278:
------------------------------------------

> It is likely to be a problem with the test. In real-life, if a datanode in the oipleine dies, it is very likely that all the threads in that Datanode become 
> inactive, so it will not matter whether the block-receiver thread or the packer responder thread exited first.

The client should be able to do a better job of detecting which Datanode died, no? The current implementation means that if a Datanode in a write pipeline dies the writer wil frequently fail. Datanode death is not a very frequent event so this is likely not a blocker, but I'd say that our test is triggering a known issue rather than saying it has a bug.

> TestDatanodeDeath failed occasionally
> -------------------------------------
>
>                 Key: HADOOP-4278
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4278
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>
> TestDatanodeDeath keeps failing occasionally.  For example, see
> http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3365/testReport/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4278) TestDatanodeDeath failed occasionally

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639694#action_12639694 ] 

Raghu Angadi commented on HADOOP-4278:
--------------------------------------

Also, do we know why multiple attempts to write after that have failed?

 e.g. {noformat} 2008-10-14 01:10:56,862 WARN  hdfs.DFSClient (DFSClient.java:processDatanodeError(2469)) - Error 
Recovery for block  blk_-6934002956802922056_1001 failed  because recovery from primary datanode 127.0.0.1:61531 
failed 3 times. Will retry...{noformat}


> TestDatanodeDeath failed occasionally
> -------------------------------------
>
>                 Key: HADOOP-4278
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4278
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>
> TestDatanodeDeath keeps failing occasionally.  For example, see
> http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3365/testReport/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4278) TestDatanodeDeath failed occasionally

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639514#action_12639514 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-4278:
------------------------------------------------

It turns out that TestDatanodeDeath.testSimple1() fails.  Below are some recent failures in Hudson.
http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3456/testReport/
http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3453/testReport/

> TestDatanodeDeath failed occasionally
> -------------------------------------
>
>                 Key: HADOOP-4278
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4278
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>
> TestDatanodeDeath keeps failing occasionally.  For example, see
> http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3365/testReport/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4278) TestDatanodeDeath failed occasionally

Posted by "Robert Chansler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Chansler updated HADOOP-4278:
------------------------------------

    Component/s: dfs

> TestDatanodeDeath failed occasionally
> -------------------------------------
>
>                 Key: HADOOP-4278
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4278
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>
> TestDatanodeDeath keeps failing occasionally.  For example, see
> http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3365/testReport/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4278) TestDatanodeDeath failed occasionally

Posted by "Robert Chansler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Chansler updated HADOOP-4278:
------------------------------------

    Fix Version/s: 0.19.0

> TestDatanodeDeath failed occasionally
> -------------------------------------
>
>                 Key: HADOOP-4278
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4278
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>            Priority: Blocker
>             Fix For: 0.19.0
>
>
> TestDatanodeDeath keeps failing occasionally.  For example, see
> http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3365/testReport/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4278) TestDatanodeDeath failed occasionally

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-4278:
-------------------------------------

    Attachment: primaryDeath.patch

Merged patch with trunk. 

> TestDatanodeDeath failed occasionally
> -------------------------------------
>
>                 Key: HADOOP-4278
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4278
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: deathlog.patch, primaryDeath.patch, primaryDeath.patch
>
>
> TestDatanodeDeath keeps failing occasionally.  For example, see
> http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3365/testReport/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4278) TestDatanodeDeath failed occasionally

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12634631#action_12634631 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-4278:
------------------------------------------------

Another example, http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3344/

> TestDatanodeDeath failed occasionally
> -------------------------------------
>
>                 Key: HADOOP-4278
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4278
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Tsz Wo (Nicholas), SZE
>
> TestDatanodeDeath keeps failing occasionally.  For example, see
> http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3365/testReport/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4278) TestDatanodeDeath failed occasionally

Posted by "Robert Chansler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Chansler updated HADOOP-4278:
------------------------------------

    Priority: Blocker  (was: Major)

Gotta be in 0.19!

> TestDatanodeDeath failed occasionally
> -------------------------------------
>
>                 Key: HADOOP-4278
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4278
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Tsz Wo (Nicholas), SZE
>            Priority: Blocker
>             Fix For: 0.19.0
>
>
> TestDatanodeDeath keeps failing occasionally.  For example, see
> http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3365/testReport/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4278) TestDatanodeDeath failed occasionally

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12640264#action_12640264 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-4278:
------------------------------------------------

+1 patch looks good.  I ran the test 10 times and all of them have been passed.

> TestDatanodeDeath failed occasionally
> -------------------------------------
>
>                 Key: HADOOP-4278
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4278
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: deathlog.patch, primaryDeath.patch, primaryDeath.patch
>
>
> TestDatanodeDeath keeps failing occasionally.  For example, see
> http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3365/testReport/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4278) TestDatanodeDeath failed occasionally

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639929#action_12639929 ] 

Raghu Angadi commented on HADOOP-4278:
--------------------------------------

The fact that TestDatanodeDeath is unpredictable is probably a good thing, though it is painful to diagnose. It points out rare race conditions.

HADOOP-3416 is about an actual bug in DFSClient and not a bug in the test. I don't think this jira is a bug in the test either. I see couple of issues pointed by this failure : 

- Datanode (not client) does not detect the problem datanode in the pipeline some cases. 
     -- We could live with this since write should succeed (though it got replicated to fewer datanodes than what was possible). This is also the reason why HADOOP-3416 was not blocker but HADOOP-3339 was.

- In such cases, DFSClient does not seem to be able to recover. I think this is a pretty important bug to fix since it will result in hard failures on the write.  

Does this sound correct?

> TestDatanodeDeath failed occasionally
> -------------------------------------
>
>                 Key: HADOOP-4278
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4278
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>
> TestDatanodeDeath keeps failing occasionally.  For example, see
> http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3365/testReport/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HADOOP-4278) TestDatanodeDeath failed occasionally

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE reassigned HADOOP-4278:
----------------------------------------------

    Assignee: dhruba borthakur  (was: Tsz Wo (Nicholas), SZE)

> TestDatanodeDeath failed occasionally
> -------------------------------------
>
>                 Key: HADOOP-4278
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4278
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>
> TestDatanodeDeath keeps failing occasionally.  For example, see
> http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3365/testReport/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4278) TestDatanodeDeath failed occasionally

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639688#action_12639688 ] 

Raghu Angadi commented on HADOOP-4278:
--------------------------------------

That probably what has happened.  That also implies this is unrelated to HADOOP-3416, no?

> TestDatanodeDeath failed occasionally
> -------------------------------------
>
>                 Key: HADOOP-4278
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4278
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>
> TestDatanodeDeath keeps failing occasionally.  For example, see
> http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3365/testReport/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4278) TestDatanodeDeath failed occasionally

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12640088#action_12640088 ] 

dhruba borthakur commented on HADOOP-4278:
------------------------------------------

Thanks Sameer for your suggestion. 

I followed Raghu's suggestion of investigating why the client never recovered. There were three datanodes A, B and C in the pipeline. The test killed B. The client thought that C was killed. 

The client designated A as the primary datanode. It made a recoverBlock RPC to the primary datanode. The primary datanode, in turn , should have made a recoverBlock RPC to itslf and B. However, this did not occur. Looking at the logs, it appears that the primary datanode tried to make a RPC only to B. This failed (again and again) and the primary datanode returned error to the client. The client than aborted.

> TestDatanodeDeath failed occasionally
> -------------------------------------
>
>                 Key: HADOOP-4278
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4278
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>
> TestDatanodeDeath keeps failing occasionally.  For example, see
> http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3365/testReport/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4278) TestDatanodeDeath failed occasionally

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-4278:
-------------------------------------

    Status: Patch Available  (was: Open)

> TestDatanodeDeath failed occasionally
> -------------------------------------
>
>                 Key: HADOOP-4278
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4278
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: deathlog.patch, primaryDeath.patch, primaryDeath.patch
>
>
> TestDatanodeDeath keeps failing occasionally.  For example, see
> http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3365/testReport/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4278) TestDatanodeDeath failed occasionally

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12652268#action_12652268 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-4278:
------------------------------------------------

primaryDeath_0.18.patch does not make sense unless the DFSClient.processDatanodeError(..) codes in HADOOP-1700 are checked in to 0.18.

Dhruba, should the DFSClient.processDatanodeError(..) in HADOOP-1700 codes should be committed to 0.18?

> TestDatanodeDeath failed occasionally
> -------------------------------------
>
>                 Key: HADOOP-4278
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4278
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: deathlog.patch, primaryDeath.patch, primaryDeath.patch, primaryDeath_0.18.patch
>
>
> TestDatanodeDeath keeps failing occasionally.  For example, see
> http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3365/testReport/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4278) TestDatanodeDeath failed occasionally

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639682#action_12639682 ] 

dhruba borthakur commented on HADOOP-4278:
------------------------------------------

It is likely to be a problem with the test. In real-life, if a datanode in the oipleine dies, it is very likely that all the threads in that Datanode become inactive, so it will not matter whether the block-receiver thread or the packer responder thread exited first.

> TestDatanodeDeath failed occasionally
> -------------------------------------
>
>                 Key: HADOOP-4278
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4278
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>
> TestDatanodeDeath keeps failing occasionally.  For example, see
> http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3365/testReport/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4278) TestDatanodeDeath failed occasionally

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639671#action_12639671 ] 

dhruba borthakur commented on HADOOP-4278:
------------------------------------------

This test failed because the second datanode was killed but the client wrongly detected that some other datanode was dead.  This is documented in HADOOP-3416

> TestDatanodeDeath failed occasionally
> -------------------------------------
>
>                 Key: HADOOP-4278
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4278
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>
> TestDatanodeDeath keeps failing occasionally.  For example, see
> http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3365/testReport/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4278) TestDatanodeDeath failed occasionally

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639925#action_12639925 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-4278:
------------------------------------------------

It seems that it will be more problematic if replication is > 3 since it is harder for client to tell which datanode is failing.

> TestDatanodeDeath failed occasionally
> -------------------------------------
>
>                 Key: HADOOP-4278
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4278
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>
> TestDatanodeDeath keeps failing occasionally.  For example, see
> http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3365/testReport/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4278) TestDatanodeDeath failed occasionally

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12640262#action_12640262 ] 

Hadoop QA commented on HADOOP-4278:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12392231/primaryDeath.patch
  against trunk revision 705215.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no tests are needed for this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3475/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3475/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3475/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3475/console

This message is automatically generated.

> TestDatanodeDeath failed occasionally
> -------------------------------------
>
>                 Key: HADOOP-4278
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4278
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: deathlog.patch, primaryDeath.patch, primaryDeath.patch
>
>
> TestDatanodeDeath keeps failing occasionally.  For example, see
> http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3365/testReport/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4278) TestDatanodeDeath failed occasionally

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639923#action_12639923 ] 

dhruba borthakur commented on HADOOP-4278:
------------------------------------------

@Sameer: I agree that out test is triggering a known issue that is more likely to occur when the Blockreceiver thread dies but the PacketResponder thread is still able to communicate a response to the upstream datanode (or client). This case typically does not occur in real life, because when a datanode dies all its threads cannot communicate with other datanodes in the pipeline. I am saying that this is not likely to be a blocker for 0.19; but on the other hand, if it is occuring frequently whiel running unit tests it is better to fix it quickly. I am thinking of changing the way the unit test kills a datanode. BTW, do you ever see this scenario being triggered in real-life cases, e.g. GridMix and or perforamance benchmarks?

@Raghu: I was thinking that both this one and HADOOP-3416 is refers to the fact that the method employed by the unit test to kill datanodes is not very deterministic and that is the reason why this JIRA is related to 3416. Fixing one issue probably fixes the other one too. Let me do some investigation on how to fix it.

> TestDatanodeDeath failed occasionally
> -------------------------------------
>
>                 Key: HADOOP-4278
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4278
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>
> TestDatanodeDeath keeps failing occasionally.  For example, see
> http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3365/testReport/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4278) TestDatanodeDeath failed occasionally

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-4278:
-------------------------------------

    Attachment: deathlog.patch

I am switching on additional debugging for the unit test. I will commit it soon so that we can get to the bottom of the real problem.

> TestDatanodeDeath failed occasionally
> -------------------------------------
>
>                 Key: HADOOP-4278
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4278
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: deathlog.patch
>
>
> TestDatanodeDeath keeps failing occasionally.  For example, see
> http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3365/testReport/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4278) TestDatanodeDeath failed occasionally

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639684#action_12639684 ] 

dhruba borthakur commented on HADOOP-4278:
------------------------------------------

If you look at the attached log to this JIRA, one can clearly see that the second datanode is being killed by the test. Hoqeer, the killing is done by sending interrupts to the second datanode. The PacketResponder thread of the second datanode sends back -2 to the first datanode beucase it encounters this InterruptedException. The firstdatanode thinks that the second datanode is good beucase it did send back a response, so the bad datanode must be the third one in the list.


> TestDatanodeDeath failed occasionally
> -------------------------------------
>
>                 Key: HADOOP-4278
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4278
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>
> TestDatanodeDeath keeps failing occasionally.  For example, see
> http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3365/testReport/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4278) TestDatanodeDeath failed occasionally

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12640872#action_12640872 ] 

Hudson commented on HADOOP-4278:
--------------------------------

Integrated in Hadoop-trunk #637 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/637/])
    . If the primary datanode fails in DFSClent, remove it from the pipe line.  (dhruba via szetszwo)


> TestDatanodeDeath failed occasionally
> -------------------------------------
>
>                 Key: HADOOP-4278
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4278
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: deathlog.patch, primaryDeath.patch, primaryDeath.patch
>
>
> TestDatanodeDeath keeps failing occasionally.  For example, see
> http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3365/testReport/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4278) TestDatanodeDeath failed occasionally

Posted by "Sameer Paranjpye (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639965#action_12639965 ] 

Sameer Paranjpye commented on HADOOP-4278:
------------------------------------------

@Dhruba: I agree that this is not a blocker for 0.19. The out of phase thread deaths don't occur typically in real deployments. Also we haven't yet observed this condition occurring frequently on our grids.

However, I think there are real deficiencies in error recovery for HDFS writes. 
# the client does not correctly detect which link in the write pipeline failed
# the client tries to initiate block recovery from the dead Datanode, fails to do so and causes the write to fail. This is mostly due to 1. but can also occur if the recovery primary fails following a link failure.

Ideally, a writer should fail only if
# the writer itself dies for some reason
# the writer loses all it's replicas

This should be the subject of a different JIRA but I think we should spend some energy making it happen. For this issue, the best course might be to disable testSimple until we have a complete recovery story.


> TestDatanodeDeath failed occasionally
> -------------------------------------
>
>                 Key: HADOOP-4278
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4278
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>
> TestDatanodeDeath keeps failing occasionally.  For example, see
> http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3365/testReport/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4278) TestDatanodeDeath failed occasionally

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-4278:
-------------------------------------------

    Attachment: primaryDeath_0.18.patch

primaryDeath_0.18.patch: for 0.18

> TestDatanodeDeath failed occasionally
> -------------------------------------
>
>                 Key: HADOOP-4278
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4278
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: deathlog.patch, primaryDeath.patch, primaryDeath.patch, primaryDeath_0.18.patch
>
>
> TestDatanodeDeath keeps failing occasionally.  For example, see
> http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3365/testReport/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HADOOP-4278) TestDatanodeDeath failed occasionally

Posted by "Robert Chansler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Chansler reassigned HADOOP-4278:
---------------------------------------

    Assignee: Tsz Wo (Nicholas), SZE

> TestDatanodeDeath failed occasionally
> -------------------------------------
>
>                 Key: HADOOP-4278
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4278
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>            Priority: Blocker
>             Fix For: 0.19.0
>
>
> TestDatanodeDeath keeps failing occasionally.  For example, see
> http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3365/testReport/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4278) TestDatanodeDeath failed occasionally

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639678#action_12639678 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-4278:
------------------------------------------------

So, are you saying that this is a bug in the test, Dhruba?  If yes, it is a good news.

> TestDatanodeDeath failed occasionally
> -------------------------------------
>
>                 Key: HADOOP-4278
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4278
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>
> TestDatanodeDeath keeps failing occasionally.  For example, see
> http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3365/testReport/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.