You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Ramya R (JIRA)" <ji...@apache.org> on 2009/03/31 16:52:55 UTC

[jira] Created: (HADOOP-5599) Unable to run jobs when all the nodes in rack are down

Unable to run jobs when all the nodes in rack are down
------------------------------------------------------

                 Key: HADOOP-5599
                 URL: https://issues.apache.org/jira/browse/HADOOP-5599
             Project: Hadoop Core
          Issue Type: Bug
          Components: dfs
    Affects Versions: 0.20.0
            Reporter: Ramya R
             Fix For: 0.20.0


Jobs such as randomwriter, sort, validator fail when all the datanodes in a rack are down.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5599) Unable to run jobs when all the nodes in rack are down

Posted by "Nigel Daley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694147#action_12694147 ] 

Nigel Daley commented on HADOOP-5599:
-------------------------------------

Does your randomwriter.cfg lower the default replication from 3 to 2 or 1?  Or does the randomwriter code?

> Unable to run jobs when all the nodes in rack are down
> ------------------------------------------------------
>
>                 Key: HADOOP-5599
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5599
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.20.0
>            Reporter: Ramya R
>             Fix For: 0.20.0
>
>
> Jobs such as randomwriter, sort, validator fail when all the datanodes in a rack are down.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5599) Unable to run jobs when all the nodes in rack are down

Posted by "Ramya R (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694173#action_12694173 ] 

Ramya R commented on HADOOP-5599:
---------------------------------

I ran the above job on a 480 node cluster with many racks and the JT was brought up using fairshare scheduler.

A similar kind of behavior is observed in the following case as well:
* Generate data and sort it
* Datanodes in a given rack go down
* Run testmapredsort. The job fails
* The filesystem is declared CORRUPT


> Unable to run jobs when all the nodes in rack are down
> ------------------------------------------------------
>
>                 Key: HADOOP-5599
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5599
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.20.0
>            Reporter: Ramya R
>             Fix For: 0.20.0
>
>
> Jobs such as randomwriter, sort, validator fail when all the datanodes in a rack are down.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5599) Unable to run jobs when all the nodes in rack are down

Posted by "Iyappan Srinivasan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12695292#action_12695292 ] 

Iyappan Srinivasan commented on HADOOP-5599:
--------------------------------------------

I was able to reproduce the above said issue.

1) In the cluster, generate data using randomwriter
2) Get one rack and kill all the datanodes in that rack only.
3) Run sort job. It fails.
4) Run Fsck from root. It says data corrupt.

I have attached the logs of these .

> Unable to run jobs when all the nodes in rack are down
> ------------------------------------------------------
>
>                 Key: HADOOP-5599
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5599
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.20.0
>            Reporter: Ramya R
>             Fix For: 0.20.0
>
>         Attachments: 5599log.txt
>
>
> Jobs such as randomwriter, sort, validator fail when all the datanodes in a rack are down.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5599) Unable to run jobs when all the nodes in rack are down

Posted by "Koji Noguchi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694189#action_12694189 ] 

Koji Noguchi commented on HADOOP-5599:
--------------------------------------

Only open case I know for block getting corrupt due to lost rack is HADOOP-4477.

> Unable to run jobs when all the nodes in rack are down
> ------------------------------------------------------
>
>                 Key: HADOOP-5599
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5599
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.20.0
>            Reporter: Ramya R
>             Fix For: 0.20.0
>
>
> Jobs such as randomwriter, sort, validator fail when all the datanodes in a rack are down.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5599) Unable to run jobs when all the nodes in rack are down

Posted by "Ramya R (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694138#action_12694138 ] 

Ramya R commented on HADOOP-5599:
---------------------------------

Consider the following simple scenario:
* Generate data
 * Datanodes in a given rack goes down(2 replicas of many block are lost)
* Run sort on the generated data
* Sort job fails
* The filesystem is declared CORRUPT

However the expected behavior would be to successfully sort the data available using the third replica of blocks.

> Unable to run jobs when all the nodes in rack are down
> ------------------------------------------------------
>
>                 Key: HADOOP-5599
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5599
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.20.0
>            Reporter: Ramya R
>             Fix For: 0.20.0
>
>
> Jobs such as randomwriter, sort, validator fail when all the datanodes in a rack are down.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5599) Unable to run jobs when all the nodes in rack are down

Posted by "Iyappan Srinivasan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Iyappan Srinivasan updated HADOOP-5599:
---------------------------------------

    Attachment: 5599log.txt

Logs of randomwriter console output,  fsck console output and namenode logs.

> Unable to run jobs when all the nodes in rack are down
> ------------------------------------------------------
>
>                 Key: HADOOP-5599
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5599
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.20.0
>            Reporter: Ramya R
>             Fix For: 0.20.0
>
>         Attachments: 5599log.txt
>
>
> Jobs such as randomwriter, sort, validator fail when all the datanodes in a rack are down.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5599) Unable to run jobs when all the nodes in rack are down

Posted by "Ramya R (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694157#action_12694157 ] 

Ramya R commented on HADOOP-5599:
---------------------------------

bq. Does your randomwriter.cfg lower the default replication from 3 to 2 or 1? Or does the randomwriter code?
None of the above two scenarios occur.

> Unable to run jobs when all the nodes in rack are down
> ------------------------------------------------------
>
>                 Key: HADOOP-5599
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5599
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.20.0
>            Reporter: Ramya R
>             Fix For: 0.20.0
>
>
> Jobs such as randomwriter, sort, validator fail when all the datanodes in a rack are down.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5599) Unable to run jobs when all the nodes in rack are down

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696299#action_12696299 ] 

Owen O'Malley commented on HADOOP-5599:
---------------------------------------

Did you have the topology defined? If not, there is nothing DFS can do. If so, the question becomes why blocks ended up with all the replicas on one rack.

> Unable to run jobs when all the nodes in rack are down
> ------------------------------------------------------
>
>                 Key: HADOOP-5599
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5599
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.20.0
>            Reporter: Ramya R
>             Fix For: 0.20.0
>
>         Attachments: 5599log.txt
>
>
> Jobs such as randomwriter, sort, validator fail when all the datanodes in a rack are down.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HADOOP-5599) Unable to run jobs when all the nodes in rack are down

Posted by "Ramya R (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ramya R resolved HADOOP-5599.
-----------------------------

    Resolution: Invalid

> Unable to run jobs when all the nodes in rack are down
> ------------------------------------------------------
>
>                 Key: HADOOP-5599
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5599
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.20.0
>            Reporter: Ramya R
>             Fix For: 0.20.0
>
>         Attachments: 5599log.txt
>
>
> Jobs such as randomwriter, sort, validator fail when all the datanodes in a rack are down.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5599) Unable to run jobs when all the nodes in rack are down

Posted by "Ramya R (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696387#action_12696387 ] 

Ramya R commented on HADOOP-5599:
---------------------------------

No, the topology was not defined. Thats the reason why all the replicas were placed in the "default" rack. After defining the network topology, the jobs successfully completed using the remaining replica. Thanks Owen. 


> Unable to run jobs when all the nodes in rack are down
> ------------------------------------------------------
>
>                 Key: HADOOP-5599
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5599
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.20.0
>            Reporter: Ramya R
>             Fix For: 0.20.0
>
>         Attachments: 5599log.txt
>
>
> Jobs such as randomwriter, sort, validator fail when all the datanodes in a rack are down.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.