You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Hairong Kuang (JIRA)" <ji...@apache.org> on 2007/04/03 20:40:32 UTC

[jira] Created: (HADOOP-1200) Datanode should periodically do a disk check

Datanode should periodically do a disk check
--------------------------------------------

                 Key: HADOOP-1200
                 URL: https://issues.apache.org/jira/browse/HADOOP-1200
             Project: Hadoop
          Issue Type: Bug
          Components: dfs
    Affects Versions: 0.12.2
            Reporter: Hairong Kuang
             Fix For: 0.13.0


HADOOP-1170 removed the disk checking feature. But this is a needed feature for maintaining a large cluster. I agree that checking the disk on every I/O is too costly. A nicer approach is to have a thread that periodically do a disk check. It then automatically decommissions itself when any error occurs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1200) Datanode should periodically do a disk check

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12493796 ] 

Hairong Kuang commented on HADOOP-1200:
---------------------------------------

> You have replaced 'shutdown(); throw iex;' with 'checkDiskError(iex); throw iex;'  does not shutdown if checkDirs() does not throw DiskErrorException. Is this functionality change intentional?
Yes, the idea is that we do not shutdown datanode if IOException is caused by any temporary error.

> Another functionality change is that data.invalidate(toDelete)'s exception is ignored in processCommand(). This change is probably not necessary since offerService() already handles the exception.
You are right. The new patch will rethrow the exception.

> Datanode should periodically do a disk check
> --------------------------------------------
>
>                 Key: HADOOP-1200
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1200
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.12.2
>            Reporter: Hairong Kuang
>         Assigned To: Hairong Kuang
>            Priority: Blocker
>             Fix For: 0.13.0
>
>         Attachments: diskCheck.patch
>
>
> HADOOP-1170 removed the disk checking feature. But this is a needed feature for maintaining a large cluster. I agree that checking the disk on every I/O is too costly. A nicer approach is to have a thread that periodically do a disk check. It then automatically decommissions itself when any error occurs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1200) Datanode should periodically do a disk check

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-1200:
----------------------------------

    Attachment: diskCheck.patch

This patch checks if the disk is read-only whenever an IOException occurs.

> Datanode should periodically do a disk check
> --------------------------------------------
>
>                 Key: HADOOP-1200
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1200
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.12.2
>            Reporter: Hairong Kuang
>         Assigned To: Hairong Kuang
>            Priority: Blocker
>             Fix For: 0.13.0
>
>         Attachments: diskCheck.patch
>
>
> HADOOP-1170 removed the disk checking feature. But this is a needed feature for maintaining a large cluster. I agree that checking the disk on every I/O is too costly. A nicer approach is to have a thread that periodically do a disk check. It then automatically decommissions itself when any error occurs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1200) Datanode should periodically do a disk check

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12490136 ] 

dhruba borthakur commented on HADOOP-1200:
------------------------------------------

Instead of a periodic thread, we can invoke checkDir when an IO error occurs.

> Datanode should periodically do a disk check
> --------------------------------------------
>
>                 Key: HADOOP-1200
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1200
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.12.2
>            Reporter: Hairong Kuang
>             Fix For: 0.13.0
>
>
> HADOOP-1170 removed the disk checking feature. But this is a needed feature for maintaining a large cluster. I agree that checking the disk on every I/O is too costly. A nicer approach is to have a thread that periodically do a disk check. It then automatically decommissions itself when any error occurs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1200) Datanode should periodically do a disk check

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-1200:
---------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

I just committed this.  Thanks, Hairong!

> Datanode should periodically do a disk check
> --------------------------------------------
>
>                 Key: HADOOP-1200
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1200
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.12.2
>            Reporter: Hairong Kuang
>         Assigned To: Hairong Kuang
>            Priority: Blocker
>             Fix For: 0.13.0
>
>         Attachments: diskCheck.patch, diskCheck1.patch
>
>
> HADOOP-1170 removed the disk checking feature. But this is a needed feature for maintaining a large cluster. I agree that checking the disk on every I/O is too costly. A nicer approach is to have a thread that periodically do a disk check. It then automatically decommissions itself when any error occurs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1200) Datanode should periodically do a disk check

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12493182 ] 

Raghu Angadi commented on HADOOP-1200:
--------------------------------------


You have replaced '{{shutdown(); throw iex;}}' with '{{checkDiskError(iex); throw iex;}}' . {{checkDiskError(iex);}} does not shutdown if checkDirs() does not throw DiskErrorException. Is this functionality change intentional? 

I am not sure why there was {{ shutdown(); }} in the first place and so this change might be ok. 

Another functionality change is that  {{data.invalidate(toDelete)}}'s exception is ignored in {{processCommand()}}.  This change is probably not necessary since {{offerService()}} already handles the exception.





> Datanode should periodically do a disk check
> --------------------------------------------
>
>                 Key: HADOOP-1200
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1200
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.12.2
>            Reporter: Hairong Kuang
>         Assigned To: Hairong Kuang
>            Priority: Blocker
>             Fix For: 0.13.0
>
>         Attachments: diskCheck.patch
>
>
> HADOOP-1170 removed the disk checking feature. But this is a needed feature for maintaining a large cluster. I agree that checking the disk on every I/O is too costly. A nicer approach is to have a thread that periodically do a disk check. It then automatically decommissions itself when any error occurs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1200) Datanode should periodically do a disk check

Posted by "Koji Noguchi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486465 ] 

Koji Noguchi commented on HADOOP-1200:
--------------------------------------

I've seen many occasions when one of the disks become read-only and TaskTracker stays up, heartbeats, but it doesn't make any progress which will hang the job.  
On the other hands, datanode cleverly stops itself leaving a log on the namenode.

2007-04-03 13:13:48,997 WARN org.apache.hadoop.dfs.NameNode: Report from __.__.__.__:____: can not create directory: /___/dfs/data/data/subdir0
2007-04-03 13:13:48,998 WARN org.apache.hadoop.dfs.NameNode: Report from __.__.__.__:____: directory is not writable: /___/dfs/data/data
2007-04-03 13:13:49,024 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: /__.__.__.__/__.__.__.__:____





> Datanode should periodically do a disk check
> --------------------------------------------
>
>                 Key: HADOOP-1200
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1200
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.12.2
>            Reporter: Hairong Kuang
>             Fix For: 0.13.0
>
>
> HADOOP-1170 removed the disk checking feature. But this is a needed feature for maintaining a large cluster. I agree that checking the disk on every I/O is too costly. A nicer approach is to have a thread that periodically do a disk check. It then automatically decommissions itself when any error occurs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1200) Datanode should periodically do a disk check

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12490138 ] 

Raghu Angadi commented on HADOOP-1200:
--------------------------------------

bq. Instead of a periodic thread, we can invoke checkDir when an IO error occurs.
(y) 

We can add a thread later if we still feel its useful.

> Datanode should periodically do a disk check
> --------------------------------------------
>
>                 Key: HADOOP-1200
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1200
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.12.2
>            Reporter: Hairong Kuang
>             Fix For: 0.13.0
>
>
> HADOOP-1170 removed the disk checking feature. But this is a needed feature for maintaining a large cluster. I agree that checking the disk on every I/O is too costly. A nicer approach is to have a thread that periodically do a disk check. It then automatically decommissions itself when any error occurs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1200) Datanode should periodically do a disk check

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-1200:
----------------------------------

    Attachment: diskCheck1.patch

The new patch incorporates Raghu's comment.

> Datanode should periodically do a disk check
> --------------------------------------------
>
>                 Key: HADOOP-1200
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1200
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.12.2
>            Reporter: Hairong Kuang
>         Assigned To: Hairong Kuang
>            Priority: Blocker
>             Fix For: 0.13.0
>
>         Attachments: diskCheck.patch, diskCheck1.patch
>
>
> HADOOP-1170 removed the disk checking feature. But this is a needed feature for maintaining a large cluster. I agree that checking the disk on every I/O is too costly. A nicer approach is to have a thread that periodically do a disk check. It then automatically decommissions itself when any error occurs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1200) Datanode should periodically do a disk check

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494251 ] 

Hadoop QA commented on HADOOP-1200:
-----------------------------------

Integrated in Hadoop-Nightly #82 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/82/)

> Datanode should periodically do a disk check
> --------------------------------------------
>
>                 Key: HADOOP-1200
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1200
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.12.2
>            Reporter: Hairong Kuang
>         Assigned To: Hairong Kuang
>            Priority: Blocker
>             Fix For: 0.13.0
>
>         Attachments: diskCheck.patch, diskCheck1.patch
>
>
> HADOOP-1170 removed the disk checking feature. But this is a needed feature for maintaining a large cluster. I agree that checking the disk on every I/O is too costly. A nicer approach is to have a thread that periodically do a disk check. It then automatically decommissions itself when any error occurs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1200) Datanode should periodically do a disk check

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Owen O'Malley updated HADOOP-1200:
----------------------------------

    Assignee: dhruba borthakur
    Priority: Blocker  (was: Major)

> Datanode should periodically do a disk check
> --------------------------------------------
>
>                 Key: HADOOP-1200
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1200
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.12.2
>            Reporter: Hairong Kuang
>         Assigned To: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.13.0
>
>
> HADOOP-1170 removed the disk checking feature. But this is a needed feature for maintaining a large cluster. I agree that checking the disk on every I/O is too costly. A nicer approach is to have a thread that periodically do a disk check. It then automatically decommissions itself when any error occurs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1200) Datanode should periodically do a disk check

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-1200:
----------------------------------

    Status: Patch Available  (was: Open)

> Datanode should periodically do a disk check
> --------------------------------------------
>
>                 Key: HADOOP-1200
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1200
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.12.2
>            Reporter: Hairong Kuang
>         Assigned To: Hairong Kuang
>            Priority: Blocker
>             Fix For: 0.13.0
>
>         Attachments: diskCheck.patch, diskCheck1.patch
>
>
> HADOOP-1170 removed the disk checking feature. But this is a needed feature for maintaining a large cluster. I agree that checking the disk on every I/O is too costly. A nicer approach is to have a thread that periodically do a disk check. It then automatically decommissions itself when any error occurs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1200) Datanode should periodically do a disk check

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12493818 ] 

Hadoop QA commented on HADOOP-1200:
-----------------------------------

+1

http://issues.apache.org/jira/secure/attachment/12356806/diskCheck1.patch applied and successfully tested against trunk revision r534975.

Test results:   http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/119/testReport/
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/119/console

> Datanode should periodically do a disk check
> --------------------------------------------
>
>                 Key: HADOOP-1200
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1200
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.12.2
>            Reporter: Hairong Kuang
>         Assigned To: Hairong Kuang
>            Priority: Blocker
>             Fix For: 0.13.0
>
>         Attachments: diskCheck.patch, diskCheck1.patch
>
>
> HADOOP-1170 removed the disk checking feature. But this is a needed feature for maintaining a large cluster. I agree that checking the disk on every I/O is too costly. A nicer approach is to have a thread that periodically do a disk check. It then automatically decommissions itself when any error occurs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-1200) Datanode should periodically do a disk check

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang reassigned HADOOP-1200:
-------------------------------------

    Assignee: Hairong Kuang  (was: dhruba borthakur)

> Datanode should periodically do a disk check
> --------------------------------------------
>
>                 Key: HADOOP-1200
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1200
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.12.2
>            Reporter: Hairong Kuang
>         Assigned To: Hairong Kuang
>            Priority: Blocker
>             Fix For: 0.13.0
>
>
> HADOOP-1170 removed the disk checking feature. But this is a needed feature for maintaining a large cluster. I agree that checking the disk on every I/O is too costly. A nicer approach is to have a thread that periodically do a disk check. It then automatically decommissions itself when any error occurs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.