You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Igor Bolotin (JIRA)" <ji...@apache.org> on 2007/03/28 06:46:32 UTC

[jira] Created: (HADOOP-1170) Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect

Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
--------------------------------------------------------------------------------------

                 Key: HADOOP-1170
                 URL: https://issues.apache.org/jira/browse/HADOOP-1170
             Project: Hadoop
          Issue Type: Bug
          Components: dfs
    Affects Versions: 0.11.2
            Reporter: Igor Bolotin


While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.

Stack trace showed following on most of the data nodes:
"org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
        at java.io.UnixFileSystem.checkAccess(Native Method)
        at java.io.File.canRead(File.java:660)
        at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
        at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
        at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
        at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
        at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
        at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
        at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
        at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
        at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
        at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
        at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
        at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
        at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
        at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
        at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
        at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
        at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
        - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
        at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
        at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
        at java.lang.Thread.run(Thread.java:595)

I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1170) Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12490132 ] 

Doug Cutting commented on HADOOP-1170:
--------------------------------------

> we should either implement a background thread to call checkDirs() before this patch can be deployed on a real cluster

Please file a new issue to be fixed for 0.13 for this.

> Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1170
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1170
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.11.2
>            Reporter: Igor Bolotin
>             Fix For: 0.13.0
>
>         Attachments: 1170-v2.patch, 1170.patch
>
>
> While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
> Stack trace showed following on most of the data nodes:
> "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
>         at java.io.UnixFileSystem.checkAccess(Native Method)
>         at java.io.File.canRead(File.java:660)
>         at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
>         at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
>         - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
>         at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
>         at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
>         at java.lang.Thread.run(Thread.java:595)
> I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1170) Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12484961 ] 

Hairong Kuang commented on HADOOP-1170:
---------------------------------------

I agree that it is too costly to call checkDirs on every I/O operation. A background thread that periodically does the sanity check would be nicer.

The patch should also clean up the code that does the error handling.

> Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1170
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1170
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.11.2
>            Reporter: Igor Bolotin
>         Attachments: 1170.patch
>
>
> While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
> Stack trace showed following on most of the data nodes:
> "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
>         at java.io.UnixFileSystem.checkAccess(Native Method)
>         at java.io.File.canRead(File.java:660)
>         at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
>         at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
>         - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
>         at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
>         at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
>         at java.lang.Thread.run(Thread.java:595)
> I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1170) Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect

Posted by "Igor Bolotin (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Igor Bolotin updated HADOOP-1170:
---------------------------------

    Status: Patch Available  (was: Open)

> Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1170
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1170
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.11.2
>            Reporter: Igor Bolotin
>         Attachments: 1170.patch
>
>
> While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
> Stack trace showed following on most of the data nodes:
> "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
>         at java.io.UnixFileSystem.checkAccess(Native Method)
>         at java.io.File.canRead(File.java:660)
>         at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
>         at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
>         - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
>         at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
>         at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
>         at java.lang.Thread.run(Thread.java:595)
> I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1170) Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect

Posted by "Igor Bolotin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12490139 ] 

Igor Bolotin commented on HADOOP-1170:
--------------------------------------

There is another issue HADOOP-1200 that was open exactly for this

> Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1170
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1170
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.11.2
>            Reporter: Igor Bolotin
>             Fix For: 0.13.0
>
>         Attachments: 1170-v2.patch, 1170.patch
>
>
> While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
> Stack trace showed following on most of the data nodes:
> "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
>         at java.io.UnixFileSystem.checkAccess(Native Method)
>         at java.io.File.canRead(File.java:660)
>         at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
>         at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
>         - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
>         at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
>         at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
>         at java.lang.Thread.run(Thread.java:595)
> I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1170) Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12484955 ] 

Raghu Angadi commented on HADOOP-1170:
--------------------------------------


It is invoked in two more places in Datanode.java.. though not this often. Should we remove those as well?  It is called once before sending block report and when a command is received from namenode (e.g. block invalidate cmd in response to heartBeat).



> Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1170
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1170
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.11.2
>            Reporter: Igor Bolotin
>         Attachments: 1170.patch
>
>
> While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
> Stack trace showed following on most of the data nodes:
> "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
>         at java.io.UnixFileSystem.checkAccess(Native Method)
>         at java.io.File.canRead(File.java:660)
>         at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
>         at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
>         - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
>         at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
>         at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
>         at java.lang.Thread.run(Thread.java:595)
> I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1170) Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12485718 ] 

Hadoop QA commented on HADOOP-1170:
-----------------------------------

Integrated in Hadoop-Nightly #43 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/43/)

> Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1170
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1170
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.11.2
>            Reporter: Igor Bolotin
>             Fix For: 0.13.0
>
>         Attachments: 1170-v2.patch, 1170.patch
>
>
> While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
> Stack trace showed following on most of the data nodes:
> "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
>         at java.io.UnixFileSystem.checkAccess(Native Method)
>         at java.io.File.canRead(File.java:660)
>         at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
>         at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
>         - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
>         at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
>         at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
>         at java.lang.Thread.run(Thread.java:595)
> I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1170) Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12485636 ] 

Hadoop QA commented on HADOOP-1170:
-----------------------------------

+1, because http://issues.apache.org/jira/secure/attachment/12354634/1170-v2.patch applied and successfully tested against trunk revision http://svn.apache.org/repos/asf/lucene/hadoop/trunk/524205. Results are at http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch

> Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1170
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1170
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.11.2
>            Reporter: Igor Bolotin
>         Attachments: 1170-v2.patch, 1170.patch
>
>
> While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
> Stack trace showed following on most of the data nodes:
> "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
>         at java.io.UnixFileSystem.checkAccess(Native Method)
>         at java.io.File.canRead(File.java:660)
>         at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
>         at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
>         - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
>         at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
>         at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
>         at java.lang.Thread.run(Thread.java:595)
> I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1170) Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect

Posted by "Igor Bolotin (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Igor Bolotin updated HADOOP-1170:
---------------------------------

    Attachment: 1170.patch

Attached patch removes checkDataDir() calls from DataXceiveServer.run() method.

> Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1170
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1170
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.11.2
>            Reporter: Igor Bolotin
>         Attachments: 1170.patch
>
>
> While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
> Stack trace showed following on most of the data nodes:
> "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
>         at java.io.UnixFileSystem.checkAccess(Native Method)
>         at java.io.File.canRead(File.java:660)
>         at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
>         at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
>         - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
>         at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
>         at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
>         at java.lang.Thread.run(Thread.java:595)
> I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1170) Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect

Posted by "Igor Bolotin (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Igor Bolotin updated HADOOP-1170:
---------------------------------

    Status: Patch Available  (was: Open)

> Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1170
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1170
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.11.2
>            Reporter: Igor Bolotin
>         Attachments: 1170-v2.patch, 1170.patch
>
>
> While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
> Stack trace showed following on most of the data nodes:
> "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
>         at java.io.UnixFileSystem.checkAccess(Native Method)
>         at java.io.File.canRead(File.java:660)
>         at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
>         at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
>         - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
>         at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
>         at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
>         at java.lang.Thread.run(Thread.java:595)
> I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1170) Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-1170:
---------------------------------

       Resolution: Fixed
    Fix Version/s: 0.13.0
           Status: Resolved  (was: Patch Available)

I just committed this.  Thanks, Igor.

> Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1170
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1170
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.11.2
>            Reporter: Igor Bolotin
>             Fix For: 0.13.0
>
>         Attachments: 1170-v2.patch, 1170.patch
>
>
> While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
> Stack trace showed following on most of the data nodes:
> "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
>         at java.io.UnixFileSystem.checkAccess(Native Method)
>         at java.io.File.canRead(File.java:660)
>         at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
>         at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
>         - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
>         at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
>         at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
>         at java.lang.Thread.run(Thread.java:595)
> I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1170) Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494258 ] 

Hadoop QA commented on HADOOP-1170:
-----------------------------------

Integrated in Hadoop-Nightly #82 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/82/)

> Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1170
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1170
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.11.2
>            Reporter: Igor Bolotin
>             Fix For: 0.13.0
>
>         Attachments: 1170-v2.patch, 1170.patch
>
>
> While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
> Stack trace showed following on most of the data nodes:
> "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
>         at java.io.UnixFileSystem.checkAccess(Native Method)
>         at java.io.File.canRead(File.java:660)
>         at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
>         at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
>         - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
>         at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
>         at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
>         at java.lang.Thread.run(Thread.java:595)
> I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1170) Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12484967 ] 

dhruba borthakur commented on HADOOP-1170:
------------------------------------------

I like the idea of a background thread that periodically checks the data directories. The idea is to detect bad/inaccessible data directories and shutdown the datanode if this occurs, right?

> Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1170
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1170
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.11.2
>            Reporter: Igor Bolotin
>         Attachments: 1170.patch
>
>
> While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
> Stack trace showed following on most of the data nodes:
> "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
>         at java.io.UnixFileSystem.checkAccess(Native Method)
>         at java.io.File.canRead(File.java:660)
>         at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
>         at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
>         - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
>         at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
>         at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
>         at java.lang.Thread.run(Thread.java:595)
> I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1170) Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect

Posted by "eric baldeschwieler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12490156 ] 

eric baldeschwieler commented on HADOOP-1170:
---------------------------------------------

The thing to understand is that we can not upgrade our cluster to HEAD with this patch committed.  This patch breaks us.  We'll try to move forward in the new issue rather than advocating rolling this back, but this patch did not address the concerns we raised in this bug and so we have a problem.  I hope we can avoid this in the future.

I'm not advocating rolling back because I agree that these checks were not the appropriate solution to the disk problems they solved.

In case the context isn't clear, we frequently see individual drives go read only on our machines.  This check was inserted to allow this problem to be detected early and avoid failed jobs cause by write failures.

> Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1170
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1170
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.11.2
>            Reporter: Igor Bolotin
>             Fix For: 0.13.0
>
>         Attachments: 1170-v2.patch, 1170.patch
>
>
> While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
> Stack trace showed following on most of the data nodes:
> "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
>         at java.io.UnixFileSystem.checkAccess(Native Method)
>         at java.io.File.canRead(File.java:660)
>         at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
>         at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
>         - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
>         at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
>         at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
>         at java.lang.Thread.run(Thread.java:595)
> I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1170) Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect

Posted by "Igor Bolotin (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Igor Bolotin updated HADOOP-1170:
---------------------------------

    Attachment: 1170-v2.patch

This patch removes all FSDataset.checkDataDir() calls from DataNode as well as DiskErrorException handling in DataXceiveServer.run() method. I decided not to touch DiskErrorException handling in DataNode.offerService() - I just don't know whether or not it's possible to get it there.
 

> Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1170
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1170
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.11.2
>            Reporter: Igor Bolotin
>         Attachments: 1170-v2.patch, 1170.patch
>
>
> While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
> Stack trace showed following on most of the data nodes:
> "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
>         at java.io.UnixFileSystem.checkAccess(Native Method)
>         at java.io.File.canRead(File.java:660)
>         at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
>         at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
>         - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
>         at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
>         at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
>         at java.lang.Thread.run(Thread.java:595)
> I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1170) Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12484711 ] 

Hadoop QA commented on HADOOP-1170:
-----------------------------------

+1, because http://issues.apache.org/jira/secure/attachment/12354393/1170.patch applied and successfully tested against trunk revision http://svn.apache.org/repos/asf/lucene/hadoop/trunk/523072. Results are at http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch

> Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1170
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1170
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.11.2
>            Reporter: Igor Bolotin
>         Attachments: 1170.patch
>
>
> While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
> Stack trace showed following on most of the data nodes:
> "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
>         at java.io.UnixFileSystem.checkAccess(Native Method)
>         at java.io.File.canRead(File.java:660)
>         at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
>         at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
>         - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
>         at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
>         at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
>         at java.lang.Thread.run(Thread.java:595)
> I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1170) Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12484971 ] 

Raghu Angadi commented on HADOOP-1170:
--------------------------------------


There is going to be a periodic checker for all the blocks. The same thread could check the some of these conditions too. For this jira, I vote for removing all calls to checkDirs in Datanode.java.


> Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1170
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1170
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.11.2
>            Reporter: Igor Bolotin
>         Attachments: 1170.patch
>
>
> While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
> Stack trace showed following on most of the data nodes:
> "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
>         at java.io.UnixFileSystem.checkAccess(Native Method)
>         at java.io.File.canRead(File.java:660)
>         at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
>         at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
>         - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
>         at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
>         at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
>         at java.lang.Thread.run(Thread.java:595)
> I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1170) Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect

Posted by "Igor Bolotin (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Igor Bolotin updated HADOOP-1170:
---------------------------------

    Status: Open  (was: Patch Available)

> Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1170
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1170
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.11.2
>            Reporter: Igor Bolotin
>         Attachments: 1170.patch
>
>
> While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
> Stack trace showed following on most of the data nodes:
> "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
>         at java.io.UnixFileSystem.checkAccess(Native Method)
>         at java.io.File.canRead(File.java:660)
>         at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
>         at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
>         - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
>         at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
>         at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
>         at java.lang.Thread.run(Thread.java:595)
> I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1170) Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12485578 ] 

Doug Cutting commented on HADOOP-1170:
--------------------------------------

Is there a consensus to commit this as-is, or is someone working on an improved version?

> Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1170
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1170
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.11.2
>            Reporter: Igor Bolotin
>         Attachments: 1170.patch
>
>
> While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
> Stack trace showed following on most of the data nodes:
> "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
>         at java.io.UnixFileSystem.checkAccess(Native Method)
>         at java.io.File.canRead(File.java:660)
>         at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
>         at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
>         - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
>         at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
>         at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
>         at java.lang.Thread.run(Thread.java:595)
> I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1170) Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12490129 ] 

dhruba borthakur commented on HADOOP-1170:
------------------------------------------

This patch improves the performance situation but removes all checkDirs from the datanode. This introduces the problem that disk's migt not get checked for a long time. This is dangerous for a cluster where disks go bad. I think we should either implement a background thread to call checkDirs() before this patch can be deployed on a real cluster.

> Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1170
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1170
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.11.2
>            Reporter: Igor Bolotin
>             Fix For: 0.13.0
>
>         Attachments: 1170-v2.patch, 1170.patch
>
>
> While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
> Stack trace showed following on most of the data nodes:
> "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
>         at java.io.UnixFileSystem.checkAccess(Native Method)
>         at java.io.File.canRead(File.java:660)
>         at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
>         at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
>         - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
>         at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
>         at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
>         at java.lang.Thread.run(Thread.java:595)
> I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1170) Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect

Posted by "Igor Bolotin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12485583 ] 

Igor Bolotin commented on HADOOP-1170:
--------------------------------------

I'll prepare patch with all calls removed later today

> Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1170
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1170
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.11.2
>            Reporter: Igor Bolotin
>         Attachments: 1170.patch
>
>
> While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
> Stack trace showed following on most of the data nodes:
> "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
>         at java.io.UnixFileSystem.checkAccess(Native Method)
>         at java.io.File.canRead(File.java:660)
>         at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
>         at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
>         - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
>         at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
>         at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
>         at java.lang.Thread.run(Thread.java:595)
> I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.