You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Hairong Kuang (JIRA)" <ji...@apache.org> on 2007/06/05 20:09:25 UTC

[jira] Created: (HADOOP-1463) dfs should report total size of all the space that dfs is using

dfs should report total size of all the space that dfs is using
---------------------------------------------------------------

                 Key: HADOOP-1463
                 URL: https://issues.apache.org/jira/browse/HADOOP-1463
             Project: Hadoop
          Issue Type: Improvement
          Components: dfs
    Affects Versions: 0.12.3
            Reporter: Hairong Kuang
             Fix For: 0.14.0


Currently namenode reports two statistics back to the client:
1. The total capacity of dfs. This is a sum of all datanode's capacities, each of which is calculated by datanode summing all data directories disk space.
2. The total remaining space of dfs. This is a sum of all datanodes's remaining space. Each datanode's remaining space is calculated by using the following formula: remaining space = unused space - capacity*unusableDiskPercentage - reserved space. So the remaining space shows how much space that the dfs can still use, but it does not show the size of unused space.

Each dfs client caculates the total dfs used space by substracting remaining space from the total capacity. So the used space does not accurately shows the space that dfs is using. However it is a very important number that dfs should provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1463) dfs should report total size of all the space that dfs is using

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-1463:
---------------------------------

    Status: Open  (was: Patch Available)

Unfortunately this patch no longer applies cleanly to trunk.

> dfs should report total size of all the space that dfs is using
> ---------------------------------------------------------------
>
>                 Key: HADOOP-1463
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1463
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>         Attachments: usedSpace.patch
>
>
> Currently namenode reports two statistics back to the client:
> 1. The total capacity of dfs. This is a sum of all datanode's capacities, each of which is calculated by datanode summing all data directories disk space.
> 2. The total remaining space of dfs. This is a sum of all datanodes's remaining space. Each datanode's remaining space is calculated by using the following formula: remaining space = unused space - capacity*unusableDiskPercentage - reserved space. So the remaining space shows how much space that the dfs can still use, but it does not show the size of unused space.
> Each dfs client caculates the total dfs used space by substracting remaining space from the total capacity. So the used space does not accurately shows the space that dfs is using. However it is a very important number that dfs should provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1463) dfs should report total size of all the space that dfs is using

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-1463:
----------------------------------

    Status: Patch Available  (was: Open)

> dfs should report total size of all the space that dfs is using
> ---------------------------------------------------------------
>
>                 Key: HADOOP-1463
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1463
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.15.0
>
>         Attachments: usedSpace.patch, usedSpace.patch
>
>
> Currently namenode reports two statistics back to the client:
> 1. The total capacity of dfs. This is a sum of all datanode's capacities, each of which is calculated by datanode summing all data directories disk space.
> 2. The total remaining space of dfs. This is a sum of all datanodes's remaining space. Each datanode's remaining space is calculated by using the following formula: remaining space = unused space - capacity*unusableDiskPercentage - reserved space. So the remaining space shows how much space that the dfs can still use, but it does not show the size of unused space.
> Each dfs client caculates the total dfs used space by substracting remaining space from the total capacity. So the used space does not accurately shows the space that dfs is using. However it is a very important number that dfs should provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1463) dfs should report total size of all the space that dfs is using

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-1463:
----------------------------------

    Attachment: usedSpace.patch

This patch made the following changes:
1. dfs_used_space is set by running "du" every block report interval (default 1 hour) and then gets adjusted when a block is written or deleted.
2. dfs_remaing_space is caculated by the following formula:
MIN( Total_Capacity - dfs_used_space - dfs.datanode.du.reserved, cur_disk_available) * dfs.datanode.du.pct
where Total_Capacity and cur_disk_available is set by running "df" every df interval (default 1 min)
3. Every heartbeat sends 3 disk metrics, total_capacity, dfs_used_space, and dfs_reamaining_space, to the namenode
4. Namenode keeps track of all 3 metrics and reports to dfsadmin shell & Web UI all 3 metrics.
5. Fix some minor number formatting bugs in FSShell.

> dfs should report total size of all the space that dfs is using
> ---------------------------------------------------------------
>
>                 Key: HADOOP-1463
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1463
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>         Attachments: usedSpace.patch
>
>
> Currently namenode reports two statistics back to the client:
> 1. The total capacity of dfs. This is a sum of all datanode's capacities, each of which is calculated by datanode summing all data directories disk space.
> 2. The total remaining space of dfs. This is a sum of all datanodes's remaining space. Each datanode's remaining space is calculated by using the following formula: remaining space = unused space - capacity*unusableDiskPercentage - reserved space. So the remaining space shows how much space that the dfs can still use, but it does not show the size of unused space.
> Each dfs client caculates the total dfs used space by substracting remaining space from the total capacity. So the used space does not accurately shows the space that dfs is using. However it is a very important number that dfs should provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1463) dfs should report total size of all the space that dfs is using

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-1463:
-------------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

I just committed this. Thanks Hairong!

> dfs should report total size of all the space that dfs is using
> ---------------------------------------------------------------
>
>                 Key: HADOOP-1463
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1463
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.15.0
>
>         Attachments: usedSpace.patch, usedSpace.patch
>
>
> Currently namenode reports two statistics back to the client:
> 1. The total capacity of dfs. This is a sum of all datanode's capacities, each of which is calculated by datanode summing all data directories disk space.
> 2. The total remaining space of dfs. This is a sum of all datanodes's remaining space. Each datanode's remaining space is calculated by using the following formula: remaining space = unused space - capacity*unusableDiskPercentage - reserved space. So the remaining space shows how much space that the dfs can still use, but it does not show the size of unused space.
> Each dfs client caculates the total dfs used space by substracting remaining space from the total capacity. So the used space does not accurately shows the space that dfs is using. However it is a very important number that dfs should provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1463) dfs should report total size of all the space that dfs is using

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513314 ] 

Hairong Kuang commented on HADOOP-1463:
---------------------------------------

> Can I ask for 
> o Capacity (df total)
> o DFS Used 
> o DFS Remaining ?

Koji's ask makes sense. I thought about the same too. But this feature requires heartbeat proctocol change. Currently each heartbeat sends capacity and remaining space and the namenode interprerates dfs used as a substraction of capacity and remaing. For this ask, we can either
1. each datanode sends disk capacity at the registration time and then each heartbeat sends dfs used space and remaining space; or
2. each datanode's heartbeat sends disk capacity (although it does not change most of the time), dfs used space, and dfs remaining space.


> dfs should report total size of all the space that dfs is using
> ---------------------------------------------------------------
>
>                 Key: HADOOP-1463
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1463
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>         Attachments: usedSpace.patch
>
>
> Currently namenode reports two statistics back to the client:
> 1. The total capacity of dfs. This is a sum of all datanode's capacities, each of which is calculated by datanode summing all data directories disk space.
> 2. The total remaining space of dfs. This is a sum of all datanodes's remaining space. Each datanode's remaining space is calculated by using the following formula: remaining space = unused space - capacity*unusableDiskPercentage - reserved space. So the remaining space shows how much space that the dfs can still use, but it does not show the size of unused space.
> Each dfs client caculates the total dfs used space by substracting remaining space from the total capacity. So the used space does not accurately shows the space that dfs is using. However it is a very important number that dfs should provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1463) dfs should report total size of all the space that dfs is using

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511864 ] 

Hairong Kuang commented on HADOOP-1463:
---------------------------------------

The current implementation seems to define reserved space for future non-dfs usage.

> dfs should report total size of all the space that dfs is using
> ---------------------------------------------------------------
>
>                 Key: HADOOP-1463
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1463
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>
> Currently namenode reports two statistics back to the client:
> 1. The total capacity of dfs. This is a sum of all datanode's capacities, each of which is calculated by datanode summing all data directories disk space.
> 2. The total remaining space of dfs. This is a sum of all datanodes's remaining space. Each datanode's remaining space is calculated by using the following formula: remaining space = unused space - capacity*unusableDiskPercentage - reserved space. So the remaining space shows how much space that the dfs can still use, but it does not show the size of unused space.
> Each dfs client caculates the total dfs used space by substracting remaining space from the total capacity. So the used space does not accurately shows the space that dfs is using. However it is a very important number that dfs should provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1463) dfs should report total size of all the space that dfs is using

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514063 ] 

Hadoop QA commented on HADOOP-1463:
-----------------------------------

-1, build or testing failed

2 attempts failed to build and test the latest attachment http://issues.apache.org/jira/secure/attachment/12362176/usedSpace.patch against trunk revision r557790.

Test results:   http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/436/testReport/
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/436/console

Please note that this message is automatically generated and may represent a problem with the automation system and not the patch.

> dfs should report total size of all the space that dfs is using
> ---------------------------------------------------------------
>
>                 Key: HADOOP-1463
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1463
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>         Attachments: usedSpace.patch, usedSpace.patch
>
>
> Currently namenode reports two statistics back to the client:
> 1. The total capacity of dfs. This is a sum of all datanode's capacities, each of which is calculated by datanode summing all data directories disk space.
> 2. The total remaining space of dfs. This is a sum of all datanodes's remaining space. Each datanode's remaining space is calculated by using the following formula: remaining space = unused space - capacity*unusableDiskPercentage - reserved space. So the remaining space shows how much space that the dfs can still use, but it does not show the size of unused space.
> Each dfs client caculates the total dfs used space by substracting remaining space from the total capacity. So the used space does not accurately shows the space that dfs is using. However it is a very important number that dfs should provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1463) dfs should report total size of all the space that dfs is using

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-1463:
----------------------------------

    Attachment:     (was: usedSpace.patch)

> dfs should report total size of all the space that dfs is using
> ---------------------------------------------------------------
>
>                 Key: HADOOP-1463
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1463
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>         Attachments: usedSpace.patch
>
>
> Currently namenode reports two statistics back to the client:
> 1. The total capacity of dfs. This is a sum of all datanode's capacities, each of which is calculated by datanode summing all data directories disk space.
> 2. The total remaining space of dfs. This is a sum of all datanodes's remaining space. Each datanode's remaining space is calculated by using the following formula: remaining space = unused space - capacity*unusableDiskPercentage - reserved space. So the remaining space shows how much space that the dfs can still use, but it does not show the size of unused space.
> Each dfs client caculates the total dfs used space by substracting remaining space from the total capacity. So the used space does not accurately shows the space that dfs is using. However it is a very important number that dfs should provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1463) dfs should report total size of all the space that dfs is using

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-1463:
----------------------------------

    Status: Open  (was: Patch Available)

> dfs should report total size of all the space that dfs is using
> ---------------------------------------------------------------
>
>                 Key: HADOOP-1463
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1463
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.15.0
>
>         Attachments: usedSpace.patch, usedSpace.patch
>
>
> Currently namenode reports two statistics back to the client:
> 1. The total capacity of dfs. This is a sum of all datanode's capacities, each of which is calculated by datanode summing all data directories disk space.
> 2. The total remaining space of dfs. This is a sum of all datanodes's remaining space. Each datanode's remaining space is calculated by using the following formula: remaining space = unused space - capacity*unusableDiskPercentage - reserved space. So the remaining space shows how much space that the dfs can still use, but it does not show the size of unused space.
> Each dfs client caculates the total dfs used space by substracting remaining space from the total capacity. So the used space does not accurately shows the space that dfs is using. However it is a very important number that dfs should provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1463) dfs should report total size of all the space that dfs is using

Posted by "Koji Noguchi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12502138 ] 

Koji Noguchi commented on HADOOP-1463:
--------------------------------------

> if we want the latter,

Yes I prefer the latter.  (Datanode utilizing another 10G of space)

As a side note, we might want to update the config description for "dfs.datanode.du.pct".
It seems to be different from what it actually does.

<property>
  <name>dfs.datanode.du.pct</name>
  <value>0.98f</value>
  <description>When calculating remaining space, only use this percentage of the real available space
  </description>
</property>



> dfs should report total size of all the space that dfs is using
> ---------------------------------------------------------------
>
>                 Key: HADOOP-1463
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1463
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>             Fix For: 0.14.0
>
>
> Currently namenode reports two statistics back to the client:
> 1. The total capacity of dfs. This is a sum of all datanode's capacities, each of which is calculated by datanode summing all data directories disk space.
> 2. The total remaining space of dfs. This is a sum of all datanodes's remaining space. Each datanode's remaining space is calculated by using the following formula: remaining space = unused space - capacity*unusableDiskPercentage - reserved space. So the remaining space shows how much space that the dfs can still use, but it does not show the size of unused space.
> Each dfs client caculates the total dfs used space by substracting remaining space from the total capacity. So the used space does not accurately shows the space that dfs is using. However it is a very important number that dfs should provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1463) dfs should report total size of all the space that dfs is using

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-1463:
----------------------------------

    Fix Version/s: 0.14.0
           Status: Patch Available  (was: Open)

> dfs should report total size of all the space that dfs is using
> ---------------------------------------------------------------
>
>                 Key: HADOOP-1463
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1463
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>         Attachments: usedSpace.patch
>
>
> Currently namenode reports two statistics back to the client:
> 1. The total capacity of dfs. This is a sum of all datanode's capacities, each of which is calculated by datanode summing all data directories disk space.
> 2. The total remaining space of dfs. This is a sum of all datanodes's remaining space. Each datanode's remaining space is calculated by using the following formula: remaining space = unused space - capacity*unusableDiskPercentage - reserved space. So the remaining space shows how much space that the dfs can still use, but it does not show the size of unused space.
> Each dfs client caculates the total dfs used space by substracting remaining space from the total capacity. So the used space does not accurately shows the space that dfs is using. However it is a very important number that dfs should provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1463) dfs should report total size of all the space that dfs is using

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12502033 ] 

Hairong Kuang commented on HADOOP-1463:
---------------------------------------

I feel that doing "du" to get the size of data directories is too costly. Current code does this every 3 seconds. What we can do is to have a counter keeping track of the size of all blocks at each datanode. It gets update whenever a block is written or deleted. It gets reset by summing up all block size when a block report is sent.

> dfs should report total size of all the space that dfs is using
> ---------------------------------------------------------------
>
>                 Key: HADOOP-1463
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1463
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>             Fix For: 0.14.0
>
>
> Currently namenode reports two statistics back to the client:
> 1. The total capacity of dfs. This is a sum of all datanode's capacities, each of which is calculated by datanode summing all data directories disk space.
> 2. The total remaining space of dfs. This is a sum of all datanodes's remaining space. Each datanode's remaining space is calculated by using the following formula: remaining space = unused space - capacity*unusableDiskPercentage - reserved space. So the remaining space shows how much space that the dfs can still use, but it does not show the size of unused space.
> Each dfs client caculates the total dfs used space by substracting remaining space from the total capacity. So the used space does not accurately shows the space that dfs is using. However it is a very important number that dfs should provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1463) dfs should report total size of all the space that dfs is using

Posted by "Koji Noguchi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513154 ] 

Koji Noguchi commented on HADOOP-1463:
--------------------------------------

bq. The current implementation is interpreted the reserved space as the space reserved per volume. We want it to be the space reserved per datanode, right?

I'm sorry that I missed this.
At least for me, I'd like to reserve the space for each volume so that mapreduce can utilize multiple drives.


bq. 1. datanode sends namenode (dfs used space + remaining space, remaining space) per heartbeat. 

Namenode webUI as well as the "dfsadmin -report" now show  "total (dfs used space + remaining space)" as the 'capacity'. This might be more accurate as the dfs capacity, but maybe it'll confuse the users?
If reserved space is set to 0, capacity will keep on changing as mapreduce uses those spaces.

Can I ask for 
o Capacity (df total)
o DFS Used 
o DFS Remaining  ?






> dfs should report total size of all the space that dfs is using
> ---------------------------------------------------------------
>
>                 Key: HADOOP-1463
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1463
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>         Attachments: usedSpace.patch
>
>
> Currently namenode reports two statistics back to the client:
> 1. The total capacity of dfs. This is a sum of all datanode's capacities, each of which is calculated by datanode summing all data directories disk space.
> 2. The total remaining space of dfs. This is a sum of all datanodes's remaining space. Each datanode's remaining space is calculated by using the following formula: remaining space = unused space - capacity*unusableDiskPercentage - reserved space. So the remaining space shows how much space that the dfs can still use, but it does not show the size of unused space.
> Each dfs client caculates the total dfs used space by substracting remaining space from the total capacity. So the used space does not accurately shows the space that dfs is using. However it is a very important number that dfs should provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1463) dfs should report total size of all the space that dfs is using

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515381 ] 

Hadoop QA commented on HADOOP-1463:
-----------------------------------

-1, build or testing failed

2 attempts failed to build and test the latest attachment http://issues.apache.org/jira/secure/attachment/12362176/usedSpace.patch against trunk revision r559068.

Test results:   http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/462/testReport/
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/462/console

Please note that this message is automatically generated and may represent a problem with the automation system and not the patch.

> dfs should report total size of all the space that dfs is using
> ---------------------------------------------------------------
>
>                 Key: HADOOP-1463
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1463
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.15.0
>
>         Attachments: usedSpace.patch, usedSpace.patch
>
>
> Currently namenode reports two statistics back to the client:
> 1. The total capacity of dfs. This is a sum of all datanode's capacities, each of which is calculated by datanode summing all data directories disk space.
> 2. The total remaining space of dfs. This is a sum of all datanodes's remaining space. Each datanode's remaining space is calculated by using the following formula: remaining space = unused space - capacity*unusableDiskPercentage - reserved space. So the remaining space shows how much space that the dfs can still use, but it does not show the size of unused space.
> Each dfs client caculates the total dfs used space by substracting remaining space from the total capacity. So the used space does not accurately shows the space that dfs is using. However it is a very important number that dfs should provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1463) dfs should report total size of all the space that dfs is using

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508333 ] 

Raghu Angadi commented on HADOOP-1463:
--------------------------------------

'df' is very cheap so calculating at every heart beat not bad either.

> Shall we enforce that only one of them is non-zero? 
Not necessary depending on calculation of 'remaining space' 'reserved space' in your equations above. How are those calculated?

According requirements of this jira, this is what I understand the calculation of available space for datanode is:
{noformat}
remainig space for Datanode = 
      Min( (dfs.datanode.du.pct * Total_Capacity -  cur_space_used_by_Datanode), 
               cur_disk_available) - dfs.datanode.du.reserved
{noformat}
cur_disk_available (and total capacity) comes from 'df' and cur_space_used_by_Datanode is based on 'du'.

The descriptions of these config variables should probably change. 


> dfs should report total size of all the space that dfs is using
> ---------------------------------------------------------------
>
>                 Key: HADOOP-1463
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1463
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>
> Currently namenode reports two statistics back to the client:
> 1. The total capacity of dfs. This is a sum of all datanode's capacities, each of which is calculated by datanode summing all data directories disk space.
> 2. The total remaining space of dfs. This is a sum of all datanodes's remaining space. Each datanode's remaining space is calculated by using the following formula: remaining space = unused space - capacity*unusableDiskPercentage - reserved space. So the remaining space shows how much space that the dfs can still use, but it does not show the size of unused space.
> Each dfs client caculates the total dfs used space by substracting remaining space from the total capacity. So the used space does not accurately shows the space that dfs is using. However it is a very important number that dfs should provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1463) dfs should report total size of all the space that dfs is using

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-1463:
----------------------------------

    Status: Open  (was: Patch Available)

> dfs should report total size of all the space that dfs is using
> ---------------------------------------------------------------
>
>                 Key: HADOOP-1463
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1463
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>         Attachments: usedSpace.patch
>
>
> Currently namenode reports two statistics back to the client:
> 1. The total capacity of dfs. This is a sum of all datanode's capacities, each of which is calculated by datanode summing all data directories disk space.
> 2. The total remaining space of dfs. This is a sum of all datanodes's remaining space. Each datanode's remaining space is calculated by using the following formula: remaining space = unused space - capacity*unusableDiskPercentage - reserved space. So the remaining space shows how much space that the dfs can still use, but it does not show the size of unused space.
> Each dfs client caculates the total dfs used space by substracting remaining space from the total capacity. So the used space does not accurately shows the space that dfs is using. However it is a very important number that dfs should provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1463) dfs should report total size of all the space that dfs is using

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511887 ] 

Hairong Kuang commented on HADOOP-1463:
---------------------------------------

+1 on Koji's proposal 2.

I am reading more of the code.  The current implementation is interpreted the reserved space as the space reserved per volumn.  We want it to be the space reserved per datanode, right?
I also found out that the period for running "df" is configurable in dfs by setting the value of "dfs.df.interval". The default value is 3000msec. should we change the default value to be 1 min?

> dfs should report total size of all the space that dfs is using
> ---------------------------------------------------------------
>
>                 Key: HADOOP-1463
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1463
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>
> Currently namenode reports two statistics back to the client:
> 1. The total capacity of dfs. This is a sum of all datanode's capacities, each of which is calculated by datanode summing all data directories disk space.
> 2. The total remaining space of dfs. This is a sum of all datanodes's remaining space. Each datanode's remaining space is calculated by using the following formula: remaining space = unused space - capacity*unusableDiskPercentage - reserved space. So the remaining space shows how much space that the dfs can still use, but it does not show the size of unused space.
> Each dfs client caculates the total dfs used space by substracting remaining space from the total capacity. So the used space does not accurately shows the space that dfs is using. However it is a very important number that dfs should provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1463) dfs should report total size of all the space that dfs is using

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-1463:
----------------------------------

    Attachment: usedSpace.patch

> dfs should report total size of all the space that dfs is using
> ---------------------------------------------------------------
>
>                 Key: HADOOP-1463
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1463
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>         Attachments: usedSpace.patch, usedSpace.patch
>
>
> Currently namenode reports two statistics back to the client:
> 1. The total capacity of dfs. This is a sum of all datanode's capacities, each of which is calculated by datanode summing all data directories disk space.
> 2. The total remaining space of dfs. This is a sum of all datanodes's remaining space. Each datanode's remaining space is calculated by using the following formula: remaining space = unused space - capacity*unusableDiskPercentage - reserved space. So the remaining space shows how much space that the dfs can still use, but it does not show the size of unused space.
> Each dfs client caculates the total dfs used space by substracting remaining space from the total capacity. So the used space does not accurately shows the space that dfs is using. However it is a very important number that dfs should provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1463) dfs should report total size of all the space that dfs is using

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12502045 ] 

Raghu Angadi commented on HADOOP-1463:
--------------------------------------

+1. Except that during block report we should probably reset with 'du' instead of summing over block sizes, so that it takes all the other overhead of Datanode directory into account ( native filesystem, directories, 'previous' directories, metadata files, tmp directory etc). But it could be updated with block sizes as you described. Accrued error till next block report would be small.


> dfs should report total size of all the space that dfs is using
> ---------------------------------------------------------------
>
>                 Key: HADOOP-1463
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1463
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>             Fix For: 0.14.0
>
>
> Currently namenode reports two statistics back to the client:
> 1. The total capacity of dfs. This is a sum of all datanode's capacities, each of which is calculated by datanode summing all data directories disk space.
> 2. The total remaining space of dfs. This is a sum of all datanodes's remaining space. Each datanode's remaining space is calculated by using the following formula: remaining space = unused space - capacity*unusableDiskPercentage - reserved space. So the remaining space shows how much space that the dfs can still use, but it does not show the size of unused space.
> Each dfs client caculates the total dfs used space by substracting remaining space from the total capacity. So the used space does not accurately shows the space that dfs is using. However it is a very important number that dfs should provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1463) dfs should report total size of all the space that dfs is using

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512685 ] 

Hadoop QA commented on HADOOP-1463:
-----------------------------------

-1, build or testing failed

2 attempts failed to build and test the latest attachment http://issues.apache.org/jira/secure/attachment/12361823/usedSpace.patch against trunk revision r555813.

Test results:   http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/413/testReport/
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/413/console

Please note that this message is automatically generated and may represent a problem with the automation system and not the patch.

> dfs should report total size of all the space that dfs is using
> ---------------------------------------------------------------
>
>                 Key: HADOOP-1463
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1463
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>         Attachments: usedSpace.patch
>
>
> Currently namenode reports two statistics back to the client:
> 1. The total capacity of dfs. This is a sum of all datanode's capacities, each of which is calculated by datanode summing all data directories disk space.
> 2. The total remaining space of dfs. This is a sum of all datanodes's remaining space. Each datanode's remaining space is calculated by using the following formula: remaining space = unused space - capacity*unusableDiskPercentage - reserved space. So the remaining space shows how much space that the dfs can still use, but it does not show the size of unused space.
> Each dfs client caculates the total dfs used space by substracting remaining space from the total capacity. So the used space does not accurately shows the space that dfs is using. However it is a very important number that dfs should provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1463) dfs should report total size of all the space that dfs is using

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514028 ] 

dhruba borthakur commented on HADOOP-1463:
------------------------------------------

+1. Code looks good. 

1. This patch is invoking 'du' so this utility should be  in the default PATH of the hadoop daemons. This probably is true because 'df' is already in the path.
2. The format of the output from du on cygwin seems to be the same as that on Linux. So, it should work on Windows too.

> dfs should report total size of all the space that dfs is using
> ---------------------------------------------------------------
>
>                 Key: HADOOP-1463
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1463
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>         Attachments: usedSpace.patch
>
>
> Currently namenode reports two statistics back to the client:
> 1. The total capacity of dfs. This is a sum of all datanode's capacities, each of which is calculated by datanode summing all data directories disk space.
> 2. The total remaining space of dfs. This is a sum of all datanodes's remaining space. Each datanode's remaining space is calculated by using the following formula: remaining space = unused space - capacity*unusableDiskPercentage - reserved space. So the remaining space shows how much space that the dfs can still use, but it does not show the size of unused space.
> Each dfs client caculates the total dfs used space by substracting remaining space from the total capacity. So the used space does not accurately shows the space that dfs is using. However it is a very important number that dfs should provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1463) dfs should report total size of all the space that dfs is using

Posted by "Koji Noguchi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511871 ] 

Koji Noguchi commented on HADOOP-1463:
--------------------------------------

> From my understanding dfs.datanode.du.pct and dfs.datanode.du.reserved are two different ways of specifying reserved space.
>
I used to think the same. But there shouldn't be two config variables that serve the same purpose.
So, assuming "dfs.datanode.du.reserved" is the one for "space reserved for non-dfs usage *whether it is used or unused*"
I'd want

{noformat}
MIN( Total_Capacity -  cur_space_used_by_Datanode - dfs.datanode.du.reserved, cur_disk_available)
{noformat}

I'm not sure where "dfs.datanode.du.pct" should fit.  Maybe 

{noformat}
MIN( Total_Capacity -  cur_space_used_by_Datanode - dfs.datanode.du.reserved, cur_disk_available) * dfs.datanode.du.pct
{noformat}


> dfs should report total size of all the space that dfs is using
> ---------------------------------------------------------------
>
>                 Key: HADOOP-1463
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1463
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>
> Currently namenode reports two statistics back to the client:
> 1. The total capacity of dfs. This is a sum of all datanode's capacities, each of which is calculated by datanode summing all data directories disk space.
> 2. The total remaining space of dfs. This is a sum of all datanodes's remaining space. Each datanode's remaining space is calculated by using the following formula: remaining space = unused space - capacity*unusableDiskPercentage - reserved space. So the remaining space shows how much space that the dfs can still use, but it does not show the size of unused space.
> Each dfs client caculates the total dfs used space by substracting remaining space from the total capacity. So the used space does not accurately shows the space that dfs is using. However it is a very important number that dfs should provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1463) dfs should report total size of all the space that dfs is using

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514330 ] 

Doug Cutting commented on HADOOP-1463:
--------------------------------------

> Hmm.. TestDecommission passed on my machine.

Passes on mine too.  I think the patch is fine.

> dfs should report total size of all the space that dfs is using
> ---------------------------------------------------------------
>
>                 Key: HADOOP-1463
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1463
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.15.0
>
>         Attachments: usedSpace.patch, usedSpace.patch
>
>
> Currently namenode reports two statistics back to the client:
> 1. The total capacity of dfs. This is a sum of all datanode's capacities, each of which is calculated by datanode summing all data directories disk space.
> 2. The total remaining space of dfs. This is a sum of all datanodes's remaining space. Each datanode's remaining space is calculated by using the following formula: remaining space = unused space - capacity*unusableDiskPercentage - reserved space. So the remaining space shows how much space that the dfs can still use, but it does not show the size of unused space.
> Each dfs client caculates the total dfs used space by substracting remaining space from the total capacity. So the used space does not accurately shows the space that dfs is using. However it is a very important number that dfs should provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1463) dfs should report total size of all the space that dfs is using

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511860 ] 

Hairong Kuang commented on HADOOP-1463:
---------------------------------------

Not sure if Raghu's formula is correct. The key is what's the definition of reserved space. From my understanding dfs.datanode.du.pct and dfs.datanode.du.reserved are two different ways of specifying reserved space. dfs.datanode.du.reserved gives an absolute value while dfs.datanode.du.pct gives a percentage. That's why I said that we need only one of them. Reserved space is for non-dfs usage including used space for map/reduce.

> dfs should report total size of all the space that dfs is using
> ---------------------------------------------------------------
>
>                 Key: HADOOP-1463
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1463
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>
> Currently namenode reports two statistics back to the client:
> 1. The total capacity of dfs. This is a sum of all datanode's capacities, each of which is calculated by datanode summing all data directories disk space.
> 2. The total remaining space of dfs. This is a sum of all datanodes's remaining space. Each datanode's remaining space is calculated by using the following formula: remaining space = unused space - capacity*unusableDiskPercentage - reserved space. So the remaining space shows how much space that the dfs can still use, but it does not show the size of unused space.
> Each dfs client caculates the total dfs used space by substracting remaining space from the total capacity. So the used space does not accurately shows the space that dfs is using. However it is a very important number that dfs should provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HADOOP-1463) dfs should report total size of all the space that dfs is using

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang reassigned HADOOP-1463:
-------------------------------------

    Assignee: Hairong Kuang

> dfs should report total size of all the space that dfs is using
> ---------------------------------------------------------------
>
>                 Key: HADOOP-1463
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1463
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>
> Currently namenode reports two statistics back to the client:
> 1. The total capacity of dfs. This is a sum of all datanode's capacities, each of which is calculated by datanode summing all data directories disk space.
> 2. The total remaining space of dfs. This is a sum of all datanodes's remaining space. Each datanode's remaining space is calculated by using the following formula: remaining space = unused space - capacity*unusableDiskPercentage - reserved space. So the remaining space shows how much space that the dfs can still use, but it does not show the size of unused space.
> Each dfs client caculates the total dfs used space by substracting remaining space from the total capacity. So the used space does not accurately shows the space that dfs is using. However it is a very important number that dfs should provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1463) dfs should report total size of all the space that dfs is using

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-1463:
----------------------------------

    Status: Patch Available  (was: Open)

> dfs should report total size of all the space that dfs is using
> ---------------------------------------------------------------
>
>                 Key: HADOOP-1463
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1463
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>         Attachments: usedSpace.patch, usedSpace.patch
>
>
> Currently namenode reports two statistics back to the client:
> 1. The total capacity of dfs. This is a sum of all datanode's capacities, each of which is calculated by datanode summing all data directories disk space.
> 2. The total remaining space of dfs. This is a sum of all datanodes's remaining space. Each datanode's remaining space is calculated by using the following formula: remaining space = unused space - capacity*unusableDiskPercentage - reserved space. So the remaining space shows how much space that the dfs can still use, but it does not show the size of unused space.
> Each dfs client caculates the total dfs used space by substracting remaining space from the total capacity. So the used space does not accurately shows the space that dfs is using. However it is a very important number that dfs should provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1463) dfs should report total size of all the space that dfs is using

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513108 ] 

Hairong Kuang commented on HADOOP-1463:
---------------------------------------

Koji is examining the interface and I am waiting for his comment..

> dfs should report total size of all the space that dfs is using
> ---------------------------------------------------------------
>
>                 Key: HADOOP-1463
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1463
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>         Attachments: usedSpace.patch
>
>
> Currently namenode reports two statistics back to the client:
> 1. The total capacity of dfs. This is a sum of all datanode's capacities, each of which is calculated by datanode summing all data directories disk space.
> 2. The total remaining space of dfs. This is a sum of all datanodes's remaining space. Each datanode's remaining space is calculated by using the following formula: remaining space = unused space - capacity*unusableDiskPercentage - reserved space. So the remaining space shows how much space that the dfs can still use, but it does not show the size of unused space.
> Each dfs client caculates the total dfs used space by substracting remaining space from the total capacity. So the used space does not accurately shows the space that dfs is using. However it is a very important number that dfs should provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1463) dfs should report total size of all the space that dfs is using

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514074 ] 

Hairong Kuang commented on HADOOP-1463:
---------------------------------------

Hmm.. TestDecommission passed on my machine.

> dfs should report total size of all the space that dfs is using
> ---------------------------------------------------------------
>
>                 Key: HADOOP-1463
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1463
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>         Attachments: usedSpace.patch, usedSpace.patch
>
>
> Currently namenode reports two statistics back to the client:
> 1. The total capacity of dfs. This is a sum of all datanode's capacities, each of which is calculated by datanode summing all data directories disk space.
> 2. The total remaining space of dfs. This is a sum of all datanodes's remaining space. Each datanode's remaining space is calculated by using the following formula: remaining space = unused space - capacity*unusableDiskPercentage - reserved space. So the remaining space shows how much space that the dfs can still use, but it does not show the size of unused space.
> Each dfs client caculates the total dfs used space by substracting remaining space from the total capacity. So the used space does not accurately shows the space that dfs is using. However it is a very important number that dfs should provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1463) dfs should report total size of all the space that dfs is using

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513079 ] 

dhruba borthakur commented on HADOOP-1463:
------------------------------------------

+1. Code looks good. is it possible to write a unit test (maybe later) for this one?

> dfs should report total size of all the space that dfs is using
> ---------------------------------------------------------------
>
>                 Key: HADOOP-1463
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1463
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>         Attachments: usedSpace.patch
>
>
> Currently namenode reports two statistics back to the client:
> 1. The total capacity of dfs. This is a sum of all datanode's capacities, each of which is calculated by datanode summing all data directories disk space.
> 2. The total remaining space of dfs. This is a sum of all datanodes's remaining space. Each datanode's remaining space is calculated by using the following formula: remaining space = unused space - capacity*unusableDiskPercentage - reserved space. So the remaining space shows how much space that the dfs can still use, but it does not show the size of unused space.
> Each dfs client caculates the total dfs used space by substracting remaining space from the total capacity. So the used space does not accurately shows the space that dfs is using. However it is a very important number that dfs should provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1463) dfs should report total size of all the space that dfs is using

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-1463:
----------------------------------

    Attachment: usedSpace.patch

A patch for review:
1. datanode sends namenode (dfs used space + remaining space, remaining space) per heartbeat. dfs remaining space & used space are cacluated as we dicussed.
2. fix a bug in printing a double in FSShell.

> dfs should report total size of all the space that dfs is using
> ---------------------------------------------------------------
>
>                 Key: HADOOP-1463
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1463
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>         Attachments: usedSpace.patch
>
>
> Currently namenode reports two statistics back to the client:
> 1. The total capacity of dfs. This is a sum of all datanode's capacities, each of which is calculated by datanode summing all data directories disk space.
> 2. The total remaining space of dfs. This is a sum of all datanodes's remaining space. Each datanode's remaining space is calculated by using the following formula: remaining space = unused space - capacity*unusableDiskPercentage - reserved space. So the remaining space shows how much space that the dfs can still use, but it does not show the size of unused space.
> Each dfs client caculates the total dfs used space by substracting remaining space from the total capacity. So the used space does not accurately shows the space that dfs is using. However it is a very important number that dfs should provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1463) dfs should report total size of all the space that dfs is using

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12502000 ] 

Raghu Angadi commented on HADOOP-1463:
--------------------------------------


The meaning of "Reserved" that current implements is : "Space that Datanode should try to keep *free* for immediate or future use by either Datanode or some other application". In that sense I think the calculates well. Note that this calculation does not require costly du.

After chatting with Hairong, I think Koji's impression of "Reserved" is : "Disk space that Datanode avoids in its calculations whether it is used or free".  Eg. if there is a partition of 100 GB, 50% is reserved, and DFS occupies 40 GB,  25GB is "available", then "remaining" should be 10GB. This requires equivalent of 'du' for Datanode's data. With the previous interpretation remaining will be zero.

If we want the latter, then we need either do 'du' or maintain disk used, to be sent with every heartbeat.




> dfs should report total size of all the space that dfs is using
> ---------------------------------------------------------------
>
>                 Key: HADOOP-1463
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1463
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>             Fix For: 0.14.0
>
>
> Currently namenode reports two statistics back to the client:
> 1. The total capacity of dfs. This is a sum of all datanode's capacities, each of which is calculated by datanode summing all data directories disk space.
> 2. The total remaining space of dfs. This is a sum of all datanodes's remaining space. Each datanode's remaining space is calculated by using the following formula: remaining space = unused space - capacity*unusableDiskPercentage - reserved space. So the remaining space shows how much space that the dfs can still use, but it does not show the size of unused space.
> Each dfs client caculates the total dfs used space by substracting remaining space from the total capacity. So the used space does not accurately shows the space that dfs is using. However it is a very important number that dfs should provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1463) dfs should report total size of all the space that dfs is using

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513085 ] 

dhruba borthakur commented on HADOOP-1463:
------------------------------------------

Hairong is uploading a new patch merged with the latest trunk and a small bug fix.

> dfs should report total size of all the space that dfs is using
> ---------------------------------------------------------------
>
>                 Key: HADOOP-1463
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1463
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>         Attachments: usedSpace.patch
>
>
> Currently namenode reports two statistics back to the client:
> 1. The total capacity of dfs. This is a sum of all datanode's capacities, each of which is calculated by datanode summing all data directories disk space.
> 2. The total remaining space of dfs. This is a sum of all datanodes's remaining space. Each datanode's remaining space is calculated by using the following formula: remaining space = unused space - capacity*unusableDiskPercentage - reserved space. So the remaining space shows how much space that the dfs can still use, but it does not show the size of unused space.
> Each dfs client caculates the total dfs used space by substracting remaining space from the total capacity. So the used space does not accurately shows the space that dfs is using. However it is a very important number that dfs should provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1463) dfs should report total size of all the space that dfs is using

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508301 ] 

Doug Cutting commented on HADOOP-1463:
--------------------------------------

> run "df" when datanode gets started to get the data node's disk space and the reserved space

We should also run "df" periodically, since other applications may be using disk space too (like MapReduce).  Probably once-per-heartbeat is excessive, but not less than once every few minutes.

> dfs should report total size of all the space that dfs is using
> ---------------------------------------------------------------
>
>                 Key: HADOOP-1463
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1463
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>
> Currently namenode reports two statistics back to the client:
> 1. The total capacity of dfs. This is a sum of all datanode's capacities, each of which is calculated by datanode summing all data directories disk space.
> 2. The total remaining space of dfs. This is a sum of all datanodes's remaining space. Each datanode's remaining space is calculated by using the following formula: remaining space = unused space - capacity*unusableDiskPercentage - reserved space. So the remaining space shows how much space that the dfs can still use, but it does not show the size of unused space.
> Each dfs client caculates the total dfs used space by substracting remaining space from the total capacity. So the used space does not accurately shows the space that dfs is using. However it is a very important number that dfs should provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1463) dfs should report total size of all the space that dfs is using

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-1463:
----------------------------------

    Status: Patch Available  (was: Open)

Although this issue is still expecting review comments, I am marking it as patch available so it could be committed to release 0.14.

> dfs should report total size of all the space that dfs is using
> ---------------------------------------------------------------
>
>                 Key: HADOOP-1463
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1463
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>         Attachments: usedSpace.patch
>
>
> Currently namenode reports two statistics back to the client:
> 1. The total capacity of dfs. This is a sum of all datanode's capacities, each of which is calculated by datanode summing all data directories disk space.
> 2. The total remaining space of dfs. This is a sum of all datanodes's remaining space. Each datanode's remaining space is calculated by using the following formula: remaining space = unused space - capacity*unusableDiskPercentage - reserved space. So the remaining space shows how much space that the dfs can still use, but it does not show the size of unused space.
> Each dfs client caculates the total dfs used space by substracting remaining space from the total capacity. So the used space does not accurately shows the space that dfs is using. However it is a very important number that dfs should provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1463) dfs should report total size of all the space that dfs is using

Posted by "Nigel Daley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nigel Daley updated HADOOP-1463:
--------------------------------

    Fix Version/s:     (was: 0.14.0)
                   0.15.0

> dfs should report total size of all the space that dfs is using
> ---------------------------------------------------------------
>
>                 Key: HADOOP-1463
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1463
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.15.0
>
>         Attachments: usedSpace.patch, usedSpace.patch
>
>
> Currently namenode reports two statistics back to the client:
> 1. The total capacity of dfs. This is a sum of all datanode's capacities, each of which is calculated by datanode summing all data directories disk space.
> 2. The total remaining space of dfs. This is a sum of all datanodes's remaining space. Each datanode's remaining space is calculated by using the following formula: remaining space = unused space - capacity*unusableDiskPercentage - reserved space. So the remaining space shows how much space that the dfs can still use, but it does not show the size of unused space.
> Each dfs client caculates the total dfs used space by substracting remaining space from the total capacity. So the used space does not accurately shows the space that dfs is using. However it is a very important number that dfs should provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1463) dfs should report total size of all the space that dfs is using

Posted by "Koji Noguchi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513958 ] 

Koji Noguchi commented on HADOOP-1463:
--------------------------------------

I really like the fact that  "DFS remaining"  is now not affected by the non-dfs usages as long as they're within the reserved space. 
As for webUI, I'm not sure if we want two lines for "DFS Used". 




> dfs should report total size of all the space that dfs is using
> ---------------------------------------------------------------
>
>                 Key: HADOOP-1463
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1463
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>         Attachments: usedSpace.patch
>
>
> Currently namenode reports two statistics back to the client:
> 1. The total capacity of dfs. This is a sum of all datanode's capacities, each of which is calculated by datanode summing all data directories disk space.
> 2. The total remaining space of dfs. This is a sum of all datanodes's remaining space. Each datanode's remaining space is calculated by using the following formula: remaining space = unused space - capacity*unusableDiskPercentage - reserved space. So the remaining space shows how much space that the dfs can still use, but it does not show the size of unused space.
> Each dfs client caculates the total dfs used space by substracting remaining space from the total capacity. So the used space does not accurately shows the space that dfs is using. However it is a very important number that dfs should provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1463) dfs should report total size of all the space that dfs is using

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508291 ] 

Hairong Kuang commented on HADOOP-1463:
---------------------------------------

To summarize what we have discussed:

each data node's disk space = dfs used space + reserved space + remaining space

where dfs used space is a summation of all data dir sizes, reserved space is reserved for non-dfs usage whether it is used or unused, and remaining space is for future dfs usage. 

dfs capacity = dfs used space + remaining space

data node sends dfs capacity and remaining space to namenode at each heartbeat.

I plan to run "df" when datanode gets started to get the data node's disk space and  the reserved space. I plan to keep track of dfs used space by running a "du" when a blockreport is sent and gets adjusted when a block is written or is deleted.

Please comment if you have any other opinion.

Regarding the reserved space, currently hadoop-default.xml supports the following two properties. Shall we enforce that only one of them is non-zero?
<code>
<property>
  <name>dfs.datanode.du.reserved</name>
  <value>0</value>
  <description>Reserved space in bytes. Always leave this much space free fornon dfs use
  </description>
</property>

<property>
  <name>dfs.datanode.du.pct</name>
  <value>0.98f</value>
  <description>When calculating remaining space, only use this percentage of the real available space
  </description>
</property>
<code> 




> dfs should report total size of all the space that dfs is using
> ---------------------------------------------------------------
>
>                 Key: HADOOP-1463
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1463
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>
> Currently namenode reports two statistics back to the client:
> 1. The total capacity of dfs. This is a sum of all datanode's capacities, each of which is calculated by datanode summing all data directories disk space.
> 2. The total remaining space of dfs. This is a sum of all datanodes's remaining space. Each datanode's remaining space is calculated by using the following formula: remaining space = unused space - capacity*unusableDiskPercentage - reserved space. So the remaining space shows how much space that the dfs can still use, but it does not show the size of unused space.
> Each dfs client caculates the total dfs used space by substracting remaining space from the total capacity. So the used space does not accurately shows the space that dfs is using. However it is a very important number that dfs should provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.