You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Yogini Gulkotwar <yo...@flutura.com> on 2013/08/09 10:58:17 UTC

Discrepancy in the values of consumed disk space by hadoop

Hi All,

I have a CDH4 hadoop cluster setup with 3 datanodes and a data replication
factor of 2.

When I try to check the consumed dfs space, I get different values using
the "hdfs dfsadmin -report" and "hdfs fsck" command.
Could anyone please help me understand the reason behind the discrepancy in
the values?

 I get the following output:

*# sudo -u hdfs hdfs dfsadmin -report*


Configured Capacity: 321252989337600 (292.18 TB)
Present Capacity: 264896108259328 (240.92 TB)
DFS Remaining: 264665811648512 (240.71 TB)
DFS Used: 230296610816 (214.48 GB)
DFS Used%: 0.09%
Under replicated blocks: 19
Blocks with corrupt replicas: 0
Missing blocks: 0

-------------------------------------------------
Datanodes available: 3 (3 total, 0 dead)

Live datanodes:
Name: (slave1)
Hostname: localhost
Decommission Status : Normal
Configured Capacity: 107084329779200 (97.39 TB)
DFS Used: 77728510976 (72.39 GB)
Non DFS Used: 18784664751104 (17.08 TB)
DFS Remaining: 88221936517120 (80.24 TB)
DFS Used%: 0.07%
DFS Remaining%: 82.39%
Last contact: Fri Aug 09 13:26:38 IST 2013


Name: (slave3)
Hostname: localhost
Decommission Status : Normal
Configured Capacity: 107084329779200 (97.39 TB)
DFS Used: 76206287872 (70.97 GB)
Non DFS Used: 18786185925632 (17.09 TB)
DFS Remaining: 88221937565696 (80.24 TB)
DFS Used%: 0.07%
DFS Remaining%: 82.39%
Last contact: Fri Aug 09 13:26:37 IST 2013


Name:(slave2)
Hostname: localhost
Decommission Status : Normal
Configured Capacity: 107084329779200 (97.39 TB)
DFS Used: 76361811968 (71.12 GB)
Non DFS Used: 18786030401536 (17.09 TB)
DFS Remaining: 88221937565696 (80.24 TB)
DFS Used%: 0.07%
DFS Remaining%: 82.39%

--------------------------------------------------------------------------------------------------------------------------
*# sudo -u hdfs hadoop fsck /*


Connecting to namenode via http://master1:50070


Status: HEALTHY
 Total size: 75245213337 B
 Total dirs: 3203
 Total files: 7893
 Total blocks (validated): 7642 (avg. block size 9846272 B)
 Minimally replicated blocks: 7642 (100.0 %)
 Over-replicated blocks: 0 (0.0 %)
 Under-replicated blocks: 19 (0.24862601 %)
 Mis-replicated blocks: 0 (0.0 %)
 Default replication factor: 2
 Average block replication: 2.0024862
 Corrupt blocks: 0
 Missing replicas: 133 (0.86162215 %)
 Number of data-nodes: 3
 Number of racks: 1
FSCK ended at Fri Aug 09 14:01:47 IST 2013 in 266 milliseconds


The filesystem under path '/' is HEALTHY

----------------------------------------------------------------------------------------------------------------------------------------------------


*# sudo -u hdfs hadoop fs -count -q /*
  2147483647      2147472547            none             inf         3203
      7897        75245470999 /



Thanks & Regards,
*Yogini Gulkotwar*
*Flutura Decision Sciences & Analytics, Bangalore*
*Email*: yogini.gulkotwar@flutura.com<yo...@fluturasolutions.com>
*Website*: www.fluturasolutions.com

Re: Discrepancy in the values of consumed disk space by hadoop

Posted by Harsh J <ha...@cloudera.com>.
There isn't a "discrepancy", but read on: DFS Used counts disk spaces
across DNs. FSCK counts file lengths on HDFS. The former includes
replicated data sizes, plus block checksum metadata consumed space.
The latter does not.

A small (but probably significant) percentage of your files are using
replication factors of more than the default of 2, so a simple
division would probably not work in showing the relation.

Also worth checking if your DN configured data directories have any
older subdirectories lying under them from past installs, if you're
sure that the small percentage of higher replica factor using files
are small enough and shouldn't be using as much more space.

On Fri, Aug 9, 2013 at 2:28 PM, Yogini Gulkotwar
<yo...@flutura.com> wrote:
> Hi All,
>
> I have a CDH4 hadoop cluster setup with 3 datanodes and a data replication
> factor of 2.
>
> When I try to check the consumed dfs space, I get different values using the
> "hdfs dfsadmin -report" and "hdfs fsck" command.
> Could anyone please help me understand the reason behind the discrepancy in
> the values?
>
>  I get the following output:
>
> # sudo -u hdfs hdfs dfsadmin -report
>
>
> Configured Capacity: 321252989337600 (292.18 TB)
> Present Capacity: 264896108259328 (240.92 TB)
> DFS Remaining: 264665811648512 (240.71 TB)
> DFS Used: 230296610816 (214.48 GB)
> DFS Used%: 0.09%
> Under replicated blocks: 19
> Blocks with corrupt replicas: 0
> Missing blocks: 0
>
> -------------------------------------------------
> Datanodes available: 3 (3 total, 0 dead)
>
> Live datanodes:
> Name: (slave1)
> Hostname: localhost
> Decommission Status : Normal
> Configured Capacity: 107084329779200 (97.39 TB)
> DFS Used: 77728510976 (72.39 GB)
> Non DFS Used: 18784664751104 (17.08 TB)
> DFS Remaining: 88221936517120 (80.24 TB)
> DFS Used%: 0.07%
> DFS Remaining%: 82.39%
> Last contact: Fri Aug 09 13:26:38 IST 2013
>
>
> Name: (slave3)
> Hostname: localhost
> Decommission Status : Normal
> Configured Capacity: 107084329779200 (97.39 TB)
> DFS Used: 76206287872 (70.97 GB)
> Non DFS Used: 18786185925632 (17.09 TB)
> DFS Remaining: 88221937565696 (80.24 TB)
> DFS Used%: 0.07%
> DFS Remaining%: 82.39%
> Last contact: Fri Aug 09 13:26:37 IST 2013
>
>
> Name:(slave2)
> Hostname: localhost
> Decommission Status : Normal
> Configured Capacity: 107084329779200 (97.39 TB)
> DFS Used: 76361811968 (71.12 GB)
> Non DFS Used: 18786030401536 (17.09 TB)
> DFS Remaining: 88221937565696 (80.24 TB)
> DFS Used%: 0.07%
> DFS Remaining%: 82.39%
>
> --------------------------------------------------------------------------------------------------------------------------
> # sudo -u hdfs hadoop fsck /
>
>
> Connecting to namenode via http://master1:50070
>
>
> Status: HEALTHY
>  Total size: 75245213337 B
>  Total dirs: 3203
>  Total files: 7893
>  Total blocks (validated): 7642 (avg. block size 9846272 B)
>  Minimally replicated blocks: 7642 (100.0 %)
>  Over-replicated blocks: 0 (0.0 %)
>  Under-replicated blocks: 19 (0.24862601 %)
>  Mis-replicated blocks: 0 (0.0 %)
>  Default replication factor: 2
>  Average block replication: 2.0024862
>  Corrupt blocks: 0
>  Missing replicas: 133 (0.86162215 %)
>  Number of data-nodes: 3
>  Number of racks: 1
> FSCK ended at Fri Aug 09 14:01:47 IST 2013 in 266 milliseconds
>
>
> The filesystem under path '/' is HEALTHY
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------
>
>
> # sudo -u hdfs hadoop fs -count -q /
>   2147483647      2147472547            none             inf         3203
> 7897        75245470999 /
>
>
>
> Thanks & Regards,
> Yogini Gulkotwar
> Flutura Decision Sciences & Analytics, Bangalore
> Email: yogini.gulkotwar@flutura.com
> Website: www.fluturasolutions.com



-- 
Harsh J

Re: Discrepancy in the values of consumed disk space by hadoop

Posted by Harsh J <ha...@cloudera.com>.
There isn't a "discrepancy", but read on: DFS Used counts disk spaces
across DNs. FSCK counts file lengths on HDFS. The former includes
replicated data sizes, plus block checksum metadata consumed space.
The latter does not.

A small (but probably significant) percentage of your files are using
replication factors of more than the default of 2, so a simple
division would probably not work in showing the relation.

Also worth checking if your DN configured data directories have any
older subdirectories lying under them from past installs, if you're
sure that the small percentage of higher replica factor using files
are small enough and shouldn't be using as much more space.

On Fri, Aug 9, 2013 at 2:28 PM, Yogini Gulkotwar
<yo...@flutura.com> wrote:
> Hi All,
>
> I have a CDH4 hadoop cluster setup with 3 datanodes and a data replication
> factor of 2.
>
> When I try to check the consumed dfs space, I get different values using the
> "hdfs dfsadmin -report" and "hdfs fsck" command.
> Could anyone please help me understand the reason behind the discrepancy in
> the values?
>
>  I get the following output:
>
> # sudo -u hdfs hdfs dfsadmin -report
>
>
> Configured Capacity: 321252989337600 (292.18 TB)
> Present Capacity: 264896108259328 (240.92 TB)
> DFS Remaining: 264665811648512 (240.71 TB)
> DFS Used: 230296610816 (214.48 GB)
> DFS Used%: 0.09%
> Under replicated blocks: 19
> Blocks with corrupt replicas: 0
> Missing blocks: 0
>
> -------------------------------------------------
> Datanodes available: 3 (3 total, 0 dead)
>
> Live datanodes:
> Name: (slave1)
> Hostname: localhost
> Decommission Status : Normal
> Configured Capacity: 107084329779200 (97.39 TB)
> DFS Used: 77728510976 (72.39 GB)
> Non DFS Used: 18784664751104 (17.08 TB)
> DFS Remaining: 88221936517120 (80.24 TB)
> DFS Used%: 0.07%
> DFS Remaining%: 82.39%
> Last contact: Fri Aug 09 13:26:38 IST 2013
>
>
> Name: (slave3)
> Hostname: localhost
> Decommission Status : Normal
> Configured Capacity: 107084329779200 (97.39 TB)
> DFS Used: 76206287872 (70.97 GB)
> Non DFS Used: 18786185925632 (17.09 TB)
> DFS Remaining: 88221937565696 (80.24 TB)
> DFS Used%: 0.07%
> DFS Remaining%: 82.39%
> Last contact: Fri Aug 09 13:26:37 IST 2013
>
>
> Name:(slave2)
> Hostname: localhost
> Decommission Status : Normal
> Configured Capacity: 107084329779200 (97.39 TB)
> DFS Used: 76361811968 (71.12 GB)
> Non DFS Used: 18786030401536 (17.09 TB)
> DFS Remaining: 88221937565696 (80.24 TB)
> DFS Used%: 0.07%
> DFS Remaining%: 82.39%
>
> --------------------------------------------------------------------------------------------------------------------------
> # sudo -u hdfs hadoop fsck /
>
>
> Connecting to namenode via http://master1:50070
>
>
> Status: HEALTHY
>  Total size: 75245213337 B
>  Total dirs: 3203
>  Total files: 7893
>  Total blocks (validated): 7642 (avg. block size 9846272 B)
>  Minimally replicated blocks: 7642 (100.0 %)
>  Over-replicated blocks: 0 (0.0 %)
>  Under-replicated blocks: 19 (0.24862601 %)
>  Mis-replicated blocks: 0 (0.0 %)
>  Default replication factor: 2
>  Average block replication: 2.0024862
>  Corrupt blocks: 0
>  Missing replicas: 133 (0.86162215 %)
>  Number of data-nodes: 3
>  Number of racks: 1
> FSCK ended at Fri Aug 09 14:01:47 IST 2013 in 266 milliseconds
>
>
> The filesystem under path '/' is HEALTHY
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------
>
>
> # sudo -u hdfs hadoop fs -count -q /
>   2147483647      2147472547            none             inf         3203
> 7897        75245470999 /
>
>
>
> Thanks & Regards,
> Yogini Gulkotwar
> Flutura Decision Sciences & Analytics, Bangalore
> Email: yogini.gulkotwar@flutura.com
> Website: www.fluturasolutions.com



-- 
Harsh J

Re: Discrepancy in the values of consumed disk space by hadoop

Posted by Harsh J <ha...@cloudera.com>.
There isn't a "discrepancy", but read on: DFS Used counts disk spaces
across DNs. FSCK counts file lengths on HDFS. The former includes
replicated data sizes, plus block checksum metadata consumed space.
The latter does not.

A small (but probably significant) percentage of your files are using
replication factors of more than the default of 2, so a simple
division would probably not work in showing the relation.

Also worth checking if your DN configured data directories have any
older subdirectories lying under them from past installs, if you're
sure that the small percentage of higher replica factor using files
are small enough and shouldn't be using as much more space.

On Fri, Aug 9, 2013 at 2:28 PM, Yogini Gulkotwar
<yo...@flutura.com> wrote:
> Hi All,
>
> I have a CDH4 hadoop cluster setup with 3 datanodes and a data replication
> factor of 2.
>
> When I try to check the consumed dfs space, I get different values using the
> "hdfs dfsadmin -report" and "hdfs fsck" command.
> Could anyone please help me understand the reason behind the discrepancy in
> the values?
>
>  I get the following output:
>
> # sudo -u hdfs hdfs dfsadmin -report
>
>
> Configured Capacity: 321252989337600 (292.18 TB)
> Present Capacity: 264896108259328 (240.92 TB)
> DFS Remaining: 264665811648512 (240.71 TB)
> DFS Used: 230296610816 (214.48 GB)
> DFS Used%: 0.09%
> Under replicated blocks: 19
> Blocks with corrupt replicas: 0
> Missing blocks: 0
>
> -------------------------------------------------
> Datanodes available: 3 (3 total, 0 dead)
>
> Live datanodes:
> Name: (slave1)
> Hostname: localhost
> Decommission Status : Normal
> Configured Capacity: 107084329779200 (97.39 TB)
> DFS Used: 77728510976 (72.39 GB)
> Non DFS Used: 18784664751104 (17.08 TB)
> DFS Remaining: 88221936517120 (80.24 TB)
> DFS Used%: 0.07%
> DFS Remaining%: 82.39%
> Last contact: Fri Aug 09 13:26:38 IST 2013
>
>
> Name: (slave3)
> Hostname: localhost
> Decommission Status : Normal
> Configured Capacity: 107084329779200 (97.39 TB)
> DFS Used: 76206287872 (70.97 GB)
> Non DFS Used: 18786185925632 (17.09 TB)
> DFS Remaining: 88221937565696 (80.24 TB)
> DFS Used%: 0.07%
> DFS Remaining%: 82.39%
> Last contact: Fri Aug 09 13:26:37 IST 2013
>
>
> Name:(slave2)
> Hostname: localhost
> Decommission Status : Normal
> Configured Capacity: 107084329779200 (97.39 TB)
> DFS Used: 76361811968 (71.12 GB)
> Non DFS Used: 18786030401536 (17.09 TB)
> DFS Remaining: 88221937565696 (80.24 TB)
> DFS Used%: 0.07%
> DFS Remaining%: 82.39%
>
> --------------------------------------------------------------------------------------------------------------------------
> # sudo -u hdfs hadoop fsck /
>
>
> Connecting to namenode via http://master1:50070
>
>
> Status: HEALTHY
>  Total size: 75245213337 B
>  Total dirs: 3203
>  Total files: 7893
>  Total blocks (validated): 7642 (avg. block size 9846272 B)
>  Minimally replicated blocks: 7642 (100.0 %)
>  Over-replicated blocks: 0 (0.0 %)
>  Under-replicated blocks: 19 (0.24862601 %)
>  Mis-replicated blocks: 0 (0.0 %)
>  Default replication factor: 2
>  Average block replication: 2.0024862
>  Corrupt blocks: 0
>  Missing replicas: 133 (0.86162215 %)
>  Number of data-nodes: 3
>  Number of racks: 1
> FSCK ended at Fri Aug 09 14:01:47 IST 2013 in 266 milliseconds
>
>
> The filesystem under path '/' is HEALTHY
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------
>
>
> # sudo -u hdfs hadoop fs -count -q /
>   2147483647      2147472547            none             inf         3203
> 7897        75245470999 /
>
>
>
> Thanks & Regards,
> Yogini Gulkotwar
> Flutura Decision Sciences & Analytics, Bangalore
> Email: yogini.gulkotwar@flutura.com
> Website: www.fluturasolutions.com



-- 
Harsh J

Re: Discrepancy in the values of consumed disk space by hadoop

Posted by Harsh J <ha...@cloudera.com>.
There isn't a "discrepancy", but read on: DFS Used counts disk spaces
across DNs. FSCK counts file lengths on HDFS. The former includes
replicated data sizes, plus block checksum metadata consumed space.
The latter does not.

A small (but probably significant) percentage of your files are using
replication factors of more than the default of 2, so a simple
division would probably not work in showing the relation.

Also worth checking if your DN configured data directories have any
older subdirectories lying under them from past installs, if you're
sure that the small percentage of higher replica factor using files
are small enough and shouldn't be using as much more space.

On Fri, Aug 9, 2013 at 2:28 PM, Yogini Gulkotwar
<yo...@flutura.com> wrote:
> Hi All,
>
> I have a CDH4 hadoop cluster setup with 3 datanodes and a data replication
> factor of 2.
>
> When I try to check the consumed dfs space, I get different values using the
> "hdfs dfsadmin -report" and "hdfs fsck" command.
> Could anyone please help me understand the reason behind the discrepancy in
> the values?
>
>  I get the following output:
>
> # sudo -u hdfs hdfs dfsadmin -report
>
>
> Configured Capacity: 321252989337600 (292.18 TB)
> Present Capacity: 264896108259328 (240.92 TB)
> DFS Remaining: 264665811648512 (240.71 TB)
> DFS Used: 230296610816 (214.48 GB)
> DFS Used%: 0.09%
> Under replicated blocks: 19
> Blocks with corrupt replicas: 0
> Missing blocks: 0
>
> -------------------------------------------------
> Datanodes available: 3 (3 total, 0 dead)
>
> Live datanodes:
> Name: (slave1)
> Hostname: localhost
> Decommission Status : Normal
> Configured Capacity: 107084329779200 (97.39 TB)
> DFS Used: 77728510976 (72.39 GB)
> Non DFS Used: 18784664751104 (17.08 TB)
> DFS Remaining: 88221936517120 (80.24 TB)
> DFS Used%: 0.07%
> DFS Remaining%: 82.39%
> Last contact: Fri Aug 09 13:26:38 IST 2013
>
>
> Name: (slave3)
> Hostname: localhost
> Decommission Status : Normal
> Configured Capacity: 107084329779200 (97.39 TB)
> DFS Used: 76206287872 (70.97 GB)
> Non DFS Used: 18786185925632 (17.09 TB)
> DFS Remaining: 88221937565696 (80.24 TB)
> DFS Used%: 0.07%
> DFS Remaining%: 82.39%
> Last contact: Fri Aug 09 13:26:37 IST 2013
>
>
> Name:(slave2)
> Hostname: localhost
> Decommission Status : Normal
> Configured Capacity: 107084329779200 (97.39 TB)
> DFS Used: 76361811968 (71.12 GB)
> Non DFS Used: 18786030401536 (17.09 TB)
> DFS Remaining: 88221937565696 (80.24 TB)
> DFS Used%: 0.07%
> DFS Remaining%: 82.39%
>
> --------------------------------------------------------------------------------------------------------------------------
> # sudo -u hdfs hadoop fsck /
>
>
> Connecting to namenode via http://master1:50070
>
>
> Status: HEALTHY
>  Total size: 75245213337 B
>  Total dirs: 3203
>  Total files: 7893
>  Total blocks (validated): 7642 (avg. block size 9846272 B)
>  Minimally replicated blocks: 7642 (100.0 %)
>  Over-replicated blocks: 0 (0.0 %)
>  Under-replicated blocks: 19 (0.24862601 %)
>  Mis-replicated blocks: 0 (0.0 %)
>  Default replication factor: 2
>  Average block replication: 2.0024862
>  Corrupt blocks: 0
>  Missing replicas: 133 (0.86162215 %)
>  Number of data-nodes: 3
>  Number of racks: 1
> FSCK ended at Fri Aug 09 14:01:47 IST 2013 in 266 milliseconds
>
>
> The filesystem under path '/' is HEALTHY
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------
>
>
> # sudo -u hdfs hadoop fs -count -q /
>   2147483647      2147472547            none             inf         3203
> 7897        75245470999 /
>
>
>
> Thanks & Regards,
> Yogini Gulkotwar
> Flutura Decision Sciences & Analytics, Bangalore
> Email: yogini.gulkotwar@flutura.com
> Website: www.fluturasolutions.com



-- 
Harsh J