You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "stack (JIRA)" <ji...@apache.org> on 2008/10/03 18:52:44 UTC
[jira] Commented: (HBASE-911) Minimize filesystem footprint

    [ https://issues.apache.org/jira/browse/HBASE-911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12636673#action_12636673 ] 

stack commented on HBASE-911:
-----------------------------

I took a look.  Blocks are not all 64MB in size.  Last block in a file is the size of the files tail.

I set up a clean hdfs on four nodes.  I took the size of the dfs directory:

{code}
[branch-0.18]$ for i in `cat conf/slaves`; do ssh $i "du -sb  /bfd/hadoop-stack/dfs"; done
37527   /bfd/hadoop-stack/dfs
20795   /bfd/hadoop-stack/dfs
20795   /bfd/hadoop-stack/dfs
20794   /bfd/hadoop-stack/dfs
{code}

Next I uploaded a file of 98 bytes up into hdfs:

{code}
[branch-0.18]$ ls -la /tmp/xxxx.txt
-rw-r--r-- 1 stack powerset 98 Sep 26 23:54 /tmp/xxxx.txt
[stack@aa0-000-12 branch-0.18]$ ./bin/hadoop fs -put /tmp/xxxx.txt /
{code}

Then I did a new listing:
{code}
[branch-0.18]$ for i in `cat conf/slaves`; do ssh $i "du -sb  /bfd/hadoop-stack/dfs"; done
37840   /bfd/hadoop-stack/dfs
20904   /bfd/hadoop-stack/dfs
20904   /bfd/hadoop-stack/dfs
20794   /bfd/hadoop-stack/dfs
{code}

Sizes changed in three locations, one per replication.

Listing the dfs data directory on one of the replicas, I see a block of size 98 bytes and some accompanying metadata:

{code}
[branch-0.18]$ ls -la /bfd/hadoop-stack/dfs/data/current/
total 20
drwxr-sr-x 2 stack powerset 4096 Oct  3 16:40 .
drwxr-sr-x 5 stack powerset 4096 Oct  3 16:39 ..
-rw-r--r-- 1 stack powerset  158 Oct  3 16:39 VERSION
-rw-r--r-- 1 stack powerset   98 Oct  3 16:40 blk_-343955609951300745
-rw-r--r-- 1 stack powerset   11 Oct  3 16:40 blk_-343955609951300745_1001.meta
-rw-r--r-- 1 stack powerset    0 Oct  3 16:39 dncp_block_verification.log.curr
{code}

> Minimize filesystem footprint
> -----------------------------
>
>                 Key: HBASE-911
>                 URL: https://issues.apache.org/jira/browse/HBASE-911
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>
> This issue is about looking into how much space in filesystem hbases uses.  Daniel Ploeg suggests that hbase is profligate in its use of space in hdfs.   Given that block sizes by default are 64MB, and that every time hbase writes a store file that its accompanied by an index file and a very small metadata file, thats 3*64MB even if the file is empty (TODO: Prove this).  The situation is aggrevated by the fact that hbase does a flush of whatever is in memory every 30 minutes to minimize loss in the absence of appends; this latter action makes for lots of small files.
> The solution to the above is implement append so optional flush is not necessary and a file format that aggregates info, index and data all in the one file.   Short-term, we should set block size on the info/metadata file down to 4k or some such small size and look into doing likewise for the mapfile index.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.