You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Sever Fundatureanu <fu...@gmail.com> on 2012/07/03 11:54:51 UTC

HBase table disk usage

Hello,

I have a simpel table with 1.5 billion rows and one column familiy 'F'.
Each row key is 33 bytes and the cell values are void. By doing the math I
would expect this table to take up (33+1)x1.5*10^9 = 51GB. However if I do
a "hadoop dfs -du" I get that the table takes up ~82GB. This is after
running major compactions a couple of times. Can someone explain where this
difference might come from?

Regards,
-- 
Sever Fundatureanu

Vrije Universiteit Amsterdam
E-mail: fundatureanu.sever@gmail.com

RE: HBase table disk usage

Posted by Anoop Sam John <an...@huawei.com>.
Hi,

The KV storage will be like

KeyLength (4 bytes) + Value length(4 bytes) + rowkeylength(2bytes) + rowkey(.. bytes) + CF length(1 byte) + CF (...bytes) + Qualifier(..bytes) + timestamp(8 bytes) + type(1 byte) + value (...bytes)

If you are using HFile V2 there will be memstoreTS also added with every KV. This will be 1 to 4 bytes long. (Mostly 1 byte as the value will be reset to 0 during compaction)

Now calculate whether the size u found is matching with the expected.

If you are using version 94, there is block encoding feature in which most of these extra bytes other than key and value can be encoded to smaller size.

-Anoop-
________________________________________
From: Sever Fundatureanu [fundatureanu.sever@gmail.com]
Sent: Tuesday, July 03, 2012 8:36 PM
To: user@hbase.apache.org
Subject: Re: HBase table disk usage

I was only du'ing the table dir. The tmp dirs only had a couple of hundred
bytes in my case.
The HFile tool only gives the avgKeyLen=46. This does not include 4 bytes
KeyLength + 4 bytes ValueLength.
Now indeed I get a total of 54 bytes/KV *1.5 billion ~= 81GB. Probably
there are also leftovers from HDFS blocks not being fully occupied.

Thanks,
Sever


On Tue, Jul 3, 2012 at 2:29 PM, Stack <st...@duboce.net> wrote:

> On Tue, Jul 3, 2012 at 2:17 PM, Sever Fundatureanu
> <fu...@gmail.com> wrote:
> > Right, forgot about the timestamps. These should be a long value each,
> so 8
> > bytes. The versioning is set to 1 so it shouldn't count.
> > Note the column qualifier is also void on each entry.
> >
> > So now we get (33+1+8)x1.5*10^9 = 63GB, still a 19GB difference...
> >
>
> What about regionserver WAL logs?  You including these in your math or
> are you just du'ing the table dir?  The table dir can have tmp dirs
> for compaction and split work.  And after Michael Segel, the KV has a
> type byte as well as some lengths for finding offsets in KV; take a
> looksee w/ the hfile tool:
> http://hbase.apache.org/book.html#hfile_tool2
>
> St.Ack
>



--
Sever Fundatureanu

Vrije Universiteit Amsterdam
E-mail: fundatureanu.sever@gmail.com

Re: HBase table disk usage

Posted by Sever Fundatureanu <fu...@gmail.com>.
I was only du'ing the table dir. The tmp dirs only had a couple of hundred
bytes in my case.
The HFile tool only gives the avgKeyLen=46. This does not include 4 bytes
KeyLength + 4 bytes ValueLength.
Now indeed I get a total of 54 bytes/KV *1.5 billion ~= 81GB. Probably
there are also leftovers from HDFS blocks not being fully occupied.

Thanks,
Sever


On Tue, Jul 3, 2012 at 2:29 PM, Stack <st...@duboce.net> wrote:

> On Tue, Jul 3, 2012 at 2:17 PM, Sever Fundatureanu
> <fu...@gmail.com> wrote:
> > Right, forgot about the timestamps. These should be a long value each,
> so 8
> > bytes. The versioning is set to 1 so it shouldn't count.
> > Note the column qualifier is also void on each entry.
> >
> > So now we get (33+1+8)x1.5*10^9 = 63GB, still a 19GB difference...
> >
>
> What about regionserver WAL logs?  You including these in your math or
> are you just du'ing the table dir?  The table dir can have tmp dirs
> for compaction and split work.  And after Michael Segel, the KV has a
> type byte as well as some lengths for finding offsets in KV; take a
> looksee w/ the hfile tool:
> http://hbase.apache.org/book.html#hfile_tool2
>
> St.Ack
>



-- 
Sever Fundatureanu

Vrije Universiteit Amsterdam
E-mail: fundatureanu.sever@gmail.com

Re: HBase table disk usage

Posted by Stack <st...@duboce.net>.
On Tue, Jul 3, 2012 at 2:17 PM, Sever Fundatureanu
<fu...@gmail.com> wrote:
> Right, forgot about the timestamps. These should be a long value each, so 8
> bytes. The versioning is set to 1 so it shouldn't count.
> Note the column qualifier is also void on each entry.
>
> So now we get (33+1+8)x1.5*10^9 = 63GB, still a 19GB difference...
>

What about regionserver WAL logs?  You including these in your math or
are you just du'ing the table dir?  The table dir can have tmp dirs
for compaction and split work.  And after Michael Segel, the KV has a
type byte as well as some lengths for finding offsets in KV; take a
looksee w/ the hfile tool:
http://hbase.apache.org/book.html#hfile_tool2

St.Ack

Re: HBase table disk usage

Posted by Sever Fundatureanu <fu...@gmail.com>.
Right, forgot about the timestamps. These should be a long value each, so 8
bytes. The versioning is set to 1 so it shouldn't count.
Note the column qualifier is also void on each entry.

So now we get (33+1+8)x1.5*10^9 = 63GB, still a 19GB difference...

Thanks,
Sever

On Tue, Jul 3, 2012 at 1:48 PM, Michael Segel <mi...@hotmail.com>wrote:

> Timestamps on the cells themselves?
> # Versions?
>
> On Jul 3, 2012, at 4:54 AM, Sever Fundatureanu wrote:
>
> > Hello,
> >
> > I have a simpel table with 1.5 billion rows and one column familiy 'F'.
> > Each row key is 33 bytes and the cell values are void. By doing the math
> I
> > would expect this table to take up (33+1)x1.5*10^9 = 51GB. However if I
> do
> > a "hadoop dfs -du" I get that the table takes up ~82GB. This is after
> > running major compactions a couple of times. Can someone explain where
> this
> > difference might come from?
> >
> > Regards,
> > --
> > Sever Fundatureanu
> >
> > Vrije Universiteit Amsterdam
> > E-mail: fundatureanu.sever@gmail.com
>
>


-- 
Sever Fundatureanu

Vrije Universiteit Amsterdam
E-mail: fundatureanu.sever@gmail.com

Re: HBase table disk usage

Posted by Michael Segel <mi...@hotmail.com>.
Timestamps on the cells themselves? 
# Versions? 

On Jul 3, 2012, at 4:54 AM, Sever Fundatureanu wrote:

> Hello,
> 
> I have a simpel table with 1.5 billion rows and one column familiy 'F'.
> Each row key is 33 bytes and the cell values are void. By doing the math I
> would expect this table to take up (33+1)x1.5*10^9 = 51GB. However if I do
> a "hadoop dfs -du" I get that the table takes up ~82GB. This is after
> running major compactions a couple of times. Can someone explain where this
> difference might come from?
> 
> Regards,
> -- 
> Sever Fundatureanu
> 
> Vrije Universiteit Amsterdam
> E-mail: fundatureanu.sever@gmail.com