You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Norbert Burger <no...@gmail.com> on 2011/01/24 18:37:27 UTC

estimate HBase DFS filesystem usage

Hi folks - is there a recommended way of estimating HBase HDFS usage for a
new environment?

We have a DEV HBase cluster in place, and from this, I'm trying to estimate
the specs of our not-yet-built PROD environment.  One of the variables we're
considering is HBase usage of HDFS.  What I've just tried is to calculate an
average bytes/record ratio by using "hadoop dfs -du /hbase", and dividing by
the number of records/table.  But this ignores any kind of fixed overhead,
so I have concerns about it.

Is there a better way?

Norbert

Re: estimate HBase DFS filesystem usage

Posted by Lars George <la...@gmail.com>.

Benoit,

You probably tripped this up? https://issues.apache.org/jira/browse/HBASE-3476

Lars

On Wed, Jan 26, 2011 at 5:53 AM, tsuna <ts...@gmail.com> wrote:
> You can run ``hbase org.apache.hadoop.hbase.io.hfile.HFile -f
> "$region" -m'' where $region is every HFile (located under
> /hbase/$table/*/$family).  This is rather slow [1] for some reason I
> don't quite understand, but it's many orders of magnitude faster than
> MapReducing the entire table.  The output will have information like
> "entryCount" (number of cells in this file), "totalBytes" (size of the
> uncompressed data), "length" (actual size on disk), "avgKeyLen"
> (average number of bytes in a key), "avgValueLen" (average number of
> bytes stored in a cell).
>
> This way you can get detailed information about your table.  The
> results won't be up-to-date to the second, but they'll be pretty
> close.
>
>
>  [1] I recently ran this at SU on a table with about 1200 regions and
> it took 1h 15m to read the meta data of every HFile.  I don't
> understand how this can take so much time.
>
> --
> Benoit "tsuna" Sigoure
> Software Engineer @ www.StumbleUpon.com
>

Re: estimate HBase DFS filesystem usage

Posted by tsuna <ts...@gmail.com>.

You can run ``hbase org.apache.hadoop.hbase.io.hfile.HFile -f
"$region" -m'' where $region is every HFile (located under
/hbase/$table/*/$family).  This is rather slow [1] for some reason I
don't quite understand, but it's many orders of magnitude faster than
MapReducing the entire table.  The output will have information like
"entryCount" (number of cells in this file), "totalBytes" (size of the
uncompressed data), "length" (actual size on disk), "avgKeyLen"
(average number of bytes in a key), "avgValueLen" (average number of
bytes stored in a cell).

This way you can get detailed information about your table.  The
results won't be up-to-date to the second, but they'll be pretty
close.


  [1] I recently ran this at SU on a table with about 1200 regions and
it took 1h 15m to read the meta data of every HFile.  I don't
understand how this can take so much time.

-- 
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com

Re: estimate HBase DFS filesystem usage

Posted by Norbert Burger <no...@gmail.com>.

Thanks Xavier.  I'll give that a shot.

Norbert

On Mon, Jan 24, 2011 at 1:33 PM, Xavier Stevens <xs...@mozilla.com>wrote:

> Not sure if there is a way to do that.  You could get a really rough
> estimate if you did the job I described and subtracted the total bytes
> calculated for the records from the "hadoop fs -dus /hbase/<table_name>"
> bytes.  Then that would give an idea of the amount of overhead.  I have
> a feeling it is negligible in the grand scheme of things.
>
> -Xavier
>
> On 1/24/11 10:23 AM, Norbert Burger wrote:
> > Good idea.  But it seems like this approach would give me the size of
> just
> > the raw data itself, ignoring any kind of container (like HFiles) that
> are
> > used to store the data.  What I'd like ideally is to get an idea of what
> the
> > fixed cost (in terms of bytes) is for each my tables, and then understand
> > how I can calculate a variable bytes/record cost.
> >
> > Is this feasible?
> >
> > Norbert
> >
> > On Mon, Jan 24, 2011 at 1:16 PM, Xavier Stevens <xstevens@mozilla.com
> >wrote:
> >
> >> Norbert,
> >>
> >> It would probably be best if you wrote a quick MapReduce job that
> >> iterates over those records and outputs the sum of bytes for each one.
> >> Then you could use that output and get some general descriptive
> >> statistics based on it.
> >>
> >> Cheers,
> >>
> >>
> >> -Xavier
> >>
> >>
> >> On 1/24/11 9:37 AM, Norbert Burger wrote:
> >>> Hi folks - is there a recommended way of estimating HBase HDFS usage
> for
> >> a
> >>> new environment?
> >>>
> >>> We have a DEV HBase cluster in place, and from this, I'm trying to
> >> estimate
> >>> the specs of our not-yet-built PROD environment.  One of the variables
> >> we're
> >>> considering is HBase usage of HDFS.  What I've just tried is to
> calculate
> >> an
> >>> average bytes/record ratio by using "hadoop dfs -du /hbase", and
> dividing
> >> by
> >>> the number of records/table.  But this ignores any kind of fixed
> >> overhead,
> >>> so I have concerns about it.
> >>>
> >>> Is there a better way?
> >>>
> >>> Norbert
> >>>
>

Re: estimate HBase DFS filesystem usage

Posted by Xavier Stevens <xs...@mozilla.com>.

Not sure if there is a way to do that.  You could get a really rough
estimate if you did the job I described and subtracted the total bytes
calculated for the records from the "hadoop fs -dus /hbase/<table_name>"
bytes.  Then that would give an idea of the amount of overhead.  I have
a feeling it is negligible in the grand scheme of things.

-Xavier

On 1/24/11 10:23 AM, Norbert Burger wrote:
> Good idea.  But it seems like this approach would give me the size of just
> the raw data itself, ignoring any kind of container (like HFiles) that are
> used to store the data.  What I'd like ideally is to get an idea of what the
> fixed cost (in terms of bytes) is for each my tables, and then understand
> how I can calculate a variable bytes/record cost.
>
> Is this feasible?
>
> Norbert
>
> On Mon, Jan 24, 2011 at 1:16 PM, Xavier Stevens <xs...@mozilla.com>wrote:
>
>> Norbert,
>>
>> It would probably be best if you wrote a quick MapReduce job that
>> iterates over those records and outputs the sum of bytes for each one.
>> Then you could use that output and get some general descriptive
>> statistics based on it.
>>
>> Cheers,
>>
>>
>> -Xavier
>>
>>
>> On 1/24/11 9:37 AM, Norbert Burger wrote:
>>> Hi folks - is there a recommended way of estimating HBase HDFS usage for
>> a
>>> new environment?
>>>
>>> We have a DEV HBase cluster in place, and from this, I'm trying to
>> estimate
>>> the specs of our not-yet-built PROD environment.  One of the variables
>> we're
>>> considering is HBase usage of HDFS.  What I've just tried is to calculate
>> an
>>> average bytes/record ratio by using "hadoop dfs -du /hbase", and dividing
>> by
>>> the number of records/table.  But this ignores any kind of fixed
>> overhead,
>>> so I have concerns about it.
>>>
>>> Is there a better way?
>>>
>>> Norbert
>>>

Re: estimate HBase DFS filesystem usage

Posted by Norbert Burger <no...@gmail.com>.

Good idea.  But it seems like this approach would give me the size of just
the raw data itself, ignoring any kind of container (like HFiles) that are
used to store the data.  What I'd like ideally is to get an idea of what the
fixed cost (in terms of bytes) is for each my tables, and then understand
how I can calculate a variable bytes/record cost.

Is this feasible?

Norbert

On Mon, Jan 24, 2011 at 1:16 PM, Xavier Stevens <xs...@mozilla.com>wrote:

> Norbert,
>
> It would probably be best if you wrote a quick MapReduce job that
> iterates over those records and outputs the sum of bytes for each one.
> Then you could use that output and get some general descriptive
> statistics based on it.
>
> Cheers,
>
>
> -Xavier
>
>
> On 1/24/11 9:37 AM, Norbert Burger wrote:
> > Hi folks - is there a recommended way of estimating HBase HDFS usage for
> a
> > new environment?
> >
> > We have a DEV HBase cluster in place, and from this, I'm trying to
> estimate
> > the specs of our not-yet-built PROD environment.  One of the variables
> we're
> > considering is HBase usage of HDFS.  What I've just tried is to calculate
> an
> > average bytes/record ratio by using "hadoop dfs -du /hbase", and dividing
> by
> > the number of records/table.  But this ignores any kind of fixed
> overhead,
> > so I have concerns about it.
> >
> > Is there a better way?
> >
> > Norbert
> >
>

Re: estimate HBase DFS filesystem usage

Posted by Xavier Stevens <xs...@mozilla.com>.

Norbert,

It would probably be best if you wrote a quick MapReduce job that
iterates over those records and outputs the sum of bytes for each one. 
Then you could use that output and get some general descriptive
statistics based on it.

Cheers,


-Xavier


On 1/24/11 9:37 AM, Norbert Burger wrote:
> Hi folks - is there a recommended way of estimating HBase HDFS usage for a
> new environment?
>
> We have a DEV HBase cluster in place, and from this, I'm trying to estimate
> the specs of our not-yet-built PROD environment.  One of the variables we're
> considering is HBase usage of HDFS.  What I've just tried is to calculate an
> average bytes/record ratio by using "hadoop dfs -du /hbase", and dividing by
> the number of records/table.  But this ignores any kind of fixed overhead,
> so I have concerns about it.
>
> Is there a better way?
>
> Norbert
>