You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Daniel <da...@abde.me> on 2016/02/21 16:22:10 UTC

Two questions about the maximum number of versions of a column family

Hi, I have two questions about the maximum number of versions of a column family:

(1) Is it OK to set a very large (>100,000) maximum number of versions for a column family?

The reference guide says "It is not recommended setting the number of max versions to an exceedingly high level (e.g., hundreds or more) unless those old values are very dear to you because this will greatly increase StoreFile size." (Chapter 36.1)

I'm new to the Hadoop ecosystem, and have no idea about the consequences of a very large StoreFile size.

Furthermore, it is OK to set a large maximum number of versions but insert only a few versions? Does it waste space?

(2) How much performance overhead does it cause to increase the maximum number of versions of a column family after enormous (e.g. billions) rows have been inserted?

Regards,

Daniel

Re: Two questions about the maximum number of versions of a column family

Posted by Ted Yu <yu...@gmail.com>.
Thanks for sharing, Stephen.

bq. scan performance on the region servers needing to scan over all that
data you may not need

When number of versions is large, try to utilize Filters (where
appropriate) which implements:

  public Cell getNextCellHint(Cell currentKV) {

See MultiRowRangeFilter for example.


Please see hbase-shell/src/main/ruby/shell/commands/alter.rb for syntax on
how to alter table. When "hbase.online.schema.update.enable" is true, table
can stay online during the change.

Cheers

On Sun, Feb 21, 2016 at 8:20 AM, Stephen Durfey <sj...@gmail.com> wrote:

> Someone please correct me if I am wrong.
> I've looked into this recently due to some performance reasons with my
> tables in a production environment. Like the books says, I don't recommend
> keeping this many versions around unless you really need them. Telling
> HBase to keep around a very large number doesn't waste space, that's just a
> value in the table descriptor. So, I wouldn't worry about that. The
> problems are going to come in when you actually write out those versions.
> My tables currently have max_versions set and roughly 40% of the tables
> are due to historical versions. So, one table in particular is around 25
> TB. I don't have a need to keep this many versions, so I am working on
> changing the max versions to the default of 3 (some cells are hundreds or
> thousands of cells deep). The issue youll run into is scan performance on
> the region servers needing to scan over all that data you may not need (due
> to large store files). This could lead to increased scan time and
> potentially scanner timeouts, depending upon how large your batch size is
> set on the scan.
> I assume this has some performance impact on compactions, both minor and
> major, but I didn't investigate that, and potentially on the write path,
> but also not something I looked into.
> Changing the number of versions after the table has been created doesn't
> have a performance impact due to just being a metadata change. The table
> will need to be disabled, changed, and re-enabled again. If this is done
> through a script the table could be offline for a couple of seconds. The
> only concern around that are users of the table. If they have scheduled job
> runs that hit that table that would break if they try to read from it while
> the table is disabled. The only performance impact I can think of around
> this change would be major compaction of the table, but even that shouldn't
> be an issue.
>
>
>     _____________________________
> From: Daniel <da...@abde.me>
> Sent: Sunday, February 21, 2016 9:22 AM
> Subject: Two questions about the maximum number of versions of a column
> family
> To: user <us...@hbase.apache.org>
>
>
> Hi, I have two questions about the maximum number of versions of a column
> family:
>
> (1) Is it OK to set a very large (>100,000) maximum number of versions for
> a column family?
>
> The reference guide says "It is not recommended setting the number of max
> versions to an exceedingly high level (e.g., hundreds or more) unless those
> old values are very dear to you because this will greatly increase
> StoreFile size." (Chapter 36.1)
>
> I'm new to the Hadoop ecosystem, and have no idea about the consequences
> of a very large StoreFile size.
>
> Furthermore, it is OK to set a large maximum number of versions but insert
> only a few versions? Does it waste space?
>
> (2) How much performance overhead does it cause to increase the maximum
> number of versions of a column family after enormous (e.g. billions) rows
> have been inserted?
>
> Regards,
>
> Daniel
>
>
>
>

Re: Two questions about the maximum number of versions of a column family

Posted by Stephen Durfey <sj...@gmail.com>.
Someone please correct me if I am wrong. 
I've looked into this recently due to some performance reasons with my tables in a production environment. Like the books says, I don't recommend keeping this many versions around unless you really need them. Telling HBase to keep around a very large number doesn't waste space, that's just a value in the table descriptor. So, I wouldn't worry about that. The problems are going to come in when you actually write out those versions. 
My tables currently have max_versions set and roughly 40% of the tables are due to historical versions. So, one table in particular is around 25 TB. I don't have a need to keep this many versions, so I am working on changing the max versions to the default of 3 (some cells are hundreds or thousands of cells deep). The issue youll run into is scan performance on the region servers needing to scan over all that data you may not need (due to large store files). This could lead to increased scan time and potentially scanner timeouts, depending upon how large your batch size is set on the scan. 
I assume this has some performance impact on compactions, both minor and major, but I didn't investigate that, and potentially on the write path, but also not something I looked into. 
Changing the number of versions after the table has been created doesn't have a performance impact due to just being a metadata change. The table will need to be disabled, changed, and re-enabled again. If this is done through a script the table could be offline for a couple of seconds. The only concern around that are users of the table. If they have scheduled job runs that hit that table that would break if they try to read from it while the table is disabled. The only performance impact I can think of around this change would be major compaction of the table, but even that shouldn't be an issue. 


    _____________________________
From: Daniel <da...@abde.me>
Sent: Sunday, February 21, 2016 9:22 AM
Subject: Two questions about the maximum number of versions of a column family
To: user <us...@hbase.apache.org>


Hi, I have two questions about the maximum number of versions of a column family:

(1) Is it OK to set a very large (>100,000) maximum number of versions for a column family?

The reference guide says "It is not recommended setting the number of max versions to an exceedingly high level (e.g., hundreds or more) unless those old values are very dear to you because this will greatly increase StoreFile size." (Chapter 36.1)

I'm new to the Hadoop ecosystem, and have no idea about the consequences of a very large StoreFile size.

Furthermore, it is OK to set a large maximum number of versions but insert only a few versions? Does it waste space?

(2) How much performance overhead does it cause to increase the maximum number of versions of a column family after enormous (e.g. billions) rows have been inserted?

Regards,

Daniel