You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Christian Schäfer <sy...@yahoo.de> on 2012/08/20 13:47:13 UTC

Schema Design - Move second column family to new table

Currently I'm about to design HBase tables.

In my case there is table1 with CF1 holding millions/billions of rows and CF2 with hundreds of rows.
Read use cases include reading both CF data by key or reading only one CF.

Referring to http://hbase.apache.org/book/number.of.cfs.html

Due to the cardinality difference I would change the schema design by putting CF2 in an extra table (table 2), right?
So after that there are table1 and table2 each with one CF with the same row key.
Any doubting about that?

Can anyone recommend resources about HBase-Schema-Design where HBase Schema Design is explained on different use cases
beyond "HBase- Definitive Guide" and the HBase online reference?

regards,
Christian

Re: Schema Design - Move second column family to new table

Posted by Christian Schäfer <sy...@yahoo.de>.

Just a short call back.

As noticed I will now use two column families (instead of an addional table) to achieve row level atomicity.

Because CF1 has a much higher cardinality than CF2, flushes will likely be always triggered by CF1's memstore reaching configured flush size.
Thus, also CF2 will be flushed resulting in very small HFiles because on 1000 set rows of CF1 comes ~1 row of CF2.

Has anyone experiences if that will become a performance problem when doing a scan restricted on CF2 (means checking many small HFiles) assuming bloom filters are applied?

regards,
Christian

----- Ursprüngliche Message -----
Von: Christian Schäfer <sy...@yahoo.de>
An: "user@hbase.apache.org" <us...@hbase.apache.org>
CC:
Gesendet: 22:54 Montag, 20.August 2012
Betreff: RE: Schema Design - Move second column family to new table

Thanks Pranav for the Schema Design resource...will check this soon.

Thanks Ian for your thoughts..you're right that the point about transactions is really important.

On the other hand due to per-region compaction, big scans over CF2 (= CF with only few rows set) would result in several disk seeks.

So I still have to find out if big scans over CF2 are really as important as I currently expect.
Whereas I guess that (in our use case) transaction security is more important than speed of analytics

regards
Chris.

________________________________
Von: Ian Varley <iv...@salesforce.com>
An: "user@hbase.apache.org" <us...@hbase.apache.org>
CC: Christian Schäfer <sy...@yahoo.de>
Gesendet: 16:37 Montag, 20.August 2012
Betreff: Re: Schema Design - Move second column family to new table

Christian,

Column families are really more "within" rows, not the other way around (they're really just a way to physically partition sets of columns in a table). In your example, then, it's more correct to say that table1 has millions / billions of rows, but only hundreds of them have any columns in CF2. I'm not exactly sure how much of a penalty that 2nd column family imposes in this case--if you don't include it as a part of your scans / gets, then you won't pay any
penalty at read time; but if you're reading from both "just in case" the row has data there, you'll always take a hit. I think the same goes for writes. (Question for the list: does adding a column family that you *never* use impose any penalties?)

The downside to moving it to another table is, writes will no longer be transactionally protected (i.e. if you're trying to write to both, it could fail after one and before the other). Conversely, if you put them as column families in the same row, writes to a single row are transactional. You may or may not care about that.

So, putting the lower cardinality data in another table with the same row key might be performance win, or it might not, depending on your read & write patterns. Try it both ways and compare, and let us know what you find.

Ian

On Aug 20, 2012, at 7:25 AM, Pranav Modi wrote:

This might be useful -
http://java.dzone.com/videos/hbase-schema-design-things-you

On Mon, Aug 20, 2012 at 5:17 PM, Christian Schäfer <sy...@yahoo.de>wrote:

Currently I'm about to design HBase tables.

In my case there is table1 with CF1 holding millions/billions of rows and
CF2 with hundreds of rows.
Read use cases include reading both CF data by key or reading only one CF.

Referring to http://hbase.apache.org/book/number.of.cfs.html

Due to the cardinality difference I would change the schema design by
putting CF2 in an extra table (table 2), right?
So after that there are table1 and table2 each with one CF with the same
row key.
Any doubting about that?

Can
anyone recommend resources about HBase-Schema-Design where HBase
Schema Design is explained on different use cases
beyond "HBase- Definitive Guide" and the HBase online reference?

regards,
Christian

Re: Substring comparator for column key

Posted by jmozah <jm...@gmail.com>.

For filtering rows based on column key ( i hope that's what you asked), there is no direct filter as far as i know.
But i think you can use "ColumnPrefixFilter" which selects only those keys whose column name matches a particular prefix (some sort of substring matching using regex).   

./Zahoor
HBase Musings

On 21-Aug-2012, at 3:27 PM, Shagun Agarwal <sh...@yahoo-inc.com> wrote:

> Hi,
> 
> There is SubstringComparator which can be used with SingleColumnValueFilter for substring filter however this works for key value. Is there any way to do a substring filtering for column key?
> 
> Thanks
> Shagun

Substring comparator for column key

Posted by Shagun Agarwal <sh...@yahoo-inc.com>.

Hi,

There is SubstringComparator which can be used with SingleColumnValueFilter for substring filter however this works for key value. Is there any way to do a substring filtering for column key?

Thanks
Shagun

RE: Schema Design - Move second column family to new table

Posted by Christian Schäfer <sy...@yahoo.de>.

Thanks Pranav for the Schema Design resource...will check this soon.

Thanks Ian for your thoughts..you're right that the point about transactions is really important.

On the other hand due to per-region compaction, big scans over CF2 (= CF with only few rows set) would result in several disk seeks.

So I still have to find out if big scans over CF2 are really as important as I currently expect.
Whereas I guess that (in our use case) transaction security is more important than speed of analytics

regards
Chris.

Christian,

Ian

On Aug 20, 2012, at 7:25 AM, Pranav Modi wrote:

This might be useful -
http://java.dzone.com/videos/hbase-schema-design-things-you

On Mon, Aug 20, 2012 at 5:17 PM, Christian Schäfer <sy...@yahoo.de>wrote:

Currently I'm about to design HBase tables.

In my case there is table1 with CF1 holding millions/billions of rows and
CF2 with hundreds of rows.
Read use cases include reading both CF data by key or reading only one CF.

Referring to http://hbase.apache.org/book/number.of.cfs.html

Can
anyone recommend resources about HBase-Schema-Design where HBase
Schema Design is explained on different use cases
beyond "HBase- Definitive Guide" and the HBase online reference?

regards,
Christian

Re: Schema Design - Move second column family to new table

Posted by Ian Varley <iv...@salesforce.com>.

Christian,

Column families are really more "within" rows, not the other way around (they're really just a way to physically partition sets of columns in a table). In your example, then, it's more correct to say that table1 has millions / billions of rows, but only hundreds of them have any columns in CF2. I'm not exactly sure how much of a penalty that 2nd column family imposes in this case--if you don't include it as a part of your scans / gets, then you won't pay any penalty at read time; but if you're reading from both "just in case" the row has data there, you'll always take a hit. I think the same goes for writes. (Question for the list: does adding a column family that you *never* use impose any penalties?)

Ian

On Aug 20, 2012, at 7:25 AM, Pranav Modi wrote:

This might be useful -
http://java.dzone.com/videos/hbase-schema-design-things-you

On Mon, Aug 20, 2012 at 5:17 PM, Christian Schäfer <sy...@yahoo.de>wrote:

Currently I'm about to design HBase tables.

In my case there is table1 with CF1 holding millions/billions of rows and
CF2 with hundreds of rows.
Read use cases include reading both CF data by key or reading only one CF.

Referring to http://hbase.apache.org/book/number.of.cfs.html

Can anyone recommend resources about HBase-Schema-Design where HBase
Schema Design is explained on different use cases
beyond "HBase- Definitive Guide" and the HBase online reference?

regards,
Christian

Re: Schema Design - Move second column family to new table

Posted by Pranav Modi <pr...@runa.com>.

This might be useful -
http://java.dzone.com/videos/hbase-schema-design-things-you

On Mon, Aug 20, 2012 at 5:17 PM, Christian Schäfer <sy...@yahoo.de>wrote:

> Currently I'm about to design HBase tables.
>
> In my case there is table1 with CF1 holding millions/billions of rows and
> CF2 with hundreds of rows.
> Read use cases include reading both CF data by key or reading only one CF.
>
> Referring to http://hbase.apache.org/book/number.of.cfs.html
>
> Due to the cardinality difference I would change the schema design by
> putting CF2 in an extra table (table 2), right?
> So after that there are table1 and table2 each with one CF with the same
> row key.
> Any doubting about that?
>
> Can anyone recommend resources about HBase-Schema-Design where HBase
> Schema Design is explained on different use cases
> beyond "HBase- Definitive Guide" and the HBase online reference?
>
> regards,
> Christian
>
>