You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by innowireless TaeYun Kim <ta...@innowireless.co.kr> on 2014/08/05 13:10:31 UTC

Question on the number of column families

Hi,

 

According to http://hbase.apache.org/book/number.of.cfs.html, having more
than 2~3 column families are strongly discouraged.

 

BTW, in my case, records on a table have the following characteristics:

 

- The table is read-only. It is bulk-loaded once. When a new data is ready,
A new table is created and the old table is deleted.

- The size of the source data can be hundreds of gigabytes.

- A record has about 130 fields. 

- The number of fields in a record is fixed.

- The names of the fields are also fixed. (it's like a table in RDBMS)

- About 40(it varies) fields mostly have value, while other fields are
mostly empty(null in RDBMS).

- It is unknown which field will be dense. It depends on the source data.

- Fields are accessed independently. Normally a user requests just one
field. A user can request several fields.

- The range on the range query is the same for all fields. (No wider, no
narrower, regardless the data density)

For me, it seems that it would be more efficient if there is one column
family for each field, since it would cost less disk I/O, for only the
needed column data will be read.

 

Can the table have 130 column families for this case?

Or the whole columns must be in one column family?

 

Thanks.

Re: Question on the number of column families

Posted by Qiang Tian <ti...@gmail.com>.

Hi TaeYun,
thanks for explain.




On Thu, Aug 7, 2014 at 12:50 PM, innowireless TaeYun Kim <
taeyun.kim@innowireless.co.kr> wrote:

> Hi Qiang,
> thank you for your help.
>
> 1. Regarding HBASE-5416, I think it's purpose is simple.
>
> "Avoid loading column families that is irrelevant to filtering while
> scanning."
> So, it can be applied to my 'dummy CF' case.
> That is, a dummy CF can act like an 'relevant' CF to filtering, provided
> that HBase can select it while applying a rowkey filter, since a dummy CF
> has the rowkey data in its 'dummy' KeyValue object.
>
> 2. About rowkey.
>
> What I meant is, I would include the field name as a component when the
> byte array for a rowkey is constructed.
>
> 3. About read-only-ness and the number of CF.
>
> Thank you for your suggestion.
> But since MemStore and BlockCache is separately managed on each column
> family, I'm a little concerned with the memory footprint.
>
> Thank you.
>
> -----Original Message-----
> From: Qiang Tian [mailto:tianq01@gmail.com]
> Sent: Thursday, August 07, 2014 11:43 AM
> To: user@hbase.apache.org
> Subject: Re: Question on the number of column families
>
> Hi,
> the description of hbase-5416 stated why it was introduced, if you only
> have 1 CF, dummy CF does not help. it is helpful for multi-CF case, e.g.
> "putting them in one column family. And "Non frequently" ones in another. "
>
> bq. "Field name will be included in rowkey."
> Please read the chapter 9 "Advanced usage" in book "HBase Definitive Guide"
> about how hbase store data on disk and how to design rowkey based on
> specific scenario.(rowkey is the only index you can use, so take care)
>
> bq. "The table is read-only. It is bulk-loaded once. When a new data is
> ready, A new table is created and the old table is deleted."
> the scenario is quite different.  as hbase is designed for random
> read/write.  the limitation described at
> http://hbase.apache.org/book/number.of.cfs.html is to consider the write
> case(flush&compaction), perhaps you could try 140 CFs, as long as you can
> presplit your regions well? after that,  since no write, there will be no
> flush/compaction...anyway, any idea better be tested with your real data.
>
>
>
>
>
>
>
>
> On Wed, Aug 6, 2014 at 7:00 PM, innowireless TaeYun Kim <
> taeyun.kim@innowireless.co.kr> wrote:
>
> > Hi Ted,
> >
> > Now I finished reading the filtering section and the source code of
> > TestJoinedScanners(0.94).
> >
> > Facts learned:
> >
> > - While scanning, an entire row will be read even for a rowkey filtering.
> > (Since a rowkey is not a physically separate entity and stored in
> > KeyValue object, it's natural. Am I right?)
> > - The key API for the essential column family support is
> > setLoadColumnFamiliesOnDemand().
> >
> > So, now I have questions:
> >
> > On rowkey filtering, which column family's KeyValue object is read?
> > If HBase just reads a KeyValue from a randomly selected (or just the
> > first) column family, how is setLoadColumnFamiliesOnDemand() affected?
> > Can HBase select a smaller column family intelligently?
> >
> > If setLoadColumnFamiliesOnDemand() can be applied to a rowkey
> > filtering, a 'dummy' column family can be used to minimize the scan cost.
> >
> > Thank you.
> >
> >
> > -----Original Message-----
> > From: innowireless TaeYun Kim [mailto:taeyun.kim@innowireless.co.kr]
> > Sent: Wednesday, August 06, 2014 1:48 PM
> > To: user@hbase.apache.org
> > Subject: RE: Question on the number of column families
> >
> > Thank you.
> >
> > The 'dummy' column will always hold the value '1' (or even an empty
> > string), that only signifies that this row exists. (And the real value
> > is in the other 'big' column family) The value is irrelevant since
> > with current schema the filtering will be done by rowkey components
> > alone. No column value is needed. (I will begin reading the filtering
> > section shortly
> > - it is only 6 pages ahead. So sorry for my premature thoughts)
> >
> >
> > -----Original Message-----
> > From: Ted Yu [mailto:yuzhihong@gmail.com]
> > Sent: Wednesday, August 06, 2014 1:38 PM
> > To: user@hbase.apache.org
> > Subject: Re: Question on the number of column families
> >
> > bq. add a 'dummy' column family and apply HBASE-5416 technique
> >
> > Adding dummy column family is not the way to utilize essential column
> > family support - what would this dummy column family hold ?
> >
> > bq. since I have not read the filtering section of the book I'm
> > reading yet
> >
> > Once you finish reading, you can look at the unit test
> > (TestJoinedScanners) from HBASE-5416. You would understand this
> > feature better.
> >
> > Cheers
> >
> >
> > On Tue, Aug 5, 2014 at 9:21 PM, innowireless TaeYun Kim <
> > taeyun.kim@innowireless.co.kr> wrote:
> >
> > > Thank you all.
> > >
> > > Facts learned:
> > >
> > > - Having 130 column families is too much. Don't do that.
> > > - While scanning, an entire row will be read for filtering, unless
> > > HBASE-5416 technique is applied which makes only relevant column
> > > family is loaded. (But it seems that still one can't load just a
> > > column needed while
> > > scanning)
> > > - Big row size is maybe not good.
> > >
> > > Currently it seems appropriate to follow the one-column solution
> > > that Alok Singh suggested, in part since currently there is no
> > > reasonable grouping of the fields.
> > >
> > > Here is my current thinking:
> > >
> > > - One column family, one column. Field name will be included in rowkey.
> > > - Eliminate filtering altogether (in most case) by properly ordering
> > > rowkey components.
> > > - If a filtering is absolutely needed, add a 'dummy' column family
> > > and apply HBASE-5416 technique to minimize disk read, since the
> > > field value can be large(~5MB). (This dummy column thing may not be
> > > right, I'm not sure, since I have not read the filtering section of
> > > the book I'm reading yet)
> > >
> > > Hope that I am not missing or misunderstanding something...
> > > (I'm a total newbie. I've started to read a HBase book since last
> > > week...)
> > >
> > >
> > >
> > >
> > >
> > >
> >
> >
>
>

RE: Question on the number of column families

Posted by innowireless TaeYun Kim <ta...@innowireless.co.kr>.

Hi Qiang,
thank you for your help.

1. Regarding HBASE-5416, I think it's purpose is simple.

"Avoid loading column families that is irrelevant to filtering while scanning."
So, it can be applied to my 'dummy CF' case.
That is, a dummy CF can act like an 'relevant' CF to filtering, provided that HBase can select it while applying a rowkey filter, since a dummy CF has the rowkey data in its 'dummy' KeyValue object.

2. About rowkey.

What I meant is, I would include the field name as a component when the byte array for a rowkey is constructed.

3. About read-only-ness and the number of CF.

Thank you for your suggestion.
But since MemStore and BlockCache is separately managed on each column family, I'm a little concerned with the memory footprint.

Thank you.

-----Original Message-----
From: Qiang Tian [mailto:tianq01@gmail.com] 
Sent: Thursday, August 07, 2014 11:43 AM
To: user@hbase.apache.org
Subject: Re: Question on the number of column families

Hi,
the description of hbase-5416 stated why it was introduced, if you only have 1 CF, dummy CF does not help. it is helpful for multi-CF case, e.g. "putting them in one column family. And "Non frequently" ones in another. "

bq. "Field name will be included in rowkey."
Please read the chapter 9 "Advanced usage" in book "HBase Definitive Guide"
about how hbase store data on disk and how to design rowkey based on specific scenario.(rowkey is the only index you can use, so take care)

bq. "The table is read-only. It is bulk-loaded once. When a new data is ready, A new table is created and the old table is deleted."
the scenario is quite different.  as hbase is designed for random read/write.  the limitation described at http://hbase.apache.org/book/number.of.cfs.html is to consider the write case(flush&compaction), perhaps you could try 140 CFs, as long as you can presplit your regions well? after that,  since no write, there will be no flush/compaction...anyway, any idea better be tested with your real data.

On Wed, Aug 6, 2014 at 7:00 PM, innowireless TaeYun Kim < taeyun.kim@innowireless.co.kr> wrote:

> Hi Ted,
>
> Now I finished reading the filtering section and the source code of 
> TestJoinedScanners(0.94).
>
> Facts learned:
>
> - While scanning, an entire row will be read even for a rowkey filtering.
> (Since a rowkey is not a physically separate entity and stored in 
> KeyValue object, it's natural. Am I right?)
> - The key API for the essential column family support is 
> setLoadColumnFamiliesOnDemand().
>
> So, now I have questions:
>
> On rowkey filtering, which column family's KeyValue object is read?
> If HBase just reads a KeyValue from a randomly selected (or just the
> first) column family, how is setLoadColumnFamiliesOnDemand() affected? 
> Can HBase select a smaller column family intelligently?
>
> If setLoadColumnFamiliesOnDemand() can be applied to a rowkey 
> filtering, a 'dummy' column family can be used to minimize the scan cost.
>
> Thank you.
>
>
> -----Original Message-----
> From: innowireless TaeYun Kim [mailto:taeyun.kim@innowireless.co.kr]
> Sent: Wednesday, August 06, 2014 1:48 PM
> To: user@hbase.apache.org
> Subject: RE: Question on the number of column families
>
> Thank you.
>
> The 'dummy' column will always hold the value '1' (or even an empty 
> string), that only signifies that this row exists. (And the real value 
> is in the other 'big' column family) The value is irrelevant since 
> with current schema the filtering will be done by rowkey components 
> alone. No column value is needed. (I will begin reading the filtering 
> section shortly
> - it is only 6 pages ahead. So sorry for my premature thoughts)
>
>
> -----Original Message-----
> From: Ted Yu [mailto:yuzhihong@gmail.com]
> Sent: Wednesday, August 06, 2014 1:38 PM
> To: user@hbase.apache.org
> Subject: Re: Question on the number of column families
>
> bq. add a 'dummy' column family and apply HBASE-5416 technique
>
> Adding dummy column family is not the way to utilize essential column 
> family support - what would this dummy column family hold ?
>
> bq. since I have not read the filtering section of the book I'm 
> reading yet
>
> Once you finish reading, you can look at the unit test
> (TestJoinedScanners) from HBASE-5416. You would understand this 
> feature better.
>
> Cheers
>
>
> On Tue, Aug 5, 2014 at 9:21 PM, innowireless TaeYun Kim < 
> taeyun.kim@innowireless.co.kr> wrote:
>
> > Thank you all.
> >
> > Facts learned:
> >
> > - Having 130 column families is too much. Don't do that.
> > - While scanning, an entire row will be read for filtering, unless
> > HBASE-5416 technique is applied which makes only relevant column 
> > family is loaded. (But it seems that still one can't load just a 
> > column needed while
> > scanning)
> > - Big row size is maybe not good.
> >
> > Currently it seems appropriate to follow the one-column solution 
> > that Alok Singh suggested, in part since currently there is no 
> > reasonable grouping of the fields.
> >
> > Here is my current thinking:
> >
> > - One column family, one column. Field name will be included in rowkey.
> > - Eliminate filtering altogether (in most case) by properly ordering 
> > rowkey components.
> > - If a filtering is absolutely needed, add a 'dummy' column family 
> > and apply HBASE-5416 technique to minimize disk read, since the 
> > field value can be large(~5MB). (This dummy column thing may not be 
> > right, I'm not sure, since I have not read the filtering section of 
> > the book I'm reading yet)
> >
> > Hope that I am not missing or misunderstanding something...
> > (I'm a total newbie. I've started to read a HBase book since last
> > week...)
> >
> >
> >
> >
> >
> >
>
>

Re: Question on the number of column families

Posted by Qiang Tian <ti...@gmail.com>.

Hi,
the description of hbase-5416 stated why it was introduced, if you only
have 1 CF, dummy CF does not help. it is helpful for multi-CF case,
e.g. "putting
them in one column family. And "Non frequently" ones in another. "

bq. "Field name will be included in rowkey."
Please read the chapter 9 "Advanced usage" in book "HBase Definitive Guide"
about how hbase store data on disk and how to design rowkey based on
specific scenario.(rowkey is the only index you can use, so take care)

bq. "The table is read-only. It is bulk-loaded once. When a new data is
ready,
A new table is created and the old table is deleted."
the scenario is quite different.  as hbase is designed for random
read/write.  the limitation described at
http://hbase.apache.org/book/number.of.cfs.html is to consider the write
case(flush&compaction), perhaps you could try 140 CFs, as long as you can
presplit your regions well? after that,  since no write, there will be no
flush/compaction...anyway, any idea better be tested with your real data.








On Wed, Aug 6, 2014 at 7:00 PM, innowireless TaeYun Kim <
taeyun.kim@innowireless.co.kr> wrote:

> Hi Ted,
>
> Now I finished reading the filtering section and the source code of
> TestJoinedScanners(0.94).
>
> Facts learned:
>
> - While scanning, an entire row will be read even for a rowkey filtering.
> (Since a rowkey is not a physically separate entity and stored in KeyValue
> object, it's natural. Am I right?)
> - The key API for the essential column family support is
> setLoadColumnFamiliesOnDemand().
>
> So, now I have questions:
>
> On rowkey filtering, which column family's KeyValue object is read?
> If HBase just reads a KeyValue from a randomly selected (or just the
> first) column family, how is setLoadColumnFamiliesOnDemand() affected? Can
> HBase select a smaller column family intelligently?
>
> If setLoadColumnFamiliesOnDemand() can be applied to a rowkey filtering, a
> 'dummy' column family can be used to minimize the scan cost.
>
> Thank you.
>
>
> -----Original Message-----
> From: innowireless TaeYun Kim [mailto:taeyun.kim@innowireless.co.kr]
> Sent: Wednesday, August 06, 2014 1:48 PM
> To: user@hbase.apache.org
> Subject: RE: Question on the number of column families
>
> Thank you.
>
> The 'dummy' column will always hold the value '1' (or even an empty
> string), that only signifies that this row exists. (And the real value is
> in the other 'big' column family) The value is irrelevant since with
> current schema the filtering will be done by rowkey components alone. No
> column value is needed. (I will begin reading the filtering section shortly
> - it is only 6 pages ahead. So sorry for my premature thoughts)
>
>
> -----Original Message-----
> From: Ted Yu [mailto:yuzhihong@gmail.com]
> Sent: Wednesday, August 06, 2014 1:38 PM
> To: user@hbase.apache.org
> Subject: Re: Question on the number of column families
>
> bq. add a 'dummy' column family and apply HBASE-5416 technique
>
> Adding dummy column family is not the way to utilize essential column
> family support - what would this dummy column family hold ?
>
> bq. since I have not read the filtering section of the book I'm reading yet
>
> Once you finish reading, you can look at the unit test
> (TestJoinedScanners) from HBASE-5416. You would understand this feature
> better.
>
> Cheers
>
>
> On Tue, Aug 5, 2014 at 9:21 PM, innowireless TaeYun Kim <
> taeyun.kim@innowireless.co.kr> wrote:
>
> > Thank you all.
> >
> > Facts learned:
> >
> > - Having 130 column families is too much. Don't do that.
> > - While scanning, an entire row will be read for filtering, unless
> > HBASE-5416 technique is applied which makes only relevant column
> > family is loaded. (But it seems that still one can't load just a
> > column needed while
> > scanning)
> > - Big row size is maybe not good.
> >
> > Currently it seems appropriate to follow the one-column solution that
> > Alok Singh suggested, in part since currently there is no reasonable
> > grouping of the fields.
> >
> > Here is my current thinking:
> >
> > - One column family, one column. Field name will be included in rowkey.
> > - Eliminate filtering altogether (in most case) by properly ordering
> > rowkey components.
> > - If a filtering is absolutely needed, add a 'dummy' column family and
> > apply HBASE-5416 technique to minimize disk read, since the field
> > value can be large(~5MB). (This dummy column thing may not be right,
> > I'm not sure, since I have not read the filtering section of the book
> > I'm reading yet)
> >
> > Hope that I am not missing or misunderstanding something...
> > (I'm a total newbie. I've started to read a HBase book since last
> > week...)
> >
> >
> >
> >
> >
> >
>
>

Re: Question on the number of column families

Posted by Ted Yu <yu...@gmail.com>.

bq. no built-in filter intelligently determines which column family is
essential, except for SingleColumnValueFilter

Mostly right - don't forget about SingleColumnValueExcludeFilter which
extends SingleColumnValueFilter.

Cheers


On Wed, Aug 6, 2014 at 9:34 PM, innowireless TaeYun Kim <
taeyun.kim@innowireless.co.kr> wrote:

> Thank you Ted.
>
> But RowFilter class has no method that can be uses to set which column
> family is essential. (Actually no built-in filter class provides such a
> method)
>
> So, if I (ever) want to apply the 'dummy' column family technique(?), it
> seems that I must do as follows:
>
> - Write my own filter that's a subclass of the RowFilter.
> - In that filter class, override isFamilyEssential() method to return true
> only when the name of the 'dummy' column family is passed as an argument.
>
> Now, HBase calls isFamilyEssential() method of my filter object for all
> the column families including the 'dummy' column family, and in result only
> loads the 'dummy' column family and happily filters rowkey using the
> KeyValue objects from the 'dummy' column family HFile(s).
>
> Am I right?
>
> BTW, it would be nice to have a method like
> 'setEssentialColumnFamilies(byte[][] names)' to set the essential families
> manually, since no built-in filter intelligently determines which column
> family is essential, except for SingleColumnValueFilter.
>
> Thanks.
>
> -----Original Message-----
> From: Ted Yu [mailto:yuzhihong@gmail.com]
> Sent: Thursday, August 07, 2014 12:38 PM
> To: user@hbase.apache.org
> Subject: Re: Question on the number of column families
>
> bq. While scanning, an entire row will be read even for a rowkey filtering
>
> If you specify essential column family in your filter, the above would not
> be true - only the essential column family would be loaded into memory
> first. Once the filter passes, the other family would be loaded.
>
> Cheers
>
>
> On Wed, Aug 6, 2014 at 4:00 AM, innowireless TaeYun Kim <
> taeyun.kim@innowireless.co.kr> wrote:
>
> > Hi Ted,
> >
> > Now I finished reading the filtering section and the source code of
> > TestJoinedScanners(0.94).
> >
> > Facts learned:
> >
> > - While scanning, an entire row will be read even for a rowkey filtering.
> > (Since a rowkey is not a physically separate entity and stored in
> > KeyValue object, it's natural. Am I right?)
> > - The key API for the essential column family support is
> > setLoadColumnFamiliesOnDemand().
> >
> > So, now I have questions:
> >
> > On rowkey filtering, which column family's KeyValue object is read?
> > If HBase just reads a KeyValue from a randomly selected (or just the
> > first) column family, how is setLoadColumnFamiliesOnDemand() affected?
> > Can HBase select a smaller column family intelligently?
> >
> > If setLoadColumnFamiliesOnDemand() can be applied to a rowkey
> > filtering, a 'dummy' column family can be used to minimize the scan cost.
> >
> > Thank you.
> >
> >
> > -----Original Message-----
> > From: innowireless TaeYun Kim [mailto:taeyun.kim@innowireless.co.kr]
> > Sent: Wednesday, August 06, 2014 1:48 PM
> > To: user@hbase.apache.org
> > Subject: RE: Question on the number of column families
> >
> > Thank you.
> >
> > The 'dummy' column will always hold the value '1' (or even an empty
> > string), that only signifies that this row exists. (And the real value
> > is in the other 'big' column family) The value is irrelevant since
> > with current schema the filtering will be done by rowkey components
> > alone. No column value is needed. (I will begin reading the filtering
> > section shortly
> > - it is only 6 pages ahead. So sorry for my premature thoughts)
> >
> >
> > -----Original Message-----
> > From: Ted Yu [mailto:yuzhihong@gmail.com]
> > Sent: Wednesday, August 06, 2014 1:38 PM
> > To: user@hbase.apache.org
> > Subject: Re: Question on the number of column families
> >
> > bq. add a 'dummy' column family and apply HBASE-5416 technique
> >
> > Adding dummy column family is not the way to utilize essential column
> > family support - what would this dummy column family hold ?
> >
> > bq. since I have not read the filtering section of the book I'm
> > reading yet
> >
> > Once you finish reading, you can look at the unit test
> > (TestJoinedScanners) from HBASE-5416. You would understand this
> > feature better.
> >
> > Cheers
> >
> >
> > On Tue, Aug 5, 2014 at 9:21 PM, innowireless TaeYun Kim <
> > taeyun.kim@innowireless.co.kr> wrote:
> >
> > > Thank you all.
> > >
> > > Facts learned:
> > >
> > > - Having 130 column families is too much. Don't do that.
> > > - While scanning, an entire row will be read for filtering, unless
> > > HBASE-5416 technique is applied which makes only relevant column
> > > family is loaded. (But it seems that still one can't load just a
> > > column needed while
> > > scanning)
> > > - Big row size is maybe not good.
> > >
> > > Currently it seems appropriate to follow the one-column solution
> > > that Alok Singh suggested, in part since currently there is no
> > > reasonable grouping of the fields.
> > >
> > > Here is my current thinking:
> > >
> > > - One column family, one column. Field name will be included in rowkey.
> > > - Eliminate filtering altogether (in most case) by properly ordering
> > > rowkey components.
> > > - If a filtering is absolutely needed, add a 'dummy' column family
> > > and apply HBASE-5416 technique to minimize disk read, since the
> > > field value can be large(~5MB). (This dummy column thing may not be
> > > right, I'm not sure, since I have not read the filtering section of
> > > the book I'm reading yet)
> > >
> > > Hope that I am not missing or misunderstanding something...
> > > (I'm a total newbie. I've started to read a HBase book since last
> > > week...)
> > >
> > >
> > >
> > >
> > >
> > >
> >
> >
>
>

RE: Question on the number of column families

Posted by innowireless TaeYun Kim <ta...@innowireless.co.kr>.

Thank you Ted.

But RowFilter class has no method that can be uses to set which column family is essential. (Actually no built-in filter class provides such a method)

So, if I (ever) want to apply the 'dummy' column family technique(?), it seems that I must do as follows:

- Write my own filter that's a subclass of the RowFilter.
- In that filter class, override isFamilyEssential() method to return true only when the name of the 'dummy' column family is passed as an argument.

Now, HBase calls isFamilyEssential() method of my filter object for all the column families including the 'dummy' column family, and in result only loads the 'dummy' column family and happily filters rowkey using the KeyValue objects from the 'dummy' column family HFile(s).

Am I right?

BTW, it would be nice to have a method like 'setEssentialColumnFamilies(byte[][] names)' to set the essential families manually, since no built-in filter intelligently determines which column family is essential, except for SingleColumnValueFilter.

Thanks.

-----Original Message-----
From: Ted Yu [mailto:yuzhihong@gmail.com] 
Sent: Thursday, August 07, 2014 12:38 PM
To: user@hbase.apache.org
Subject: Re: Question on the number of column families

bq. While scanning, an entire row will be read even for a rowkey filtering

If you specify essential column family in your filter, the above would not be true - only the essential column family would be loaded into memory first. Once the filter passes, the other family would be loaded.

Cheers


On Wed, Aug 6, 2014 at 4:00 AM, innowireless TaeYun Kim < taeyun.kim@innowireless.co.kr> wrote:

> Hi Ted,
>
> Now I finished reading the filtering section and the source code of 
> TestJoinedScanners(0.94).
>
> Facts learned:
>
> - While scanning, an entire row will be read even for a rowkey filtering.
> (Since a rowkey is not a physically separate entity and stored in 
> KeyValue object, it's natural. Am I right?)
> - The key API for the essential column family support is 
> setLoadColumnFamiliesOnDemand().
>
> So, now I have questions:
>
> On rowkey filtering, which column family's KeyValue object is read?
> If HBase just reads a KeyValue from a randomly selected (or just the
> first) column family, how is setLoadColumnFamiliesOnDemand() affected? 
> Can HBase select a smaller column family intelligently?
>
> If setLoadColumnFamiliesOnDemand() can be applied to a rowkey 
> filtering, a 'dummy' column family can be used to minimize the scan cost.
>
> Thank you.
>
>
> -----Original Message-----
> From: innowireless TaeYun Kim [mailto:taeyun.kim@innowireless.co.kr]
> Sent: Wednesday, August 06, 2014 1:48 PM
> To: user@hbase.apache.org
> Subject: RE: Question on the number of column families
>
> Thank you.
>
> The 'dummy' column will always hold the value '1' (or even an empty 
> string), that only signifies that this row exists. (And the real value 
> is in the other 'big' column family) The value is irrelevant since 
> with current schema the filtering will be done by rowkey components 
> alone. No column value is needed. (I will begin reading the filtering 
> section shortly
> - it is only 6 pages ahead. So sorry for my premature thoughts)
>
>
> -----Original Message-----
> From: Ted Yu [mailto:yuzhihong@gmail.com]
> Sent: Wednesday, August 06, 2014 1:38 PM
> To: user@hbase.apache.org
> Subject: Re: Question on the number of column families
>
> bq. add a 'dummy' column family and apply HBASE-5416 technique
>
> Adding dummy column family is not the way to utilize essential column 
> family support - what would this dummy column family hold ?
>
> bq. since I have not read the filtering section of the book I'm 
> reading yet
>
> Once you finish reading, you can look at the unit test
> (TestJoinedScanners) from HBASE-5416. You would understand this 
> feature better.
>
> Cheers
>
>
> On Tue, Aug 5, 2014 at 9:21 PM, innowireless TaeYun Kim < 
> taeyun.kim@innowireless.co.kr> wrote:
>
> > Thank you all.
> >
> > Facts learned:
> >
> > - Having 130 column families is too much. Don't do that.
> > - While scanning, an entire row will be read for filtering, unless
> > HBASE-5416 technique is applied which makes only relevant column 
> > family is loaded. (But it seems that still one can't load just a 
> > column needed while
> > scanning)
> > - Big row size is maybe not good.
> >
> > Currently it seems appropriate to follow the one-column solution 
> > that Alok Singh suggested, in part since currently there is no 
> > reasonable grouping of the fields.
> >
> > Here is my current thinking:
> >
> > - One column family, one column. Field name will be included in rowkey.
> > - Eliminate filtering altogether (in most case) by properly ordering 
> > rowkey components.
> > - If a filtering is absolutely needed, add a 'dummy' column family 
> > and apply HBASE-5416 technique to minimize disk read, since the 
> > field value can be large(~5MB). (This dummy column thing may not be 
> > right, I'm not sure, since I have not read the filtering section of 
> > the book I'm reading yet)
> >
> > Hope that I am not missing or misunderstanding something...
> > (I'm a total newbie. I've started to read a HBase book since last
> > week...)
> >
> >
> >
> >
> >
> >
>
>

Re: Question on the number of column families

Posted by Ted Yu <yu...@gmail.com>.

bq. While scanning, an entire row will be read even for a rowkey filtering

If you specify essential column family in your filter, the above would not
be true - only the essential column family would be loaded into memory
first. Once the filter passes, the other family would be loaded.

Cheers


On Wed, Aug 6, 2014 at 4:00 AM, innowireless TaeYun Kim <
taeyun.kim@innowireless.co.kr> wrote:

> Hi Ted,
>
> Now I finished reading the filtering section and the source code of
> TestJoinedScanners(0.94).
>
> Facts learned:
>
> - While scanning, an entire row will be read even for a rowkey filtering.
> (Since a rowkey is not a physically separate entity and stored in KeyValue
> object, it's natural. Am I right?)
> - The key API for the essential column family support is
> setLoadColumnFamiliesOnDemand().
>
> So, now I have questions:
>
> On rowkey filtering, which column family's KeyValue object is read?
> If HBase just reads a KeyValue from a randomly selected (or just the
> first) column family, how is setLoadColumnFamiliesOnDemand() affected? Can
> HBase select a smaller column family intelligently?
>
> If setLoadColumnFamiliesOnDemand() can be applied to a rowkey filtering, a
> 'dummy' column family can be used to minimize the scan cost.
>
> Thank you.
>
>
> -----Original Message-----
> From: innowireless TaeYun Kim [mailto:taeyun.kim@innowireless.co.kr]
> Sent: Wednesday, August 06, 2014 1:48 PM
> To: user@hbase.apache.org
> Subject: RE: Question on the number of column families
>
> Thank you.
>
> The 'dummy' column will always hold the value '1' (or even an empty
> string), that only signifies that this row exists. (And the real value is
> in the other 'big' column family) The value is irrelevant since with
> current schema the filtering will be done by rowkey components alone. No
> column value is needed. (I will begin reading the filtering section shortly
> - it is only 6 pages ahead. So sorry for my premature thoughts)
>
>
> -----Original Message-----
> From: Ted Yu [mailto:yuzhihong@gmail.com]
> Sent: Wednesday, August 06, 2014 1:38 PM
> To: user@hbase.apache.org
> Subject: Re: Question on the number of column families
>
> bq. add a 'dummy' column family and apply HBASE-5416 technique
>
> Adding dummy column family is not the way to utilize essential column
> family support - what would this dummy column family hold ?
>
> bq. since I have not read the filtering section of the book I'm reading yet
>
> Once you finish reading, you can look at the unit test
> (TestJoinedScanners) from HBASE-5416. You would understand this feature
> better.
>
> Cheers
>
>
> On Tue, Aug 5, 2014 at 9:21 PM, innowireless TaeYun Kim <
> taeyun.kim@innowireless.co.kr> wrote:
>
> > Thank you all.
> >
> > Facts learned:
> >
> > - Having 130 column families is too much. Don't do that.
> > - While scanning, an entire row will be read for filtering, unless
> > HBASE-5416 technique is applied which makes only relevant column
> > family is loaded. (But it seems that still one can't load just a
> > column needed while
> > scanning)
> > - Big row size is maybe not good.
> >
> > Currently it seems appropriate to follow the one-column solution that
> > Alok Singh suggested, in part since currently there is no reasonable
> > grouping of the fields.
> >
> > Here is my current thinking:
> >
> > - One column family, one column. Field name will be included in rowkey.
> > - Eliminate filtering altogether (in most case) by properly ordering
> > rowkey components.
> > - If a filtering is absolutely needed, add a 'dummy' column family and
> > apply HBASE-5416 technique to minimize disk read, since the field
> > value can be large(~5MB). (This dummy column thing may not be right,
> > I'm not sure, since I have not read the filtering section of the book
> > I'm reading yet)
> >
> > Hope that I am not missing or misunderstanding something...
> > (I'm a total newbie. I've started to read a HBase book since last
> > week...)
> >
> >
> >
> >
> >
> >
>
>

RE: Question on the number of column families

Posted by innowireless TaeYun Kim <ta...@innowireless.co.kr>.

Hi Ted,

Now I finished reading the filtering section and the source code of TestJoinedScanners(0.94).

Facts learned:

- While scanning, an entire row will be read even for a rowkey filtering. (Since a rowkey is not a physically separate entity and stored in KeyValue object, it's natural. Am I right?)
- The key API for the essential column family support is setLoadColumnFamiliesOnDemand().

So, now I have questions:

On rowkey filtering, which column family's KeyValue object is read?
If HBase just reads a KeyValue from a randomly selected (or just the first) column family, how is setLoadColumnFamiliesOnDemand() affected? Can HBase select a smaller column family intelligently?

If setLoadColumnFamiliesOnDemand() can be applied to a rowkey filtering, a 'dummy' column family can be used to minimize the scan cost.

Thank you.

-----Original Message-----
From: innowireless TaeYun Kim [mailto:taeyun.kim@innowireless.co.kr] 
Sent: Wednesday, August 06, 2014 1:48 PM
To: user@hbase.apache.org
Subject: RE: Question on the number of column families

Thank you.

The 'dummy' column will always hold the value '1' (or even an empty string), that only signifies that this row exists. (And the real value is in the other 'big' column family) The value is irrelevant since with current schema the filtering will be done by rowkey components alone. No column value is needed. (I will begin reading the filtering section shortly - it is only 6 pages ahead. So sorry for my premature thoughts)

-----Original Message-----
From: Ted Yu [mailto:yuzhihong@gmail.com]
Sent: Wednesday, August 06, 2014 1:38 PM
To: user@hbase.apache.org
Subject: Re: Question on the number of column families

bq. add a 'dummy' column family and apply HBASE-5416 technique

Adding dummy column family is not the way to utilize essential column family support - what would this dummy column family hold ?

bq. since I have not read the filtering section of the book I'm reading yet

Once you finish reading, you can look at the unit test (TestJoinedScanners) from HBASE-5416. You would understand this feature better.

Cheers

On Tue, Aug 5, 2014 at 9:21 PM, innowireless TaeYun Kim < taeyun.kim@innowireless.co.kr> wrote:

> Thank you all.
>
> Facts learned:
>
> - Having 130 column families is too much. Don't do that.
> - While scanning, an entire row will be read for filtering, unless
> HBASE-5416 technique is applied which makes only relevant column 
> family is loaded. (But it seems that still one can't load just a 
> column needed while
> scanning)
> - Big row size is maybe not good.
>
> Currently it seems appropriate to follow the one-column solution that 
> Alok Singh suggested, in part since currently there is no reasonable 
> grouping of the fields.
>
> Here is my current thinking:
>
> - One column family, one column. Field name will be included in rowkey.
> - Eliminate filtering altogether (in most case) by properly ordering 
> rowkey components.
> - If a filtering is absolutely needed, add a 'dummy' column family and 
> apply HBASE-5416 technique to minimize disk read, since the field 
> value can be large(~5MB). (This dummy column thing may not be right, 
> I'm not sure, since I have not read the filtering section of the book 
> I'm reading yet)
>
> Hope that I am not missing or misunderstanding something...
> (I'm a total newbie. I've started to read a HBase book since last
> week...)
>
>
>
>
>
>

RE: Question on the number of column families

Posted by innowireless TaeYun Kim <ta...@innowireless.co.kr>.

Thank you.

The 'dummy' column will always hold the value '1' (or even an empty string), that only signifies that this row exists. (And the real value is in the other 'big' column family)
The value is irrelevant since with current schema the filtering will be done by rowkey components alone. No column value is needed. (I will begin reading the filtering section shortly - it is only 6 pages ahead. So sorry for my premature thoughts)

-----Original Message-----
From: Ted Yu [mailto:yuzhihong@gmail.com] 
Sent: Wednesday, August 06, 2014 1:38 PM
To: user@hbase.apache.org
Subject: Re: Question on the number of column families

bq. add a 'dummy' column family and apply HBASE-5416 technique

Adding dummy column family is not the way to utilize essential column family support - what would this dummy column family hold ?

bq. since I have not read the filtering section of the book I'm reading yet

Once you finish reading, you can look at the unit test (TestJoinedScanners) from HBASE-5416. You would understand this feature better.

Cheers

On Tue, Aug 5, 2014 at 9:21 PM, innowireless TaeYun Kim < taeyun.kim@innowireless.co.kr> wrote:

> Thank you all.
>
> Facts learned:
>
> - Having 130 column families is too much. Don't do that.
> - While scanning, an entire row will be read for filtering, unless
> HBASE-5416 technique is applied which makes only relevant column 
> family is loaded. (But it seems that still one can't load just a 
> column needed while
> scanning)
> - Big row size is maybe not good.
>
> Currently it seems appropriate to follow the one-column solution that 
> Alok Singh suggested, in part since currently there is no reasonable 
> grouping of the fields.
>
> Here is my current thinking:
>
> - One column family, one column. Field name will be included in rowkey.
> - Eliminate filtering altogether (in most case) by properly ordering 
> rowkey components.
> - If a filtering is absolutely needed, add a 'dummy' column family and 
> apply HBASE-5416 technique to minimize disk read, since the field 
> value can be large(~5MB). (This dummy column thing may not be right, 
> I'm not sure, since I have not read the filtering section of the book 
> I'm reading yet)
>
> Hope that I am not missing or misunderstanding something...
> (I'm a total newbie. I've started to read a HBase book since last 
> week...)
>
>
>
>
>
>

Re: Question on the number of column families

Posted by Ted Yu <yu...@gmail.com>.

bq. add a 'dummy' column family and apply HBASE-5416 technique

Adding dummy column family is not the way to utilize essential column
family support - what would this dummy column family hold ?

bq. since I have not read the filtering section of the book I'm reading yet

Once you finish reading, you can look at the unit test (TestJoinedScanners)
from HBASE-5416. You would understand this feature better.

Cheers


On Tue, Aug 5, 2014 at 9:21 PM, innowireless TaeYun Kim <
taeyun.kim@innowireless.co.kr> wrote:

> Thank you all.
>
> Facts learned:
>
> - Having 130 column families is too much. Don't do that.
> - While scanning, an entire row will be read for filtering, unless
> HBASE-5416 technique is applied which makes only relevant column family is
> loaded. (But it seems that still one can't load just a column needed while
> scanning)
> - Big row size is maybe not good.
>
> Currently it seems appropriate to follow the one-column solution that Alok
> Singh suggested, in part since currently there is no reasonable grouping of
> the fields.
>
> Here is my current thinking:
>
> - One column family, one column. Field name will be included in rowkey.
> - Eliminate filtering altogether (in most case) by properly ordering
> rowkey components.
> - If a filtering is absolutely needed, add a 'dummy' column family and
> apply HBASE-5416 technique to minimize disk read, since the field value can
> be large(~5MB). (This dummy column thing may not be right, I'm not sure,
> since I have not read the filtering section of the book I'm reading yet)
>
> Hope that I am not missing or misunderstanding something...
> (I'm a total newbie. I've started to read a HBase book since last week...)
>
>
>
>
>
>

RE: Question on the number of column families

Posted by innowireless TaeYun Kim <ta...@innowireless.co.kr>.

Thank you all.

Facts learned:

- Having 130 column families is too much. Don't do that.
- While scanning, an entire row will be read for filtering, unless HBASE-5416 technique is applied which makes only relevant column family is loaded. (But it seems that still one can't load just a column needed while scanning)
- Big row size is maybe not good.

Currently it seems appropriate to follow the one-column solution that Alok Singh suggested, in part since currently there is no reasonable grouping of the fields.

Here is my current thinking:

- One column family, one column. Field name will be included in rowkey.
- Eliminate filtering altogether (in most case) by properly ordering rowkey components.
- If a filtering is absolutely needed, add a 'dummy' column family and apply HBASE-5416 technique to minimize disk read, since the field value can be large(~5MB). (This dummy column thing may not be right, I'm not sure, since I have not read the filtering section of the book I'm reading yet)

Hope that I am not missing or misunderstanding something...
(I'm a total newbie. I've started to read a HBase book since last week...)

Re: Question on the number of column families

Posted by Alok Singh <al...@gmail.com>.

One way to model the data would be to use a composite key that is made
up of the RDMS primary_key + "." + field_name. Then just have a single
column that contains the value of the field.
Individual field lookups will be a simple get and to get all of fields
of a record, you would do a scan with startrow => primary_key + ".!",
endrow => primary_key + ".~"

Alok

Re: Question on the number of column families

Posted by Michael Segel <mi...@hotmail.com>.

I think you need to go back a bit further from the problem and ask yourself when would you want to have the same row key used for disjoint data. That is data that refers to the same object, yet the data in each column family is never or rarely used with data from another column family. 

To give you a concrete example... one that I've used in a class... An order entry system. 

Think of the life cycle of your order. 

You enter the order, the company then generates pick slips from the warehouse(s), then the warehouse(s) issue shipping slips, then as the product ships, invoices are issued and the billing process occurs. 

In each part of the process, information that could be shared could be copied so that you have an inquiry in to the order, you would see what was done and when, but in each process like managing the pick slip, you dont need to bring up the entire order. 

Does that make sense? 

In that example, you have 4 column families. 

There are other examples, but that should help you put column families in perspective. 

HTH
-Mike

On Aug 5, 2014, at 11:52 AM, Ted Yu <yu...@gmail.com> wrote:

> As Alok mentioned previously, once columns are grouped into several column
> families, you would be able to leverage essential column family feature
> introduced by this JIRA:
> 
> HBASE-5416 Improve performance of scans with some kind of filters
> 
> Cheers
> 
> 
> On Tue, Aug 5, 2014 at 5:26 AM, Alok Kumar <al...@gmail.com> wrote:
> 
>> You could narrow the number of rows to scan by using Filters. I don't
>> think, you could reach/optimize to column level I/O.
>> 
>> Block Cache is related to actual data read from HDFS per column family. If
>> your scan is fetching random (all) columns, then you are any way going to
>> hit all the column-family-blocks and "irrelevant" data in block cache!!
>> You could limit or set columns you want to fetch on client side after scan,
>> that will save network IO.
>> 
>> Do you have 130 * 5 = 650MB of row size?
>> 
>> Thanks
>> Alok
>> 
>> On Tue, Aug 5, 2014 at 5:17 PM, innowireless TaeYun Kim <
>> taeyun.kim@innowireless.co.kr> wrote:
>> 
>>> Plus,
>>> Since most of the time a client will display the area that does not fit
>> in
>>> 500x500, Scan operations are required. (Get is not enough)
>>> So, I'm worried that on scanning, many irrelevant column data (those have
>>> the same rowkey, which is the position on the grid) would be read into
>> the
>>> block cache, unless the columns are separated by individual column
>> family.
>>> 
>>> 
>>> -----Original Message-----
>>> From: innowireless TaeYun Kim [mailto:taeyun.kim@innowireless.co.kr]
>>> Sent: Tuesday, August 05, 2014 8:36 PM
>>> To: user@hbase.apache.org
>>> Subject: RE: Question on the number of column families
>>> 
>>> Thank you for your reply.
>>> 
>>> I can decrease the size of column value if it's not good for HBase.
>>> BTW, The values are for a point on a grid cell on a map.
>>> 250000 is 500x500, and 500x500 is somewhat related to the size of the
>>> client screen that displays the values on a map.
>>> Normally a client requests the values for the area that is displayed on
>>> the screen.
>>> 
>>> 
>>> -----Original Message-----
>>> From: Alok Kumar [mailto:alokawi@gmail.com]
>>> Sent: Tuesday, August 05, 2014 8:24 PM
>>> To: user@hbase.apache.org
>>> Subject: Re: Question on the number of column families
>>> 
>>> Hi,
>>> 
>>> Hbase creates HFile per column-family. Having 130 column-family is really
>>> not recommended.
>>> It will increase number of file pointer ( open file count) underneath.
>>> 
>>> If you are sure which columns are "frequently" accessed by users, you
>>> could consider putting them in one column family. And "Non frequently"
>> ones
>>> in another.
>>> Btw, ~5MB size of column value is something to consider. We should wait
>>> for some expert advise here!!
>>> 
>>> 
>>> Thanks
>>> Alok
>>> 
>>> 
>>> On Tue, Aug 5, 2014 at 4:50 PM, innowireless TaeYun Kim <
>>> taeyun.kim@innowireless.co.kr> wrote:
>>> 
>>>> Plus,
>>>> the size of the value of each field can be ~5MB, since max 250000
>>>> lines of the source data will be merged into one record, to match the
>>>> request pattern.
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: innowireless TaeYun Kim [mailto:taeyun.kim@innowireless.co.kr]
>>>> Sent: Tuesday, August 05, 2014 8:11 PM
>>>> To: user@hbase.apache.org
>>>> Subject: Question on the number of column families
>>>> 
>>>> Hi,
>>>> 
>>>> 
>>>> 
>>>> According to http://hbase.apache.org/book/number.of.cfs.html, having
>>>> more than 2~3 column families are strongly discouraged.
>>>> 
>>>> 
>>>> 
>>>> BTW, in my case, records on a table have the following characteristics:
>>>> 
>>>> 
>>>> 
>>>> - The table is read-only. It is bulk-loaded once. When a new data is
>>>> ready, A new table is created and the old table is deleted.
>>>> 
>>>> - The size of the source data can be hundreds of gigabytes.
>>>> 
>>>> - A record has about 130 fields.
>>>> 
>>>> - The number of fields in a record is fixed.
>>>> 
>>>> - The names of the fields are also fixed. (it's like a table in RDBMS)
>>>> 
>>>> - About 40(it varies) fields mostly have value, while other fields are
>>>> mostly empty(null in RDBMS).
>>>> 
>>>> - It is unknown which field will be dense. It depends on the source
>> data.
>>>> 
>>>> - Fields are accessed independently. Normally a user requests just one
>>>> field. A user can request several fields.
>>>> 
>>>> - The range on the range query is the same for all fields. (No wider,
>>>> no narrower, regardless the data density)
>>>> 
>>>> For me, it seems that it would be more efficient if there is one
>>>> column family for each field, since it would cost less disk I/O, for
>>>> only the needed column data will be read.
>>>> 
>>>> 
>>>> 
>>>> Can the table have 130 column families for this case?
>>>> 
>>>> Or the whole columns must be in one column family?
>>>> 
>>>> 
>>>> 
>>>> Thanks.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Alok Kumar
>>> Email : alokawi@gmail.com
>>> http://sharepointorange.blogspot.in/
>>> http://www.linkedin.com/in/alokawi
>>> 
>>> 
>> 

The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com

Re: Question on the number of column families

Posted by Ted Yu <yu...@gmail.com>.

As Alok mentioned previously, once columns are grouped into several column
families, you would be able to leverage essential column family feature
introduced by this JIRA:

HBASE-5416 Improve performance of scans with some kind of filters

Cheers


On Tue, Aug 5, 2014 at 5:26 AM, Alok Kumar <al...@gmail.com> wrote:

> You could narrow the number of rows to scan by using Filters. I don't
> think, you could reach/optimize to column level I/O.
>
> Block Cache is related to actual data read from HDFS per column family. If
> your scan is fetching random (all) columns, then you are any way going to
> hit all the column-family-blocks and "irrelevant" data in block cache!!
> You could limit or set columns you want to fetch on client side after scan,
> that will save network IO.
>
> Do you have 130 * 5 = 650MB of row size?
>
> Thanks
> Alok
>
> On Tue, Aug 5, 2014 at 5:17 PM, innowireless TaeYun Kim <
> taeyun.kim@innowireless.co.kr> wrote:
>
> > Plus,
> > Since most of the time a client will display the area that does not fit
> in
> > 500x500, Scan operations are required. (Get is not enough)
> > So, I'm worried that on scanning, many irrelevant column data (those have
> > the same rowkey, which is the position on the grid) would be read into
> the
> > block cache, unless the columns are separated by individual column
> family.
> >
> >
> > -----Original Message-----
> > From: innowireless TaeYun Kim [mailto:taeyun.kim@innowireless.co.kr]
> > Sent: Tuesday, August 05, 2014 8:36 PM
> > To: user@hbase.apache.org
> > Subject: RE: Question on the number of column families
> >
> > Thank you for your reply.
> >
> > I can decrease the size of column value if it's not good for HBase.
> > BTW, The values are for a point on a grid cell on a map.
> > 250000 is 500x500, and 500x500 is somewhat related to the size of the
> > client screen that displays the values on a map.
> > Normally a client requests the values for the area that is displayed on
> > the screen.
> >
> >
> > -----Original Message-----
> > From: Alok Kumar [mailto:alokawi@gmail.com]
> > Sent: Tuesday, August 05, 2014 8:24 PM
> > To: user@hbase.apache.org
> > Subject: Re: Question on the number of column families
> >
> > Hi,
> >
> > Hbase creates HFile per column-family. Having 130 column-family is really
> > not recommended.
> > It will increase number of file pointer ( open file count) underneath.
> >
> > If you are sure which columns are "frequently" accessed by users, you
> > could consider putting them in one column family. And "Non frequently"
> ones
> > in another.
> > Btw, ~5MB size of column value is something to consider. We should wait
> > for some expert advise here!!
> >
> >
> > Thanks
> > Alok
> >
> >
> > On Tue, Aug 5, 2014 at 4:50 PM, innowireless TaeYun Kim <
> > taeyun.kim@innowireless.co.kr> wrote:
> >
> > > Plus,
> > > the size of the value of each field can be ~5MB, since max 250000
> > > lines of the source data will be merged into one record, to match the
> > > request pattern.
> > >
> > >
> > > -----Original Message-----
> > > From: innowireless TaeYun Kim [mailto:taeyun.kim@innowireless.co.kr]
> > > Sent: Tuesday, August 05, 2014 8:11 PM
> > > To: user@hbase.apache.org
> > > Subject: Question on the number of column families
> > >
> > > Hi,
> > >
> > >
> > >
> > > According to http://hbase.apache.org/book/number.of.cfs.html, having
> > > more than 2~3 column families are strongly discouraged.
> > >
> > >
> > >
> > > BTW, in my case, records on a table have the following characteristics:
> > >
> > >
> > >
> > > - The table is read-only. It is bulk-loaded once. When a new data is
> > > ready, A new table is created and the old table is deleted.
> > >
> > > - The size of the source data can be hundreds of gigabytes.
> > >
> > > - A record has about 130 fields.
> > >
> > > - The number of fields in a record is fixed.
> > >
> > > - The names of the fields are also fixed. (it's like a table in RDBMS)
> > >
> > > - About 40(it varies) fields mostly have value, while other fields are
> > > mostly empty(null in RDBMS).
> > >
> > > - It is unknown which field will be dense. It depends on the source
> data.
> > >
> > > - Fields are accessed independently. Normally a user requests just one
> > > field. A user can request several fields.
> > >
> > > - The range on the range query is the same for all fields. (No wider,
> > > no narrower, regardless the data density)
> > >
> > > For me, it seems that it would be more efficient if there is one
> > > column family for each field, since it would cost less disk I/O, for
> > > only the needed column data will be read.
> > >
> > >
> > >
> > > Can the table have 130 column families for this case?
> > >
> > > Or the whole columns must be in one column family?
> > >
> > >
> > >
> > > Thanks.
> > >
> > >
> > >
> > >
> > >
> >
> >
> > --
> > Alok Kumar
> > Email : alokawi@gmail.com
> > http://sharepointorange.blogspot.in/
> > http://www.linkedin.com/in/alokawi
> >
> >
>

Re: Question on the number of column families

Posted by Alok Kumar <al...@gmail.com>.

You could narrow the number of rows to scan by using Filters. I don't
think, you could reach/optimize to column level I/O.

Block Cache is related to actual data read from HDFS per column family. If
your scan is fetching random (all) columns, then you are any way going to
hit all the column-family-blocks and "irrelevant" data in block cache!!
You could limit or set columns you want to fetch on client side after scan,
that will save network IO.

Do you have 130 * 5 = 650MB of row size?

Thanks
Alok

On Tue, Aug 5, 2014 at 5:17 PM, innowireless TaeYun Kim <
taeyun.kim@innowireless.co.kr> wrote:

> Plus,
> Since most of the time a client will display the area that does not fit in
> 500x500, Scan operations are required. (Get is not enough)
> So, I'm worried that on scanning, many irrelevant column data (those have
> the same rowkey, which is the position on the grid) would be read into the
> block cache, unless the columns are separated by individual column family.
>
>
> -----Original Message-----
> From: innowireless TaeYun Kim [mailto:taeyun.kim@innowireless.co.kr]
> Sent: Tuesday, August 05, 2014 8:36 PM
> To: user@hbase.apache.org
> Subject: RE: Question on the number of column families
>
> Thank you for your reply.
>
> I can decrease the size of column value if it's not good for HBase.
> BTW, The values are for a point on a grid cell on a map.
> 250000 is 500x500, and 500x500 is somewhat related to the size of the
> client screen that displays the values on a map.
> Normally a client requests the values for the area that is displayed on
> the screen.
>
>
> -----Original Message-----
> From: Alok Kumar [mailto:alokawi@gmail.com]
> Sent: Tuesday, August 05, 2014 8:24 PM
> To: user@hbase.apache.org
> Subject: Re: Question on the number of column families
>
> Hi,
>
> Hbase creates HFile per column-family. Having 130 column-family is really
> not recommended.
> It will increase number of file pointer ( open file count) underneath.
>
> If you are sure which columns are "frequently" accessed by users, you
> could consider putting them in one column family. And "Non frequently" ones
> in another.
> Btw, ~5MB size of column value is something to consider. We should wait
> for some expert advise here!!
>
>
> Thanks
> Alok
>
>
> On Tue, Aug 5, 2014 at 4:50 PM, innowireless TaeYun Kim <
> taeyun.kim@innowireless.co.kr> wrote:
>
> > Plus,
> > the size of the value of each field can be ~5MB, since max 250000
> > lines of the source data will be merged into one record, to match the
> > request pattern.
> >
> >
> > -----Original Message-----
> > From: innowireless TaeYun Kim [mailto:taeyun.kim@innowireless.co.kr]
> > Sent: Tuesday, August 05, 2014 8:11 PM
> > To: user@hbase.apache.org
> > Subject: Question on the number of column families
> >
> > Hi,
> >
> >
> >
> > According to http://hbase.apache.org/book/number.of.cfs.html, having
> > more than 2~3 column families are strongly discouraged.
> >
> >
> >
> > BTW, in my case, records on a table have the following characteristics:
> >
> >
> >
> > - The table is read-only. It is bulk-loaded once. When a new data is
> > ready, A new table is created and the old table is deleted.
> >
> > - The size of the source data can be hundreds of gigabytes.
> >
> > - A record has about 130 fields.
> >
> > - The number of fields in a record is fixed.
> >
> > - The names of the fields are also fixed. (it's like a table in RDBMS)
> >
> > - About 40(it varies) fields mostly have value, while other fields are
> > mostly empty(null in RDBMS).
> >
> > - It is unknown which field will be dense. It depends on the source data.
> >
> > - Fields are accessed independently. Normally a user requests just one
> > field. A user can request several fields.
> >
> > - The range on the range query is the same for all fields. (No wider,
> > no narrower, regardless the data density)
> >
> > For me, it seems that it would be more efficient if there is one
> > column family for each field, since it would cost less disk I/O, for
> > only the needed column data will be read.
> >
> >
> >
> > Can the table have 130 column families for this case?
> >
> > Or the whole columns must be in one column family?
> >
> >
> >
> > Thanks.
> >
> >
> >
> >
> >
>
>
> --
> Alok Kumar
> Email : alokawi@gmail.com
> http://sharepointorange.blogspot.in/
> http://www.linkedin.com/in/alokawi
>
>

RE: Question on the number of column families

Posted by innowireless TaeYun Kim <ta...@innowireless.co.kr>.

Plus,
Since most of the time a client will display the area that does not fit in 500x500, Scan operations are required. (Get is not enough)
So, I'm worried that on scanning, many irrelevant column data (those have the same rowkey, which is the position on the grid) would be read into the block cache, unless the columns are separated by individual column family.

-----Original Message-----
From: innowireless TaeYun Kim [mailto:taeyun.kim@innowireless.co.kr] 
Sent: Tuesday, August 05, 2014 8:36 PM
To: user@hbase.apache.org
Subject: RE: Question on the number of column families

Thank you for your reply.

I can decrease the size of column value if it's not good for HBase.
BTW, The values are for a point on a grid cell on a map.
250000 is 500x500, and 500x500 is somewhat related to the size of the client screen that displays the values on a map.
Normally a client requests the values for the area that is displayed on the screen.

-----Original Message-----
From: Alok Kumar [mailto:alokawi@gmail.com]
Sent: Tuesday, August 05, 2014 8:24 PM
To: user@hbase.apache.org
Subject: Re: Question on the number of column families

Hi,

Hbase creates HFile per column-family. Having 130 column-family is really not recommended.
It will increase number of file pointer ( open file count) underneath.

If you are sure which columns are "frequently" accessed by users, you could consider putting them in one column family. And "Non frequently" ones in another.
Btw, ~5MB size of column value is something to consider. We should wait for some expert advise here!!

Thanks
Alok

On Tue, Aug 5, 2014 at 4:50 PM, innowireless TaeYun Kim < taeyun.kim@innowireless.co.kr> wrote:

> Plus,
> the size of the value of each field can be ~5MB, since max 250000 
> lines of the source data will be merged into one record, to match the 
> request pattern.
>
>
> -----Original Message-----
> From: innowireless TaeYun Kim [mailto:taeyun.kim@innowireless.co.kr]
> Sent: Tuesday, August 05, 2014 8:11 PM
> To: user@hbase.apache.org
> Subject: Question on the number of column families
>
> Hi,
>
>
>
> According to http://hbase.apache.org/book/number.of.cfs.html, having 
> more than 2~3 column families are strongly discouraged.
>
>
>
> BTW, in my case, records on a table have the following characteristics:
>
>
>
> - The table is read-only. It is bulk-loaded once. When a new data is 
> ready, A new table is created and the old table is deleted.
>
> - The size of the source data can be hundreds of gigabytes.
>
> - A record has about 130 fields.
>
> - The number of fields in a record is fixed.
>
> - The names of the fields are also fixed. (it's like a table in RDBMS)
>
> - About 40(it varies) fields mostly have value, while other fields are 
> mostly empty(null in RDBMS).
>
> - It is unknown which field will be dense. It depends on the source data.
>
> - Fields are accessed independently. Normally a user requests just one 
> field. A user can request several fields.
>
> - The range on the range query is the same for all fields. (No wider, 
> no narrower, regardless the data density)
>
> For me, it seems that it would be more efficient if there is one 
> column family for each field, since it would cost less disk I/O, for 
> only the needed column data will be read.
>
>
>
> Can the table have 130 column families for this case?
>
> Or the whole columns must be in one column family?
>
>
>
> Thanks.
>
>
>
>
>

--
Alok Kumar
Email : alokawi@gmail.com
http://sharepointorange.blogspot.in/
http://www.linkedin.com/in/alokawi

RE: Question on the number of column families

Posted by innowireless TaeYun Kim <ta...@innowireless.co.kr>.

Thank you for your reply.

I can decrease the size of column value if it's not good for HBase.
BTW, The values are for a point on a grid cell on a map.
250000 is 500x500, and 500x500 is somewhat related to the size of the client screen that displays the values on a map.
Normally a client requests the values for the area that is displayed on the screen.


-----Original Message-----
From: Alok Kumar [mailto:alokawi@gmail.com] 
Sent: Tuesday, August 05, 2014 8:24 PM
To: user@hbase.apache.org
Subject: Re: Question on the number of column families

Hi,

Hbase creates HFile per column-family. Having 130 column-family is really not recommended.
It will increase number of file pointer ( open file count) underneath.

If you are sure which columns are "frequently" accessed by users, you could consider putting them in one column family. And "Non frequently" ones in another.
Btw, ~5MB size of column value is something to consider. We should wait for some expert advise here!!


Thanks
Alok


On Tue, Aug 5, 2014 at 4:50 PM, innowireless TaeYun Kim < taeyun.kim@innowireless.co.kr> wrote:

> Plus,
> the size of the value of each field can be ~5MB, since max 250000 
> lines of the source data will be merged into one record, to match the 
> request pattern.
>
>
> -----Original Message-----
> From: innowireless TaeYun Kim [mailto:taeyun.kim@innowireless.co.kr]
> Sent: Tuesday, August 05, 2014 8:11 PM
> To: user@hbase.apache.org
> Subject: Question on the number of column families
>
> Hi,
>
>
>
> According to http://hbase.apache.org/book/number.of.cfs.html, having 
> more than 2~3 column families are strongly discouraged.
>
>
>
> BTW, in my case, records on a table have the following characteristics:
>
>
>
> - The table is read-only. It is bulk-loaded once. When a new data is 
> ready, A new table is created and the old table is deleted.
>
> - The size of the source data can be hundreds of gigabytes.
>
> - A record has about 130 fields.
>
> - The number of fields in a record is fixed.
>
> - The names of the fields are also fixed. (it's like a table in RDBMS)
>
> - About 40(it varies) fields mostly have value, while other fields are 
> mostly empty(null in RDBMS).
>
> - It is unknown which field will be dense. It depends on the source data.
>
> - Fields are accessed independently. Normally a user requests just one 
> field. A user can request several fields.
>
> - The range on the range query is the same for all fields. (No wider, 
> no narrower, regardless the data density)
>
> For me, it seems that it would be more efficient if there is one 
> column family for each field, since it would cost less disk I/O, for 
> only the needed column data will be read.
>
>
>
> Can the table have 130 column families for this case?
>
> Or the whole columns must be in one column family?
>
>
>
> Thanks.
>
>
>
>
>


--
Alok Kumar
Email : alokawi@gmail.com
http://sharepointorange.blogspot.in/
http://www.linkedin.com/in/alokawi

Re: Question on the number of column families

Posted by Alok Kumar <al...@gmail.com>.

Hi,

Hbase creates HFile per column-family. Having 130 column-family is really
not recommended.
It will increase number of file pointer ( open file count) underneath.

If you are sure which columns are "frequently" accessed by users, you could
consider putting them in one column family. And "Non frequently" ones in
another.
Btw, ~5MB size of column value is something to consider. We should wait for
some expert advise here!!


Thanks
Alok


On Tue, Aug 5, 2014 at 4:50 PM, innowireless TaeYun Kim <
taeyun.kim@innowireless.co.kr> wrote:

> Plus,
> the size of the value of each field can be ~5MB, since max 250000 lines of
> the source data will be merged into one record, to match the request
> pattern.
>
>
> -----Original Message-----
> From: innowireless TaeYun Kim [mailto:taeyun.kim@innowireless.co.kr]
> Sent: Tuesday, August 05, 2014 8:11 PM
> To: user@hbase.apache.org
> Subject: Question on the number of column families
>
> Hi,
>
>
>
> According to http://hbase.apache.org/book/number.of.cfs.html, having more
> than 2~3 column families are strongly discouraged.
>
>
>
> BTW, in my case, records on a table have the following characteristics:
>
>
>
> - The table is read-only. It is bulk-loaded once. When a new data is ready,
> A new table is created and the old table is deleted.
>
> - The size of the source data can be hundreds of gigabytes.
>
> - A record has about 130 fields.
>
> - The number of fields in a record is fixed.
>
> - The names of the fields are also fixed. (it's like a table in RDBMS)
>
> - About 40(it varies) fields mostly have value, while other fields are
> mostly empty(null in RDBMS).
>
> - It is unknown which field will be dense. It depends on the source data.
>
> - Fields are accessed independently. Normally a user requests just one
> field. A user can request several fields.
>
> - The range on the range query is the same for all fields. (No wider, no
> narrower, regardless the data density)
>
> For me, it seems that it would be more efficient if there is one column
> family for each field, since it would cost less disk I/O, for only the
> needed column data will be read.
>
>
>
> Can the table have 130 column families for this case?
>
> Or the whole columns must be in one column family?
>
>
>
> Thanks.
>
>
>
>
>


-- 
Alok Kumar
Email : alokawi@gmail.com
http://sharepointorange.blogspot.in/
http://www.linkedin.com/in/alokawi

RE: Question on the number of column families

Posted by innowireless TaeYun Kim <ta...@innowireless.co.kr>.

Plus,
the size of the value of each field can be ~5MB, since max 250000 lines of
the source data will be merged into one record, to match the request
pattern.

-----Original Message-----
From: innowireless TaeYun Kim [mailto:taeyun.kim@innowireless.co.kr] 
Sent: Tuesday, August 05, 2014 8:11 PM
To: user@hbase.apache.org
Subject: Question on the number of column families

Hi,

According to http://hbase.apache.org/book/number.of.cfs.html, having more
than 2~3 column families are strongly discouraged.

BTW, in my case, records on a table have the following characteristics:

- The table is read-only. It is bulk-loaded once. When a new data is ready,
A new table is created and the old table is deleted.

- The size of the source data can be hundreds of gigabytes.

- A record has about 130 fields. 

- The number of fields in a record is fixed.

- The names of the fields are also fixed. (it's like a table in RDBMS)

- About 40(it varies) fields mostly have value, while other fields are
mostly empty(null in RDBMS).

- It is unknown which field will be dense. It depends on the source data.

- Fields are accessed independently. Normally a user requests just one
field. A user can request several fields.

- The range on the range query is the same for all fields. (No wider, no
narrower, regardless the data density)

For me, it seems that it would be more efficient if there is one column
family for each field, since it would cost less disk I/O, for only the
needed column data will be read.

Can the table have 130 column families for this case?

Or the whole columns must be in one column family?

Thanks.