You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by shawn du <sh...@gmail.com> on 2013/04/07 10:03:24 UTC

schema design: rows vs wide columns

Hello,

I am newer for hbase, but i have some experience on cassandra. In the
official document, it is said prefer to use rows instead of columns. I
don't know whether I should follow.
This is my user case:
I have about hundreds of services. each service is stored by a
number(service id). we try to store users registration for specific service
in a day.
so there are two solutions for this:
rows:
rowkey: month(2013-03) columns will be each service ids. values will be the
number for each service.
wide columns:
rowkey: serviceId, columns/values will be months and numbers.

Query requirement:
we only query for a specific service id and time between a start time and
end time.

so which solution is better?

also another question:
it is said that we 'd better desgin less than 3 column families. it is
true? can i create as many as tables i need in hbase?

Thanks in advance.

BR.Shawn

Re: schema design: rows vs wide columns

Posted by ramkrishna vasudevan <ra...@gmail.com>.

I agree with Andrew here and also Stack's comment on FB usage with 15 CFs
is interesting.
Whenever people read that line from the doc, people used to ask why is it
so and also i was thinking that one restriction of having max 3 CFs was one
factor which sometimes made schema design a  bit challenging one.

Regards
Ram


On Mon, Apr 8, 2013 at 5:21 AM, Viral Bajaria <vi...@gmail.com>wrote:

> I think this whole idea of don't go over a certain number of column
> families was a 2+ year old story. I remember hearing numbers like 5 or 6
> (not 3) come up when talking at Hadoop conferences with engineers who were
> at companies that were heavy HBase users. I agree with Andrew's suggestion
> that we should remove that text and replace it with benchmarks. Obviously
> we need to provide disclaimers that these are benchmarks based on a
> specific schema design and so YMMV.
>
> I have run a cluster with some tables having upwards of 5 CFs but the data
> was evenly spread across them. I don't think I saw any performance issues
> as such or maybe it got masked but 5 CFs was not a problem at all.
>
> Stack puts out an interesting stat i.e. ~15 CFs at FB. Do they run their
> own HBase version ? I feel they do and so they might have some enhancements
> which are not available to the community or that is no longer the case ?
>
> Thanks,
> Viral
>
>
> On Sun, Apr 7, 2013 at 3:52 PM, Andrew Purtell <ap...@apache.org>
> wrote:
>
> > Is there a pointer to evidence/experiment backed analysis of this
> question?
> > I'm sure there is some basis for this text in the book but I recommend we
> > strike it. We could replace it with YCSB or LoadTestTool driven latency
> > graphs for different workloads maybe. Although that would also be a big
> > simplification of 'schema design' considerations, it would not be so
> > starkly lacking background.
> >
> > On Sunday, April 7, 2013, Ted Yu wrote:
> >
> > > From http://hbase.apache.org/book.html#number.of.cfs :
> > >
> > > HBase currently does not do well with anything above two or three
> column
> > > families so keep the number of column families in your schema low.
> > >
> > > Cheers
> > >
> > > On Sun, Apr 7, 2013 at 3:04 PM, Stack <stack@duboce.net<javascript:;>>
> > > wrote:
> > >
> > > > On Sun, Apr 7, 2013 at 11:58 AM, Ted <yuzhihong@gmail.com
> <javascript:;>>
> > > wrote:
> > > >
> > > > > With regard to number of column families, 3 is the recommended
> > maximum.
> > > > >
> > > >
> > > > How did you come up w/ the number '3'?  Is it a 'hard' 3? Or does it
> > > > depend?  If the latter, on what does it depend?
> > > > Thanks,
> > > > St.Ack
> > > >
> > >
> >
> >
> > --
> > Best regards,
> >
> >    - Andy
> >
> > Problems worthy of attack prove their worth by hitting back. - Piet Hein
> > (via Tom White)
> >
>

Re: schema design: rows vs wide columns

Posted by Viral Bajaria <vi...@gmail.com>.

I think this whole idea of don't go over a certain number of column
families was a 2+ year old story. I remember hearing numbers like 5 or 6
(not 3) come up when talking at Hadoop conferences with engineers who were
at companies that were heavy HBase users. I agree with Andrew's suggestion
that we should remove that text and replace it with benchmarks. Obviously
we need to provide disclaimers that these are benchmarks based on a
specific schema design and so YMMV.

I have run a cluster with some tables having upwards of 5 CFs but the data
was evenly spread across them. I don't think I saw any performance issues
as such or maybe it got masked but 5 CFs was not a problem at all.

Stack puts out an interesting stat i.e. ~15 CFs at FB. Do they run their
own HBase version ? I feel they do and so they might have some enhancements
which are not available to the community or that is no longer the case ?

Thanks,
Viral

On Sun, Apr 7, 2013 at 3:52 PM, Andrew Purtell <ap...@apache.org> wrote:

> Is there a pointer to evidence/experiment backed analysis of this question?
> I'm sure there is some basis for this text in the book but I recommend we
> strike it. We could replace it with YCSB or LoadTestTool driven latency
> graphs for different workloads maybe. Although that would also be a big
> simplification of 'schema design' considerations, it would not be so
> starkly lacking background.
>
> On Sunday, April 7, 2013, Ted Yu wrote:
>
> > From http://hbase.apache.org/book.html#number.of.cfs :
> >
> > HBase currently does not do well with anything above two or three column
> > families so keep the number of column families in your schema low.
> >
> > Cheers
> >
> > On Sun, Apr 7, 2013 at 3:04 PM, Stack <stack@duboce.net <javascript:;>>
> > wrote:
> >
> > > On Sun, Apr 7, 2013 at 11:58 AM, Ted <yuzhihong@gmail.com<javascript:;>>
> > wrote:
> > >
> > > > With regard to number of column families, 3 is the recommended
> maximum.
> > > >
> > >
> > > How did you come up w/ the number '3'?  Is it a 'hard' 3? Or does it
> > > depend?  If the latter, on what does it depend?
> > > Thanks,
> > > St.Ack
> > >
> >
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>

Re: schema design: rows vs wide columns

Posted by Adrien Mogenet <ad...@gmail.com>.

Wide area :-)

I agree with Michael, perhaps the best explanation could be to explicit
*WHEN* adding extra CF perfectly makes sense.


On Tue, Apr 16, 2013 at 4:35 PM, Michael Segel <mi...@hotmail.com>wrote:

> I think the important thing about Column Families is trying to understand
> on how to use them properly in a design.
>
> Sparse data may make sense. It depends on the use case and an
> understanding of the trade offs.
>
> It all depends on how the data breaks down in to specific use cases.
>
> Keeping CFs to a minimum makes sense. However, what that minimum remains
> to be seen.
>
> It depends....
>
>
> On Apr 16, 2013, at 9:08 AM, Ted Yu <yu...@gmail.com> wrote:
>
> > bq. Maybe we can explain why there is some impacts, or what to consider?
> >
> > The above would be covered in the JIRA.
> >
> > Thanks
> >
> > On Tue, Apr 16, 2013 at 7:04 AM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org> wrote:
> >
> >> Can we add more details than just changing the maximum CF number? Maybe
> we
> >> can explain why there is some impacts, or what to consider?
> >>
> >> JM
> >>
> >> 2013/4/16 Ted Yu <yu...@gmail.com>
> >>
> >>> If there is no objection, I will create a JIRA to increase the maximum
> >>> number of column families described here:
> >>>
> >>> http://hbase.apache.org/book.html#number.of.cfs
> >>>
> >>> Cheers
> >>>
> >>> On Mon, Apr 8, 2013 at 7:21 AM, Doug Meil <
> doug.meil@explorysmedical.com
> >>>> wrote:
> >>>
> >>>>
> >>>>
> >>>> For the record, the refGuide mentions potential issues of CF lumpiness
> >>>> that you mentioned:
> >>>>
> >>>> http://hbase.apache.org/book.html#number.of.cfs
> >>>>
> >>>>
> >>>> 6.2.1. Cardinality of ColumnFamilies
> >>>>
> >>>> Where multiple ColumnFamilies exist in a single table, be aware of the
> >>>> cardinality (i.e., number of rows).
> >>>>      If ColumnFamilyA has 1 million rows and ColumnFamilyB has 1
> >> billion
> >>>> rows, ColumnFamilyA's data will likely be spread
> >>>>      across many, many regions (and RegionServers).  This makes mass
> >>>> scans for ColumnFamilyA less efficient.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> Š. anything that needs to be updated/added for this?
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On 4/8/13 12:39 AM, "lars hofhansl" <la...@apache.org> wrote:
> >>>>
> >>>>> I think the main problem is that all CFs have to be flushed if one
> >> gets
> >>>>> large enough to require a flush.
> >>>>> (Does anyone remember why exactly that is? And do we still need that
> >> now
> >>>>> that the memstoreTS is stored in the HFiles?)
> >>>>>
> >>>>>
> >>>>> So things are fine as long as all CFs have roughly the same size. But
> >> if
> >>>>> you have one that gets a lot of data and many others that are
> smaller,
> >>>>> we'd end up with a lot of unnecessary and small store files from the
> >>>>> smaller CFs.
> >>>>>
> >>>>> Anything else known that is bad about many column families?
> >>>>>
> >>>>>
> >>>>> -- Lars
> >>>>>
> >>>>>
> >>>>>
> >>>>> ________________________________
> >>>>> From: Andrew Purtell <ap...@apache.org>
> >>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> >>>>> Sent: Sunday, April 7, 2013 3:52 PM
> >>>>> Subject: Re: schema design: rows vs wide columns
> >>>>>
> >>>>> Is there a pointer to evidence/experiment backed analysis of this
> >>>>> question?
> >>>>> I'm sure there is some basis for this text in the book but I
> recommend
> >>> we
> >>>>> strike it. We could replace it with YCSB or LoadTestTool driven
> >> latency
> >>>>> graphs for different workloads maybe. Although that would also be a
> >> big
> >>>>> simplification of 'schema design' considerations, it would not be so
> >>>>> starkly lacking background.
> >>>>>
> >>>>> On Sunday, April 7, 2013, Ted Yu wrote:
> >>>>>
> >>>>>> From http://hbase.apache.org/book.html#number.of.cfs :
> >>>>>>
> >>>>>> HBase currently does not do well with anything above two or three
> >>> column
> >>>>>> families so keep the number of column families in your schema low.
> >>>>>>
> >>>>>> Cheers
> >>>>>>
> >>>>>> On Sun, Apr 7, 2013 at 3:04 PM, Stack <stack@duboce.net
> >> <javascript:;>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> On Sun, Apr 7, 2013 at 11:58 AM, Ted <yuzhihong@gmail.com
> >>>>>> <javascript:;>>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>> With regard to number of column families, 3 is the recommended
> >>>>>> maximum.
> >>>>>>>>
> >>>>>>>
> >>>>>>> How did you come up w/ the number '3'?  Is it a 'hard' 3? Or does
> >> it
> >>>>>>> depend?  If the latter, on what does it depend?
> >>>>>>> Thanks,
> >>>>>>> St.Ack
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Best regards,
> >>>>>
> >>>>>  - Andy
> >>>>>
> >>>>> Problems worthy of attack prove their worth by hitting back. - Piet
> >> Hein
> >>>>> (via Tom White)
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>
>
>


-- 
Adrien Mogenet
http://www.borntosegfault.com

Re: schema design: rows vs wide columns

Posted by Michael Segel <mi...@hotmail.com>.

I think the important thing about Column Families is trying to understand on how to use them properly in a design. 

Sparse data may make sense. It depends on the use case and an understanding of the trade offs. 

It all depends on how the data breaks down in to specific use cases. 

Keeping CFs to a minimum makes sense. However, what that minimum remains to be seen. 

It depends....


On Apr 16, 2013, at 9:08 AM, Ted Yu <yu...@gmail.com> wrote:

> bq. Maybe we can explain why there is some impacts, or what to consider?
> 
> The above would be covered in the JIRA.
> 
> Thanks
> 
> On Tue, Apr 16, 2013 at 7:04 AM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
> 
>> Can we add more details than just changing the maximum CF number? Maybe we
>> can explain why there is some impacts, or what to consider?
>> 
>> JM
>> 
>> 2013/4/16 Ted Yu <yu...@gmail.com>
>> 
>>> If there is no objection, I will create a JIRA to increase the maximum
>>> number of column families described here:
>>> 
>>> http://hbase.apache.org/book.html#number.of.cfs
>>> 
>>> Cheers
>>> 
>>> On Mon, Apr 8, 2013 at 7:21 AM, Doug Meil <doug.meil@explorysmedical.com
>>>> wrote:
>>> 
>>>> 
>>>> 
>>>> For the record, the refGuide mentions potential issues of CF lumpiness
>>>> that you mentioned:
>>>> 
>>>> http://hbase.apache.org/book.html#number.of.cfs
>>>> 
>>>> 
>>>> 6.2.1. Cardinality of ColumnFamilies
>>>> 
>>>> Where multiple ColumnFamilies exist in a single table, be aware of the
>>>> cardinality (i.e., number of rows).
>>>>      If ColumnFamilyA has 1 million rows and ColumnFamilyB has 1
>> billion
>>>> rows, ColumnFamilyA's data will likely be spread
>>>>      across many, many regions (and RegionServers).  This makes mass
>>>> scans for ColumnFamilyA less efficient.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Š. anything that needs to be updated/added for this?
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On 4/8/13 12:39 AM, "lars hofhansl" <la...@apache.org> wrote:
>>>> 
>>>>> I think the main problem is that all CFs have to be flushed if one
>> gets
>>>>> large enough to require a flush.
>>>>> (Does anyone remember why exactly that is? And do we still need that
>> now
>>>>> that the memstoreTS is stored in the HFiles?)
>>>>> 
>>>>> 
>>>>> So things are fine as long as all CFs have roughly the same size. But
>> if
>>>>> you have one that gets a lot of data and many others that are smaller,
>>>>> we'd end up with a lot of unnecessary and small store files from the
>>>>> smaller CFs.
>>>>> 
>>>>> Anything else known that is bad about many column families?
>>>>> 
>>>>> 
>>>>> -- Lars
>>>>> 
>>>>> 
>>>>> 
>>>>> ________________________________
>>>>> From: Andrew Purtell <ap...@apache.org>
>>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>>>>> Sent: Sunday, April 7, 2013 3:52 PM
>>>>> Subject: Re: schema design: rows vs wide columns
>>>>> 
>>>>> Is there a pointer to evidence/experiment backed analysis of this
>>>>> question?
>>>>> I'm sure there is some basis for this text in the book but I recommend
>>> we
>>>>> strike it. We could replace it with YCSB or LoadTestTool driven
>> latency
>>>>> graphs for different workloads maybe. Although that would also be a
>> big
>>>>> simplification of 'schema design' considerations, it would not be so
>>>>> starkly lacking background.
>>>>> 
>>>>> On Sunday, April 7, 2013, Ted Yu wrote:
>>>>> 
>>>>>> From http://hbase.apache.org/book.html#number.of.cfs :
>>>>>> 
>>>>>> HBase currently does not do well with anything above two or three
>>> column
>>>>>> families so keep the number of column families in your schema low.
>>>>>> 
>>>>>> Cheers
>>>>>> 
>>>>>> On Sun, Apr 7, 2013 at 3:04 PM, Stack <stack@duboce.net
>> <javascript:;>>
>>>>>> wrote:
>>>>>> 
>>>>>>> On Sun, Apr 7, 2013 at 11:58 AM, Ted <yuzhihong@gmail.com
>>>>>> <javascript:;>>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> With regard to number of column families, 3 is the recommended
>>>>>> maximum.
>>>>>>>> 
>>>>>>> 
>>>>>>> How did you come up w/ the number '3'?  Is it a 'hard' 3? Or does
>> it
>>>>>>> depend?  If the latter, on what does it depend?
>>>>>>> Thanks,
>>>>>>> St.Ack
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Best regards,
>>>>> 
>>>>>  - Andy
>>>>> 
>>>>> Problems worthy of attack prove their worth by hitting back. - Piet
>> Hein
>>>>> (via Tom White)
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>

Re: schema design: rows vs wide columns

Posted by Ted Yu <yu...@gmail.com>.

bq. Maybe we can explain why there is some impacts, or what to consider?

The above would be covered in the JIRA.

Thanks

On Tue, Apr 16, 2013 at 7:04 AM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Can we add more details than just changing the maximum CF number? Maybe we
> can explain why there is some impacts, or what to consider?
>
> JM
>
> 2013/4/16 Ted Yu <yu...@gmail.com>
>
> > If there is no objection, I will create a JIRA to increase the maximum
> > number of column families described here:
> >
> > http://hbase.apache.org/book.html#number.of.cfs
> >
> > Cheers
> >
> > On Mon, Apr 8, 2013 at 7:21 AM, Doug Meil <doug.meil@explorysmedical.com
> > >wrote:
> >
> > >
> > >
> > > For the record, the refGuide mentions potential issues of CF lumpiness
> > > that you mentioned:
> > >
> > > http://hbase.apache.org/book.html#number.of.cfs
> > >
> > >
> > > 6.2.1. Cardinality of ColumnFamilies
> > >
> > > Where multiple ColumnFamilies exist in a single table, be aware of the
> > > cardinality (i.e., number of rows).
> > >       If ColumnFamilyA has 1 million rows and ColumnFamilyB has 1
> billion
> > > rows, ColumnFamilyA's data will likely be spread
> > >       across many, many regions (and RegionServers).  This makes mass
> > > scans for ColumnFamilyA less efficient.
> > >
> > >
> > >
> > >
> > >
> > > Š. anything that needs to be updated/added for this?
> > >
> > >
> > >
> > >
> > >
> > > On 4/8/13 12:39 AM, "lars hofhansl" <la...@apache.org> wrote:
> > >
> > > >I think the main problem is that all CFs have to be flushed if one
> gets
> > > >large enough to require a flush.
> > > >(Does anyone remember why exactly that is? And do we still need that
> now
> > > >that the memstoreTS is stored in the HFiles?)
> > > >
> > > >
> > > >So things are fine as long as all CFs have roughly the same size. But
> if
> > > >you have one that gets a lot of data and many others that are smaller,
> > > >we'd end up with a lot of unnecessary and small store files from the
> > > >smaller CFs.
> > > >
> > > >Anything else known that is bad about many column families?
> > > >
> > > >
> > > >-- Lars
> > > >
> > > >
> > > >
> > > >________________________________
> > > > From: Andrew Purtell <ap...@apache.org>
> > > >To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > >Sent: Sunday, April 7, 2013 3:52 PM
> > > >Subject: Re: schema design: rows vs wide columns
> > > >
> > > >Is there a pointer to evidence/experiment backed analysis of this
> > > >question?
> > > >I'm sure there is some basis for this text in the book but I recommend
> > we
> > > >strike it. We could replace it with YCSB or LoadTestTool driven
> latency
> > > >graphs for different workloads maybe. Although that would also be a
> big
> > > >simplification of 'schema design' considerations, it would not be so
> > > >starkly lacking background.
> > > >
> > > >On Sunday, April 7, 2013, Ted Yu wrote:
> > > >
> > > >> From http://hbase.apache.org/book.html#number.of.cfs :
> > > >>
> > > >> HBase currently does not do well with anything above two or three
> > column
> > > >> families so keep the number of column families in your schema low.
> > > >>
> > > >> Cheers
> > > >>
> > > >> On Sun, Apr 7, 2013 at 3:04 PM, Stack <stack@duboce.net
> <javascript:;>>
> > > >> wrote:
> > > >>
> > > >> > On Sun, Apr 7, 2013 at 11:58 AM, Ted <yuzhihong@gmail.com
> > > >><javascript:;>>
> > > >> wrote:
> > > >> >
> > > >> > > With regard to number of column families, 3 is the recommended
> > > >>maximum.
> > > >> > >
> > > >> >
> > > >> > How did you come up w/ the number '3'?  Is it a 'hard' 3? Or does
> it
> > > >> > depend?  If the latter, on what does it depend?
> > > >> > Thanks,
> > > >> > St.Ack
> > > >> >
> > > >>
> > > >
> > > >
> > > >--
> > > >Best regards,
> > > >
> > > >   - Andy
> > > >
> > > >Problems worthy of attack prove their worth by hitting back. - Piet
> Hein
> > > >(via Tom White)
> > >
> > >
> > >
> > >
> >
>

Re: schema design: rows vs wide columns

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

Can we add more details than just changing the maximum CF number? Maybe we
can explain why there is some impacts, or what to consider?

JM

2013/4/16 Ted Yu <yu...@gmail.com>

> If there is no objection, I will create a JIRA to increase the maximum
> number of column families described here:
>
> http://hbase.apache.org/book.html#number.of.cfs
>
> Cheers
>
> On Mon, Apr 8, 2013 at 7:21 AM, Doug Meil <doug.meil@explorysmedical.com
> >wrote:
>
> >
> >
> > For the record, the refGuide mentions potential issues of CF lumpiness
> > that you mentioned:
> >
> > http://hbase.apache.org/book.html#number.of.cfs
> >
> >
> > 6.2.1. Cardinality of ColumnFamilies
> >
> > Where multiple ColumnFamilies exist in a single table, be aware of the
> > cardinality (i.e., number of rows).
> >       If ColumnFamilyA has 1 million rows and ColumnFamilyB has 1 billion
> > rows, ColumnFamilyA's data will likely be spread
> >       across many, many regions (and RegionServers).  This makes mass
> > scans for ColumnFamilyA less efficient.
> >
> >
> >
> >
> >
> > Š. anything that needs to be updated/added for this?
> >
> >
> >
> >
> >
> > On 4/8/13 12:39 AM, "lars hofhansl" <la...@apache.org> wrote:
> >
> > >I think the main problem is that all CFs have to be flushed if one gets
> > >large enough to require a flush.
> > >(Does anyone remember why exactly that is? And do we still need that now
> > >that the memstoreTS is stored in the HFiles?)
> > >
> > >
> > >So things are fine as long as all CFs have roughly the same size. But if
> > >you have one that gets a lot of data and many others that are smaller,
> > >we'd end up with a lot of unnecessary and small store files from the
> > >smaller CFs.
> > >
> > >Anything else known that is bad about many column families?
> > >
> > >
> > >-- Lars
> > >
> > >
> > >
> > >________________________________
> > > From: Andrew Purtell <ap...@apache.org>
> > >To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > >Sent: Sunday, April 7, 2013 3:52 PM
> > >Subject: Re: schema design: rows vs wide columns
> > >
> > >Is there a pointer to evidence/experiment backed analysis of this
> > >question?
> > >I'm sure there is some basis for this text in the book but I recommend
> we
> > >strike it. We could replace it with YCSB or LoadTestTool driven latency
> > >graphs for different workloads maybe. Although that would also be a big
> > >simplification of 'schema design' considerations, it would not be so
> > >starkly lacking background.
> > >
> > >On Sunday, April 7, 2013, Ted Yu wrote:
> > >
> > >> From http://hbase.apache.org/book.html#number.of.cfs :
> > >>
> > >> HBase currently does not do well with anything above two or three
> column
> > >> families so keep the number of column families in your schema low.
> > >>
> > >> Cheers
> > >>
> > >> On Sun, Apr 7, 2013 at 3:04 PM, Stack <stack@duboce.net<javascript:;>>
> > >> wrote:
> > >>
> > >> > On Sun, Apr 7, 2013 at 11:58 AM, Ted <yuzhihong@gmail.com
> > >><javascript:;>>
> > >> wrote:
> > >> >
> > >> > > With regard to number of column families, 3 is the recommended
> > >>maximum.
> > >> > >
> > >> >
> > >> > How did you come up w/ the number '3'?  Is it a 'hard' 3? Or does it
> > >> > depend?  If the latter, on what does it depend?
> > >> > Thanks,
> > >> > St.Ack
> > >> >
> > >>
> > >
> > >
> > >--
> > >Best regards,
> > >
> > >   - Andy
> > >
> > >Problems worthy of attack prove their worth by hitting back. - Piet Hein
> > >(via Tom White)
> >
> >
> >
> >
>

Re: schema design: rows vs wide columns

Posted by Ted Yu <yu...@gmail.com>.

If there is no objection, I will create a JIRA to increase the maximum
number of column families described here:

http://hbase.apache.org/book.html#number.of.cfs

Cheers

On Mon, Apr 8, 2013 at 7:21 AM, Doug Meil <do...@explorysmedical.com>wrote:

>
>
> For the record, the refGuide mentions potential issues of CF lumpiness
> that you mentioned:
>
> http://hbase.apache.org/book.html#number.of.cfs
>
>
> 6.2.1. Cardinality of ColumnFamilies
>
> Where multiple ColumnFamilies exist in a single table, be aware of the
> cardinality (i.e., number of rows).
>       If ColumnFamilyA has 1 million rows and ColumnFamilyB has 1 billion
> rows, ColumnFamilyA's data will likely be spread
>       across many, many regions (and RegionServers).  This makes mass
> scans for ColumnFamilyA less efficient.
>
>
>
>
>
> Š. anything that needs to be updated/added for this?
>
>
>
>
>
> On 4/8/13 12:39 AM, "lars hofhansl" <la...@apache.org> wrote:
>
> >I think the main problem is that all CFs have to be flushed if one gets
> >large enough to require a flush.
> >(Does anyone remember why exactly that is? And do we still need that now
> >that the memstoreTS is stored in the HFiles?)
> >
> >
> >So things are fine as long as all CFs have roughly the same size. But if
> >you have one that gets a lot of data and many others that are smaller,
> >we'd end up with a lot of unnecessary and small store files from the
> >smaller CFs.
> >
> >Anything else known that is bad about many column families?
> >
> >
> >-- Lars
> >
> >
> >
> >________________________________
> > From: Andrew Purtell <ap...@apache.org>
> >To: "user@hbase.apache.org" <us...@hbase.apache.org>
> >Sent: Sunday, April 7, 2013 3:52 PM
> >Subject: Re: schema design: rows vs wide columns
> >
> >Is there a pointer to evidence/experiment backed analysis of this
> >question?
> >I'm sure there is some basis for this text in the book but I recommend we
> >strike it. We could replace it with YCSB or LoadTestTool driven latency
> >graphs for different workloads maybe. Although that would also be a big
> >simplification of 'schema design' considerations, it would not be so
> >starkly lacking background.
> >
> >On Sunday, April 7, 2013, Ted Yu wrote:
> >
> >> From http://hbase.apache.org/book.html#number.of.cfs :
> >>
> >> HBase currently does not do well with anything above two or three column
> >> families so keep the number of column families in your schema low.
> >>
> >> Cheers
> >>
> >> On Sun, Apr 7, 2013 at 3:04 PM, Stack <stack@duboce.net <javascript:;>>
> >> wrote:
> >>
> >> > On Sun, Apr 7, 2013 at 11:58 AM, Ted <yuzhihong@gmail.com
> >><javascript:;>>
> >> wrote:
> >> >
> >> > > With regard to number of column families, 3 is the recommended
> >>maximum.
> >> > >
> >> >
> >> > How did you come up w/ the number '3'?  Is it a 'hard' 3? Or does it
> >> > depend?  If the latter, on what does it depend?
> >> > Thanks,
> >> > St.Ack
> >> >
> >>
> >
> >
> >--
> >Best regards,
> >
> >   - Andy
> >
> >Problems worthy of attack prove their worth by hitting back. - Piet Hein
> >(via Tom White)
>
>
>
>

Re: schema design: rows vs wide columns

Posted by ramkrishna vasudevan <ra...@gmail.com>.

"So things are fine as long as all CFs have roughly the same size. But if
you have one that gets a lot of data and many others that are smaller, we'd
end up with a lot of unnecessary and small store files from the smaller
CFs."

This is true.  I am not very sure of other reasons.  We any way ensure
cross CF atomicity with a single row.


Regards
Ram


On Mon, Apr 8, 2013 at 10:09 AM, lars hofhansl <la...@apache.org> wrote:

> I think the main problem is that all CFs have to be flushed if one gets
> large enough to require a flush.
> (Does anyone remember why exactly that is? And do we still need that now
> that the memstoreTS is stored in the HFiles?)
>
>
> So things are fine as long as all CFs have roughly the same size. But if
> you have one that gets a lot of data and many others that are smaller, we'd
> end up with a lot of unnecessary and small store files from the smaller CFs.
>
> Anything else known that is bad about many column families?
>
>
> -- Lars
>
>
>
> ________________________________
>  From: Andrew Purtell <ap...@apache.org>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> Sent: Sunday, April 7, 2013 3:52 PM
> Subject: Re: schema design: rows vs wide columns
>
> Is there a pointer to evidence/experiment backed analysis of this question?
> I'm sure there is some basis for this text in the book but I recommend we
> strike it. We could replace it with YCSB or LoadTestTool driven latency
> graphs for different workloads maybe. Although that would also be a big
> simplification of 'schema design' considerations, it would not be so
> starkly lacking background.
>
> On Sunday, April 7, 2013, Ted Yu wrote:
>
> > From http://hbase.apache.org/book.html#number.of.cfs :
> >
> > HBase currently does not do well with anything above two or three column
> > families so keep the number of column families in your schema low.
> >
> > Cheers
> >
> > On Sun, Apr 7, 2013 at 3:04 PM, Stack <stack@duboce.net <javascript:;>>
> > wrote:
> >
> > > On Sun, Apr 7, 2013 at 11:58 AM, Ted <yuzhihong@gmail.com<javascript:;>>
> > wrote:
> > >
> > > > With regard to number of column families, 3 is the recommended
> maximum.
> > > >
> > >
> > > How did you come up w/ the number '3'?  Is it a 'hard' 3? Or does it
> > > depend?  If the latter, on what does it depend?
> > > Thanks,
> > > St.Ack
> > >
> >
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>

Re: schema design: rows vs wide columns

Posted by Doug Meil <do...@explorysmedical.com>.


For the record, the refGuide mentions potential issues of CF lumpiness
that you mentioned:

http://hbase.apache.org/book.html#number.of.cfs
 

6.2.1. Cardinality of ColumnFamilies

Where multiple ColumnFamilies exist in a single table, be aware of the
cardinality (i.e., number of rows).
      If ColumnFamilyA has 1 million rows and ColumnFamilyB has 1 billion
rows, ColumnFamilyA's data will likely be spread
      across many, many regions (and RegionServers).  This makes mass
scans for ColumnFamilyA less efficient.
      




Š. anything that needs to be updated/added for this?





On 4/8/13 12:39 AM, "lars hofhansl" <la...@apache.org> wrote:

>I think the main problem is that all CFs have to be flushed if one gets
>large enough to require a flush.
>(Does anyone remember why exactly that is? And do we still need that now
>that the memstoreTS is stored in the HFiles?)
>
>
>So things are fine as long as all CFs have roughly the same size. But if
>you have one that gets a lot of data and many others that are smaller,
>we'd end up with a lot of unnecessary and small store files from the
>smaller CFs.
>
>Anything else known that is bad about many column families?
>
>
>-- Lars
>
>
>
>________________________________
> From: Andrew Purtell <ap...@apache.org>
>To: "user@hbase.apache.org" <us...@hbase.apache.org>
>Sent: Sunday, April 7, 2013 3:52 PM
>Subject: Re: schema design: rows vs wide columns
> 
>Is there a pointer to evidence/experiment backed analysis of this
>question?
>I'm sure there is some basis for this text in the book but I recommend we
>strike it. We could replace it with YCSB or LoadTestTool driven latency
>graphs for different workloads maybe. Although that would also be a big
>simplification of 'schema design' considerations, it would not be so
>starkly lacking background.
>
>On Sunday, April 7, 2013, Ted Yu wrote:
>
>> From http://hbase.apache.org/book.html#number.of.cfs :
>>
>> HBase currently does not do well with anything above two or three column
>> families so keep the number of column families in your schema low.
>>
>> Cheers
>>
>> On Sun, Apr 7, 2013 at 3:04 PM, Stack <stack@duboce.net <javascript:;>>
>> wrote:
>>
>> > On Sun, Apr 7, 2013 at 11:58 AM, Ted <yuzhihong@gmail.com
>><javascript:;>>
>> wrote:
>> >
>> > > With regard to number of column families, 3 is the recommended
>>maximum.
>> > >
>> >
>> > How did you come up w/ the number '3'?  Is it a 'hard' 3? Or does it
>> > depend?  If the latter, on what does it depend?
>> > Thanks,
>> > St.Ack
>> >
>>
>
>
>-- 
>Best regards,
>
>   - Andy
>
>Problems worthy of attack prove their worth by hitting back. - Piet Hein
>(via Tom White)

Re: schema design: rows vs wide columns

Posted by lars hofhansl <la...@apache.org>.

I think the main problem is that all CFs have to be flushed if one gets large enough to require a flush.
(Does anyone remember why exactly that is? And do we still need that now that the memstoreTS is stored in the HFiles?)

So things are fine as long as all CFs have roughly the same size. But if you have one that gets a lot of data and many others that are smaller, we'd end up with a lot of unnecessary and small store files from the smaller CFs.

Anything else known that is bad about many column families?

-- Lars

________________________________
 From: Andrew Purtell <ap...@apache.org>
To: "user@hbase.apache.org" <us...@hbase.apache.org> 
Sent: Sunday, April 7, 2013 3:52 PM
Subject: Re: schema design: rows vs wide columns

Is there a pointer to evidence/experiment backed analysis of this question?
I'm sure there is some basis for this text in the book but I recommend we
strike it. We could replace it with YCSB or LoadTestTool driven latency
graphs for different workloads maybe. Although that would also be a big
simplification of 'schema design' considerations, it would not be so
starkly lacking background.

On Sunday, April 7, 2013, Ted Yu wrote:

> From http://hbase.apache.org/book.html#number.of.cfs :
>
> HBase currently does not do well with anything above two or three column
> families so keep the number of column families in your schema low.
>
> Cheers
>
> On Sun, Apr 7, 2013 at 3:04 PM, Stack <stack@duboce.net <javascript:;>>
> wrote:
>
> > On Sun, Apr 7, 2013 at 11:58 AM, Ted <yuzhihong@gmail.com <javascript:;>>
> wrote:
> >
> > > With regard to number of column families, 3 is the recommended maximum.
> > >
> >
> > How did you come up w/ the number '3'?  Is it a 'hard' 3? Or does it
> > depend?  If the latter, on what does it depend?
> > Thanks,
> > St.Ack
> >
>

-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: schema design: rows vs wide columns

Posted by Andrew Purtell <ap...@apache.org>.

Is there a pointer to evidence/experiment backed analysis of this question?
I'm sure there is some basis for this text in the book but I recommend we
strike it. We could replace it with YCSB or LoadTestTool driven latency
graphs for different workloads maybe. Although that would also be a big
simplification of 'schema design' considerations, it would not be so
starkly lacking background.

On Sunday, April 7, 2013, Ted Yu wrote:

> From http://hbase.apache.org/book.html#number.of.cfs :
>
> HBase currently does not do well with anything above two or three column
> families so keep the number of column families in your schema low.
>
> Cheers
>
> On Sun, Apr 7, 2013 at 3:04 PM, Stack <stack@duboce.net <javascript:;>>
> wrote:
>
> > On Sun, Apr 7, 2013 at 11:58 AM, Ted <yuzhihong@gmail.com <javascript:;>>
> wrote:
> >
> > > With regard to number of column families, 3 is the recommended maximum.
> > >
> >
> > How did you come up w/ the number '3'?  Is it a 'hard' 3? Or does it
> > depend?  If the latter, on what does it depend?
> > Thanks,
> > St.Ack
> >
>

-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: schema design: rows vs wide columns

Posted by Michael Segel <mi...@hotmail.com>.

StAck, 

Just because FB does something doesn't mean its necessarily a good idea for others to do the same.  FB designs specifically for their needs and their use cases may not match those of others. 

To your point though, I agree that Ted's number of 3 is more of a rule of thumb and not a hard and fast number. I think that the wording in that section should be changed.  (I may take a stab at it later today...) 

In our HBase course, I teach an example of an Order entry system. (Order, Pick, Ship, Invoice) There are 4 column families in that example. To your point, in the use cases, the CFs are usually used in an atomic fashion. When I do a pick slip, I don't need to constantly reference the order, except when I initially create the Pick Slip(s). 

The larger question in terms of design, should you use a CF to segment your data if you're constantly pulling data from both CFs in your main use case, or should they be part of the same table? 

 -Mike

On Apr 7, 2013, at 5:45 PM, Stack <st...@duboce.net> wrote:

> On Sun, Apr 7, 2013 at 3:27 PM, Ted Yu <yu...@gmail.com> wrote:
> 
>> From http://hbase.apache.org/book.html#number.of.cfs :
>> 
>> HBase currently does not do well with anything above two or three column
>> families so keep the number of column families in your schema low.
>> 
> 
> We should add more to that section.  FB run w/ ~15 and purportedly it works
> with appropriate write and query pattern.
> St.Ack

Re: schema design: rows vs wide columns

Posted by Stack <st...@duboce.net>.

On Sun, Apr 7, 2013 at 3:27 PM, Ted Yu <yu...@gmail.com> wrote:

> From http://hbase.apache.org/book.html#number.of.cfs :
>
> HBase currently does not do well with anything above two or three column
> families so keep the number of column families in your schema low.
>

We should add more to that section.  FB run w/ ~15 and purportedly it works
with appropriate write and query pattern.
St.Ack

Re: schema design: rows vs wide columns

Posted by Ted Yu <yu...@gmail.com>.

>From http://hbase.apache.org/book.html#number.of.cfs :

HBase currently does not do well with anything above two or three column
families so keep the number of column families in your schema low.

Cheers

On Sun, Apr 7, 2013 at 3:04 PM, Stack <st...@duboce.net> wrote:

> On Sun, Apr 7, 2013 at 11:58 AM, Ted <yu...@gmail.com> wrote:
>
> > With regard to number of column families, 3 is the recommended maximum.
> >
>
> How did you come up w/ the number '3'?  Is it a 'hard' 3? Or does it
> depend?  If the latter, on what does it depend?
> Thanks,
> St.Ack
>

Re: schema design: rows vs wide columns

Posted by Stack <st...@duboce.net>.

On Sun, Apr 7, 2013 at 11:58 AM, Ted <yu...@gmail.com> wrote:

> With regard to number of column families, 3 is the recommended maximum.
>

How did you come up w/ the number '3'?  Is it a 'hard' 3? Or does it
depend?  If the latter, on what does it depend?
Thanks,
St.Ack

Re: schema design: rows vs wide columns

Posted by Ted <yu...@gmail.com>.

If you store service Id by month, how do you deal with time range in query that spans partial month(s) ?

With regard to number of column families, 3 is the recommended maximum. 

Cheers

On Apr 7, 2013, at 1:03 AM, shawn du <sh...@gmail.com> wrote:

> Hello,
> 
> I am newer for hbase, but i have some experience on cassandra. In the
> official document, it is said prefer to use rows instead of columns. I
> don't know whether I should follow.
> This is my user case:
> I have about hundreds of services. each service is stored by a
> number(service id). we try to store users registration for specific service
> in a day.
> so there are two solutions for this:
> rows:
> rowkey: month(2013-03) columns will be each service ids. values will be the
> number for each service.
> wide columns:
> rowkey: serviceId, columns/values will be months and numbers.
> 
> Query requirement:
> we only query for a specific service id and time between a start time and
> end time.
> 
> so which solution is better?
> 
> also another question:
> it is said that we 'd better desgin less than 3 column families. it is
> true? can i create as many as tables i need in hbase?
> 
> Thanks in advance.
> 
> BR.Shawn