You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Naama Kraus <na...@gmail.com> on 2008/05/18 13:01:15 UTC

Scheme design questions

Hi,

I am trying to figure out how should I design HBase tables and I got couple
of questions. I'd appreciate some assistance.

Say I have data about students confirming of -
Student id and some basic information such as first name, last name, gender,
address, date she started her studies, hobbies and some areas of interest.
Additionally, for each student there is information on the course she has
taken and the final grade.

My Questions:
1. Should the basic attributes (first name, last name, gender ...) share a
common column family or each should have a different family ? If the second
is the way to go, would it harm HBase flexibility characteristic which
allows adding a new type of attribute that may pop up after I defined the
table scheme? E.g. new data source comes in with the 'age' attribute, that
was not known upon defining the scheme.

2. For attributes which may have multiple values, would it make sense to
define a common column family and add a column for each value ?
2.1 For hobbies - I'd define a 'hobby' column family under which I put each
hobby in a separate column. hobby_i (i being incremented by 1 for each new
hobby being inserted in the row) as a column name and the actual hobby as a
value ? Or I'd rather have the hobby name as a column name and some
arbitrary value (e.g. 1) as cell value ?
2.2 Similarly, for grades there could be a common grades family. For each
course grade, I could put the course id as a column name and the course
grade as a value. Does it make sense ?

3. Say there is the 'zipcode' attribute, and a student may have multiple zip
codes associated with her. By now, it is a case similar to question 2. But
what if for each zip I have the matching city and state information. Should
I create a separate table with each row containing a zip and the
corresponding city and state and use join at query time if needed ? Or is
there a way to de-normalize the data and somehow integrate the multiple
zip-s plus the city and state of each within the original students table ?
To what extent should I aspire to denormalize data ?

4. Can columns of different types (numbers/text/date) share the same column
family ?

Thanks for any help, Naama

-- 
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)

Re: Scheme design questions

Posted by Naama Kraus <na...@gmail.com>.

Thanks again, Naama

On Mon, May 19, 2008 at 9:32 AM, Jim Kellerman <ji...@powerset.com> wrote:

> Comments inline.
>
> ---
> Jim Kellerman, Senior Engineer; Powerset
>
>
> > -----Original Message-----
> > From: Naama Kraus [mailto:naamakraus@gmail.com]
> > Sent: Sunday, May 18, 2008 9:03 PM
> > To: hbase-user@hadoop.apache.org
> > Subject: Re: Scheme design questions
> >
> > Thank you very much Jim for the useful information.
> > My further questions inlined (within <<< ... >>>).
> >
> > Another question - what are the limits on number of families
> > and number of family members within one table ?
>
> Currently, there are no limits to the number of column families you
> can create. However, Google's Bigtable paper says that you should expect
> some limit (in the hundreds, i.e., < 999) but neither Bigtable nor HBase
> limit you on the number of family members. See below for explanation.
>
> > Are there any limits to the overall size of a data stored in a table ?
>
> There are no architectural limits to the size of a table.
>
>
> More below
>
> > Naama
> >
> > On Sun, May 18, 2008 at 8:08 PM, Jim Kellerman
> > <ji...@powerset.com> wrote:
> >
> > > Comments in-line below
> > >
> > > ---
> > > Jim Kellerman, Senior Engineer; Powerset
> > >
> > >
> > > > -----Original Message-----
> > > > From: Naama Kraus [mailto:naamakraus@gmail.com]
> > > > Sent: Sunday, May 18, 2008 4:01 AM
> > > > To: hbase-user@hadoop.apache.org
> > > > Subject: Scheme design questions
> > > >
> > > > Hi,
> > > >
> > > > I am trying to figure out how should I design HBase
> > tables and I got
> > > > couple of questions. I'd appreciate some assistance.
> > > >
> > > > Say I have data about students confirming of - Student id
> > and some
> > > > basic information such as first name, last name, gender, address,
> > > > date she started her studies, hobbies and some areas of interest.
> > > > Additionally, for each student there is information on the course
> > > > she has taken and the final grade.
> > > >
> > > > My Questions:
> > > > 1. Should the basic attributes (first name, last name, gender
> > > > ...) share a common column family or each should have a different
> > > > family ?
> > >
> > > This kind of depends on the access pattern. For example in the
> > > Webtable example, one column contains page content which is usually
> > > processed together and another column contains page
> > attributes such as
> > > encoding, mime-type, etc.
> > >
> > > My guess is that your information should share a column family.
> >
> >
> > <<< So does this mean that a column family is stored together
> > ? In the documentation I read that regions are stored
> > together, but I thought regions are bunch of rows, each
> > containing all columns. So I am now confused, rows or columns
> > ? Could you please explain ? >>>
>
> Yes, HBase is a column oriented data store just like Bigtable.
> Adding new family members is cheap, new columns expensive.
>
> Regions are indeed a bunch of rows. A single region represents
> a row range from [low-key:high-key). For each region there is
> an HStore for each column family that has data in the region's
> row range.
>
> > >
> > > > If the second is the way to go, would it harm HBase flexibility
> > > > characteristic which allows adding a new type of
> > attribute that may
> > > > pop up after I defined the table scheme? E.g. new data
> > source comes
> > > > in with the 'age'
> > > > attribute, that was not known upon defining the scheme.
> > >
> > > This is the disadvantage of the one column per attribute approach.
> > > It is expensive to add a new column, but new family members can be
> > > added at any time.
> >
> >
> > <<< Can a column be added to an existing table then, or only
> > prior to create ? In what sense is it expensive to add a new
> > column ? >>>
>
> You can add a new column to an existing table, but you must first
> 'disable' the table (take it offline). It is expensive, because adding
> a new column family means creating a new HStore for each existing region.
>
> > >
> > >
> > > > 2. For attributes which may have multiple values, would it make
> > > > sense to define a common column family and add a column for each
> > > > value ?
> > >
> > > It might make sense in this case to have a family for the
> > multi-valued
> > > attribute and just add a new member for each new value.
> > >
> > > > 2.1 For hobbies - I'd define a 'hobby' column family
> > under which I
> > > > put each hobby in a separate column. hobby_i (i being
> > incremented by
> > > > 1 for each new hobby being inserted in the
> > > > row) as a column name and the actual hobby as a value ? Or I'd
> > > > rather have the hobby name as a column name and some
> > arbitrary value
> > > > (e.g. 1) as cell value ?
> > >
> > > I'd define a family, hobby and use a new family member for
> > each value,
> > > for example:
> > >
> > > hobby:video-games
> > > hobby:tennis
> > > hobby:floral-arranging
> > > etc.
> > >
> > > > 2.2 Similarly, for grades there could be a common grades
> > family. For
> > > > each course grade, I could put the course id as a column name and
> > > > the course grade as a value. Does it make sense ?
> > >
> > > Yes. For example:
> > >
> > > Family course:
> > >
> > > course:math101 (with value) B
> > > course:economics203 (value) c
> > > etc.
> > >
> > > > 3. Say there is the 'zipcode' attribute, and a student may have
> > > > multiple zip codes associated with her. By now, it is a
> > case similar
> > > > to question 2. But what if for each zip I have the
> > matching city and
> > > > state information. Should I create a separate table with each row
> > > > containing a zip and the corresponding city and state and
> > use join
> > > > at query time if needed ?
> > >
> > > There is no join operation in HBase. However, you could run a
> > > map/reduce job to do something like a join.
> >
> >
> > <<< Is there somewhere a code sample for doing map/reduce
> > jon-like above HBase ? >>>
>
> The best examples we have available for using HBase with map/reduce are
> in the test cases (see org.apache.hadoop.hbase.mapred.*)
>
> >
> > >
> > >
> > > For zipcode, I might do something like:
> > >
> > > Family zip:
> > >
> > > zip:12345 (value) home
> > > zip:09876 (value) school
> > > etc.
> > >
> > > > Or is there a way to de-normalize the data and somehow
> > integrate the
> > > > multiple zip-s plus the city and state of each within the
> > original
> > > > students table ?
> > >
> > > It is a little tricky to store multi-value attributes in a
> > colum that
> > > is multivalued.
> > >
> > > For example if the row key is the student name, you could have
> > > something like:
> > >
> > > Family info:
> > > info:id
> > > info:address
> > > info:zip1
> > > info:zip2
> > >
> > > or:
> > >
> > > info:id
> > > info:address
> > > info:zip (value is a serialized map of zipcode, location)
> > >
> > > > To what extent should I aspire to denormalize data ?
> > >
> > > Again it depends on your access patterns. If the data is
> > going to be
> > > accessed together, it is probably better to put them in the same
> > > family. If you know that some data will never (or
> > > rarely) be accessed togetether, then put them in separate column
> > > families.
> > >
> > > > 4. Can columns of different types (numbers/text/date)
> > share the same
> > > > column family ?
> > >
> > > There are no data type in HBase. All values are byte[]
> > >
> > > > Thanks for any help, Naama
> > > >
> > > > --
> > > > oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
> > 00 oo 00 oo
> > > > 00 oo 00 oo 00 oo "If you want your children to be
> > intelligent, read
> > > > them fairy tales. If you want them to be more
> > intelligent, read them
> > > > more fairy tales." (Albert
> > > > Einstein)
> > > >
> > > > No virus found in this incoming message.
> > > > Checked by AVG.
> > > > Version: 8.0.100 / Virus Database: 269.23.20/1452 - Release
> > > > Date: 5/17/2008 6:26 PM
> > > >
> > > No virus found in this outgoing message.
> > > Checked by AVG.
> > > Version: 8.0.100 / Virus Database: 269.23.20/1452 - Release Date:
> > > 5/17/2008
> > > 6:26 PM
> > >
> >
> >
> >
> > --
> > oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00
> > oo 00 oo 00 oo 00 oo 00 oo "If you want your children to be
> > intelligent, read them fairy tales. If you want them to be
> > more intelligent, read them more fairy tales." (Albert
> > Einstein)
> >
> > No virus found in this incoming message.
> > Checked by AVG.
> > Version: 8.0.100 / Virus Database: 269.23.20/1453 - Release
> > Date: 5/18/2008 9:31 AM
> >
> No virus found in this outgoing message.
> Checked by AVG.
> Version: 8.0.100 / Virus Database: 269.23.20/1453 - Release Date: 5/18/2008
> 9:31 AM
>



-- 
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)

RE: Scheme design questions

Posted by Jim Kellerman <ji...@powerset.com>.

Comments inline.

---
Jim Kellerman, Senior Engineer; Powerset


> -----Original Message-----
> From: Naama Kraus [mailto:naamakraus@gmail.com]
> Sent: Sunday, May 18, 2008 9:03 PM
> To: hbase-user@hadoop.apache.org
> Subject: Re: Scheme design questions
>
> Thank you very much Jim for the useful information.
> My further questions inlined (within <<< ... >>>).
>
> Another question - what are the limits on number of families
> and number of family members within one table ?

Currently, there are no limits to the number of column families you
can create. However, Google's Bigtable paper says that you should expect
some limit (in the hundreds, i.e., < 999) but neither Bigtable nor HBase
limit you on the number of family members. See below for explanation.

> Are there any limits to the overall size of a data stored in a table ?

There are no architectural limits to the size of a table.


More below

> Naama
>
> On Sun, May 18, 2008 at 8:08 PM, Jim Kellerman
> <ji...@powerset.com> wrote:
>
> > Comments in-line below
> >
> > ---
> > Jim Kellerman, Senior Engineer; Powerset
> >
> >
> > > -----Original Message-----
> > > From: Naama Kraus [mailto:naamakraus@gmail.com]
> > > Sent: Sunday, May 18, 2008 4:01 AM
> > > To: hbase-user@hadoop.apache.org
> > > Subject: Scheme design questions
> > >
> > > Hi,
> > >
> > > I am trying to figure out how should I design HBase
> tables and I got
> > > couple of questions. I'd appreciate some assistance.
> > >
> > > Say I have data about students confirming of - Student id
> and some
> > > basic information such as first name, last name, gender, address,
> > > date she started her studies, hobbies and some areas of interest.
> > > Additionally, for each student there is information on the course
> > > she has taken and the final grade.
> > >
> > > My Questions:
> > > 1. Should the basic attributes (first name, last name, gender
> > > ...) share a common column family or each should have a different
> > > family ?
> >
> > This kind of depends on the access pattern. For example in the
> > Webtable example, one column contains page content which is usually
> > processed together and another column contains page
> attributes such as
> > encoding, mime-type, etc.
> >
> > My guess is that your information should share a column family.
>
>
> <<< So does this mean that a column family is stored together
> ? In the documentation I read that regions are stored
> together, but I thought regions are bunch of rows, each
> containing all columns. So I am now confused, rows or columns
> ? Could you please explain ? >>>

Yes, HBase is a column oriented data store just like Bigtable.
Adding new family members is cheap, new columns expensive.

Regions are indeed a bunch of rows. A single region represents
a row range from [low-key:high-key). For each region there is
an HStore for each column family that has data in the region's
row range.

> >
> > > If the second is the way to go, would it harm HBase flexibility
> > > characteristic which allows adding a new type of
> attribute that may
> > > pop up after I defined the table scheme? E.g. new data
> source comes
> > > in with the 'age'
> > > attribute, that was not known upon defining the scheme.
> >
> > This is the disadvantage of the one column per attribute approach.
> > It is expensive to add a new column, but new family members can be
> > added at any time.
>
>
> <<< Can a column be added to an existing table then, or only
> prior to create ? In what sense is it expensive to add a new
> column ? >>>

You can add a new column to an existing table, but you must first
'disable' the table (take it offline). It is expensive, because adding
a new column family means creating a new HStore for each existing region.

> >
> >
> > > 2. For attributes which may have multiple values, would it make
> > > sense to define a common column family and add a column for each
> > > value ?
> >
> > It might make sense in this case to have a family for the
> multi-valued
> > attribute and just add a new member for each new value.
> >
> > > 2.1 For hobbies - I'd define a 'hobby' column family
> under which I
> > > put each hobby in a separate column. hobby_i (i being
> incremented by
> > > 1 for each new hobby being inserted in the
> > > row) as a column name and the actual hobby as a value ? Or I'd
> > > rather have the hobby name as a column name and some
> arbitrary value
> > > (e.g. 1) as cell value ?
> >
> > I'd define a family, hobby and use a new family member for
> each value,
> > for example:
> >
> > hobby:video-games
> > hobby:tennis
> > hobby:floral-arranging
> > etc.
> >
> > > 2.2 Similarly, for grades there could be a common grades
> family. For
> > > each course grade, I could put the course id as a column name and
> > > the course grade as a value. Does it make sense ?
> >
> > Yes. For example:
> >
> > Family course:
> >
> > course:math101 (with value) B
> > course:economics203 (value) c
> > etc.
> >
> > > 3. Say there is the 'zipcode' attribute, and a student may have
> > > multiple zip codes associated with her. By now, it is a
> case similar
> > > to question 2. But what if for each zip I have the
> matching city and
> > > state information. Should I create a separate table with each row
> > > containing a zip and the corresponding city and state and
> use join
> > > at query time if needed ?
> >
> > There is no join operation in HBase. However, you could run a
> > map/reduce job to do something like a join.
>
>
> <<< Is there somewhere a code sample for doing map/reduce
> jon-like above HBase ? >>>

The best examples we have available for using HBase with map/reduce are
in the test cases (see org.apache.hadoop.hbase.mapred.*)

>
> >
> >
> > For zipcode, I might do something like:
> >
> > Family zip:
> >
> > zip:12345 (value) home
> > zip:09876 (value) school
> > etc.
> >
> > > Or is there a way to de-normalize the data and somehow
> integrate the
> > > multiple zip-s plus the city and state of each within the
> original
> > > students table ?
> >
> > It is a little tricky to store multi-value attributes in a
> colum that
> > is multivalued.
> >
> > For example if the row key is the student name, you could have
> > something like:
> >
> > Family info:
> > info:id
> > info:address
> > info:zip1
> > info:zip2
> >
> > or:
> >
> > info:id
> > info:address
> > info:zip (value is a serialized map of zipcode, location)
> >
> > > To what extent should I aspire to denormalize data ?
> >
> > Again it depends on your access patterns. If the data is
> going to be
> > accessed together, it is probably better to put them in the same
> > family. If you know that some data will never (or
> > rarely) be accessed togetether, then put them in separate column
> > families.
> >
> > > 4. Can columns of different types (numbers/text/date)
> share the same
> > > column family ?
> >
> > There are no data type in HBase. All values are byte[]
> >
> > > Thanks for any help, Naama
> > >
> > > --
> > > oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
> 00 oo 00 oo
> > > 00 oo 00 oo 00 oo "If you want your children to be
> intelligent, read
> > > them fairy tales. If you want them to be more
> intelligent, read them
> > > more fairy tales." (Albert
> > > Einstein)
> > >
> > > No virus found in this incoming message.
> > > Checked by AVG.
> > > Version: 8.0.100 / Virus Database: 269.23.20/1452 - Release
> > > Date: 5/17/2008 6:26 PM
> > >
> > No virus found in this outgoing message.
> > Checked by AVG.
> > Version: 8.0.100 / Virus Database: 269.23.20/1452 - Release Date:
> > 5/17/2008
> > 6:26 PM
> >
>
>
>
> --
> oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00
> oo 00 oo 00 oo 00 oo 00 oo "If you want your children to be
> intelligent, read them fairy tales. If you want them to be
> more intelligent, read them more fairy tales." (Albert
> Einstein)
>
> No virus found in this incoming message.
> Checked by AVG.
> Version: 8.0.100 / Virus Database: 269.23.20/1453 - Release
> Date: 5/18/2008 9:31 AM
>
No virus found in this outgoing message.
Checked by AVG.
Version: 8.0.100 / Virus Database: 269.23.20/1453 - Release Date: 5/18/2008 9:31 AM

Re: Scheme design questions

Posted by Naama Kraus <na...@gmail.com>.

Thank you very much Jim for the useful information.
My further questions inlined (within <<< ... >>>).

Another question - what are the limits on number of families and number of
family members within one table ?
Are there any limits to the overall size of a data stored in a table ?

Naama

On Sun, May 18, 2008 at 8:08 PM, Jim Kellerman <ji...@powerset.com> wrote:

> Comments in-line below
>
> ---
> Jim Kellerman, Senior Engineer; Powerset
>
>
> > -----Original Message-----
> > From: Naama Kraus [mailto:naamakraus@gmail.com]
> > Sent: Sunday, May 18, 2008 4:01 AM
> > To: hbase-user@hadoop.apache.org
> > Subject: Scheme design questions
> >
> > Hi,
> >
> > I am trying to figure out how should I design HBase tables
> > and I got couple of questions. I'd appreciate some assistance.
> >
> > Say I have data about students confirming of - Student id and
> > some basic information such as first name, last name, gender,
> > address, date she started her studies, hobbies and some areas
> > of interest.
> > Additionally, for each student there is information on the
> > course she has taken and the final grade.
> >
> > My Questions:
> > 1. Should the basic attributes (first name, last name, gender
> > ...) share a common column family or each should have a
> > different family ?
>
> This kind of depends on the access pattern. For example in the
> Webtable example, one column contains page content which is usually
> processed together and another column contains page attributes
> such as encoding, mime-type, etc.
>
> My guess is that your information should share a column family.


<<< So does this mean that a column family is stored together ? In the
documentation I read that regions are stored together, but I thought regions
are bunch of rows, each containing all columns. So I am now confused, rows
or columns ? Could you please explain ? >>>

>
> > If the second is the way to go, would it
> > harm HBase flexibility characteristic which allows adding a
> > new type of attribute that may pop up after I defined the
> > table scheme? E.g. new data source comes in with the 'age'
> > attribute, that was not known upon defining the scheme.
>
> This is the disadvantage of the one column per attribute approach.
> It is expensive to add a new column, but new family members
> can be added at any time.


<<< Can a column be added to an existing table then, or only prior to create
? In what sense is it expensive to add a new column ? >>>

>
>
> > 2. For attributes which may have multiple values, would it
> > make sense to define a common column family and add a column
> > for each value ?
>
> It might make sense in this case to have a family for the
> multi-valued attribute and just add a new member for each new
> value.
>
> > 2.1 For hobbies - I'd define a 'hobby' column family under
> > which I put each hobby in a separate column. hobby_i (i being
> > incremented by 1 for each new hobby being inserted in the
> > row) as a column name and the actual hobby as a value ? Or
> > I'd rather have the hobby name as a column name and some
> > arbitrary value (e.g. 1) as cell value ?
>
> I'd define a family, hobby and use a new family member for
> each value, for example:
>
> hobby:video-games
> hobby:tennis
> hobby:floral-arranging
> etc.
>
> > 2.2 Similarly, for grades there could be a common grades
> > family. For each course grade, I could put the course id as a
> > column name and the course grade as a value. Does it make sense ?
>
> Yes. For example:
>
> Family course:
>
> course:math101 (with value) B
> course:economics203 (value) c
> etc.
>
> > 3. Say there is the 'zipcode' attribute, and a student may
> > have multiple zip codes associated with her. By now, it is a
> > case similar to question 2. But what if for each zip I have
> > the matching city and state information. Should I create a
> > separate table with each row containing a zip and the
> > corresponding city and state and use join at query time if
> > needed ?
>
> There is no join operation in HBase. However, you could run
> a map/reduce job to do something like a join.


<<< Is there somewhere a code sample for doing map/reduce jon-like above
HBase ? >>>

>
>
> For zipcode, I might do something like:
>
> Family zip:
>
> zip:12345 (value) home
> zip:09876 (value) school
> etc.
>
> > Or is there a way to de-normalize the data and
> > somehow integrate the multiple zip-s plus the city and state
> > of each within the original students table ?
>
> It is a little tricky to store multi-value attributes in a
> colum that is multivalued.
>
> For example if the row key is the student name, you could
> have something like:
>
> Family info:
> info:id
> info:address
> info:zip1
> info:zip2
>
> or:
>
> info:id
> info:address
> info:zip (value is a serialized map of zipcode, location)
>
> > To what extent should I aspire to denormalize data ?
>
> Again it depends on your access patterns. If the data is going
> to be accessed together, it is probably better to put them
> in the same family. If you know that some data will never (or
> rarely) be accessed togetether, then put them in separate
> column families.
>
> > 4. Can columns of different types (numbers/text/date) share
> > the same column family ?
>
> There are no data type in HBase. All values are byte[]
>
> > Thanks for any help, Naama
> >
> > --
> > oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00
> > oo 00 oo 00 oo 00 oo 00 oo "If you want your children to be
> > intelligent, read them fairy tales. If you want them to be
> > more intelligent, read them more fairy tales." (Albert
> > Einstein)
> >
> > No virus found in this incoming message.
> > Checked by AVG.
> > Version: 8.0.100 / Virus Database: 269.23.20/1452 - Release
> > Date: 5/17/2008 6:26 PM
> >
> No virus found in this outgoing message.
> Checked by AVG.
> Version: 8.0.100 / Virus Database: 269.23.20/1452 - Release Date: 5/17/2008
> 6:26 PM
>



-- 
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)

RE: Scheme design questions

Posted by Jim Kellerman <ji...@powerset.com>.

Comments in-line below

---
Jim Kellerman, Senior Engineer; Powerset


> -----Original Message-----
> From: Naama Kraus [mailto:naamakraus@gmail.com]
> Sent: Sunday, May 18, 2008 4:01 AM
> To: hbase-user@hadoop.apache.org
> Subject: Scheme design questions
>
> Hi,
>
> I am trying to figure out how should I design HBase tables
> and I got couple of questions. I'd appreciate some assistance.
>
> Say I have data about students confirming of - Student id and
> some basic information such as first name, last name, gender,
> address, date she started her studies, hobbies and some areas
> of interest.
> Additionally, for each student there is information on the
> course she has taken and the final grade.
>
> My Questions:
> 1. Should the basic attributes (first name, last name, gender
> ...) share a common column family or each should have a
> different family ?

This kind of depends on the access pattern. For example in the
Webtable example, one column contains page content which is usually
processed together and another column contains page attributes
such as encoding, mime-type, etc.

My guess is that your information should share a column family.

> If the second is the way to go, would it
> harm HBase flexibility characteristic which allows adding a
> new type of attribute that may pop up after I defined the
> table scheme? E.g. new data source comes in with the 'age'
> attribute, that was not known upon defining the scheme.

This is the disadvantage of the one column per attribute approach.
It is expensive to add a new column, but new family members
can be added at any time.

> 2. For attributes which may have multiple values, would it
> make sense to define a common column family and add a column
> for each value ?

It might make sense in this case to have a family for the
multi-valued attribute and just add a new member for each new
value.

> 2.1 For hobbies - I'd define a 'hobby' column family under
> which I put each hobby in a separate column. hobby_i (i being
> incremented by 1 for each new hobby being inserted in the
> row) as a column name and the actual hobby as a value ? Or
> I'd rather have the hobby name as a column name and some
> arbitrary value (e.g. 1) as cell value ?

I'd define a family, hobby and use a new family member for
each value, for example:

hobby:video-games
hobby:tennis
hobby:floral-arranging
etc.

> 2.2 Similarly, for grades there could be a common grades
> family. For each course grade, I could put the course id as a
> column name and the course grade as a value. Does it make sense ?

Yes. For example:

Family course:

course:math101 (with value) B
course:economics203 (value) c
etc.

> 3. Say there is the 'zipcode' attribute, and a student may
> have multiple zip codes associated with her. By now, it is a
> case similar to question 2. But what if for each zip I have
> the matching city and state information. Should I create a
> separate table with each row containing a zip and the
> corresponding city and state and use join at query time if
> needed ?

There is no join operation in HBase. However, you could run
a map/reduce job to do something like a join.

For zipcode, I might do something like:

Family zip:

zip:12345 (value) home
zip:09876 (value) school
etc.

> Or is there a way to de-normalize the data and
> somehow integrate the multiple zip-s plus the city and state
> of each within the original students table ?

It is a little tricky to store multi-value attributes in a
colum that is multivalued.

For example if the row key is the student name, you could
have something like:

Family info:
info:id
info:address
info:zip1
info:zip2

or:

info:id
info:address
info:zip (value is a serialized map of zipcode, location)

> To what extent should I aspire to denormalize data ?

Again it depends on your access patterns. If the data is going
to be accessed together, it is probably better to put them
in the same family. If you know that some data will never (or
rarely) be accessed togetether, then put them in separate
column families.

> 4. Can columns of different types (numbers/text/date) share
> the same column family ?

There are no data type in HBase. All values are byte[]

> Thanks for any help, Naama
>
> --
> oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00
> oo 00 oo 00 oo 00 oo 00 oo "If you want your children to be
> intelligent, read them fairy tales. If you want them to be
> more intelligent, read them more fairy tales." (Albert
> Einstein)
>
> No virus found in this incoming message.
> Checked by AVG.
> Version: 8.0.100 / Virus Database: 269.23.20/1452 - Release
> Date: 5/17/2008 6:26 PM
>
No virus found in this outgoing message.
Checked by AVG.
Version: 8.0.100 / Virus Database: 269.23.20/1452 - Release Date: 5/17/2008 6:26 PM