You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by java8964 <ja...@hotmail.com> on 2014/02/14 03:49:37 UTC

some data modeling questions related to Hbase

Hi, 
I am researching a project to load data from MySQL into HBase.
The data is our user data. We want to analyze the user data with its activities and tracking. The user data is big. The record count reaches billion and whole data set size is about more than several Ts.
The purpose of this data in this system is pure analyzing. The reason I am thinking of using HBase instead of HDFS is because of daily change of the data. Every day, there could be about 1 to 5% of user having any update/delete activities. Because HDFS is a write-once system, I have to regenerate the whole latest snapshot just for small change of the data, it is kind of wasting.
So I am thinking storing the data into HBase. All other activities/tracking data (Fact data) will store in HDFS. I hope I can get good performance joining the data between HBase and HDFS for different datasets. I test the HBase scan speed, not as good as HDFS, but maybe acceptable for my case. HBase here is only a sink/source of MR jobs in Hadoop. So I want to give it a try.
This system has to support Hive, as most of users prefer SQL. When I want to store the data into HBase, I am not sure what is the best way to store them.
As sample example, if I have data from the users table, and also have the data from addresses table, it is one-to-many relationship between them. Let's say I have user with (userid, name, ...)and address with (addressId, userId, street, state ...). I think I have following 2 options:
Store them as 2 tables in HBase, using userid as row key of Users table is obvious. For address table, I can use either addressId as row key, or (userId + addressId) as row key. The benefits of this way is that it is a straight forward, and I have very easy mapping both tables into Hive external table. But I have following doubts:The data between 2 tables will be join again and again. If I can store them together during ETL, I can avoid thatEven if I choose (userId + addressId) as row key (I can map this row key into 2 columns in Hive), it won't really help me during the join, right? If I query like this (select * from addresses where userId = 'xxx'), the HBase still need to search across all the regions, as the row key is (userid + addressed), correct? Can HBase utilize that first part of the row key being queried?In this case, there is not much optimizing can use from HBase. It will be pure data source of MR job, right?Another way is to store them in one table. But since I don't know how many addresses a user could have, even if I create an address column family in user table, how do I create column name? I thought of using addressId as prefix of the column name in this CF, but from what I found out so far, it is impossible to map dynamic column names of HBase in Hive, especially when the count of columns is unknown. I would prefer to store this way, but I have to support Hive, and don't know how to archive that. Besides there are 7-8 user related tables, I am not sure store all of them together is a good idea.
I want to know if anyone here can share some good/bad of each way. Since I am still very new to HBase, I have also some additional questions:
1) I know I should use short CF names and column names, so I am thinking just use index number as column names, like (1, 2, 3 ...). Is that fine?2) I am thinking storing frequently queried columns into one CF, the rest in another. How many CFs I can create? From HBase wiki page, it said only 2 or 3 CFs are enough. Is that kind of too small? 		 	   		  

Re: some data modeling questions related to Hbase

Posted by Ted Yu <yu...@gmail.com>.
This post would give you some more idea about how Phoenix is used:

http://search-hadoop.com/m/zS4qCsW1Ry1/james+phoenix+create+table&subj=Re+Rowkey+design+and+presplit+table

Cheers


On Fri, Feb 14, 2014 at 6:36 AM, java8964 <ja...@hotmail.com> wrote:

> Hi, Ted:
> I will take a look.
> Thanks
> Yong
>
> > Date: Thu, 13 Feb 2014 20:39:18 -0800
> > Subject: Re: some data modeling questions related to Hbase
> > From: yuzhihong@gmail.com
> > To: user@hbase.apache.org
> >
> > bq. This system has to support Hive, as most of users prefer SQL
> >
> > Have you considered Apache Phoenix ?
> > See http://incubator.apache.org/projects/phoenix.html
> >
> > Cheers
> >
> >
> > On Thu, Feb 13, 2014 at 6:49 PM, java8964 <ja...@hotmail.com> wrote:
> >
> > > Hi,
> > > I am researching a project to load data from MySQL into HBase.
> > > The data is our user data. We want to analyze the user data with its
> > > activities and tracking. The user data is big. The record count reaches
> > > billion and whole data set size is about more than several Ts.
> > > The purpose of this data in this system is pure analyzing. The reason
> I am
> > > thinking of using HBase instead of HDFS is because of daily change of
> the
> > > data. Every day, there could be about 1 to 5% of user having any
> > > update/delete activities. Because HDFS is a write-once system, I have
> to
> > > regenerate the whole latest snapshot just for small change of the
> data, it
> > > is kind of wasting.
> > > So I am thinking storing the data into HBase. All other
> > > activities/tracking data (Fact data) will store in HDFS. I hope I can
> get
> > > good performance joining the data between HBase and HDFS for different
> > > datasets. I test the HBase scan speed, not as good as HDFS, but maybe
> > > acceptable for my case. HBase here is only a sink/source of MR jobs in
> > > Hadoop. So I want to give it a try.
> > > This system has to support Hive, as most of users prefer SQL. When I
> want
> > > to store the data into HBase, I am not sure what is the best way to
> store
> > > them.
> > > As sample example, if I have data from the users table, and also have
> the
> > > data from addresses table, it is one-to-many relationship between them.
> > > Let's say I have user with (userid, name, ...)and address with
> (addressId,
> > > userId, street, state ...). I think I have following 2 options:
> > > Store them as 2 tables in HBase, using userid as row key of Users
> table is
> > > obvious. For address table, I can use either addressId as row key, or
> > > (userId + addressId) as row key. The benefits of this way is that it
> is a
> > > straight forward, and I have very easy mapping both tables into Hive
> > > external table. But I have following doubts:The data between 2 tables
> will
> > > be join again and again. If I can store them together during ETL, I can
> > > avoid thatEven if I choose (userId + addressId) as row key (I can map
> this
> > > row key into 2 columns in Hive), it won't really help me during the
> join,
> > > right? If I query like this (select * from addresses where userId =
> 'xxx'),
> > > the HBase still need to search across all the regions, as the row key
> is
> > > (userid + addressed), correct? Can HBase utilize that first part of
> the row
> > > key being queried?In this case, there is not much optimizing can use
> from
> > > HBase. It will be pure data source of MR job, right?Another way is to
> store
> > > them in one table. But since I don't know how many addresses a user
> could
> > > have, even if I create an address column family in user table, how do I
> > > create column name? I thought of using addressId as prefix of the
> column
> > > name in this CF, but from what I found out so far, it is impossible to
> map
> > > dynamic column names of HBase in Hive, especially when the count of
> columns
> > > is unknown. I would prefer to store this way, but I have to support
> Hive,
> > > and don't know how to archive that. Besides there are 7-8 user related
> > > tables, I am not sure store all of them together is a good idea.
> > > I want to know if anyone here can share some good/bad of each way.
> Since I
> > > am still very new to HBase, I have also some additional questions:
> > > 1) I know I should use short CF names and column names, so I am
> thinking
> > > just use index number as column names, like (1, 2, 3 ...). Is that
> fine?2)
> > > I am thinking storing frequently queried columns into one CF, the rest
> in
> > > another. How many CFs I can create? From HBase wiki page, it said only
> 2 or
> > > 3 CFs are enough. Is that kind of too small?
> > >
>
>

RE: some data modeling questions related to Hbase

Posted by java8964 <ja...@hotmail.com>.
Hi, Ted:
I will take a look.
Thanks
Yong

> Date: Thu, 13 Feb 2014 20:39:18 -0800
> Subject: Re: some data modeling questions related to Hbase
> From: yuzhihong@gmail.com
> To: user@hbase.apache.org
> 
> bq. This system has to support Hive, as most of users prefer SQL
> 
> Have you considered Apache Phoenix ?
> See http://incubator.apache.org/projects/phoenix.html
> 
> Cheers
> 
> 
> On Thu, Feb 13, 2014 at 6:49 PM, java8964 <ja...@hotmail.com> wrote:
> 
> > Hi,
> > I am researching a project to load data from MySQL into HBase.
> > The data is our user data. We want to analyze the user data with its
> > activities and tracking. The user data is big. The record count reaches
> > billion and whole data set size is about more than several Ts.
> > The purpose of this data in this system is pure analyzing. The reason I am
> > thinking of using HBase instead of HDFS is because of daily change of the
> > data. Every day, there could be about 1 to 5% of user having any
> > update/delete activities. Because HDFS is a write-once system, I have to
> > regenerate the whole latest snapshot just for small change of the data, it
> > is kind of wasting.
> > So I am thinking storing the data into HBase. All other
> > activities/tracking data (Fact data) will store in HDFS. I hope I can get
> > good performance joining the data between HBase and HDFS for different
> > datasets. I test the HBase scan speed, not as good as HDFS, but maybe
> > acceptable for my case. HBase here is only a sink/source of MR jobs in
> > Hadoop. So I want to give it a try.
> > This system has to support Hive, as most of users prefer SQL. When I want
> > to store the data into HBase, I am not sure what is the best way to store
> > them.
> > As sample example, if I have data from the users table, and also have the
> > data from addresses table, it is one-to-many relationship between them.
> > Let's say I have user with (userid, name, ...)and address with (addressId,
> > userId, street, state ...). I think I have following 2 options:
> > Store them as 2 tables in HBase, using userid as row key of Users table is
> > obvious. For address table, I can use either addressId as row key, or
> > (userId + addressId) as row key. The benefits of this way is that it is a
> > straight forward, and I have very easy mapping both tables into Hive
> > external table. But I have following doubts:The data between 2 tables will
> > be join again and again. If I can store them together during ETL, I can
> > avoid thatEven if I choose (userId + addressId) as row key (I can map this
> > row key into 2 columns in Hive), it won't really help me during the join,
> > right? If I query like this (select * from addresses where userId = 'xxx'),
> > the HBase still need to search across all the regions, as the row key is
> > (userid + addressed), correct? Can HBase utilize that first part of the row
> > key being queried?In this case, there is not much optimizing can use from
> > HBase. It will be pure data source of MR job, right?Another way is to store
> > them in one table. But since I don't know how many addresses a user could
> > have, even if I create an address column family in user table, how do I
> > create column name? I thought of using addressId as prefix of the column
> > name in this CF, but from what I found out so far, it is impossible to map
> > dynamic column names of HBase in Hive, especially when the count of columns
> > is unknown. I would prefer to store this way, but I have to support Hive,
> > and don't know how to archive that. Besides there are 7-8 user related
> > tables, I am not sure store all of them together is a good idea.
> > I want to know if anyone here can share some good/bad of each way. Since I
> > am still very new to HBase, I have also some additional questions:
> > 1) I know I should use short CF names and column names, so I am thinking
> > just use index number as column names, like (1, 2, 3 ...). Is that fine?2)
> > I am thinking storing frequently queried columns into one CF, the rest in
> > another. How many CFs I can create? From HBase wiki page, it said only 2 or
> > 3 CFs are enough. Is that kind of too small?
> >
 		 	   		  

Re: some data modeling questions related to Hbase

Posted by Ted Yu <yu...@gmail.com>.
bq. This system has to support Hive, as most of users prefer SQL

Have you considered Apache Phoenix ?
See http://incubator.apache.org/projects/phoenix.html

Cheers


On Thu, Feb 13, 2014 at 6:49 PM, java8964 <ja...@hotmail.com> wrote:

> Hi,
> I am researching a project to load data from MySQL into HBase.
> The data is our user data. We want to analyze the user data with its
> activities and tracking. The user data is big. The record count reaches
> billion and whole data set size is about more than several Ts.
> The purpose of this data in this system is pure analyzing. The reason I am
> thinking of using HBase instead of HDFS is because of daily change of the
> data. Every day, there could be about 1 to 5% of user having any
> update/delete activities. Because HDFS is a write-once system, I have to
> regenerate the whole latest snapshot just for small change of the data, it
> is kind of wasting.
> So I am thinking storing the data into HBase. All other
> activities/tracking data (Fact data) will store in HDFS. I hope I can get
> good performance joining the data between HBase and HDFS for different
> datasets. I test the HBase scan speed, not as good as HDFS, but maybe
> acceptable for my case. HBase here is only a sink/source of MR jobs in
> Hadoop. So I want to give it a try.
> This system has to support Hive, as most of users prefer SQL. When I want
> to store the data into HBase, I am not sure what is the best way to store
> them.
> As sample example, if I have data from the users table, and also have the
> data from addresses table, it is one-to-many relationship between them.
> Let's say I have user with (userid, name, ...)and address with (addressId,
> userId, street, state ...). I think I have following 2 options:
> Store them as 2 tables in HBase, using userid as row key of Users table is
> obvious. For address table, I can use either addressId as row key, or
> (userId + addressId) as row key. The benefits of this way is that it is a
> straight forward, and I have very easy mapping both tables into Hive
> external table. But I have following doubts:The data between 2 tables will
> be join again and again. If I can store them together during ETL, I can
> avoid thatEven if I choose (userId + addressId) as row key (I can map this
> row key into 2 columns in Hive), it won't really help me during the join,
> right? If I query like this (select * from addresses where userId = 'xxx'),
> the HBase still need to search across all the regions, as the row key is
> (userid + addressed), correct? Can HBase utilize that first part of the row
> key being queried?In this case, there is not much optimizing can use from
> HBase. It will be pure data source of MR job, right?Another way is to store
> them in one table. But since I don't know how many addresses a user could
> have, even if I create an address column family in user table, how do I
> create column name? I thought of using addressId as prefix of the column
> name in this CF, but from what I found out so far, it is impossible to map
> dynamic column names of HBase in Hive, especially when the count of columns
> is unknown. I would prefer to store this way, but I have to support Hive,
> and don't know how to archive that. Besides there are 7-8 user related
> tables, I am not sure store all of them together is a good idea.
> I want to know if anyone here can share some good/bad of each way. Since I
> am still very new to HBase, I have also some additional questions:
> 1) I know I should use short CF names and column names, so I am thinking
> just use index number as column names, like (1, 2, 3 ...). Is that fine?2)
> I am thinking storing frequently queried columns into one CF, the rest in
> another. How many CFs I can create? From HBase wiki page, it said only 2 or
> 3 CFs are enough. Is that kind of too small?
>