You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by bigdata <bi...@outlook.com> on 2012/12/13 06:57:06 UTC

How to design a data warehouse in HBase?

Dear all,
We have a traditional star-model data warehouse in RDBMS, now we want to transfer it to HBase. After study HBase, I learn that HBase is normally can be query by rowkey.
1.full rowkey (fastest)2.rowkey filter (fast)3.column family/qualifier filter (slow)
How can I design the HBase tables to implement the warehouse functions, like:1.Query by DimensionA2.Query by DimensionA and DimensionB3.Sum, count, distinct ...
From my opinion, I should create several HBase tables with all combinations of different dimensions as the rowkey. This solution will lead to huge data duplication. Is there any good suggestions to solve it?
Thanks a lot!

Re: How to design a data warehouse in HBase?

Posted by Asaf Mesika <as...@gmail.com>.

Here's my take on this matter:

In the current situation, there isn't any good solution to the data warehousing solution you want in large scale. Impala and Drill are both projects that heads in this direction, but they still have a way to go and are not production ready yet. If you can stay at MySQL for moment, than stay there, or go for Hive but prepare a very large cluster of computers to handle the load.

A normal data warehouse as you describe is composed of DIMS (dimensions) and FACT tables. Representing this as is in HBase is a mess, since this will require you to do joins across the clusters - i.e. RPC calls and lots of them between Region Servers - which will slow down your queries to a halt (unless you want your user to wait 10-15 minutes).

The more sane approach then is do normalize the data - i.e. have a table containing the attributes of all dimensions in the FACT table, as one big fat FACT table - and save it to HDFS or HBase. Both have a partition key - your primary key to query upon (e.g. timestamp-customerId, timestamp-deviceId). You can query the data, after you filter it by the partition key, thus scanning only a portion of it, and then on each datanode/RS, filtering by the dimensions attributes as required by your query. If your data is distributed evenly across your cluster, running this query on multiple nodes at the same time can overcome the downside of fully reading the files/rows belonging to the partition key. You can add the statistical functions you require, such as sum,count, and send the rolled up results thus saving bandwidth.

The problem in current software stacks is that there's none that actually does what is stated above. Impala is in the right direction, but its yet to be in production state, from what I've read. Drill is just starting. Thus you end having to write map reduce jobs, which does the described above solution by either employing HIVE to get the HDFS files stored by partition key and translating you query into MR job, or using other open source solutions such as Cascading to ease the burden of writing your own MR Job code.

So in summary, I would stay at Oracle/MySQL until a descent open source answering your need will arrive - which I guess will happen during 2013/2014. If you can't - you will be forced to write your own custom solution, tailored to your queries, based on MR job. You can take a look at Trecul (https://github.com/akamai-tech/trecul) to boost processing speed of your Map Reduce job.

Asaf

On 13 בדצמ 2012, at 07:57, bigdata <bi...@outlook.com> wrote:

> Dear all,
> We have a traditional star-model data warehouse in RDBMS, now we want to transfer it to HBase. After study HBase, I learn that HBase is normally can be query by rowkey.
> 1.full rowkey (fastest)2.rowkey filter (fast)3.column family/qualifier filter (slow)
> How can I design the HBase tables to implement the warehouse functions, like:1.Query by DimensionA2.Query by DimensionA and DimensionB3.Sum, count, distinct ...
> From my opinion, I should create several HBase tables with all combinations of different dimensions as the rowkey. This solution will lead to huge data duplication. Is there any good suggestions to solve it?
> Thanks a lot!
> 
>

Re: How to design a data warehouse in HBase?

Posted by Michel Segel <mi...@hotmail.com>.

I don't know that I would recommend Impala at this stage in its development.
Sorry, it has a bit of growing up.

It's interesting, but no UDFs, right?

Sent from a remote device. Please excuse any typos...

Mike Segel

On Dec 13, 2012, at 4:42 PM, "Kevin O'dell" <ke...@cloudera.com> wrote:

> Correct, Impala relies on the Hive Metastore.
> 
> On Thu, Dec 13, 2012 at 11:38 AM, Manoj Babu <ma...@gmail.com> wrote:
> 
>> Kevin,
>> 
>> Impala requires Hive right?
>> so to get the advantages of Impala do we need to go with Hive?
>> 
>> 
>> Cheers!
>> Manoj.
>> 
>> 
>> 
>> On Thu, Dec 13, 2012 at 9:03 PM, Mohammad Tariq <do...@gmail.com>
>> wrote:
>> 
>>> Thank you so much for the clarification Kevin.
>>> 
>>> Regards,
>>>    Mohammad Tariq
>>> 
>>> 
>>> 
>>> On Thu, Dec 13, 2012 at 9:00 PM, Kevin O'dell <kevin.odell@cloudera.com
>>>> wrote:
>>> 
>>>> Mohammad,
>>>> 
>>>>  I am not sure you are thinking about Impala correctly.  It still uses
>>>> HDFS so your data increasing over time is fine.  You are not going to
>>> need
>>>> to tune for special CPU, Storage, or Network.  Typically with Impala
>> you
>>>> are going to be bound at the disks as it functions off of data
>> locality.
>>>> You can also use compression of Snappy, GZip, and BZip to help with
>> the
>>>> amount of data you are storing.  You will not need to frequently update
>>>> your hardware.
>>>> 
>>>> On Thu, Dec 13, 2012 at 10:06 AM, Mohammad Tariq <do...@gmail.com>
>>>> wrote:
>>>> 
>>>>> Oh yes..Impala..good point by Kevin.
>>>>> 
>>>>> Kevin : Would it be appropriate if I say that I should go for Impala
>> if
>>>> my
>>>>> data is not going to increase dramatically over time or if I have to
>>> work
>>>>> on only a subset of my BigData?Since Impala uses MPP, it may
>>>>> require specialized hardware tuned for CPU, storage and network
>>>> performance
>>>>> for better results, which could become a problem if have to upgrade
>> the
>>>>> hardware frequently because of the growing data.
>>>>> 
>>>>> Regards,
>>>>>    Mohammad Tariq
>>>>> 
>>>>> 
>>>>> 
>>>>> On Thu, Dec 13, 2012 at 8:17 PM, Kevin O'dell <
>>> kevin.odell@cloudera.com
>>>>>> wrote:
>>>>> 
>>>>>> To Mohammad's point.  You can use HBase for quick scans of the
>> data.
>>>>> Hive
>>>>>> for your longer running jobs.  Impala over the two for quick adhoc
>>>>>> searches.
>>>>>> 
>>>>>> On Thu, Dec 13, 2012 at 9:44 AM, Mohammad Tariq <
>> dontariq@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> I am not saying Hbase is not good. My point was to consider Hive
>> as
>>>>> well.
>>>>>>> Think about the approach keeping both the tools in mind and
>>> decide. I
>>>>>> just
>>>>>>> provided an option keeping in mind the available built-in Hive
>>>>> features.
>>>>>> I
>>>>>>> would like to add one more point here, you can map your Hbase
>>> tables
>>>> to
>>>>>>> Hive.
>>>>>>> 
>>>>>>> Regards,
>>>>>>>    Mohammad Tariq
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Dec 13, 2012 at 7:58 PM, bigdata <
>> bigdatabase@outlook.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi, Tariq
>>>>>>>> Thanks for your feedback. Actually, now we have two ways to
>> reach
>>>> the
>>>>>>>> target, by Hive and  by HBase.Could you tell me why HBase is
>> not
>>>> good
>>>>>> for
>>>>>>>> my requirements?Or what's the problem in my solution?
>>>>>>>> Thanks.
>>>>>>>> 
>>>>>>>>> From: dontariq@gmail.com
>>>>>>>>> Date: Thu, 13 Dec 2012 15:43:25 +0530
>>>>>>>>> Subject: Re: How to design a data warehouse in HBase?
>>>>>>>>> To: user@hbase.apache.org
>>>>>>>>> 
>>>>>>>>> Both have got different purposes. Normally people say that
>> Hive
>>>> is
>>>>>>> slow,
>>>>>>>>> that's just because it uses MapReduce under the hood. And i'm
>>>> sure
>>>>>> that
>>>>>>>> if
>>>>>>>>> the data stored in HBase is very huge, nobody would write
>>>>> sequential
>>>>>>>>> programs for Get or Scan. Instead they will write MP jobs or
>> do
>>>>>>> something
>>>>>>>>> similar.
>>>>>>>>> 
>>>>>>>>> My point is that nothing can be 100% real time. Is that what
>>> you
>>>>>>> want?If
>>>>>>>>> that is the case I would never suggest Hadoop on the first
>>> place
>>>> as
>>>>>>> it's
>>>>>>>> a
>>>>>>>>> batch processing system and cannot be used like an OLTP
>> system,
>>>>>> unless
>>>>>>>> you
>>>>>>>>> have thought of some additional stuff. Since you are talking
>>>> about
>>>>>>>>> warehouse, I am assuming you are going to store and process
>>>>> gigantic
>>>>>>>>> amounts of data. That's the only reason I had suggested Hive.
>>>>>>>>> 
>>>>>>>>> The whole point is that everything is not a solution for
>>>>> everything.
>>>>>>> One
>>>>>>>>> size doesn't fit all. First, we need to analyze our
>> particular
>>>> use
>>>>>>> case.
>>>>>>>>> The person, who says Hive is slow, might be correct. But only
>>> for
>>>>> his
>>>>>>>>> scenario.
>>>>>>>>> 
>>>>>>>>> HTH
>>>>>>>>> 
>>>>>>>>> Regards,
>>>>>>>>>    Mohammad Tariq
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Thu, Dec 13, 2012 at 3:17 PM, bigdata <
>>>> bigdatabase@outlook.com>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> I've got the information that HIVE 's performance is too
>> low.
>>>> It
>>>>>>> access
>>>>>>>>>> HDFS files and scan all data to search one record. IS it
>>> TRUE?
>>>>> And
>>>>>>>> HBase is
>>>>>>>>>> much faster than it.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> From: dontariq@gmail.com
>>>>>>>>>>> Date: Thu, 13 Dec 2012 15:12:25 +0530
>>>>>>>>>>> Subject: Re: How to design a data warehouse in HBase?
>>>>>>>>>>> To: user@hbase.apache.org
>>>>>>>>>>> 
>>>>>>>>>>> Hi there,
>>>>>>>>>>> 
>>>>>>>>>>>   If you are really planning for a warehousing solution
>>>> then I
>>>>>>> would
>>>>>>>>>>> suggest you to have a look over Apache Hive. It provides
>>> you
>>>>>>>> warehousing
>>>>>>>>>>> capabilities on top of a Hadoop cluster. Along with that
>> it
>>>>> also
>>>>>>>> provides
>>>>>>>>>>> an SQLish interface to the data stored in your warehouse,
>>>> which
>>>>>>>> would be
>>>>>>>>>>> very helpful to you, in case you are coming from an SQL
>>>>>> background.
>>>>>>>>>>> 
>>>>>>>>>>> HTH
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Regards,
>>>>>>>>>>>    Mohammad Tariq
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, Dec 13, 2012 at 2:43 PM, bigdata <
>>>>>> bigdatabase@outlook.com>
>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Thanks. I think a real example is better for me to
>>>> understand
>>>>>>> your
>>>>>>>>>>>> suggestions.
>>>>>>>>>>>> Now I have a relational table:ID   LoginTime
>>>>>>>>>> DeviceID1
>>>>>>>>>>>>    2012-12-12 12:12:12   abcdef2     2012-12-12
>> 19:12:12
>>>>>>>> abcdef3
>>>>>>>>>>>> 2012-12-13 10:10:10  defdaf
>>>>>>>>>>>> There are several requirements about this table:1. How
>>> many
>>>>>>> device
>>>>>>>>>> login
>>>>>>>>>>>> in each day?1. For one day, how many new device login?
>>>> (never
>>>>>>> login
>>>>>>>>>>>> before)1. For one day, how many accumulated device
>> login?
>>>>>>>>>>>> How can I design HBase tables to calculate these
>> data?Now
>>>> my
>>>>>>>> solution
>>>>>>>>>>>> is:table A:
>>>>>>>>>>>> rowkey:  date-deviceidcolumn family: logincolumn
>>> qualifier:
>>>>>>>> 2012-12-12
>>>>>>>>>>>> 12:12:12/2012-12-12 19:12:12....
>>>>>>>>>>>> table B:rowkey: deviceidcolumn family:null or anyvalue
>>>>>>>>>>>> 
>>>>>>>>>>>> For req#1, I can scan table A and use
>>> prefixfilter(rowkey)
>>>> to
>>>>>>>> check one
>>>>>>>>>>>> special date, and get records countFor req#2, I get
>>> table b
>>>>>> with
>>>>>>>> each
>>>>>>>>>>>> deviceid, and count result
>>>>>>>>>>>> For req#3, count table A with prefixfilter like 1.
>>>>>>>>>>>> Does it OK?  Or other better solutions?
>>>>>>>>>>>> Thanks!!
>>>>>>>>>>>> 
>>>>>>>>>>>>> CC: user@hbase.apache.org
>>>>>>>>>>>>> From: michael_segel@hotmail.com
>>>>>>>>>>>>> Subject: Re: How to design a data warehouse in HBase?
>>>>>>>>>>>>> Date: Thu, 13 Dec 2012 08:43:31 +0000
>>>>>>>>>>>>> To: user@hbase.apache.org
>>>>>>>>>>>>> 
>>>>>>>>>>>>> You need to spend a bit of time on Schema design.
>>>>>>>>>>>>> You need to flatten your Schema...
>>>>>>>>>>>>> Implement some secondary indexing to improve join
>>>>>>> performance...
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Depends on what you want to do... There are other
>>> options
>>>>>>> too...
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Sent from a remote device. Please excuse any typos...
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Mike Segel
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Dec 13, 2012, at 7:09 AM, lars hofhansl <
>>>>>>> lhofhansl@yahoo.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> For OLAP type queries you will generally be better
>>> off
>>>>>> with a
>>>>>>>> truly
>>>>>>>>>>>> column oriented database.
>>>>>>>>>>>>>> You can probably shoehorn HBase into this, but it
>>>> wasn't
>>>>>>> really
>>>>>>>>>>>> designed with raw scan performance along single columns
>>> in
>>>>>> mind.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> ________________________________
>>>>>>>>>>>>>> From: bigdata <bi...@outlook.com>
>>>>>>>>>>>>>> To: "user@hbase.apache.org" <user@hbase.apache.org
>>> 
>>>>>>>>>>>>>> Sent: Wednesday, December 12, 2012 9:57 PM
>>>>>>>>>>>>>> Subject: How to design a data warehouse in HBase?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Dear all,
>>>>>>>>>>>>>> We have a traditional star-model data warehouse in
>>>> RDBMS,
>>>>>> now
>>>>>>>> we
>>>>>>>>>> want
>>>>>>>>>>>> to transfer it to HBase. After study HBase, I learn
>> that
>>>>> HBase
>>>>>> is
>>>>>>>>>> normally
>>>>>>>>>>>> can be query by rowkey.
>>>>>>>>>>>>>> 1.full rowkey (fastest)2.rowkey filter
>> (fast)3.column
>>>>>>>>>> family/qualifier
>>>>>>>>>>>> filter (slow)
>>>>>>>>>>>>>> How can I design the HBase tables to implement the
>>>>>> warehouse
>>>>>>>>>>>> functions, like:1.Query by DimensionA2.Query by
>>> DimensionA
>>>>> and
>>>>>>>>>>>> DimensionB3.Sum, count, distinct ...
>>>>>>>>>>>>>> From my opinion, I should create several HBase
>> tables
>>>>> with
>>>>>>> all
>>>>>>>>>>>> combinations of different dimensions as the rowkey.
>> This
>>>>>> solution
>>>>>>>> will
>>>>>>>>>> lead
>>>>>>>>>>>> to huge data duplication. Is there any good suggestions
>>> to
>>>>>> solve
>>>>>>>> it?
>>>>>>>>>>>>>> Thanks a lot!
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Kevin O'Dell
>>>>>> Customer Operations Engineer, Cloudera
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Kevin O'Dell
>>>> Customer Operations Engineer, Cloudera
> 
> 
> 
> -- 
> Kevin O'Dell
> Customer Operations Engineer, Cloudera

Re: How to design a data warehouse in HBase?

Posted by Kevin O'dell <ke...@cloudera.com>.

Correct, Impala relies on the Hive Metastore.

On Thu, Dec 13, 2012 at 11:38 AM, Manoj Babu <ma...@gmail.com> wrote:

> Kevin,
>
> Impala requires Hive right?
> so to get the advantages of Impala do we need to go with Hive?
>
>
> Cheers!
> Manoj.
>
>
>
> On Thu, Dec 13, 2012 at 9:03 PM, Mohammad Tariq <do...@gmail.com>
> wrote:
>
> > Thank you so much for the clarification Kevin.
> >
> > Regards,
> >     Mohammad Tariq
> >
> >
> >
> > On Thu, Dec 13, 2012 at 9:00 PM, Kevin O'dell <kevin.odell@cloudera.com
> > >wrote:
> >
> > > Mohammad,
> > >
> > >   I am not sure you are thinking about Impala correctly.  It still uses
> > > HDFS so your data increasing over time is fine.  You are not going to
> > need
> > > to tune for special CPU, Storage, or Network.  Typically with Impala
> you
> > > are going to be bound at the disks as it functions off of data
> locality.
> > >  You can also use compression of Snappy, GZip, and BZip to help with
> the
> > > amount of data you are storing.  You will not need to frequently update
> > > your hardware.
> > >
> > > On Thu, Dec 13, 2012 at 10:06 AM, Mohammad Tariq <do...@gmail.com>
> > > wrote:
> > >
> > > > Oh yes..Impala..good point by Kevin.
> > > >
> > > > Kevin : Would it be appropriate if I say that I should go for Impala
> if
> > > my
> > > > data is not going to increase dramatically over time or if I have to
> > work
> > > > on only a subset of my BigData?Since Impala uses MPP, it may
> > > > require specialized hardware tuned for CPU, storage and network
> > > performance
> > > > for better results, which could become a problem if have to upgrade
> the
> > > > hardware frequently because of the growing data.
> > > >
> > > > Regards,
> > > >     Mohammad Tariq
> > > >
> > > >
> > > >
> > > > On Thu, Dec 13, 2012 at 8:17 PM, Kevin O'dell <
> > kevin.odell@cloudera.com
> > > > >wrote:
> > > >
> > > > > To Mohammad's point.  You can use HBase for quick scans of the
> data.
> > > >  Hive
> > > > > for your longer running jobs.  Impala over the two for quick adhoc
> > > > > searches.
> > > > >
> > > > > On Thu, Dec 13, 2012 at 9:44 AM, Mohammad Tariq <
> dontariq@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > I am not saying Hbase is not good. My point was to consider Hive
> as
> > > > well.
> > > > > > Think about the approach keeping both the tools in mind and
> > decide. I
> > > > > just
> > > > > > provided an option keeping in mind the available built-in Hive
> > > > features.
> > > > > I
> > > > > > would like to add one more point here, you can map your Hbase
> > tables
> > > to
> > > > > > Hive.
> > > > > >
> > > > > > Regards,
> > > > > >     Mohammad Tariq
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, Dec 13, 2012 at 7:58 PM, bigdata <
> bigdatabase@outlook.com>
> > > > > wrote:
> > > > > >
> > > > > > > Hi, Tariq
> > > > > > > Thanks for your feedback. Actually, now we have two ways to
> reach
> > > the
> > > > > > > target, by Hive and  by HBase.Could you tell me why HBase is
> not
> > > good
> > > > > for
> > > > > > > my requirements?Or what's the problem in my solution?
> > > > > > > Thanks.
> > > > > > >
> > > > > > > > From: dontariq@gmail.com
> > > > > > > > Date: Thu, 13 Dec 2012 15:43:25 +0530
> > > > > > > > Subject: Re: How to design a data warehouse in HBase?
> > > > > > > > To: user@hbase.apache.org
> > > > > > > >
> > > > > > > > Both have got different purposes. Normally people say that
> Hive
> > > is
> > > > > > slow,
> > > > > > > > that's just because it uses MapReduce under the hood. And i'm
> > > sure
> > > > > that
> > > > > > > if
> > > > > > > > the data stored in HBase is very huge, nobody would write
> > > > sequential
> > > > > > > > programs for Get or Scan. Instead they will write MP jobs or
> do
> > > > > > something
> > > > > > > > similar.
> > > > > > > >
> > > > > > > > My point is that nothing can be 100% real time. Is that what
> > you
> > > > > > want?If
> > > > > > > > that is the case I would never suggest Hadoop on the first
> > place
> > > as
> > > > > > it's
> > > > > > > a
> > > > > > > > batch processing system and cannot be used like an OLTP
> system,
> > > > > unless
> > > > > > > you
> > > > > > > > have thought of some additional stuff. Since you are talking
> > > about
> > > > > > > > warehouse, I am assuming you are going to store and process
> > > > gigantic
> > > > > > > > amounts of data. That's the only reason I had suggested Hive.
> > > > > > > >
> > > > > > > > The whole point is that everything is not a solution for
> > > > everything.
> > > > > > One
> > > > > > > > size doesn't fit all. First, we need to analyze our
> particular
> > > use
> > > > > > case.
> > > > > > > > The person, who says Hive is slow, might be correct. But only
> > for
> > > > his
> > > > > > > > scenario.
> > > > > > > >
> > > > > > > > HTH
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > >     Mohammad Tariq
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Dec 13, 2012 at 3:17 PM, bigdata <
> > > bigdatabase@outlook.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > > I've got the information that HIVE 's performance is too
> low.
> > > It
> > > > > > access
> > > > > > > > > HDFS files and scan all data to search one record. IS it
> > TRUE?
> > > > And
> > > > > > > HBase is
> > > > > > > > > much faster than it.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > From: dontariq@gmail.com
> > > > > > > > > > Date: Thu, 13 Dec 2012 15:12:25 +0530
> > > > > > > > > > Subject: Re: How to design a data warehouse in HBase?
> > > > > > > > > > To: user@hbase.apache.org
> > > > > > > > > >
> > > > > > > > > > Hi there,
> > > > > > > > > >
> > > > > > > > > >    If you are really planning for a warehousing solution
> > > then I
> > > > > > would
> > > > > > > > > > suggest you to have a look over Apache Hive. It provides
> > you
> > > > > > > warehousing
> > > > > > > > > > capabilities on top of a Hadoop cluster. Along with that
> it
> > > > also
> > > > > > > provides
> > > > > > > > > > an SQLish interface to the data stored in your warehouse,
> > > which
> > > > > > > would be
> > > > > > > > > > very helpful to you, in case you are coming from an SQL
> > > > > background.
> > > > > > > > > >
> > > > > > > > > > HTH
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > >     Mohammad Tariq
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Thu, Dec 13, 2012 at 2:43 PM, bigdata <
> > > > > bigdatabase@outlook.com>
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Thanks. I think a real example is better for me to
> > > understand
> > > > > > your
> > > > > > > > > > > suggestions.
> > > > > > > > > > > Now I have a relational table:ID   LoginTime
> > > > > > > > >  DeviceID1
> > > > > > > > > > >     2012-12-12 12:12:12   abcdef2     2012-12-12
> 19:12:12
> > > > > > > abcdef3
> > > > > > > > > > >  2012-12-13 10:10:10  defdaf
> > > > > > > > > > > There are several requirements about this table:1. How
> > many
> > > > > > device
> > > > > > > > > login
> > > > > > > > > > > in each day?1. For one day, how many new device login?
> > > (never
> > > > > > login
> > > > > > > > > > > before)1. For one day, how many accumulated device
> login?
> > > > > > > > > > > How can I design HBase tables to calculate these
> data?Now
> > > my
> > > > > > > solution
> > > > > > > > > > > is:table A:
> > > > > > > > > > > rowkey:  date-deviceidcolumn family: logincolumn
> > qualifier:
> > > > > > >  2012-12-12
> > > > > > > > > > > 12:12:12/2012-12-12 19:12:12....
> > > > > > > > > > > table B:rowkey: deviceidcolumn family:null or anyvalue
> > > > > > > > > > >
> > > > > > > > > > > For req#1, I can scan table A and use
> > prefixfilter(rowkey)
> > > to
> > > > > > > check one
> > > > > > > > > > > special date, and get records countFor req#2, I get
> > table b
> > > > > with
> > > > > > > each
> > > > > > > > > > > deviceid, and count result
> > > > > > > > > > > For req#3, count table A with prefixfilter like 1.
> > > > > > > > > > > Does it OK?  Or other better solutions?
> > > > > > > > > > > Thanks!!
> > > > > > > > > > >
> > > > > > > > > > > > CC: user@hbase.apache.org
> > > > > > > > > > > > From: michael_segel@hotmail.com
> > > > > > > > > > > > Subject: Re: How to design a data warehouse in HBase?
> > > > > > > > > > > > Date: Thu, 13 Dec 2012 08:43:31 +0000
> > > > > > > > > > > > To: user@hbase.apache.org
> > > > > > > > > > > >
> > > > > > > > > > > > You need to spend a bit of time on Schema design.
> > > > > > > > > > > > You need to flatten your Schema...
> > > > > > > > > > > > Implement some secondary indexing to improve join
> > > > > > performance...
> > > > > > > > > > > >
> > > > > > > > > > > > Depends on what you want to do... There are other
> > options
> > > > > > too...
> > > > > > > > > > > >
> > > > > > > > > > > > Sent from a remote device. Please excuse any typos...
> > > > > > > > > > > >
> > > > > > > > > > > > Mike Segel
> > > > > > > > > > > >
> > > > > > > > > > > > On Dec 13, 2012, at 7:09 AM, lars hofhansl <
> > > > > > lhofhansl@yahoo.com>
> > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > For OLAP type queries you will generally be better
> > off
> > > > > with a
> > > > > > > truly
> > > > > > > > > > > column oriented database.
> > > > > > > > > > > > > You can probably shoehorn HBase into this, but it
> > > wasn't
> > > > > > really
> > > > > > > > > > > designed with raw scan performance along single columns
> > in
> > > > > mind.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > ________________________________
> > > > > > > > > > > > > From: bigdata <bi...@outlook.com>
> > > > > > > > > > > > > To: "user@hbase.apache.org" <user@hbase.apache.org
> >
> > > > > > > > > > > > > Sent: Wednesday, December 12, 2012 9:57 PM
> > > > > > > > > > > > > Subject: How to design a data warehouse in HBase?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Dear all,
> > > > > > > > > > > > > We have a traditional star-model data warehouse in
> > > RDBMS,
> > > > > now
> > > > > > > we
> > > > > > > > > want
> > > > > > > > > > > to transfer it to HBase. After study HBase, I learn
> that
> > > > HBase
> > > > > is
> > > > > > > > > normally
> > > > > > > > > > > can be query by rowkey.
> > > > > > > > > > > > > 1.full rowkey (fastest)2.rowkey filter
> (fast)3.column
> > > > > > > > > family/qualifier
> > > > > > > > > > > filter (slow)
> > > > > > > > > > > > > How can I design the HBase tables to implement the
> > > > > warehouse
> > > > > > > > > > > functions, like:1.Query by DimensionA2.Query by
> > DimensionA
> > > > and
> > > > > > > > > > > DimensionB3.Sum, count, distinct ...
> > > > > > > > > > > > > From my opinion, I should create several HBase
> tables
> > > > with
> > > > > > all
> > > > > > > > > > > combinations of different dimensions as the rowkey.
> This
> > > > > solution
> > > > > > > will
> > > > > > > > > lead
> > > > > > > > > > > to huge data duplication. Is there any good suggestions
> > to
> > > > > solve
> > > > > > > it?
> > > > > > > > > > > > > Thanks a lot!
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Kevin O'Dell
> > > > > Customer Operations Engineer, Cloudera
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Kevin O'Dell
> > > Customer Operations Engineer, Cloudera
> > >
> >
>



-- 
Kevin O'Dell
Customer Operations Engineer, Cloudera

Re: How to design a data warehouse in HBase?

Posted by Manoj Babu <ma...@gmail.com>.

Kevin,

Impala requires Hive right?
so to get the advantages of Impala do we need to go with Hive?


Cheers!
Manoj.



On Thu, Dec 13, 2012 at 9:03 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Thank you so much for the clarification Kevin.
>
> Regards,
>     Mohammad Tariq
>
>
>
> On Thu, Dec 13, 2012 at 9:00 PM, Kevin O'dell <kevin.odell@cloudera.com
> >wrote:
>
> > Mohammad,
> >
> >   I am not sure you are thinking about Impala correctly.  It still uses
> > HDFS so your data increasing over time is fine.  You are not going to
> need
> > to tune for special CPU, Storage, or Network.  Typically with Impala you
> > are going to be bound at the disks as it functions off of data locality.
> >  You can also use compression of Snappy, GZip, and BZip to help with the
> > amount of data you are storing.  You will not need to frequently update
> > your hardware.
> >
> > On Thu, Dec 13, 2012 at 10:06 AM, Mohammad Tariq <do...@gmail.com>
> > wrote:
> >
> > > Oh yes..Impala..good point by Kevin.
> > >
> > > Kevin : Would it be appropriate if I say that I should go for Impala if
> > my
> > > data is not going to increase dramatically over time or if I have to
> work
> > > on only a subset of my BigData?Since Impala uses MPP, it may
> > > require specialized hardware tuned for CPU, storage and network
> > performance
> > > for better results, which could become a problem if have to upgrade the
> > > hardware frequently because of the growing data.
> > >
> > > Regards,
> > >     Mohammad Tariq
> > >
> > >
> > >
> > > On Thu, Dec 13, 2012 at 8:17 PM, Kevin O'dell <
> kevin.odell@cloudera.com
> > > >wrote:
> > >
> > > > To Mohammad's point.  You can use HBase for quick scans of the data.
> > >  Hive
> > > > for your longer running jobs.  Impala over the two for quick adhoc
> > > > searches.
> > > >
> > > > On Thu, Dec 13, 2012 at 9:44 AM, Mohammad Tariq <do...@gmail.com>
> > > > wrote:
> > > >
> > > > > I am not saying Hbase is not good. My point was to consider Hive as
> > > well.
> > > > > Think about the approach keeping both the tools in mind and
> decide. I
> > > > just
> > > > > provided an option keeping in mind the available built-in Hive
> > > features.
> > > > I
> > > > > would like to add one more point here, you can map your Hbase
> tables
> > to
> > > > > Hive.
> > > > >
> > > > > Regards,
> > > > >     Mohammad Tariq
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Dec 13, 2012 at 7:58 PM, bigdata <bi...@outlook.com>
> > > > wrote:
> > > > >
> > > > > > Hi, Tariq
> > > > > > Thanks for your feedback. Actually, now we have two ways to reach
> > the
> > > > > > target, by Hive and  by HBase.Could you tell me why HBase is not
> > good
> > > > for
> > > > > > my requirements?Or what's the problem in my solution?
> > > > > > Thanks.
> > > > > >
> > > > > > > From: dontariq@gmail.com
> > > > > > > Date: Thu, 13 Dec 2012 15:43:25 +0530
> > > > > > > Subject: Re: How to design a data warehouse in HBase?
> > > > > > > To: user@hbase.apache.org
> > > > > > >
> > > > > > > Both have got different purposes. Normally people say that Hive
> > is
> > > > > slow,
> > > > > > > that's just because it uses MapReduce under the hood. And i'm
> > sure
> > > > that
> > > > > > if
> > > > > > > the data stored in HBase is very huge, nobody would write
> > > sequential
> > > > > > > programs for Get or Scan. Instead they will write MP jobs or do
> > > > > something
> > > > > > > similar.
> > > > > > >
> > > > > > > My point is that nothing can be 100% real time. Is that what
> you
> > > > > want?If
> > > > > > > that is the case I would never suggest Hadoop on the first
> place
> > as
> > > > > it's
> > > > > > a
> > > > > > > batch processing system and cannot be used like an OLTP system,
> > > > unless
> > > > > > you
> > > > > > > have thought of some additional stuff. Since you are talking
> > about
> > > > > > > warehouse, I am assuming you are going to store and process
> > > gigantic
> > > > > > > amounts of data. That's the only reason I had suggested Hive.
> > > > > > >
> > > > > > > The whole point is that everything is not a solution for
> > > everything.
> > > > > One
> > > > > > > size doesn't fit all. First, we need to analyze our particular
> > use
> > > > > case.
> > > > > > > The person, who says Hive is slow, might be correct. But only
> for
> > > his
> > > > > > > scenario.
> > > > > > >
> > > > > > > HTH
> > > > > > >
> > > > > > > Regards,
> > > > > > >     Mohammad Tariq
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Dec 13, 2012 at 3:17 PM, bigdata <
> > bigdatabase@outlook.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > > I've got the information that HIVE 's performance is too low.
> > It
> > > > > access
> > > > > > > > HDFS files and scan all data to search one record. IS it
> TRUE?
> > > And
> > > > > > HBase is
> > > > > > > > much faster than it.
> > > > > > > >
> > > > > > > >
> > > > > > > > > From: dontariq@gmail.com
> > > > > > > > > Date: Thu, 13 Dec 2012 15:12:25 +0530
> > > > > > > > > Subject: Re: How to design a data warehouse in HBase?
> > > > > > > > > To: user@hbase.apache.org
> > > > > > > > >
> > > > > > > > > Hi there,
> > > > > > > > >
> > > > > > > > >    If you are really planning for a warehousing solution
> > then I
> > > > > would
> > > > > > > > > suggest you to have a look over Apache Hive. It provides
> you
> > > > > > warehousing
> > > > > > > > > capabilities on top of a Hadoop cluster. Along with that it
> > > also
> > > > > > provides
> > > > > > > > > an SQLish interface to the data stored in your warehouse,
> > which
> > > > > > would be
> > > > > > > > > very helpful to you, in case you are coming from an SQL
> > > > background.
> > > > > > > > >
> > > > > > > > > HTH
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > >     Mohammad Tariq
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Thu, Dec 13, 2012 at 2:43 PM, bigdata <
> > > > bigdatabase@outlook.com>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Thanks. I think a real example is better for me to
> > understand
> > > > > your
> > > > > > > > > > suggestions.
> > > > > > > > > > Now I have a relational table:ID   LoginTime
> > > > > > > >  DeviceID1
> > > > > > > > > >     2012-12-12 12:12:12   abcdef2     2012-12-12 19:12:12
> > > > > > abcdef3
> > > > > > > > > >  2012-12-13 10:10:10  defdaf
> > > > > > > > > > There are several requirements about this table:1. How
> many
> > > > > device
> > > > > > > > login
> > > > > > > > > > in each day?1. For one day, how many new device login?
> > (never
> > > > > login
> > > > > > > > > > before)1. For one day, how many accumulated device login?
> > > > > > > > > > How can I design HBase tables to calculate these data?Now
> > my
> > > > > > solution
> > > > > > > > > > is:table A:
> > > > > > > > > > rowkey:  date-deviceidcolumn family: logincolumn
> qualifier:
> > > > > >  2012-12-12
> > > > > > > > > > 12:12:12/2012-12-12 19:12:12....
> > > > > > > > > > table B:rowkey: deviceidcolumn family:null or anyvalue
> > > > > > > > > >
> > > > > > > > > > For req#1, I can scan table A and use
> prefixfilter(rowkey)
> > to
> > > > > > check one
> > > > > > > > > > special date, and get records countFor req#2, I get
> table b
> > > > with
> > > > > > each
> > > > > > > > > > deviceid, and count result
> > > > > > > > > > For req#3, count table A with prefixfilter like 1.
> > > > > > > > > > Does it OK?  Or other better solutions?
> > > > > > > > > > Thanks!!
> > > > > > > > > >
> > > > > > > > > > > CC: user@hbase.apache.org
> > > > > > > > > > > From: michael_segel@hotmail.com
> > > > > > > > > > > Subject: Re: How to design a data warehouse in HBase?
> > > > > > > > > > > Date: Thu, 13 Dec 2012 08:43:31 +0000
> > > > > > > > > > > To: user@hbase.apache.org
> > > > > > > > > > >
> > > > > > > > > > > You need to spend a bit of time on Schema design.
> > > > > > > > > > > You need to flatten your Schema...
> > > > > > > > > > > Implement some secondary indexing to improve join
> > > > > performance...
> > > > > > > > > > >
> > > > > > > > > > > Depends on what you want to do... There are other
> options
> > > > > too...
> > > > > > > > > > >
> > > > > > > > > > > Sent from a remote device. Please excuse any typos...
> > > > > > > > > > >
> > > > > > > > > > > Mike Segel
> > > > > > > > > > >
> > > > > > > > > > > On Dec 13, 2012, at 7:09 AM, lars hofhansl <
> > > > > lhofhansl@yahoo.com>
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > For OLAP type queries you will generally be better
> off
> > > > with a
> > > > > > truly
> > > > > > > > > > column oriented database.
> > > > > > > > > > > > You can probably shoehorn HBase into this, but it
> > wasn't
> > > > > really
> > > > > > > > > > designed with raw scan performance along single columns
> in
> > > > mind.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > ________________________________
> > > > > > > > > > > > From: bigdata <bi...@outlook.com>
> > > > > > > > > > > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > > > > > > > > > Sent: Wednesday, December 12, 2012 9:57 PM
> > > > > > > > > > > > Subject: How to design a data warehouse in HBase?
> > > > > > > > > > > >
> > > > > > > > > > > > Dear all,
> > > > > > > > > > > > We have a traditional star-model data warehouse in
> > RDBMS,
> > > > now
> > > > > > we
> > > > > > > > want
> > > > > > > > > > to transfer it to HBase. After study HBase, I learn that
> > > HBase
> > > > is
> > > > > > > > normally
> > > > > > > > > > can be query by rowkey.
> > > > > > > > > > > > 1.full rowkey (fastest)2.rowkey filter (fast)3.column
> > > > > > > > family/qualifier
> > > > > > > > > > filter (slow)
> > > > > > > > > > > > How can I design the HBase tables to implement the
> > > > warehouse
> > > > > > > > > > functions, like:1.Query by DimensionA2.Query by
> DimensionA
> > > and
> > > > > > > > > > DimensionB3.Sum, count, distinct ...
> > > > > > > > > > > > From my opinion, I should create several HBase tables
> > > with
> > > > > all
> > > > > > > > > > combinations of different dimensions as the rowkey. This
> > > > solution
> > > > > > will
> > > > > > > > lead
> > > > > > > > > > to huge data duplication. Is there any good suggestions
> to
> > > > solve
> > > > > > it?
> > > > > > > > > > > > Thanks a lot!
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Kevin O'Dell
> > > > Customer Operations Engineer, Cloudera
> > > >
> > >
> >
> >
> >
> > --
> > Kevin O'Dell
> > Customer Operations Engineer, Cloudera
> >
>

Re: How to design a data warehouse in HBase?

Posted by Mohammad Tariq <do...@gmail.com>.

Thank you so much for the clarification Kevin.

Regards,
    Mohammad Tariq



On Thu, Dec 13, 2012 at 9:00 PM, Kevin O'dell <ke...@cloudera.com>wrote:

> Mohammad,
>
>   I am not sure you are thinking about Impala correctly.  It still uses
> HDFS so your data increasing over time is fine.  You are not going to need
> to tune for special CPU, Storage, or Network.  Typically with Impala you
> are going to be bound at the disks as it functions off of data locality.
>  You can also use compression of Snappy, GZip, and BZip to help with the
> amount of data you are storing.  You will not need to frequently update
> your hardware.
>
> On Thu, Dec 13, 2012 at 10:06 AM, Mohammad Tariq <do...@gmail.com>
> wrote:
>
> > Oh yes..Impala..good point by Kevin.
> >
> > Kevin : Would it be appropriate if I say that I should go for Impala if
> my
> > data is not going to increase dramatically over time or if I have to work
> > on only a subset of my BigData?Since Impala uses MPP, it may
> > require specialized hardware tuned for CPU, storage and network
> performance
> > for better results, which could become a problem if have to upgrade the
> > hardware frequently because of the growing data.
> >
> > Regards,
> >     Mohammad Tariq
> >
> >
> >
> > On Thu, Dec 13, 2012 at 8:17 PM, Kevin O'dell <kevin.odell@cloudera.com
> > >wrote:
> >
> > > To Mohammad's point.  You can use HBase for quick scans of the data.
> >  Hive
> > > for your longer running jobs.  Impala over the two for quick adhoc
> > > searches.
> > >
> > > On Thu, Dec 13, 2012 at 9:44 AM, Mohammad Tariq <do...@gmail.com>
> > > wrote:
> > >
> > > > I am not saying Hbase is not good. My point was to consider Hive as
> > well.
> > > > Think about the approach keeping both the tools in mind and decide. I
> > > just
> > > > provided an option keeping in mind the available built-in Hive
> > features.
> > > I
> > > > would like to add one more point here, you can map your Hbase tables
> to
> > > > Hive.
> > > >
> > > > Regards,
> > > >     Mohammad Tariq
> > > >
> > > >
> > > >
> > > > On Thu, Dec 13, 2012 at 7:58 PM, bigdata <bi...@outlook.com>
> > > wrote:
> > > >
> > > > > Hi, Tariq
> > > > > Thanks for your feedback. Actually, now we have two ways to reach
> the
> > > > > target, by Hive and  by HBase.Could you tell me why HBase is not
> good
> > > for
> > > > > my requirements?Or what's the problem in my solution?
> > > > > Thanks.
> > > > >
> > > > > > From: dontariq@gmail.com
> > > > > > Date: Thu, 13 Dec 2012 15:43:25 +0530
> > > > > > Subject: Re: How to design a data warehouse in HBase?
> > > > > > To: user@hbase.apache.org
> > > > > >
> > > > > > Both have got different purposes. Normally people say that Hive
> is
> > > > slow,
> > > > > > that's just because it uses MapReduce under the hood. And i'm
> sure
> > > that
> > > > > if
> > > > > > the data stored in HBase is very huge, nobody would write
> > sequential
> > > > > > programs for Get or Scan. Instead they will write MP jobs or do
> > > > something
> > > > > > similar.
> > > > > >
> > > > > > My point is that nothing can be 100% real time. Is that what you
> > > > want?If
> > > > > > that is the case I would never suggest Hadoop on the first place
> as
> > > > it's
> > > > > a
> > > > > > batch processing system and cannot be used like an OLTP system,
> > > unless
> > > > > you
> > > > > > have thought of some additional stuff. Since you are talking
> about
> > > > > > warehouse, I am assuming you are going to store and process
> > gigantic
> > > > > > amounts of data. That's the only reason I had suggested Hive.
> > > > > >
> > > > > > The whole point is that everything is not a solution for
> > everything.
> > > > One
> > > > > > size doesn't fit all. First, we need to analyze our particular
> use
> > > > case.
> > > > > > The person, who says Hive is slow, might be correct. But only for
> > his
> > > > > > scenario.
> > > > > >
> > > > > > HTH
> > > > > >
> > > > > > Regards,
> > > > > >     Mohammad Tariq
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, Dec 13, 2012 at 3:17 PM, bigdata <
> bigdatabase@outlook.com>
> > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > > I've got the information that HIVE 's performance is too low.
> It
> > > > access
> > > > > > > HDFS files and scan all data to search one record. IS it TRUE?
> > And
> > > > > HBase is
> > > > > > > much faster than it.
> > > > > > >
> > > > > > >
> > > > > > > > From: dontariq@gmail.com
> > > > > > > > Date: Thu, 13 Dec 2012 15:12:25 +0530
> > > > > > > > Subject: Re: How to design a data warehouse in HBase?
> > > > > > > > To: user@hbase.apache.org
> > > > > > > >
> > > > > > > > Hi there,
> > > > > > > >
> > > > > > > >    If you are really planning for a warehousing solution
> then I
> > > > would
> > > > > > > > suggest you to have a look over Apache Hive. It provides you
> > > > > warehousing
> > > > > > > > capabilities on top of a Hadoop cluster. Along with that it
> > also
> > > > > provides
> > > > > > > > an SQLish interface to the data stored in your warehouse,
> which
> > > > > would be
> > > > > > > > very helpful to you, in case you are coming from an SQL
> > > background.
> > > > > > > >
> > > > > > > > HTH
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > >     Mohammad Tariq
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Dec 13, 2012 at 2:43 PM, bigdata <
> > > bigdatabase@outlook.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks. I think a real example is better for me to
> understand
> > > > your
> > > > > > > > > suggestions.
> > > > > > > > > Now I have a relational table:ID   LoginTime
> > > > > > >  DeviceID1
> > > > > > > > >     2012-12-12 12:12:12   abcdef2     2012-12-12 19:12:12
> > > > > abcdef3
> > > > > > > > >  2012-12-13 10:10:10  defdaf
> > > > > > > > > There are several requirements about this table:1. How many
> > > > device
> > > > > > > login
> > > > > > > > > in each day?1. For one day, how many new device login?
> (never
> > > > login
> > > > > > > > > before)1. For one day, how many accumulated device login?
> > > > > > > > > How can I design HBase tables to calculate these data?Now
> my
> > > > > solution
> > > > > > > > > is:table A:
> > > > > > > > > rowkey:  date-deviceidcolumn family: logincolumn qualifier:
> > > > >  2012-12-12
> > > > > > > > > 12:12:12/2012-12-12 19:12:12....
> > > > > > > > > table B:rowkey: deviceidcolumn family:null or anyvalue
> > > > > > > > >
> > > > > > > > > For req#1, I can scan table A and use prefixfilter(rowkey)
> to
> > > > > check one
> > > > > > > > > special date, and get records countFor req#2, I get table b
> > > with
> > > > > each
> > > > > > > > > deviceid, and count result
> > > > > > > > > For req#3, count table A with prefixfilter like 1.
> > > > > > > > > Does it OK?  Or other better solutions?
> > > > > > > > > Thanks!!
> > > > > > > > >
> > > > > > > > > > CC: user@hbase.apache.org
> > > > > > > > > > From: michael_segel@hotmail.com
> > > > > > > > > > Subject: Re: How to design a data warehouse in HBase?
> > > > > > > > > > Date: Thu, 13 Dec 2012 08:43:31 +0000
> > > > > > > > > > To: user@hbase.apache.org
> > > > > > > > > >
> > > > > > > > > > You need to spend a bit of time on Schema design.
> > > > > > > > > > You need to flatten your Schema...
> > > > > > > > > > Implement some secondary indexing to improve join
> > > > performance...
> > > > > > > > > >
> > > > > > > > > > Depends on what you want to do... There are other options
> > > > too...
> > > > > > > > > >
> > > > > > > > > > Sent from a remote device. Please excuse any typos...
> > > > > > > > > >
> > > > > > > > > > Mike Segel
> > > > > > > > > >
> > > > > > > > > > On Dec 13, 2012, at 7:09 AM, lars hofhansl <
> > > > lhofhansl@yahoo.com>
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > For OLAP type queries you will generally be better off
> > > with a
> > > > > truly
> > > > > > > > > column oriented database.
> > > > > > > > > > > You can probably shoehorn HBase into this, but it
> wasn't
> > > > really
> > > > > > > > > designed with raw scan performance along single columns in
> > > mind.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > ________________________________
> > > > > > > > > > > From: bigdata <bi...@outlook.com>
> > > > > > > > > > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > > > > > > > > Sent: Wednesday, December 12, 2012 9:57 PM
> > > > > > > > > > > Subject: How to design a data warehouse in HBase?
> > > > > > > > > > >
> > > > > > > > > > > Dear all,
> > > > > > > > > > > We have a traditional star-model data warehouse in
> RDBMS,
> > > now
> > > > > we
> > > > > > > want
> > > > > > > > > to transfer it to HBase. After study HBase, I learn that
> > HBase
> > > is
> > > > > > > normally
> > > > > > > > > can be query by rowkey.
> > > > > > > > > > > 1.full rowkey (fastest)2.rowkey filter (fast)3.column
> > > > > > > family/qualifier
> > > > > > > > > filter (slow)
> > > > > > > > > > > How can I design the HBase tables to implement the
> > > warehouse
> > > > > > > > > functions, like:1.Query by DimensionA2.Query by DimensionA
> > and
> > > > > > > > > DimensionB3.Sum, count, distinct ...
> > > > > > > > > > > From my opinion, I should create several HBase tables
> > with
> > > > all
> > > > > > > > > combinations of different dimensions as the rowkey. This
> > > solution
> > > > > will
> > > > > > > lead
> > > > > > > > > to huge data duplication. Is there any good suggestions to
> > > solve
> > > > > it?
> > > > > > > > > > > Thanks a lot!
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Kevin O'Dell
> > > Customer Operations Engineer, Cloudera
> > >
> >
>
>
>
> --
> Kevin O'Dell
> Customer Operations Engineer, Cloudera
>

Re: How to design a data warehouse in HBase?

Posted by Kevin O'dell <ke...@cloudera.com>.

Mohammad,

  I am not sure you are thinking about Impala correctly.  It still uses
HDFS so your data increasing over time is fine.  You are not going to need
to tune for special CPU, Storage, or Network.  Typically with Impala you
are going to be bound at the disks as it functions off of data locality.
 You can also use compression of Snappy, GZip, and BZip to help with the
amount of data you are storing.  You will not need to frequently update
your hardware.

On Thu, Dec 13, 2012 at 10:06 AM, Mohammad Tariq <do...@gmail.com> wrote:

> Oh yes..Impala..good point by Kevin.
>
> Kevin : Would it be appropriate if I say that I should go for Impala if my
> data is not going to increase dramatically over time or if I have to work
> on only a subset of my BigData?Since Impala uses MPP, it may
> require specialized hardware tuned for CPU, storage and network performance
> for better results, which could become a problem if have to upgrade the
> hardware frequently because of the growing data.
>
> Regards,
>     Mohammad Tariq
>
>
>
> On Thu, Dec 13, 2012 at 8:17 PM, Kevin O'dell <kevin.odell@cloudera.com
> >wrote:
>
> > To Mohammad's point.  You can use HBase for quick scans of the data.
>  Hive
> > for your longer running jobs.  Impala over the two for quick adhoc
> > searches.
> >
> > On Thu, Dec 13, 2012 at 9:44 AM, Mohammad Tariq <do...@gmail.com>
> > wrote:
> >
> > > I am not saying Hbase is not good. My point was to consider Hive as
> well.
> > > Think about the approach keeping both the tools in mind and decide. I
> > just
> > > provided an option keeping in mind the available built-in Hive
> features.
> > I
> > > would like to add one more point here, you can map your Hbase tables to
> > > Hive.
> > >
> > > Regards,
> > >     Mohammad Tariq
> > >
> > >
> > >
> > > On Thu, Dec 13, 2012 at 7:58 PM, bigdata <bi...@outlook.com>
> > wrote:
> > >
> > > > Hi, Tariq
> > > > Thanks for your feedback. Actually, now we have two ways to reach the
> > > > target, by Hive and  by HBase.Could you tell me why HBase is not good
> > for
> > > > my requirements?Or what's the problem in my solution?
> > > > Thanks.
> > > >
> > > > > From: dontariq@gmail.com
> > > > > Date: Thu, 13 Dec 2012 15:43:25 +0530
> > > > > Subject: Re: How to design a data warehouse in HBase?
> > > > > To: user@hbase.apache.org
> > > > >
> > > > > Both have got different purposes. Normally people say that Hive is
> > > slow,
> > > > > that's just because it uses MapReduce under the hood. And i'm sure
> > that
> > > > if
> > > > > the data stored in HBase is very huge, nobody would write
> sequential
> > > > > programs for Get or Scan. Instead they will write MP jobs or do
> > > something
> > > > > similar.
> > > > >
> > > > > My point is that nothing can be 100% real time. Is that what you
> > > want?If
> > > > > that is the case I would never suggest Hadoop on the first place as
> > > it's
> > > > a
> > > > > batch processing system and cannot be used like an OLTP system,
> > unless
> > > > you
> > > > > have thought of some additional stuff. Since you are talking about
> > > > > warehouse, I am assuming you are going to store and process
> gigantic
> > > > > amounts of data. That's the only reason I had suggested Hive.
> > > > >
> > > > > The whole point is that everything is not a solution for
> everything.
> > > One
> > > > > size doesn't fit all. First, we need to analyze our particular use
> > > case.
> > > > > The person, who says Hive is slow, might be correct. But only for
> his
> > > > > scenario.
> > > > >
> > > > > HTH
> > > > >
> > > > > Regards,
> > > > >     Mohammad Tariq
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Dec 13, 2012 at 3:17 PM, bigdata <bi...@outlook.com>
> > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > > I've got the information that HIVE 's performance is too low. It
> > > access
> > > > > > HDFS files and scan all data to search one record. IS it TRUE?
> And
> > > > HBase is
> > > > > > much faster than it.
> > > > > >
> > > > > >
> > > > > > > From: dontariq@gmail.com
> > > > > > > Date: Thu, 13 Dec 2012 15:12:25 +0530
> > > > > > > Subject: Re: How to design a data warehouse in HBase?
> > > > > > > To: user@hbase.apache.org
> > > > > > >
> > > > > > > Hi there,
> > > > > > >
> > > > > > >    If you are really planning for a warehousing solution then I
> > > would
> > > > > > > suggest you to have a look over Apache Hive. It provides you
> > > > warehousing
> > > > > > > capabilities on top of a Hadoop cluster. Along with that it
> also
> > > > provides
> > > > > > > an SQLish interface to the data stored in your warehouse, which
> > > > would be
> > > > > > > very helpful to you, in case you are coming from an SQL
> > background.
> > > > > > >
> > > > > > > HTH
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Regards,
> > > > > > >     Mohammad Tariq
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Dec 13, 2012 at 2:43 PM, bigdata <
> > bigdatabase@outlook.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Thanks. I think a real example is better for me to understand
> > > your
> > > > > > > > suggestions.
> > > > > > > > Now I have a relational table:ID   LoginTime
> > > > > >  DeviceID1
> > > > > > > >     2012-12-12 12:12:12   abcdef2     2012-12-12 19:12:12
> > > > abcdef3
> > > > > > > >  2012-12-13 10:10:10  defdaf
> > > > > > > > There are several requirements about this table:1. How many
> > > device
> > > > > > login
> > > > > > > > in each day?1. For one day, how many new device login? (never
> > > login
> > > > > > > > before)1. For one day, how many accumulated device login?
> > > > > > > > How can I design HBase tables to calculate these data?Now my
> > > > solution
> > > > > > > > is:table A:
> > > > > > > > rowkey:  date-deviceidcolumn family: logincolumn qualifier:
> > > >  2012-12-12
> > > > > > > > 12:12:12/2012-12-12 19:12:12....
> > > > > > > > table B:rowkey: deviceidcolumn family:null or anyvalue
> > > > > > > >
> > > > > > > > For req#1, I can scan table A and use prefixfilter(rowkey) to
> > > > check one
> > > > > > > > special date, and get records countFor req#2, I get table b
> > with
> > > > each
> > > > > > > > deviceid, and count result
> > > > > > > > For req#3, count table A with prefixfilter like 1.
> > > > > > > > Does it OK?  Or other better solutions?
> > > > > > > > Thanks!!
> > > > > > > >
> > > > > > > > > CC: user@hbase.apache.org
> > > > > > > > > From: michael_segel@hotmail.com
> > > > > > > > > Subject: Re: How to design a data warehouse in HBase?
> > > > > > > > > Date: Thu, 13 Dec 2012 08:43:31 +0000
> > > > > > > > > To: user@hbase.apache.org
> > > > > > > > >
> > > > > > > > > You need to spend a bit of time on Schema design.
> > > > > > > > > You need to flatten your Schema...
> > > > > > > > > Implement some secondary indexing to improve join
> > > performance...
> > > > > > > > >
> > > > > > > > > Depends on what you want to do... There are other options
> > > too...
> > > > > > > > >
> > > > > > > > > Sent from a remote device. Please excuse any typos...
> > > > > > > > >
> > > > > > > > > Mike Segel
> > > > > > > > >
> > > > > > > > > On Dec 13, 2012, at 7:09 AM, lars hofhansl <
> > > lhofhansl@yahoo.com>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > For OLAP type queries you will generally be better off
> > with a
> > > > truly
> > > > > > > > column oriented database.
> > > > > > > > > > You can probably shoehorn HBase into this, but it wasn't
> > > really
> > > > > > > > designed with raw scan performance along single columns in
> > mind.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > ________________________________
> > > > > > > > > > From: bigdata <bi...@outlook.com>
> > > > > > > > > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > > > > > > > Sent: Wednesday, December 12, 2012 9:57 PM
> > > > > > > > > > Subject: How to design a data warehouse in HBase?
> > > > > > > > > >
> > > > > > > > > > Dear all,
> > > > > > > > > > We have a traditional star-model data warehouse in RDBMS,
> > now
> > > > we
> > > > > > want
> > > > > > > > to transfer it to HBase. After study HBase, I learn that
> HBase
> > is
> > > > > > normally
> > > > > > > > can be query by rowkey.
> > > > > > > > > > 1.full rowkey (fastest)2.rowkey filter (fast)3.column
> > > > > > family/qualifier
> > > > > > > > filter (slow)
> > > > > > > > > > How can I design the HBase tables to implement the
> > warehouse
> > > > > > > > functions, like:1.Query by DimensionA2.Query by DimensionA
> and
> > > > > > > > DimensionB3.Sum, count, distinct ...
> > > > > > > > > > From my opinion, I should create several HBase tables
> with
> > > all
> > > > > > > > combinations of different dimensions as the rowkey. This
> > solution
> > > > will
> > > > > > lead
> > > > > > > > to huge data duplication. Is there any good suggestions to
> > solve
> > > > it?
> > > > > > > > > > Thanks a lot!
> > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Kevin O'Dell
> > Customer Operations Engineer, Cloudera
> >
>



-- 
Kevin O'Dell
Customer Operations Engineer, Cloudera

Re: How to design a data warehouse in HBase?

Posted by Mohammad Tariq <do...@gmail.com>.

Oh yes..Impala..good point by Kevin.

Kevin : Would it be appropriate if I say that I should go for Impala if my
data is not going to increase dramatically over time or if I have to work
on only a subset of my BigData?Since Impala uses MPP, it may
require specialized hardware tuned for CPU, storage and network performance
for better results, which could become a problem if have to upgrade the
hardware frequently because of the growing data.

Regards,
    Mohammad Tariq



On Thu, Dec 13, 2012 at 8:17 PM, Kevin O'dell <ke...@cloudera.com>wrote:

> To Mohammad's point.  You can use HBase for quick scans of the data.  Hive
> for your longer running jobs.  Impala over the two for quick adhoc
> searches.
>
> On Thu, Dec 13, 2012 at 9:44 AM, Mohammad Tariq <do...@gmail.com>
> wrote:
>
> > I am not saying Hbase is not good. My point was to consider Hive as well.
> > Think about the approach keeping both the tools in mind and decide. I
> just
> > provided an option keeping in mind the available built-in Hive features.
> I
> > would like to add one more point here, you can map your Hbase tables to
> > Hive.
> >
> > Regards,
> >     Mohammad Tariq
> >
> >
> >
> > On Thu, Dec 13, 2012 at 7:58 PM, bigdata <bi...@outlook.com>
> wrote:
> >
> > > Hi, Tariq
> > > Thanks for your feedback. Actually, now we have two ways to reach the
> > > target, by Hive and  by HBase.Could you tell me why HBase is not good
> for
> > > my requirements?Or what's the problem in my solution?
> > > Thanks.
> > >
> > > > From: dontariq@gmail.com
> > > > Date: Thu, 13 Dec 2012 15:43:25 +0530
> > > > Subject: Re: How to design a data warehouse in HBase?
> > > > To: user@hbase.apache.org
> > > >
> > > > Both have got different purposes. Normally people say that Hive is
> > slow,
> > > > that's just because it uses MapReduce under the hood. And i'm sure
> that
> > > if
> > > > the data stored in HBase is very huge, nobody would write sequential
> > > > programs for Get or Scan. Instead they will write MP jobs or do
> > something
> > > > similar.
> > > >
> > > > My point is that nothing can be 100% real time. Is that what you
> > want?If
> > > > that is the case I would never suggest Hadoop on the first place as
> > it's
> > > a
> > > > batch processing system and cannot be used like an OLTP system,
> unless
> > > you
> > > > have thought of some additional stuff. Since you are talking about
> > > > warehouse, I am assuming you are going to store and process gigantic
> > > > amounts of data. That's the only reason I had suggested Hive.
> > > >
> > > > The whole point is that everything is not a solution for everything.
> > One
> > > > size doesn't fit all. First, we need to analyze our particular use
> > case.
> > > > The person, who says Hive is slow, might be correct. But only for his
> > > > scenario.
> > > >
> > > > HTH
> > > >
> > > > Regards,
> > > >     Mohammad Tariq
> > > >
> > > >
> > > >
> > > > On Thu, Dec 13, 2012 at 3:17 PM, bigdata <bi...@outlook.com>
> > > wrote:
> > > >
> > > > > Hi,
> > > > > I've got the information that HIVE 's performance is too low. It
> > access
> > > > > HDFS files and scan all data to search one record. IS it TRUE? And
> > > HBase is
> > > > > much faster than it.
> > > > >
> > > > >
> > > > > > From: dontariq@gmail.com
> > > > > > Date: Thu, 13 Dec 2012 15:12:25 +0530
> > > > > > Subject: Re: How to design a data warehouse in HBase?
> > > > > > To: user@hbase.apache.org
> > > > > >
> > > > > > Hi there,
> > > > > >
> > > > > >    If you are really planning for a warehousing solution then I
> > would
> > > > > > suggest you to have a look over Apache Hive. It provides you
> > > warehousing
> > > > > > capabilities on top of a Hadoop cluster. Along with that it also
> > > provides
> > > > > > an SQLish interface to the data stored in your warehouse, which
> > > would be
> > > > > > very helpful to you, in case you are coming from an SQL
> background.
> > > > > >
> > > > > > HTH
> > > > > >
> > > > > >
> > > > > >
> > > > > > Regards,
> > > > > >     Mohammad Tariq
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, Dec 13, 2012 at 2:43 PM, bigdata <
> bigdatabase@outlook.com>
> > > > > wrote:
> > > > > >
> > > > > > > Thanks. I think a real example is better for me to understand
> > your
> > > > > > > suggestions.
> > > > > > > Now I have a relational table:ID   LoginTime
> > > > >  DeviceID1
> > > > > > >     2012-12-12 12:12:12   abcdef2     2012-12-12 19:12:12
> > > abcdef3
> > > > > > >  2012-12-13 10:10:10  defdaf
> > > > > > > There are several requirements about this table:1. How many
> > device
> > > > > login
> > > > > > > in each day?1. For one day, how many new device login? (never
> > login
> > > > > > > before)1. For one day, how many accumulated device login?
> > > > > > > How can I design HBase tables to calculate these data?Now my
> > > solution
> > > > > > > is:table A:
> > > > > > > rowkey:  date-deviceidcolumn family: logincolumn qualifier:
> > >  2012-12-12
> > > > > > > 12:12:12/2012-12-12 19:12:12....
> > > > > > > table B:rowkey: deviceidcolumn family:null or anyvalue
> > > > > > >
> > > > > > > For req#1, I can scan table A and use prefixfilter(rowkey) to
> > > check one
> > > > > > > special date, and get records countFor req#2, I get table b
> with
> > > each
> > > > > > > deviceid, and count result
> > > > > > > For req#3, count table A with prefixfilter like 1.
> > > > > > > Does it OK?  Or other better solutions?
> > > > > > > Thanks!!
> > > > > > >
> > > > > > > > CC: user@hbase.apache.org
> > > > > > > > From: michael_segel@hotmail.com
> > > > > > > > Subject: Re: How to design a data warehouse in HBase?
> > > > > > > > Date: Thu, 13 Dec 2012 08:43:31 +0000
> > > > > > > > To: user@hbase.apache.org
> > > > > > > >
> > > > > > > > You need to spend a bit of time on Schema design.
> > > > > > > > You need to flatten your Schema...
> > > > > > > > Implement some secondary indexing to improve join
> > performance...
> > > > > > > >
> > > > > > > > Depends on what you want to do... There are other options
> > too...
> > > > > > > >
> > > > > > > > Sent from a remote device. Please excuse any typos...
> > > > > > > >
> > > > > > > > Mike Segel
> > > > > > > >
> > > > > > > > On Dec 13, 2012, at 7:09 AM, lars hofhansl <
> > lhofhansl@yahoo.com>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > For OLAP type queries you will generally be better off
> with a
> > > truly
> > > > > > > column oriented database.
> > > > > > > > > You can probably shoehorn HBase into this, but it wasn't
> > really
> > > > > > > designed with raw scan performance along single columns in
> mind.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > ________________________________
> > > > > > > > > From: bigdata <bi...@outlook.com>
> > > > > > > > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > > > > > > Sent: Wednesday, December 12, 2012 9:57 PM
> > > > > > > > > Subject: How to design a data warehouse in HBase?
> > > > > > > > >
> > > > > > > > > Dear all,
> > > > > > > > > We have a traditional star-model data warehouse in RDBMS,
> now
> > > we
> > > > > want
> > > > > > > to transfer it to HBase. After study HBase, I learn that HBase
> is
> > > > > normally
> > > > > > > can be query by rowkey.
> > > > > > > > > 1.full rowkey (fastest)2.rowkey filter (fast)3.column
> > > > > family/qualifier
> > > > > > > filter (slow)
> > > > > > > > > How can I design the HBase tables to implement the
> warehouse
> > > > > > > functions, like:1.Query by DimensionA2.Query by DimensionA and
> > > > > > > DimensionB3.Sum, count, distinct ...
> > > > > > > > > From my opinion, I should create several HBase tables with
> > all
> > > > > > > combinations of different dimensions as the rowkey. This
> solution
> > > will
> > > > > lead
> > > > > > > to huge data duplication. Is there any good suggestions to
> solve
> > > it?
> > > > > > > > > Thanks a lot!
> > > > > > >
> > > > > > >
> > > > >
> > > > >
> > >
> > >
> >
>
>
>
> --
> Kevin O'Dell
> Customer Operations Engineer, Cloudera
>

Re: How to design a data warehouse in HBase?

Posted by Kevin O'dell <ke...@cloudera.com>.

To Mohammad's point.  You can use HBase for quick scans of the data.  Hive
for your longer running jobs.  Impala over the two for quick adhoc searches.

On Thu, Dec 13, 2012 at 9:44 AM, Mohammad Tariq <do...@gmail.com> wrote:

> I am not saying Hbase is not good. My point was to consider Hive as well.
> Think about the approach keeping both the tools in mind and decide. I just
> provided an option keeping in mind the available built-in Hive features. I
> would like to add one more point here, you can map your Hbase tables to
> Hive.
>
> Regards,
>     Mohammad Tariq
>
>
>
> On Thu, Dec 13, 2012 at 7:58 PM, bigdata <bi...@outlook.com> wrote:
>
> > Hi, Tariq
> > Thanks for your feedback. Actually, now we have two ways to reach the
> > target, by Hive and  by HBase.Could you tell me why HBase is not good for
> > my requirements?Or what's the problem in my solution?
> > Thanks.
> >
> > > From: dontariq@gmail.com
> > > Date: Thu, 13 Dec 2012 15:43:25 +0530
> > > Subject: Re: How to design a data warehouse in HBase?
> > > To: user@hbase.apache.org
> > >
> > > Both have got different purposes. Normally people say that Hive is
> slow,
> > > that's just because it uses MapReduce under the hood. And i'm sure that
> > if
> > > the data stored in HBase is very huge, nobody would write sequential
> > > programs for Get or Scan. Instead they will write MP jobs or do
> something
> > > similar.
> > >
> > > My point is that nothing can be 100% real time. Is that what you
> want?If
> > > that is the case I would never suggest Hadoop on the first place as
> it's
> > a
> > > batch processing system and cannot be used like an OLTP system, unless
> > you
> > > have thought of some additional stuff. Since you are talking about
> > > warehouse, I am assuming you are going to store and process gigantic
> > > amounts of data. That's the only reason I had suggested Hive.
> > >
> > > The whole point is that everything is not a solution for everything.
> One
> > > size doesn't fit all. First, we need to analyze our particular use
> case.
> > > The person, who says Hive is slow, might be correct. But only for his
> > > scenario.
> > >
> > > HTH
> > >
> > > Regards,
> > >     Mohammad Tariq
> > >
> > >
> > >
> > > On Thu, Dec 13, 2012 at 3:17 PM, bigdata <bi...@outlook.com>
> > wrote:
> > >
> > > > Hi,
> > > > I've got the information that HIVE 's performance is too low. It
> access
> > > > HDFS files and scan all data to search one record. IS it TRUE? And
> > HBase is
> > > > much faster than it.
> > > >
> > > >
> > > > > From: dontariq@gmail.com
> > > > > Date: Thu, 13 Dec 2012 15:12:25 +0530
> > > > > Subject: Re: How to design a data warehouse in HBase?
> > > > > To: user@hbase.apache.org
> > > > >
> > > > > Hi there,
> > > > >
> > > > >    If you are really planning for a warehousing solution then I
> would
> > > > > suggest you to have a look over Apache Hive. It provides you
> > warehousing
> > > > > capabilities on top of a Hadoop cluster. Along with that it also
> > provides
> > > > > an SQLish interface to the data stored in your warehouse, which
> > would be
> > > > > very helpful to you, in case you are coming from an SQL background.
> > > > >
> > > > > HTH
> > > > >
> > > > >
> > > > >
> > > > > Regards,
> > > > >     Mohammad Tariq
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Dec 13, 2012 at 2:43 PM, bigdata <bi...@outlook.com>
> > > > wrote:
> > > > >
> > > > > > Thanks. I think a real example is better for me to understand
> your
> > > > > > suggestions.
> > > > > > Now I have a relational table:ID   LoginTime
> > > >  DeviceID1
> > > > > >     2012-12-12 12:12:12   abcdef2     2012-12-12 19:12:12
> > abcdef3
> > > > > >  2012-12-13 10:10:10  defdaf
> > > > > > There are several requirements about this table:1. How many
> device
> > > > login
> > > > > > in each day?1. For one day, how many new device login? (never
> login
> > > > > > before)1. For one day, how many accumulated device login?
> > > > > > How can I design HBase tables to calculate these data?Now my
> > solution
> > > > > > is:table A:
> > > > > > rowkey:  date-deviceidcolumn family: logincolumn qualifier:
> >  2012-12-12
> > > > > > 12:12:12/2012-12-12 19:12:12....
> > > > > > table B:rowkey: deviceidcolumn family:null or anyvalue
> > > > > >
> > > > > > For req#1, I can scan table A and use prefixfilter(rowkey) to
> > check one
> > > > > > special date, and get records countFor req#2, I get table b with
> > each
> > > > > > deviceid, and count result
> > > > > > For req#3, count table A with prefixfilter like 1.
> > > > > > Does it OK?  Or other better solutions?
> > > > > > Thanks!!
> > > > > >
> > > > > > > CC: user@hbase.apache.org
> > > > > > > From: michael_segel@hotmail.com
> > > > > > > Subject: Re: How to design a data warehouse in HBase?
> > > > > > > Date: Thu, 13 Dec 2012 08:43:31 +0000
> > > > > > > To: user@hbase.apache.org
> > > > > > >
> > > > > > > You need to spend a bit of time on Schema design.
> > > > > > > You need to flatten your Schema...
> > > > > > > Implement some secondary indexing to improve join
> performance...
> > > > > > >
> > > > > > > Depends on what you want to do... There are other options
> too...
> > > > > > >
> > > > > > > Sent from a remote device. Please excuse any typos...
> > > > > > >
> > > > > > > Mike Segel
> > > > > > >
> > > > > > > On Dec 13, 2012, at 7:09 AM, lars hofhansl <
> lhofhansl@yahoo.com>
> > > > wrote:
> > > > > > >
> > > > > > > > For OLAP type queries you will generally be better off with a
> > truly
> > > > > > column oriented database.
> > > > > > > > You can probably shoehorn HBase into this, but it wasn't
> really
> > > > > > designed with raw scan performance along single columns in mind.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > ________________________________
> > > > > > > > From: bigdata <bi...@outlook.com>
> > > > > > > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > > > > > Sent: Wednesday, December 12, 2012 9:57 PM
> > > > > > > > Subject: How to design a data warehouse in HBase?
> > > > > > > >
> > > > > > > > Dear all,
> > > > > > > > We have a traditional star-model data warehouse in RDBMS, now
> > we
> > > > want
> > > > > > to transfer it to HBase. After study HBase, I learn that HBase is
> > > > normally
> > > > > > can be query by rowkey.
> > > > > > > > 1.full rowkey (fastest)2.rowkey filter (fast)3.column
> > > > family/qualifier
> > > > > > filter (slow)
> > > > > > > > How can I design the HBase tables to implement the warehouse
> > > > > > functions, like:1.Query by DimensionA2.Query by DimensionA and
> > > > > > DimensionB3.Sum, count, distinct ...
> > > > > > > > From my opinion, I should create several HBase tables with
> all
> > > > > > combinations of different dimensions as the rowkey. This solution
> > will
> > > > lead
> > > > > > to huge data duplication. Is there any good suggestions to solve
> > it?
> > > > > > > > Thanks a lot!
> > > > > >
> > > > > >
> > > >
> > > >
> >
> >
>



-- 
Kevin O'Dell
Customer Operations Engineer, Cloudera

Re: How to design a data warehouse in HBase?

Posted by Mohammad Tariq <do...@gmail.com>.

I am not saying Hbase is not good. My point was to consider Hive as well.
Think about the approach keeping both the tools in mind and decide. I just
provided an option keeping in mind the available built-in Hive features. I
would like to add one more point here, you can map your Hbase tables to
Hive.

Regards,
    Mohammad Tariq



On Thu, Dec 13, 2012 at 7:58 PM, bigdata <bi...@outlook.com> wrote:

> Hi, Tariq
> Thanks for your feedback. Actually, now we have two ways to reach the
> target, by Hive and  by HBase.Could you tell me why HBase is not good for
> my requirements?Or what's the problem in my solution?
> Thanks.
>
> > From: dontariq@gmail.com
> > Date: Thu, 13 Dec 2012 15:43:25 +0530
> > Subject: Re: How to design a data warehouse in HBase?
> > To: user@hbase.apache.org
> >
> > Both have got different purposes. Normally people say that Hive is slow,
> > that's just because it uses MapReduce under the hood. And i'm sure that
> if
> > the data stored in HBase is very huge, nobody would write sequential
> > programs for Get or Scan. Instead they will write MP jobs or do something
> > similar.
> >
> > My point is that nothing can be 100% real time. Is that what you want?If
> > that is the case I would never suggest Hadoop on the first place as it's
> a
> > batch processing system and cannot be used like an OLTP system, unless
> you
> > have thought of some additional stuff. Since you are talking about
> > warehouse, I am assuming you are going to store and process gigantic
> > amounts of data. That's the only reason I had suggested Hive.
> >
> > The whole point is that everything is not a solution for everything. One
> > size doesn't fit all. First, we need to analyze our particular use case.
> > The person, who says Hive is slow, might be correct. But only for his
> > scenario.
> >
> > HTH
> >
> > Regards,
> >     Mohammad Tariq
> >
> >
> >
> > On Thu, Dec 13, 2012 at 3:17 PM, bigdata <bi...@outlook.com>
> wrote:
> >
> > > Hi,
> > > I've got the information that HIVE 's performance is too low. It access
> > > HDFS files and scan all data to search one record. IS it TRUE? And
> HBase is
> > > much faster than it.
> > >
> > >
> > > > From: dontariq@gmail.com
> > > > Date: Thu, 13 Dec 2012 15:12:25 +0530
> > > > Subject: Re: How to design a data warehouse in HBase?
> > > > To: user@hbase.apache.org
> > > >
> > > > Hi there,
> > > >
> > > >    If you are really planning for a warehousing solution then I would
> > > > suggest you to have a look over Apache Hive. It provides you
> warehousing
> > > > capabilities on top of a Hadoop cluster. Along with that it also
> provides
> > > > an SQLish interface to the data stored in your warehouse, which
> would be
> > > > very helpful to you, in case you are coming from an SQL background.
> > > >
> > > > HTH
> > > >
> > > >
> > > >
> > > > Regards,
> > > >     Mohammad Tariq
> > > >
> > > >
> > > >
> > > > On Thu, Dec 13, 2012 at 2:43 PM, bigdata <bi...@outlook.com>
> > > wrote:
> > > >
> > > > > Thanks. I think a real example is better for me to understand your
> > > > > suggestions.
> > > > > Now I have a relational table:ID   LoginTime
> > >  DeviceID1
> > > > >     2012-12-12 12:12:12   abcdef2     2012-12-12  19:12:12
> abcdef3
> > > > >  2012-12-13   10:10:10  defdaf
> > > > > There are several requirements about this table:1. How many device
> > > login
> > > > > in each day?1. For one day, how many new device login? (never login
> > > > > before)1. For one day, how many accumulated device login?
> > > > > How can I design HBase tables to calculate these data?Now my
> solution
> > > > > is:table A:
> > > > > rowkey:  date-deviceidcolumn family: logincolumn qualifier:
>  2012-12-12
> > > > > 12:12:12/2012-12-12 19:12:12....
> > > > > table B:rowkey: deviceidcolumn family:null or anyvalue
> > > > >
> > > > > For req#1, I can scan table A and use prefixfilter(rowkey) to
> check one
> > > > > special date, and get records countFor req#2, I get table b with
> each
> > > > > deviceid, and count result
> > > > > For req#3, count table A with prefixfilter like 1.
> > > > > Does it OK?  Or other better solutions?
> > > > > Thanks!!
> > > > >
> > > > > > CC: user@hbase.apache.org
> > > > > > From: michael_segel@hotmail.com
> > > > > > Subject: Re: How to design a data warehouse in HBase?
> > > > > > Date: Thu, 13 Dec 2012 08:43:31 +0000
> > > > > > To: user@hbase.apache.org
> > > > > >
> > > > > > You need to spend a bit of time on Schema design.
> > > > > > You need to flatten your Schema...
> > > > > > Implement some secondary indexing to improve join performance...
> > > > > >
> > > > > > Depends on what you want to do... There are other options too...
> > > > > >
> > > > > > Sent from a remote device. Please excuse any typos...
> > > > > >
> > > > > > Mike Segel
> > > > > >
> > > > > > On Dec 13, 2012, at 7:09 AM, lars hofhansl <lh...@yahoo.com>
> > > wrote:
> > > > > >
> > > > > > > For OLAP type queries you will generally be better off with a
> truly
> > > > > column oriented database.
> > > > > > > You can probably shoehorn HBase into this, but it wasn't really
> > > > > designed with raw scan performance along single columns in mind.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > ________________________________
> > > > > > > From: bigdata <bi...@outlook.com>
> > > > > > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > > > > Sent: Wednesday, December 12, 2012 9:57 PM
> > > > > > > Subject: How to design a data warehouse in HBase?
> > > > > > >
> > > > > > > Dear all,
> > > > > > > We have a traditional star-model data warehouse in RDBMS, now
> we
> > > want
> > > > > to transfer it to HBase. After study HBase, I learn that HBase is
> > > normally
> > > > > can be query by rowkey.
> > > > > > > 1.full rowkey (fastest)2.rowkey filter (fast)3.column
> > > family/qualifier
> > > > > filter (slow)
> > > > > > > How can I design the HBase tables to implement the warehouse
> > > > > functions, like:1.Query by DimensionA2.Query by DimensionA and
> > > > > DimensionB3.Sum, count, distinct ...
> > > > > > > From my opinion, I should create several HBase tables with all
> > > > > combinations of different dimensions as the rowkey. This solution
> will
> > > lead
> > > > > to huge data duplication. Is there any good suggestions to solve
> it?
> > > > > > > Thanks a lot!
> > > > >
> > > > >
> > >
> > >
>
>

RE: How to design a data warehouse in HBase?

Posted by bigdata <bi...@outlook.com>.

Hi, Tariq
Thanks for your feedback. Actually, now we have two ways to reach the target, by Hive and  by HBase.Could you tell me why HBase is not good for my requirements?Or what's the problem in my solution?
Thanks.

> From: dontariq@gmail.com
> Date: Thu, 13 Dec 2012 15:43:25 +0530
> Subject: Re: How to design a data warehouse in HBase?
> To: user@hbase.apache.org
> 
> Both have got different purposes. Normally people say that Hive is slow,
> that's just because it uses MapReduce under the hood. And i'm sure that if
> the data stored in HBase is very huge, nobody would write sequential
> programs for Get or Scan. Instead they will write MP jobs or do something
> similar.
> 
> My point is that nothing can be 100% real time. Is that what you want?If
> that is the case I would never suggest Hadoop on the first place as it's a
> batch processing system and cannot be used like an OLTP system, unless you
> have thought of some additional stuff. Since you are talking about
> warehouse, I am assuming you are going to store and process gigantic
> amounts of data. That's the only reason I had suggested Hive.
> 
> The whole point is that everything is not a solution for everything. One
> size doesn't fit all. First, we need to analyze our particular use case.
> The person, who says Hive is slow, might be correct. But only for his
> scenario.
> 
> HTH
> 
> Regards,
>     Mohammad Tariq
> 
> 
> 
> On Thu, Dec 13, 2012 at 3:17 PM, bigdata <bi...@outlook.com> wrote:
> 
> > Hi,
> > I've got the information that HIVE 's performance is too low. It access
> > HDFS files and scan all data to search one record. IS it TRUE? And HBase is
> > much faster than it.
> >
> >
> > > From: dontariq@gmail.com
> > > Date: Thu, 13 Dec 2012 15:12:25 +0530
> > > Subject: Re: How to design a data warehouse in HBase?
> > > To: user@hbase.apache.org
> > >
> > > Hi there,
> > >
> > >    If you are really planning for a warehousing solution then I would
> > > suggest you to have a look over Apache Hive. It provides you warehousing
> > > capabilities on top of a Hadoop cluster. Along with that it also provides
> > > an SQLish interface to the data stored in your warehouse, which would be
> > > very helpful to you, in case you are coming from an SQL background.
> > >
> > > HTH
> > >
> > >
> > >
> > > Regards,
> > >     Mohammad Tariq
> > >
> > >
> > >
> > > On Thu, Dec 13, 2012 at 2:43 PM, bigdata <bi...@outlook.com>
> > wrote:
> > >
> > > > Thanks. I think a real example is better for me to understand your
> > > > suggestions.
> > > > Now I have a relational table:ID   LoginTime
> >  DeviceID1
> > > >     2012-12-12 12:12:12   abcdef2     2012-12-12  19:12:12   abcdef3
> > > >  2012-12-13   10:10:10  defdaf
> > > > There are several requirements about this table:1. How many device
> > login
> > > > in each day?1. For one day, how many new device login? (never login
> > > > before)1. For one day, how many accumulated device login?
> > > > How can I design HBase tables to calculate these data?Now my solution
> > > > is:table A:
> > > > rowkey:  date-deviceidcolumn family: logincolumn qualifier:  2012-12-12
> > > > 12:12:12/2012-12-12 19:12:12....
> > > > table B:rowkey: deviceidcolumn family:null or anyvalue
> > > >
> > > > For req#1, I can scan table A and use prefixfilter(rowkey) to check one
> > > > special date, and get records countFor req#2, I get table b with each
> > > > deviceid, and count result
> > > > For req#3, count table A with prefixfilter like 1.
> > > > Does it OK?  Or other better solutions?
> > > > Thanks!!
> > > >
> > > > > CC: user@hbase.apache.org
> > > > > From: michael_segel@hotmail.com
> > > > > Subject: Re: How to design a data warehouse in HBase?
> > > > > Date: Thu, 13 Dec 2012 08:43:31 +0000
> > > > > To: user@hbase.apache.org
> > > > >
> > > > > You need to spend a bit of time on Schema design.
> > > > > You need to flatten your Schema...
> > > > > Implement some secondary indexing to improve join performance...
> > > > >
> > > > > Depends on what you want to do... There are other options too...
> > > > >
> > > > > Sent from a remote device. Please excuse any typos...
> > > > >
> > > > > Mike Segel
> > > > >
> > > > > On Dec 13, 2012, at 7:09 AM, lars hofhansl <lh...@yahoo.com>
> > wrote:
> > > > >
> > > > > > For OLAP type queries you will generally be better off with a truly
> > > > column oriented database.
> > > > > > You can probably shoehorn HBase into this, but it wasn't really
> > > > designed with raw scan performance along single columns in mind.
> > > > > >
> > > > > >
> > > > > >
> > > > > > ________________________________
> > > > > > From: bigdata <bi...@outlook.com>
> > > > > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > > > Sent: Wednesday, December 12, 2012 9:57 PM
> > > > > > Subject: How to design a data warehouse in HBase?
> > > > > >
> > > > > > Dear all,
> > > > > > We have a traditional star-model data warehouse in RDBMS, now we
> > want
> > > > to transfer it to HBase. After study HBase, I learn that HBase is
> > normally
> > > > can be query by rowkey.
> > > > > > 1.full rowkey (fastest)2.rowkey filter (fast)3.column
> > family/qualifier
> > > > filter (slow)
> > > > > > How can I design the HBase tables to implement the warehouse
> > > > functions, like:1.Query by DimensionA2.Query by DimensionA and
> > > > DimensionB3.Sum, count, distinct ...
> > > > > > From my opinion, I should create several HBase tables with all
> > > > combinations of different dimensions as the rowkey. This solution will
> > lead
> > > > to huge data duplication. Is there any good suggestions to solve it?
> > > > > > Thanks a lot!
> > > >
> > > >
> >
> >

Re: How to design a data warehouse in HBase?

Posted by Mohammad Tariq <do...@gmail.com>.

Both have got different purposes. Normally people say that Hive is slow,
that's just because it uses MapReduce under the hood. And i'm sure that if
the data stored in HBase is very huge, nobody would write sequential
programs for Get or Scan. Instead they will write MP jobs or do something
similar.

My point is that nothing can be 100% real time. Is that what you want?If
that is the case I would never suggest Hadoop on the first place as it's a
batch processing system and cannot be used like an OLTP system, unless you
have thought of some additional stuff. Since you are talking about
warehouse, I am assuming you are going to store and process gigantic
amounts of data. That's the only reason I had suggested Hive.

The whole point is that everything is not a solution for everything. One
size doesn't fit all. First, we need to analyze our particular use case.
The person, who says Hive is slow, might be correct. But only for his
scenario.

HTH

Regards,
    Mohammad Tariq



On Thu, Dec 13, 2012 at 3:17 PM, bigdata <bi...@outlook.com> wrote:

> Hi,
> I've got the information that HIVE 's performance is too low. It access
> HDFS files and scan all data to search one record. IS it TRUE? And HBase is
> much faster than it.
>
>
> > From: dontariq@gmail.com
> > Date: Thu, 13 Dec 2012 15:12:25 +0530
> > Subject: Re: How to design a data warehouse in HBase?
> > To: user@hbase.apache.org
> >
> > Hi there,
> >
> >    If you are really planning for a warehousing solution then I would
> > suggest you to have a look over Apache Hive. It provides you warehousing
> > capabilities on top of a Hadoop cluster. Along with that it also provides
> > an SQLish interface to the data stored in your warehouse, which would be
> > very helpful to you, in case you are coming from an SQL background.
> >
> > HTH
> >
> >
> >
> > Regards,
> >     Mohammad Tariq
> >
> >
> >
> > On Thu, Dec 13, 2012 at 2:43 PM, bigdata <bi...@outlook.com>
> wrote:
> >
> > > Thanks. I think a real example is better for me to understand your
> > > suggestions.
> > > Now I have a relational table:ID   LoginTime
>  DeviceID1
> > >     2012-12-12 12:12:12   abcdef2     2012-12-12  19:12:12   abcdef3
> > >  2012-12-13   10:10:10  defdaf
> > > There are several requirements about this table:1. How many device
> login
> > > in each day?1. For one day, how many new device login? (never login
> > > before)1. For one day, how many accumulated device login?
> > > How can I design HBase tables to calculate these data?Now my solution
> > > is:table A:
> > > rowkey:  date-deviceidcolumn family: logincolumn qualifier:  2012-12-12
> > > 12:12:12/2012-12-12 19:12:12....
> > > table B:rowkey: deviceidcolumn family:null or anyvalue
> > >
> > > For req#1, I can scan table A and use prefixfilter(rowkey) to check one
> > > special date, and get records countFor req#2, I get table b with each
> > > deviceid, and count result
> > > For req#3, count table A with prefixfilter like 1.
> > > Does it OK?  Or other better solutions?
> > > Thanks!!
> > >
> > > > CC: user@hbase.apache.org
> > > > From: michael_segel@hotmail.com
> > > > Subject: Re: How to design a data warehouse in HBase?
> > > > Date: Thu, 13 Dec 2012 08:43:31 +0000
> > > > To: user@hbase.apache.org
> > > >
> > > > You need to spend a bit of time on Schema design.
> > > > You need to flatten your Schema...
> > > > Implement some secondary indexing to improve join performance...
> > > >
> > > > Depends on what you want to do... There are other options too...
> > > >
> > > > Sent from a remote device. Please excuse any typos...
> > > >
> > > > Mike Segel
> > > >
> > > > On Dec 13, 2012, at 7:09 AM, lars hofhansl <lh...@yahoo.com>
> wrote:
> > > >
> > > > > For OLAP type queries you will generally be better off with a truly
> > > column oriented database.
> > > > > You can probably shoehorn HBase into this, but it wasn't really
> > > designed with raw scan performance along single columns in mind.
> > > > >
> > > > >
> > > > >
> > > > > ________________________________
> > > > > From: bigdata <bi...@outlook.com>
> > > > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > > Sent: Wednesday, December 12, 2012 9:57 PM
> > > > > Subject: How to design a data warehouse in HBase?
> > > > >
> > > > > Dear all,
> > > > > We have a traditional star-model data warehouse in RDBMS, now we
> want
> > > to transfer it to HBase. After study HBase, I learn that HBase is
> normally
> > > can be query by rowkey.
> > > > > 1.full rowkey (fastest)2.rowkey filter (fast)3.column
> family/qualifier
> > > filter (slow)
> > > > > How can I design the HBase tables to implement the warehouse
> > > functions, like:1.Query by DimensionA2.Query by DimensionA and
> > > DimensionB3.Sum, count, distinct ...
> > > > > From my opinion, I should create several HBase tables with all
> > > combinations of different dimensions as the rowkey. This solution will
> lead
> > > to huge data duplication. Is there any good suggestions to solve it?
> > > > > Thanks a lot!
> > >
> > >
>
>

Re: How to design a data warehouse in HBase?

Posted by Michael Segel <mi...@hotmail.com>.

I think you need to level set your expectation. Hive is good if you're working with a large portion of your underlying data set. 

HBase is better if you're looking at a relatively smaller subset of the overall data. 

In both cases, joins are expensive and if you flatten your data against your dominant use case, you can get decent performance. Again this is where secondary indexes, including search can help. 

On Dec 13, 2012, at 3:47 AM, bigdata <bi...@outlook.com> wrote:

> Hi,
> I've got the information that HIVE 's performance is too low. It access HDFS files and scan all data to search one record. IS it TRUE? And HBase is much faster than it.
> 
> 
>> From: dontariq@gmail.com
>> Date: Thu, 13 Dec 2012 15:12:25 +0530
>> Subject: Re: How to design a data warehouse in HBase?
>> To: user@hbase.apache.org
>> 
>> Hi there,
>> 
>>   If you are really planning for a warehousing solution then I would
>> suggest you to have a look over Apache Hive. It provides you warehousing
>> capabilities on top of a Hadoop cluster. Along with that it also provides
>> an SQLish interface to the data stored in your warehouse, which would be
>> very helpful to you, in case you are coming from an SQL background.
>> 
>> HTH
>> 
>> 
>> 
>> Regards,
>>    Mohammad Tariq
>> 
>> 
>> 
>> On Thu, Dec 13, 2012 at 2:43 PM, bigdata <bi...@outlook.com> wrote:
>> 
>>> Thanks. I think a real example is better for me to understand your
>>> suggestions.
>>> Now I have a relational table:ID   LoginTime                    DeviceID1
>>>    2012-12-12 12:12:12   abcdef2     2012-12-12  19:12:12   abcdef3
>>> 2012-12-13   10:10:10  defdaf
>>> There are several requirements about this table:1. How many device login
>>> in each day?1. For one day, how many new device login? (never login
>>> before)1. For one day, how many accumulated device login?
>>> How can I design HBase tables to calculate these data?Now my solution
>>> is:table A:
>>> rowkey:  date-deviceidcolumn family: logincolumn qualifier:  2012-12-12
>>> 12:12:12/2012-12-12 19:12:12....
>>> table B:rowkey: deviceidcolumn family:null or anyvalue
>>> 
>>> For req#1, I can scan table A and use prefixfilter(rowkey) to check one
>>> special date, and get records countFor req#2, I get table b with each
>>> deviceid, and count result
>>> For req#3, count table A with prefixfilter like 1.
>>> Does it OK?  Or other better solutions?
>>> Thanks!!
>>> 
>>>> CC: user@hbase.apache.org
>>>> From: michael_segel@hotmail.com
>>>> Subject: Re: How to design a data warehouse in HBase?
>>>> Date: Thu, 13 Dec 2012 08:43:31 +0000
>>>> To: user@hbase.apache.org
>>>> 
>>>> You need to spend a bit of time on Schema design.
>>>> You need to flatten your Schema...
>>>> Implement some secondary indexing to improve join performance...
>>>> 
>>>> Depends on what you want to do... There are other options too...
>>>> 
>>>> Sent from a remote device. Please excuse any typos...
>>>> 
>>>> Mike Segel
>>>> 
>>>> On Dec 13, 2012, at 7:09 AM, lars hofhansl <lh...@yahoo.com> wrote:
>>>> 
>>>>> For OLAP type queries you will generally be better off with a truly
>>> column oriented database.
>>>>> You can probably shoehorn HBase into this, but it wasn't really
>>> designed with raw scan performance along single columns in mind.
>>>>> 
>>>>> 
>>>>> 
>>>>> ________________________________
>>>>> From: bigdata <bi...@outlook.com>
>>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>>>>> Sent: Wednesday, December 12, 2012 9:57 PM
>>>>> Subject: How to design a data warehouse in HBase?
>>>>> 
>>>>> Dear all,
>>>>> We have a traditional star-model data warehouse in RDBMS, now we want
>>> to transfer it to HBase. After study HBase, I learn that HBase is normally
>>> can be query by rowkey.
>>>>> 1.full rowkey (fastest)2.rowkey filter (fast)3.column family/qualifier
>>> filter (slow)
>>>>> How can I design the HBase tables to implement the warehouse
>>> functions, like:1.Query by DimensionA2.Query by DimensionA and
>>> DimensionB3.Sum, count, distinct ...
>>>>> From my opinion, I should create several HBase tables with all
>>> combinations of different dimensions as the rowkey. This solution will lead
>>> to huge data duplication. Is there any good suggestions to solve it?
>>>>> Thanks a lot!
>>> 
>>> 
>

RE: How to design a data warehouse in HBase?

Posted by bigdata <bi...@outlook.com>.

Hi,
I've got the information that HIVE 's performance is too low. It access HDFS files and scan all data to search one record. IS it TRUE? And HBase is much faster than it.


> From: dontariq@gmail.com
> Date: Thu, 13 Dec 2012 15:12:25 +0530
> Subject: Re: How to design a data warehouse in HBase?
> To: user@hbase.apache.org
> 
> Hi there,
> 
>    If you are really planning for a warehousing solution then I would
> suggest you to have a look over Apache Hive. It provides you warehousing
> capabilities on top of a Hadoop cluster. Along with that it also provides
> an SQLish interface to the data stored in your warehouse, which would be
> very helpful to you, in case you are coming from an SQL background.
> 
> HTH
> 
> 
> 
> Regards,
>     Mohammad Tariq
> 
> 
> 
> On Thu, Dec 13, 2012 at 2:43 PM, bigdata <bi...@outlook.com> wrote:
> 
> > Thanks. I think a real example is better for me to understand your
> > suggestions.
> > Now I have a relational table:ID   LoginTime                    DeviceID1
> >     2012-12-12 12:12:12   abcdef2     2012-12-12  19:12:12   abcdef3
> >  2012-12-13   10:10:10  defdaf
> > There are several requirements about this table:1. How many device login
> > in each day?1. For one day, how many new device login? (never login
> > before)1. For one day, how many accumulated device login?
> > How can I design HBase tables to calculate these data?Now my solution
> > is:table A:
> > rowkey:  date-deviceidcolumn family: logincolumn qualifier:  2012-12-12
> > 12:12:12/2012-12-12 19:12:12....
> > table B:rowkey: deviceidcolumn family:null or anyvalue
> >
> > For req#1, I can scan table A and use prefixfilter(rowkey) to check one
> > special date, and get records countFor req#2, I get table b with each
> > deviceid, and count result
> > For req#3, count table A with prefixfilter like 1.
> > Does it OK?  Or other better solutions?
> > Thanks!!
> >
> > > CC: user@hbase.apache.org
> > > From: michael_segel@hotmail.com
> > > Subject: Re: How to design a data warehouse in HBase?
> > > Date: Thu, 13 Dec 2012 08:43:31 +0000
> > > To: user@hbase.apache.org
> > >
> > > You need to spend a bit of time on Schema design.
> > > You need to flatten your Schema...
> > > Implement some secondary indexing to improve join performance...
> > >
> > > Depends on what you want to do... There are other options too...
> > >
> > > Sent from a remote device. Please excuse any typos...
> > >
> > > Mike Segel
> > >
> > > On Dec 13, 2012, at 7:09 AM, lars hofhansl <lh...@yahoo.com> wrote:
> > >
> > > > For OLAP type queries you will generally be better off with a truly
> > column oriented database.
> > > > You can probably shoehorn HBase into this, but it wasn't really
> > designed with raw scan performance along single columns in mind.
> > > >
> > > >
> > > >
> > > > ________________________________
> > > > From: bigdata <bi...@outlook.com>
> > > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > Sent: Wednesday, December 12, 2012 9:57 PM
> > > > Subject: How to design a data warehouse in HBase?
> > > >
> > > > Dear all,
> > > > We have a traditional star-model data warehouse in RDBMS, now we want
> > to transfer it to HBase. After study HBase, I learn that HBase is normally
> > can be query by rowkey.
> > > > 1.full rowkey (fastest)2.rowkey filter (fast)3.column family/qualifier
> > filter (slow)
> > > > How can I design the HBase tables to implement the warehouse
> > functions, like:1.Query by DimensionA2.Query by DimensionA and
> > DimensionB3.Sum, count, distinct ...
> > > > From my opinion, I should create several HBase tables with all
> > combinations of different dimensions as the rowkey. This solution will lead
> > to huge data duplication. Is there any good suggestions to solve it?
> > > > Thanks a lot!
> >
> >

Re: How to design a data warehouse in HBase?

Posted by Mohammad Tariq <do...@gmail.com>.

Hi there,

   If you are really planning for a warehousing solution then I would
suggest you to have a look over Apache Hive. It provides you warehousing
capabilities on top of a Hadoop cluster. Along with that it also provides
an SQLish interface to the data stored in your warehouse, which would be
very helpful to you, in case you are coming from an SQL background.

HTH



Regards,
    Mohammad Tariq



On Thu, Dec 13, 2012 at 2:43 PM, bigdata <bi...@outlook.com> wrote:

> Thanks. I think a real example is better for me to understand your
> suggestions.
> Now I have a relational table:ID   LoginTime                    DeviceID1
>     2012-12-12 12:12:12   abcdef2     2012-12-12  19:12:12   abcdef3
>  2012-12-13   10:10:10  defdaf
> There are several requirements about this table:1. How many device login
> in each day?1. For one day, how many new device login? (never login
> before)1. For one day, how many accumulated device login?
> How can I design HBase tables to calculate these data?Now my solution
> is:table A:
> rowkey:  date-deviceidcolumn family: logincolumn qualifier:  2012-12-12
> 12:12:12/2012-12-12 19:12:12....
> table B:rowkey: deviceidcolumn family:null or anyvalue
>
> For req#1, I can scan table A and use prefixfilter(rowkey) to check one
> special date, and get records countFor req#2, I get table b with each
> deviceid, and count result
> For req#3, count table A with prefixfilter like 1.
> Does it OK?  Or other better solutions?
> Thanks!!
>
> > CC: user@hbase.apache.org
> > From: michael_segel@hotmail.com
> > Subject: Re: How to design a data warehouse in HBase?
> > Date: Thu, 13 Dec 2012 08:43:31 +0000
> > To: user@hbase.apache.org
> >
> > You need to spend a bit of time on Schema design.
> > You need to flatten your Schema...
> > Implement some secondary indexing to improve join performance...
> >
> > Depends on what you want to do... There are other options too...
> >
> > Sent from a remote device. Please excuse any typos...
> >
> > Mike Segel
> >
> > On Dec 13, 2012, at 7:09 AM, lars hofhansl <lh...@yahoo.com> wrote:
> >
> > > For OLAP type queries you will generally be better off with a truly
> column oriented database.
> > > You can probably shoehorn HBase into this, but it wasn't really
> designed with raw scan performance along single columns in mind.
> > >
> > >
> > >
> > > ________________________________
> > > From: bigdata <bi...@outlook.com>
> > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > Sent: Wednesday, December 12, 2012 9:57 PM
> > > Subject: How to design a data warehouse in HBase?
> > >
> > > Dear all,
> > > We have a traditional star-model data warehouse in RDBMS, now we want
> to transfer it to HBase. After study HBase, I learn that HBase is normally
> can be query by rowkey.
> > > 1.full rowkey (fastest)2.rowkey filter (fast)3.column family/qualifier
> filter (slow)
> > > How can I design the HBase tables to implement the warehouse
> functions, like:1.Query by DimensionA2.Query by DimensionA and
> DimensionB3.Sum, count, distinct ...
> > > From my opinion, I should create several HBase tables with all
> combinations of different dimensions as the rowkey. This solution will lead
> to huge data duplication. Is there any good suggestions to solve it?
> > > Thanks a lot!
>
>

RE: How to design a data warehouse in HBase?

Posted by bigdata <bi...@outlook.com>.

Thanks. I think a real example is better for me to understand your suggestions.
Now I have a relational table:ID   LoginTime                    DeviceID1     2012-12-12 12:12:12   abcdef2     2012-12-12  19:12:12   abcdef3      2012-12-13   10:10:10  defdaf
There are several requirements about this table:1. How many device login in each day?1. For one day, how many new device login? (never login before)1. For one day, how many accumulated device login?
How can I design HBase tables to calculate these data?Now my solution is:table A:     
rowkey:  date-deviceidcolumn family: logincolumn qualifier:  2012-12-12 12:12:12/2012-12-12 19:12:12....
table B:rowkey: deviceidcolumn family:null or anyvalue

For req#1, I can scan table A and use prefixfilter(rowkey) to check one special date, and get records countFor req#2, I get table b with each deviceid, and count result
For req#3, count table A with prefixfilter like 1.
Does it OK?  Or other better solutions?
Thanks!!

> CC: user@hbase.apache.org
> From: michael_segel@hotmail.com
> Subject: Re: How to design a data warehouse in HBase?
> Date: Thu, 13 Dec 2012 08:43:31 +0000
> To: user@hbase.apache.org
> 
> You need to spend a bit of time on Schema design.
> You need to flatten your Schema...
> Implement some secondary indexing to improve join performance...
> 
> Depends on what you want to do... There are other options too...
> 
> Sent from a remote device. Please excuse any typos...
> 
> Mike Segel
> 
> On Dec 13, 2012, at 7:09 AM, lars hofhansl <lh...@yahoo.com> wrote:
> 
> > For OLAP type queries you will generally be better off with a truly column oriented database.
> > You can probably shoehorn HBase into this, but it wasn't really designed with raw scan performance along single columns in mind.
> > 
> > 
> > 
> > ________________________________
> > From: bigdata <bi...@outlook.com>
> > To: "user@hbase.apache.org" <us...@hbase.apache.org> 
> > Sent: Wednesday, December 12, 2012 9:57 PM
> > Subject: How to design a data warehouse in HBase?
> > 
> > Dear all,
> > We have a traditional star-model data warehouse in RDBMS, now we want to transfer it to HBase. After study HBase, I learn that HBase is normally can be query by rowkey.
> > 1.full rowkey (fastest)2.rowkey filter (fast)3.column family/qualifier filter (slow)
> > How can I design the HBase tables to implement the warehouse functions, like:1.Query by DimensionA2.Query by DimensionA and DimensionB3.Sum, count, distinct ...
> > From my opinion, I should create several HBase tables with all combinations of different dimensions as the rowkey. This solution will lead to huge data duplication. Is there any good suggestions to solve it?
> > Thanks a lot!

Re: How to design a data warehouse in HBase?

Posted by Michel Segel <mi...@hotmail.com>.

You need to spend a bit of time on Schema design.
You need to flatten your Schema...
Implement some secondary indexing to improve join performance...

Depends on what you want to do... There are other options too...

Sent from a remote device. Please excuse any typos...

Mike Segel

On Dec 13, 2012, at 7:09 AM, lars hofhansl <lh...@yahoo.com> wrote:

> For OLAP type queries you will generally be better off with a truly column oriented database.
> You can probably shoehorn HBase into this, but it wasn't really designed with raw scan performance along single columns in mind.
> 
> 
> 
> ________________________________
> From: bigdata <bi...@outlook.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org> 
> Sent: Wednesday, December 12, 2012 9:57 PM
> Subject: How to design a data warehouse in HBase?
> 
> Dear all,
> We have a traditional star-model data warehouse in RDBMS, now we want to transfer it to HBase. After study HBase, I learn that HBase is normally can be query by rowkey.
> 1.full rowkey (fastest)2.rowkey filter (fast)3.column family/qualifier filter (slow)
> How can I design the HBase tables to implement the warehouse functions, like:1.Query by DimensionA2.Query by DimensionA and DimensionB3.Sum, count, distinct ...
> From my opinion, I should create several HBase tables with all combinations of different dimensions as the rowkey. This solution will lead to huge data duplication. Is there any good suggestions to solve it?
> Thanks a lot!

Re: How to design a data warehouse in HBase?

Posted by lars hofhansl <lh...@yahoo.com>.

For OLAP type queries you will generally be better off with a truly column oriented database.
You can probably shoehorn HBase into this, but it wasn't really designed with raw scan performance along single columns in mind.

________________________________
 From: bigdata <bi...@outlook.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org> 
Sent: Wednesday, December 12, 2012 9:57 PM
Subject: How to design a data warehouse in HBase?

Dear all,
We have a traditional star-model data warehouse in RDBMS, now we want to transfer it to HBase. After study HBase, I learn that HBase is normally can be query by rowkey.
1.full rowkey (fastest)2.rowkey filter (fast)3.column family/qualifier filter (slow)
How can I design the HBase tables to implement the warehouse functions, like:1.Query by DimensionA2.Query by DimensionA and DimensionB3.Sum, count, distinct ...
>From my opinion, I should create several HBase tables with all combinations of different dimensions as the rowkey. This solution will lead to huge data duplication. Is there any good suggestions to solve it?
Thanks a lot!