You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Edward Capriolo <ed...@gmail.com> on 2012/05/10 15:53:51 UTC

Re: Dimensional Data Model on Hive

On Thu, May 10, 2012 at 9:26 AM, Kuldeep Chitrakar
<ku...@synechron.com> wrote:
> Hi
>
>
>
> I have data warehouse implementation for Click Stream data analysis on
> RDBMS. Its a start schema (Dimensions and Facts).
>
>
>
> Now if i want to move to Hive, Do i need to create same data model as
> Dimensions and facts and join them.
>
>
>
> I should create a big de-normalized table which contains all textual
> attributes from all dimensions. If so how do we handle SCD 2 type dimensions
> in Hive.
>
>
>
> Its very basic question but I am just confused on this.
>
>
>
>
>
> Thanks,
>
> Kuldeep

While hive is sometimes referred to as a data warehouse you usually
want to avoid data warehouse concepts like stat-schema. There are a
number of reasons for this:
1) No unique constraints
2) limited index capabilities
3) Map side joins are optimal when a single table is small
4) Most join types while generalize into map reduce are much different
then a join in single node databases

I'm most situations I advice going the "nosql route" and de-normalize
almost everything. Optimize for scanning.

Re: Dimensional Data Model on Hive

Posted by Ashish Thusoo <at...@qubole.com>.

Also of most of the things that you will be doing is full scans as opposed
to needle in haystack queries there is usually no point in paying the
overhead of running hbase region servers. Only if your data is heavily
accessed by a key is the overhead of hbase justified. Another case could be
when parts of your data are updated heavily again by a predominant key.

Ashish
On May 10, 2012 10:25 AM, "Edward Capriolo" <ed...@gmail.com> wrote:

> On Thu, May 10, 2012 at 10:16 AM, Kuldeep Chitrakar
> <ku...@synechron.com> wrote:
> > Does that mean all data in one BigTable in de-normalized form? Then
> whats the main benefit of using Hive against Hbase as Hbase also recommends
> Highly de normalized BigTable.
> >
> >
> > Thanks,
> > Kuldeep
> > -----Original Message-----
> > From: Edward Capriolo [mailto:edlinuxguru@gmail.com]
> > Sent: 10 May 2012 19:24
> > To: user@hive.apache.org
> > Subject: Re: Dimensional Data Model on Hive
> >
> > On Thu, May 10, 2012 at 9:26 AM, Kuldeep Chitrakar
> > <ku...@synechron.com> wrote:
> >> Hi
> >>
> >>
> >>
> >> I have data warehouse implementation for Click Stream data analysis on
> >> RDBMS. Its a start schema (Dimensions and Facts).
> >>
> >>
> >>
> >> Now if i want to move to Hive, Do i need to create same data model as
> >> Dimensions and facts and join them.
> >>
> >>
> >>
> >> I should create a big de-normalized table which contains all textual
> >> attributes from all dimensions. If so how do we handle SCD 2 type
> dimensions
> >> in Hive.
> >>
> >>
> >>
> >> Its very basic question but I am just confused on this.
> >>
> >>
> >>
> >>
> >>
> >> Thanks,
> >>
> >> Kuldeep
> >
> > While hive is sometimes referred to as a data warehouse you usually
> > want to avoid data warehouse concepts like stat-schema. There are a
> > number of reasons for this:
> > 1) No unique constraints
> > 2) limited index capabilities
> > 3) Map side joins are optimal when a single table is small
> > 4) Most join types while generalize into map reduce are much different
> > then a join in single node databases
> >
> > I'm most situations I advice going the "nosql route" and de-normalize
> > almost everything. Optimize for scanning.
>
> Q: Does that mean all data in one BigTable in de-normalized form?
> A: No. I qualified this by saying "most". I am not advocating one
> large table, every situation is different. But generally star schema
> is going to be very difficult to implement and have less benefits then
> it would in most RDBMS systems.
>
> Q: What is the main benefit of using hive against hbase?
> A: I am not sure what you mean by "against". If you mean why would i
> chose one and not the other, hbase is designed for low latency < 20 ms
> put, get and scan operations. Hive is a declarative SQL like language
> that "queries" multi GB or TB sized files in hadoop. There is a
> storage handler implementation that allows you to query hbase data
> from hive as well if that is what you mean by against.
>

Re: Security and permissions within Hive

Posted by shashwat shriparv <dw...@gmail.com>.

Check out this....

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Authorization


Any how something need to be written on middle layer. you can not expect
full proof solution from default hive authentication and roles.

On Thu, May 10, 2012 at 8:22 PM, Raghunath, Ranjith <
Ranjith.Raghunath1@usaa.com> wrote:

> Anyone implementing authorization and roles within their hive environment?
> If so how successful has it been?
>
> Thanks,
> Ranjith
>



-- 


∞
Shashwat Shriparv

Security and permissions within Hive

Posted by "Raghunath, Ranjith" <Ra...@usaa.com>.

Anyone implementing authorization and roles within their hive environment? If so how successful has it been?

Thanks,
Ranjith

Re: Dimensional Data Model on Hive

Posted by Edward Capriolo <ed...@gmail.com>.

On Thu, May 10, 2012 at 10:16 AM, Kuldeep Chitrakar
<ku...@synechron.com> wrote:
> Does that mean all data in one BigTable in de-normalized form? Then whats the main benefit of using Hive against Hbase as Hbase also recommends Highly de normalized BigTable.
>
>
> Thanks,
> Kuldeep
> -----Original Message-----
> From: Edward Capriolo [mailto:edlinuxguru@gmail.com]
> Sent: 10 May 2012 19:24
> To: user@hive.apache.org
> Subject: Re: Dimensional Data Model on Hive
>
> On Thu, May 10, 2012 at 9:26 AM, Kuldeep Chitrakar
> <ku...@synechron.com> wrote:
>> Hi
>>
>>
>>
>> I have data warehouse implementation for Click Stream data analysis on
>> RDBMS. Its a start schema (Dimensions and Facts).
>>
>>
>>
>> Now if i want to move to Hive, Do i need to create same data model as
>> Dimensions and facts and join them.
>>
>>
>>
>> I should create a big de-normalized table which contains all textual
>> attributes from all dimensions. If so how do we handle SCD 2 type dimensions
>> in Hive.
>>
>>
>>
>> Its very basic question but I am just confused on this.
>>
>>
>>
>>
>>
>> Thanks,
>>
>> Kuldeep
>
> While hive is sometimes referred to as a data warehouse you usually
> want to avoid data warehouse concepts like stat-schema. There are a
> number of reasons for this:
> 1) No unique constraints
> 2) limited index capabilities
> 3) Map side joins are optimal when a single table is small
> 4) Most join types while generalize into map reduce are much different
> then a join in single node databases
>
> I'm most situations I advice going the "nosql route" and de-normalize
> almost everything. Optimize for scanning.

Q: Does that mean all data in one BigTable in de-normalized form?
A: No. I qualified this by saying "most". I am not advocating one
large table, every situation is different. But generally star schema
is going to be very difficult to implement and have less benefits then
it would in most RDBMS systems.

Q: What is the main benefit of using hive against hbase?
A: I am not sure what you mean by "against". If you mean why would i
chose one and not the other, hbase is designed for low latency < 20 ms
put, get and scan operations. Hive is a declarative SQL like language
that "queries" multi GB or TB sized files in hadoop. There is a
storage handler implementation that allows you to query hbase data
from hive as well if that is what you mean by against.

RE: Dimensional Data Model on Hive

Posted by Kuldeep Chitrakar <ku...@synechron.com>.

Does that mean all data in one BigTable in de-normalized form? Then whats the main benefit of using Hive against Hbase as Hbase also recommends Highly de normalized BigTable.

Thanks,
Kuldeep
-----Original Message-----
From: Edward Capriolo [mailto:edlinuxguru@gmail.com] 
Sent: 10 May 2012 19:24
To: user@hive.apache.org
Subject: Re: Dimensional Data Model on Hive

On Thu, May 10, 2012 at 9:26 AM, Kuldeep Chitrakar
<ku...@synechron.com> wrote:
> Hi
>
>
>
> I have data warehouse implementation for Click Stream data analysis on
> RDBMS. Its a start schema (Dimensions and Facts).
>
>
>
> Now if i want to move to Hive, Do i need to create same data model as
> Dimensions and facts and join them.
>
>
>
> I should create a big de-normalized table which contains all textual
> attributes from all dimensions. If so how do we handle SCD 2 type dimensions
> in Hive.
>
>
>
> Its very basic question but I am just confused on this.
>
>
>
>
>
> Thanks,
>
> Kuldeep

While hive is sometimes referred to as a data warehouse you usually
want to avoid data warehouse concepts like stat-schema. There are a
number of reasons for this:
1) No unique constraints
2) limited index capabilities
3) Map side joins are optimal when a single table is small
4) Most join types while generalize into map reduce are much different
then a join in single node databases

I'm most situations I advice going the "nosql route" and de-normalize
almost everything. Optimize for scanning.