You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Renato Marroquín Mogrovejo <re...@gmail.com> on 2010/07/15 23:52:47 UTC

Hive Usage

Hi there I would like to know if there is anyone who has done some kind of
modelling on Hive, and is willing to share some experiences please.
Today is my first day with Hive, and I have several doubts regarding to the
modelling, if I would have to do a special modelling, or a regular DW one.ç
And another thing I wanted to know is if Hive already has the append option
enabled, because I know there is a hadoop branch with the append option
enabled and also a cloudera release does (I think it is the CHD3).
Please any kind of suggestion or opinions are highly appreciated.


Renato M.

Fwd: Hive Usage

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.

Hi, I am forwarding my original message into the dev list because I didn't
get any response from the user list.
Please any ideas are highly appreciated.

---------- Forwarded message ----------
From: Renato Marroquín Mogrovejo <re...@gmail.com>
Date: 2010/7/15
Subject: Hive Usage
To: hive-user@hadoop.apache.org

Hi there I would like to know if there is anyone who has done some kind of
modelling on Hive, and is willing to share some experiences please.
Today is my first day with Hive, and I have several doubts regarding to the
modelling, if I would have to do a special modelling, or a regular DW one.ç
And another thing I wanted to know is if Hive already has the append option
enabled, because I know there is a hadoop branch with the append option
enabled and also a cloudera release does (I think it is the CHD3).
Please any kind of suggestion or opinions are highly appreciated.

Renato M.

Re: Hive Usage

Posted by Edward Capriolo <ed...@gmail.com>.

On Wed, Jul 21, 2010 at 3:38 PM, Renato Marroquín Mogrovejo
<re...@gmail.com> wrote:
> Does anybody else have any Hive experience that would like to share? Please?
> Any kind of suggestion is highly appreciated.
> Thanks in advanced.
>
>
> Renato M.
>
> 2010/7/19 Renato Marroquín Mogrovejo <re...@gmail.com>
>>
>> Hi Ashish,
>>
>> I mean if there are like modeling best practices in order to obtain better
>> performance (buckets, partitions, tables related), e.g. maybe creating
>> different partitions considering not just time frames but maybe also
>> partition size, or for example in Hive's paper, the list partitioning that
>> the compiler uses to know where to look for the data, or I dunno those kind
>> of modeling related things.
>> Or is it just to choose between the well known Kimball or Inmon
>> approaches?
>> Thanks in advanced.
>>
>>
>> Renato M.
>>
>> 2010/7/16 Ashish Thusoo <at...@facebook.com>
>>>
>>> Hi Renato,
>>>
>>> Can you expand more on what exactly you mean by modelling?
>>>
>>> On the append side, Hive does not really support appends though you can
>>> create a new partition within the table for every run and that could be used
>>> as a work around for appends.
>>>
>>> Ashish
>>> ________________________________
>>> From: Renato Marroquín Mogrovejo [mailto:renatoj.marroquin@gmail.com]
>>> Sent: Thursday, July 15, 2010 2:53 PM
>>> To: hive-user@hadoop.apache.org
>>> Subject: Hive Usage
>>>
>>> Hi there I would like to know if there is anyone who has done some kind
>>> of modelling on Hive, and is willing to share some experiences please.
>>> Today is my first day with Hive, and I have several doubts regarding to
>>> the modelling, if I would have to do a special modelling, or a regular DW
>>> one.ç
>>> And another thing I wanted to know is if Hive already has the append
>>> option enabled, because I know there is a hadoop branch with the append
>>> option enabled and also a cloudera release does (I think it is the CHD3).
>>> Please any kind of suggestion or opinions are highly appreciated.
>>>
>>>
>>> Renato M.
>>
>
>

Most people partition data by time. For example, if your table is
partitioned by day you may not need "append" you can insert your data
into a partition (day=20101005). With this, a query that uses DAY
anywhere in the where clause ( WHERE day=20101005) can use partition
pruning and only scan only the data for that day. Over-partitioning a
scheme like day/month/hour/minute is not good if you end up with
numerous partitions and each of them have less data then the smallest
DFS block (128MB.)

Buckets are ideal if you have to join data. If you have two tables
bucketed on the same column and you join on that column, or wish to
sample this is very efficient. Bucketing is implemented as multiple
files in the table or partition, getting the files into the buckets
usually requires a query with # of reducers = number of buckets. We do
not do much joins or samples so we do not need this as much.

Hive is row oriented storage (with the exception of the columnar
SerDe) and I would guess that most are very denormalized. This is
probably due hive does not have indexes and constraints, keeping
normalized data in sync is difficult. While map reduce joins are
efficient for large data sets, star-schema is probably not where you
want to be.

In general, I find that since hive is great with raw data do not over
think your modelling. Just get your raw data in there. Reports are
aggregations which come from querying your raw data and outputting
other tables are where your thinking comes in. Common place is hourly
or daily role ups of reports that matter to people.

select url,count(1) from pageviews group by (url) where (day=20100101)
insert into pageviewreport partition(day=20100101)

Hope this helps,
Edward
Hive FTW

Re: Hive Usage

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.

Does anybody else have any Hive experience that would like to share? Please?
Any kind of suggestion is highly appreciated.
Thanks in advanced.


Renato M.

2010/7/19 Renato Marroquín Mogrovejo <re...@gmail.com>

> Hi Ashish,
>
> I mean if there are like modeling best practices in order to obtain better
> performance (buckets, partitions, tables related), e.g. maybe creating
> different partitions considering not just time frames but maybe also
> partition size, or for example in Hive's paper, the list partitioning that
> the compiler uses to know where to look for the data, or I dunno those kind
> of modeling related things.
> Or is it just to choose between the well known Kimball or Inmon
> approaches?
> Thanks in advanced.
>
>
> Renato M.
>
> 2010/7/16 Ashish Thusoo <at...@facebook.com>
>
>  Hi Renato,
>>
>> Can you expand more on what exactly you mean by modelling?
>>
>> On the append side, Hive does not really support appends though you can
>> create a new partition within the table for every run and that could be used
>> as a work around for appends.
>>
>> Ashish
>>
>>  ------------------------------
>> *From:* Renato Marroquín Mogrovejo [mailto:renatoj.marroquin@gmail.com]
>> *Sent:* Thursday, July 15, 2010 2:53 PM
>>
>> *To:* hive-user@hadoop.apache.org
>> *Subject:* Hive Usage
>>
>> Hi there I would like to know if there is anyone who has done some kind of
>> modelling on Hive, and is willing to share some experiences please.
>> Today is my first day with Hive, and I have several doubts regarding to
>> the modelling, if I would have to do a special modelling, or a regular DW
>> one.ç
>> And another thing I wanted to know is if Hive already has the append
>> option enabled, because I know there is a hadoop branch with the append
>> option enabled and also a cloudera release does (I think it is the CHD3).
>> Please any kind of suggestion or opinions are highly appreciated.
>>
>>
>> Renato M.
>>
>
>

Re: Hive Usage

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.

Hi Ashish,

I mean if there are like modeling best practices in order to obtain better
performance (buckets, partitions, tables related), e.g. maybe creating
different partitions considering not just time frames but maybe also
partition size, or for example in Hive's paper, the list partitioning that
the compiler uses to know where to look for the data, or I dunno those kind
of modeling related things.
Or is it just to choose between the well known Kimball or Inmon  approaches?
Thanks in advanced.


Renato M.

2010/7/16 Ashish Thusoo <at...@facebook.com>

>  Hi Renato,
>
> Can you expand more on what exactly you mean by modelling?
>
> On the append side, Hive does not really support appends though you can
> create a new partition within the table for every run and that could be used
> as a work around for appends.
>
> Ashish
>
>  ------------------------------
> *From:* Renato Marroquín Mogrovejo [mailto:renatoj.marroquin@gmail.com]
> *Sent:* Thursday, July 15, 2010 2:53 PM
>
> *To:* hive-user@hadoop.apache.org
> *Subject:* Hive Usage
>
> Hi there I would like to know if there is anyone who has done some kind of
> modelling on Hive, and is willing to share some experiences please.
> Today is my first day with Hive, and I have several doubts regarding to the
> modelling, if I would have to do a special modelling, or a regular DW one.ç
> And another thing I wanted to know is if Hive already has the append option
> enabled, because I know there is a hadoop branch with the append option
> enabled and also a cloudera release does (I think it is the CHD3).
> Please any kind of suggestion or opinions are highly appreciated.
>
>
> Renato M.
>

RE: Hive Usage

Posted by Ashish Thusoo <at...@facebook.com>.

Hi Renato,

Can you expand more on what exactly you mean by modelling?

On the append side, Hive does not really support appends though you can create a new partition within the table for every run and that could be used as a work around for appends.

Ashish

________________________________
From: Renato Marroquín Mogrovejo [mailto:renatoj.marroquin@gmail.com]
Sent: Thursday, July 15, 2010 2:53 PM
To: hive-user@hadoop.apache.org
Subject: Hive Usage

Hi there I would like to know if there is anyone who has done some kind of modelling on Hive, and is willing to share some experiences please.
Today is my first day with Hive, and I have several doubts regarding to the modelling, if I would have to do a special modelling, or a regular DW one.ç
And another thing I wanted to know is if Hive already has the append option enabled, because I know there is a hadoop branch with the append option enabled and also a cloudera release does (I think it is the CHD3).
Please any kind of suggestion or opinions are highly appreciated.


Renato M.