You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by ri...@nokia.com on 2012/06/27 17:47:54 UTC

Obvious and not so obvious query optimzations in Hive

Hey Hivers,

I am trying to understand what are some of the obvious and not so obvious optimization I could do for a Hive Query on AWS EMR cluster. I know the answer for some of these questions but want to know what do you guys think and by what factor it affects the performance over the other approach.


1.       Having my external table data gzipped and reading it in the table v/s no compression at all.

2.       Having the external table data on S3 v/s having it on HDFS?

3.       Creating intermediate external tables v/s non external tables v/s creating views?

4.       Storing the external table as Textfile v/s Sequence file. I know sequence file compresses the data, but in what format? I read about RC files and how efficient they are, how to use them?

5.       How are number of reducers get set for a Hive query (The way group by and order by sets the number of reducers to 1) ? If I am not changing it explicitly does it pick it from the underlying Hadoop cluster? I am trying to understand the bottleneck between query and cluster size.

6.       Any other optimizations/ best practices?

Thanks a lot in advance.
Richin

Re: Obvious and not so obvious query optimzations in Hive

Posted by Igor Tatarinov <ig...@decide.com>.

If you are optimizing for latency (running time) as opposed to throughput,
it's best to have a single "wave" of reducers. So if your cluster is setup
with a limit of, say, 2 reducers per node using 2*N reduce tasks would work
best (for large queries). You have to specify that in your script using
SET mapred.reduce.tasks = ...;

GroupBy doesn't limit the number of reducers but OrderBy does use a single
reducer - so that's slow. I never use OrderBy though (Unix's sort is
probably faster). For analytics queries I need Distribute/Sort By (with
UDFs), which can use multiple reducers.

Hope this helps.
igor
decide.com

On Wed, Jun 27, 2012 at 8:47 AM, <ri...@nokia.com> wrote:

> 5.       **How are number of reducers get set for a Hive query (The way
> group by and order by sets the number of reducers to 1) ? If I am not
> changing it explicitly does it pick it from the underlying Hadoop cluster?
> I am trying to understand the bottleneck between query and cluster size.**
> **
>
>

Re: Obvious and not so obvious query optimzations in Hive

Posted by Bejoy KS <be...@yahoo.com>.

Hi Richin

In most cases to tune up queries you may have to explicitly specify the number of reduces based on the data size and available reduce slots in cluster. Reduce tasks and number of reducers are the same.

Reduce slots means the total number of reduce tasks your cluster can run in parallel. 

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: <ri...@nokia.com>
Date: Thu, 28 Jun 2012 19:10:21 
To: <us...@hive.apache.org>
Reply-To: user@hive.apache.org
Subject: RE: Obvious and not so obvious query optimzations in Hive

Igor,Bejoy - thanks a lot, that helps.

He, I am running the query on Amazon EMR cluster and based on the type of instances I pick, default number of mappers and reducers are set. Now I would expect Hive to generate that many number of output files as there are number of reducers (since I am not using order by clause or setting it explicitly). If Hive is setting lower number of reducers for itself than there is no point using a high end EMR cluster and pay for it.
Also I can only set number of reduce tasks explicitly through  SET mapred.reduce.tasks = ... , how to set number of reducers itself? I am confused between number of reduce tasks and reducers, can you please explain?

Thanks,
Richin

============
If you are optimizing for latency (running time) as opposed to throughput, it's best to have a single "wave" of reducers. So if your cluster is setup with a limit of, say, 2 reducers per node using 2*N reduce tasks would work best (for large queries). You have to specify that in your script using 
SET mapred.reduce.tasks = ...;

GroupBy doesn't limit the number of reducers but OrderBy does use a single reducer - so that's slow. I never use OrderBy though (Unix's sort is probably faster). For analytics queries I need Distribute/Sort By (with UDFs), which can use multiple reducers.

Hope this helps.
igor
decide.com

On Wed, Jun 27, 2012 at 8:47 AM, <ri...@nokia.com> wrote:
5.       How are number of reducers get set for a Hive query (The way group by and order by sets the number of reducers to 1) ? If I am not changing it explicitly does it pick it from the underlying Hadoop cluster? I am trying to understand the bottleneck between query and cluster size.

-----Original Message-----
From: ext yongqiang he [mailto:heyongqiangict@gmail.com] 
Sent: Wednesday, June 27, 2012 6:32 PM
To: user@hive.apache.org
Subject: Re: Obvious and not so obvious query optimzations in Hive

1.       Having my external table data gzipped and reading it in the
table v/s no compression at all.

You may want GZip your data since it is offline. But space is not a concern and you want to optimize CPU, use snappy.

With snappy, there is no reason to go with no compression.

3.       Creating intermediate external tables v/s non external tables
v/s creating views?

First go with normal tables. External tables are hard to manage.
Views are there for complex things which are hard to do with 'managed table'.

4.       Storing the external table as Textfile v/s Sequence file. I
know sequence file compresses the data, but in what format? I read about RC files and how efficient they are, how to use them?

rcfile if you query your data with Hive. 'create table xxx(xxx) stored as rcfile'

5.       How are number of reducers get set for a Hive query (The way
group by and order by sets the number of reducers to 1) ? If I am not changing it explicitly does it pick it from the underlying Hadoop cluster? I am trying to understand the bottleneck between query and cluster size.

Can you say more about your concern about "query and cluster size"?

On Wed, Jun 27, 2012 at 11:46 AM, Bejoy KS <be...@yahoo.com> wrote:
> Hi Richin
>
> I'm not an AWS guy but still lemme try answering a few questions in 
> general (not wrt AWS EMR)
>
>
> 1.       Having my external table data gzipped and reading it in the 
> table v/s no compression at all.
>
> Bejoy: When the data volume is large compression saves up the disk 
> space and hence it is recommended. Gzip is not splittable, means one 
> file can't be distributed across map tasks. Go in with Lzo or Snappy.
>
>
> 2.       Having the external table data on S3 v/s having it on HDFS?
>
> 3.       Creating intermediate external tables v/s non external tables 
> v/s creating views?
>
> Bejoy: External Tables are not much different than managed tables. In 
> managed tables when you drop the table using HQL the underlying data 
> in hdfs is also deleted but in case of External tables only the table 
> defn is dropped and the data in hdfs is preserved. Views are not 
> evaluated on its creation time, but when it is used MR jobs are 
> triggered and the required data is extracted out of source tables. So 
> if you are planning to reuse a view n number of times it'll be better 
> creating a table and using it else the view query will be evaluated n times.
>
>
> 4.       Storing the external table as Textfile v/s Sequence file. I 
> know sequence file compresses the data, but in what format? I read 
> about RC files and how efficient they are, how to use them?
>
> Bejoy: Sequence files are splitable on its own so it is a good choice 
> when using with Snappy. In sequence file, compression happens either 
> at record level or block level which is configurable at the time of 
> compressing. If you are using Gzip, TextFiles offer more compression 
> ratio as the whole data is compressed in a go compared with Sequence 
> files where it happens per record/block. But then you compromise on 
> splitability. RC files are a good choice when your queries involve 
> querying only a few columns rather than all columns in a row.
>
> 5.       How are number of reducers get set for a Hive query (The way 
> group by and order by sets the number of reducers to 1) ? If I am not 
> changing it explicitly does it pick it from the underlying Hadoop 
> cluster? I am trying to understand the bottleneck between query and cluster size.
>
> Bejoy: Some queries in hive are forced to have just a single reducer 
> like Order By. In case of other queries hive determines the number of reducers.
> However you can always specify the number of reducer on a per query 
> basis based on the data it process.
>
> 6.       Any other optimizations/ best practices?
>
> Bejoy: There are lots of other optimizations in hive which can be 
> injected based on the query. There are various join optimizations, 
> group by optimizations etc suited for specific needs.
>
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ________________________________
> From: <ri...@nokia.com>
> Date: Wed, 27 Jun 2012 15:47:54 +0000
> To: <us...@hive.apache.org>
> ReplyTo: user@hive.apache.org
> Subject: Obvious and not so obvious query optimzations in Hive
>
> Hey Hivers,
>
>
>
> I am trying to understand what are some of the obvious and not so 
> obvious optimization I could do for a Hive Query on AWS EMR cluster. I 
> know the answer for some of these questions but want to know what do 
> you guys think and by what factor it affects the performance over the other approach.
>
>
>
> 1.       Having my external table data gzipped and reading it in the 
> table v/s no compression at all.
>
> 2.       Having the external table data on S3 v/s having it on HDFS?
>
> 3.       Creating intermediate external tables v/s non external tables 
> v/s creating views?
>
> 4.       Storing the external table as Textfile v/s Sequence file. I 
> know sequence file compresses the data, but in what format? I read 
> about RC files and how efficient they are, how to use them?
>
> 5.       How are number of reducers get set for a Hive query (The way 
> group by and order by sets the number of reducers to 1) ? If I am not 
> changing it explicitly does it pick it from the underlying Hadoop 
> cluster? I am trying to understand the bottleneck between query and cluster size.
>
> 6.       Any other optimizations/ best practices?
>
>
>
> Thanks a lot in advance.
>
> Richin

RE: Obvious and not so obvious query optimzations in Hive

Posted by ri...@nokia.com.

Thanks Bejoy, that is really helpful.

From: ext Bejoy KS [mailto:bejoy_ks@yahoo.com]
Sent: Thursday, June 28, 2012 4:12 PM
To: Jain Richin (Nokia-HR/Boston); user@hive.apache.org
Subject: Re: Obvious and not so obvious query optimzations in Hive

Hi Richin

The Keys vary based on your queries on the same table. You can always see how your query will be parsed to map reduce jobs using EXPLAIN Statement. It'll give you which columns are chosen as keys and values on each MR job for a query.
Regards
Bejoy KS

Sent from handheld, please excuse typos.
________________________________
From: <ri...@nokia.com>>
Date: Thu, 28 Jun 2012 20:08:14 +0000
To: <us...@hive.apache.org>>; <be...@yahoo.com>>
Subject: RE: Obvious and not so obvious query optimzations in Hive

Bejoy, thanks again. This might be the silliest question but what are the keys in a hive query. Is it the fields we pick in select clause or the one we define with the group by clause.
Can you tell me what the keys will be for reducers for my query down below

CREATE EXTERNAL TABLE extlog
       (id string,
       ts string,
       metric string,
       min double,
       max double,
       avg double,
       sum double,
       sample double,
        unit string)
        ROW FORMAT DELIMITED
       FIELDS TERMINATED BY ','
        STORED AS TEXTFILE
       LOCATION 's3n://xxx/xxx/';

CREATE TABLE range
       (ts string,
       id string)
       ROW FORMAT DELIMITED
       FIELDS TERMINATED BY ',';

CREATE EXTERNAL TABLE timeline
       (ts string,
       instancecount int)
       ROW FORMAT DELIMITED
       FIELDS TERMINATED BY ','
       STORED AS SEQUENCEFILE
       LOCATION 's3n://xxx/xx/xx/';

insert overwrite table range
       select from_unixtime
       ((unix_timestamp(ts, "MM/dd/yyyy HH:mm") - ((unix_timestamp(ts, "MM/dd/yyyy HH:mm"))%600)),"MM/dd/yyyy HH:mm"),
       id
       from extlog;

insert overwrite table timeline select ts,count(distinct id) from timelinerange group by ts;

Thanks,
Richin

From: ext Bejoy KS [mailto:bejoy_ks@yahoo.com]<mailto:[mailto:bejoy_ks@yahoo.com]>
Sent: Thursday, June 28, 2012 3:47 PM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Obvious and not so obvious query optimzations in Hive

Hi Richin

The keys are chosen by hive based on the input unless you specify certain clauses as DISTRIBUTE BY in your query which gives the flexibility for user to decide on the key/columns the data has to be distributed across reducers.

You are right In MR if a reducer writes out no data then you should get an empty file. Need to check source code to see whether hive deletes those files.
Regards
Bejoy KS

Sent from handheld, please excuse typos.
________________________________
From: <ri...@nokia.com>>
Date: Thu, 28 Jun 2012 19:23:46 +0000
To: <us...@hive.apache.org>>
ReplyTo: user@hive.apache.org<ma...@hive.apache.org>
Subject: RE: Obvious and not so obvious query optimzations in Hive

Thanks Nitin.
Depending on how I design my keys they might go to one or more reducers, but shouldn't I be seeing empty files for the reducers which did not get any data to reduce (because of the design of keys) ?
Or does hive clean all the empty files at the end of the query?

Richin

From: ext Nitin Pawar [mailto:nitinpawar432@gmail.com]<mailto:[mailto:nitinpawar432@gmail.com]>
Sent: Thursday, June 28, 2012 3:15 PM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Obvious and not so obvious query optimzations in Hive

Richin,

even if you set number of reducers to be launched it does not guarantee u to that it will generate those many files.

based on your query and data only the reducers which got keys to process will generate the files
so when you have hive query with large number of keys but with lesser number in spilt size it will need large maps but then reducers will always depend on the keys emitted by the mappers and all the extra reducers will be a burden to the system
On Fri, Jun 29, 2012 at 12:40 AM, <ri...@nokia.com>> wrote:
Igor,Bejoy - thanks a lot, that helps.

He, I am running the query on Amazon EMR cluster and based on the type of instances I pick, default number of mappers and reducers are set. Now I would expect Hive to generate that many number of output files as there are number of reducers (since I am not using order by clause or setting it explicitly). If Hive is setting lower number of reducers for itself than there is no point using a high end EMR cluster and pay for it.
Also I can only set number of reduce tasks explicitly through  SET mapred.reduce.tasks = ... , how to set number of reducers itself? I am confused between number of reduce tasks and reducers, can you please explain?

Thanks,
Richin

============
If you are optimizing for latency (running time) as opposed to throughput, it's best to have a single "wave" of reducers. So if your cluster is setup with a limit of, say, 2 reducers per node using 2*N reduce tasks would work best (for large queries). You have to specify that in your script using
SET mapred.reduce.tasks = ...;

GroupBy doesn't limit the number of reducers but OrderBy does use a single reducer - so that's slow. I never use OrderBy though (Unix's sort is probably faster). For analytics queries I need Distribute/Sort By (with UDFs), which can use multiple reducers.

Hope this helps.
igor
decide.com<http://decide.com>

On Wed, Jun 27, 2012 at 8:47 AM, <ri...@nokia.com>> wrote:
5.       How are number of reducers get set for a Hive query (The way group by and order by sets the number of reducers to 1) ? If I am not changing it explicitly does it pick it from the underlying Hadoop cluster? I am trying to understand the bottleneck between query and cluster size.
-----Original Message-----
From: ext yongqiang he [mailto:heyongqiangict@gmail.com<ma...@gmail.com>]
Sent: Wednesday, June 27, 2012 6:32 PM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Obvious and not so obvious query optimzations in Hive

1.       Having my external table data gzipped and reading it in the
table v/s no compression at all.

You may want GZip your data since it is offline. But space is not a concern and you want to optimize CPU, use snappy.

With snappy, there is no reason to go with no compression.

3.       Creating intermediate external tables v/s non external tables
v/s creating views?

First go with normal tables. External tables are hard to manage.
Views are there for complex things which are hard to do with 'managed table'.

4.       Storing the external table as Textfile v/s Sequence file. I
know sequence file compresses the data, but in what format? I read about RC files and how efficient they are, how to use them?

rcfile if you query your data with Hive. 'create table xxx(xxx) stored as rcfile'

5.       How are number of reducers get set for a Hive query (The way
group by and order by sets the number of reducers to 1) ? If I am not changing it explicitly does it pick it from the underlying Hadoop cluster? I am trying to understand the bottleneck between query and cluster size.

Can you say more about your concern about "query and cluster size"?

On Wed, Jun 27, 2012 at 11:46 AM, Bejoy KS <be...@yahoo.com>> wrote:
> Hi Richin
>
> I'm not an AWS guy but still lemme try answering a few questions in
> general (not wrt AWS EMR)
>
>
> 1.       Having my external table data gzipped and reading it in the
> table v/s no compression at all.
>
> Bejoy: When the data volume is large compression saves up the disk
> space and hence it is recommended. Gzip is not splittable, means one
> file can't be distributed across map tasks. Go in with Lzo or Snappy.
>
>
> 2.       Having the external table data on S3 v/s having it on HDFS?
>
> 3.       Creating intermediate external tables v/s non external tables
> v/s creating views?
>
> Bejoy: External Tables are not much different than managed tables. In
> managed tables when you drop the table using HQL the underlying data
> in hdfs is also deleted but in case of External tables only the table
> defn is dropped and the data in hdfs is preserved. Views are not
> evaluated on its creation time, but when it is used MR jobs are
> triggered and the required data is extracted out of source tables. So
> if you are planning to reuse a view n number of times it'll be better
> creating a table and using it else the view query will be evaluated n times.
>
>
> 4.       Storing the external table as Textfile v/s Sequence file. I
> know sequence file compresses the data, but in what format? I read
> about RC files and how efficient they are, how to use them?
>
> Bejoy: Sequence files are splitable on its own so it is a good choice
> when using with Snappy. In sequence file, compression happens either
> at record level or block level which is configurable at the time of
> compressing. If you are using Gzip, TextFiles offer more compression
> ratio as the whole data is compressed in a go compared with Sequence
> files where it happens per record/block. But then you compromise on
> splitability. RC files are a good choice when your queries involve
> querying only a few columns rather than all columns in a row.
>
> 5.       How are number of reducers get set for a Hive query (The way
> group by and order by sets the number of reducers to 1) ? If I am not
> changing it explicitly does it pick it from the underlying Hadoop
> cluster? I am trying to understand the bottleneck between query and cluster size.
>
> Bejoy: Some queries in hive are forced to have just a single reducer
> like Order By. In case of other queries hive determines the number of reducers.
> However you can always specify the number of reducer on a per query
> basis based on the data it process.
>
> 6.       Any other optimizations/ best practices?
>
> Bejoy: There are lots of other optimizations in hive which can be
> injected based on the query. There are various join optimizations,
> group by optimizations etc suited for specific needs.
>
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ________________________________
> From: <ri...@nokia.com>>
> Date: Wed, 27 Jun 2012 15:47:54 +0000
> To: <us...@hive.apache.org>>
> ReplyTo: user@hive.apache.org<ma...@hive.apache.org>
> Subject: Obvious and not so obvious query optimzations in Hive
>
> Hey Hivers,
>
>
>
> I am trying to understand what are some of the obvious and not so
> obvious optimization I could do for a Hive Query on AWS EMR cluster. I
> know the answer for some of these questions but want to know what do
> you guys think and by what factor it affects the performance over the other approach.
>
>
>
> 1.       Having my external table data gzipped and reading it in the
> table v/s no compression at all.
>
> 2.       Having the external table data on S3 v/s having it on HDFS?
>
> 3.       Creating intermediate external tables v/s non external tables
> v/s creating views?
>
> 4.       Storing the external table as Textfile v/s Sequence file. I
> know sequence file compresses the data, but in what format? I read
> about RC files and how efficient they are, how to use them?
>
> 5.       How are number of reducers get set for a Hive query (The way
> group by and order by sets the number of reducers to 1) ? If I am not
> changing it explicitly does it pick it from the underlying Hadoop
> cluster? I am trying to understand the bottleneck between query and cluster size.
>
> 6.       Any other optimizations/ best practices?
>
>
>
> Thanks a lot in advance.
>
> Richin

--
Nitin Pawar

Re: Obvious and not so obvious query optimzations in Hive

Posted by Bejoy KS <be...@yahoo.com>.

Hi Richin

The Keys vary based on your queries on the same table. You can always see how your query will be parsed to map reduce jobs using EXPLAIN Statement. It'll give you which columns are chosen as keys and values on each MR job for a query.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: <ri...@nokia.com>
Date: Thu, 28 Jun 2012 20:08:14 
To: <us...@hive.apache.org>; <be...@yahoo.com>
Subject: RE: Obvious and not so obvious query optimzations in Hive

Bejoy, thanks again. This might be the silliest question but what are the keys in a hive query. Is it the fields we pick in select clause or the one we define with the group by clause.
Can you tell me what the keys will be for reducers for my query down below

CREATE EXTERNAL TABLE extlog
       (id string,
       ts string,
       metric string,
       min double,
       max double,
       avg double,
       sum double,
       sample double,
        unit string)
        ROW FORMAT DELIMITED
       FIELDS TERMINATED BY ','
        STORED AS TEXTFILE
       LOCATION 's3n://xxx/xxx/';

CREATE TABLE range
       (ts string,
       id string)
       ROW FORMAT DELIMITED
       FIELDS TERMINATED BY ',';

CREATE EXTERNAL TABLE timeline
       (ts string,
       instancecount int)
       ROW FORMAT DELIMITED
       FIELDS TERMINATED BY ','
       STORED AS SEQUENCEFILE
       LOCATION 's3n://xxx/xx/xx/';

insert overwrite table range
       select from_unixtime
       ((unix_timestamp(ts, "MM/dd/yyyy HH:mm") - ((unix_timestamp(ts, "MM/dd/yyyy HH:mm"))%600)),"MM/dd/yyyy HH:mm"),
       id
       from extlog;

insert overwrite table timeline select ts,count(distinct id) from timelinerange group by ts;

Thanks,
Richin

From: ext Bejoy KS [mailto:bejoy_ks@yahoo.com]
Sent: Thursday, June 28, 2012 3:47 PM
To: user@hive.apache.org
Subject: Re: Obvious and not so obvious query optimzations in Hive

Hi Richin

The keys are chosen by hive based on the input unless you specify certain clauses as DISTRIBUTE BY in your query which gives the flexibility for user to decide on the key/columns the data has to be distributed across reducers.

You are right In MR if a reducer writes out no data then you should get an empty file. Need to check source code to see whether hive deletes those files.
Regards
Bejoy KS

Sent from handheld, please excuse typos.
________________________________
From: <ri...@nokia.com>>
Date: Thu, 28 Jun 2012 19:23:46 +0000
To: <us...@hive.apache.org>>
ReplyTo: user@hive.apache.org<ma...@hive.apache.org>
Subject: RE: Obvious and not so obvious query optimzations in Hive

Thanks Nitin.
Depending on how I design my keys they might go to one or more reducers, but shouldn't I be seeing empty files for the reducers which did not get any data to reduce (because of the design of keys) ?
Or does hive clean all the empty files at the end of the query?

Richin

From: ext Nitin Pawar [mailto:nitinpawar432@gmail.com]<mailto:[mailto:nitinpawar432@gmail.com]>
Sent: Thursday, June 28, 2012 3:15 PM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Obvious and not so obvious query optimzations in Hive

Richin,

even if you set number of reducers to be launched it does not guarantee u to that it will generate those many files.

based on your query and data only the reducers which got keys to process will generate the files
so when you have hive query with large number of keys but with lesser number in spilt size it will need large maps but then reducers will always depend on the keys emitted by the mappers and all the extra reducers will be a burden to the system
On Fri, Jun 29, 2012 at 12:40 AM, <ri...@nokia.com>> wrote:
Igor,Bejoy - thanks a lot, that helps.

He, I am running the query on Amazon EMR cluster and based on the type of instances I pick, default number of mappers and reducers are set. Now I would expect Hive to generate that many number of output files as there are number of reducers (since I am not using order by clause or setting it explicitly). If Hive is setting lower number of reducers for itself than there is no point using a high end EMR cluster and pay for it.
Also I can only set number of reduce tasks explicitly through  SET mapred.reduce.tasks = ... , how to set number of reducers itself? I am confused between number of reduce tasks and reducers, can you please explain?

Thanks,
Richin

============
If you are optimizing for latency (running time) as opposed to throughput, it's best to have a single "wave" of reducers. So if your cluster is setup with a limit of, say, 2 reducers per node using 2*N reduce tasks would work best (for large queries). You have to specify that in your script using
SET mapred.reduce.tasks = ...;

GroupBy doesn't limit the number of reducers but OrderBy does use a single reducer - so that's slow. I never use OrderBy though (Unix's sort is probably faster). For analytics queries I need Distribute/Sort By (with UDFs), which can use multiple reducers.

Hope this helps.
igor
decide.com<http://decide.com>

On Wed, Jun 27, 2012 at 8:47 AM, <ri...@nokia.com>> wrote:
5.       How are number of reducers get set for a Hive query (The way group by and order by sets the number of reducers to 1) ? If I am not changing it explicitly does it pick it from the underlying Hadoop cluster? I am trying to understand the bottleneck between query and cluster size.
-----Original Message-----
From: ext yongqiang he [mailto:heyongqiangict@gmail.com<ma...@gmail.com>]
Sent: Wednesday, June 27, 2012 6:32 PM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Obvious and not so obvious query optimzations in Hive

1.       Having my external table data gzipped and reading it in the
table v/s no compression at all.

You may want GZip your data since it is offline. But space is not a concern and you want to optimize CPU, use snappy.

With snappy, there is no reason to go with no compression.

3.       Creating intermediate external tables v/s non external tables
v/s creating views?

First go with normal tables. External tables are hard to manage.
Views are there for complex things which are hard to do with 'managed table'.

4.       Storing the external table as Textfile v/s Sequence file. I
know sequence file compresses the data, but in what format? I read about RC files and how efficient they are, how to use them?

rcfile if you query your data with Hive. 'create table xxx(xxx) stored as rcfile'

5.       How are number of reducers get set for a Hive query (The way
group by and order by sets the number of reducers to 1) ? If I am not changing it explicitly does it pick it from the underlying Hadoop cluster? I am trying to understand the bottleneck between query and cluster size.

Can you say more about your concern about "query and cluster size"?

On Wed, Jun 27, 2012 at 11:46 AM, Bejoy KS <be...@yahoo.com>> wrote:
> Hi Richin
>
> I'm not an AWS guy but still lemme try answering a few questions in
> general (not wrt AWS EMR)
>
>
> 1.       Having my external table data gzipped and reading it in the
> table v/s no compression at all.
>
> Bejoy: When the data volume is large compression saves up the disk
> space and hence it is recommended. Gzip is not splittable, means one
> file can't be distributed across map tasks. Go in with Lzo or Snappy.
>
>
> 2.       Having the external table data on S3 v/s having it on HDFS?
>
> 3.       Creating intermediate external tables v/s non external tables
> v/s creating views?
>
> Bejoy: External Tables are not much different than managed tables. In
> managed tables when you drop the table using HQL the underlying data
> in hdfs is also deleted but in case of External tables only the table
> defn is dropped and the data in hdfs is preserved. Views are not
> evaluated on its creation time, but when it is used MR jobs are
> triggered and the required data is extracted out of source tables. So
> if you are planning to reuse a view n number of times it'll be better
> creating a table and using it else the view query will be evaluated n times.
>
>
> 4.       Storing the external table as Textfile v/s Sequence file. I
> know sequence file compresses the data, but in what format? I read
> about RC files and how efficient they are, how to use them?
>
> Bejoy: Sequence files are splitable on its own so it is a good choice
> when using with Snappy. In sequence file, compression happens either
> at record level or block level which is configurable at the time of
> compressing. If you are using Gzip, TextFiles offer more compression
> ratio as the whole data is compressed in a go compared with Sequence
> files where it happens per record/block. But then you compromise on
> splitability. RC files are a good choice when your queries involve
> querying only a few columns rather than all columns in a row.
>
> 5.       How are number of reducers get set for a Hive query (The way
> group by and order by sets the number of reducers to 1) ? If I am not
> changing it explicitly does it pick it from the underlying Hadoop
> cluster? I am trying to understand the bottleneck between query and cluster size.
>
> Bejoy: Some queries in hive are forced to have just a single reducer
> like Order By. In case of other queries hive determines the number of reducers.
> However you can always specify the number of reducer on a per query
> basis based on the data it process.
>
> 6.       Any other optimizations/ best practices?
>
> Bejoy: There are lots of other optimizations in hive which can be
> injected based on the query. There are various join optimizations,
> group by optimizations etc suited for specific needs.
>
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ________________________________
> From: <ri...@nokia.com>>
> Date: Wed, 27 Jun 2012 15:47:54 +0000
> To: <us...@hive.apache.org>>
> ReplyTo: user@hive.apache.org<ma...@hive.apache.org>
> Subject: Obvious and not so obvious query optimzations in Hive
>
> Hey Hivers,
>
>
>
> I am trying to understand what are some of the obvious and not so
> obvious optimization I could do for a Hive Query on AWS EMR cluster. I
> know the answer for some of these questions but want to know what do
> you guys think and by what factor it affects the performance over the other approach.
>
>
>
> 1.       Having my external table data gzipped and reading it in the
> table v/s no compression at all.
>
> 2.       Having the external table data on S3 v/s having it on HDFS?
>
> 3.       Creating intermediate external tables v/s non external tables
> v/s creating views?
>
> 4.       Storing the external table as Textfile v/s Sequence file. I
> know sequence file compresses the data, but in what format? I read
> about RC files and how efficient they are, how to use them?
>
> 5.       How are number of reducers get set for a Hive query (The way
> group by and order by sets the number of reducers to 1) ? If I am not
> changing it explicitly does it pick it from the underlying Hadoop
> cluster? I am trying to understand the bottleneck between query and cluster size.
>
> 6.       Any other optimizations/ best practices?
>
>
>
> Thanks a lot in advance.
>
> Richin

--
Nitin Pawar

RE: Obvious and not so obvious query optimzations in Hive

Posted by ri...@nokia.com.

Bejoy, thanks again. This might be the silliest question but what are the keys in a hive query. Is it the fields we pick in select clause or the one we define with the group by clause.
Can you tell me what the keys will be for reducers for my query down below

CREATE EXTERNAL TABLE extlog
       (id string,
       ts string,
       metric string,
       min double,
       max double,
       avg double,
       sum double,
       sample double,
        unit string)
        ROW FORMAT DELIMITED
       FIELDS TERMINATED BY ','
        STORED AS TEXTFILE
       LOCATION 's3n://xxx/xxx/';

CREATE TABLE range
       (ts string,
       id string)
       ROW FORMAT DELIMITED
       FIELDS TERMINATED BY ',';

CREATE EXTERNAL TABLE timeline
       (ts string,
       instancecount int)
       ROW FORMAT DELIMITED
       FIELDS TERMINATED BY ','
       STORED AS SEQUENCEFILE
       LOCATION 's3n://xxx/xx/xx/';

insert overwrite table range
       select from_unixtime
       ((unix_timestamp(ts, "MM/dd/yyyy HH:mm") - ((unix_timestamp(ts, "MM/dd/yyyy HH:mm"))%600)),"MM/dd/yyyy HH:mm"),
       id
       from extlog;

insert overwrite table timeline select ts,count(distinct id) from timelinerange group by ts;

Thanks,
Richin

From: ext Bejoy KS [mailto:bejoy_ks@yahoo.com]
Sent: Thursday, June 28, 2012 3:47 PM
To: user@hive.apache.org
Subject: Re: Obvious and not so obvious query optimzations in Hive

Hi Richin

The keys are chosen by hive based on the input unless you specify certain clauses as DISTRIBUTE BY in your query which gives the flexibility for user to decide on the key/columns the data has to be distributed across reducers.

You are right In MR if a reducer writes out no data then you should get an empty file. Need to check source code to see whether hive deletes those files.
Regards
Bejoy KS

Sent from handheld, please excuse typos.
________________________________
From: <ri...@nokia.com>>
Date: Thu, 28 Jun 2012 19:23:46 +0000
To: <us...@hive.apache.org>>
ReplyTo: user@hive.apache.org<ma...@hive.apache.org>
Subject: RE: Obvious and not so obvious query optimzations in Hive

Thanks Nitin.
Depending on how I design my keys they might go to one or more reducers, but shouldn't I be seeing empty files for the reducers which did not get any data to reduce (because of the design of keys) ?
Or does hive clean all the empty files at the end of the query?

Richin

From: ext Nitin Pawar [mailto:nitinpawar432@gmail.com]<mailto:[mailto:nitinpawar432@gmail.com]>
Sent: Thursday, June 28, 2012 3:15 PM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Obvious and not so obvious query optimzations in Hive

Richin,

even if you set number of reducers to be launched it does not guarantee u to that it will generate those many files.

based on your query and data only the reducers which got keys to process will generate the files
so when you have hive query with large number of keys but with lesser number in spilt size it will need large maps but then reducers will always depend on the keys emitted by the mappers and all the extra reducers will be a burden to the system
On Fri, Jun 29, 2012 at 12:40 AM, <ri...@nokia.com>> wrote:
Igor,Bejoy - thanks a lot, that helps.

He, I am running the query on Amazon EMR cluster and based on the type of instances I pick, default number of mappers and reducers are set. Now I would expect Hive to generate that many number of output files as there are number of reducers (since I am not using order by clause or setting it explicitly). If Hive is setting lower number of reducers for itself than there is no point using a high end EMR cluster and pay for it.
Also I can only set number of reduce tasks explicitly through  SET mapred.reduce.tasks = ... , how to set number of reducers itself? I am confused between number of reduce tasks and reducers, can you please explain?

Thanks,
Richin

============
If you are optimizing for latency (running time) as opposed to throughput, it's best to have a single "wave" of reducers. So if your cluster is setup with a limit of, say, 2 reducers per node using 2*N reduce tasks would work best (for large queries). You have to specify that in your script using
SET mapred.reduce.tasks = ...;

GroupBy doesn't limit the number of reducers but OrderBy does use a single reducer - so that's slow. I never use OrderBy though (Unix's sort is probably faster). For analytics queries I need Distribute/Sort By (with UDFs), which can use multiple reducers.

Hope this helps.
igor
decide.com<http://decide.com>

On Wed, Jun 27, 2012 at 8:47 AM, <ri...@nokia.com>> wrote:
5.       How are number of reducers get set for a Hive query (The way group by and order by sets the number of reducers to 1) ? If I am not changing it explicitly does it pick it from the underlying Hadoop cluster? I am trying to understand the bottleneck between query and cluster size.
-----Original Message-----
From: ext yongqiang he [mailto:heyongqiangict@gmail.com<ma...@gmail.com>]
Sent: Wednesday, June 27, 2012 6:32 PM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Obvious and not so obvious query optimzations in Hive

1.       Having my external table data gzipped and reading it in the
table v/s no compression at all.

You may want GZip your data since it is offline. But space is not a concern and you want to optimize CPU, use snappy.

With snappy, there is no reason to go with no compression.

3.       Creating intermediate external tables v/s non external tables
v/s creating views?

First go with normal tables. External tables are hard to manage.
Views are there for complex things which are hard to do with 'managed table'.

4.       Storing the external table as Textfile v/s Sequence file. I
know sequence file compresses the data, but in what format? I read about RC files and how efficient they are, how to use them?

rcfile if you query your data with Hive. 'create table xxx(xxx) stored as rcfile'

5.       How are number of reducers get set for a Hive query (The way
group by and order by sets the number of reducers to 1) ? If I am not changing it explicitly does it pick it from the underlying Hadoop cluster? I am trying to understand the bottleneck between query and cluster size.

Can you say more about your concern about "query and cluster size"?

On Wed, Jun 27, 2012 at 11:46 AM, Bejoy KS <be...@yahoo.com>> wrote:
> Hi Richin
>
> I'm not an AWS guy but still lemme try answering a few questions in
> general (not wrt AWS EMR)
>
>
> 1.       Having my external table data gzipped and reading it in the
> table v/s no compression at all.
>
> Bejoy: When the data volume is large compression saves up the disk
> space and hence it is recommended. Gzip is not splittable, means one
> file can't be distributed across map tasks. Go in with Lzo or Snappy.
>
>
> 2.       Having the external table data on S3 v/s having it on HDFS?
>
> 3.       Creating intermediate external tables v/s non external tables
> v/s creating views?
>
> Bejoy: External Tables are not much different than managed tables. In
> managed tables when you drop the table using HQL the underlying data
> in hdfs is also deleted but in case of External tables only the table
> defn is dropped and the data in hdfs is preserved. Views are not
> evaluated on its creation time, but when it is used MR jobs are
> triggered and the required data is extracted out of source tables. So
> if you are planning to reuse a view n number of times it'll be better
> creating a table and using it else the view query will be evaluated n times.
>
>
> 4.       Storing the external table as Textfile v/s Sequence file. I
> know sequence file compresses the data, but in what format? I read
> about RC files and how efficient they are, how to use them?
>
> Bejoy: Sequence files are splitable on its own so it is a good choice
> when using with Snappy. In sequence file, compression happens either
> at record level or block level which is configurable at the time of
> compressing. If you are using Gzip, TextFiles offer more compression
> ratio as the whole data is compressed in a go compared with Sequence
> files where it happens per record/block. But then you compromise on
> splitability. RC files are a good choice when your queries involve
> querying only a few columns rather than all columns in a row.
>
> 5.       How are number of reducers get set for a Hive query (The way
> group by and order by sets the number of reducers to 1) ? If I am not
> changing it explicitly does it pick it from the underlying Hadoop
> cluster? I am trying to understand the bottleneck between query and cluster size.
>
> Bejoy: Some queries in hive are forced to have just a single reducer
> like Order By. In case of other queries hive determines the number of reducers.
> However you can always specify the number of reducer on a per query
> basis based on the data it process.
>
> 6.       Any other optimizations/ best practices?
>
> Bejoy: There are lots of other optimizations in hive which can be
> injected based on the query. There are various join optimizations,
> group by optimizations etc suited for specific needs.
>
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ________________________________
> From: <ri...@nokia.com>>
> Date: Wed, 27 Jun 2012 15:47:54 +0000
> To: <us...@hive.apache.org>>
> ReplyTo: user@hive.apache.org<ma...@hive.apache.org>
> Subject: Obvious and not so obvious query optimzations in Hive
>
> Hey Hivers,
>
>
>
> I am trying to understand what are some of the obvious and not so
> obvious optimization I could do for a Hive Query on AWS EMR cluster. I
> know the answer for some of these questions but want to know what do
> you guys think and by what factor it affects the performance over the other approach.
>
>
>
> 1.       Having my external table data gzipped and reading it in the
> table v/s no compression at all.
>
> 2.       Having the external table data on S3 v/s having it on HDFS?
>
> 3.       Creating intermediate external tables v/s non external tables
> v/s creating views?
>
> 4.       Storing the external table as Textfile v/s Sequence file. I
> know sequence file compresses the data, but in what format? I read
> about RC files and how efficient they are, how to use them?
>
> 5.       How are number of reducers get set for a Hive query (The way
> group by and order by sets the number of reducers to 1) ? If I am not
> changing it explicitly does it pick it from the underlying Hadoop
> cluster? I am trying to understand the bottleneck between query and cluster size.
>
> 6.       Any other optimizations/ best practices?
>
>
>
> Thanks a lot in advance.
>
> Richin

--
Nitin Pawar

Re: Obvious and not so obvious query optimzations in Hive

Posted by Bejoy KS <be...@yahoo.com>.

Hi Richin

The keys are chosen by hive based on the input unless you specify certain clauses as DISTRIBUTE BY in your query which gives the flexibility for user to decide on the key/columns the data has to be distributed across reducers. 

You are right In MR if a reducer writes out no data then you should get an empty file. Need to check source code to see whether hive deletes those files.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: <ri...@nokia.com>
Date: Thu, 28 Jun 2012 19:23:46 
To: <us...@hive.apache.org>
Reply-To: user@hive.apache.org
Subject: RE: Obvious and not so obvious query optimzations in Hive

Thanks Nitin.
Depending on how I design my keys they might go to one or more reducers, but shouldn't I be seeing empty files for the reducers which did not get any data to reduce (because of the design of keys) ?
Or does hive clean all the empty files at the end of the query?

Richin

From: ext Nitin Pawar [mailto:nitinpawar432@gmail.com]
Sent: Thursday, June 28, 2012 3:15 PM
To: user@hive.apache.org
Subject: Re: Obvious and not so obvious query optimzations in Hive

Richin,

even if you set number of reducers to be launched it does not guarantee u to that it will generate those many files.

based on your query and data only the reducers which got keys to process will generate the files
so when you have hive query with large number of keys but with lesser number in spilt size it will need large maps but then reducers will always depend on the keys emitted by the mappers and all the extra reducers will be a burden to the system
On Fri, Jun 29, 2012 at 12:40 AM, <ri...@nokia.com>> wrote:
Igor,Bejoy - thanks a lot, that helps.

He, I am running the query on Amazon EMR cluster and based on the type of instances I pick, default number of mappers and reducers are set. Now I would expect Hive to generate that many number of output files as there are number of reducers (since I am not using order by clause or setting it explicitly). If Hive is setting lower number of reducers for itself than there is no point using a high end EMR cluster and pay for it.
Also I can only set number of reduce tasks explicitly through  SET mapred.reduce.tasks = ... , how to set number of reducers itself? I am confused between number of reduce tasks and reducers, can you please explain?

Thanks,
Richin

============
If you are optimizing for latency (running time) as opposed to throughput, it's best to have a single "wave" of reducers. So if your cluster is setup with a limit of, say, 2 reducers per node using 2*N reduce tasks would work best (for large queries). You have to specify that in your script using
SET mapred.reduce.tasks = ...;

GroupBy doesn't limit the number of reducers but OrderBy does use a single reducer - so that's slow. I never use OrderBy though (Unix's sort is probably faster). For analytics queries I need Distribute/Sort By (with UDFs), which can use multiple reducers.

Hope this helps.
igor
decide.com<http://decide.com>

On Wed, Jun 27, 2012 at 8:47 AM, <ri...@nokia.com>> wrote:
5.       How are number of reducers get set for a Hive query (The way group by and order by sets the number of reducers to 1) ? If I am not changing it explicitly does it pick it from the underlying Hadoop cluster? I am trying to understand the bottleneck between query and cluster size.

-----Original Message-----
From: ext yongqiang he [mailto:heyongqiangict@gmail.com<ma...@gmail.com>]
Sent: Wednesday, June 27, 2012 6:32 PM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Obvious and not so obvious query optimzations in Hive

1.       Having my external table data gzipped and reading it in the
table v/s no compression at all.

You may want GZip your data since it is offline. But space is not a concern and you want to optimize CPU, use snappy.

With snappy, there is no reason to go with no compression.

3.       Creating intermediate external tables v/s non external tables
v/s creating views?

First go with normal tables. External tables are hard to manage.
Views are there for complex things which are hard to do with 'managed table'.

4.       Storing the external table as Textfile v/s Sequence file. I
know sequence file compresses the data, but in what format? I read about RC files and how efficient they are, how to use them?

rcfile if you query your data with Hive. 'create table xxx(xxx) stored as rcfile'

5.       How are number of reducers get set for a Hive query (The way
group by and order by sets the number of reducers to 1) ? If I am not changing it explicitly does it pick it from the underlying Hadoop cluster? I am trying to understand the bottleneck between query and cluster size.

Can you say more about your concern about "query and cluster size"?

On Wed, Jun 27, 2012 at 11:46 AM, Bejoy KS <be...@yahoo.com>> wrote:
> Hi Richin
>
> I'm not an AWS guy but still lemme try answering a few questions in
> general (not wrt AWS EMR)
>
>
> 1.       Having my external table data gzipped and reading it in the
> table v/s no compression at all.
>
> Bejoy: When the data volume is large compression saves up the disk
> space and hence it is recommended. Gzip is not splittable, means one
> file can't be distributed across map tasks. Go in with Lzo or Snappy.
>
>
> 2.       Having the external table data on S3 v/s having it on HDFS?
>
> 3.       Creating intermediate external tables v/s non external tables
> v/s creating views?
>
> Bejoy: External Tables are not much different than managed tables. In
> managed tables when you drop the table using HQL the underlying data
> in hdfs is also deleted but in case of External tables only the table
> defn is dropped and the data in hdfs is preserved. Views are not
> evaluated on its creation time, but when it is used MR jobs are
> triggered and the required data is extracted out of source tables. So
> if you are planning to reuse a view n number of times it'll be better
> creating a table and using it else the view query will be evaluated n times.
>
>
> 4.       Storing the external table as Textfile v/s Sequence file. I
> know sequence file compresses the data, but in what format? I read
> about RC files and how efficient they are, how to use them?
>
> Bejoy: Sequence files are splitable on its own so it is a good choice
> when using with Snappy. In sequence file, compression happens either
> at record level or block level which is configurable at the time of
> compressing. If you are using Gzip, TextFiles offer more compression
> ratio as the whole data is compressed in a go compared with Sequence
> files where it happens per record/block. But then you compromise on
> splitability. RC files are a good choice when your queries involve
> querying only a few columns rather than all columns in a row.
>
> 5.       How are number of reducers get set for a Hive query (The way
> group by and order by sets the number of reducers to 1) ? If I am not
> changing it explicitly does it pick it from the underlying Hadoop
> cluster? I am trying to understand the bottleneck between query and cluster size.
>
> Bejoy: Some queries in hive are forced to have just a single reducer
> like Order By. In case of other queries hive determines the number of reducers.
> However you can always specify the number of reducer on a per query
> basis based on the data it process.
>
> 6.       Any other optimizations/ best practices?
>
> Bejoy: There are lots of other optimizations in hive which can be
> injected based on the query. There are various join optimizations,
> group by optimizations etc suited for specific needs.
>
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ________________________________
> From: <ri...@nokia.com>>
> Date: Wed, 27 Jun 2012 15:47:54 +0000
> To: <us...@hive.apache.org>>
> ReplyTo: user@hive.apache.org<ma...@hive.apache.org>
> Subject: Obvious and not so obvious query optimzations in Hive
>
> Hey Hivers,
>
>
>
> I am trying to understand what are some of the obvious and not so
> obvious optimization I could do for a Hive Query on AWS EMR cluster. I
> know the answer for some of these questions but want to know what do
> you guys think and by what factor it affects the performance over the other approach.
>
>
>
> 1.       Having my external table data gzipped and reading it in the
> table v/s no compression at all.
>
> 2.       Having the external table data on S3 v/s having it on HDFS?
>
> 3.       Creating intermediate external tables v/s non external tables
> v/s creating views?
>
> 4.       Storing the external table as Textfile v/s Sequence file. I
> know sequence file compresses the data, but in what format? I read
> about RC files and how efficient they are, how to use them?
>
> 5.       How are number of reducers get set for a Hive query (The way
> group by and order by sets the number of reducers to 1) ? If I am not
> changing it explicitly does it pick it from the underlying Hadoop
> cluster? I am trying to understand the bottleneck between query and cluster size.
>
> 6.       Any other optimizations/ best practices?
>
>
>
> Thanks a lot in advance.
>
> Richin

--
Nitin Pawar

RE: Obvious and not so obvious query optimzations in Hive

Posted by ri...@nokia.com.

Thanks Nitin.
Depending on how I design my keys they might go to one or more reducers, but shouldn't I be seeing empty files for the reducers which did not get any data to reduce (because of the design of keys) ?
Or does hive clean all the empty files at the end of the query?

Richin

From: ext Nitin Pawar [mailto:nitinpawar432@gmail.com]
Sent: Thursday, June 28, 2012 3:15 PM
To: user@hive.apache.org
Subject: Re: Obvious and not so obvious query optimzations in Hive

Richin,

even if you set number of reducers to be launched it does not guarantee u to that it will generate those many files.

based on your query and data only the reducers which got keys to process will generate the files
so when you have hive query with large number of keys but with lesser number in spilt size it will need large maps but then reducers will always depend on the keys emitted by the mappers and all the extra reducers will be a burden to the system
On Fri, Jun 29, 2012 at 12:40 AM, <ri...@nokia.com>> wrote:
Igor,Bejoy - thanks a lot, that helps.

He, I am running the query on Amazon EMR cluster and based on the type of instances I pick, default number of mappers and reducers are set. Now I would expect Hive to generate that many number of output files as there are number of reducers (since I am not using order by clause or setting it explicitly). If Hive is setting lower number of reducers for itself than there is no point using a high end EMR cluster and pay for it.
Also I can only set number of reduce tasks explicitly through  SET mapred.reduce.tasks = ... , how to set number of reducers itself? I am confused between number of reduce tasks and reducers, can you please explain?

Thanks,
Richin

============
If you are optimizing for latency (running time) as opposed to throughput, it's best to have a single "wave" of reducers. So if your cluster is setup with a limit of, say, 2 reducers per node using 2*N reduce tasks would work best (for large queries). You have to specify that in your script using
SET mapred.reduce.tasks = ...;

GroupBy doesn't limit the number of reducers but OrderBy does use a single reducer - so that's slow. I never use OrderBy though (Unix's sort is probably faster). For analytics queries I need Distribute/Sort By (with UDFs), which can use multiple reducers.

Hope this helps.
igor
decide.com<http://decide.com>

On Wed, Jun 27, 2012 at 8:47 AM, <ri...@nokia.com>> wrote:
5.       How are number of reducers get set for a Hive query (The way group by and order by sets the number of reducers to 1) ? If I am not changing it explicitly does it pick it from the underlying Hadoop cluster? I am trying to understand the bottleneck between query and cluster size.

-----Original Message-----
From: ext yongqiang he [mailto:heyongqiangict@gmail.com<ma...@gmail.com>]
Sent: Wednesday, June 27, 2012 6:32 PM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Obvious and not so obvious query optimzations in Hive

1.       Having my external table data gzipped and reading it in the
table v/s no compression at all.

You may want GZip your data since it is offline. But space is not a concern and you want to optimize CPU, use snappy.

With snappy, there is no reason to go with no compression.

3.       Creating intermediate external tables v/s non external tables
v/s creating views?

First go with normal tables. External tables are hard to manage.
Views are there for complex things which are hard to do with 'managed table'.

4.       Storing the external table as Textfile v/s Sequence file. I
know sequence file compresses the data, but in what format? I read about RC files and how efficient they are, how to use them?

rcfile if you query your data with Hive. 'create table xxx(xxx) stored as rcfile'

5.       How are number of reducers get set for a Hive query (The way
group by and order by sets the number of reducers to 1) ? If I am not changing it explicitly does it pick it from the underlying Hadoop cluster? I am trying to understand the bottleneck between query and cluster size.

Can you say more about your concern about "query and cluster size"?

On Wed, Jun 27, 2012 at 11:46 AM, Bejoy KS <be...@yahoo.com>> wrote:
> Hi Richin
>
> I'm not an AWS guy but still lemme try answering a few questions in
> general (not wrt AWS EMR)
>
>
> 1.       Having my external table data gzipped and reading it in the
> table v/s no compression at all.
>
> Bejoy: When the data volume is large compression saves up the disk
> space and hence it is recommended. Gzip is not splittable, means one
> file can't be distributed across map tasks. Go in with Lzo or Snappy.
>
>
> 2.       Having the external table data on S3 v/s having it on HDFS?
>
> 3.       Creating intermediate external tables v/s non external tables
> v/s creating views?
>
> Bejoy: External Tables are not much different than managed tables. In
> managed tables when you drop the table using HQL the underlying data
> in hdfs is also deleted but in case of External tables only the table
> defn is dropped and the data in hdfs is preserved. Views are not
> evaluated on its creation time, but when it is used MR jobs are
> triggered and the required data is extracted out of source tables. So
> if you are planning to reuse a view n number of times it'll be better
> creating a table and using it else the view query will be evaluated n times.
>
>
> 4.       Storing the external table as Textfile v/s Sequence file. I
> know sequence file compresses the data, but in what format? I read
> about RC files and how efficient they are, how to use them?
>
> Bejoy: Sequence files are splitable on its own so it is a good choice
> when using with Snappy. In sequence file, compression happens either
> at record level or block level which is configurable at the time of
> compressing. If you are using Gzip, TextFiles offer more compression
> ratio as the whole data is compressed in a go compared with Sequence
> files where it happens per record/block. But then you compromise on
> splitability. RC files are a good choice when your queries involve
> querying only a few columns rather than all columns in a row.
>
> 5.       How are number of reducers get set for a Hive query (The way
> group by and order by sets the number of reducers to 1) ? If I am not
> changing it explicitly does it pick it from the underlying Hadoop
> cluster? I am trying to understand the bottleneck between query and cluster size.
>
> Bejoy: Some queries in hive are forced to have just a single reducer
> like Order By. In case of other queries hive determines the number of reducers.
> However you can always specify the number of reducer on a per query
> basis based on the data it process.
>
> 6.       Any other optimizations/ best practices?
>
> Bejoy: There are lots of other optimizations in hive which can be
> injected based on the query. There are various join optimizations,
> group by optimizations etc suited for specific needs.
>
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ________________________________
> From: <ri...@nokia.com>>
> Date: Wed, 27 Jun 2012 15:47:54 +0000
> To: <us...@hive.apache.org>>
> ReplyTo: user@hive.apache.org<ma...@hive.apache.org>
> Subject: Obvious and not so obvious query optimzations in Hive
>
> Hey Hivers,
>
>
>
> I am trying to understand what are some of the obvious and not so
> obvious optimization I could do for a Hive Query on AWS EMR cluster. I
> know the answer for some of these questions but want to know what do
> you guys think and by what factor it affects the performance over the other approach.
>
>
>
> 1.       Having my external table data gzipped and reading it in the
> table v/s no compression at all.
>
> 2.       Having the external table data on S3 v/s having it on HDFS?
>
> 3.       Creating intermediate external tables v/s non external tables
> v/s creating views?
>
> 4.       Storing the external table as Textfile v/s Sequence file. I
> know sequence file compresses the data, but in what format? I read
> about RC files and how efficient they are, how to use them?
>
> 5.       How are number of reducers get set for a Hive query (The way
> group by and order by sets the number of reducers to 1) ? If I am not
> changing it explicitly does it pick it from the underlying Hadoop
> cluster? I am trying to understand the bottleneck between query and cluster size.
>
> 6.       Any other optimizations/ best practices?
>
>
>
> Thanks a lot in advance.
>
> Richin

--
Nitin Pawar

Re: Obvious and not so obvious query optimzations in Hive

Posted by Nitin Pawar <ni...@gmail.com>.

Richin,

even if you set number of reducers to be launched it does not guarantee u
to that it will generate those many files.

based on your query and data only the reducers which got keys to process
will generate the files
so when you have hive query with large number of keys but with lesser
number in spilt size it will need large maps but then reducers will always
depend on the keys emitted by the mappers and all the extra reducers will
be a burden to the system

On Fri, Jun 29, 2012 at 12:40 AM, <ri...@nokia.com> wrote:

> Igor,Bejoy - thanks a lot, that helps.
>
> He, I am running the query on Amazon EMR cluster and based on the type of
> instances I pick, default number of mappers and reducers are set. Now I
> would expect Hive to generate that many number of output files as there are
> number of reducers (since I am not using order by clause or setting it
> explicitly). If Hive is setting lower number of reducers for itself than
> there is no point using a high end EMR cluster and pay for it.
> Also I can only set number of reduce tasks explicitly through  SET
> mapred.reduce.tasks = ... , how to set number of reducers itself? I am
> confused between number of reduce tasks and reducers, can you please
> explain?
>
> Thanks,
> Richin
>
> ============
> If you are optimizing for latency (running time) as opposed to throughput,
> it's best to have a single "wave" of reducers. So if your cluster is setup
> with a limit of, say, 2 reducers per node using 2*N reduce tasks would work
> best (for large queries). You have to specify that in your script using
> SET mapred.reduce.tasks = ...;
>
> GroupBy doesn't limit the number of reducers but OrderBy does use a single
> reducer - so that's slow. I never use OrderBy though (Unix's sort is
> probably faster). For analytics queries I need Distribute/Sort By (with
> UDFs), which can use multiple reducers.
>
>
> Hope this helps.
> igor
> decide.com
>
> On Wed, Jun 27, 2012 at 8:47 AM, <ri...@nokia.com> wrote:
> 5.       How are number of reducers get set for a Hive query (The way
> group by and order by sets the number of reducers to 1) ? If I am not
> changing it explicitly does it pick it from the underlying Hadoop cluster?
> I am trying to understand the bottleneck between query and cluster size.
>
>
> -----Original Message-----
> From: ext yongqiang he [mailto:heyongqiangict@gmail.com]
> Sent: Wednesday, June 27, 2012 6:32 PM
> To: user@hive.apache.org
> Subject: Re: Obvious and not so obvious query optimzations in Hive
>
> 1.       Having my external table data gzipped and reading it in the
> table v/s no compression at all.
>
> You may want GZip your data since it is offline. But space is not a
> concern and you want to optimize CPU, use snappy.
>
> With snappy, there is no reason to go with no compression.
>
>
> 3.       Creating intermediate external tables v/s non external tables
> v/s creating views?
>
> First go with normal tables. External tables are hard to manage.
> Views are there for complex things which are hard to do with 'managed
> table'.
>
> 4.       Storing the external table as Textfile v/s Sequence file. I
> know sequence file compresses the data, but in what format? I read about
> RC files and how efficient they are, how to use them?
>
> rcfile if you query your data with Hive. 'create table xxx(xxx) stored as
> rcfile'
>
> 5.       How are number of reducers get set for a Hive query (The way
> group by and order by sets the number of reducers to 1) ? If I am not
> changing it explicitly does it pick it from the underlying Hadoop cluster?
> I am trying to understand the bottleneck between query and cluster size.
>
> Can you say more about your concern about "query and cluster size"?
>
> On Wed, Jun 27, 2012 at 11:46 AM, Bejoy KS <be...@yahoo.com> wrote:
> > Hi Richin
> >
> > I'm not an AWS guy but still lemme try answering a few questions in
> > general (not wrt AWS EMR)
> >
> >
> > 1.       Having my external table data gzipped and reading it in the
> > table v/s no compression at all.
> >
> > Bejoy: When the data volume is large compression saves up the disk
> > space and hence it is recommended. Gzip is not splittable, means one
> > file can't be distributed across map tasks. Go in with Lzo or Snappy.
> >
> >
> > 2.       Having the external table data on S3 v/s having it on HDFS?
> >
> > 3.       Creating intermediate external tables v/s non external tables
> > v/s creating views?
> >
> > Bejoy: External Tables are not much different than managed tables. In
> > managed tables when you drop the table using HQL the underlying data
> > in hdfs is also deleted but in case of External tables only the table
> > defn is dropped and the data in hdfs is preserved. Views are not
> > evaluated on its creation time, but when it is used MR jobs are
> > triggered and the required data is extracted out of source tables. So
> > if you are planning to reuse a view n number of times it'll be better
> > creating a table and using it else the view query will be evaluated n
> times.
> >
> >
> > 4.       Storing the external table as Textfile v/s Sequence file. I
> > know sequence file compresses the data, but in what format? I read
> > about RC files and how efficient they are, how to use them?
> >
> > Bejoy: Sequence files are splitable on its own so it is a good choice
> > when using with Snappy. In sequence file, compression happens either
> > at record level or block level which is configurable at the time of
> > compressing. If you are using Gzip, TextFiles offer more compression
> > ratio as the whole data is compressed in a go compared with Sequence
> > files where it happens per record/block. But then you compromise on
> > splitability. RC files are a good choice when your queries involve
> > querying only a few columns rather than all columns in a row.
> >
> > 5.       How are number of reducers get set for a Hive query (The way
> > group by and order by sets the number of reducers to 1) ? If I am not
> > changing it explicitly does it pick it from the underlying Hadoop
> > cluster? I am trying to understand the bottleneck between query and
> cluster size.
> >
> > Bejoy: Some queries in hive are forced to have just a single reducer
> > like Order By. In case of other queries hive determines the number of
> reducers.
> > However you can always specify the number of reducer on a per query
> > basis based on the data it process.
> >
> > 6.       Any other optimizations/ best practices?
> >
> > Bejoy: There are lots of other optimizations in hive which can be
> > injected based on the query. There are various join optimizations,
> > group by optimizations etc suited for specific needs.
> >
> >
> > Regards
> > Bejoy KS
> >
> > Sent from handheld, please excuse typos.
> > ________________________________
> > From: <ri...@nokia.com>
> > Date: Wed, 27 Jun 2012 15:47:54 +0000
> > To: <us...@hive.apache.org>
> > ReplyTo: user@hive.apache.org
> > Subject: Obvious and not so obvious query optimzations in Hive
> >
> > Hey Hivers,
> >
> >
> >
> > I am trying to understand what are some of the obvious and not so
> > obvious optimization I could do for a Hive Query on AWS EMR cluster. I
> > know the answer for some of these questions but want to know what do
> > you guys think and by what factor it affects the performance over the
> other approach.
> >
> >
> >
> > 1.       Having my external table data gzipped and reading it in the
> > table v/s no compression at all.
> >
> > 2.       Having the external table data on S3 v/s having it on HDFS?
> >
> > 3.       Creating intermediate external tables v/s non external tables
> > v/s creating views?
> >
> > 4.       Storing the external table as Textfile v/s Sequence file. I
> > know sequence file compresses the data, but in what format? I read
> > about RC files and how efficient they are, how to use them?
> >
> > 5.       How are number of reducers get set for a Hive query (The way
> > group by and order by sets the number of reducers to 1) ? If I am not
> > changing it explicitly does it pick it from the underlying Hadoop
> > cluster? I am trying to understand the bottleneck between query and
> cluster size.
> >
> > 6.       Any other optimizations/ best practices?
> >
> >
> >
> > Thanks a lot in advance.
> >
> > Richin
>



-- 
Nitin Pawar

RE: Obvious and not so obvious query optimzations in Hive

Posted by ri...@nokia.com.

Igor,Bejoy - thanks a lot, that helps.

He, I am running the query on Amazon EMR cluster and based on the type of instances I pick, default number of mappers and reducers are set. Now I would expect Hive to generate that many number of output files as there are number of reducers (since I am not using order by clause or setting it explicitly). If Hive is setting lower number of reducers for itself than there is no point using a high end EMR cluster and pay for it.
Also I can only set number of reduce tasks explicitly through  SET mapred.reduce.tasks = ... , how to set number of reducers itself? I am confused between number of reduce tasks and reducers, can you please explain?

Thanks,
Richin

============
If you are optimizing for latency (running time) as opposed to throughput, it's best to have a single "wave" of reducers. So if your cluster is setup with a limit of, say, 2 reducers per node using 2*N reduce tasks would work best (for large queries). You have to specify that in your script using 
SET mapred.reduce.tasks = ...;

GroupBy doesn't limit the number of reducers but OrderBy does use a single reducer - so that's slow. I never use OrderBy though (Unix's sort is probably faster). For analytics queries I need Distribute/Sort By (with UDFs), which can use multiple reducers.

Hope this helps.
igor
decide.com

On Wed, Jun 27, 2012 at 8:47 AM, <ri...@nokia.com> wrote:
5.       How are number of reducers get set for a Hive query (The way group by and order by sets the number of reducers to 1) ? If I am not changing it explicitly does it pick it from the underlying Hadoop cluster? I am trying to understand the bottleneck between query and cluster size.

-----Original Message-----
From: ext yongqiang he [mailto:heyongqiangict@gmail.com] 
Sent: Wednesday, June 27, 2012 6:32 PM
To: user@hive.apache.org
Subject: Re: Obvious and not so obvious query optimzations in Hive

1.       Having my external table data gzipped and reading it in the
table v/s no compression at all.

You may want GZip your data since it is offline. But space is not a concern and you want to optimize CPU, use snappy.

With snappy, there is no reason to go with no compression.

3.       Creating intermediate external tables v/s non external tables
v/s creating views?

First go with normal tables. External tables are hard to manage.
Views are there for complex things which are hard to do with 'managed table'.

4.       Storing the external table as Textfile v/s Sequence file. I
know sequence file compresses the data, but in what format? I read about RC files and how efficient they are, how to use them?

rcfile if you query your data with Hive. 'create table xxx(xxx) stored as rcfile'

5.       How are number of reducers get set for a Hive query (The way
group by and order by sets the number of reducers to 1) ? If I am not changing it explicitly does it pick it from the underlying Hadoop cluster? I am trying to understand the bottleneck between query and cluster size.

Can you say more about your concern about "query and cluster size"?

On Wed, Jun 27, 2012 at 11:46 AM, Bejoy KS <be...@yahoo.com> wrote:
> Hi Richin
>
> I'm not an AWS guy but still lemme try answering a few questions in 
> general (not wrt AWS EMR)
>
>
> 1.       Having my external table data gzipped and reading it in the 
> table v/s no compression at all.
>
> Bejoy: When the data volume is large compression saves up the disk 
> space and hence it is recommended. Gzip is not splittable, means one 
> file can't be distributed across map tasks. Go in with Lzo or Snappy.
>
>
> 2.       Having the external table data on S3 v/s having it on HDFS?
>
> 3.       Creating intermediate external tables v/s non external tables 
> v/s creating views?
>
> Bejoy: External Tables are not much different than managed tables. In 
> managed tables when you drop the table using HQL the underlying data 
> in hdfs is also deleted but in case of External tables only the table 
> defn is dropped and the data in hdfs is preserved. Views are not 
> evaluated on its creation time, but when it is used MR jobs are 
> triggered and the required data is extracted out of source tables. So 
> if you are planning to reuse a view n number of times it'll be better 
> creating a table and using it else the view query will be evaluated n times.
>
>
> 4.       Storing the external table as Textfile v/s Sequence file. I 
> know sequence file compresses the data, but in what format? I read 
> about RC files and how efficient they are, how to use them?
>
> Bejoy: Sequence files are splitable on its own so it is a good choice 
> when using with Snappy. In sequence file, compression happens either 
> at record level or block level which is configurable at the time of 
> compressing. If you are using Gzip, TextFiles offer more compression 
> ratio as the whole data is compressed in a go compared with Sequence 
> files where it happens per record/block. But then you compromise on 
> splitability. RC files are a good choice when your queries involve 
> querying only a few columns rather than all columns in a row.
>
> 5.       How are number of reducers get set for a Hive query (The way 
> group by and order by sets the number of reducers to 1) ? If I am not 
> changing it explicitly does it pick it from the underlying Hadoop 
> cluster? I am trying to understand the bottleneck between query and cluster size.
>
> Bejoy: Some queries in hive are forced to have just a single reducer 
> like Order By. In case of other queries hive determines the number of reducers.
> However you can always specify the number of reducer on a per query 
> basis based on the data it process.
>
> 6.       Any other optimizations/ best practices?
>
> Bejoy: There are lots of other optimizations in hive which can be 
> injected based on the query. There are various join optimizations, 
> group by optimizations etc suited for specific needs.
>
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ________________________________
> From: <ri...@nokia.com>
> Date: Wed, 27 Jun 2012 15:47:54 +0000
> To: <us...@hive.apache.org>
> ReplyTo: user@hive.apache.org
> Subject: Obvious and not so obvious query optimzations in Hive
>
> Hey Hivers,
>
>
>
> I am trying to understand what are some of the obvious and not so 
> obvious optimization I could do for a Hive Query on AWS EMR cluster. I 
> know the answer for some of these questions but want to know what do 
> you guys think and by what factor it affects the performance over the other approach.
>
>
>
> 1.       Having my external table data gzipped and reading it in the 
> table v/s no compression at all.
>
> 2.       Having the external table data on S3 v/s having it on HDFS?
>
> 3.       Creating intermediate external tables v/s non external tables 
> v/s creating views?
>
> 4.       Storing the external table as Textfile v/s Sequence file. I 
> know sequence file compresses the data, but in what format? I read 
> about RC files and how efficient they are, how to use them?
>
> 5.       How are number of reducers get set for a Hive query (The way 
> group by and order by sets the number of reducers to 1) ? If I am not 
> changing it explicitly does it pick it from the underlying Hadoop 
> cluster? I am trying to understand the bottleneck between query and cluster size.
>
> 6.       Any other optimizations/ best practices?
>
>
>
> Thanks a lot in advance.
>
> Richin

Re: Obvious and not so obvious query optimzations in Hive

Posted by yongqiang he <he...@gmail.com>.

1.       Having my external table data gzipped and reading it in the
table v/s no compression at all.

You may want GZip your data since it is offline. But space is not a
concern and you want to optimize CPU, use snappy.

With snappy, there is no reason to go with no compression.


3.       Creating intermediate external tables v/s non external tables
v/s creating views?

First go with normal tables. External tables are hard to manage.
Views are there for complex things which are hard to do with 'managed table'.

4.       Storing the external table as Textfile v/s Sequence file. I
know sequence file compresses the data, but in what format? I read
about RC files and how efficient they are, how to use them?

rcfile if you query your data with Hive. 'create table xxx(xxx) stored
as rcfile'

5.       How are number of reducers get set for a Hive query (The way
group by and order by sets the number of reducers to 1) ? If I am not
changing it explicitly does it pick it from the underlying Hadoop
cluster? I am trying to understand the bottleneck between query and
cluster size.

Can you say more about your concern about "query and cluster size"?

On Wed, Jun 27, 2012 at 11:46 AM, Bejoy KS <be...@yahoo.com> wrote:
> Hi Richin
>
> I'm not an AWS guy but still lemme try answering a few questions in general
> (not wrt AWS EMR)
>
>
> 1.       Having my external table data gzipped and reading it in the table
> v/s no compression at all.
>
> Bejoy: When the data volume is large compression saves up the disk space and
> hence it is recommended. Gzip is not splittable, means one file can't be
> distributed across map tasks. Go in with Lzo or Snappy.
>
>
> 2.       Having the external table data on S3 v/s having it on HDFS?
>
> 3.       Creating intermediate external tables v/s non external tables v/s
> creating views?
>
> Bejoy: External Tables are not much different than managed tables. In
> managed tables when you drop the table using HQL the underlying data in hdfs
> is also deleted but in case of External tables only the table defn is
> dropped and the data in hdfs is preserved. Views are not evaluated on its
> creation time, but when it is used MR jobs are triggered and the required
> data is extracted out of source tables. So if you are planning to reuse a
> view n number of times it'll be better creating a table and using it else
> the view query will be evaluated n times.
>
>
> 4.       Storing the external table as Textfile v/s Sequence file. I know
> sequence file compresses the data, but in what format? I read about RC files
> and how efficient they are, how to use them?
>
> Bejoy: Sequence files are splitable on its own so it is a good choice when
> using with Snappy. In sequence file, compression happens either at record
> level or block level which is configurable at the time of compressing. If
> you are using Gzip, TextFiles offer more compression ratio as the whole data
> is compressed in a go compared with Sequence files where it happens per
> record/block. But then you compromise on splitability. RC files are a good
> choice when your queries involve querying only a few columns rather than all
> columns in a row.
>
> 5.       How are number of reducers get set for a Hive query (The way group
> by and order by sets the number of reducers to 1) ? If I am not changing it
> explicitly does it pick it from the underlying Hadoop cluster? I am trying
> to understand the bottleneck between query and cluster size.
>
> Bejoy: Some queries in hive are forced to have just a single reducer like
> Order By. In case of other queries hive determines the number of reducers.
> However you can always specify the number of reducer on a per query basis
> based on the data it process.
>
> 6.       Any other optimizations/ best practices?
>
> Bejoy: There are lots of other optimizations in hive which can be injected
> based on the query. There are various join optimizations, group by
> optimizations etc suited for specific needs.
>
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ________________________________
> From: <ri...@nokia.com>
> Date: Wed, 27 Jun 2012 15:47:54 +0000
> To: <us...@hive.apache.org>
> ReplyTo: user@hive.apache.org
> Subject: Obvious and not so obvious query optimzations in Hive
>
> Hey Hivers,
>
>
>
> I am trying to understand what are some of the obvious and not so obvious
> optimization I could do for a Hive Query on AWS EMR cluster. I know the
> answer for some of these questions but want to know what do you guys think
> and by what factor it affects the performance over the other approach.
>
>
>
> 1.       Having my external table data gzipped and reading it in the table
> v/s no compression at all.
>
> 2.       Having the external table data on S3 v/s having it on HDFS?
>
> 3.       Creating intermediate external tables v/s non external tables v/s
> creating views?
>
> 4.       Storing the external table as Textfile v/s Sequence file. I know
> sequence file compresses the data, but in what format? I read about RC files
> and how efficient they are, how to use them?
>
> 5.       How are number of reducers get set for a Hive query (The way group
> by and order by sets the number of reducers to 1) ? If I am not changing it
> explicitly does it pick it from the underlying Hadoop cluster? I am trying
> to understand the bottleneck between query and cluster size.
>
> 6.       Any other optimizations/ best practices?
>
>
>
> Thanks a lot in advance.
>
> Richin

Re: Obvious and not so obvious query optimzations in Hive

Posted by Bejoy KS <be...@yahoo.com>.

Hi Richin

I'm not an AWS guy but still lemme try answering a few questions in general (not wrt AWS EMR)

1.       Having my external table data gzipped and reading it in the table v/s no compression at all.

Bejoy: When the data volume is large compression saves up the disk space and hence it is recommended. Gzip is not splittable, means one file can't be distributed across map tasks. Go in with Lzo or Snappy.

2.       Having the external table data on S3 v/s having it on HDFS?

3.       Creating intermediate external tables v/s non external tables v/s creating views?

Bejoy: External Tables are not much different than managed tables. In managed tables when you drop the table using HQL the underlying data in hdfs is also deleted but in case of External tables only the table defn is dropped and the data in hdfs is preserved. Views are not evaluated on its creation time, but when it is used MR jobs are triggered and the required data is extracted out of source tables. So if you are planning to reuse a view n number of times it'll be better creating a table and using it else the view query will be evaluated n times.

4.       Storing the external table as Textfile v/s Sequence file. I know sequence file compresses the data, but in what format? I read about RC files and how efficient they are, how to use them?

Bejoy: Sequence files are splitable on its own so it is a good choice when using with Snappy. In sequence file, compression happens either at record level or block level which is configurable at the time of compressing. If you are using Gzip, TextFiles offer more compression ratio as the whole data is compressed in a go compared with Sequence files where it happens per record/block. But then you compromise on splitability. RC files are a good choice when your queries involve querying only a few columns rather than all columns in a row. 

5.       How are number of reducers get set for a Hive query (The way group by and order by sets the number of reducers to 1) ? If I am not changing it explicitly does it pick it from the underlying Hadoop cluster? I am trying to understand the bottleneck between query and cluster size.

Bejoy: Some queries in hive are forced to have just a single reducer like Order By. In case of other queries hive determines the number of reducers. However you can always specify the number of reducer on a per query basis based on the data it process. 

6.       Any other optimizations/ best practices?

Bejoy: There are lots of other optimizations in hive which can be injected based on the query. There are various join optimizations, group by optimizations etc suited for specific needs.



Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: <ri...@nokia.com>
Date: Wed, 27 Jun 2012 15:47:54 
To: <us...@hive.apache.org>
Reply-To: user@hive.apache.org
Subject: Obvious and not so obvious query optimzations in Hive

Hey Hivers,

I am trying to understand what are some of the obvious and not so obvious optimization I could do for a Hive Query on AWS EMR cluster. I know the answer for some of these questions but want to know what do you guys think and by what factor it affects the performance over the other approach.


1.       Having my external table data gzipped and reading it in the table v/s no compression at all.

2.       Having the external table data on S3 v/s having it on HDFS?

3.       Creating intermediate external tables v/s non external tables v/s creating views?

4.       Storing the external table as Textfile v/s Sequence file. I know sequence file compresses the data, but in what format? I read about RC files and how efficient they are, how to use them?

5.       How are number of reducers get set for a Hive query (The way group by and order by sets the number of reducers to 1) ? If I am not changing it explicitly does it pick it from the underlying Hadoop cluster? I am trying to understand the bottleneck between query and cluster size.

6.       Any other optimizations/ best practices?

Thanks a lot in advance.
Richin