You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Udit Mehta <um...@groupon.com> on 2016/04/27 03:01:04 UTC

Disable Hive autogather optimization

Hi,

We need to disable the Hive autogather stats optimization by disabling "
*hive.stats.autogather*" but for some reason, the config change doesnt seem
to go through. We modified this config in the hive-site.xml and restarted
the Hive metastore. We also made this change explicitly in the job but it
doesnt seem to help.



*set hive.stats.autogather=false;*
Does anyone know the right way to disable this config since we dont want to
compute stats in out jobs.

Thanks,
Udit

Re: Disable Hive autogather optimization

Posted by Udit Mehta <um...@groupon.com>.
thanks Mich. I will test this out and get back to you!

On Fri, Apr 29, 2016 at 4:42 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> apologies should read "Udit"
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 30 April 2016 at 00:35, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>> Hi Unit,
>>
>> *For new tables*
>>
>> Disable stats autogathering in Hive when creating a new table  and
>> populating it
>>
>> SET hive.stats.autogather=false;
>>
>> *Already existing tables*
>>
>> As a work-around you can try this on the already existing tables  by manually
>> alter the numRows to -1
>>
>> ALTER TABLE <table_name> PARTITION <partition_spec> SET TBLPROPERTIES
>> ('numRows'='-1');
>>
>> Example
>>
>> 0: jdbc:hive2://rhes564:10010/default> create table testme as select *
>> from sales_staging limit 1000;
>>
>> 0: jdbc:hive2://rhes564:10010/default> desc formatted  testme;
>>
>>
>> +-------------------------------+-----------------------------------------------------------------+-----------------------------+--+
>> |           col_name            |
>> data_type                            |           comment           |
>>
>> +-------------------------------+-----------------------------------------------------------------+-----------------------------+--+
>> | # col_name                    |
>> data_type                                                       |
>> comment                     |
>> |                               |
>> NULL                                                            |
>> NULL                        |
>> | prod_id                       |
>> double
>> |                             |
>> | cust_id                       |
>> double
>> |                             |
>> | time_id                       |
>> string
>> |                             |
>> | channel_id                    |
>> double
>> |                             |
>> | promo_id                      |
>> double
>> |                             |
>> | quantity_sold                 |
>> double
>> |                             |
>> | amount_sold                   |
>> double
>> |                             |
>> |                               |
>> NULL                                                            |
>> NULL                        |
>> | # Detailed Table Information  |
>> NULL                                                            |
>> NULL                        |
>> | Database:                     |
>> oraclehadoop                                                    |
>> NULL                        |
>> | Owner:                        |
>> hduser                                                          |
>> NULL                        |
>> | CreateTime:                   | Sat Apr 30 00:31:17 BST
>> 2016                                    | NULL                        |
>> | LastAccessTime:               |
>> UNKNOWN                                                         |
>> NULL                        |
>> | Retention:                    |
>> 0                                                               |
>> NULL                        |
>> | Location:                     |
>> hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/testme  |
>> NULL                        |
>> | Table Type:                   |
>> MANAGED_TABLE                                                   |
>> NULL                        |
>> | Table Parameters:             |
>> NULL                                                            |
>> NULL                        |
>> |                               |
>> COLUMN_STATS_ACCURATE                                           |
>> {\"BASIC_STATS\":\"true\"}  |
>> |                               |
>> numFiles                                                        |
>> 1                           |
>> |                               |
>> numRows                                                         |
>> 1000                        |
>> |                               |
>> rawDataSize                                                     |
>> 54853                       |
>> |                               |
>> totalSize                                                       |
>> 55853                       |
>> |                               |
>> transient_lastDdlTime                                           |
>> 1461972677                  |
>> |                               |
>> NULL                                                            |
>> NULL                        |
>> | # Storage Information         |
>> NULL                                                            |
>> NULL                        |
>> | SerDe Library:                |
>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe              |
>> NULL                        |
>> | InputFormat:                  |
>> org.apache.hadoop.mapred.TextInputFormat                        |
>> NULL                        |
>> | OutputFormat:                 |
>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat      |
>> NULL                        |
>> | Compressed:                   |
>> No                                                              |
>> NULL                        |
>> | Num Buckets:                  |
>> -1                                                              |
>> NULL                        |
>> | Bucket Columns:               |
>> []                                                              |
>> NULL                        |
>> | Sort Columns:                 |
>> []                                                              |
>> NULL                        |
>> | Storage Desc Params:          |
>> NULL                                                            |
>> NULL                        |
>> |                               |
>> serialization.format                                            |
>> 1                           |
>>
>> +-------------------------------+-----------------------------------------------------------------+-----------------------------+--+
>>
>> 0: jdbc:hive2://rhes564:10010/default>* ALTER TABLE testme  SET
>> TBLPROPERTIES ('numRows'='-1');*
>>
>> 0: jdbc:hive2://rhes564:10010/default> desc formatted  testme;
>>
>>
>> +-------------------------------+-----------------------------------------------------------------+-----------------------+--+
>> |           col_name            |
>> data_type                            |        comment        |
>>
>> +-------------------------------+-----------------------------------------------------------------+-----------------------+--+
>> | # col_name                    |
>> data_type                                                       |
>> comment               |
>> |                               |
>> NULL                                                            |
>> NULL                  |
>> | prod_id                       |
>> double
>> |                       |
>> | cust_id                       |
>> double
>> |                       |
>> | time_id                       |
>> string
>> |                       |
>> | channel_id                    |
>> double
>> |                       |
>> | promo_id                      |
>> double
>> |                       |
>> | quantity_sold                 |
>> double
>> |                       |
>> | amount_sold                   |
>> double
>> |                       |
>> |                               |
>> NULL                                                            |
>> NULL                  |
>> | # Detailed Table Information  |
>> NULL                                                            |
>> NULL                  |
>> | Database:                     |
>> oraclehadoop                                                    |
>> NULL                  |
>> | Owner:                        |
>> hduser                                                          |
>> NULL                  |
>> | CreateTime:                   | Sat Apr 30 00:31:17 BST
>> 2016                                    | NULL                  |
>> | LastAccessTime:               |
>> UNKNOWN                                                         |
>> NULL                  |
>> | Retention:                    |
>> 0                                                               |
>> NULL                  |
>> | Location:                     |
>> hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/testme  |
>> NULL                  |
>> | Table Type:                   |
>> MANAGED_TABLE                                                   |
>> NULL                  |
>> | Table Parameters:             |
>> NULL                                                            |
>> NULL                  |
>> |                               |
>> last_modified_by                                                |
>> hduser                |
>> |                               |
>> last_modified_time                                              |
>> 1461973002            |
>> |                               |
>> numFiles                                                        |
>> 1                     |
>> |                               |
>> numRows                                                         |
>> -1                    |
>> |                               |
>> rawDataSize                                                     |
>> 54853                 |
>> |                               |
>> totalSize                                                       |
>> 55853                 |
>> |                               |
>> transient_lastDdlTime                                           |
>> 1461973002            |
>> |                               |
>> NULL                                                            |
>> NULL                  |
>> | # Storage Information         |
>> NULL                                                            |
>> NULL                  |
>> | SerDe Library:                |
>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe              |
>> NULL                  |
>> | InputFormat:                  |
>> org.apache.hadoop.mapred.TextInputFormat                        |
>> NULL                  |
>> | OutputFormat:                 |
>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat      |
>> NULL                  |
>> | Compressed:                   |
>> No                                                              |
>> NULL                  |
>> | Num Buckets:                  |
>> -1                                                              |
>> NULL                  |
>> | Bucket Columns:               |
>> []                                                              |
>> NULL                  |
>> | Sort Columns:                 |
>> []                                                              |
>> NULL                  |
>> | Storage Desc Params:          |
>> NULL                                                            |
>> NULL                  |
>> |                               |
>> serialization.format                                            |
>> 1                     |
>>
>> +-------------------------------+-----------------------------------------------------------------+-----------------------+--+
>>
>> Hopefully that will turn off the autogather feature for existing tables.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 29 April 2016 at 23:32, Udit Mehta <um...@groupon.com> wrote:
>>
>>> Hi,
>>>
>>> Thanks for the replies.
>>> We have a scenario where we have an ETL job inserting into a table with
>>> thousands of partitions using dynamic partitioning. We have certain SLA's
>>> within which we would like the job to finish and sometimes there are
>>> scenarios where they are missed (extra data or a busy cluster). I
>>> understand that stats are essential for Hive CBO but we are trying to
>>> explore how much overhead do these stats collection add to the job runtime.
>>> A lot of these tables are intermediary tables so having stats for them
>>> might not be entirely necessary.
>>>
>>> I just wanted to figure if there was a easy way to disable the stats and
>>> then compare the performance.
>>>
>>> Mich, can you give more information on how to disable it in the table
>>> struct as I cant find any documentation on it.
>>>
>>> Thanks again.
>>> Udit
>>>
>>> On Fri, Apr 29, 2016 at 10:42 AM, Pengcheng Xiong <px...@apache.org>
>>> wrote:
>>>
>>>> Hi Udit,
>>>>
>>>>     Could u be more specific about your problem? Like, what settings
>>>> you have, what query you run and what is the result and what result do you
>>>> expect?
>>>>
>>>>     From what you said, my understanding is that, you want to wipe out
>>>> the basic stats for existing tables? And, could u also let us know why you
>>>> would like to get rid of the stats? Stats is crucial for Hive CBO to work
>>>> and we are moving towards the direction to make table/column stats
>>>> collection automatically. It seems that you prefer an opposite direction.
>>>> There is nothing wrong here and we would like to listen to your idea and
>>>> motivation so that we can better design Hive stats collection. Thanks!
>>>>
>>>> Best
>>>> Pengcheng
>>>>
>>>>
>>>> On Thu, Apr 28, 2016 at 4:12 PM, Udit Mehta <um...@groupon.com> wrote:
>>>>
>>>>> Any insights on this?
>>>>>
>>>>> On Tue, Apr 26, 2016 at 7:32 PM, Udit Mehta <um...@groupon.com>
>>>>> wrote:
>>>>>
>>>>>> Update: Realized this works if we create a fresh table with this
>>>>>> config already disabled but does not work if there is already a table
>>>>>> created when this config was enabled. We now need to figure out how to
>>>>>> disable this config for a table created when this config was true.
>>>>>>
>>>>>> On Tue, Apr 26, 2016 at 6:16 PM, Udit Mehta <um...@groupon.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hive version we are using is 1.2.1.
>>>>>>>
>>>>>>> On Tue, Apr 26, 2016 at 6:01 PM, Udit Mehta <um...@groupon.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> We need to disable the Hive autogather stats optimization by
>>>>>>>> disabling "*hive.stats.autogather*" but for some reason, the
>>>>>>>> config change doesnt seem to go through. We modified this config in the
>>>>>>>> hive-site.xml and restarted the Hive metastore. We also made this change
>>>>>>>> explicitly in the job but it doesnt seem to help.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *set hive.stats.autogather=false;*
>>>>>>>> Does anyone know the right way to disable this config since we dont
>>>>>>>> want to compute stats in out jobs.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Udit
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Disable Hive autogather optimization

Posted by Mich Talebzadeh <mi...@gmail.com>.
apologies should read "Udit"

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 30 April 2016 at 00:35, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Hi Unit,
>
> *For new tables*
>
> Disable stats autogathering in Hive when creating a new table  and
> populating it
>
> SET hive.stats.autogather=false;
>
> *Already existing tables*
>
> As a work-around you can try this on the already existing tables  by manually
> alter the numRows to -1
>
> ALTER TABLE <table_name> PARTITION <partition_spec> SET TBLPROPERTIES
> ('numRows'='-1');
>
> Example
>
> 0: jdbc:hive2://rhes564:10010/default> create table testme as select *
> from sales_staging limit 1000;
>
> 0: jdbc:hive2://rhes564:10010/default> desc formatted  testme;
>
>
> +-------------------------------+-----------------------------------------------------------------+-----------------------------+--+
> |           col_name            |
> data_type                            |           comment           |
>
> +-------------------------------+-----------------------------------------------------------------+-----------------------------+--+
> | # col_name                    |
> data_type                                                       |
> comment                     |
> |                               |
> NULL                                                            |
> NULL                        |
> | prod_id                       |
> double
> |                             |
> | cust_id                       |
> double
> |                             |
> | time_id                       |
> string
> |                             |
> | channel_id                    |
> double
> |                             |
> | promo_id                      |
> double
> |                             |
> | quantity_sold                 |
> double
> |                             |
> | amount_sold                   |
> double
> |                             |
> |                               |
> NULL                                                            |
> NULL                        |
> | # Detailed Table Information  |
> NULL                                                            |
> NULL                        |
> | Database:                     |
> oraclehadoop                                                    |
> NULL                        |
> | Owner:                        |
> hduser                                                          |
> NULL                        |
> | CreateTime:                   | Sat Apr 30 00:31:17 BST
> 2016                                    | NULL                        |
> | LastAccessTime:               |
> UNKNOWN                                                         |
> NULL                        |
> | Retention:                    |
> 0                                                               |
> NULL                        |
> | Location:                     |
> hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/testme  |
> NULL                        |
> | Table Type:                   |
> MANAGED_TABLE                                                   |
> NULL                        |
> | Table Parameters:             |
> NULL                                                            |
> NULL                        |
> |                               |
> COLUMN_STATS_ACCURATE                                           |
> {\"BASIC_STATS\":\"true\"}  |
> |                               |
> numFiles                                                        |
> 1                           |
> |                               |
> numRows                                                         |
> 1000                        |
> |                               |
> rawDataSize                                                     |
> 54853                       |
> |                               |
> totalSize                                                       |
> 55853                       |
> |                               |
> transient_lastDdlTime                                           |
> 1461972677                  |
> |                               |
> NULL                                                            |
> NULL                        |
> | # Storage Information         |
> NULL                                                            |
> NULL                        |
> | SerDe Library:                |
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe              |
> NULL                        |
> | InputFormat:                  |
> org.apache.hadoop.mapred.TextInputFormat                        |
> NULL                        |
> | OutputFormat:                 |
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat      |
> NULL                        |
> | Compressed:                   |
> No                                                              |
> NULL                        |
> | Num Buckets:                  |
> -1                                                              |
> NULL                        |
> | Bucket Columns:               |
> []                                                              |
> NULL                        |
> | Sort Columns:                 |
> []                                                              |
> NULL                        |
> | Storage Desc Params:          |
> NULL                                                            |
> NULL                        |
> |                               |
> serialization.format                                            |
> 1                           |
>
> +-------------------------------+-----------------------------------------------------------------+-----------------------------+--+
>
> 0: jdbc:hive2://rhes564:10010/default>* ALTER TABLE testme  SET
> TBLPROPERTIES ('numRows'='-1');*
>
> 0: jdbc:hive2://rhes564:10010/default> desc formatted  testme;
>
>
> +-------------------------------+-----------------------------------------------------------------+-----------------------+--+
> |           col_name            |
> data_type                            |        comment        |
>
> +-------------------------------+-----------------------------------------------------------------+-----------------------+--+
> | # col_name                    |
> data_type                                                       |
> comment               |
> |                               |
> NULL                                                            |
> NULL                  |
> | prod_id                       |
> double
> |                       |
> | cust_id                       |
> double
> |                       |
> | time_id                       |
> string
> |                       |
> | channel_id                    |
> double
> |                       |
> | promo_id                      |
> double
> |                       |
> | quantity_sold                 |
> double
> |                       |
> | amount_sold                   |
> double
> |                       |
> |                               |
> NULL                                                            |
> NULL                  |
> | # Detailed Table Information  |
> NULL                                                            |
> NULL                  |
> | Database:                     |
> oraclehadoop                                                    |
> NULL                  |
> | Owner:                        |
> hduser                                                          |
> NULL                  |
> | CreateTime:                   | Sat Apr 30 00:31:17 BST
> 2016                                    | NULL                  |
> | LastAccessTime:               |
> UNKNOWN                                                         |
> NULL                  |
> | Retention:                    |
> 0                                                               |
> NULL                  |
> | Location:                     |
> hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/testme  |
> NULL                  |
> | Table Type:                   |
> MANAGED_TABLE                                                   |
> NULL                  |
> | Table Parameters:             |
> NULL                                                            |
> NULL                  |
> |                               |
> last_modified_by                                                |
> hduser                |
> |                               |
> last_modified_time                                              |
> 1461973002            |
> |                               |
> numFiles                                                        |
> 1                     |
> |                               |
> numRows                                                         |
> -1                    |
> |                               |
> rawDataSize                                                     |
> 54853                 |
> |                               |
> totalSize                                                       |
> 55853                 |
> |                               |
> transient_lastDdlTime                                           |
> 1461973002            |
> |                               |
> NULL                                                            |
> NULL                  |
> | # Storage Information         |
> NULL                                                            |
> NULL                  |
> | SerDe Library:                |
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe              |
> NULL                  |
> | InputFormat:                  |
> org.apache.hadoop.mapred.TextInputFormat                        |
> NULL                  |
> | OutputFormat:                 |
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat      |
> NULL                  |
> | Compressed:                   |
> No                                                              |
> NULL                  |
> | Num Buckets:                  |
> -1                                                              |
> NULL                  |
> | Bucket Columns:               |
> []                                                              |
> NULL                  |
> | Sort Columns:                 |
> []                                                              |
> NULL                  |
> | Storage Desc Params:          |
> NULL                                                            |
> NULL                  |
> |                               |
> serialization.format                                            |
> 1                     |
>
> +-------------------------------+-----------------------------------------------------------------+-----------------------+--+
>
> Hopefully that will turn off the autogather feature for existing tables.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 29 April 2016 at 23:32, Udit Mehta <um...@groupon.com> wrote:
>
>> Hi,
>>
>> Thanks for the replies.
>> We have a scenario where we have an ETL job inserting into a table with
>> thousands of partitions using dynamic partitioning. We have certain SLA's
>> within which we would like the job to finish and sometimes there are
>> scenarios where they are missed (extra data or a busy cluster). I
>> understand that stats are essential for Hive CBO but we are trying to
>> explore how much overhead do these stats collection add to the job runtime.
>> A lot of these tables are intermediary tables so having stats for them
>> might not be entirely necessary.
>>
>> I just wanted to figure if there was a easy way to disable the stats and
>> then compare the performance.
>>
>> Mich, can you give more information on how to disable it in the table
>> struct as I cant find any documentation on it.
>>
>> Thanks again.
>> Udit
>>
>> On Fri, Apr 29, 2016 at 10:42 AM, Pengcheng Xiong <px...@apache.org>
>> wrote:
>>
>>> Hi Udit,
>>>
>>>     Could u be more specific about your problem? Like, what settings you
>>> have, what query you run and what is the result and what result do you
>>> expect?
>>>
>>>     From what you said, my understanding is that, you want to wipe out
>>> the basic stats for existing tables? And, could u also let us know why you
>>> would like to get rid of the stats? Stats is crucial for Hive CBO to work
>>> and we are moving towards the direction to make table/column stats
>>> collection automatically. It seems that you prefer an opposite direction.
>>> There is nothing wrong here and we would like to listen to your idea and
>>> motivation so that we can better design Hive stats collection. Thanks!
>>>
>>> Best
>>> Pengcheng
>>>
>>>
>>> On Thu, Apr 28, 2016 at 4:12 PM, Udit Mehta <um...@groupon.com> wrote:
>>>
>>>> Any insights on this?
>>>>
>>>> On Tue, Apr 26, 2016 at 7:32 PM, Udit Mehta <um...@groupon.com> wrote:
>>>>
>>>>> Update: Realized this works if we create a fresh table with this
>>>>> config already disabled but does not work if there is already a table
>>>>> created when this config was enabled. We now need to figure out how to
>>>>> disable this config for a table created when this config was true.
>>>>>
>>>>> On Tue, Apr 26, 2016 at 6:16 PM, Udit Mehta <um...@groupon.com>
>>>>> wrote:
>>>>>
>>>>>> Hive version we are using is 1.2.1.
>>>>>>
>>>>>> On Tue, Apr 26, 2016 at 6:01 PM, Udit Mehta <um...@groupon.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> We need to disable the Hive autogather stats optimization by
>>>>>>> disabling "*hive.stats.autogather*" but for some reason, the config
>>>>>>> change doesnt seem to go through. We modified this config in the
>>>>>>> hive-site.xml and restarted the Hive metastore. We also made this change
>>>>>>> explicitly in the job but it doesnt seem to help.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *set hive.stats.autogather=false;*
>>>>>>> Does anyone know the right way to disable this config since we dont
>>>>>>> want to compute stats in out jobs.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Udit
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Disable Hive autogather optimization

Posted by Mich Talebzadeh <mi...@gmail.com>.
Hi Unit,

*For new tables*

Disable stats autogathering in Hive when creating a new table  and
populating it

SET hive.stats.autogather=false;

*Already existing tables*

As a work-around you can try this on the already existing tables  by manually
alter the numRows to -1

ALTER TABLE <table_name> PARTITION <partition_spec> SET TBLPROPERTIES
('numRows'='-1');

Example

0: jdbc:hive2://rhes564:10010/default> create table testme as select * from
sales_staging limit 1000;

0: jdbc:hive2://rhes564:10010/default> desc formatted  testme;

+-------------------------------+-----------------------------------------------------------------+-----------------------------+--+
|           col_name            |
data_type                            |           comment           |
+-------------------------------+-----------------------------------------------------------------+-----------------------------+--+
| # col_name                    |
data_type                                                       |
comment                     |
|                               |
NULL                                                            |
NULL                        |
| prod_id                       |
double
|                             |
| cust_id                       |
double
|                             |
| time_id                       |
string
|                             |
| channel_id                    |
double
|                             |
| promo_id                      |
double
|                             |
| quantity_sold                 |
double
|                             |
| amount_sold                   |
double
|                             |
|                               |
NULL                                                            |
NULL                        |
| # Detailed Table Information  |
NULL                                                            |
NULL                        |
| Database:                     |
oraclehadoop                                                    |
NULL                        |
| Owner:                        |
hduser                                                          |
NULL                        |
| CreateTime:                   | Sat Apr 30 00:31:17 BST
2016                                    | NULL                        |
| LastAccessTime:               |
UNKNOWN                                                         |
NULL                        |
| Retention:                    |
0                                                               |
NULL                        |
| Location:                     |
hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/testme  |
NULL                        |
| Table Type:                   |
MANAGED_TABLE                                                   |
NULL                        |
| Table Parameters:             |
NULL                                                            |
NULL                        |
|                               |
COLUMN_STATS_ACCURATE                                           |
{\"BASIC_STATS\":\"true\"}  |
|                               |
numFiles                                                        |
1                           |
|                               |
numRows                                                         |
1000                        |
|                               |
rawDataSize                                                     |
54853                       |
|                               |
totalSize                                                       |
55853                       |
|                               |
transient_lastDdlTime                                           |
1461972677                  |
|                               |
NULL                                                            |
NULL                        |
| # Storage Information         |
NULL                                                            |
NULL                        |
| SerDe Library:                |
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe              |
NULL                        |
| InputFormat:                  |
org.apache.hadoop.mapred.TextInputFormat                        |
NULL                        |
| OutputFormat:                 |
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat      |
NULL                        |
| Compressed:                   |
No                                                              |
NULL                        |
| Num Buckets:                  |
-1                                                              |
NULL                        |
| Bucket Columns:               |
[]                                                              |
NULL                        |
| Sort Columns:                 |
[]                                                              |
NULL                        |
| Storage Desc Params:          |
NULL                                                            |
NULL                        |
|                               |
serialization.format                                            |
1                           |
+-------------------------------+-----------------------------------------------------------------+-----------------------------+--+

0: jdbc:hive2://rhes564:10010/default>* ALTER TABLE testme  SET
TBLPROPERTIES ('numRows'='-1');*

0: jdbc:hive2://rhes564:10010/default> desc formatted  testme;

+-------------------------------+-----------------------------------------------------------------+-----------------------+--+
|           col_name            |
data_type                            |        comment        |
+-------------------------------+-----------------------------------------------------------------+-----------------------+--+
| # col_name                    |
data_type                                                       |
comment               |
|                               |
NULL                                                            |
NULL                  |
| prod_id                       |
double
|                       |
| cust_id                       |
double
|                       |
| time_id                       |
string
|                       |
| channel_id                    |
double
|                       |
| promo_id                      |
double
|                       |
| quantity_sold                 |
double
|                       |
| amount_sold                   |
double
|                       |
|                               |
NULL                                                            |
NULL                  |
| # Detailed Table Information  |
NULL                                                            |
NULL                  |
| Database:                     |
oraclehadoop                                                    |
NULL                  |
| Owner:                        |
hduser                                                          |
NULL                  |
| CreateTime:                   | Sat Apr 30 00:31:17 BST
2016                                    | NULL                  |
| LastAccessTime:               |
UNKNOWN                                                         |
NULL                  |
| Retention:                    |
0                                                               |
NULL                  |
| Location:                     |
hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/testme  |
NULL                  |
| Table Type:                   |
MANAGED_TABLE                                                   |
NULL                  |
| Table Parameters:             |
NULL                                                            |
NULL                  |
|                               |
last_modified_by                                                |
hduser                |
|                               |
last_modified_time                                              |
1461973002            |
|                               |
numFiles                                                        |
1                     |
|                               |
numRows                                                         |
-1                    |
|                               |
rawDataSize                                                     |
54853                 |
|                               |
totalSize                                                       |
55853                 |
|                               |
transient_lastDdlTime                                           |
1461973002            |
|                               |
NULL                                                            |
NULL                  |
| # Storage Information         |
NULL                                                            |
NULL                  |
| SerDe Library:                |
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe              |
NULL                  |
| InputFormat:                  |
org.apache.hadoop.mapred.TextInputFormat                        |
NULL                  |
| OutputFormat:                 |
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat      |
NULL                  |
| Compressed:                   |
No                                                              |
NULL                  |
| Num Buckets:                  |
-1                                                              |
NULL                  |
| Bucket Columns:               |
[]                                                              |
NULL                  |
| Sort Columns:                 |
[]                                                              |
NULL                  |
| Storage Desc Params:          |
NULL                                                            |
NULL                  |
|                               |
serialization.format                                            |
1                     |
+-------------------------------+-----------------------------------------------------------------+-----------------------+--+

Hopefully that will turn off the autogather feature for existing tables.

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 29 April 2016 at 23:32, Udit Mehta <um...@groupon.com> wrote:

> Hi,
>
> Thanks for the replies.
> We have a scenario where we have an ETL job inserting into a table with
> thousands of partitions using dynamic partitioning. We have certain SLA's
> within which we would like the job to finish and sometimes there are
> scenarios where they are missed (extra data or a busy cluster). I
> understand that stats are essential for Hive CBO but we are trying to
> explore how much overhead do these stats collection add to the job runtime.
> A lot of these tables are intermediary tables so having stats for them
> might not be entirely necessary.
>
> I just wanted to figure if there was a easy way to disable the stats and
> then compare the performance.
>
> Mich, can you give more information on how to disable it in the table
> struct as I cant find any documentation on it.
>
> Thanks again.
> Udit
>
> On Fri, Apr 29, 2016 at 10:42 AM, Pengcheng Xiong <px...@apache.org>
> wrote:
>
>> Hi Udit,
>>
>>     Could u be more specific about your problem? Like, what settings you
>> have, what query you run and what is the result and what result do you
>> expect?
>>
>>     From what you said, my understanding is that, you want to wipe out
>> the basic stats for existing tables? And, could u also let us know why you
>> would like to get rid of the stats? Stats is crucial for Hive CBO to work
>> and we are moving towards the direction to make table/column stats
>> collection automatically. It seems that you prefer an opposite direction.
>> There is nothing wrong here and we would like to listen to your idea and
>> motivation so that we can better design Hive stats collection. Thanks!
>>
>> Best
>> Pengcheng
>>
>>
>> On Thu, Apr 28, 2016 at 4:12 PM, Udit Mehta <um...@groupon.com> wrote:
>>
>>> Any insights on this?
>>>
>>> On Tue, Apr 26, 2016 at 7:32 PM, Udit Mehta <um...@groupon.com> wrote:
>>>
>>>> Update: Realized this works if we create a fresh table with this config
>>>> already disabled but does not work if there is already a table created when
>>>> this config was enabled. We now need to figure out how to disable this
>>>> config for a table created when this config was true.
>>>>
>>>> On Tue, Apr 26, 2016 at 6:16 PM, Udit Mehta <um...@groupon.com> wrote:
>>>>
>>>>> Hive version we are using is 1.2.1.
>>>>>
>>>>> On Tue, Apr 26, 2016 at 6:01 PM, Udit Mehta <um...@groupon.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> We need to disable the Hive autogather stats optimization by
>>>>>> disabling "*hive.stats.autogather*" but for some reason, the config
>>>>>> change doesnt seem to go through. We modified this config in the
>>>>>> hive-site.xml and restarted the Hive metastore. We also made this change
>>>>>> explicitly in the job but it doesnt seem to help.
>>>>>>
>>>>>>
>>>>>>
>>>>>> *set hive.stats.autogather=false;*
>>>>>> Does anyone know the right way to disable this config since we dont
>>>>>> want to compute stats in out jobs.
>>>>>>
>>>>>> Thanks,
>>>>>> Udit
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Disable Hive autogather optimization

Posted by Udit Mehta <um...@groupon.com>.
Hi,

Thanks for the replies.
We have a scenario where we have an ETL job inserting into a table with
thousands of partitions using dynamic partitioning. We have certain SLA's
within which we would like the job to finish and sometimes there are
scenarios where they are missed (extra data or a busy cluster). I
understand that stats are essential for Hive CBO but we are trying to
explore how much overhead do these stats collection add to the job runtime.
A lot of these tables are intermediary tables so having stats for them
might not be entirely necessary.

I just wanted to figure if there was a easy way to disable the stats and
then compare the performance.

Mich, can you give more information on how to disable it in the table
struct as I cant find any documentation on it.

Thanks again.
Udit

On Fri, Apr 29, 2016 at 10:42 AM, Pengcheng Xiong <px...@apache.org> wrote:

> Hi Udit,
>
>     Could u be more specific about your problem? Like, what settings you
> have, what query you run and what is the result and what result do you
> expect?
>
>     From what you said, my understanding is that, you want to wipe out the
> basic stats for existing tables? And, could u also let us know why you
> would like to get rid of the stats? Stats is crucial for Hive CBO to work
> and we are moving towards the direction to make table/column stats
> collection automatically. It seems that you prefer an opposite direction.
> There is nothing wrong here and we would like to listen to your idea and
> motivation so that we can better design Hive stats collection. Thanks!
>
> Best
> Pengcheng
>
>
> On Thu, Apr 28, 2016 at 4:12 PM, Udit Mehta <um...@groupon.com> wrote:
>
>> Any insights on this?
>>
>> On Tue, Apr 26, 2016 at 7:32 PM, Udit Mehta <um...@groupon.com> wrote:
>>
>>> Update: Realized this works if we create a fresh table with this config
>>> already disabled but does not work if there is already a table created when
>>> this config was enabled. We now need to figure out how to disable this
>>> config for a table created when this config was true.
>>>
>>> On Tue, Apr 26, 2016 at 6:16 PM, Udit Mehta <um...@groupon.com> wrote:
>>>
>>>> Hive version we are using is 1.2.1.
>>>>
>>>> On Tue, Apr 26, 2016 at 6:01 PM, Udit Mehta <um...@groupon.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> We need to disable the Hive autogather stats optimization by disabling
>>>>> "*hive.stats.autogather*" but for some reason, the config change
>>>>> doesnt seem to go through. We modified this config in the hive-site.xml and
>>>>> restarted the Hive metastore. We also made this change explicitly in the
>>>>> job but it doesnt seem to help.
>>>>>
>>>>>
>>>>>
>>>>> *set hive.stats.autogather=false;*
>>>>> Does anyone know the right way to disable this config since we dont
>>>>> want to compute stats in out jobs.
>>>>>
>>>>> Thanks,
>>>>> Udit
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Disable Hive autogather optimization

Posted by Pengcheng Xiong <px...@apache.org>.
Hi Udit,

    Could u be more specific about your problem? Like, what settings you
have, what query you run and what is the result and what result do you
expect?

    From what you said, my understanding is that, you want to wipe out the
basic stats for existing tables? And, could u also let us know why you
would like to get rid of the stats? Stats is crucial for Hive CBO to work
and we are moving towards the direction to make table/column stats
collection automatically. It seems that you prefer an opposite direction.
There is nothing wrong here and we would like to listen to your idea and
motivation so that we can better design Hive stats collection. Thanks!

Best
Pengcheng


On Thu, Apr 28, 2016 at 4:12 PM, Udit Mehta <um...@groupon.com> wrote:

> Any insights on this?
>
> On Tue, Apr 26, 2016 at 7:32 PM, Udit Mehta <um...@groupon.com> wrote:
>
>> Update: Realized this works if we create a fresh table with this config
>> already disabled but does not work if there is already a table created when
>> this config was enabled. We now need to figure out how to disable this
>> config for a table created when this config was true.
>>
>> On Tue, Apr 26, 2016 at 6:16 PM, Udit Mehta <um...@groupon.com> wrote:
>>
>>> Hive version we are using is 1.2.1.
>>>
>>> On Tue, Apr 26, 2016 at 6:01 PM, Udit Mehta <um...@groupon.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> We need to disable the Hive autogather stats optimization by disabling "
>>>> *hive.stats.autogather*" but for some reason, the config change doesnt
>>>> seem to go through. We modified this config in the hive-site.xml and
>>>> restarted the Hive metastore. We also made this change explicitly in the
>>>> job but it doesnt seem to help.
>>>>
>>>>
>>>>
>>>> *set hive.stats.autogather=false;*
>>>> Does anyone know the right way to disable this config since we dont
>>>> want to compute stats in out jobs.
>>>>
>>>> Thanks,
>>>> Udit
>>>>
>>>
>>>
>>
>

Re: Disable Hive autogather optimization

Posted by Mich Talebzadeh <mi...@gmail.com>.
Hi
Is this what is detailed in the following Jira
<https://issues.apache.org/jira/browse/HIVE-11160>

Description

Hive will collect table stats when set hive.stats.autogather=true during
the INSERT OVERWRITE command. And then the users need to collect the column
stats themselves using "Analyze" command. In this patch, the column stats
will also be collected automatically. More specifically, INSERT OVERWRITE
will automatically create new column stats. INSERT INTO will automatically
merge new column stats with existing ones.


Ok the issue you are having is when INSERT OVERWRITE operation is involved
in an existing table, then column stats kicks in and that adds to timing
process?


Sounds like it is a  general feature and can be disabled as part of table
struct.





Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 29 April 2016 at 00:12, Udit Mehta <um...@groupon.com> wrote:

> Any insights on this?
>
> On Tue, Apr 26, 2016 at 7:32 PM, Udit Mehta <um...@groupon.com> wrote:
>
>> Update: Realized this works if we create a fresh table with this config
>> already disabled but does not work if there is already a table created when
>> this config was enabled. We now need to figure out how to disable this
>> config for a table created when this config was true.
>>
>> On Tue, Apr 26, 2016 at 6:16 PM, Udit Mehta <um...@groupon.com> wrote:
>>
>>> Hive version we are using is 1.2.1.
>>>
>>> On Tue, Apr 26, 2016 at 6:01 PM, Udit Mehta <um...@groupon.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> We need to disable the Hive autogather stats optimization by disabling "
>>>> *hive.stats.autogather*" but for some reason, the config change doesnt
>>>> seem to go through. We modified this config in the hive-site.xml and
>>>> restarted the Hive metastore. We also made this change explicitly in the
>>>> job but it doesnt seem to help.
>>>>
>>>>
>>>>
>>>> *set hive.stats.autogather=false;*
>>>> Does anyone know the right way to disable this config since we dont
>>>> want to compute stats in out jobs.
>>>>
>>>> Thanks,
>>>> Udit
>>>>
>>>
>>>
>>
>

Re: Disable Hive autogather optimization

Posted by Udit Mehta <um...@groupon.com>.
Any insights on this?

On Tue, Apr 26, 2016 at 7:32 PM, Udit Mehta <um...@groupon.com> wrote:

> Update: Realized this works if we create a fresh table with this config
> already disabled but does not work if there is already a table created when
> this config was enabled. We now need to figure out how to disable this
> config for a table created when this config was true.
>
> On Tue, Apr 26, 2016 at 6:16 PM, Udit Mehta <um...@groupon.com> wrote:
>
>> Hive version we are using is 1.2.1.
>>
>> On Tue, Apr 26, 2016 at 6:01 PM, Udit Mehta <um...@groupon.com> wrote:
>>
>>> Hi,
>>>
>>> We need to disable the Hive autogather stats optimization by disabling "
>>> *hive.stats.autogather*" but for some reason, the config change doesnt
>>> seem to go through. We modified this config in the hive-site.xml and
>>> restarted the Hive metastore. We also made this change explicitly in the
>>> job but it doesnt seem to help.
>>>
>>>
>>>
>>> *set hive.stats.autogather=false;*
>>> Does anyone know the right way to disable this config since we dont want
>>> to compute stats in out jobs.
>>>
>>> Thanks,
>>> Udit
>>>
>>
>>
>

Re: Disable Hive autogather optimization

Posted by Udit Mehta <um...@groupon.com>.
Update: Realized this works if we create a fresh table with this config
already disabled but does not work if there is already a table created when
this config was enabled. We now need to figure out how to disable this
config for a table created when this config was true.

On Tue, Apr 26, 2016 at 6:16 PM, Udit Mehta <um...@groupon.com> wrote:

> Hive version we are using is 1.2.1.
>
> On Tue, Apr 26, 2016 at 6:01 PM, Udit Mehta <um...@groupon.com> wrote:
>
>> Hi,
>>
>> We need to disable the Hive autogather stats optimization by disabling "
>> *hive.stats.autogather*" but for some reason, the config change doesnt
>> seem to go through. We modified this config in the hive-site.xml and
>> restarted the Hive metastore. We also made this change explicitly in the
>> job but it doesnt seem to help.
>>
>>
>>
>> *set hive.stats.autogather=false;*
>> Does anyone know the right way to disable this config since we dont want
>> to compute stats in out jobs.
>>
>> Thanks,
>> Udit
>>
>
>

Re: Disable Hive autogather optimization

Posted by Udit Mehta <um...@groupon.com>.
Hive version we are using is 1.2.1.

On Tue, Apr 26, 2016 at 6:01 PM, Udit Mehta <um...@groupon.com> wrote:

> Hi,
>
> We need to disable the Hive autogather stats optimization by disabling "
> *hive.stats.autogather*" but for some reason, the config change doesnt
> seem to go through. We modified this config in the hive-site.xml and
> restarted the Hive metastore. We also made this change explicitly in the
> job but it doesnt seem to help.
>
>
>
> *set hive.stats.autogather=false;*
> Does anyone know the right way to disable this config since we dont want
> to compute stats in out jobs.
>
> Thanks,
> Udit
>