You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Michael Häusler <mi...@akatose.de> on 2016/06/13 15:00:49 UTC

column statistics for non-primitive types

Hi there,


when testing column statistics I stumbled upon the following error message:

DROP TABLE IF EXISTS foo;
CREATE TABLE foo (foo BIGINT, bar ARRAY<BIGINT>, foobar STRUCT<key:STRING,value:STRING>);

ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;
FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but array<bigint> is passed.

ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar;
FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but struct<key:string,value:string> is passed.


1) Basically, it seems that column statistics don't work for non-primitive types. Are there any workarounds or any plans to change this?

2) Furthermore, the convenience syntax to compute statistics for all columns does not work as soon as there is a non-supported column. Are there any plans to change this, so it is easier to compute statistics for all supported columns?

3) ANALYZE TABLE will only provide the first failing *type* in the error message. Especially for wide tables it would be much easier if all non-supported column *names* would be printed.


Best regards
Michael

Re: column statistics for non-primitive types

Posted by Mich Talebzadeh <mi...@gmail.com>.

hm,

I am on Hive 2 and still the same

hive> create table testme as select * from oraclehadoop.sales_staging where
1 = 2;

hive> insert into testme select * from sales_staging limit  100000;

hive> desc formatted testme;
OK
# col_name              data_type               comment
prod_id                 bigint
cust_id                 bigint
time_id                 timestamp
channel_id              bigint
promo_id                bigint
quantity_sold           decimal(10,0)
amount_sold             decimal(10,0)
# Detailed Table Information
Database:               test
Owner:                  hduser
CreateTime:             Tue Jun 14 23:00:55 BST 2016
LastAccessTime:         UNKNOWN
Retention:              0
Location:
hdfs://rhes564:9000/user/hive/warehouse/test.db/testme
Table Type:             MANAGED_TABLE
Table Parameters:
        COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
        numFiles                2
        numRows                 100000
        rawDataSize             3848068
        totalSize               3948068
        transient_lastDdlTime   1465941690

hive> analyze table testme compute statistics
*for columns;*OK
hive> desc formatted testme;
OK
# col_name              data_type               comment
prod_id                 bigint
cust_id                 bigint
time_id                 timestamp
channel_id              bigint
promo_id                bigint
quantity_sold           decimal(10,0)
amount_sold             decimal(10,0)
# Detailed Table Information
Database:               test
Owner:                  hduser
CreateTime:             Tue Jun 14 23:00:55 BST 2016
LastAccessTime:         UNKNOWN
Retention:              0
Location:
hdfs://rhes564:9000/user/hive/warehouse/test.db/testme
Table Type:             MANAGED_TABLE
Table Parameters:
        COLUMN_STATS_ACCURATE
{\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"prod_id\":\"true\",\"cust_id\":\"true\",\"time_id\":\"true\",\"channel_id\":\"true\",\"promo_id\":\"true\",\"quantity_sold\":\"true\",\"amount_sold\":\"true\"}}
        numFiles                2
        numRows                 100000
        rawDataSize             3848068
        totalSize               3948068
        transient_lastDdlTime   1465941690


Although there is gain to be made by having up-to-date stats, your quickest
performance buck is going ton come by running Hive on Spark engine (order
of magnitude) or using Spark on Hive tables. As ever your mileage varies
depending on the availability of RAM on your cluster. Having external
indexes visible to Hive optimizer will help but I suppose that is another
discussion

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 14 June 2016 at 22:51, Michael Häusler <mi...@akatose.de> wrote:

> Hi Mich,
>
> as we are still on Hive 1.2.1, it is only working like this for basic
> stats.
> I would welcome it though, if it would work for column statistics as well
> - and it seems this feature is coming via HIVE-11160.
>
> Best
> Michael
>
> On 2016-06-14, at 23:42, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
> Hi,
>
> Is this automatic stats update is basic statistics or for all columns?
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 14 June 2016 at 22:10, Michael Häusler <mi...@akatose.de> wrote:
>
>> Hi Mich,
>>
>> I agree with Pengcheng here. Automatic stats gathering can be extremely
>> useful - and it is configurable.
>>
>> E.g., initial import into Hive happens as CSV or AVRO. Then you might
>> want to do a conversion to ORC within Hive via create-table-as-select. At
>> that point Hive is reading all the records anyway and we might just as well
>> get as much useful information as possible for stats.
>>
>> Best
>> Michael
>>
>>
>> On 2016-06-14, at 23:05, Pengcheng Xiong <px...@apache.org> wrote:
>>
>> Hi Mich,
>>
>>     I agree with you that column stats gathering in Hive is not cheap and
>> comes with overheads. This is due to the large volume of data that Hive has
>> to process. However, this is the price you have to pay anyway even with
>> current "analyze table" solution.
>>
>>    The new feature not only provides a way to make users have column
>> stats automatically, but also saves overhead for the "insert into" case. In
>> this case, the new column stats are generated incrementally, i.e., by
>> merging with the existing stats. Without this feature, you have to scan the
>> whole table and compute stats.
>>
>>    In conclusion, this new feature should not have any more overhead than
>> the current solution.
>>
>> Best
>> Pengcheng
>>
>> On Tue, Jun 14, 2016 at 1:41 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> hi,
>>>
>>> (2) There is a configuration "hive.stats.fetch.column.stats". If you set
>>> it to true, it will automatically collect column stats for you when you
>>> insert into/overwrite a new table. You can refer to HIVE-11160 for more
>>> details.
>>>
>>> Not without its overheads.
>>>
>>> Automatic gather stats is not new. Has been around for a good time in
>>> RDBMS and can impact the performance of other queries running. So I am not
>>> sure it can be considered as blessing.
>>>
>>> HTH
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 14 June 2016 at 21:25, Pengcheng Xiong <px...@apache.org> wrote:
>>>
>>>> Exactly, "the useful or meaningful these statistics is going to be"
>>>> (The motivation behind).
>>>>
>>>> Best
>>>> Pengcheng
>>>>
>>>>
>>>> On Tue, Jun 14, 2016 at 1:21 PM, Mich Talebzadeh <
>>>> mich.talebzadeh@gmail.com> wrote:
>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> My point was we are where we are and in this juncture there is no
>>>>> collection of statistics for complex columns. That may be a future
>>>>> enhancement.
>>>>>
>>>>> But then the obvious question is how useful or meaningful these
>>>>> statistics is going to be?
>>>>>
>>>>> HTH
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> On 14 June 2016 at 21:03, Michael Häusler <mi...@akatose.de> wrote:
>>>>>
>>>>>> Hi there,
>>>>>>
>>>>>> there might be two topics here:
>>>>>>
>>>>>> 1) feasibility of stats for non-primitive columns
>>>>>> 2) ease of use
>>>>>>
>>>>>>
>>>>>> 1) feasibility of stats for non-primitive columns:
>>>>>>
>>>>>> Hive currently collects different kind of statistics for different
>>>>>> kind of types:
>>>>>> numeric values: min, max, #nulls, #distincts
>>>>>> boolean values: #nulls, #trues, #falses
>>>>>> string values: #nulls, #distincts, avgLength, maxLength
>>>>>>
>>>>>> So, it seems quite possible to also collect at least partial stats
>>>>>> for top-level non-primitive columns, e.g.:
>>>>>> array values: #nulls, #distincts, avgLength, maxLength
>>>>>> map values: #nulls, #distincts, avgLength, maxLength
>>>>>> struct values: #nulls, #distincts
>>>>>> union values: #nulls, #distincts
>>>>>>
>>>>>>
>>>>>> 2) ease of use
>>>>>>
>>>>>> The presence of a single non-primitive column currently breaks the
>>>>>> use of the convenience shorthand to gather statistics for all columns
>>>>>> (ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;). Imho, this slows down
>>>>>> adoption of column statistics for hive users.
>>>>>>
>>>>>> Best regards
>>>>>> Michael
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2016-06-14, at 12:04, Mich Talebzadeh <mi...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hi Michael,
>>>>>>
>>>>>> Statistics for columns in Hive are kept in Hive metadata table
>>>>>> tab_col_stats.
>>>>>>
>>>>>> When I am looking at this table in Oracle, I only see statistics for
>>>>>> primitives columns here. STRUCT columns do not have it as a STRUCT column
>>>>>> will have to be broken into its primitive columns.  I don't think Hive has
>>>>>> the means to do that.
>>>>>>
>>>>>> desc tab_col_stats;
>>>>>>  Name
>>>>>> Null?    Type
>>>>>>  ------------------------------------------------------------------------
>>>>>> -------- -------------------------------------------------
>>>>>>  CS_ID
>>>>>> NOT NULL NUMBER
>>>>>>  DB_NAME
>>>>>> NOT NULL VARCHAR2(128)
>>>>>>  TABLE_NAME
>>>>>> NOT NULL VARCHAR2(128)
>>>>>>  COLUMN_NAME
>>>>>> NOT NULL VARCHAR2(1000)
>>>>>>  COLUMN_TYPE
>>>>>> NOT NULL VARCHAR2(128)
>>>>>>  TBL_ID
>>>>>> NOT NULL NUMBER
>>>>>>  LONG_LOW_VALUE
>>>>>> NUMBER
>>>>>>  LONG_HIGH_VALUE
>>>>>> NUMBER
>>>>>>  DOUBLE_LOW_VALUE
>>>>>> NUMBER
>>>>>>  DOUBLE_HIGH_VALUE
>>>>>> NUMBER
>>>>>>  BIG_DECIMAL_LOW_VALUE
>>>>>> VARCHAR2(4000)
>>>>>>  BIG_DECIMAL_HIGH_VALUE
>>>>>> VARCHAR2(4000)
>>>>>>  NUM_NULLS
>>>>>> NOT NULL NUMBER
>>>>>>  NUM_DISTINCTS
>>>>>> NUMBER
>>>>>>  AVG_COL_LEN
>>>>>> NUMBER
>>>>>>  MAX_COL_LEN
>>>>>> NUMBER
>>>>>>  NUM_TRUES
>>>>>> NUMBER
>>>>>>  NUM_FALSES
>>>>>> NUMBER
>>>>>>  LAST_ANALYZED
>>>>>> NOT NULL NUMBER
>>>>>>
>>>>>>
>>>>>>
>>>>>>  So in summary although column type STRUCT do exit, I don't think
>>>>>> Hive can cater for their statistics. Actually I don't think Oracle itself
>>>>>> does it.
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> P.S. I am on Hive 2 and it does not.
>>>>>>
>>>>>> hive> analyze table foo compute statistics for columns;
>>>>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>>>>> accepted but array<bigint> is passed.
>>>>>>
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 14 June 2016 at 09:57, Michael Häusler <mi...@akatose.de> wrote:
>>>>>>
>>>>>>> Hi there,
>>>>>>>
>>>>>>> you can reproduce the messages below with Hive 1.2.1.
>>>>>>>
>>>>>>> Best regards
>>>>>>> Michael
>>>>>>>
>>>>>>>
>>>>>>> On 2016-06-13, at 22:21, Mich Talebzadeh <mi...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> which version of Hive are you using?
>>>>>>>
>>>>>>> Dr Mich Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>
>>>>>>>
>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 13 June 2016 at 16:00, Michael Häusler <mi...@akatose.de>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi there,
>>>>>>>>
>>>>>>>>
>>>>>>>> when testing column statistics I stumbled upon the following error
>>>>>>>> message:
>>>>>>>>
>>>>>>>> DROP TABLE IF EXISTS foo;
>>>>>>>> CREATE TABLE foo (foo BIGINT, bar ARRAY<BIGINT>, foobar
>>>>>>>> STRUCT<key:STRING,value:STRING>);
>>>>>>>>
>>>>>>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;
>>>>>>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>>>>>>> accepted but array<bigint> is passed.
>>>>>>>>
>>>>>>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar;
>>>>>>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>>>>>>> accepted but struct<key:string,value:string> is passed.
>>>>>>>>
>>>>>>>>
>>>>>>>> 1) Basically, it seems that column statistics don't work for
>>>>>>>> non-primitive types. Are there any workarounds or any plans to change this?
>>>>>>>>
>>>>>>>> 2) Furthermore, the convenience syntax to compute statistics for
>>>>>>>> all columns does not work as soon as there is a non-supported column. Are
>>>>>>>> there any plans to change this, so it is easier to compute statistics for
>>>>>>>> all supported columns?
>>>>>>>>
>>>>>>>> 3) ANALYZE TABLE will only provide the first failing *type* in the
>>>>>>>> error message. Especially for wide tables it would be much easier if all
>>>>>>>> non-supported column *names* would be printed.
>>>>>>>>
>>>>>>>>
>>>>>>>> Best regards
>>>>>>>> Michael
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>
>

Re: column statistics for non-primitive types

Posted by Michael Häusler <mi...@akatose.de>.

Hi Mich,

as we are still on Hive 1.2.1, it is only working like this for basic stats.
I would welcome it though, if it would work for column statistics as well - and it seems this feature is coming via HIVE-11160.

Best
Michael

> On 2016-06-14, at 23:42, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> Hi,
> 
> Is this automatic stats update is basic statistics or for all columns?
> 
> Thanks
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 14 June 2016 at 22:10, Michael Häusler <michael@akatose.de <ma...@akatose.de>> wrote:
> Hi Mich,
> 
> I agree with Pengcheng here. Automatic stats gathering can be extremely useful - and it is configurable.
> 
> E.g., initial import into Hive happens as CSV or AVRO. Then you might want to do a conversion to ORC within Hive via create-table-as-select. At that point Hive is reading all the records anyway and we might just as well get as much useful information as possible for stats.
> 
> Best
> Michael
> 
> 
>> On 2016-06-14, at 23:05, Pengcheng Xiong <pxiong@apache.org <ma...@apache.org>> wrote:
>> 
>> Hi Mich,
>> 
>>     I agree with you that column stats gathering in Hive is not cheap and comes with overheads. This is due to the large volume of data that Hive has to process. However, this is the price you have to pay anyway even with current "analyze table" solution.
>> 
>>    The new feature not only provides a way to make users have column stats automatically, but also saves overhead for the "insert into" case. In this case, the new column stats are generated incrementally, i.e., by merging with the existing stats. Without this feature, you have to scan the whole table and compute stats. 
>> 
>>    In conclusion, this new feature should not have any more overhead than the current solution.  
>> 
>> Best
>> Pengcheng
>> 
>> On Tue, Jun 14, 2016 at 1:41 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>> hi,
>> 
>> (2) There is a configuration "hive.stats.fetch.column.stats". If you set it to true, it will automatically collect column stats for you when you insert into/overwrite a new table. You can refer to HIVE-11160 for more details.
>> 
>> Not without its overheads.
>> 
>> Automatic gather stats is not new. Has been around for a good time in RDBMS and can impact the performance of other queries running. So I am not sure it can be considered as blessing.
>> 
>> HTH
>> 
>> 
>> 
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> 
>> On 14 June 2016 at 21:25, Pengcheng Xiong <pxiong@apache.org <ma...@apache.org>> wrote:
>> Exactly, "the useful or meaningful these statistics is going to be" (The motivation behind).
>> 
>> Best
>> Pengcheng
>> 
>> 
>> On Tue, Jun 14, 2016 at 1:21 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Hi,
>> 
>> My point was we are where we are and in this juncture there is no collection of statistics for complex columns. That may be a future enhancement.
>> 
>> But then the obvious question is how useful or meaningful these statistics is going to be?
>> 
>> HTH 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> 
>> On 14 June 2016 at 21:03, Michael Häusler <michael@akatose.de <ma...@akatose.de>> wrote:
>> Hi there,
>> 
>> there might be two topics here:
>> 
>> 1) feasibility of stats for non-primitive columns
>> 2) ease of use
>> 
>> 
>> 1) feasibility of stats for non-primitive columns:
>> 
>> Hive currently collects different kind of statistics for different kind of types:
>> numeric values:	min, max, #nulls, #distincts
>> boolean values:	#nulls, #trues, #falses
>> string values:		#nulls, #distincts, avgLength, maxLength
>> 
>> So, it seems quite possible to also collect at least partial stats for top-level non-primitive columns, e.g.:
>> array values:		#nulls, #distincts, avgLength, maxLength 
>> map values:		#nulls, #distincts, avgLength, maxLength
>> struct values:		#nulls, #distincts
>> union values:		#nulls, #distincts
>> 
>> 
>> 2) ease of use
>> 
>> The presence of a single non-primitive column currently breaks the use of the convenience shorthand to gather statistics for all columns (ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;). Imho, this slows down adoption of column statistics for hive users.
>> 
>> Best regards
>> Michael
>> 
>> 
>> 
>>> On 2016-06-14, at 12:04, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> Hi Michael,
>>> 
>>> Statistics for columns in Hive are kept in Hive metadata table tab_col_stats.
>>> 
>>> When I am looking at this table in Oracle, I only see statistics for primitives columns here. STRUCT columns do not have it as a STRUCT column will have to be broken into its primitive columns.  I don't think Hive has the means to do that.
>>> 
>>> desc tab_col_stats;
>>>  Name                                                                     Null?    Type
>>>  ------------------------------------------------------------------------ -------- -------------------------------------------------
>>>  CS_ID                                                                    NOT NULL NUMBER
>>>  DB_NAME                                                                  NOT NULL VARCHAR2(128)
>>>  TABLE_NAME                                                               NOT NULL VARCHAR2(128)
>>>  COLUMN_NAME                                                              NOT NULL VARCHAR2(1000)
>>>  COLUMN_TYPE                                                              NOT NULL VARCHAR2(128)
>>>  TBL_ID                                                                   NOT NULL NUMBER
>>>  LONG_LOW_VALUE                                                                    NUMBER
>>>  LONG_HIGH_VALUE                                                                   NUMBER
>>>  DOUBLE_LOW_VALUE                                                                  NUMBER
>>>  DOUBLE_HIGH_VALUE                                                                 NUMBER
>>>  BIG_DECIMAL_LOW_VALUE                                                             VARCHAR2(4000)
>>>  BIG_DECIMAL_HIGH_VALUE                                                            VARCHAR2(4000)
>>>  NUM_NULLS                                                                NOT NULL NUMBER
>>>  NUM_DISTINCTS                                                                     NUMBER
>>>  AVG_COL_LEN                                                                       NUMBER
>>>  MAX_COL_LEN                                                                       NUMBER
>>>  NUM_TRUES                                                                         NUMBER
>>>  NUM_FALSES                                                                        NUMBER
>>>  LAST_ANALYZED                                                            NOT NULL NUMBER
>>> 
>>> 
>>> 
>>>  So in summary although column type STRUCT do exit, I don't think Hive can cater for their statistics. Actually I don't think Oracle itself does it.
>>> 
>>> HTH
>>> 
>>> P.S. I am on Hive 2 and it does not.
>>> 
>>> hive> analyze table foo compute statistics for columns;
>>> FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but array<bigint> is passed.
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>  
>>> 
>>> On 14 June 2016 at 09:57, Michael Häusler <michael@akatose.de <ma...@akatose.de>> wrote:
>>> Hi there,
>>> 
>>> you can reproduce the messages below with Hive 1.2.1.
>>> 
>>> Best regards
>>> Michael
>>> 
>>> 
>>>> On 2016-06-13, at 22:21, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> which version of Hive are you using?
>>>> 
>>>> Dr Mich Talebzadeh
>>>>  
>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>>  
>>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>>  
>>>> 
>>>> On 13 June 2016 at 16:00, Michael Häusler <michael@akatose.de <ma...@akatose.de>> wrote:
>>>> Hi there,
>>>> 
>>>> 
>>>> when testing column statistics I stumbled upon the following error message:
>>>> 
>>>> DROP TABLE IF EXISTS foo;
>>>> CREATE TABLE foo (foo BIGINT, bar ARRAY<BIGINT>, foobar STRUCT<key:STRING,value:STRING>);
>>>> 
>>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;
>>>> FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but array<bigint> is passed.
>>>> 
>>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar;
>>>> FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but struct<key:string,value:string> is passed.
>>>> 
>>>> 
>>>> 1) Basically, it seems that column statistics don't work for non-primitive types. Are there any workarounds or any plans to change this?
>>>> 
>>>> 2) Furthermore, the convenience syntax to compute statistics for all columns does not work as soon as there is a non-supported column. Are there any plans to change this, so it is easier to compute statistics for all supported columns?
>>>> 
>>>> 3) ANALYZE TABLE will only provide the first failing *type* in the error message. Especially for wide tables it would be much easier if all non-supported column *names* would be printed.
>>>> 
>>>> 
>>>> Best regards
>>>> Michael
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> 
>> 
> 
>

Re: column statistics for non-primitive types

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi,

Is this automatic stats update is basic statistics or for all columns?

Thanks

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 14 June 2016 at 22:10, Michael Häusler <mi...@akatose.de> wrote:

> Hi Mich,
>
> I agree with Pengcheng here. Automatic stats gathering can be extremely
> useful - and it is configurable.
>
> E.g., initial import into Hive happens as CSV or AVRO. Then you might want
> to do a conversion to ORC within Hive via create-table-as-select. At that
> point Hive is reading all the records anyway and we might just as well get
> as much useful information as possible for stats.
>
> Best
> Michael
>
>
> On 2016-06-14, at 23:05, Pengcheng Xiong <px...@apache.org> wrote:
>
> Hi Mich,
>
>     I agree with you that column stats gathering in Hive is not cheap and
> comes with overheads. This is due to the large volume of data that Hive has
> to process. However, this is the price you have to pay anyway even with
> current "analyze table" solution.
>
>    The new feature not only provides a way to make users have column stats
> automatically, but also saves overhead for the "insert into" case. In this
> case, the new column stats are generated incrementally, i.e., by merging
> with the existing stats. Without this feature, you have to scan the whole
> table and compute stats.
>
>    In conclusion, this new feature should not have any more overhead than
> the current solution.
>
> Best
> Pengcheng
>
> On Tue, Jun 14, 2016 at 1:41 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> hi,
>>
>> (2) There is a configuration "hive.stats.fetch.column.stats". If you set
>> it to true, it will automatically collect column stats for you when you
>> insert into/overwrite a new table. You can refer to HIVE-11160 for more
>> details.
>>
>> Not without its overheads.
>>
>> Automatic gather stats is not new. Has been around for a good time in
>> RDBMS and can impact the performance of other queries running. So I am not
>> sure it can be considered as blessing.
>>
>> HTH
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 14 June 2016 at 21:25, Pengcheng Xiong <px...@apache.org> wrote:
>>
>>> Exactly, "the useful or meaningful these statistics is going to be" (The
>>> motivation behind).
>>>
>>> Best
>>> Pengcheng
>>>
>>>
>>> On Tue, Jun 14, 2016 at 1:21 PM, Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com> wrote:
>>>
>>>>
>>>> Hi,
>>>>
>>>> My point was we are where we are and in this juncture there is no
>>>> collection of statistics for complex columns. That may be a future
>>>> enhancement.
>>>>
>>>> But then the obvious question is how useful or meaningful these
>>>> statistics is going to be?
>>>>
>>>> HTH
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 14 June 2016 at 21:03, Michael Häusler <mi...@akatose.de> wrote:
>>>>
>>>>> Hi there,
>>>>>
>>>>> there might be two topics here:
>>>>>
>>>>> 1) feasibility of stats for non-primitive columns
>>>>> 2) ease of use
>>>>>
>>>>>
>>>>> 1) feasibility of stats for non-primitive columns:
>>>>>
>>>>> Hive currently collects different kind of statistics for different
>>>>> kind of types:
>>>>> numeric values: min, max, #nulls, #distincts
>>>>> boolean values: #nulls, #trues, #falses
>>>>> string values: #nulls, #distincts, avgLength, maxLength
>>>>>
>>>>> So, it seems quite possible to also collect at least partial stats for
>>>>> top-level non-primitive columns, e.g.:
>>>>> array values: #nulls, #distincts, avgLength, maxLength
>>>>> map values: #nulls, #distincts, avgLength, maxLength
>>>>> struct values: #nulls, #distincts
>>>>> union values: #nulls, #distincts
>>>>>
>>>>>
>>>>> 2) ease of use
>>>>>
>>>>> The presence of a single non-primitive column currently breaks the use
>>>>> of the convenience shorthand to gather statistics for all columns (ANALYZE
>>>>> TABLE foo COMPUTE STATISTICS FOR COLUMNS;). Imho, this slows down adoption
>>>>> of column statistics for hive users.
>>>>>
>>>>> Best regards
>>>>> Michael
>>>>>
>>>>>
>>>>>
>>>>> On 2016-06-14, at 12:04, Mich Talebzadeh <mi...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Hi Michael,
>>>>>
>>>>> Statistics for columns in Hive are kept in Hive metadata table
>>>>> tab_col_stats.
>>>>>
>>>>> When I am looking at this table in Oracle, I only see statistics for
>>>>> primitives columns here. STRUCT columns do not have it as a STRUCT column
>>>>> will have to be broken into its primitive columns.  I don't think Hive has
>>>>> the means to do that.
>>>>>
>>>>> desc tab_col_stats;
>>>>>  Name
>>>>> Null?    Type
>>>>>  ------------------------------------------------------------------------
>>>>> -------- -------------------------------------------------
>>>>>  CS_ID
>>>>> NOT NULL NUMBER
>>>>>  DB_NAME
>>>>> NOT NULL VARCHAR2(128)
>>>>>  TABLE_NAME
>>>>> NOT NULL VARCHAR2(128)
>>>>>  COLUMN_NAME
>>>>> NOT NULL VARCHAR2(1000)
>>>>>  COLUMN_TYPE
>>>>> NOT NULL VARCHAR2(128)
>>>>>  TBL_ID
>>>>> NOT NULL NUMBER
>>>>>  LONG_LOW_VALUE
>>>>> NUMBER
>>>>>  LONG_HIGH_VALUE
>>>>> NUMBER
>>>>>  DOUBLE_LOW_VALUE
>>>>> NUMBER
>>>>>  DOUBLE_HIGH_VALUE
>>>>> NUMBER
>>>>>  BIG_DECIMAL_LOW_VALUE
>>>>> VARCHAR2(4000)
>>>>>  BIG_DECIMAL_HIGH_VALUE
>>>>> VARCHAR2(4000)
>>>>>  NUM_NULLS
>>>>> NOT NULL NUMBER
>>>>>  NUM_DISTINCTS
>>>>> NUMBER
>>>>>  AVG_COL_LEN
>>>>> NUMBER
>>>>>  MAX_COL_LEN
>>>>> NUMBER
>>>>>  NUM_TRUES
>>>>> NUMBER
>>>>>  NUM_FALSES
>>>>> NUMBER
>>>>>  LAST_ANALYZED
>>>>> NOT NULL NUMBER
>>>>>
>>>>>
>>>>>
>>>>>  So in summary although column type STRUCT do exit, I don't think Hive
>>>>> can cater for their statistics. Actually I don't think Oracle itself does
>>>>> it.
>>>>>
>>>>> HTH
>>>>>
>>>>> P.S. I am on Hive 2 and it does not.
>>>>>
>>>>> hive> analyze table foo compute statistics for columns;
>>>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>>>> accepted but array<bigint> is passed.
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> On 14 June 2016 at 09:57, Michael Häusler <mi...@akatose.de> wrote:
>>>>>
>>>>>> Hi there,
>>>>>>
>>>>>> you can reproduce the messages below with Hive 1.2.1.
>>>>>>
>>>>>> Best regards
>>>>>> Michael
>>>>>>
>>>>>>
>>>>>> On 2016-06-13, at 22:21, Mich Talebzadeh <mi...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> which version of Hive are you using?
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 13 June 2016 at 16:00, Michael Häusler <mi...@akatose.de> wrote:
>>>>>>
>>>>>>> Hi there,
>>>>>>>
>>>>>>>
>>>>>>> when testing column statistics I stumbled upon the following error
>>>>>>> message:
>>>>>>>
>>>>>>> DROP TABLE IF EXISTS foo;
>>>>>>> CREATE TABLE foo (foo BIGINT, bar ARRAY<BIGINT>, foobar
>>>>>>> STRUCT<key:STRING,value:STRING>);
>>>>>>>
>>>>>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;
>>>>>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>>>>>> accepted but array<bigint> is passed.
>>>>>>>
>>>>>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar;
>>>>>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>>>>>> accepted but struct<key:string,value:string> is passed.
>>>>>>>
>>>>>>>
>>>>>>> 1) Basically, it seems that column statistics don't work for
>>>>>>> non-primitive types. Are there any workarounds or any plans to change this?
>>>>>>>
>>>>>>> 2) Furthermore, the convenience syntax to compute statistics for all
>>>>>>> columns does not work as soon as there is a non-supported column. Are there
>>>>>>> any plans to change this, so it is easier to compute statistics for all
>>>>>>> supported columns?
>>>>>>>
>>>>>>> 3) ANALYZE TABLE will only provide the first failing *type* in the
>>>>>>> error message. Especially for wide tables it would be much easier if all
>>>>>>> non-supported column *names* would be printed.
>>>>>>>
>>>>>>>
>>>>>>> Best regards
>>>>>>> Michael
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
>

RE: column statistics for non-primitive types

Posted by "Markovitz, Dudu" <dm...@paypal.com>.

Guys,

While some types of statistics are quite cheap to compute (min / max / count etc.) some are quite expensive (distinct values / histograms etc.).
Furthermore, some of my experience involves working with tables of hundreds and thousands of columns, where due to usage, only few of them are relevant for statistics collection.
Therefore I would consider the following:


1)      An option to declare a list of columns for automatic statistics collection.

Every insert into a table will automatically update the statistics for those columns and those columns only.



2)      An option to declare the level (cheep / expensive) for automatic statistics collection.

Only the requested level will be collected in case hive.stats.fetch.column.stats = true

What say you?

Dudu

From: Michael Häusler [mailto:michael@akatose.de]
Sent: Wednesday, June 15, 2016 12:11 AM
To: user@hive.apache.org
Subject: Re: column statistics for non-primitive types

Hi Mich,

I agree with Pengcheng here. Automatic stats gathering can be extremely useful - and it is configurable.

E.g., initial import into Hive happens as CSV or AVRO. Then you might want to do a conversion to ORC within Hive via create-table-as-select. At that point Hive is reading all the records anyway and we might just as well get as much useful information as possible for stats.

Best
Michael


On 2016-06-14, at 23:05, Pengcheng Xiong <px...@apache.org>> wrote:

Hi Mich,

    I agree with you that column stats gathering in Hive is not cheap and comes with overheads. This is due to the large volume of data that Hive has to process. However, this is the price you have to pay anyway even with current "analyze table" solution.

   The new feature not only provides a way to make users have column stats automatically, but also saves overhead for the "insert into" case. In this case, the new column stats are generated incrementally, i.e., by merging with the existing stats. Without this feature, you have to scan the whole table and compute stats.

   In conclusion, this new feature should not have any more overhead than the current solution.

Best
Pengcheng

On Tue, Jun 14, 2016 at 1:41 PM, Mich Talebzadeh <mi...@gmail.com>> wrote:
hi,

(2) There is a configuration "hive.stats.fetch.column.stats". If you set it to true, it will automatically collect column stats for you when you insert into/overwrite a new table. You can refer to HIVE-11160 for more details.

Not without its overheads.

Automatic gather stats is not new. Has been around for a good time in RDBMS and can impact the performance of other queries running. So I am not sure it can be considered as blessing.

HTH




Dr Mich Talebzadeh

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>


On 14 June 2016 at 21:25, Pengcheng Xiong <px...@apache.org>> wrote:
Exactly, "the useful or meaningful these statistics is going to be" (The motivation behind).

Best
Pengcheng


On Tue, Jun 14, 2016 at 1:21 PM, Mich Talebzadeh <mi...@gmail.com>> wrote:

Hi,

My point was we are where we are and in this juncture there is no collection of statistics for complex columns. That may be a future enhancement.

But then the obvious question is how useful or meaningful these statistics is going to be?

HTH

Dr Mich Talebzadeh

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>


On 14 June 2016 at 21:03, Michael Häusler <mi...@akatose.de>> wrote:
Hi there,

there might be two topics here:

1) feasibility of stats for non-primitive columns
2) ease of use


1) feasibility of stats for non-primitive columns:

Hive currently collects different kind of statistics for different kind of types:
numeric values: min, max, #nulls, #distincts
boolean values: #nulls, #trues, #falses
string values: #nulls, #distincts, avgLength, maxLength

So, it seems quite possible to also collect at least partial stats for top-level non-primitive columns, e.g.:
array values: #nulls, #distincts, avgLength, maxLength
map values: #nulls, #distincts, avgLength, maxLength
struct values: #nulls, #distincts
union values: #nulls, #distincts


2) ease of use

The presence of a single non-primitive column currently breaks the use of the convenience shorthand to gather statistics for all columns (ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;). Imho, this slows down adoption of column statistics for hive users.

Best regards
Michael



On 2016-06-14, at 12:04, Mich Talebzadeh <mi...@gmail.com>> wrote:

Hi Michael,

Statistics for columns in Hive are kept in Hive metadata table tab_col_stats.

When I am looking at this table in Oracle, I only see statistics for primitives columns here. STRUCT columns do not have it as a STRUCT column will have to be broken into its primitive columns.  I don't think Hive has the means to do that.

desc tab_col_stats;
 Name                                                                     Null?    Type
 ------------------------------------------------------------------------ -------- -------------------------------------------------
 CS_ID                                                                    NOT NULL NUMBER
 DB_NAME                                                                  NOT NULL VARCHAR2(128)
 TABLE_NAME                                                               NOT NULL VARCHAR2(128)
 COLUMN_NAME                                                              NOT NULL VARCHAR2(1000)
 COLUMN_TYPE                                                              NOT NULL VARCHAR2(128)
 TBL_ID                                                                   NOT NULL NUMBER
 LONG_LOW_VALUE                                                                    NUMBER
 LONG_HIGH_VALUE                                                                   NUMBER
 DOUBLE_LOW_VALUE                                                                  NUMBER
 DOUBLE_HIGH_VALUE                                                                 NUMBER
 BIG_DECIMAL_LOW_VALUE                                                             VARCHAR2(4000)
 BIG_DECIMAL_HIGH_VALUE                                                            VARCHAR2(4000)
 NUM_NULLS                                                                NOT NULL NUMBER
 NUM_DISTINCTS                                                                     NUMBER
 AVG_COL_LEN                                                                       NUMBER
 MAX_COL_LEN                                                                       NUMBER
 NUM_TRUES                                                                         NUMBER
 NUM_FALSES                                                                        NUMBER
 LAST_ANALYZED                                                            NOT NULL NUMBER



 So in summary although column type STRUCT do exit, I don't think Hive can cater for their statistics. Actually I don't think Oracle itself does it.

HTH

P.S. I am on Hive 2 and it does not.

hive> analyze table foo compute statistics for columns;
FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but array<bigint> is passed.


Dr Mich Talebzadeh

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>


On 14 June 2016 at 09:57, Michael Häusler <mi...@akatose.de>> wrote:
Hi there,

you can reproduce the messages below with Hive 1.2.1.

Best regards
Michael


On 2016-06-13, at 22:21, Mich Talebzadeh <mi...@gmail.com>> wrote:

which version of Hive are you using?

Dr Mich Talebzadeh

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>


On 13 June 2016 at 16:00, Michael Häusler <mi...@akatose.de>> wrote:
Hi there,


when testing column statistics I stumbled upon the following error message:

DROP TABLE IF EXISTS foo;
CREATE TABLE foo (foo BIGINT, bar ARRAY<BIGINT>, foobar STRUCT<key:STRING,value:STRING>);

ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;
FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but array<bigint> is passed.

ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar;
FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but struct<key:string,value:string> is passed.


1) Basically, it seems that column statistics don't work for non-primitive types. Are there any workarounds or any plans to change this?

2) Furthermore, the convenience syntax to compute statistics for all columns does not work as soon as there is a non-supported column. Are there any plans to change this, so it is easier to compute statistics for all supported columns?

3) ANALYZE TABLE will only provide the first failing *type* in the error message. Especially for wide tables it would be much easier if all non-supported column *names* would be printed.


Best regards
Michael

Re: column statistics for non-primitive types

Posted by Michael Häusler <mi...@akatose.de>.

Hi Mich,

I agree with Pengcheng here. Automatic stats gathering can be extremely useful - and it is configurable.

E.g., initial import into Hive happens as CSV or AVRO. Then you might want to do a conversion to ORC within Hive via create-table-as-select. At that point Hive is reading all the records anyway and we might just as well get as much useful information as possible for stats.

Best
Michael


> On 2016-06-14, at 23:05, Pengcheng Xiong <px...@apache.org> wrote:
> 
> Hi Mich,
> 
>     I agree with you that column stats gathering in Hive is not cheap and comes with overheads. This is due to the large volume of data that Hive has to process. However, this is the price you have to pay anyway even with current "analyze table" solution.
> 
>    The new feature not only provides a way to make users have column stats automatically, but also saves overhead for the "insert into" case. In this case, the new column stats are generated incrementally, i.e., by merging with the existing stats. Without this feature, you have to scan the whole table and compute stats. 
> 
>    In conclusion, this new feature should not have any more overhead than the current solution.  
> 
> Best
> Pengcheng
> 
> On Tue, Jun 14, 2016 at 1:41 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
> hi,
> 
> (2) There is a configuration "hive.stats.fetch.column.stats". If you set it to true, it will automatically collect column stats for you when you insert into/overwrite a new table. You can refer to HIVE-11160 for more details.
> 
> Not without its overheads.
> 
> Automatic gather stats is not new. Has been around for a good time in RDBMS and can impact the performance of other queries running. So I am not sure it can be considered as blessing.
> 
> HTH
> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 14 June 2016 at 21:25, Pengcheng Xiong <pxiong@apache.org <ma...@apache.org>> wrote:
> Exactly, "the useful or meaningful these statistics is going to be" (The motivation behind).
> 
> Best
> Pengcheng
> 
> 
> On Tue, Jun 14, 2016 at 1:21 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
> 
> Hi,
> 
> My point was we are where we are and in this juncture there is no collection of statistics for complex columns. That may be a future enhancement.
> 
> But then the obvious question is how useful or meaningful these statistics is going to be?
> 
> HTH 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 14 June 2016 at 21:03, Michael Häusler <michael@akatose.de <ma...@akatose.de>> wrote:
> Hi there,
> 
> there might be two topics here:
> 
> 1) feasibility of stats for non-primitive columns
> 2) ease of use
> 
> 
> 1) feasibility of stats for non-primitive columns:
> 
> Hive currently collects different kind of statistics for different kind of types:
> numeric values:	min, max, #nulls, #distincts
> boolean values:	#nulls, #trues, #falses
> string values:		#nulls, #distincts, avgLength, maxLength
> 
> So, it seems quite possible to also collect at least partial stats for top-level non-primitive columns, e.g.:
> array values:		#nulls, #distincts, avgLength, maxLength 
> map values:		#nulls, #distincts, avgLength, maxLength
> struct values:		#nulls, #distincts
> union values:		#nulls, #distincts
> 
> 
> 2) ease of use
> 
> The presence of a single non-primitive column currently breaks the use of the convenience shorthand to gather statistics for all columns (ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;). Imho, this slows down adoption of column statistics for hive users.
> 
> Best regards
> Michael
> 
> 
> 
>> On 2016-06-14, at 12:04, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Hi Michael,
>> 
>> Statistics for columns in Hive are kept in Hive metadata table tab_col_stats.
>> 
>> When I am looking at this table in Oracle, I only see statistics for primitives columns here. STRUCT columns do not have it as a STRUCT column will have to be broken into its primitive columns.  I don't think Hive has the means to do that.
>> 
>> desc tab_col_stats;
>>  Name                                                                     Null?    Type
>>  ------------------------------------------------------------------------ -------- -------------------------------------------------
>>  CS_ID                                                                    NOT NULL NUMBER
>>  DB_NAME                                                                  NOT NULL VARCHAR2(128)
>>  TABLE_NAME                                                               NOT NULL VARCHAR2(128)
>>  COLUMN_NAME                                                              NOT NULL VARCHAR2(1000)
>>  COLUMN_TYPE                                                              NOT NULL VARCHAR2(128)
>>  TBL_ID                                                                   NOT NULL NUMBER
>>  LONG_LOW_VALUE                                                                    NUMBER
>>  LONG_HIGH_VALUE                                                                   NUMBER
>>  DOUBLE_LOW_VALUE                                                                  NUMBER
>>  DOUBLE_HIGH_VALUE                                                                 NUMBER
>>  BIG_DECIMAL_LOW_VALUE                                                             VARCHAR2(4000)
>>  BIG_DECIMAL_HIGH_VALUE                                                            VARCHAR2(4000)
>>  NUM_NULLS                                                                NOT NULL NUMBER
>>  NUM_DISTINCTS                                                                     NUMBER
>>  AVG_COL_LEN                                                                       NUMBER
>>  MAX_COL_LEN                                                                       NUMBER
>>  NUM_TRUES                                                                         NUMBER
>>  NUM_FALSES                                                                        NUMBER
>>  LAST_ANALYZED                                                            NOT NULL NUMBER
>> 
>> 
>> 
>>  So in summary although column type STRUCT do exit, I don't think Hive can cater for their statistics. Actually I don't think Oracle itself does it.
>> 
>> HTH
>> 
>> P.S. I am on Hive 2 and it does not.
>> 
>> hive> analyze table foo compute statistics for columns;
>> FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but array<bigint> is passed.
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> 
>> On 14 June 2016 at 09:57, Michael Häusler <michael@akatose.de <ma...@akatose.de>> wrote:
>> Hi there,
>> 
>> you can reproduce the messages below with Hive 1.2.1.
>> 
>> Best regards
>> Michael
>> 
>> 
>>> On 2016-06-13, at 22:21, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> which version of Hive are you using?
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>  
>>> 
>>> On 13 June 2016 at 16:00, Michael Häusler <michael@akatose.de <ma...@akatose.de>> wrote:
>>> Hi there,
>>> 
>>> 
>>> when testing column statistics I stumbled upon the following error message:
>>> 
>>> DROP TABLE IF EXISTS foo;
>>> CREATE TABLE foo (foo BIGINT, bar ARRAY<BIGINT>, foobar STRUCT<key:STRING,value:STRING>);
>>> 
>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;
>>> FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but array<bigint> is passed.
>>> 
>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar;
>>> FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but struct<key:string,value:string> is passed.
>>> 
>>> 
>>> 1) Basically, it seems that column statistics don't work for non-primitive types. Are there any workarounds or any plans to change this?
>>> 
>>> 2) Furthermore, the convenience syntax to compute statistics for all columns does not work as soon as there is a non-supported column. Are there any plans to change this, so it is easier to compute statistics for all supported columns?
>>> 
>>> 3) ANALYZE TABLE will only provide the first failing *type* in the error message. Especially for wide tables it would be much easier if all non-supported column *names* would be printed.
>>> 
>>> 
>>> Best regards
>>> Michael
>>> 
>>> 
>> 
>> 
> 
> 
> 
> 
>

Re: column statistics for non-primitive types

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi,

there is another approach to reduce time for analyzing stats and that is
sampling, i.e. looking at the fraction of data. For example in Oracle one
can do that

EXEC DBMS_STATS.GATHER_TABLE_STATS(ownname=>user,tabname=>'T1',*estimate_percent=>
50%*)

In general Hive statistics is pretty straight forward. This is because
unless a table is ORC and transactional then update/deletes won't happen.
Only new inserts for most of data (immutable).

My opinion on this is also mixed. Does not apply to Hive. Most RDNMS
provide something like datachange() function that recommends analysing
table if the underlying table size is changed.

I gather with Hive any new insert will trigger an automatic analyze stats.
new data coming in to an existing table with data does not imply lack of
quality statistics. If the distribution remain more and less the same for a
given column, updating statistics is not going to make that much
difference. I would rather spend more time on making external indexes
useful for the optimizer.


 HTH




Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 14 June 2016 at 22:05, Pengcheng Xiong <px...@apache.org> wrote:

> Hi Mich,
>
>     I agree with you that column stats gathering in Hive is not cheap and
> comes with overheads. This is due to the large volume of data that Hive has
> to process. However, this is the price you have to pay anyway even with
> current "analyze table" solution.
>
>    The new feature not only provides a way to make users have column stats
> automatically, but also saves overhead for the "insert into" case. In this
> case, the new column stats are generated incrementally, i.e., by merging
> with the existing stats. Without this feature, you have to scan the whole
> table and compute stats.
>
>    In conclusion, this new feature should not have any more overhead than
> the current solution.
>
> Best
> Pengcheng
>
> On Tue, Jun 14, 2016 at 1:41 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> hi,
>>
>> (2) There is a configuration "hive.stats.fetch.column.stats". If you set
>> it to true, it will automatically collect column stats for you when you
>> insert into/overwrite a new table. You can refer to HIVE-11160 for more
>> details.
>>
>> Not without its overheads.
>>
>> Automatic gather stats is not new. Has been around for a good time in
>> RDBMS and can impact the performance of other queries running. So I am not
>> sure it can be considered as blessing.
>>
>> HTH
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 14 June 2016 at 21:25, Pengcheng Xiong <px...@apache.org> wrote:
>>
>>> Exactly, "the useful or meaningful these statistics is going to be" (The
>>> motivation behind).
>>>
>>> Best
>>> Pengcheng
>>>
>>>
>>> On Tue, Jun 14, 2016 at 1:21 PM, Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com> wrote:
>>>
>>>>
>>>> Hi,
>>>>
>>>> My point was we are where we are and in this juncture there is no
>>>> collection of statistics for complex columns. That may be a future
>>>> enhancement.
>>>>
>>>> But then the obvious question is how useful or meaningful these
>>>> statistics is going to be?
>>>>
>>>> HTH
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 14 June 2016 at 21:03, Michael Häusler <mi...@akatose.de> wrote:
>>>>
>>>>> Hi there,
>>>>>
>>>>> there might be two topics here:
>>>>>
>>>>> 1) feasibility of stats for non-primitive columns
>>>>> 2) ease of use
>>>>>
>>>>>
>>>>> 1) feasibility of stats for non-primitive columns:
>>>>>
>>>>> Hive currently collects different kind of statistics for different
>>>>> kind of types:
>>>>> numeric values: min, max, #nulls, #distincts
>>>>> boolean values: #nulls, #trues, #falses
>>>>> string values: #nulls, #distincts, avgLength, maxLength
>>>>>
>>>>> So, it seems quite possible to also collect at least partial stats for
>>>>> top-level non-primitive columns, e.g.:
>>>>> array values: #nulls, #distincts, avgLength, maxLength
>>>>> map values: #nulls, #distincts, avgLength, maxLength
>>>>> struct values: #nulls, #distincts
>>>>> union values: #nulls, #distincts
>>>>>
>>>>>
>>>>> 2) ease of use
>>>>>
>>>>> The presence of a single non-primitive column currently breaks the use
>>>>> of the convenience shorthand to gather statistics for all columns (ANALYZE
>>>>> TABLE foo COMPUTE STATISTICS FOR COLUMNS;). Imho, this slows down adoption
>>>>> of column statistics for hive users.
>>>>>
>>>>> Best regards
>>>>> Michael
>>>>>
>>>>>
>>>>>
>>>>> On 2016-06-14, at 12:04, Mich Talebzadeh <mi...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Hi Michael,
>>>>>
>>>>> Statistics for columns in Hive are kept in Hive metadata table
>>>>> tab_col_stats.
>>>>>
>>>>> When I am looking at this table in Oracle, I only see statistics for
>>>>> primitives columns here. STRUCT columns do not have it as a STRUCT column
>>>>> will have to be broken into its primitive columns.  I don't think Hive has
>>>>> the means to do that.
>>>>>
>>>>> desc tab_col_stats;
>>>>>  Name
>>>>> Null?    Type
>>>>>  ------------------------------------------------------------------------
>>>>> -------- -------------------------------------------------
>>>>>  CS_ID
>>>>> NOT NULL NUMBER
>>>>>  DB_NAME
>>>>> NOT NULL VARCHAR2(128)
>>>>>  TABLE_NAME
>>>>> NOT NULL VARCHAR2(128)
>>>>>  COLUMN_NAME
>>>>> NOT NULL VARCHAR2(1000)
>>>>>  COLUMN_TYPE
>>>>> NOT NULL VARCHAR2(128)
>>>>>  TBL_ID
>>>>> NOT NULL NUMBER
>>>>>  LONG_LOW_VALUE
>>>>> NUMBER
>>>>>  LONG_HIGH_VALUE
>>>>> NUMBER
>>>>>  DOUBLE_LOW_VALUE
>>>>> NUMBER
>>>>>  DOUBLE_HIGH_VALUE
>>>>> NUMBER
>>>>>  BIG_DECIMAL_LOW_VALUE
>>>>> VARCHAR2(4000)
>>>>>  BIG_DECIMAL_HIGH_VALUE
>>>>> VARCHAR2(4000)
>>>>>  NUM_NULLS
>>>>> NOT NULL NUMBER
>>>>>  NUM_DISTINCTS
>>>>> NUMBER
>>>>>  AVG_COL_LEN
>>>>> NUMBER
>>>>>  MAX_COL_LEN
>>>>> NUMBER
>>>>>  NUM_TRUES
>>>>> NUMBER
>>>>>  NUM_FALSES
>>>>> NUMBER
>>>>>  LAST_ANALYZED
>>>>> NOT NULL NUMBER
>>>>>
>>>>>
>>>>>
>>>>>  So in summary although column type STRUCT do exit, I don't think Hive
>>>>> can cater for their statistics. Actually I don't think Oracle itself does
>>>>> it.
>>>>>
>>>>> HTH
>>>>>
>>>>> P.S. I am on Hive 2 and it does not.
>>>>>
>>>>> hive> analyze table foo compute statistics for columns;
>>>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>>>> accepted but array<bigint> is passed.
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> On 14 June 2016 at 09:57, Michael Häusler <mi...@akatose.de> wrote:
>>>>>
>>>>>> Hi there,
>>>>>>
>>>>>> you can reproduce the messages below with Hive 1.2.1.
>>>>>>
>>>>>> Best regards
>>>>>> Michael
>>>>>>
>>>>>>
>>>>>> On 2016-06-13, at 22:21, Mich Talebzadeh <mi...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> which version of Hive are you using?
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 13 June 2016 at 16:00, Michael Häusler <mi...@akatose.de> wrote:
>>>>>>
>>>>>>> Hi there,
>>>>>>>
>>>>>>>
>>>>>>> when testing column statistics I stumbled upon the following error
>>>>>>> message:
>>>>>>>
>>>>>>> DROP TABLE IF EXISTS foo;
>>>>>>> CREATE TABLE foo (foo BIGINT, bar ARRAY<BIGINT>, foobar
>>>>>>> STRUCT<key:STRING,value:STRING>);
>>>>>>>
>>>>>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;
>>>>>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>>>>>> accepted but array<bigint> is passed.
>>>>>>>
>>>>>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar;
>>>>>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>>>>>> accepted but struct<key:string,value:string> is passed.
>>>>>>>
>>>>>>>
>>>>>>> 1) Basically, it seems that column statistics don't work for
>>>>>>> non-primitive types. Are there any workarounds or any plans to change this?
>>>>>>>
>>>>>>> 2) Furthermore, the convenience syntax to compute statistics for all
>>>>>>> columns does not work as soon as there is a non-supported column. Are there
>>>>>>> any plans to change this, so it is easier to compute statistics for all
>>>>>>> supported columns?
>>>>>>>
>>>>>>> 3) ANALYZE TABLE will only provide the first failing *type* in the
>>>>>>> error message. Especially for wide tables it would be much easier if all
>>>>>>> non-supported column *names* would be printed.
>>>>>>>
>>>>>>>
>>>>>>> Best regards
>>>>>>> Michael
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: column statistics for non-primitive types

Posted by Pengcheng Xiong <px...@apache.org>.

Hi Mich,

    I agree with you that column stats gathering in Hive is not cheap and
comes with overheads. This is due to the large volume of data that Hive has
to process. However, this is the price you have to pay anyway even with
current "analyze table" solution.

   The new feature not only provides a way to make users have column stats
automatically, but also saves overhead for the "insert into" case. In this
case, the new column stats are generated incrementally, i.e., by merging
with the existing stats. Without this feature, you have to scan the whole
table and compute stats.

   In conclusion, this new feature should not have any more overhead than
the current solution.

Best
Pengcheng

On Tue, Jun 14, 2016 at 1:41 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> hi,
>
> (2) There is a configuration "hive.stats.fetch.column.stats". If you set
> it to true, it will automatically collect column stats for you when you
> insert into/overwrite a new table. You can refer to HIVE-11160 for more
> details.
>
> Not without its overheads.
>
> Automatic gather stats is not new. Has been around for a good time in
> RDBMS and can impact the performance of other queries running. So I am not
> sure it can be considered as blessing.
>
> HTH
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 14 June 2016 at 21:25, Pengcheng Xiong <px...@apache.org> wrote:
>
>> Exactly, "the useful or meaningful these statistics is going to be" (The
>> motivation behind).
>>
>> Best
>> Pengcheng
>>
>>
>> On Tue, Jun 14, 2016 at 1:21 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>>
>>> Hi,
>>>
>>> My point was we are where we are and in this juncture there is no
>>> collection of statistics for complex columns. That may be a future
>>> enhancement.
>>>
>>> But then the obvious question is how useful or meaningful these
>>> statistics is going to be?
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 14 June 2016 at 21:03, Michael Häusler <mi...@akatose.de> wrote:
>>>
>>>> Hi there,
>>>>
>>>> there might be two topics here:
>>>>
>>>> 1) feasibility of stats for non-primitive columns
>>>> 2) ease of use
>>>>
>>>>
>>>> 1) feasibility of stats for non-primitive columns:
>>>>
>>>> Hive currently collects different kind of statistics for different kind
>>>> of types:
>>>> numeric values: min, max, #nulls, #distincts
>>>> boolean values: #nulls, #trues, #falses
>>>> string values: #nulls, #distincts, avgLength, maxLength
>>>>
>>>> So, it seems quite possible to also collect at least partial stats for
>>>> top-level non-primitive columns, e.g.:
>>>> array values: #nulls, #distincts, avgLength, maxLength
>>>> map values: #nulls, #distincts, avgLength, maxLength
>>>> struct values: #nulls, #distincts
>>>> union values: #nulls, #distincts
>>>>
>>>>
>>>> 2) ease of use
>>>>
>>>> The presence of a single non-primitive column currently breaks the use
>>>> of the convenience shorthand to gather statistics for all columns (ANALYZE
>>>> TABLE foo COMPUTE STATISTICS FOR COLUMNS;). Imho, this slows down adoption
>>>> of column statistics for hive users.
>>>>
>>>> Best regards
>>>> Michael
>>>>
>>>>
>>>>
>>>> On 2016-06-14, at 12:04, Mich Talebzadeh <mi...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi Michael,
>>>>
>>>> Statistics for columns in Hive are kept in Hive metadata table
>>>> tab_col_stats.
>>>>
>>>> When I am looking at this table in Oracle, I only see statistics for
>>>> primitives columns here. STRUCT columns do not have it as a STRUCT column
>>>> will have to be broken into its primitive columns.  I don't think Hive has
>>>> the means to do that.
>>>>
>>>> desc tab_col_stats;
>>>>  Name
>>>> Null?    Type
>>>>  ------------------------------------------------------------------------
>>>> -------- -------------------------------------------------
>>>>  CS_ID
>>>> NOT NULL NUMBER
>>>>  DB_NAME
>>>> NOT NULL VARCHAR2(128)
>>>>  TABLE_NAME
>>>> NOT NULL VARCHAR2(128)
>>>>  COLUMN_NAME
>>>> NOT NULL VARCHAR2(1000)
>>>>  COLUMN_TYPE
>>>> NOT NULL VARCHAR2(128)
>>>>  TBL_ID
>>>> NOT NULL NUMBER
>>>>  LONG_LOW_VALUE
>>>> NUMBER
>>>>  LONG_HIGH_VALUE
>>>> NUMBER
>>>>  DOUBLE_LOW_VALUE
>>>> NUMBER
>>>>  DOUBLE_HIGH_VALUE
>>>> NUMBER
>>>>  BIG_DECIMAL_LOW_VALUE
>>>> VARCHAR2(4000)
>>>>  BIG_DECIMAL_HIGH_VALUE
>>>> VARCHAR2(4000)
>>>>  NUM_NULLS
>>>> NOT NULL NUMBER
>>>>  NUM_DISTINCTS
>>>> NUMBER
>>>>  AVG_COL_LEN
>>>> NUMBER
>>>>  MAX_COL_LEN
>>>> NUMBER
>>>>  NUM_TRUES
>>>> NUMBER
>>>>  NUM_FALSES
>>>> NUMBER
>>>>  LAST_ANALYZED
>>>> NOT NULL NUMBER
>>>>
>>>>
>>>>
>>>>  So in summary although column type STRUCT do exit, I don't think Hive
>>>> can cater for their statistics. Actually I don't think Oracle itself does
>>>> it.
>>>>
>>>> HTH
>>>>
>>>> P.S. I am on Hive 2 and it does not.
>>>>
>>>> hive> analyze table foo compute statistics for columns;
>>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>>> accepted but array<bigint> is passed.
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 14 June 2016 at 09:57, Michael Häusler <mi...@akatose.de> wrote:
>>>>
>>>>> Hi there,
>>>>>
>>>>> you can reproduce the messages below with Hive 1.2.1.
>>>>>
>>>>> Best regards
>>>>> Michael
>>>>>
>>>>>
>>>>> On 2016-06-13, at 22:21, Mich Talebzadeh <mi...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> which version of Hive are you using?
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> On 13 June 2016 at 16:00, Michael Häusler <mi...@akatose.de> wrote:
>>>>>
>>>>>> Hi there,
>>>>>>
>>>>>>
>>>>>> when testing column statistics I stumbled upon the following error
>>>>>> message:
>>>>>>
>>>>>> DROP TABLE IF EXISTS foo;
>>>>>> CREATE TABLE foo (foo BIGINT, bar ARRAY<BIGINT>, foobar
>>>>>> STRUCT<key:STRING,value:STRING>);
>>>>>>
>>>>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;
>>>>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>>>>> accepted but array<bigint> is passed.
>>>>>>
>>>>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar;
>>>>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>>>>> accepted but struct<key:string,value:string> is passed.
>>>>>>
>>>>>>
>>>>>> 1) Basically, it seems that column statistics don't work for
>>>>>> non-primitive types. Are there any workarounds or any plans to change this?
>>>>>>
>>>>>> 2) Furthermore, the convenience syntax to compute statistics for all
>>>>>> columns does not work as soon as there is a non-supported column. Are there
>>>>>> any plans to change this, so it is easier to compute statistics for all
>>>>>> supported columns?
>>>>>>
>>>>>> 3) ANALYZE TABLE will only provide the first failing *type* in the
>>>>>> error message. Especially for wide tables it would be much easier if all
>>>>>> non-supported column *names* would be printed.
>>>>>>
>>>>>>
>>>>>> Best regards
>>>>>> Michael
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Re: column statistics for non-primitive types

Posted by Mich Talebzadeh <mi...@gmail.com>.

hi,

(2) There is a configuration "hive.stats.fetch.column.stats". If you set it
to true, it will automatically collect column stats for you when you insert
into/overwrite a new table. You can refer to HIVE-11160 for more details.

Not without its overheads.

Automatic gather stats is not new. Has been around for a good time in
RDBMS and can impact the performance of other queries running. So I am not
sure it can be considered as blessing.

HTH




Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 14 June 2016 at 21:25, Pengcheng Xiong <px...@apache.org> wrote:

> Exactly, "the useful or meaningful these statistics is going to be" (The
> motivation behind).
>
> Best
> Pengcheng
>
>
> On Tue, Jun 14, 2016 at 1:21 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>>
>> Hi,
>>
>> My point was we are where we are and in this juncture there is no
>> collection of statistics for complex columns. That may be a future
>> enhancement.
>>
>> But then the obvious question is how useful or meaningful these
>> statistics is going to be?
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 14 June 2016 at 21:03, Michael Häusler <mi...@akatose.de> wrote:
>>
>>> Hi there,
>>>
>>> there might be two topics here:
>>>
>>> 1) feasibility of stats for non-primitive columns
>>> 2) ease of use
>>>
>>>
>>> 1) feasibility of stats for non-primitive columns:
>>>
>>> Hive currently collects different kind of statistics for different kind
>>> of types:
>>> numeric values: min, max, #nulls, #distincts
>>> boolean values: #nulls, #trues, #falses
>>> string values: #nulls, #distincts, avgLength, maxLength
>>>
>>> So, it seems quite possible to also collect at least partial stats for
>>> top-level non-primitive columns, e.g.:
>>> array values: #nulls, #distincts, avgLength, maxLength
>>> map values: #nulls, #distincts, avgLength, maxLength
>>> struct values: #nulls, #distincts
>>> union values: #nulls, #distincts
>>>
>>>
>>> 2) ease of use
>>>
>>> The presence of a single non-primitive column currently breaks the use
>>> of the convenience shorthand to gather statistics for all columns (ANALYZE
>>> TABLE foo COMPUTE STATISTICS FOR COLUMNS;). Imho, this slows down adoption
>>> of column statistics for hive users.
>>>
>>> Best regards
>>> Michael
>>>
>>>
>>>
>>> On 2016-06-14, at 12:04, Mich Talebzadeh <mi...@gmail.com>
>>> wrote:
>>>
>>> Hi Michael,
>>>
>>> Statistics for columns in Hive are kept in Hive metadata table
>>> tab_col_stats.
>>>
>>> When I am looking at this table in Oracle, I only see statistics for
>>> primitives columns here. STRUCT columns do not have it as a STRUCT column
>>> will have to be broken into its primitive columns.  I don't think Hive has
>>> the means to do that.
>>>
>>> desc tab_col_stats;
>>>  Name
>>> Null?    Type
>>>  ------------------------------------------------------------------------
>>> -------- -------------------------------------------------
>>>  CS_ID
>>> NOT NULL NUMBER
>>>  DB_NAME
>>> NOT NULL VARCHAR2(128)
>>>  TABLE_NAME
>>> NOT NULL VARCHAR2(128)
>>>  COLUMN_NAME
>>> NOT NULL VARCHAR2(1000)
>>>  COLUMN_TYPE
>>> NOT NULL VARCHAR2(128)
>>>  TBL_ID
>>> NOT NULL NUMBER
>>>  LONG_LOW_VALUE
>>> NUMBER
>>>  LONG_HIGH_VALUE
>>> NUMBER
>>>  DOUBLE_LOW_VALUE
>>> NUMBER
>>>  DOUBLE_HIGH_VALUE
>>> NUMBER
>>>  BIG_DECIMAL_LOW_VALUE
>>> VARCHAR2(4000)
>>>  BIG_DECIMAL_HIGH_VALUE
>>> VARCHAR2(4000)
>>>  NUM_NULLS
>>> NOT NULL NUMBER
>>>  NUM_DISTINCTS
>>> NUMBER
>>>  AVG_COL_LEN
>>> NUMBER
>>>  MAX_COL_LEN
>>> NUMBER
>>>  NUM_TRUES
>>> NUMBER
>>>  NUM_FALSES
>>> NUMBER
>>>  LAST_ANALYZED
>>> NOT NULL NUMBER
>>>
>>>
>>>
>>>  So in summary although column type STRUCT do exit, I don't think Hive
>>> can cater for their statistics. Actually I don't think Oracle itself does
>>> it.
>>>
>>> HTH
>>>
>>> P.S. I am on Hive 2 and it does not.
>>>
>>> hive> analyze table foo compute statistics for columns;
>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>> accepted but array<bigint> is passed.
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 14 June 2016 at 09:57, Michael Häusler <mi...@akatose.de> wrote:
>>>
>>>> Hi there,
>>>>
>>>> you can reproduce the messages below with Hive 1.2.1.
>>>>
>>>> Best regards
>>>> Michael
>>>>
>>>>
>>>> On 2016-06-13, at 22:21, Mich Talebzadeh <mi...@gmail.com>
>>>> wrote:
>>>>
>>>> which version of Hive are you using?
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 13 June 2016 at 16:00, Michael Häusler <mi...@akatose.de> wrote:
>>>>
>>>>> Hi there,
>>>>>
>>>>>
>>>>> when testing column statistics I stumbled upon the following error
>>>>> message:
>>>>>
>>>>> DROP TABLE IF EXISTS foo;
>>>>> CREATE TABLE foo (foo BIGINT, bar ARRAY<BIGINT>, foobar
>>>>> STRUCT<key:STRING,value:STRING>);
>>>>>
>>>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;
>>>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>>>> accepted but array<bigint> is passed.
>>>>>
>>>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar;
>>>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>>>> accepted but struct<key:string,value:string> is passed.
>>>>>
>>>>>
>>>>> 1) Basically, it seems that column statistics don't work for
>>>>> non-primitive types. Are there any workarounds or any plans to change this?
>>>>>
>>>>> 2) Furthermore, the convenience syntax to compute statistics for all
>>>>> columns does not work as soon as there is a non-supported column. Are there
>>>>> any plans to change this, so it is easier to compute statistics for all
>>>>> supported columns?
>>>>>
>>>>> 3) ANALYZE TABLE will only provide the first failing *type* in the
>>>>> error message. Especially for wide tables it would be much easier if all
>>>>> non-supported column *names* would be printed.
>>>>>
>>>>>
>>>>> Best regards
>>>>> Michael
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: column statistics for non-primitive types

Posted by Pengcheng Xiong <px...@apache.org>.

Exactly, "the useful or meaningful these statistics is going to be" (The
motivation behind).

Best
Pengcheng

On Tue, Jun 14, 2016 at 1:21 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

>
> Hi,
>
> My point was we are where we are and in this juncture there is no
> collection of statistics for complex columns. That may be a future
> enhancement.
>
> But then the obvious question is how useful or meaningful these statistics
> is going to be?
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 14 June 2016 at 21:03, Michael Häusler <mi...@akatose.de> wrote:
>
>> Hi there,
>>
>> there might be two topics here:
>>
>> 1) feasibility of stats for non-primitive columns
>> 2) ease of use
>>
>>
>> 1) feasibility of stats for non-primitive columns:
>>
>> Hive currently collects different kind of statistics for different kind
>> of types:
>> numeric values: min, max, #nulls, #distincts
>> boolean values: #nulls, #trues, #falses
>> string values: #nulls, #distincts, avgLength, maxLength
>>
>> So, it seems quite possible to also collect at least partial stats for
>> top-level non-primitive columns, e.g.:
>> array values: #nulls, #distincts, avgLength, maxLength
>> map values: #nulls, #distincts, avgLength, maxLength
>> struct values: #nulls, #distincts
>> union values: #nulls, #distincts
>>
>>
>> 2) ease of use
>>
>> The presence of a single non-primitive column currently breaks the use of
>> the convenience shorthand to gather statistics for all columns (ANALYZE
>> TABLE foo COMPUTE STATISTICS FOR COLUMNS;). Imho, this slows down adoption
>> of column statistics for hive users.
>>
>> Best regards
>> Michael
>>
>>
>>
>> On 2016-06-14, at 12:04, Mich Talebzadeh <mi...@gmail.com>
>> wrote:
>>
>> Hi Michael,
>>
>> Statistics for columns in Hive are kept in Hive metadata table
>> tab_col_stats.
>>
>> When I am looking at this table in Oracle, I only see statistics for
>> primitives columns here. STRUCT columns do not have it as a STRUCT column
>> will have to be broken into its primitive columns.  I don't think Hive has
>> the means to do that.
>>
>> desc tab_col_stats;
>>  Name
>> Null?    Type
>>  ------------------------------------------------------------------------
>> -------- -------------------------------------------------
>>  CS_ID
>> NOT NULL NUMBER
>>  DB_NAME
>> NOT NULL VARCHAR2(128)
>>  TABLE_NAME
>> NOT NULL VARCHAR2(128)
>>  COLUMN_NAME
>> NOT NULL VARCHAR2(1000)
>>  COLUMN_TYPE
>> NOT NULL VARCHAR2(128)
>>  TBL_ID
>> NOT NULL NUMBER
>>  LONG_LOW_VALUE
>> NUMBER
>>  LONG_HIGH_VALUE
>> NUMBER
>>  DOUBLE_LOW_VALUE
>> NUMBER
>>  DOUBLE_HIGH_VALUE
>> NUMBER
>>  BIG_DECIMAL_LOW_VALUE
>> VARCHAR2(4000)
>>  BIG_DECIMAL_HIGH_VALUE
>> VARCHAR2(4000)
>>  NUM_NULLS
>> NOT NULL NUMBER
>>  NUM_DISTINCTS
>> NUMBER
>>  AVG_COL_LEN
>> NUMBER
>>  MAX_COL_LEN
>> NUMBER
>>  NUM_TRUES
>> NUMBER
>>  NUM_FALSES
>> NUMBER
>>  LAST_ANALYZED
>> NOT NULL NUMBER
>>
>>
>>
>>  So in summary although column type STRUCT do exit, I don't think Hive
>> can cater for their statistics. Actually I don't think Oracle itself does
>> it.
>>
>> HTH
>>
>> P.S. I am on Hive 2 and it does not.
>>
>> hive> analyze table foo compute statistics for columns;
>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>> accepted but array<bigint> is passed.
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 14 June 2016 at 09:57, Michael Häusler <mi...@akatose.de> wrote:
>>
>>> Hi there,
>>>
>>> you can reproduce the messages below with Hive 1.2.1.
>>>
>>> Best regards
>>> Michael
>>>
>>>
>>> On 2016-06-13, at 22:21, Mich Talebzadeh <mi...@gmail.com>
>>> wrote:
>>>
>>> which version of Hive are you using?
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 13 June 2016 at 16:00, Michael Häusler <mi...@akatose.de> wrote:
>>>
>>>> Hi there,
>>>>
>>>>
>>>> when testing column statistics I stumbled upon the following error
>>>> message:
>>>>
>>>> DROP TABLE IF EXISTS foo;
>>>> CREATE TABLE foo (foo BIGINT, bar ARRAY<BIGINT>, foobar
>>>> STRUCT<key:STRING,value:STRING>);
>>>>
>>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;
>>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>>> accepted but array<bigint> is passed.
>>>>
>>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar;
>>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>>> accepted but struct<key:string,value:string> is passed.
>>>>
>>>>
>>>> 1) Basically, it seems that column statistics don't work for
>>>> non-primitive types. Are there any workarounds or any plans to change this?
>>>>
>>>> 2) Furthermore, the convenience syntax to compute statistics for all
>>>> columns does not work as soon as there is a non-supported column. Are there
>>>> any plans to change this, so it is easier to compute statistics for all
>>>> supported columns?
>>>>
>>>> 3) ANALYZE TABLE will only provide the first failing *type* in the
>>>> error message. Especially for wide tables it would be much easier if all
>>>> non-supported column *names* would be printed.
>>>>
>>>>
>>>> Best regards
>>>> Michael
>>>>
>>>>
>>>
>>>
>>
>>
>

Re: column statistics for non-primitive types

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi,

My point was we are where we are and in this juncture there is no
collection of statistics for complex columns. That may be a future
enhancement.

But then the obvious question is how useful or meaningful these statistics
is going to be?

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 14 June 2016 at 21:03, Michael Häusler <mi...@akatose.de> wrote:

> Hi there,
>
> there might be two topics here:
>
> 1) feasibility of stats for non-primitive columns
> 2) ease of use
>
>
> 1) feasibility of stats for non-primitive columns:
>
> Hive currently collects different kind of statistics for different kind of
> types:
> numeric values: min, max, #nulls, #distincts
> boolean values: #nulls, #trues, #falses
> string values: #nulls, #distincts, avgLength, maxLength
>
> So, it seems quite possible to also collect at least partial stats for
> top-level non-primitive columns, e.g.:
> array values: #nulls, #distincts, avgLength, maxLength
> map values: #nulls, #distincts, avgLength, maxLength
> struct values: #nulls, #distincts
> union values: #nulls, #distincts
>
>
> 2) ease of use
>
> The presence of a single non-primitive column currently breaks the use of
> the convenience shorthand to gather statistics for all columns (ANALYZE
> TABLE foo COMPUTE STATISTICS FOR COLUMNS;). Imho, this slows down adoption
> of column statistics for hive users.
>
> Best regards
> Michael
>
>
>
> On 2016-06-14, at 12:04, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
> Hi Michael,
>
> Statistics for columns in Hive are kept in Hive metadata table
> tab_col_stats.
>
> When I am looking at this table in Oracle, I only see statistics for
> primitives columns here. STRUCT columns do not have it as a STRUCT column
> will have to be broken into its primitive columns.  I don't think Hive has
> the means to do that.
>
> desc tab_col_stats;
>  Name
> Null?    Type
>  ------------------------------------------------------------------------
> -------- -------------------------------------------------
>  CS_ID
> NOT NULL NUMBER
>  DB_NAME
> NOT NULL VARCHAR2(128)
>  TABLE_NAME
> NOT NULL VARCHAR2(128)
>  COLUMN_NAME
> NOT NULL VARCHAR2(1000)
>  COLUMN_TYPE
> NOT NULL VARCHAR2(128)
>  TBL_ID
> NOT NULL NUMBER
>  LONG_LOW_VALUE
> NUMBER
>  LONG_HIGH_VALUE
> NUMBER
>  DOUBLE_LOW_VALUE
> NUMBER
>  DOUBLE_HIGH_VALUE
> NUMBER
>  BIG_DECIMAL_LOW_VALUE
> VARCHAR2(4000)
>  BIG_DECIMAL_HIGH_VALUE
> VARCHAR2(4000)
>  NUM_NULLS
> NOT NULL NUMBER
>  NUM_DISTINCTS
> NUMBER
>  AVG_COL_LEN
> NUMBER
>  MAX_COL_LEN
> NUMBER
>  NUM_TRUES
> NUMBER
>  NUM_FALSES
> NUMBER
>  LAST_ANALYZED
> NOT NULL NUMBER
>
>
>
>  So in summary although column type STRUCT do exit, I don't think Hive can
> cater for their statistics. Actually I don't think Oracle itself does it.
>
> HTH
>
> P.S. I am on Hive 2 and it does not.
>
> hive> analyze table foo compute statistics for columns;
> FAILED: UDFArgumentTypeException Only primitive type arguments are
> accepted but array<bigint> is passed.
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 14 June 2016 at 09:57, Michael Häusler <mi...@akatose.de> wrote:
>
>> Hi there,
>>
>> you can reproduce the messages below with Hive 1.2.1.
>>
>> Best regards
>> Michael
>>
>>
>> On 2016-06-13, at 22:21, Mich Talebzadeh <mi...@gmail.com>
>> wrote:
>>
>> which version of Hive are you using?
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 13 June 2016 at 16:00, Michael Häusler <mi...@akatose.de> wrote:
>>
>>> Hi there,
>>>
>>>
>>> when testing column statistics I stumbled upon the following error
>>> message:
>>>
>>> DROP TABLE IF EXISTS foo;
>>> CREATE TABLE foo (foo BIGINT, bar ARRAY<BIGINT>, foobar
>>> STRUCT<key:STRING,value:STRING>);
>>>
>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;
>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>> accepted but array<bigint> is passed.
>>>
>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar;
>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>> accepted but struct<key:string,value:string> is passed.
>>>
>>>
>>> 1) Basically, it seems that column statistics don't work for
>>> non-primitive types. Are there any workarounds or any plans to change this?
>>>
>>> 2) Furthermore, the convenience syntax to compute statistics for all
>>> columns does not work as soon as there is a non-supported column. Are there
>>> any plans to change this, so it is easier to compute statistics for all
>>> supported columns?
>>>
>>> 3) ANALYZE TABLE will only provide the first failing *type* in the error
>>> message. Especially for wide tables it would be much easier if all
>>> non-supported column *names* would be printed.
>>>
>>>
>>> Best regards
>>> Michael
>>>
>>>
>>
>>
>
>

RE: column statistics for non-primitive types

Posted by "Markovitz, Dudu" <dm...@paypal.com>.

Hi Michael

Case ‘b’ (“answer query directly”) seems to be risky in an open system.
Files/directories can be deleted directly in the filesystem without Hive having any knowledge about it which will lead to wrong queries results.

Dudu

From: Michael Häusler [mailto:michael@akatose.de]
Sent: Tuesday, June 14, 2016 11:43 PM
To: user@hive.apache.org
Subject: Re: column statistics for non-primitive types

Hi Pengcheng,

(1)
statistics on non-primitive columns can be just as useful as on primitive columns, e.g.,
DROP TABLE IF EXISTS foo;
CREATE TABLE foo (id BIGINT, someArray ARRAY<BIGINT>, someStruct STRUCT<key:STRING,value:STRING>);

a) query optimization
Let foo be a huge table that needs to be joined with another huge table bar like this

SELECT
            f.id
FROM
            foo f
JOIN
            bar b
ON
            f.id = b.id
WHERE
            f.someArray IS NOT NULL

If statistics tell us that #nulls in someArray is small, we could apply a different join strategy (e.g., map-side join, bar main table, filtered foo as hash table)

b) answer query directly

SELECT
            COUNT(DISTINCT someStruct)
FROM
            foo;

Such a query can easily be answered directly from stats.

(2)

Do you happen to know, whether HIVE-11160 also works for CTAS?
Because a quick test of the configuration property did not work for me:

hive> SET hive.stats.fetch.column.stats=true;
hive> DROP TABLE IF EXISTS foo;
OK
Time taken: 6.585 seconds
hive> CREATE TABLE foo AS
    > SELECT
    >     1 AS foo;
Query ID = haeusler_20160614203002_7a47459d-349b-4012-ac7f-b2cc867b87ef
Total jobs = 1
Launching Job 1 out of 1

Status: Running (Executing on YARN cluster with App id application_1465334589772_15920)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 2.89 s
--------------------------------------------------------------------------------
Moving data to: hdfs://invcluster/user/hive/warehouse/haeusler.db/foo
Table haeusler.foo stats: [numFiles=1, numRows=1, totalSize=194, rawDataSize=4]
OK
Time taken: 8.088 seconds
hive> DESCRIBE FORMATTED foo.foo;
OK
# col_name              data_type               min                     max                     num_nulls               distinct_count          avg_col_len             max_col_len             num_trues               num_falses              comment

foo                     int                                                                                                                                                                                                                     from deserializer
Time taken: 0.197 seconds, Fetched: 3 row(s)

^^^ the table creation works, but I don't get any column stats.

Best regards
Michael

On 2016-06-14, at 22:23, Pengcheng Xiong <px...@apache.org>> wrote:

Hi Michael,

    (1) We collect columns stats for the following purpose (a) Query optimization, esp. join reordering and big/small table size estimation. More recently, we also use it to remove filters. You can refer to Calcite rules. (b) Answer query directly through metaStore. You can refer to the configuration of HIVEOPTIMIZEMETADATAQUERIES("hive.compute.query.using.stats").

    We can do stats for non-primitive columns, but we need to know the motivation to do so before we do it. If you can, could you please list some?

   (2) There is a configuration "hive.stats.fetch.column.stats". If you set it to true, it will automatically collect column stats for you when you insert into/overwrite a new table. You can refer to HIVE-11160 for more details.

   Hope my answers help.

Thanks

Best.
Pengcheng

On Tue, Jun 14, 2016 at 1:03 PM, Michael Häusler <mi...@akatose.de>> wrote:
Hi there,

there might be two topics here:

1) feasibility of stats for non-primitive columns
2) ease of use

1) feasibility of stats for non-primitive columns:

Hive currently collects different kind of statistics for different kind of types:
numeric values: min, max, #nulls, #distincts
boolean values: #nulls, #trues, #falses
string values: #nulls, #distincts, avgLength, maxLength

So, it seems quite possible to also collect at least partial stats for top-level non-primitive columns, e.g.:
array values: #nulls, #distincts, avgLength, maxLength
map values: #nulls, #distincts, avgLength, maxLength
struct values: #nulls, #distincts
union values: #nulls, #distincts

2) ease of use

The presence of a single non-primitive column currently breaks the use of the convenience shorthand to gather statistics for all columns (ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;). Imho, this slows down adoption of column statistics for hive users.

Best regards
Michael

On 2016-06-14, at 12:04, Mich Talebzadeh <mi...@gmail.com>> wrote:

Hi Michael,

Statistics for columns in Hive are kept in Hive metadata table tab_col_stats.

When I am looking at this table in Oracle, I only see statistics for primitives columns here. STRUCT columns do not have it as a STRUCT column will have to be broken into its primitive columns.  I don't think Hive has the means to do that.

desc tab_col_stats;
 Name                                                                     Null?    Type
 ------------------------------------------------------------------------ -------- -------------------------------------------------
 CS_ID                                                                    NOT NULL NUMBER
 DB_NAME                                                                  NOT NULL VARCHAR2(128)
 TABLE_NAME                                                               NOT NULL VARCHAR2(128)
 COLUMN_NAME                                                              NOT NULL VARCHAR2(1000)
 COLUMN_TYPE                                                              NOT NULL VARCHAR2(128)
 TBL_ID                                                                   NOT NULL NUMBER
 LONG_LOW_VALUE                                                                    NUMBER
 LONG_HIGH_VALUE                                                                   NUMBER
 DOUBLE_LOW_VALUE                                                                  NUMBER
 DOUBLE_HIGH_VALUE                                                                 NUMBER
 BIG_DECIMAL_LOW_VALUE                                                             VARCHAR2(4000)
 BIG_DECIMAL_HIGH_VALUE                                                            VARCHAR2(4000)
 NUM_NULLS                                                                NOT NULL NUMBER
 NUM_DISTINCTS                                                                     NUMBER
 AVG_COL_LEN                                                                       NUMBER
 MAX_COL_LEN                                                                       NUMBER
 NUM_TRUES                                                                         NUMBER
 NUM_FALSES                                                                        NUMBER
 LAST_ANALYZED                                                            NOT NULL NUMBER

 So in summary although column type STRUCT do exit, I don't think Hive can cater for their statistics. Actually I don't think Oracle itself does it.

HTH

P.S. I am on Hive 2 and it does not.

hive> analyze table foo compute statistics for columns;
FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but array<bigint> is passed.

Dr Mich Talebzadeh

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

On 14 June 2016 at 09:57, Michael Häusler <mi...@akatose.de>> wrote:
Hi there,

you can reproduce the messages below with Hive 1.2.1.

Best regards
Michael

On 2016-06-13, at 22:21, Mich Talebzadeh <mi...@gmail.com>> wrote:

which version of Hive are you using?

Dr Mich Talebzadeh

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

On 13 June 2016 at 16:00, Michael Häusler <mi...@akatose.de>> wrote:
Hi there,

when testing column statistics I stumbled upon the following error message:

DROP TABLE IF EXISTS foo;
CREATE TABLE foo (foo BIGINT, bar ARRAY<BIGINT>, foobar STRUCT<key:STRING,value:STRING>);

ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;
FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but array<bigint> is passed.

ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar;
FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but struct<key:string,value:string> is passed.

1) Basically, it seems that column statistics don't work for non-primitive types. Are there any workarounds or any plans to change this?

2) Furthermore, the convenience syntax to compute statistics for all columns does not work as soon as there is a non-supported column. Are there any plans to change this, so it is easier to compute statistics for all supported columns?

3) ANALYZE TABLE will only provide the first failing *type* in the error message. Especially for wide tables it would be much easier if all non-supported column *names* would be printed.

Best regards
Michael

Re: column statistics for non-primitive types

Posted by Michael Häusler <mi...@akatose.de>.

Hi Pengcheng,

(1)
Filtering on IS NULL / IS NOT NULL is imho a very frequent use case in all kind of queries, including joins.

We also see it frequently in queries like
SELECT COUNT(*) FROM foo WHERE foo.someComplex IS NULL;
SELECT COUNT(*) FROM foo WHERE foo.someComplex IS NOT NULL;

We also see quite some queries like
SELECT MAX(SIZE(someArray)) FROM foo;
SELECT COUNT(DISTINCT someStruct.someField) FROM foo;

Rarely we see queries like
SELECT COUNT(DISTINCT someStruct) FROM foo;

(3)
I'll see if I can get it compiled, so that I can test locally.

Best
Michael


> On 2016-06-14, at 23:19, Pengcheng Xiong <px...@apache.org> wrote:
> 
> Hi Michael,
> 
>     Thanks for your reply.
> 
>     (1) We have considered your first point before as well. However, join reordering is now done by Calcite in Hive optimizer. Calcite can not deal with complex/structure types. Thus, we may need to first make improvement in Calcite and then make the column stats for complex/structure type useful. Your second point to use stats directly is valid. But we would like to know how many use cases are there and how popular it is.
> 
>     (2) AFAIK, it only supports insert overwrite and insert into. I will investigate the support of CTAS.
> 
>     (3) The new feature is in Hive 2.1. If possible, could u upgrade and try?
> 
> Thanks.
> 
> Best
> Pengcheng 
> 
> On Tue, Jun 14, 2016 at 1:42 PM, Michael Häusler <michael@akatose.de <ma...@akatose.de>> wrote:
> Hi Pengcheng,
> 
> (1)
> statistics on non-primitive columns can be just as useful as on primitive columns, e.g.,
> DROP TABLE IF EXISTS foo;
> CREATE TABLE foo (id BIGINT, someArray ARRAY<BIGINT>, someStruct STRUCT<key:STRING,value:STRING>);
> 
> a) query optimization
> Let foo be a huge table that needs to be joined with another huge table bar like this
> 
> SELECT
> 	f.id <http://f.id/>
> FROM
> 	foo f
> JOIN
> 	bar b
> ON
> 	f.id <http://f.id/> = b.id <http://b.id/>
> WHERE
> 	f.someArray IS NOT NULL
> 
> If statistics tell us that #nulls in someArray is small, we could apply a different join strategy (e.g., map-side join, bar main table, filtered foo as hash table)
> 
> b) answer query directly
> 
> SELECT
> 	COUNT(DISTINCT someStruct)
> FROM
> 	foo;
> 
> Such a query can easily be answered directly from stats.
> 
> 
> 
> (2)
> 
> Do you happen to know, whether HIVE-11160 also works for CTAS?
> Because a quick test of the configuration property did not work for me:
> 
> hive> SET hive.stats.fetch.column.stats=true;
> hive> DROP TABLE IF EXISTS foo;
> OK
> Time taken: 6.585 seconds
> hive> CREATE TABLE foo AS
>     > SELECT
>     >     1 AS foo;
> Query ID = haeusler_20160614203002_7a47459d-349b-4012-ac7f-b2cc867b87ef
> Total jobs = 1
> Launching Job 1 out of 1
> 
> 
> Status: Running (Executing on YARN cluster with App id application_1465334589772_15920)
> 
> --------------------------------------------------------------------------------
>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
> --------------------------------------------------------------------------------
> Map 1 ..........   SUCCEEDED      1          1        0        0       0       0
> --------------------------------------------------------------------------------
> VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 2.89 s
> --------------------------------------------------------------------------------
> Moving data to: hdfs://invcluster/user/hive/warehouse/haeusler.db/foo <>
> Table haeusler.foo stats: [numFiles=1, numRows=1, totalSize=194, rawDataSize=4]
> OK
> Time taken: 8.088 seconds
> hive> DESCRIBE FORMATTED foo.foo;
> OK
> # col_name              data_type               min                     max                     num_nulls               distinct_count          avg_col_len             max_col_len             num_trues               num_falses              comment
> 
> foo                     int                                                                                                                                                                                                                     from deserializer
> Time taken: 0.197 seconds, Fetched: 3 row(s)
> 
> ^^^ the table creation works, but I don't get any column stats.
> 
> 
> Best regards
> Michael
> 
> 
> 
>> On 2016-06-14, at 22:23, Pengcheng Xiong <pxiong@apache.org <ma...@apache.org>> wrote:
>> 
>> Hi Michael,
>> 
>>     (1) We collect columns stats for the following purpose (a) Query optimization, esp. join reordering and big/small table size estimation. More recently, we also use it to remove filters. You can refer to Calcite rules. (b) Answer query directly through metaStore. You can refer to the configuration of HIVEOPTIMIZEMETADATAQUERIES("hive.compute.query.using.stats").
>>     
>>     We can do stats for non-primitive columns, but we need to know the motivation to do so before we do it. If you can, could you please list some?
>> 
>>    (2) There is a configuration "hive.stats.fetch.column.stats". If you set it to true, it will automatically collect column stats for you when you insert into/overwrite a new table. You can refer to HIVE-11160 for more details.
>> 
>>    Hope my answers help.
>> 
>> Thanks
>> 
>> Best.
>> Pengcheng
>>     
>> 
>> On Tue, Jun 14, 2016 at 1:03 PM, Michael Häusler <michael@akatose.de <ma...@akatose.de>> wrote:
>> Hi there,
>> 
>> there might be two topics here:
>> 
>> 1) feasibility of stats for non-primitive columns
>> 2) ease of use
>> 
>> 
>> 1) feasibility of stats for non-primitive columns:
>> 
>> Hive currently collects different kind of statistics for different kind of types:
>> numeric values:	min, max, #nulls, #distincts
>> boolean values:	#nulls, #trues, #falses
>> string values:		#nulls, #distincts, avgLength, maxLength
>> 
>> So, it seems quite possible to also collect at least partial stats for top-level non-primitive columns, e.g.:
>> array values:		#nulls, #distincts, avgLength, maxLength 
>> map values:		#nulls, #distincts, avgLength, maxLength
>> struct values:		#nulls, #distincts
>> union values:		#nulls, #distincts
>> 
>> 
>> 2) ease of use
>> 
>> The presence of a single non-primitive column currently breaks the use of the convenience shorthand to gather statistics for all columns (ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;). Imho, this slows down adoption of column statistics for hive users.
>> 
>> Best regards
>> Michael
>> 
>> 
>> 
>>> On 2016-06-14, at 12:04, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> Hi Michael,
>>> 
>>> Statistics for columns in Hive are kept in Hive metadata table tab_col_stats.
>>> 
>>> When I am looking at this table in Oracle, I only see statistics for primitives columns here. STRUCT columns do not have it as a STRUCT column will have to be broken into its primitive columns.  I don't think Hive has the means to do that.
>>> 
>>> desc tab_col_stats;
>>>  Name                                                                     Null?    Type
>>>  ------------------------------------------------------------------------ -------- -------------------------------------------------
>>>  CS_ID                                                                    NOT NULL NUMBER
>>>  DB_NAME                                                                  NOT NULL VARCHAR2(128)
>>>  TABLE_NAME                                                               NOT NULL VARCHAR2(128)
>>>  COLUMN_NAME                                                              NOT NULL VARCHAR2(1000)
>>>  COLUMN_TYPE                                                              NOT NULL VARCHAR2(128)
>>>  TBL_ID                                                                   NOT NULL NUMBER
>>>  LONG_LOW_VALUE                                                                    NUMBER
>>>  LONG_HIGH_VALUE                                                                   NUMBER
>>>  DOUBLE_LOW_VALUE                                                                  NUMBER
>>>  DOUBLE_HIGH_VALUE                                                                 NUMBER
>>>  BIG_DECIMAL_LOW_VALUE                                                             VARCHAR2(4000)
>>>  BIG_DECIMAL_HIGH_VALUE                                                            VARCHAR2(4000)
>>>  NUM_NULLS                                                                NOT NULL NUMBER
>>>  NUM_DISTINCTS                                                                     NUMBER
>>>  AVG_COL_LEN                                                                       NUMBER
>>>  MAX_COL_LEN                                                                       NUMBER
>>>  NUM_TRUES                                                                         NUMBER
>>>  NUM_FALSES                                                                        NUMBER
>>>  LAST_ANALYZED                                                            NOT NULL NUMBER
>>> 
>>> 
>>> 
>>>  So in summary although column type STRUCT do exit, I don't think Hive can cater for their statistics. Actually I don't think Oracle itself does it.
>>> 
>>> HTH
>>> 
>>> P.S. I am on Hive 2 and it does not.
>>> 
>>> hive> analyze table foo compute statistics for columns;
>>> FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but array<bigint> is passed.
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>  
>>> 
>>> On 14 June 2016 at 09:57, Michael Häusler <michael@akatose.de <ma...@akatose.de>> wrote:
>>> Hi there,
>>> 
>>> you can reproduce the messages below with Hive 1.2.1.
>>> 
>>> Best regards
>>> Michael
>>> 
>>> 
>>>> On 2016-06-13, at 22:21, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> which version of Hive are you using?
>>>> 
>>>> Dr Mich Talebzadeh
>>>>  
>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>>  
>>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>>  
>>>> 
>>>> On 13 June 2016 at 16:00, Michael Häusler <michael@akatose.de <ma...@akatose.de>> wrote:
>>>> Hi there,
>>>> 
>>>> 
>>>> when testing column statistics I stumbled upon the following error message:
>>>> 
>>>> DROP TABLE IF EXISTS foo;
>>>> CREATE TABLE foo (foo BIGINT, bar ARRAY<BIGINT>, foobar STRUCT<key:STRING,value:STRING>);
>>>> 
>>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;
>>>> FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but array<bigint> is passed.
>>>> 
>>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar;
>>>> FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but struct<key:string,value:string> is passed.
>>>> 
>>>> 
>>>> 1) Basically, it seems that column statistics don't work for non-primitive types. Are there any workarounds or any plans to change this?
>>>> 
>>>> 2) Furthermore, the convenience syntax to compute statistics for all columns does not work as soon as there is a non-supported column. Are there any plans to change this, so it is easier to compute statistics for all supported columns?
>>>> 
>>>> 3) ANALYZE TABLE will only provide the first failing *type* in the error message. Especially for wide tables it would be much easier if all non-supported column *names* would be printed.
>>>> 
>>>> 
>>>> Best regards
>>>> Michael
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: column statistics for non-primitive types

Posted by Pengcheng Xiong <px...@apache.org>.

Hi Michael,

    Thanks for your reply.

    (1) We have considered your first point before as well. However, join
reordering is now done by Calcite in Hive optimizer. Calcite can not deal
with complex/structure types. Thus, we may need to first make improvement
in Calcite and then make the column stats for complex/structure type
useful. Your second point to use stats directly is valid. But we would like
to know how many use cases are there and how popular it is.

    (2) AFAIK, it only supports insert overwrite and insert into. I will
investigate the support of CTAS.

    (3) The new feature is in Hive 2.1. If possible, could u upgrade and
try?

Thanks.

Best
Pengcheng

On Tue, Jun 14, 2016 at 1:42 PM, Michael Häusler <mi...@akatose.de> wrote:

> Hi Pengcheng,
>
> (1)
> statistics on non-primitive columns can be just as useful as on primitive
> columns, e.g.,
> DROP TABLE IF EXISTS foo;
> CREATE TABLE foo (id BIGINT, someArray ARRAY<BIGINT>, someStruct
> STRUCT<key:STRING,value:STRING>);
>
> a) query optimization
> Let foo be a huge table that needs to be joined with another huge table
> bar like this
>
> SELECT
> f.id
> FROM
> foo f
> JOIN
> bar b
> ON
> f.id = b.id
> WHERE
> f.someArray IS NOT NULL
>
> If statistics tell us that #nulls in someArray is small, we could apply a
> different join strategy (e.g., map-side join, bar main table, filtered foo
> as hash table)
>
> b) answer query directly
>
> SELECT
> COUNT(DISTINCT someStruct)
> FROM
> foo;
>
> Such a query can easily be answered directly from stats.
>
>
>
> (2)
>
> Do you happen to know, whether HIVE-11160 also works for CTAS?
> Because a quick test of the configuration property did not work for me:
>
> hive> SET hive.stats.fetch.column.stats=true;
> hive> DROP TABLE IF EXISTS foo;
> OK
> Time taken: 6.585 seconds
> hive> CREATE TABLE foo AS
>     > SELECT
>     >     1 AS foo;
> Query ID = haeusler_20160614203002_7a47459d-349b-4012-ac7f-b2cc867b87ef
> Total jobs = 1
> Launching Job 1 out of 1
>
>
> Status: Running (Executing on YARN cluster with App id
> application_1465334589772_15920)
>
>
> --------------------------------------------------------------------------------
>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
>  KILLED
>
> --------------------------------------------------------------------------------
> Map 1 ..........   SUCCEEDED      1          1        0        0       0
>     0
>
> --------------------------------------------------------------------------------
> VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 2.89 s
>
> --------------------------------------------------------------------------------
> Moving data to: hdfs://invcluster/user/hive/warehouse/haeusler.db/foo
> Table haeusler.foo stats: [numFiles=1, numRows=1, totalSize=194,
> rawDataSize=4]
> OK
> Time taken: 8.088 seconds
> hive> DESCRIBE FORMATTED foo.foo;
> OK
> # col_name              data_type               min
> max                     num_nulls               distinct_count
>  avg_col_len             max_col_len             num_trues
> num_falses              comment
>
> foo                     int
>
>
>               from deserializer
> Time taken: 0.197 seconds, Fetched: 3 row(s)
>
> ^^^ the table creation works, but I don't get any column stats.
>
>
> Best regards
> Michael
>
>
>
> On 2016-06-14, at 22:23, Pengcheng Xiong <px...@apache.org> wrote:
>
> Hi Michael,
>
>     (1) We collect columns stats for the following purpose (a) Query
> optimization, esp. join reordering and big/small table size estimation.
> More recently, we also use it to remove filters. You can refer to Calcite
> rules. (b) Answer query directly through metaStore. You can refer to the
> configuration of HIVEOPTIMIZEMETADATAQUERIES(
> "hive.compute.query.using.stats").
>
>     We can do stats for non-primitive columns, but we need to know the
> motivation to do so before we do it. If you can, could you please list some?
>
>    (2) There is a configuration "hive.stats.fetch.column.stats". If you
> set it to true, it will automatically collect column stats for you when you
> insert into/overwrite a new table. You can refer to HIVE-11160 for more
> details.
>
>    Hope my answers help.
>
> Thanks
>
> Best.
> Pengcheng
>
>
> On Tue, Jun 14, 2016 at 1:03 PM, Michael Häusler <mi...@akatose.de>
> wrote:
>
>> Hi there,
>>
>> there might be two topics here:
>>
>> 1) feasibility of stats for non-primitive columns
>> 2) ease of use
>>
>>
>> 1) feasibility of stats for non-primitive columns:
>>
>> Hive currently collects different kind of statistics for different kind
>> of types:
>> numeric values: min, max, #nulls, #distincts
>> boolean values: #nulls, #trues, #falses
>> string values: #nulls, #distincts, avgLength, maxLength
>>
>> So, it seems quite possible to also collect at least partial stats for
>> top-level non-primitive columns, e.g.:
>> array values: #nulls, #distincts, avgLength, maxLength
>> map values: #nulls, #distincts, avgLength, maxLength
>> struct values: #nulls, #distincts
>> union values: #nulls, #distincts
>>
>>
>> 2) ease of use
>>
>> The presence of a single non-primitive column currently breaks the use of
>> the convenience shorthand to gather statistics for all columns (ANALYZE
>> TABLE foo COMPUTE STATISTICS FOR COLUMNS;). Imho, this slows down adoption
>> of column statistics for hive users.
>>
>> Best regards
>> Michael
>>
>>
>>
>> On 2016-06-14, at 12:04, Mich Talebzadeh <mi...@gmail.com>
>> wrote:
>>
>> Hi Michael,
>>
>> Statistics for columns in Hive are kept in Hive metadata table
>> tab_col_stats.
>>
>> When I am looking at this table in Oracle, I only see statistics for
>> primitives columns here. STRUCT columns do not have it as a STRUCT column
>> will have to be broken into its primitive columns.  I don't think Hive has
>> the means to do that.
>>
>> desc tab_col_stats;
>>  Name
>> Null?    Type
>>  ------------------------------------------------------------------------
>> -------- -------------------------------------------------
>>  CS_ID
>> NOT NULL NUMBER
>>  DB_NAME
>> NOT NULL VARCHAR2(128)
>>  TABLE_NAME
>> NOT NULL VARCHAR2(128)
>>  COLUMN_NAME
>> NOT NULL VARCHAR2(1000)
>>  COLUMN_TYPE
>> NOT NULL VARCHAR2(128)
>>  TBL_ID
>> NOT NULL NUMBER
>>  LONG_LOW_VALUE
>> NUMBER
>>  LONG_HIGH_VALUE
>> NUMBER
>>  DOUBLE_LOW_VALUE
>> NUMBER
>>  DOUBLE_HIGH_VALUE
>> NUMBER
>>  BIG_DECIMAL_LOW_VALUE
>> VARCHAR2(4000)
>>  BIG_DECIMAL_HIGH_VALUE
>> VARCHAR2(4000)
>>  NUM_NULLS
>> NOT NULL NUMBER
>>  NUM_DISTINCTS
>> NUMBER
>>  AVG_COL_LEN
>> NUMBER
>>  MAX_COL_LEN
>> NUMBER
>>  NUM_TRUES
>> NUMBER
>>  NUM_FALSES
>> NUMBER
>>  LAST_ANALYZED
>> NOT NULL NUMBER
>>
>>
>>
>>  So in summary although column type STRUCT do exit, I don't think Hive
>> can cater for their statistics. Actually I don't think Oracle itself does
>> it.
>>
>> HTH
>>
>> P.S. I am on Hive 2 and it does not.
>>
>> hive> analyze table foo compute statistics for columns;
>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>> accepted but array<bigint> is passed.
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 14 June 2016 at 09:57, Michael Häusler <mi...@akatose.de> wrote:
>>
>>> Hi there,
>>>
>>> you can reproduce the messages below with Hive 1.2.1.
>>>
>>> Best regards
>>> Michael
>>>
>>>
>>> On 2016-06-13, at 22:21, Mich Talebzadeh <mi...@gmail.com>
>>> wrote:
>>>
>>> which version of Hive are you using?
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 13 June 2016 at 16:00, Michael Häusler <mi...@akatose.de> wrote:
>>>
>>>> Hi there,
>>>>
>>>>
>>>> when testing column statistics I stumbled upon the following error
>>>> message:
>>>>
>>>> DROP TABLE IF EXISTS foo;
>>>> CREATE TABLE foo (foo BIGINT, bar ARRAY<BIGINT>, foobar
>>>> STRUCT<key:STRING,value:STRING>);
>>>>
>>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;
>>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>>> accepted but array<bigint> is passed.
>>>>
>>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar;
>>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>>> accepted but struct<key:string,value:string> is passed.
>>>>
>>>>
>>>> 1) Basically, it seems that column statistics don't work for
>>>> non-primitive types. Are there any workarounds or any plans to change this?
>>>>
>>>> 2) Furthermore, the convenience syntax to compute statistics for all
>>>> columns does not work as soon as there is a non-supported column. Are there
>>>> any plans to change this, so it is easier to compute statistics for all
>>>> supported columns?
>>>>
>>>> 3) ANALYZE TABLE will only provide the first failing *type* in the
>>>> error message. Especially for wide tables it would be much easier if all
>>>> non-supported column *names* would be printed.
>>>>
>>>>
>>>> Best regards
>>>> Michael
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Re: column statistics for non-primitive types

Posted by Michael Häusler <mi...@akatose.de>.

Hi there,

> On 2016-06-14, at 22:42, Michael Häusler <mi...@akatose.de> wrote:
> 
> (2)
> Do you happen to know, whether HIVE-11160 also works for CTAS?
> Because a quick test of the configuration property did not work for me:

My bad - HIVE-11160 has target version 2.2.0, so it won't work yet on my installation of Hive 1.2.1.

Best regards
Michael

Re: column statistics for non-primitive types

Posted by Michael Häusler <mi...@akatose.de>.

Hi Pengcheng,

(1)
statistics on non-primitive columns can be just as useful as on primitive columns, e.g.,
DROP TABLE IF EXISTS foo;
CREATE TABLE foo (id BIGINT, someArray ARRAY<BIGINT>, someStruct STRUCT<key:STRING,value:STRING>);

a) query optimization
Let foo be a huge table that needs to be joined with another huge table bar like this

SELECT
	f.id
FROM
	foo f
JOIN
	bar b
ON
	f.id = b.id
WHERE
	f.someArray IS NOT NULL

If statistics tell us that #nulls in someArray is small, we could apply a different join strategy (e.g., map-side join, bar main table, filtered foo as hash table)

b) answer query directly

SELECT
	COUNT(DISTINCT someStruct)
FROM
	foo;

Such a query can easily be answered directly from stats.



(2)

Do you happen to know, whether HIVE-11160 also works for CTAS?
Because a quick test of the configuration property did not work for me:

hive> SET hive.stats.fetch.column.stats=true;
hive> DROP TABLE IF EXISTS foo;
OK
Time taken: 6.585 seconds
hive> CREATE TABLE foo AS
    > SELECT
    >     1 AS foo;
Query ID = haeusler_20160614203002_7a47459d-349b-4012-ac7f-b2cc867b87ef
Total jobs = 1
Launching Job 1 out of 1


Status: Running (Executing on YARN cluster with App id application_1465334589772_15920)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 2.89 s
--------------------------------------------------------------------------------
Moving data to: hdfs://invcluster/user/hive/warehouse/haeusler.db/foo
Table haeusler.foo stats: [numFiles=1, numRows=1, totalSize=194, rawDataSize=4]
OK
Time taken: 8.088 seconds
hive> DESCRIBE FORMATTED foo.foo;
OK
# col_name              data_type               min                     max                     num_nulls               distinct_count          avg_col_len             max_col_len             num_trues               num_falses              comment

foo                     int                                                                                                                                                                                                                     from deserializer
Time taken: 0.197 seconds, Fetched: 3 row(s)

^^^ the table creation works, but I don't get any column stats.


Best regards
Michael



> On 2016-06-14, at 22:23, Pengcheng Xiong <px...@apache.org> wrote:
> 
> Hi Michael,
> 
>     (1) We collect columns stats for the following purpose (a) Query optimization, esp. join reordering and big/small table size estimation. More recently, we also use it to remove filters. You can refer to Calcite rules. (b) Answer query directly through metaStore. You can refer to the configuration of HIVEOPTIMIZEMETADATAQUERIES("hive.compute.query.using.stats").
>     
>     We can do stats for non-primitive columns, but we need to know the motivation to do so before we do it. If you can, could you please list some?
> 
>    (2) There is a configuration "hive.stats.fetch.column.stats". If you set it to true, it will automatically collect column stats for you when you insert into/overwrite a new table. You can refer to HIVE-11160 for more details.
> 
>    Hope my answers help.
> 
> Thanks
> 
> Best.
> Pengcheng
>     
> 
> On Tue, Jun 14, 2016 at 1:03 PM, Michael Häusler <michael@akatose.de <ma...@akatose.de>> wrote:
> Hi there,
> 
> there might be two topics here:
> 
> 1) feasibility of stats for non-primitive columns
> 2) ease of use
> 
> 
> 1) feasibility of stats for non-primitive columns:
> 
> Hive currently collects different kind of statistics for different kind of types:
> numeric values:	min, max, #nulls, #distincts
> boolean values:	#nulls, #trues, #falses
> string values:		#nulls, #distincts, avgLength, maxLength
> 
> So, it seems quite possible to also collect at least partial stats for top-level non-primitive columns, e.g.:
> array values:		#nulls, #distincts, avgLength, maxLength 
> map values:		#nulls, #distincts, avgLength, maxLength
> struct values:		#nulls, #distincts
> union values:		#nulls, #distincts
> 
> 
> 2) ease of use
> 
> The presence of a single non-primitive column currently breaks the use of the convenience shorthand to gather statistics for all columns (ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;). Imho, this slows down adoption of column statistics for hive users.
> 
> Best regards
> Michael
> 
> 
> 
>> On 2016-06-14, at 12:04, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Hi Michael,
>> 
>> Statistics for columns in Hive are kept in Hive metadata table tab_col_stats.
>> 
>> When I am looking at this table in Oracle, I only see statistics for primitives columns here. STRUCT columns do not have it as a STRUCT column will have to be broken into its primitive columns.  I don't think Hive has the means to do that.
>> 
>> desc tab_col_stats;
>>  Name                                                                     Null?    Type
>>  ------------------------------------------------------------------------ -------- -------------------------------------------------
>>  CS_ID                                                                    NOT NULL NUMBER
>>  DB_NAME                                                                  NOT NULL VARCHAR2(128)
>>  TABLE_NAME                                                               NOT NULL VARCHAR2(128)
>>  COLUMN_NAME                                                              NOT NULL VARCHAR2(1000)
>>  COLUMN_TYPE                                                              NOT NULL VARCHAR2(128)
>>  TBL_ID                                                                   NOT NULL NUMBER
>>  LONG_LOW_VALUE                                                                    NUMBER
>>  LONG_HIGH_VALUE                                                                   NUMBER
>>  DOUBLE_LOW_VALUE                                                                  NUMBER
>>  DOUBLE_HIGH_VALUE                                                                 NUMBER
>>  BIG_DECIMAL_LOW_VALUE                                                             VARCHAR2(4000)
>>  BIG_DECIMAL_HIGH_VALUE                                                            VARCHAR2(4000)
>>  NUM_NULLS                                                                NOT NULL NUMBER
>>  NUM_DISTINCTS                                                                     NUMBER
>>  AVG_COL_LEN                                                                       NUMBER
>>  MAX_COL_LEN                                                                       NUMBER
>>  NUM_TRUES                                                                         NUMBER
>>  NUM_FALSES                                                                        NUMBER
>>  LAST_ANALYZED                                                            NOT NULL NUMBER
>> 
>> 
>> 
>>  So in summary although column type STRUCT do exit, I don't think Hive can cater for their statistics. Actually I don't think Oracle itself does it.
>> 
>> HTH
>> 
>> P.S. I am on Hive 2 and it does not.
>> 
>> hive> analyze table foo compute statistics for columns;
>> FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but array<bigint> is passed.
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> 
>> On 14 June 2016 at 09:57, Michael Häusler <michael@akatose.de <ma...@akatose.de>> wrote:
>> Hi there,
>> 
>> you can reproduce the messages below with Hive 1.2.1.
>> 
>> Best regards
>> Michael
>> 
>> 
>>> On 2016-06-13, at 22:21, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> which version of Hive are you using?
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>  
>>> 
>>> On 13 June 2016 at 16:00, Michael Häusler <michael@akatose.de <ma...@akatose.de>> wrote:
>>> Hi there,
>>> 
>>> 
>>> when testing column statistics I stumbled upon the following error message:
>>> 
>>> DROP TABLE IF EXISTS foo;
>>> CREATE TABLE foo (foo BIGINT, bar ARRAY<BIGINT>, foobar STRUCT<key:STRING,value:STRING>);
>>> 
>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;
>>> FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but array<bigint> is passed.
>>> 
>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar;
>>> FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but struct<key:string,value:string> is passed.
>>> 
>>> 
>>> 1) Basically, it seems that column statistics don't work for non-primitive types. Are there any workarounds or any plans to change this?
>>> 
>>> 2) Furthermore, the convenience syntax to compute statistics for all columns does not work as soon as there is a non-supported column. Are there any plans to change this, so it is easier to compute statistics for all supported columns?
>>> 
>>> 3) ANALYZE TABLE will only provide the first failing *type* in the error message. Especially for wide tables it would be much easier if all non-supported column *names* would be printed.
>>> 
>>> 
>>> Best regards
>>> Michael
>>> 
>>> 
>> 
>> 
> 
>

Re: column statistics for non-primitive types

Posted by Pengcheng Xiong <px...@apache.org>.

Hi Michael,

    (1) We collect columns stats for the following purpose (a) Query
optimization, esp. join reordering and big/small table size estimation.
More recently, we also use it to remove filters. You can refer to Calcite
rules. (b) Answer query directly through metaStore. You can refer to the
configuration of HIVEOPTIMIZEMETADATAQUERIES(
"hive.compute.query.using.stats").

    We can do stats for non-primitive columns, but we need to know the
motivation to do so before we do it. If you can, could you please list some?

   (2) There is a configuration "hive.stats.fetch.column.stats". If you set
it to true, it will automatically collect column stats for you when you
insert into/overwrite a new table. You can refer to HIVE-11160 for more
details.

   Hope my answers help.

Thanks

Best.
Pengcheng


On Tue, Jun 14, 2016 at 1:03 PM, Michael Häusler <mi...@akatose.de> wrote:

> Hi there,
>
> there might be two topics here:
>
> 1) feasibility of stats for non-primitive columns
> 2) ease of use
>
>
> 1) feasibility of stats for non-primitive columns:
>
> Hive currently collects different kind of statistics for different kind of
> types:
> numeric values: min, max, #nulls, #distincts
> boolean values: #nulls, #trues, #falses
> string values: #nulls, #distincts, avgLength, maxLength
>
> So, it seems quite possible to also collect at least partial stats for
> top-level non-primitive columns, e.g.:
> array values: #nulls, #distincts, avgLength, maxLength
> map values: #nulls, #distincts, avgLength, maxLength
> struct values: #nulls, #distincts
> union values: #nulls, #distincts
>
>
> 2) ease of use
>
> The presence of a single non-primitive column currently breaks the use of
> the convenience shorthand to gather statistics for all columns (ANALYZE
> TABLE foo COMPUTE STATISTICS FOR COLUMNS;). Imho, this slows down adoption
> of column statistics for hive users.
>
> Best regards
> Michael
>
>
>
> On 2016-06-14, at 12:04, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
> Hi Michael,
>
> Statistics for columns in Hive are kept in Hive metadata table
> tab_col_stats.
>
> When I am looking at this table in Oracle, I only see statistics for
> primitives columns here. STRUCT columns do not have it as a STRUCT column
> will have to be broken into its primitive columns.  I don't think Hive has
> the means to do that.
>
> desc tab_col_stats;
>  Name
> Null?    Type
>  ------------------------------------------------------------------------
> -------- -------------------------------------------------
>  CS_ID
> NOT NULL NUMBER
>  DB_NAME
> NOT NULL VARCHAR2(128)
>  TABLE_NAME
> NOT NULL VARCHAR2(128)
>  COLUMN_NAME
> NOT NULL VARCHAR2(1000)
>  COLUMN_TYPE
> NOT NULL VARCHAR2(128)
>  TBL_ID
> NOT NULL NUMBER
>  LONG_LOW_VALUE
> NUMBER
>  LONG_HIGH_VALUE
> NUMBER
>  DOUBLE_LOW_VALUE
> NUMBER
>  DOUBLE_HIGH_VALUE
> NUMBER
>  BIG_DECIMAL_LOW_VALUE
> VARCHAR2(4000)
>  BIG_DECIMAL_HIGH_VALUE
> VARCHAR2(4000)
>  NUM_NULLS
> NOT NULL NUMBER
>  NUM_DISTINCTS
> NUMBER
>  AVG_COL_LEN
> NUMBER
>  MAX_COL_LEN
> NUMBER
>  NUM_TRUES
> NUMBER
>  NUM_FALSES
> NUMBER
>  LAST_ANALYZED
> NOT NULL NUMBER
>
>
>
>  So in summary although column type STRUCT do exit, I don't think Hive can
> cater for their statistics. Actually I don't think Oracle itself does it.
>
> HTH
>
> P.S. I am on Hive 2 and it does not.
>
> hive> analyze table foo compute statistics for columns;
> FAILED: UDFArgumentTypeException Only primitive type arguments are
> accepted but array<bigint> is passed.
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 14 June 2016 at 09:57, Michael Häusler <mi...@akatose.de> wrote:
>
>> Hi there,
>>
>> you can reproduce the messages below with Hive 1.2.1.
>>
>> Best regards
>> Michael
>>
>>
>> On 2016-06-13, at 22:21, Mich Talebzadeh <mi...@gmail.com>
>> wrote:
>>
>> which version of Hive are you using?
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 13 June 2016 at 16:00, Michael Häusler <mi...@akatose.de> wrote:
>>
>>> Hi there,
>>>
>>>
>>> when testing column statistics I stumbled upon the following error
>>> message:
>>>
>>> DROP TABLE IF EXISTS foo;
>>> CREATE TABLE foo (foo BIGINT, bar ARRAY<BIGINT>, foobar
>>> STRUCT<key:STRING,value:STRING>);
>>>
>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;
>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>> accepted but array<bigint> is passed.
>>>
>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar;
>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>> accepted but struct<key:string,value:string> is passed.
>>>
>>>
>>> 1) Basically, it seems that column statistics don't work for
>>> non-primitive types. Are there any workarounds or any plans to change this?
>>>
>>> 2) Furthermore, the convenience syntax to compute statistics for all
>>> columns does not work as soon as there is a non-supported column. Are there
>>> any plans to change this, so it is easier to compute statistics for all
>>> supported columns?
>>>
>>> 3) ANALYZE TABLE will only provide the first failing *type* in the error
>>> message. Especially for wide tables it would be much easier if all
>>> non-supported column *names* would be printed.
>>>
>>>
>>> Best regards
>>> Michael
>>>
>>>
>>
>>
>
>

Re: column statistics for non-primitive types

Posted by Michael Häusler <mi...@akatose.de>.

Hi there,

there might be two topics here:

1) feasibility of stats for non-primitive columns
2) ease of use


1) feasibility of stats for non-primitive columns:

Hive currently collects different kind of statistics for different kind of types:
numeric values:	min, max, #nulls, #distincts
boolean values:	#nulls, #trues, #falses
string values:		#nulls, #distincts, avgLength, maxLength

So, it seems quite possible to also collect at least partial stats for top-level non-primitive columns, e.g.:
array values:		#nulls, #distincts, avgLength, maxLength 
map values:		#nulls, #distincts, avgLength, maxLength
struct values:		#nulls, #distincts
union values:		#nulls, #distincts


2) ease of use

The presence of a single non-primitive column currently breaks the use of the convenience shorthand to gather statistics for all columns (ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;). Imho, this slows down adoption of column statistics for hive users.

Best regards
Michael



> On 2016-06-14, at 12:04, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> Hi Michael,
> 
> Statistics for columns in Hive are kept in Hive metadata table tab_col_stats.
> 
> When I am looking at this table in Oracle, I only see statistics for primitives columns here. STRUCT columns do not have it as a STRUCT column will have to be broken into its primitive columns.  I don't think Hive has the means to do that.
> 
> desc tab_col_stats;
>  Name                                                                     Null?    Type
>  ------------------------------------------------------------------------ -------- -------------------------------------------------
>  CS_ID                                                                    NOT NULL NUMBER
>  DB_NAME                                                                  NOT NULL VARCHAR2(128)
>  TABLE_NAME                                                               NOT NULL VARCHAR2(128)
>  COLUMN_NAME                                                              NOT NULL VARCHAR2(1000)
>  COLUMN_TYPE                                                              NOT NULL VARCHAR2(128)
>  TBL_ID                                                                   NOT NULL NUMBER
>  LONG_LOW_VALUE                                                                    NUMBER
>  LONG_HIGH_VALUE                                                                   NUMBER
>  DOUBLE_LOW_VALUE                                                                  NUMBER
>  DOUBLE_HIGH_VALUE                                                                 NUMBER
>  BIG_DECIMAL_LOW_VALUE                                                             VARCHAR2(4000)
>  BIG_DECIMAL_HIGH_VALUE                                                            VARCHAR2(4000)
>  NUM_NULLS                                                                NOT NULL NUMBER
>  NUM_DISTINCTS                                                                     NUMBER
>  AVG_COL_LEN                                                                       NUMBER
>  MAX_COL_LEN                                                                       NUMBER
>  NUM_TRUES                                                                         NUMBER
>  NUM_FALSES                                                                        NUMBER
>  LAST_ANALYZED                                                            NOT NULL NUMBER
> 
> 
> 
>  So in summary although column type STRUCT do exit, I don't think Hive can cater for their statistics. Actually I don't think Oracle itself does it.
> 
> HTH
> 
> P.S. I am on Hive 2 and it does not.
> 
> hive> analyze table foo compute statistics for columns;
> FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but array<bigint> is passed.
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 14 June 2016 at 09:57, Michael Häusler <michael@akatose.de <ma...@akatose.de>> wrote:
> Hi there,
> 
> you can reproduce the messages below with Hive 1.2.1.
> 
> Best regards
> Michael
> 
> 
>> On 2016-06-13, at 22:21, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>> 
>> which version of Hive are you using?
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> 
>> On 13 June 2016 at 16:00, Michael Häusler <michael@akatose.de <ma...@akatose.de>> wrote:
>> Hi there,
>> 
>> 
>> when testing column statistics I stumbled upon the following error message:
>> 
>> DROP TABLE IF EXISTS foo;
>> CREATE TABLE foo (foo BIGINT, bar ARRAY<BIGINT>, foobar STRUCT<key:STRING,value:STRING>);
>> 
>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;
>> FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but array<bigint> is passed.
>> 
>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar;
>> FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but struct<key:string,value:string> is passed.
>> 
>> 
>> 1) Basically, it seems that column statistics don't work for non-primitive types. Are there any workarounds or any plans to change this?
>> 
>> 2) Furthermore, the convenience syntax to compute statistics for all columns does not work as soon as there is a non-supported column. Are there any plans to change this, so it is easier to compute statistics for all supported columns?
>> 
>> 3) ANALYZE TABLE will only provide the first failing *type* in the error message. Especially for wide tables it would be much easier if all non-supported column *names* would be printed.
>> 
>> 
>> Best regards
>> Michael
>> 
>> 
> 
>

Re: column statistics for non-primitive types

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi Michael,

Statistics for columns in Hive are kept in Hive metadata table
tab_col_stats.

When I am looking at this table in Oracle, I only see statistics for
primitives columns here. STRUCT columns do not have it as a STRUCT column
will have to be broken into its primitive columns.  I don't think Hive has
the means to do that.

desc tab_col_stats;
 Name
Null?    Type
 ------------------------------------------------------------------------
-------- -------------------------------------------------
 CS_ID
NOT NULL NUMBER
 DB_NAME
NOT NULL VARCHAR2(128)
 TABLE_NAME
NOT NULL VARCHAR2(128)
 COLUMN_NAME
NOT NULL VARCHAR2(1000)
 COLUMN_TYPE
NOT NULL VARCHAR2(128)
 TBL_ID
NOT NULL NUMBER
 LONG_LOW_VALUE
NUMBER
 LONG_HIGH_VALUE
NUMBER
 DOUBLE_LOW_VALUE
NUMBER
 DOUBLE_HIGH_VALUE
NUMBER
 BIG_DECIMAL_LOW_VALUE
VARCHAR2(4000)
 BIG_DECIMAL_HIGH_VALUE
VARCHAR2(4000)
 NUM_NULLS
NOT NULL NUMBER
 NUM_DISTINCTS
NUMBER
 AVG_COL_LEN
NUMBER
 MAX_COL_LEN
NUMBER
 NUM_TRUES
NUMBER
 NUM_FALSES
NUMBER
 LAST_ANALYZED
NOT NULL NUMBER



 So in summary although column type STRUCT do exit, I don't think Hive can
cater for their statistics. Actually I don't think Oracle itself does it.

HTH

P.S. I am on Hive 2 and it does not.

hive> analyze table foo compute statistics for columns;
FAILED: UDFArgumentTypeException Only primitive type arguments are accepted
but array<bigint> is passed.


Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 14 June 2016 at 09:57, Michael Häusler <mi...@akatose.de> wrote:

> Hi there,
>
> you can reproduce the messages below with Hive 1.2.1.
>
> Best regards
> Michael
>
>
> On 2016-06-13, at 22:21, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
> which version of Hive are you using?
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 13 June 2016 at 16:00, Michael Häusler <mi...@akatose.de> wrote:
>
>> Hi there,
>>
>>
>> when testing column statistics I stumbled upon the following error
>> message:
>>
>> DROP TABLE IF EXISTS foo;
>> CREATE TABLE foo (foo BIGINT, bar ARRAY<BIGINT>, foobar
>> STRUCT<key:STRING,value:STRING>);
>>
>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;
>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>> accepted but array<bigint> is passed.
>>
>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar;
>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>> accepted but struct<key:string,value:string> is passed.
>>
>>
>> 1) Basically, it seems that column statistics don't work for
>> non-primitive types. Are there any workarounds or any plans to change this?
>>
>> 2) Furthermore, the convenience syntax to compute statistics for all
>> columns does not work as soon as there is a non-supported column. Are there
>> any plans to change this, so it is easier to compute statistics for all
>> supported columns?
>>
>> 3) ANALYZE TABLE will only provide the first failing *type* in the error
>> message. Especially for wide tables it would be much easier if all
>> non-supported column *names* would be printed.
>>
>>
>> Best regards
>> Michael
>>
>>
>
>

Re: column statistics for non-primitive types

Posted by Michael Häusler <mi...@akatose.de>.

Hi there,

you can reproduce the messages below with Hive 1.2.1.

Best regards
Michael


> On 2016-06-13, at 22:21, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> which version of Hive are you using?
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 13 June 2016 at 16:00, Michael Häusler <michael@akatose.de <ma...@akatose.de>> wrote:
> Hi there,
> 
> 
> when testing column statistics I stumbled upon the following error message:
> 
> DROP TABLE IF EXISTS foo;
> CREATE TABLE foo (foo BIGINT, bar ARRAY<BIGINT>, foobar STRUCT<key:STRING,value:STRING>);
> 
> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;
> FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but array<bigint> is passed.
> 
> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar;
> FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but struct<key:string,value:string> is passed.
> 
> 
> 1) Basically, it seems that column statistics don't work for non-primitive types. Are there any workarounds or any plans to change this?
> 
> 2) Furthermore, the convenience syntax to compute statistics for all columns does not work as soon as there is a non-supported column. Are there any plans to change this, so it is easier to compute statistics for all supported columns?
> 
> 3) ANALYZE TABLE will only provide the first failing *type* in the error message. Especially for wide tables it would be much easier if all non-supported column *names* would be printed.
> 
> 
> Best regards
> Michael
> 
>

Re: column statistics for non-primitive types

Posted by Mich Talebzadeh <mi...@gmail.com>.

which version of Hive are you using?

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 13 June 2016 at 16:00, Michael Häusler <mi...@akatose.de> wrote:

> Hi there,
>
>
> when testing column statistics I stumbled upon the following error message:
>
> DROP TABLE IF EXISTS foo;
> CREATE TABLE foo (foo BIGINT, bar ARRAY<BIGINT>, foobar
> STRUCT<key:STRING,value:STRING>);
>
> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;
> FAILED: UDFArgumentTypeException Only primitive type arguments are
> accepted but array<bigint> is passed.
>
> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar;
> FAILED: UDFArgumentTypeException Only primitive type arguments are
> accepted but struct<key:string,value:string> is passed.
>
>
> 1) Basically, it seems that column statistics don't work for non-primitive
> types. Are there any workarounds or any plans to change this?
>
> 2) Furthermore, the convenience syntax to compute statistics for all
> columns does not work as soon as there is a non-supported column. Are there
> any plans to change this, so it is easier to compute statistics for all
> supported columns?
>
> 3) ANALYZE TABLE will only provide the first failing *type* in the error
> message. Especially for wide tables it would be much easier if all
> non-supported column *names* would be printed.
>
>
> Best regards
> Michael
>
>