You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Antoni Ivanov (JIRA)" <ji...@apache.org> on 2018/01/04 16:53:00 UTC

[jira] [Created] (IMPALA-6367) Compute stats do not update statistics for big tables

Antoni Ivanov created IMPALA-6367:
-------------------------------------

             Summary: Compute stats do not update statistics for big tables
                 Key: IMPALA-6367
                 URL: https://issues.apache.org/jira/browse/IMPALA-6367
             Project: IMPALA
          Issue Type: Bug
          Components: Backend, Catalog
    Affects Versions: Impala 2.8.0
         Environment: Impala - v2.8.0-cdh5.11.1
We are using Hive Metastore Database embedded (by cloudera)
It's postgres  8.4.20
OS: Centos 

            Reporter: Antoni Ivanov


Table with at least 10000 partitions and 100 columns 
The table is partitioned by day(bigint), string (this partition cardinality is no bigger than 100)

Executing compute incremental stats without dynnamic partitioning takes about 1 hour. 
So we use partitioning:

compute incremental stats table stats partition (some-condition) (I tried  (day =X) -- or (day = X , string_part = Y)  or (day < X and day > X - 3days) )

It finishes successfully but when I do show table stats for all the partitions in the range I get the following: 
day			string_part		#Rows		Incremental stats

1409529600		foo1		0		false
1409529600		foo2		0		false

The #Rows is 0 (the partition is not empty though) And "Incremental stats" column is set to false


Another case
If I execute compute incremental stats table stats partition 

and then show table stats 

day			string_part		#Rows		Incremental stats

1409529600		foo1		13		false
1409529600		foo2		13		false


The #Rows is updated but "Incremental stats" remains False. 
That's usually for smaller tables.

Note that the same happens if I do not use partition clause  
Note also that I ran compute stats (without incremental) only for the big table (on our test server) and it had the same effect

Note that on production intermittently(not always) it happens for small tables (#Rows is 0 after compute stats) 
But for the biggest tables it's always

In Impala there are 2500 tables with almost 900.000 partitions (accross all tables) with average of 20 columns per table (or 90.000 columns accross all tables), The biggest table has about 35000 partitions
We are using postgres provided by Cloudera as hive metastore backend

I am able to reproduce the issue in our testing setup - it has less than 100 tables and only one is big - 35000 partitions (which I copied from prod). 


 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)