You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kylin.apache.org by "Shaofeng SHI (JIRA)" <ji...@apache.org> on 2016/11/06 08:11:58 UTC

[jira] [Created] (KYLIN-2165) Use hive table statistics data to get the total count

Shaofeng SHI created KYLIN-2165:
-----------------------------------

             Summary: Use hive table statistics data to get the total count
                 Key: KYLIN-2165
                 URL: https://issues.apache.org/jira/browse/KYLIN-2165
             Project: Kylin
          Issue Type: Improvement
          Components: Job Engine
            Reporter: Shaofeng SHI
            Assignee: Shaofeng SHI
             Fix For: v1.6.0


Kylin will count on the intermediate flat hive table to get the total row number, then to redistribute that.

From hive's wiki, hive will automatically collect the table statistics when run a "insert overwrite" statement, then the subsequent "select count(*)" will be very fast. While, Kylin is executing "INSERT OVERWRITE DIRECTORY '/kylin/row_count' SELECT count(*) from", which still cause MR/Tez job be started, this will cause the step take longer time.

Just change the SQL to "select count(*)" or using Hive API to get the statistic, the cost will be saved. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)