You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kylin.apache.org by "Shaofeng SHI (JIRA)" <ji...@apache.org> on 2016/11/06 08:13:58 UTC
[jira] [Updated] (KYLIN-2165) Use hive table statistics data to get the total count

     [ https://issues.apache.org/jira/browse/KYLIN-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shaofeng SHI updated KYLIN-2165:
--------------------------------
    Description: 
Kylin will count on the intermediate flat hive table to get the total row number, then to redistribute that.

From hive's wiki, hive will automatically collect the table statistics when run a "insert overwrite" statement, then the subsequent "select count(*)" will be very fast. While, Kylin is executing "INSERT OVERWRITE DIRECTORY '/kylin/row_count' SELECT count(*) from", which still cause MR/Tez job be started, this will cause the step take longer time.

Just change the SQL to "select count(*)" or using Hive API to get the statistic, the cost will be saved. 

Here is a sample, the table 'kylin_intermediate_qq_dbe874d2_bb9a_4375_ba50_3dcf096a13c5' is an intermediate table :

If directly run "count(*)", it is pretty fast:
hive> select count(*) from kylin_intermediate_qq_dbe874d2_bb9a_4375_ba50_3dcf096a13c5;
OK
970033
Time taken: 0.112 seconds, Fetched: 1 row(s)


While today Kylin's SQL will cause a job be started:

hive> INSERT OVERWRITE DIRECTORY '/kylin/row_count' select count(*) from kylin_intermediate_qq_dbe874d2_bb9a_4375_ba50_3dcf096a13c5;
Query ID = root_20161106080808_0099b622-c0bd-41da-aee5-2321adf7bdda
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1463701915919_46208, Tracking URL =



  was:
Kylin will count on the intermediate flat hive table to get the total row number, then to redistribute that.

From hive's wiki, hive will automatically collect the table statistics when run a "insert overwrite" statement, then the subsequent "select count(*)" will be very fast. While, Kylin is executing "INSERT OVERWRITE DIRECTORY '/kylin/row_count' SELECT count(*) from", which still cause MR/Tez job be started, this will cause the step take longer time.

Just change the SQL to "select count(*)" or using Hive API to get the statistic, the cost will be saved. 


> Use hive table statistics data to get the total count
> -----------------------------------------------------
>
>                 Key: KYLIN-2165
>                 URL: https://issues.apache.org/jira/browse/KYLIN-2165
>             Project: Kylin
>          Issue Type: Improvement
>          Components: Job Engine
>            Reporter: Shaofeng SHI
>            Assignee: Shaofeng SHI
>             Fix For: v1.6.0
>
>
> Kylin will count on the intermediate flat hive table to get the total row number, then to redistribute that.
> From hive's wiki, hive will automatically collect the table statistics when run a "insert overwrite" statement, then the subsequent "select count(*)" will be very fast. While, Kylin is executing "INSERT OVERWRITE DIRECTORY '/kylin/row_count' SELECT count(*) from", which still cause MR/Tez job be started, this will cause the step take longer time.
> Just change the SQL to "select count(*)" or using Hive API to get the statistic, the cost will be saved. 
> Here is a sample, the table 'kylin_intermediate_qq_dbe874d2_bb9a_4375_ba50_3dcf096a13c5' is an intermediate table :
> If directly run "count(*)", it is pretty fast:
> hive> select count(*) from kylin_intermediate_qq_dbe874d2_bb9a_4375_ba50_3dcf096a13c5;
> OK
> 970033
> Time taken: 0.112 seconds, Fetched: 1 row(s)
> While today Kylin's SQL will cause a job be started:
> hive> INSERT OVERWRITE DIRECTORY '/kylin/row_count' select count(*) from kylin_intermediate_qq_dbe874d2_bb9a_4375_ba50_3dcf096a13c5;
> Query ID = root_20161106080808_0099b622-c0bd-41da-aee5-2321adf7bdda
> Total jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks determined at compile time: 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=<number>
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=<number>
> In order to set a constant number of reducers:
>   set mapreduce.job.reduces=<number>
> Starting Job = job_1463701915919_46208, Tracking URL =



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)