You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kylin.apache.org by "Dayue Gao (JIRA)" <ji...@apache.org> on 2017/02/08 14:04:41 UTC

[jira] [Updated] (KYLIN-2438) replace scan threshold with max scan bytes

     [ https://issues.apache.org/jira/browse/KYLIN-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dayue Gao updated KYLIN-2438:
-----------------------------
    Description: 
In order to guard against bad queries that can consume lots of memory and potentially crash kylin / hbase server, kylin limits the maximum number of rows query can scan. The maximum value is chosen based on two configs
# *kylin.query.scan.threshold* is used if the query doesn't contain memory-hungry metrics
# *kylin.query.mem.budget* / estimated_row_size is used otherwise as the per region maximum.

This approach however has several deficiencies:
* It doesn't work with complex, varlen metrics very well. The estimated threshold could be either too small or too large. If it's too small, good queries are killed. If it's too large, bad queries are not banned.
* Row count doesn't correspond to memory consumption, thus it's difficult to determine how large scan threshold should be set to.
* kylin.query.scan.threshold can't be override at cube level.

In this JIRA, I propose to replace the current row count based threshold with a more intuitive size based threshold
* KYLIN-2437 will collect the number of bytes scanned at both region and query level
* A new configuration *kylin.query.max-scan-bytes* will be added to limits the maximum number of bytes query can scan
* *kylin.query.mem.budget* will be renamed to *kylin.storage.hbase.coprocessor-max-scan-bytes*, which limits at region level
* The above two configs scan be override at cube level
* the old *kylin.query.scan.threshold* will be deprecated

  was:
In order to guard against bad queries that can consume too much memory and then crash kylin / hbase server, kylin limits the maximum number of rows query can scan. The maximum value is determined by two configs
# *kylin.query.scan.threshold* is used if the query doesn't contain memory-hungry metrics
# otherwise, *kylin.query.mem.budget* / estimated_row_size is used as the maximum per region.

This approach however has several deficiencies:
* It doesn't work with complex, variable length metrics very well. The estimated threshold could be either too small or too large. If it's too small, good queries are killed. If it's too large, bad queries are not banned.
* Row count doesn't correspond to memory consumption, thus it's difficult to determine how large scan threshold should be set to.
* kylin.query.scan.threshold can't be override at cube level.

In this JIRA, I propose to replace the current row count based threshold with a more intuitive size based threshold
* KYLIN-2437 will collect the number of bytes scanned at both region and query level
* A new configuration *kylin.query.max-scan-bytes* will be added to limits the maximum number of bytes query can scan in total
* *kylin.query.mem.budget* will be renamed to *kylin.storage.hbase.coprocessor-max-scan-bytes*, which limits at region level
* the old *kylin.query.scan.threshold* will be deprecated


> replace scan threshold with max scan bytes
> ------------------------------------------
>
>                 Key: KYLIN-2438
>                 URL: https://issues.apache.org/jira/browse/KYLIN-2438
>             Project: Kylin
>          Issue Type: Improvement
>          Components: Query Engine, Storage - HBase
>    Affects Versions: v1.6.0
>            Reporter: Dayue Gao
>            Assignee: Dayue Gao
>
> In order to guard against bad queries that can consume lots of memory and potentially crash kylin / hbase server, kylin limits the maximum number of rows query can scan. The maximum value is chosen based on two configs
> # *kylin.query.scan.threshold* is used if the query doesn't contain memory-hungry metrics
> # *kylin.query.mem.budget* / estimated_row_size is used otherwise as the per region maximum.
> This approach however has several deficiencies:
> * It doesn't work with complex, varlen metrics very well. The estimated threshold could be either too small or too large. If it's too small, good queries are killed. If it's too large, bad queries are not banned.
> * Row count doesn't correspond to memory consumption, thus it's difficult to determine how large scan threshold should be set to.
> * kylin.query.scan.threshold can't be override at cube level.
> In this JIRA, I propose to replace the current row count based threshold with a more intuitive size based threshold
> * KYLIN-2437 will collect the number of bytes scanned at both region and query level
> * A new configuration *kylin.query.max-scan-bytes* will be added to limits the maximum number of bytes query can scan
> * *kylin.query.mem.budget* will be renamed to *kylin.storage.hbase.coprocessor-max-scan-bytes*, which limits at region level
> * The above two configs scan be override at cube level
> * the old *kylin.query.scan.threshold* will be deprecated



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)