You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kylin.apache.org by 张磊 <12...@qq.com> on 2016/11/01 06:49:51 UTC
回复： Exceed scan threshold at 10000001

Hi 


there is something i can not see, can you send again?




Query:
select sum(m1) from table group by dim1,dim2 limit 10001
select sum(m1) from table group by dim1,dim2 limit 10000


cude:
dimensions only contain dim1 and dim2


count(distinct dim1) 200000, count(distinct dim 2) 20000


The query result is 
limit 10000  --> scan 1000 rows

limit 10001  ---> scan millions of rows




I do not konw why?  data isn't balanced? How can i avoid it?


Who can help me? thanks!








------------------ 原始邮件 ------------------
发件人: "a.ramonportoles";<a....@gmail.com>;
发送时间: 2016年10月28日(星期五) 晚上6:51
收件人: "dev"<de...@kylin.apache.org>; 

主题: Re: Exceed scan threshold at 10000001



Q1: If i query select count(1) from table group by letter,number limit 2,it  should scan the first two rows(letter number Agg groups)?


A1: Kylin build Hbase Key with Dimensions:


Kylin is "smart" when compose Hbase Row Key:
Is not the same Group by / filter by Dim1 that Dim3   :)

Dim1: Range scan--> you read that you need --> fast

Dim3: full scan --> you read more rows that you need --> slow


how to solve it?  (I think:) you can build several cubes / uses different aggregation groups  on same project
 




Q2: when i query select count(1) from table group by letter limit 2,it should scan the two rows(letter Agg group)


A2: Yes,, if you define count(1) as measure and letter as Dim, you will have a pre-calculated results



Also: check the cardinaliy of your data, Isn't normal:

limit 10000  --> scan 1000 rows

limit 10001  ---> scan millions of rows

If this is true your data isn't balanced, I don't know any solution for this


Alb















2016-10-28 10:01 GMT+02:00 张磊 <12...@qq.com>:
kylin do put pre-calculate results in hbase, if cude desc is below:
 Dimensions:letter number
 Measures:count
 in hbase the result is
 count letter number
 1         A        1
 1         A        2
 1         B        1
 1         B        2
 1         B        3
 1         B        4
 count  letter
 2          A
 4          B
 If i query select count(1) from table group by letter,number limit 2,it should scan the first two rows(letter number Agg groups)?
 when i query select count(1) from table group by letter limit 2,it should scan the two rows(letter Agg group)
 Do i say right?
 
 
 ------------------ 原始邮件 ------------------
 发件人: "a.ramonportoles";<a....@gmail.com>;
 发送时间: 2016年10月28日(星期五) 下午3:43
 收件人: "dev"<de...@kylin.apache.org>;
 
 主题: Re: Exceed scan threshold at 10000001
 
 
 
 hummmm
 but are you using "group by LO_CUSTKEY,LO_PARTKEY"
 
 And limit apply to final result no to scan rows
 
 Example:
 table with two columns Letter / Number
 A:1
 A:2
 B:1
 B:2
 B:3
 B:4
 
 select count (1), Letter from TB group by Letter limit 1
    Result: 2:A
    Scans 2 rows
 
 select count (1), Letter from TB group by Letter limit 2
    Result: 2:A
                4:B
    Scans 2 +4 rows
 
 
 Alb
 
 
 
 2016-10-28 8:33 GMT+02:00 张磊 <12...@qq.com>:
 
 > Query1:select count(1),sum(LO_REVENUE) from lineorder group by
 > LO_CUSTKEY,LO_PARTKEY
 > LIMIT 10000
 >
 >
 > I find it scan 10000 rows from HBase
 >
 >
 > Query2: select count(1),sum(LO_REVENUE) from lineorder group by
 > LO_CUSTKEY,LO_PARTKEY
 > LIMIT 10001
 >
 >
 > I find it scan 10000001 rows from Hbase
 >
 >
 > I do not know why?  Should not  scan 10001 row?
 >
 >
 > The two query i scan the same HTable KYLIN_78ROC49NQY
 > Kylin log:Endpoint RPC returned from HTable KYLIN_78ROC49NQY
 >
 >
 >
 >
 > ------------------ 原始邮件 ------------------
 > 发件人: "ShaoFeng Shi";<sh...@apache.org>;
 > 发送时间: 2016年10月28日(星期五) 中午11:20
 > 收件人: "dev"<de...@kylin.apache.org>;
 >
 > 主题: Re: Exceed scan threshold at 10000001
 >
 >
 >
 > Alberto, thanks for your explaination, you got the points and is already an
 > Kylin expert I believe.
 >
 > In order to protect HBase and Kylin from crashing by bad queries (which
 > scan too many rows), Kylin add this mechnisam to interrupt when reach some
 > threshold. Usually in an OLAP scenario, the result wouldn't be too large.
 > This is also a reminder for user to rethink the design; If you really want
 > to get the threshold be enlarged, you can allocate more memory to Kylin and
 > set "kylin.query.mem.budget" to bigger value.
 >
 > 2016-10-27 18:39 GMT+08:00 Alberto Ramón <a....@gmail.com>:
 >
 > > NOTE: I'm not a expert on Kylin  ;)
 > >
 > > Where is mandatory? No
 > > Where is recommended? yes
 > > Where bypass the threshold? No, I think this limit is hardcoded ¿?
 > >
 > > The real question must be: why this limit exists ?: (opinion)
 > > - The target of Kylin is Real / Near RT, limit rows --> limit response
 > time
 > > - If Your are using JDBC, this is not a good option by performance
 > > - Protect the HBase Coprocesor
 > > - Perhaps you need a new Dim, to precalculate This Aggregate or filter by
 > > this new Dim
 > >
 > > For Extra-Large queries, you can also check:
 > >  -kylin.query.mem.budget= 3GB
 > >  -hbase.server.scanner.max.result.size = 100MB  (limit from HBase, you
 > can
 > > disable with -1)
 > >
 > > Good Luck, Alb
 > >
 > > 2016-10-27 11:56 GMT+02:00 张磊 <12...@qq.com>:
 > >
 > > > Do you mean when i query, i should add where clause,
 > > > but in some case, the number of records > threshold, how can i do?
 > > > For example, order by all groups, the number of the  all groups >
 > > > threshold
 > > >
 > > >
 > > >
 > > >
 > > > ------------------ 原始邮件 ------------------
 > > > 发件人: "Alberto Ramón";<a....@gmail.com>;
 > > > 发送时间: 2016年10月27日(星期四) 下午5:47
 > > > 收件人: "dev"<de...@kylin.apache.org>;
 > > >
 > > > 主题: Re: Exceed scan threshold at 10000001
 > > >
 > > >
 > > >
 > > >  ERROR: Scan row count exceeded threshold
 > > >
 > > > MailList
 > > > <http://mail-archives.apache.org/mod_mbox/kylin-user/
 > > > 201608.mbox/%3CCALjEW7M_YYi7Xs55OqPdxS6pzNvD0%
 > 2BamN2AX3hetnF0%3D9uFnow%
 > > > 40mail.gmail.com%3E>
 > > > Kilin
 > > > 1787 <https://issues.apache.org/jira/browse/KYLIN-1787>v1.5.3
 > > >
 > > > *Scan row count exceeded threshold: 1000000, please add filter
 > condition
 > > to
 > > > narrow down backend scan range, like where clause*
 > > >
 > > >
 > > > BR, Alb
 > > >
 > > > 2016-10-27 11:40 GMT+02:00 张磊 <12...@qq.com>:
 > > >
 > > > > Hi
 > > > >
 > > > >
 > > > > When i query a sql, I do not know why should scan hbase? How can i
 > do?
 > > > > Thanks!
 > > > >
 > > > >
 > > > > Table: lineorder  12,000,000 row records
 > > > > Dimensions: LO_CUSTKEY,LO_PARTKEY
 > > > > Measures: count(1), sum(LO_REVENUE)
 > > > >
 > > > >
 > > > > Query SQL: select count(1),sum(LO_REVENUE) from lineorder group by
 > > > > LO_CUSTKEY,LO_PARTKEY order by LO_CUSTKEY,LO_PARTKEY limit 50
 > > > >
 > > > >
 > > > > I build a cude with two Dimensions and two Measures(count and sum),
 > the
 > > > > size of the Htable is 98 MB, when i execute a query in insight, it
 > > shows
 > > > > Error in coprocessor; and i check the hbase log, i find blow messages
 > > > >
 > > > >
 > > > > 2016-10-27 02:06:13,470 INFO  [B.defaultRpcServer.handler=4,
 > > > queue=1,port=16020]
 > > > > gridtable.GTScanRequest: pre aggregation is not beneficial, skip it
 > > > > 2016-10-27 02:06:13,470 INFO  [B.defaultRpcServer.handler=4,
 > > > queue=1,port=16020]
 > > > > endpoint.CubeVisitService: Scanned 1 rows from HBase.
 > > > >
 > > > >
 > > > > 2016-10-27 02:24:20,884 INFO  [B.defaultRpcServer.handler=6,
 > > > queue=0,port=16020]
 > > > > endpoint.CubeVisitService: Scanned 9999001 rows from HBase.
 > > > > 2016-10-27 02:24:20,889 INFO  [B.defaultRpcServer.handler=6,
 > > > queue=0,port=16020]
 > > > > endpoint.CubeVisitService: The cube visit did not finish normally
 > > because
 > > > > scan num exceeds threshold
 > > > > org.apache.kylin.gridtable.GTScanExceedThresholdException: Exceed
 > scan
 > > > > threshold at 10000001
 > > > >         at org.apache.kylin.storage.hbase.cube.v2.coprocessor.
 > > > > endpoint.CubeVisitService$2.hasNext(CubeVisitService.java:267)
 > > > >         at org.apache.kylin.storage.hbase.cube.v2.
 > > > HBaseReadonlyStore$1$1.
 > > > > hasNext(HBaseReadonlyStore.java:111)
 > > > >         at org.apache.kylin.storage.hbase.cube.v2.coprocessor.
 > > > > endpoint.CubeVisitService.visitCube(CubeVisitService.java:299)
 > > > >         at org.apache.kylin.storage.hbase.cube.v2.coprocessor.
 > > > > endpoint.generated.CubeVisitProtos$CubeVisitService.callMethod(
 > > > > CubeVisitProtos.java:3952)
 > > > >         at org.apache.hadoop.hbase.regionserver.HRegion.
 > > > > execService(HRegion.java:7815)
 > > > >         at org.apache.hadoop.hbase.regionserver.RSRpcServices.
 > > > > execServiceOnRegion(RSRpcServices.java:1986)
 > > > >         at org.apache.hadoop.hbase.regionserver.RSRpcServices.
 > > > > execService(RSRpcServices.java:1968)
 > > > >         at org.apache.hadoop.hbase.protobuf.generated.
 > > > > ClientProtos$ClientService$2.callBlockingMethod(
 > > ClientProtos.java:33652)
 > > > >         at org.apache.hadoop.hbase.ipc.
 > RpcServer.call(RpcServer.java:
 > > > 2178)
 > > > >         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.
 > > > java:112)
 > > > >         at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(
 > > > > RpcExecutor.java:133)
 > > > >         at org.apache.hadoop.hbase.ipc.
 > RpcExecutor$1.run(RpcExecutor.
 > > > > java:108)
 > > > >         at java.lang.Thread.run(Thread.java:745)
 > > >
 > >
 >
 >
 >
 > --
 > Best regards,
 >
 > Shaofeng Shi 史少锋
 >