You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@kylin.apache.org by cheney <53...@qq.com> on 2018/11/05 10:55:02 UTC
回复: doubt about measure of processedRowCount
Yes. the log is as following.
2018-11-02 22:25:34,980 DEBUG [Query 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] gtrecord.StorageResponseGTScatter:88 : Using SortMergedPartitionResultIterator to merge 103 partition results
2018-11-02 22:25:34,982 INFO [Query 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] gtrecord.SequentialCubeTupleIterator:73 : Using Iterators.concat to merge segment results
2018-11-02 22:25:34,982 DEBUG [Query 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] enumerator.OLAPEnumerator:122 : return TupleIterator...
2018-11-02 22:25:34,991 INFO [Query 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:897 : Processed rows for each storageContext: 366
2018-11-02 22:25:34,991 INFO [Query 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:422 : Stats of SQL response: isException: false, duration: 20, total scan count 1552
Acoording the log, valueA = 366. valueB= (total scan count) 1552 - (total Agrrated/filterd in hbase)270 = 1282
valueB is much larger than valueA .
------------------ 原始邮件 ------------------
发件人: "JiaTao Tao"<ta...@gmail.com>;
发送时间: 2018年11月5日(星期一) 下午2:41
收件人: "user"<us...@kylin.apache.org>;
主题: Re: doubt about measure of processedRowCount
Can you grep logs like "to merge segment results" in that scenario?
cheney <53...@qq.com> 于2018年11月3日周六 下午4:15写道:
Thank your repling, .but I am sure there's only one OlapContext in the quey in my scenario.
---Original---
From: "JiaTao Tao"<ta...@gmail.com>
Date: Sat, Nov 3, 2018 10:42 AM
To: "user"<us...@kylin.apache.org>;
Subject: Re: doubt about measure of processedRowCount
Maybe count all the valueA would be more appropriate, cuz maybe there's more than one OlapContext in the query ( one OlapContext correspond one storageContext ).
There are two good blogs about Kylin's query engine, you may take a look :).
https://blog.csdn.net/yu616568/article/details/50838504
https://zhuanlan.zhihu.com/p/30613434
cheney <53...@qq.com> 于2018年11月2日周五 下午11:10写道:
Hi, guys
When I executed a sql in kylin, kylin server will log some log about query statics. for example, The log is as following:
"Processed rows for each storageContext: valueA". valueA is processedRowCount.
What I understand is processedRowCount is the record rows numbers returned by hbase.
Hbase corprocessor will log region stats, including: "Total scanned row","Total filtered/aggred row".
For one region, final records returned by hbase = Total scanned row - Total filtered/aggred row;
Suppose this query need to scan 10 region in hbase, we can get every region stats. we can get all records valueB returned by hbase by
suming every final records in 10 region.
In general, valueA is equal to valueB, but valueB is much larger than valueA in sometimes. Why?
--
Regards!
Aron Tao
--
Regards!
Aron Tao
Re: doubt about measure of processedRowCount
Posted by JiaTao Tao <ta...@gmail.com>.
Thanks, Shaofeng, for your affirmation :).
ShaoFeng Shi <sh...@apache.org> 于2018年11月7日周三 上午9:29写道:
> Good job Jiatao! I appreciate your support to the community!
>
> JiaTao Tao <ta...@gmail.com> 于2018年11月7日周三 上午9:17写道:
>
>> Very glad that my reply is helpful, I already opened a JIRA to add logs
>> for "*GTStreamAggregateScanner*" and next time it would be much easier
>> to navigate this :).
>>
>> cheney <53...@qq.com> 于2018年11月6日周二 下午11:57写道:
>>
>>> Hi, JiaTao, thank you very much! The statis is right when I config "kylin.query.stream-aggregate-enabled=false".
>>> You are right. Records are pre-aggregated by GTStreamAggregateScanner.
>>>
>>>
>>> ------------------ 原始邮件 ------------------
>>> *发件人:* "JiaTao Tao"<ta...@gmail.com>;
>>> *发送时间:* 2018年11月6日(星期二) 晚上10:50
>>> *收件人:* "user"<us...@kylin.apache.org>;
>>> *主题:* Re: doubt about measure of processedRowCount
>>>
>>> One possible place I can find in the code is using
>>> *GTStreamAggregateScanne*r (in "*SegmentCubeTupleIterator.java#111"*).
>>> You can find it does do aggregate in
>>> *"GTStreamAggregateScanner.AbstractStreamMergeIterator#next*" so it'll
>>> reduce the inputs. But there's no log printing in this class as you can
>>> see, so it's pretty hard to confirm. Try
>>> "kylin.query.stream-aggregate-enabled=false" and run the scenario again to
>>> see any differences.
>>>
>>> cheney <53...@qq.com> 于2018年11月5日周一 下午6:55写道:
>>>
>>>> Yes. the log is as following.
>>>>
>>>> 2018-11-02 22:25:34,980 DEBUG [Query
>>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914]
>>>> gtrecord.StorageResponseGTScatter:88 : Using
>>>> SortMergedPartitionResultIterator to merge 103 partition results
>>>> 2018-11-02 22:25:34,982 INFO [Query
>>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914]
>>>> gtrecord.SequentialCubeTupleIterator:73 : Using Iterators.concat *to
>>>> merge segment results*
>>>> 2018-11-02 22:25:34,982 DEBUG [Query
>>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] enumerator.OLAPEnumerator:122
>>>> : return TupleIterator...
>>>> 2018-11-02 22:25:34,991 INFO [Query
>>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:897 : *Processed
>>>> rows for each storageContext*: 366
>>>> 2018-11-02 22:25:34,991 INFO [Query
>>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:422 :
>>>> Stats of SQL response: isException: false, duration: 20, *total scan
>>>> count 1552*
>>>>
>>>> Acoording the log, *valueA *= 366. *valueB*= (total scan count) 1552
>>>> - (total Agrrated/filterd in hbase)270 = 1282
>>>> *valueB *is much larger than *valueA *.
>>>>
>>>>
>>>>
>>>> ------------------ 原始邮件 ------------------
>>>> *发件人:* "JiaTao Tao"<ta...@gmail.com>;
>>>> *发送时间:* 2018年11月5日(星期一) 下午2:41
>>>> *收件人:* "user"<us...@kylin.apache.org>;
>>>> *主题:* Re: doubt about measure of processedRowCount
>>>>
>>>> Can you grep logs like "to merge segment results" in that scenario?
>>>>
>>>> cheney <53...@qq.com> 于2018年11月3日周六 下午4:15写道:
>>>>
>>>>> Thank your repling, .but I am sure there's only one OlapContext in
>>>>> the quey in my scenario.
>>>>> ---Original---
>>>>> *From:* "JiaTao Tao"<ta...@gmail.com>
>>>>> *Date:* Sat, Nov 3, 2018 10:42 AM
>>>>> *To:* "user"<us...@kylin.apache.org>;
>>>>> *Subject:* Re: doubt about measure of processedRowCount
>>>>>
>>>>> Maybe count all the *valueA *would be more appropriate, cuz maybe
>>>>> there's more than one OlapContext in the query ( one OlapContext correspond
>>>>> one storageContext ).
>>>>>
>>>>> There are two good blogs about Kylin's query engine, you may take a
>>>>> look :).
>>>>>
>>>>> https://blog.csdn.net/yu616568/article/details/50838504
>>>>>
>>>>> https://zhuanlan.zhihu.com/p/30613434
>>>>>
>>>>> cheney <53...@qq.com> 于2018年11月2日周五 下午11:10写道:
>>>>>
>>>>>> Hi, guys
>>>>>>
>>>>>> When I executed a sql in kylin, kylin server will log some
>>>>>> log about query statics. for example, The log is as following:
>>>>>>
>>>>>> "Processed rows for each storageContext: *valueA*". *valueA *is processedRowCount.
>>>>>>
>>>>>> What I understand is processedRowCount is the record rows
>>>>>> numbers returned by hbase.
>>>>>>
>>>>>> Hbase corprocessor will log region stats, including: "*Total
>>>>>> scanned row*","Total filtered/aggred row".
>>>>>>
>>>>>> For one region, final records returned by hbase = *Total scanned
>>>>>> row - *Total filtered/aggred row;
>>>>>> Suppose this query need to scan 10 region in hbase, we can get
>>>>>> every region stats. we can get all records *valueB *returned by
>>>>>> hbase by
>>>>>> suming every final records in 10 region.
>>>>>>
>>>>>> In general, *valueA *is equal to * valueB*, but *valueB *is
>>>>>> much larger than *valueA* in sometimes. Why?
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>>
>>>>> Regards!
>>>>>
>>>>> Aron Tao
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>> Regards!
>>>>
>>>> Aron Tao
>>>>
>>>
>>>
>>> --
>>>
>>>
>>> Regards!
>>>
>>> Aron Tao
>>>
>>
>>
>> --
>>
>>
>> Regards!
>>
>> Aron Tao
>>
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>
>
--
Regards!
Aron Tao
Re: doubt about measure of processedRowCount
Posted by ShaoFeng Shi <sh...@apache.org>.
Good job Jiatao! I appreciate your support to the community!
JiaTao Tao <ta...@gmail.com> 于2018年11月7日周三 上午9:17写道:
> Very glad that my reply is helpful, I already opened a JIRA to add logs
> for "*GTStreamAggregateScanner*" and next time it would be much easier to
> navigate this :).
>
> cheney <53...@qq.com> 于2018年11月6日周二 下午11:57写道:
>
>> Hi, JiaTao, thank you very much! The statis is right when I config "kylin.query.stream-aggregate-enabled=false".
>> You are right. Records are pre-aggregated by GTStreamAggregateScanner.
>>
>>
>> ------------------ 原始邮件 ------------------
>> *发件人:* "JiaTao Tao"<ta...@gmail.com>;
>> *发送时间:* 2018年11月6日(星期二) 晚上10:50
>> *收件人:* "user"<us...@kylin.apache.org>;
>> *主题:* Re: doubt about measure of processedRowCount
>>
>> One possible place I can find in the code is using
>> *GTStreamAggregateScanne*r (in "*SegmentCubeTupleIterator.java#111"*).
>> You can find it does do aggregate in
>> *"GTStreamAggregateScanner.AbstractStreamMergeIterator#next*" so it'll
>> reduce the inputs. But there's no log printing in this class as you can
>> see, so it's pretty hard to confirm. Try
>> "kylin.query.stream-aggregate-enabled=false" and run the scenario again to
>> see any differences.
>>
>> cheney <53...@qq.com> 于2018年11月5日周一 下午6:55写道:
>>
>>> Yes. the log is as following.
>>>
>>> 2018-11-02 22:25:34,980 DEBUG [Query
>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914]
>>> gtrecord.StorageResponseGTScatter:88 : Using
>>> SortMergedPartitionResultIterator to merge 103 partition results
>>> 2018-11-02 22:25:34,982 INFO [Query
>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914]
>>> gtrecord.SequentialCubeTupleIterator:73 : Using Iterators.concat *to
>>> merge segment results*
>>> 2018-11-02 22:25:34,982 DEBUG [Query
>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] enumerator.OLAPEnumerator:122
>>> : return TupleIterator...
>>> 2018-11-02 22:25:34,991 INFO [Query
>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:897 : *Processed
>>> rows for each storageContext*: 366
>>> 2018-11-02 22:25:34,991 INFO [Query
>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:422 :
>>> Stats of SQL response: isException: false, duration: 20, *total scan
>>> count 1552*
>>>
>>> Acoording the log, *valueA *= 366. *valueB*= (total scan count) 1552 -
>>> (total Agrrated/filterd in hbase)270 = 1282
>>> *valueB *is much larger than *valueA *.
>>>
>>>
>>>
>>> ------------------ 原始邮件 ------------------
>>> *发件人:* "JiaTao Tao"<ta...@gmail.com>;
>>> *发送时间:* 2018年11月5日(星期一) 下午2:41
>>> *收件人:* "user"<us...@kylin.apache.org>;
>>> *主题:* Re: doubt about measure of processedRowCount
>>>
>>> Can you grep logs like "to merge segment results" in that scenario?
>>>
>>> cheney <53...@qq.com> 于2018年11月3日周六 下午4:15写道:
>>>
>>>> Thank your repling, .but I am sure there's only one OlapContext in the
>>>> quey in my scenario.
>>>> ---Original---
>>>> *From:* "JiaTao Tao"<ta...@gmail.com>
>>>> *Date:* Sat, Nov 3, 2018 10:42 AM
>>>> *To:* "user"<us...@kylin.apache.org>;
>>>> *Subject:* Re: doubt about measure of processedRowCount
>>>>
>>>> Maybe count all the *valueA *would be more appropriate, cuz maybe
>>>> there's more than one OlapContext in the query ( one OlapContext correspond
>>>> one storageContext ).
>>>>
>>>> There are two good blogs about Kylin's query engine, you may take a
>>>> look :).
>>>>
>>>> https://blog.csdn.net/yu616568/article/details/50838504
>>>>
>>>> https://zhuanlan.zhihu.com/p/30613434
>>>>
>>>> cheney <53...@qq.com> 于2018年11月2日周五 下午11:10写道:
>>>>
>>>>> Hi, guys
>>>>>
>>>>> When I executed a sql in kylin, kylin server will log some log
>>>>> about query statics. for example, The log is as following:
>>>>>
>>>>> "Processed rows for each storageContext: *valueA*". *valueA *is processedRowCount.
>>>>>
>>>>> What I understand is processedRowCount is the record rows
>>>>> numbers returned by hbase.
>>>>>
>>>>> Hbase corprocessor will log region stats, including: "*Total
>>>>> scanned row*","Total filtered/aggred row".
>>>>>
>>>>> For one region, final records returned by hbase = *Total scanned
>>>>> row - *Total filtered/aggred row;
>>>>> Suppose this query need to scan 10 region in hbase, we can get
>>>>> every region stats. we can get all records *valueB *returned by
>>>>> hbase by
>>>>> suming every final records in 10 region.
>>>>>
>>>>> In general, *valueA *is equal to * valueB*, but *valueB *is
>>>>> much larger than *valueA* in sometimes. Why?
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>> Regards!
>>>>
>>>> Aron Tao
>>>>
>>>
>>>
>>> --
>>>
>>>
>>> Regards!
>>>
>>> Aron Tao
>>>
>>
>>
>> --
>>
>>
>> Regards!
>>
>> Aron Tao
>>
>
>
> --
>
>
> Regards!
>
> Aron Tao
>
--
Best regards,
Shaofeng Shi 史少锋
Re: doubt about measure of processedRowCount
Posted by ShaoFeng Shi <sh...@apache.org>.
Good job Jiatao! I appreciate your support to the community!
JiaTao Tao <ta...@gmail.com> 于2018年11月7日周三 上午9:17写道:
> Very glad that my reply is helpful, I already opened a JIRA to add logs
> for "*GTStreamAggregateScanner*" and next time it would be much easier to
> navigate this :).
>
> cheney <53...@qq.com> 于2018年11月6日周二 下午11:57写道:
>
>> Hi, JiaTao, thank you very much! The statis is right when I config "kylin.query.stream-aggregate-enabled=false".
>> You are right. Records are pre-aggregated by GTStreamAggregateScanner.
>>
>>
>> ------------------ 原始邮件 ------------------
>> *发件人:* "JiaTao Tao"<ta...@gmail.com>;
>> *发送时间:* 2018年11月6日(星期二) 晚上10:50
>> *收件人:* "user"<us...@kylin.apache.org>;
>> *主题:* Re: doubt about measure of processedRowCount
>>
>> One possible place I can find in the code is using
>> *GTStreamAggregateScanne*r (in "*SegmentCubeTupleIterator.java#111"*).
>> You can find it does do aggregate in
>> *"GTStreamAggregateScanner.AbstractStreamMergeIterator#next*" so it'll
>> reduce the inputs. But there's no log printing in this class as you can
>> see, so it's pretty hard to confirm. Try
>> "kylin.query.stream-aggregate-enabled=false" and run the scenario again to
>> see any differences.
>>
>> cheney <53...@qq.com> 于2018年11月5日周一 下午6:55写道:
>>
>>> Yes. the log is as following.
>>>
>>> 2018-11-02 22:25:34,980 DEBUG [Query
>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914]
>>> gtrecord.StorageResponseGTScatter:88 : Using
>>> SortMergedPartitionResultIterator to merge 103 partition results
>>> 2018-11-02 22:25:34,982 INFO [Query
>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914]
>>> gtrecord.SequentialCubeTupleIterator:73 : Using Iterators.concat *to
>>> merge segment results*
>>> 2018-11-02 22:25:34,982 DEBUG [Query
>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] enumerator.OLAPEnumerator:122
>>> : return TupleIterator...
>>> 2018-11-02 22:25:34,991 INFO [Query
>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:897 : *Processed
>>> rows for each storageContext*: 366
>>> 2018-11-02 22:25:34,991 INFO [Query
>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:422 :
>>> Stats of SQL response: isException: false, duration: 20, *total scan
>>> count 1552*
>>>
>>> Acoording the log, *valueA *= 366. *valueB*= (total scan count) 1552 -
>>> (total Agrrated/filterd in hbase)270 = 1282
>>> *valueB *is much larger than *valueA *.
>>>
>>>
>>>
>>> ------------------ 原始邮件 ------------------
>>> *发件人:* "JiaTao Tao"<ta...@gmail.com>;
>>> *发送时间:* 2018年11月5日(星期一) 下午2:41
>>> *收件人:* "user"<us...@kylin.apache.org>;
>>> *主题:* Re: doubt about measure of processedRowCount
>>>
>>> Can you grep logs like "to merge segment results" in that scenario?
>>>
>>> cheney <53...@qq.com> 于2018年11月3日周六 下午4:15写道:
>>>
>>>> Thank your repling, .but I am sure there's only one OlapContext in the
>>>> quey in my scenario.
>>>> ---Original---
>>>> *From:* "JiaTao Tao"<ta...@gmail.com>
>>>> *Date:* Sat, Nov 3, 2018 10:42 AM
>>>> *To:* "user"<us...@kylin.apache.org>;
>>>> *Subject:* Re: doubt about measure of processedRowCount
>>>>
>>>> Maybe count all the *valueA *would be more appropriate, cuz maybe
>>>> there's more than one OlapContext in the query ( one OlapContext correspond
>>>> one storageContext ).
>>>>
>>>> There are two good blogs about Kylin's query engine, you may take a
>>>> look :).
>>>>
>>>> https://blog.csdn.net/yu616568/article/details/50838504
>>>>
>>>> https://zhuanlan.zhihu.com/p/30613434
>>>>
>>>> cheney <53...@qq.com> 于2018年11月2日周五 下午11:10写道:
>>>>
>>>>> Hi, guys
>>>>>
>>>>> When I executed a sql in kylin, kylin server will log some log
>>>>> about query statics. for example, The log is as following:
>>>>>
>>>>> "Processed rows for each storageContext: *valueA*". *valueA *is processedRowCount.
>>>>>
>>>>> What I understand is processedRowCount is the record rows
>>>>> numbers returned by hbase.
>>>>>
>>>>> Hbase corprocessor will log region stats, including: "*Total
>>>>> scanned row*","Total filtered/aggred row".
>>>>>
>>>>> For one region, final records returned by hbase = *Total scanned
>>>>> row - *Total filtered/aggred row;
>>>>> Suppose this query need to scan 10 region in hbase, we can get
>>>>> every region stats. we can get all records *valueB *returned by
>>>>> hbase by
>>>>> suming every final records in 10 region.
>>>>>
>>>>> In general, *valueA *is equal to * valueB*, but *valueB *is
>>>>> much larger than *valueA* in sometimes. Why?
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>> Regards!
>>>>
>>>> Aron Tao
>>>>
>>>
>>>
>>> --
>>>
>>>
>>> Regards!
>>>
>>> Aron Tao
>>>
>>
>>
>> --
>>
>>
>> Regards!
>>
>> Aron Tao
>>
>
>
> --
>
>
> Regards!
>
> Aron Tao
>
--
Best regards,
Shaofeng Shi 史少锋
Re: doubt about measure of processedRowCount
Posted by JiaTao Tao <ta...@gmail.com>.
Very glad that my reply is helpful, I already opened a JIRA to add logs for
"*GTStreamAggregateScanner*" and next time it would be much easier to
navigate this :).
cheney <53...@qq.com> 于2018年11月6日周二 下午11:57写道:
> Hi, JiaTao, thank you very much! The statis is right when I config "kylin.query.stream-aggregate-enabled=false".
> You are right. Records are pre-aggregated by GTStreamAggregateScanner.
>
>
> ------------------ 原始邮件 ------------------
> *发件人:* "JiaTao Tao"<ta...@gmail.com>;
> *发送时间:* 2018年11月6日(星期二) 晚上10:50
> *收件人:* "user"<us...@kylin.apache.org>;
> *主题:* Re: doubt about measure of processedRowCount
>
> One possible place I can find in the code is using
> *GTStreamAggregateScanne*r (in "*SegmentCubeTupleIterator.java#111"*).
> You can find it does do aggregate in
> *"GTStreamAggregateScanner.AbstractStreamMergeIterator#next*" so it'll
> reduce the inputs. But there's no log printing in this class as you can
> see, so it's pretty hard to confirm. Try
> "kylin.query.stream-aggregate-enabled=false" and run the scenario again to
> see any differences.
>
> cheney <53...@qq.com> 于2018年11月5日周一 下午6:55写道:
>
>> Yes. the log is as following.
>>
>> 2018-11-02 22:25:34,980 DEBUG [Query
>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914]
>> gtrecord.StorageResponseGTScatter:88 : Using
>> SortMergedPartitionResultIterator to merge 103 partition results
>> 2018-11-02 22:25:34,982 INFO [Query
>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914]
>> gtrecord.SequentialCubeTupleIterator:73 : Using Iterators.concat *to
>> merge segment results*
>> 2018-11-02 22:25:34,982 DEBUG [Query
>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] enumerator.OLAPEnumerator:122
>> : return TupleIterator...
>> 2018-11-02 22:25:34,991 INFO [Query
>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:897 : *Processed
>> rows for each storageContext*: 366
>> 2018-11-02 22:25:34,991 INFO [Query
>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:422 :
>> Stats of SQL response: isException: false, duration: 20, *total scan
>> count 1552*
>>
>> Acoording the log, *valueA *= 366. *valueB*= (total scan count) 1552 -
>> (total Agrrated/filterd in hbase)270 = 1282
>> *valueB *is much larger than *valueA *.
>>
>>
>>
>> ------------------ 原始邮件 ------------------
>> *发件人:* "JiaTao Tao"<ta...@gmail.com>;
>> *发送时间:* 2018年11月5日(星期一) 下午2:41
>> *收件人:* "user"<us...@kylin.apache.org>;
>> *主题:* Re: doubt about measure of processedRowCount
>>
>> Can you grep logs like "to merge segment results" in that scenario?
>>
>> cheney <53...@qq.com> 于2018年11月3日周六 下午4:15写道:
>>
>>> Thank your repling, .but I am sure there's only one OlapContext in the
>>> quey in my scenario.
>>> ---Original---
>>> *From:* "JiaTao Tao"<ta...@gmail.com>
>>> *Date:* Sat, Nov 3, 2018 10:42 AM
>>> *To:* "user"<us...@kylin.apache.org>;
>>> *Subject:* Re: doubt about measure of processedRowCount
>>>
>>> Maybe count all the *valueA *would be more appropriate, cuz maybe
>>> there's more than one OlapContext in the query ( one OlapContext correspond
>>> one storageContext ).
>>>
>>> There are two good blogs about Kylin's query engine, you may take a look
>>> :).
>>>
>>> https://blog.csdn.net/yu616568/article/details/50838504
>>>
>>> https://zhuanlan.zhihu.com/p/30613434
>>>
>>> cheney <53...@qq.com> 于2018年11月2日周五 下午11:10写道:
>>>
>>>> Hi, guys
>>>>
>>>> When I executed a sql in kylin, kylin server will log some log
>>>> about query statics. for example, The log is as following:
>>>>
>>>> "Processed rows for each storageContext: *valueA*". *valueA *is processedRowCount.
>>>>
>>>> What I understand is processedRowCount is the record rows
>>>> numbers returned by hbase.
>>>>
>>>> Hbase corprocessor will log region stats, including: "*Total
>>>> scanned row*","Total filtered/aggred row".
>>>>
>>>> For one region, final records returned by hbase = *Total scanned
>>>> row - *Total filtered/aggred row;
>>>> Suppose this query need to scan 10 region in hbase, we can get
>>>> every region stats. we can get all records *valueB *returned by hbase
>>>> by
>>>> suming every final records in 10 region.
>>>>
>>>> In general, *valueA *is equal to * valueB*, but *valueB *is much
>>>> larger than *valueA* in sometimes. Why?
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>>
>>> Regards!
>>>
>>> Aron Tao
>>>
>>
>>
>> --
>>
>>
>> Regards!
>>
>> Aron Tao
>>
>
>
> --
>
>
> Regards!
>
> Aron Tao
>
--
Regards!
Aron Tao
Re: doubt about measure of processedRowCount
Posted by JiaTao Tao <ta...@gmail.com>.
Very glad that my reply is helpful, I already opened a JIRA to add logs for
"*GTStreamAggregateScanner*" and next time it would be much easier to
navigate this :).
cheney <53...@qq.com> 于2018年11月6日周二 下午11:57写道:
> Hi, JiaTao, thank you very much! The statis is right when I config "kylin.query.stream-aggregate-enabled=false".
> You are right. Records are pre-aggregated by GTStreamAggregateScanner.
>
>
> ------------------ 原始邮件 ------------------
> *发件人:* "JiaTao Tao"<ta...@gmail.com>;
> *发送时间:* 2018年11月6日(星期二) 晚上10:50
> *收件人:* "user"<us...@kylin.apache.org>;
> *主题:* Re: doubt about measure of processedRowCount
>
> One possible place I can find in the code is using
> *GTStreamAggregateScanne*r (in "*SegmentCubeTupleIterator.java#111"*).
> You can find it does do aggregate in
> *"GTStreamAggregateScanner.AbstractStreamMergeIterator#next*" so it'll
> reduce the inputs. But there's no log printing in this class as you can
> see, so it's pretty hard to confirm. Try
> "kylin.query.stream-aggregate-enabled=false" and run the scenario again to
> see any differences.
>
> cheney <53...@qq.com> 于2018年11月5日周一 下午6:55写道:
>
>> Yes. the log is as following.
>>
>> 2018-11-02 22:25:34,980 DEBUG [Query
>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914]
>> gtrecord.StorageResponseGTScatter:88 : Using
>> SortMergedPartitionResultIterator to merge 103 partition results
>> 2018-11-02 22:25:34,982 INFO [Query
>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914]
>> gtrecord.SequentialCubeTupleIterator:73 : Using Iterators.concat *to
>> merge segment results*
>> 2018-11-02 22:25:34,982 DEBUG [Query
>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] enumerator.OLAPEnumerator:122
>> : return TupleIterator...
>> 2018-11-02 22:25:34,991 INFO [Query
>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:897 : *Processed
>> rows for each storageContext*: 366
>> 2018-11-02 22:25:34,991 INFO [Query
>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:422 :
>> Stats of SQL response: isException: false, duration: 20, *total scan
>> count 1552*
>>
>> Acoording the log, *valueA *= 366. *valueB*= (total scan count) 1552 -
>> (total Agrrated/filterd in hbase)270 = 1282
>> *valueB *is much larger than *valueA *.
>>
>>
>>
>> ------------------ 原始邮件 ------------------
>> *发件人:* "JiaTao Tao"<ta...@gmail.com>;
>> *发送时间:* 2018年11月5日(星期一) 下午2:41
>> *收件人:* "user"<us...@kylin.apache.org>;
>> *主题:* Re: doubt about measure of processedRowCount
>>
>> Can you grep logs like "to merge segment results" in that scenario?
>>
>> cheney <53...@qq.com> 于2018年11月3日周六 下午4:15写道:
>>
>>> Thank your repling, .but I am sure there's only one OlapContext in the
>>> quey in my scenario.
>>> ---Original---
>>> *From:* "JiaTao Tao"<ta...@gmail.com>
>>> *Date:* Sat, Nov 3, 2018 10:42 AM
>>> *To:* "user"<us...@kylin.apache.org>;
>>> *Subject:* Re: doubt about measure of processedRowCount
>>>
>>> Maybe count all the *valueA *would be more appropriate, cuz maybe
>>> there's more than one OlapContext in the query ( one OlapContext correspond
>>> one storageContext ).
>>>
>>> There are two good blogs about Kylin's query engine, you may take a look
>>> :).
>>>
>>> https://blog.csdn.net/yu616568/article/details/50838504
>>>
>>> https://zhuanlan.zhihu.com/p/30613434
>>>
>>> cheney <53...@qq.com> 于2018年11月2日周五 下午11:10写道:
>>>
>>>> Hi, guys
>>>>
>>>> When I executed a sql in kylin, kylin server will log some log
>>>> about query statics. for example, The log is as following:
>>>>
>>>> "Processed rows for each storageContext: *valueA*". *valueA *is processedRowCount.
>>>>
>>>> What I understand is processedRowCount is the record rows
>>>> numbers returned by hbase.
>>>>
>>>> Hbase corprocessor will log region stats, including: "*Total
>>>> scanned row*","Total filtered/aggred row".
>>>>
>>>> For one region, final records returned by hbase = *Total scanned
>>>> row - *Total filtered/aggred row;
>>>> Suppose this query need to scan 10 region in hbase, we can get
>>>> every region stats. we can get all records *valueB *returned by hbase
>>>> by
>>>> suming every final records in 10 region.
>>>>
>>>> In general, *valueA *is equal to * valueB*, but *valueB *is much
>>>> larger than *valueA* in sometimes. Why?
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>>
>>> Regards!
>>>
>>> Aron Tao
>>>
>>
>>
>> --
>>
>>
>> Regards!
>>
>> Aron Tao
>>
>
>
> --
>
>
> Regards!
>
> Aron Tao
>
--
Regards!
Aron Tao
回复: doubt about measure of processedRowCount
Posted by cheney <53...@qq.com>.
Hi, JiaTao, thank you very much! The statis is right when I config "kylin.query.stream-aggregate-enabled=false". You are right. Records are pre-aggregated by GTStreamAggregateScanner.
------------------ 原始邮件 ------------------
发件人: "JiaTao Tao"<ta...@gmail.com>;
发送时间: 2018年11月6日(星期二) 晚上10:50
收件人: "user"<us...@kylin.apache.org>;
主题: Re: doubt about measure of processedRowCount
One possible place I can find in the code is using GTStreamAggregateScanner (in "SegmentCubeTupleIterator.java#111"). You can find it does do aggregate in "GTStreamAggregateScanner.AbstractStreamMergeIterator#next" so it'll reduce the inputs. But there's no log printing in this class as you can see, so it's pretty hard to confirm. Try "kylin.query.stream-aggregate-enabled=false" and run the scenario again to see any differences.
cheney <53...@qq.com> 于2018年11月5日周一 下午6:55写道:
Yes. the log is as following.
2018-11-02 22:25:34,980 DEBUG [Query 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] gtrecord.StorageResponseGTScatter:88 : Using SortMergedPartitionResultIterator to merge 103 partition results
2018-11-02 22:25:34,982 INFO [Query 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] gtrecord.SequentialCubeTupleIterator:73 : Using Iterators.concat to merge segment results
2018-11-02 22:25:34,982 DEBUG [Query 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] enumerator.OLAPEnumerator:122 : return TupleIterator...
2018-11-02 22:25:34,991 INFO [Query 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:897 : Processed rows for each storageContext: 366
2018-11-02 22:25:34,991 INFO [Query 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:422 : Stats of SQL response: isException: false, duration: 20, total scan count 1552
Acoording the log, valueA = 366. valueB= (total scan count) 1552 - (total Agrrated/filterd in hbase)270 = 1282
valueB is much larger than valueA .
------------------ 原始邮件 ------------------
发件人: "JiaTao Tao"<ta...@gmail.com>;
发送时间: 2018年11月5日(星期一) 下午2:41
收件人: "user"<us...@kylin.apache.org>;
主题: Re: doubt about measure of processedRowCount
Can you grep logs like "to merge segment results" in that scenario?
cheney <53...@qq.com> 于2018年11月3日周六 下午4:15写道:
Thank your repling, .but I am sure there's only one OlapContext in the quey in my scenario.
---Original---
From: "JiaTao Tao"<ta...@gmail.com>
Date: Sat, Nov 3, 2018 10:42 AM
To: "user"<us...@kylin.apache.org>;
Subject: Re: doubt about measure of processedRowCount
Maybe count all the valueA would be more appropriate, cuz maybe there's more than one OlapContext in the query ( one OlapContext correspond one storageContext ).
There are two good blogs about Kylin's query engine, you may take a look :).
https://blog.csdn.net/yu616568/article/details/50838504
https://zhuanlan.zhihu.com/p/30613434
cheney <53...@qq.com> 于2018年11月2日周五 下午11:10写道:
Hi, guys
When I executed a sql in kylin, kylin server will log some log about query statics. for example, The log is as following:
"Processed rows for each storageContext: valueA". valueA is processedRowCount.
What I understand is processedRowCount is the record rows numbers returned by hbase.
Hbase corprocessor will log region stats, including: "Total scanned row","Total filtered/aggred row".
For one region, final records returned by hbase = Total scanned row - Total filtered/aggred row;
Suppose this query need to scan 10 region in hbase, we can get every region stats. we can get all records valueB returned by hbase by
suming every final records in 10 region.
In general, valueA is equal to valueB, but valueB is much larger than valueA in sometimes. Why?
--
Regards!
Aron Tao
--
Regards!
Aron Tao
--
Regards!
Aron Tao
Re: doubt about measure of processedRowCount
Posted by JiaTao Tao <ta...@gmail.com>.
One possible place I can find in the code is using *GTStreamAggregateScanne*r
(in "*SegmentCubeTupleIterator.java#111"*). You can find it does do
aggregate in *"GTStreamAggregateScanner.AbstractStreamMergeIterator#next*"
so it'll reduce the inputs. But there's no log printing in this class as
you can see, so it's pretty hard to confirm. Try
"kylin.query.stream-aggregate-enabled=false" and run the scenario again to
see any differences.
cheney <53...@qq.com> 于2018年11月5日周一 下午6:55写道:
> Yes. the log is as following.
>
> 2018-11-02 22:25:34,980 DEBUG [Query
> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914]
> gtrecord.StorageResponseGTScatter:88 : Using
> SortMergedPartitionResultIterator to merge 103 partition results
> 2018-11-02 22:25:34,982 INFO [Query
> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914]
> gtrecord.SequentialCubeTupleIterator:73 : Using Iterators.concat *to
> merge segment results*
> 2018-11-02 22:25:34,982 DEBUG [Query
> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] enumerator.OLAPEnumerator:122
> : return TupleIterator...
> 2018-11-02 22:25:34,991 INFO [Query
> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:897 : *Processed
> rows for each storageContext*: 366
> 2018-11-02 22:25:34,991 INFO [Query
> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:422 :
> Stats of SQL response: isException: false, duration: 20, *total scan
> count 1552*
>
> Acoording the log, *valueA *= 366. *valueB*= (total scan count) 1552 -
> (total Agrrated/filterd in hbase)270 = 1282
> *valueB *is much larger than *valueA *.
>
>
>
> ------------------ 原始邮件 ------------------
> *发件人:* "JiaTao Tao"<ta...@gmail.com>;
> *发送时间:* 2018年11月5日(星期一) 下午2:41
> *收件人:* "user"<us...@kylin.apache.org>;
> *主题:* Re: doubt about measure of processedRowCount
>
> Can you grep logs like "to merge segment results" in that scenario?
>
> cheney <53...@qq.com> 于2018年11月3日周六 下午4:15写道:
>
>> Thank your repling, .but I am sure there's only one OlapContext in the
>> quey in my scenario.
>> ---Original---
>> *From:* "JiaTao Tao"<ta...@gmail.com>
>> *Date:* Sat, Nov 3, 2018 10:42 AM
>> *To:* "user"<us...@kylin.apache.org>;
>> *Subject:* Re: doubt about measure of processedRowCount
>>
>> Maybe count all the *valueA *would be more appropriate, cuz maybe
>> there's more than one OlapContext in the query ( one OlapContext correspond
>> one storageContext ).
>>
>> There are two good blogs about Kylin's query engine, you may take a look
>> :).
>>
>> https://blog.csdn.net/yu616568/article/details/50838504
>>
>> https://zhuanlan.zhihu.com/p/30613434
>>
>> cheney <53...@qq.com> 于2018年11月2日周五 下午11:10写道:
>>
>>> Hi, guys
>>>
>>> When I executed a sql in kylin, kylin server will log some log
>>> about query statics. for example, The log is as following:
>>>
>>> "Processed rows for each storageContext: *valueA*". *valueA *is processedRowCount.
>>>
>>> What I understand is processedRowCount is the record rows numbers
>>> returned by hbase.
>>>
>>> Hbase corprocessor will log region stats, including: "*Total
>>> scanned row*","Total filtered/aggred row".
>>>
>>> For one region, final records returned by hbase = *Total scanned
>>> row - *Total filtered/aggred row;
>>> Suppose this query need to scan 10 region in hbase, we can get
>>> every region stats. we can get all records *valueB *returned by hbase
>>> by
>>> suming every final records in 10 region.
>>>
>>> In general, *valueA *is equal to * valueB*, but *valueB *is much
>>> larger than *valueA* in sometimes. Why?
>>>
>>>
>>>
>>
>>
>> --
>>
>>
>> Regards!
>>
>> Aron Tao
>>
>
>
> --
>
>
> Regards!
>
> Aron Tao
>
--
Regards!
Aron Tao