You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@kylin.apache.org by cheney <53...@qq.com> on 2018/11/05 10:55:02 UTC

回复: doubt about measure of processedRowCount

Yes. the log is as following.


2018-11-02 22:25:34,980 DEBUG [Query 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] gtrecord.StorageResponseGTScatter:88 : Using SortMergedPartitionResultIterator to merge 103 partition results
2018-11-02 22:25:34,982 INFO  [Query 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] gtrecord.SequentialCubeTupleIterator:73 : Using Iterators.concat to merge segment results
2018-11-02 22:25:34,982 DEBUG [Query 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] enumerator.OLAPEnumerator:122 : return TupleIterator...
2018-11-02 22:25:34,991 INFO  [Query 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:897 : Processed rows for each storageContext: 366
2018-11-02 22:25:34,991 INFO  [Query 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:422 : Stats of SQL response: isException: false, duration: 20, total scan count 1552



Acoording the log,  valueA = 366. valueB= (total scan count) 1552 - (total Agrrated/filterd in hbase)270 = 1282
 valueB is much larger than valueA .



             


------------------ 原始邮件 ------------------
发件人: "JiaTao Tao"<ta...@gmail.com>;
发送时间: 2018年11月5日(星期一) 下午2:41
收件人: "user"<us...@kylin.apache.org>;

主题: Re: doubt about measure of processedRowCount



Can you grep logs like "to merge segment results" in that scenario?


cheney <53...@qq.com> 于2018年11月3日周六 下午4:15写道:

Thank your repling, .but I  am sure there's only one OlapContext in the quey in my scenario.
 
---Original---
From: "JiaTao Tao"<ta...@gmail.com>
Date: Sat, Nov 3, 2018 10:42 AM
To: "user"<us...@kylin.apache.org>;
Subject: Re: doubt about measure of processedRowCount


Maybe count all the valueA would be more appropriate, cuz maybe there's more than one OlapContext in the query ( one OlapContext correspond one storageContext ).


There are two good blogs about Kylin's query engine, you may take a look :).


https://blog.csdn.net/yu616568/article/details/50838504



https://zhuanlan.zhihu.com/p/30613434







cheney <53...@qq.com> 于2018年11月2日周五 下午11:10写道:

Hi, guys   
        When I executed a sql in kylin, kylin server will log some log about query statics. for example, The log is as following:
       
       "Processed rows for each storageContext: valueA". valueA is processedRowCount.   
       What I understand is processedRowCount is the record rows numbers returned by hbase. 


       Hbase corprocessor will log region stats, including:  "Total scanned row","Total filtered/aggred row".
       
        For  one region,  final records returned by hbase = Total scanned row - Total filtered/aggred row;
       Suppose this query need to scan 10 region in hbase, we can get every region stats. we can get all records  valueB returned by hbase by
       suming every final records in 10 region. 
       
      In general, valueA is equal to  valueB, but valueB is much larger than valueA in sometimes. Why?
       
       




-- 




Regards!

Aron Tao





 




-- 




Regards!

Aron Tao

Re: doubt about measure of processedRowCount

Posted by JiaTao Tao <ta...@gmail.com>.
Thanks, Shaofeng, for your affirmation :).

ShaoFeng Shi <sh...@apache.org> 于2018年11月7日周三 上午9:29写道:

> Good job Jiatao! I appreciate your support to the community!
>
> JiaTao Tao <ta...@gmail.com> 于2018年11月7日周三 上午9:17写道:
>
>> Very glad that my reply is helpful, I already opened a JIRA to add logs
>> for "*GTStreamAggregateScanner*" and next time it would be much easier
>> to navigate this :).
>>
>> cheney <53...@qq.com> 于2018年11月6日周二 下午11:57写道:
>>
>>> Hi, JiaTao, thank you very much!  The statis is right when I config "kylin.query.stream-aggregate-enabled=false".
>>> You are right. Records are pre-aggregated by GTStreamAggregateScanner.
>>>
>>>
>>> ------------------ 原始邮件 ------------------
>>> *发件人:* "JiaTao Tao"<ta...@gmail.com>;
>>> *发送时间:* 2018年11月6日(星期二) 晚上10:50
>>> *收件人:* "user"<us...@kylin.apache.org>;
>>> *主题:* Re: doubt about measure of processedRowCount
>>>
>>> One possible place I can find in the code is using
>>> *GTStreamAggregateScanne*r (in "*SegmentCubeTupleIterator.java#111"*).
>>> You can find it does do aggregate in
>>> *"GTStreamAggregateScanner.AbstractStreamMergeIterator#next*" so it'll
>>> reduce the inputs. But there's no log printing in this class as you can
>>> see, so it's pretty hard to confirm. Try
>>> "kylin.query.stream-aggregate-enabled=false" and run the scenario again to
>>> see any differences.
>>>
>>> cheney <53...@qq.com> 于2018年11月5日周一 下午6:55写道:
>>>
>>>> Yes. the log is as following.
>>>>
>>>> 2018-11-02 22:25:34,980 DEBUG [Query
>>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914]
>>>> gtrecord.StorageResponseGTScatter:88 : Using
>>>> SortMergedPartitionResultIterator to merge 103 partition results
>>>> 2018-11-02 22:25:34,982 INFO  [Query
>>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914]
>>>> gtrecord.SequentialCubeTupleIterator:73 : Using Iterators.concat *to
>>>> merge segment results*
>>>> 2018-11-02 22:25:34,982 DEBUG [Query
>>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] enumerator.OLAPEnumerator:122
>>>> : return TupleIterator...
>>>> 2018-11-02 22:25:34,991 INFO  [Query
>>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:897 : *Processed
>>>> rows for each storageContext*: 366
>>>> 2018-11-02 22:25:34,991 INFO  [Query
>>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:422 :
>>>> Stats of SQL response: isException: false, duration: 20, *total scan
>>>> count 1552*
>>>>
>>>> Acoording the log,  *valueA *= 366. *valueB*= (total scan count) 1552
>>>> - (total Agrrated/filterd in hbase)270 = 1282
>>>>  *valueB *is much larger than *valueA *.
>>>>
>>>>
>>>>
>>>> ------------------ 原始邮件 ------------------
>>>> *发件人:* "JiaTao Tao"<ta...@gmail.com>;
>>>> *发送时间:* 2018年11月5日(星期一) 下午2:41
>>>> *收件人:* "user"<us...@kylin.apache.org>;
>>>> *主题:* Re: doubt about measure of processedRowCount
>>>>
>>>> Can you grep logs like "to merge segment results" in that scenario?
>>>>
>>>> cheney <53...@qq.com> 于2018年11月3日周六 下午4:15写道:
>>>>
>>>>> Thank your repling, .but I  am sure there's only one OlapContext in
>>>>> the quey in my scenario.
>>>>> ---Original---
>>>>> *From:* "JiaTao Tao"<ta...@gmail.com>
>>>>> *Date:* Sat, Nov 3, 2018 10:42 AM
>>>>> *To:* "user"<us...@kylin.apache.org>;
>>>>> *Subject:* Re: doubt about measure of processedRowCount
>>>>>
>>>>> Maybe count all the *valueA *would be more appropriate, cuz maybe
>>>>> there's more than one OlapContext in the query ( one OlapContext correspond
>>>>> one storageContext ).
>>>>>
>>>>> There are two good blogs about Kylin's query engine, you may take a
>>>>> look :).
>>>>>
>>>>> https://blog.csdn.net/yu616568/article/details/50838504
>>>>>
>>>>> https://zhuanlan.zhihu.com/p/30613434
>>>>>
>>>>> cheney <53...@qq.com> 于2018年11月2日周五 下午11:10写道:
>>>>>
>>>>>> Hi, guys
>>>>>>
>>>>>>         When I executed a sql in kylin, kylin server will log some
>>>>>> log about query statics. for example, The log is as following:
>>>>>>
>>>>>>        "Processed rows for each storageContext: *valueA*". *valueA *is processedRowCount.
>>>>>>
>>>>>>        What I understand is processedRowCount is the record rows
>>>>>> numbers returned by hbase.
>>>>>>
>>>>>>        Hbase corprocessor will log region stats, including:  "*Total
>>>>>> scanned row*","Total filtered/aggred row".
>>>>>>
>>>>>>         For  one region,  final records returned by hbase = *Total scanned
>>>>>> row - *Total filtered/aggred row;
>>>>>>        Suppose this query need to scan 10 region in hbase, we can get
>>>>>> every region stats. we can get all records  *valueB *returned by
>>>>>> hbase by
>>>>>>        suming every final records in 10 region.
>>>>>>
>>>>>>       In general, *valueA *is equal to * valueB*, but *valueB *is
>>>>>> much larger than *valueA* in sometimes. Why?
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>>
>>>>> Regards!
>>>>>
>>>>> Aron Tao
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>> Regards!
>>>>
>>>> Aron Tao
>>>>
>>>
>>>
>>> --
>>>
>>>
>>> Regards!
>>>
>>> Aron Tao
>>>
>>
>>
>> --
>>
>>
>> Regards!
>>
>> Aron Tao
>>
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>
>

-- 


Regards!

Aron Tao

Re: doubt about measure of processedRowCount

Posted by ShaoFeng Shi <sh...@apache.org>.
Good job Jiatao! I appreciate your support to the community!

JiaTao Tao <ta...@gmail.com> 于2018年11月7日周三 上午9:17写道:

> Very glad that my reply is helpful, I already opened a JIRA to add logs
> for "*GTStreamAggregateScanner*" and next time it would be much easier to
> navigate this :).
>
> cheney <53...@qq.com> 于2018年11月6日周二 下午11:57写道:
>
>> Hi, JiaTao, thank you very much!  The statis is right when I config "kylin.query.stream-aggregate-enabled=false".
>> You are right. Records are pre-aggregated by GTStreamAggregateScanner.
>>
>>
>> ------------------ 原始邮件 ------------------
>> *发件人:* "JiaTao Tao"<ta...@gmail.com>;
>> *发送时间:* 2018年11月6日(星期二) 晚上10:50
>> *收件人:* "user"<us...@kylin.apache.org>;
>> *主题:* Re: doubt about measure of processedRowCount
>>
>> One possible place I can find in the code is using
>> *GTStreamAggregateScanne*r (in "*SegmentCubeTupleIterator.java#111"*).
>> You can find it does do aggregate in
>> *"GTStreamAggregateScanner.AbstractStreamMergeIterator#next*" so it'll
>> reduce the inputs. But there's no log printing in this class as you can
>> see, so it's pretty hard to confirm. Try
>> "kylin.query.stream-aggregate-enabled=false" and run the scenario again to
>> see any differences.
>>
>> cheney <53...@qq.com> 于2018年11月5日周一 下午6:55写道:
>>
>>> Yes. the log is as following.
>>>
>>> 2018-11-02 22:25:34,980 DEBUG [Query
>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914]
>>> gtrecord.StorageResponseGTScatter:88 : Using
>>> SortMergedPartitionResultIterator to merge 103 partition results
>>> 2018-11-02 22:25:34,982 INFO  [Query
>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914]
>>> gtrecord.SequentialCubeTupleIterator:73 : Using Iterators.concat *to
>>> merge segment results*
>>> 2018-11-02 22:25:34,982 DEBUG [Query
>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] enumerator.OLAPEnumerator:122
>>> : return TupleIterator...
>>> 2018-11-02 22:25:34,991 INFO  [Query
>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:897 : *Processed
>>> rows for each storageContext*: 366
>>> 2018-11-02 22:25:34,991 INFO  [Query
>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:422 :
>>> Stats of SQL response: isException: false, duration: 20, *total scan
>>> count 1552*
>>>
>>> Acoording the log,  *valueA *= 366. *valueB*= (total scan count) 1552 -
>>> (total Agrrated/filterd in hbase)270 = 1282
>>>  *valueB *is much larger than *valueA *.
>>>
>>>
>>>
>>> ------------------ 原始邮件 ------------------
>>> *发件人:* "JiaTao Tao"<ta...@gmail.com>;
>>> *发送时间:* 2018年11月5日(星期一) 下午2:41
>>> *收件人:* "user"<us...@kylin.apache.org>;
>>> *主题:* Re: doubt about measure of processedRowCount
>>>
>>> Can you grep logs like "to merge segment results" in that scenario?
>>>
>>> cheney <53...@qq.com> 于2018年11月3日周六 下午4:15写道:
>>>
>>>> Thank your repling, .but I  am sure there's only one OlapContext in the
>>>> quey in my scenario.
>>>> ---Original---
>>>> *From:* "JiaTao Tao"<ta...@gmail.com>
>>>> *Date:* Sat, Nov 3, 2018 10:42 AM
>>>> *To:* "user"<us...@kylin.apache.org>;
>>>> *Subject:* Re: doubt about measure of processedRowCount
>>>>
>>>> Maybe count all the *valueA *would be more appropriate, cuz maybe
>>>> there's more than one OlapContext in the query ( one OlapContext correspond
>>>> one storageContext ).
>>>>
>>>> There are two good blogs about Kylin's query engine, you may take a
>>>> look :).
>>>>
>>>> https://blog.csdn.net/yu616568/article/details/50838504
>>>>
>>>> https://zhuanlan.zhihu.com/p/30613434
>>>>
>>>> cheney <53...@qq.com> 于2018年11月2日周五 下午11:10写道:
>>>>
>>>>> Hi, guys
>>>>>
>>>>>         When I executed a sql in kylin, kylin server will log some log
>>>>> about query statics. for example, The log is as following:
>>>>>
>>>>>        "Processed rows for each storageContext: *valueA*". *valueA *is processedRowCount.
>>>>>
>>>>>        What I understand is processedRowCount is the record rows
>>>>> numbers returned by hbase.
>>>>>
>>>>>        Hbase corprocessor will log region stats, including:  "*Total
>>>>> scanned row*","Total filtered/aggred row".
>>>>>
>>>>>         For  one region,  final records returned by hbase = *Total scanned
>>>>> row - *Total filtered/aggred row;
>>>>>        Suppose this query need to scan 10 region in hbase, we can get
>>>>> every region stats. we can get all records  *valueB *returned by
>>>>> hbase by
>>>>>        suming every final records in 10 region.
>>>>>
>>>>>       In general, *valueA *is equal to * valueB*, but *valueB *is
>>>>> much larger than *valueA* in sometimes. Why?
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>> Regards!
>>>>
>>>> Aron Tao
>>>>
>>>
>>>
>>> --
>>>
>>>
>>> Regards!
>>>
>>> Aron Tao
>>>
>>
>>
>> --
>>
>>
>> Regards!
>>
>> Aron Tao
>>
>
>
> --
>
>
> Regards!
>
> Aron Tao
>


-- 
Best regards,

Shaofeng Shi 史少锋

Re: doubt about measure of processedRowCount

Posted by ShaoFeng Shi <sh...@apache.org>.
Good job Jiatao! I appreciate your support to the community!

JiaTao Tao <ta...@gmail.com> 于2018年11月7日周三 上午9:17写道:

> Very glad that my reply is helpful, I already opened a JIRA to add logs
> for "*GTStreamAggregateScanner*" and next time it would be much easier to
> navigate this :).
>
> cheney <53...@qq.com> 于2018年11月6日周二 下午11:57写道:
>
>> Hi, JiaTao, thank you very much!  The statis is right when I config "kylin.query.stream-aggregate-enabled=false".
>> You are right. Records are pre-aggregated by GTStreamAggregateScanner.
>>
>>
>> ------------------ 原始邮件 ------------------
>> *发件人:* "JiaTao Tao"<ta...@gmail.com>;
>> *发送时间:* 2018年11月6日(星期二) 晚上10:50
>> *收件人:* "user"<us...@kylin.apache.org>;
>> *主题:* Re: doubt about measure of processedRowCount
>>
>> One possible place I can find in the code is using
>> *GTStreamAggregateScanne*r (in "*SegmentCubeTupleIterator.java#111"*).
>> You can find it does do aggregate in
>> *"GTStreamAggregateScanner.AbstractStreamMergeIterator#next*" so it'll
>> reduce the inputs. But there's no log printing in this class as you can
>> see, so it's pretty hard to confirm. Try
>> "kylin.query.stream-aggregate-enabled=false" and run the scenario again to
>> see any differences.
>>
>> cheney <53...@qq.com> 于2018年11月5日周一 下午6:55写道:
>>
>>> Yes. the log is as following.
>>>
>>> 2018-11-02 22:25:34,980 DEBUG [Query
>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914]
>>> gtrecord.StorageResponseGTScatter:88 : Using
>>> SortMergedPartitionResultIterator to merge 103 partition results
>>> 2018-11-02 22:25:34,982 INFO  [Query
>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914]
>>> gtrecord.SequentialCubeTupleIterator:73 : Using Iterators.concat *to
>>> merge segment results*
>>> 2018-11-02 22:25:34,982 DEBUG [Query
>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] enumerator.OLAPEnumerator:122
>>> : return TupleIterator...
>>> 2018-11-02 22:25:34,991 INFO  [Query
>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:897 : *Processed
>>> rows for each storageContext*: 366
>>> 2018-11-02 22:25:34,991 INFO  [Query
>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:422 :
>>> Stats of SQL response: isException: false, duration: 20, *total scan
>>> count 1552*
>>>
>>> Acoording the log,  *valueA *= 366. *valueB*= (total scan count) 1552 -
>>> (total Agrrated/filterd in hbase)270 = 1282
>>>  *valueB *is much larger than *valueA *.
>>>
>>>
>>>
>>> ------------------ 原始邮件 ------------------
>>> *发件人:* "JiaTao Tao"<ta...@gmail.com>;
>>> *发送时间:* 2018年11月5日(星期一) 下午2:41
>>> *收件人:* "user"<us...@kylin.apache.org>;
>>> *主题:* Re: doubt about measure of processedRowCount
>>>
>>> Can you grep logs like "to merge segment results" in that scenario?
>>>
>>> cheney <53...@qq.com> 于2018年11月3日周六 下午4:15写道:
>>>
>>>> Thank your repling, .but I  am sure there's only one OlapContext in the
>>>> quey in my scenario.
>>>> ---Original---
>>>> *From:* "JiaTao Tao"<ta...@gmail.com>
>>>> *Date:* Sat, Nov 3, 2018 10:42 AM
>>>> *To:* "user"<us...@kylin.apache.org>;
>>>> *Subject:* Re: doubt about measure of processedRowCount
>>>>
>>>> Maybe count all the *valueA *would be more appropriate, cuz maybe
>>>> there's more than one OlapContext in the query ( one OlapContext correspond
>>>> one storageContext ).
>>>>
>>>> There are two good blogs about Kylin's query engine, you may take a
>>>> look :).
>>>>
>>>> https://blog.csdn.net/yu616568/article/details/50838504
>>>>
>>>> https://zhuanlan.zhihu.com/p/30613434
>>>>
>>>> cheney <53...@qq.com> 于2018年11月2日周五 下午11:10写道:
>>>>
>>>>> Hi, guys
>>>>>
>>>>>         When I executed a sql in kylin, kylin server will log some log
>>>>> about query statics. for example, The log is as following:
>>>>>
>>>>>        "Processed rows for each storageContext: *valueA*". *valueA *is processedRowCount.
>>>>>
>>>>>        What I understand is processedRowCount is the record rows
>>>>> numbers returned by hbase.
>>>>>
>>>>>        Hbase corprocessor will log region stats, including:  "*Total
>>>>> scanned row*","Total filtered/aggred row".
>>>>>
>>>>>         For  one region,  final records returned by hbase = *Total scanned
>>>>> row - *Total filtered/aggred row;
>>>>>        Suppose this query need to scan 10 region in hbase, we can get
>>>>> every region stats. we can get all records  *valueB *returned by
>>>>> hbase by
>>>>>        suming every final records in 10 region.
>>>>>
>>>>>       In general, *valueA *is equal to * valueB*, but *valueB *is
>>>>> much larger than *valueA* in sometimes. Why?
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>> Regards!
>>>>
>>>> Aron Tao
>>>>
>>>
>>>
>>> --
>>>
>>>
>>> Regards!
>>>
>>> Aron Tao
>>>
>>
>>
>> --
>>
>>
>> Regards!
>>
>> Aron Tao
>>
>
>
> --
>
>
> Regards!
>
> Aron Tao
>


-- 
Best regards,

Shaofeng Shi 史少锋

Re: doubt about measure of processedRowCount

Posted by JiaTao Tao <ta...@gmail.com>.
Very glad that my reply is helpful, I already opened a JIRA to add logs for
"*GTStreamAggregateScanner*" and next time it would be much easier to
navigate this :).

cheney <53...@qq.com> 于2018年11月6日周二 下午11:57写道:

> Hi, JiaTao, thank you very much!  The statis is right when I config "kylin.query.stream-aggregate-enabled=false".
> You are right. Records are pre-aggregated by GTStreamAggregateScanner.
>
>
> ------------------ 原始邮件 ------------------
> *发件人:* "JiaTao Tao"<ta...@gmail.com>;
> *发送时间:* 2018年11月6日(星期二) 晚上10:50
> *收件人:* "user"<us...@kylin.apache.org>;
> *主题:* Re: doubt about measure of processedRowCount
>
> One possible place I can find in the code is using
> *GTStreamAggregateScanne*r (in "*SegmentCubeTupleIterator.java#111"*).
> You can find it does do aggregate in
> *"GTStreamAggregateScanner.AbstractStreamMergeIterator#next*" so it'll
> reduce the inputs. But there's no log printing in this class as you can
> see, so it's pretty hard to confirm. Try
> "kylin.query.stream-aggregate-enabled=false" and run the scenario again to
> see any differences.
>
> cheney <53...@qq.com> 于2018年11月5日周一 下午6:55写道:
>
>> Yes. the log is as following.
>>
>> 2018-11-02 22:25:34,980 DEBUG [Query
>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914]
>> gtrecord.StorageResponseGTScatter:88 : Using
>> SortMergedPartitionResultIterator to merge 103 partition results
>> 2018-11-02 22:25:34,982 INFO  [Query
>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914]
>> gtrecord.SequentialCubeTupleIterator:73 : Using Iterators.concat *to
>> merge segment results*
>> 2018-11-02 22:25:34,982 DEBUG [Query
>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] enumerator.OLAPEnumerator:122
>> : return TupleIterator...
>> 2018-11-02 22:25:34,991 INFO  [Query
>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:897 : *Processed
>> rows for each storageContext*: 366
>> 2018-11-02 22:25:34,991 INFO  [Query
>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:422 :
>> Stats of SQL response: isException: false, duration: 20, *total scan
>> count 1552*
>>
>> Acoording the log,  *valueA *= 366. *valueB*= (total scan count) 1552 -
>> (total Agrrated/filterd in hbase)270 = 1282
>>  *valueB *is much larger than *valueA *.
>>
>>
>>
>> ------------------ 原始邮件 ------------------
>> *发件人:* "JiaTao Tao"<ta...@gmail.com>;
>> *发送时间:* 2018年11月5日(星期一) 下午2:41
>> *收件人:* "user"<us...@kylin.apache.org>;
>> *主题:* Re: doubt about measure of processedRowCount
>>
>> Can you grep logs like "to merge segment results" in that scenario?
>>
>> cheney <53...@qq.com> 于2018年11月3日周六 下午4:15写道:
>>
>>> Thank your repling, .but I  am sure there's only one OlapContext in the
>>> quey in my scenario.
>>> ---Original---
>>> *From:* "JiaTao Tao"<ta...@gmail.com>
>>> *Date:* Sat, Nov 3, 2018 10:42 AM
>>> *To:* "user"<us...@kylin.apache.org>;
>>> *Subject:* Re: doubt about measure of processedRowCount
>>>
>>> Maybe count all the *valueA *would be more appropriate, cuz maybe
>>> there's more than one OlapContext in the query ( one OlapContext correspond
>>> one storageContext ).
>>>
>>> There are two good blogs about Kylin's query engine, you may take a look
>>> :).
>>>
>>> https://blog.csdn.net/yu616568/article/details/50838504
>>>
>>> https://zhuanlan.zhihu.com/p/30613434
>>>
>>> cheney <53...@qq.com> 于2018年11月2日周五 下午11:10写道:
>>>
>>>> Hi, guys
>>>>
>>>>         When I executed a sql in kylin, kylin server will log some log
>>>> about query statics. for example, The log is as following:
>>>>
>>>>        "Processed rows for each storageContext: *valueA*". *valueA *is processedRowCount.
>>>>
>>>>        What I understand is processedRowCount is the record rows
>>>> numbers returned by hbase.
>>>>
>>>>        Hbase corprocessor will log region stats, including:  "*Total
>>>> scanned row*","Total filtered/aggred row".
>>>>
>>>>         For  one region,  final records returned by hbase = *Total scanned
>>>> row - *Total filtered/aggred row;
>>>>        Suppose this query need to scan 10 region in hbase, we can get
>>>> every region stats. we can get all records  *valueB *returned by hbase
>>>> by
>>>>        suming every final records in 10 region.
>>>>
>>>>       In general, *valueA *is equal to * valueB*, but *valueB *is much
>>>> larger than *valueA* in sometimes. Why?
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>>
>>> Regards!
>>>
>>> Aron Tao
>>>
>>
>>
>> --
>>
>>
>> Regards!
>>
>> Aron Tao
>>
>
>
> --
>
>
> Regards!
>
> Aron Tao
>


-- 


Regards!

Aron Tao

Re: doubt about measure of processedRowCount

Posted by JiaTao Tao <ta...@gmail.com>.
Very glad that my reply is helpful, I already opened a JIRA to add logs for
"*GTStreamAggregateScanner*" and next time it would be much easier to
navigate this :).

cheney <53...@qq.com> 于2018年11月6日周二 下午11:57写道:

> Hi, JiaTao, thank you very much!  The statis is right when I config "kylin.query.stream-aggregate-enabled=false".
> You are right. Records are pre-aggregated by GTStreamAggregateScanner.
>
>
> ------------------ 原始邮件 ------------------
> *发件人:* "JiaTao Tao"<ta...@gmail.com>;
> *发送时间:* 2018年11月6日(星期二) 晚上10:50
> *收件人:* "user"<us...@kylin.apache.org>;
> *主题:* Re: doubt about measure of processedRowCount
>
> One possible place I can find in the code is using
> *GTStreamAggregateScanne*r (in "*SegmentCubeTupleIterator.java#111"*).
> You can find it does do aggregate in
> *"GTStreamAggregateScanner.AbstractStreamMergeIterator#next*" so it'll
> reduce the inputs. But there's no log printing in this class as you can
> see, so it's pretty hard to confirm. Try
> "kylin.query.stream-aggregate-enabled=false" and run the scenario again to
> see any differences.
>
> cheney <53...@qq.com> 于2018年11月5日周一 下午6:55写道:
>
>> Yes. the log is as following.
>>
>> 2018-11-02 22:25:34,980 DEBUG [Query
>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914]
>> gtrecord.StorageResponseGTScatter:88 : Using
>> SortMergedPartitionResultIterator to merge 103 partition results
>> 2018-11-02 22:25:34,982 INFO  [Query
>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914]
>> gtrecord.SequentialCubeTupleIterator:73 : Using Iterators.concat *to
>> merge segment results*
>> 2018-11-02 22:25:34,982 DEBUG [Query
>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] enumerator.OLAPEnumerator:122
>> : return TupleIterator...
>> 2018-11-02 22:25:34,991 INFO  [Query
>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:897 : *Processed
>> rows for each storageContext*: 366
>> 2018-11-02 22:25:34,991 INFO  [Query
>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:422 :
>> Stats of SQL response: isException: false, duration: 20, *total scan
>> count 1552*
>>
>> Acoording the log,  *valueA *= 366. *valueB*= (total scan count) 1552 -
>> (total Agrrated/filterd in hbase)270 = 1282
>>  *valueB *is much larger than *valueA *.
>>
>>
>>
>> ------------------ 原始邮件 ------------------
>> *发件人:* "JiaTao Tao"<ta...@gmail.com>;
>> *发送时间:* 2018年11月5日(星期一) 下午2:41
>> *收件人:* "user"<us...@kylin.apache.org>;
>> *主题:* Re: doubt about measure of processedRowCount
>>
>> Can you grep logs like "to merge segment results" in that scenario?
>>
>> cheney <53...@qq.com> 于2018年11月3日周六 下午4:15写道:
>>
>>> Thank your repling, .but I  am sure there's only one OlapContext in the
>>> quey in my scenario.
>>> ---Original---
>>> *From:* "JiaTao Tao"<ta...@gmail.com>
>>> *Date:* Sat, Nov 3, 2018 10:42 AM
>>> *To:* "user"<us...@kylin.apache.org>;
>>> *Subject:* Re: doubt about measure of processedRowCount
>>>
>>> Maybe count all the *valueA *would be more appropriate, cuz maybe
>>> there's more than one OlapContext in the query ( one OlapContext correspond
>>> one storageContext ).
>>>
>>> There are two good blogs about Kylin's query engine, you may take a look
>>> :).
>>>
>>> https://blog.csdn.net/yu616568/article/details/50838504
>>>
>>> https://zhuanlan.zhihu.com/p/30613434
>>>
>>> cheney <53...@qq.com> 于2018年11月2日周五 下午11:10写道:
>>>
>>>> Hi, guys
>>>>
>>>>         When I executed a sql in kylin, kylin server will log some log
>>>> about query statics. for example, The log is as following:
>>>>
>>>>        "Processed rows for each storageContext: *valueA*". *valueA *is processedRowCount.
>>>>
>>>>        What I understand is processedRowCount is the record rows
>>>> numbers returned by hbase.
>>>>
>>>>        Hbase corprocessor will log region stats, including:  "*Total
>>>> scanned row*","Total filtered/aggred row".
>>>>
>>>>         For  one region,  final records returned by hbase = *Total scanned
>>>> row - *Total filtered/aggred row;
>>>>        Suppose this query need to scan 10 region in hbase, we can get
>>>> every region stats. we can get all records  *valueB *returned by hbase
>>>> by
>>>>        suming every final records in 10 region.
>>>>
>>>>       In general, *valueA *is equal to * valueB*, but *valueB *is much
>>>> larger than *valueA* in sometimes. Why?
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>>
>>> Regards!
>>>
>>> Aron Tao
>>>
>>
>>
>> --
>>
>>
>> Regards!
>>
>> Aron Tao
>>
>
>
> --
>
>
> Regards!
>
> Aron Tao
>


-- 


Regards!

Aron Tao

回复: doubt about measure of processedRowCount

Posted by cheney <53...@qq.com>.
Hi, JiaTao, thank you very much!  The statis is right when I config "kylin.query.stream-aggregate-enabled=false". You are right. Records are pre-aggregated by GTStreamAggregateScanner.




------------------ 原始邮件 ------------------
发件人: "JiaTao Tao"<ta...@gmail.com>;
发送时间: 2018年11月6日(星期二) 晚上10:50
收件人: "user"<us...@kylin.apache.org>;

主题: Re: doubt about measure of processedRowCount



One possible place I can find in the code is using GTStreamAggregateScanner (in "SegmentCubeTupleIterator.java#111"). You can find it does do aggregate in "GTStreamAggregateScanner.AbstractStreamMergeIterator#next" so it'll reduce the inputs. But there's no log printing in this class as you can see, so it's pretty hard to confirm. Try "kylin.query.stream-aggregate-enabled=false" and run the scenario again to see any differences.







cheney <53...@qq.com> 于2018年11月5日周一 下午6:55写道:

Yes. the log is as following.


2018-11-02 22:25:34,980 DEBUG [Query 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] gtrecord.StorageResponseGTScatter:88 : Using SortMergedPartitionResultIterator to merge 103 partition results
2018-11-02 22:25:34,982 INFO  [Query 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] gtrecord.SequentialCubeTupleIterator:73 : Using Iterators.concat to merge segment results
2018-11-02 22:25:34,982 DEBUG [Query 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] enumerator.OLAPEnumerator:122 : return TupleIterator...
2018-11-02 22:25:34,991 INFO  [Query 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:897 : Processed rows for each storageContext: 366
2018-11-02 22:25:34,991 INFO  [Query 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:422 : Stats of SQL response: isException: false, duration: 20, total scan count 1552



Acoording the log,  valueA = 366. valueB= (total scan count) 1552 - (total Agrrated/filterd in hbase)270 = 1282
 valueB is much larger than valueA .



             


------------------ 原始邮件 ------------------
发件人: "JiaTao Tao"<ta...@gmail.com>;
发送时间: 2018年11月5日(星期一) 下午2:41
收件人: "user"<us...@kylin.apache.org>;

主题: Re: doubt about measure of processedRowCount



Can you grep logs like "to merge segment results" in that scenario?


cheney <53...@qq.com> 于2018年11月3日周六 下午4:15写道:

Thank your repling, .but I  am sure there's only one OlapContext in the quey in my scenario.
 
---Original---
From: "JiaTao Tao"<ta...@gmail.com>
Date: Sat, Nov 3, 2018 10:42 AM
To: "user"<us...@kylin.apache.org>;
Subject: Re: doubt about measure of processedRowCount


Maybe count all the valueA would be more appropriate, cuz maybe there's more than one OlapContext in the query ( one OlapContext correspond one storageContext ).


There are two good blogs about Kylin's query engine, you may take a look :).


https://blog.csdn.net/yu616568/article/details/50838504



https://zhuanlan.zhihu.com/p/30613434







cheney <53...@qq.com> 于2018年11月2日周五 下午11:10写道:

Hi, guys   
        When I executed a sql in kylin, kylin server will log some log about query statics. for example, The log is as following:
       
       "Processed rows for each storageContext: valueA". valueA is processedRowCount.   
       What I understand is processedRowCount is the record rows numbers returned by hbase. 


       Hbase corprocessor will log region stats, including:  "Total scanned row","Total filtered/aggred row".
       
        For  one region,  final records returned by hbase = Total scanned row - Total filtered/aggred row;
       Suppose this query need to scan 10 region in hbase, we can get every region stats. we can get all records  valueB returned by hbase by
       suming every final records in 10 region. 
       
      In general, valueA is equal to  valueB, but valueB is much larger than valueA in sometimes. Why?
       
       




-- 




Regards!

Aron Tao





 




-- 




Regards!

Aron Tao










-- 




Regards!

Aron Tao

Re: doubt about measure of processedRowCount

Posted by JiaTao Tao <ta...@gmail.com>.
One possible place I can find in the code is using *GTStreamAggregateScanne*r
(in "*SegmentCubeTupleIterator.java#111"*). You can find it does do
aggregate in *"GTStreamAggregateScanner.AbstractStreamMergeIterator#next*"
so it'll reduce the inputs. But there's no log printing in this class as
you can see, so it's pretty hard to confirm. Try
"kylin.query.stream-aggregate-enabled=false" and run the scenario again to
see any differences.

cheney <53...@qq.com> 于2018年11月5日周一 下午6:55写道:

> Yes. the log is as following.
>
> 2018-11-02 22:25:34,980 DEBUG [Query
> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914]
> gtrecord.StorageResponseGTScatter:88 : Using
> SortMergedPartitionResultIterator to merge 103 partition results
> 2018-11-02 22:25:34,982 INFO  [Query
> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914]
> gtrecord.SequentialCubeTupleIterator:73 : Using Iterators.concat *to
> merge segment results*
> 2018-11-02 22:25:34,982 DEBUG [Query
> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] enumerator.OLAPEnumerator:122
> : return TupleIterator...
> 2018-11-02 22:25:34,991 INFO  [Query
> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:897 : *Processed
> rows for each storageContext*: 366
> 2018-11-02 22:25:34,991 INFO  [Query
> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:422 :
> Stats of SQL response: isException: false, duration: 20, *total scan
> count 1552*
>
> Acoording the log,  *valueA *= 366. *valueB*= (total scan count) 1552 -
> (total Agrrated/filterd in hbase)270 = 1282
>  *valueB *is much larger than *valueA *.
>
>
>
> ------------------ 原始邮件 ------------------
> *发件人:* "JiaTao Tao"<ta...@gmail.com>;
> *发送时间:* 2018年11月5日(星期一) 下午2:41
> *收件人:* "user"<us...@kylin.apache.org>;
> *主题:* Re: doubt about measure of processedRowCount
>
> Can you grep logs like "to merge segment results" in that scenario?
>
> cheney <53...@qq.com> 于2018年11月3日周六 下午4:15写道:
>
>> Thank your repling, .but I  am sure there's only one OlapContext in the
>> quey in my scenario.
>> ---Original---
>> *From:* "JiaTao Tao"<ta...@gmail.com>
>> *Date:* Sat, Nov 3, 2018 10:42 AM
>> *To:* "user"<us...@kylin.apache.org>;
>> *Subject:* Re: doubt about measure of processedRowCount
>>
>> Maybe count all the *valueA *would be more appropriate, cuz maybe
>> there's more than one OlapContext in the query ( one OlapContext correspond
>> one storageContext ).
>>
>> There are two good blogs about Kylin's query engine, you may take a look
>> :).
>>
>> https://blog.csdn.net/yu616568/article/details/50838504
>>
>> https://zhuanlan.zhihu.com/p/30613434
>>
>> cheney <53...@qq.com> 于2018年11月2日周五 下午11:10写道:
>>
>>> Hi, guys
>>>
>>>         When I executed a sql in kylin, kylin server will log some log
>>> about query statics. for example, The log is as following:
>>>
>>>        "Processed rows for each storageContext: *valueA*". *valueA *is processedRowCount.
>>>
>>>        What I understand is processedRowCount is the record rows numbers
>>> returned by hbase.
>>>
>>>        Hbase corprocessor will log region stats, including:  "*Total
>>> scanned row*","Total filtered/aggred row".
>>>
>>>         For  one region,  final records returned by hbase = *Total scanned
>>> row - *Total filtered/aggred row;
>>>        Suppose this query need to scan 10 region in hbase, we can get
>>> every region stats. we can get all records  *valueB *returned by hbase
>>> by
>>>        suming every final records in 10 region.
>>>
>>>       In general, *valueA *is equal to * valueB*, but *valueB *is much
>>> larger than *valueA* in sometimes. Why?
>>>
>>>
>>>
>>
>>
>> --
>>
>>
>> Regards!
>>
>> Aron Tao
>>
>
>
> --
>
>
> Regards!
>
> Aron Tao
>


-- 


Regards!

Aron Tao