You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@kylin.apache.org by Zhou Kang <zh...@outlook.com> on 2020/01/03 02:32:22 UTC

[DISCUSS] Cost-benefit of HBase scan result compression

Hi,all

kylin.storage.hbase.endpoint-compress-result is TRUE as default.
In Xiaomi Group, we found compression will cause query time latency up to 30 sec and more. After we analyze log in HBase, we found compression is useless in most situations.

Detail info you can see in : https://issues.apache.org/jira/browse/KYLIN-4322

And more, in our environment,

1.     Only 0.05% data is bigger than 1M

2.     Almost 70% compression data is larger than source data.

So, should we set this config FALSE as default.
And, kylin.storage.hbase.endpoint-compress-result should be override in cube or project, which is forbidden in CubeVisitService:visitCube now.


Re: [DISCUSS] Cost-benefit of HBase scan result compression

Posted by ShaoFeng Shi <sh...@apache.org>.
Hi Chunen and Kang, is there any follow-up JIRA for this?

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Email: shaofengshi@apache.org

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org




nichunen <ni...@apache.org> 于2020年1月13日周一 上午10:46写道:

> Agree with Yaqian, we may set the default value to FALSE
>
>
>
> Best regards,
>
>
>
> Ni Chunen / George
>
>
>
> On 01/9/2020 10:41,Zhou Kang<zh...@outlook.com> wrote:
> ( ̄▽ ̄)” Seems mail list disable rich text.
> kylin sample data
>
> +────────────────────+────────────────+───────────────────────────+───────────────────────────────+
> | Query Result Size  | Compress Time  | Query Duration(Compress)  | Query
> Duration(Uncompressed)  |
>
> +────────────────────+────────────────+───────────────────────────+───────────────────────────────+
> | 0.25M              | 5ms            | 0.18s                     | 0.23s
>                        |
> | 0.5M               | 20ms           | 0.38s                     | 0.38s
>                        |
> | 0.7M               | 25ms           | 0.52s                     | 0.45s
>                        |
>
> +────────────────────+────────────────+───────────────────────────+───────────────────────────────+
>
> SSB data
>
> +────────────────────+────────────────+───────────────────────────+───────────────────────────────+
> | Query Result Size  | Compress Time  | Query Duration(Compress)  | Query
> Duration(Uncompressed)  |
>
> +────────────────────+────────────────+───────────────────────────+───────────────────────────────+
> | 0.25M              | 4ms            | 0.12s                     | 0.15s
>                        |
> | 0.5M               | 7ms            | 0.25s                     | 0.24s
>                        |
> | 0.7M               | 10ms           | 0.35s                     | 0.35s
>                        |
> | 1M                 | 13ms           | 0.41s                     | 0.39s
>                        |
> | 5M                 | 63ms           | 2.26s                     | 2.27s
>                        |
> | 10M                | 135ms          | 5.10s                     | 4.90s
>                        |
> | 16M                | 215ms          | 7.89s                     | 7.60s
>                        |
>
> +────────────────────+────────────────+───────────────────────────+───────────────────────────────+
> 发件人: Zhou Kang <zh...@outlook.com>
> 答复: "dev@kylin.apache.org" <de...@kylin.apache.org>
> 日期: 2020年1月9日 星期四 上午10:34
> 收件人: "dev@kylin.apache.org" <de...@kylin.apache.org>, Yaqian Zhang <
> Yaqian_Zhang@126.com>
> 主题: Re: [DISCUSS] Cost-benefit of HBase scan result compression
>
> Hi, Yaqian Zhang:
>
> Thanks for your query latency tests.
>
> I retyped the test data for easy reading
>
> kylin sample data
> Query Result Size
> Compress Time
> Query Duration
> (Compress)
> Query Duration
> (Uncompressed)
> 0.25M
> 5ms
> 0.18s
> 0.23s
> 0.5M
> 20ms
> 0.38s
> 0.38s
> 0.7M
> 25ms
> 0.52s
> 0.45s
>
> SSB data
> Query Result Size
> Compress Time
> Query Duration
> (Compress)
> Query Duration
> (Uncompressed)
> 0.25M
> 4ms
> 0.12s
> 0.15s
> 0.5M
> 7ms
> 0.25s
> 0.24s
> 0.7M
> 10ms
> 0.35s
> 0.35s
> 1M
> 13ms
> 0.41s
> 0.39s
> 5M
> 63ms
> 2.26s
> 2.27s
> 10M
> 135ms
> 5.10s
> 4.90s
> 16M
> 215ms
> 7.89s
> 7.60s
>
>
> 发件人: Yaqian Zhang <Ya...@126.com>>
> 答复: "dev@kylin.apache.org<ma...@kylin.apache.org>" <
> dev@kylin.apache.org<ma...@kylin.apache.org>>
> 日期: 2020年1月8日 星期三 下午8:04
> 收件人: "dev@kylin.apache.org<ma...@kylin.apache.org>" <
> dev@kylin.apache.org<ma...@kylin.apache.org>>
> 主题: Re: [DISCUSS] Cost-benefit of HBase scan result compression
>
> Hi:
>
> I have tested the  query time latency in In both cases.
>
> In our CDH cluster environment, I get the following experimental results.
>
> kylin sample data
> Query Result Size
> Compress Time
> Query Duration(Compress)
> Query Duration(Uncompressed)
> 0.25M
> 5ms
> 0.18s
> 0.23s
> 0.5M
> 20ms
> 0.38s
> 0.38s
> 0.7M
> 25ms
> 0.52s
> 0.45s
>
> SSB data
> Query Result Size
> Compress Time
> Query Duration(Compress)
> Query Duration(Uncompressed)
> 0.25M
> 4ms
> 0.12s
> 0.15s
> 0.5M
> 7ms
> 0.25s
> 0.24s
> 0.7M
> 10ms
> 0.35s
> 0.35s
> 1M
> 13ms
> 0.41s
> 0.39s
> 5M
> 63ms
> 2.26s
> 2.27s
> 10M
> 135ms
> 5.10s
> 4.90s
> 16M
> 215ms
> 7.89s
> 7.60s
>
> Conclusion:
> Enable compression will improve query speed when result size<0.5M.
> Turning on compression will reduce query speed in general when result
> size>1M.
>
> So,it is recommended to set the default value of
> kylin.storage.hbase.endpoint-compress-result to false.
>
>
> 在 2020年1月4日,19:35,Yaqian Zhang <Yaqian_Zhang@126.com<mailto:
> Yaqian_Zhang@126.com><ma...@126.com><mailto:
> Yaqian_Zhang@126.com%3e>> 写道:
> HI Kang:
> Thank you for your compare and report!
> I will test and verify the query time latency for this!
> 在 2020年1月3日,10:32,Zhou Kang <zhoukangcn@outlook.com<mailto:
> zhoukangcn@outlook.com><ma...@outlook.com><mailto:
> zhoukangcn@outlook.com%3e>> 写道:
> Hi,all
> kylin.storage.hbase.endpoint-compress-result is TRUE as default.
> In Xiaomi Group, we found compression will cause query time latency up to
> 30 sec and more. After we analyze log in HBase, we found compression is
> useless in most situations.
> Detail info you can see in :
> https://issues.apache.org/jira/browse/KYLIN-4322
> And more, in our environment,
> 1.     Only 0.05% data is bigger than 1M
> 2.     Almost 70% compression data is larger than source data.
> So, should we set this config FALSE as default.
> And, kylin.storage.hbase.endpoint-compress-result should be override in
> cube or project, which is forbidden in CubeVisitService:visitCube now.
>
>
>
>

Re: [DISCUSS] Cost-benefit of HBase scan result compression

Posted by nichunen <ni...@apache.org>.
Agree with Yaqian, we may set the default value to FALSE



Best regards,

 

Ni Chunen / George



On 01/9/2020 10:41,Zhou Kang<zh...@outlook.com> wrote:
( ̄▽ ̄)” Seems mail list disable rich text.
kylin sample data
+────────────────────+────────────────+───────────────────────────+───────────────────────────────+
| Query Result Size  | Compress Time  | Query Duration(Compress)  | Query Duration(Uncompressed)  |
+────────────────────+────────────────+───────────────────────────+───────────────────────────────+
| 0.25M              | 5ms            | 0.18s                     | 0.23s                         |
| 0.5M               | 20ms           | 0.38s                     | 0.38s                         |
| 0.7M               | 25ms           | 0.52s                     | 0.45s                         |
+────────────────────+────────────────+───────────────────────────+───────────────────────────────+

SSB data
+────────────────────+────────────────+───────────────────────────+───────────────────────────────+
| Query Result Size  | Compress Time  | Query Duration(Compress)  | Query Duration(Uncompressed)  |
+────────────────────+────────────────+───────────────────────────+───────────────────────────────+
| 0.25M              | 4ms            | 0.12s                     | 0.15s                         |
| 0.5M               | 7ms            | 0.25s                     | 0.24s                         |
| 0.7M               | 10ms           | 0.35s                     | 0.35s                         |
| 1M                 | 13ms           | 0.41s                     | 0.39s                         |
| 5M                 | 63ms           | 2.26s                     | 2.27s                         |
| 10M                | 135ms          | 5.10s                     | 4.90s                         |
| 16M                | 215ms          | 7.89s                     | 7.60s                         |
+────────────────────+────────────────+───────────────────────────+───────────────────────────────+
发件人: Zhou Kang <zh...@outlook.com>
答复: "dev@kylin.apache.org" <de...@kylin.apache.org>
日期: 2020年1月9日 星期四 上午10:34
收件人: "dev@kylin.apache.org" <de...@kylin.apache.org>, Yaqian Zhang <Ya...@126.com>
主题: Re: [DISCUSS] Cost-benefit of HBase scan result compression

Hi, Yaqian Zhang:

Thanks for your query latency tests.

I retyped the test data for easy reading

kylin sample data
Query Result Size
Compress Time
Query Duration
(Compress)
Query Duration
(Uncompressed)
0.25M
5ms
0.18s
0.23s
0.5M
20ms
0.38s
0.38s
0.7M
25ms
0.52s
0.45s

SSB data
Query Result Size
Compress Time
Query Duration
(Compress)
Query Duration
(Uncompressed)
0.25M
4ms
0.12s
0.15s
0.5M
7ms
0.25s
0.24s
0.7M
10ms
0.35s
0.35s
1M
13ms
0.41s
0.39s
5M
63ms
2.26s
2.27s
10M
135ms
5.10s
4.90s
16M
215ms
7.89s
7.60s


发件人: Yaqian Zhang <Ya...@126.com>>
答复: "dev@kylin.apache.org<ma...@kylin.apache.org>" <de...@kylin.apache.org>>
日期: 2020年1月8日 星期三 下午8:04
收件人: "dev@kylin.apache.org<ma...@kylin.apache.org>" <de...@kylin.apache.org>>
主题: Re: [DISCUSS] Cost-benefit of HBase scan result compression

Hi:

I have tested the  query time latency in In both cases.

In our CDH cluster environment, I get the following experimental results.

kylin sample data
Query Result Size
Compress Time
Query Duration(Compress)
Query Duration(Uncompressed)
0.25M
5ms
0.18s
0.23s
0.5M
20ms
0.38s
0.38s
0.7M
25ms
0.52s
0.45s

SSB data
Query Result Size
Compress Time
Query Duration(Compress)
Query Duration(Uncompressed)
0.25M
4ms
0.12s
0.15s
0.5M
7ms
0.25s
0.24s
0.7M
10ms
0.35s
0.35s
1M
13ms
0.41s
0.39s
5M
63ms
2.26s
2.27s
10M
135ms
5.10s
4.90s
16M
215ms
7.89s
7.60s

Conclusion:
Enable compression will improve query speed when result size<0.5M.
Turning on compression will reduce query speed in general when result size>1M.

So,it is recommended to set the default value of kylin.storage.hbase.endpoint-compress-result to false.


在 2020年1月4日,19:35,Yaqian Zhang <Ya...@126.com><mailto:Yaqian_Zhang@126.com%3e>> 写道:
HI Kang:
Thank you for your compare and report!
I will test and verify the query time latency for this!
在 2020年1月3日,10:32,Zhou Kang <zh...@outlook.com><mailto:zhoukangcn@outlook.com%3e>> 写道:
Hi,all
kylin.storage.hbase.endpoint-compress-result is TRUE as default.
In Xiaomi Group, we found compression will cause query time latency up to 30 sec and more. After we analyze log in HBase, we found compression is useless in most situations.
Detail info you can see in : https://issues.apache.org/jira/browse/KYLIN-4322
And more, in our environment,
1.     Only 0.05% data is bigger than 1M
2.     Almost 70% compression data is larger than source data.
So, should we set this config FALSE as default.
And, kylin.storage.hbase.endpoint-compress-result should be override in cube or project, which is forbidden in CubeVisitService:visitCube now.




Re: [DISCUSS] Cost-benefit of HBase scan result compression

Posted by Zhou Kang <zh...@outlook.com>.
( ̄▽ ̄)” Seems mail list disable rich text.
kylin sample data
+────────────────────+────────────────+───────────────────────────+───────────────────────────────+
| Query Result Size  | Compress Time  | Query Duration(Compress)  | Query Duration(Uncompressed)  |
+────────────────────+────────────────+───────────────────────────+───────────────────────────────+
| 0.25M              | 5ms            | 0.18s                     | 0.23s                         |
| 0.5M               | 20ms           | 0.38s                     | 0.38s                         |
| 0.7M               | 25ms           | 0.52s                     | 0.45s                         |
+────────────────────+────────────────+───────────────────────────+───────────────────────────────+

SSB data
+────────────────────+────────────────+───────────────────────────+───────────────────────────────+
| Query Result Size  | Compress Time  | Query Duration(Compress)  | Query Duration(Uncompressed)  |
+────────────────────+────────────────+───────────────────────────+───────────────────────────────+
| 0.25M              | 4ms            | 0.12s                     | 0.15s                         |
| 0.5M               | 7ms            | 0.25s                     | 0.24s                         |
| 0.7M               | 10ms           | 0.35s                     | 0.35s                         |
| 1M                 | 13ms           | 0.41s                     | 0.39s                         |
| 5M                 | 63ms           | 2.26s                     | 2.27s                         |
| 10M                | 135ms          | 5.10s                     | 4.90s                         |
| 16M                | 215ms          | 7.89s                     | 7.60s                         |
+────────────────────+────────────────+───────────────────────────+───────────────────────────────+
发件人: Zhou Kang <zh...@outlook.com>
答复: "dev@kylin.apache.org" <de...@kylin.apache.org>
日期: 2020年1月9日 星期四 上午10:34
收件人: "dev@kylin.apache.org" <de...@kylin.apache.org>, Yaqian Zhang <Ya...@126.com>
主题: Re: [DISCUSS] Cost-benefit of HBase scan result compression

Hi, Yaqian Zhang:

Thanks for your query latency tests.

I retyped the test data for easy reading

kylin sample data
Query Result Size
Compress Time
Query Duration
(Compress)
Query Duration
(Uncompressed)
0.25M
5ms
0.18s
0.23s
0.5M
20ms
0.38s
0.38s
0.7M
25ms
0.52s
0.45s

SSB data
Query Result Size
Compress Time
Query Duration
(Compress)
Query Duration
(Uncompressed)
0.25M
4ms
0.12s
0.15s
0.5M
7ms
0.25s
0.24s
0.7M
10ms
0.35s
0.35s
1M
13ms
0.41s
0.39s
5M
63ms
2.26s
2.27s
10M
135ms
5.10s
4.90s
16M
215ms
7.89s
7.60s


发件人: Yaqian Zhang <Ya...@126.com>>
答复: "dev@kylin.apache.org<ma...@kylin.apache.org>" <de...@kylin.apache.org>>
日期: 2020年1月8日 星期三 下午8:04
收件人: "dev@kylin.apache.org<ma...@kylin.apache.org>" <de...@kylin.apache.org>>
主题: Re: [DISCUSS] Cost-benefit of HBase scan result compression

Hi:

I have tested the  query time latency in In both cases.

In our CDH cluster environment, I get the following experimental results.

kylin sample data
Query Result Size
Compress Time
Query Duration(Compress)
Query Duration(Uncompressed)
0.25M
5ms
0.18s
0.23s
0.5M
20ms
0.38s
0.38s
0.7M
25ms
0.52s
0.45s

SSB data
Query Result Size
Compress Time
Query Duration(Compress)
Query Duration(Uncompressed)
0.25M
4ms
0.12s
0.15s
0.5M
7ms
0.25s
0.24s
0.7M
10ms
0.35s
0.35s
1M
13ms
0.41s
0.39s
5M
63ms
2.26s
2.27s
10M
135ms
5.10s
4.90s
16M
215ms
7.89s
7.60s

Conclusion:
Enable compression will improve query speed when result size<0.5M.
Turning on compression will reduce query speed in general when result size>1M.

So,it is recommended to set the default value of kylin.storage.hbase.endpoint-compress-result to false.


在 2020年1月4日,19:35,Yaqian Zhang <Ya...@126.com><mailto:Yaqian_Zhang@126.com%3e>> 写道:
HI Kang:
Thank you for your compare and report!
I will test and verify the query time latency for this!
在 2020年1月3日,10:32,Zhou Kang <zh...@outlook.com><mailto:zhoukangcn@outlook.com%3e>> 写道:
Hi,all
kylin.storage.hbase.endpoint-compress-result is TRUE as default.
In Xiaomi Group, we found compression will cause query time latency up to 30 sec and more. After we analyze log in HBase, we found compression is useless in most situations.
Detail info you can see in : https://issues.apache.org/jira/browse/KYLIN-4322
And more, in our environment,
1.     Only 0.05% data is bigger than 1M
2.     Almost 70% compression data is larger than source data.
So, should we set this config FALSE as default.
And, kylin.storage.hbase.endpoint-compress-result should be override in cube or project, which is forbidden in CubeVisitService:visitCube now.




Re: [DISCUSS] Cost-benefit of HBase scan result compression

Posted by Zhou Kang <zh...@outlook.com>.
Hi, Yaqian Zhang:

Thanks for your query latency tests.

I retyped the test data for easy reading

kylin sample data
Query Result Size
Compress Time
Query Duration
(Compress)
Query Duration
(Uncompressed)
0.25M
5ms
0.18s
0.23s
0.5M
20ms
0.38s
0.38s
0.7M
25ms
0.52s
0.45s

SSB data
Query Result Size
Compress Time
Query Duration
(Compress)
Query Duration
(Uncompressed)
0.25M
4ms
0.12s
0.15s
0.5M
7ms
0.25s
0.24s
0.7M
10ms
0.35s
0.35s
1M
13ms
0.41s
0.39s
5M
63ms
2.26s
2.27s
10M
135ms
5.10s
4.90s
16M
215ms
7.89s
7.60s


发件人: Yaqian Zhang <Ya...@126.com>
答复: "dev@kylin.apache.org" <de...@kylin.apache.org>
日期: 2020年1月8日 星期三 下午8:04
收件人: "dev@kylin.apache.org" <de...@kylin.apache.org>
主题: Re: [DISCUSS] Cost-benefit of HBase scan result compression

Hi:

I have tested the  query time latency in In both cases.

In our CDH cluster environment, I get the following experimental results.

kylin sample data
Query Result Size
Compress Time
Query Duration(Compress)
Query Duration(Uncompressed)
0.25M
5ms
0.18s
0.23s
0.5M
20ms
0.38s
0.38s
0.7M
25ms
0.52s
0.45s

SSB data
Query Result Size
Compress Time
Query Duration(Compress)
Query Duration(Uncompressed)
0.25M
4ms
0.12s
0.15s
0.5M
7ms
0.25s
0.24s
0.7M
10ms
0.35s
0.35s
1M
13ms
0.41s
0.39s
5M
63ms
2.26s
2.27s
10M
135ms
5.10s
4.90s
16M
215ms
7.89s
7.60s

Conclusion:
Enable compression will improve query speed when result size<0.5M.
Turning on compression will reduce query speed in general when result size>1M.

So,it is recommended to set the default value of kylin.storage.hbase.endpoint-compress-result to false.


在 2020年1月4日,19:35,Yaqian Zhang <Ya...@126.com>> 写道:
HI Kang:
Thank you for your compare and report!
I will test and verify the query time latency for this!
在 2020年1月3日,10:32,Zhou Kang <zh...@outlook.com>> 写道:
Hi,all
kylin.storage.hbase.endpoint-compress-result is TRUE as default.
In Xiaomi Group, we found compression will cause query time latency up to 30 sec and more. After we analyze log in HBase, we found compression is useless in most situations.
Detail info you can see in : https://issues.apache.org/jira/browse/KYLIN-4322
And more, in our environment,
1.     Only 0.05% data is bigger than 1M
2.     Almost 70% compression data is larger than source data.
So, should we set this config FALSE as default.
And, kylin.storage.hbase.endpoint-compress-result should be override in cube or project, which is forbidden in CubeVisitService:visitCube now.



Re: [DISCUSS] Cost-benefit of HBase scan result compression

Posted by Yaqian Zhang <Ya...@126.com>.
Hi:

I have tested the  query time latency in In both cases.

In our CDH cluster environment, I get the following experimental results.

kylin sample data
Query Result Size
Compress Time
Query Duration(Compress)
Query Duration(Uncompressed)
0.25M
5ms
0.18s
0.23s
0.5M
20ms
0.38s
0.38s
0.7M
25ms
0.52s
0.45s

SSB data
Query Result Size
Compress Time
Query Duration(Compress)
Query Duration(Uncompressed)
0.25M
4ms
0.12s
0.15s
0.5M
7ms
0.25s
0.24s
0.7M
10ms
0.35s
0.35s
1M
13ms
0.41s
0.39s
5M
63ms
2.26s
2.27s
10M
135ms
5.10s
4.90s
16M
215ms
7.89s
7.60s

Conclusion:
Enable compression will improve query speed when result size<0.5M.
Turning on compression will reduce query speed in general when result size>1M.

So,it is recommended to set the default value of kylin.storage.hbase.endpoint-compress-result to false.


> 在 2020年1月4日,19:35,Yaqian Zhang <Ya...@126.com> 写道:
> 
> HI Kang:
> 
> Thank you for your compare and report!
> 
> I will test and verify the query time latency for this!
> 
>> 在 2020年1月3日,10:32,Zhou Kang <zh...@outlook.com> 写道:
>> 
>> Hi,all
>> 
>> kylin.storage.hbase.endpoint-compress-result is TRUE as default.
>> In Xiaomi Group, we found compression will cause query time latency up to 30 sec and more. After we analyze log in HBase, we found compression is useless in most situations.
>> 
>> Detail info you can see in : https://issues.apache.org/jira/browse/KYLIN-4322
>> 
>> And more, in our environment,
>> 
>> 1.     Only 0.05% data is bigger than 1M
>> 
>> 2.     Almost 70% compression data is larger than source data.
>> 
>> So, should we set this config FALSE as default.
>> And, kylin.storage.hbase.endpoint-compress-result should be override in cube or project, which is forbidden in CubeVisitService:visitCube now.
>> 


Re: [DISCUSS] Cost-benefit of HBase scan result compression

Posted by Yaqian Zhang <Ya...@126.com>.
HI Kang:

Thank you for your compare and report!

I will test and verify the query time latency for this!

> 在 2020年1月3日,10:32,Zhou Kang <zh...@outlook.com> 写道:
> 
> Hi,all
> 
> kylin.storage.hbase.endpoint-compress-result is TRUE as default.
> In Xiaomi Group, we found compression will cause query time latency up to 30 sec and more. After we analyze log in HBase, we found compression is useless in most situations.
> 
> Detail info you can see in : https://issues.apache.org/jira/browse/KYLIN-4322
> 
> And more, in our environment,
> 
> 1.     Only 0.05% data is bigger than 1M
> 
> 2.     Almost 70% compression data is larger than source data.
> 
> So, should we set this config FALSE as default.
> And, kylin.storage.hbase.endpoint-compress-result should be override in cube or project, which is forbidden in CubeVisitService:visitCube now.
>