You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@kylin.apache.org by 苏启龙 <su...@qiyi.com> on 2018/01/22 08:54:02 UTC

segment size estimate when merging

Hi,

We have some unclear points about the segment size estimate when merging multi-segments.

We find that the segment merge job still uses CubeStatsReader::getCuboidSizeMap to estimate the total size of the merged segment. From our understanding, when building a new segment, Kylin uses this way to estimate the total size is OK since no other info we can turn to. But in merging we may sum the table size of the segments to be merged, which should be more accurate.

So why for this consideration?



Su Qilong

Re: segment size estimate when merging

Posted by 苏启龙 <su...@qiyi.com>.
Thanks a lot for this reply Alberto!

but this parameter is involved in v2.0.0+, and we’re currently using 1.6.0 sadly



发件人: Alberto Ramón <a....@gmail.com>>
答复: "user@kylin.apache.org<ma...@kylin.apache.org>" <us...@kylin.apache.org>>
日期: 2018年1月27日 星期六 19:50
至: user <us...@kylin.apache.org>>
抄送: 林豪(linhao)-技术产品中心 <li...@qiyi.com>>
主题: Re: segment size estimate when merging

Could be this related? KYLIN-2779<https://issues.apache.org/jira/browse/KYLIN-2779>, this JIRA have a lot of sense

On 24 January 2018 at 13:43, ShaoFeng Shi <sh...@apache.org>> wrote:
Hi Qilong,

If seg A's estimation size is 10 GB, but real size is 5 GB; then when merge or build another segment, we can adjust the estimated size by divide by 2. Then it should be closer with real size.

2018-01-24 9:49 GMT+08:00 苏启龙 <su...@qiyi.com>>:
Many thanks shaofeng! We’ll check more on these parameters to see how to make it better.

BTW, what do u mean by the last line? I mean by which way I can introduce the actual size to help Kylin to adjust the estimation? Currently I can only use the max-regions parameter manually, but this is not convenient for auto-merging.

QIlong

发件人: ShaoFeng Shi <sh...@apache.org>>
答复: "user@kylin.apache.org<ma...@kylin.apache.org>" <us...@kylin.apache.org>>
日期: 2018年1月23日 星期二 21:49

至: user <us...@kylin.apache.org>>
抄送: 林豪(linhao)-技术产品中心 <li...@qiyi.com>>
主题: Re: segment size estimate when merging

Hi Qilong,

Does your cube have count-distinct or Top-N measure?

If you observed that there are too many or too small hbase regions, you can adjust some parameters:

kylin.cube.size-estimate-ratio=0.25
kylin.cube.size-estimate-countdistinct-ratio=0.05

The default ratio for common case is 0.25, you can set it to smaller if the estimated size is bigger than actual size. These two parameters can be set at Cube level.

A better way is when doing merge, using the actual size of existing segments to adjust the estimated size, then get a closer result.

2018-01-23 14:47 GMT+08:00 苏启龙 <su...@qiyi.com>>:
Hi shaofeng,

Yes, it’s usually smaller then the sum of each segment, but usually a small amount compared with the total size.

But for the statistics estimate, usually result in a N times larger than it actually be, and results in a huge waste of HBase region numbers。


  1.  Do you have any data about deviation of the two ways in statistics? I mean generally which way will be closer?
  2.  Is there any improve plan for this in the roadmap? Or some consideration to give more options to user to select their own estimate algo?

Thanks

Qilong

发件人: ShaoFeng Shi <sh...@apache.org>>
答复: "user@kylin.apache.org<ma...@kylin.apache.org>" <us...@kylin.apache.org>>
日期: 2018年1月23日 星期二 09:43
至: user <us...@kylin.apache.org>>
抄送: 林豪(linhao)-技术产品中心 <li...@qiyi.com>>
主题: Re: segment size estimate when merging

Hi Qilong,

When merging segments, the dimension-measure values (k-v) will be re-orged and the same key will be merged, so the merged size is not simply a sum of each segment; usually, it is smaller than before.

Always using the statistics to estimate the size is for consistency. Of course, there is room to improve the estimation accuracy.



2018-01-22 16:54 GMT+08:00 苏启龙 <su...@qiyi.com>>:

Hi,

We have some unclear points about the segment size estimate when merging multi-segments.

We find that the segment merge job still uses CubeStatsReader::getCuboidSizeMap to estimate the total size of the merged segment. From our understanding, when building a new segment, Kylin uses this way to estimate the total size is OK since no other info we can turn to. But in merging we may sum the table size of the segments to be merged, which should be more accurate.

So why for this consideration?



Su Qilong



--
Best regards,

Shaofeng Shi 史少锋




--
Best regards,

Shaofeng Shi 史少锋




--
Best regards,

Shaofeng Shi 史少锋



Re: segment size estimate when merging

Posted by Alberto Ramón <a....@gmail.com>.
Could be this related? KYLIN-2779
<https://issues.apache.org/jira/browse/KYLIN-2779>, this JIRA have a lot of
sense

On 24 January 2018 at 13:43, ShaoFeng Shi <sh...@apache.org> wrote:

> Hi Qilong,
>
> If seg A's estimation size is 10 GB, but real size is 5 GB; then when
> merge or build another segment, we can adjust the estimated size by divide
> by 2. Then it should be closer with real size.
>
> 2018-01-24 9:49 GMT+08:00 苏启龙 <su...@qiyi.com>:
>
>> Many thanks shaofeng! We’ll check more on these parameters to see how to
>> make it better.
>>
>> BTW, what do u mean by the last line? I mean by which way I can introduce
>> the actual size to help Kylin to adjust the estimation? Currently I can
>> only use the max-regions parameter manually, but this is not convenient for
>> auto-merging.
>>
>> QIlong
>>
>> 发件人: ShaoFeng Shi <sh...@apache.org>
>> 答复: "user@kylin.apache.org" <us...@kylin.apache.org>
>> 日期: 2018年1月23日 星期二 21:49
>>
>> 至: user <us...@kylin.apache.org>
>> 抄送: 林豪(linhao)-技术产品中心 <li...@qiyi.com>
>> 主题: Re: segment size estimate when merging
>>
>> Hi Qilong,
>>
>> Does your cube have count-distinct or Top-N measure?
>>
>> If you observed that there are too many or too small hbase regions, you
>> can adjust some parameters:
>>
>> kylin.cube.size-estimate-ratio=0.25
>> kylin.cube.size-estimate-countdistinct-ratio=0.05
>>
>> The default ratio for common case is 0.25, you can set it to smaller if
>> the estimated size is bigger than actual size. These two parameters can be
>> set at Cube level.
>>
>> A better way is when doing merge, using the actual size of existing
>> segments to adjust the estimated size, then get a closer result.
>>
>> 2018-01-23 14:47 GMT+08:00 苏启龙 <su...@qiyi.com>:
>>
>>> Hi shaofeng,
>>>
>>> Yes, it’s usually smaller then the sum of each segment, but usually a
>>> small amount compared with the total size.
>>>
>>> But for the statistics estimate, usually result in a N times larger than
>>> it actually be, and results in a huge waste of HBase region numbers。
>>>
>>>
>>>    1. Do you have any data about deviation of the two ways in
>>>    statistics? I mean generally which way will be closer?
>>>    2. Is there any improve plan for this in the roadmap? Or some
>>>    consideration to give more options to user to select their own estimate
>>>    algo?
>>>
>>>
>>> Thanks
>>>
>>> Qilong
>>>
>>> 发件人: ShaoFeng Shi <sh...@apache.org>
>>> 答复: "user@kylin.apache.org" <us...@kylin.apache.org>
>>> 日期: 2018年1月23日 星期二 09:43
>>> 至: user <us...@kylin.apache.org>
>>> 抄送: 林豪(linhao)-技术产品中心 <li...@qiyi.com>
>>> 主题: Re: segment size estimate when merging
>>>
>>> Hi Qilong,
>>>
>>> When merging segments, the dimension-measure values (k-v) will be
>>> re-orged and the same key will be merged, so the merged size is not simply
>>> a sum of each segment; usually, it is smaller than before.
>>>
>>> Always using the statistics to estimate the size is for consistency. Of
>>> course, there is room to improve the estimation accuracy.
>>>
>>>
>>>
>>> 2018-01-22 16:54 GMT+08:00 苏启龙 <su...@qiyi.com>:
>>>
>>>>
>>>> Hi,
>>>>
>>>> We have some unclear points about the segment size estimate when
>>>> merging multi-segments.
>>>>
>>>> We find that the segment merge job still uses
>>>> CubeStatsReader::getCuboidSizeMap to estimate the total size of the
>>>> merged segment. From our understanding, when building a new segment, Kylin
>>>> uses this way to estimate the total size is OK since no other info we can
>>>> turn to. But in merging we may sum the table size of the segments to be
>>>> merged, which should be more accurate.
>>>>
>>>> So why for this consideration?
>>>>
>>>>
>>>>
>>>> Su Qilong
>>>>
>>>
>>>
>>>
>>> --
>>> Best regards,
>>>
>>> Shaofeng Shi 史少锋
>>>
>>>
>>
>>
>> --
>> Best regards,
>>
>> Shaofeng Shi 史少锋
>>
>>
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>
>

Re: segment size estimate when merging

Posted by ShaoFeng Shi <sh...@apache.org>.
Hi Qilong,

If seg A's estimation size is 10 GB, but real size is 5 GB; then when merge
or build another segment, we can adjust the estimated size by divide by 2.
Then it should be closer with real size.

2018-01-24 9:49 GMT+08:00 苏启龙 <su...@qiyi.com>:

> Many thanks shaofeng! We’ll check more on these parameters to see how to
> make it better.
>
> BTW, what do u mean by the last line? I mean by which way I can introduce
> the actual size to help Kylin to adjust the estimation? Currently I can
> only use the max-regions parameter manually, but this is not convenient for
> auto-merging.
>
> QIlong
>
> 发件人: ShaoFeng Shi <sh...@apache.org>
> 答复: "user@kylin.apache.org" <us...@kylin.apache.org>
> 日期: 2018年1月23日 星期二 21:49
>
> 至: user <us...@kylin.apache.org>
> 抄送: 林豪(linhao)-技术产品中心 <li...@qiyi.com>
> 主题: Re: segment size estimate when merging
>
> Hi Qilong,
>
> Does your cube have count-distinct or Top-N measure?
>
> If you observed that there are too many or too small hbase regions, you
> can adjust some parameters:
>
> kylin.cube.size-estimate-ratio=0.25
> kylin.cube.size-estimate-countdistinct-ratio=0.05
>
> The default ratio for common case is 0.25, you can set it to smaller if
> the estimated size is bigger than actual size. These two parameters can be
> set at Cube level.
>
> A better way is when doing merge, using the actual size of existing
> segments to adjust the estimated size, then get a closer result.
>
> 2018-01-23 14:47 GMT+08:00 苏启龙 <su...@qiyi.com>:
>
>> Hi shaofeng,
>>
>> Yes, it’s usually smaller then the sum of each segment, but usually a
>> small amount compared with the total size.
>>
>> But for the statistics estimate, usually result in a N times larger than
>> it actually be, and results in a huge waste of HBase region numbers。
>>
>>
>>    1. Do you have any data about deviation of the two ways in
>>    statistics? I mean generally which way will be closer?
>>    2. Is there any improve plan for this in the roadmap? Or some
>>    consideration to give more options to user to select their own estimate
>>    algo?
>>
>>
>> Thanks
>>
>> Qilong
>>
>> 发件人: ShaoFeng Shi <sh...@apache.org>
>> 答复: "user@kylin.apache.org" <us...@kylin.apache.org>
>> 日期: 2018年1月23日 星期二 09:43
>> 至: user <us...@kylin.apache.org>
>> 抄送: 林豪(linhao)-技术产品中心 <li...@qiyi.com>
>> 主题: Re: segment size estimate when merging
>>
>> Hi Qilong,
>>
>> When merging segments, the dimension-measure values (k-v) will be
>> re-orged and the same key will be merged, so the merged size is not simply
>> a sum of each segment; usually, it is smaller than before.
>>
>> Always using the statistics to estimate the size is for consistency. Of
>> course, there is room to improve the estimation accuracy.
>>
>>
>>
>> 2018-01-22 16:54 GMT+08:00 苏启龙 <su...@qiyi.com>:
>>
>>>
>>> Hi,
>>>
>>> We have some unclear points about the segment size estimate when merging
>>> multi-segments.
>>>
>>> We find that the segment merge job still uses
>>> CubeStatsReader::getCuboidSizeMap to estimate the total size of the
>>> merged segment. From our understanding, when building a new segment, Kylin
>>> uses this way to estimate the total size is OK since no other info we can
>>> turn to. But in merging we may sum the table size of the segments to be
>>> merged, which should be more accurate.
>>>
>>> So why for this consideration?
>>>
>>>
>>>
>>> Su Qilong
>>>
>>
>>
>>
>> --
>> Best regards,
>>
>> Shaofeng Shi 史少锋
>>
>>
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>
>


-- 
Best regards,

Shaofeng Shi 史少锋

Re: segment size estimate when merging

Posted by 苏启龙 <su...@qiyi.com>.
Many thanks shaofeng! We’ll check more on these parameters to see how to make it better.

BTW, what do u mean by the last line? I mean by which way I can introduce the actual size to help Kylin to adjust the estimation? Currently I can only use the max-regions parameter manually, but this is not convenient for auto-merging.

QIlong

发件人: ShaoFeng Shi <sh...@apache.org>>
答复: "user@kylin.apache.org<ma...@kylin.apache.org>" <us...@kylin.apache.org>>
日期: 2018年1月23日 星期二 21:49
至: user <us...@kylin.apache.org>>
抄送: 林豪(linhao)-技术产品中心 <li...@qiyi.com>>
主题: Re: segment size estimate when merging

Hi Qilong,

Does your cube have count-distinct or Top-N measure?

If you observed that there are too many or too small hbase regions, you can adjust some parameters:

kylin.cube.size-estimate-ratio=0.25
kylin.cube.size-estimate-countdistinct-ratio=0.05

The default ratio for common case is 0.25, you can set it to smaller if the estimated size is bigger than actual size. These two parameters can be set at Cube level.

A better way is when doing merge, using the actual size of existing segments to adjust the estimated size, then get a closer result.

2018-01-23 14:47 GMT+08:00 苏启龙 <su...@qiyi.com>>:
Hi shaofeng,

Yes, it’s usually smaller then the sum of each segment, but usually a small amount compared with the total size.

But for the statistics estimate, usually result in a N times larger than it actually be, and results in a huge waste of HBase region numbers。


  1.  Do you have any data about deviation of the two ways in statistics? I mean generally which way will be closer?
  2.  Is there any improve plan for this in the roadmap? Or some consideration to give more options to user to select their own estimate algo?

Thanks

Qilong

发件人: ShaoFeng Shi <sh...@apache.org>>
答复: "user@kylin.apache.org<ma...@kylin.apache.org>" <us...@kylin.apache.org>>
日期: 2018年1月23日 星期二 09:43
至: user <us...@kylin.apache.org>>
抄送: 林豪(linhao)-技术产品中心 <li...@qiyi.com>>
主题: Re: segment size estimate when merging

Hi Qilong,

When merging segments, the dimension-measure values (k-v) will be re-orged and the same key will be merged, so the merged size is not simply a sum of each segment; usually, it is smaller than before.

Always using the statistics to estimate the size is for consistency. Of course, there is room to improve the estimation accuracy.



2018-01-22 16:54 GMT+08:00 苏启龙 <su...@qiyi.com>>:

Hi,

We have some unclear points about the segment size estimate when merging multi-segments.

We find that the segment merge job still uses CubeStatsReader::getCuboidSizeMap to estimate the total size of the merged segment. From our understanding, when building a new segment, Kylin uses this way to estimate the total size is OK since no other info we can turn to. But in merging we may sum the table size of the segments to be merged, which should be more accurate.

So why for this consideration?



Su Qilong



--
Best regards,

Shaofeng Shi 史少锋




--
Best regards,

Shaofeng Shi 史少锋


Re: segment size estimate when merging

Posted by ShaoFeng Shi <sh...@apache.org>.
Hi Qilong,

Does your cube have count-distinct or Top-N measure?

If you observed that there are too many or too small hbase regions, you can
adjust some parameters:

kylin.cube.size-estimate-ratio=0.25
kylin.cube.size-estimate-countdistinct-ratio=0.05

The default ratio for common case is 0.25, you can set it to smaller if the
estimated size is bigger than actual size. These two parameters can be set
at Cube level.

A better way is when doing merge, using the actual size of existing
segments to adjust the estimated size, then get a closer result.

2018-01-23 14:47 GMT+08:00 苏启龙 <su...@qiyi.com>:

> Hi shaofeng,
>
> Yes, it’s usually smaller then the sum of each segment, but usually a
> small amount compared with the total size.
>
> But for the statistics estimate, usually result in a N times larger than
> it actually be, and results in a huge waste of HBase region numbers。
>
>
>    1. Do you have any data about deviation of the two ways in statistics?
>    I mean generally which way will be closer?
>    2. Is there any improve plan for this in the roadmap? Or some
>    consideration to give more options to user to select their own estimate
>    algo?
>
>
> Thanks
>
> Qilong
>
> 发件人: ShaoFeng Shi <sh...@apache.org>
> 答复: "user@kylin.apache.org" <us...@kylin.apache.org>
> 日期: 2018年1月23日 星期二 09:43
> 至: user <us...@kylin.apache.org>
> 抄送: 林豪(linhao)-技术产品中心 <li...@qiyi.com>
> 主题: Re: segment size estimate when merging
>
> Hi Qilong,
>
> When merging segments, the dimension-measure values (k-v) will be re-orged
> and the same key will be merged, so the merged size is not simply a sum of
> each segment; usually, it is smaller than before.
>
> Always using the statistics to estimate the size is for consistency. Of
> course, there is room to improve the estimation accuracy.
>
>
>
> 2018-01-22 16:54 GMT+08:00 苏启龙 <su...@qiyi.com>:
>
>>
>> Hi,
>>
>> We have some unclear points about the segment size estimate when merging
>> multi-segments.
>>
>> We find that the segment merge job still uses
>> CubeStatsReader::getCuboidSizeMap to estimate the total size of the
>> merged segment. From our understanding, when building a new segment, Kylin
>> uses this way to estimate the total size is OK since no other info we can
>> turn to. But in merging we may sum the table size of the segments to be
>> merged, which should be more accurate.
>>
>> So why for this consideration?
>>
>>
>>
>> Su Qilong
>>
>
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>
>


-- 
Best regards,

Shaofeng Shi 史少锋

Re: segment size estimate when merging

Posted by 苏启龙 <su...@qiyi.com>.
Hi shaofeng,

Yes, it’s usually smaller then the sum of each segment, but usually a small amount compared with the total size.

But for the statistics estimate, usually result in a N times larger than it actually be, and results in a huge waste of HBase region numbers。


  1.  Do you have any data about deviation of the two ways in statistics? I mean generally which way will be closer?
  2.  Is there any improve plan for this in the roadmap? Or some consideration to give more options to user to select their own estimate algo?

Thanks

Qilong

发件人: ShaoFeng Shi <sh...@apache.org>>
答复: "user@kylin.apache.org<ma...@kylin.apache.org>" <us...@kylin.apache.org>>
日期: 2018年1月23日 星期二 09:43
至: user <us...@kylin.apache.org>>
抄送: 林豪(linhao)-技术产品中心 <li...@qiyi.com>>
主题: Re: segment size estimate when merging

Hi Qilong,

When merging segments, the dimension-measure values (k-v) will be re-orged and the same key will be merged, so the merged size is not simply a sum of each segment; usually, it is smaller than before.

Always using the statistics to estimate the size is for consistency. Of course, there is room to improve the estimation accuracy.



2018-01-22 16:54 GMT+08:00 苏启龙 <su...@qiyi.com>>:

Hi,

We have some unclear points about the segment size estimate when merging multi-segments.

We find that the segment merge job still uses CubeStatsReader::getCuboidSizeMap to estimate the total size of the merged segment. From our understanding, when building a new segment, Kylin uses this way to estimate the total size is OK since no other info we can turn to. But in merging we may sum the table size of the segments to be merged, which should be more accurate.

So why for this consideration?



Su Qilong



--
Best regards,

Shaofeng Shi 史少锋


Re: segment size estimate when merging

Posted by ShaoFeng Shi <sh...@apache.org>.
Hi Qilong,

When merging segments, the dimension-measure values (k-v) will be re-orged
and the same key will be merged, so the merged size is not simply a sum of
each segment; usually, it is smaller than before.

Always using the statistics to estimate the size is for consistency. Of
course, there is room to improve the estimation accuracy.



2018-01-22 16:54 GMT+08:00 苏启龙 <su...@qiyi.com>:

>
> Hi,
>
> We have some unclear points about the segment size estimate when merging
> multi-segments.
>
> We find that the segment merge job still uses CubeStatsReader::getCuboidSizeMap
> to estimate the total size of the merged segment. From our understanding,
> when building a new segment, Kylin uses this way to estimate the total size
> is OK since no other info we can turn to. But in merging we may sum the
> table size of the segments to be merged, which should be more accurate.
>
> So why for this consideration?
>
>
>
> Su Qilong
>



-- 
Best regards,

Shaofeng Shi 史少锋