You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@kylin.apache.org by Alexander Sterligov <st...@joom.it> on 2017/08/06 17:15:58 UTC

Data disappears if hbase splits region

hi,

I noticed very large hbase region for one segment (more than 20GB and
kylin.storage.hbase.region-cut-gb=5). I don't know why it is so large, but
anyway it degraded performance a lot, so I decided to split it in hbase.

When the split has just started kylin started to return empty results for
queries to this segment.

Why can that happen?

PS
It seams to me that kylin.storage.hbase.region-cut-gb doesn't work in case
if external hbase cluster is used.

Re: Data disappears if hbase splits region

Posted by Alexander Sterligov <st...@joom.it>.

Done!
https://issues.apache.org/jira/browse/KYLIN-2779

On Tue, Aug 8, 2017 at 10:02 AM, ShaoFeng Shi <sh...@apache.org>
wrote:

> Okay, the estimation ratio is too small for bitmap type measure. Could you
> please open a JIRA with your findings? We can enhance that in the future
> release. Thanks!
>
> 2017-08-08 12:56 GMT+08:00 Alexander Sterligov <st...@joom.it>:
>
>> Yes, I'm using lz4.
>>
>> On Tue, Aug 8, 2017 at 4:15 AM, ShaoFeng Shi <sh...@apache.org>
>> wrote:
>>
>>> Thanks for the input. Did you enable any compression (e.g, LZO,
>>> Snappy) for HBase?
>>>
>>> 2017-08-08 0:49 GMT+08:00 Alexander Sterligov <st...@joom.it>:
>>>
>>>> All parameters were default. I've found out that it is really related
>>>> to size estimation of count distinct measure. F2 family were underestimated
>>>> for about 4 times.
>>>>
>>>> After I set kylin.cube.size-estimate-countdistinct-ratio=0.2
>>>> estimations are good and it works much better.
>>>>
>>>> It looks like default value of 0.05 is too low for bitmap and global
>>>> dictionary.
>>>>
>>>> Cube description is attached.
>>>>
>>>> On Mon, Aug 7, 2017 at 6:21 AM, ShaoFeng Shi <sh...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi Alexander,
>>>>>
>>>>> Sometimes there will be over-estimation for the size if Cube has some
>>>>> complex measure like count distinct and topn, but seldom heard of less
>>>>> estimation. Did you change other parameters which may impact on the
>>>>> estimation in kylin.properties? Besides, if you can share the Cube
>>>>> definition, that would help (information like dimension/measure, rowkey
>>>>> encoding will also impact on the region split).
>>>>>
>>>>> 2017-08-07 3:03 GMT+08:00 Alexander Sterligov <st...@joom.it>:
>>>>>
>>>>>> I've found out that sharding is done manually, so running split in
>>>>>> hbase shell breaks data.
>>>>>>
>>>>>> So the main problem is that region-cut doesn't work on hbase with s3.
>>>>>> I see that in the log it creates shards properly:
>>>>>>
>>>>>> 2017-08-05 20:54:48,709 INFO  [Job 1175d3ed-504f-4eb0-a973-d57338fdff2c-892]
>>>>>> steps.CreateHTableJob:192 : Total size 21334.075368547456M (estimated)
>>>>>> 2017-08-05 20:54:48,709 INFO  [Job 1175d3ed-504f-4eb0-a973-d57338fdff2c-892]
>>>>>> steps.CreateHTableJob:193 : Expecting 4 regions.
>>>>>> 2017-08-05 20:54:48,709 INFO  [Job 1175d3ed-504f-4eb0-a973-d57338fdff2c-892]
>>>>>> steps.CreateHTableJob:194 : Expecting 5333 MB per region.
>>>>>>
>>>>>> But then I get single 20GB region.
>>>>>>
>>>>>> Did anyone had same behaviour?
>>>>>>
>>>>>> On Sun, Aug 6, 2017 at 8:15 PM, Alexander Sterligov <
>>>>>> sterligovak@joom.it> wrote:
>>>>>>
>>>>>>> hi,
>>>>>>>
>>>>>>> I noticed very large hbase region for one segment (more than 20GB
>>>>>>> and kylin.storage.hbase.region-cut-gb=5). I don't know why it is so
>>>>>>> large, but anyway it degraded performance a lot, so I decided to split it
>>>>>>> in hbase.
>>>>>>>
>>>>>>> When the split has just started kylin started to return empty
>>>>>>> results for queries to this segment.
>>>>>>>
>>>>>>> Why can that happen?
>>>>>>>
>>>>>>> PS
>>>>>>> It seams to me that kylin.storage.hbase.region-cut-gb doesn't work
>>>>>>> in case if external hbase cluster is used.
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best regards,
>>>>>
>>>>> Shaofeng Shi 史少锋
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Best regards,
>>>
>>> Shaofeng Shi 史少锋
>>>
>>>
>>
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>
>

Re: Data disappears if hbase splits region

Posted by ShaoFeng Shi <sh...@apache.org>.

Okay, the estimation ratio is too small for bitmap type measure. Could you
please open a JIRA with your findings? We can enhance that in the future
release. Thanks!

2017-08-08 12:56 GMT+08:00 Alexander Sterligov <st...@joom.it>:

> Yes, I'm using lz4.
>
> On Tue, Aug 8, 2017 at 4:15 AM, ShaoFeng Shi <sh...@apache.org>
> wrote:
>
>> Thanks for the input. Did you enable any compression (e.g, LZO,
>> Snappy) for HBase?
>>
>> 2017-08-08 0:49 GMT+08:00 Alexander Sterligov <st...@joom.it>:
>>
>>> All parameters were default. I've found out that it is really related to
>>> size estimation of count distinct measure. F2 family were underestimated
>>> for about 4 times.
>>>
>>> After I set kylin.cube.size-estimate-countdistinct-ratio=0.2
>>> estimations are good and it works much better.
>>>
>>> It looks like default value of 0.05 is too low for bitmap and global
>>> dictionary.
>>>
>>> Cube description is attached.
>>>
>>> On Mon, Aug 7, 2017 at 6:21 AM, ShaoFeng Shi <sh...@apache.org>
>>> wrote:
>>>
>>>> Hi Alexander,
>>>>
>>>> Sometimes there will be over-estimation for the size if Cube has some
>>>> complex measure like count distinct and topn, but seldom heard of less
>>>> estimation. Did you change other parameters which may impact on the
>>>> estimation in kylin.properties? Besides, if you can share the Cube
>>>> definition, that would help (information like dimension/measure, rowkey
>>>> encoding will also impact on the region split).
>>>>
>>>> 2017-08-07 3:03 GMT+08:00 Alexander Sterligov <st...@joom.it>:
>>>>
>>>>> I've found out that sharding is done manually, so running split in
>>>>> hbase shell breaks data.
>>>>>
>>>>> So the main problem is that region-cut doesn't work on hbase with s3.
>>>>> I see that in the log it creates shards properly:
>>>>>
>>>>> 2017-08-05 20:54:48,709 INFO  [Job 1175d3ed-504f-4eb0-a973-d57338fdff2c-892]
>>>>> steps.CreateHTableJob:192 : Total size 21334.075368547456M (estimated)
>>>>> 2017-08-05 20:54:48,709 INFO  [Job 1175d3ed-504f-4eb0-a973-d57338fdff2c-892]
>>>>> steps.CreateHTableJob:193 : Expecting 4 regions.
>>>>> 2017-08-05 20:54:48,709 INFO  [Job 1175d3ed-504f-4eb0-a973-d57338fdff2c-892]
>>>>> steps.CreateHTableJob:194 : Expecting 5333 MB per region.
>>>>>
>>>>> But then I get single 20GB region.
>>>>>
>>>>> Did anyone had same behaviour?
>>>>>
>>>>> On Sun, Aug 6, 2017 at 8:15 PM, Alexander Sterligov <
>>>>> sterligovak@joom.it> wrote:
>>>>>
>>>>>> hi,
>>>>>>
>>>>>> I noticed very large hbase region for one segment (more than 20GB and
>>>>>> kylin.storage.hbase.region-cut-gb=5). I don't know why it is so
>>>>>> large, but anyway it degraded performance a lot, so I decided to split it
>>>>>> in hbase.
>>>>>>
>>>>>> When the split has just started kylin started to return empty results
>>>>>> for queries to this segment.
>>>>>>
>>>>>> Why can that happen?
>>>>>>
>>>>>> PS
>>>>>> It seams to me that kylin.storage.hbase.region-cut-gb doesn't work
>>>>>> in case if external hbase cluster is used.
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best regards,
>>>>
>>>> Shaofeng Shi 史少锋
>>>>
>>>>
>>>
>>
>>
>> --
>> Best regards,
>>
>> Shaofeng Shi 史少锋
>>
>>
>


-- 
Best regards,

Shaofeng Shi 史少锋

Re: Data disappears if hbase splits region

Posted by Alexander Sterligov <st...@joom.it>.

Yes, I'm using lz4.

On Tue, Aug 8, 2017 at 4:15 AM, ShaoFeng Shi <sh...@apache.org> wrote:

> Thanks for the input. Did you enable any compression (e.g, LZO,
> Snappy) for HBase?
>
> 2017-08-08 0:49 GMT+08:00 Alexander Sterligov <st...@joom.it>:
>
>> All parameters were default. I've found out that it is really related to
>> size estimation of count distinct measure. F2 family were underestimated
>> for about 4 times.
>>
>> After I set kylin.cube.size-estimate-countdistinct-ratio=0.2 estimations
>> are good and it works much better.
>>
>> It looks like default value of 0.05 is too low for bitmap and global
>> dictionary.
>>
>> Cube description is attached.
>>
>> On Mon, Aug 7, 2017 at 6:21 AM, ShaoFeng Shi <sh...@apache.org>
>> wrote:
>>
>>> Hi Alexander,
>>>
>>> Sometimes there will be over-estimation for the size if Cube has some
>>> complex measure like count distinct and topn, but seldom heard of less
>>> estimation. Did you change other parameters which may impact on the
>>> estimation in kylin.properties? Besides, if you can share the Cube
>>> definition, that would help (information like dimension/measure, rowkey
>>> encoding will also impact on the region split).
>>>
>>> 2017-08-07 3:03 GMT+08:00 Alexander Sterligov <st...@joom.it>:
>>>
>>>> I've found out that sharding is done manually, so running split in
>>>> hbase shell breaks data.
>>>>
>>>> So the main problem is that region-cut doesn't work on hbase with s3. I
>>>> see that in the log it creates shards properly:
>>>>
>>>> 2017-08-05 20:54:48,709 INFO  [Job 1175d3ed-504f-4eb0-a973-d57338fdff2c-892]
>>>> steps.CreateHTableJob:192 : Total size 21334.075368547456M (estimated)
>>>> 2017-08-05 20:54:48,709 INFO  [Job 1175d3ed-504f-4eb0-a973-d57338fdff2c-892]
>>>> steps.CreateHTableJob:193 : Expecting 4 regions.
>>>> 2017-08-05 20:54:48,709 INFO  [Job 1175d3ed-504f-4eb0-a973-d57338fdff2c-892]
>>>> steps.CreateHTableJob:194 : Expecting 5333 MB per region.
>>>>
>>>> But then I get single 20GB region.
>>>>
>>>> Did anyone had same behaviour?
>>>>
>>>> On Sun, Aug 6, 2017 at 8:15 PM, Alexander Sterligov <
>>>> sterligovak@joom.it> wrote:
>>>>
>>>>> hi,
>>>>>
>>>>> I noticed very large hbase region for one segment (more than 20GB and
>>>>> kylin.storage.hbase.region-cut-gb=5). I don't know why it is so
>>>>> large, but anyway it degraded performance a lot, so I decided to split it
>>>>> in hbase.
>>>>>
>>>>> When the split has just started kylin started to return empty results
>>>>> for queries to this segment.
>>>>>
>>>>> Why can that happen?
>>>>>
>>>>> PS
>>>>> It seams to me that kylin.storage.hbase.region-cut-gb doesn't work in
>>>>> case if external hbase cluster is used.
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Best regards,
>>>
>>> Shaofeng Shi 史少锋
>>>
>>>
>>
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>
>

Re: Data disappears if hbase splits region

Posted by ShaoFeng Shi <sh...@apache.org>.

Thanks for the input. Did you enable any compression (e.g, LZO, Snappy) for
HBase?

2017-08-08 0:49 GMT+08:00 Alexander Sterligov <st...@joom.it>:

> All parameters were default. I've found out that it is really related to
> size estimation of count distinct measure. F2 family were underestimated
> for about 4 times.
>
> After I set kylin.cube.size-estimate-countdistinct-ratio=0.2 estimations
> are good and it works much better.
>
> It looks like default value of 0.05 is too low for bitmap and global
> dictionary.
>
> Cube description is attached.
>
> On Mon, Aug 7, 2017 at 6:21 AM, ShaoFeng Shi <sh...@apache.org>
> wrote:
>
>> Hi Alexander,
>>
>> Sometimes there will be over-estimation for the size if Cube has some
>> complex measure like count distinct and topn, but seldom heard of less
>> estimation. Did you change other parameters which may impact on the
>> estimation in kylin.properties? Besides, if you can share the Cube
>> definition, that would help (information like dimension/measure, rowkey
>> encoding will also impact on the region split).
>>
>> 2017-08-07 3:03 GMT+08:00 Alexander Sterligov <st...@joom.it>:
>>
>>> I've found out that sharding is done manually, so running split in hbase
>>> shell breaks data.
>>>
>>> So the main problem is that region-cut doesn't work on hbase with s3. I
>>> see that in the log it creates shards properly:
>>>
>>> 2017-08-05 20:54:48,709 INFO  [Job 1175d3ed-504f-4eb0-a973-d57338fdff2c-892]
>>> steps.CreateHTableJob:192 : Total size 21334.075368547456M (estimated)
>>> 2017-08-05 20:54:48,709 INFO  [Job 1175d3ed-504f-4eb0-a973-d57338fdff2c-892]
>>> steps.CreateHTableJob:193 : Expecting 4 regions.
>>> 2017-08-05 20:54:48,709 INFO  [Job 1175d3ed-504f-4eb0-a973-d57338fdff2c-892]
>>> steps.CreateHTableJob:194 : Expecting 5333 MB per region.
>>>
>>> But then I get single 20GB region.
>>>
>>> Did anyone had same behaviour?
>>>
>>> On Sun, Aug 6, 2017 at 8:15 PM, Alexander Sterligov <sterligovak@joom.it
>>> > wrote:
>>>
>>>> hi,
>>>>
>>>> I noticed very large hbase region for one segment (more than 20GB and
>>>> kylin.storage.hbase.region-cut-gb=5). I don't know why it is so large,
>>>> but anyway it degraded performance a lot, so I decided to split it in hbase.
>>>>
>>>> When the split has just started kylin started to return empty results
>>>> for queries to this segment.
>>>>
>>>> Why can that happen?
>>>>
>>>> PS
>>>> It seams to me that kylin.storage.hbase.region-cut-gb doesn't work in
>>>> case if external hbase cluster is used.
>>>>
>>>
>>>
>>
>>
>> --
>> Best regards,
>>
>> Shaofeng Shi 史少锋
>>
>>
>


-- 
Best regards,

Shaofeng Shi 史少锋

Re: Data disappears if hbase splits region

Posted by Alexander Sterligov <st...@joom.it>.

All parameters were default. I've found out that it is really related to
size estimation of count distinct measure. F2 family were underestimated
for about 4 times.

After I set kylin.cube.size-estimate-countdistinct-ratio=0.2 estimations
are good and it works much better.

It looks like default value of 0.05 is too low for bitmap and global
dictionary.

Cube description is attached.

On Mon, Aug 7, 2017 at 6:21 AM, ShaoFeng Shi <sh...@apache.org> wrote:

> Hi Alexander,
>
> Sometimes there will be over-estimation for the size if Cube has some
> complex measure like count distinct and topn, but seldom heard of less
> estimation. Did you change other parameters which may impact on the
> estimation in kylin.properties? Besides, if you can share the Cube
> definition, that would help (information like dimension/measure, rowkey
> encoding will also impact on the region split).
>
> 2017-08-07 3:03 GMT+08:00 Alexander Sterligov <st...@joom.it>:
>
>> I've found out that sharding is done manually, so running split in hbase
>> shell breaks data.
>>
>> So the main problem is that region-cut doesn't work on hbase with s3. I
>> see that in the log it creates shards properly:
>>
>> 2017-08-05 20:54:48,709 INFO  [Job 1175d3ed-504f-4eb0-a973-d57338fdff2c-892]
>> steps.CreateHTableJob:192 : Total size 21334.075368547456M (estimated)
>> 2017-08-05 20:54:48,709 INFO  [Job 1175d3ed-504f-4eb0-a973-d57338fdff2c-892]
>> steps.CreateHTableJob:193 : Expecting 4 regions.
>> 2017-08-05 20:54:48,709 INFO  [Job 1175d3ed-504f-4eb0-a973-d57338fdff2c-892]
>> steps.CreateHTableJob:194 : Expecting 5333 MB per region.
>>
>> But then I get single 20GB region.
>>
>> Did anyone had same behaviour?
>>
>> On Sun, Aug 6, 2017 at 8:15 PM, Alexander Sterligov <st...@joom.it>
>> wrote:
>>
>>> hi,
>>>
>>> I noticed very large hbase region for one segment (more than 20GB and
>>> kylin.storage.hbase.region-cut-gb=5). I don't know why it is so large,
>>> but anyway it degraded performance a lot, so I decided to split it in hbase.
>>>
>>> When the split has just started kylin started to return empty results
>>> for queries to this segment.
>>>
>>> Why can that happen?
>>>
>>> PS
>>> It seams to me that kylin.storage.hbase.region-cut-gb doesn't work in
>>> case if external hbase cluster is used.
>>>
>>
>>
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>
>

Re: Data disappears if hbase splits region

Posted by ShaoFeng Shi <sh...@apache.org>.

Hi Alexander,

Sometimes there will be over-estimation for the size if Cube has some
complex measure like count distinct and topn, but seldom heard of less
estimation. Did you change other parameters which may impact on the
estimation in kylin.properties? Besides, if you can share the Cube
definition, that would help (information like dimension/measure, rowkey
encoding will also impact on the region split).

2017-08-07 3:03 GMT+08:00 Alexander Sterligov <st...@joom.it>:

> I've found out that sharding is done manually, so running split in hbase
> shell breaks data.
>
> So the main problem is that region-cut doesn't work on hbase with s3. I
> see that in the log it creates shards properly:
>
> 2017-08-05 20:54:48,709 INFO  [Job 1175d3ed-504f-4eb0-a973-d57338fdff2c-892]
> steps.CreateHTableJob:192 : Total size 21334.075368547456M (estimated)
> 2017-08-05 20:54:48,709 INFO  [Job 1175d3ed-504f-4eb0-a973-d57338fdff2c-892]
> steps.CreateHTableJob:193 : Expecting 4 regions.
> 2017-08-05 20:54:48,709 INFO  [Job 1175d3ed-504f-4eb0-a973-d57338fdff2c-892]
> steps.CreateHTableJob:194 : Expecting 5333 MB per region.
>
> But then I get single 20GB region.
>
> Did anyone had same behaviour?
>
> On Sun, Aug 6, 2017 at 8:15 PM, Alexander Sterligov <st...@joom.it>
> wrote:
>
>> hi,
>>
>> I noticed very large hbase region for one segment (more than 20GB and
>> kylin.storage.hbase.region-cut-gb=5). I don't know why it is so large,
>> but anyway it degraded performance a lot, so I decided to split it in hbase.
>>
>> When the split has just started kylin started to return empty results for
>> queries to this segment.
>>
>> Why can that happen?
>>
>> PS
>> It seams to me that kylin.storage.hbase.region-cut-gb doesn't work in
>> case if external hbase cluster is used.
>>
>
>


-- 
Best regards,

Shaofeng Shi 史少锋

Re: Data disappears if hbase splits region

Posted by Alexander Sterligov <st...@joom.it>.

I've found out that sharding is done manually, so running split in hbase
shell breaks data.

So the main problem is that region-cut doesn't work on hbase with s3. I see
that in the log it creates shards properly:

2017-08-05 20:54:48,709 INFO  [Job
1175d3ed-504f-4eb0-a973-d57338fdff2c-892] steps.CreateHTableJob:192 : Total
size 21334.075368547456M (estimated)
2017-08-05 20:54:48,709 INFO  [Job
1175d3ed-504f-4eb0-a973-d57338fdff2c-892] steps.CreateHTableJob:193 :
Expecting 4 regions.
2017-08-05 20:54:48,709 INFO  [Job
1175d3ed-504f-4eb0-a973-d57338fdff2c-892] steps.CreateHTableJob:194 :
Expecting 5333 MB per region.

But then I get single 20GB region.

Did anyone had same behaviour?

On Sun, Aug 6, 2017 at 8:15 PM, Alexander Sterligov <st...@joom.it>
wrote:

> hi,
>
> I noticed very large hbase region for one segment (more than 20GB and
> kylin.storage.hbase.region-cut-gb=5). I don't know why it is so large,
> but anyway it degraded performance a lot, so I decided to split it in hbase.
>
> When the split has just started kylin started to return empty results for
> queries to this segment.
>
> Why can that happen?
>
> PS
> It seams to me that kylin.storage.hbase.region-cut-gb doesn't work in
> case if external hbase cluster is used.
>