You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kylin.apache.org by vipul jhawar <vi...@gmail.com> on 2015/07/24 16:22:35 UTC

data retention

Hi

Would be interested to know, what solutions you would recommend to
implement data retention. Say if we want to retain data for only upto last
90 days in the cube, what is the best option.

Our daily size is > 60 G so we cannot store data forever and want limit to
a time range to support advanced analysis.

Thanks

Re: data retention

Posted by "Shi, Shaofeng" <sh...@ebay.com>.

Bijeet, your understanding is correct, thanks for the comment;

We had planned to release this feature in 0.8, for the streaming case; Now
we see the community has the need, will back port to 0.7 and release in
0.7.3 or 0.7.4;

Here are retention related JIRAs, I will associate them together:
https://issues.apache.org/jira/browse/KYLIN-886

https://issues.apache.org/jira/browse/KYLIN-895

https://issues.apache.org/jira/browse/KYLIN-906


On 7/27/15, 2:47 AM, "Bijeet Singh" <bi...@gmail.com> wrote:

>From what I understand, a cube comprises multiple segments and each
>segment
>is effectively a table in HBase. While querying, HBaseKeyRange is created
>for each matching segment of the cube and the result from the segments is
>finally merged. So it seems that truncating the HBase table, corresponding
>to an older segment will not affect the other segments. Please correct me
>if I am wrong here.
>
>If it is indeed possible to truncate the older segments, while maintaining
>the correctness of cube, then the older data can effectively be deleted
>from the cube by truncating the corresponding HBase tables.
>
>This way,  if I want to retain data for say, around 60 days, I can have 10
>segments(given that 10 seems to be the optimal number of segments) each
>having 6 days worth of data. And once I have the 11th segment ready for
>the
>most recent 6 days, I can truncate the oldest segment.
>
>Please let me know if it looks feasible.
>
>Thanks,
>Bijeet
>
>On Sat, Jul 25, 2015 at 6:41 AM, vipul jhawar <vi...@gmail.com>
>wrote:
>
>> Sure, i will open a JIRA.
>>
>> So, at eBay you are storing the data forever in the cubes ?
>>
>> Rebuilding the cube several days seems to be very suboptimal as it
>>means we
>> have to spend lot more resources again.
>> Even if i partitioned my cubes by days such as cube_01, cube_02 by
>>month i
>> would have to go and run parallel queries against all of them when my
>>date
>> range is across months and then re aggregate in memory.
>>
>> On Fri, Jul 24, 2015 at 8:39 PM, Han, Luke <lu...@ebay.com> wrote:
>>
>> > Could you please open one JIRA for this? We have one for streaming
>>case,
>> > but I think it make sense to enable retention for batch also.
>> >
>> > Currently, I would like to say you have to rebuild cube several days
>>to
>> > discard old data.
>> > To minimum impact, you can define two cubes with same logical, and
>>build
>> > one first, then build another one like 7days later, once new one done,
>> > disable old one and purge the data, then, again and again....
>> >
>> > Thanks.
>> >
>> > 发自我的 iPhone
>> >
>> > > 在 2015年7月24日，22:22，vipul jhawar <vi...@gmail.com> 写道：
>> > >
>> > > Hi
>> > >
>> > > Would be interested to know, what solutions you would recommend to
>> > > implement data retention. Say if we want to retain data for only
>>upto
>> > last
>> > > 90 days in the cube, what is the best option.
>> > >
>> > > Our daily size is > 60 G so we cannot store data forever and want
>>limit
>> > to
>> > > a time range to support advanced analysis.
>> > >
>> > > Thanks
>> >
>>

Re: data retention

Posted by Bijeet Singh <bi...@gmail.com>.

>From what I understand, a cube comprises multiple segments and each segment
is effectively a table in HBase. While querying, HBaseKeyRange is created
for each matching segment of the cube and the result from the segments is
finally merged. So it seems that truncating the HBase table, corresponding
to an older segment will not affect the other segments. Please correct me
if I am wrong here.

If it is indeed possible to truncate the older segments, while maintaining
the correctness of cube, then the older data can effectively be deleted
from the cube by truncating the corresponding HBase tables.

This way,  if I want to retain data for say, around 60 days, I can have 10
segments(given that 10 seems to be the optimal number of segments) each
having 6 days worth of data. And once I have the 11th segment ready for the
most recent 6 days, I can truncate the oldest segment.

Please let me know if it looks feasible.

Thanks,
Bijeet

On Sat, Jul 25, 2015 at 6:41 AM, vipul jhawar <vi...@gmail.com>
wrote:

> Sure, i will open a JIRA.
>
> So, at eBay you are storing the data forever in the cubes ?
>
> Rebuilding the cube several days seems to be very suboptimal as it means we
> have to spend lot more resources again.
> Even if i partitioned my cubes by days such as cube_01, cube_02 by month i
> would have to go and run parallel queries against all of them when my date
> range is across months and then re aggregate in memory.
>
> On Fri, Jul 24, 2015 at 8:39 PM, Han, Luke <lu...@ebay.com> wrote:
>
> > Could you please open one JIRA for this? We have one for streaming case,
> > but I think it make sense to enable retention for batch also.
> >
> > Currently, I would like to say you have to rebuild cube several days to
> > discard old data.
> > To minimum impact, you can define two cubes with same logical, and build
> > one first, then build another one like 7days later, once new one done,
> > disable old one and purge the data, then, again and again....
> >
> > Thanks.
> >
> > 发自我的 iPhone
> >
> > > 在 2015年7月24日，22:22，vipul jhawar <vi...@gmail.com> 写道：
> > >
> > > Hi
> > >
> > > Would be interested to know, what solutions you would recommend to
> > > implement data retention. Say if we want to retain data for only upto
> > last
> > > 90 days in the cube, what is the best option.
> > >
> > > Our daily size is > 60 G so we cannot store data forever and want limit
> > to
> > > a time range to support advanced analysis.
> > >
> > > Thanks
> >
>

Re: data retention

Posted by vipul jhawar <vi...@gmail.com>.

Sure, i will open a JIRA.

So, at eBay you are storing the data forever in the cubes ?

Rebuilding the cube several days seems to be very suboptimal as it means we
have to spend lot more resources again.
Even if i partitioned my cubes by days such as cube_01, cube_02 by month i
would have to go and run parallel queries against all of them when my date
range is across months and then re aggregate in memory.

On Fri, Jul 24, 2015 at 8:39 PM, Han, Luke <lu...@ebay.com> wrote:

> Could you please open one JIRA for this? We have one for streaming case,
> but I think it make sense to enable retention for batch also.
>
> Currently, I would like to say you have to rebuild cube several days to
> discard old data.
> To minimum impact, you can define two cubes with same logical, and build
> one first, then build another one like 7days later, once new one done,
> disable old one and purge the data, then, again and again....
>
> Thanks.
>
> 发自我的 iPhone
>
> > 在 2015年7月24日，22:22，vipul jhawar <vi...@gmail.com> 写道：
> >
> > Hi
> >
> > Would be interested to know, what solutions you would recommend to
> > implement data retention. Say if we want to retain data for only upto
> last
> > 90 days in the cube, what is the best option.
> >
> > Our daily size is > 60 G so we cannot store data forever and want limit
> to
> > a time range to support advanced analysis.
> >
> > Thanks
>

Re: data retention

Posted by "Han, Luke" <lu...@ebay.com>.

Could you please open one JIRA for this? We have one for streaming case, but I think it make sense to enable retention for batch also.

Currently, I would like to say you have to rebuild cube several days to discard old data.
To minimum impact, you can define two cubes with same logical, and build one first, then build another one like 7days later, once new one done, disable old one and purge the data, then, again and again....

Thanks.

发自我的 iPhone

> 在 2015年7月24日，22:22，vipul jhawar <vi...@gmail.com> 写道：
> 
> Hi
> 
> Would be interested to know, what solutions you would recommend to
> implement data retention. Say if we want to retain data for only upto last
> 90 days in the cube, what is the best option.
> 
> Our daily size is > 60 G so we cannot store data forever and want limit to
> a time range to support advanced analysis.
> 
> Thanks