You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@kudu.apache.org by Sand Stone <sa...@gmail.com> on 2016/05/12 15:16:36 UTC

best practices to remove/retire data

Hi. Presumably I need to write a program to delete the unwanted rows, say,
remove all data older than 3 days, while the table is still ingesting new
data.

How well will this perform for large tables? Both deletion and ingestion
wise.

Or for this specific case that I retire data by day, I should create a new
table per day. However then the users have to be aware of the table naming
scheme somehow. If a mention policy is changed. all the client side code
might have to change (sure we can have one level of indirection to minimize
the pain).

Thanks.

Re: best practices to remove/retire data

Posted by Dan Burkert <da...@cloudera.com>.
On Thu, May 12, 2016 at 8:32 AM, Chris George <Ch...@rms.com>
wrote:

> How hard would a predicate based delete be?
> Ie ScanDelete or something.
> -Chris George
>

That might be pretty difficult, since it implicitly assumes cross row
transactional consistency.  If consistency isn't required you can simulate
it today by starting the scan and issuing deletes for each result.

- Dan



>
> On 5/12/16, 9:24 AM, "Jean-Daniel Cryans" <jd...@apache.org> wrote:
>
> Hi,
>
> Right now this use case is more difficult than it needs to be. In your
> previous thread, "Partition and Split rows", we talked about non-covering
> range partition and this is something that would help your use case a lot.
> Basically, you could create partitions that cover full days, and everyday
> you could delete the old partitions while creating the next day's. Deleting
> a partition is really quick and efficient compared to manually deleting
> individual rows.
>
> Until this is available I'd do this with multiple table, but it's a mess
> to handle as you described.
>
> Hope this helps,
>
> J-D
>
> On Thu, May 12, 2016 at 8:16 AM, Sand Stone <sa...@gmail.com>
> wrote:
>
>> Hi. Presumably I need to write a program to delete the unwanted rows,
>> say, remove all data older than 3 days, while the table is still ingesting
>> new data.
>>
>> How well will this perform for large tables? Both deletion and ingestion
>> wise.
>>
>> Or for this specific case that I retire data by day, I should create a
>> new table per day. However then the users have to be aware of the table
>> naming scheme somehow. If a mention policy is changed. all the client side
>> code might have to change (sure we can have one level of indirection to
>> minimize the pain).
>>
>> Thanks.
>>
>
>

Re: best practices to remove/retire data

Posted by Chris George <Ch...@rms.com>.
How hard would a predicate based delete be?
Ie ScanDelete or something.
-Chris George

On 5/12/16, 9:24 AM, "Jean-Daniel Cryans" <jd...@apache.org>> wrote:

Hi,

Right now this use case is more difficult than it needs to be. In your previous thread, "Partition and Split rows", we talked about non-covering range partition and this is something that would help your use case a lot. Basically, you could create partitions that cover full days, and everyday you could delete the old partitions while creating the next day's. Deleting a partition is really quick and efficient compared to manually deleting individual rows.

Until this is available I'd do this with multiple table, but it's a mess to handle as you described.

Hope this helps,

J-D

On Thu, May 12, 2016 at 8:16 AM, Sand Stone <sa...@gmail.com>> wrote:
Hi. Presumably I need to write a program to delete the unwanted rows, say, remove all data older than 3 days, while the table is still ingesting new data.

How well will this perform for large tables? Both deletion and ingestion wise.

Or for this specific case that I retire data by day, I should create a new table per day. However then the users have to be aware of the table naming scheme somehow. If a mention policy is changed. all the client side code might have to change (sure we can have one level of indirection to minimize the pain).

Thanks.


Re: best practices to remove/retire data

Posted by Jean-Daniel Cryans <jd...@apache.org>.
It should be fully implemented for 1.0 which we're aiming for August. You
can follow this jira: https://issues.apache.org/jira/browse/KUDU-1306

J-D

On Thu, May 12, 2016 at 10:10 AM, Sand Stone <sa...@gmail.com> wrote:

> Thanks J-D.
>
> Any idea when the partition level deletion will be implemented?
>
> On Thu, May 12, 2016 at 8:24 AM, Jean-Daniel Cryans <jd...@apache.org>
> wrote:
>
>> Hi,
>>
>> Right now this use case is more difficult than it needs to be. In your
>> previous thread, "Partition and Split rows", we talked about non-covering
>> range partition and this is something that would help your use case a lot.
>> Basically, you could create partitions that cover full days, and everyday
>> you could delete the old partitions while creating the next day's. Deleting
>> a partition is really quick and efficient compared to manually deleting
>> individual rows.
>>
>> Until this is available I'd do this with multiple table, but it's a mess
>> to handle as you described.
>>
>> Hope this helps,
>>
>> J-D
>>
>> On Thu, May 12, 2016 at 8:16 AM, Sand Stone <sa...@gmail.com>
>> wrote:
>>
>>> Hi. Presumably I need to write a program to delete the unwanted rows,
>>> say, remove all data older than 3 days, while the table is still ingesting
>>> new data.
>>>
>>> How well will this perform for large tables? Both deletion and ingestion
>>> wise.
>>>
>>> Or for this specific case that I retire data by day, I should create a
>>> new table per day. However then the users have to be aware of the table
>>> naming scheme somehow. If a mention policy is changed. all the client side
>>> code might have to change (sure we can have one level of indirection to
>>> minimize the pain).
>>>
>>> Thanks.
>>>
>>
>>
>

Re: best practices to remove/retire data

Posted by Sand Stone <sa...@gmail.com>.
Thanks J-D.

Any idea when the partition level deletion will be implemented?

On Thu, May 12, 2016 at 8:24 AM, Jean-Daniel Cryans <jd...@apache.org>
wrote:

> Hi,
>
> Right now this use case is more difficult than it needs to be. In your
> previous thread, "Partition and Split rows", we talked about non-covering
> range partition and this is something that would help your use case a lot.
> Basically, you could create partitions that cover full days, and everyday
> you could delete the old partitions while creating the next day's. Deleting
> a partition is really quick and efficient compared to manually deleting
> individual rows.
>
> Until this is available I'd do this with multiple table, but it's a mess
> to handle as you described.
>
> Hope this helps,
>
> J-D
>
> On Thu, May 12, 2016 at 8:16 AM, Sand Stone <sa...@gmail.com>
> wrote:
>
>> Hi. Presumably I need to write a program to delete the unwanted rows,
>> say, remove all data older than 3 days, while the table is still ingesting
>> new data.
>>
>> How well will this perform for large tables? Both deletion and ingestion
>> wise.
>>
>> Or for this specific case that I retire data by day, I should create a
>> new table per day. However then the users have to be aware of the table
>> naming scheme somehow. If a mention policy is changed. all the client side
>> code might have to change (sure we can have one level of indirection to
>> minimize the pain).
>>
>> Thanks.
>>
>
>

Re: best practices to remove/retire data

Posted by Jean-Daniel Cryans <jd...@apache.org>.
Hi,

Right now this use case is more difficult than it needs to be. In your
previous thread, "Partition and Split rows", we talked about non-covering
range partition and this is something that would help your use case a lot.
Basically, you could create partitions that cover full days, and everyday
you could delete the old partitions while creating the next day's. Deleting
a partition is really quick and efficient compared to manually deleting
individual rows.

Until this is available I'd do this with multiple table, but it's a mess to
handle as you described.

Hope this helps,

J-D

On Thu, May 12, 2016 at 8:16 AM, Sand Stone <sa...@gmail.com> wrote:

> Hi. Presumably I need to write a program to delete the unwanted rows, say,
> remove all data older than 3 days, while the table is still ingesting new
> data.
>
> How well will this perform for large tables? Both deletion and ingestion
> wise.
>
> Or for this specific case that I retire data by day, I should create a new
> table per day. However then the users have to be aware of the table naming
> scheme somehow. If a mention policy is changed. all the client side code
> might have to change (sure we can have one level of indirection to minimize
> the pain).
>
> Thanks.
>