You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kudu.apache.org by Scott Reynolds <sd...@gmail.com> on 2019/08/14 16:42:15 UTC

Dimension table delete and recreate

Hi developers,

I have a dimension table that is generated by a spark job and written to
kudu. I would like to remove the rows in the table that were not found by
the spark job.

To do this, I was thinking the f renaming the existing table so it keeps
the UUID for existing queries create the table again and load the rows into
it. An hour later come back through and delete the old table.

If I were to do that what would your three highest concerns be? How would
this affect kudu master process?

Re: Dimension table delete and recreate

Posted by Adar Lieber-Dembo <ad...@cloudera.com>.
(+user, -dev, as this is more appropriate for the users list)

The Kudu master currently keeps a record of all tables and partitions,
including those that have been deleted. With a high enough rate of
table deletion it's theoretically possible for that to consume a lot
of disk space or memory. In practice (and since you mentioned you'd do
it once an hour) I wouldn't expect it to be a problem.

There shouldn't be any long-lasting impact on the tablet servers
though; tablets belonging to deleted tables are completely expunged
from disk.

Alternatively, you may find it more intuitive to model the "create
new, wait, then drop old" data motion via range partitions in a single
table.

On Wed, Aug 14, 2019 at 9:42 AM Scott Reynolds <sd...@gmail.com> wrote:
>
> Hi developers,
>
> I have a dimension table that is generated by a spark job and written to
> kudu. I would like to remove the rows in the table that were not found by
> the spark job.
>
> To do this, I was thinking the f renaming the existing table so it keeps
> the UUID for existing queries create the table again and load the rows into
> it. An hour later come back through and delete the old table.
>
> If I were to do that what would your three highest concerns be? How would
> this affect kudu master process?

Re: Dimension table delete and recreate

Posted by Adar Lieber-Dembo <ad...@cloudera.com.INVALID>.
(+user, -dev, as this is more appropriate for the users list)

The Kudu master currently keeps a record of all tables and partitions,
including those that have been deleted. With a high enough rate of
table deletion it's theoretically possible for that to consume a lot
of disk space or memory. In practice (and since you mentioned you'd do
it once an hour) I wouldn't expect it to be a problem.

There shouldn't be any long-lasting impact on the tablet servers
though; tablets belonging to deleted tables are completely expunged
from disk.

Alternatively, you may find it more intuitive to model the "create
new, wait, then drop old" data motion via range partitions in a single
table.

On Wed, Aug 14, 2019 at 9:42 AM Scott Reynolds <sd...@gmail.com> wrote:
>
> Hi developers,
>
> I have a dimension table that is generated by a spark job and written to
> kudu. I would like to remove the rows in the table that were not found by
> the spark job.
>
> To do this, I was thinking the f renaming the existing table so it keeps
> the UUID for existing queries create the table again and load the rows into
> it. An hour later come back through and delete the old table.
>
> If I were to do that what would your three highest concerns be? How would
> this affect kudu master process?