You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kudu.apache.org by Todd Lipcon <to...@apache.org> on 2016/04/25 19:54:22 UTC

Weekly update 4/25

Hey Kudu-ers,

For the last month and a half, I've been posting weekly summaries of
community development activity on the Kudu blog. In case you aren't on
twitter or slack you might not have seen the posts, so I'm going to start
emailing them to the list as well.

Here's this week's update:
http://getkudu.io/2016/04/25/weekly-update.html

Feel free to reply to this mail if you have any questions or would like to
get involved in development.

-Todd

Re: Weekly update 4/25

Posted by Mike Percy <mp...@apache.org>.

Thanks for filing it, Jordan. Great writeup too.

Mike

On Thu, Apr 28, 2016 at 12:54 PM, Jordan Birdsell <
jordan.birdsell.kdvm@statefarm.com> wrote:

> Opened KUDU-1431 <https://issues.apache.org/jira/browse/KUDU-1431>
>
>
>
> *From:* Mike Percy [mailto:mpercy@apache.org]
> *Sent:* Thursday, April 28, 2016 1:55 PM
>
> *To:* user@kudu.incubator.apache.org
> *Subject:* Re: Weekly update 4/25
>
>
>
> Hey Jordan,
>
> It would definitely be helpful if you could file a JIRA to track this.
>
>
>
> The initial version of tablet history GC that I am currently working on as
> part of KUDU-236 won't yet support this type of SLA-based removal, since
> the current changes are much simpler than that since they are more in line
> with how we currently schedule background maintenance tasks. They
> prioritize work that is estimated to provide the greatest performance
> improvement or space improvement. Still, this is something we should look
> at more closely, to support compliance-based use cases like the one it
> sounds like you're describing.
>
>
>
> Mike
>
>
>
> On Thu, Apr 28, 2016 at 10:28 AM, Jordan Birdsell <
> jordan.birdsell.kdvm@statefarm.com> wrote:
>
> Todd,
>
>
>
> Should a JIRA be opened to track this?
>
>
>
> *From:* Jordan Birdsell
> *Sent:* Tuesday, April 26, 2016 2:07 PM
> *To:* user@kudu.incubator.apache.org
> *Subject:* RE: Weekly update 4/25
>
>
>
> Today we solve this on an RDBMS (DB2) platform, however when data is
> replicated to the cluster, we need to be able to address such deletes that
> occur after replication so that we don’t have to continue to replicate
> petabytes across the network.  We’ve experimented with HBase and some HDFS
> solutions (Hive transactions), but neither really seem to be ideal.
>
>
>
> *From:* Todd Lipcon [mailto:todd@cloudera.com <to...@cloudera.com>]
> *Sent:* Tuesday, April 26, 2016 1:21 PM
>
>
> *To:* user@kudu.incubator.apache.org
> *Subject:* Re: Weekly update 4/25
>
>
>
> On Tue, Apr 26, 2016 at 10:14 AM, Jordan Birdsell <
> jordan.birdsell.kdvm@statefarm.com> wrote:
>
> If we had to go less frequently than a day I’m sure it’d be acceptable.
> The volume of deletes is very low in this case.  In some tables we can just
> “erase” a column’s data but in others, based on the data design, we must
> delete the entire row or group of rows.
>
>
>
> Thanks for the details.
>
>
>
> I'm curious, are you solving this use case with an existing system today?
> (eg HBase, HDFS, or some RDBMS?) Would like to compare our planned
> implementation with whatever that system is doing to make sure it's at
> least as good.
>
>
>
> -Todd
>
>
>
>
>
> *From:* Todd Lipcon [mailto:todd@cloudera.com]
> *Sent:* Tuesday, April 26, 2016 12:59 PM
>
>
> *To:* user@kudu.incubator.apache.org
> *Subject:* Re: Weekly update 4/25
>
>
>
> On Tue, Apr 26, 2016 at 8:28 AM, Jordan Birdsell <
> jordan.birdsell.kdvm@statefarm.com> wrote:
>
> Yes, this is exactly what we need to do.  Not immediately is ok for our
> current requirements, I’d say within a day would be ideal.
>
>
>
> Even within a day can be tricky for this kind of system if you have a
> fairly uniform random delete workload. That would imply that you're
> rewriting _all_ of your data every day, which uses a fair amount of IO.
>
>
>
> Are deletes extremely rare for your use case?
>
>
>
> Is it the entire row of data that has to be deleted or would it be
> sufficient to "X out" some particularly sensitive column?
>
>
>
> -Todd
>
>
>
>
>
> *From:* Jean-Daniel Cryans [mailto:jdcryans@apache.org]
> *Sent:* Tuesday, April 26, 2016 11:15 AM
>
>
> *To:* user@kudu.incubator.apache.org
> *Subject:* Re: Weekly update 4/25
>
>
>
> Oh I see so this is in order to comply with asks such as "much sure that
> data for some user/customer is 100% deleted"? We'll still have the problem
> where we don't want to rewrite all the base data files (GBs/TBs) to clean
> up KBs of data, although since a single row is always only part of one row
> set, it means it's at most 64MB that you'd be rewriting.
>
>
>
> BTW is it ok if the data isn't immediately deleted? How long is it
> acceptable to wait for before it happens?
>
>
>
> J-D
>
>
>
> On Tue, Apr 26, 2016 at 8:04 AM, Jordan Birdsell <
> jordan.birdsell.kdvm@statefarm.com> wrote:
>
> Correct.  As for the “latest version”, if a row is deleted in the latest
> version then removing the old versions where it existed is exactly what
> we’re looking to do.  Basically, we need a way to physically get rid of
> select rows (or data within a column for that matter) and all versions of
> that row or column data.
>
>
>
> *From:* Jean-Daniel Cryans [mailto:jdcryans@apache.org]
> *Sent:* Tuesday, April 26, 2016 10:56 AM
> *To:* user@kudu.incubator.apache.org
> *Subject:* Re: Weekly update 4/25
>
>
>
> Hi Jordan,
>
>
>
> In other words, you'd like to tag specific rows to be excluded from the
> default data history retention?
>
>
>
> Also, keep in mind that this improvement is about removing old versions of
> the data, it will not delete the latest version. If you are used to HBase,
> it's like specifying some TTL plus MIN_VERSIONS=1 so it doesn't completely
> age out a row.
>
>
>
> Hope this helps,
>
>
>
> J-D
>
>
>
> On Tue, Apr 26, 2016 at 4:29 AM, Jordan Birdsell <
> jordan.birdsell.kdvm@statefarm.com> wrote:
>
> Hi,
>
>
>
> Regarding row GC,  I see in the design document that the tablet history
> max age will be set at the table level, would it be possible to make this
> something that can be overridden for specific transactions?  We have some
> use cases that would require accelerated removal of data from disk and
> other use cases that would not have the same requirement. Unfortunately,
> these different use cases apply, often times, to the same tables.
>
>
>
> Thanks,
>
> Jordan Birdsell
>
>
>
> *From:* Todd Lipcon [mailto:todd@apache.org]
> *Sent:* Monday, April 25, 2016 1:54 PM
> *To:* dev@kudu.incubator.apache.org; user@kudu.incubator.apache.org
> *Subject:* Weekly update 4/25
>
>
>
> Hey Kudu-ers,
>
>
>
> For the last month and a half, I've been posting weekly summaries of
> community development activity on the Kudu blog. In case you aren't on
> twitter or slack you might not have seen the posts, so I'm going to start
> emailing them to the list as well.
>
>
>
> Here's this week's update:
>
> http://getkudu.io/2016/04/25/weekly-update.html
>
>
>
> Feel free to reply to this mail if you have any questions or would like to
> get involved in development.
>
>
>
> -Todd
>
>
>
>
>
>
>
>
>
> --
>
> Todd Lipcon
> Software Engineer, Cloudera
>
>
>
>
>
> --
>
> Todd Lipcon
> Software Engineer, Cloudera
>
>
>

RE: Weekly update 4/25

Posted by Jordan Birdsell <jo...@statefarm.com>.

Opened KUDU-1431<https://issues.apache.org/jira/browse/KUDU-1431>

From: Mike Percy [mailto:mpercy@apache.org]
Sent: Thursday, April 28, 2016 1:55 PM
To: user@kudu.incubator.apache.org
Subject: Re: Weekly update 4/25

Hey Jordan,
It would definitely be helpful if you could file a JIRA to track this.

The initial version of tablet history GC that I am currently working on as part of KUDU-236 won't yet support this type of SLA-based removal, since the current changes are much simpler than that since they are more in line with how we currently schedule background maintenance tasks. They prioritize work that is estimated to provide the greatest performance improvement or space improvement. Still, this is something we should look at more closely, to support compliance-based use cases like the one it sounds like you're describing.

Mike

On Thu, Apr 28, 2016 at 10:28 AM, Jordan Birdsell <jo...@statefarm.com>> wrote:
Todd,

Should a JIRA be opened to track this?

From: Jordan Birdsell
Sent: Tuesday, April 26, 2016 2:07 PM
To: user@kudu.incubator.apache.org<ma...@kudu.incubator.apache.org>
Subject: RE: Weekly update 4/25

Today we solve this on an RDBMS (DB2) platform, however when data is replicated to the cluster, we need to be able to address such deletes that occur after replication so that we don’t have to continue to replicate petabytes across the network.  We’ve experimented with HBase and some HDFS solutions (Hive transactions), but neither really seem to be ideal.

From: Todd Lipcon [mailto:todd@cloudera.com]
Sent: Tuesday, April 26, 2016 1:21 PM

To: user@kudu.incubator.apache.org<ma...@kudu.incubator.apache.org>
Subject: Re: Weekly update 4/25

On Tue, Apr 26, 2016 at 10:14 AM, Jordan Birdsell <jo...@statefarm.com>> wrote:
If we had to go less frequently than a day I’m sure it’d be acceptable.  The volume of deletes is very low in this case.  In some tables we can just “erase” a column’s data but in others, based on the data design, we must delete the entire row or group of rows.

Thanks for the details.

I'm curious, are you solving this use case with an existing system today? (eg HBase, HDFS, or some RDBMS?) Would like to compare our planned implementation with whatever that system is doing to make sure it's at least as good.

-Todd

From: Todd Lipcon [mailto:todd@cloudera.com<ma...@cloudera.com>]
Sent: Tuesday, April 26, 2016 12:59 PM

To: user@kudu.incubator.apache.org<ma...@kudu.incubator.apache.org>
Subject: Re: Weekly update 4/25

On Tue, Apr 26, 2016 at 8:28 AM, Jordan Birdsell <jo...@statefarm.com>> wrote:
Yes, this is exactly what we need to do.  Not immediately is ok for our current requirements, I’d say within a day would be ideal.

Even within a day can be tricky for this kind of system if you have a fairly uniform random delete workload. That would imply that you're rewriting _all_ of your data every day, which uses a fair amount of IO.

Are deletes extremely rare for your use case?

Is it the entire row of data that has to be deleted or would it be sufficient to "X out" some particularly sensitive column?

-Todd

From: Jean-Daniel Cryans [mailto:jdcryans@apache.org<ma...@apache.org>]
Sent: Tuesday, April 26, 2016 11:15 AM

To: user@kudu.incubator.apache.org<ma...@kudu.incubator.apache.org>
Subject: Re: Weekly update 4/25

Oh I see so this is in order to comply with asks such as "much sure that data for some user/customer is 100% deleted"? We'll still have the problem where we don't want to rewrite all the base data files (GBs/TBs) to clean up KBs of data, although since a single row is always only part of one row set, it means it's at most 64MB that you'd be rewriting.

BTW is it ok if the data isn't immediately deleted? How long is it acceptable to wait for before it happens?

J-D

On Tue, Apr 26, 2016 at 8:04 AM, Jordan Birdsell <jo...@statefarm.com>> wrote:
Correct.  As for the “latest version”, if a row is deleted in the latest version then removing the old versions where it existed is exactly what we’re looking to do.  Basically, we need a way to physically get rid of select rows (or data within a column for that matter) and all versions of that row or column data.

From: Jean-Daniel Cryans [mailto:jdcryans@apache.org<ma...@apache.org>]
Sent: Tuesday, April 26, 2016 10:56 AM
To: user@kudu.incubator.apache.org<ma...@kudu.incubator.apache.org>
Subject: Re: Weekly update 4/25

Hi Jordan,

In other words, you'd like to tag specific rows to be excluded from the default data history retention?

Also, keep in mind that this improvement is about removing old versions of the data, it will not delete the latest version. If you are used to HBase, it's like specifying some TTL plus MIN_VERSIONS=1 so it doesn't completely age out a row.

Hope this helps,

J-D

On Tue, Apr 26, 2016 at 4:29 AM, Jordan Birdsell <jo...@statefarm.com>> wrote:
Hi,

Regarding row GC,  I see in the design document that the tablet history max age will be set at the table level, would it be possible to make this something that can be overridden for specific transactions?  We have some use cases that would require accelerated removal of data from disk and other use cases that would not have the same requirement. Unfortunately, these different use cases apply, often times, to the same tables.

Thanks,
Jordan Birdsell

From: Todd Lipcon [mailto:todd@apache.org<ma...@apache.org>]
Sent: Monday, April 25, 2016 1:54 PM
To: dev@kudu.incubator.apache.org<ma...@kudu.incubator.apache.org>; user@kudu.incubator.apache.org<ma...@kudu.incubator.apache.org>
Subject: Weekly update 4/25

Hey Kudu-ers,

For the last month and a half, I've been posting weekly summaries of community development activity on the Kudu blog. In case you aren't on twitter or slack you might not have seen the posts, so I'm going to start emailing them to the list as well.

Here's this week's update:
http://getkudu.io/2016/04/25/weekly-update.html

Feel free to reply to this mail if you have any questions or would like to get involved in development.

-Todd

--
Todd Lipcon
Software Engineer, Cloudera

--
Todd Lipcon
Software Engineer, Cloudera

Re: Weekly update 4/25

Posted by Mike Percy <mp...@apache.org>.

Hey Jordan,
It would definitely be helpful if you could file a JIRA to track this.

The initial version of tablet history GC that I am currently working on as
part of KUDU-236 won't yet support this type of SLA-based removal, since
the current changes are much simpler than that since they are more in line
with how we currently schedule background maintenance tasks. They
prioritize work that is estimated to provide the greatest performance
improvement or space improvement. Still, this is something we should look
at more closely, to support compliance-based use cases like the one it
sounds like you're describing.

Mike

On Thu, Apr 28, 2016 at 10:28 AM, Jordan Birdsell <
jordan.birdsell.kdvm@statefarm.com> wrote:

> Todd,
>
>
>
> Should a JIRA be opened to track this?
>
>
>
> *From:* Jordan Birdsell
> *Sent:* Tuesday, April 26, 2016 2:07 PM
> *To:* user@kudu.incubator.apache.org
> *Subject:* RE: Weekly update 4/25
>
>
>
> Today we solve this on an RDBMS (DB2) platform, however when data is
> replicated to the cluster, we need to be able to address such deletes that
> occur after replication so that we don’t have to continue to replicate
> petabytes across the network.  We’ve experimented with HBase and some HDFS
> solutions (Hive transactions), but neither really seem to be ideal.
>
>
>
> *From:* Todd Lipcon [mailto:todd@cloudera.com <to...@cloudera.com>]
> *Sent:* Tuesday, April 26, 2016 1:21 PM
>
> *To:* user@kudu.incubator.apache.org
> *Subject:* Re: Weekly update 4/25
>
>
>
> On Tue, Apr 26, 2016 at 10:14 AM, Jordan Birdsell <
> jordan.birdsell.kdvm@statefarm.com> wrote:
>
> If we had to go less frequently than a day I’m sure it’d be acceptable.
> The volume of deletes is very low in this case.  In some tables we can just
> “erase” a column’s data but in others, based on the data design, we must
> delete the entire row or group of rows.
>
>
>
> Thanks for the details.
>
>
>
> I'm curious, are you solving this use case with an existing system today?
> (eg HBase, HDFS, or some RDBMS?) Would like to compare our planned
> implementation with whatever that system is doing to make sure it's at
> least as good.
>
>
>
> -Todd
>
>
>
>
>
> *From:* Todd Lipcon [mailto:todd@cloudera.com]
> *Sent:* Tuesday, April 26, 2016 12:59 PM
>
>
> *To:* user@kudu.incubator.apache.org
> *Subject:* Re: Weekly update 4/25
>
>
>
> On Tue, Apr 26, 2016 at 8:28 AM, Jordan Birdsell <
> jordan.birdsell.kdvm@statefarm.com> wrote:
>
> Yes, this is exactly what we need to do.  Not immediately is ok for our
> current requirements, I’d say within a day would be ideal.
>
>
>
> Even within a day can be tricky for this kind of system if you have a
> fairly uniform random delete workload. That would imply that you're
> rewriting _all_ of your data every day, which uses a fair amount of IO.
>
>
>
> Are deletes extremely rare for your use case?
>
>
>
> Is it the entire row of data that has to be deleted or would it be
> sufficient to "X out" some particularly sensitive column?
>
>
>
> -Todd
>
>
>
>
>
> *From:* Jean-Daniel Cryans [mailto:jdcryans@apache.org]
> *Sent:* Tuesday, April 26, 2016 11:15 AM
>
>
> *To:* user@kudu.incubator.apache.org
> *Subject:* Re: Weekly update 4/25
>
>
>
> Oh I see so this is in order to comply with asks such as "much sure that
> data for some user/customer is 100% deleted"? We'll still have the problem
> where we don't want to rewrite all the base data files (GBs/TBs) to clean
> up KBs of data, although since a single row is always only part of one row
> set, it means it's at most 64MB that you'd be rewriting.
>
>
>
> BTW is it ok if the data isn't immediately deleted? How long is it
> acceptable to wait for before it happens?
>
>
>
> J-D
>
>
>
> On Tue, Apr 26, 2016 at 8:04 AM, Jordan Birdsell <
> jordan.birdsell.kdvm@statefarm.com> wrote:
>
> Correct.  As for the “latest version”, if a row is deleted in the latest
> version then removing the old versions where it existed is exactly what
> we’re looking to do.  Basically, we need a way to physically get rid of
> select rows (or data within a column for that matter) and all versions of
> that row or column data.
>
>
>
> *From:* Jean-Daniel Cryans [mailto:jdcryans@apache.org]
> *Sent:* Tuesday, April 26, 2016 10:56 AM
> *To:* user@kudu.incubator.apache.org
> *Subject:* Re: Weekly update 4/25
>
>
>
> Hi Jordan,
>
>
>
> In other words, you'd like to tag specific rows to be excluded from the
> default data history retention?
>
>
>
> Also, keep in mind that this improvement is about removing old versions of
> the data, it will not delete the latest version. If you are used to HBase,
> it's like specifying some TTL plus MIN_VERSIONS=1 so it doesn't completely
> age out a row.
>
>
>
> Hope this helps,
>
>
>
> J-D
>
>
>
> On Tue, Apr 26, 2016 at 4:29 AM, Jordan Birdsell <
> jordan.birdsell.kdvm@statefarm.com> wrote:
>
> Hi,
>
>
>
> Regarding row GC,  I see in the design document that the tablet history
> max age will be set at the table level, would it be possible to make this
> something that can be overridden for specific transactions?  We have some
> use cases that would require accelerated removal of data from disk and
> other use cases that would not have the same requirement. Unfortunately,
> these different use cases apply, often times, to the same tables.
>
>
>
> Thanks,
>
> Jordan Birdsell
>
>
>
> *From:* Todd Lipcon [mailto:todd@apache.org]
> *Sent:* Monday, April 25, 2016 1:54 PM
> *To:* dev@kudu.incubator.apache.org; user@kudu.incubator.apache.org
> *Subject:* Weekly update 4/25
>
>
>
> Hey Kudu-ers,
>
>
>
> For the last month and a half, I've been posting weekly summaries of
> community development activity on the Kudu blog. In case you aren't on
> twitter or slack you might not have seen the posts, so I'm going to start
> emailing them to the list as well.
>
>
>
> Here's this week's update:
>
> http://getkudu.io/2016/04/25/weekly-update.html
>
>
>
> Feel free to reply to this mail if you have any questions or would like to
> get involved in development.
>
>
>
> -Todd
>
>
>
>
>
>
>
>
>
> --
>
> Todd Lipcon
> Software Engineer, Cloudera
>
>
>
>
>
> --
>
> Todd Lipcon
> Software Engineer, Cloudera
>

RE: Weekly update 4/25

Posted by Jordan Birdsell <jo...@statefarm.com>.

Todd,

Should a JIRA be opened to track this?

From: Jordan Birdsell
Sent: Tuesday, April 26, 2016 2:07 PM
To: user@kudu.incubator.apache.org
Subject: RE: Weekly update 4/25

Today we solve this on an RDBMS (DB2) platform, however when data is replicated to the cluster, we need to be able to address such deletes that occur after replication so that we don’t have to continue to replicate petabytes across the network.  We’ve experimented with HBase and some HDFS solutions (Hive transactions), but neither really seem to be ideal.

From: Todd Lipcon [mailto:todd@cloudera.com]
Sent: Tuesday, April 26, 2016 1:21 PM
To: user@kudu.incubator.apache.org<ma...@kudu.incubator.apache.org>
Subject: Re: Weekly update 4/25

On Tue, Apr 26, 2016 at 10:14 AM, Jordan Birdsell <jo...@statefarm.com>> wrote:
If we had to go less frequently than a day I’m sure it’d be acceptable.  The volume of deletes is very low in this case.  In some tables we can just “erase” a column’s data but in others, based on the data design, we must delete the entire row or group of rows.

Thanks for the details.

I'm curious, are you solving this use case with an existing system today? (eg HBase, HDFS, or some RDBMS?) Would like to compare our planned implementation with whatever that system is doing to make sure it's at least as good.

-Todd

From: Todd Lipcon [mailto:todd@cloudera.com<ma...@cloudera.com>]
Sent: Tuesday, April 26, 2016 12:59 PM

To: user@kudu.incubator.apache.org<ma...@kudu.incubator.apache.org>
Subject: Re: Weekly update 4/25

On Tue, Apr 26, 2016 at 8:28 AM, Jordan Birdsell <jo...@statefarm.com>> wrote:
Yes, this is exactly what we need to do.  Not immediately is ok for our current requirements, I’d say within a day would be ideal.

Even within a day can be tricky for this kind of system if you have a fairly uniform random delete workload. That would imply that you're rewriting _all_ of your data every day, which uses a fair amount of IO.

Are deletes extremely rare for your use case?

Is it the entire row of data that has to be deleted or would it be sufficient to "X out" some particularly sensitive column?

-Todd

From: Jean-Daniel Cryans [mailto:jdcryans@apache.org<ma...@apache.org>]
Sent: Tuesday, April 26, 2016 11:15 AM

To: user@kudu.incubator.apache.org<ma...@kudu.incubator.apache.org>
Subject: Re: Weekly update 4/25

Oh I see so this is in order to comply with asks such as "much sure that data for some user/customer is 100% deleted"? We'll still have the problem where we don't want to rewrite all the base data files (GBs/TBs) to clean up KBs of data, although since a single row is always only part of one row set, it means it's at most 64MB that you'd be rewriting.

BTW is it ok if the data isn't immediately deleted? How long is it acceptable to wait for before it happens?

J-D

On Tue, Apr 26, 2016 at 8:04 AM, Jordan Birdsell <jo...@statefarm.com>> wrote:
Correct.  As for the “latest version”, if a row is deleted in the latest version then removing the old versions where it existed is exactly what we’re looking to do.  Basically, we need a way to physically get rid of select rows (or data within a column for that matter) and all versions of that row or column data.

From: Jean-Daniel Cryans [mailto:jdcryans@apache.org<ma...@apache.org>]
Sent: Tuesday, April 26, 2016 10:56 AM
To: user@kudu.incubator.apache.org<ma...@kudu.incubator.apache.org>
Subject: Re: Weekly update 4/25

Hi Jordan,

In other words, you'd like to tag specific rows to be excluded from the default data history retention?

Also, keep in mind that this improvement is about removing old versions of the data, it will not delete the latest version. If you are used to HBase, it's like specifying some TTL plus MIN_VERSIONS=1 so it doesn't completely age out a row.

Hope this helps,

J-D

On Tue, Apr 26, 2016 at 4:29 AM, Jordan Birdsell <jo...@statefarm.com>> wrote:
Hi,

Regarding row GC,  I see in the design document that the tablet history max age will be set at the table level, would it be possible to make this something that can be overridden for specific transactions?  We have some use cases that would require accelerated removal of data from disk and other use cases that would not have the same requirement. Unfortunately, these different use cases apply, often times, to the same tables.

Thanks,
Jordan Birdsell

From: Todd Lipcon [mailto:todd@apache.org<ma...@apache.org>]
Sent: Monday, April 25, 2016 1:54 PM
To: dev@kudu.incubator.apache.org<ma...@kudu.incubator.apache.org>; user@kudu.incubator.apache.org<ma...@kudu.incubator.apache.org>
Subject: Weekly update 4/25

Hey Kudu-ers,

For the last month and a half, I've been posting weekly summaries of community development activity on the Kudu blog. In case you aren't on twitter or slack you might not have seen the posts, so I'm going to start emailing them to the list as well.

Here's this week's update:
http://getkudu.io/2016/04/25/weekly-update.html

Feel free to reply to this mail if you have any questions or would like to get involved in development.

-Todd

--
Todd Lipcon
Software Engineer, Cloudera

--
Todd Lipcon
Software Engineer, Cloudera

RE: Weekly update 4/25

Posted by Jordan Birdsell <jo...@statefarm.com>.

Today we solve this on an RDBMS (DB2) platform, however when data is replicated to the cluster, we need to be able to address such deletes that occur after replication so that we don’t have to continue to replicate petabytes across the network.  We’ve experimented with HBase and some HDFS solutions (Hive transactions), but neither really seem to be ideal.

From: Todd Lipcon [mailto:todd@cloudera.com]
Sent: Tuesday, April 26, 2016 1:21 PM
To: user@kudu.incubator.apache.org
Subject: Re: Weekly update 4/25

On Tue, Apr 26, 2016 at 10:14 AM, Jordan Birdsell <jo...@statefarm.com>> wrote:
If we had to go less frequently than a day I’m sure it’d be acceptable.  The volume of deletes is very low in this case.  In some tables we can just “erase” a column’s data but in others, based on the data design, we must delete the entire row or group of rows.

Thanks for the details.

I'm curious, are you solving this use case with an existing system today? (eg HBase, HDFS, or some RDBMS?) Would like to compare our planned implementation with whatever that system is doing to make sure it's at least as good.

-Todd

From: Todd Lipcon [mailto:todd@cloudera.com<ma...@cloudera.com>]
Sent: Tuesday, April 26, 2016 12:59 PM

To: user@kudu.incubator.apache.org<ma...@kudu.incubator.apache.org>
Subject: Re: Weekly update 4/25

On Tue, Apr 26, 2016 at 8:28 AM, Jordan Birdsell <jo...@statefarm.com>> wrote:
Yes, this is exactly what we need to do.  Not immediately is ok for our current requirements, I’d say within a day would be ideal.

Even within a day can be tricky for this kind of system if you have a fairly uniform random delete workload. That would imply that you're rewriting _all_ of your data every day, which uses a fair amount of IO.

Are deletes extremely rare for your use case?

Is it the entire row of data that has to be deleted or would it be sufficient to "X out" some particularly sensitive column?

-Todd

From: Jean-Daniel Cryans [mailto:jdcryans@apache.org<ma...@apache.org>]
Sent: Tuesday, April 26, 2016 11:15 AM

To: user@kudu.incubator.apache.org<ma...@kudu.incubator.apache.org>
Subject: Re: Weekly update 4/25

Oh I see so this is in order to comply with asks such as "much sure that data for some user/customer is 100% deleted"? We'll still have the problem where we don't want to rewrite all the base data files (GBs/TBs) to clean up KBs of data, although since a single row is always only part of one row set, it means it's at most 64MB that you'd be rewriting.

BTW is it ok if the data isn't immediately deleted? How long is it acceptable to wait for before it happens?

J-D

On Tue, Apr 26, 2016 at 8:04 AM, Jordan Birdsell <jo...@statefarm.com>> wrote:
Correct.  As for the “latest version”, if a row is deleted in the latest version then removing the old versions where it existed is exactly what we’re looking to do.  Basically, we need a way to physically get rid of select rows (or data within a column for that matter) and all versions of that row or column data.

From: Jean-Daniel Cryans [mailto:jdcryans@apache.org<ma...@apache.org>]
Sent: Tuesday, April 26, 2016 10:56 AM
To: user@kudu.incubator.apache.org<ma...@kudu.incubator.apache.org>
Subject: Re: Weekly update 4/25

Hi Jordan,

In other words, you'd like to tag specific rows to be excluded from the default data history retention?

Also, keep in mind that this improvement is about removing old versions of the data, it will not delete the latest version. If you are used to HBase, it's like specifying some TTL plus MIN_VERSIONS=1 so it doesn't completely age out a row.

Hope this helps,

J-D

On Tue, Apr 26, 2016 at 4:29 AM, Jordan Birdsell <jo...@statefarm.com>> wrote:
Hi,

Regarding row GC,  I see in the design document that the tablet history max age will be set at the table level, would it be possible to make this something that can be overridden for specific transactions?  We have some use cases that would require accelerated removal of data from disk and other use cases that would not have the same requirement. Unfortunately, these different use cases apply, often times, to the same tables.

Thanks,
Jordan Birdsell

From: Todd Lipcon [mailto:todd@apache.org<ma...@apache.org>]
Sent: Monday, April 25, 2016 1:54 PM
To: dev@kudu.incubator.apache.org<ma...@kudu.incubator.apache.org>; user@kudu.incubator.apache.org<ma...@kudu.incubator.apache.org>
Subject: Weekly update 4/25

Hey Kudu-ers,

For the last month and a half, I've been posting weekly summaries of community development activity on the Kudu blog. In case you aren't on twitter or slack you might not have seen the posts, so I'm going to start emailing them to the list as well.

Here's this week's update:
http://getkudu.io/2016/04/25/weekly-update.html

Feel free to reply to this mail if you have any questions or would like to get involved in development.

-Todd

--
Todd Lipcon
Software Engineer, Cloudera

--
Todd Lipcon
Software Engineer, Cloudera

Re: Weekly update 4/25

Posted by Todd Lipcon <to...@cloudera.com>.

On Tue, Apr 26, 2016 at 10:14 AM, Jordan Birdsell <
jordan.birdsell.kdvm@statefarm.com> wrote:

> If we had to go less frequently than a day I’m sure it’d be acceptable.
> The volume of deletes is very low in this case.  In some tables we can just
> “erase” a column’s data but in others, based on the data design, we must
> delete the entire row or group of rows.
>

Thanks for the details.

I'm curious, are you solving this use case with an existing system today?
(eg HBase, HDFS, or some RDBMS?) Would like to compare our planned
implementation with whatever that system is doing to make sure it's at
least as good.

-Todd


>
>
> *From:* Todd Lipcon [mailto:todd@cloudera.com]
> *Sent:* Tuesday, April 26, 2016 12:59 PM
>
> *To:* user@kudu.incubator.apache.org
> *Subject:* Re: Weekly update 4/25
>
>
>
> On Tue, Apr 26, 2016 at 8:28 AM, Jordan Birdsell <
> jordan.birdsell.kdvm@statefarm.com> wrote:
>
> Yes, this is exactly what we need to do.  Not immediately is ok for our
> current requirements, I’d say within a day would be ideal.
>
>
>
> Even within a day can be tricky for this kind of system if you have a
> fairly uniform random delete workload. That would imply that you're
> rewriting _all_ of your data every day, which uses a fair amount of IO.
>
>
>
> Are deletes extremely rare for your use case?
>
>
>
> Is it the entire row of data that has to be deleted or would it be
> sufficient to "X out" some particularly sensitive column?
>
>
>
> -Todd
>
>
>
>
>
> *From:* Jean-Daniel Cryans [mailto:jdcryans@apache.org]
> *Sent:* Tuesday, April 26, 2016 11:15 AM
>
>
> *To:* user@kudu.incubator.apache.org
> *Subject:* Re: Weekly update 4/25
>
>
>
> Oh I see so this is in order to comply with asks such as "much sure that
> data for some user/customer is 100% deleted"? We'll still have the problem
> where we don't want to rewrite all the base data files (GBs/TBs) to clean
> up KBs of data, although since a single row is always only part of one row
> set, it means it's at most 64MB that you'd be rewriting.
>
>
>
> BTW is it ok if the data isn't immediately deleted? How long is it
> acceptable to wait for before it happens?
>
>
>
> J-D
>
>
>
> On Tue, Apr 26, 2016 at 8:04 AM, Jordan Birdsell <
> jordan.birdsell.kdvm@statefarm.com> wrote:
>
> Correct.  As for the “latest version”, if a row is deleted in the latest
> version then removing the old versions where it existed is exactly what
> we’re looking to do.  Basically, we need a way to physically get rid of
> select rows (or data within a column for that matter) and all versions of
> that row or column data.
>
>
>
> *From:* Jean-Daniel Cryans [mailto:jdcryans@apache.org]
> *Sent:* Tuesday, April 26, 2016 10:56 AM
> *To:* user@kudu.incubator.apache.org
> *Subject:* Re: Weekly update 4/25
>
>
>
> Hi Jordan,
>
>
>
> In other words, you'd like to tag specific rows to be excluded from the
> default data history retention?
>
>
>
> Also, keep in mind that this improvement is about removing old versions of
> the data, it will not delete the latest version. If you are used to HBase,
> it's like specifying some TTL plus MIN_VERSIONS=1 so it doesn't completely
> age out a row.
>
>
>
> Hope this helps,
>
>
>
> J-D
>
>
>
> On Tue, Apr 26, 2016 at 4:29 AM, Jordan Birdsell <
> jordan.birdsell.kdvm@statefarm.com> wrote:
>
> Hi,
>
>
>
> Regarding row GC,  I see in the design document that the tablet history
> max age will be set at the table level, would it be possible to make this
> something that can be overridden for specific transactions?  We have some
> use cases that would require accelerated removal of data from disk and
> other use cases that would not have the same requirement. Unfortunately,
> these different use cases apply, often times, to the same tables.
>
>
>
> Thanks,
>
> Jordan Birdsell
>
>
>
> *From:* Todd Lipcon [mailto:todd@apache.org]
> *Sent:* Monday, April 25, 2016 1:54 PM
> *To:* dev@kudu.incubator.apache.org; user@kudu.incubator.apache.org
> *Subject:* Weekly update 4/25
>
>
>
> Hey Kudu-ers,
>
>
>
> For the last month and a half, I've been posting weekly summaries of
> community development activity on the Kudu blog. In case you aren't on
> twitter or slack you might not have seen the posts, so I'm going to start
> emailing them to the list as well.
>
>
>
> Here's this week's update:
>
> http://getkudu.io/2016/04/25/weekly-update.html
>
>
>
> Feel free to reply to this mail if you have any questions or would like to
> get involved in development.
>
>
>
> -Todd
>
>
>
>
>
>
>
>
>
> --
>
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Todd Lipcon
Software Engineer, Cloudera

RE: Weekly update 4/25

Posted by Jordan Birdsell <jo...@statefarm.com>.

If we had to go less frequently than a day I’m sure it’d be acceptable.  The volume of deletes is very low in this case.  In some tables we can just “erase” a column’s data but in others, based on the data design, we must delete the entire row or group of rows.

From: Todd Lipcon [mailto:todd@cloudera.com]
Sent: Tuesday, April 26, 2016 12:59 PM
To: user@kudu.incubator.apache.org
Subject: Re: Weekly update 4/25

On Tue, Apr 26, 2016 at 8:28 AM, Jordan Birdsell <jo...@statefarm.com>> wrote:
Yes, this is exactly what we need to do.  Not immediately is ok for our current requirements, I’d say within a day would be ideal.

Even within a day can be tricky for this kind of system if you have a fairly uniform random delete workload. That would imply that you're rewriting _all_ of your data every day, which uses a fair amount of IO.

Are deletes extremely rare for your use case?

Is it the entire row of data that has to be deleted or would it be sufficient to "X out" some particularly sensitive column?

-Todd

From: Jean-Daniel Cryans [mailto:jdcryans@apache.org<ma...@apache.org>]
Sent: Tuesday, April 26, 2016 11:15 AM

To: user@kudu.incubator.apache.org<ma...@kudu.incubator.apache.org>
Subject: Re: Weekly update 4/25

Oh I see so this is in order to comply with asks such as "much sure that data for some user/customer is 100% deleted"? We'll still have the problem where we don't want to rewrite all the base data files (GBs/TBs) to clean up KBs of data, although since a single row is always only part of one row set, it means it's at most 64MB that you'd be rewriting.

BTW is it ok if the data isn't immediately deleted? How long is it acceptable to wait for before it happens?

J-D

On Tue, Apr 26, 2016 at 8:04 AM, Jordan Birdsell <jo...@statefarm.com>> wrote:
Correct.  As for the “latest version”, if a row is deleted in the latest version then removing the old versions where it existed is exactly what we’re looking to do.  Basically, we need a way to physically get rid of select rows (or data within a column for that matter) and all versions of that row or column data.

From: Jean-Daniel Cryans [mailto:jdcryans@apache.org<ma...@apache.org>]
Sent: Tuesday, April 26, 2016 10:56 AM
To: user@kudu.incubator.apache.org<ma...@kudu.incubator.apache.org>
Subject: Re: Weekly update 4/25

Hi Jordan,

In other words, you'd like to tag specific rows to be excluded from the default data history retention?

Also, keep in mind that this improvement is about removing old versions of the data, it will not delete the latest version. If you are used to HBase, it's like specifying some TTL plus MIN_VERSIONS=1 so it doesn't completely age out a row.

Hope this helps,

J-D

On Tue, Apr 26, 2016 at 4:29 AM, Jordan Birdsell <jo...@statefarm.com>> wrote:
Hi,

Regarding row GC,  I see in the design document that the tablet history max age will be set at the table level, would it be possible to make this something that can be overridden for specific transactions?  We have some use cases that would require accelerated removal of data from disk and other use cases that would not have the same requirement. Unfortunately, these different use cases apply, often times, to the same tables.

Thanks,
Jordan Birdsell

From: Todd Lipcon [mailto:todd@apache.org<ma...@apache.org>]
Sent: Monday, April 25, 2016 1:54 PM
To: dev@kudu.incubator.apache.org<ma...@kudu.incubator.apache.org>; user@kudu.incubator.apache.org<ma...@kudu.incubator.apache.org>
Subject: Weekly update 4/25

Hey Kudu-ers,

For the last month and a half, I've been posting weekly summaries of community development activity on the Kudu blog. In case you aren't on twitter or slack you might not have seen the posts, so I'm going to start emailing them to the list as well.

Here's this week's update:
http://getkudu.io/2016/04/25/weekly-update.html

Feel free to reply to this mail if you have any questions or would like to get involved in development.

-Todd

--
Todd Lipcon
Software Engineer, Cloudera

Re: Weekly update 4/25

Posted by Todd Lipcon <to...@cloudera.com>.

On Tue, Apr 26, 2016 at 8:28 AM, Jordan Birdsell <
jordan.birdsell.kdvm@statefarm.com> wrote:

> Yes, this is exactly what we need to do.  Not immediately is ok for our
> current requirements, I’d say within a day would be ideal.
>
>
Even within a day can be tricky for this kind of system if you have a
fairly uniform random delete workload. That would imply that you're
rewriting _all_ of your data every day, which uses a fair amount of IO.

Are deletes extremely rare for your use case?

Is it the entire row of data that has to be deleted or would it be
sufficient to "X out" some particularly sensitive column?

-Todd


>
>
> *From:* Jean-Daniel Cryans [mailto:jdcryans@apache.org]
> *Sent:* Tuesday, April 26, 2016 11:15 AM
>
> *To:* user@kudu.incubator.apache.org
> *Subject:* Re: Weekly update 4/25
>
>
>
> Oh I see so this is in order to comply with asks such as "much sure that
> data for some user/customer is 100% deleted"? We'll still have the problem
> where we don't want to rewrite all the base data files (GBs/TBs) to clean
> up KBs of data, although since a single row is always only part of one row
> set, it means it's at most 64MB that you'd be rewriting.
>
>
>
> BTW is it ok if the data isn't immediately deleted? How long is it
> acceptable to wait for before it happens?
>
>
>
> J-D
>
>
>
> On Tue, Apr 26, 2016 at 8:04 AM, Jordan Birdsell <
> jordan.birdsell.kdvm@statefarm.com> wrote:
>
> Correct.  As for the “latest version”, if a row is deleted in the latest
> version then removing the old versions where it existed is exactly what
> we’re looking to do.  Basically, we need a way to physically get rid of
> select rows (or data within a column for that matter) and all versions of
> that row or column data.
>
>
>
> *From:* Jean-Daniel Cryans [mailto:jdcryans@apache.org]
> *Sent:* Tuesday, April 26, 2016 10:56 AM
> *To:* user@kudu.incubator.apache.org
> *Subject:* Re: Weekly update 4/25
>
>
>
> Hi Jordan,
>
>
>
> In other words, you'd like to tag specific rows to be excluded from the
> default data history retention?
>
>
>
> Also, keep in mind that this improvement is about removing old versions of
> the data, it will not delete the latest version. If you are used to HBase,
> it's like specifying some TTL plus MIN_VERSIONS=1 so it doesn't completely
> age out a row.
>
>
>
> Hope this helps,
>
>
>
> J-D
>
>
>
> On Tue, Apr 26, 2016 at 4:29 AM, Jordan Birdsell <
> jordan.birdsell.kdvm@statefarm.com> wrote:
>
> Hi,
>
>
>
> Regarding row GC,  I see in the design document that the tablet history
> max age will be set at the table level, would it be possible to make this
> something that can be overridden for specific transactions?  We have some
> use cases that would require accelerated removal of data from disk and
> other use cases that would not have the same requirement. Unfortunately,
> these different use cases apply, often times, to the same tables.
>
>
>
> Thanks,
>
> Jordan Birdsell
>
>
>
> *From:* Todd Lipcon [mailto:todd@apache.org]
> *Sent:* Monday, April 25, 2016 1:54 PM
> *To:* dev@kudu.incubator.apache.org; user@kudu.incubator.apache.org
> *Subject:* Weekly update 4/25
>
>
>
> Hey Kudu-ers,
>
>
>
> For the last month and a half, I've been posting weekly summaries of
> community development activity on the Kudu blog. In case you aren't on
> twitter or slack you might not have seen the posts, so I'm going to start
> emailing them to the list as well.
>
>
>
> Here's this week's update:
>
> http://getkudu.io/2016/04/25/weekly-update.html
>
>
>
> Feel free to reply to this mail if you have any questions or would like to
> get involved in development.
>
>
>
> -Todd
>
>
>
>
>



-- 
Todd Lipcon
Software Engineer, Cloudera

RE: Weekly update 4/25

Posted by Jordan Birdsell <jo...@statefarm.com>.

Yes, this is exactly what we need to do.  Not immediately is ok for our current requirements, I’d say within a day would be ideal.

From: Jean-Daniel Cryans [mailto:jdcryans@apache.org]
Sent: Tuesday, April 26, 2016 11:15 AM
To: user@kudu.incubator.apache.org
Subject: Re: Weekly update 4/25

Oh I see so this is in order to comply with asks such as "much sure that data for some user/customer is 100% deleted"? We'll still have the problem where we don't want to rewrite all the base data files (GBs/TBs) to clean up KBs of data, although since a single row is always only part of one row set, it means it's at most 64MB that you'd be rewriting.

BTW is it ok if the data isn't immediately deleted? How long is it acceptable to wait for before it happens?

J-D

On Tue, Apr 26, 2016 at 8:04 AM, Jordan Birdsell <jo...@statefarm.com>> wrote:
Correct.  As for the “latest version”, if a row is deleted in the latest version then removing the old versions where it existed is exactly what we’re looking to do.  Basically, we need a way to physically get rid of select rows (or data within a column for that matter) and all versions of that row or column data.

From: Jean-Daniel Cryans [mailto:jdcryans@apache.org<ma...@apache.org>]
Sent: Tuesday, April 26, 2016 10:56 AM
To: user@kudu.incubator.apache.org<ma...@kudu.incubator.apache.org>
Subject: Re: Weekly update 4/25

Hi Jordan,

In other words, you'd like to tag specific rows to be excluded from the default data history retention?

Also, keep in mind that this improvement is about removing old versions of the data, it will not delete the latest version. If you are used to HBase, it's like specifying some TTL plus MIN_VERSIONS=1 so it doesn't completely age out a row.

Hope this helps,

J-D

On Tue, Apr 26, 2016 at 4:29 AM, Jordan Birdsell <jo...@statefarm.com>> wrote:
Hi,

Regarding row GC,  I see in the design document that the tablet history max age will be set at the table level, would it be possible to make this something that can be overridden for specific transactions?  We have some use cases that would require accelerated removal of data from disk and other use cases that would not have the same requirement. Unfortunately, these different use cases apply, often times, to the same tables.

Thanks,
Jordan Birdsell

From: Todd Lipcon [mailto:todd@apache.org<ma...@apache.org>]
Sent: Monday, April 25, 2016 1:54 PM
To: dev@kudu.incubator.apache.org<ma...@kudu.incubator.apache.org>; user@kudu.incubator.apache.org<ma...@kudu.incubator.apache.org>
Subject: Weekly update 4/25

Hey Kudu-ers,

For the last month and a half, I've been posting weekly summaries of community development activity on the Kudu blog. In case you aren't on twitter or slack you might not have seen the posts, so I'm going to start emailing them to the list as well.

Here's this week's update:
http://getkudu.io/2016/04/25/weekly-update.html

Feel free to reply to this mail if you have any questions or would like to get involved in development.

-Todd

Re: Weekly update 4/25

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Oh I see so this is in order to comply with asks such as "much sure that
data for some user/customer is 100% deleted"? We'll still have the problem
where we don't want to rewrite all the base data files (GBs/TBs) to clean
up KBs of data, although since a single row is always only part of one row
set, it means it's at most 64MB that you'd be rewriting.

BTW is it ok if the data isn't immediately deleted? How long is it
acceptable to wait for before it happens?

J-D

On Tue, Apr 26, 2016 at 8:04 AM, Jordan Birdsell <
jordan.birdsell.kdvm@statefarm.com> wrote:

> Correct.  As for the “latest version”, if a row is deleted in the latest
> version then removing the old versions where it existed is exactly what
> we’re looking to do.  Basically, we need a way to physically get rid of
> select rows (or data within a column for that matter) and all versions of
> that row or column data.
>
>
>
> *From:* Jean-Daniel Cryans [mailto:jdcryans@apache.org]
> *Sent:* Tuesday, April 26, 2016 10:56 AM
> *To:* user@kudu.incubator.apache.org
> *Subject:* Re: Weekly update 4/25
>
>
>
> Hi Jordan,
>
>
>
> In other words, you'd like to tag specific rows to be excluded from the
> default data history retention?
>
>
>
> Also, keep in mind that this improvement is about removing old versions of
> the data, it will not delete the latest version. If you are used to HBase,
> it's like specifying some TTL plus MIN_VERSIONS=1 so it doesn't completely
> age out a row.
>
>
>
> Hope this helps,
>
>
>
> J-D
>
>
>
> On Tue, Apr 26, 2016 at 4:29 AM, Jordan Birdsell <
> jordan.birdsell.kdvm@statefarm.com> wrote:
>
> Hi,
>
>
>
> Regarding row GC,  I see in the design document that the tablet history
> max age will be set at the table level, would it be possible to make this
> something that can be overridden for specific transactions?  We have some
> use cases that would require accelerated removal of data from disk and
> other use cases that would not have the same requirement. Unfortunately,
> these different use cases apply, often times, to the same tables.
>
>
>
> Thanks,
>
> Jordan Birdsell
>
>
>
> *From:* Todd Lipcon [mailto:todd@apache.org]
> *Sent:* Monday, April 25, 2016 1:54 PM
> *To:* dev@kudu.incubator.apache.org; user@kudu.incubator.apache.org
> *Subject:* Weekly update 4/25
>
>
>
> Hey Kudu-ers,
>
>
>
> For the last month and a half, I've been posting weekly summaries of
> community development activity on the Kudu blog. In case you aren't on
> twitter or slack you might not have seen the posts, so I'm going to start
> emailing them to the list as well.
>
>
>
> Here's this week's update:
>
> http://getkudu.io/2016/04/25/weekly-update.html
>
>
>
> Feel free to reply to this mail if you have any questions or would like to
> get involved in development.
>
>
>
> -Todd
>
>
>

RE: Weekly update 4/25

Posted by Jordan Birdsell <jo...@statefarm.com>.

Correct.  As for the “latest version”, if a row is deleted in the latest version then removing the old versions where it existed is exactly what we’re looking to do.  Basically, we need a way to physically get rid of select rows (or data within a column for that matter) and all versions of that row or column data.

From: Jean-Daniel Cryans [mailto:jdcryans@apache.org]
Sent: Tuesday, April 26, 2016 10:56 AM
To: user@kudu.incubator.apache.org
Subject: Re: Weekly update 4/25

Hi Jordan,

In other words, you'd like to tag specific rows to be excluded from the default data history retention?

Also, keep in mind that this improvement is about removing old versions of the data, it will not delete the latest version. If you are used to HBase, it's like specifying some TTL plus MIN_VERSIONS=1 so it doesn't completely age out a row.

Hope this helps,

J-D

On Tue, Apr 26, 2016 at 4:29 AM, Jordan Birdsell <jo...@statefarm.com>> wrote:
Hi,

Regarding row GC,  I see in the design document that the tablet history max age will be set at the table level, would it be possible to make this something that can be overridden for specific transactions?  We have some use cases that would require accelerated removal of data from disk and other use cases that would not have the same requirement. Unfortunately, these different use cases apply, often times, to the same tables.

Thanks,
Jordan Birdsell

From: Todd Lipcon [mailto:todd@apache.org<ma...@apache.org>]
Sent: Monday, April 25, 2016 1:54 PM
To: dev@kudu.incubator.apache.org<ma...@kudu.incubator.apache.org>; user@kudu.incubator.apache.org<ma...@kudu.incubator.apache.org>
Subject: Weekly update 4/25

Hey Kudu-ers,

For the last month and a half, I've been posting weekly summaries of community development activity on the Kudu blog. In case you aren't on twitter or slack you might not have seen the posts, so I'm going to start emailing them to the list as well.

Here's this week's update:
http://getkudu.io/2016/04/25/weekly-update.html

Feel free to reply to this mail if you have any questions or would like to get involved in development.

-Todd

Re: Weekly update 4/25

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Hi Jordan,

In other words, you'd like to tag specific rows to be excluded from the
default data history retention?

Also, keep in mind that this improvement is about removing old versions of
the data, it will not delete the latest version. If you are used to HBase,
it's like specifying some TTL plus MIN_VERSIONS=1 so it doesn't completely
age out a row.

Hope this helps,

J-D

On Tue, Apr 26, 2016 at 4:29 AM, Jordan Birdsell <
jordan.birdsell.kdvm@statefarm.com> wrote:

> Hi,
>
>
>
> Regarding row GC,  I see in the design document that the tablet history
> max age will be set at the table level, would it be possible to make this
> something that can be overridden for specific transactions?  We have some
> use cases that would require accelerated removal of data from disk and
> other use cases that would not have the same requirement. Unfortunately,
> these different use cases apply, often times, to the same tables.
>
>
>
> Thanks,
>
> Jordan Birdsell
>
>
>
> *From:* Todd Lipcon [mailto:todd@apache.org]
> *Sent:* Monday, April 25, 2016 1:54 PM
> *To:* dev@kudu.incubator.apache.org; user@kudu.incubator.apache.org
> *Subject:* Weekly update 4/25
>
>
>
> Hey Kudu-ers,
>
>
>
> For the last month and a half, I've been posting weekly summaries of
> community development activity on the Kudu blog. In case you aren't on
> twitter or slack you might not have seen the posts, so I'm going to start
> emailing them to the list as well.
>
>
>
> Here's this week's update:
>
> http://getkudu.io/2016/04/25/weekly-update.html
>
>
>
> Feel free to reply to this mail if you have any questions or would like to
> get involved in development.
>
>
>
> -Todd
>

RE: Weekly update 4/25

Posted by Jordan Birdsell <jo...@statefarm.com>.

Hi,

Regarding row GC,  I see in the design document that the tablet history max age will be set at the table level, would it be possible to make this something that can be overridden for specific transactions?  We have some use cases that would require accelerated removal of data from disk and other use cases that would not have the same requirement. Unfortunately, these different use cases apply, often times, to the same tables.

Thanks,
Jordan Birdsell

From: Todd Lipcon [mailto:todd@apache.org]
Sent: Monday, April 25, 2016 1:54 PM
To: dev@kudu.incubator.apache.org; user@kudu.incubator.apache.org
Subject: Weekly update 4/25

Hey Kudu-ers,

For the last month and a half, I've been posting weekly summaries of community development activity on the Kudu blog. In case you aren't on twitter or slack you might not have seen the posts, so I'm going to start emailing them to the list as well.

Here's this week's update:
http://getkudu.io/2016/04/25/weekly-update.html

Feel free to reply to this mail if you have any questions or would like to get involved in development.

-Todd