You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by Jean-Daniel Cryans <jd...@apache.org> on 2013/06/19 14:31:49 UTC

Efficiently wiping out random data?

Hey devs,

I was presenting at GOTO Amsterdam yesterday and I got a question
about a scenario that I've never thought about before. I'm wondering
what others think.

How do you efficiently wipe out random data in HBase?

For example, you have a website and a user asks you to close their
account and get rid of the data.

Would you say "sure can do, lemme just issue a couple of Deletes!" and
call it a day? What if you really have to delete the data, not just
mask it, because of contractual obligations or local laws?

Major compacting is the obvious solution but it seems really
inefficient. Let's say you've got some truly random data to delete and
it happens so that you have at least one row per region to get rid
of... then you need to basically rewrite the whole table?

My answer was such, and I told the attendee that it's not an easy use
case to manage in HBase.

Thoughts?

J-D

Re: Efficiently wiping out random data?

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Thank you so much for all the answers guys, looks like I should write
up something for the ref guide!

J-D

On Sun, Jun 23, 2013 at 3:31 PM, Andrew Purtell <ap...@apache.org> wrote:
> Right, compaction followed by a 'secure' HDFS level delete by random
> rewrites to nuke the blocks of the remnant. Even then it's difficult to say
> something recoverable does not remain, though in practical terms the
> hypothetical user here could be assured no API of HBase or HDFS could
> ever retrieve the data.
>
> Or burn the platters to ash.
>
> On Sunday, June 23, 2013, Ian Varley wrote:
>
>> One more followup on this, after talking to some security-types:
>>
>>  - The issue isn't wiping out all data for a customer; it's wiping out
>> *specific* data. Using the "forget an encryption key" method would then
>> mean separate encryption keys per row, which isn't feasible most of the
>> time. (Consider information that becomes classified but didn't used to be,
>> for example.)
>>  - In some cases, decryption can still happen without keys, by brute force
>> or from finding weaknesses in the algorithms down the road. Yes, I know
>> that the brute force CPU time is measured in eons, but never say never; we
>> can easily decrypt things now that were encrypted with the best available
>> algorithms and keys 40 years ago. :)
>>
>> So for cases where it counts, a "secure delete" means no less than writing
>> over the data with random strings. It would be interesting to add features
>> to HBase / HDFS that passed muster for stuff like this; for example, an
>> HDFS secure-delete<
>> http://www.ghacks.net/2010/08/26/securely-delete-files-with-secure-delete/>
>> command, and an HBase secure-delete that does all of: add delete marker,
>> force major compaction, and run HDFS secure-delete.
>>
>> Ian
>>
>> On Jun 20, 2013, at 7:39 AM, Jean-Marc Spaggiari wrote:
>>
>> Correct, that's another way. Just need to have one encryption key per
>> customer. And all what is written into HBase, over all the tables, is
>> encrypted with that key.
>>
>> If the customer want to have all its data erased, just erased the key,
>> and you have no way to retrieve anything from HBase even if it's still
>> into all the tables. So now you can emit all the deletes required, and
>> that will be totally deleted on the next regular major compaction...
>>
>> There will be a small impact on regular reads/write since you will
>> need to read the key first, but them a user delete will be way more
>> efficient.
>>
>>
>> 2013/6/20 lars hofhansl <larsh@apache.org <javascript:;><mailto:
>> larsh@apache.org <javascript:;>>>:
>> IMHO the "proper" of doing such things is encryption.
>>
>> 0-ing the values or even overwriting with a pattern typically leaves
>> traces of the old data on a magnetic platter that can be retrieved with
>> proper forensics. (Secure erase of SSD is typically pretty secure, though).
>>
>>
>> For such use cases, files (HFiles) should be encrypted and the decryption
>> keys should just be forgotten at the appropriate times.
>> I realize that for J-D's specific use case doing this at the HFile level
>> would be very difficult.
>>
>> Maybe the KVs' values could be stored encrypted with a user specific key.
>> Deleting the user's data then means to forget that users key.
>>
>>
>> -- Lars
>>
>> ________________________________
>> From: Matt Corgan <mcorgan@hotpads.com <javascript:;><mailto:
>> mcorgan@hotpads.com <javascript:;>>>
>> To: dev <dev@hbase.apache.org <javascript:;><mailto:dev@hbase.apache.org<javascript:;>
>> >>
>> Sent: Wednesday, June 19, 2013 2:15 PM
>> Subject: Re: Efficiently wiping out random data?
>>
>>
>> Would it be possible to zero-out all the value bytes for cells in existing
>> HFiles?  They keys would remain, but if you knew that ahead of time you
>> could design your keys so they don't contain important info.
>>
>>
>> On Wed, Jun 19, 2013 at 11:28 AM, Ian Varley <ivarley@salesforce.com
>> <ma...@salesforce.com>> wrote:
>>
>> At least in some cases, the answer to that question ("do you even have to
>> destroy your tapes?") is a resounding "yes". For some extreme cases (think
>> health care, privacy, etc), companies do all RDBMS backups to disk instead
>> of tape for that reason. (Transaction logs are considered different, I
>> guess because they're inherently transient? Who knows.)
>>
>> The "no time travel" fix doesn't work, because you could still change that
>> code or ACL in the future and get back to the data. In these cases, one
>> must provably destroy the data.
>>
>> That said, forcing full compactions (especially if they can be targeted
>> via stripes or levels or something) is an OK way to handle it, maybe
>> eventually with more ways to nice it down so it doesn't hose your cluster.
>>
>> Ian
>>
>> On Jun 19, 2013, at 11:27 AM, Todd Lipcon wrote:
>>
>> I'd also question what exactly the regulatory requirements for deletion
>> are. For example, if you had tape backups of your Oracle DB, would you have
>> to drive to your off-site storage facility, grab every tape you ever made,
>> and zero out the user's data as well? I doubt it, considering tapes have
>> basically the same storage characteristics as HDFS in terms of inability to
>> random write.
>>
>> Another example: deletes work the same way in most databases -- eg in
>> postgres, deletion of a record just consists of setting a record's "xmax"
>> column to the current transaction ID. This is equivalent to a tombstone,
>> and you have to wait for a VACUUM process to come along and actually delete
>> the record entry. In Oracle, the record will persist in a rollback segment
>> for a configurable amount of time, and you can use a Flashback query to
>> time travel and see it again. In Vertica, you also set an "xmax" entry and
>> wait until the next merge-out (like a major compaction).
>>
>> Even in a filesystem, deletion doesn't typically remove data, unless you
>> use a tool like srm. It just unlinks the inode from the directory tree.
>>
>> So, if any of the above systems satisfy their use case, then HBase ought to
>> as well. Perhaps there's an ACL we could add which would allow/disallow
>> users from doing time travel more than N seconds in the past..  maybe that
>> would help allay fears?
>>
>> -Todd
>>
>> On Wed, Jun 19, 2013 at 8:12 AM, Jesse Yates <jesse.k.yates@gmail.com
>> <ma...@gmail.com>
>> <ma...@gmail.com>>wrote:
>>
>> Chances are that date isn't completely "random". For instance, with a user
>> they are likely to have an id in their row key, so doing a filtering (with
>> a custom scanner) major compaction would clean that up. With Sergey's
>> compaction stuff coming in you could break that out even further and only
>> have to compact a small set of files to get that removal.
>>
>> So it's hard, but as its not our direct use case, it's gonna be a few extra
>> hoops.
>>
>> On Wednesday, June 19, 2013, Kevin O'dell wrote:
>>
>> Yeah, the immutable nature of HDFS is biting us here.
>>
>>
>> On Wed, Jun 19, 2013 at 8:46 AM, Jean-Daniel Cryans <jdcryans@apache.org
>> <ma...@apache.org>
>> <ma...@apache.org>
>> <javascript:;>
>> wrote:
>>
>> That sounds like a very effective way for developers to kill clusters
>> with compactions :)
>>
>> J-D
>>
>> On Wed, Jun 19, 2013 at 2:39 PM, Kevin O'dell <
>> kevin.odell@cloudera.com<javascript:;>
>>
>> wrote:
>> JD,
>>
>>  What about adding a flag for the delete, something like -full or
>> -true(it is early).  Once we issue the delete to the proper
>> row/region
>> we
>> run a flush, then execute a single region major compaction.  That
>> way,
>> if
>> it is a single record, or a subset of data the impact is minimal.  If
>> the
>> delete happens to hit every region we will compact every region(not
>> ideal).
>> Another thought would be an overwrite
>
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)

Re: Efficiently wiping out random data?

Posted by Andrew Purtell <ap...@apache.org>.

Right, compaction followed by a 'secure' HDFS level delete by random
rewrites to nuke the blocks of the remnant. Even then it's difficult to say
something recoverable does not remain, though in practical terms the
hypothetical user here could be assured no API of HBase or HDFS could
ever retrieve the data.

Or burn the platters to ash.

On Sunday, June 23, 2013, Ian Varley wrote:

> One more followup on this, after talking to some security-types:
>
>  - The issue isn't wiping out all data for a customer; it's wiping out
> *specific* data. Using the "forget an encryption key" method would then
> mean separate encryption keys per row, which isn't feasible most of the
> time. (Consider information that becomes classified but didn't used to be,
> for example.)
>  - In some cases, decryption can still happen without keys, by brute force
> or from finding weaknesses in the algorithms down the road. Yes, I know
> that the brute force CPU time is measured in eons, but never say never; we
> can easily decrypt things now that were encrypted with the best available
> algorithms and keys 40 years ago. :)
>
> So for cases where it counts, a "secure delete" means no less than writing
> over the data with random strings. It would be interesting to add features
> to HBase / HDFS that passed muster for stuff like this; for example, an
> HDFS secure-delete<
> http://www.ghacks.net/2010/08/26/securely-delete-files-with-secure-delete/>
> command, and an HBase secure-delete that does all of: add delete marker,
> force major compaction, and run HDFS secure-delete.
>
> Ian
>
> On Jun 20, 2013, at 7:39 AM, Jean-Marc Spaggiari wrote:
>
> Correct, that's another way. Just need to have one encryption key per
> customer. And all what is written into HBase, over all the tables, is
> encrypted with that key.
>
> If the customer want to have all its data erased, just erased the key,
> and you have no way to retrieve anything from HBase even if it's still
> into all the tables. So now you can emit all the deletes required, and
> that will be totally deleted on the next regular major compaction...
>
> There will be a small impact on regular reads/write since you will
> need to read the key first, but them a user delete will be way more
> efficient.
>
>
> 2013/6/20 lars hofhansl <larsh@apache.org <javascript:;><mailto:
> larsh@apache.org <javascript:;>>>:
> IMHO the "proper" of doing such things is encryption.
>
> 0-ing the values or even overwriting with a pattern typically leaves
> traces of the old data on a magnetic platter that can be retrieved with
> proper forensics. (Secure erase of SSD is typically pretty secure, though).
>
>
> For such use cases, files (HFiles) should be encrypted and the decryption
> keys should just be forgotten at the appropriate times.
> I realize that for J-D's specific use case doing this at the HFile level
> would be very difficult.
>
> Maybe the KVs' values could be stored encrypted with a user specific key.
> Deleting the user's data then means to forget that users key.
>
>
> -- Lars
>
> ________________________________
> From: Matt Corgan <mcorgan@hotpads.com <javascript:;><mailto:
> mcorgan@hotpads.com <javascript:;>>>
> To: dev <dev@hbase.apache.org <javascript:;><mailto:dev@hbase.apache.org<javascript:;>
> >>
> Sent: Wednesday, June 19, 2013 2:15 PM
> Subject: Re: Efficiently wiping out random data?
>
>
> Would it be possible to zero-out all the value bytes for cells in existing
> HFiles?  They keys would remain, but if you knew that ahead of time you
> could design your keys so they don't contain important info.
>
>
> On Wed, Jun 19, 2013 at 11:28 AM, Ian Varley <ivarley@salesforce.com
> <ma...@salesforce.com>> wrote:
>
> At least in some cases, the answer to that question ("do you even have to
> destroy your tapes?") is a resounding "yes". For some extreme cases (think
> health care, privacy, etc), companies do all RDBMS backups to disk instead
> of tape for that reason. (Transaction logs are considered different, I
> guess because they're inherently transient? Who knows.)
>
> The "no time travel" fix doesn't work, because you could still change that
> code or ACL in the future and get back to the data. In these cases, one
> must provably destroy the data.
>
> That said, forcing full compactions (especially if they can be targeted
> via stripes or levels or something) is an OK way to handle it, maybe
> eventually with more ways to nice it down so it doesn't hose your cluster.
>
> Ian
>
> On Jun 19, 2013, at 11:27 AM, Todd Lipcon wrote:
>
> I'd also question what exactly the regulatory requirements for deletion
> are. For example, if you had tape backups of your Oracle DB, would you have
> to drive to your off-site storage facility, grab every tape you ever made,
> and zero out the user's data as well? I doubt it, considering tapes have
> basically the same storage characteristics as HDFS in terms of inability to
> random write.
>
> Another example: deletes work the same way in most databases -- eg in
> postgres, deletion of a record just consists of setting a record's "xmax"
> column to the current transaction ID. This is equivalent to a tombstone,
> and you have to wait for a VACUUM process to come along and actually delete
> the record entry. In Oracle, the record will persist in a rollback segment
> for a configurable amount of time, and you can use a Flashback query to
> time travel and see it again. In Vertica, you also set an "xmax" entry and
> wait until the next merge-out (like a major compaction).
>
> Even in a filesystem, deletion doesn't typically remove data, unless you
> use a tool like srm. It just unlinks the inode from the directory tree.
>
> So, if any of the above systems satisfy their use case, then HBase ought to
> as well. Perhaps there's an ACL we could add which would allow/disallow
> users from doing time travel more than N seconds in the past..  maybe that
> would help allay fears?
>
> -Todd
>
> On Wed, Jun 19, 2013 at 8:12 AM, Jesse Yates <jesse.k.yates@gmail.com
> <ma...@gmail.com>
> <ma...@gmail.com>>wrote:
>
> Chances are that date isn't completely "random". For instance, with a user
> they are likely to have an id in their row key, so doing a filtering (with
> a custom scanner) major compaction would clean that up. With Sergey's
> compaction stuff coming in you could break that out even further and only
> have to compact a small set of files to get that removal.
>
> So it's hard, but as its not our direct use case, it's gonna be a few extra
> hoops.
>
> On Wednesday, June 19, 2013, Kevin O'dell wrote:
>
> Yeah, the immutable nature of HDFS is biting us here.
>
>
> On Wed, Jun 19, 2013 at 8:46 AM, Jean-Daniel Cryans <jdcryans@apache.org
> <ma...@apache.org>
> <ma...@apache.org>
> <javascript:;>
> wrote:
>
> That sounds like a very effective way for developers to kill clusters
> with compactions :)
>
> J-D
>
> On Wed, Jun 19, 2013 at 2:39 PM, Kevin O'dell <
> kevin.odell@cloudera.com<javascript:;>
>
> wrote:
> JD,
>
>  What about adding a flag for the delete, something like -full or
> -true(it is early).  Once we issue the delete to the proper
> row/region
> we
> run a flush, then execute a single region major compaction.  That
> way,
> if
> it is a single record, or a subset of data the impact is minimal.  If
> the
> delete happens to hit every region we will compact every region(not
> ideal).
> Another thought would be an overwrite



-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Efficiently wiping out random data?

Posted by Andrew Purtell <ap...@apache.org>.

Right, compaction followed by a 'secure' HDFS level delete by random
rewrites to nuke the blocks of the remnant. In practical terms the
hypothetical user here could be assured no HBase or HDFS API could never
retrieve the data.

Or burn the platters to ash.

On Sunday, June 23, 2013, Ian Varley wrote:

> One more followup on this, after talking to some security-types:
>
>  - The issue isn't wiping out all data for a customer; it's wiping out
> *specific* data. Using the "forget an encryption key" method would then
> mean separate encryption keys per row, which isn't feasible most of the
> time. (Consider information that becomes classified but didn't used to be,
> for example.)
>  - In some cases, decryption can still happen without keys, by brute force
> or from finding weaknesses in the algorithms down the road. Yes, I know
> that the brute force CPU time is measured in eons, but never say never; we
> can easily decrypt things now that were encrypted with the best available
> algorithms and keys 40 years ago. :)
>
> So for cases where it counts, a "secure delete" means no less than writing
> over the data with random strings. It would be interesting to add features
> to HBase / HDFS that passed muster for stuff like this; for example, an
> HDFS secure-delete<
> http://www.ghacks.net/2010/08/26/securely-delete-files-with-secure-delete/>
> command, and an HBase secure-delete that does all of: add delete marker,
> force major compaction, and run HDFS secure-delete.
>
> Ian
>
> On Jun 20, 2013, at 7:39 AM, Jean-Marc Spaggiari wrote:
>
> Correct, that's another way. Just need to have one encryption key per
> customer. And all what is written into HBase, over all the tables, is
> encrypted with that key.
>
> If the customer want to have all its data erased, just erased the key,
> and you have no way to retrieve anything from HBase even if it's still
> into all the tables. So now you can emit all the deletes required, and
> that will be totally deleted on the next regular major compaction...
>
> There will be a small impact on regular reads/write since you will
> need to read the key first, but them a user delete will be way more
> efficient.
>
>
> 2013/6/20 lars hofhansl <la...@apache.org>>:
> IMHO the "proper" of doing such things is encryption.
>
> 0-ing the values or even overwriting with a pattern typically leaves
> traces of the old data on a magnetic platter that can be retrieved with
> proper forensics. (Secure erase of SSD is typically pretty secure, though).
>
>
> For such use cases, files (HFiles) should be encrypted and the decryption
> keys should just be forgotten at the appropriate times.
> I realize that for J-D's specific use case doing this at the HFile level
> would be very difficult.
>
> Maybe the KVs' values could be stored encrypted with a user specific key.
> Deleting the user's data then means to forget that users key.
>
>
> -- Lars
>
> ________________________________
> From: Matt Corgan <mc...@hotpads.com>>
> To: dev <de...@hbase.apache.org>>
> Sent: Wednesday, June 19, 2013 2:15 PM
> Subject: Re: Efficiently wiping out random data?
>
>
> Would it be possible to zero-out all the value bytes for cells in existing
> HFiles?  They keys would remain, but if you knew that ahead of time you
> could design your keys so they don't contain important info.
>
>
> On Wed, Jun 19, 2013 at 11:28 AM, Ian Varley <ivarley@salesforce.com
> <ma...@salesforce.com>> wrote:
>
> At least in some cases, the answer to that question ("do you even have to
> destroy your tapes?") is a resounding "yes". For some extreme cases (think
> health care, privacy, etc), companies do all RDBMS backups to disk instead
> of tape for that reason. (Transaction logs are considered different, I
> guess because they're inherently transient? Who knows.)
>
> The "no time travel" fix doesn't work, because you could still change that
> code or ACL in the future and get back to the data. In these cases, one
> must provably destroy the data.
>
> That said, forcing full compactions (especially if they can be targeted
> via stripes or levels or something) is an OK way to handle it, maybe
> eventually with more ways to nice it down so it doesn't hose your cluster.
>
> Ian
>
> On Jun 19, 2013, at 11:27 AM, Todd Lipcon wrote:
>
> I'd also question what exactly the regulatory requirements for deletion
> are. For example, if you had tape backups of your Oracle DB, would you have
> to drive to your off-site storage facility, grab every tape you ever made,
> and zero out the user's data as well? I doubt it, considering tapes have
> basically the same storage characteristics as HDFS in terms of inability to
> random write.
>
> Another example: deletes work the same way in most databases -- eg in
> postgres, deletion of a record just consists of setting a record's "xmax"
> column to the current transaction ID. This is equivalent to a tombstone,
> and you have to wait for a VACUUM process to come along and actually delete
> the record entry. In Oracle, the record will persist in a rollback segment
> for a configurable amount of time, and you can use a Flashback query to
> time travel and see it again. In Vertica, you also set an "xmax" entry and
> wait until the next merge-out (like a major compaction).
>
> Even in a filesystem, deletion doesn't typically remove data, unless you
> use a tool like srm. It just unlinks the inode from the directory tree.
>
> So, if any of the above systems satisfy their use case, then HBase ought to
> as well. Perhaps there's an ACL we could add which would allow/disallow
> users from doing time travel more than N seconds in the past..  maybe that
> would help allay fears?
>
> -Todd
>
> On Wed, Jun 19, 2013 at 8:12 AM, Jesse Yates <jesse.k.yates@gmail.com
> <ma...@gmail.com>
> <ma...@gmail.com>>wrote:
>
> Chances are that date isn't completely "random". For instance, with a user
> they are likely to have an id in their row key, so doing a filtering (with
> a custom scanner) major compaction would clean that up. With Sergey's
> compaction stuff coming in you could break that out even further and only
> have to compact a small set of files to get that removal.
>
> So it's hard, but as its not our direct use case, it's gonna be a few extra
> hoops.
>
> On Wednesday, June 19, 2013, Kevin O'dell wrote:
>
> Yeah, the immutable nature of HDFS is biting us here.
>
>
> On Wed, Jun 19, 2013 at 8:46 AM, Jean-Daniel Cryans <jdcryans@apache.org
> <ma...@apache.org>
> <ma...@apache.org>
> <javascript:;>
> wrote:
>
> That sounds like a very effective way for developers to kill clusters
> with compactions :)
>
> J-D
>
> On Wed, Jun 19, 2013 at 2:39 PM, Kevin O'dell <
> kevin.odell@cloudera.com<javascript:;>
>
> wrote:
> JD,
>
>  What about adding a flag for the delete, something like -full or
> -true(it is early).  Once we issue the delete to the proper
> row/region
> we
> run a flush, then execute a single region major compaction.  That
> way,
> if
> it is a single record, or a subset of data the impact is minimal.  If
> the
> delete happens to hit every region we will compact every region(not
> ideal).
> Another thought would be an overwrite



-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Efficiently wiping out random data?

Posted by Andrew Purtell <ap...@apache.org>.

Right, compaction followed by a 'secure' HDFS level delete by random
rewrites to nuke the blocks of the remnant. In practical terms the
hypothetical user here could be assured no HBase or HDFS API could never
retrieve the data.

Or burn the platters to ash.

On Sunday, June 23, 2013, Ian Varley wrote:

> One more followup on this, after talking to some security-types:
>
>  - The issue isn't wiping out all data for a customer; it's wiping out
> *specific* data. Using the "forget an encryption key" method would then
> mean separate encryption keys per row, which isn't feasible most of the
> time. (Consider information that becomes classified but didn't used to be,
> for example.)
>  - In some cases, decryption can still happen without keys, by brute force
> or from finding weaknesses in the algorithms down the road. Yes, I know
> that the brute force CPU time is measured in eons, but never say never; we
> can easily decrypt things now that were encrypted with the best available
> algorithms and keys 40 years ago. :)
>
> So for cases where it counts, a "secure delete" means no less than writing
> over the data with random strings. It would be interesting to add features
> to HBase / HDFS that passed muster for stuff like this; for example, an
> HDFS secure-delete<
> http://www.ghacks.net/2010/08/26/securely-delete-files-with-secure-delete/>
> command, and an HBase secure-delete that does all of: add delete marker,
> force major compaction, and run HDFS secure-delete.
>
> Ian
>
> On Jun 20, 2013, at 7:39 AM, Jean-Marc Spaggiari wrote:
>
> Correct, that's another way. Just need to have one encryption key per
> customer. And all what is written into HBase, over all the tables, is
> encrypted with that key.
>
> If the customer want to have all its data erased, just erased the key,
> and you have no way to retrieve anything from HBase even if it's still
> into all the tables. So now you can emit all the deletes required, and
> that will be totally deleted on the next regular major compaction...
>
> There will be a small impact on regular reads/write since you will
> need to read the key first, but them a user delete will be way more
> efficient.
>
>
> 2013/6/20 lars hofhansl <la...@apache.org>>:
> IMHO the "proper" of doing such things is encryption.
>
> 0-ing the values or even overwriting with a pattern typically leaves
> traces of the old data on a magnetic platter that can be retrieved with
> proper forensics. (Secure erase of SSD is typically pretty secure, though).
>
>
> For such use cases, files (HFiles) should be encrypted and the decryption
> keys should just be forgotten at the appropriate times.
> I realize that for J-D's specific use case doing this at the HFile level
> would be very difficult.
>
> Maybe the KVs' values could be stored encrypted with a user specific key.
> Deleting the user's data then means to forget that users key.
>
>
> -- Lars
>
> ________________________________
> From: Matt Corgan <mc...@hotpads.com>>
> To: dev <de...@hbase.apache.org>>
> Sent: Wednesday, June 19, 2013 2:15 PM
> Subject: Re: Efficiently wiping out random data?
>
>
> Would it be possible to zero-out all the value bytes for cells in existing
> HFiles?  They keys would remain, but if you knew that ahead of time you
> could design your keys so they don't contain important info.
>
>
> On Wed, Jun 19, 2013 at 11:28 AM, Ian Varley <ivarley@salesforce.com
> <ma...@salesforce.com>> wrote:
>
> At least in some cases, the answer to that question ("do you even have to
> destroy your tapes?") is a resounding "yes". For some extreme cases (think
> health care, privacy, etc), companies do all RDBMS backups to disk instead
> of tape for that reason. (Transaction logs are considered different, I
> guess because they're inherently transient? Who knows.)
>
> The "no time travel" fix doesn't work, because you could still change that
> code or ACL in the future and get back to the data. In these cases, one
> must provably destroy the data.
>
> That said, forcing full compactions (especially if they can be targeted
> via stripes or levels or something) is an OK way to handle it, maybe
> eventually with more ways to nice it down so it doesn't hose your cluster.
>
> Ian
>
> On Jun 19, 2013, at 11:27 AM, Todd Lipcon wrote:
>
> I'd also question what exactly the regulatory requirements for deletion
> are. For example, if you had tape backups of your Oracle DB, would you have
> to drive to your off-site storage facility, grab every tape you ever made,
> and zero out the user's data as well? I doubt it, considering tapes have
> basically the same storage characteristics as HDFS in terms of inability to
> random write.
>
> Another example: deletes work the same way in most databases -- eg in
> postgres, deletion of a record just consists of setting a record's "xmax"
> column to the current transaction ID. This is equivalent to a tombstone,
> and you have to wait for a VACUUM process to come along and actually delete
> the record entry. In Oracle, the record will persist in a rollback segment
> for a configurable amount of time, and you can use a Flashback query to
> time travel and see it again. In Vertica, you also set an "xmax" entry and
> wait until the next merge-out (like a major compaction).
>
> Even in a filesystem, deletion doesn't typically remove data, unless you
> use a tool like srm. It just unlinks the inode from the directory tree.
>
> So, if any of the above systems satisfy their use case, then HBase ought to
> as well. Perhaps there's an ACL we could add which would allow/disallow
> users from doing time travel more than N seconds in the past..  maybe that
> would help allay fears?
>
> -Todd
>
> On Wed, Jun 19, 2013 at 8:12 AM, Jesse Yates <jesse.k.yates@gmail.com
> <ma...@gmail.com>
> <ma...@gmail.com>>wrote:
>
> Chances are that date isn't completely "random". For instance, with a user
> they are likely to have an id in their row key, so doing a filtering (with
> a custom scanner) major compaction would clean that up. With Sergey's
> compaction stuff coming in you could break that out even further and only
> have to compact a small set of files to get that removal.
>
> So it's hard, but as its not our direct use case, it's gonna be a few extra
> hoops.
>
> On Wednesday, June 19, 2013, Kevin O'dell wrote:
>
> Yeah, the immutable nature of HDFS is biting us here.
>
>
> On Wed, Jun 19, 2013 at 8:46 AM, Jean-Daniel Cryans <jdcryans@apache.org
> <ma...@apache.org>
> <ma...@apache.org>
> <javascript:;>
> wrote:
>
> That sounds like a very effective way for developers to kill clusters
> with compactions :)
>
> J-D
>
> On Wed, Jun 19, 2013 at 2:39 PM, Kevin O'dell <
> kevin.odell@cloudera.com<javascript:;>
>
> wrote:
> JD,
>
>  What about adding a flag for the delete, something like -full or
> -true(it is early).  Once we issue the delete to the proper
> row/region
> we
> run a flush, then execute a single region major compaction.  That
> way,
> if
> it is a single record, or a subset of data the impact is minimal.  If
> the
> delete happens to hit every region we will compact every region(not
> ideal).
> Another thought would be an overwrite



-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: Efficiently wiping out random data?

Posted by Ian Varley <iv...@salesforce.com>.

One more followup on this, after talking to some security-types:

 - The issue isn't wiping out all data for a customer; it's wiping out *specific* data. Using the "forget an encryption key" method would then mean separate encryption keys per row, which isn't feasible most of the time. (Consider information that becomes classified but didn't used to be, for example.)
 - In some cases, decryption can still happen without keys, by brute force or from finding weaknesses in the algorithms down the road. Yes, I know that the brute force CPU time is measured in eons, but never say never; we can easily decrypt things now that were encrypted with the best available algorithms and keys 40 years ago. :)

So for cases where it counts, a "secure delete" means no less than writing over the data with random strings. It would be interesting to add features to HBase / HDFS that passed muster for stuff like this; for example, an HDFS secure-delete<http://www.ghacks.net/2010/08/26/securely-delete-files-with-secure-delete/> command, and an HBase secure-delete that does all of: add delete marker, force major compaction, and run HDFS secure-delete.

Ian

On Jun 20, 2013, at 7:39 AM, Jean-Marc Spaggiari wrote:

Correct, that's another way. Just need to have one encryption key per
customer. And all what is written into HBase, over all the tables, is
encrypted with that key.

If the customer want to have all its data erased, just erased the key,
and you have no way to retrieve anything from HBase even if it's still
into all the tables. So now you can emit all the deletes required, and
that will be totally deleted on the next regular major compaction...

There will be a small impact on regular reads/write since you will
need to read the key first, but them a user delete will be way more
efficient.


2013/6/20 lars hofhansl <la...@apache.org>>:
IMHO the "proper" of doing such things is encryption.

0-ing the values or even overwriting with a pattern typically leaves traces of the old data on a magnetic platter that can be retrieved with proper forensics. (Secure erase of SSD is typically pretty secure, though).


For such use cases, files (HFiles) should be encrypted and the decryption keys should just be forgotten at the appropriate times.
I realize that for J-D's specific use case doing this at the HFile level would be very difficult.

Maybe the KVs' values could be stored encrypted with a user specific key. Deleting the user's data then means to forget that users key.


-- Lars

________________________________
From: Matt Corgan <mc...@hotpads.com>>
To: dev <de...@hbase.apache.org>>
Sent: Wednesday, June 19, 2013 2:15 PM
Subject: Re: Efficiently wiping out random data?


Would it be possible to zero-out all the value bytes for cells in existing
HFiles?  They keys would remain, but if you knew that ahead of time you
could design your keys so they don't contain important info.


On Wed, Jun 19, 2013 at 11:28 AM, Ian Varley <iv...@salesforce.com>> wrote:

At least in some cases, the answer to that question ("do you even have to
destroy your tapes?") is a resounding "yes". For some extreme cases (think
health care, privacy, etc), companies do all RDBMS backups to disk instead
of tape for that reason. (Transaction logs are considered different, I
guess because they're inherently transient? Who knows.)

The "no time travel" fix doesn't work, because you could still change that
code or ACL in the future and get back to the data. In these cases, one
must provably destroy the data.

That said, forcing full compactions (especially if they can be targeted
via stripes or levels or something) is an OK way to handle it, maybe
eventually with more ways to nice it down so it doesn't hose your cluster.

Ian

On Jun 19, 2013, at 11:27 AM, Todd Lipcon wrote:

I'd also question what exactly the regulatory requirements for deletion
are. For example, if you had tape backups of your Oracle DB, would you have
to drive to your off-site storage facility, grab every tape you ever made,
and zero out the user's data as well? I doubt it, considering tapes have
basically the same storage characteristics as HDFS in terms of inability to
random write.

Another example: deletes work the same way in most databases -- eg in
postgres, deletion of a record just consists of setting a record's "xmax"
column to the current transaction ID. This is equivalent to a tombstone,
and you have to wait for a VACUUM process to come along and actually delete
the record entry. In Oracle, the record will persist in a rollback segment
for a configurable amount of time, and you can use a Flashback query to
time travel and see it again. In Vertica, you also set an "xmax" entry and
wait until the next merge-out (like a major compaction).

Even in a filesystem, deletion doesn't typically remove data, unless you
use a tool like srm. It just unlinks the inode from the directory tree.

So, if any of the above systems satisfy their use case, then HBase ought to
as well. Perhaps there's an ACL we could add which would allow/disallow
users from doing time travel more than N seconds in the past..  maybe that
would help allay fears?

-Todd

On Wed, Jun 19, 2013 at 8:12 AM, Jesse Yates <je...@gmail.com>
<ma...@gmail.com>>wrote:

Chances are that date isn't completely "random". For instance, with a user
they are likely to have an id in their row key, so doing a filtering (with
a custom scanner) major compaction would clean that up. With Sergey's
compaction stuff coming in you could break that out even further and only
have to compact a small set of files to get that removal.

So it's hard, but as its not our direct use case, it's gonna be a few extra
hoops.

On Wednesday, June 19, 2013, Kevin O'dell wrote:

Yeah, the immutable nature of HDFS is biting us here.


On Wed, Jun 19, 2013 at 8:46 AM, Jean-Daniel Cryans <jd...@apache.org>
<ma...@apache.org>
<javascript:;>
wrote:

That sounds like a very effective way for developers to kill clusters
with compactions :)

J-D

On Wed, Jun 19, 2013 at 2:39 PM, Kevin O'dell <
kevin.odell@cloudera.com<javascript:;>

wrote:
JD,

 What about adding a flag for the delete, something like -full or
-true(it is early).  Once we issue the delete to the proper
row/region
we
run a flush, then execute a single region major compaction.  That
way,
if
it is a single record, or a subset of data the impact is minimal.  If
the
delete happens to hit every region we will compact every region(not
ideal).
Another thought would be an overwrite, but with versions this logic
becomes more complicated.


On Wed, Jun 19, 2013 at 8:31 AM, Jean-Daniel Cryans <
jdcryans@apache.org<ma...@apache.org> <javascript:;>
wrote:

Hey devs,

I was presenting at GOTO Amsterdam yesterday and I got a question
about a scenario that I've never thought about before. I'm wondering
what others think.

How do you efficiently wipe out random data in HBase?

For example, you have a website and a user asks you to close their
account and get rid of the data.

Would you say "sure can do, lemme just issue a couple of Deletes!"
and
call it a day? What if you really have to delete the data, not just
mask it, because of contractual obligations or local laws?

Major compacting is the obvious solution but it seems really
inefficient. Let's say you've got some truly random data to delete
and
it happens so that you have at least one row per region to get rid
of... then you need to basically rewrite the whole table?

My answer was such, and I told the attendee that it's not an easy
use
case to manage in HBase.

Thoughts?

J-D




--
Kevin O'Dell
Systems Engineer, Cloudera




--
Kevin O'Dell
Systems Engineer, Cloudera



--
-------------------
Jesse Yates
@jesse_yates
jyates.github.com<http://jyates.github.com>




--
Todd Lipcon
Software Engineer, Cloudera

Re: Efficiently wiping out random data?

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

Correct, that's another way. Just need to have one encryption key per
customer. And all what is written into HBase, over all the tables, is
encrypted with that key.

If the customer want to have all its data erased, just erased the key,
and you have no way to retrieve anything from HBase even if it's still
into all the tables. So now you can emit all the deletes required, and
that will be totally deleted on the next regular major compaction...

There will be a small impact on regular reads/write since you will
need to read the key first, but them a user delete will be way more
efficient.


2013/6/20 lars hofhansl <la...@apache.org>:
> IMHO the "proper" of doing such things is encryption.
>
> 0-ing the values or even overwriting with a pattern typically leaves traces of the old data on a magnetic platter that can be retrieved with proper forensics. (Secure erase of SSD is typically pretty secure, though).
>
>
> For such use cases, files (HFiles) should be encrypted and the decryption keys should just be forgotten at the appropriate times.
> I realize that for J-D's specific use case doing this at the HFile level would be very difficult.
>
> Maybe the KVs' values could be stored encrypted with a user specific key. Deleting the user's data then means to forget that users key.
>
>
> -- Lars
>
> ________________________________
> From: Matt Corgan <mc...@hotpads.com>
> To: dev <de...@hbase.apache.org>
> Sent: Wednesday, June 19, 2013 2:15 PM
> Subject: Re: Efficiently wiping out random data?
>
>
> Would it be possible to zero-out all the value bytes for cells in existing
> HFiles?  They keys would remain, but if you knew that ahead of time you
> could design your keys so they don't contain important info.
>
>
> On Wed, Jun 19, 2013 at 11:28 AM, Ian Varley <iv...@salesforce.com> wrote:
>
>> At least in some cases, the answer to that question ("do you even have to
>> destroy your tapes?") is a resounding "yes". For some extreme cases (think
>> health care, privacy, etc), companies do all RDBMS backups to disk instead
>> of tape for that reason. (Transaction logs are considered different, I
>> guess because they're inherently transient? Who knows.)
>>
>> The "no time travel" fix doesn't work, because you could still change that
>> code or ACL in the future and get back to the data. In these cases, one
>> must provably destroy the data.
>>
>> That said, forcing full compactions (especially if they can be targeted
>> via stripes or levels or something) is an OK way to handle it, maybe
>> eventually with more ways to nice it down so it doesn't hose your cluster.
>>
>> Ian
>>
>> On Jun 19, 2013, at 11:27 AM, Todd Lipcon wrote:
>>
>> I'd also question what exactly the regulatory requirements for deletion
>> are. For example, if you had tape backups of your Oracle DB, would you have
>> to drive to your off-site storage facility, grab every tape you ever made,
>> and zero out the user's data as well? I doubt it, considering tapes have
>> basically the same storage characteristics as HDFS in terms of inability to
>> random write.
>>
>> Another example: deletes work the same way in most databases -- eg in
>> postgres, deletion of a record just consists of setting a record's "xmax"
>> column to the current transaction ID. This is equivalent to a tombstone,
>> and you have to wait for a VACUUM process to come along and actually delete
>> the record entry. In Oracle, the record will persist in a rollback segment
>> for a configurable amount of time, and you can use a Flashback query to
>> time travel and see it again. In Vertica, you also set an "xmax" entry and
>> wait until the next merge-out (like a major compaction).
>>
>> Even in a filesystem, deletion doesn't typically remove data, unless you
>> use a tool like srm. It just unlinks the inode from the directory tree.
>>
>> So, if any of the above systems satisfy their use case, then HBase ought to
>> as well. Perhaps there's an ACL we could add which would allow/disallow
>> users from doing time travel more than N seconds in the past..  maybe that
>> would help allay fears?
>>
>> -Todd
>>
>> On Wed, Jun 19, 2013 at 8:12 AM, Jesse Yates <jesse.k.yates@gmail.com
>> <ma...@gmail.com>>wrote:
>>
>> Chances are that date isn't completely "random". For instance, with a user
>> they are likely to have an id in their row key, so doing a filtering (with
>> a custom scanner) major compaction would clean that up. With Sergey's
>> compaction stuff coming in you could break that out even further and only
>> have to compact a small set of files to get that removal.
>>
>> So it's hard, but as its not our direct use case, it's gonna be a few extra
>> hoops.
>>
>> On Wednesday, June 19, 2013, Kevin O'dell wrote:
>>
>> Yeah, the immutable nature of HDFS is biting us here.
>>
>>
>> On Wed, Jun 19, 2013 at 8:46 AM, Jean-Daniel Cryans <jdcryans@apache.org
>> <ma...@apache.org>
>> <javascript:;>
>> wrote:
>>
>> That sounds like a very effective way for developers to kill clusters
>> with compactions :)
>>
>> J-D
>>
>> On Wed, Jun 19, 2013 at 2:39 PM, Kevin O'dell <
>> kevin.odell@cloudera.com<javascript:;>
>>
>> wrote:
>> JD,
>>
>>   What about adding a flag for the delete, something like -full or
>> -true(it is early).  Once we issue the delete to the proper
>> row/region
>> we
>> run a flush, then execute a single region major compaction.  That
>> way,
>> if
>> it is a single record, or a subset of data the impact is minimal.  If
>> the
>> delete happens to hit every region we will compact every region(not
>> ideal).
>> Another thought would be an overwrite, but with versions this logic
>> becomes more complicated.
>>
>>
>> On Wed, Jun 19, 2013 at 8:31 AM, Jean-Daniel Cryans <
>> jdcryans@apache.org<ma...@apache.org> <javascript:;>
>> wrote:
>>
>> Hey devs,
>>
>> I was presenting at GOTO Amsterdam yesterday and I got a question
>> about a scenario that I've never thought about before. I'm wondering
>> what others think.
>>
>> How do you efficiently wipe out random data in HBase?
>>
>> For example, you have a website and a user asks you to close their
>> account and get rid of the data.
>>
>> Would you say "sure can do, lemme just issue a couple of Deletes!"
>> and
>> call it a day? What if you really have to delete the data, not just
>> mask it, because of contractual obligations or local laws?
>>
>> Major compacting is the obvious solution but it seems really
>> inefficient. Let's say you've got some truly random data to delete
>> and
>> it happens so that you have at least one row per region to get rid
>> of... then you need to basically rewrite the whole table?
>>
>> My answer was such, and I told the attendee that it's not an easy
>> use
>> case to manage in HBase.
>>
>> Thoughts?
>>
>> J-D
>>
>>
>>
>>
>> --
>> Kevin O'Dell
>> Systems Engineer, Cloudera
>>
>>
>>
>>
>> --
>> Kevin O'Dell
>> Systems Engineer, Cloudera
>>
>>
>>
>> --
>> -------------------
>> Jesse Yates
>> @jesse_yates
>> jyates.github.com<http://jyates.github.com>
>>
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>>

Re: Efficiently wiping out random data?

Posted by lars hofhansl <la...@apache.org>.

IMHO the "proper" of doing such things is encryption.

0-ing the values or even overwriting with a pattern typically leaves traces of the old data on a magnetic platter that can be retrieved with proper forensics. (Secure erase of SSD is typically pretty secure, though).


For such use cases, files (HFiles) should be encrypted and the decryption keys should just be forgotten at the appropriate times.
I realize that for J-D's specific use case doing this at the HFile level would be very difficult.

Maybe the KVs' values could be stored encrypted with a user specific key. Deleting the user's data then means to forget that users key.


-- Lars

________________________________
From: Matt Corgan <mc...@hotpads.com>
To: dev <de...@hbase.apache.org> 
Sent: Wednesday, June 19, 2013 2:15 PM
Subject: Re: Efficiently wiping out random data?


Would it be possible to zero-out all the value bytes for cells in existing
HFiles?  They keys would remain, but if you knew that ahead of time you
could design your keys so they don't contain important info.


On Wed, Jun 19, 2013 at 11:28 AM, Ian Varley <iv...@salesforce.com> wrote:

> At least in some cases, the answer to that question ("do you even have to
> destroy your tapes?") is a resounding "yes". For some extreme cases (think
> health care, privacy, etc), companies do all RDBMS backups to disk instead
> of tape for that reason. (Transaction logs are considered different, I
> guess because they're inherently transient? Who knows.)
>
> The "no time travel" fix doesn't work, because you could still change that
> code or ACL in the future and get back to the data. In these cases, one
> must provably destroy the data.
>
> That said, forcing full compactions (especially if they can be targeted
> via stripes or levels or something) is an OK way to handle it, maybe
> eventually with more ways to nice it down so it doesn't hose your cluster.
>
> Ian
>
> On Jun 19, 2013, at 11:27 AM, Todd Lipcon wrote:
>
> I'd also question what exactly the regulatory requirements for deletion
> are. For example, if you had tape backups of your Oracle DB, would you have
> to drive to your off-site storage facility, grab every tape you ever made,
> and zero out the user's data as well? I doubt it, considering tapes have
> basically the same storage characteristics as HDFS in terms of inability to
> random write.
>
> Another example: deletes work the same way in most databases -- eg in
> postgres, deletion of a record just consists of setting a record's "xmax"
> column to the current transaction ID. This is equivalent to a tombstone,
> and you have to wait for a VACUUM process to come along and actually delete
> the record entry. In Oracle, the record will persist in a rollback segment
> for a configurable amount of time, and you can use a Flashback query to
> time travel and see it again. In Vertica, you also set an "xmax" entry and
> wait until the next merge-out (like a major compaction).
>
> Even in a filesystem, deletion doesn't typically remove data, unless you
> use a tool like srm. It just unlinks the inode from the directory tree.
>
> So, if any of the above systems satisfy their use case, then HBase ought to
> as well. Perhaps there's an ACL we could add which would allow/disallow
> users from doing time travel more than N seconds in the past..  maybe that
> would help allay fears?
>
> -Todd
>
> On Wed, Jun 19, 2013 at 8:12 AM, Jesse Yates <jesse.k.yates@gmail.com
> <ma...@gmail.com>>wrote:
>
> Chances are that date isn't completely "random". For instance, with a user
> they are likely to have an id in their row key, so doing a filtering (with
> a custom scanner) major compaction would clean that up. With Sergey's
> compaction stuff coming in you could break that out even further and only
> have to compact a small set of files to get that removal.
>
> So it's hard, but as its not our direct use case, it's gonna be a few extra
> hoops.
>
> On Wednesday, June 19, 2013, Kevin O'dell wrote:
>
> Yeah, the immutable nature of HDFS is biting us here.
>
>
> On Wed, Jun 19, 2013 at 8:46 AM, Jean-Daniel Cryans <jdcryans@apache.org
> <ma...@apache.org>
> <javascript:;>
> wrote:
>
> That sounds like a very effective way for developers to kill clusters
> with compactions :)
>
> J-D
>
> On Wed, Jun 19, 2013 at 2:39 PM, Kevin O'dell <
> kevin.odell@cloudera.com<javascript:;>
>
> wrote:
> JD,
>
>   What about adding a flag for the delete, something like -full or
> -true(it is early).  Once we issue the delete to the proper
> row/region
> we
> run a flush, then execute a single region major compaction.  That
> way,
> if
> it is a single record, or a subset of data the impact is minimal.  If
> the
> delete happens to hit every region we will compact every region(not
> ideal).
> Another thought would be an overwrite, but with versions this logic
> becomes more complicated.
>
>
> On Wed, Jun 19, 2013 at 8:31 AM, Jean-Daniel Cryans <
> jdcryans@apache.org<ma...@apache.org> <javascript:;>
> wrote:
>
> Hey devs,
>
> I was presenting at GOTO Amsterdam yesterday and I got a question
> about a scenario that I've never thought about before. I'm wondering
> what others think.
>
> How do you efficiently wipe out random data in HBase?
>
> For example, you have a website and a user asks you to close their
> account and get rid of the data.
>
> Would you say "sure can do, lemme just issue a couple of Deletes!"
> and
> call it a day? What if you really have to delete the data, not just
> mask it, because of contractual obligations or local laws?
>
> Major compacting is the obvious solution but it seems really
> inefficient. Let's say you've got some truly random data to delete
> and
> it happens so that you have at least one row per region to get rid
> of... then you need to basically rewrite the whole table?
>
> My answer was such, and I told the attendee that it's not an easy
> use
> case to manage in HBase.
>
> Thoughts?
>
> J-D
>
>
>
>
> --
> Kevin O'Dell
> Systems Engineer, Cloudera
>
>
>
>
> --
> Kevin O'Dell
> Systems Engineer, Cloudera
>
>
>
> --
> -------------------
> Jesse Yates
> @jesse_yates
> jyates.github.com<http://jyates.github.com>
>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>

Re: Efficiently wiping out random data?

Posted by Matt Corgan <mc...@hotpads.com>.

Would it be possible to zero-out all the value bytes for cells in existing
HFiles?  They keys would remain, but if you knew that ahead of time you
could design your keys so they don't contain important info.


On Wed, Jun 19, 2013 at 11:28 AM, Ian Varley <iv...@salesforce.com> wrote:

> At least in some cases, the answer to that question ("do you even have to
> destroy your tapes?") is a resounding "yes". For some extreme cases (think
> health care, privacy, etc), companies do all RDBMS backups to disk instead
> of tape for that reason. (Transaction logs are considered different, I
> guess because they're inherently transient? Who knows.)
>
> The "no time travel" fix doesn't work, because you could still change that
> code or ACL in the future and get back to the data. In these cases, one
> must provably destroy the data.
>
> That said, forcing full compactions (especially if they can be targeted
> via stripes or levels or something) is an OK way to handle it, maybe
> eventually with more ways to nice it down so it doesn't hose your cluster.
>
> Ian
>
> On Jun 19, 2013, at 11:27 AM, Todd Lipcon wrote:
>
> I'd also question what exactly the regulatory requirements for deletion
> are. For example, if you had tape backups of your Oracle DB, would you have
> to drive to your off-site storage facility, grab every tape you ever made,
> and zero out the user's data as well? I doubt it, considering tapes have
> basically the same storage characteristics as HDFS in terms of inability to
> random write.
>
> Another example: deletes work the same way in most databases -- eg in
> postgres, deletion of a record just consists of setting a record's "xmax"
> column to the current transaction ID. This is equivalent to a tombstone,
> and you have to wait for a VACUUM process to come along and actually delete
> the record entry. In Oracle, the record will persist in a rollback segment
> for a configurable amount of time, and you can use a Flashback query to
> time travel and see it again. In Vertica, you also set an "xmax" entry and
> wait until the next merge-out (like a major compaction).
>
> Even in a filesystem, deletion doesn't typically remove data, unless you
> use a tool like srm. It just unlinks the inode from the directory tree.
>
> So, if any of the above systems satisfy their use case, then HBase ought to
> as well. Perhaps there's an ACL we could add which would allow/disallow
> users from doing time travel more than N seconds in the past..  maybe that
> would help allay fears?
>
> -Todd
>
> On Wed, Jun 19, 2013 at 8:12 AM, Jesse Yates <jesse.k.yates@gmail.com
> <ma...@gmail.com>>wrote:
>
> Chances are that date isn't completely "random". For instance, with a user
> they are likely to have an id in their row key, so doing a filtering (with
> a custom scanner) major compaction would clean that up. With Sergey's
> compaction stuff coming in you could break that out even further and only
> have to compact a small set of files to get that removal.
>
> So it's hard, but as its not our direct use case, it's gonna be a few extra
> hoops.
>
> On Wednesday, June 19, 2013, Kevin O'dell wrote:
>
> Yeah, the immutable nature of HDFS is biting us here.
>
>
> On Wed, Jun 19, 2013 at 8:46 AM, Jean-Daniel Cryans <jdcryans@apache.org
> <ma...@apache.org>
> <javascript:;>
> wrote:
>
> That sounds like a very effective way for developers to kill clusters
> with compactions :)
>
> J-D
>
> On Wed, Jun 19, 2013 at 2:39 PM, Kevin O'dell <
> kevin.odell@cloudera.com<javascript:;>
>
> wrote:
> JD,
>
>   What about adding a flag for the delete, something like -full or
> -true(it is early).  Once we issue the delete to the proper
> row/region
> we
> run a flush, then execute a single region major compaction.  That
> way,
> if
> it is a single record, or a subset of data the impact is minimal.  If
> the
> delete happens to hit every region we will compact every region(not
> ideal).
> Another thought would be an overwrite, but with versions this logic
> becomes more complicated.
>
>
> On Wed, Jun 19, 2013 at 8:31 AM, Jean-Daniel Cryans <
> jdcryans@apache.org<ma...@apache.org> <javascript:;>
> wrote:
>
> Hey devs,
>
> I was presenting at GOTO Amsterdam yesterday and I got a question
> about a scenario that I've never thought about before. I'm wondering
> what others think.
>
> How do you efficiently wipe out random data in HBase?
>
> For example, you have a website and a user asks you to close their
> account and get rid of the data.
>
> Would you say "sure can do, lemme just issue a couple of Deletes!"
> and
> call it a day? What if you really have to delete the data, not just
> mask it, because of contractual obligations or local laws?
>
> Major compacting is the obvious solution but it seems really
> inefficient. Let's say you've got some truly random data to delete
> and
> it happens so that you have at least one row per region to get rid
> of... then you need to basically rewrite the whole table?
>
> My answer was such, and I told the attendee that it's not an easy
> use
> case to manage in HBase.
>
> Thoughts?
>
> J-D
>
>
>
>
> --
> Kevin O'Dell
> Systems Engineer, Cloudera
>
>
>
>
> --
> Kevin O'Dell
> Systems Engineer, Cloudera
>
>
>
> --
> -------------------
> Jesse Yates
> @jesse_yates
> jyates.github.com<http://jyates.github.com>
>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>

Re: Efficiently wiping out random data?

Posted by Ian Varley <iv...@salesforce.com>.

At least in some cases, the answer to that question ("do you even have to destroy your tapes?") is a resounding "yes". For some extreme cases (think health care, privacy, etc), companies do all RDBMS backups to disk instead of tape for that reason. (Transaction logs are considered different, I guess because they're inherently transient? Who knows.)

The "no time travel" fix doesn't work, because you could still change that code or ACL in the future and get back to the data. In these cases, one must provably destroy the data.

That said, forcing full compactions (especially if they can be targeted via stripes or levels or something) is an OK way to handle it, maybe eventually with more ways to nice it down so it doesn't hose your cluster.

Ian

On Jun 19, 2013, at 11:27 AM, Todd Lipcon wrote:

I'd also question what exactly the regulatory requirements for deletion
are. For example, if you had tape backups of your Oracle DB, would you have
to drive to your off-site storage facility, grab every tape you ever made,
and zero out the user's data as well? I doubt it, considering tapes have
basically the same storage characteristics as HDFS in terms of inability to
random write.

Another example: deletes work the same way in most databases -- eg in
postgres, deletion of a record just consists of setting a record's "xmax"
column to the current transaction ID. This is equivalent to a tombstone,
and you have to wait for a VACUUM process to come along and actually delete
the record entry. In Oracle, the record will persist in a rollback segment
for a configurable amount of time, and you can use a Flashback query to
time travel and see it again. In Vertica, you also set an "xmax" entry and
wait until the next merge-out (like a major compaction).

Even in a filesystem, deletion doesn't typically remove data, unless you
use a tool like srm. It just unlinks the inode from the directory tree.

So, if any of the above systems satisfy their use case, then HBase ought to
as well. Perhaps there's an ACL we could add which would allow/disallow
users from doing time travel more than N seconds in the past.. maybe that
would help allay fears?

-Todd

On Wed, Jun 19, 2013 at 8:12 AM, Jesse Yates <je...@gmail.com>>wrote:

Chances are that date isn't completely "random". For instance, with a user
they are likely to have an id in their row key, so doing a filtering (with
a custom scanner) major compaction would clean that up. With Sergey's
compaction stuff coming in you could break that out even further and only
have to compact a small set of files to get that removal.

So it's hard, but as its not our direct use case, it's gonna be a few extra
hoops.

On Wednesday, June 19, 2013, Kevin O'dell wrote:

Yeah, the immutable nature of HDFS is biting us here.

On Wed, Jun 19, 2013 at 8:46 AM, Jean-Daniel Cryans <jd...@apache.org>
<javascript:;>
wrote:

That sounds like a very effective way for developers to kill clusters
with compactions :)

J-D

On Wed, Jun 19, 2013 at 2:39 PM, Kevin O'dell <
kevin.odell@cloudera.com<javascript:;>

wrote:
JD,

What about adding a flag for the delete, something like -full or
-true(it is early). Once we issue the delete to the proper
row/region
we
run a flush, then execute a single region major compaction. That
way,
if
it is a single record, or a subset of data the impact is minimal. If
the
delete happens to hit every region we will compact every region(not
ideal).
Another thought would be an overwrite, but with versions this logic
becomes more complicated.

On Wed, Jun 19, 2013 at 8:31 AM, Jean-Daniel Cryans <
jdcryans@apache.org<ma...@apache.org> <javascript:;>
wrote:

Hey devs,

I was presenting at GOTO Amsterdam yesterday and I got a question
about a scenario that I've never thought about before. I'm wondering
what others think.

How do you efficiently wipe out random data in HBase?

For example, you have a website and a user asks you to close their
account and get rid of the data.

Would you say "sure can do, lemme just issue a couple of Deletes!"
and
call it a day? What if you really have to delete the data, not just
mask it, because of contractual obligations or local laws?

Major compacting is the obvious solution but it seems really
inefficient. Let's say you've got some truly random data to delete
and
it happens so that you have at least one row per region to get rid
of... then you need to basically rewrite the whole table?

My answer was such, and I told the attendee that it's not an easy
use
case to manage in HBase.

Thoughts?

J-D

--
Kevin O'Dell
Systems Engineer, Cloudera

--
-------------------
Jesse Yates
@jesse_yates
jyates.github.com<http://jyates.github.com>

--
Todd Lipcon
Software Engineer, Cloudera

Re: Efficiently wiping out random data?

Posted by Todd Lipcon <to...@cloudera.com>.

I'd also question what exactly the regulatory requirements for deletion
are. For example, if you had tape backups of your Oracle DB, would you have
to drive to your off-site storage facility, grab every tape you ever made,
and zero out the user's data as well? I doubt it, considering tapes have
basically the same storage characteristics as HDFS in terms of inability to
random write.

Another example: deletes work the same way in most databases -- eg in
postgres, deletion of a record just consists of setting a record's "xmax"
column to the current transaction ID. This is equivalent to a tombstone,
and you have to wait for a VACUUM process to come along and actually delete
the record entry. In Oracle, the record will persist in a rollback segment
for a configurable amount of time, and you can use a Flashback query to
time travel and see it again. In Vertica, you also set an "xmax" entry and
wait until the next merge-out (like a major compaction).

Even in a filesystem, deletion doesn't typically remove data, unless you
use a tool like srm. It just unlinks the inode from the directory tree.

So, if any of the above systems satisfy their use case, then HBase ought to
as well. Perhaps there's an ACL we could add which would allow/disallow
users from doing time travel more than N seconds in the past..  maybe that
would help allay fears?

-Todd

On Wed, Jun 19, 2013 at 8:12 AM, Jesse Yates <je...@gmail.com>wrote:

> Chances are that date isn't completely "random". For instance, with a user
> they are likely to have an id in their row key, so doing a filtering (with
> a custom scanner) major compaction would clean that up. With Sergey's
> compaction stuff coming in you could break that out even further and only
> have to compact a small set of files to get that removal.
>
> So it's hard, but as its not our direct use case, it's gonna be a few extra
> hoops.
>
> On Wednesday, June 19, 2013, Kevin O'dell wrote:
>
> > Yeah, the immutable nature of HDFS is biting us here.
> >
> >
> > On Wed, Jun 19, 2013 at 8:46 AM, Jean-Daniel Cryans <jdcryans@apache.org
> <javascript:;>
> > >wrote:
> >
> > > That sounds like a very effective way for developers to kill clusters
> > > with compactions :)
> > >
> > > J-D
> > >
> > > On Wed, Jun 19, 2013 at 2:39 PM, Kevin O'dell <
> kevin.odell@cloudera.com<javascript:;>
> > >
> > > wrote:
> > > > JD,
> > > >
> > > >    What about adding a flag for the delete, something like -full or
> > > > -true(it is early).  Once we issue the delete to the proper
> row/region
> > we
> > > > run a flush, then execute a single region major compaction.  That
> way,
> > if
> > > > it is a single record, or a subset of data the impact is minimal.  If
> > the
> > > > delete happens to hit every region we will compact every region(not
> > > ideal).
> > > >  Another thought would be an overwrite, but with versions this logic
> > > > becomes more complicated.
> > > >
> > > >
> > > > On Wed, Jun 19, 2013 at 8:31 AM, Jean-Daniel Cryans <
> > jdcryans@apache.org <javascript:;>
> > > >wrote:
> > > >
> > > >> Hey devs,
> > > >>
> > > >> I was presenting at GOTO Amsterdam yesterday and I got a question
> > > >> about a scenario that I've never thought about before. I'm wondering
> > > >> what others think.
> > > >>
> > > >> How do you efficiently wipe out random data in HBase?
> > > >>
> > > >> For example, you have a website and a user asks you to close their
> > > >> account and get rid of the data.
> > > >>
> > > >> Would you say "sure can do, lemme just issue a couple of Deletes!"
> and
> > > >> call it a day? What if you really have to delete the data, not just
> > > >> mask it, because of contractual obligations or local laws?
> > > >>
> > > >> Major compacting is the obvious solution but it seems really
> > > >> inefficient. Let's say you've got some truly random data to delete
> and
> > > >> it happens so that you have at least one row per region to get rid
> > > >> of... then you need to basically rewrite the whole table?
> > > >>
> > > >> My answer was such, and I told the attendee that it's not an easy
> use
> > > >> case to manage in HBase.
> > > >>
> > > >> Thoughts?
> > > >>
> > > >> J-D
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Kevin O'Dell
> > > > Systems Engineer, Cloudera
> > >
> >
> >
> >
> > --
> > Kevin O'Dell
> > Systems Engineer, Cloudera
> >
>
>
> --
> -------------------
> Jesse Yates
> @jesse_yates
> jyates.github.com
>

-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Efficiently wiping out random data?

Posted by Jesse Yates <je...@gmail.com>.

Chances are that date isn't completely "random". For instance, with a user
they are likely to have an id in their row key, so doing a filtering (with
a custom scanner) major compaction would clean that up. With Sergey's
compaction stuff coming in you could break that out even further and only
have to compact a small set of files to get that removal.

So it's hard, but as its not our direct use case, it's gonna be a few extra
hoops.

On Wednesday, June 19, 2013, Kevin O'dell wrote:

> Yeah, the immutable nature of HDFS is biting us here.
>
>
> On Wed, Jun 19, 2013 at 8:46 AM, Jean-Daniel Cryans <jdcryans@apache.org<javascript:;>
> >wrote:
>
> > That sounds like a very effective way for developers to kill clusters
> > with compactions :)
> >
> > J-D
> >
> > On Wed, Jun 19, 2013 at 2:39 PM, Kevin O'dell <kevin.odell@cloudera.com<javascript:;>
> >
> > wrote:
> > > JD,
> > >
> > >    What about adding a flag for the delete, something like -full or
> > > -true(it is early).  Once we issue the delete to the proper row/region
> we
> > > run a flush, then execute a single region major compaction.  That way,
> if
> > > it is a single record, or a subset of data the impact is minimal.  If
> the
> > > delete happens to hit every region we will compact every region(not
> > ideal).
> > >  Another thought would be an overwrite, but with versions this logic
> > > becomes more complicated.
> > >
> > >
> > > On Wed, Jun 19, 2013 at 8:31 AM, Jean-Daniel Cryans <
> jdcryans@apache.org <javascript:;>
> > >wrote:
> > >
> > >> Hey devs,
> > >>
> > >> I was presenting at GOTO Amsterdam yesterday and I got a question
> > >> about a scenario that I've never thought about before. I'm wondering
> > >> what others think.
> > >>
> > >> How do you efficiently wipe out random data in HBase?
> > >>
> > >> For example, you have a website and a user asks you to close their
> > >> account and get rid of the data.
> > >>
> > >> Would you say "sure can do, lemme just issue a couple of Deletes!" and
> > >> call it a day? What if you really have to delete the data, not just
> > >> mask it, because of contractual obligations or local laws?
> > >>
> > >> Major compacting is the obvious solution but it seems really
> > >> inefficient. Let's say you've got some truly random data to delete and
> > >> it happens so that you have at least one row per region to get rid
> > >> of... then you need to basically rewrite the whole table?
> > >>
> > >> My answer was such, and I told the attendee that it's not an easy use
> > >> case to manage in HBase.
> > >>
> > >> Thoughts?
> > >>
> > >> J-D
> > >>
> > >
> > >
> > >
> > > --
> > > Kevin O'Dell
> > > Systems Engineer, Cloudera
> >
>
>
>
> --
> Kevin O'Dell
> Systems Engineer, Cloudera
>


-- 
-------------------
Jesse Yates
@jesse_yates
jyates.github.com

Re: Efficiently wiping out random data?

Posted by Kevin O'dell <ke...@cloudera.com>.

Yeah, the immutable nature of HDFS is biting us here.


On Wed, Jun 19, 2013 at 8:46 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> That sounds like a very effective way for developers to kill clusters
> with compactions :)
>
> J-D
>
> On Wed, Jun 19, 2013 at 2:39 PM, Kevin O'dell <ke...@cloudera.com>
> wrote:
> > JD,
> >
> >    What about adding a flag for the delete, something like -full or
> > -true(it is early).  Once we issue the delete to the proper row/region we
> > run a flush, then execute a single region major compaction.  That way, if
> > it is a single record, or a subset of data the impact is minimal.  If the
> > delete happens to hit every region we will compact every region(not
> ideal).
> >  Another thought would be an overwrite, but with versions this logic
> > becomes more complicated.
> >
> >
> > On Wed, Jun 19, 2013 at 8:31 AM, Jean-Daniel Cryans <jdcryans@apache.org
> >wrote:
> >
> >> Hey devs,
> >>
> >> I was presenting at GOTO Amsterdam yesterday and I got a question
> >> about a scenario that I've never thought about before. I'm wondering
> >> what others think.
> >>
> >> How do you efficiently wipe out random data in HBase?
> >>
> >> For example, you have a website and a user asks you to close their
> >> account and get rid of the data.
> >>
> >> Would you say "sure can do, lemme just issue a couple of Deletes!" and
> >> call it a day? What if you really have to delete the data, not just
> >> mask it, because of contractual obligations or local laws?
> >>
> >> Major compacting is the obvious solution but it seems really
> >> inefficient. Let's say you've got some truly random data to delete and
> >> it happens so that you have at least one row per region to get rid
> >> of... then you need to basically rewrite the whole table?
> >>
> >> My answer was such, and I told the attendee that it's not an easy use
> >> case to manage in HBase.
> >>
> >> Thoughts?
> >>
> >> J-D
> >>
> >
> >
> >
> > --
> > Kevin O'Dell
> > Systems Engineer, Cloudera
>



-- 
Kevin O'Dell
Systems Engineer, Cloudera

Re: Efficiently wiping out random data?

Posted by Jean-Daniel Cryans <jd...@apache.org>.

That sounds like a very effective way for developers to kill clusters
with compactions :)

J-D

On Wed, Jun 19, 2013 at 2:39 PM, Kevin O'dell <ke...@cloudera.com> wrote:
> JD,
>
>    What about adding a flag for the delete, something like -full or
> -true(it is early).  Once we issue the delete to the proper row/region we
> run a flush, then execute a single region major compaction.  That way, if
> it is a single record, or a subset of data the impact is minimal.  If the
> delete happens to hit every region we will compact every region(not ideal).
>  Another thought would be an overwrite, but with versions this logic
> becomes more complicated.
>
>
> On Wed, Jun 19, 2013 at 8:31 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:
>
>> Hey devs,
>>
>> I was presenting at GOTO Amsterdam yesterday and I got a question
>> about a scenario that I've never thought about before. I'm wondering
>> what others think.
>>
>> How do you efficiently wipe out random data in HBase?
>>
>> For example, you have a website and a user asks you to close their
>> account and get rid of the data.
>>
>> Would you say "sure can do, lemme just issue a couple of Deletes!" and
>> call it a day? What if you really have to delete the data, not just
>> mask it, because of contractual obligations or local laws?
>>
>> Major compacting is the obvious solution but it seems really
>> inefficient. Let's say you've got some truly random data to delete and
>> it happens so that you have at least one row per region to get rid
>> of... then you need to basically rewrite the whole table?
>>
>> My answer was such, and I told the attendee that it's not an easy use
>> case to manage in HBase.
>>
>> Thoughts?
>>
>> J-D
>>
>
>
>
> --
> Kevin O'Dell
> Systems Engineer, Cloudera

Re: Efficiently wiping out random data?

Posted by Kevin O'dell <ke...@cloudera.com>.

JD,

   What about adding a flag for the delete, something like -full or
-true(it is early).  Once we issue the delete to the proper row/region we
run a flush, then execute a single region major compaction.  That way, if
it is a single record, or a subset of data the impact is minimal.  If the
delete happens to hit every region we will compact every region(not ideal).
 Another thought would be an overwrite, but with versions this logic
becomes more complicated.


On Wed, Jun 19, 2013 at 8:31 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> Hey devs,
>
> I was presenting at GOTO Amsterdam yesterday and I got a question
> about a scenario that I've never thought about before. I'm wondering
> what others think.
>
> How do you efficiently wipe out random data in HBase?
>
> For example, you have a website and a user asks you to close their
> account and get rid of the data.
>
> Would you say "sure can do, lemme just issue a couple of Deletes!" and
> call it a day? What if you really have to delete the data, not just
> mask it, because of contractual obligations or local laws?
>
> Major compacting is the obvious solution but it seems really
> inefficient. Let's say you've got some truly random data to delete and
> it happens so that you have at least one row per region to get rid
> of... then you need to basically rewrite the whole table?
>
> My answer was such, and I told the attendee that it's not an easy use
> case to manage in HBase.
>
> Thoughts?
>
> J-D
>



-- 
Kevin O'Dell
Systems Engineer, Cloudera