You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@phoenix.apache.org by Gabriel Reid <ga...@gmail.com> on 2014/07/23 16:09:53 UTC

Default use of HColumnDescriptor#setKeepDeletedCells(true)

Hi,

I noticed that HColumnDescriptor.KEEP_DELETED_CELLS is enabled by
default on new Phoenix tables. This seems like a bit of an unexpected
default, as it means (at least as far as I understand it) that data
deleted with delete statements will never actually be cleared, even
after a major compaction.

Can anyone let me know what the reasoning is behind this? Any
functional requirement within Phoenix that makes use of this default
property (i.e. if I disable it in my DDL, is there anything that we
know won't work then)? And then going further, is this something we
definitely want to keep as a default?

Thanks,

Gabriel

Re: Default use of HColumnDescriptor#setKeepDeletedCells(true)

Posted by Gabriel Reid <ga...@gmail.com>.
Thanks for the info James. I've created PHOENIX-1108 to look into this further.

I've also just done a bit more experimentation to understand the
implications of KEEP_DELETED_CELLS a bit better, and have noticed that
this is primarily an issue for me at the moment because I'm setting
the VERSIONS attribute to Integer.MAX_VALUE. Combined with
KEEP_DELETED_CELLS=true, this basically means that I can never
actually fully delete data in the current situation.

- Gabriel


On Wed, Jul 23, 2014 at 8:42 PM, James Taylor <ja...@apache.org> wrote:
> Good question, Gabriel. I believe that the deleted cells are cleaned
> up after a second major compaction with the KEEP_DELETED_CELLS option
> enabled. Lars H. implemented this option, so he can comment more, but
> AFAIK he couldn't figure out how to get them to be collected on the
> first major compaction. IMHO, this seems like a bug (but what do I
> know, I'm not an HBase committer :-) ).
>
> The time that KEEP_DELETED_CELLS is required is for flashback or
> point-in-time queries. IMHO, without this option, HBase doesn't really
> work correctly. Though you might argue "we never do that" and turn it
> off, under-the-covers, Phoenix is doing point-in-time queries. If you
> have a query that starts, at t1 and runs until t5, it won't see data
> inserted after t1. Say a delete was done on a row at t2. Without the
> KEEP_DELETED_CELLS being true, you'd potentially see this delete from
> your query.
>
> Perhaps the MVCC used by HBase should (does?) take care of this
> automatically without us setting a max on the scan time range, but I'm
> not sure. If it does, then we could likely not have this be the
> default. We'd need to test this with the new ChunkedResultIterator as
> well.
>
> Maybe file a JIRA for further investigation?
>
> Thanks,
> James
>
> On Wed, Jul 23, 2014 at 7:09 AM, Gabriel Reid <ga...@gmail.com> wrote:
>> Hi,
>>
>> I noticed that HColumnDescriptor.KEEP_DELETED_CELLS is enabled by
>> default on new Phoenix tables. This seems like a bit of an unexpected
>> default, as it means (at least as far as I understand it) that data
>> deleted with delete statements will never actually be cleared, even
>> after a major compaction.
>>
>> Can anyone let me know what the reasoning is behind this? Any
>> functional requirement within Phoenix that makes use of this default
>> property (i.e. if I disable it in my DDL, is there anything that we
>> know won't work then)? And then going further, is this something we
>> definitely want to keep as a default?
>>
>> Thanks,
>>
>> Gabriel

Re: Default use of HColumnDescriptor#setKeepDeletedCells(true)

Posted by James Taylor <ja...@apache.org>.
Good question, Gabriel. I believe that the deleted cells are cleaned
up after a second major compaction with the KEEP_DELETED_CELLS option
enabled. Lars H. implemented this option, so he can comment more, but
AFAIK he couldn't figure out how to get them to be collected on the
first major compaction. IMHO, this seems like a bug (but what do I
know, I'm not an HBase committer :-) ).

The time that KEEP_DELETED_CELLS is required is for flashback or
point-in-time queries. IMHO, without this option, HBase doesn't really
work correctly. Though you might argue "we never do that" and turn it
off, under-the-covers, Phoenix is doing point-in-time queries. If you
have a query that starts, at t1 and runs until t5, it won't see data
inserted after t1. Say a delete was done on a row at t2. Without the
KEEP_DELETED_CELLS being true, you'd potentially see this delete from
your query.

Perhaps the MVCC used by HBase should (does?) take care of this
automatically without us setting a max on the scan time range, but I'm
not sure. If it does, then we could likely not have this be the
default. We'd need to test this with the new ChunkedResultIterator as
well.

Maybe file a JIRA for further investigation?

Thanks,
James

On Wed, Jul 23, 2014 at 7:09 AM, Gabriel Reid <ga...@gmail.com> wrote:
> Hi,
>
> I noticed that HColumnDescriptor.KEEP_DELETED_CELLS is enabled by
> default on new Phoenix tables. This seems like a bit of an unexpected
> default, as it means (at least as far as I understand it) that data
> deleted with delete statements will never actually be cleared, even
> after a major compaction.
>
> Can anyone let me know what the reasoning is behind this? Any
> functional requirement within Phoenix that makes use of this default
> property (i.e. if I disable it in my DDL, is there anything that we
> know won't work then)? And then going further, is this something we
> definitely want to keep as a default?
>
> Thanks,
>
> Gabriel