You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by "Adam Fuchs (JIRA)" <ji...@apache.org> on 2014/01/23 21:15:39 UTC

[jira] [Comment Edited] (ACCUMULO-2232) Combiners can cause deleted data to come back

    [ https://issues.apache.org/jira/browse/ACCUMULO-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13879367#comment-13879367 ] 

Adam Fuchs edited comment on ACCUMULO-2232 at 1/23/14 8:13 PM:
---------------------------------------------------------------

I agree with Josh (and everyone else) on both counts: the performance
implications will be huge and this is enough rope for people to hang
themselves with.  However, I think a lot of people use combiners on tables
that are append-only and never delete (at least not a record at a time).
The *warning unsafe doom will ensue* bypass is pretty important to support
those uses, but I also think it is best to default to accuracy while we
implement a better fix.

It seems like the right way to fix this in the long run is to keep track of
timestamp ranges of files and calculate two properties on the set of files
being compacted:
1. Is the time range contiguous, or are there other files not being
compacted that overlap the range?
2. Are there any files with an older timestamp?
This way we can run combiners on any compactions that satisfy property #1,
and preserve the most recent deletes in any compaction that satisfies
property #2. This generally makes minor compactions safe for running
combiners (assuming Accumulo sets the timestamps and there is no bulk
loading), although the most recent delete needs to be preserved. If I were
to speculate about general major compactions, I would say that when splits
are rare most other compactions also have property #1.

I think we could expose these properties in the iterator environment. We
could even come up with a compaction strategy that biases compactions
towards contiguous time ranges if we were ambitious.






was (Author: afuchs):
Ugh, jira is down, but let me get my thoughts out while I'm having them:

I agree with Josh (and everyone else) on both counts: the performance
implications will be huge and this is enough rope for people to hang
themselves with.  However, I think a lot of people use combiners on tables
that are append-only and never delete (at least not a record at a time).
The *warning unsafe doom will ensue* bypass is pretty important to support
those uses, but I also think it is best to default to accuracy while we
implement a better fix.

It seems like the right way to fix this in the long run is to keep track of
timestamp ranges of files and calculate two properties on the set of files
being compacted:
1. Is the time range contiguous, or are there other files not being
compacted that overlap the range?
2. Are there any files with an older timestamp?
This way we can run combiners on any compactions that satisfy property #1,
and preserve the most recent deletes in any compaction that satisfies
property #2. This generally makes minor compactions safe for running
combiners (assuming Accumulo sets the timestamps and there is no bulk
loading), although the most recent delete needs to be preserved. If I were
to speculate about general major compactions, I would say that when splits
are rare most other compactions also have property #1.

I think we could expose these properties in the iterator environment. We
could even come up with a compaction strategy that biases compactions
towards contiguous time ranges if we were ambitious.

Adam






> Combiners can cause deleted data to come back
> ---------------------------------------------
>
>                 Key: ACCUMULO-2232
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-2232
>             Project: Accumulo
>          Issue Type: Bug
>          Components: client, tserver
>            Reporter: John Vines
>
> The case-
> 3 files with-
> * 1 with a key, k, with timestamp 0, value 3
> * 1 with a delete of k with timestamp 1
> * 1 with k with timestamp 2, value 2
> The column of k has a summing combiner set on it. The issue here is that depending on how the major compactions play out, differing values with result. If all 3 files compact, the correct value of 2 will result. However, if 1 & 3 compact first, they will aggregate to 5. And then the delete will fall after the combined value, resulting in the result 5 to persist.
> First and foremost, this should be documented. I think to remedy this, combiners should only be used on full MajC, not not full ones. This may necessitate a special flag or a new combiner that implemented the proper semantics.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)