You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Peter Vary (Jira)" <ji...@apache.org> on 2020/01/08 10:41:00 UTC
[jira] [Updated] (HIVE-20901) running compactor when there is
nothing to do produces duplicate data
[ https://issues.apache.org/jira/browse/HIVE-20901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Peter Vary updated HIVE-20901:
------------------------------
Fix Version/s: 4.0.0
Resolution: Duplicate
Status: Resolved (was: Patch Available)
[~asomani]: If you do not mind I close this jira as it was fixed by HIVE-9995. Sorry for the confusion, I have found this jira only now :(
> running compactor when there is nothing to do produces duplicate data
> ---------------------------------------------------------------------
>
> Key: HIVE-20901
> URL: https://issues.apache.org/jira/browse/HIVE-20901
> Project: Hive
> Issue Type: Bug
> Components: Transactions
> Affects Versions: 4.0.0
> Reporter: Eugene Koifman
> Assignee: Abhishek Somani
> Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-20901.1.patch, HIVE-20901.2.patch
>
>
> suppose we run minor compaction 2 times, via alter table
> The 2nd request to compaction should have nothing to do but I don't think there is a check for that. It's visible in the context of HIVE-20823, where each compactor run produces a delta with new visibility suffix so we end up with something like
> {noformat}
> target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands3-1541810844849/warehouse/t/
> ├── delete_delta_0000001_0000002_v0000019
> │ ├── _orc_acid_version
> │ └── bucket_00000
> ├── delete_delta_0000001_0000002_v0000021
> │ ├── _orc_acid_version
> │ └── bucket_00000
> ├── delta_0000001_0000001_0000
> │ ├── _orc_acid_version
> │ └── bucket_00000
> ├── delta_0000001_0000002_v0000019
> │ ├── _orc_acid_version
> │ └── bucket_00000
> ├── delta_0000001_0000002_v0000021
> │ ├── _orc_acid_version
> │ └── bucket_00000
> └── delta_0000002_0000002_0000
> ├── _orc_acid_version
> └── bucket_00000{noformat}
> i.e. 2 deltas with the same write ID range
> this is bad. Probably happens today as well but new run produces a delta with the same name and clobbers the previous one, which may interfere with writers
>
> need to investigate
>
> -The issue (I think) is that {{AcidUtils.getAcidState()}} then returns both deltas as if they were distinct and it effectively duplicates data.- There is no data duplication - {{getAcidState()}} will not use 2 deltas with the same {{writeid}} range
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)