You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2020/06/17 08:00:00 UTC

[jira] [Work logged] (HIVE-23703) Major QB compaction with multiple FileSinkOperators results in data loss and one original file

     [ https://issues.apache.org/jira/browse/HIVE-23703?focusedWorklogId=447118&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-447118 ]

ASF GitHub Bot logged work on HIVE-23703:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 17/Jun/20 07:59
            Start Date: 17/Jun/20 07:59
    Worklog Time Spent: 10m 
      Work Description: klcopp opened a new pull request #1134:
URL: https://github.com/apache/hive/pull/1134


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

            Worklog Id:     (was: 447118)
    Remaining Estimate: 0h
            Time Spent: 10m

> Major QB compaction with multiple FileSinkOperators results in data loss and one original file
> ----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-23703
>                 URL: https://issues.apache.org/jira/browse/HIVE-23703
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Karen Coppage
>            Assignee: Karen Coppage
>            Priority: Critical
>              Labels: compaction
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> h4. Problems
> Example:
> {code:java}
> drop table if exists tbl2;
> create transactional table tbl2 (a int, b int) clustered by (a) into 4 buckets stored as ORC TBLPROPERTIES('transactional'='true','transactional_properties'='default');
> insert into tbl2 values(1,2),(1,3),(1,4),(2,2),(2,3),(2,4);
> insert into tbl2 values(3,2),(3,3),(3,4),(4,2),(4,3),(4,4);
> insert into tbl2 values(5,2),(5,3),(5,4),(6,2),(6,3),(6,4);{code}
> E.g. in the example above, bucketId=0 when a=2 and a=6.
> 1. Data lossĀ 
>  In non-acid tables, an operator's temp files are named with their task id. Because of this snippet, temp files in the FileSinkOperator for compaction tables are identified by their bucket_id.
> {code:java}
> if (conf.isCompactionTable()) {
>  fsp.initializeBucketPaths(filesIdx, AcidUtils.BUCKET_PREFIX + String.format(AcidUtils.BUCKET_DIGITS, bucketId),
>  isNativeTable(), isSkewedStoredAsSubDirectories);
>  } else {
>  fsp.initializeBucketPaths(filesIdx, taskId, isNativeTable(), isSkewedStoredAsSubDirectories);
>  }
> {code}
> So 2 temp files containing data with a=2 and a=6 will be named bucket_0 and not 000000_0 and 000000_1 as they would normally.
>  In FileSinkOperator.commit, when data with a=2, filename: bucket_0 is moved from _task_tmp.-ext-10002 to _tmp.-ext-10002, it overwrites the files already there with a=6 data, because it too is named bucket_0. You can see in the logs:
> {code:java}
>  WARN [LocalJobRunner Map Task Executor #0] exec.FileSinkOperator: Target path file:.../hive/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnNoBuckets-1591107230237/warehouse/testmajorcompaction/base_0000002_v0000013/.hive-staging_hive_2020-06-02_07-15-21_771_8551447285061957908-1/_tmp.-ext-10002/bucket_00000 with a size 610 exists. Trying to delete it.
> {code}
> 2. Results in one original file
>  OrcFileMergeOperator merges the results of the FSOp into 1 file named 000000_0.
> h4. Fix
> 1. FSOp will store data as: taskid/bucketId. e.g. 0_0/bucket_0
> 2. OrcMergeFileOp, instead of merging a bunch of files into 1 file named 000000_0, will merge all files named bucket_0 into one file named bucket_0, and so on.
> 3. MoveTask will get rid of the taskId directories if present and only move the bucket files in them, in case OrcMergeFileOp is not run.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)