You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2021/05/20 10:27:30 UTC

[GitHub] [hive] harishjp commented on a change in pull request #2120: HIVE-24936 - Fix file name parsing and copy file move.

harishjp commented on a change in pull request #2120:
URL: https://github.com/apache/hive/pull/2120#discussion_r635974959



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/AbstractFileMergeOperator.java
##########
@@ -275,39 +275,33 @@ public void closeOp(boolean abort) throws HiveException {
           throw new HiveException("Incompatible files should not happen in MM tables.");
         }
         Path destDir = finalPath.getParent();
-        Path destPath = destDir;
         // move any incompatible files to final path
         if (incompatFileSet != null && !incompatFileSet.isEmpty()) {
           for (Path incompatFile : incompatFileSet) {
-            // check if path conforms to Hive's file name convention. Hive expects filenames to be in specific format
+            // Hive expects filenames to be in specific format
             // like 000000_0, but "LOAD DATA" commands can let you add any files to any partitions/tables without
-            // renaming. This can cause MoveTask to remove files in some cases where MoveTask assumes the files are
-            // are generated by speculatively executed tasks.
+            // renaming.
+            // This can cause a few issues:
+            // MoveTask will remove files in some cases where MoveTask assumes the files are are generated by
+            // speculatively executed tasks.
             // Example: MoveTask thinks the following files are same
             // part-m-00000_1417075294718
             // part-m-00001_1417075294718
             // Assumes 1417075294718 as taskId and retains only large file supposedly generated by speculative execution.
-            // This can result in data loss in case of CONCATENATE/merging. Filter out files that does not match Hive's
-            // filename convention.
-            if (!Utilities.isHiveManagedFile(incompatFile)) {
-              // rename un-managed files to conform to Hive's naming standard
-              // Example:
-              // /warehouse/table/part-m-00000_1417075294718 will get renamed to /warehouse/table/.hive-staging/000000_0
-              // If staging directory already contains the file, taskId_copy_N naming will be used.
-              final String taskId = Utilities.getTaskId(jc);
-              Path destFilePath = new Path(destDir, new Path(taskId));
-              for (int counter = 1; fs.exists(destFilePath); counter++) {
-                destFilePath = new Path(destDir, taskId + (Utilities.COPY_KEYWORD + counter));
-              }
-              LOG.warn("Path doesn't conform to Hive's expectation. Renaming {} to {}", incompatFile, destFilePath);
-              destPath = destFilePath;
-            }
+            // This can result in data loss in case of CONCATENATE/merging.
 
+            // If filename is consistent with XXXXXX_N and another task with same task-id runs after this move, then
+            // the same file name is used in the other task which will result in task failure and retry of task and
+            // subsequent removal of this file as duplicate.
+            // Example: if the file name is 000001_0 and another task runs with taskid 000001_0, it will fail to create
+            // the file and next attempt will create 000001_1, both the files will be considered as output of same task
+            // and only 000001_1 will be picked resulting it loss of existing file 000001_0.
+            final String destFileName = Utilities.getTaskId(jc) + Utilities.COPY_KEYWORD + 1;

Review comment:
       moveFile does not just move, if file already exists it keeps incrementing the copy index until it finds one which does not exist. So no data loss.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org