You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2021/05/20 08:47:51 UTC
[GitHub] [hive] kishendas commented on a change in pull request #2120: HIVE-24936 - Fix file name parsing and copy file move.

kishendas commented on a change in pull request #2120:
URL: https://github.com/apache/hive/pull/2120#discussion_r635874792



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/AbstractFileMergeOperator.java
##########
@@ -275,39 +275,33 @@ public void closeOp(boolean abort) throws HiveException {
           throw new HiveException("Incompatible files should not happen in MM tables.");
         }
         Path destDir = finalPath.getParent();
-        Path destPath = destDir;
         // move any incompatible files to final path
         if (incompatFileSet != null && !incompatFileSet.isEmpty()) {
           for (Path incompatFile : incompatFileSet) {
-            // check if path conforms to Hive's file name convention. Hive expects filenames to be in specific format
+            // Hive expects filenames to be in specific format
             // like 000000_0, but "LOAD DATA" commands can let you add any files to any partitions/tables without
-            // renaming. This can cause MoveTask to remove files in some cases where MoveTask assumes the files are
-            // are generated by speculatively executed tasks.
+            // renaming.
+            // This can cause a few issues:
+            // MoveTask will remove files in some cases where MoveTask assumes the files are are generated by

Review comment:
       files are generated by

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/AbstractFileMergeOperator.java
##########
@@ -275,39 +275,33 @@ public void closeOp(boolean abort) throws HiveException {
           throw new HiveException("Incompatible files should not happen in MM tables.");
         }
         Path destDir = finalPath.getParent();
-        Path destPath = destDir;
         // move any incompatible files to final path
         if (incompatFileSet != null && !incompatFileSet.isEmpty()) {
           for (Path incompatFile : incompatFileSet) {
-            // check if path conforms to Hive's file name convention. Hive expects filenames to be in specific format
+            // Hive expects filenames to be in specific format
             // like 000000_0, but "LOAD DATA" commands can let you add any files to any partitions/tables without
-            // renaming. This can cause MoveTask to remove files in some cases where MoveTask assumes the files are
-            // are generated by speculatively executed tasks.
+            // renaming.
+            // This can cause a few issues:
+            // MoveTask will remove files in some cases where MoveTask assumes the files are are generated by
+            // speculatively executed tasks.
             // Example: MoveTask thinks the following files are same
             // part-m-00000_1417075294718
             // part-m-00001_1417075294718
             // Assumes 1417075294718 as taskId and retains only large file supposedly generated by speculative execution.
-            // This can result in data loss in case of CONCATENATE/merging. Filter out files that does not match Hive's
-            // filename convention.
-            if (!Utilities.isHiveManagedFile(incompatFile)) {
-              // rename un-managed files to conform to Hive's naming standard
-              // Example:
-              // /warehouse/table/part-m-00000_1417075294718 will get renamed to /warehouse/table/.hive-staging/000000_0
-              // If staging directory already contains the file, taskId_copy_N naming will be used.
-              final String taskId = Utilities.getTaskId(jc);
-              Path destFilePath = new Path(destDir, new Path(taskId));
-              for (int counter = 1; fs.exists(destFilePath); counter++) {
-                destFilePath = new Path(destDir, taskId + (Utilities.COPY_KEYWORD + counter));
-              }
-              LOG.warn("Path doesn't conform to Hive's expectation. Renaming {} to {}", incompatFile, destFilePath);
-              destPath = destFilePath;
-            }
+            // This can result in data loss in case of CONCATENATE/merging.
 
+            // If filename is consistent with XXXXXX_N and another task with same task-id runs after this move, then
+            // the same file name is used in the other task which will result in task failure and retry of task and
+            // subsequent removal of this file as duplicate.
+            // Example: if the file name is 000001_0 and another task runs with taskid 000001_0, it will fail to create
+            // the file and next attempt will create 000001_1, both the files will be considered as output of same task
+            // and only 000001_1 will be picked resulting it loss of existing file 000001_0.

Review comment:
       resulting in loss of 

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/AbstractFileMergeOperator.java
##########
@@ -275,39 +275,33 @@ public void closeOp(boolean abort) throws HiveException {
           throw new HiveException("Incompatible files should not happen in MM tables.");
         }
         Path destDir = finalPath.getParent();
-        Path destPath = destDir;
         // move any incompatible files to final path
         if (incompatFileSet != null && !incompatFileSet.isEmpty()) {
           for (Path incompatFile : incompatFileSet) {
-            // check if path conforms to Hive's file name convention. Hive expects filenames to be in specific format
+            // Hive expects filenames to be in specific format
             // like 000000_0, but "LOAD DATA" commands can let you add any files to any partitions/tables without
-            // renaming. This can cause MoveTask to remove files in some cases where MoveTask assumes the files are
-            // are generated by speculatively executed tasks.
+            // renaming.
+            // This can cause a few issues:
+            // MoveTask will remove files in some cases where MoveTask assumes the files are are generated by
+            // speculatively executed tasks.
             // Example: MoveTask thinks the following files are same
             // part-m-00000_1417075294718
             // part-m-00001_1417075294718
             // Assumes 1417075294718 as taskId and retains only large file supposedly generated by speculative execution.
-            // This can result in data loss in case of CONCATENATE/merging. Filter out files that does not match Hive's
-            // filename convention.
-            if (!Utilities.isHiveManagedFile(incompatFile)) {
-              // rename un-managed files to conform to Hive's naming standard
-              // Example:
-              // /warehouse/table/part-m-00000_1417075294718 will get renamed to /warehouse/table/.hive-staging/000000_0
-              // If staging directory already contains the file, taskId_copy_N naming will be used.
-              final String taskId = Utilities.getTaskId(jc);
-              Path destFilePath = new Path(destDir, new Path(taskId));
-              for (int counter = 1; fs.exists(destFilePath); counter++) {
-                destFilePath = new Path(destDir, taskId + (Utilities.COPY_KEYWORD + counter));
-              }
-              LOG.warn("Path doesn't conform to Hive's expectation. Renaming {} to {}", incompatFile, destFilePath);
-              destPath = destFilePath;
-            }
+            // This can result in data loss in case of CONCATENATE/merging.
 
+            // If filename is consistent with XXXXXX_N and another task with same task-id runs after this move, then
+            // the same file name is used in the other task which will result in task failure and retry of task and
+            // subsequent removal of this file as duplicate.
+            // Example: if the file name is 000001_0 and another task runs with taskid 000001_0, it will fail to create
+            // the file and next attempt will create 000001_1, both the files will be considered as output of same task
+            // and only 000001_1 will be picked resulting it loss of existing file 000001_0.
+            final String destFileName = Utilities.getTaskId(jc) + Utilities.COPY_KEYWORD + 1;

Review comment:
       Will appending "+1" work in all cases ? 
   What happens if concat partially succeeds and there is one more file with the same name, after the previous one is already moved.
   If we run concat again, will it not result in data loss ?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
##########
@@ -1093,40 +1093,69 @@ public static void rename(FileSystem fs, Path src, Path dst) throws IOException,
     }
   }
 
-  private static void moveFile(FileSystem fs, FileStatus file, Path dst) throws IOException,
+  private static void moveFileOrDir(FileSystem fs, FileStatus file, Path dst) throws IOException,

Review comment:
       Can we add tests to cover this method as well ? 

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/ParsedOutputFileName.java
##########
@@ -0,0 +1,113 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec;
+
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+
+/**
+ * Helper class to match hive filenames and extract taskId, taskAttemptId, copyIndex.
+ *
+ * Matches following:
+ * 00001_02
+ * 00001_02.gz
+ * 00001_02.zlib.gz
+ * 00001_02_copy_1
+ * 00001_02_copy_1.gz
+ * <p>
+ * All the components are here:
+ * tmp_(taskPrefix)00001_02_copy_1.zlib.gz
+ */
+public class ParsedOutputFileName {
+  private static final Pattern COPY_FILE_NAME_TO_TASK_ID_REGEX = Pattern.compile(
+      "^(.*?)?" + // any prefix
+      "(\\(.*\\))?" + // taskId prefix
+      "([0-9]+)" + // taskId
+      "(?:_([0-9]{1,6}))?" + // _<attemptId> (limited to 6 digits)

Review comment:
       Can attemptId be more than 6 digits in future ? Is it possible to refer to the regex for attemptId, from the place where its generated ? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org