You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2021/05/19 03:09:41 UTC

[GitHub] [hive] harishjp commented on a change in pull request #2285: HIVE-25130: handle spark inserted files during alter table concat

harishjp commented on a change in pull request #2285:
URL: https://github.com/apache/hive/pull/2285#discussion_r634882155



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
##########
@@ -1252,32 +1253,45 @@ public static String getTaskIdFromFilename(String filename) {
    * @param filename
    *          filename to extract taskid from
    */
-  private static String getPrefixedTaskIdFromFilename(String filename) {
+  static String getPrefixedTaskIdFromFilename(String filename) {
     return getTaskIdFromFilename(filename, FILE_NAME_PREFIXED_TASK_ID_REGEX);
   }
 
   private static String getTaskIdFromFilename(String filename, Pattern pattern) {
-    return getIdFromFilename(filename, pattern, 1);
+    return getIdFromFilename(filename, pattern, 1, false);
   }
 
-  private static int getAttemptIdFromFilename(String filename) {
-    String attemptStr = getIdFromFilename(filename, FILE_NAME_PREFIXED_TASK_ID_REGEX, 3);
+  static int getAttemptIdFromFilename(String filename) {
+    String attemptStr = getIdFromFilename(filename, FILE_NAME_PREFIXED_TASK_ID_REGEX, 3, true);
     return Integer.parseInt(attemptStr.substring(1));
   }
 
-  private static String getIdFromFilename(String filename, Pattern pattern, int group) {
+  private static String getIdFromFilename(String filename, Pattern pattern, int group, boolean extractAttemptId) {
     String taskId = filename;
     int dirEnd = filename.lastIndexOf(Path.SEPARATOR);
-    if (dirEnd != -1) {
+    if (dirEnd!=-1) {
       taskId = filename.substring(dirEnd + 1);
     }
 
-    Matcher m = pattern.matcher(taskId);
-    if (!m.matches()) {
-      LOG.warn("Unable to get task id from file name: {}. Using last component {}"
-          + " as task id.", filename, taskId);
+    // Spark emitted files have the format part-[number-string]-uuid.<suffix>.<optional extension>
+    // Examples: part-00026-23003837-facb-49ec-b1c4-eeda902cacf3.c000.zlib.orc, 00026-23003837 is the taskId
+    // and part-00004-c6acfdee-0c32-492e-b209-c2f1cf477770.c000, 00004-c6acfdee is the taskId
+    String strings[] = taskId.split("-");

Review comment:
       This looks a bit fragile and can match lot of files. We should use slightly better pattern matching here to ensure we do not accidentally match other files. Something of the lines of "part-(\d+-[a-f0-9]+)- "...




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org