You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2023/01/13 04:15:07 UTC

[GitHub] [hudi] nsivabalan commented on a diff in pull request #7612: [HUDI-5336] Fixing log file pattern match to ignore extraneous files

nsivabalan commented on code in PR #7612:
URL: https://github.com/apache/hudi/pull/7612#discussion_r1068908725


##########
hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java:
##########
@@ -358,23 +364,27 @@ public static String createNewFileId(String idPfx, int id) {
    * Get the file extension from the log file.
    */
   public static String getFileExtensionFromLog(Path logPath) {
-    Matcher matcher = LOG_FILE_PATTERN.matcher(logPath.getName());
+    boolean isArchivedLog = logPath.getName().contains(ARCHIVED_LOG_PREFIX);
+    Matcher matcher =  isArchivedLog ? ARCHIVED_LOG_FILE_PATTERN.matcher(logPath.getName()) :
+        LOG_FILE_PATTERN.matcher(logPath.getName());
     if (!matcher.find()) {
       throw new InvalidHoodiePathException(logPath, "LogFile");
     }
-    return matcher.group(3);
+    return isArchivedLog ? ARCHIVE_STR : matcher.group(3);
   }
 
   /**
    * Get the first part of the file name in the log file. That will be the fileId. Log file do not have instantTime in
    * the file name.
    */
   public static String getFileIdFromLogPath(Path path) {
-    Matcher matcher = LOG_FILE_PATTERN.matcher(path.getName());
+    boolean isArchivedLog = path.getName().contains(ARCHIVED_LOG_PREFIX);
+    Matcher matcher =  isArchivedLog ? ARCHIVED_LOG_FILE_PATTERN.matcher(path.getName())
+        : LOG_FILE_PATTERN.matcher(path.getName());
     if (!matcher.find()) {
       throw new InvalidHoodiePathException(path, "LogFile");
     }
-    return matcher.group(1);
+    return isArchivedLog ? COMMITS_STR : matcher.group(1);

Review Comment:
   archived commits has a static prefix (".commits.archived") and so felt its better to have a separate regex rather than using one regex to work for both as it might be complex. I prefer this way so we know exactly whats the expected format in each case. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org