You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/10/24 03:46:11 UTC

[GitHub] [hudi] YannByron opened a new pull request, #7042: [MINOR] optimize the cdc log file name

YannByron opened a new pull request, #7042:
URL: https://github.com/apache/hudi/pull/7042

   ### Change Logs
   
   optimize the cdc log file name
   
   ### Impact
   
   none
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
     ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make
     changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] YannByron commented on a diff in pull request #7042: [HUDI-5082] Improve the cdc log file name format

Posted by GitBox <gi...@apache.org>.
YannByron commented on code in PR #7042:
URL: https://github.com/apache/hudi/pull/7042#discussion_r1007877496


##########
hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java:
##########
@@ -79,7 +80,7 @@ public class FSUtils {
   private static final Logger LOG = LogManager.getLogger(FSUtils.class);
   // Log files are of this pattern - .b5068208-e1a4-11e6-bf01-fe55135034f3_20170101134598.log.1
   private static final Pattern LOG_FILE_PATTERN =
-      Pattern.compile("\\.(.*)_(.*)\\.(.*)\\.([0-9]*)(_(([0-9]*)-([0-9]*)-([0-9]*)(-cdc)?))?");
+      Pattern.compile("\\.(.*)_(.*)\\.(.*)\\.([0-9]*)(_(([0-9]*)-([0-9]*)-([0-9]*)))?");

Review Comment:
   Only replacing `(-cdc)` to `(.cdc)` doesn't work, it has to change this `LOG_FILE_PATTERN` more.
   This way will be easier to understand and make the affects clear.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] YannByron commented on a diff in pull request #7042: [HUDI-5082] Improve the cdc log file name format

Posted by GitBox <gi...@apache.org>.
YannByron commented on code in PR #7042:
URL: https://github.com/apache/hudi/pull/7042#discussion_r1014582854


##########
hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java:
##########
@@ -355,11 +356,19 @@ public static String createNewFileId(String idPfx, int id) {
     return String.format("%s-%d", idPfx, id);
   }
 
+  private static String extractCommonLogFileName(String fileName) {
+    if (fileName.endsWith(HoodieCDCUtils.CDC_LOGFILE_SUFFIX)) {
+      return fileName.substring(0, fileName.length() - HoodieCDCUtils.CDC_LOGFILE_SUFFIX.length());
+    }
+    return fileName;

Review Comment:
   then have to modify the origin pattern, like https://github.com/apache/hudi/pull/7128.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on pull request #7042: [MINOR] Improve the cdc log file name format

Posted by GitBox <gi...@apache.org>.
xushiyan commented on PR #7042:
URL: https://github.com/apache/hudi/pull/7042#issuecomment-1288385370

   @YannByron can you file a JIRA please?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on a diff in pull request #7042: [HUDI-5082] Improve the cdc log file name format

Posted by GitBox <gi...@apache.org>.
xushiyan commented on code in PR #7042:
URL: https://github.com/apache/hudi/pull/7042#discussion_r1013933825


##########
hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java:
##########
@@ -355,11 +356,19 @@ public static String createNewFileId(String idPfx, int id) {
     return String.format("%s-%d", idPfx, id);
   }
 
+  private static String extractCommonLogFileName(String fileName) {
+    if (fileName.endsWith(HoodieCDCUtils.CDC_LOGFILE_SUFFIX)) {
+      return fileName.substring(0, fileName.length() - HoodieCDCUtils.CDC_LOGFILE_SUFFIX.length());
+    }
+    return fileName;

Review Comment:
   not a big fan of having a special handling of cdc filenames. can we fix regex alone to achieve it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7042: [HUDI-5082] Improve the cdc log file name format

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #7042:
URL: https://github.com/apache/hudi/pull/7042#issuecomment-1288867160

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6ad211f90e9d94467ca6888e11bc28903b79ad15",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12509",
       "triggerID" : "6ad211f90e9d94467ca6888e11bc28903b79ad15",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6ad211f90e9d94467ca6888e11bc28903b79ad15 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12509) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7042: [HUDI-5082] Improve the cdc log file name format

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #7042:
URL: https://github.com/apache/hudi/pull/7042#issuecomment-1306594170

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6ad211f90e9d94467ca6888e11bc28903b79ad15",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12509",
       "triggerID" : "6ad211f90e9d94467ca6888e11bc28903b79ad15",
       "triggerType" : "PUSH"
     }, {
       "hash" : "37903336e3258d43a2b934ac0299118c04f95f22",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12865",
       "triggerID" : "37903336e3258d43a2b934ac0299118c04f95f22",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6ad211f90e9d94467ca6888e11bc28903b79ad15 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12509) 
   * 37903336e3258d43a2b934ac0299118c04f95f22 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12865) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7042: [HUDI-5082] Improve the cdc log file name format

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #7042:
URL: https://github.com/apache/hudi/pull/7042#issuecomment-1306591308

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6ad211f90e9d94467ca6888e11bc28903b79ad15",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12509",
       "triggerID" : "6ad211f90e9d94467ca6888e11bc28903b79ad15",
       "triggerType" : "PUSH"
     }, {
       "hash" : "37903336e3258d43a2b934ac0299118c04f95f22",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "37903336e3258d43a2b934ac0299118c04f95f22",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6ad211f90e9d94467ca6888e11bc28903b79ad15 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12509) 
   * 37903336e3258d43a2b934ac0299118c04f95f22 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #7042: [HUDI-5082] Improve the cdc log file name format

Posted by GitBox <gi...@apache.org>.
danny0405 commented on code in PR #7042:
URL: https://github.com/apache/hudi/pull/7042#discussion_r1012529808


##########
hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java:
##########
@@ -79,7 +80,7 @@ public class FSUtils {
   private static final Logger LOG = LogManager.getLogger(FSUtils.class);
   // Log files are of this pattern - .b5068208-e1a4-11e6-bf01-fe55135034f3_20170101134598.log.1
   private static final Pattern LOG_FILE_PATTERN =
-      Pattern.compile("\\.(.*)_(.*)\\.(.*)\\.([0-9]*)(_(([0-9]*)-([0-9]*)-([0-9]*)(-cdc)?))?");
+      Pattern.compile("\\.(.*)_(.*)\\.(.*)\\.([0-9]*)(_(([0-9]*)-([0-9]*)-([0-9]*)))?");

Review Comment:
   Changed to this pattern should work well:
   ```java
   Pattern.compile("\\.(.*)_(.*)\\.(.*)\\.([0-9]*)(_(([0-9]*)-([0-9]*)-([0-9]*)(\\.cdc)?))?");
   ```
   
   I have tested it locally and it works correctly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on pull request #7042: [HUDI-5082] Improve the cdc log file name format

Posted by GitBox <gi...@apache.org>.
xushiyan commented on PR #7042:
URL: https://github.com/apache/hudi/pull/7042#issuecomment-1306820042

   ![Screen Shot 2022-11-08 at 4 27 19 PM](https://user-images.githubusercontent.com/2701446/200513106-f3a334c5-3dbc-49d8-8951-93d2975ae3e4.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] YannByron commented on a diff in pull request #7042: [HUDI-5082] Improve the cdc log file name format

Posted by GitBox <gi...@apache.org>.
YannByron commented on code in PR #7042:
URL: https://github.com/apache/hudi/pull/7042#discussion_r1012550114


##########
hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java:
##########
@@ -79,7 +80,7 @@ public class FSUtils {
   private static final Logger LOG = LogManager.getLogger(FSUtils.class);
   // Log files are of this pattern - .b5068208-e1a4-11e6-bf01-fe55135034f3_20170101134598.log.1
   private static final Pattern LOG_FILE_PATTERN =
-      Pattern.compile("\\.(.*)_(.*)\\.(.*)\\.([0-9]*)(_(([0-9]*)-([0-9]*)-([0-9]*)(-cdc)?))?");
+      Pattern.compile("\\.(.*)_(.*)\\.(.*)\\.([0-9]*)(_(([0-9]*)-([0-9]*)-([0-9]*)))?");

Review Comment:
   the pattern you provided doesn't work for filename with `.cdc`, like `.7a7aa1f6-370b-4871-a077-0848f1472e87-0_20221024114225320.log.1_0-125-118.cdc`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan merged pull request #7042: [HUDI-5082] Improve the cdc log file name format

Posted by GitBox <gi...@apache.org>.
xushiyan merged PR #7042:
URL: https://github.com/apache/hudi/pull/7042


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7042: [MINOR] Improve the cdc log file name format

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #7042:
URL: https://github.com/apache/hudi/pull/7042#issuecomment-1288417562

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6ad211f90e9d94467ca6888e11bc28903b79ad15",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12509",
       "triggerID" : "6ad211f90e9d94467ca6888e11bc28903b79ad15",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6ad211f90e9d94467ca6888e11bc28903b79ad15 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12509) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7042: [MINOR] Improve the cdc log file name format

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #7042:
URL: https://github.com/apache/hudi/pull/7042#issuecomment-1288414199

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6ad211f90e9d94467ca6888e11bc28903b79ad15",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "6ad211f90e9d94467ca6888e11bc28903b79ad15",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6ad211f90e9d94467ca6888e11bc28903b79ad15 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #7042: [HUDI-5082] Improve the cdc log file name format

Posted by GitBox <gi...@apache.org>.
danny0405 commented on code in PR #7042:
URL: https://github.com/apache/hudi/pull/7042#discussion_r1007744473


##########
hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java:
##########
@@ -79,7 +80,7 @@ public class FSUtils {
   private static final Logger LOG = LogManager.getLogger(FSUtils.class);
   // Log files are of this pattern - .b5068208-e1a4-11e6-bf01-fe55135034f3_20170101134598.log.1
   private static final Pattern LOG_FILE_PATTERN =
-      Pattern.compile("\\.(.*)_(.*)\\.(.*)\\.([0-9]*)(_(([0-9]*)-([0-9]*)-([0-9]*)(-cdc)?))?");
+      Pattern.compile("\\.(.*)_(.*)\\.(.*)\\.([0-9]*)(_(([0-9]*)-([0-9]*)-([0-9]*)))?");

Review Comment:
   How about we add a `(.cdc)` group instead of `(-cdc)` here ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org