You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gobblin.apache.org by "meethngala (via GitHub)" <gi...@apache.org> on 2023/04/24 18:05:02 UTC

[GitHub] [gobblin] meethngala opened a new pull request, #3687: [GOBBLIN-1825]Hive retention job should fail if deleting underlying files fail

meethngala opened a new pull request, #3687:
URL: https://github.com/apache/gobblin/pull/3687

   Dear Gobblin maintainers,
   
   Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!
   
   
   ### JIRA
   - [ ] My PR addresses the following [Gobblin JIRA](https://issues.apache.org/jira/browse/GOBBLIN/) issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR"
       - https://issues.apache.org/jira/browse/GOBBLIN-1825
   
   
   ### Description
   Hive retention would perform two tasks: drop the partition first and then delete the underlying files. Now, if for some reason the partition was dropped, but we couldn't delete the underlying files, the job would still succeed. Thus, if we try to re-run the job, it wouldn't fine any work and the hdfs files would never be discovered.
   
   In order to avoid running into this situation, I have made the below changes:
   - Delete the underlying files first and then try dropping the hive partition
   - throw the exceptions and fail the job rather than marking the job status as successful
   
   
   ### Tests
   - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason:
   
   
   ### Commits
   - [ ] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
       1. Subject is separated from body by a blank line
       2. Subject is limited to 50 characters
       3. Subject does not end with a period
       4. Subject uses the imperative mood ("add", not "adding")
       5. Body wraps at 72 characters
       6. Body explains "what" and "why", not "how"
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] meethngala commented on a diff in pull request #3687: [GOBBLIN-1825]Hive retention job should fail if deleting underlying files fail

Posted by "meethngala (via GitHub)" <gi...@apache.org>.
meethngala commented on code in PR #3687:
URL: https://github.com/apache/gobblin/pull/3687#discussion_r1204611662


##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/retention/version/HiveDatasetVersionCleaner.java:
##########
@@ -88,20 +88,24 @@ public void clean() throws IOException {
       Partition partition = hiveDatasetVersion.getPartition();
       try {
         if (!cleanableHiveDataset.isSimulate()) {
+          if (cleanableHiveDataset.isShouldDeleteData()) {
+            cleanableHiveDataset.getFsCleanableHelper().clean(hiveDatasetVersion, possiblyEmptyDirectories);
+          }
           client.get().dropPartition(partition.getTable().getDbName(), partition.getTable().getTableName(), partition.getValues(), false);

Review Comment:
   yup.. makes sense! I have added the comments outlining why the order is critical in my latest commit!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] meethngala commented on a diff in pull request #3687: [GOBBLIN-1825]Hive retention job should fail if deleting underlying files fail

Posted by "meethngala (via GitHub)" <gi...@apache.org>.
meethngala commented on code in PR #3687:
URL: https://github.com/apache/gobblin/pull/3687#discussion_r1204616456


##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/retention/version/HiveDatasetVersionCleaner.java:
##########
@@ -88,20 +88,24 @@ public void clean() throws IOException {
       Partition partition = hiveDatasetVersion.getPartition();
       try {
         if (!cleanableHiveDataset.isSimulate()) {
+          if (cleanableHiveDataset.isShouldDeleteData()) {
+            cleanableHiveDataset.getFsCleanableHelper().clean(hiveDatasetVersion, possiblyEmptyDirectories);

Review Comment:
   I see the concern here. It always fetches the paths to be deleted from the `FileSystemDatasetVersion`. This can be confirmed with this line: `Set<Path> pathsToDelete = versionToDelete.getPaths();`. Thus, the set should be empty if we try to fetch paths that don't exist. However, upon re-run, we can also set the flag `isShouldDeleteData` to `false` which will not try to delete it again. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] meethngala commented on a diff in pull request #3687: [GOBBLIN-1825]Hive retention job should fail if deleting underlying files fail

Posted by "meethngala (via GitHub)" <gi...@apache.org>.
meethngala commented on code in PR #3687:
URL: https://github.com/apache/gobblin/pull/3687#discussion_r1175846256


##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/retention/version/HiveDatasetVersionCleaner.java:
##########
@@ -87,21 +87,25 @@ public void clean() throws IOException {
     try (AutoReturnableObject<IMetaStoreClient> client = cleanableHiveDataset.getClientPool().getClient()) {
       Partition partition = hiveDatasetVersion.getPartition();
       try {
+        if (cleanableHiveDataset.isShouldDeleteData()) {
+          cleanableHiveDataset.getFsCleanableHelper().clean(hiveDatasetVersion, possiblyEmptyDirectories);
+        }

Review Comment:
   Yeah, I thought that was the expected flow since we used to allow deleting of the underlying files in simulation mode. Good that we identified that bug too :)! I have pushed my changes in the latest PR



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] codecov-commenter commented on pull request #3687: [GOBBLIN-1825]Hive retention job should fail if deleting underlying files fail

Posted by "codecov-commenter (via GitHub)" <gi...@apache.org>.
codecov-commenter commented on PR #3687:
URL: https://github.com/apache/gobblin/pull/3687#issuecomment-1520633907

   ## [Codecov](https://codecov.io/gh/apache/gobblin/pull/3687?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#3687](https://codecov.io/gh/apache/gobblin/pull/3687?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (2342f39) into [master](https://codecov.io/gh/apache/gobblin/commit/22cfc5ceb2d89aeadf6626763c3e67a0b7d816f8?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (22cfc5c) will **increase** coverage by `2.45%`.
   > The diff coverage is `33.33%`.
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #3687      +/-   ##
   ============================================
   + Coverage     46.87%   49.32%   +2.45%     
   + Complexity    10756     9273    -1483     
   ============================================
     Files          2138     1756     -382     
     Lines         84018    68375   -15643     
     Branches       9337     7791    -1546     
   ============================================
   - Hits          39380    33725    -5655     
   + Misses        41042    31512    -9530     
   + Partials       3596     3138     -458     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/gobblin/pull/3687?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...t/retention/version/HiveDatasetVersionCleaner.java](https://codecov.io/gh/apache/gobblin/pull/3687?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Z29iYmxpbi1kYXRhLW1hbmFnZW1lbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2dvYmJsaW4vZGF0YS9tYW5hZ2VtZW50L3JldGVudGlvbi92ZXJzaW9uL0hpdmVEYXRhc2V0VmVyc2lvbkNsZWFuZXIuamF2YQ==) | `76.19% <33.33%> (-3.48%)` | :arrow_down: |
   
   ... and [393 files with indirect coverage changes](https://codecov.io/gh/apache/gobblin/pull/3687/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   :mega: We’re building smart automated test selection to slash your CI/CD build times. [Learn more](https://about.codecov.io/iterative-testing/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] Will-Lo commented on a diff in pull request #3687: [GOBBLIN-1825]Hive retention job should fail if deleting underlying files fail

Posted by "Will-Lo (via GitHub)" <gi...@apache.org>.
Will-Lo commented on code in PR #3687:
URL: https://github.com/apache/gobblin/pull/3687#discussion_r1175633516


##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/retention/version/HiveDatasetVersionCleaner.java:
##########
@@ -87,21 +87,25 @@ public void clean() throws IOException {
     try (AutoReturnableObject<IMetaStoreClient> client = cleanableHiveDataset.getClientPool().getClient()) {
       Partition partition = hiveDatasetVersion.getPartition();
       try {
+        if (cleanableHiveDataset.isShouldDeleteData()) {
+          cleanableHiveDataset.getFsCleanableHelper().clean(hiveDatasetVersion, possiblyEmptyDirectories);
+        }

Review Comment:
   If we are doing this first before dropping the partitions, then we should also check if the flow is a simulate flow. Otherwise we will delete the underlying data before it simulates dropping the partition. Also I think this was a bug in the existing code as well, before it would also delete the underlying data if it was simulate.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] ZihanLi58 merged pull request #3687: [GOBBLIN-1825]Hive retention job should fail if deleting underlying files fail

Posted by "ZihanLi58 (via GitHub)" <gi...@apache.org>.
ZihanLi58 merged PR #3687:
URL: https://github.com/apache/gobblin/pull/3687


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] phet commented on a diff in pull request #3687: [GOBBLIN-1825]Hive retention job should fail if deleting underlying files fail

Posted by "phet (via GitHub)" <gi...@apache.org>.
phet commented on code in PR #3687:
URL: https://github.com/apache/gobblin/pull/3687#discussion_r1178754124


##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/retention/version/HiveDatasetVersionCleaner.java:
##########
@@ -88,20 +88,24 @@ public void clean() throws IOException {
       Partition partition = hiveDatasetVersion.getPartition();
       try {
         if (!cleanableHiveDataset.isSimulate()) {
+          if (cleanableHiveDataset.isShouldDeleteData()) {
+            cleanableHiveDataset.getFsCleanableHelper().clean(hiveDatasetVersion, possiblyEmptyDirectories);
+          }
           client.get().dropPartition(partition.getTable().getDbName(), partition.getTable().getTableName(), partition.getValues(), false);
           log.info("Successfully dropped partition " + partition.getCompleteName());
         } else {
           log.info("Simulating drop partition " + partition.getCompleteName());
         }
-        if (cleanableHiveDataset.isShouldDeleteData()) {
-          cleanableHiveDataset.getFsCleanableHelper().clean(hiveDatasetVersion, possiblyEmptyDirectories);
-        }
       } catch (TException | IOException e) {
         log.warn(String.format("Failed to completely delete partition %s.", partition.getCompleteName()), e);
         throw new IOException(e);
       }
+    } try {

Review Comment:
   a `try` should begin on its own line



##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/retention/version/HiveDatasetVersionCleaner.java:
##########
@@ -88,20 +88,24 @@ public void clean() throws IOException {
       Partition partition = hiveDatasetVersion.getPartition();
       try {
         if (!cleanableHiveDataset.isSimulate()) {
+          if (cleanableHiveDataset.isShouldDeleteData()) {
+            cleanableHiveDataset.getFsCleanableHelper().clean(hiveDatasetVersion, possiblyEmptyDirectories);
+          }
           client.get().dropPartition(partition.getTable().getDbName(), partition.getTable().getTableName(), partition.getValues(), false);

Review Comment:
   since this is tricky enough - to the extent that we're here fixing a logical bug - please document w/ a comment that order really is critical (and why that's so)



##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/retention/version/HiveDatasetVersionCleaner.java:
##########
@@ -88,20 +88,24 @@ public void clean() throws IOException {
       Partition partition = hiveDatasetVersion.getPartition();
       try {
         if (!cleanableHiveDataset.isSimulate()) {
+          if (cleanableHiveDataset.isShouldDeleteData()) {
+            cleanableHiveDataset.getFsCleanableHelper().clean(hiveDatasetVersion, possiblyEmptyDirectories);

Review Comment:
   are we certain this is an idempotent operation that won't throw an exception the second and more time?
   
   scenario:
   1. the first time through it was deleted before dropping of the partition failed
   2. upon retry, before again trying to drop the partition, we invoke this once more to delete a FS location that was already removed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org