You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gobblin.apache.org by GitBox <gi...@apache.org> on 2021/04/02 22:29:03 UTC

[GitHub] [gobblin] ZihanLi58 opened a new pull request #3255: [GOBBLIN-1419]Error handling for compaction pipeline on GMCE emitted error

ZihanLi58 opened a new pull request #3255:
URL: https://github.com/apache/gobblin/pull/3255


   Dear Gobblin maintainers,
   
   Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!
   
   
   ### JIRA
   - [ ] My PR addresses the following [Gobblin JIRA](https://issues.apache.org/jira/browse/GOBBLIN/) issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR"
       - https://issues.apache.org/jira/browse/GOBBLIN-1419
   
   
   ### Description
   - [ ] Here are some details about my PR, including screenshots (if applicable):
   No compaction publish file and emit GMCE are happening separately. If file rename succeed but emit GMCE fails, we will end up with not emit GMCE for the compaction. So need  a strategy to do error handling for compaction that when emit GMCE fails, next job will treat the previous compaction as fail and re-compact the data.
   
   ### Tests
   - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason:
   unit test
   
   ### Commits
   - [ ] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
       1. Subject is separated from body by a blank line
       2. Subject is limited to 50 characters
       3. Subject does not end with a period
       4. Subject uses the imperative mood ("add", not "adding")
       5. Body wraps at 72 characters
       6. Body explains "what" and "why", not "how"
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [gobblin] ZihanLi58 commented on pull request #3255: [GOBBLIN-1419]Error handling for compaction pipeline on GMCE emitted error

Posted by GitBox <gi...@apache.org>.
ZihanLi58 commented on pull request #3255:
URL: https://github.com/apache/gobblin/pull/3255#issuecomment-818004268


   @autumnust @sv2000 can you help take a look at this change? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [gobblin] ZihanLi58 commented on a change in pull request #3255: [GOBBLIN-1419]Error handling for compaction pipeline on GMCE emitted error

Posted by GitBox <gi...@apache.org>.
ZihanLi58 commented on a change in pull request #3255:
URL: https://github.com/apache/gobblin/pull/3255#discussion_r613437586



##########
File path: gobblin-compaction/src/main/java/org/apache/gobblin/compaction/verify/CompactionThresholdVerifier.java
##########
@@ -60,7 +62,8 @@ public CompactionThresholdVerifier(State state) {
    * dataset. To avoid scalability issue, we choose a stateless approach where each dataset tracks
    * record count by themselves and persist it in the file system)
    *
-   * @return true iff the difference exceeds the threshold or this is the first time compaction
+   * @return true if the difference exceeds the threshold or this is the first time compaction or

Review comment:
       So the logic of verifier is if any of the verifier fail the dataset, the compaction will not run. In this case, if gmce verifier say it needs to re compact but threshold verifier say it does not need to be compacted, then the dataset will be skipped. That's the reason I embedded the logic here. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [gobblin] ZihanLi58 commented on a change in pull request #3255: [GOBBLIN-1419]Error handling for compaction pipeline on GMCE emitted error

Posted by GitBox <gi...@apache.org>.
ZihanLi58 commented on a change in pull request #3255:
URL: https://github.com/apache/gobblin/pull/3255#discussion_r613580204



##########
File path: gobblin-compaction/src/main/java/org/apache/gobblin/compaction/action/CompactionGMCEPublishingAction.java
##########
@@ -57,9 +58,11 @@
 
   public static final String ICEBERG_ID_ATTRIBUTE = "iceberg.id";
   public static final String ICEBERG_REQUIRED_ATTRIBUTE = "iceberg.required";
+  public final static String GMCE_EMITTED_KEY = "GMCE.emitted";
   private final State state;
   private final CompactionJobConfigurator configurator;
   private final Configuration conf;
+  private InputRecordCountHelper helper;

Review comment:
       +1, but this will be a backward incompatible change, might be better when we release a new version?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [gobblin] autumnust commented on a change in pull request #3255: [GOBBLIN-1419]Error handling for compaction pipeline on GMCE emitted error

Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #3255:
URL: https://github.com/apache/gobblin/pull/3255#discussion_r613395422



##########
File path: gobblin-compaction/src/main/java/org/apache/gobblin/compaction/verify/CompactionThresholdVerifier.java
##########
@@ -81,10 +84,15 @@ public Result verify(FileSystemDataset dataset) {
         newRecords = helper.calculateRecordCount(Lists.newArrayList(new Path(dataset.datasetURN())));
       }
       double oldRecords = helper.readRecordCount(new Path(result.getDstAbsoluteDir()));
-
+      State datasetState = helper.loadState(new Path(result.getDstAbsoluteDir()));
       if (oldRecords == 0) {
         return new Result(true, "");
       }
+      if(state.getPropAsBoolean(ConfigurationKeys.GOBBLIN_METADATA_CHANGE_EVENT_ENABLED, false)) {

Review comment:
       these two branches could be merged.

##########
File path: gobblin-compaction/src/main/java/org/apache/gobblin/compaction/verify/CompactionThresholdVerifier.java
##########
@@ -60,7 +62,8 @@ public CompactionThresholdVerifier(State state) {
    * dataset. To avoid scalability issue, we choose a stateless approach where each dataset tracks
    * record count by themselves and persist it in the file system)
    *
-   * @return true iff the difference exceeds the threshold or this is the first time compaction
+   * @return true if the difference exceeds the threshold or this is the first time compaction or

Review comment:
       I am not sure if it is a good idea to package the logic of checking gmce-emission inside `thresholder` verifier.  What's the cost of having a separate implementation of verifier ? 

##########
File path: gobblin-compaction/src/main/java/org/apache/gobblin/compaction/action/CompactionGMCEPublishingAction.java
##########
@@ -57,9 +58,11 @@
 
   public static final String ICEBERG_ID_ATTRIBUTE = "iceberg.id";
   public static final String ICEBERG_REQUIRED_ATTRIBUTE = "iceberg.required";
+  public final static String GMCE_EMITTED_KEY = "GMCE.emitted";
   private final State state;
   private final CompactionJobConfigurator configurator;
   private final Configuration conf;
+  private InputRecordCountHelper helper;

Review comment:
       This class may deserve a better name given the fact that it is not only concerning for recordCount anymore. 

##########
File path: gobblin-compaction/src/main/java/org/apache/gobblin/compaction/action/CompactionGMCEPublishingAction.java
##########
@@ -57,9 +58,11 @@
 
   public static final String ICEBERG_ID_ATTRIBUTE = "iceberg.id";
   public static final String ICEBERG_REQUIRED_ATTRIBUTE = "iceberg.required";
+  public final static String GMCE_EMITTED_KEY = "GMCE.emitted";
   private final State state;
   private final CompactionJobConfigurator configurator;
   private final Configuration conf;
+  private InputRecordCountHelper helper;

Review comment:
       Second comment for this: Considering the wide usage of `InputRecordCountHelper` among different implementation of CompactionAction, does it make sense to share an instance of that instead of instantiating the same object in different implementations? 

##########
File path: gobblin-compaction/src/main/java/org/apache/gobblin/compaction/action/CompactionCompleteFileOperationAction.java
##########
@@ -167,6 +167,10 @@ public void onCompactionJobComplete(FileSystemDataset dataset) throws IOExceptio
         compactionState.setProp(DUPLICATE_COUNT_TOTAL + Long.toString(executionCount),
             compactionState.getProp(DUPLICATE_COUNT_TOTAL, "null"));
       }
+      if(state.getPropAsBoolean(ConfigurationKeys.GOBBLIN_METADATA_CHANGE_EVENT_ENABLED, false)) {

Review comment:
       please consider using auto-formatting to clean the spaces between preserved keywords




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [gobblin] asfgit closed pull request #3255: [GOBBLIN-1419]Error handling for compaction pipeline on GMCE emitted error

Posted by GitBox <gi...@apache.org>.
asfgit closed pull request #3255:
URL: https://github.com/apache/gobblin/pull/3255


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [gobblin] autumnust commented on a change in pull request #3255: [GOBBLIN-1419]Error handling for compaction pipeline on GMCE emitted error

Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #3255:
URL: https://github.com/apache/gobblin/pull/3255#discussion_r619405644



##########
File path: gobblin-compaction/src/main/java/org/apache/gobblin/compaction/action/CompactionGMCEPublishingAction.java
##########
@@ -57,9 +58,11 @@
 
   public static final String ICEBERG_ID_ATTRIBUTE = "iceberg.id";
   public static final String ICEBERG_REQUIRED_ATTRIBUTE = "iceberg.required";
+  public final static String GMCE_EMITTED_KEY = "GMCE.emitted";
   private final State state;
   private final CompactionJobConfigurator configurator;
   private final Configuration conf;
+  private InputRecordCountHelper helper;

Review comment:
       That's fair. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [gobblin] ZihanLi58 closed pull request #3255: [GOBBLIN-1419]Error handling for compaction pipeline on GMCE emitted error

Posted by GitBox <gi...@apache.org>.
ZihanLi58 closed pull request #3255:
URL: https://github.com/apache/gobblin/pull/3255


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org