You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/08/24 02:08:07 UTC

[GitHub] [hudi] sreeram26 opened a new pull request #2014: [HUDI-1153] Spark DataSource and Streaming Write must fail when operation type is misconfigured

sreeram26 opened a new pull request #2014:
URL: https://github.com/apache/hudi/pull/2014


   
   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a pull request.*
   
   ## What is the purpose of the pull request
   
   Currently for Spark Streaming Write operation is being manually string compared on usage in most of the code, we also silently swallow illegal operation types by defaulting to upsert. This addresses these issues
   
   ## Brief change log
   
   - [HUDI-1153] Spark DataSource and Streaming Write must fail when operation type is misconfigured
   
   ## Verify this pull request
   
   This pull request is already covered by existing tests, such as *(please describe tests)*.
   TestDataSourceUtils
   * testDoWriteOperationWithoutUserDefinedBulkInsertPartitioner
   * testDoWriteOperationWithNonExistUserDefinedBulkInsertPartitioner
   * testDoWriteOperationWithUserDefinedBulkInsertPartitioner
   If all existing tests pass. This should be good to review
   
    - [ ] Existing tests pass 
   
   ## Committer checklist
   
    - [x] Has a corresponding JIRA in PR title & commit
    
    - [x] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [x] Necessary doc changes done or have another open PR - None
          
    - [x] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA- Not a large task


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on pull request #2014: [HUDI-1153] Spark DataSource and Streaming Write must fail when operation type is misconfigured

Posted by GitBox <gi...@apache.org>.
bvaradar commented on pull request #2014:
URL: https://github.com/apache/hudi/pull/2014#issuecomment-687242382


   Looks good @sreeram26. Thanks for your contribution.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] sreeram26 commented on pull request #2014: [HUDI-1153] Spark DataSource and Streaming Write must fail when operation type is misconfigured

Posted by GitBox <gi...@apache.org>.
sreeram26 commented on pull request #2014:
URL: https://github.com/apache/hudi/pull/2014#issuecomment-678878193


   @bvaradar 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] sreeram26 commented on a change in pull request #2014: [HUDI-1153] Spark DataSource and Streaming Write must fail when operation type is misconfigured

Posted by GitBox <gi...@apache.org>.
sreeram26 commented on a change in pull request #2014:
URL: https://github.com/apache/hudi/pull/2014#discussion_r475306551



##########
File path: hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java
##########
@@ -248,15 +249,18 @@ public static HoodieWriteClient createHoodieClient(JavaSparkContext jssc, String
   }
 
   public static JavaRDD<WriteStatus> doWriteOperation(HoodieWriteClient client, JavaRDD<HoodieRecord> hoodieRecords,
-      String instantTime, String operation) throws HoodieException {
-    if (operation.equals(DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL())) {
+      String instantTime, WriteOperationType operation) throws HoodieException {
+    if (operation == WriteOperationType.BULK_INSERT) {
       Option<BulkInsertPartitioner> userDefinedBulkInsertPartitioner =
           createUserDefinedBulkInsertPartitioner(client.getConfig());
       return client.bulkInsert(hoodieRecords, instantTime, userDefinedBulkInsertPartitioner);
-    } else if (operation.equals(DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL())) {
+    } else if (operation == WriteOperationType.INSERT) {
       return client.insert(hoodieRecords, instantTime);
     } else {
       // default is upsert
+      if (operation != WriteOperationType.UPSERT) {

Review comment:
       Not throwing an explicit error here, the only other value it can potentially have is Bootstrap based on the enum, the issue which exposed this issue would have thrown an exception on WriteOperationType.fromValue itself




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] sreeram26 commented on a change in pull request #2014: [HUDI-1153] Spark DataSource and Streaming Write must fail when operation type is misconfigured

Posted by GitBox <gi...@apache.org>.
sreeram26 commented on a change in pull request #2014:
URL: https://github.com/apache/hudi/pull/2014#discussion_r475306551



##########
File path: hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java
##########
@@ -248,15 +249,18 @@ public static HoodieWriteClient createHoodieClient(JavaSparkContext jssc, String
   }
 
   public static JavaRDD<WriteStatus> doWriteOperation(HoodieWriteClient client, JavaRDD<HoodieRecord> hoodieRecords,
-      String instantTime, String operation) throws HoodieException {
-    if (operation.equals(DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL())) {
+      String instantTime, WriteOperationType operation) throws HoodieException {
+    if (operation == WriteOperationType.BULK_INSERT) {
       Option<BulkInsertPartitioner> userDefinedBulkInsertPartitioner =
           createUserDefinedBulkInsertPartitioner(client.getConfig());
       return client.bulkInsert(hoodieRecords, instantTime, userDefinedBulkInsertPartitioner);
-    } else if (operation.equals(DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL())) {
+    } else if (operation == WriteOperationType.INSERT) {
       return client.insert(hoodieRecords, instantTime);
     } else {
       // default is upsert
+      if (operation != WriteOperationType.UPSERT) {

Review comment:
       Not throwing an explicit error here, the only other value it can potentially have is Bootstrap based on the enum, the issue which exposed this issue would have thrown an exception on WriteOperationType.fromValue itself.
   
   Can change to throw a HoodieException, if the reviewer feels that is necessary




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar merged pull request #2014: [HUDI-1153] Spark DataSource and Streaming Write must fail when operation type is misconfigured

Posted by GitBox <gi...@apache.org>.
bvaradar merged pull request #2014:
URL: https://github.com/apache/hudi/pull/2014


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on a change in pull request #2014: [HUDI-1153] Spark DataSource and Streaming Write must fail when operation type is misconfigured

Posted by GitBox <gi...@apache.org>.
bvaradar commented on a change in pull request #2014:
URL: https://github.com/apache/hudi/pull/2014#discussion_r477439176



##########
File path: hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java
##########
@@ -248,15 +249,18 @@ public static HoodieWriteClient createHoodieClient(JavaSparkContext jssc, String
   }
 
   public static JavaRDD<WriteStatus> doWriteOperation(HoodieWriteClient client, JavaRDD<HoodieRecord> hoodieRecords,
-      String instantTime, String operation) throws HoodieException {
-    if (operation.equals(DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL())) {
+      String instantTime, WriteOperationType operation) throws HoodieException {
+    if (operation == WriteOperationType.BULK_INSERT) {

Review comment:
       Suggestion : Instead of if-else, by using switch-case, we can take advantage of checkstyle to ensure we are not missing handling of any operation types . 

##########
File path: hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java
##########
@@ -248,15 +249,18 @@ public static HoodieWriteClient createHoodieClient(JavaSparkContext jssc, String
   }
 
   public static JavaRDD<WriteStatus> doWriteOperation(HoodieWriteClient client, JavaRDD<HoodieRecord> hoodieRecords,
-      String instantTime, String operation) throws HoodieException {
-    if (operation.equals(DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL())) {
+      String instantTime, WriteOperationType operation) throws HoodieException {

Review comment:
       Also, since we have changed this to enum, can you also update DataSourceWriteOptions to reference this enum. It is good that both the enum values and those defined in DataSourceWriteOptions are consistent. So, we can do this without any backwards compatibility issues.
   
   ```
     val OPERATION_OPT_KEY = "hoodie.datasource.write.operation"
     val BULK_INSERT_OPERATION_OPT_VAL = WriteOperationType.BULK_INSERT.name
     val INSERT_OPERATION_OPT_VAL = WriteOperationType.INSERT.name
     val UPSERT_OPERATION_OPT_VAL = WriteOperationType.UPSERT.name
     val DELETE_OPERATION_OPT_VAL = WriteOperationType.DELETE.name
     val BOOTSTRAP_OPERATION_OPT_VAL = WriteOperationType.BOOTSTRAP.name
     val DEFAULT_OPERATION_OPT_VAL = UPSERT_OPERATION_OPT_VAL
   ```

##########
File path: hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java
##########
@@ -248,15 +249,18 @@ public static HoodieWriteClient createHoodieClient(JavaSparkContext jssc, String
   }
 
   public static JavaRDD<WriteStatus> doWriteOperation(HoodieWriteClient client, JavaRDD<HoodieRecord> hoodieRecords,
-      String instantTime, String operation) throws HoodieException {
-    if (operation.equals(DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL())) {
+      String instantTime, WriteOperationType operation) throws HoodieException {
+    if (operation == WriteOperationType.BULK_INSERT) {
       Option<BulkInsertPartitioner> userDefinedBulkInsertPartitioner =
           createUserDefinedBulkInsertPartitioner(client.getConfig());
       return client.bulkInsert(hoodieRecords, instantTime, userDefinedBulkInsertPartitioner);
-    } else if (operation.equals(DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL())) {
+    } else if (operation == WriteOperationType.INSERT) {
       return client.insert(hoodieRecords, instantTime);
     } else {
       // default is upsert
+      if (operation != WriteOperationType.UPSERT) {

Review comment:
       @sreeram26 : I think it is better to throw an exception in this case. We saw some cases where uses intended to use bulk insert but had a typo in the operation type. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org