You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/10/22 02:30:31 UTC

[GitHub] [hudi] lw309637554 opened a new pull request #2196: [HUDI-1349]spark sql support overwrite use replace action

lw309637554 opened a new pull request #2196:
URL: https://github.com/apache/hudi/pull/2196


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a pull request.*
   
   ## What is the purpose of the pull request
   
   spark sql support overwrite use replace action
   
   ## Brief change log
   
   *(for example:)*
     - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
     - *Added integration tests for end-to-end.*
     - *Added HoodieClientWriteTest to verify the change.*
     - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] lw309637554 commented on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action

Posted by GitBox <gi...@apache.org>.
lw309637554 commented on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-722967018


   > > @satishkotha
   > > A. Thanks so much. This pr need to solved the issue with better approach.
   > > Now I am more clear about overwrite semantic between table.overwrite and spark sql overwrite for hudi.
   > > B. Also spark sql for hudi overwrite should have the ability just like spark sql 、hive 、 delta lake.
   > > these engine have three mode for overwrite about partition:
   > > 
   > > 1. Dynamic Partition : delete all partition data ,and the insert the new data for different
   > > 2. Static partition: just overwrite the partition which is user specified
   > > 3. Mixed partition: mixed of 1 and 2
   > >    more detail in :
   > >    https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-dml-insert-overwrite-table.html
   > >    https://www.programmersought.com/article/47155360487/
   > > 
   > > C. our plan
   > > 
   > > 1. Now spark sql for hudi overwrite is Dynamic Partition. I will resolved it in this issue HUDI-1349, first support delete all partition in [HUDI-1350](https://issues.apache.org/jira/browse/HUDI-1350) , then land this issue. (Just like @satishkotha 's Suggestion)
   > > 2. Now spark sql for hudi does not support "Static partition" mode, will then land it in  [HUDI-1374](https://issues.apache.org/jira/browse/HUDI-1374)
   > > 3. future support "Mixed partition" mode
   > > 
   > > cc @n3nash @vinothchandar please help to review if the plan about spark sql for hudi overwirte is suitable.
   > > It is also possible that my understanding is inappropriate
   > 
   > Just fyi, in the [RFC](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+18+Insert+Overwrite+API#RFC18InsertOverwriteAPI-API) we discussed having 'insert_overwrite_table' operation to support dynamic partitioning. static partitioning is supported by 'insert_overwrite'.
   > 
   > But, agree with getting approval from Nishith/Vinoth before starting implementation.
   
   @satishkotha thanks  so much,  will read more about the RFC-18, and merge the implementation between  the RFC-18 and my plan


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] satishkotha commented on a change in pull request #2196: [HUDI-1349]spark sql support overwrite use replace action

Posted by GitBox <gi...@apache.org>.
satishkotha commented on a change in pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#discussion_r513672664



##########
File path: hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
##########
@@ -93,6 +93,11 @@ private[hudi] object HoodieSparkSqlWriter {
       operation = WriteOperationType.INSERT
     }
 
+    // If the mode is Overwrite, should use INSERT_OVERWRITE operation

Review comment:
       @lw309637554  can you add below test:
   
   step1: Write N records to hoodie table for partition1
   step2: Write N more records using SaveMode.Overwrite for partition2
   step3: Query for all the rows from hoodie table. We should only see N records for partition2.  
   
   If i understand your code correctly, with this change, we will see 2N records




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] lw309637554 commented on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action

Posted by GitBox <gi...@apache.org>.
lw309637554 commented on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-729001085


   @satishkotha  have implement insert_overwrite_table to support spark sql overwrite. please review again , thanks 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] satishkotha commented on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action

Posted by GitBox <gi...@apache.org>.
satishkotha commented on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-718127824


   > @satishkotha @n3nash can you help to review again, Thanks
   
   Can you please add additional tests with multiple steps writing to disjoint partitions.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash merged pull request #2196: [HUDI-1349]spark sql support overwrite use replace action

Posted by GitBox <gi...@apache.org>.
n3nash merged pull request #2196:
URL: https://github.com/apache/hudi/pull/2196


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] satishkotha commented on a change in pull request #2196: [HUDI-1349]spark sql support overwrite use replace action

Posted by GitBox <gi...@apache.org>.
satishkotha commented on a change in pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#discussion_r517675835



##########
File path: hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala
##########
@@ -156,6 +162,76 @@ class TestCOWDataSource extends HoodieClientTestBase {
     assertEquals(100, timeTravelDF.count()) // 100 initial inserts must be pulled
   }
 
+  @Test def testOverWriteModeUseReplaceAction(): Unit = {
+    val records1 = recordsToStrings(dataGen.generateInserts("001", 5)).toList
+    val inputDF1 = spark.read.json(spark.sparkContext.parallelize(records1, 2))
+    inputDF1.write.format("org.apache.hudi")
+      .options(commonOpts)
+      .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
+      .mode(SaveMode.Append)
+      .save(basePath)
+
+    val records2 = recordsToStrings(dataGen.generateInserts("002", 5)).toList
+    val inputDF2 = spark.read.json(spark.sparkContext.parallelize(records2, 2))
+    inputDF2.write.format("org.apache.hudi")
+      .options(commonOpts)
+      .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
+      .mode(SaveMode.Overwrite)
+      .save(basePath)
+
+    val metaClient = new HoodieTableMetaClient(spark.sparkContext.hadoopConfiguration, basePath, true)
+    val commits =  metaClient.getActiveTimeline.filterCompletedInstants().getInstants.toArray
+      .map(instant => (instant.asInstanceOf[HoodieInstant]).getAction)
+    assertEquals(2, commits.size)
+    assertEquals("commit", commits(0))
+    assertEquals("replacecommit", commits(1))
+  }
+
+  @Test def testOverWriteModeUseReplaceActionOnDisJointPartitions(): Unit = {
+    // step1: Write 5 records to hoodie table for partition1 DEFAULT_FIRST_PARTITION_PATH
+    val records1 = recordsToStrings(dataGen.generateInsertsForPartition("001", 5, HoodieTestDataGenerator.DEFAULT_FIRST_PARTITION_PATH)).toList
+    val inputDF1 = spark.read.json(spark.sparkContext.parallelize(records1, 2))
+    inputDF1.write.format("org.apache.hudi")
+      .options(commonOpts)
+      .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
+      .mode(SaveMode.Append)
+      .save(basePath)
+
+    // step2: Write 7 more records using SaveMode.Overwrite for partition2 DEFAULT_SECOND_PARTITION_PATH
+    val records2 = recordsToStrings(dataGen.generateInsertsForPartition("002", 7, HoodieTestDataGenerator.DEFAULT_SECOND_PARTITION_PATH)).toList
+    val inputDF2 = spark.read.json(spark.sparkContext.parallelize(records2, 2))
+    inputDF2.write.format("org.apache.hudi")
+      .options(commonOpts)
+      .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
+      .mode(SaveMode.Overwrite)
+      .save(basePath)
+    inputDF2.registerTempTable("tmpTable")
+
+    // step3: Query the rows count from hoodie table  for partition1 DEFAULT_FIRST_PARTITION_PATH
+    val recordCountForParititon1 =  spark.sql(String.format("select count(*) from tmpTable where partition = '%s'", HoodieTestDataGenerator.DEFAULT_FIRST_PARTITION_PATH)).collect()

Review comment:
       tmpTable only registered inputDF2, so you will not get data for partition1 even if we do SaveMode.Append in line 206? Don't you need to read back all data from table? Can you please fix test setup?

##########
File path: hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala
##########
@@ -156,6 +162,76 @@ class TestCOWDataSource extends HoodieClientTestBase {
     assertEquals(100, timeTravelDF.count()) // 100 initial inserts must be pulled
   }
 
+  @Test def testOverWriteModeUseReplaceAction(): Unit = {

Review comment:
       I ran a similar test using quick start setup guide:
   1) ingest 10 records for partition "2020/03/11"
   `scala> val inserts = convertToStringList(dataGen.generateInserts(10))
   inserts: java.util.List[String] = [{"ts": 0, "uuid": "299d5202-1ea0-4918-9d2f-2365bc1c2402", "rider": "rider-213", "driver": "driver-213", "begin_lat": 0.4726905879569653, "begin_lon": 0.46157858450465483, "end_lat": 0.754803407008858, "end_lon": 0.9671159942018241, "fare": 34.158284716382845, "partitionpath": "2020/03/11"}, {"ts": 0, "uuid": "0fc23a14-c815-4b09-bff1-c6193a6de5b7", "rider": "rider-213", "driver": "driver-213", "begin_lat": 0.6100070562136587, "begin_lon": 0.8779402295427752, "end_lat": 0.3407870505929602, "end_lon": 0.5030798142293655, "fare": 43.4923811219014, "partitionpath": "2020/03/11"}, {"ts": 0, "uuid": "7136e8f8-ed82-4fc4-b60d-f7367f7be791", "rider": "rider-213", "driver": "driver-213", "begin_lat": 0.5731835407930634, "begin_lon": 0.4923479652912024, "end_lat":...
   scala> val df = spark.read.json(spark.sparkContext.parallelize(inserts, 1))
   warning: there was one deprecation warning; re-run with -deprecation for details
   df: org.apache.spark.sql.DataFrame = [begin_lat: double, begin_lon: double ... 8 more fields]
   
   scala> df.write.format("org.apache.hudi").
        |     options(getQuickstartWriteConfigs).
        |     option(PRECOMBINE_FIELD_OPT_KEY, "ts").
        |     option(RECORDKEY_FIELD_OPT_KEY, "uuid").
        |     option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
        |     option(TABLE_NAME, tableName).
        |     mode(Overwrite).
        |     save(basePath);`
   
   2. Query the data back and see that records are written correctly
   `scala> val tripsSnapshotDF = spark.
        |     read.
        |     format("org.apache.hudi").
        |     load(basePath + "/*/*/*/*")
   20/11/04 14:51:53 WARN DefaultSource: Loading Base File Only View.
   tripsSnapshotDF: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, _hoodie_commit_seqno: string ... 13 more fields]
   
   scala> tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")
   
   scala> 
   
   scala> spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from  hudi_trips_snapshot").show()
   +-------------------+--------------------+----------------------+---------+----------+------------------+
   |_hoodie_commit_time|  _hoodie_record_key|_hoodie_partition_path|    rider|    driver|              fare|
   +-------------------+--------------------+----------------------+---------+----------+------------------+
   |     20201104145141|299d5202-1ea0-491...|            2020/03/11|rider-213|driver-213|34.158284716382845|
   |     20201104145141|0fc23a14-c815-4b0...|            2020/03/11|rider-213|driver-213|  43.4923811219014|
   |     20201104145141|7136e8f8-ed82-4fc...|            2020/03/11|rider-213|driver-213| 64.27696295884016|
   |     20201104145141|5ffa488e-d75e-4ef...|            2020/03/11|rider-213|driver-213| 93.56018115236618|
   |     20201104145141|cf09166f-dc3f-45e...|            2020/03/11|rider-213|driver-213|17.851135255091155|
   |     20201104145141|6f522490-e29e-419...|            2020/03/11|rider-213|driver-213|19.179139106643607|
   |     20201104145141|db97e3ef-ad7a-4e8...|            2020/03/11|rider-213|driver-213| 33.92216483948643|
   |     20201104145141|a42d7c22-d0bf-4b9...|            2020/03/11|rider-213|driver-213| 66.62084366450246|
   |     20201104145141|94154d3e-c3da-436...|            2020/03/11|rider-213|driver-213| 41.06290929046368|
   |     20201104145141|618b3f38-bb71-402...|            2020/03/11|rider-213|driver-213| 27.79478688582596|
   +-------------------+--------------------+----------------------+---------+----------+------------------+
   `
   
   3. Use insert_overwrite and write to new partition (This is with master, not using your change. So I still have insert_overwrite operation with Append mode)
   `scala>  val dataGen = new DataGenerator(Array("2020/09/11"))
   dataGen: org.apache.hudi.QuickstartUtils.DataGenerator = org.apache.hudi.QuickstartUtils$DataGenerator@f00b18a
   
   scala> val inserts2 = convertToStringList(dataGen.generateInserts(1))
   scala> val df = spark.read.json(spark.sparkContext.parallelize(inserts2, 1))
   scala> df.write.format("org.apache.hudi").
        |     options(getQuickstartWriteConfigs).
        |     option(PRECOMBINE_FIELD_OPT_KEY, "ts").
        |     option(RECORDKEY_FIELD_OPT_KEY, "uuid").
        |     option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
        |     option(TABLE_NAME, tableName).
        |     mode(Append).
        | option(OPERATION_OPT_KEY, "insert_overwrite").
        |      save(basePath);`
   4. Query the data back
   `scala> spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from  hudi_trips_snapshot").show()
   +-------------------+--------------------+----------------------+---------+----------+------------------+
   |_hoodie_commit_time|  _hoodie_record_key|_hoodie_partition_path|    rider|    driver|              fare|
   +-------------------+--------------------+----------------------+---------+----------+------------------+
   |     20201104145258|299d5202-1ea0-491...|            2020/03/11|rider-213|driver-213|34.158284716382845|
   |     20201104145258|0fc23a14-c815-4b0...|            2020/03/11|rider-213|driver-213|  43.4923811219014|
   |     20201104145258|7136e8f8-ed82-4fc...|            2020/03/11|rider-213|driver-213| 64.27696295884016|
   |     20201104145258|5ffa488e-d75e-4ef...|            2020/03/11|rider-213|driver-213| 93.56018115236618|
   |     20201104145258|cf09166f-dc3f-45e...|            2020/03/11|rider-213|driver-213|17.851135255091155|
   |     20201104145258|6f522490-e29e-419...|            2020/03/11|rider-213|driver-213|19.179139106643607|
   |     20201104145258|db97e3ef-ad7a-4e8...|            2020/03/11|rider-213|driver-213| 33.92216483948643|
   |     20201104145258|a42d7c22-d0bf-4b9...|            2020/03/11|rider-213|driver-213| 66.62084366450246|
   |     20201104145258|94154d3e-c3da-436...|            2020/03/11|rider-213|driver-213| 41.06290929046368|
   |     20201104145258|618b3f38-bb71-402...|            2020/03/11|rider-213|driver-213| 27.79478688582596|
   |     20201104145348|4dac8aa3-b8fa-410...|            2020/09/11|rider-284|driver-284|49.527694252432056|
   +-------------------+--------------------+----------------------+---------+----------+------------------+
   `
   
   As you can see in step4, we see all 11 records. With 'SaveMode.Overwrite' we should only see 1 record. Hope this is clear.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io edited a comment on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-728653013


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=h1) Report
   > Merging [#2196](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=desc) (305f320) into [master](https://codecov.io/gh/apache/hudi/commit/430d4b428e7c5b325c7414a187f9cda158c2758a?el=desc) (430d4b4) will **decrease** coverage by `0.07%`.
   > The diff coverage is `16.66%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2196/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=tree)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #2196      +/-   ##
   ============================================
   - Coverage     53.54%   53.47%   -0.08%     
   - Complexity     2770     2772       +2     
   ============================================
     Files           348      348              
     Lines         16109    16139      +30     
     Branches       1643     1643              
   ============================================
   + Hits           8626     8630       +4     
   - Misses         6785     6810      +25     
   - Partials        698      699       +1     
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `38.37% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `55.17% <4.34%> (-0.16%)` | `0.00 <0.00> (ø)` | |
   | hudihadoopmr | `32.94% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudispark | `65.86% <66.66%> (+0.28%)` | `0.00 <2.00> (ø)` | |
   | huditimelineservice | `64.34% <14.28%> (-0.96%)` | `0.00 <0.00> (ø)` | |
   | hudiutilities | `70.09% <ø> (ø)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [...common/table/view/AbstractTableFileSystemView.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3ZpZXcvQWJzdHJhY3RUYWJsZUZpbGVTeXN0ZW1WaWV3LmphdmE=) | `82.73% <0.00%> (-3.06%)` | `137.00 <0.00> (ø)` | |
   | [...common/table/view/PriorityBasedFileSystemView.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3ZpZXcvUHJpb3JpdHlCYXNlZEZpbGVTeXN0ZW1WaWV3LmphdmE=) | `95.71% <0.00%> (-1.39%)` | `33.00 <0.00> (ø)` | |
   | [...on/table/view/RemoteHoodieTableFileSystemView.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3ZpZXcvUmVtb3RlSG9vZGllVGFibGVGaWxlU3lzdGVtVmlldy5qYXZh) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | [...i/hadoop/utils/HoodieRealtimeInputFormatUtils.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL3V0aWxzL0hvb2RpZVJlYWx0aW1lSW5wdXRGb3JtYXRVdGlscy5qYXZh) | `0.00% <ø> (ø)` | `0.00 <0.00> (ø)` | |
   | [...di/timeline/service/handlers/FileSliceHandler.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS10aW1lbGluZS1zZXJ2aWNlL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL3RpbWVsaW5lL3NlcnZpY2UvaGFuZGxlcnMvRmlsZVNsaWNlSGFuZGxlci5qYXZh) | `87.09% <0.00%> (-6.01%)` | `15.00 <0.00> (ø)` | |
   | [...e/hudi/timeline/service/FileSystemViewHandler.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS10aW1lbGluZS1zZXJ2aWNlL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL3RpbWVsaW5lL3NlcnZpY2UvRmlsZVN5c3RlbVZpZXdIYW5kbGVyLmphdmE=) | `75.21% <20.00%> (-1.23%)` | `27.00 <0.00> (ø)` | |
   | [...g/apache/hudi/common/model/WriteOperationType.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL1dyaXRlT3BlcmF0aW9uVHlwZS5qYXZh) | `55.17% <50.00%> (-0.39%)` | `2.00 <0.00> (ø)` | |
   | [...src/main/java/org/apache/hudi/DataSourceUtils.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9EYXRhU291cmNlVXRpbHMuamF2YQ==) | `42.45% <50.00%> (+1.50%)` | `21.00 <2.00> (+2.00)` | |
   | [...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL2h1ZGkvSG9vZGllU3BhcmtTcWxXcml0ZXIuc2NhbGE=) | `51.72% <50.00%> (+0.77%)` | `0.00 <0.00> (ø)` | |
   | [.../java/org/apache/hudi/HoodieDataSourceHelpers.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9Ib29kaWVEYXRhU291cmNlSGVscGVycy5qYXZh) | `84.61% <100.00%> (ø)` | `5.00 <0.00> (ø)` | |
   | ... and [2 more](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree-more) | |
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] lw309637554 commented on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action

Posted by GitBox <gi...@apache.org>.
lw309637554 commented on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-722512937


   @satishkotha 
   A. Thanks  so much. This pr need to solved the issue with better approach. 
   Now I am more clear about overwrite semantic between table.overwrite and  spark sql overwrite for hudi.
   
   B. Also spark sql for hudi overwrite should have the ability just like spark sql 、hive 、 delta lake.
   these engine have three mode for overwrite about partition:
   1. Dynamic Partition : delete all partition data ,and the insert the new data for different 
   2. Static partition: just overwrite the partition which is user specified
   3. Mixed partition: mixed of 1 and 2
   more detail in : 
   https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-dml-insert-overwrite-table.html
   https://www.programmersought.com/article/47155360487/
   
   C. our plan
   1. Now spark sql for hudi overwrite is Dynamic Partition. I will resolved it in this issue HUDI-1349, first support delete all partition in [HUDI-1350](https://issues.apache.org/jira/browse/HUDI-1350) , then land this issue. (Just like @satishkotha 's Suggestion)
   2. Now spark sql for hudi does not support "Static partition" mode, will then land it in  [HUDI-1374](https://issues.apache.org/jira/browse/HUDI-1374)
   3. future support "Mixed partition" mode
   
   cc @n3nash @vinothchandar  please help to review  if  the plan about spark sql for hudi  overwirte is suitable. 
   It is also possible that my understanding is inappropriate 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] lw309637554 commented on a change in pull request #2196: [HUDI-1349]spark sql support overwrite use replace action

Posted by GitBox <gi...@apache.org>.
lw309637554 commented on a change in pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#discussion_r518175216



##########
File path: hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala
##########
@@ -156,6 +162,76 @@ class TestCOWDataSource extends HoodieClientTestBase {
     assertEquals(100, timeTravelDF.count()) // 100 initial inserts must be pulled
   }
 
+  @Test def testOverWriteModeUseReplaceAction(): Unit = {
+    val records1 = recordsToStrings(dataGen.generateInserts("001", 5)).toList
+    val inputDF1 = spark.read.json(spark.sparkContext.parallelize(records1, 2))
+    inputDF1.write.format("org.apache.hudi")
+      .options(commonOpts)
+      .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
+      .mode(SaveMode.Append)
+      .save(basePath)
+
+    val records2 = recordsToStrings(dataGen.generateInserts("002", 5)).toList
+    val inputDF2 = spark.read.json(spark.sparkContext.parallelize(records2, 2))
+    inputDF2.write.format("org.apache.hudi")
+      .options(commonOpts)
+      .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
+      .mode(SaveMode.Overwrite)
+      .save(basePath)
+
+    val metaClient = new HoodieTableMetaClient(spark.sparkContext.hadoopConfiguration, basePath, true)
+    val commits =  metaClient.getActiveTimeline.filterCompletedInstants().getInstants.toArray
+      .map(instant => (instant.asInstanceOf[HoodieInstant]).getAction)
+    assertEquals(2, commits.size)
+    assertEquals("commit", commits(0))
+    assertEquals("replacecommit", commits(1))
+  }
+
+  @Test def testOverWriteModeUseReplaceActionOnDisJointPartitions(): Unit = {
+    // step1: Write 5 records to hoodie table for partition1 DEFAULT_FIRST_PARTITION_PATH
+    val records1 = recordsToStrings(dataGen.generateInsertsForPartition("001", 5, HoodieTestDataGenerator.DEFAULT_FIRST_PARTITION_PATH)).toList
+    val inputDF1 = spark.read.json(spark.sparkContext.parallelize(records1, 2))
+    inputDF1.write.format("org.apache.hudi")
+      .options(commonOpts)
+      .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
+      .mode(SaveMode.Append)
+      .save(basePath)
+
+    // step2: Write 7 more records using SaveMode.Overwrite for partition2 DEFAULT_SECOND_PARTITION_PATH
+    val records2 = recordsToStrings(dataGen.generateInsertsForPartition("002", 7, HoodieTestDataGenerator.DEFAULT_SECOND_PARTITION_PATH)).toList
+    val inputDF2 = spark.read.json(spark.sparkContext.parallelize(records2, 2))
+    inputDF2.write.format("org.apache.hudi")
+      .options(commonOpts)
+      .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
+      .mode(SaveMode.Overwrite)
+      .save(basePath)
+    inputDF2.registerTempTable("tmpTable")
+
+    // step3: Query the rows count from hoodie table  for partition1 DEFAULT_FIRST_PARTITION_PATH
+    val recordCountForParititon1 =  spark.sql(String.format("select count(*) from tmpTable where partition = '%s'", HoodieTestDataGenerator.DEFAULT_FIRST_PARTITION_PATH)).collect()

Review comment:
       will fix it




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] lw309637554 commented on a change in pull request #2196: [HUDI-1349]spark sql support overwrite use replace action

Posted by GitBox <gi...@apache.org>.
lw309637554 commented on a change in pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#discussion_r510838979



##########
File path: hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
##########
@@ -93,6 +93,11 @@ private[hudi] object HoodieSparkSqlWriter {
       operation = WriteOperationType.INSERT
     }
 
+    // If the mode is Overwrite, should use INSERT_OVERWRITE operation

Review comment:
       @satishkotha  thanks, 
   1. i think the method is ok . Because in HoodieSparkSqlWriter.scala , if we set  operation as replace action, then it also use  "client.insertOverwrite(hoodieRecords, instantTime); "  here will do as " Insert overwrite only updates partitions".
   
    public static HoodieWriteResult doWriteOperation(SparkRDDWriteClient client, JavaRDD<HoodieRecord> hoodieRecords,
                                                      String instantTime, WriteOperationType operation) throws HoodieException {
       switch (operation) {
         case BULK_INSERT:
           Option<BulkInsertPartitioner> userDefinedBulkInsertPartitioner =
                   createUserDefinedBulkInsertPartitioner(client.getConfig());
           return new HoodieWriteResult(client.bulkInsert(hoodieRecords, instantTime, userDefinedBulkInsertPartitioner));
         case INSERT:
           return new HoodieWriteResult(client.insert(hoodieRecords, instantTime));
         case UPSERT:
           return new HoodieWriteResult(client.upsert(hoodieRecords, instantTime));
         case INSERT_OVERWRITE:
           return client.insertOverwrite(hoodieRecords, instantTime);
         default:
   
   2. https://issues.apache.org/jira/browse/HUDI-1350. is another job.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] satishkotha commented on a change in pull request #2196: [HUDI-1349]spark sql support overwrite use replace action

Posted by GitBox <gi...@apache.org>.
satishkotha commented on a change in pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#discussion_r526430857



##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkInsertOverwriteTableCommitActionExecutor.java
##########
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.action.commit;
+
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.client.common.HoodieEngineContext;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.model.WriteOperationType;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.spark.api.java.JavaRDD;
+
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+public class SparkInsertOverwriteTableCommitActionExecutor<T extends HoodieRecordPayload<T>>
+    extends SparkInsertOverwriteCommitActionExecutor<T> {
+
+  public SparkInsertOverwriteTableCommitActionExecutor(HoodieEngineContext context,
+                                                       HoodieWriteConfig config, HoodieTable table,
+                                                       String instantTime, JavaRDD<HoodieRecord<T>> inputRecordsRDD) {
+    super(context, config, table, instantTime, inputRecordsRDD, WriteOperationType.INSERT_OVERWRITE_TABLE);
+  }
+
+  @Override
+  protected Map<String, List<String>> getPartitionToReplacedFileIds(JavaRDD<WriteStatus> writeStatuses) {
+    Map<String, List<String>> result =  table.getSliceView().getLatestFileSlices().distinct()

Review comment:
       This can be time consuming for large tables. Is it possible to parallelize this? 
   1) get all partitions and then
    2) enginecontext.parallelize(#paritions)
    3) for each partition, get all latest file ids
   
   I think we dont need to add new filesystem apis with this approach




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar edited a comment on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action

Posted by GitBox <gi...@apache.org>.
vinothchandar edited a comment on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-757498877


   @lw309637554 sounds good. but I think we should have made this configurable and make the old behavior the default. `spark.write...mode("overwrite")` is a very common operation, and I'd prefer to not have this dependent on such new feature. Do you have cycles to do this in the next few hours?  Please let me know. @n3nash can you drive this otherwise? 
   
   general rule of thumb: new features should be guarded by a flag, turned off by default, until deemed stable after release. :) lets please uphold this. Happy to discuss more if needed. 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on a change in pull request #2196: [HUDI-1349]spark sql support overwrite use replace action

Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#discussion_r533849917



##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkInsertOverwriteTableCommitActionExecutor.java
##########
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.action.commit;
+
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.client.common.HoodieEngineContext;
+import org.apache.hudi.client.common.HoodieSparkEngineContext;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.model.WriteOperationType;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieCommitException;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.api.java.function.PairFunction;
+import scala.Tuple2;
+
+import java.io.IOException;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+public class SparkInsertOverwriteTableCommitActionExecutor<T extends HoodieRecordPayload<T>>
+    extends SparkInsertOverwriteCommitActionExecutor<T> {
+
+  public SparkInsertOverwriteTableCommitActionExecutor(HoodieEngineContext context,
+                                                       HoodieWriteConfig config, HoodieTable table,
+                                                       String instantTime, JavaRDD<HoodieRecord<T>> inputRecordsRDD) {
+    super(context, config, table, instantTime, inputRecordsRDD, WriteOperationType.INSERT_OVERWRITE_TABLE);
+  }
+
+  protected List<String> getAllExistingFileIds(String partitionPath) {
+    return table.getSliceView().getLatestFileSlices(partitionPath)
+        .map(fg -> fg.getFileId()).distinct().collect(Collectors.toList());
+  }
+
+  @Override
+  protected Map<String, List<String>> getPartitionToReplacedFileIds(JavaRDD<WriteStatus> writeStatuses) {
+    Map<String, List<String>> partitionToExistingFileIds = new HashMap<>();
+    try {
+      List<String> partitionPaths = FSUtils.getAllPartitionPaths(table.getMetaClient().getFs(),
+          table.getMetaClient().getBasePath(), false);
+      JavaSparkContext jsc = HoodieSparkEngineContext.getSparkContext(context);
+      if (partitionPaths != null && partitionPaths.size() > 0) {
+        context.setJobStatus(this.getClass().getSimpleName(), "Getting ExistingFileIds of all partitions");
+        JavaRDD<String> partitionPathRdd = jsc.parallelize(partitionPaths, partitionPaths.size());
+        partitionToExistingFileIds = partitionPathRdd.mapToPair((PairFunction<String, String, List<String>>)

Review comment:
       Can we use function style here instead of needing to case to PairFunction ? Something like partitionPathRdd.mapToPair( () -> ...) ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] lw309637554 commented on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action

Posted by GitBox <gi...@apache.org>.
lw309637554 commented on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-737319592


   @satishkotha   have resolved comment of n3nash, and squash the commits. can you help to merged ? cc @n3nash 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io edited a comment on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-728653013


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=h1) Report
   > Merging [#2196](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=desc) (305f320) into [master](https://codecov.io/gh/apache/hudi/commit/430d4b428e7c5b325c7414a187f9cda158c2758a?el=desc) (430d4b4) will **decrease** coverage by `43.14%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2196/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=tree)
   
   ```diff
   @@              Coverage Diff              @@
   ##             master    #2196       +/-   ##
   =============================================
   - Coverage     53.54%   10.39%   -43.15%     
   + Complexity     2770       48     -2722     
   =============================================
     Files           348       50      -298     
     Lines         16109     1779    -14330     
     Branches       1643      211     -1432     
   =============================================
   - Hits           8626      185     -8441     
   + Misses         6785     1581     -5204     
   + Partials        698       13      -685     
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudispark | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `10.39% <ø> (-59.70%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | ... and [314 more](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree-more) | |
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] lw309637554 commented on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action

Posted by GitBox <gi...@apache.org>.
lw309637554 commented on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-718070816


   @satishkotha @n3nash  can you help to review again, Thanks  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] satishkotha commented on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action

Posted by GitBox <gi...@apache.org>.
satishkotha commented on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-722937587


   > @satishkotha
   > A. Thanks so much. This pr need to solved the issue with better approach.
   > Now I am more clear about overwrite semantic between table.overwrite and spark sql overwrite for hudi.
   > 
   > B. Also spark sql for hudi overwrite should have the ability just like spark sql 、hive 、 delta lake.
   > these engine have three mode for overwrite about partition:
   > 
   > 1. Dynamic Partition : delete all partition data ,and the insert the new data for different
   > 2. Static partition: just overwrite the partition which is user specified
   > 3. Mixed partition: mixed of 1 and 2
   >    more detail in :
   >    https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-dml-insert-overwrite-table.html
   >    https://www.programmersought.com/article/47155360487/
   > 
   > C. our plan
   > 
   > 1. Now spark sql for hudi overwrite is Dynamic Partition. I will resolved it in this issue HUDI-1349, first support delete all partition in [HUDI-1350](https://issues.apache.org/jira/browse/HUDI-1350) , then land this issue. (Just like @satishkotha 's Suggestion)
   > 2. Now spark sql for hudi does not support "Static partition" mode, will then land it in  [HUDI-1374](https://issues.apache.org/jira/browse/HUDI-1374)
   > 3. future support "Mixed partition" mode
   > 
   > cc @n3nash @vinothchandar please help to review if the plan about spark sql for hudi overwirte is suitable.
   > It is also possible that my understanding is inappropriate
   
   Just fyi, in the [RFC](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+18+Insert+Overwrite+API#RFC18InsertOverwriteAPI-API)  we discussed having 'insert_overwrite_table' operation to support dynamic partitioning. static partitioning is supported by 'insert_overwrite'. 
   
   But, agree with getting approval from Nishith/Vinoth before starting implementation.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io commented on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action

Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-728653013


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=h1) Report
   > Merging [#2196](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=desc) (305f320) into [master](https://codecov.io/gh/apache/hudi/commit/430d4b428e7c5b325c7414a187f9cda158c2758a?el=desc) (430d4b4) will **decrease** coverage by `43.14%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2196/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=tree)
   
   ```diff
   @@              Coverage Diff              @@
   ##             master    #2196       +/-   ##
   =============================================
   - Coverage     53.54%   10.39%   -43.15%     
   + Complexity     2770       48     -2722     
   =============================================
     Files           348       50      -298     
     Lines         16109     1779    -14330     
     Branches       1643      211     -1432     
   =============================================
   - Hits           8626      185     -8441     
   + Misses         6785     1581     -5204     
   + Partials        698       13      -685     
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudispark | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `10.39% <ø> (-59.70%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | ... and [314 more](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree-more) | |
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] lw309637554 edited a comment on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action

Posted by GitBox <gi...@apache.org>.
lw309637554 edited a comment on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-757504672


   > @lw309637554 sounds good. but I think we should have made this configurable and make the old behavior the default. `spark.write...mode("overwrite")` is a very common operation, and I'd prefer to not have this dependent on such new feature. Do you have cycles to do this in the next few hours? Please let me know. @n3nash can you drive this otherwise?
   > 
   > general rule of thumb: new features should be guarded by a flag, turned off by default, until deemed stable after release. :) lets please uphold this. Happy to discuss more if needed.
   
   @vinothchandar okay, i can add the configure now. Will land it in https://github.com/apache/hudi/pull/2428  cc @n3nash 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] lw309637554 commented on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action

Posted by GitBox <gi...@apache.org>.
lw309637554 commented on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-731599644


   > LGTM. @n3nash Can you take a pass and merge this?
   
   @n3nash  Can you help to review? thanks 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-715447332


   @n3nash you and satish can take this home


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] lw309637554 commented on a change in pull request #2196: [HUDI-1349]spark sql support overwrite use replace action

Posted by GitBox <gi...@apache.org>.
lw309637554 commented on a change in pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#discussion_r534276769



##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkInsertOverwriteTableCommitActionExecutor.java
##########
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.action.commit;
+
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.client.common.HoodieEngineContext;
+import org.apache.hudi.client.common.HoodieSparkEngineContext;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.model.WriteOperationType;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieCommitException;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.api.java.function.PairFunction;
+import scala.Tuple2;
+
+import java.io.IOException;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+public class SparkInsertOverwriteTableCommitActionExecutor<T extends HoodieRecordPayload<T>>
+    extends SparkInsertOverwriteCommitActionExecutor<T> {
+
+  public SparkInsertOverwriteTableCommitActionExecutor(HoodieEngineContext context,
+                                                       HoodieWriteConfig config, HoodieTable table,
+                                                       String instantTime, JavaRDD<HoodieRecord<T>> inputRecordsRDD) {
+    super(context, config, table, instantTime, inputRecordsRDD, WriteOperationType.INSERT_OVERWRITE_TABLE);
+  }
+
+  protected List<String> getAllExistingFileIds(String partitionPath) {
+    return table.getSliceView().getLatestFileSlices(partitionPath)
+        .map(fg -> fg.getFileId()).distinct().collect(Collectors.toList());
+  }
+
+  @Override
+  protected Map<String, List<String>> getPartitionToReplacedFileIds(JavaRDD<WriteStatus> writeStatuses) {
+    Map<String, List<String>> partitionToExistingFileIds = new HashMap<>();
+    try {
+      List<String> partitionPaths = FSUtils.getAllPartitionPaths(table.getMetaClient().getFs(),
+          table.getMetaClient().getBasePath(), false);
+      JavaSparkContext jsc = HoodieSparkEngineContext.getSparkContext(context);
+      if (partitionPaths != null && partitionPaths.size() > 0) {
+        context.setJobStatus(this.getClass().getSimpleName(), "Getting ExistingFileIds of all partitions");
+        JavaRDD<String> partitionPathRdd = jsc.parallelize(partitionPaths, partitionPaths.size());
+        partitionToExistingFileIds = partitionPathRdd.mapToPair((PairFunction<String, String, List<String>>)

Review comment:
       ok




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io edited a comment on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-728653013


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=h1) Report
   > Merging [#2196](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=desc) (3f9e2ca) into [master](https://codecov.io/gh/apache/hudi/commit/c8d5ea2752f8289b9dec37a149e82c030f657924?el=desc) (c8d5ea2) will **decrease** coverage by `43.10%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2196/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=tree)
   
   ```diff
   @@              Coverage Diff              @@
   ##             master    #2196       +/-   ##
   =============================================
   - Coverage     53.51%   10.41%   -43.11%     
   + Complexity     2769       48     -2721     
   =============================================
     Files           348       50      -298     
     Lines         16107     1777    -14330     
     Branches       1643      211     -1432     
   =============================================
   - Hits           8619      185     -8434     
   + Misses         6789     1579     -5210     
   + Partials        699       13      -686     
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudispark | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `10.41% <ø> (-59.66%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | ... and [319 more](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree-more) | |
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io edited a comment on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-728653013


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=h1) Report
   > Merging [#2196](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=desc) (0da0ca7) into [master](https://codecov.io/gh/apache/hudi/commit/c8d5ea2752f8289b9dec37a149e82c030f657924?el=desc) (c8d5ea2) will **decrease** coverage by `43.10%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2196/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=tree)
   
   ```diff
   @@              Coverage Diff              @@
   ##             master    #2196       +/-   ##
   =============================================
   - Coverage     53.51%   10.41%   -43.11%     
   + Complexity     2769       48     -2721     
   =============================================
     Files           348       50      -298     
     Lines         16107     1777    -14330     
     Branches       1643      211     -1432     
   =============================================
   - Hits           8619      185     -8434     
   + Misses         6789     1579     -5210     
   + Partials        699       13      -686     
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudispark | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `10.41% <ø> (-59.66%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | ... and [319 more](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree-more) | |
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] lw309637554 commented on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action

Posted by GitBox <gi...@apache.org>.
lw309637554 commented on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-757473127


   > @lw309637554 @n3nash @satishkotha
   > 
   > I have been trying to test master branch and the following change to make `Overwrite` be a INSERT_OVERWRITE_TABLE, adds complexity IMO - esp given interplay with newer features like table metadata.
   > 
   > What's the rationale behind this? Just to make the operation faster over the previous fs.delete() based approach? I mean, ultimately someone needs to do these deletes no.
   
   @vinothchandar  Hello,
   the original intention of this function is that, users can be able to recover accidentally deleted when using INSERT_OVERWRITE_TABLE, if the replace file have not been cleaned. The old fs.delete needs to retrieve data from the data source , it is complex.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] satishkotha commented on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action

Posted by GitBox <gi...@apache.org>.
satishkotha commented on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-730596478


   LGTM. @n3nash Can you take a pass and merge this?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] lw309637554 commented on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action

Posted by GitBox <gi...@apache.org>.
lw309637554 commented on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-730723941


   > LGTM. @n3nash Can you take a pass and merge this?
   
   @satishkotha Thanks for your suggestion.  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] lw309637554 commented on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action

Posted by GitBox <gi...@apache.org>.
lw309637554 commented on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-727229138


   > @lw309637554 the plan looks good to me. @satishkotha when the PR is ready, please tag me so I can take a final pass and merge it
   
   thanks , @satishkotha @n3nash  i thinks add a insertOverwriteTable(JavaRDD<HoodieRecord<T>> records, final String commitTime) just as https://cwiki.apache.org/confluence/display/HUDI/RFC+-+18+Insert+Overwrite+API#RFC18InsertOverwriteAPI-API will be ok . I will land it


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-757498877


   @lw309637554 sounds good. but I think we should have made this configurable and make the old behavior the default. `spark.write...mode("overwrite")` is a very common operation, and I'd prefer to not have this dependent on such new feature. Do you have cycles to do this?  Please let me know. @n3nash can you drive this otherwise? 
   
   general rule of thumb: new features should be guarded by a flag, turned off by default, until deemed stable after release. :) lets please uphold this. Happy to discuss more if needed. 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] lw309637554 commented on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action

Posted by GitBox <gi...@apache.org>.
lw309637554 commented on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-757504672


   > @lw309637554 sounds good. but I think we should have made this configurable and make the old behavior the default. `spark.write...mode("overwrite")` is a very common operation, and I'd prefer to not have this dependent on such new feature. Do you have cycles to do this in the next few hours? Please let me know. @n3nash can you drive this otherwise?
   > 
   > general rule of thumb: new features should be guarded by a flag, turned off by default, until deemed stable after release. :) lets please uphold this. Happy to discuss more if needed.
   
   @vinothchandar okay, i can add the configure now


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] lw309637554 commented on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action

Posted by GitBox <gi...@apache.org>.
lw309637554 commented on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-718349922


   > > @satishkotha @n3nash can you help to review again, Thanks
   > 
   > Can you please add additional tests with multiple steps writing to disjoint partitions.
   
   okay


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io edited a comment on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-728653013


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=h1) Report
   > Merging [#2196](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=desc) (305f320) into [master](https://codecov.io/gh/apache/hudi/commit/430d4b428e7c5b325c7414a187f9cda158c2758a?el=desc) (430d4b4) will **decrease** coverage by `7.06%`.
   > The diff coverage is `17.24%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2196/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=tree)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #2196      +/-   ##
   ============================================
   - Coverage     53.54%   46.47%   -7.07%     
   + Complexity     2770     2437     -333     
   ============================================
     Files           348      342       -6     
     Lines         16109    15766     -343     
     Branches       1643     1627      -16     
   ============================================
   - Hits           8626     7328    -1298     
   - Misses         6785     7866    +1081     
   + Partials        698      572     -126     
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `38.37% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `55.17% <4.34%> (-0.16%)` | `0.00 <0.00> (ø)` | |
   | hudihadoopmr | `32.94% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudispark | `65.86% <66.66%> (+0.28%)` | `0.00 <2.00> (ø)` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `10.39% <ø> (-59.70%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [...common/table/view/AbstractTableFileSystemView.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3ZpZXcvQWJzdHJhY3RUYWJsZUZpbGVTeXN0ZW1WaWV3LmphdmE=) | `82.73% <0.00%> (-3.06%)` | `137.00 <0.00> (ø)` | |
   | [...common/table/view/PriorityBasedFileSystemView.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3ZpZXcvUHJpb3JpdHlCYXNlZEZpbGVTeXN0ZW1WaWV3LmphdmE=) | `95.71% <0.00%> (-1.39%)` | `33.00 <0.00> (ø)` | |
   | [...on/table/view/RemoteHoodieTableFileSystemView.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3ZpZXcvUmVtb3RlSG9vZGllVGFibGVGaWxlU3lzdGVtVmlldy5qYXZh) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | [...i/hadoop/utils/HoodieRealtimeInputFormatUtils.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL3V0aWxzL0hvb2RpZVJlYWx0aW1lSW5wdXRGb3JtYXRVdGlscy5qYXZh) | `0.00% <ø> (ø)` | `0.00 <0.00> (ø)` | |
   | [...g/apache/hudi/common/model/WriteOperationType.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL1dyaXRlT3BlcmF0aW9uVHlwZS5qYXZh) | `55.17% <50.00%> (-0.39%)` | `2.00 <0.00> (ø)` | |
   | [...src/main/java/org/apache/hudi/DataSourceUtils.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9EYXRhU291cmNlVXRpbHMuamF2YQ==) | `42.45% <50.00%> (+1.50%)` | `21.00 <2.00> (+2.00)` | |
   | [...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL2h1ZGkvSG9vZGllU3BhcmtTcWxXcml0ZXIuc2NhbGE=) | `51.72% <50.00%> (+0.77%)` | `0.00 <0.00> (ø)` | |
   | [.../java/org/apache/hudi/HoodieDataSourceHelpers.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9Ib29kaWVEYXRhU291cmNlSGVscGVycy5qYXZh) | `84.61% <100.00%> (ø)` | `5.00 <0.00> (ø)` | |
   | [...main/scala/org/apache/hudi/DataSourceOptions.scala](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL2h1ZGkvRGF0YVNvdXJjZU9wdGlvbnMuc2NhbGE=) | `94.95% <100.00%> (+0.04%)` | `0.00 <0.00> (ø)` | |
   | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | ... and [41 more](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree-more) | |
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action

Posted by GitBox <gi...@apache.org>.
n3nash commented on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-724184645


   @lw309637554 the plan looks good to me. @satishkotha when the PR is ready, please tag me so I can take a final pass and merge it


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io edited a comment on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-728653013


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=h1) Report
   > Merging [#2196](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=desc) (1b099ab) into [master](https://codecov.io/gh/apache/hudi/commit/78fd12259404cad8117862fdf0b948cbe115e453?el=desc) (78fd122) will **decrease** coverage by `43.04%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2196/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=tree)
   
   ```diff
   @@              Coverage Diff              @@
   ##             master    #2196       +/-   ##
   =============================================
   - Coverage     53.45%   10.41%   -43.05%     
   + Complexity     2781       48     -2733     
   =============================================
     Files           354       50      -304     
     Lines         16158     1777    -14381     
     Branches       1648      211     -1437     
   =============================================
   - Hits           8637      185     -8452     
   + Misses         6821     1579     -5242     
   + Partials        700       13      -687     
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudispark | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `10.41% <ø> (-59.66%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | ... and [325 more](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree-more) | |
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] satishkotha commented on a change in pull request #2196: [HUDI-1349]spark sql support overwrite use replace action

Posted by GitBox <gi...@apache.org>.
satishkotha commented on a change in pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#discussion_r510350766



##########
File path: hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
##########
@@ -93,6 +93,11 @@ private[hudi] object HoodieSparkSqlWriter {
       operation = WriteOperationType.INSERT
     }
 
+    // If the mode is Overwrite, should use INSERT_OVERWRITE operation

Review comment:
       I think this won't work. Insert overwrite only updates partitions that have records in the dataframe. Other partitions continue to have old data.
   
   But, I like the idea. Maybe we can add additional configuration to insert_overwrite to mark old partitions as 'deleted'? This can be done in a way to support https://issues.apache.org/jira/browse/HUDI-1350. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-757419558


   @lw309637554 @n3nash @satishkotha 
   
   I have been trying to test master branch and the following change to make `Overwrite` be a INSERT_OVERWRITE_TABLE, adds complexity IMO - esp given interplay with newer features like table metadata. 
   
   What's the rationale behind this? Just to make the operation faster over the previous fs.delete() based approach? I mean, ultimately someone needs to do these deletes no.
   
   
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org