You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/10/22 02:30:31 UTC
[GitHub] [hudi] lw309637554 opened a new pull request #2196: [HUDI-1349]spark sql support overwrite use replace action
lw309637554 opened a new pull request #2196:
URL: https://github.com/apache/hudi/pull/2196
## *Tips*
- *Thank you very much for contributing to Apache Hudi.*
- *Please review https://hudi.apache.org/contributing.html before opening a pull request.*
## What is the purpose of the pull request
spark sql support overwrite use replace action
## Brief change log
*(for example:)*
- *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
## Verify this pull request
*(Please pick either of the following options)*
This pull request is a trivial rework / code cleanup without any test coverage.
*(or)*
This pull request is already covered by existing tests, such as *(please describe tests)*.
(or)
This change added tests and can be verified as follows:
*(example:)*
- *Added integration tests for end-to-end.*
- *Added HoodieClientWriteTest to verify the change.*
- *Manually verified the change by running a job locally.*
## Committer checklist
- [ ] Has a corresponding JIRA in PR title & commit
- [ ] Commit message is descriptive of the change
- [ ] CI is green
- [ ] Necessary doc changes done or have another open PR
- [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] lw309637554 commented on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action
Posted by GitBox <gi...@apache.org>.
lw309637554 commented on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-722967018
> > @satishkotha
> > A. Thanks so much. This pr need to solved the issue with better approach.
> > Now I am more clear about overwrite semantic between table.overwrite and spark sql overwrite for hudi.
> > B. Also spark sql for hudi overwrite should have the ability just like spark sql 、hive 、 delta lake.
> > these engine have three mode for overwrite about partition:
> >
> > 1. Dynamic Partition : delete all partition data ,and the insert the new data for different
> > 2. Static partition: just overwrite the partition which is user specified
> > 3. Mixed partition: mixed of 1 and 2
> > more detail in :
> > https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-dml-insert-overwrite-table.html
> > https://www.programmersought.com/article/47155360487/
> >
> > C. our plan
> >
> > 1. Now spark sql for hudi overwrite is Dynamic Partition. I will resolved it in this issue HUDI-1349, first support delete all partition in [HUDI-1350](https://issues.apache.org/jira/browse/HUDI-1350) , then land this issue. (Just like @satishkotha 's Suggestion)
> > 2. Now spark sql for hudi does not support "Static partition" mode, will then land it in [HUDI-1374](https://issues.apache.org/jira/browse/HUDI-1374)
> > 3. future support "Mixed partition" mode
> >
> > cc @n3nash @vinothchandar please help to review if the plan about spark sql for hudi overwirte is suitable.
> > It is also possible that my understanding is inappropriate
>
> Just fyi, in the [RFC](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+18+Insert+Overwrite+API#RFC18InsertOverwriteAPI-API) we discussed having 'insert_overwrite_table' operation to support dynamic partitioning. static partitioning is supported by 'insert_overwrite'.
>
> But, agree with getting approval from Nishith/Vinoth before starting implementation.
@satishkotha thanks so much, will read more about the RFC-18, and merge the implementation between the RFC-18 and my plan
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] satishkotha commented on a change in pull request #2196: [HUDI-1349]spark sql support overwrite use replace action
Posted by GitBox <gi...@apache.org>.
satishkotha commented on a change in pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#discussion_r513672664
##########
File path: hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
##########
@@ -93,6 +93,11 @@ private[hudi] object HoodieSparkSqlWriter {
operation = WriteOperationType.INSERT
}
+ // If the mode is Overwrite, should use INSERT_OVERWRITE operation
Review comment:
@lw309637554 can you add below test:
step1: Write N records to hoodie table for partition1
step2: Write N more records using SaveMode.Overwrite for partition2
step3: Query for all the rows from hoodie table. We should only see N records for partition2.
If i understand your code correctly, with this change, we will see 2N records
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] lw309637554 commented on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action
Posted by GitBox <gi...@apache.org>.
lw309637554 commented on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-729001085
@satishkotha have implement insert_overwrite_table to support spark sql overwrite. please review again , thanks
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] satishkotha commented on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action
Posted by GitBox <gi...@apache.org>.
satishkotha commented on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-718127824
> @satishkotha @n3nash can you help to review again, Thanks
Can you please add additional tests with multiple steps writing to disjoint partitions.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] n3nash merged pull request #2196: [HUDI-1349]spark sql support overwrite use replace action
Posted by GitBox <gi...@apache.org>.
n3nash merged pull request #2196:
URL: https://github.com/apache/hudi/pull/2196
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] satishkotha commented on a change in pull request #2196: [HUDI-1349]spark sql support overwrite use replace action
Posted by GitBox <gi...@apache.org>.
satishkotha commented on a change in pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#discussion_r517675835
##########
File path: hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala
##########
@@ -156,6 +162,76 @@ class TestCOWDataSource extends HoodieClientTestBase {
assertEquals(100, timeTravelDF.count()) // 100 initial inserts must be pulled
}
+ @Test def testOverWriteModeUseReplaceAction(): Unit = {
+ val records1 = recordsToStrings(dataGen.generateInserts("001", 5)).toList
+ val inputDF1 = spark.read.json(spark.sparkContext.parallelize(records1, 2))
+ inputDF1.write.format("org.apache.hudi")
+ .options(commonOpts)
+ .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
+ .mode(SaveMode.Append)
+ .save(basePath)
+
+ val records2 = recordsToStrings(dataGen.generateInserts("002", 5)).toList
+ val inputDF2 = spark.read.json(spark.sparkContext.parallelize(records2, 2))
+ inputDF2.write.format("org.apache.hudi")
+ .options(commonOpts)
+ .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
+ .mode(SaveMode.Overwrite)
+ .save(basePath)
+
+ val metaClient = new HoodieTableMetaClient(spark.sparkContext.hadoopConfiguration, basePath, true)
+ val commits = metaClient.getActiveTimeline.filterCompletedInstants().getInstants.toArray
+ .map(instant => (instant.asInstanceOf[HoodieInstant]).getAction)
+ assertEquals(2, commits.size)
+ assertEquals("commit", commits(0))
+ assertEquals("replacecommit", commits(1))
+ }
+
+ @Test def testOverWriteModeUseReplaceActionOnDisJointPartitions(): Unit = {
+ // step1: Write 5 records to hoodie table for partition1 DEFAULT_FIRST_PARTITION_PATH
+ val records1 = recordsToStrings(dataGen.generateInsertsForPartition("001", 5, HoodieTestDataGenerator.DEFAULT_FIRST_PARTITION_PATH)).toList
+ val inputDF1 = spark.read.json(spark.sparkContext.parallelize(records1, 2))
+ inputDF1.write.format("org.apache.hudi")
+ .options(commonOpts)
+ .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
+ .mode(SaveMode.Append)
+ .save(basePath)
+
+ // step2: Write 7 more records using SaveMode.Overwrite for partition2 DEFAULT_SECOND_PARTITION_PATH
+ val records2 = recordsToStrings(dataGen.generateInsertsForPartition("002", 7, HoodieTestDataGenerator.DEFAULT_SECOND_PARTITION_PATH)).toList
+ val inputDF2 = spark.read.json(spark.sparkContext.parallelize(records2, 2))
+ inputDF2.write.format("org.apache.hudi")
+ .options(commonOpts)
+ .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
+ .mode(SaveMode.Overwrite)
+ .save(basePath)
+ inputDF2.registerTempTable("tmpTable")
+
+ // step3: Query the rows count from hoodie table for partition1 DEFAULT_FIRST_PARTITION_PATH
+ val recordCountForParititon1 = spark.sql(String.format("select count(*) from tmpTable where partition = '%s'", HoodieTestDataGenerator.DEFAULT_FIRST_PARTITION_PATH)).collect()
Review comment:
tmpTable only registered inputDF2, so you will not get data for partition1 even if we do SaveMode.Append in line 206? Don't you need to read back all data from table? Can you please fix test setup?
##########
File path: hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala
##########
@@ -156,6 +162,76 @@ class TestCOWDataSource extends HoodieClientTestBase {
assertEquals(100, timeTravelDF.count()) // 100 initial inserts must be pulled
}
+ @Test def testOverWriteModeUseReplaceAction(): Unit = {
Review comment:
I ran a similar test using quick start setup guide:
1) ingest 10 records for partition "2020/03/11"
`scala> val inserts = convertToStringList(dataGen.generateInserts(10))
inserts: java.util.List[String] = [{"ts": 0, "uuid": "299d5202-1ea0-4918-9d2f-2365bc1c2402", "rider": "rider-213", "driver": "driver-213", "begin_lat": 0.4726905879569653, "begin_lon": 0.46157858450465483, "end_lat": 0.754803407008858, "end_lon": 0.9671159942018241, "fare": 34.158284716382845, "partitionpath": "2020/03/11"}, {"ts": 0, "uuid": "0fc23a14-c815-4b09-bff1-c6193a6de5b7", "rider": "rider-213", "driver": "driver-213", "begin_lat": 0.6100070562136587, "begin_lon": 0.8779402295427752, "end_lat": 0.3407870505929602, "end_lon": 0.5030798142293655, "fare": 43.4923811219014, "partitionpath": "2020/03/11"}, {"ts": 0, "uuid": "7136e8f8-ed82-4fc4-b60d-f7367f7be791", "rider": "rider-213", "driver": "driver-213", "begin_lat": 0.5731835407930634, "begin_lon": 0.4923479652912024, "end_lat":...
scala> val df = spark.read.json(spark.sparkContext.parallelize(inserts, 1))
warning: there was one deprecation warning; re-run with -deprecation for details
df: org.apache.spark.sql.DataFrame = [begin_lat: double, begin_lon: double ... 8 more fields]
scala> df.write.format("org.apache.hudi").
| options(getQuickstartWriteConfigs).
| option(PRECOMBINE_FIELD_OPT_KEY, "ts").
| option(RECORDKEY_FIELD_OPT_KEY, "uuid").
| option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
| option(TABLE_NAME, tableName).
| mode(Overwrite).
| save(basePath);`
2. Query the data back and see that records are written correctly
`scala> val tripsSnapshotDF = spark.
| read.
| format("org.apache.hudi").
| load(basePath + "/*/*/*/*")
20/11/04 14:51:53 WARN DefaultSource: Loading Base File Only View.
tripsSnapshotDF: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, _hoodie_commit_seqno: string ... 13 more fields]
scala> tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")
scala>
scala> spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from hudi_trips_snapshot").show()
+-------------------+--------------------+----------------------+---------+----------+------------------+
|_hoodie_commit_time| _hoodie_record_key|_hoodie_partition_path| rider| driver| fare|
+-------------------+--------------------+----------------------+---------+----------+------------------+
| 20201104145141|299d5202-1ea0-491...| 2020/03/11|rider-213|driver-213|34.158284716382845|
| 20201104145141|0fc23a14-c815-4b0...| 2020/03/11|rider-213|driver-213| 43.4923811219014|
| 20201104145141|7136e8f8-ed82-4fc...| 2020/03/11|rider-213|driver-213| 64.27696295884016|
| 20201104145141|5ffa488e-d75e-4ef...| 2020/03/11|rider-213|driver-213| 93.56018115236618|
| 20201104145141|cf09166f-dc3f-45e...| 2020/03/11|rider-213|driver-213|17.851135255091155|
| 20201104145141|6f522490-e29e-419...| 2020/03/11|rider-213|driver-213|19.179139106643607|
| 20201104145141|db97e3ef-ad7a-4e8...| 2020/03/11|rider-213|driver-213| 33.92216483948643|
| 20201104145141|a42d7c22-d0bf-4b9...| 2020/03/11|rider-213|driver-213| 66.62084366450246|
| 20201104145141|94154d3e-c3da-436...| 2020/03/11|rider-213|driver-213| 41.06290929046368|
| 20201104145141|618b3f38-bb71-402...| 2020/03/11|rider-213|driver-213| 27.79478688582596|
+-------------------+--------------------+----------------------+---------+----------+------------------+
`
3. Use insert_overwrite and write to new partition (This is with master, not using your change. So I still have insert_overwrite operation with Append mode)
`scala> val dataGen = new DataGenerator(Array("2020/09/11"))
dataGen: org.apache.hudi.QuickstartUtils.DataGenerator = org.apache.hudi.QuickstartUtils$DataGenerator@f00b18a
scala> val inserts2 = convertToStringList(dataGen.generateInserts(1))
scala> val df = spark.read.json(spark.sparkContext.parallelize(inserts2, 1))
scala> df.write.format("org.apache.hudi").
| options(getQuickstartWriteConfigs).
| option(PRECOMBINE_FIELD_OPT_KEY, "ts").
| option(RECORDKEY_FIELD_OPT_KEY, "uuid").
| option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
| option(TABLE_NAME, tableName).
| mode(Append).
| option(OPERATION_OPT_KEY, "insert_overwrite").
| save(basePath);`
4. Query the data back
`scala> spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from hudi_trips_snapshot").show()
+-------------------+--------------------+----------------------+---------+----------+------------------+
|_hoodie_commit_time| _hoodie_record_key|_hoodie_partition_path| rider| driver| fare|
+-------------------+--------------------+----------------------+---------+----------+------------------+
| 20201104145258|299d5202-1ea0-491...| 2020/03/11|rider-213|driver-213|34.158284716382845|
| 20201104145258|0fc23a14-c815-4b0...| 2020/03/11|rider-213|driver-213| 43.4923811219014|
| 20201104145258|7136e8f8-ed82-4fc...| 2020/03/11|rider-213|driver-213| 64.27696295884016|
| 20201104145258|5ffa488e-d75e-4ef...| 2020/03/11|rider-213|driver-213| 93.56018115236618|
| 20201104145258|cf09166f-dc3f-45e...| 2020/03/11|rider-213|driver-213|17.851135255091155|
| 20201104145258|6f522490-e29e-419...| 2020/03/11|rider-213|driver-213|19.179139106643607|
| 20201104145258|db97e3ef-ad7a-4e8...| 2020/03/11|rider-213|driver-213| 33.92216483948643|
| 20201104145258|a42d7c22-d0bf-4b9...| 2020/03/11|rider-213|driver-213| 66.62084366450246|
| 20201104145258|94154d3e-c3da-436...| 2020/03/11|rider-213|driver-213| 41.06290929046368|
| 20201104145258|618b3f38-bb71-402...| 2020/03/11|rider-213|driver-213| 27.79478688582596|
| 20201104145348|4dac8aa3-b8fa-410...| 2020/09/11|rider-284|driver-284|49.527694252432056|
+-------------------+--------------------+----------------------+---------+----------+------------------+
`
As you can see in step4, we see all 11 records. With 'SaveMode.Overwrite' we should only see 1 record. Hope this is clear.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] codecov-io edited a comment on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-728653013
# [Codecov](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=h1) Report
> Merging [#2196](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=desc) (305f320) into [master](https://codecov.io/gh/apache/hudi/commit/430d4b428e7c5b325c7414a187f9cda158c2758a?el=desc) (430d4b4) will **decrease** coverage by `0.07%`.
> The diff coverage is `16.66%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2196/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #2196 +/- ##
============================================
- Coverage 53.54% 53.47% -0.08%
- Complexity 2770 2772 +2
============================================
Files 348 348
Lines 16109 16139 +30
Branches 1643 1643
============================================
+ Hits 8626 8630 +4
- Misses 6785 6810 +25
- Partials 698 699 +1
```
| Flag | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| hudicli | `38.37% <ø> (ø)` | `0.00 <ø> (ø)` | |
| hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
| hudicommon | `55.17% <4.34%> (-0.16%)` | `0.00 <0.00> (ø)` | |
| hudihadoopmr | `32.94% <ø> (ø)` | `0.00 <ø> (ø)` | |
| hudispark | `65.86% <66.66%> (+0.28%)` | `0.00 <2.00> (ø)` | |
| huditimelineservice | `64.34% <14.28%> (-0.96%)` | `0.00 <0.00> (ø)` | |
| hudiutilities | `70.09% <ø> (ø)` | `0.00 <ø> (ø)` | |
Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
| [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| [...common/table/view/AbstractTableFileSystemView.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3ZpZXcvQWJzdHJhY3RUYWJsZUZpbGVTeXN0ZW1WaWV3LmphdmE=) | `82.73% <0.00%> (-3.06%)` | `137.00 <0.00> (ø)` | |
| [...common/table/view/PriorityBasedFileSystemView.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3ZpZXcvUHJpb3JpdHlCYXNlZEZpbGVTeXN0ZW1WaWV3LmphdmE=) | `95.71% <0.00%> (-1.39%)` | `33.00 <0.00> (ø)` | |
| [...on/table/view/RemoteHoodieTableFileSystemView.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3ZpZXcvUmVtb3RlSG9vZGllVGFibGVGaWxlU3lzdGVtVmlldy5qYXZh) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
| [...i/hadoop/utils/HoodieRealtimeInputFormatUtils.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL3V0aWxzL0hvb2RpZVJlYWx0aW1lSW5wdXRGb3JtYXRVdGlscy5qYXZh) | `0.00% <ø> (ø)` | `0.00 <0.00> (ø)` | |
| [...di/timeline/service/handlers/FileSliceHandler.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS10aW1lbGluZS1zZXJ2aWNlL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL3RpbWVsaW5lL3NlcnZpY2UvaGFuZGxlcnMvRmlsZVNsaWNlSGFuZGxlci5qYXZh) | `87.09% <0.00%> (-6.01%)` | `15.00 <0.00> (ø)` | |
| [...e/hudi/timeline/service/FileSystemViewHandler.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS10aW1lbGluZS1zZXJ2aWNlL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL3RpbWVsaW5lL3NlcnZpY2UvRmlsZVN5c3RlbVZpZXdIYW5kbGVyLmphdmE=) | `75.21% <20.00%> (-1.23%)` | `27.00 <0.00> (ø)` | |
| [...g/apache/hudi/common/model/WriteOperationType.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL1dyaXRlT3BlcmF0aW9uVHlwZS5qYXZh) | `55.17% <50.00%> (-0.39%)` | `2.00 <0.00> (ø)` | |
| [...src/main/java/org/apache/hudi/DataSourceUtils.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9EYXRhU291cmNlVXRpbHMuamF2YQ==) | `42.45% <50.00%> (+1.50%)` | `21.00 <2.00> (+2.00)` | |
| [...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL2h1ZGkvSG9vZGllU3BhcmtTcWxXcml0ZXIuc2NhbGE=) | `51.72% <50.00%> (+0.77%)` | `0.00 <0.00> (ø)` | |
| [.../java/org/apache/hudi/HoodieDataSourceHelpers.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9Ib29kaWVEYXRhU291cmNlSGVscGVycy5qYXZh) | `84.61% <100.00%> (ø)` | `5.00 <0.00> (ø)` | |
| ... and [2 more](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree-more) | |
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] lw309637554 commented on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action
Posted by GitBox <gi...@apache.org>.
lw309637554 commented on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-722512937
@satishkotha
A. Thanks so much. This pr need to solved the issue with better approach.
Now I am more clear about overwrite semantic between table.overwrite and spark sql overwrite for hudi.
B. Also spark sql for hudi overwrite should have the ability just like spark sql 、hive 、 delta lake.
these engine have three mode for overwrite about partition:
1. Dynamic Partition : delete all partition data ,and the insert the new data for different
2. Static partition: just overwrite the partition which is user specified
3. Mixed partition: mixed of 1 and 2
more detail in :
https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-dml-insert-overwrite-table.html
https://www.programmersought.com/article/47155360487/
C. our plan
1. Now spark sql for hudi overwrite is Dynamic Partition. I will resolved it in this issue HUDI-1349, first support delete all partition in [HUDI-1350](https://issues.apache.org/jira/browse/HUDI-1350) , then land this issue. (Just like @satishkotha 's Suggestion)
2. Now spark sql for hudi does not support "Static partition" mode, will then land it in [HUDI-1374](https://issues.apache.org/jira/browse/HUDI-1374)
3. future support "Mixed partition" mode
cc @n3nash @vinothchandar please help to review if the plan about spark sql for hudi overwirte is suitable.
It is also possible that my understanding is inappropriate
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] lw309637554 commented on a change in pull request #2196: [HUDI-1349]spark sql support overwrite use replace action
Posted by GitBox <gi...@apache.org>.
lw309637554 commented on a change in pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#discussion_r518175216
##########
File path: hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala
##########
@@ -156,6 +162,76 @@ class TestCOWDataSource extends HoodieClientTestBase {
assertEquals(100, timeTravelDF.count()) // 100 initial inserts must be pulled
}
+ @Test def testOverWriteModeUseReplaceAction(): Unit = {
+ val records1 = recordsToStrings(dataGen.generateInserts("001", 5)).toList
+ val inputDF1 = spark.read.json(spark.sparkContext.parallelize(records1, 2))
+ inputDF1.write.format("org.apache.hudi")
+ .options(commonOpts)
+ .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
+ .mode(SaveMode.Append)
+ .save(basePath)
+
+ val records2 = recordsToStrings(dataGen.generateInserts("002", 5)).toList
+ val inputDF2 = spark.read.json(spark.sparkContext.parallelize(records2, 2))
+ inputDF2.write.format("org.apache.hudi")
+ .options(commonOpts)
+ .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
+ .mode(SaveMode.Overwrite)
+ .save(basePath)
+
+ val metaClient = new HoodieTableMetaClient(spark.sparkContext.hadoopConfiguration, basePath, true)
+ val commits = metaClient.getActiveTimeline.filterCompletedInstants().getInstants.toArray
+ .map(instant => (instant.asInstanceOf[HoodieInstant]).getAction)
+ assertEquals(2, commits.size)
+ assertEquals("commit", commits(0))
+ assertEquals("replacecommit", commits(1))
+ }
+
+ @Test def testOverWriteModeUseReplaceActionOnDisJointPartitions(): Unit = {
+ // step1: Write 5 records to hoodie table for partition1 DEFAULT_FIRST_PARTITION_PATH
+ val records1 = recordsToStrings(dataGen.generateInsertsForPartition("001", 5, HoodieTestDataGenerator.DEFAULT_FIRST_PARTITION_PATH)).toList
+ val inputDF1 = spark.read.json(spark.sparkContext.parallelize(records1, 2))
+ inputDF1.write.format("org.apache.hudi")
+ .options(commonOpts)
+ .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
+ .mode(SaveMode.Append)
+ .save(basePath)
+
+ // step2: Write 7 more records using SaveMode.Overwrite for partition2 DEFAULT_SECOND_PARTITION_PATH
+ val records2 = recordsToStrings(dataGen.generateInsertsForPartition("002", 7, HoodieTestDataGenerator.DEFAULT_SECOND_PARTITION_PATH)).toList
+ val inputDF2 = spark.read.json(spark.sparkContext.parallelize(records2, 2))
+ inputDF2.write.format("org.apache.hudi")
+ .options(commonOpts)
+ .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
+ .mode(SaveMode.Overwrite)
+ .save(basePath)
+ inputDF2.registerTempTable("tmpTable")
+
+ // step3: Query the rows count from hoodie table for partition1 DEFAULT_FIRST_PARTITION_PATH
+ val recordCountForParititon1 = spark.sql(String.format("select count(*) from tmpTable where partition = '%s'", HoodieTestDataGenerator.DEFAULT_FIRST_PARTITION_PATH)).collect()
Review comment:
will fix it
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] lw309637554 commented on a change in pull request #2196: [HUDI-1349]spark sql support overwrite use replace action
Posted by GitBox <gi...@apache.org>.
lw309637554 commented on a change in pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#discussion_r510838979
##########
File path: hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
##########
@@ -93,6 +93,11 @@ private[hudi] object HoodieSparkSqlWriter {
operation = WriteOperationType.INSERT
}
+ // If the mode is Overwrite, should use INSERT_OVERWRITE operation
Review comment:
@satishkotha thanks,
1. i think the method is ok . Because in HoodieSparkSqlWriter.scala , if we set operation as replace action, then it also use "client.insertOverwrite(hoodieRecords, instantTime); " here will do as " Insert overwrite only updates partitions".
public static HoodieWriteResult doWriteOperation(SparkRDDWriteClient client, JavaRDD<HoodieRecord> hoodieRecords,
String instantTime, WriteOperationType operation) throws HoodieException {
switch (operation) {
case BULK_INSERT:
Option<BulkInsertPartitioner> userDefinedBulkInsertPartitioner =
createUserDefinedBulkInsertPartitioner(client.getConfig());
return new HoodieWriteResult(client.bulkInsert(hoodieRecords, instantTime, userDefinedBulkInsertPartitioner));
case INSERT:
return new HoodieWriteResult(client.insert(hoodieRecords, instantTime));
case UPSERT:
return new HoodieWriteResult(client.upsert(hoodieRecords, instantTime));
case INSERT_OVERWRITE:
return client.insertOverwrite(hoodieRecords, instantTime);
default:
2. https://issues.apache.org/jira/browse/HUDI-1350. is another job.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] satishkotha commented on a change in pull request #2196: [HUDI-1349]spark sql support overwrite use replace action
Posted by GitBox <gi...@apache.org>.
satishkotha commented on a change in pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#discussion_r526430857
##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkInsertOverwriteTableCommitActionExecutor.java
##########
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.action.commit;
+
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.client.common.HoodieEngineContext;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.model.WriteOperationType;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.spark.api.java.JavaRDD;
+
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+public class SparkInsertOverwriteTableCommitActionExecutor<T extends HoodieRecordPayload<T>>
+ extends SparkInsertOverwriteCommitActionExecutor<T> {
+
+ public SparkInsertOverwriteTableCommitActionExecutor(HoodieEngineContext context,
+ HoodieWriteConfig config, HoodieTable table,
+ String instantTime, JavaRDD<HoodieRecord<T>> inputRecordsRDD) {
+ super(context, config, table, instantTime, inputRecordsRDD, WriteOperationType.INSERT_OVERWRITE_TABLE);
+ }
+
+ @Override
+ protected Map<String, List<String>> getPartitionToReplacedFileIds(JavaRDD<WriteStatus> writeStatuses) {
+ Map<String, List<String>> result = table.getSliceView().getLatestFileSlices().distinct()
Review comment:
This can be time consuming for large tables. Is it possible to parallelize this?
1) get all partitions and then
2) enginecontext.parallelize(#paritions)
3) for each partition, get all latest file ids
I think we dont need to add new filesystem apis with this approach
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] vinothchandar edited a comment on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action
Posted by GitBox <gi...@apache.org>.
vinothchandar edited a comment on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-757498877
@lw309637554 sounds good. but I think we should have made this configurable and make the old behavior the default. `spark.write...mode("overwrite")` is a very common operation, and I'd prefer to not have this dependent on such new feature. Do you have cycles to do this in the next few hours? Please let me know. @n3nash can you drive this otherwise?
general rule of thumb: new features should be guarded by a flag, turned off by default, until deemed stable after release. :) lets please uphold this. Happy to discuss more if needed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] n3nash commented on a change in pull request #2196: [HUDI-1349]spark sql support overwrite use replace action
Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#discussion_r533849917
##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkInsertOverwriteTableCommitActionExecutor.java
##########
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.action.commit;
+
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.client.common.HoodieEngineContext;
+import org.apache.hudi.client.common.HoodieSparkEngineContext;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.model.WriteOperationType;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieCommitException;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.api.java.function.PairFunction;
+import scala.Tuple2;
+
+import java.io.IOException;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+public class SparkInsertOverwriteTableCommitActionExecutor<T extends HoodieRecordPayload<T>>
+ extends SparkInsertOverwriteCommitActionExecutor<T> {
+
+ public SparkInsertOverwriteTableCommitActionExecutor(HoodieEngineContext context,
+ HoodieWriteConfig config, HoodieTable table,
+ String instantTime, JavaRDD<HoodieRecord<T>> inputRecordsRDD) {
+ super(context, config, table, instantTime, inputRecordsRDD, WriteOperationType.INSERT_OVERWRITE_TABLE);
+ }
+
+ protected List<String> getAllExistingFileIds(String partitionPath) {
+ return table.getSliceView().getLatestFileSlices(partitionPath)
+ .map(fg -> fg.getFileId()).distinct().collect(Collectors.toList());
+ }
+
+ @Override
+ protected Map<String, List<String>> getPartitionToReplacedFileIds(JavaRDD<WriteStatus> writeStatuses) {
+ Map<String, List<String>> partitionToExistingFileIds = new HashMap<>();
+ try {
+ List<String> partitionPaths = FSUtils.getAllPartitionPaths(table.getMetaClient().getFs(),
+ table.getMetaClient().getBasePath(), false);
+ JavaSparkContext jsc = HoodieSparkEngineContext.getSparkContext(context);
+ if (partitionPaths != null && partitionPaths.size() > 0) {
+ context.setJobStatus(this.getClass().getSimpleName(), "Getting ExistingFileIds of all partitions");
+ JavaRDD<String> partitionPathRdd = jsc.parallelize(partitionPaths, partitionPaths.size());
+ partitionToExistingFileIds = partitionPathRdd.mapToPair((PairFunction<String, String, List<String>>)
Review comment:
Can we use function style here instead of needing to case to PairFunction ? Something like partitionPathRdd.mapToPair( () -> ...) ?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] lw309637554 commented on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action
Posted by GitBox <gi...@apache.org>.
lw309637554 commented on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-737319592
@satishkotha have resolved comment of n3nash, and squash the commits. can you help to merged ? cc @n3nash
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] codecov-io edited a comment on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-728653013
# [Codecov](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=h1) Report
> Merging [#2196](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=desc) (305f320) into [master](https://codecov.io/gh/apache/hudi/commit/430d4b428e7c5b325c7414a187f9cda158c2758a?el=desc) (430d4b4) will **decrease** coverage by `43.14%`.
> The diff coverage is `n/a`.
[![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2196/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #2196 +/- ##
=============================================
- Coverage 53.54% 10.39% -43.15%
+ Complexity 2770 48 -2722
=============================================
Files 348 50 -298
Lines 16109 1779 -14330
Branches 1643 211 -1432
=============================================
- Hits 8626 185 -8441
+ Misses 6785 1581 -5204
+ Partials 698 13 -685
```
| Flag | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| hudicli | `?` | `?` | |
| hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
| hudicommon | `?` | `?` | |
| hudihadoopmr | `?` | `?` | |
| hudispark | `?` | `?` | |
| huditimelineservice | `?` | `?` | |
| hudiutilities | `10.39% <ø> (-59.70%)` | `0.00 <ø> (ø)` | |
Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
| [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
| [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
| [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
| [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
| [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
| [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
| [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
| [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
| [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
| [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
| ... and [314 more](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree-more) | |
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] lw309637554 commented on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action
Posted by GitBox <gi...@apache.org>.
lw309637554 commented on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-718070816
@satishkotha @n3nash can you help to review again, Thanks
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] satishkotha commented on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action
Posted by GitBox <gi...@apache.org>.
satishkotha commented on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-722937587
> @satishkotha
> A. Thanks so much. This pr need to solved the issue with better approach.
> Now I am more clear about overwrite semantic between table.overwrite and spark sql overwrite for hudi.
>
> B. Also spark sql for hudi overwrite should have the ability just like spark sql 、hive 、 delta lake.
> these engine have three mode for overwrite about partition:
>
> 1. Dynamic Partition : delete all partition data ,and the insert the new data for different
> 2. Static partition: just overwrite the partition which is user specified
> 3. Mixed partition: mixed of 1 and 2
> more detail in :
> https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-dml-insert-overwrite-table.html
> https://www.programmersought.com/article/47155360487/
>
> C. our plan
>
> 1. Now spark sql for hudi overwrite is Dynamic Partition. I will resolved it in this issue HUDI-1349, first support delete all partition in [HUDI-1350](https://issues.apache.org/jira/browse/HUDI-1350) , then land this issue. (Just like @satishkotha 's Suggestion)
> 2. Now spark sql for hudi does not support "Static partition" mode, will then land it in [HUDI-1374](https://issues.apache.org/jira/browse/HUDI-1374)
> 3. future support "Mixed partition" mode
>
> cc @n3nash @vinothchandar please help to review if the plan about spark sql for hudi overwirte is suitable.
> It is also possible that my understanding is inappropriate
Just fyi, in the [RFC](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+18+Insert+Overwrite+API#RFC18InsertOverwriteAPI-API) we discussed having 'insert_overwrite_table' operation to support dynamic partitioning. static partitioning is supported by 'insert_overwrite'.
But, agree with getting approval from Nishith/Vinoth before starting implementation.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] codecov-io commented on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action
Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-728653013
# [Codecov](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=h1) Report
> Merging [#2196](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=desc) (305f320) into [master](https://codecov.io/gh/apache/hudi/commit/430d4b428e7c5b325c7414a187f9cda158c2758a?el=desc) (430d4b4) will **decrease** coverage by `43.14%`.
> The diff coverage is `n/a`.
[![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2196/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #2196 +/- ##
=============================================
- Coverage 53.54% 10.39% -43.15%
+ Complexity 2770 48 -2722
=============================================
Files 348 50 -298
Lines 16109 1779 -14330
Branches 1643 211 -1432
=============================================
- Hits 8626 185 -8441
+ Misses 6785 1581 -5204
+ Partials 698 13 -685
```
| Flag | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| hudicli | `?` | `?` | |
| hudiclient | `?` | `?` | |
| hudicommon | `?` | `?` | |
| hudihadoopmr | `?` | `?` | |
| hudispark | `?` | `?` | |
| huditimelineservice | `?` | `?` | |
| hudiutilities | `10.39% <ø> (-59.70%)` | `0.00 <ø> (ø)` | |
Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
| [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
| [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
| [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
| [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
| [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
| [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
| [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
| [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
| [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
| [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
| ... and [314 more](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree-more) | |
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] lw309637554 edited a comment on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action
Posted by GitBox <gi...@apache.org>.
lw309637554 edited a comment on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-757504672
> @lw309637554 sounds good. but I think we should have made this configurable and make the old behavior the default. `spark.write...mode("overwrite")` is a very common operation, and I'd prefer to not have this dependent on such new feature. Do you have cycles to do this in the next few hours? Please let me know. @n3nash can you drive this otherwise?
>
> general rule of thumb: new features should be guarded by a flag, turned off by default, until deemed stable after release. :) lets please uphold this. Happy to discuss more if needed.
@vinothchandar okay, i can add the configure now. Will land it in https://github.com/apache/hudi/pull/2428 cc @n3nash
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] lw309637554 commented on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action
Posted by GitBox <gi...@apache.org>.
lw309637554 commented on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-731599644
> LGTM. @n3nash Can you take a pass and merge this?
@n3nash Can you help to review? thanks
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] vinothchandar commented on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action
Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-715447332
@n3nash you and satish can take this home
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] lw309637554 commented on a change in pull request #2196: [HUDI-1349]spark sql support overwrite use replace action
Posted by GitBox <gi...@apache.org>.
lw309637554 commented on a change in pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#discussion_r534276769
##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkInsertOverwriteTableCommitActionExecutor.java
##########
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.action.commit;
+
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.client.common.HoodieEngineContext;
+import org.apache.hudi.client.common.HoodieSparkEngineContext;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.model.WriteOperationType;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieCommitException;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.api.java.function.PairFunction;
+import scala.Tuple2;
+
+import java.io.IOException;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+public class SparkInsertOverwriteTableCommitActionExecutor<T extends HoodieRecordPayload<T>>
+ extends SparkInsertOverwriteCommitActionExecutor<T> {
+
+ public SparkInsertOverwriteTableCommitActionExecutor(HoodieEngineContext context,
+ HoodieWriteConfig config, HoodieTable table,
+ String instantTime, JavaRDD<HoodieRecord<T>> inputRecordsRDD) {
+ super(context, config, table, instantTime, inputRecordsRDD, WriteOperationType.INSERT_OVERWRITE_TABLE);
+ }
+
+ protected List<String> getAllExistingFileIds(String partitionPath) {
+ return table.getSliceView().getLatestFileSlices(partitionPath)
+ .map(fg -> fg.getFileId()).distinct().collect(Collectors.toList());
+ }
+
+ @Override
+ protected Map<String, List<String>> getPartitionToReplacedFileIds(JavaRDD<WriteStatus> writeStatuses) {
+ Map<String, List<String>> partitionToExistingFileIds = new HashMap<>();
+ try {
+ List<String> partitionPaths = FSUtils.getAllPartitionPaths(table.getMetaClient().getFs(),
+ table.getMetaClient().getBasePath(), false);
+ JavaSparkContext jsc = HoodieSparkEngineContext.getSparkContext(context);
+ if (partitionPaths != null && partitionPaths.size() > 0) {
+ context.setJobStatus(this.getClass().getSimpleName(), "Getting ExistingFileIds of all partitions");
+ JavaRDD<String> partitionPathRdd = jsc.parallelize(partitionPaths, partitionPaths.size());
+ partitionToExistingFileIds = partitionPathRdd.mapToPair((PairFunction<String, String, List<String>>)
Review comment:
ok
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] codecov-io edited a comment on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-728653013
# [Codecov](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=h1) Report
> Merging [#2196](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=desc) (3f9e2ca) into [master](https://codecov.io/gh/apache/hudi/commit/c8d5ea2752f8289b9dec37a149e82c030f657924?el=desc) (c8d5ea2) will **decrease** coverage by `43.10%`.
> The diff coverage is `n/a`.
[![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2196/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #2196 +/- ##
=============================================
- Coverage 53.51% 10.41% -43.11%
+ Complexity 2769 48 -2721
=============================================
Files 348 50 -298
Lines 16107 1777 -14330
Branches 1643 211 -1432
=============================================
- Hits 8619 185 -8434
+ Misses 6789 1579 -5210
+ Partials 699 13 -686
```
| Flag | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| hudicli | `?` | `?` | |
| hudiclient | `?` | `?` | |
| hudicommon | `?` | `?` | |
| hudihadoopmr | `?` | `?` | |
| hudispark | `?` | `?` | |
| huditimelineservice | `?` | `?` | |
| hudiutilities | `10.41% <ø> (-59.66%)` | `0.00 <ø> (ø)` | |
Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
| [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
| [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
| [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
| [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
| [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
| [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
| [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
| [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
| [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
| [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
| ... and [319 more](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree-more) | |
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] codecov-io edited a comment on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-728653013
# [Codecov](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=h1) Report
> Merging [#2196](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=desc) (0da0ca7) into [master](https://codecov.io/gh/apache/hudi/commit/c8d5ea2752f8289b9dec37a149e82c030f657924?el=desc) (c8d5ea2) will **decrease** coverage by `43.10%`.
> The diff coverage is `n/a`.
[![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2196/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #2196 +/- ##
=============================================
- Coverage 53.51% 10.41% -43.11%
+ Complexity 2769 48 -2721
=============================================
Files 348 50 -298
Lines 16107 1777 -14330
Branches 1643 211 -1432
=============================================
- Hits 8619 185 -8434
+ Misses 6789 1579 -5210
+ Partials 699 13 -686
```
| Flag | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| hudicli | `?` | `?` | |
| hudiclient | `?` | `?` | |
| hudicommon | `?` | `?` | |
| hudihadoopmr | `?` | `?` | |
| hudispark | `?` | `?` | |
| huditimelineservice | `?` | `?` | |
| hudiutilities | `10.41% <ø> (-59.66%)` | `0.00 <ø> (ø)` | |
Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
| [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
| [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
| [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
| [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
| [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
| [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
| [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
| [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
| [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
| [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
| ... and [319 more](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree-more) | |
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] lw309637554 commented on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action
Posted by GitBox <gi...@apache.org>.
lw309637554 commented on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-757473127
> @lw309637554 @n3nash @satishkotha
>
> I have been trying to test master branch and the following change to make `Overwrite` be a INSERT_OVERWRITE_TABLE, adds complexity IMO - esp given interplay with newer features like table metadata.
>
> What's the rationale behind this? Just to make the operation faster over the previous fs.delete() based approach? I mean, ultimately someone needs to do these deletes no.
@vinothchandar Hello,
the original intention of this function is that, users can be able to recover accidentally deleted when using INSERT_OVERWRITE_TABLE, if the replace file have not been cleaned. The old fs.delete needs to retrieve data from the data source , it is complex.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] satishkotha commented on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action
Posted by GitBox <gi...@apache.org>.
satishkotha commented on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-730596478
LGTM. @n3nash Can you take a pass and merge this?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] lw309637554 commented on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action
Posted by GitBox <gi...@apache.org>.
lw309637554 commented on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-730723941
> LGTM. @n3nash Can you take a pass and merge this?
@satishkotha Thanks for your suggestion.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] lw309637554 commented on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action
Posted by GitBox <gi...@apache.org>.
lw309637554 commented on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-727229138
> @lw309637554 the plan looks good to me. @satishkotha when the PR is ready, please tag me so I can take a final pass and merge it
thanks , @satishkotha @n3nash i thinks add a insertOverwriteTable(JavaRDD<HoodieRecord<T>> records, final String commitTime) just as https://cwiki.apache.org/confluence/display/HUDI/RFC+-+18+Insert+Overwrite+API#RFC18InsertOverwriteAPI-API will be ok . I will land it
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] vinothchandar commented on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action
Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-757498877
@lw309637554 sounds good. but I think we should have made this configurable and make the old behavior the default. `spark.write...mode("overwrite")` is a very common operation, and I'd prefer to not have this dependent on such new feature. Do you have cycles to do this? Please let me know. @n3nash can you drive this otherwise?
general rule of thumb: new features should be guarded by a flag, turned off by default, until deemed stable after release. :) lets please uphold this. Happy to discuss more if needed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] lw309637554 commented on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action
Posted by GitBox <gi...@apache.org>.
lw309637554 commented on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-757504672
> @lw309637554 sounds good. but I think we should have made this configurable and make the old behavior the default. `spark.write...mode("overwrite")` is a very common operation, and I'd prefer to not have this dependent on such new feature. Do you have cycles to do this in the next few hours? Please let me know. @n3nash can you drive this otherwise?
>
> general rule of thumb: new features should be guarded by a flag, turned off by default, until deemed stable after release. :) lets please uphold this. Happy to discuss more if needed.
@vinothchandar okay, i can add the configure now
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] lw309637554 commented on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action
Posted by GitBox <gi...@apache.org>.
lw309637554 commented on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-718349922
> > @satishkotha @n3nash can you help to review again, Thanks
>
> Can you please add additional tests with multiple steps writing to disjoint partitions.
okay
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] codecov-io edited a comment on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-728653013
# [Codecov](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=h1) Report
> Merging [#2196](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=desc) (305f320) into [master](https://codecov.io/gh/apache/hudi/commit/430d4b428e7c5b325c7414a187f9cda158c2758a?el=desc) (430d4b4) will **decrease** coverage by `7.06%`.
> The diff coverage is `17.24%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2196/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #2196 +/- ##
============================================
- Coverage 53.54% 46.47% -7.07%
+ Complexity 2770 2437 -333
============================================
Files 348 342 -6
Lines 16109 15766 -343
Branches 1643 1627 -16
============================================
- Hits 8626 7328 -1298
- Misses 6785 7866 +1081
+ Partials 698 572 -126
```
| Flag | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| hudicli | `38.37% <ø> (ø)` | `0.00 <ø> (ø)` | |
| hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
| hudicommon | `55.17% <4.34%> (-0.16%)` | `0.00 <0.00> (ø)` | |
| hudihadoopmr | `32.94% <ø> (ø)` | `0.00 <ø> (ø)` | |
| hudispark | `65.86% <66.66%> (+0.28%)` | `0.00 <2.00> (ø)` | |
| huditimelineservice | `?` | `?` | |
| hudiutilities | `10.39% <ø> (-59.70%)` | `0.00 <ø> (ø)` | |
Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
| [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| [...common/table/view/AbstractTableFileSystemView.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3ZpZXcvQWJzdHJhY3RUYWJsZUZpbGVTeXN0ZW1WaWV3LmphdmE=) | `82.73% <0.00%> (-3.06%)` | `137.00 <0.00> (ø)` | |
| [...common/table/view/PriorityBasedFileSystemView.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3ZpZXcvUHJpb3JpdHlCYXNlZEZpbGVTeXN0ZW1WaWV3LmphdmE=) | `95.71% <0.00%> (-1.39%)` | `33.00 <0.00> (ø)` | |
| [...on/table/view/RemoteHoodieTableFileSystemView.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3ZpZXcvUmVtb3RlSG9vZGllVGFibGVGaWxlU3lzdGVtVmlldy5qYXZh) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
| [...i/hadoop/utils/HoodieRealtimeInputFormatUtils.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL3V0aWxzL0hvb2RpZVJlYWx0aW1lSW5wdXRGb3JtYXRVdGlscy5qYXZh) | `0.00% <ø> (ø)` | `0.00 <0.00> (ø)` | |
| [...g/apache/hudi/common/model/WriteOperationType.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL1dyaXRlT3BlcmF0aW9uVHlwZS5qYXZh) | `55.17% <50.00%> (-0.39%)` | `2.00 <0.00> (ø)` | |
| [...src/main/java/org/apache/hudi/DataSourceUtils.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9EYXRhU291cmNlVXRpbHMuamF2YQ==) | `42.45% <50.00%> (+1.50%)` | `21.00 <2.00> (+2.00)` | |
| [...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL2h1ZGkvSG9vZGllU3BhcmtTcWxXcml0ZXIuc2NhbGE=) | `51.72% <50.00%> (+0.77%)` | `0.00 <0.00> (ø)` | |
| [.../java/org/apache/hudi/HoodieDataSourceHelpers.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9Ib29kaWVEYXRhU291cmNlSGVscGVycy5qYXZh) | `84.61% <100.00%> (ø)` | `5.00 <0.00> (ø)` | |
| [...main/scala/org/apache/hudi/DataSourceOptions.scala](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL2h1ZGkvRGF0YVNvdXJjZU9wdGlvbnMuc2NhbGE=) | `94.95% <100.00%> (+0.04%)` | `0.00 <0.00> (ø)` | |
| [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
| ... and [41 more](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree-more) | |
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] n3nash commented on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action
Posted by GitBox <gi...@apache.org>.
n3nash commented on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-724184645
@lw309637554 the plan looks good to me. @satishkotha when the PR is ready, please tag me so I can take a final pass and merge it
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] codecov-io edited a comment on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-728653013
# [Codecov](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=h1) Report
> Merging [#2196](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=desc) (1b099ab) into [master](https://codecov.io/gh/apache/hudi/commit/78fd12259404cad8117862fdf0b948cbe115e453?el=desc) (78fd122) will **decrease** coverage by `43.04%`.
> The diff coverage is `n/a`.
[![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2196/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #2196 +/- ##
=============================================
- Coverage 53.45% 10.41% -43.05%
+ Complexity 2781 48 -2733
=============================================
Files 354 50 -304
Lines 16158 1777 -14381
Branches 1648 211 -1437
=============================================
- Hits 8637 185 -8452
+ Misses 6821 1579 -5242
+ Partials 700 13 -687
```
| Flag | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| hudicli | `?` | `?` | |
| hudiclient | `?` | `?` | |
| hudicommon | `?` | `?` | |
| hudihadoopmr | `?` | `?` | |
| hudispark | `?` | `?` | |
| huditimelineservice | `?` | `?` | |
| hudiutilities | `10.41% <ø> (-59.66%)` | `0.00 <ø> (ø)` | |
Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
| [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2196?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
| [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
| [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
| [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
| [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
| [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
| [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
| [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
| [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
| [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
| ... and [325 more](https://codecov.io/gh/apache/hudi/pull/2196/diff?src=pr&el=tree-more) | |
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] satishkotha commented on a change in pull request #2196: [HUDI-1349]spark sql support overwrite use replace action
Posted by GitBox <gi...@apache.org>.
satishkotha commented on a change in pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#discussion_r510350766
##########
File path: hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
##########
@@ -93,6 +93,11 @@ private[hudi] object HoodieSparkSqlWriter {
operation = WriteOperationType.INSERT
}
+ // If the mode is Overwrite, should use INSERT_OVERWRITE operation
Review comment:
I think this won't work. Insert overwrite only updates partitions that have records in the dataframe. Other partitions continue to have old data.
But, I like the idea. Maybe we can add additional configuration to insert_overwrite to mark old partitions as 'deleted'? This can be done in a way to support https://issues.apache.org/jira/browse/HUDI-1350.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] vinothchandar commented on pull request #2196: [HUDI-1349]spark sql support overwrite use replace action
Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #2196:
URL: https://github.com/apache/hudi/pull/2196#issuecomment-757419558
@lw309637554 @n3nash @satishkotha
I have been trying to test master branch and the following change to make `Overwrite` be a INSERT_OVERWRITE_TABLE, adds complexity IMO - esp given interplay with newer features like table metadata.
What's the rationale behind this? Just to make the operation faster over the previous fs.delete() based approach? I mean, ultimately someone needs to do these deletes no.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org