You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/01/22 15:18:53 UTC
[GitHub] [hudi] teeyog opened a new pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
teeyog opened a new pull request #2475:
URL: https://github.com/apache/hudi/pull/2475
## *Tips*
- *Thank you very much for contributing to Apache Hudi.*
- *Please review https://hudi.apache.org/contributing.html before opening a pull request.*
## What is the purpose of the pull request
To read the hudi table, you need to specify the path, but the path is not only the tablePath corresponding to the table, but needs to be determined by the partition directory structure. Different keyGenerators correspond to different partition directory structures. The first-level partition directory uses path=```.../table/*/*```, the secondary partition directory path=```.../table/*/*/*```,so it is troublesome to let the user specify the data path, the user only needs to specify the tablePath: ```.../table```
At the same time, after reading the hudi table by configuring path=```.../table```, it is more convenient to use sparksql to query the hudi table. You only need to add tabproperties to the hive table metadata: ```spark.sql.sources.provider= hudi```, you can automatically convert the hive table to the hudi table.
## Brief change log
*(for example:)*
- *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
## Verify this pull request
*(Please pick either of the following options)*
This pull request is a trivial rework / code cleanup without any test coverage.
*(or)*
This pull request is already covered by existing tests, such as *(please describe tests)*.
(or)
This change added tests and can be verified as follows:
*(example:)*
- *Added integration tests for end-to-end.*
- *Added HoodieClientWriteTest to verify the change.*
- *Manually verified the change by running a job locally.*
## Committer checklist
- [ ] Has a corresponding JIRA in PR title & commit
- [ ] Commit message is descriptive of the change
- [ ] CI is green
- [ ] Necessary doc changes done or have another open PR
- [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] codecov-io edited a comment on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-765495259
# [Codecov](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=h1) Report
> Merging [#2475](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=desc) (9c38d02) into [master](https://codecov.io/gh/apache/hudi/commit/e302c6bc12c7eb764781898fdee8ee302ef4ec10?el=desc) (e302c6b) will **increase** coverage by `19.24%`.
> The diff coverage is `n/a`.
[![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2475/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #2475 +/- ##
=============================================
+ Coverage 50.18% 69.43% +19.24%
+ Complexity 3050 357 -2693
=============================================
Files 419 53 -366
Lines 18931 1930 -17001
Branches 1948 230 -1718
=============================================
- Hits 9500 1340 -8160
+ Misses 8656 456 -8200
+ Partials 775 134 -641
```
| Flag | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| hudicli | `?` | `?` | |
| hudiclient | `?` | `?` | |
| hudicommon | `?` | `?` | |
| hudiflink | `?` | `?` | |
| hudihadoopmr | `?` | `?` | |
| hudisparkdatasource | `?` | `?` | |
| hudisync | `?` | `?` | |
| huditimelineservice | `?` | `?` | |
| hudiutilities | `69.43% <ø> (ø)` | `0.00 <ø> (ø)` | |
Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
| [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| [...e/hudi/common/table/timeline/dto/FileGroupDTO.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL2R0by9GaWxlR3JvdXBEVE8uamF2YQ==) | | | |
| [...ava/org/apache/hudi/common/model/HoodieRecord.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZVJlY29yZC5qYXZh) | | | |
| [...metadata/HoodieMetadataMergedLogRecordScanner.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0YWRhdGEvSG9vZGllTWV0YWRhdGFNZXJnZWRMb2dSZWNvcmRTY2FubmVyLmphdmE=) | | | |
| [...g/apache/hudi/exception/HoodieRemoteException.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZVJlbW90ZUV4Y2VwdGlvbi5qYXZh) | | | |
| [.../apache/hudi/common/table/log/HoodieLogFormat.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Ib29kaWVMb2dGb3JtYXQuamF2YQ==) | | | |
| [...n/java/org/apache/hudi/cli/HoodieSplashScreen.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL0hvb2RpZVNwbGFzaFNjcmVlbi5qYXZh) | | | |
| [.../hudi/hadoop/realtime/HoodieRealtimeFileSplit.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL3JlYWx0aW1lL0hvb2RpZVJlYWx0aW1lRmlsZVNwbGl0LmphdmE=) | | | |
| [...org/apache/hudi/hadoop/realtime/RealtimeSplit.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL3JlYWx0aW1lL1JlYWx0aW1lU3BsaXQuamF2YQ==) | | | |
| [...n/java/org/apache/hudi/common/HoodieCleanStat.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL0hvb2RpZUNsZWFuU3RhdC5qYXZh) | | | |
| [...udi/common/table/log/block/HoodieCorruptBlock.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9ibG9jay9Ib29kaWVDb3JydXB0QmxvY2suamF2YQ==) | | | |
| ... and [355 more](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree-more) | |
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] codecov-io edited a comment on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-765495259
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] teeyog commented on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
Posted by GitBox <gi...@apache.org>.
teeyog commented on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-774639582
> FSUtils.getAllPartitionPaths()
It has been modified to obtain the partition path by ```FSUtils.getAllPartitionPaths()```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] vinothchandar commented on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-774538085
@teeyog if you could call FSUtils.getAllPartitionPaths() or add a new method `getNPartitionPaths()` and return the first N such partition paths using your traversal in the class `FileSystemBackedTableMetadata`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] teeyog commented on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
Posted by GitBox <gi...@apache.org>.
teeyog commented on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-774571361
> > You only need to add tabproperties to the hive table metadata: spark.sql.sources.provider= hudi, you can automatically convert the hive table to the hudi table.
>
> @teeyog can you please expand on this. is this related to this PR or a general comment?
If the hive metadata tabproperties contains ```spark.sql.sources.provider=hudi```, the parsing process of sparksql reading the hive table is as follows:
First step
[https://github.com/apache/spark/blob/62be2483d7d78e61fd2f77929cf41c76eff17869/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L302](https://github.com/apache/spark/blob/62be2483d7d78e61fd2f77929cf41c76eff17869/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L302)
Second step
[https://github.com/apache/spark/blob/62be2483d7d78e61fd2f77929cf41c76eff17869/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L261](https://github.com/apache/spark/blob/62be2483d7d78e61fd2f77929cf41c76eff17869/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L261)
The resolveRelation in the second step will go directly to the DefaultSource of hudi, so reading the hive table is automatically converted to reading the hudi table
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] vinothchandar commented on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-774542977
>Of course it can, but it is specified by the parameter basePath=../2020/*/*
I want to clarify this a bit. do you mean `val READ_PATHS_OPT_KEY = "hoodie.datasource.read.paths"` ?
if I do the following, I see that we reset the `path` in options to `basePath + "/*/*/*/*`. How does Spark parquet source know to only look for 2020 and 2021 for e,g?
```
val snapshotDF1 = spark.read.format("org.apache.hudi")
.load(basePath + "/202*/*/*/*")
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] codecov-io edited a comment on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-765495259
# [Codecov](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=h1) Report
> Merging [#2475](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=desc) (62cff1e) into [master](https://codecov.io/gh/apache/hudi/commit/43a0776c7c88a5f7beac6c8853db7e341810635a?el=desc) (43a0776) will **increase** coverage by `0.03%`.
> The diff coverage is `93.33%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2475/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #2475 +/- ##
============================================
+ Coverage 51.14% 51.17% +0.03%
- Complexity 3215 3219 +4
============================================
Files 438 438
Lines 20041 20055 +14
Branches 2064 2067 +3
============================================
+ Hits 10250 10264 +14
Misses 8946 8946
Partials 845 845
```
| Flag | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| hudicli | `36.87% <ø> (ø)` | `0.00 <ø> (ø)` | |
| hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
| hudicommon | `51.36% <ø> (ø)` | `0.00 <ø> (ø)` | |
| hudiflink | `45.44% <ø> (ø)` | `0.00 <ø> (ø)` | |
| hudihadoopmr | `33.16% <ø> (ø)` | `0.00 <ø> (ø)` | |
| hudisparkdatasource | `69.96% <93.33%> (+0.20%)` | `0.00 <0.00> (ø)` | |
| hudisync | `48.61% <ø> (ø)` | `0.00 <ø> (ø)` | |
| huditimelineservice | `66.49% <ø> (ø)` | `0.00 <ø> (ø)` | |
| hudiutilities | `69.51% <ø> (+0.05%)` | `0.00 <ø> (ø)` | |
Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
| [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| [...src/main/scala/org/apache/hudi/DefaultSource.scala](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0RlZmF1bHRTb3VyY2Uuc2NhbGE=) | `85.41% <93.33%> (+1.27%)` | `20.00 <0.00> (+3.00)` | |
| [...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=) | `70.35% <0.00%> (+0.35%)` | `51.00% <0.00%> (+1.00%)` | |
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] vinothchandar commented on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-774539141
> You only need to add tabproperties to the hive table metadata: spark.sql.sources.provider= hudi, you can automatically convert the hive table to the hudi table.
@teeyog can you please expand on this. is this related to this PR or a general comment?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] vinothchandar closed pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
Posted by GitBox <gi...@apache.org>.
vinothchandar closed pull request #2475:
URL: https://github.com/apache/hudi/pull/2475
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] zhedoubushishi commented on a change in pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
Posted by GitBox <gi...@apache.org>.
zhedoubushishi commented on a change in pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#discussion_r569775940
##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java
##########
@@ -84,6 +86,39 @@ public static String getTablePath(FileSystem fs, Path[] userProvidedPaths) throw
throw new TableNotFoundException("Unable to find a hudi table for the user provided paths.");
}
+ public static Option<String> getOnePartitionPath(FileSystem fs, Path tablePath) throws IOException {
+ // When the table is not partitioned
+ if (HoodiePartitionMetadata.hasPartitionMetadata(fs, tablePath)) {
+ return Option.of(tablePath.toString());
+ }
+ FileStatus[] statuses = fs.listStatus(tablePath);
+ for (FileStatus status : statuses) {
+ if (status.isDirectory()) {
+ if (HoodiePartitionMetadata.hasPartitionMetadata(fs, status.getPath())) {
+ return Option.of(status.getPath().toString());
+ } else {
+ Option<String> partitionPath = getOnePartitionPath(fs, status.getPath());
+ if (partitionPath.isPresent()) {
+ return partitionPath;
Review comment:
Yea I agree it would be better to use ```HoodieTableMetadata``` to avoid ```fs.listStatus```. But what about the tables w/o metadata feature enable? Will it take super long time if it's a table with many partitions?
Also ```hoodie_partition_metadata``` saves a parameter called ```partitionDepth```, could we take advantage of this?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] teeyog commented on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
Posted by GitBox <gi...@apache.org>.
teeyog commented on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-774578774
> > Of course it can, but it is specified by the parameter basePath=../2020/_/_
>
> I want to clarify this a bit. do you mean `val READ_PATHS_OPT_KEY = "hoodie.datasource.read.paths"` ?
>
> if I do the following, I see that we reset the `path` in options to `basePath + "/*/*/*/*`. How does Spark parquet source know to only look for 2020 and 2021 for e,g?
>
> ```
> val snapshotDF1 = spark.read.format("org.apache.hudi")
> .load(basePath + "/202*/*/*/*")
> ```
I understand what you mean. The situation you said is really not supported, because the data path will be automatically inferred to cover the path configured by the user, but you only check the requirements of 2020 and 2021, you can use dadaframe when Filter again, or do I need to judge whether the path specified by the user contains *, if it does, the data path is not automatically inferred, what do you think?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] vinothchandar commented on a change in pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#discussion_r577755265
##########
File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala
##########
@@ -74,6 +78,19 @@ class DefaultSource extends RelationProvider
val tablePath = DataSourceUtils.getTablePath(fs, globPaths.toArray)
log.info("Obtained hudi table path: " + tablePath)
+ val sparkEngineContext = new HoodieSparkEngineContext(sqlContext.sparkContext)
+ val fsBackedTableMetadata =
+ new FileSystemBackedTableMetadata(sparkEngineContext, new SerializableConfiguration(fs.getConf), tablePath, false)
+ val partitionPaths = fsBackedTableMetadata.getAllPartitionPaths
+ val onePartitionPath = if(!partitionPaths.isEmpty && !StringUtils.isEmpty(partitionPaths.get(0))) {
+ tablePath + "/" + partitionPaths.get(0)
+ } else {
+ tablePath
+ }
+ val dataPath = DataSourceUtils.getDataPath(tablePath, onePartitionPath)
+ log.info("Obtained hudi data path: " + dataPath)
+ parameters += "path" -> dataPath
Review comment:
@teeyog Sorry it's not still clear to me. I supplied a globbed path `2015/*/*/*` and even that overrides `path -> tablePath/*/*/*/*`
Won't this incur reading all partitions in the tablePath as opposed only 2015's?
![image](https://user-images.githubusercontent.com/1179324/108234541-bde4fb00-70f9-11eb-8611-58579636b51b.png)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] lw309637554 commented on a change in pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
Posted by GitBox <gi...@apache.org>.
lw309637554 commented on a change in pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#discussion_r584307500
##########
File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala
##########
@@ -84,6 +88,26 @@ class DefaultSource extends RelationProvider
val tablePath = DataSourceUtils.getTablePath(fs, globPaths.toArray)
log.info("Obtained hudi table path: " + tablePath)
+ if (path.nonEmpty) {
+ val _path = path.get.stripSuffix("/")
+ val pathTmp = new Path(_path).makeQualified(fs.getUri, fs.getWorkingDirectory)
+ // If the user specifies the table path, the data path is automatically inferred
+ if (pathTmp.toString.equals(tablePath)) {
+ val sparkEngineContext = new HoodieSparkEngineContext(sqlContext.sparkContext)
+ val fsBackedTableMetadata =
+ new FileSystemBackedTableMetadata(sparkEngineContext, new SerializableConfiguration(fs.getConf), tablePath, false)
+ val partitionPaths = fsBackedTableMetadata.getAllPartitionPaths
Review comment:
@teeyog hello, now infer the partition for getallpartition paths from metadata table.
The partition mode is set as hoodie.datasource.write.partitionpath.field when write the hudi table. Can we persist the hoodie.datasource.write.partitionpath.field to metatable? Then read just get the properties , not get all the partition path? cc @vinothchandar
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] teeyog commented on a change in pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
Posted by GitBox <gi...@apache.org>.
teeyog commented on a change in pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#discussion_r569889546
##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java
##########
@@ -84,6 +86,39 @@ public static String getTablePath(FileSystem fs, Path[] userProvidedPaths) throw
throw new TableNotFoundException("Unable to find a hudi table for the user provided paths.");
}
+ public static Option<String> getOnePartitionPath(FileSystem fs, Path tablePath) throws IOException {
+ // When the table is not partitioned
+ if (HoodiePartitionMetadata.hasPartitionMetadata(fs, tablePath)) {
+ return Option.of(tablePath.toString());
+ }
+ FileStatus[] statuses = fs.listStatus(tablePath);
+ for (FileStatus status : statuses) {
+ if (status.isDirectory()) {
+ if (HoodiePartitionMetadata.hasPartitionMetadata(fs, status.getPath())) {
+ return Option.of(status.getPath().toString());
+ } else {
+ Option<String> partitionPath = getOnePartitionPath(fs, status.getPath());
+ if (partitionPath.isPresent()) {
+ return partitionPath;
Review comment:
Thank you for your review, this method of obtaining partitions is very fast. As long as one partition path is obtained, it will return directly. FSUtils.getAllPartitionPaths will obtain all partition paths, which is very time-consuming.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] codecov-io edited a comment on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-765495259
# [Codecov](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=h1) Report
> Merging [#2475](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=desc) (9c38d02) into [master](https://codecov.io/gh/apache/hudi/commit/e302c6bc12c7eb764781898fdee8ee302ef4ec10?el=desc) (e302c6b) will **decrease** coverage by `40.49%`.
> The diff coverage is `n/a`.
[![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2475/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #2475 +/- ##
============================================
- Coverage 50.18% 9.68% -40.50%
+ Complexity 3050 48 -3002
============================================
Files 419 53 -366
Lines 18931 1930 -17001
Branches 1948 230 -1718
============================================
- Hits 9500 187 -9313
+ Misses 8656 1730 -6926
+ Partials 775 13 -762
```
| Flag | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| hudicli | `?` | `?` | |
| hudiclient | `?` | `?` | |
| hudicommon | `?` | `?` | |
| hudiflink | `?` | `?` | |
| hudihadoopmr | `?` | `?` | |
| hudisparkdatasource | `?` | `?` | |
| hudisync | `?` | `?` | |
| huditimelineservice | `?` | `?` | |
| hudiutilities | `9.68% <ø> (-59.75%)` | `0.00 <ø> (ø)` | |
Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
| [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
| [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
| [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
| [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
| [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
| [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
| [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
| [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
| [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
| [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
| ... and [395 more](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree-more) | |
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] teeyog commented on a change in pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
Posted by GitBox <gi...@apache.org>.
teeyog commented on a change in pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#discussion_r584401119
##########
File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala
##########
@@ -84,6 +88,26 @@ class DefaultSource extends RelationProvider
val tablePath = DataSourceUtils.getTablePath(fs, globPaths.toArray)
log.info("Obtained hudi table path: " + tablePath)
+ if (path.nonEmpty) {
+ val _path = path.get.stripSuffix("/")
+ val pathTmp = new Path(_path).makeQualified(fs.getUri, fs.getWorkingDirectory)
+ // If the user specifies the table path, the data path is automatically inferred
+ if (pathTmp.toString.equals(tablePath)) {
+ val sparkEngineContext = new HoodieSparkEngineContext(sqlContext.sparkContext)
+ val fsBackedTableMetadata =
+ new FileSystemBackedTableMetadata(sparkEngineContext, new SerializableConfiguration(fs.getConf), tablePath, false)
+ val partitionPaths = fsBackedTableMetadata.getAllPartitionPaths
Review comment:
@lw309637554 Thank you for your review, the previous path to get the hudi table can also be obtained through configuration instead of inference
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-874357657
<!--
Meta data
{
"version" : 1,
"metaDataEntries" : [ {
"hash" : "62cff1e1b984c800d44e8f33df23f9ccb9fa4c97",
"status" : "UNKNOWN",
"url" : "TBD",
"triggerID" : "62cff1e1b984c800d44e8f33df23f9ccb9fa4c97",
"triggerType" : "PUSH"
} ]
}-->
## CI report:
* 62cff1e1b984c800d44e8f33df23f9ccb9fa4c97 UNKNOWN
<details>
<summary>Bot commands</summary>
@hudi-bot supports the following commands:
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] codecov-io commented on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-765495259
# [Codecov](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=h1) Report
> Merging [#2475](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=desc) (8d073b6) into [master](https://codecov.io/gh/apache/hudi/commit/048633da1a913a05252b1b5dea0b3d40d75c81b4?el=desc) (048633d) will **increase** coverage by `19.25%`.
> The diff coverage is `n/a`.
[![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2475/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #2475 +/- ##
=============================================
+ Coverage 50.17% 69.43% +19.25%
+ Complexity 3050 357 -2693
=============================================
Files 419 53 -366
Lines 18931 1930 -17001
Branches 1948 230 -1718
=============================================
- Hits 9498 1340 -8158
+ Misses 8657 456 -8201
+ Partials 776 134 -642
```
| Flag | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| hudicli | `?` | `?` | |
| hudiclient | `?` | `?` | |
| hudicommon | `?` | `?` | |
| hudiflink | `?` | `?` | |
| hudihadoopmr | `?` | `?` | |
| hudisparkdatasource | `?` | `?` | |
| hudisync | `?` | `?` | |
| huditimelineservice | `?` | `?` | |
| hudiutilities | `69.43% <ø> (ø)` | `0.00 <ø> (ø)` | |
Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
| [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| [...va/org/apache/hudi/metadata/BaseTableMetadata.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0YWRhdGEvQmFzZVRhYmxlTWV0YWRhdGEuamF2YQ==) | | | |
| [.../apache/hudi/common/model/HoodieRecordPayload.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZVJlY29yZFBheWxvYWQuamF2YQ==) | | | |
| [...metadata/HoodieMetadataMergedLogRecordScanner.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0YWRhdGEvSG9vZGllTWV0YWRhdGFNZXJnZWRMb2dSZWNvcmRTY2FubmVyLmphdmE=) | | | |
| [...ain/java/org/apache/hudi/cli/utils/CommitUtil.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL3V0aWxzL0NvbW1pdFV0aWwuamF2YQ==) | | | |
| [...che/hudi/common/table/timeline/HoodieTimeline.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL0hvb2RpZVRpbWVsaW5lLmphdmE=) | | | |
| [...i/common/table/log/block/HoodieHFileDataBlock.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9ibG9jay9Ib29kaWVIRmlsZURhdGFCbG9jay5qYXZh) | | | |
| [...e/hudi/common/engine/LocalTaskContextSupplier.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2VuZ2luZS9Mb2NhbFRhc2tDb250ZXh0U3VwcGxpZXIuamF2YQ==) | | | |
| [...ava/org/apache/hudi/payload/AWSDmsAvroPayload.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvcGF5bG9hZC9BV1NEbXNBdnJvUGF5bG9hZC5qYXZh) | | | |
| [...rg/apache/hudi/cli/commands/SavepointsCommand.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL2NvbW1hbmRzL1NhdmVwb2ludHNDb21tYW5kLmphdmE=) | | | |
| [...ache/hudi/common/table/timeline/TimelineUtils.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL1RpbWVsaW5lVXRpbHMuamF2YQ==) | | | |
| ... and [355 more](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree-more) | |
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] vinothchandar commented on a change in pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#discussion_r571483611
##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java
##########
@@ -84,6 +86,39 @@ public static String getTablePath(FileSystem fs, Path[] userProvidedPaths) throw
throw new TableNotFoundException("Unable to find a hudi table for the user provided paths.");
}
+ public static Option<String> getOnePartitionPath(FileSystem fs, Path tablePath) throws IOException {
+ // When the table is not partitioned
+ if (HoodiePartitionMetadata.hasPartitionMetadata(fs, tablePath)) {
+ return Option.of(tablePath.toString());
+ }
+ FileStatus[] statuses = fs.listStatus(tablePath);
+ for (FileStatus status : statuses) {
+ if (status.isDirectory()) {
+ if (HoodiePartitionMetadata.hasPartitionMetadata(fs, status.getPath())) {
+ return Option.of(status.getPath().toString());
+ } else {
+ Option<String> partitionPath = getOnePartitionPath(fs, status.getPath());
+ if (partitionPath.isPresent()) {
+ return partitionPath;
Review comment:
@teeyog we could even add a new overload/methods for this under `HoodieTableMetadata` interface, but really good to keep all of this under the interface. With the metadata table, its actually okay to call getAllPartitionPaths(), its pretty fast.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] vinothchandar commented on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-888755262
Closing in favor of #3353
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-874357657
<!--
Meta data
{
"version" : 1,
"metaDataEntries" : [ {
"hash" : "62cff1e1b984c800d44e8f33df23f9ccb9fa4c97",
"status" : "UNKNOWN",
"url" : "TBD",
"triggerID" : "62cff1e1b984c800d44e8f33df23f9ccb9fa4c97",
"triggerType" : "PUSH"
} ]
}-->
## CI report:
* 62cff1e1b984c800d44e8f33df23f9ccb9fa4c97 UNKNOWN
<details>
<summary>Bot commands</summary>
@hudi-bot supports the following commands:
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] teeyog commented on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
Posted by GitBox <gi...@apache.org>.
teeyog commented on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-774640491
> @teeyog if you could call FSUtils.getAllPartitionPaths() or add a new method `getNPartitionPaths()` and return the first N such partition paths using your traversal in the class `FileSystemBackedTableMetadata`
It has been modified to obtain the partition path by ```FSUtils.getAllPartitionPaths()```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] codecov-io edited a comment on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-765495259
# [Codecov](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=h1) Report
> Merging [#2475](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=desc) (d0ee06e) into [master](https://codecov.io/gh/apache/hudi/commit/e302c6bc12c7eb764781898fdee8ee302ef4ec10?el=desc) (e302c6b) will **increase** coverage by `0.02%`.
> The diff coverage is `100.00%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2475/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #2475 +/- ##
============================================
+ Coverage 50.18% 50.20% +0.02%
- Complexity 3050 3051 +1
============================================
Files 419 419
Lines 18931 18935 +4
Branches 1948 1948
============================================
+ Hits 9500 9506 +6
+ Misses 8656 8654 -2
Partials 775 775
```
| Flag | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| hudicli | `37.21% <ø> (ø)` | `0.00 <ø> (ø)` | |
| hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
| hudicommon | `51.47% <ø> (-0.03%)` | `0.00 <ø> (ø)` | |
| hudiflink | `0.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
| hudihadoopmr | `33.16% <ø> (ø)` | `0.00 <ø> (ø)` | |
| hudisparkdatasource | `66.16% <100.00%> (+0.31%)` | `0.00 <0.00> (ø)` | |
| hudisync | `48.61% <ø> (ø)` | `0.00 <ø> (ø)` | |
| huditimelineservice | `66.49% <ø> (ø)` | `0.00 <ø> (ø)` | |
| hudiutilities | `69.48% <ø> (+0.05%)` | `0.00 <ø> (ø)` | |
Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
| [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| [...src/main/scala/org/apache/hudi/DefaultSource.scala](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0RlZmF1bHRTb3VyY2Uuc2NhbGE=) | `89.39% <100.00%> (+0.68%)` | `15.00 <0.00> (ø)` | |
| [...e/hudi/common/table/log/HoodieLogFormatWriter.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Ib29kaWVMb2dGb3JtYXRXcml0ZXIuamF2YQ==) | `78.12% <0.00%> (-1.57%)` | `26.00% <0.00%> (ø%)` | |
| [...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=) | `70.86% <0.00%> (+0.35%)` | `51.00% <0.00%> (+1.00%)` | |
| [...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0hvb2RpZVNwYXJrU3FsV3JpdGVyLnNjYWxh) | `49.64% <0.00%> (+1.06%)` | `0.00% <0.00%> (ø%)` | |
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] codecov-io edited a comment on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-765495259
# [Codecov](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=h1) Report
> Merging [#2475](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=desc) (0e4f1ee) into [master](https://codecov.io/gh/apache/hudi/commit/43a0776c7c88a5f7beac6c8853db7e341810635a?el=desc) (43a0776) will **decrease** coverage by `41.45%`.
> The diff coverage is `n/a`.
[![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2475/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #2475 +/- ##
============================================
- Coverage 51.14% 9.69% -41.46%
+ Complexity 3215 48 -3167
============================================
Files 438 53 -385
Lines 20041 1929 -18112
Branches 2064 230 -1834
============================================
- Hits 10250 187 -10063
+ Misses 8946 1729 -7217
+ Partials 845 13 -832
```
| Flag | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| hudicli | `?` | `?` | |
| hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
| hudicommon | `?` | `?` | |
| hudiflink | `?` | `?` | |
| hudihadoopmr | `?` | `?` | |
| hudisparkdatasource | `?` | `?` | |
| hudisync | `?` | `?` | |
| huditimelineservice | `?` | `?` | |
| hudiutilities | `9.69% <ø> (-59.78%)` | `0.00 <ø> (ø)` | |
Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
| [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
| [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
| [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
| [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
| [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
| [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
| [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
| [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
| [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
| [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
| ... and [414 more](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree-more) | |
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] vinothchandar commented on a change in pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#discussion_r569761896
##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java
##########
@@ -84,6 +86,39 @@ public static String getTablePath(FileSystem fs, Path[] userProvidedPaths) throw
throw new TableNotFoundException("Unable to find a hudi table for the user provided paths.");
}
+ public static Option<String> getOnePartitionPath(FileSystem fs, Path tablePath) throws IOException {
+ // When the table is not partitioned
+ if (HoodiePartitionMetadata.hasPartitionMetadata(fs, tablePath)) {
+ return Option.of(tablePath.toString());
+ }
+ FileStatus[] statuses = fs.listStatus(tablePath);
+ for (FileStatus status : statuses) {
+ if (status.isDirectory()) {
+ if (HoodiePartitionMetadata.hasPartitionMetadata(fs, status.getPath())) {
+ return Option.of(status.getPath().toString());
+ } else {
+ Option<String> partitionPath = getOnePartitionPath(fs, status.getPath());
+ if (partitionPath.isPresent()) {
+ return partitionPath;
Review comment:
So, I am wondering if we can use the `HoodieTableMetadata` abstraction to read a partition path, instead of listing alone. We are trying to avoid any introduction of single point listings. There is a method to get all partition paths already FSUtils.getAllPartitionPaths(), lets just use that for now? I am thinking that it will be little bit of an overkill to list all partition paths, without metadata table
##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java
##########
@@ -84,6 +86,39 @@ public static String getTablePath(FileSystem fs, Path[] userProvidedPaths) throw
throw new TableNotFoundException("Unable to find a hudi table for the user provided paths.");
}
+ public static Option<String> getOnePartitionPath(FileSystem fs, Path tablePath) throws IOException {
+ // When the table is not partitioned
+ if (HoodiePartitionMetadata.hasPartitionMetadata(fs, tablePath)) {
+ return Option.of(tablePath.toString());
+ }
+ FileStatus[] statuses = fs.listStatus(tablePath);
+ for (FileStatus status : statuses) {
+ if (status.isDirectory()) {
+ if (HoodiePartitionMetadata.hasPartitionMetadata(fs, status.getPath())) {
+ return Option.of(status.getPath().toString());
+ } else {
+ Option<String> partitionPath = getOnePartitionPath(fs, status.getPath());
+ if (partitionPath.isPresent()) {
+ return partitionPath;
Review comment:
this short circuits the recursive stack, once we get one partition path I guess
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] teeyog removed a comment on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
Posted by GitBox <gi...@apache.org>.
teeyog removed a comment on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-774639582
> FSUtils.getAllPartitionPaths()
It has been modified to obtain the partition path by ```FSUtils.getAllPartitionPaths()```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] vinothchandar commented on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-769517183
@zhedoubushishi @umehrot2 could you please take a first pass
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] codecov-io edited a comment on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-765495259
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] teeyog commented on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
Posted by GitBox <gi...@apache.org>.
teeyog commented on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-772964324
> Thanks for this! Seems very useful.
>
> One thing I wanted to understand was - whether an user can still do `basePath/2020/*/*` and have only the parquet files for 2020 read out for e.g?
Of course it can, but it is specified by the parameter ```basePath=../2020/*/*```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] teeyog commented on a change in pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
Posted by GitBox <gi...@apache.org>.
teeyog commented on a change in pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#discussion_r580728819
##########
File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala
##########
@@ -74,6 +78,19 @@ class DefaultSource extends RelationProvider
val tablePath = DataSourceUtils.getTablePath(fs, globPaths.toArray)
log.info("Obtained hudi table path: " + tablePath)
+ val sparkEngineContext = new HoodieSparkEngineContext(sqlContext.sparkContext)
+ val fsBackedTableMetadata =
+ new FileSystemBackedTableMetadata(sparkEngineContext, new SerializableConfiguration(fs.getConf), tablePath, false)
+ val partitionPaths = fsBackedTableMetadata.getAllPartitionPaths
+ val onePartitionPath = if(!partitionPaths.isEmpty && !StringUtils.isEmpty(partitionPaths.get(0))) {
+ tablePath + "/" + partitionPaths.get(0)
+ } else {
+ tablePath
+ }
+ val dataPath = DataSourceUtils.getDataPath(tablePath, onePartitionPath)
+ log.info("Obtained hudi data path: " + dataPath)
+ parameters += "path" -> dataPath
Review comment:
@vinothchandar Now it supports your needs. If the path specified by the user is a table path, it will be automatically inferred, otherwise it will not be inferred.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] codecov-io edited a comment on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-765495259
# [Codecov](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=h1) Report
> Merging [#2475](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=desc) (62cff1e) into [master](https://codecov.io/gh/apache/hudi/commit/43a0776c7c88a5f7beac6c8853db7e341810635a?el=desc) (43a0776) will **increase** coverage by `0.03%`.
> The diff coverage is `93.33%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2475/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #2475 +/- ##
============================================
+ Coverage 51.14% 51.17% +0.03%
- Complexity 3215 3219 +4
============================================
Files 438 438
Lines 20041 20055 +14
Branches 2064 2067 +3
============================================
+ Hits 10250 10264 +14
Misses 8946 8946
Partials 845 845
```
| Flag | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| hudicli | `36.87% <ø> (ø)` | `0.00 <ø> (ø)` | |
| hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
| hudicommon | `51.36% <ø> (ø)` | `0.00 <ø> (ø)` | |
| hudiflink | `45.44% <ø> (ø)` | `0.00 <ø> (ø)` | |
| hudihadoopmr | `33.16% <ø> (ø)` | `0.00 <ø> (ø)` | |
| hudisparkdatasource | `69.96% <93.33%> (+0.20%)` | `0.00 <0.00> (ø)` | |
| hudisync | `48.61% <ø> (ø)` | `0.00 <ø> (ø)` | |
| huditimelineservice | `66.49% <ø> (ø)` | `0.00 <ø> (ø)` | |
| hudiutilities | `69.51% <ø> (+0.05%)` | `0.00 <ø> (ø)` | |
Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
| [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| [...src/main/scala/org/apache/hudi/DefaultSource.scala](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0RlZmF1bHRTb3VyY2Uuc2NhbGE=) | `85.41% <93.33%> (+1.27%)` | `20.00 <0.00> (+3.00)` | |
| [...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=) | `70.35% <0.00%> (+0.35%)` | `51.00% <0.00%> (+1.00%)` | |
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] codecov-io edited a comment on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-765495259
# [Codecov](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=h1) Report
> Merging [#2475](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=desc) (d0ee06e) into [master](https://codecov.io/gh/apache/hudi/commit/e302c6bc12c7eb764781898fdee8ee302ef4ec10?el=desc) (e302c6b) will **increase** coverage by `19.29%`.
> The diff coverage is `n/a`.
[![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2475/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #2475 +/- ##
=============================================
+ Coverage 50.18% 69.48% +19.29%
+ Complexity 3050 358 -2692
=============================================
Files 419 53 -366
Lines 18931 1930 -17001
Branches 1948 230 -1718
=============================================
- Hits 9500 1341 -8159
+ Misses 8656 456 -8200
+ Partials 775 133 -642
```
| Flag | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| hudicli | `?` | `?` | |
| hudiclient | `?` | `?` | |
| hudicommon | `?` | `?` | |
| hudiflink | `?` | `?` | |
| hudihadoopmr | `?` | `?` | |
| hudisparkdatasource | `?` | `?` | |
| hudisync | `?` | `?` | |
| huditimelineservice | `?` | `?` | |
| hudiutilities | `69.48% <ø> (+0.05%)` | `0.00 <ø> (ø)` | |
Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
| [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| [...pache/hudi/hadoop/config/HoodieRealtimeConfig.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL2NvbmZpZy9Ib29kaWVSZWFsdGltZUNvbmZpZy5qYXZh) | | | |
| [.../apache/hudi/hive/MultiPartKeysValueExtractor.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1zeW5jL2h1ZGktaGl2ZS1zeW5jL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL2hpdmUvTXVsdGlQYXJ0S2V5c1ZhbHVlRXh0cmFjdG9yLmphdmE=) | | | |
| [...odie/hadoop/hive/HoodieCombineHiveInputFormat.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9jb20vdWJlci9ob29kaWUvaGFkb29wL2hpdmUvSG9vZGllQ29tYmluZUhpdmVJbnB1dEZvcm1hdC5qYXZh) | | | |
| [...e/hudi/common/table/log/HoodieLogFormatReader.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Ib29kaWVMb2dGb3JtYXRSZWFkZXIuamF2YQ==) | | | |
| [...pache/hudi/cli/commands/FileSystemViewCommand.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL2NvbW1hbmRzL0ZpbGVTeXN0ZW1WaWV3Q29tbWFuZC5qYXZh) | | | |
| [...in/java/org/apache/hudi/common/model/BaseFile.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0Jhc2VGaWxlLmphdmE=) | | | |
| [...apache/hudi/common/model/HoodieRecordLocation.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZVJlY29yZExvY2F0aW9uLmphdmE=) | | | |
| [...i/src/main/java/org/apache/hudi/cli/HoodieCLI.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL0hvb2RpZUNMSS5qYXZh) | | | |
| [...he/hudi/common/fs/SizeAwareFSDataOutputStream.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL1NpemVBd2FyZUZTRGF0YU91dHB1dFN0cmVhbS5qYXZh) | | | |
| [...ache/hudi/common/fs/inline/InMemoryFileSystem.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL2lubGluZS9Jbk1lbW9yeUZpbGVTeXN0ZW0uamF2YQ==) | | | |
| ... and [356 more](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree-more) | |
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] rubenssoto commented on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
Posted by GitBox <gi...@apache.org>.
rubenssoto commented on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-765712586
This is a great and important feature to make Hudi easier for no heavy users.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] codecov-io edited a comment on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-765495259
# [Codecov](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=h1) Report
> Merging [#2475](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=desc) (bca1656) into [master](https://codecov.io/gh/apache/hudi/commit/4c5b6923ccfaaa6616a934a3f690b1a795a42d41?el=desc) (4c5b692) will **increase** coverage by `10.41%`.
> The diff coverage is `n/a`.
[![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2475/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #2475 +/- ##
=============================================
+ Coverage 50.91% 61.32% +10.41%
+ Complexity 3168 317 -2851
=============================================
Files 433 53 -380
Lines 19806 1929 -17877
Branches 2032 229 -1803
=============================================
- Hits 10084 1183 -8901
+ Misses 8904 623 -8281
+ Partials 818 123 -695
```
| Flag | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| hudicli | `?` | `?` | |
| hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
| hudicommon | `?` | `?` | |
| hudiflink | `?` | `?` | |
| hudihadoopmr | `?` | `?` | |
| hudisparkdatasource | `?` | `?` | |
| hudisync | `?` | `?` | |
| huditimelineservice | `?` | `?` | |
| hudiutilities | `61.32% <ø> (-8.20%)` | `0.00 <ø> (ø)` | |
Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
| [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| [...ies/exception/HoodieSnapshotExporterException.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2V4Y2VwdGlvbi9Ib29kaWVTbmFwc2hvdEV4cG9ydGVyRXhjZXB0aW9uLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
| [.../apache/hudi/utilities/HoodieSnapshotExporter.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0hvb2RpZVNuYXBzaG90RXhwb3J0ZXIuamF2YQ==) | `5.17% <0.00%> (-83.63%)` | `0.00% <0.00%> (-28.00%)` | |
| [...hudi/utilities/schema/JdbcbasedSchemaProvider.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9KZGJjYmFzZWRTY2hlbWFQcm92aWRlci5qYXZh) | `0.00% <0.00%> (-72.23%)` | `0.00% <0.00%> (-2.00%)` | |
| [...he/hudi/utilities/transform/AWSDmsTransformer.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3RyYW5zZm9ybS9BV1NEbXNUcmFuc2Zvcm1lci5qYXZh) | `0.00% <0.00%> (-66.67%)` | `0.00% <0.00%> (-2.00%)` | |
| [...in/java/org/apache/hudi/utilities/UtilHelpers.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL1V0aWxIZWxwZXJzLmphdmE=) | `40.69% <0.00%> (-23.84%)` | `26.00% <0.00%> (-6.00%)` | |
| [...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=) | `70.50% <0.00%> (-0.36%)` | `50.00% <0.00%> (-1.00%)` | |
| [...mmon/table/log/HoodieUnMergedLogRecordScanner.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Ib29kaWVVbk1lcmdlZExvZ1JlY29yZFNjYW5uZXIuamF2YQ==) | | | |
| [...pache/hudi/hadoop/HoodieColumnProjectionUtils.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL0hvb2RpZUNvbHVtblByb2plY3Rpb25VdGlscy5qYXZh) | | | |
| [.../apache/hudi/hive/MultiPartKeysValueExtractor.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1zeW5jL2h1ZGktaGl2ZS1zeW5jL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL2hpdmUvTXVsdGlQYXJ0S2V5c1ZhbHVlRXh0cmFjdG9yLmphdmE=) | | | |
| [...meline/versioning/clean/CleanMetadataMigrator.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL3ZlcnNpb25pbmcvY2xlYW4vQ2xlYW5NZXRhZGF0YU1pZ3JhdG9yLmphdmE=) | | | |
| ... and [375 more](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree-more) | |
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] teeyog commented on a change in pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
Posted by GitBox <gi...@apache.org>.
teeyog commented on a change in pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#discussion_r579910252
##########
File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala
##########
@@ -74,6 +78,19 @@ class DefaultSource extends RelationProvider
val tablePath = DataSourceUtils.getTablePath(fs, globPaths.toArray)
log.info("Obtained hudi table path: " + tablePath)
+ val sparkEngineContext = new HoodieSparkEngineContext(sqlContext.sparkContext)
+ val fsBackedTableMetadata =
+ new FileSystemBackedTableMetadata(sparkEngineContext, new SerializableConfiguration(fs.getConf), tablePath, false)
+ val partitionPaths = fsBackedTableMetadata.getAllPartitionPaths
+ val onePartitionPath = if(!partitionPaths.isEmpty && !StringUtils.isEmpty(partitionPaths.get(0))) {
+ tablePath + "/" + partitionPaths.get(0)
+ } else {
+ tablePath
+ }
+ val dataPath = DataSourceUtils.getDataPath(tablePath, onePartitionPath)
+ log.info("Obtained hudi data path: " + dataPath)
+ parameters += "path" -> dataPath
Review comment:
I will try to see if I can automatically infer this but also meet your needs
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] teeyog commented on a change in pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
Posted by GitBox <gi...@apache.org>.
teeyog commented on a change in pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#discussion_r579907953
##########
File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala
##########
@@ -74,6 +78,19 @@ class DefaultSource extends RelationProvider
val tablePath = DataSourceUtils.getTablePath(fs, globPaths.toArray)
log.info("Obtained hudi table path: " + tablePath)
+ val sparkEngineContext = new HoodieSparkEngineContext(sqlContext.sparkContext)
+ val fsBackedTableMetadata =
+ new FileSystemBackedTableMetadata(sparkEngineContext, new SerializableConfiguration(fs.getConf), tablePath, false)
+ val partitionPaths = fsBackedTableMetadata.getAllPartitionPaths
+ val onePartitionPath = if(!partitionPaths.isEmpty && !StringUtils.isEmpty(partitionPaths.get(0))) {
+ tablePath + "/" + partitionPaths.get(0)
+ } else {
+ tablePath
+ }
+ val dataPath = DataSourceUtils.getDataPath(tablePath, onePartitionPath)
+ log.info("Obtained hudi data path: " + dataPath)
+ parameters += "path" -> dataPath
Review comment:
The path specified by the user will be overwritten by the automatically inferred data directory, and your needs cannot be met
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] vinothchandar commented on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-774543006
other than these two I am good with this
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] wangxianghu commented on a change in pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
Posted by GitBox <gi...@apache.org>.
wangxianghu commented on a change in pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#discussion_r564472557
##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java
##########
@@ -84,6 +86,39 @@ public static String getTablePath(FileSystem fs, Path[] userProvidedPaths) throw
throw new TableNotFoundException("Unable to find a hudi table for the user provided paths.");
}
+ public static Option<String> getOnePartitionPath(FileSystem fs, Path tablePath) throws IOException {
Review comment:
@teeyog maybe we can check whether the table is partitioned through `hoodie.datasource.write.keygenerator.class` param
WDYT?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org