You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/01/22 15:18:53 UTC

[GitHub] [hudi] teeyog opened a new pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

teeyog opened a new pull request #2475:
URL: https://github.com/apache/hudi/pull/2475


   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a pull request.*
   
   ## What is the purpose of the pull request
   
   To read the hudi table, you need to specify the path, but the path is not only the tablePath corresponding to the table, but needs to be determined by the partition directory structure. Different keyGenerators correspond to different partition directory structures. The first-level partition directory uses path=```.../table/*/*```, the secondary partition directory path=```.../table/*/*/*```,so it is troublesome to let the user specify the data path, the user only needs to specify the tablePath:  ```.../table```
   
   At the same time, after reading the hudi table by configuring path=```.../table```, it is more convenient to use sparksql to query the hudi table. You only need to add tabproperties to the hive table metadata: ```spark.sql.sources.provider= hudi```, you can automatically convert the hive table to the hudi table.
   
   ## Brief change log
   
   *(for example:)*
     - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
     - *Added integration tests for end-to-end.*
     - *Added HoodieClientWriteTest to verify the change.*
     - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io edited a comment on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-765495259


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=h1) Report
   > Merging [#2475](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=desc) (9c38d02) into [master](https://codecov.io/gh/apache/hudi/commit/e302c6bc12c7eb764781898fdee8ee302ef4ec10?el=desc) (e302c6b) will **increase** coverage by `19.24%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2475/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=tree)
   
   ```diff
   @@              Coverage Diff              @@
   ##             master    #2475       +/-   ##
   =============================================
   + Coverage     50.18%   69.43%   +19.24%     
   + Complexity     3050      357     -2693     
   =============================================
     Files           419       53      -366     
     Lines         18931     1930    -17001     
     Branches       1948      230     -1718     
   =============================================
   - Hits           9500     1340     -8160     
   + Misses         8656      456     -8200     
   + Partials        775      134      -641     
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `69.43% <ø> (ø)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [...e/hudi/common/table/timeline/dto/FileGroupDTO.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL2R0by9GaWxlR3JvdXBEVE8uamF2YQ==) | | | |
   | [...ava/org/apache/hudi/common/model/HoodieRecord.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZVJlY29yZC5qYXZh) | | | |
   | [...metadata/HoodieMetadataMergedLogRecordScanner.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0YWRhdGEvSG9vZGllTWV0YWRhdGFNZXJnZWRMb2dSZWNvcmRTY2FubmVyLmphdmE=) | | | |
   | [...g/apache/hudi/exception/HoodieRemoteException.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZVJlbW90ZUV4Y2VwdGlvbi5qYXZh) | | | |
   | [.../apache/hudi/common/table/log/HoodieLogFormat.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Ib29kaWVMb2dGb3JtYXQuamF2YQ==) | | | |
   | [...n/java/org/apache/hudi/cli/HoodieSplashScreen.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL0hvb2RpZVNwbGFzaFNjcmVlbi5qYXZh) | | | |
   | [.../hudi/hadoop/realtime/HoodieRealtimeFileSplit.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL3JlYWx0aW1lL0hvb2RpZVJlYWx0aW1lRmlsZVNwbGl0LmphdmE=) | | | |
   | [...org/apache/hudi/hadoop/realtime/RealtimeSplit.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL3JlYWx0aW1lL1JlYWx0aW1lU3BsaXQuamF2YQ==) | | | |
   | [...n/java/org/apache/hudi/common/HoodieCleanStat.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL0hvb2RpZUNsZWFuU3RhdC5qYXZh) | | | |
   | [...udi/common/table/log/block/HoodieCorruptBlock.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9ibG9jay9Ib29kaWVDb3JydXB0QmxvY2suamF2YQ==) | | | |
   | ... and [355 more](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree-more) | |
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io edited a comment on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-765495259






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] teeyog commented on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

Posted by GitBox <gi...@apache.org>.
teeyog commented on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-774639582


   > FSUtils.getAllPartitionPaths()
   
   It has been modified to obtain the partition path by ```FSUtils.getAllPartitionPaths()```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-774538085


   @teeyog if you could call FSUtils.getAllPartitionPaths() or add a new method `getNPartitionPaths()` and return the first N such partition paths using your traversal in the class `FileSystemBackedTableMetadata` 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] teeyog commented on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

Posted by GitBox <gi...@apache.org>.
teeyog commented on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-774571361


   > > You only need to add tabproperties to the hive table metadata: spark.sql.sources.provider= hudi, you can automatically convert the hive table to the hudi table.
   > 
   > @teeyog can you please expand on this. is this related to this PR or a general comment?
   
   If the hive metadata tabproperties contains ```spark.sql.sources.provider=hudi```, the parsing process of sparksql reading the hive table is as follows:
   First step
   [https://github.com/apache/spark/blob/62be2483d7d78e61fd2f77929cf41c76eff17869/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L302](https://github.com/apache/spark/blob/62be2483d7d78e61fd2f77929cf41c76eff17869/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L302)
   
   Second step
   [https://github.com/apache/spark/blob/62be2483d7d78e61fd2f77929cf41c76eff17869/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L261](https://github.com/apache/spark/blob/62be2483d7d78e61fd2f77929cf41c76eff17869/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L261)
   
   The resolveRelation in the second step will go directly to the DefaultSource of hudi, so reading the hive table is automatically converted to reading the hudi table


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-774542977


   >Of course it can, but it is specified by the parameter basePath=../2020/*/*
   
   I want to clarify this a bit. do you mean `val READ_PATHS_OPT_KEY = "hoodie.datasource.read.paths"` ?  
   
   if I do the following, I see that we reset the `path` in options to `basePath + "/*/*/*/*`. How does Spark parquet source know to only look for 2020 and 2021 for e,g? 
   
   ```
   val snapshotDF1 = spark.read.format("org.apache.hudi")
         .load(basePath + "/202*/*/*/*")
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io edited a comment on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-765495259


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=h1) Report
   > Merging [#2475](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=desc) (62cff1e) into [master](https://codecov.io/gh/apache/hudi/commit/43a0776c7c88a5f7beac6c8853db7e341810635a?el=desc) (43a0776) will **increase** coverage by `0.03%`.
   > The diff coverage is `93.33%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2475/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=tree)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #2475      +/-   ##
   ============================================
   + Coverage     51.14%   51.17%   +0.03%     
   - Complexity     3215     3219       +4     
   ============================================
     Files           438      438              
     Lines         20041    20055      +14     
     Branches       2064     2067       +3     
   ============================================
   + Hits          10250    10264      +14     
     Misses         8946     8946              
     Partials        845      845              
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `36.87% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `51.36% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiflink | `45.44% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudihadoopmr | `33.16% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudisparkdatasource | `69.96% <93.33%> (+0.20%)` | `0.00 <0.00> (ø)` | |
   | hudisync | `48.61% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | huditimelineservice | `66.49% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiutilities | `69.51% <ø> (+0.05%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [...src/main/scala/org/apache/hudi/DefaultSource.scala](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0RlZmF1bHRTb3VyY2Uuc2NhbGE=) | `85.41% <93.33%> (+1.27%)` | `20.00 <0.00> (+3.00)` | |
   | [...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=) | `70.35% <0.00%> (+0.35%)` | `51.00% <0.00%> (+1.00%)` | |
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-774539141


   > You only need to add tabproperties to the hive table metadata: spark.sql.sources.provider= hudi, you can automatically convert the hive table to the hudi table.
   
   @teeyog can you please expand on this. is this related to this PR or a general comment?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar closed pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

Posted by GitBox <gi...@apache.org>.
vinothchandar closed pull request #2475:
URL: https://github.com/apache/hudi/pull/2475


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] zhedoubushishi commented on a change in pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

Posted by GitBox <gi...@apache.org>.
zhedoubushishi commented on a change in pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#discussion_r569775940



##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java
##########
@@ -84,6 +86,39 @@ public static String getTablePath(FileSystem fs, Path[] userProvidedPaths) throw
     throw new TableNotFoundException("Unable to find a hudi table for the user provided paths.");
   }
 
+  public static Option<String> getOnePartitionPath(FileSystem fs, Path tablePath) throws IOException {
+    // When the table is not partitioned
+    if (HoodiePartitionMetadata.hasPartitionMetadata(fs, tablePath)) {
+      return Option.of(tablePath.toString());
+    }
+    FileStatus[] statuses = fs.listStatus(tablePath);
+    for (FileStatus status : statuses) {
+      if (status.isDirectory()) {
+        if (HoodiePartitionMetadata.hasPartitionMetadata(fs, status.getPath())) {
+          return Option.of(status.getPath().toString());
+        } else {
+          Option<String> partitionPath = getOnePartitionPath(fs, status.getPath());
+          if (partitionPath.isPresent()) {
+            return partitionPath;

Review comment:
       Yea I agree it would be better to use ```HoodieTableMetadata``` to avoid ```fs.listStatus```. But what about the tables w/o metadata feature enable? Will it take super long time if it's a table with many partitions?
   
   Also ```hoodie_partition_metadata``` saves a parameter called ```partitionDepth```, could we take advantage of this?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] teeyog commented on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

Posted by GitBox <gi...@apache.org>.
teeyog commented on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-774578774


   > > Of course it can, but it is specified by the parameter basePath=../2020/_/_
   > 
   > I want to clarify this a bit. do you mean `val READ_PATHS_OPT_KEY = "hoodie.datasource.read.paths"` ?
   > 
   > if I do the following, I see that we reset the `path` in options to `basePath + "/*/*/*/*`. How does Spark parquet source know to only look for 2020 and 2021 for e,g?
   > 
   > ```
   > val snapshotDF1 = spark.read.format("org.apache.hudi")
   >       .load(basePath + "/202*/*/*/*")
   > ```
   
   I understand what you mean. The situation you said is really not supported, because the data path will be automatically inferred to cover the path configured by the user, but you only check the requirements of 2020 and 2021, you can use dadaframe when Filter again, or do I need to judge whether the path specified by the user contains *, if it does, the data path is not automatically inferred, what do you think?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on a change in pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#discussion_r577755265



##########
File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala
##########
@@ -74,6 +78,19 @@ class DefaultSource extends RelationProvider
     val tablePath = DataSourceUtils.getTablePath(fs, globPaths.toArray)
     log.info("Obtained hudi table path: " + tablePath)
 
+    val sparkEngineContext = new HoodieSparkEngineContext(sqlContext.sparkContext)
+    val fsBackedTableMetadata =
+      new FileSystemBackedTableMetadata(sparkEngineContext, new SerializableConfiguration(fs.getConf), tablePath, false)
+    val partitionPaths = fsBackedTableMetadata.getAllPartitionPaths
+    val onePartitionPath = if(!partitionPaths.isEmpty && !StringUtils.isEmpty(partitionPaths.get(0))) {
+        tablePath + "/" + partitionPaths.get(0)
+      } else {
+        tablePath
+      }
+    val dataPath = DataSourceUtils.getDataPath(tablePath, onePartitionPath)
+    log.info("Obtained hudi data path: " + dataPath)
+    parameters += "path" -> dataPath

Review comment:
       @teeyog Sorry it's not still clear to me. I supplied a globbed path `2015/*/*/*` and even that overrides `path -> tablePath/*/*/*/*` 
   
   Won't this incur reading all partitions in the tablePath as opposed only 2015's? 
   
   ![image](https://user-images.githubusercontent.com/1179324/108234541-bde4fb00-70f9-11eb-8611-58579636b51b.png)
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] lw309637554 commented on a change in pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

Posted by GitBox <gi...@apache.org>.
lw309637554 commented on a change in pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#discussion_r584307500



##########
File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala
##########
@@ -84,6 +88,26 @@ class DefaultSource extends RelationProvider
     val tablePath = DataSourceUtils.getTablePath(fs, globPaths.toArray)
     log.info("Obtained hudi table path: " + tablePath)
 
+    if (path.nonEmpty) {
+      val _path = path.get.stripSuffix("/")
+      val pathTmp = new Path(_path).makeQualified(fs.getUri, fs.getWorkingDirectory)
+      // If the user specifies the table path, the data path is automatically inferred
+      if (pathTmp.toString.equals(tablePath)) {
+        val sparkEngineContext = new HoodieSparkEngineContext(sqlContext.sparkContext)
+        val fsBackedTableMetadata =
+          new FileSystemBackedTableMetadata(sparkEngineContext, new SerializableConfiguration(fs.getConf), tablePath, false)
+        val partitionPaths = fsBackedTableMetadata.getAllPartitionPaths

Review comment:
       @teeyog  hello, now infer the partition for getallpartition paths from metadata table. 
   The partition mode is set as hoodie.datasource.write.partitionpath.field when write the hudi table. Can we persist the hoodie.datasource.write.partitionpath.field to metatable? Then read just get the properties , not get all the partition path? cc @vinothchandar 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] teeyog commented on a change in pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

Posted by GitBox <gi...@apache.org>.
teeyog commented on a change in pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#discussion_r569889546



##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java
##########
@@ -84,6 +86,39 @@ public static String getTablePath(FileSystem fs, Path[] userProvidedPaths) throw
     throw new TableNotFoundException("Unable to find a hudi table for the user provided paths.");
   }
 
+  public static Option<String> getOnePartitionPath(FileSystem fs, Path tablePath) throws IOException {
+    // When the table is not partitioned
+    if (HoodiePartitionMetadata.hasPartitionMetadata(fs, tablePath)) {
+      return Option.of(tablePath.toString());
+    }
+    FileStatus[] statuses = fs.listStatus(tablePath);
+    for (FileStatus status : statuses) {
+      if (status.isDirectory()) {
+        if (HoodiePartitionMetadata.hasPartitionMetadata(fs, status.getPath())) {
+          return Option.of(status.getPath().toString());
+        } else {
+          Option<String> partitionPath = getOnePartitionPath(fs, status.getPath());
+          if (partitionPath.isPresent()) {
+            return partitionPath;

Review comment:
       Thank you for your review, this method of obtaining partitions is very fast. As long as one partition path is obtained, it will return directly. FSUtils.getAllPartitionPaths will obtain all partition paths, which is very time-consuming.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io edited a comment on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-765495259


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=h1) Report
   > Merging [#2475](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=desc) (9c38d02) into [master](https://codecov.io/gh/apache/hudi/commit/e302c6bc12c7eb764781898fdee8ee302ef4ec10?el=desc) (e302c6b) will **decrease** coverage by `40.49%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2475/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=tree)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master   #2475       +/-   ##
   ============================================
   - Coverage     50.18%   9.68%   -40.50%     
   + Complexity     3050      48     -3002     
   ============================================
     Files           419      53      -366     
     Lines         18931    1930    -17001     
     Branches       1948     230     -1718     
   ============================================
   - Hits           9500     187     -9313     
   + Misses         8656    1730     -6926     
   + Partials        775      13      -762     
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `9.68% <ø> (-59.75%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | ... and [395 more](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree-more) | |
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] teeyog commented on a change in pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

Posted by GitBox <gi...@apache.org>.
teeyog commented on a change in pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#discussion_r584401119



##########
File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala
##########
@@ -84,6 +88,26 @@ class DefaultSource extends RelationProvider
     val tablePath = DataSourceUtils.getTablePath(fs, globPaths.toArray)
     log.info("Obtained hudi table path: " + tablePath)
 
+    if (path.nonEmpty) {
+      val _path = path.get.stripSuffix("/")
+      val pathTmp = new Path(_path).makeQualified(fs.getUri, fs.getWorkingDirectory)
+      // If the user specifies the table path, the data path is automatically inferred
+      if (pathTmp.toString.equals(tablePath)) {
+        val sparkEngineContext = new HoodieSparkEngineContext(sqlContext.sparkContext)
+        val fsBackedTableMetadata =
+          new FileSystemBackedTableMetadata(sparkEngineContext, new SerializableConfiguration(fs.getConf), tablePath, false)
+        val partitionPaths = fsBackedTableMetadata.getAllPartitionPaths

Review comment:
       @lw309637554 Thank you for your review, the previous path to get the hudi table can also be obtained through configuration instead of inference




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-874357657


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "62cff1e1b984c800d44e8f33df23f9ccb9fa4c97",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "62cff1e1b984c800d44e8f33df23f9ccb9fa4c97",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 62cff1e1b984c800d44e8f33df23f9ccb9fa4c97 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run travis` re-run the last Travis build
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io commented on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-765495259


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=h1) Report
   > Merging [#2475](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=desc) (8d073b6) into [master](https://codecov.io/gh/apache/hudi/commit/048633da1a913a05252b1b5dea0b3d40d75c81b4?el=desc) (048633d) will **increase** coverage by `19.25%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2475/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=tree)
   
   ```diff
   @@              Coverage Diff              @@
   ##             master    #2475       +/-   ##
   =============================================
   + Coverage     50.17%   69.43%   +19.25%     
   + Complexity     3050      357     -2693     
   =============================================
     Files           419       53      -366     
     Lines         18931     1930    -17001     
     Branches       1948      230     -1718     
   =============================================
   - Hits           9498     1340     -8158     
   + Misses         8657      456     -8201     
   + Partials        776      134      -642     
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `69.43% <ø> (ø)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [...va/org/apache/hudi/metadata/BaseTableMetadata.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0YWRhdGEvQmFzZVRhYmxlTWV0YWRhdGEuamF2YQ==) | | | |
   | [.../apache/hudi/common/model/HoodieRecordPayload.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZVJlY29yZFBheWxvYWQuamF2YQ==) | | | |
   | [...metadata/HoodieMetadataMergedLogRecordScanner.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0YWRhdGEvSG9vZGllTWV0YWRhdGFNZXJnZWRMb2dSZWNvcmRTY2FubmVyLmphdmE=) | | | |
   | [...ain/java/org/apache/hudi/cli/utils/CommitUtil.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL3V0aWxzL0NvbW1pdFV0aWwuamF2YQ==) | | | |
   | [...che/hudi/common/table/timeline/HoodieTimeline.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL0hvb2RpZVRpbWVsaW5lLmphdmE=) | | | |
   | [...i/common/table/log/block/HoodieHFileDataBlock.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9ibG9jay9Ib29kaWVIRmlsZURhdGFCbG9jay5qYXZh) | | | |
   | [...e/hudi/common/engine/LocalTaskContextSupplier.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2VuZ2luZS9Mb2NhbFRhc2tDb250ZXh0U3VwcGxpZXIuamF2YQ==) | | | |
   | [...ava/org/apache/hudi/payload/AWSDmsAvroPayload.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvcGF5bG9hZC9BV1NEbXNBdnJvUGF5bG9hZC5qYXZh) | | | |
   | [...rg/apache/hudi/cli/commands/SavepointsCommand.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL2NvbW1hbmRzL1NhdmVwb2ludHNDb21tYW5kLmphdmE=) | | | |
   | [...ache/hudi/common/table/timeline/TimelineUtils.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL1RpbWVsaW5lVXRpbHMuamF2YQ==) | | | |
   | ... and [355 more](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree-more) | |
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on a change in pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#discussion_r571483611



##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java
##########
@@ -84,6 +86,39 @@ public static String getTablePath(FileSystem fs, Path[] userProvidedPaths) throw
     throw new TableNotFoundException("Unable to find a hudi table for the user provided paths.");
   }
 
+  public static Option<String> getOnePartitionPath(FileSystem fs, Path tablePath) throws IOException {
+    // When the table is not partitioned
+    if (HoodiePartitionMetadata.hasPartitionMetadata(fs, tablePath)) {
+      return Option.of(tablePath.toString());
+    }
+    FileStatus[] statuses = fs.listStatus(tablePath);
+    for (FileStatus status : statuses) {
+      if (status.isDirectory()) {
+        if (HoodiePartitionMetadata.hasPartitionMetadata(fs, status.getPath())) {
+          return Option.of(status.getPath().toString());
+        } else {
+          Option<String> partitionPath = getOnePartitionPath(fs, status.getPath());
+          if (partitionPath.isPresent()) {
+            return partitionPath;

Review comment:
       @teeyog we could even add a new overload/methods for this under `HoodieTableMetadata` interface, but really good to keep all of this under the interface.  With the metadata table, its actually okay to call getAllPartitionPaths(), its pretty fast. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-888755262


   Closing in favor of #3353 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-874357657


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "62cff1e1b984c800d44e8f33df23f9ccb9fa4c97",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "62cff1e1b984c800d44e8f33df23f9ccb9fa4c97",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 62cff1e1b984c800d44e8f33df23f9ccb9fa4c97 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run travis` re-run the last Travis build
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] teeyog commented on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

Posted by GitBox <gi...@apache.org>.
teeyog commented on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-774640491


   > @teeyog if you could call FSUtils.getAllPartitionPaths() or add a new method `getNPartitionPaths()` and return the first N such partition paths using your traversal in the class `FileSystemBackedTableMetadata`
   
   It has been modified to obtain the partition path by ```FSUtils.getAllPartitionPaths()```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io edited a comment on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-765495259


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=h1) Report
   > Merging [#2475](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=desc) (d0ee06e) into [master](https://codecov.io/gh/apache/hudi/commit/e302c6bc12c7eb764781898fdee8ee302ef4ec10?el=desc) (e302c6b) will **increase** coverage by `0.02%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2475/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=tree)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #2475      +/-   ##
   ============================================
   + Coverage     50.18%   50.20%   +0.02%     
   - Complexity     3050     3051       +1     
   ============================================
     Files           419      419              
     Lines         18931    18935       +4     
     Branches       1948     1948              
   ============================================
   + Hits           9500     9506       +6     
   + Misses         8656     8654       -2     
     Partials        775      775              
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `37.21% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `51.47% <ø> (-0.03%)` | `0.00 <ø> (ø)` | |
   | hudiflink | `0.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudihadoopmr | `33.16% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudisparkdatasource | `66.16% <100.00%> (+0.31%)` | `0.00 <0.00> (ø)` | |
   | hudisync | `48.61% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | huditimelineservice | `66.49% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiutilities | `69.48% <ø> (+0.05%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [...src/main/scala/org/apache/hudi/DefaultSource.scala](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0RlZmF1bHRTb3VyY2Uuc2NhbGE=) | `89.39% <100.00%> (+0.68%)` | `15.00 <0.00> (ø)` | |
   | [...e/hudi/common/table/log/HoodieLogFormatWriter.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Ib29kaWVMb2dGb3JtYXRXcml0ZXIuamF2YQ==) | `78.12% <0.00%> (-1.57%)` | `26.00% <0.00%> (ø%)` | |
   | [...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=) | `70.86% <0.00%> (+0.35%)` | `51.00% <0.00%> (+1.00%)` | |
   | [...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0hvb2RpZVNwYXJrU3FsV3JpdGVyLnNjYWxh) | `49.64% <0.00%> (+1.06%)` | `0.00% <0.00%> (ø%)` | |
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io edited a comment on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-765495259


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=h1) Report
   > Merging [#2475](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=desc) (0e4f1ee) into [master](https://codecov.io/gh/apache/hudi/commit/43a0776c7c88a5f7beac6c8853db7e341810635a?el=desc) (43a0776) will **decrease** coverage by `41.45%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2475/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=tree)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master   #2475       +/-   ##
   ============================================
   - Coverage     51.14%   9.69%   -41.46%     
   + Complexity     3215      48     -3167     
   ============================================
     Files           438      53      -385     
     Lines         20041    1929    -18112     
     Branches       2064     230     -1834     
   ============================================
   - Hits          10250     187    -10063     
   + Misses         8946    1729     -7217     
   + Partials        845      13      -832     
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `9.69% <ø> (-59.78%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | ... and [414 more](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree-more) | |
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on a change in pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#discussion_r569761896



##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java
##########
@@ -84,6 +86,39 @@ public static String getTablePath(FileSystem fs, Path[] userProvidedPaths) throw
     throw new TableNotFoundException("Unable to find a hudi table for the user provided paths.");
   }
 
+  public static Option<String> getOnePartitionPath(FileSystem fs, Path tablePath) throws IOException {
+    // When the table is not partitioned
+    if (HoodiePartitionMetadata.hasPartitionMetadata(fs, tablePath)) {
+      return Option.of(tablePath.toString());
+    }
+    FileStatus[] statuses = fs.listStatus(tablePath);
+    for (FileStatus status : statuses) {
+      if (status.isDirectory()) {
+        if (HoodiePartitionMetadata.hasPartitionMetadata(fs, status.getPath())) {
+          return Option.of(status.getPath().toString());
+        } else {
+          Option<String> partitionPath = getOnePartitionPath(fs, status.getPath());
+          if (partitionPath.isPresent()) {
+            return partitionPath;

Review comment:
       So, I am wondering if we can use the `HoodieTableMetadata` abstraction to read a partition path, instead of listing alone. We are trying to avoid any introduction of single point listings. There is a method to get all partition paths already FSUtils.getAllPartitionPaths(), lets just use that for now? I am thinking that it will be little bit of an overkill to list all partition paths, without metadata table

##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java
##########
@@ -84,6 +86,39 @@ public static String getTablePath(FileSystem fs, Path[] userProvidedPaths) throw
     throw new TableNotFoundException("Unable to find a hudi table for the user provided paths.");
   }
 
+  public static Option<String> getOnePartitionPath(FileSystem fs, Path tablePath) throws IOException {
+    // When the table is not partitioned
+    if (HoodiePartitionMetadata.hasPartitionMetadata(fs, tablePath)) {
+      return Option.of(tablePath.toString());
+    }
+    FileStatus[] statuses = fs.listStatus(tablePath);
+    for (FileStatus status : statuses) {
+      if (status.isDirectory()) {
+        if (HoodiePartitionMetadata.hasPartitionMetadata(fs, status.getPath())) {
+          return Option.of(status.getPath().toString());
+        } else {
+          Option<String> partitionPath = getOnePartitionPath(fs, status.getPath());
+          if (partitionPath.isPresent()) {
+            return partitionPath;

Review comment:
       this short circuits the recursive stack, once we get one partition path I guess




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] teeyog removed a comment on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

Posted by GitBox <gi...@apache.org>.
teeyog removed a comment on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-774639582


   > FSUtils.getAllPartitionPaths()
   
   It has been modified to obtain the partition path by ```FSUtils.getAllPartitionPaths()```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-769517183


   @zhedoubushishi @umehrot2 could you please take a first pass


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io edited a comment on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-765495259






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] teeyog commented on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

Posted by GitBox <gi...@apache.org>.
teeyog commented on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-772964324


   > Thanks for this! Seems very useful.
   > 
   > One thing I wanted to understand was - whether an user can still do `basePath/2020/*/*` and have only the parquet files for 2020 read out for e.g?
   
   Of course it can, but it is specified by the parameter ```basePath=../2020/*/*```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] teeyog commented on a change in pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

Posted by GitBox <gi...@apache.org>.
teeyog commented on a change in pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#discussion_r580728819



##########
File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala
##########
@@ -74,6 +78,19 @@ class DefaultSource extends RelationProvider
     val tablePath = DataSourceUtils.getTablePath(fs, globPaths.toArray)
     log.info("Obtained hudi table path: " + tablePath)
 
+    val sparkEngineContext = new HoodieSparkEngineContext(sqlContext.sparkContext)
+    val fsBackedTableMetadata =
+      new FileSystemBackedTableMetadata(sparkEngineContext, new SerializableConfiguration(fs.getConf), tablePath, false)
+    val partitionPaths = fsBackedTableMetadata.getAllPartitionPaths
+    val onePartitionPath = if(!partitionPaths.isEmpty && !StringUtils.isEmpty(partitionPaths.get(0))) {
+        tablePath + "/" + partitionPaths.get(0)
+      } else {
+        tablePath
+      }
+    val dataPath = DataSourceUtils.getDataPath(tablePath, onePartitionPath)
+    log.info("Obtained hudi data path: " + dataPath)
+    parameters += "path" -> dataPath

Review comment:
       @vinothchandar Now it supports your needs. If the path specified by the user is a table path, it will be automatically inferred, otherwise it will not be inferred.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io edited a comment on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-765495259


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=h1) Report
   > Merging [#2475](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=desc) (62cff1e) into [master](https://codecov.io/gh/apache/hudi/commit/43a0776c7c88a5f7beac6c8853db7e341810635a?el=desc) (43a0776) will **increase** coverage by `0.03%`.
   > The diff coverage is `93.33%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2475/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=tree)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #2475      +/-   ##
   ============================================
   + Coverage     51.14%   51.17%   +0.03%     
   - Complexity     3215     3219       +4     
   ============================================
     Files           438      438              
     Lines         20041    20055      +14     
     Branches       2064     2067       +3     
   ============================================
   + Hits          10250    10264      +14     
     Misses         8946     8946              
     Partials        845      845              
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `36.87% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `51.36% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiflink | `45.44% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudihadoopmr | `33.16% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudisparkdatasource | `69.96% <93.33%> (+0.20%)` | `0.00 <0.00> (ø)` | |
   | hudisync | `48.61% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | huditimelineservice | `66.49% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiutilities | `69.51% <ø> (+0.05%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [...src/main/scala/org/apache/hudi/DefaultSource.scala](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0RlZmF1bHRTb3VyY2Uuc2NhbGE=) | `85.41% <93.33%> (+1.27%)` | `20.00 <0.00> (+3.00)` | |
   | [...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=) | `70.35% <0.00%> (+0.35%)` | `51.00% <0.00%> (+1.00%)` | |
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io edited a comment on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-765495259


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=h1) Report
   > Merging [#2475](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=desc) (d0ee06e) into [master](https://codecov.io/gh/apache/hudi/commit/e302c6bc12c7eb764781898fdee8ee302ef4ec10?el=desc) (e302c6b) will **increase** coverage by `19.29%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2475/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=tree)
   
   ```diff
   @@              Coverage Diff              @@
   ##             master    #2475       +/-   ##
   =============================================
   + Coverage     50.18%   69.48%   +19.29%     
   + Complexity     3050      358     -2692     
   =============================================
     Files           419       53      -366     
     Lines         18931     1930    -17001     
     Branches       1948      230     -1718     
   =============================================
   - Hits           9500     1341     -8159     
   + Misses         8656      456     -8200     
   + Partials        775      133      -642     
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `69.48% <ø> (+0.05%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [...pache/hudi/hadoop/config/HoodieRealtimeConfig.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL2NvbmZpZy9Ib29kaWVSZWFsdGltZUNvbmZpZy5qYXZh) | | | |
   | [.../apache/hudi/hive/MultiPartKeysValueExtractor.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1zeW5jL2h1ZGktaGl2ZS1zeW5jL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL2hpdmUvTXVsdGlQYXJ0S2V5c1ZhbHVlRXh0cmFjdG9yLmphdmE=) | | | |
   | [...odie/hadoop/hive/HoodieCombineHiveInputFormat.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9jb20vdWJlci9ob29kaWUvaGFkb29wL2hpdmUvSG9vZGllQ29tYmluZUhpdmVJbnB1dEZvcm1hdC5qYXZh) | | | |
   | [...e/hudi/common/table/log/HoodieLogFormatReader.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Ib29kaWVMb2dGb3JtYXRSZWFkZXIuamF2YQ==) | | | |
   | [...pache/hudi/cli/commands/FileSystemViewCommand.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL2NvbW1hbmRzL0ZpbGVTeXN0ZW1WaWV3Q29tbWFuZC5qYXZh) | | | |
   | [...in/java/org/apache/hudi/common/model/BaseFile.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0Jhc2VGaWxlLmphdmE=) | | | |
   | [...apache/hudi/common/model/HoodieRecordLocation.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZVJlY29yZExvY2F0aW9uLmphdmE=) | | | |
   | [...i/src/main/java/org/apache/hudi/cli/HoodieCLI.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL0hvb2RpZUNMSS5qYXZh) | | | |
   | [...he/hudi/common/fs/SizeAwareFSDataOutputStream.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL1NpemVBd2FyZUZTRGF0YU91dHB1dFN0cmVhbS5qYXZh) | | | |
   | [...ache/hudi/common/fs/inline/InMemoryFileSystem.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL2lubGluZS9Jbk1lbW9yeUZpbGVTeXN0ZW0uamF2YQ==) | | | |
   | ... and [356 more](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree-more) | |
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rubenssoto commented on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

Posted by GitBox <gi...@apache.org>.
rubenssoto commented on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-765712586


   This is a great and important feature to make Hudi easier for no heavy users.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io edited a comment on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-765495259


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=h1) Report
   > Merging [#2475](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=desc) (bca1656) into [master](https://codecov.io/gh/apache/hudi/commit/4c5b6923ccfaaa6616a934a3f690b1a795a42d41?el=desc) (4c5b692) will **increase** coverage by `10.41%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2475/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=tree)
   
   ```diff
   @@              Coverage Diff              @@
   ##             master    #2475       +/-   ##
   =============================================
   + Coverage     50.91%   61.32%   +10.41%     
   + Complexity     3168      317     -2851     
   =============================================
     Files           433       53      -380     
     Lines         19806     1929    -17877     
     Branches       2032      229     -1803     
   =============================================
   - Hits          10084     1183     -8901     
   + Misses         8904      623     -8281     
   + Partials        818      123      -695     
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `61.32% <ø> (-8.20%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2475?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [...ies/exception/HoodieSnapshotExporterException.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2V4Y2VwdGlvbi9Ib29kaWVTbmFwc2hvdEV4cG9ydGVyRXhjZXB0aW9uLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [.../apache/hudi/utilities/HoodieSnapshotExporter.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0hvb2RpZVNuYXBzaG90RXhwb3J0ZXIuamF2YQ==) | `5.17% <0.00%> (-83.63%)` | `0.00% <0.00%> (-28.00%)` | |
   | [...hudi/utilities/schema/JdbcbasedSchemaProvider.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9KZGJjYmFzZWRTY2hlbWFQcm92aWRlci5qYXZh) | `0.00% <0.00%> (-72.23%)` | `0.00% <0.00%> (-2.00%)` | |
   | [...he/hudi/utilities/transform/AWSDmsTransformer.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3RyYW5zZm9ybS9BV1NEbXNUcmFuc2Zvcm1lci5qYXZh) | `0.00% <0.00%> (-66.67%)` | `0.00% <0.00%> (-2.00%)` | |
   | [...in/java/org/apache/hudi/utilities/UtilHelpers.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL1V0aWxIZWxwZXJzLmphdmE=) | `40.69% <0.00%> (-23.84%)` | `26.00% <0.00%> (-6.00%)` | |
   | [...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=) | `70.50% <0.00%> (-0.36%)` | `50.00% <0.00%> (-1.00%)` | |
   | [...mmon/table/log/HoodieUnMergedLogRecordScanner.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Ib29kaWVVbk1lcmdlZExvZ1JlY29yZFNjYW5uZXIuamF2YQ==) | | | |
   | [...pache/hudi/hadoop/HoodieColumnProjectionUtils.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL0hvb2RpZUNvbHVtblByb2plY3Rpb25VdGlscy5qYXZh) | | | |
   | [.../apache/hudi/hive/MultiPartKeysValueExtractor.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1zeW5jL2h1ZGktaGl2ZS1zeW5jL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL2hpdmUvTXVsdGlQYXJ0S2V5c1ZhbHVlRXh0cmFjdG9yLmphdmE=) | | | |
   | [...meline/versioning/clean/CleanMetadataMigrator.java](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL3ZlcnNpb25pbmcvY2xlYW4vQ2xlYW5NZXRhZGF0YU1pZ3JhdG9yLmphdmE=) | | | |
   | ... and [375 more](https://codecov.io/gh/apache/hudi/pull/2475/diff?src=pr&el=tree-more) | |
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] teeyog commented on a change in pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

Posted by GitBox <gi...@apache.org>.
teeyog commented on a change in pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#discussion_r579910252



##########
File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala
##########
@@ -74,6 +78,19 @@ class DefaultSource extends RelationProvider
     val tablePath = DataSourceUtils.getTablePath(fs, globPaths.toArray)
     log.info("Obtained hudi table path: " + tablePath)
 
+    val sparkEngineContext = new HoodieSparkEngineContext(sqlContext.sparkContext)
+    val fsBackedTableMetadata =
+      new FileSystemBackedTableMetadata(sparkEngineContext, new SerializableConfiguration(fs.getConf), tablePath, false)
+    val partitionPaths = fsBackedTableMetadata.getAllPartitionPaths
+    val onePartitionPath = if(!partitionPaths.isEmpty && !StringUtils.isEmpty(partitionPaths.get(0))) {
+        tablePath + "/" + partitionPaths.get(0)
+      } else {
+        tablePath
+      }
+    val dataPath = DataSourceUtils.getDataPath(tablePath, onePartitionPath)
+    log.info("Obtained hudi data path: " + dataPath)
+    parameters += "path" -> dataPath

Review comment:
       I will try to see if I can automatically infer this but also meet your needs




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] teeyog commented on a change in pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

Posted by GitBox <gi...@apache.org>.
teeyog commented on a change in pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#discussion_r579907953



##########
File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala
##########
@@ -74,6 +78,19 @@ class DefaultSource extends RelationProvider
     val tablePath = DataSourceUtils.getTablePath(fs, globPaths.toArray)
     log.info("Obtained hudi table path: " + tablePath)
 
+    val sparkEngineContext = new HoodieSparkEngineContext(sqlContext.sparkContext)
+    val fsBackedTableMetadata =
+      new FileSystemBackedTableMetadata(sparkEngineContext, new SerializableConfiguration(fs.getConf), tablePath, false)
+    val partitionPaths = fsBackedTableMetadata.getAllPartitionPaths
+    val onePartitionPath = if(!partitionPaths.isEmpty && !StringUtils.isEmpty(partitionPaths.get(0))) {
+        tablePath + "/" + partitionPaths.get(0)
+      } else {
+        tablePath
+      }
+    val dataPath = DataSourceUtils.getDataPath(tablePath, onePartitionPath)
+    log.info("Obtained hudi data path: " + dataPath)
+    parameters += "path" -> dataPath

Review comment:
       The path specified by the user will be overwritten by the automatically inferred data directory, and your needs cannot be met




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#issuecomment-774543006


   other than these two I am good with this


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] wangxianghu commented on a change in pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

Posted by GitBox <gi...@apache.org>.
wangxianghu commented on a change in pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#discussion_r564472557



##########
File path: hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java
##########
@@ -84,6 +86,39 @@ public static String getTablePath(FileSystem fs, Path[] userProvidedPaths) throw
     throw new TableNotFoundException("Unable to find a hudi table for the user provided paths.");
   }
 
+  public static Option<String> getOnePartitionPath(FileSystem fs, Path tablePath) throws IOException {

Review comment:
       @teeyog maybe we can check whether the table is partitioned through `hoodie.datasource.write.keygenerator.class` param
   WDYT?
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org