You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/01/07 22:48:30 UTC

[GitHub] [hudi] umehrot2 opened a new pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

umehrot2 opened a new pull request #2417:
URL: https://github.com/apache/hudi/pull/2417


   ## What is the purpose of the pull request
   
   In HUDI-1510 we moved `HoodieEngineContext` so it can be used to parallelize fetching of partition paths from `FileSystemBackedMetadata` implementation. This PR actually uses it to do the parallelization.
   
   In addition, we fix `HoodieTableMetadata` to create `HoodieBackedTableMetadata` or `FileSystemBackedTableMetadata` depending on whether metadata is enabled or not.
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io edited a comment on pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#issuecomment-756532045






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] umehrot2 commented on a change in pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
umehrot2 commented on a change in pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#discussion_r554270231



##########
File path: hudi-common/src/main/java/org/apache/hudi/common/engine/HoodieMREngineContext.java
##########
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.engine;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.function.SerializableConsumer;
+import org.apache.hudi.common.function.SerializableFunction;
+import org.apache.hudi.common.function.SerializablePairFunction;
+import org.apache.hudi.common.util.Option;
+
+import org.apache.hudi.common.util.collection.Pair;
+
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+import static java.util.stream.Collectors.toList;
+import static org.apache.hudi.common.function.FunctionWrapper.throwingFlatMapWrapper;
+import static org.apache.hudi.common.function.FunctionWrapper.throwingForeachWrapper;
+import static org.apache.hudi.common.function.FunctionWrapper.throwingMapToPairWrapper;
+import static org.apache.hudi.common.function.FunctionWrapper.throwingMapWrapper;
+
+/**
+ * A java based engine context that can be used from map-reduce tasks executing in query engines like
+ * spark, hive and presto.
+ */
+public final class HoodieMREngineContext extends HoodieEngineContext {

Review comment:
       Yes but not all commands need spark context. It is created depending on the commands which truly need spark context.
   
   Besides this, as discussed offline there is another issue which I encountered when I started to fetch `hadoopConf` from `engineContext` based on your other comment. It results in `NPE` at places when metadata table is created inside spark executor tasks because `engineContext` is not serializable and ends up null at executors. For such scenarios in `HoodieTable` where we create metadata table I will have to add a check to create a `HoodieMREngineContext` in case engine context is `null`. Thus `hudi-client` needs access to this as well. So based on what we agreed to, we need something like a `HoodieLocalEngineContext` that can be used at all these places and can sit in `hudi-common`.

##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
##########
@@ -49,12 +60,48 @@ public FileSystemBackedTableMetadata(SerializableConfiguration conf, String data
 
   @Override
   public List<String> getAllPartitionPaths() throws IOException {
-    FileSystem fs = new Path(datasetBasePath).getFileSystem(hadoopConf.get());
     if (assumeDatePartitioning) {
+      FileSystem fs = new Path(datasetBasePath).getFileSystem(hadoopConf.get());
       return FSUtils.getAllPartitionFoldersThreeLevelsDown(fs, datasetBasePath);
-    } else {
-      return FSUtils.getAllFoldersWithPartitionMetaFile(fs, datasetBasePath);
     }
+
+    List<Path> pathsToList = new LinkedList<>();
+    pathsToList.add(new Path(datasetBasePath));
+    List<String> partitionPaths = new ArrayList<>();
+
+    // TODO: Get the parallelism from HoodieWriteConfig
+    final int fileListingParallelism = 1500;

Review comment:
       Will do.

##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
##########
@@ -64,6 +111,6 @@ public FileSystemBackedTableMetadata(SerializableConfiguration conf, String data
 
   @Override
   public boolean isInSync() {
-    throw new UnsupportedOperationException();
+    return false;

Review comment:
       Will do.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io edited a comment on pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#issuecomment-756532045






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io edited a comment on pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#issuecomment-756532045






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] umehrot2 commented on a change in pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
umehrot2 commented on a change in pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#discussion_r554242901



##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
##########
@@ -49,12 +60,48 @@ public FileSystemBackedTableMetadata(SerializableConfiguration conf, String data
 
   @Override
   public List<String> getAllPartitionPaths() throws IOException {
-    FileSystem fs = new Path(datasetBasePath).getFileSystem(hadoopConf.get());
     if (assumeDatePartitioning) {
+      FileSystem fs = new Path(datasetBasePath).getFileSystem(hadoopConf.get());
       return FSUtils.getAllPartitionFoldersThreeLevelsDown(fs, datasetBasePath);
-    } else {
-      return FSUtils.getAllFoldersWithPartitionMetaFile(fs, datasetBasePath);
     }
+
+    List<Path> pathsToList = new LinkedList<>();

Review comment:
       Yeah I adapted the same code with some changes which optimize for this particular case of fetching only partition paths.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] umehrot2 commented on a change in pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
umehrot2 commented on a change in pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#discussion_r554278832



##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
##########
@@ -49,12 +60,48 @@ public FileSystemBackedTableMetadata(SerializableConfiguration conf, String data
 
   @Override
   public List<String> getAllPartitionPaths() throws IOException {
-    FileSystem fs = new Path(datasetBasePath).getFileSystem(hadoopConf.get());
     if (assumeDatePartitioning) {
+      FileSystem fs = new Path(datasetBasePath).getFileSystem(hadoopConf.get());
       return FSUtils.getAllPartitionFoldersThreeLevelsDown(fs, datasetBasePath);
-    } else {
-      return FSUtils.getAllFoldersWithPartitionMetaFile(fs, datasetBasePath);
     }
+
+    List<Path> pathsToList = new LinkedList<>();

Review comment:
       As I had mentioned, that this is already tested through code paths like cleaner, and I did verify that `TestCleaner` for example fails if this would return incorrect partitions.
   
   Adding specific tests for `FileSystemBackedMetadata` is tricky because `spark context` is not available in `hudi-common`. Perhaps, this can be done in a separate PR.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] prashantwason commented on a change in pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
prashantwason commented on a change in pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#discussion_r554237450



##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
##########
@@ -49,12 +60,48 @@ public FileSystemBackedTableMetadata(SerializableConfiguration conf, String data
 
   @Override
   public List<String> getAllPartitionPaths() throws IOException {
-    FileSystem fs = new Path(datasetBasePath).getFileSystem(hadoopConf.get());
     if (assumeDatePartitioning) {
+      FileSystem fs = new Path(datasetBasePath).getFileSystem(hadoopConf.get());
       return FSUtils.getAllPartitionFoldersThreeLevelsDown(fs, datasetBasePath);
-    } else {
-      return FSUtils.getAllFoldersWithPartitionMetaFile(fs, datasetBasePath);
     }
+
+    List<Path> pathsToList = new LinkedList<>();
+    pathsToList.add(new Path(datasetBasePath));
+    List<String> partitionPaths = new ArrayList<>();
+
+    // TODO: Get the parallelism from HoodieWriteConfig
+    final int fileListingParallelism = 1500;

Review comment:
       This is the max. There is a Math.min below.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] umehrot2 commented on a change in pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
umehrot2 commented on a change in pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#discussion_r554245704



##########
File path: hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewManager.java
##########
@@ -159,12 +160,12 @@ private static HoodieTableFileSystemView createInMemoryFileSystemView(Serializab
     return new HoodieTableFileSystemView(metaClient, timeline, viewConf.isIncrementalTimelineSyncEnabled());
   }
 
-  public static HoodieTableFileSystemView createInMemoryFileSystemView(HoodieTableMetaClient metaClient,
-                                                                       boolean useFileListingFromMetadata,
-                                                                       boolean verifyListings) {
+  public static HoodieTableFileSystemView createInMemoryFileSystemViewForReaders(HoodieTableMetaClient metaClient,
+      boolean useFileListingFromMetadata, boolean verifyListings) {
     LOG.info("Creating InMemory based view for basePath " + metaClient.getBasePath());
     if (useFileListingFromMetadata) {
-      return new HoodieMetadataFileSystemView(metaClient,
+      HoodieMREngineContext engineContext = new HoodieMREngineContext(metaClient.getHadoopConf());

Review comment:
       Hudi has added support for other engines like Flink. Also HoodieSparkEngineContext cannot be used here this API is called inside individual map reduce tasks from engines like spark and hive. In case of spark, the individual tasks do not have access to spark context. The spark context is not serializable.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io edited a comment on pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#issuecomment-756532045


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2417?src=pr&el=h1) Report
   > Merging [#2417](https://codecov.io/gh/apache/hudi/pull/2417?src=pr&el=desc) (2519902) into [master](https://codecov.io/gh/apache/hudi/commit/368c1a8f5c36d06ed49706b4afde4a83073a9011?el=desc) (368c1a8) will **decrease** coverage by `40.84%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2417/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2417?src=pr&el=tree)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master   #2417       +/-   ##
   ============================================
   - Coverage     50.53%   9.68%   -40.85%     
   + Complexity     3032      48     -2984     
   ============================================
     Files           417      53      -364     
     Lines         18727    1930    -16797     
     Branches       1917     230     -1687     
   ============================================
   - Hits           9463     187     -9276     
   + Misses         8489    1730     -6759     
   + Partials        775      13      -762     
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `9.68% <100.00%> (-59.73%)` | `0.00 <1.00> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2417?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [.../apache/hudi/utilities/HoodieSnapshotExporter.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0hvb2RpZVNuYXBzaG90RXhwb3J0ZXIuamF2YQ==) | `83.62% <100.00%> (-5.08%)` | `28.00 <1.00> (ø)` | |
   | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | ... and [383 more](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree-more) | |
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
n3nash commented on pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#issuecomment-757432014


   @vinothchandar @umehrot2 Added a test class, can you ptal ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io edited a comment on pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#issuecomment-756532045






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar merged pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
vinothchandar merged pull request #2417:
URL: https://github.com/apache/hudi/pull/2417


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] prashantwason commented on a change in pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
prashantwason commented on a change in pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#discussion_r554237450



##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
##########
@@ -49,12 +60,48 @@ public FileSystemBackedTableMetadata(SerializableConfiguration conf, String data
 
   @Override
   public List<String> getAllPartitionPaths() throws IOException {
-    FileSystem fs = new Path(datasetBasePath).getFileSystem(hadoopConf.get());
     if (assumeDatePartitioning) {
+      FileSystem fs = new Path(datasetBasePath).getFileSystem(hadoopConf.get());
       return FSUtils.getAllPartitionFoldersThreeLevelsDown(fs, datasetBasePath);
-    } else {
-      return FSUtils.getAllFoldersWithPartitionMetaFile(fs, datasetBasePath);
     }
+
+    List<Path> pathsToList = new LinkedList<>();
+    pathsToList.add(new Path(datasetBasePath));
+    List<String> partitionPaths = new ArrayList<>();
+
+    // TODO: Get the parallelism from HoodieWriteConfig
+    final int fileListingParallelism = 1500;

Review comment:
       This is the max. There is a Math.min below.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] lw309637554 commented on a change in pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
lw309637554 commented on a change in pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#discussion_r554041498



##########
File path: hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewManager.java
##########
@@ -159,12 +160,12 @@ private static HoodieTableFileSystemView createInMemoryFileSystemView(Serializab
     return new HoodieTableFileSystemView(metaClient, timeline, viewConf.isIncrementalTimelineSyncEnabled());
   }
 
-  public static HoodieTableFileSystemView createInMemoryFileSystemView(HoodieTableMetaClient metaClient,
-                                                                       boolean useFileListingFromMetadata,
-                                                                       boolean verifyListings) {
+  public static HoodieTableFileSystemView createInMemoryFileSystemViewForReaders(HoodieTableMetaClient metaClient,
+      boolean useFileListingFromMetadata, boolean verifyListings) {
     LOG.info("Creating InMemory based view for basePath " + metaClient.getBasePath());
     if (useFileListingFromMetadata) {
-      return new HoodieMetadataFileSystemView(metaClient,
+      HoodieMREngineContext engineContext = new HoodieMREngineContext(metaClient.getHadoopConf());

Review comment:
       @umehrot2 hello, why not use spark EngineContext? Because hudi can run just use spark, many users environment just have spark , but not have mr.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io edited a comment on pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#issuecomment-756532045






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on a change in pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#discussion_r554735165



##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java
##########
@@ -660,8 +663,16 @@ public boolean requireSortedRecords() {
 
   public HoodieTableMetadata metadata() {
     if (metadata == null) {
-      metadata = HoodieTableMetadata.create(hadoopConfiguration.get(), config.getBasePath(), config.getSpillableMapBasePath(),
-          config.useFileListingMetadata(), config.getFileListingMetadataVerify(), config.isMetricsOn(), config.shouldAssumeDatePartitioning());
+      HoodieEngineContext engineContext = context;

Review comment:
       a more elegant way to do this would be through a `getEngineContext()` method that reinits lazily




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io edited a comment on pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#issuecomment-756532045






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io edited a comment on pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#issuecomment-756532045


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2417?src=pr&el=h1) Report
   > Merging [#2417](https://codecov.io/gh/apache/hudi/pull/2417?src=pr&el=desc) (00939cb) into [master](https://codecov.io/gh/apache/hudi/commit/368c1a8f5c36d06ed49706b4afde4a83073a9011?el=desc) (368c1a8) will **decrease** coverage by `40.84%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2417/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2417?src=pr&el=tree)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master   #2417       +/-   ##
   ============================================
   - Coverage     50.53%   9.68%   -40.85%     
   + Complexity     3032      48     -2984     
   ============================================
     Files           417      53      -364     
     Lines         18727    1930    -16797     
     Branches       1917     230     -1687     
   ============================================
   - Hits           9463     187     -9276     
   + Misses         8489    1730     -6759     
   + Partials        775      13      -762     
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `9.68% <100.00%> (-59.73%)` | `0.00 <1.00> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2417?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [.../apache/hudi/utilities/HoodieSnapshotExporter.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0hvb2RpZVNuYXBzaG90RXhwb3J0ZXIuamF2YQ==) | `83.62% <100.00%> (-5.08%)` | `28.00 <1.00> (ø)` | |
   | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | ... and [383 more](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree-more) | |
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on a change in pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#discussion_r554260296



##########
File path: hudi-common/src/main/java/org/apache/hudi/common/engine/HoodieMREngineContext.java
##########
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.engine;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.function.SerializableConsumer;
+import org.apache.hudi.common.function.SerializableFunction;
+import org.apache.hudi.common.function.SerializablePairFunction;
+import org.apache.hudi.common.util.Option;
+
+import org.apache.hudi.common.util.collection.Pair;
+
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+import static java.util.stream.Collectors.toList;
+import static org.apache.hudi.common.function.FunctionWrapper.throwingFlatMapWrapper;
+import static org.apache.hudi.common.function.FunctionWrapper.throwingForeachWrapper;
+import static org.apache.hudi.common.function.FunctionWrapper.throwingMapToPairWrapper;
+import static org.apache.hudi.common.function.FunctionWrapper.throwingMapWrapper;
+
+/**
+ * A java based engine context that can be used from map-reduce tasks executing in query engines like
+ * spark, hive and presto.
+ */
+public final class HoodieMREngineContext extends HoodieEngineContext {

Review comment:
       > But we should not get rid of 2 because we are treating HoodieEngineContext as the default engine context to instantiate
   
   But that only depends on the interface right. `hudi-cli` has access to sparkContext right? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io edited a comment on pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#issuecomment-756532045






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on a change in pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#discussion_r553710593



##########
File path: hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java
##########
@@ -39,6 +39,7 @@
   // Validate contents of Metadata Table on each access against the actual filesystem
   public static final String METADATA_VALIDATE_PROP = METADATA_PREFIX + ".validate";
   public static final boolean DEFAULT_METADATA_VALIDATE = false;
+  public static final boolean DEFAULT_METADATA_ENABLE_FOR_READERS = false;

Review comment:
       can we move this closer to the actual property for which this is the default property?

##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##########
@@ -221,8 +221,9 @@ public HoodieBackedTableMetadata metadata() {
   protected abstract void initialize(HoodieEngineContext engineContext, HoodieTableMetaClient datasetMetaClient);
 
   private void initTableMetadata() {
-    this.metadata = new HoodieBackedTableMetadata(hadoopConf.get(), datasetWriteConfig.getBasePath(), datasetWriteConfig.getSpillableMapBasePath(),
-        datasetWriteConfig.useFileListingMetadata(), datasetWriteConfig.getFileListingMetadataVerify(), false,
+    this.metadata = new HoodieBackedTableMetadata(engineContext, hadoopConf.get(), datasetWriteConfig.getBasePath(),

Review comment:
       the `engineContext` has the `hadoopConf` as well right? use that?

##########
File path: hudi-common/src/main/java/org/apache/hudi/common/engine/HoodieMREngineContext.java
##########
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.engine;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.function.SerializableConsumer;
+import org.apache.hudi.common.function.SerializableFunction;
+import org.apache.hudi.common.function.SerializablePairFunction;
+import org.apache.hudi.common.util.Option;
+
+import org.apache.hudi.common.util.collection.Pair;
+
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+import static java.util.stream.Collectors.toList;
+import static org.apache.hudi.common.function.FunctionWrapper.throwingFlatMapWrapper;
+import static org.apache.hudi.common.function.FunctionWrapper.throwingForeachWrapper;
+import static org.apache.hudi.common.function.FunctionWrapper.throwingMapToPairWrapper;
+import static org.apache.hudi.common.function.FunctionWrapper.throwingMapWrapper;
+
+/**
+ * A java based engine context that can be used from map-reduce tasks executing in query engines like
+ * spark, hive and presto.
+ */
+public final class HoodieMREngineContext extends HoodieEngineContext {

Review comment:
       any reason why this cannot be in `hudi-hadoop-mr`?

##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
##########
@@ -49,12 +60,48 @@ public FileSystemBackedTableMetadata(SerializableConfiguration conf, String data
 
   @Override
   public List<String> getAllPartitionPaths() throws IOException {
-    FileSystem fs = new Path(datasetBasePath).getFileSystem(hadoopConf.get());
     if (assumeDatePartitioning) {
+      FileSystem fs = new Path(datasetBasePath).getFileSystem(hadoopConf.get());
       return FSUtils.getAllPartitionFoldersThreeLevelsDown(fs, datasetBasePath);
-    } else {
-      return FSUtils.getAllFoldersWithPartitionMetaFile(fs, datasetBasePath);
     }
+
+    List<Path> pathsToList = new LinkedList<>();
+    pathsToList.add(new Path(datasetBasePath));
+    List<String> partitionPaths = new ArrayList<>();
+
+    // TODO: Get the parallelism from HoodieWriteConfig
+    final int fileListingParallelism = 1500;

Review comment:
       +1 there can we pass this in as an arg to `FileSystemBackedTableMetadata` class? This default is too much for query side. 

##########
File path: hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java
##########
@@ -82,7 +82,7 @@
       HoodieTableMetaClient metaClient = partitionsToMetaClient.get(partitionPath);
       if (!fsCache.containsKey(metaClient)) {
 
-        HoodieTableFileSystemView fsView = FileSystemViewManager.createInMemoryFileSystemView(metaClient,
+        HoodieTableFileSystemView fsView = FileSystemViewManager.createInMemoryFileSystemViewForReaders(metaClient,

Review comment:
       Can we avoid special naming methods based on caller. We should keep stuff like `FileSystemViewManager` totally agnostic of reader/writer. If you want special behavior, can you please add a new method arg and control it like that. 

##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
##########
@@ -64,6 +111,6 @@ public FileSystemBackedTableMetadata(SerializableConfiguration conf, String data
 
   @Override
   public boolean isInSync() {
-    throw new UnsupportedOperationException();
+    return false;

Review comment:
       this should be `true`? FileSystem and Data timeline cannot be divergent right

##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
##########
@@ -49,12 +60,48 @@ public FileSystemBackedTableMetadata(SerializableConfiguration conf, String data
 
   @Override
   public List<String> getAllPartitionPaths() throws IOException {
-    FileSystem fs = new Path(datasetBasePath).getFileSystem(hadoopConf.get());
     if (assumeDatePartitioning) {
+      FileSystem fs = new Path(datasetBasePath).getFileSystem(hadoopConf.get());
       return FSUtils.getAllPartitionFoldersThreeLevelsDown(fs, datasetBasePath);
-    } else {
-      return FSUtils.getAllFoldersWithPartitionMetaFile(fs, datasetBasePath);
     }
+
+    List<Path> pathsToList = new LinkedList<>();

Review comment:
       is this block verbatim the code @prashantwason had for parallelizing bootstrapping.. Prashant, can you also please help review this PR

##########
File path: hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewManager.java
##########
@@ -159,12 +160,12 @@ private static HoodieTableFileSystemView createInMemoryFileSystemView(Serializab
     return new HoodieTableFileSystemView(metaClient, timeline, viewConf.isIncrementalTimelineSyncEnabled());
   }
 
-  public static HoodieTableFileSystemView createInMemoryFileSystemView(HoodieTableMetaClient metaClient,
-                                                                       boolean useFileListingFromMetadata,
-                                                                       boolean verifyListings) {
+  public static HoodieTableFileSystemView createInMemoryFileSystemViewForReaders(HoodieTableMetaClient metaClient,
+      boolean useFileListingFromMetadata, boolean verifyListings) {

Review comment:
       lets pass the `HoodieEngineContext` as a parameter? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] umehrot2 commented on a change in pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
umehrot2 commented on a change in pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#discussion_r554242390



##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
##########
@@ -49,12 +60,48 @@ public FileSystemBackedTableMetadata(SerializableConfiguration conf, String data
 
   @Override
   public List<String> getAllPartitionPaths() throws IOException {
-    FileSystem fs = new Path(datasetBasePath).getFileSystem(hadoopConf.get());
     if (assumeDatePartitioning) {
+      FileSystem fs = new Path(datasetBasePath).getFileSystem(hadoopConf.get());
       return FSUtils.getAllPartitionFoldersThreeLevelsDown(fs, datasetBasePath);
-    } else {
-      return FSUtils.getAllFoldersWithPartitionMetaFile(fs, datasetBasePath);
     }
+
+    List<Path> pathsToList = new LinkedList<>();
+    pathsToList.add(new Path(datasetBasePath));
+    List<String> partitionPaths = new ArrayList<>();
+
+    // TODO: Get the parallelism from HoodieWriteConfig
+    final int fileListingParallelism = 1500;

Review comment:
       Yeah 1500 is just the maximum. The only downside with this current code will be that one cannot increase parallelism beyond 1500. These classes already have too many parameters and adding a new one here again results in changing all the consumers (and their consumers) which again affects more and more files.
   
   I think it would be better that in a separate PR, we should get rid of these individual parameters and have the consumers pass in HoodieMetadataConfig (which has all these params).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on a change in pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#discussion_r554537115



##########
File path: hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/metadata/TestHoodieBackedMetadata.java
##########
@@ -444,7 +445,8 @@ public void testSync(HoodieTableType tableType) throws Exception {
       assertTrue(metadata(client).isInSync());
     }
 
-    // Various table operations without metadata table enabled
+    // Various table operations without metadata table enabled. When metadata is disabled, file system

Review comment:
       The second arg is useFilelistingMetadata = false




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io edited a comment on pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#issuecomment-756532045


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2417?src=pr&el=h1) Report
   > Merging [#2417](https://codecov.io/gh/apache/hudi/pull/2417?src=pr&el=desc) (03336c2) into [master](https://codecov.io/gh/apache/hudi/commit/368c1a8f5c36d06ed49706b4afde4a83073a9011?el=desc) (368c1a8) will **decrease** coverage by `40.84%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2417/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2417?src=pr&el=tree)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master   #2417       +/-   ##
   ============================================
   - Coverage     50.53%   9.68%   -40.85%     
   + Complexity     3032      48     -2984     
   ============================================
     Files           417      53      -364     
     Lines         18727    1930    -16797     
     Branches       1917     230     -1687     
   ============================================
   - Hits           9463     187     -9276     
   + Misses         8489    1730     -6759     
   + Partials        775      13      -762     
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `9.68% <100.00%> (-59.73%)` | `0.00 <1.00> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2417?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [.../apache/hudi/utilities/HoodieSnapshotExporter.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0hvb2RpZVNuYXBzaG90RXhwb3J0ZXIuamF2YQ==) | `83.62% <100.00%> (-5.08%)` | `28.00 <1.00> (ø)` | |
   | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | ... and [383 more](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree-more) | |
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on a change in pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#discussion_r554259024



##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
##########
@@ -49,12 +60,48 @@ public FileSystemBackedTableMetadata(SerializableConfiguration conf, String data
 
   @Override
   public List<String> getAllPartitionPaths() throws IOException {
-    FileSystem fs = new Path(datasetBasePath).getFileSystem(hadoopConf.get());
     if (assumeDatePartitioning) {
+      FileSystem fs = new Path(datasetBasePath).getFileSystem(hadoopConf.get());
       return FSUtils.getAllPartitionFoldersThreeLevelsDown(fs, datasetBasePath);
-    } else {
-      return FSUtils.getAllFoldersWithPartitionMetaFile(fs, datasetBasePath);
     }
+
+    List<Path> pathsToList = new LinkedList<>();
+    pathsToList.add(new Path(datasetBasePath));
+    List<String> partitionPaths = new ArrayList<>();
+
+    // TODO: Get the parallelism from HoodieWriteConfig
+    final int fileListingParallelism = 1500;

Review comment:
       Okay my bad. can we pull this into a static final variable for now. 

##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
##########
@@ -64,6 +111,6 @@ public FileSystemBackedTableMetadata(SerializableConfiguration conf, String data
 
   @Override
   public boolean isInSync() {
-    throw new UnsupportedOperationException();
+    return false;

Review comment:
       the stats are important, so let's keep it there. I think this happens because some parts of the tests run with metadata enabled and this is returned as the fallback. If its not too many, can we fix the tests. it does not make sense to have this return `false` semantically.

##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
##########
@@ -49,12 +60,48 @@ public FileSystemBackedTableMetadata(SerializableConfiguration conf, String data
 
   @Override
   public List<String> getAllPartitionPaths() throws IOException {
-    FileSystem fs = new Path(datasetBasePath).getFileSystem(hadoopConf.get());
     if (assumeDatePartitioning) {
+      FileSystem fs = new Path(datasetBasePath).getFileSystem(hadoopConf.get());
       return FSUtils.getAllPartitionFoldersThreeLevelsDown(fs, datasetBasePath);
-    } else {
-      return FSUtils.getAllFoldersWithPartitionMetaFile(fs, datasetBasePath);
     }
+
+    List<Path> pathsToList = new LinkedList<>();

Review comment:
       Can we add more tests to this piece of code? (previously it was just in the bootstrap part), not this is prime time. 

##########
File path: hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewManager.java
##########
@@ -159,12 +160,12 @@ private static HoodieTableFileSystemView createInMemoryFileSystemView(Serializab
     return new HoodieTableFileSystemView(metaClient, timeline, viewConf.isIncrementalTimelineSyncEnabled());
   }
 
-  public static HoodieTableFileSystemView createInMemoryFileSystemView(HoodieTableMetaClient metaClient,
-                                                                       boolean useFileListingFromMetadata,
-                                                                       boolean verifyListings) {
+  public static HoodieTableFileSystemView createInMemoryFileSystemViewForReaders(HoodieTableMetaClient metaClient,
+      boolean useFileListingFromMetadata, boolean verifyListings) {
     LOG.info("Creating InMemory based view for basePath " + metaClient.getBasePath());
     if (useFileListingFromMetadata) {
-      return new HoodieMetadataFileSystemView(metaClient,
+      HoodieMREngineContext engineContext = new HoodieMREngineContext(metaClient.getHadoopConf());

Review comment:
       On a side note, we should move it to the hudi-hadoop-mr module or rename this to `HoodieLocalEngine` or something.  Lets discuss the relocation in the other comment. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#issuecomment-757542553


   > moved HoodieJavaEngineContext to hudi-common and remove HoodieLocalEngineContext
   Let's not do this. Please revert back. We discussed this before and chose to deliberately keep `HoodieJavaEngineContext` untouched. Every other engine context lives with its client. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io edited a comment on pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#issuecomment-756532045






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on a change in pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#discussion_r554525498



##########
File path: hudi-common/src/test/java/org/apache/hudi/metadata/TestFileSystemBackedTableMetadata.java
##########
@@ -0,0 +1,181 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.metadata;
+
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.engine.HoodieLocalEngineContext;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodiePartitionMetadata;
+import org.apache.hudi.common.testutils.FileCreateUtils;
+import org.apache.hudi.common.testutils.HoodieCommonTestHarness;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.file.Files;
+import java.nio.file.Paths;
+import java.util.Arrays;
+import java.util.List;
+import java.util.UUID;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+
+public class TestFileSystemBackedTableMetadata extends HoodieCommonTestHarness {
+
+  private static final int NUM_FILE_IDS_PER_PARTITION = 10;
+
+  private static String TEST_WRITE_TOKEN = "1-0-1";
+
+  private final List<String> DATE_PARTITIONS = Arrays.asList("2019/01/01", "2020/01/02", "2021/03/01");
+  private final List<String> ONE_LEVEL_PARTITIONS = Arrays.asList("2019", "2020", "2021");
+  private final List<String> MULTI_LEVEL_PARTITIONS = Arrays.asList("2019/01", "2020/01", "2021/01");
+  private final List<String> fileIdsPerPartition =
+      IntStream.range(0, NUM_FILE_IDS_PER_PARTITION).mapToObj(x -> UUID.randomUUID().toString()).collect(Collectors.toList());
+
+  @BeforeEach
+  public void setUp() throws IOException {
+    initMetaClient();
+  }
+
+  @AfterEach
+  public void tearDown() throws IOException {
+    metaClient.getFs().delete(new Path(metaClient.getBasePath()), true);
+  }
+
+  /**
+   * Test non partition hoodie table
+   * @throws IOException
+   */
+  @Test
+  public void testNonPartitionedTable() throws IOException {
+    // Generate 10 files under basepath
+    String instant = "100";
+    createDataFiles(basePath, instant);
+    FileCreateUtils.createCommit(basePath, instant);
+    HoodieLocalEngineContext localEngineContext = new HoodieLocalEngineContext(metaClient.getHadoopConf());
+    FileSystemBackedTableMetadata fileSystemBackedTableMetadata =
+        new FileSystemBackedTableMetadata(localEngineContext, new SerializableConfiguration(metaClient.getHadoopConf()), basePath, false);
+    Assertions.assertTrue(fileSystemBackedTableMetadata.getAllPartitionPaths().size() == 0);
+    Assertions.assertTrue(fileSystemBackedTableMetadata.getAllFilesInPartition(new Path(basePath)).length == 10);
+  }
+
+  /**
+   * Test listing of partitions result for date based partitions
+   * @throws IOException
+   */
+  @Test
+  public void testDatePartitionedTable() throws IOException {
+    String instant = "100";
+    // Generate 10 files under each partition
+    DATE_PARTITIONS.stream().map(p -> createPartition(p, false))

Review comment:
       @vinothchandar Does assumeDatePartitoning=true assume that all partitions are hoodie partitions (skips check for .hoodie_partition_metadata) ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] umehrot2 commented on a change in pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
umehrot2 commented on a change in pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#discussion_r554237558



##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##########
@@ -221,8 +221,9 @@ public HoodieBackedTableMetadata metadata() {
   protected abstract void initialize(HoodieEngineContext engineContext, HoodieTableMetaClient datasetMetaClient);
 
   private void initTableMetadata() {
-    this.metadata = new HoodieBackedTableMetadata(hadoopConf.get(), datasetWriteConfig.getBasePath(), datasetWriteConfig.getSpillableMapBasePath(),
-        datasetWriteConfig.useFileListingMetadata(), datasetWriteConfig.getFileListingMetadataVerify(), false,
+    this.metadata = new HoodieBackedTableMetadata(engineContext, hadoopConf.get(), datasetWriteConfig.getBasePath(),

Review comment:
       Will do.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io edited a comment on pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#issuecomment-756532045


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2417?src=pr&el=h1) Report
   > Merging [#2417](https://codecov.io/gh/apache/hudi/pull/2417?src=pr&el=desc) (145372c) into [master](https://codecov.io/gh/apache/hudi/commit/17df517b812c9a37dd64014f0d5c35a3cfac0c4e?el=desc) (17df517) will **decrease** coverage by `40.10%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2417/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2417?src=pr&el=tree)
   
   ```diff
   @@              Coverage Diff              @@
   ##             master    #2417       +/-   ##
   =============================================
   - Coverage     50.19%   10.09%   -40.11%     
   + Complexity     2990       48     -2942     
   =============================================
     Files           415       52      -363     
     Lines         18439     1853    -16586     
     Branches       1885      223     -1662     
   =============================================
   - Hits           9255      187     -9068     
   + Misses         8427     1653     -6774     
   + Partials        757       13      -744     
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `10.09% <100.00%> (-59.57%)` | `0.00 <1.00> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2417?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [.../apache/hudi/utilities/HoodieSnapshotExporter.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0hvb2RpZVNuYXBzaG90RXhwb3J0ZXIuamF2YQ==) | `83.62% <100.00%> (-5.08%)` | `28.00 <1.00> (ø)` | |
   | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | ... and [381 more](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree-more) | |
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] umehrot2 commented on a change in pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
umehrot2 commented on a change in pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#discussion_r554237778



##########
File path: hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java
##########
@@ -39,6 +39,7 @@
   // Validate contents of Metadata Table on each access against the actual filesystem
   public static final String METADATA_VALIDATE_PROP = METADATA_PREFIX + ".validate";
   public static final boolean DEFAULT_METADATA_VALIDATE = false;
+  public static final boolean DEFAULT_METADATA_ENABLE_FOR_READERS = false;

Review comment:
       This is already next to `METADATA_VALIDATE_PROP` for which this is the default (for readers).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] umehrot2 commented on a change in pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
umehrot2 commented on a change in pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#discussion_r554241357



##########
File path: hudi-common/src/main/java/org/apache/hudi/common/engine/HoodieMREngineContext.java
##########
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.engine;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.function.SerializableConsumer;
+import org.apache.hudi.common.function.SerializableFunction;
+import org.apache.hudi.common.function.SerializablePairFunction;
+import org.apache.hudi.common.util.Option;
+
+import org.apache.hudi.common.util.collection.Pair;
+
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+import static java.util.stream.Collectors.toList;
+import static org.apache.hudi.common.function.FunctionWrapper.throwingFlatMapWrapper;
+import static org.apache.hudi.common.function.FunctionWrapper.throwingForeachWrapper;
+import static org.apache.hudi.common.function.FunctionWrapper.throwingMapToPairWrapper;
+import static org.apache.hudi.common.function.FunctionWrapper.throwingMapWrapper;
+
+/**
+ * A java based engine context that can be used from map-reduce tasks executing in query engines like
+ * spark, hive and presto.
+ */
+public final class HoodieMREngineContext extends HoodieEngineContext {

Review comment:
       Well it is being directly instantiated at a couple of places in `hudi-common`:
   1. `FileSystemViewManager`: https://github.com/apache/hudi/pull/2417/files#diff-1fa21ae97e976db303c47571e7cb9c34e4f94719d482e537134e231b2e944388R167
   2. `HoodieBackedTableMetadata`: https://github.com/apache/hudi/pull/2417/files#diff-7c43aea81a02b4f135452b50eaa36d5868081e72b37d43101ca9de1f9ebb5195R77
   
   `hudi-common` does not depend on `hudi-hadoop-mr`.  As per your other comment we can get rid of `1` if we pass the engine context from the caller side. But we should not get rid of `2` because we are treating `HoodieEngineContext` as the default engine context to instantiate. This constructor is being used from `hud-cli` for ex (does not depend on `hudi-hadoop-mr`:
   -  https://github.com/apache/hudi/pull/2417/files#diff-b9829aa53bc2a7b0a655993df2c3c75d53f6a8e51f317753e51f27c63fc20341R165
   - https://github.com/apache/hudi/pull/2417/files#diff-b9829aa53bc2a7b0a655993df2c3c75d53f6a8e51f317753e51f27c63fc20341R197
   
   So, if we want to treat it like a default engine context to be used would be better to keep in `hudi-common`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io edited a comment on pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#issuecomment-756532045


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2417?src=pr&el=h1) Report
   > Merging [#2417](https://codecov.io/gh/apache/hudi/pull/2417?src=pr&el=desc) (35aca53) into [master](https://codecov.io/gh/apache/hudi/commit/368c1a8f5c36d06ed49706b4afde4a83073a9011?el=desc) (368c1a8) will **decrease** coverage by `40.84%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2417/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2417?src=pr&el=tree)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master   #2417       +/-   ##
   ============================================
   - Coverage     50.53%   9.68%   -40.85%     
   + Complexity     3032      48     -2984     
   ============================================
     Files           417      53      -364     
     Lines         18727    1930    -16797     
     Branches       1917     230     -1687     
   ============================================
   - Hits           9463     187     -9276     
   + Misses         8489    1730     -6759     
   + Partials        775      13      -762     
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `9.68% <100.00%> (-59.73%)` | `0.00 <1.00> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2417?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [.../apache/hudi/utilities/HoodieSnapshotExporter.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0hvb2RpZVNuYXBzaG90RXhwb3J0ZXIuamF2YQ==) | `83.62% <100.00%> (-5.08%)` | `28.00 <1.00> (ø)` | |
   | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | ... and [383 more](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree-more) | |
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar edited a comment on pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
vinothchandar edited a comment on pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#issuecomment-757542553


   > moved HoodieJavaEngineContext to hudi-common and remove HoodieLocalEngineContext
   
   Let's not do this. Please revert back. We discussed this before and chose to deliberately keep `HoodieJavaEngineContext` untouched. Every other engine context lives with its client. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] umehrot2 commented on a change in pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
umehrot2 commented on a change in pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#discussion_r554243274



##########
File path: hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewManager.java
##########
@@ -159,12 +160,12 @@ private static HoodieTableFileSystemView createInMemoryFileSystemView(Serializab
     return new HoodieTableFileSystemView(metaClient, timeline, viewConf.isIncrementalTimelineSyncEnabled());
   }
 
-  public static HoodieTableFileSystemView createInMemoryFileSystemView(HoodieTableMetaClient metaClient,
-                                                                       boolean useFileListingFromMetadata,
-                                                                       boolean verifyListings) {
+  public static HoodieTableFileSystemView createInMemoryFileSystemViewForReaders(HoodieTableMetaClient metaClient,
+      boolean useFileListingFromMetadata, boolean verifyListings) {

Review comment:
       Will do.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] lw309637554 commented on a change in pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
lw309637554 commented on a change in pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#discussion_r554041498



##########
File path: hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewManager.java
##########
@@ -159,12 +160,12 @@ private static HoodieTableFileSystemView createInMemoryFileSystemView(Serializab
     return new HoodieTableFileSystemView(metaClient, timeline, viewConf.isIncrementalTimelineSyncEnabled());
   }
 
-  public static HoodieTableFileSystemView createInMemoryFileSystemView(HoodieTableMetaClient metaClient,
-                                                                       boolean useFileListingFromMetadata,
-                                                                       boolean verifyListings) {
+  public static HoodieTableFileSystemView createInMemoryFileSystemViewForReaders(HoodieTableMetaClient metaClient,
+      boolean useFileListingFromMetadata, boolean verifyListings) {
     LOG.info("Creating InMemory based view for basePath " + metaClient.getBasePath());
     if (useFileListingFromMetadata) {
-      return new HoodieMetadataFileSystemView(metaClient,
+      HoodieMREngineContext engineContext = new HoodieMREngineContext(metaClient.getHadoopConf());

Review comment:
       @umehrot2 hello, why not use spark EngineContext? Because hudi can run just use spark, many users environment just have spark , but not have mr.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io edited a comment on pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#issuecomment-756532045


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2417?src=pr&el=h1) Report
   > Merging [#2417](https://codecov.io/gh/apache/hudi/pull/2417?src=pr&el=desc) (d2e275b) into [master](https://codecov.io/gh/apache/hudi/commit/368c1a8f5c36d06ed49706b4afde4a83073a9011?el=desc) (368c1a8) will **decrease** coverage by `40.84%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2417/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2417?src=pr&el=tree)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master   #2417       +/-   ##
   ============================================
   - Coverage     50.53%   9.68%   -40.85%     
   + Complexity     3032      48     -2984     
   ============================================
     Files           417      53      -364     
     Lines         18727    1930    -16797     
     Branches       1917     230     -1687     
   ============================================
   - Hits           9463     187     -9276     
   + Misses         8489    1730     -6759     
   + Partials        775      13      -762     
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `9.68% <100.00%> (-59.73%)` | `0.00 <1.00> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2417?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [.../apache/hudi/utilities/HoodieSnapshotExporter.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0hvb2RpZVNuYXBzaG90RXhwb3J0ZXIuamF2YQ==) | `83.62% <100.00%> (-5.08%)` | `28.00 <1.00> (ø)` | |
   | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | ... and [383 more](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree-more) | |
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] umehrot2 commented on a change in pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
umehrot2 commented on a change in pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#discussion_r554243165



##########
File path: hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java
##########
@@ -82,7 +82,7 @@
       HoodieTableMetaClient metaClient = partitionsToMetaClient.get(partitionPath);
       if (!fsCache.containsKey(metaClient)) {
 
-        HoodieTableFileSystemView fsView = FileSystemViewManager.createInMemoryFileSystemView(metaClient,
+        HoodieTableFileSystemView fsView = FileSystemViewManager.createInMemoryFileSystemViewForReaders(metaClient,

Review comment:
       Will do.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] umehrot2 commented on a change in pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
umehrot2 commented on a change in pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#discussion_r554237558



##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##########
@@ -221,8 +221,9 @@ public HoodieBackedTableMetadata metadata() {
   protected abstract void initialize(HoodieEngineContext engineContext, HoodieTableMetaClient datasetMetaClient);
 
   private void initTableMetadata() {
-    this.metadata = new HoodieBackedTableMetadata(hadoopConf.get(), datasetWriteConfig.getBasePath(), datasetWriteConfig.getSpillableMapBasePath(),
-        datasetWriteConfig.useFileListingMetadata(), datasetWriteConfig.getFileListingMetadataVerify(), false,
+    this.metadata = new HoodieBackedTableMetadata(engineContext, hadoopConf.get(), datasetWriteConfig.getBasePath(),

Review comment:
       Will do.

##########
File path: hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java
##########
@@ -39,6 +39,7 @@
   // Validate contents of Metadata Table on each access against the actual filesystem
   public static final String METADATA_VALIDATE_PROP = METADATA_PREFIX + ".validate";
   public static final boolean DEFAULT_METADATA_VALIDATE = false;
+  public static final boolean DEFAULT_METADATA_ENABLE_FOR_READERS = false;

Review comment:
       This is already next to `METADATA_VALIDATE_PROP` for which this is the default (for readers).

##########
File path: hudi-common/src/main/java/org/apache/hudi/common/engine/HoodieMREngineContext.java
##########
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.engine;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.function.SerializableConsumer;
+import org.apache.hudi.common.function.SerializableFunction;
+import org.apache.hudi.common.function.SerializablePairFunction;
+import org.apache.hudi.common.util.Option;
+
+import org.apache.hudi.common.util.collection.Pair;
+
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+import static java.util.stream.Collectors.toList;
+import static org.apache.hudi.common.function.FunctionWrapper.throwingFlatMapWrapper;
+import static org.apache.hudi.common.function.FunctionWrapper.throwingForeachWrapper;
+import static org.apache.hudi.common.function.FunctionWrapper.throwingMapToPairWrapper;
+import static org.apache.hudi.common.function.FunctionWrapper.throwingMapWrapper;
+
+/**
+ * A java based engine context that can be used from map-reduce tasks executing in query engines like
+ * spark, hive and presto.
+ */
+public final class HoodieMREngineContext extends HoodieEngineContext {

Review comment:
       Well it is being directly instantiated at a couple of places in `hudi-common`:
   1. `FileSystemViewManager`: https://github.com/apache/hudi/pull/2417/files#diff-1fa21ae97e976db303c47571e7cb9c34e4f94719d482e537134e231b2e944388R167
   2. `HoodieBackedTableMetadata`: https://github.com/apache/hudi/pull/2417/files#diff-7c43aea81a02b4f135452b50eaa36d5868081e72b37d43101ca9de1f9ebb5195R77
   
   `hudi-common` does not depend on `hudi-hadoop-mr`.  As per your other comment we can get rid of `1` if we pass the engine context from the caller side. But we should not get rid of `2` because we are treating `HoodieEngineContext` as the default engine context to instantiate. This constructor is being used from `hud-cli` for ex (does not depend on `hudi-hadoop-mr`:
   -  https://github.com/apache/hudi/pull/2417/files#diff-b9829aa53bc2a7b0a655993df2c3c75d53f6a8e51f317753e51f27c63fc20341R165
   - https://github.com/apache/hudi/pull/2417/files#diff-b9829aa53bc2a7b0a655993df2c3c75d53f6a8e51f317753e51f27c63fc20341R197
   
   So, if we want to treat it like a default engine context to be used would be better to keep in `hudi-common`.

##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
##########
@@ -49,12 +60,48 @@ public FileSystemBackedTableMetadata(SerializableConfiguration conf, String data
 
   @Override
   public List<String> getAllPartitionPaths() throws IOException {
-    FileSystem fs = new Path(datasetBasePath).getFileSystem(hadoopConf.get());
     if (assumeDatePartitioning) {
+      FileSystem fs = new Path(datasetBasePath).getFileSystem(hadoopConf.get());
       return FSUtils.getAllPartitionFoldersThreeLevelsDown(fs, datasetBasePath);
-    } else {
-      return FSUtils.getAllFoldersWithPartitionMetaFile(fs, datasetBasePath);
     }
+
+    List<Path> pathsToList = new LinkedList<>();
+    pathsToList.add(new Path(datasetBasePath));
+    List<String> partitionPaths = new ArrayList<>();
+
+    // TODO: Get the parallelism from HoodieWriteConfig
+    final int fileListingParallelism = 1500;

Review comment:
       Yeah 1500 is just the maximum. The only downside with this current code will be that one cannot increase parallelism beyond 1500. These classes already have too many parameters and adding a new one here again results in changing all the consumers (and their consumers) which again affects more and more files.
   
   I think it would be better that in a separate PR, we should get rid of these individual parameters and have the consumers pass in HoodieMetadataConfig (which has all these params).

##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
##########
@@ -49,12 +60,48 @@ public FileSystemBackedTableMetadata(SerializableConfiguration conf, String data
 
   @Override
   public List<String> getAllPartitionPaths() throws IOException {
-    FileSystem fs = new Path(datasetBasePath).getFileSystem(hadoopConf.get());
     if (assumeDatePartitioning) {
+      FileSystem fs = new Path(datasetBasePath).getFileSystem(hadoopConf.get());
       return FSUtils.getAllPartitionFoldersThreeLevelsDown(fs, datasetBasePath);
-    } else {
-      return FSUtils.getAllFoldersWithPartitionMetaFile(fs, datasetBasePath);
     }
+
+    List<Path> pathsToList = new LinkedList<>();

Review comment:
       Yeah I adapted the same code with some changes which optimize for this particular case of fetching only partition paths.

##########
File path: hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java
##########
@@ -82,7 +82,7 @@
       HoodieTableMetaClient metaClient = partitionsToMetaClient.get(partitionPath);
       if (!fsCache.containsKey(metaClient)) {
 
-        HoodieTableFileSystemView fsView = FileSystemViewManager.createInMemoryFileSystemView(metaClient,
+        HoodieTableFileSystemView fsView = FileSystemViewManager.createInMemoryFileSystemViewForReaders(metaClient,

Review comment:
       Will do.

##########
File path: hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewManager.java
##########
@@ -159,12 +160,12 @@ private static HoodieTableFileSystemView createInMemoryFileSystemView(Serializab
     return new HoodieTableFileSystemView(metaClient, timeline, viewConf.isIncrementalTimelineSyncEnabled());
   }
 
-  public static HoodieTableFileSystemView createInMemoryFileSystemView(HoodieTableMetaClient metaClient,
-                                                                       boolean useFileListingFromMetadata,
-                                                                       boolean verifyListings) {
+  public static HoodieTableFileSystemView createInMemoryFileSystemViewForReaders(HoodieTableMetaClient metaClient,
+      boolean useFileListingFromMetadata, boolean verifyListings) {

Review comment:
       Will do.

##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
##########
@@ -64,6 +111,6 @@ public FileSystemBackedTableMetadata(SerializableConfiguration conf, String data
 
   @Override
   public boolean isInSync() {
-    throw new UnsupportedOperationException();
+    return false;

Review comment:
       I guess. But setting it to true breaks https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/metadata/TestHoodieBackedMetadata.java#L456 and others. I guess will look at the tests again to see if those need to be fixed instead. I don't really see `isInSync()` being used anywhere (in source code, apart from tests) except for being stored in stats somewhere. Wonder if its really needed.

##########
File path: hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewManager.java
##########
@@ -159,12 +160,12 @@ private static HoodieTableFileSystemView createInMemoryFileSystemView(Serializab
     return new HoodieTableFileSystemView(metaClient, timeline, viewConf.isIncrementalTimelineSyncEnabled());
   }
 
-  public static HoodieTableFileSystemView createInMemoryFileSystemView(HoodieTableMetaClient metaClient,
-                                                                       boolean useFileListingFromMetadata,
-                                                                       boolean verifyListings) {
+  public static HoodieTableFileSystemView createInMemoryFileSystemViewForReaders(HoodieTableMetaClient metaClient,
+      boolean useFileListingFromMetadata, boolean verifyListings) {
     LOG.info("Creating InMemory based view for basePath " + metaClient.getBasePath());
     if (useFileListingFromMetadata) {
-      return new HoodieMetadataFileSystemView(metaClient,
+      HoodieMREngineContext engineContext = new HoodieMREngineContext(metaClient.getHadoopConf());

Review comment:
       Hudi has added support for other engines like Flink. Also HoodieSparkEngineContext cannot be used here this API is called inside individual map reduce tasks from engines like spark and hive. In case of spark, the individual tasks do not have access to spark context. The spark context is not serializable.

##########
File path: hudi-common/src/main/java/org/apache/hudi/common/engine/HoodieMREngineContext.java
##########
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.engine;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.function.SerializableConsumer;
+import org.apache.hudi.common.function.SerializableFunction;
+import org.apache.hudi.common.function.SerializablePairFunction;
+import org.apache.hudi.common.util.Option;
+
+import org.apache.hudi.common.util.collection.Pair;
+
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+import static java.util.stream.Collectors.toList;
+import static org.apache.hudi.common.function.FunctionWrapper.throwingFlatMapWrapper;
+import static org.apache.hudi.common.function.FunctionWrapper.throwingForeachWrapper;
+import static org.apache.hudi.common.function.FunctionWrapper.throwingMapToPairWrapper;
+import static org.apache.hudi.common.function.FunctionWrapper.throwingMapWrapper;
+
+/**
+ * A java based engine context that can be used from map-reduce tasks executing in query engines like
+ * spark, hive and presto.
+ */
+public final class HoodieMREngineContext extends HoodieEngineContext {

Review comment:
       Yes but not all commands need spark context. It is created depending on the commands which truly need spark context.
   
   Besides this, as discussed offline there is another issue which I encountered when I started to fetch `hadoopConf` from `engineContext` based on your other comment. It results in `NPE` at places when metadata table is created inside spark executor tasks because `engineContext` is not serializable and ends up null at executors. For such scenarios in `HoodieTable` where we create metadata table I will have to add a check to create a `HoodieMREngineContext` in case engine context is `null`. Thus `hudi-client` needs access to this as well. So based on what we agreed to, we need something like a `HoodieLocalEngineContext` that can be used at all these places and can sit in `hudi-common`.

##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
##########
@@ -49,12 +60,48 @@ public FileSystemBackedTableMetadata(SerializableConfiguration conf, String data
 
   @Override
   public List<String> getAllPartitionPaths() throws IOException {
-    FileSystem fs = new Path(datasetBasePath).getFileSystem(hadoopConf.get());
     if (assumeDatePartitioning) {
+      FileSystem fs = new Path(datasetBasePath).getFileSystem(hadoopConf.get());
       return FSUtils.getAllPartitionFoldersThreeLevelsDown(fs, datasetBasePath);
-    } else {
-      return FSUtils.getAllFoldersWithPartitionMetaFile(fs, datasetBasePath);
     }
+
+    List<Path> pathsToList = new LinkedList<>();
+    pathsToList.add(new Path(datasetBasePath));
+    List<String> partitionPaths = new ArrayList<>();
+
+    // TODO: Get the parallelism from HoodieWriteConfig
+    final int fileListingParallelism = 1500;

Review comment:
       Will do.

##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
##########
@@ -64,6 +111,6 @@ public FileSystemBackedTableMetadata(SerializableConfiguration conf, String data
 
   @Override
   public boolean isInSync() {
-    throw new UnsupportedOperationException();
+    return false;

Review comment:
       Will do.

##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
##########
@@ -49,12 +60,48 @@ public FileSystemBackedTableMetadata(SerializableConfiguration conf, String data
 
   @Override
   public List<String> getAllPartitionPaths() throws IOException {
-    FileSystem fs = new Path(datasetBasePath).getFileSystem(hadoopConf.get());
     if (assumeDatePartitioning) {
+      FileSystem fs = new Path(datasetBasePath).getFileSystem(hadoopConf.get());
       return FSUtils.getAllPartitionFoldersThreeLevelsDown(fs, datasetBasePath);
-    } else {
-      return FSUtils.getAllFoldersWithPartitionMetaFile(fs, datasetBasePath);
     }
+
+    List<Path> pathsToList = new LinkedList<>();

Review comment:
       As I had mentioned, that this is already tested through code paths like cleaner, and I did verify that `TestCleaner` for example fails if this would return incorrect partitions.
   
   Adding specific tests for `FileSystemBackedMetadata` is tricky because `spark context` is not available in `hudi-common`. Perhaps, this can be done in a separate PR.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#issuecomment-757573470


   @n3nash please drop me a ping once the CI is happy


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io edited a comment on pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#issuecomment-756532045


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2417?src=pr&el=h1) Report
   > Merging [#2417](https://codecov.io/gh/apache/hudi/pull/2417?src=pr&el=desc) (7bef2da) into [master](https://codecov.io/gh/apache/hudi/commit/368c1a8f5c36d06ed49706b4afde4a83073a9011?el=desc) (368c1a8) will **decrease** coverage by `40.84%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2417/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2417?src=pr&el=tree)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master   #2417       +/-   ##
   ============================================
   - Coverage     50.53%   9.68%   -40.85%     
   + Complexity     3032      48     -2984     
   ============================================
     Files           417      53      -364     
     Lines         18727    1930    -16797     
     Branches       1917     230     -1687     
   ============================================
   - Hits           9463     187     -9276     
   + Misses         8489    1730     -6759     
   + Partials        775      13      -762     
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `9.68% <100.00%> (-59.73%)` | `0.00 <1.00> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2417?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [.../apache/hudi/utilities/HoodieSnapshotExporter.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0hvb2RpZVNuYXBzaG90RXhwb3J0ZXIuamF2YQ==) | `83.62% <100.00%> (-5.08%)` | `28.00 <1.00> (ø)` | |
   | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | ... and [383 more](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree-more) | |
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io commented on pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#issuecomment-756532045


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2417?src=pr&el=h1) Report
   > Merging [#2417](https://codecov.io/gh/apache/hudi/pull/2417?src=pr&el=desc) (145372c) into [master](https://codecov.io/gh/apache/hudi/commit/17df517b812c9a37dd64014f0d5c35a3cfac0c4e?el=desc) (17df517) will **decrease** coverage by `40.10%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2417/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2417?src=pr&el=tree)
   
   ```diff
   @@              Coverage Diff              @@
   ##             master    #2417       +/-   ##
   =============================================
   - Coverage     50.19%   10.09%   -40.11%     
   + Complexity     2990       48     -2942     
   =============================================
     Files           415       52      -363     
     Lines         18439     1853    -16586     
     Branches       1885      223     -1662     
   =============================================
   - Hits           9255      187     -9068     
   + Misses         8427     1653     -6774     
   + Partials        757       13      -744     
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `10.09% <100.00%> (-59.57%)` | `0.00 <1.00> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2417?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [.../apache/hudi/utilities/HoodieSnapshotExporter.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0hvb2RpZVNuYXBzaG90RXhwb3J0ZXIuamF2YQ==) | `83.62% <100.00%> (-5.08%)` | `28.00 <1.00> (ø)` | |
   | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | ... and [381 more](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree-more) | |
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on a change in pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#discussion_r554529441



##########
File path: hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/metadata/TestHoodieBackedMetadata.java
##########
@@ -444,7 +445,8 @@ public void testSync(HoodieTableType tableType) throws Exception {
       assertTrue(metadata(client).isInSync());
     }
 
-    // Various table operations without metadata table enabled
+    // Various table operations without metadata table enabled. When metadata is disabled, file system

Review comment:
       but these calls seem like they enable metadata? first arg of `getWriteConfig(..)` 

##########
File path: hudi-common/src/main/java/org/apache/hudi/common/engine/HoodieLocalEngineContext.java
##########
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.engine;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.function.SerializableConsumer;
+import org.apache.hudi.common.function.SerializableFunction;
+import org.apache.hudi.common.function.SerializablePairFunction;
+import org.apache.hudi.common.util.Option;
+
+import org.apache.hudi.common.util.collection.Pair;
+
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+import static java.util.stream.Collectors.toList;
+import static org.apache.hudi.common.function.FunctionWrapper.throwingFlatMapWrapper;
+import static org.apache.hudi.common.function.FunctionWrapper.throwingForeachWrapper;
+import static org.apache.hudi.common.function.FunctionWrapper.throwingMapToPairWrapper;
+import static org.apache.hudi.common.function.FunctionWrapper.throwingMapWrapper;
+
+/**
+ * A java based engine context that can be used from map-reduce tasks executing in query engines like

Review comment:
       remove reference to map-reduce task in doc. This is just generic enough to be used from a single jvm. thats what it could say. Talking about MR is kind of misleading

##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/BaseTableMetadata.java
##########
@@ -66,10 +67,11 @@
   protected final String spillableMapDirectory;
   private transient HoodieMetadataMergedInstantRecordScanner timelineRecordScanner;
 
-  protected BaseTableMetadata(Configuration hadoopConf, String datasetBasePath, String spillableMapDirectory,
+  protected BaseTableMetadata(HoodieEngineContext engineContext, String datasetBasePath, String spillableMapDirectory,

Review comment:
       avoid these reformatting?

##########
File path: hudi-common/src/test/java/org/apache/hudi/metadata/TestFileSystemBackedTableMetadata.java
##########
@@ -0,0 +1,181 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.metadata;
+
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.engine.HoodieLocalEngineContext;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodiePartitionMetadata;
+import org.apache.hudi.common.testutils.FileCreateUtils;
+import org.apache.hudi.common.testutils.HoodieCommonTestHarness;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.file.Files;
+import java.nio.file.Paths;
+import java.util.Arrays;
+import java.util.List;
+import java.util.UUID;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+
+public class TestFileSystemBackedTableMetadata extends HoodieCommonTestHarness {
+
+  private static final int NUM_FILE_IDS_PER_PARTITION = 10;
+
+  private static String TEST_WRITE_TOKEN = "1-0-1";
+
+  private final List<String> DATE_PARTITIONS = Arrays.asList("2019/01/01", "2020/01/02", "2021/03/01");
+  private final List<String> ONE_LEVEL_PARTITIONS = Arrays.asList("2019", "2020", "2021");
+  private final List<String> MULTI_LEVEL_PARTITIONS = Arrays.asList("2019/01", "2020/01", "2021/01");
+  private final List<String> fileIdsPerPartition =
+      IntStream.range(0, NUM_FILE_IDS_PER_PARTITION).mapToObj(x -> UUID.randomUUID().toString()).collect(Collectors.toList());
+
+  @BeforeEach
+  public void setUp() throws IOException {
+    initMetaClient();
+  }
+
+  @AfterEach
+  public void tearDown() throws IOException {
+    metaClient.getFs().delete(new Path(metaClient.getBasePath()), true);
+  }
+
+  /**
+   * Test non partition hoodie table
+   * @throws IOException
+   */
+  @Test
+  public void testNonPartitionedTable() throws IOException {
+    // Generate 10 files under basepath
+    String instant = "100";
+    createDataFiles(basePath, instant);
+    FileCreateUtils.createCommit(basePath, instant);
+    HoodieLocalEngineContext localEngineContext = new HoodieLocalEngineContext(metaClient.getHadoopConf());
+    FileSystemBackedTableMetadata fileSystemBackedTableMetadata =
+        new FileSystemBackedTableMetadata(localEngineContext, new SerializableConfiguration(metaClient.getHadoopConf()), basePath, false);
+    Assertions.assertTrue(fileSystemBackedTableMetadata.getAllPartitionPaths().size() == 0);
+    Assertions.assertTrue(fileSystemBackedTableMetadata.getAllFilesInPartition(new Path(basePath)).length == 10);
+  }
+
+  /**
+   * Test listing of partitions result for date based partitions
+   * @throws IOException
+   */
+  @Test
+  public void testDatePartitionedTable() throws IOException {
+    String instant = "100";
+    // Generate 10 files under each partition
+    DATE_PARTITIONS.stream().map(p -> createPartition(p, false))
+        .forEach(p -> createDataFiles(basePath + "/" + p, instant));
+    FileCreateUtils.createCommit(basePath, instant);
+    HoodieLocalEngineContext localEngineContext = new HoodieLocalEngineContext(metaClient.getHadoopConf());
+    FileSystemBackedTableMetadata fileSystemBackedTableMetadata =
+        new FileSystemBackedTableMetadata(localEngineContext, new SerializableConfiguration(metaClient.getHadoopConf()), basePath, true);
+    Assertions.assertTrue(fileSystemBackedTableMetadata.getAllPartitionPaths().size() == 3);
+    Assertions.assertTrue(fileSystemBackedTableMetadata.getAllFilesInPartition(new Path(basePath + "/" + DATE_PARTITIONS.get(0))).length == 10);
+  }
+
+  /**
+   * Test listing of partitions result for date based partitions with assumeDataPartitioning = false
+   * @throws IOException
+   */
+  @Test
+  public void testDatePartitionedTableWithAssumeDateIsFalse() throws IOException {
+    String instant = "100";
+    // Generate 10 files under each partition
+    DATE_PARTITIONS.stream().map(p -> createPartition(p, false))
+        .forEach(p -> createDataFiles(basePath + "/" + p, instant));
+    FileCreateUtils.createCommit(basePath, instant);
+    HoodieLocalEngineContext localEngineContext = new HoodieLocalEngineContext(metaClient.getHadoopConf());
+    FileSystemBackedTableMetadata fileSystemBackedTableMetadata =
+        new FileSystemBackedTableMetadata(localEngineContext, new SerializableConfiguration(metaClient.getHadoopConf()), basePath, false);
+    Assertions.assertTrue(fileSystemBackedTableMetadata.getAllPartitionPaths().size() == 0);
+  }
+
+  @Test
+  public void testOneLevelPartitionedTableWithoutHoodiePartitionMetaFile() throws IOException {
+    String instant = "100";
+    // Generate 10 files under each partition
+    ONE_LEVEL_PARTITIONS.stream().map(p -> createPartition(p, true))
+        .forEach(p -> createDataFiles(basePath + "/" + p, instant));
+    FileCreateUtils.createCommit(basePath, instant);
+    HoodieLocalEngineContext localEngineContext = new HoodieLocalEngineContext(metaClient.getHadoopConf());
+    FileSystemBackedTableMetadata fileSystemBackedTableMetadata =
+        new FileSystemBackedTableMetadata(localEngineContext, new SerializableConfiguration(metaClient.getHadoopConf()), basePath, false);
+    Assertions.assertTrue(fileSystemBackedTableMetadata.getAllPartitionPaths().size() == 3);
+    Assertions.assertTrue(fileSystemBackedTableMetadata.getAllFilesInPartition(new Path(basePath + "/" + ONE_LEVEL_PARTITIONS.get(0))).length == 10);
+  }
+
+  @Test
+  public void testMultiLevelPartitionedTable() throws IOException {
+    String instant = "100";
+    // Generate 10 files under each partition
+    MULTI_LEVEL_PARTITIONS.stream().map(p -> createPartition(p, true))
+        .forEach(p -> createDataFiles(basePath + "/" + p, instant));
+    FileCreateUtils.createCommit(basePath, instant);
+    HoodieLocalEngineContext localEngineContext = new HoodieLocalEngineContext(metaClient.getHadoopConf());
+    FileSystemBackedTableMetadata fileSystemBackedTableMetadata =
+        new FileSystemBackedTableMetadata(localEngineContext, new SerializableConfiguration(metaClient.getHadoopConf()), basePath, false);
+    Assertions.assertTrue(fileSystemBackedTableMetadata.getAllPartitionPaths().size() == 3);
+    Assertions.assertTrue(fileSystemBackedTableMetadata.getAllFilesInPartition(new Path(basePath + "/" + MULTI_LEVEL_PARTITIONS.get(0))).length == 10);
+  }
+
+  @Test
+  public void testMultiLevelEmptyPartitionTable() throws IOException {
+    String instant = "100";
+    // Generate 10 files under each partition
+    MULTI_LEVEL_PARTITIONS.stream().forEach(p -> createPartition(p, true));
+    FileCreateUtils.createCommit(basePath, instant);
+    HoodieLocalEngineContext localEngineContext = new HoodieLocalEngineContext(metaClient.getHadoopConf());
+    FileSystemBackedTableMetadata fileSystemBackedTableMetadata =
+        new FileSystemBackedTableMetadata(localEngineContext, new SerializableConfiguration(metaClient.getHadoopConf()), basePath, false);
+    Assertions.assertTrue(fileSystemBackedTableMetadata.getAllPartitionPaths().size() == 3);
+    Assertions.assertTrue(fileSystemBackedTableMetadata.getAllFilesInPartition(new Path(basePath + "/" + MULTI_LEVEL_PARTITIONS.get(0))).length == 0);
+  }
+
+  private void createDataFiles(String fullPath, String instant) {
+    fileIdsPerPartition.stream().forEach(fId -> {
+      try {
+        new File(fullPath + "/" + FSUtils.makeDataFileName(instant, TEST_WRITE_TOKEN, fId)).createNewFile();
+      } catch (IOException e) {
+        throw new RuntimeException(e);
+      }
+    });
+  }
+
+  private String createPartition(String p, Boolean hoodiePartition) {

Review comment:
       same comment

##########
File path: hudi-common/src/test/java/org/apache/hudi/metadata/TestFileSystemBackedTableMetadata.java
##########
@@ -0,0 +1,181 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.metadata;
+
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.engine.HoodieLocalEngineContext;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodiePartitionMetadata;
+import org.apache.hudi.common.testutils.FileCreateUtils;
+import org.apache.hudi.common.testutils.HoodieCommonTestHarness;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.file.Files;
+import java.nio.file.Paths;
+import java.util.Arrays;
+import java.util.List;
+import java.util.UUID;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+
+public class TestFileSystemBackedTableMetadata extends HoodieCommonTestHarness {
+
+  private static final int NUM_FILE_IDS_PER_PARTITION = 10;
+
+  private static String TEST_WRITE_TOKEN = "1-0-1";
+
+  private final List<String> DATE_PARTITIONS = Arrays.asList("2019/01/01", "2020/01/02", "2021/03/01");
+  private final List<String> ONE_LEVEL_PARTITIONS = Arrays.asList("2019", "2020", "2021");
+  private final List<String> MULTI_LEVEL_PARTITIONS = Arrays.asList("2019/01", "2020/01", "2021/01");
+  private final List<String> fileIdsPerPartition =
+      IntStream.range(0, NUM_FILE_IDS_PER_PARTITION).mapToObj(x -> UUID.randomUUID().toString()).collect(Collectors.toList());
+
+  @BeforeEach
+  public void setUp() throws IOException {
+    initMetaClient();
+  }
+
+  @AfterEach
+  public void tearDown() throws IOException {
+    metaClient.getFs().delete(new Path(metaClient.getBasePath()), true);
+  }
+
+  /**
+   * Test non partition hoodie table
+   * @throws IOException
+   */
+  @Test
+  public void testNonPartitionedTable() throws IOException {
+    // Generate 10 files under basepath
+    String instant = "100";
+    createDataFiles(basePath, instant);
+    FileCreateUtils.createCommit(basePath, instant);
+    HoodieLocalEngineContext localEngineContext = new HoodieLocalEngineContext(metaClient.getHadoopConf());
+    FileSystemBackedTableMetadata fileSystemBackedTableMetadata =
+        new FileSystemBackedTableMetadata(localEngineContext, new SerializableConfiguration(metaClient.getHadoopConf()), basePath, false);
+    Assertions.assertTrue(fileSystemBackedTableMetadata.getAllPartitionPaths().size() == 0);
+    Assertions.assertTrue(fileSystemBackedTableMetadata.getAllFilesInPartition(new Path(basePath)).length == 10);
+  }
+
+  /**
+   * Test listing of partitions result for date based partitions
+   * @throws IOException
+   */
+  @Test
+  public void testDatePartitionedTable() throws IOException {
+    String instant = "100";
+    // Generate 10 files under each partition
+    DATE_PARTITIONS.stream().map(p -> createPartition(p, false))
+        .forEach(p -> createDataFiles(basePath + "/" + p, instant));
+    FileCreateUtils.createCommit(basePath, instant);
+    HoodieLocalEngineContext localEngineContext = new HoodieLocalEngineContext(metaClient.getHadoopConf());
+    FileSystemBackedTableMetadata fileSystemBackedTableMetadata =
+        new FileSystemBackedTableMetadata(localEngineContext, new SerializableConfiguration(metaClient.getHadoopConf()), basePath, true);
+    Assertions.assertTrue(fileSystemBackedTableMetadata.getAllPartitionPaths().size() == 3);
+    Assertions.assertTrue(fileSystemBackedTableMetadata.getAllFilesInPartition(new Path(basePath + "/" + DATE_PARTITIONS.get(0))).length == 10);
+  }
+
+  /**
+   * Test listing of partitions result for date based partitions with assumeDataPartitioning = false
+   * @throws IOException
+   */
+  @Test
+  public void testDatePartitionedTableWithAssumeDateIsFalse() throws IOException {
+    String instant = "100";
+    // Generate 10 files under each partition
+    DATE_PARTITIONS.stream().map(p -> createPartition(p, false))
+        .forEach(p -> createDataFiles(basePath + "/" + p, instant));
+    FileCreateUtils.createCommit(basePath, instant);
+    HoodieLocalEngineContext localEngineContext = new HoodieLocalEngineContext(metaClient.getHadoopConf());
+    FileSystemBackedTableMetadata fileSystemBackedTableMetadata =
+        new FileSystemBackedTableMetadata(localEngineContext, new SerializableConfiguration(metaClient.getHadoopConf()), basePath, false);
+    Assertions.assertTrue(fileSystemBackedTableMetadata.getAllPartitionPaths().size() == 0);
+  }
+
+  @Test
+  public void testOneLevelPartitionedTableWithoutHoodiePartitionMetaFile() throws IOException {
+    String instant = "100";
+    // Generate 10 files under each partition
+    ONE_LEVEL_PARTITIONS.stream().map(p -> createPartition(p, true))
+        .forEach(p -> createDataFiles(basePath + "/" + p, instant));
+    FileCreateUtils.createCommit(basePath, instant);
+    HoodieLocalEngineContext localEngineContext = new HoodieLocalEngineContext(metaClient.getHadoopConf());
+    FileSystemBackedTableMetadata fileSystemBackedTableMetadata =
+        new FileSystemBackedTableMetadata(localEngineContext, new SerializableConfiguration(metaClient.getHadoopConf()), basePath, false);
+    Assertions.assertTrue(fileSystemBackedTableMetadata.getAllPartitionPaths().size() == 3);
+    Assertions.assertTrue(fileSystemBackedTableMetadata.getAllFilesInPartition(new Path(basePath + "/" + ONE_LEVEL_PARTITIONS.get(0))).length == 10);
+  }
+
+  @Test
+  public void testMultiLevelPartitionedTable() throws IOException {
+    String instant = "100";
+    // Generate 10 files under each partition
+    MULTI_LEVEL_PARTITIONS.stream().map(p -> createPartition(p, true))
+        .forEach(p -> createDataFiles(basePath + "/" + p, instant));
+    FileCreateUtils.createCommit(basePath, instant);
+    HoodieLocalEngineContext localEngineContext = new HoodieLocalEngineContext(metaClient.getHadoopConf());
+    FileSystemBackedTableMetadata fileSystemBackedTableMetadata =
+        new FileSystemBackedTableMetadata(localEngineContext, new SerializableConfiguration(metaClient.getHadoopConf()), basePath, false);
+    Assertions.assertTrue(fileSystemBackedTableMetadata.getAllPartitionPaths().size() == 3);
+    Assertions.assertTrue(fileSystemBackedTableMetadata.getAllFilesInPartition(new Path(basePath + "/" + MULTI_LEVEL_PARTITIONS.get(0))).length == 10);
+  }
+
+  @Test
+  public void testMultiLevelEmptyPartitionTable() throws IOException {
+    String instant = "100";
+    // Generate 10 files under each partition
+    MULTI_LEVEL_PARTITIONS.stream().forEach(p -> createPartition(p, true));
+    FileCreateUtils.createCommit(basePath, instant);
+    HoodieLocalEngineContext localEngineContext = new HoodieLocalEngineContext(metaClient.getHadoopConf());
+    FileSystemBackedTableMetadata fileSystemBackedTableMetadata =
+        new FileSystemBackedTableMetadata(localEngineContext, new SerializableConfiguration(metaClient.getHadoopConf()), basePath, false);
+    Assertions.assertTrue(fileSystemBackedTableMetadata.getAllPartitionPaths().size() == 3);
+    Assertions.assertTrue(fileSystemBackedTableMetadata.getAllFilesInPartition(new Path(basePath + "/" + MULTI_LEVEL_PARTITIONS.get(0))).length == 0);
+  }
+
+  private void createDataFiles(String fullPath, String instant) {

Review comment:
       can we please use the `HoodieTestUtils` methods? We spend a lot of time chasing after these one-offs. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on a change in pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#discussion_r554539152



##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/BaseTableMetadata.java
##########
@@ -66,10 +67,11 @@
   protected final String spillableMapDirectory;
   private transient HoodieMetadataMergedInstantRecordScanner timelineRecordScanner;
 
-  protected BaseTableMetadata(Configuration hadoopConf, String datasetBasePath, String spillableMapDirectory,
+  protected BaseTableMetadata(HoodieEngineContext engineContext, String datasetBasePath, String spillableMapDirectory,

Review comment:
       It's a new variable..




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] umehrot2 commented on a change in pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
umehrot2 commented on a change in pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#discussion_r554244342



##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
##########
@@ -64,6 +111,6 @@ public FileSystemBackedTableMetadata(SerializableConfiguration conf, String data
 
   @Override
   public boolean isInSync() {
-    throw new UnsupportedOperationException();
+    return false;

Review comment:
       I guess. But setting it to true breaks https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/metadata/TestHoodieBackedMetadata.java#L456 and others. I guess will look at the tests again to see if those need to be fixed instead. I don't really see `isInSync()` being used anywhere (in source code, apart from tests) except for being stored in stats somewhere. Wonder if its really needed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
n3nash commented on pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#issuecomment-757542288


   @vinothchandar Addressed comments, moved `HoodieJavaEngineContext` to hudi-common and remove `HoodieLocalEngineContext`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on a change in pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#discussion_r554259024



##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
##########
@@ -49,12 +60,48 @@ public FileSystemBackedTableMetadata(SerializableConfiguration conf, String data
 
   @Override
   public List<String> getAllPartitionPaths() throws IOException {
-    FileSystem fs = new Path(datasetBasePath).getFileSystem(hadoopConf.get());
     if (assumeDatePartitioning) {
+      FileSystem fs = new Path(datasetBasePath).getFileSystem(hadoopConf.get());
       return FSUtils.getAllPartitionFoldersThreeLevelsDown(fs, datasetBasePath);
-    } else {
-      return FSUtils.getAllFoldersWithPartitionMetaFile(fs, datasetBasePath);
     }
+
+    List<Path> pathsToList = new LinkedList<>();
+    pathsToList.add(new Path(datasetBasePath));
+    List<String> partitionPaths = new ArrayList<>();
+
+    // TODO: Get the parallelism from HoodieWriteConfig
+    final int fileListingParallelism = 1500;

Review comment:
       Okay my bad. can we pull this into a static final variable for now. 

##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
##########
@@ -64,6 +111,6 @@ public FileSystemBackedTableMetadata(SerializableConfiguration conf, String data
 
   @Override
   public boolean isInSync() {
-    throw new UnsupportedOperationException();
+    return false;

Review comment:
       the stats are important, so let's keep it there. I think this happens because some parts of the tests run with metadata enabled and this is returned as the fallback. If its not too many, can we fix the tests. it does not make sense to have this return `false` semantically.

##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
##########
@@ -49,12 +60,48 @@ public FileSystemBackedTableMetadata(SerializableConfiguration conf, String data
 
   @Override
   public List<String> getAllPartitionPaths() throws IOException {
-    FileSystem fs = new Path(datasetBasePath).getFileSystem(hadoopConf.get());
     if (assumeDatePartitioning) {
+      FileSystem fs = new Path(datasetBasePath).getFileSystem(hadoopConf.get());
       return FSUtils.getAllPartitionFoldersThreeLevelsDown(fs, datasetBasePath);
-    } else {
-      return FSUtils.getAllFoldersWithPartitionMetaFile(fs, datasetBasePath);
     }
+
+    List<Path> pathsToList = new LinkedList<>();

Review comment:
       Can we add more tests to this piece of code? (previously it was just in the bootstrap part), not this is prime time. 

##########
File path: hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewManager.java
##########
@@ -159,12 +160,12 @@ private static HoodieTableFileSystemView createInMemoryFileSystemView(Serializab
     return new HoodieTableFileSystemView(metaClient, timeline, viewConf.isIncrementalTimelineSyncEnabled());
   }
 
-  public static HoodieTableFileSystemView createInMemoryFileSystemView(HoodieTableMetaClient metaClient,
-                                                                       boolean useFileListingFromMetadata,
-                                                                       boolean verifyListings) {
+  public static HoodieTableFileSystemView createInMemoryFileSystemViewForReaders(HoodieTableMetaClient metaClient,
+      boolean useFileListingFromMetadata, boolean verifyListings) {
     LOG.info("Creating InMemory based view for basePath " + metaClient.getBasePath());
     if (useFileListingFromMetadata) {
-      return new HoodieMetadataFileSystemView(metaClient,
+      HoodieMREngineContext engineContext = new HoodieMREngineContext(metaClient.getHadoopConf());

Review comment:
       On a side note, we should move it to the hudi-hadoop-mr module or rename this to `HoodieLocalEngine` or something.  Lets discuss the relocation in the other comment. 

##########
File path: hudi-common/src/main/java/org/apache/hudi/common/engine/HoodieMREngineContext.java
##########
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.engine;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.function.SerializableConsumer;
+import org.apache.hudi.common.function.SerializableFunction;
+import org.apache.hudi.common.function.SerializablePairFunction;
+import org.apache.hudi.common.util.Option;
+
+import org.apache.hudi.common.util.collection.Pair;
+
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+import static java.util.stream.Collectors.toList;
+import static org.apache.hudi.common.function.FunctionWrapper.throwingFlatMapWrapper;
+import static org.apache.hudi.common.function.FunctionWrapper.throwingForeachWrapper;
+import static org.apache.hudi.common.function.FunctionWrapper.throwingMapToPairWrapper;
+import static org.apache.hudi.common.function.FunctionWrapper.throwingMapWrapper;
+
+/**
+ * A java based engine context that can be used from map-reduce tasks executing in query engines like
+ * spark, hive and presto.
+ */
+public final class HoodieMREngineContext extends HoodieEngineContext {

Review comment:
       > But we should not get rid of 2 because we are treating HoodieEngineContext as the default engine context to instantiate
   
   But that only depends on the interface right. `hudi-cli` has access to sparkContext right? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on a change in pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#discussion_r554538803



##########
File path: hudi-common/src/main/java/org/apache/hudi/common/engine/HoodieLocalEngineContext.java
##########
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.engine;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.function.SerializableConsumer;
+import org.apache.hudi.common.function.SerializableFunction;
+import org.apache.hudi.common.function.SerializablePairFunction;
+import org.apache.hudi.common.util.Option;
+
+import org.apache.hudi.common.util.collection.Pair;
+
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+import static java.util.stream.Collectors.toList;
+import static org.apache.hudi.common.function.FunctionWrapper.throwingFlatMapWrapper;
+import static org.apache.hudi.common.function.FunctionWrapper.throwingForeachWrapper;
+import static org.apache.hudi.common.function.FunctionWrapper.throwingMapToPairWrapper;
+import static org.apache.hudi.common.function.FunctionWrapper.throwingMapWrapper;
+
+/**
+ * A java based engine context that can be used from map-reduce tasks executing in query engines like

Review comment:
       On looking deeper, I refactored HoodieJavaEngineContext to hudi-common and remove HoodieLocalEngineContext




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io edited a comment on pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#issuecomment-756532045






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io edited a comment on pull request #2417: [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#issuecomment-756532045


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2417?src=pr&el=h1) Report
   > Merging [#2417](https://codecov.io/gh/apache/hudi/pull/2417?src=pr&el=desc) (ad815be) into [master](https://codecov.io/gh/apache/hudi/commit/368c1a8f5c36d06ed49706b4afde4a83073a9011?el=desc) (368c1a8) will **decrease** coverage by `40.84%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2417/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2417?src=pr&el=tree)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master   #2417       +/-   ##
   ============================================
   - Coverage     50.53%   9.68%   -40.85%     
   + Complexity     3032      48     -2984     
   ============================================
     Files           417      53      -364     
     Lines         18727    1930    -16797     
     Branches       1917     230     -1687     
   ============================================
   - Hits           9463     187     -9276     
   + Misses         8489    1730     -6759     
   + Partials        775      13      -762     
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `9.68% <100.00%> (-59.73%)` | `0.00 <1.00> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2417?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [.../apache/hudi/utilities/HoodieSnapshotExporter.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0hvb2RpZVNuYXBzaG90RXhwb3J0ZXIuamF2YQ==) | `83.62% <100.00%> (-5.08%)` | `28.00 <1.00> (ø)` | |
   | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | ... and [383 more](https://codecov.io/gh/apache/hudi/pull/2417/diff?src=pr&el=tree-more) | |
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org