You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/05/10 02:46:10 UTC

[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2893: [HUDI-1371] Support metadata based listing for Spark DataSource and Spark SQL

pengzhiwei2018 commented on a change in pull request #2893:
URL: https://github.com/apache/hudi/pull/2893#discussion_r629013054



##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/BaseTableMetadata.java
##########
@@ -141,6 +143,31 @@ protected BaseTableMetadata(HoodieEngineContext engineContext, HoodieMetadataCon
         .getAllFilesInPartition(partitionPath);
   }
 
+  @Override
+  public Map<String, FileStatus[]> getAllFilesInPartitions(List<String> partitionPaths)
+      throws IOException {
+    if (enabled) {
+      Map<String, FileStatus[]> partitionsFilesMap = new HashMap<>();
+
+      try {
+        for (String partitionPath : partitionPaths) {
+          partitionsFilesMap.put(partitionPath, fetchAllFilesInPartition(new Path(partitionPath)));
+        }
+      } catch (Exception e) {
+        if (metadataConfig.enableFallback()) {
+          LOG.error("Failed to retrieve files in partitions from metadata", e);

Review comment:
       If enable the fallback here, an empty `partitionsFilesMap` will return if there is an Exception happened, is it right?

##########
File path: hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
##########
@@ -105,6 +106,20 @@ public FileSystemBackedTableMetadata(HoodieEngineContext engineContext, Serializ
     return partitionPaths;
   }
 
+  @Override
+  public Map<String, FileStatus[]> getAllFilesInPartitions(List<String> partitionPaths)
+      throws IOException {
+    int parallelism = Math.min(DEFAULT_LISTING_PARALLELISM, partitionPaths.size());

Review comment:
       If the `partitionPaths` is empty, the `parallelism` will be 0, there may be an Exception ("Positive number of partitions required")  throw out for the `sparkContext.parallelize(seq, parallelism)`,

##########
File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
##########
@@ -227,7 +227,17 @@ case class HoodieFileIndex(
    */
   private def loadPartitionPathFiles(): Map[PartitionRowPath, Array[FileStatus]] = {
     val sparkEngine = new HoodieSparkEngineContext(new JavaSparkContext(spark.sparkContext))
+    val serializableConf = new SerializableConfiguration(spark.sessionState.newHadoopConf())
+
     val properties = new Properties()
+    // To support metadata listing via Spark SQL we allow users to pass the config via Hadoop Conf. Spark SQL does not

Review comment:
       Should we get these `configurations` from the `spark.sessionState.conf` for spark? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org