You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/05/12 17:53:21 UTC

[GitHub] [iceberg] aokolnychyi opened a new pull request, #4758: Spark 3.2: Avoid reflection to load metadata tables in SparkTableUtil

aokolnychyi opened a new pull request, #4758:
URL: https://github.com/apache/iceberg/pull/4758

   This logic removes the unnecessary reflection to load metadata tables in `SparkTableUtil` since we have separate modules for Spark versions. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on a diff in pull request #4758: Spark 3.2: Avoid reflection to load metadata tables in SparkTableUtil

Posted by GitBox <gi...@apache.org>.
flyrain commented on code in PR #4758:
URL: https://github.com/apache/iceberg/pull/4758#discussion_r872651322


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java:
##########
@@ -599,46 +602,15 @@ private static void deleteManifests(FileIO io, List<ManifestFile> manifests) {
         .run(item -> io.deleteFile(item.path()));
   }
 
-  // Attempt to use Spark3 Catalog resolution if available on the path
-  private static final DynMethods.UnboundMethod LOAD_METADATA_TABLE = DynMethods.builder("loadMetadataTable")
-      .hiddenImpl("org.apache.iceberg.spark.Spark3Util", SparkSession.class, Table.class, MetadataTableType.class)
-      .orNoop()
-      .build();
-
-  public static Dataset<Row> loadCatalogMetadataTable(SparkSession spark, Table table, MetadataTableType type) {

Review Comment:
   Yeah, this is a grey area, in which these methods actually are APIs, but not considered as APIs.  Per our community sync, only the public things in the api module are considered as APIs. We need something(e.g. annotation) to mark these methods as APIs. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on a diff in pull request #4758: Spark 3.2: Avoid reflection to load metadata tables in SparkTableUtil

Posted by GitBox <gi...@apache.org>.
flyrain commented on code in PR #4758:
URL: https://github.com/apache/iceberg/pull/4758#discussion_r872651322


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java:
##########
@@ -599,46 +602,15 @@ private static void deleteManifests(FileIO io, List<ManifestFile> manifests) {
         .run(item -> io.deleteFile(item.path()));
   }
 
-  // Attempt to use Spark3 Catalog resolution if available on the path
-  private static final DynMethods.UnboundMethod LOAD_METADATA_TABLE = DynMethods.builder("loadMetadataTable")
-      .hiddenImpl("org.apache.iceberg.spark.Spark3Util", SparkSession.class, Table.class, MetadataTableType.class)
-      .orNoop()
-      .build();
-
-  public static Dataset<Row> loadCatalogMetadataTable(SparkSession spark, Table table, MetadataTableType type) {

Review Comment:
   Yeah, this is a grey area, in which these methods actually are APIs, but not considered as APIs.  Per our community sync, only the public things in the api module are considered as APIs. We need something(e.g. annotation) to mark these methods as APIs. Otherwise, the discussion of which public method should be deprecated and which shouldn't will keep going. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #4758: Spark 3.2: Avoid reflection to load metadata tables in SparkTableUtil

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on PR #4758:
URL: https://github.com/apache/iceberg/pull/4758#issuecomment-1125272415

   cc @RussellSpitzer @flyrain @karuppayya @szehon-ho @rdblue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #4758: Spark 3.2: Avoid reflection to load metadata tables in SparkTableUtil

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on code in PR #4758:
URL: https://github.com/apache/iceberg/pull/4758#discussion_r871863655


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java:
##########
@@ -599,46 +602,15 @@ private static void deleteManifests(FileIO io, List<ManifestFile> manifests) {
         .run(item -> io.deleteFile(item.path()));
   }
 
-  // Attempt to use Spark3 Catalog resolution if available on the path
-  private static final DynMethods.UnboundMethod LOAD_METADATA_TABLE = DynMethods.builder("loadMetadataTable")
-      .hiddenImpl("org.apache.iceberg.spark.Spark3Util", SparkSession.class, Table.class, MetadataTableType.class)
-      .orNoop()
-      .build();
-
-  public static Dataset<Row> loadCatalogMetadataTable(SparkSession spark, Table table, MetadataTableType type) {

Review Comment:
   Just flagging that this will remove public method, in case anyone may use this method (though I feel there should not be).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #4758: Spark 3.2: Avoid reflection to load metadata tables in SparkTableUtil

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on code in PR #4758:
URL: https://github.com/apache/iceberg/pull/4758#discussion_r872702169


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java:
##########
@@ -599,46 +602,15 @@ private static void deleteManifests(FileIO io, List<ManifestFile> manifests) {
         .run(item -> io.deleteFile(item.path()));
   }
 
-  // Attempt to use Spark3 Catalog resolution if available on the path
-  private static final DynMethods.UnboundMethod LOAD_METADATA_TABLE = DynMethods.builder("loadMetadataTable")
-      .hiddenImpl("org.apache.iceberg.spark.Spark3Util", SparkSession.class, Table.class, MetadataTableType.class)
-      .orNoop()
-      .build();
-
-  public static Dataset<Row> loadCatalogMetadataTable(SparkSession spark, Table table, MetadataTableType type) {

Review Comment:
   Sounds good to me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #4758: Spark 3.2: Avoid reflection to load metadata tables in SparkTableUtil

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on code in PR #4758:
URL: https://github.com/apache/iceberg/pull/4758#discussion_r871665547


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java:
##########
@@ -599,46 +602,15 @@ private static void deleteManifests(FileIO io, List<ManifestFile> manifests) {
         .run(item -> io.deleteFile(item.path()));
   }
 
-  // Attempt to use Spark3 Catalog resolution if available on the path
-  private static final DynMethods.UnboundMethod LOAD_METADATA_TABLE = DynMethods.builder("loadMetadataTable")
-      .hiddenImpl("org.apache.iceberg.spark.Spark3Util", SparkSession.class, Table.class, MetadataTableType.class)
-      .orNoop()
-      .build();
-
-  public static Dataset<Row> loadCatalogMetadataTable(SparkSession spark, Table table, MetadataTableType type) {
-    Preconditions.checkArgument(!LOAD_METADATA_TABLE.isNoop(), "Cannot find Spark3Util class but Spark3 is in use");
-    return LOAD_METADATA_TABLE.asStatic().invoke(spark, table, type);
-  }
-
   public static Dataset<Row> loadMetadataTable(SparkSession spark, Table table, MetadataTableType type) {
-    if (spark.version().startsWith("3")) {
-      // construct the metadata table instance directly
-      Dataset<Row> catalogMetadataTable = loadCatalogMetadataTable(spark, table, type);
-      if (catalogMetadataTable != null) {
-        return catalogMetadataTable;
-      }
-    }
-
-    String tableName = table.name();
-    String tableLocation = table.location();
-
-    DataFrameReader dataFrameReader = spark.read().format("iceberg");
-    if (tableName.contains("/")) {
-      // Hadoop Table or Metadata location passed, load without a catalog
-      return dataFrameReader.load(tableName + "#" + type);
-    }
+    return loadMetadataTable(spark, table, type, ImmutableMap.of());
+  }
 
-    // Catalog based resolution failed, our catalog may be a non-DatasourceV2 Catalog
-    if (tableName.startsWith("hadoop.")) {
-      // Try loading by location as Hadoop table without Catalog
-      return dataFrameReader.load(tableLocation + "#" + type);
-    } else if (tableName.startsWith("hive")) {
-      // Try loading by name as a Hive table without Catalog
-      return dataFrameReader.load(tableName.replaceFirst("hive\\.", "") + "." + type);
-    } else {
-      throw new IllegalArgumentException(String.format(
-          "Cannot find the metadata table for %s of type %s", tableName, type));
-    }
+  public static Dataset<Row> loadMetadataTable(SparkSession spark, Table table, MetadataTableType type,
+                                               Map<String, String> extraOptions) {
+    SparkTable metadataTable = new SparkTable(MetadataTableUtils.createMetadataTableInstance(table, type), false);

Review Comment:
   We had this code in that private method in `Spark3Util`. I just moved it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #4758: Spark 3.2: Avoid reflection to load metadata tables in SparkTableUtil

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on code in PR #4758:
URL: https://github.com/apache/iceberg/pull/4758#discussion_r871664696


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java:
##########
@@ -599,46 +602,15 @@ private static void deleteManifests(FileIO io, List<ManifestFile> manifests) {
         .run(item -> io.deleteFile(item.path()));
   }
 
-  // Attempt to use Spark3 Catalog resolution if available on the path
-  private static final DynMethods.UnboundMethod LOAD_METADATA_TABLE = DynMethods.builder("loadMetadataTable")
-      .hiddenImpl("org.apache.iceberg.spark.Spark3Util", SparkSession.class, Table.class, MetadataTableType.class)
-      .orNoop()
-      .build();
-
-  public static Dataset<Row> loadCatalogMetadataTable(SparkSession spark, Table table, MetadataTableType type) {
-    Preconditions.checkArgument(!LOAD_METADATA_TABLE.isNoop(), "Cannot find Spark3Util class but Spark3 is in use");
-    return LOAD_METADATA_TABLE.asStatic().invoke(spark, table, type);
-  }
-
   public static Dataset<Row> loadMetadataTable(SparkSession spark, Table table, MetadataTableType type) {
-    if (spark.version().startsWith("3")) {
-      // construct the metadata table instance directly
-      Dataset<Row> catalogMetadataTable = loadCatalogMetadataTable(spark, table, type);
-      if (catalogMetadataTable != null) {
-        return catalogMetadataTable;
-      }
-    }
-
-    String tableName = table.name();
-    String tableLocation = table.location();
-
-    DataFrameReader dataFrameReader = spark.read().format("iceberg");
-    if (tableName.contains("/")) {
-      // Hadoop Table or Metadata location passed, load without a catalog
-      return dataFrameReader.load(tableName + "#" + type);
-    }
+    return loadMetadataTable(spark, table, type, ImmutableMap.of());
+  }
 
-    // Catalog based resolution failed, our catalog may be a non-DatasourceV2 Catalog
-    if (tableName.startsWith("hadoop.")) {
-      // Try loading by location as Hadoop table without Catalog
-      return dataFrameReader.load(tableLocation + "#" + type);
-    } else if (tableName.startsWith("hive")) {
-      // Try loading by name as a Hive table without Catalog
-      return dataFrameReader.load(tableName.replaceFirst("hive\\.", "") + "." + type);
-    } else {
-      throw new IllegalArgumentException(String.format(
-          "Cannot find the metadata table for %s of type %s", tableName, type));
-    }
+  public static Dataset<Row> loadMetadataTable(SparkSession spark, Table table, MetadataTableType type,
+                                               Map<String, String> extraOptions) {
+    SparkTable metadataTable = new SparkTable(MetadataTableUtils.createMetadataTableInstance(table, type), false);

Review Comment:
   Looks like constructing `DataSourceV2Relation` directly is the easiest and safest option.
   Let me know if I missed a use case where we still need the old code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #4758: Spark 3.2: Avoid reflection to load metadata tables in SparkTableUtil

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on code in PR #4758:
URL: https://github.com/apache/iceberg/pull/4758#discussion_r872607468


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java:
##########
@@ -599,46 +602,15 @@ private static void deleteManifests(FileIO io, List<ManifestFile> manifests) {
         .run(item -> io.deleteFile(item.path()));
   }
 
-  // Attempt to use Spark3 Catalog resolution if available on the path
-  private static final DynMethods.UnboundMethod LOAD_METADATA_TABLE = DynMethods.builder("loadMetadataTable")
-      .hiddenImpl("org.apache.iceberg.spark.Spark3Util", SparkSession.class, Table.class, MetadataTableType.class)
-      .orNoop()
-      .build();
-
-  public static Dataset<Row> loadCatalogMetadataTable(SparkSession spark, Table table, MetadataTableType type) {

Review Comment:
   I think we should probably deprecate it for at least one release before dropping it. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi merged pull request #4758: Spark 3.2: Avoid reflection to load metadata tables in SparkTableUtil

Posted by GitBox <gi...@apache.org>.
aokolnychyi merged PR #4758:
URL: https://github.com/apache/iceberg/pull/4758


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #4758: Spark 3.2: Avoid reflection to load metadata tables in SparkTableUtil

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on code in PR #4758:
URL: https://github.com/apache/iceberg/pull/4758#discussion_r871896672


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java:
##########
@@ -599,46 +602,15 @@ private static void deleteManifests(FileIO io, List<ManifestFile> manifests) {
         .run(item -> io.deleteFile(item.path()));
   }
 
-  // Attempt to use Spark3 Catalog resolution if available on the path
-  private static final DynMethods.UnboundMethod LOAD_METADATA_TABLE = DynMethods.builder("loadMetadataTable")
-      .hiddenImpl("org.apache.iceberg.spark.Spark3Util", SparkSession.class, Table.class, MetadataTableType.class)
-      .orNoop()
-      .build();
-
-  public static Dataset<Row> loadCatalogMetadataTable(SparkSession spark, Table table, MetadataTableType type) {

Review Comment:
   Good point, I missed it. I wonder why we made it public, though. I'd probably remove the method but I don't mind deprecating and delegating to the one below if folks think this may impact anyone. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org