You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "szehon-ho (via GitHub)" <gi...@apache.org> on 2023/02/17 02:06:03 UTC

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6779: Spark 3.2, 3.3: use table partition schema in add_files for getPartitions to avoid data corruption

szehon-ho commented on code in PR #6779:
URL: https://github.com/apache/iceberg/pull/6779#discussion_r1109199713


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java:
##########
@@ -836,9 +836,30 @@ public static String quotedFullIdentifier(String catalogName, Identifier identif
    * @param format format of the file
    * @param partitionFilter partitionFilter of the file
    * @return all table's partitions
+   * @deprecated use {@link Spark3Util#getPartitions(SparkSession, Path, String, Map, Option)}
    */
+  @Deprecated
   public static List<SparkPartition> getPartitions(
       SparkSession spark, Path rootPath, String format, Map<String, String> partitionFilter) {
+    return getPartitions(spark, rootPath, format, partitionFilter, Option.empty());
+  }
+
+  /**
+   * Use Spark to list all partitions in the table.
+   *
+   * @param spark a Spark session
+   * @param rootPath a table identifier
+   * @param format format of the file
+   * @param partitionFilter partitionFilter of the file
+   * @param partitionSpec partitionSpec of the table
+   * @return all table's partitions
+   */
+  public static List<SparkPartition> getPartitions(
+      SparkSession spark,
+      Path rootPath,
+      String format,
+      Map<String, String> partitionFilter,
+      Option<PartitionSpec> partitionSpec) {

Review Comment:
   We shouldnt take in scala Option as argument to java method.  I would vote to just take a regular PartitionSpec, and check null in code, the equivalent is java.util.Optional and javadoc suggests to use it primarily for return types. 



##########
spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:
##########
@@ -911,6 +935,14 @@ public void testPartitionedImportFromEmptyPartitionDoesNotThrow() {
     new StructField("ts", DataTypes.DateType, true, Metadata.empty())
   };
 
+  private static final StructField[] dateHourStruct = {
+    new StructField("id", DataTypes.IntegerType, true, Metadata.empty()),
+    new StructField("name", DataTypes.StringType, true, Metadata.empty()),
+    new StructField("dept", DataTypes.StringType, true, Metadata.empty()),
+    new StructField("ts", DataTypes.DateType, true, Metadata.empty()),
+    new StructField("hour", DataTypes.StringType, true, Metadata.empty())

Review Comment:
   Question, (maybe I am not understanding original problem entirely), can we re-use existing test struct, ie, using "dept=01" as partition field?



##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkSchemaUtil.java:
##########
@@ -354,4 +354,15 @@ public static Map<Integer, String> indexQuotedNameById(Schema schema) {
     Function<String, String> quotingFunc = name -> String.format("`%s`", name.replace("`", "``"));
     return TypeUtil.indexQuotedNameById(schema.asStruct(), quotingFunc);
   }
+
+  /**
+   * Convert a {@link PartitionSpec} to a {@link DataType Spark type}.
+   *
+   * @param spec iceberg PartitionSpec
+   * @return {@link StructType}
+   * @throws IllegalArgumentException if the type cannot be converted
+   */
+  public static StructType convert(PartitionSpec spec) {
+    return convert(new Schema(spec.partitionType().asNestedType().asStructType().fields()));

Review Comment:
   I think the general project direction has been to avoid adding public method unless we need it in many places.  This is converting from a partition spec type to spark type, and should be fairly limited, so would suggest inline.
   
   I personally think , if anything, adding public method convert(StructType) to spark StructType would be more useful, instead of forcing constructing of Schema in existing convert method, and will cover this case here.  like to see what @aokolnychyi @RussellSpitzer think as well.
   
   Also, is it more convoluted than needed?  Could it be?
   ```
   convert(new Schema(partType.partitionType().fields())))
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org