You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by "abmo-x (via GitHub)" <gi...@apache.org> on 2023/02/08 23:48:24 UTC

[GitHub] [iceberg] abmo-x opened a new pull request, #6779: use table partition schema in add_files for getPartitions to avoid data corruption

abmo-x opened a new pull request, #6779:
URL: https://github.com/apache/iceberg/pull/6779

   Issue:
   partition of string type with integer value with prefix zero like "01" gets stored incorrectly without the zero as "1" resulting in partition and column value getting stored and returned incorrectly for that partition.
   
   Fix:
   As spark's automatic type inference is used in add_files procedure, we pass table partition spec as user defined partition schema to avoid spark's automatic type inference.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6779: Spark 3.2, 3.3: use table partition schema in add_files for getPartitions to avoid data corruption

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.

szehon-ho commented on code in PR #6779:
URL: https://github.com/apache/iceberg/pull/6779#discussion_r1110474802


##########
spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:
##########
@@ -926,6 +955,17 @@ private static java.sql.Date toDate(String value) {
               new StructType(dateStruct))
           .repartition(2);
 
+  private static final Dataset<Row> dateHourDF =

Review Comment:
   I think dateDF is specifically to address date partition test, and dateHourDf would expect to be date/int.   Maybe something descriptive like testPartitionTypeDF



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] abmo-x commented on a diff in pull request #6779: Spark 3.2, 3.3: use table partition schema in add_files for getPartitions to avoid data corruption

Posted by "abmo-x (via GitHub)" <gi...@apache.org>.

abmo-x commented on code in PR #6779:
URL: https://github.com/apache/iceberg/pull/6779#discussion_r1112258293


##########
spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:
##########
@@ -911,6 +935,14 @@ public void testPartitionedImportFromEmptyPartitionDoesNotThrow() {
     new StructField("ts", DataTypes.DateType, true, Metadata.empty())
   };
 
+  private static final StructField[] dateHourStruct = {
+    new StructField("id", DataTypes.IntegerType, true, Metadata.empty()),
+    new StructField("name", DataTypes.StringType, true, Metadata.empty()),
+    new StructField("dept", DataTypes.StringType, true, Metadata.empty()),
+    new StructField("ts", DataTypes.DateType, true, Metadata.empty()),
+    new StructField("hour", DataTypes.StringType, true, Metadata.empty())

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] abmo-x commented on a diff in pull request #6779: Spark 3.2, 3.3: use table partition schema in add_files for getPartitions to avoid data corruption

Posted by "abmo-x (via GitHub)" <gi...@apache.org>.

abmo-x commented on code in PR #6779:
URL: https://github.com/apache/iceberg/pull/6779#discussion_r1112258161


##########
spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:
##########
@@ -926,6 +955,17 @@ private static java.sql.Date toDate(String value) {
               new StructType(dateStruct))
           .repartition(2);
 
+  private static final Dataset<Row> dateHourDF =

Review Comment:
   done. Had to rearrange date column order as spark insert was complaining about string to date conversion for non date column



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on pull request #6779: Spark 3.2, 3.3: use table partition schema in add_files for getPartitions to avoid data corruption

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.

szehon-ho commented on PR #6779:
URL: https://github.com/apache/iceberg/pull/6779#issuecomment-1437743359

   Ill wait bit for other comments, if not will commit


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6779: Spark 3.2, 3.3: use table partition schema in add_files for getPartitions to avoid data corruption

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.

szehon-ho commented on code in PR #6779:
URL: https://github.com/apache/iceberg/pull/6779#discussion_r1109199713


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java:
##########
@@ -836,9 +836,30 @@ public static String quotedFullIdentifier(String catalogName, Identifier identif
    * @param format format of the file
    * @param partitionFilter partitionFilter of the file
    * @return all table's partitions
+   * @deprecated use {@link Spark3Util#getPartitions(SparkSession, Path, String, Map, Option)}
    */
+  @Deprecated
   public static List<SparkPartition> getPartitions(
       SparkSession spark, Path rootPath, String format, Map<String, String> partitionFilter) {
+    return getPartitions(spark, rootPath, format, partitionFilter, Option.empty());
+  }
+
+  /**
+   * Use Spark to list all partitions in the table.
+   *
+   * @param spark a Spark session
+   * @param rootPath a table identifier
+   * @param format format of the file
+   * @param partitionFilter partitionFilter of the file
+   * @param partitionSpec partitionSpec of the table
+   * @return all table's partitions
+   */
+  public static List<SparkPartition> getPartitions(
+      SparkSession spark,
+      Path rootPath,
+      String format,
+      Map<String, String> partitionFilter,
+      Option<PartitionSpec> partitionSpec) {

Review Comment:
   We shouldnt take in scala Option as argument to java method.  I would vote to just take a regular PartitionSpec, and check null in code, the equivalent is java.util.Optional and javadoc suggests to use it primarily for return types. 



##########
spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:
##########
@@ -911,6 +935,14 @@ public void testPartitionedImportFromEmptyPartitionDoesNotThrow() {
     new StructField("ts", DataTypes.DateType, true, Metadata.empty())
   };
 
+  private static final StructField[] dateHourStruct = {
+    new StructField("id", DataTypes.IntegerType, true, Metadata.empty()),
+    new StructField("name", DataTypes.StringType, true, Metadata.empty()),
+    new StructField("dept", DataTypes.StringType, true, Metadata.empty()),
+    new StructField("ts", DataTypes.DateType, true, Metadata.empty()),
+    new StructField("hour", DataTypes.StringType, true, Metadata.empty())

Review Comment:
   Question, (maybe I am not understanding original problem entirely), can we re-use existing test struct, ie, using "dept=01" as partition field?



##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkSchemaUtil.java:
##########
@@ -354,4 +354,15 @@ public static Map<Integer, String> indexQuotedNameById(Schema schema) {
     Function<String, String> quotingFunc = name -> String.format("`%s`", name.replace("`", "``"));
     return TypeUtil.indexQuotedNameById(schema.asStruct(), quotingFunc);
   }
+
+  /**
+   * Convert a {@link PartitionSpec} to a {@link DataType Spark type}.
+   *
+   * @param spec iceberg PartitionSpec
+   * @return {@link StructType}
+   * @throws IllegalArgumentException if the type cannot be converted
+   */
+  public static StructType convert(PartitionSpec spec) {
+    return convert(new Schema(spec.partitionType().asNestedType().asStructType().fields()));

Review Comment:
   I think the general project direction has been to avoid adding public method unless we need it in many places.  This is converting from a partition spec type to spark type, and should be fairly limited, so would suggest inline.
   
   I personally think , if anything, adding public method convert(StructType) to spark StructType would be more useful, instead of forcing constructing of Schema in existing convert method, and will cover this case here.  like to see what @aokolnychyi @RussellSpitzer think as well.
   
   Also, is it more convoluted than needed?  Could it be?
   ```
   convert(new Schema(partType.partitionType().fields())))
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] abmo-x commented on pull request #6779: Spark 3.2, 3.3: use table partition schema in add_files for getPartitions to avoid data corruption

Posted by "abmo-x (via GitHub)" <gi...@apache.org>.

abmo-x commented on PR #6779:
URL: https://github.com/apache/iceberg/pull/6779#issuecomment-1437509528

   Thanks @szehon-ho 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] abmo-x commented on a diff in pull request #6779: Spark 3.2, 3.3: use table partition schema in add_files for getPartitions to avoid data corruption

Posted by "abmo-x (via GitHub)" <gi...@apache.org>.

abmo-x commented on code in PR #6779:
URL: https://github.com/apache/iceberg/pull/6779#discussion_r1110308969


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkSchemaUtil.java:
##########
@@ -354,4 +354,15 @@ public static Map<Integer, String> indexQuotedNameById(Schema schema) {
     Function<String, String> quotingFunc = name -> String.format("`%s`", name.replace("`", "``"));
     return TypeUtil.indexQuotedNameById(schema.asStruct(), quotingFunc);
   }
+
+  /**
+   * Convert a {@link PartitionSpec} to a {@link DataType Spark type}.
+   *
+   * @param spec iceberg PartitionSpec
+   * @return {@link StructType}
+   * @throws IllegalArgumentException if the type cannot be converted
+   */
+  public static StructType convert(PartitionSpec spec) {
+    return convert(new Schema(spec.partitionType().asNestedType().asStructType().fields()));

Review Comment:
   removed the public method and used the recommended code snippet. Thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] abmo-x commented on a diff in pull request #6779: Spark 3.2, 3.3: use table partition schema in add_files for getPartitions to avoid data corruption

Posted by "abmo-x (via GitHub)" <gi...@apache.org>.

abmo-x commented on code in PR #6779:
URL: https://github.com/apache/iceberg/pull/6779#discussion_r1110309278


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java:
##########
@@ -836,9 +836,30 @@ public static String quotedFullIdentifier(String catalogName, Identifier identif
    * @param format format of the file
    * @param partitionFilter partitionFilter of the file
    * @return all table's partitions
+   * @deprecated use {@link Spark3Util#getPartitions(SparkSession, Path, String, Map, Option)}
    */
+  @Deprecated
   public static List<SparkPartition> getPartitions(
       SparkSession spark, Path rootPath, String format, Map<String, String> partitionFilter) {
+    return getPartitions(spark, rootPath, format, partitionFilter, Option.empty());
+  }
+
+  /**
+   * Use Spark to list all partitions in the table.
+   *
+   * @param spark a Spark session
+   * @param rootPath a table identifier
+   * @param format format of the file
+   * @param partitionFilter partitionFilter of the file
+   * @param partitionSpec partitionSpec of the table
+   * @return all table's partitions
+   */
+  public static List<SparkPartition> getPartitions(
+      SparkSession spark,
+      Path rootPath,
+      String format,
+      Map<String, String> partitionFilter,
+      Option<PartitionSpec> partitionSpec) {

Review Comment:
   Fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] abmo-x commented on a diff in pull request #6779: Spark 3.2, 3.3: use table partition schema in add_files for getPartitions to avoid data corruption

Posted by "abmo-x (via GitHub)" <gi...@apache.org>.

abmo-x commented on code in PR #6779:
URL: https://github.com/apache/iceberg/pull/6779#discussion_r1112245408


##########
spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:
##########
@@ -418,6 +418,27 @@ public void addDataPartitionedByDateToPartitioned() {
         sql("SELECT id, name, dept, date FROM %s ORDER BY id", tableName));
   }
 
+  @Test
+  public void addDataPartitionedByDateHourToPartitioned() {

Review Comment:
   updated name



##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java:
##########
@@ -815,9 +816,30 @@ public static String quotedFullIdentifier(String catalogName, Identifier identif
    * @param format format of the file
    * @param partitionFilter partitionFilter of the file
    * @return all table's partitions
+   * @deprecated use {@link Spark3Util#getPartitions(SparkSession, Path, String, Map, Option)}
    */
+  @Deprecated
   public static List<SparkPartition> getPartitions(
       SparkSession spark, Path rootPath, String format, Map<String, String> partitionFilter) {
+    return getPartitions(spark, rootPath, format, partitionFilter, Optional.empty());
+  }
+
+  /**
+   * Use Spark to list all partitions in the table.
+   *
+   * @param spark a Spark session
+   * @param rootPath a table identifier
+   * @param format format of the file
+   * @param partitionFilter partitionFilter of the file
+   * @param partitionSpec partitionSpec of the table
+   * @return all table's partitions
+   */
+  public static List<SparkPartition> getPartitions(
+      SparkSession spark,
+      Path rootPath,
+      String format,
+      Map<String, String> partitionFilter,
+      Optional<PartitionSpec> partitionSpec) {

Review Comment:
   ah, got it. Updated it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6779: Spark 3.2, 3.3: use table partition schema in add_files for getPartitions to avoid data corruption

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.

szehon-ho commented on code in PR #6779:
URL: https://github.com/apache/iceberg/pull/6779#discussion_r1109194175


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkSchemaUtil.java:
##########
@@ -354,4 +354,15 @@ public static Map<Integer, String> indexQuotedNameById(Schema schema) {
     Function<String, String> quotingFunc = name -> String.format("`%s`", name.replace("`", "``"));
     return TypeUtil.indexQuotedNameById(schema.asStruct(), quotingFunc);
   }
+
+  /**
+   * Convert a {@link PartitionSpec} to a {@link DataType Spark type}.
+   *
+   * @param spec iceberg PartitionSpec
+   * @return {@link StructType}
+   * @throws IllegalArgumentException if the type cannot be converted
+   */
+  public static StructType convert(PartitionSpec spec) {
+    return convert(new Schema(spec.partitionType().asNestedType().asStructType().fields()));

Review Comment:
   I think the general project direction has been to avoid adding public method unless we need it in many places.  This is converting from a partition spec type to spark type, and should be fairly limited, so would suggest inline.
   
   I personally think , if anything, adding public method convert(StructType) to spark StructType would be more useful, and will cover this case here.   
   
   Would like to see what @aokolnychyi @RussellSpitzer think as well.
   
   Also, is it more convoluted than needed?  Could it be?
   ```
   convert(new Schema(partType.partitionType().fields())))
   ```



##########
spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:
##########
@@ -911,6 +935,14 @@ public void testPartitionedImportFromEmptyPartitionDoesNotThrow() {
     new StructField("ts", DataTypes.DateType, true, Metadata.empty())
   };
 
+  private static final StructField[] dateHourStruct = {
+    new StructField("id", DataTypes.IntegerType, true, Metadata.empty()),
+    new StructField("name", DataTypes.StringType, true, Metadata.empty()),
+    new StructField("dept", DataTypes.StringType, true, Metadata.empty()),
+    new StructField("ts", DataTypes.DateType, true, Metadata.empty()),
+    new StructField("hour", DataTypes.StringType, true, Metadata.empty())

Review Comment:
   Question, (maybe I am not understanding original problem entirely), can we re-use existing test struct, ie, using "dept=01" as partition field?  I think we can remove a lot of test code, if so?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6779: Spark 3.2, 3.3: use table partition schema in add_files for getPartitions to avoid data corruption

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.

szehon-ho commented on code in PR #6779:
URL: https://github.com/apache/iceberg/pull/6779#discussion_r1110474171


##########
spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:
##########
@@ -911,6 +935,14 @@ public void testPartitionedImportFromEmptyPartitionDoesNotThrow() {
     new StructField("ts", DataTypes.DateType, true, Metadata.empty())
   };
 
+  private static final StructField[] dateHourStruct = {
+    new StructField("id", DataTypes.IntegerType, true, Metadata.empty()),
+    new StructField("name", DataTypes.StringType, true, Metadata.empty()),
+    new StructField("dept", DataTypes.StringType, true, Metadata.empty()),
+    new StructField("ts", DataTypes.DateType, true, Metadata.empty()),
+    new StructField("hour", DataTypes.StringType, true, Metadata.empty())

Review Comment:
   OK in that case, I'd prefer not extra structs that's not strictly necessary, to keep the changes smaller.  I dont see string hour being of a value like 01 being much more readable than a dept that has a name like 01 to justify a new struct. (most hours are modeled as ints)
   
   I think we can still make a separate DF if we need to.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] dramaticlly commented on a diff in pull request #6779: use table partition schema in add_files for getPartitions to avoid data corruption

Posted by "dramaticlly (via GitHub)" <gi...@apache.org>.

dramaticlly commented on code in PR #6779:
URL: https://github.com/apache/iceberg/pull/6779#discussion_r1100813099


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkSchemaUtil.java:
##########
@@ -392,4 +392,14 @@ public static Map<Integer, String> indexQuotedNameById(Schema schema) {
     Function<String, String> quotingFunc = name -> String.format("`%s`", name.replace("`", "``"));
     return TypeUtil.indexQuotedNameById(schema.asStruct(), quotingFunc);
   }
+
+  /**
+   * convert partition spec to Spark type
+   *
+   * @param spec
+   * @return
+   */
+  public static StructType convert(PartitionSpec spec) {

Review Comment:
   nit: 
   
   ```java
   
     /**
      * Convert a {@link PartitionSpec} to a {@link DataType Spark type}.
      *
      * @param spec a iceberg PartitionSpec
      * @return the equivalent Spark type
      */
   ```



##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java:
##########
@@ -836,9 +836,29 @@ public static String quotedFullIdentifier(String catalogName, Identifier identif
    * @param format format of the file
    * @param partitionFilter partitionFilter of the file
    * @return all table's partitions
+   * @deprecated use {@link Spark3Util#getPartitions(SparkSession, Path, String, Map, Option)}
    */
+  @Deprecated
   public static List<SparkPartition> getPartitions(
       SparkSession spark, Path rootPath, String format, Map<String, String> partitionFilter) {
+    return getPartitions(spark, rootPath, format, partitionFilter, Option.empty());
+  }
+
+  /**
+   * Use Spark to list all partitions in the table.

Review Comment:
   Do you want to add some comment in javadoc to differentiate from the deprecated comment above?
   
   like optional iceberg partition spec parameter to influence how spark list partitions



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6779: Spark 3.2, 3.3: use table partition schema in add_files for getPartitions to avoid data corruption

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.

szehon-ho commented on code in PR #6779:
URL: https://github.com/apache/iceberg/pull/6779#discussion_r1112267916


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java:
##########
@@ -836,19 +836,47 @@ public static String quotedFullIdentifier(String catalogName, Identifier identif
    * @param format format of the file
    * @param partitionFilter partitionFilter of the file
    * @return all table's partitions
+   * @deprecated use {@link Spark3Util#getPartitions(SparkSession, Path, String, Map,
+   *     PartitionSpec)}
    */
+  @Deprecated

Review Comment:
   Nit: i think its still valid to have this API, and no need to deprecate?  If you want to infer the partition spec from the path.  wdyt?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on pull request #6779: Spark 3.2, 3.3: use table partition schema in add_files for getPartitions to avoid data corruption

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.

szehon-ho commented on PR #6779:
URL: https://github.com/apache/iceberg/pull/6779#issuecomment-1441151139

   Merged, thanks @abmo-x , @dramaticlly for extra review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] abmo-x commented on a diff in pull request #6779: use table partition schema in add_files for getPartitions to avoid data corruption

Posted by "abmo-x (via GitHub)" <gi...@apache.org>.

abmo-x commented on code in PR #6779:
URL: https://github.com/apache/iceberg/pull/6779#discussion_r1100862211


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java:
##########
@@ -836,9 +836,29 @@ public static String quotedFullIdentifier(String catalogName, Identifier identif
    * @param format format of the file
    * @param partitionFilter partitionFilter of the file
    * @return all table's partitions
+   * @deprecated use {@link Spark3Util#getPartitions(SparkSession, Path, String, Map, Option)}
    */
+  @Deprecated
   public static List<SparkPartition> getPartitions(
       SparkSession spark, Path rootPath, String format, Map<String, String> partitionFilter) {
+    return getPartitions(spark, rootPath, format, partitionFilter, Option.empty());
+  }
+
+  /**
+   * Use Spark to list all partitions in the table.

Review Comment:
   updated javadoc for method



##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkSchemaUtil.java:
##########
@@ -392,4 +392,14 @@ public static Map<Integer, String> indexQuotedNameById(Schema schema) {
     Function<String, String> quotingFunc = name -> String.format("`%s`", name.replace("`", "``"));
     return TypeUtil.indexQuotedNameById(schema.asStruct(), quotingFunc);
   }
+
+  /**
+   * convert partition spec to Spark type
+   *
+   * @param spec
+   * @return
+   */
+  public static StructType convert(PartitionSpec spec) {

Review Comment:
   updated, thanks
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] abmo-x commented on a diff in pull request #6779: Spark 3.2, 3.3: use table partition schema in add_files for getPartitions to avoid data corruption

Posted by "abmo-x (via GitHub)" <gi...@apache.org>.

abmo-x commented on code in PR #6779:
URL: https://github.com/apache/iceberg/pull/6779#discussion_r1110344812


##########
spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:
##########
@@ -911,6 +935,14 @@ public void testPartitionedImportFromEmptyPartitionDoesNotThrow() {
     new StructField("ts", DataTypes.DateType, true, Metadata.empty())
   };
 
+  private static final StructField[] dateHourStruct = {
+    new StructField("id", DataTypes.IntegerType, true, Metadata.empty()),
+    new StructField("name", DataTypes.StringType, true, Metadata.empty()),
+    new StructField("dept", DataTypes.StringType, true, Metadata.empty()),
+    new StructField("ts", DataTypes.DateType, true, Metadata.empty()),
+    new StructField("hour", DataTypes.StringType, true, Metadata.empty())

Review Comment:
   I wanted to reproduce the failure with date/hour, even if I use dept I will still have to keep some of the code for creating  table with different partition.
   
   will keep it as is for readability and to have the reproducible test which is relatable with date/hour



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho merged pull request #6779: Spark 3.2, 3.3: use table partition schema in add_files for getPartitions to avoid data corruption

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.

szehon-ho merged PR #6779:
URL: https://github.com/apache/iceberg/pull/6779


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6779: Spark 3.2, 3.3: use table partition schema in add_files for getPartitions to avoid data corruption

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.

szehon-ho commented on code in PR #6779:
URL: https://github.com/apache/iceberg/pull/6779#discussion_r1112267916


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java:
##########
@@ -836,19 +836,47 @@ public static String quotedFullIdentifier(String catalogName, Identifier identif
    * @param format format of the file
    * @param partitionFilter partitionFilter of the file
    * @return all table's partitions
+   * @deprecated use {@link Spark3Util#getPartitions(SparkSession, Path, String, Map,
+   *     PartitionSpec)}
    */
+  @Deprecated

Review Comment:
   Nit: i think its still valid to have this API, and no need to deprecate?  If you want to infer the partition spec from the path.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6779: Spark 3.2, 3.3: use table partition schema in add_files for getPartitions to avoid data corruption

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.

szehon-ho commented on code in PR #6779:
URL: https://github.com/apache/iceberg/pull/6779#discussion_r1110473312


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java:
##########
@@ -815,9 +816,30 @@ public static String quotedFullIdentifier(String catalogName, Identifier identif
    * @param format format of the file
    * @param partitionFilter partitionFilter of the file
    * @return all table's partitions
+   * @deprecated use {@link Spark3Util#getPartitions(SparkSession, Path, String, Map, Option)}
    */
+  @Deprecated
   public static List<SparkPartition> getPartitions(
       SparkSession spark, Path rootPath, String format, Map<String, String> partitionFilter) {
+    return getPartitions(spark, rootPath, format, partitionFilter, Optional.empty());
+  }
+
+  /**
+   * Use Spark to list all partitions in the table.
+   *
+   * @param spark a Spark session
+   * @param rootPath a table identifier
+   * @param format format of the file
+   * @param partitionFilter partitionFilter of the file
+   * @param partitionSpec partitionSpec of the table
+   * @return all table's partitions
+   */
+  public static List<SparkPartition> getPartitions(
+      SparkSession spark,
+      Path rootPath,
+      String format,
+      Map<String, String> partitionFilter,
+      Optional<PartitionSpec> partitionSpec) {

Review Comment:
   I was saying in original comment, Optional javadoc mentions it's usually for return value, I think in Java it's not so frequently used for arguments, and I don't think we use that in Iceberg much.  So I would say, let's just make a PartitionSpec that can be null.  We have the other version that takes in 4 arguments for users.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6779: Spark 3.2, 3.3: use table partition schema in add_files for getPartitions to avoid data corruption

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.

szehon-ho commented on code in PR #6779:
URL: https://github.com/apache/iceberg/pull/6779#discussion_r1110474171


##########
spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:
##########
@@ -911,6 +935,14 @@ public void testPartitionedImportFromEmptyPartitionDoesNotThrow() {
     new StructField("ts", DataTypes.DateType, true, Metadata.empty())
   };
 
+  private static final StructField[] dateHourStruct = {
+    new StructField("id", DataTypes.IntegerType, true, Metadata.empty()),
+    new StructField("name", DataTypes.StringType, true, Metadata.empty()),
+    new StructField("dept", DataTypes.StringType, true, Metadata.empty()),
+    new StructField("ts", DataTypes.DateType, true, Metadata.empty()),
+    new StructField("hour", DataTypes.StringType, true, Metadata.empty())

Review Comment:
   OK in that case, I'd prefer not extra structs that's not strictly necessary, to keep the changes smaller.  I dont see string hour being of a value like 01 being much more readable than a dept that has a name like 01 to justify a new struct.
   
   I think we can still make a separate DF if we need to.



##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java:
##########
@@ -815,9 +816,30 @@ public static String quotedFullIdentifier(String catalogName, Identifier identif
    * @param format format of the file
    * @param partitionFilter partitionFilter of the file
    * @return all table's partitions
+   * @deprecated use {@link Spark3Util#getPartitions(SparkSession, Path, String, Map, Option)}
    */
+  @Deprecated
   public static List<SparkPartition> getPartitions(
       SparkSession spark, Path rootPath, String format, Map<String, String> partitionFilter) {
+    return getPartitions(spark, rootPath, format, partitionFilter, Optional.empty());
+  }
+
+  /**
+   * Use Spark to list all partitions in the table.
+   *
+   * @param spark a Spark session
+   * @param rootPath a table identifier
+   * @param format format of the file
+   * @param partitionFilter partitionFilter of the file
+   * @param partitionSpec partitionSpec of the table
+   * @return all table's partitions
+   */
+  public static List<SparkPartition> getPartitions(
+      SparkSession spark,
+      Path rootPath,
+      String format,
+      Map<String, String> partitionFilter,
+      Optional<PartitionSpec> partitionSpec) {

Review Comment:
   I was saying in original comment, Optional javadoc mentions it's usually for return value, I think in Java it's not so frequently used for arguments.  So I would say, let's just make a PartitionSpec that can be null.  We have the other version that takes in 4 arguments for users.



##########
spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:
##########
@@ -926,6 +955,17 @@ private static java.sql.Date toDate(String value) {
               new StructType(dateStruct))
           .repartition(2);
 
+  private static final Dataset<Row> dateHourDF =

Review Comment:
   I think dateDF is specifically to address date partition test, and dateHourDf doesnt indicate that is testing a case where hour is modeled as a string.  Maybe something descriptive like testPartitionTypeDF



##########
spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:
##########
@@ -418,6 +418,27 @@ public void addDataPartitionedByDateToPartitioned() {
         sql("SELECT id, name, dept, date FROM %s ORDER BY id", tableName));
   }
 
+  @Test
+  public void addDataPartitionedByDateHourToPartitioned() {

Review Comment:
   I think the test name is not capturing the problem its solving, should be something like 'testPartitionType' to capture the problem.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] abmo-x commented on a diff in pull request #6779: Spark 3.2, 3.3: use table partition schema in add_files for getPartitions to avoid data corruption

Posted by "abmo-x (via GitHub)" <gi...@apache.org>.

abmo-x commented on code in PR #6779:
URL: https://github.com/apache/iceberg/pull/6779#discussion_r1112293111


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java:
##########
@@ -836,19 +836,47 @@ public static String quotedFullIdentifier(String catalogName, Identifier identif
    * @param format format of the file
    * @param partitionFilter partitionFilter of the file
    * @return all table's partitions
+   * @deprecated use {@link Spark3Util#getPartitions(SparkSession, Path, String, Map,
+   *     PartitionSpec)}
    */
+  @Deprecated

Review Comment:
   now that the other method spec is not optional, I think we can remove deprecated flag for this method.
   However I feel that as we know spark interpretation doesn't work for all cases and users will always have a partition spec, it might be better to deprecate this as `add_files` doesn't validate schema and does not fail fast. if a partition type is not interpreted correctly add_files doesn't fail and users will end up with bad data in their table.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org