You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by "wypoon (via GitHub)" <gi...@apache.org> on 2023/05/30 23:08:48 UTC

[GitHub] [iceberg] wypoon opened a new pull request, #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

wypoon opened a new pull request, #7744:
URL: https://github.com/apache/iceberg/pull/7744

   This is https://github.com/apache/iceberg/issues/7612 reported by @gaborkaszab.
   
   In this change, we fix the Spark migrate procedure to be able to handle a partition where a partition value contains '/'.
   We do this by fixing `TableMigrationUtil.listPartition`. Instead of constructing a partition key (a String like `a=1/b=2`) from the `Map<String, String>` of partition column to partition value that is passed in, and then calling `DataFiles.filllFromPath` (via `DataFiles.Builder#withPartitionPath`) with the partition key String, where it is then split along '/' and the value recovered for each partition column (by further splitting along '='), we simply construct a List of the partition values and hand that off.
   
   (We retain `DataFiles.filllFromPath` and `DataFiles.Builder#withPartitionPath` for other existing callers.)
   
   We add a unit test that shows the problem; without the fix, the test fails with 
   ```
   java.lang.IllegalArgumentException: Invalid partition data, too many fields (expecting 1): data=2023/05/30
   	at org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkArgument(Preconditions.java:326)
   	at org.apache.iceberg.DataFiles.fillFromPath(DataFiles.java:67)
   	at org.apache.iceberg.DataFiles$Builder.withPartitionPath(DataFiles.java:246)
   	at org.apache.iceberg.data.TableMigrationUtil.buildDataFile(TableMigrationUtil.java:190)
   	at org.apache.iceberg.data.TableMigrationUtil.lambda$listPartition$3(TableMigrationUtil.java:131)
   	at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:413)
   	at org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:219)
   	at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:203)
   	at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:196)
   	at org.apache.iceberg.data.TableMigrationUtil.listPartition(TableMigrationUtil.java:126)
   	at org.apache.iceberg.data.TableMigrationUtil.listPartition(TableMigrationUtil.java:83)
   	at org.apache.iceberg.spark.SparkTableUtil.listPartition(SparkTableUtil.java:278)
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] wypoon commented on a diff in pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "wypoon (via GitHub)" <gi...@apache.org>.

wypoon commented on code in PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#discussion_r1239067224


##########
core/src/main/java/org/apache/iceberg/DataFiles.java:
##########
@@ -248,6 +268,16 @@ public Builder withPartitionPath(String newPartitionPath) {
       return this;
     }
 
+    public Builder withPartitionValues(List<String> partitionValues) {
+      Preconditions.checkArgument(
+          isPartitioned || partitionValues.isEmpty(),
+          "Cannot add partition data for an unpartitioned table");
+      if (!partitionValues.isEmpty()) {

Review Comment:
   Do you mean `partitionValues.isEmpty()` can never be true logically? Or it is never true because `withPartitionValues` is never called with an empty `partitionValues`?
   If the latter, that is because I'm the only one who has written code calling `withPartitionValues` but there is nothing to stop future callers from calling it with empty `partitionValues`.
   I copied the code from the existing `withPartitionPath`. Perhaps the `if` in both methods is overkill and protects against calls that don't exist (but in principle could occur).
   In principle, `isPartitioned` could be true and `withPartitionValues` be called with empty `partitionValues`. Then the `Preconditions.checkArgument` would pass and the condition in the `if` would be false.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] wypoon commented on a diff in pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "wypoon (via GitHub)" <gi...@apache.org>.

wypoon commented on code in PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#discussion_r1237445435


##########
data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java:
##########
@@ -83,27 +83,47 @@ public static List<DataFile> listPartition(
     return listPartition(partition, uri, format, spec, conf, metricsConfig, mapping, 1);
   }
 
+  /**
+   * Returns the data files in a partition by listing the partition location using some number of
+   * threads.
+   *
+   * <p>For Parquet and ORC partitions, this will read metrics from the file footer. For Avro
+   * partitions, metrics are set to null.
+   *
+   * <p>Note: certain metrics, like NaN counts, that are only supported by Iceberg file writers but
+   * not file footers, will not be populated.
+   *
+   * @param partition map of partition columns to column values
+   * @param uri partition location URI
+   * @param format partition format, avro, parquet or orc
+   * @param spec a partition spec
+   * @param conf a Hadoop conf
+   * @param metricsConfig a metrics conf
+   * @param mapping a name mapping
+   * @param parallelism number of threads to use
+   * @return a List of DataFile
+   */
   public static List<DataFile> listPartition(
-      Map<String, String> partitionPath,
-      String partitionUri,
+      Map<String, String> partition,

Review Comment:
   We are modifying an existing public method. There is no change to the signature of the method, although the names of a few parameters were changed to make them consistent with the other existing `listPartitions` method (which was actually the first `listPartitions` method). The difference between the first `listPartitions` and this second one is that this has an additional `parallelism` parameter. The javadoc for the first method is copied over to this second method with relevant addition (another reason to make the parameter names consistent).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] wypoon commented on a diff in pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "wypoon (via GitHub)" <gi...@apache.org>.

wypoon commented on code in PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#discussion_r1239128963


##########
core/src/main/java/org/apache/iceberg/DataFiles.java:
##########
@@ -248,6 +268,16 @@ public Builder withPartitionPath(String newPartitionPath) {
       return this;
     }
 
+    public Builder withPartitionValues(List<String> partitionValues) {
+      Preconditions.checkArgument(
+          isPartitioned || partitionValues.isEmpty(),
+          "Cannot add partition data for an unpartitioned table");
+      if (!partitionValues.isEmpty()) {

Review Comment:
   Actually, I see now that we should leave things as they are.
   `SparkTableUtil.importUnpartitionedSparkTable` actually does call `TableMigrationUtil.listPartition`, with an empty `Map` for `partition`, and so that will call `DataFiles.Builder#withPartitionValues` with an empty `List` for `partitionValues`. That should be allowed.
   
   (I tried the change I just proposed above, and some tests in `TestMigrateTableProcedure` failed.)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] wypoon commented on a diff in pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "wypoon (via GitHub)" <gi...@apache.org>.

wypoon commented on code in PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#discussion_r1239117799


##########
core/src/main/java/org/apache/iceberg/DataFiles.java:
##########
@@ -248,6 +268,16 @@ public Builder withPartitionPath(String newPartitionPath) {
       return this;
     }
 
+    public Builder withPartitionValues(List<String> partitionValues) {
+      Preconditions.checkArgument(
+          isPartitioned || partitionValues.isEmpty(),
+          "Cannot add partition data for an unpartitioned table");
+      if (!partitionValues.isEmpty()) {

Review Comment:
   I can change the `Preconditions.checkArgument` in `withPartitionValues` to
   ```
         Preconditions.checkArgument(
             isPartitioned && !partitionValues.isEmpty(),
             "Table must be partitioned and partition values must not be empty");
   ```
   Then 1 and 4 will throw an error.
   Only 3 will be allowed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] gaborkaszab commented on pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "gaborkaszab (via GitHub)" <gi...@apache.org>.

gaborkaszab commented on PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#issuecomment-1569485492

   Hey @wypoon , I have also uploaded a PR to fix this issues. Forgot to put it into the github issue, though. https://github.com/apache/iceberg/pull/7738
   Anyway I believe DataFiles.Builder already has an interface to avoid using the path for providing a partition.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.

RussellSpitzer commented on code in PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#discussion_r1237406627


##########
data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java:
##########
@@ -162,40 +182,44 @@ private static Metrics getAvroMetrics(Path path, Configuration conf) {
   }
 
   private static Metrics getParquetMetrics(
-      Path path, Configuration conf, MetricsConfig metricsSpec, NameMapping mapping) {
+      Path path, Configuration conf, MetricsConfig metricsConfig, NameMapping mapping) {
     try {
       InputFile file = HadoopInputFile.fromPath(path, conf);
-      return ParquetUtil.fileMetrics(file, metricsSpec, mapping);
+      return ParquetUtil.fileMetrics(file, metricsConfig, mapping);
     } catch (UncheckedIOException e) {
       throw new RuntimeException("Unable to read the metrics of the Parquet file: " + path, e);
     }
   }
 
   private static Metrics getOrcMetrics(
-      Path path, Configuration conf, MetricsConfig metricsSpec, NameMapping mapping) {
+      Path path, Configuration conf, MetricsConfig metricsConfig, NameMapping mapping) {
     try {
-      return OrcMetrics.fromInputFile(HadoopInputFile.fromPath(path, conf), metricsSpec, mapping);
+      return OrcMetrics.fromInputFile(HadoopInputFile.fromPath(path, conf), metricsConfig, mapping);
     } catch (UncheckedIOException e) {
       throw new RuntimeException("Unable to read the metrics of the Orc file: " + path, e);
     }
   }
 
   private static DataFile buildDataFile(
-      FileStatus stat, String partitionKey, PartitionSpec spec, Metrics metrics, String format) {
+      FileStatus stat,
+      List<String> partitionValues,
+      PartitionSpec spec,
+      Metrics metrics,
+      String format) {
     return DataFiles.builder(spec)
         .withPath(stat.getPath().toString())
         .withFormat(format)
         .withFileSizeInBytes(stat.getLen())
         .withMetrics(metrics)
-        .withPartitionPath(partitionKey)
+        .withPartitionValues(partitionValues)
         .build();
   }
 
-  private static ExecutorService migrationService(int concurrentDeletes) {
+  private static ExecutorService migrationService(int numThreads) {

Review Comment:
   why is this renamed?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] wypoon commented on pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "wypoon (via GitHub)" <gi...@apache.org>.

wypoon commented on PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#issuecomment-1601410308

   > This PR is a bit challenging to review because it combines several cosmetic refactors with the actual code making the change. It would probably be much easier to read if we could separate out all the refactors that do not contribute to the core task of "handling partitions with a special character." Further changes to improve readability and so forth should go in a separate PR.
   
   I hear you. I shall remove the cosmetic refactors (except for renaming variables in `listPartitions`).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] wypoon commented on a diff in pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "wypoon (via GitHub)" <gi...@apache.org>.

wypoon commented on code in PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#discussion_r1213631347


##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMigrateTableProcedure.java:
##########
@@ -184,4 +184,21 @@ public void testInvalidMigrateCases() {
         "Cannot handle an empty identifier",
         () -> sql("CALL %s.system.migrate('')", catalogName));
   }
+
+  @Test
+  public void testMigratePartitionWithSpecialCharacter() throws IOException {
+    Assume.assumeTrue(catalogName.equals("spark_catalog"));
+    String location = temp.newFolder().toString();
+    sql(
+        "CREATE TABLE %s (id bigint NOT NULL, data string) USING parquet "
+            + "PARTITIONED BY (data) LOCATION '%s'",

Review Comment:
   Ok, no problem.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer merged pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.

RussellSpitzer merged PR #7744:
URL: https://github.com/apache/iceberg/pull/7744


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] gaborkaszab commented on pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "gaborkaszab (via GitHub)" <gi...@apache.org>.

gaborkaszab commented on PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#issuecomment-1590614290

   Also comparing to the other solution, indeed this in-place approach seems better because it won't require any changes on the client side. One last thing I'd like to try is to make a build from this, load it under my local Impala and run my tests just to be on the safe side.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.

RussellSpitzer commented on code in PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#discussion_r1238942292


##########
core/src/main/java/org/apache/iceberg/DataFiles.java:
##########
@@ -248,6 +268,16 @@ public Builder withPartitionPath(String newPartitionPath) {
       return this;
     }
 
+    public Builder withPartitionValues(List<String> partitionValues) {
+      Preconditions.checkArgument(
+          isPartitioned || partitionValues.isEmpty(),
+          "Cannot add partition data for an unpartitioned table");
+      if (!partitionValues.isEmpty()) {

Review Comment:
   This branch is never false



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.

RussellSpitzer commented on code in PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#discussion_r1239078192


##########
core/src/main/java/org/apache/iceberg/DataFiles.java:
##########
@@ -248,6 +268,16 @@ public Builder withPartitionPath(String newPartitionPath) {
       return this;
     }
 
+    public Builder withPartitionValues(List<String> partitionValues) {
+      Preconditions.checkArgument(
+          isPartitioned || partitionValues.isEmpty(),
+          "Cannot add partition data for an unpartitioned table");
+      if (!partitionValues.isEmpty()) {

Review Comment:
   This is actually a little more confusing still, "fillFromValues" throws an error if you have the wrong number of values so currently the logic looks like
   
   1. isPartitioned + partitionValues.isEmpty - doesNothing
   2. isPartitioned + parititonValues.size != partitonSpec.size && isNotEmpty - Error
   3. isPartitioned + partitionValues.size == partitionSpec.size - does something
   4. not partitioned + partitionvalues.isEmpty - doesNothing
   5. not partitioned + parittionValues.nonEmpty - Error
   
   In my opinion here everything but 3 should be illegal. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] gaborkaszab commented on a diff in pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "gaborkaszab (via GitHub)" <gi...@apache.org>.

gaborkaszab commented on code in PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#discussion_r1211530056


##########
core/src/main/java/org/apache/iceberg/DataFiles.java:
##########
@@ -248,6 +273,16 @@ public Builder withPartitionPath(String newPartitionPath) {
       return this;
     }
 
+    public Builder withPartitionValues(List<String> partitionValues) {

Review Comment:
   There is an existing function to provide the partition to the builder. DataFiles.Builder.withPartition(). I think we shouldn't extend the Builder with a new function for his purpose.



##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMigrateTableProcedure.java:
##########
@@ -184,4 +184,21 @@ public void testInvalidMigrateCases() {
         "Cannot handle an empty identifier",
         () -> sql("CALL %s.system.migrate('')", catalogName));
   }
+
+  @Test
+  public void testMigratePartitionWithSpecialCharacter() throws IOException {
+    Assume.assumeTrue(catalogName.equals("spark_catalog"));
+    String location = temp.newFolder().toString();
+    sql(
+        "CREATE TABLE %s (id bigint NOT NULL, data string) USING parquet "
+            + "PARTITIONED BY (data) LOCATION '%s'",

Review Comment:
   You get an exception if you partition by a non-string column. The reason is that you pass the partition values as a Strin-String map and these string are not converted into the actual partition field type by default. My https://github.com/apache/iceberg/pull/7738 has a solution for this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] gaborkaszab commented on pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "gaborkaszab (via GitHub)" <gi...@apache.org>.

gaborkaszab commented on PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#issuecomment-1590611619

   I made another round of review and I think this PR is fine.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.

RussellSpitzer commented on code in PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#discussion_r1211962729


##########
core/src/main/java/org/apache/iceberg/DataFiles.java:
##########
@@ -89,6 +89,31 @@ static PartitionData fillFromPath(PartitionSpec spec, String partitionPath, Part
     return data;
   }
 
+  static PartitionData fillFromValues(
+      PartitionSpec spec, List<String> partitionValues, PartitionData reuse) {
+    PartitionData data = reuse;
+    if (data == null) {
+      data = newPartitionData(spec);
+    }
+
+    Preconditions.checkArgument(

Review Comment:
   Could just do one check here size != size, "Invalid partition data, found %d fields, expected %d"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] wypoon commented on pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "wypoon (via GitHub)" <gi...@apache.org>.

wypoon commented on PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#issuecomment-1601684824

   @RussellSpitzer thanks for your second review. I have improved the javadoc and removed unrelated refactors in `TableMigrationUtil`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] wypoon commented on a diff in pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "wypoon (via GitHub)" <gi...@apache.org>.

wypoon commented on code in PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#discussion_r1237388661


##########
data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java:
##########
@@ -83,27 +83,47 @@ public static List<DataFile> listPartition(
     return listPartition(partition, uri, format, spec, conf, metricsConfig, mapping, 1);
   }
 
+  /**
+   * Returns the data files in a partition by listing the partition location using some number of
+   * threads.
+   *
+   * <p>For Parquet and ORC partitions, this will read metrics from the file footer. For Avro
+   * partitions, metrics are set to null.
+   *
+   * <p>Note: certain metrics, like NaN counts, that are only supported by Iceberg file writers but
+   * not file footers, will not be populated.
+   *
+   * @param partition map of partition columns to column values
+   * @param uri partition location URI
+   * @param format partition format, avro, parquet or orc
+   * @param spec a partition spec
+   * @param conf a Hadoop conf
+   * @param metricsConfig a metrics conf
+   * @param mapping a name mapping
+   * @param parallelism number of threads to use
+   * @return a List of DataFile
+   */
   public static List<DataFile> listPartition(
-      Map<String, String> partitionPath,
-      String partitionUri,
+      Map<String, String> partition,

Review Comment:
   Here I'm just using the same parameter names as in the first `listPartition` method above, which is actually the original `listPartition` method. This one here was added subsequently to introduce `parallelism`. I changed the parameter names here to make them consistent across both methods.
   Admittedly, `partition` is not the best name for the parameter, but it does pertain to a single partition, the one we want to list the data files for. So `partitions` would not be correct. I didn't call it `partitionValues` here because I use that later for a `List<String>`.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] wypoon commented on pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "wypoon (via GitHub)" <gi...@apache.org>.

wypoon commented on PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#issuecomment-1615094464

   Can this be merged now? It's been a week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] wypoon commented on a diff in pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "wypoon (via GitHub)" <gi...@apache.org>.

wypoon commented on code in PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#discussion_r1211945426


##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMigrateTableProcedure.java:
##########
@@ -184,4 +184,21 @@ public void testInvalidMigrateCases() {
         "Cannot handle an empty identifier",
         () -> sql("CALL %s.system.migrate('')", catalogName));
   }
+
+  @Test
+  public void testMigratePartitionWithSpecialCharacter() throws IOException {
+    Assume.assumeTrue(catalogName.equals("spark_catalog"));
+    String location = temp.newFolder().toString();
+    sql(
+        "CREATE TABLE %s (id bigint NOT NULL, data string) USING parquet "
+            + "PARTITIONED BY (data) LOCATION '%s'",

Review Comment:
   I think Spark uses a `Map<String, String>` to encode the partition values to pass them to `TableMigrationUtil` because it doesn't have better information than that. https://github.com/apache/iceberg/blob/8858f1cfd3d1b0bd9b2d4230d1975a48d8bea1a8/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java#L173-L192
   Spark's SessionCatalog returns a `org.apache.spark.sql.catalyst.catalog.CatalogTablePartition` which contains a `Map[String, String]`. So this is what `SparkTableUtil` has to work with.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] wypoon commented on a diff in pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "wypoon (via GitHub)" <gi...@apache.org>.

wypoon commented on code in PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#discussion_r1212365154


##########
core/src/main/java/org/apache/iceberg/DataFiles.java:
##########
@@ -89,6 +89,31 @@ static PartitionData fillFromPath(PartitionSpec spec, String partitionPath, Part
     return data;
   }
 
+  static PartitionData fillFromValues(
+      PartitionSpec spec, List<String> partitionValues, PartitionData reuse) {
+    PartitionData data = reuse;
+    if (data == null) {
+      data = newPartitionData(spec);
+    }
+
+    Preconditions.checkArgument(

Review Comment:
   Sure, I'll just check for `partitionValues.size() == spec.fields().size()`. The code here was copied from `fillFromPath`, where there are two checks.
   Btw, I think we need to use "%s" in the string template, as the guava code only looks for "%s", to replace with the varargs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.

RussellSpitzer commented on code in PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#discussion_r1211962729


##########
core/src/main/java/org/apache/iceberg/DataFiles.java:
##########
@@ -89,6 +89,31 @@ static PartitionData fillFromPath(PartitionSpec spec, String partitionPath, Part
     return data;
   }
 
+  static PartitionData fillFromValues(
+      PartitionSpec spec, List<String> partitionValues, PartitionData reuse) {
+    PartitionData data = reuse;
+    if (data == null) {
+      data = newPartitionData(spec);
+    }
+
+    Preconditions.checkArgument(

Review Comment:
   Could just do one check here size != size, "Invalid partition data, found %d fields, expected %d"
   
   Also %d instead of %s?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] gaborkaszab commented on a diff in pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "gaborkaszab (via GitHub)" <gi...@apache.org>.

gaborkaszab commented on code in PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#discussion_r1212937700


##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMigrateTableProcedure.java:
##########
@@ -184,4 +184,21 @@ public void testInvalidMigrateCases() {
         "Cannot handle an empty identifier",
         () -> sql("CALL %s.system.migrate('')", catalogName));
   }
+
+  @Test
+  public void testMigratePartitionWithSpecialCharacter() throws IOException {
+    Assume.assumeTrue(catalogName.equals("spark_catalog"));
+    String location = temp.newFolder().toString();
+    sql(
+        "CREATE TABLE %s (id bigint NOT NULL, data string) USING parquet "
+            + "PARTITIONED BY (data) LOCATION '%s'",

Review Comment:
   Oh, my bad, I modified this test and it started failing due to incompatible types between bigint and string, but apparently the issue was on my side. But to be on the safe side, would it be possible to add test coverage on more partition column types other than string?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] wypoon commented on a diff in pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "wypoon (via GitHub)" <gi...@apache.org>.

wypoon commented on code in PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#discussion_r1211945426


##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMigrateTableProcedure.java:
##########
@@ -184,4 +184,21 @@ public void testInvalidMigrateCases() {
         "Cannot handle an empty identifier",
         () -> sql("CALL %s.system.migrate('')", catalogName));
   }
+
+  @Test
+  public void testMigratePartitionWithSpecialCharacter() throws IOException {
+    Assume.assumeTrue(catalogName.equals("spark_catalog"));
+    String location = temp.newFolder().toString();
+    sql(
+        "CREATE TABLE %s (id bigint NOT NULL, data string) USING parquet "
+            + "PARTITIONED BY (data) LOCATION '%s'",

Review Comment:
   I think Spark uses a `Map<String, String>` to encode the partition values to pass them to `TableMigrationUtil` because it doesn't have better information than that. https://github.com/apache/iceberg/blob/8858f1cfd3d1b0bd9b2d4230d1975a48d8bea1a8/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java#L173-L192
   Spark's SessionCatalog returns a `Seq` of `org.apache.spark.sql.catalyst.catalog.CatalogTablePartition` which contains a `Map[String, String]`. So this is what `SparkTableUtil` has to work with.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] zinking commented on a diff in pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "zinking (via GitHub)" <gi...@apache.org>.

zinking commented on code in PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#discussion_r1236644398


##########
data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java:
##########
@@ -83,27 +83,47 @@ public static List<DataFile> listPartition(
     return listPartition(partition, uri, format, spec, conf, metricsConfig, mapping, 1);
   }
 
+  /**
+   * Returns the data files in a partition by listing the partition location using some number of
+   * threads.
+   *
+   * <p>For Parquet and ORC partitions, this will read metrics from the file footer. For Avro
+   * partitions, metrics are set to null.
+   *
+   * <p>Note: certain metrics, like NaN counts, that are only supported by Iceberg file writers but
+   * not file footers, will not be populated.
+   *
+   * @param partition map of partition columns to column values
+   * @param uri partition location URI
+   * @param format partition format, avro, parquet or orc
+   * @param spec a partition spec
+   * @param conf a Hadoop conf
+   * @param metricsConfig a metrics conf
+   * @param mapping a name mapping
+   * @param parallelism number of threads to use
+   * @return a List of DataFile
+   */
   public static List<DataFile> listPartition(
-      Map<String, String> partitionPath,
-      String partitionUri,
+      Map<String, String> partition,

Review Comment:
   partitions ? minor



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.

RussellSpitzer commented on PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#issuecomment-1615153312

   Merged, Thanks for your patience @wypoon I've been traveling this week so it's been very busy . Thanks also to @gaborkaszab and @JonasJ-ap for your review feedback.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.

RussellSpitzer commented on code in PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#discussion_r1239288589


##########
core/src/main/java/org/apache/iceberg/DataFiles.java:
##########
@@ -248,6 +268,16 @@ public Builder withPartitionPath(String newPartitionPath) {
       return this;
     }
 
+    public Builder withPartitionValues(List<String> partitionValues) {
+      Preconditions.checkArgument(
+          isPartitioned || partitionValues.isEmpty(),
+          "Cannot add partition data for an unpartitioned table");
+      if (!partitionValues.isEmpty()) {

Review Comment:
   So if it's unpartitioned then it must be empty. So either it is partitioned and values is as long as the spec. Or it is unpartitioned and empty?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] wypoon commented on pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "wypoon (via GitHub)" <gi...@apache.org>.

wypoon commented on PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#issuecomment-1570461740

   @gaborkaszab thanks for reviewing and pointing out your PR, https://github.com/apache/iceberg/pull/7738. I haven't had a chance to review it yet, but I will do so.
   @szehon-ho @RussellSpitzer would you mind reviewing this please? If https://github.com/apache/iceberg/pull/7738 is a better solution, I am happy to contribute the tests here to it.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] wypoon commented on a diff in pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "wypoon (via GitHub)" <gi...@apache.org>.

wypoon commented on code in PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#discussion_r1211945426


##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMigrateTableProcedure.java:
##########
@@ -184,4 +184,21 @@ public void testInvalidMigrateCases() {
         "Cannot handle an empty identifier",
         () -> sql("CALL %s.system.migrate('')", catalogName));
   }
+
+  @Test
+  public void testMigratePartitionWithSpecialCharacter() throws IOException {
+    Assume.assumeTrue(catalogName.equals("spark_catalog"));
+    String location = temp.newFolder().toString();
+    sql(
+        "CREATE TABLE %s (id bigint NOT NULL, data string) USING parquet "
+            + "PARTITIONED BY (data) LOCATION '%s'",

Review Comment:
   I think Spark uses a `Map<String, String>` to encode the partition values to pass them to `TableMigrationUtil` because it doesn't have better information than that. https://github.com/apache/iceberg/blob/8858f1cfd3d1b0bd9b2d4230d1975a48d8bea1a8/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java#L173-L192
   Spark's SessionCatalog returns a `Seq` of `org.apache.spark.sql.catalyst.catalog.CatalogTablePartition`, each of which contains a `Map[String, String]`. So this is what `SparkTableUtil` has to work with.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] wypoon commented on pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "wypoon (via GitHub)" <gi...@apache.org>.

wypoon commented on PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#issuecomment-1607778822

   > One minor nit on the un-needed branch, I think this is good to go otherwise.
   
   @RussellSpitzer I have addressed the minor nit as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] wypoon commented on a diff in pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "wypoon (via GitHub)" <gi...@apache.org>.

wypoon commented on code in PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#discussion_r1239297558


##########
core/src/main/java/org/apache/iceberg/DataFiles.java:
##########
@@ -248,6 +268,16 @@ public Builder withPartitionPath(String newPartitionPath) {
       return this;
     }
 
+    public Builder withPartitionValues(List<String> partitionValues) {
+      Preconditions.checkArgument(
+          isPartitioned || partitionValues.isEmpty(),
+          "Cannot add partition data for an unpartitioned table");
+      if (!partitionValues.isEmpty()) {

Review Comment:
   Yes, that is correct.
   Right now we allow 1 and 4. We should allow 4, but to be strict, we should not allow 1. So, if we want to be strict, we can use
   ```
         Preconditions.checkArgument(
             (isPartitioned && !partitionValues.isEmpty())
                 || (!isPartitioned && partitionValues.isEmpty()),
             "Table must be partitioned or partition values must be empty");
   ```
   We still need the `if (!partitionValues.isEmpty())` though.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] wypoon commented on a diff in pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "wypoon (via GitHub)" <gi...@apache.org>.

wypoon commented on code in PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#discussion_r1239306101


##########
core/src/main/java/org/apache/iceberg/DataFiles.java:
##########
@@ -248,6 +268,16 @@ public Builder withPartitionPath(String newPartitionPath) {
       return this;
     }
 
+    public Builder withPartitionValues(List<String> partitionValues) {
+      Preconditions.checkArgument(
+          isPartitioned || partitionValues.isEmpty(),
+          "Cannot add partition data for an unpartitioned table");
+      if (!partitionValues.isEmpty()) {

Review Comment:
   Or more succinctly,
   ```
         Preconditions.checkArgument(
             isPartitioned ^ partitionValues.isEmpty(),
             "Table must be partitioned or partition values must be empty");
   ```
   I didn't realize that `^` could be used as a logical xor as well as bitwise xor.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] wypoon commented on pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "wypoon (via GitHub)" <gi...@apache.org>.

wypoon commented on PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#issuecomment-1595240527

   @gaborkaszab thank you for reviewing and testing the fix with Impala!
   
   @szehon-ho @flyrain can one of you help review this? Perhaps @RussellSpitzer is on vacation?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] gaborkaszab commented on pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "gaborkaszab (via GitHub)" <gi...@apache.org>.

gaborkaszab commented on PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#issuecomment-1592937019

   I pulled this fix into my env and ran my Impala tests on it. Everything works as expected! I think this patch is good to go. @RussellSpitzer would you mind taking a look? Impala/Spark/Hive would need this fix for table migration.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] wypoon commented on a diff in pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "wypoon (via GitHub)" <gi...@apache.org>.

wypoon commented on code in PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#discussion_r1237616093


##########
data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java:
##########
@@ -83,27 +83,47 @@ public static List<DataFile> listPartition(
     return listPartition(partition, uri, format, spec, conf, metricsConfig, mapping, 1);
   }
 
+  /**
+   * Returns the data files in a partition by listing the partition location using some number of

Review Comment:
   You're right.



##########
data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java:
##########
@@ -83,27 +83,47 @@ public static List<DataFile> listPartition(
     return listPartition(partition, uri, format, spec, conf, metricsConfig, mapping, 1);
   }
 
+  /**
+   * Returns the data files in a partition by listing the partition location using some number of
+   * threads.
+   *
+   * <p>For Parquet and ORC partitions, this will read metrics from the file footer. For Avro
+   * partitions, metrics are set to null.
+   *
+   * <p>Note: certain metrics, like NaN counts, that are only supported by Iceberg file writers but
+   * not file footers, will not be populated.
+   *
+   * @param partition map of partition columns to column values
+   * @param uri partition location URI
+   * @param format partition format, avro, parquet or orc
+   * @param spec a partition spec
+   * @param conf a Hadoop conf
+   * @param metricsConfig a metrics conf
+   * @param mapping a name mapping
+   * @param parallelism number of threads to use

Review Comment:
   Actually no. I have improved the javadoc in the latest revision to explain the parallelism.
   There is only one partition `listPartition` works on (as the name implies). The parallelism is in the reading of the data files (after the list of files is obtained) to get the metrics. The listing itself is done single-threaded.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.

RussellSpitzer commented on code in PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#discussion_r1237411278


##########
data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java:
##########
@@ -83,27 +83,47 @@ public static List<DataFile> listPartition(
     return listPartition(partition, uri, format, spec, conf, metricsConfig, mapping, 1);
   }
 
+  /**
+   * Returns the data files in a partition by listing the partition location using some number of

Review Comment:
   Could we improve this description? It's not quite clear what this method is actually doing



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] wypoon commented on a diff in pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "wypoon (via GitHub)" <gi...@apache.org>.

wypoon commented on code in PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#discussion_r1239067224


##########
core/src/main/java/org/apache/iceberg/DataFiles.java:
##########
@@ -248,6 +268,16 @@ public Builder withPartitionPath(String newPartitionPath) {
       return this;
     }
 
+    public Builder withPartitionValues(List<String> partitionValues) {
+      Preconditions.checkArgument(
+          isPartitioned || partitionValues.isEmpty(),
+          "Cannot add partition data for an unpartitioned table");
+      if (!partitionValues.isEmpty()) {

Review Comment:
   Do you mean `partitionValues.isEmpty()` can never be true logically? Or it is never true because `withPartitionValues` is never called with an empty `partitionValues`?
   If the latter, that is because I'm the only one who has written code calling `withPartitionValues` but there is nothing to stop future callers from calling it with empty `partitionValues`.
   I copied the code from the existing `withPartitionPath`. Perhaps the `if` in both methods is overkill and protects against calls that don't exist (but in principle could occur).
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] wypoon commented on pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "wypoon (via GitHub)" <gi...@apache.org>.

wypoon commented on PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#issuecomment-1573949065

   @RussellSpitzer can you please review this again? I have addressed the feedback.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] wypoon commented on a diff in pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "wypoon (via GitHub)" <gi...@apache.org>.

wypoon commented on code in PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#discussion_r1237446942


##########
data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java:
##########
@@ -162,40 +182,44 @@ private static Metrics getAvroMetrics(Path path, Configuration conf) {
   }
 
   private static Metrics getParquetMetrics(
-      Path path, Configuration conf, MetricsConfig metricsSpec, NameMapping mapping) {
+      Path path, Configuration conf, MetricsConfig metricsConfig, NameMapping mapping) {
     try {
       InputFile file = HadoopInputFile.fromPath(path, conf);
-      return ParquetUtil.fileMetrics(file, metricsSpec, mapping);
+      return ParquetUtil.fileMetrics(file, metricsConfig, mapping);
     } catch (UncheckedIOException e) {
       throw new RuntimeException("Unable to read the metrics of the Parquet file: " + path, e);
     }
   }
 
   private static Metrics getOrcMetrics(
-      Path path, Configuration conf, MetricsConfig metricsSpec, NameMapping mapping) {
+      Path path, Configuration conf, MetricsConfig metricsConfig, NameMapping mapping) {
     try {
-      return OrcMetrics.fromInputFile(HadoopInputFile.fromPath(path, conf), metricsSpec, mapping);
+      return OrcMetrics.fromInputFile(HadoopInputFile.fromPath(path, conf), metricsConfig, mapping);
     } catch (UncheckedIOException e) {
       throw new RuntimeException("Unable to read the metrics of the Orc file: " + path, e);
     }
   }
 
   private static DataFile buildDataFile(
-      FileStatus stat, String partitionKey, PartitionSpec spec, Metrics metrics, String format) {
+      FileStatus stat,
+      List<String> partitionValues,
+      PartitionSpec spec,
+      Metrics metrics,
+      String format) {
     return DataFiles.builder(spec)
         .withPath(stat.getPath().toString())
         .withFormat(format)
         .withFileSizeInBytes(stat.getLen())
         .withMetrics(metrics)
-        .withPartitionPath(partitionKey)
+        .withPartitionValues(partitionValues)
         .build();
   }
 
-  private static ExecutorService migrationService(int concurrentDeletes) {
+  private static ExecutorService migrationService(int numThreads) {

Review Comment:
   Because the name is misleading. There are no deletes. This was obviously copied from somewhere else.
   I shall remove this renaming.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.

RussellSpitzer commented on code in PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#discussion_r1237409435


##########
data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java:
##########
@@ -83,27 +83,47 @@ public static List<DataFile> listPartition(
     return listPartition(partition, uri, format, spec, conf, metricsConfig, mapping, 1);
   }
 
+  /**
+   * Returns the data files in a partition by listing the partition location using some number of
+   * threads.
+   *
+   * <p>For Parquet and ORC partitions, this will read metrics from the file footer. For Avro
+   * partitions, metrics are set to null.
+   *
+   * <p>Note: certain metrics, like NaN counts, that are only supported by Iceberg file writers but
+   * not file footers, will not be populated.
+   *
+   * @param partition map of partition columns to column values
+   * @param uri partition location URI
+   * @param format partition format, avro, parquet or orc
+   * @param spec a partition spec
+   * @param conf a Hadoop conf
+   * @param metricsConfig a metrics conf
+   * @param mapping a name mapping
+   * @param parallelism number of threads to use
+   * @return a List of DataFile
+   */
   public static List<DataFile> listPartition(
-      Map<String, String> partitionPath,
-      String partitionUri,
+      Map<String, String> partition,

Review Comment:
   I'm confused about what happened here, are we modifying the exiting public method or adding a new one? We should attempt not to break compatibility if we can for public methods in utility classes. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.

RussellSpitzer commented on code in PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#discussion_r1237412430


##########
data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java:
##########
@@ -83,27 +83,47 @@ public static List<DataFile> listPartition(
     return listPartition(partition, uri, format, spec, conf, metricsConfig, mapping, 1);
   }
 
+  /**
+   * Returns the data files in a partition by listing the partition location using some number of
+   * threads.
+   *
+   * <p>For Parquet and ORC partitions, this will read metrics from the file footer. For Avro
+   * partitions, metrics are set to null.
+   *
+   * <p>Note: certain metrics, like NaN counts, that are only supported by Iceberg file writers but
+   * not file footers, will not be populated.
+   *
+   * @param partition map of partition columns to column values
+   * @param uri partition location URI
+   * @param format partition format, avro, parquet or orc
+   * @param spec a partition spec
+   * @param conf a Hadoop conf
+   * @param metricsConfig a metrics conf
+   * @param mapping a name mapping
+   * @param parallelism number of threads to use

Review Comment:
   number of partitions to search concurrently?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] wypoon commented on a diff in pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "wypoon (via GitHub)" <gi...@apache.org>.

wypoon commented on code in PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#discussion_r1239128963


##########
core/src/main/java/org/apache/iceberg/DataFiles.java:
##########
@@ -248,6 +268,16 @@ public Builder withPartitionPath(String newPartitionPath) {
       return this;
     }
 
+    public Builder withPartitionValues(List<String> partitionValues) {
+      Preconditions.checkArgument(
+          isPartitioned || partitionValues.isEmpty(),
+          "Cannot add partition data for an unpartitioned table");
+      if (!partitionValues.isEmpty()) {

Review Comment:
   Actually, I see now that we should leave things as they are.
   `SparkTableUtil.importUnpartitionedSparkTable` actually does call `TableMigrationUtil.listPartition`, with an empty `Map` for `partition`, and so that will call `DataFiles.Builder#withPartitionValues` with an empty `List` for `partitionValues`. That should be allowed. *And it should indeed do nothing.*
   
   (I tried the change I just proposed above, and some tests in `TestMigrateTableProcedure` failed.)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] wypoon commented on a diff in pull request #7744: Core, Spark: Fix migrate table in case of partitioned table with partition containing a special character

Posted by "wypoon (via GitHub)" <gi...@apache.org>.

wypoon commented on code in PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#discussion_r1212379437


##########
core/src/main/java/org/apache/iceberg/DataFiles.java:
##########
@@ -248,6 +273,16 @@ public Builder withPartitionPath(String newPartitionPath) {
       return this;
     }
 
+    public Builder withPartitionValues(List<String> partitionValues) {

Review Comment:
   I don't see why not. `DataFiles.Builder` is essentially a helper class. If there is use for another helper method, I don't see why we shouldn't add it.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org