You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/01/25 03:01:47 UTC

[GitHub] [iceberg] kingeasternsun opened a new pull request #3973: Add "parallelism" parameter to "add_files" syscall and the related functions.

kingeasternsun opened a new pull request #3973:
URL: https://github.com/apache/iceberg/pull/3973


   Based on    `Speed up TableMigration By collect the DafaFile In parallel`   #3876 ,  
   - add parallelism parameter to `add_files` and the related functions.
   - for compatibility，I Keep the original method  and let original method call the newer method with default parallelism = 1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3973: :art: Add "parallelism" parameter to "add_files" syscall and the related functions.

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #3973:
URL: https://github.com/apache/iceberg/pull/3973#discussion_r793708213



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java
##########
@@ -60,7 +60,8 @@
       ProcedureParameter.required("table", DataTypes.StringType),
       ProcedureParameter.required("source_table", DataTypes.StringType),
       ProcedureParameter.optional("partition_filter", STRING_MAP),
-      ProcedureParameter.optional("check_duplicate_files", DataTypes.BooleanType)
+      ProcedureParameter.optional("check_duplicate_files", DataTypes.BooleanType),
+      ProcedureParameter.optional("list_partition_parallelism", DataTypes.IntegerType)

Review comment:
       I would consider "max_concurrent_read_datafiles" before list partition, since list partition doesn't really mean anything. I think reads may be fine but read_datafiles is probably also ok.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3973: :art: Add "parallelism" parameter to "add_files" syscall and the related functions.

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #3973:
URL: https://github.com/apache/iceberg/pull/3973#discussion_r793706931



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java
##########
@@ -60,7 +60,8 @@
       ProcedureParameter.required("table", DataTypes.StringType),
       ProcedureParameter.required("source_table", DataTypes.StringType),
       ProcedureParameter.optional("partition_filter", STRING_MAP),
-      ProcedureParameter.optional("check_duplicate_files", DataTypes.BooleanType)
+      ProcedureParameter.optional("check_duplicate_files", DataTypes.BooleanType),
+      ProcedureParameter.optional("list_partition_parallelism", DataTypes.IntegerType)

Review comment:
       https://github.com/apache/iceberg/blob/7fc8f9a6c8fe9dacb2c19fdfa76dd8c592c47c24/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/MigrateTableProcedure.java#L37




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kingeasternsun commented on a change in pull request #3973: :art: Add "parallelism" parameter to "add_files" syscall and the related functions.

Posted by GitBox <gi...@apache.org>.

kingeasternsun commented on a change in pull request #3973:
URL: https://github.com/apache/iceberg/pull/3973#discussion_r793261491



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java
##########
@@ -60,7 +60,8 @@
       ProcedureParameter.required("table", DataTypes.StringType),
       ProcedureParameter.required("source_table", DataTypes.StringType),
       ProcedureParameter.optional("partition_filter", STRING_MAP),
-      ProcedureParameter.optional("check_duplicate_files", DataTypes.BooleanType)
+      ProcedureParameter.optional("check_duplicate_files", DataTypes.BooleanType),
+      ProcedureParameter.optional("list_partition_parallelism", DataTypes.IntegerType)

Review comment:
       > on ExpireSnapshots and RemoveOrphans we have the parallelism parameter named
   > 
   > max_concurrent_xxxxxx
   > 
   > So maybe we just keep that convention? "max_concurrent_reads"?
   
   ok , how about  `max_concurrent_list_partition`  or `max_concurrent_read_datafiles`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3973: :art: Add "parallelism" parameter to "add_files" syscall and the related functions.

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #3973:
URL: https://github.com/apache/iceberg/pull/3973#discussion_r793717990



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java
##########
@@ -399,20 +400,41 @@ public static void importSparkTable(SparkSession spark, TableIdentifier sourceTa
       PartitionSpec spec = SparkSchemaUtil.specForTable(spark, sourceTableIdentWithDB.unquotedString());
 
       if (Objects.equal(spec, PartitionSpec.unpartitioned())) {
-        importUnpartitionedSparkTable(spark, sourceTableIdentWithDB, targetTable, checkDuplicateFiles);
+        importUnpartitionedSparkTable(spark, sourceTableIdentWithDB, targetTable, checkDuplicateFiles, parallelism);
       } else {
         List<SparkPartition> sourceTablePartitions = getPartitions(spark, sourceTableIdent,
             partitionFilter);
         Preconditions.checkArgument(!sourceTablePartitions.isEmpty(),
             "Cannot find any partitions in table %s", sourceTableIdent);
-        importSparkPartitions(spark, sourceTablePartitions, targetTable, spec, stagingDir, checkDuplicateFiles);
+        importSparkPartitions(spark, sourceTablePartitions, targetTable, spec, stagingDir, checkDuplicateFiles,
+                parallelism);
       }
     } catch (AnalysisException e) {
       throw SparkExceptionUtil.toUncheckedException(
           e, "Unable to get partition spec for table: %s", sourceTableIdentWithDB);
     }
   }
 
+  /**
+   * Import files from an existing Spark table to an Iceberg table.
+   *
+   * The import uses the Spark session to get table metadata. It assumes no
+   * operation is going on the original and target table and thus is not
+   * thread-safe.
+   *
+   * @param spark a Spark session
+   * @param sourceTableIdent an identifier of the source Spark table
+   * @param targetTable an Iceberg table where to import the data
+   * @param stagingDir a staging directory to store temporary manifest files
+   * @param partitionFilter only import partitions whose values match those in the map, can be partially defined
+   * @param checkDuplicateFiles if true, throw exception if import results in a duplicate data file
+   */
+  public static void importSparkTable(SparkSession spark, TableIdentifier sourceTableIdent, Table targetTable,

Review comment:
       This one we have to keep since it's a public API




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3973: :art: Add "parallelism" parameter to "add_files" syscall and MigrateTable, SnapshotTable.

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #3973:
URL: https://github.com/apache/iceberg/pull/3973#discussion_r810400302



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java
##########
@@ -517,7 +542,9 @@ public static void importSparkPartitions(SparkSession spark, List<SparkPartition
 
     Dataset<DataFile> filesToImport = partitionDS
         .flatMap((FlatMapFunction<SparkPartition, DataFile>) sparkPartition ->
-                listPartition(sparkPartition, spec, serializableConf, metricsConfig, nameMapping).iterator(),
+                TableMigrationUtil.listPartition(sparkPartition.values, sparkPartition.uri,
+                        sparkPartition.format, spec, serializableConf.get(), metricsConfig, nameMapping,

Review comment:
       nit: should be 4 instead of 8




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kingeasternsun commented on pull request #3973: :art: Add "parallelism" parameter to "add_files" syscall and MigrateTable, SnapshotTable.

Posted by GitBox <gi...@apache.org>.

kingeasternsun commented on pull request #3973:
URL: https://github.com/apache/iceberg/pull/3973#issuecomment-1049416478


   > I think most of my main concerns are wrapped up, just some formatting fixes left. I think as a follow up we should deprecate all the public methods not using "parallelism", not important to do it right int his PR though.
   
   Ok ，Thanks。


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kingeasternsun edited a comment on pull request #3973: :art: Add "parallelism" parameter to "add_files" syscall and MigrateTable, SnapshotTable.

Posted by GitBox <gi...@apache.org>.

kingeasternsun edited a comment on pull request #3973:
URL: https://github.com/apache/iceberg/pull/3973#issuecomment-1057570924


   > I think most of my main concerns are wrapped up, just some formatting fixes left. I think as a follow up we should deprecate all the public methods not using "parallelism", not important to do it right int his PR though.
   
   
   I thinks the sentence ` int hist PR` should be `in this PR` ?  @RussellSpitzer  .  And did you mean  `deprecate all the public methods not using "parallelism" ` in another PR ?  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kingeasternsun commented on a change in pull request #3973: :art: Add "parallelism" parameter to "add_files" syscall and the related functions.

Posted by GitBox <gi...@apache.org>.

kingeasternsun commented on a change in pull request #3973:
URL: https://github.com/apache/iceberg/pull/3973#discussion_r794158990



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java
##########
@@ -127,6 +135,11 @@ private boolean isFileIdentifier(Identifier ident) {
 
   private long importToIceberg(Identifier destIdent, Identifier sourceIdent, Map<String, String> partitionFilter,

Review comment:
       It seems that not been used anymore,  I will remove it




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3973: :art: Add "parallelism" parameter to "add_files" syscall and the related functions.

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #3973:
URL: https://github.com/apache/iceberg/pull/3973#discussion_r793843901



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java
##########
@@ -380,10 +380,11 @@ public boolean isDefinedAt(Expression attr) {
    * @param stagingDir a staging directory to store temporary manifest files
    * @param partitionFilter only import partitions whose values match those in the map, can be partially defined
    * @param checkDuplicateFiles if true, throw exception if import results in a duplicate data file
+   * @param parallelism the parallelism to list the datafile of source spark table

Review comment:
       Like my below comment I think we should have this a little more clear. "Controls max concurrency of file reads per partition?"
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3973: :art: Add "parallelism" parameter to "add_files" syscall and the related functions.

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #3973:
URL: https://github.com/apache/iceberg/pull/3973#discussion_r793715489



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java
##########
@@ -488,6 +511,11 @@ private static void importUnpartitionedSparkTable(SparkSession spark, TableIdent
     }
   }
 
+  private static void importUnpartitionedSparkTable(SparkSession spark, TableIdentifier sourceTableIdent,

Review comment:
       Unused?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3973: :art: Add "parallelism" parameter to "add_files" syscall and MigrateTable, SnapshotTable.

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #3973:
URL: https://github.com/apache/iceberg/pull/3973#discussion_r810401269



##########
File path: api/src/main/java/org/apache/iceberg/actions/MigrateTable.java
##########
@@ -44,6 +44,12 @@
    */
   MigrateTable tableProperty(String name, String value);
 
+  /**
+   * @param numReaders the number of concurrent file read operations to use per partition
+   * @return this for method chaining
+   **/
+  MigrateTable withParallelReads(int numReaders);

Review comment:
       Same as with SnapshotTable, default to unsupported so we can review just 3.2. Then merge the other modules in a followup.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3973: :art: Add "parallelism" parameter to "add_files" syscall and the related functions.

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #3973:
URL: https://github.com/apache/iceberg/pull/3973#discussion_r793713961



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java
##########
@@ -497,9 +525,11 @@ private static void importUnpartitionedSparkTable(SparkSession spark, TableIdent
    * @param spec a partition spec
    * @param stagingDir a staging directory to store temporary manifest files
    * @param checkDuplicateFiles if true, throw exception if import results in a duplicate data file
+   * @param listPartitionParallelism the parallelism to list the datafile of source spark table

Review comment:
       Perhaps we should label this, "Max number of concurrent files to read per partition while indexing table"?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kingeasternsun commented on a change in pull request #3973: :art: Add "parallelism" parameter to "add_files" syscall and the related functions.

Posted by GitBox <gi...@apache.org>.

kingeasternsun commented on a change in pull request #3973:
URL: https://github.com/apache/iceberg/pull/3973#discussion_r794159400



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java
##########
@@ -158,6 +171,11 @@ private static void ensureNameMappingPresent(Table table) {
 
   private void importFileTable(Table table, Path tableLocation, String format, Map<String, String> partitionFilter,

Review comment:
       > I don't think this is used anymore?
   
   Yes,  I will remove it .




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kingeasternsun commented on pull request #3973: :art: Add "parallelism" parameter to "add_files" syscall and MigrateTable, SnapshotTable.

Posted by GitBox <gi...@apache.org>.

kingeasternsun commented on pull request #3973:
URL: https://github.com/apache/iceberg/pull/3973#issuecomment-1057570924


   > I think most of my main concerns are wrapped up, just some formatting fixes left. I think as a follow up we should deprecate all the public methods not using "parallelism", not important to do it right int his PR though.
   
   
   I thinks the sentence ` int hist PR` should be `in this PR` ?  @RussellSpitzer  .  And did you mean  `deprecate all the public methods not using "parallelism" ` in another PR


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on pull request #3973: :art: Add "parallelism" parameter to "add_files" syscall and MigrateTable, SnapshotTable.

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on pull request #3973:
URL: https://github.com/apache/iceberg/pull/3973#issuecomment-1058515570


   > > I think most of my main concerns are wrapped up, just some formatting fixes left. I think as a follow up we should deprecate all the public methods not using "parallelism", not important to do it right int his PR though.
   > 
   > I thinks the sentence ` int hist PR` should be `in this PR` ? @RussellSpitzer . And did you mean `deprecate all the public methods not using "parallelism" ` in another PR ?
   
   There are a bunch of public utility methods now that have a version which takes parallelism and a version which does not. I think with this PR we have finished changing all internal usages to always specify parallelism so we are probably good with removing the non-parallel arg versions in a future release.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3973: :art: Add "parallelism" parameter to "add_files" syscall and the related functions.

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #3973:
URL: https://github.com/apache/iceberg/pull/3973#discussion_r808191404



##########
File path: api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java
##########
@@ -40,6 +40,13 @@ default MigrateTable migrateTable(String tableIdent) {
     throw new UnsupportedOperationException(this.getClass().getName() + " does not implement migrateTable");
   }
 
+  /**
+   * Instantiates an action to migrate an existing table to Iceberg.
+   */
+  default MigrateTable migrateTable(String tableIdent, int parallelism) {

Review comment:
       We would also need the same for `SnapshotTable`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3973: :art: Add "parallelism" parameter to "add_files" syscall and MigrateTable, SnapshotTable.

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #3973:
URL: https://github.com/apache/iceberg/pull/3973#discussion_r810396577



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java
##########
@@ -169,27 +177,27 @@ private void importFileTable(Table table, Path tableLocation, String format, Map
 
       // Build a Global Partition for the source
       SparkPartition partition = new SparkPartition(Collections.emptyMap(), tableLocation.toString(), format);
-      importPartitions(table, ImmutableList.of(partition), checkDuplicateFiles);
+      importPartitions(table, ImmutableList.of(partition), checkDuplicateFiles, parallelism);
     } else {
       Preconditions.checkArgument(!partitions.isEmpty(),
           "Cannot find any matching partitions in table %s", partitions);
-      importPartitions(table, partitions, checkDuplicateFiles);
+      importPartitions(table, partitions, checkDuplicateFiles, parallelism);
     }
   }
 
   private void importCatalogTable(Table table, Identifier sourceIdent, Map<String, String> partitionFilter,
-                                  boolean checkDuplicateFiles) {
+                                  boolean checkDuplicateFiles, int parallelism) {
     String stagingLocation = getMetadataLocation(table);
     TableIdentifier sourceTableIdentifier = Spark3Util.toV1TableIdentifier(sourceIdent);
     SparkTableUtil.importSparkTable(spark(), sourceTableIdentifier, table, stagingLocation, partitionFilter,
-        checkDuplicateFiles);
+        checkDuplicateFiles, parallelism);
   }
 
   private void importPartitions(Table table, List<SparkTableUtil.SparkPartition> partitions,
-                                boolean checkDuplicateFiles) {
+                                boolean checkDuplicateFiles, int parallelism) {
     String stagingLocation = getMetadataLocation(table);
     SparkTableUtil.importSparkPartitions(spark(), partitions, table, table.spec(), stagingLocation,
-        checkDuplicateFiles);
+            checkDuplicateFiles, parallelism);

Review comment:
       nit format




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3973: :art: Add "parallelism" parameter to "add_files" syscall and MigrateTable, SnapshotTable.

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #3973:
URL: https://github.com/apache/iceberg/pull/3973#discussion_r810399667



##########
File path: spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java
##########
@@ -343,6 +343,50 @@ public void addDataPartitionedHive() {
         sql("SELECT id, name, dept, subdept FROM %s ORDER BY id", tableName));
   }
 
+  @Test
+  public void addDataPartitionedHiveInParallel() {

Review comment:
       I think the tests here are good but I believe the formatting is off, check the tests directly underneath for the proper spacing.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3973: :art: Add "parallelism" parameter to "add_files" syscall and MigrateTable, SnapshotTable.

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #3973:
URL: https://github.com/apache/iceberg/pull/3973#discussion_r810399887



##########
File path: spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMigrateTableProcedure.java
##########
@@ -100,6 +100,34 @@ public void testMigrateWithOptions() throws IOException {
     sql("DROP TABLE %s", tableName + "_BACKUP_");
   }
 
+  @Test
+  public void testMigrateWithOptionsAndParallelism() throws IOException {

Review comment:
       I'm not sure why this tests the Option and the parallelism? Shouldn't it just test the parallelism?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3973: :art: Add "parallelism" parameter to "add_files" syscall and the related functions.

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #3973:
URL: https://github.com/apache/iceberg/pull/3973#discussion_r793710749



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java
##########
@@ -158,6 +171,11 @@ private static void ensureNameMappingPresent(Table table) {
 
   private void importFileTable(Table table, Path tableLocation, String format, Map<String, String> partitionFilter,

Review comment:
       I don't think this is used anymore?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kingeasternsun commented on a change in pull request #3973: :art: Add "parallelism" parameter to "add_files" syscall and the related functions.

Posted by GitBox <gi...@apache.org>.

kingeasternsun commented on a change in pull request #3973:
URL: https://github.com/apache/iceberg/pull/3973#discussion_r793262659



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java
##########
@@ -60,7 +60,8 @@
       ProcedureParameter.required("table", DataTypes.StringType),
       ProcedureParameter.required("source_table", DataTypes.StringType),
       ProcedureParameter.optional("partition_filter", STRING_MAP),
-      ProcedureParameter.optional("check_duplicate_files", DataTypes.BooleanType)
+      ProcedureParameter.optional("check_duplicate_files", DataTypes.BooleanType),
+      ProcedureParameter.optional("list_partition_parallelism", DataTypes.IntegerType)

Review comment:
       > Should we be adding this to the Migrate Procedure as well?
   
   Thanks for your advice, can you help where the  ` Migrate Procedure` code is?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kingeasternsun commented on a change in pull request #3973: :art: Add "parallelism" parameter to "add_files" syscall and the related functions.

Posted by GitBox <gi...@apache.org>.

kingeasternsun commented on a change in pull request #3973:
URL: https://github.com/apache/iceberg/pull/3973#discussion_r794150006



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java
##########
@@ -60,7 +60,8 @@
       ProcedureParameter.required("table", DataTypes.StringType),
       ProcedureParameter.required("source_table", DataTypes.StringType),
       ProcedureParameter.optional("partition_filter", STRING_MAP),
-      ProcedureParameter.optional("check_duplicate_files", DataTypes.BooleanType)
+      ProcedureParameter.optional("check_duplicate_files", DataTypes.BooleanType),
+      ProcedureParameter.optional("list_partition_parallelism", DataTypes.IntegerType)

Review comment:
       Ok, Thanks




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3973: :art: Add "parallelism" parameter to "add_files" syscall and the related functions.

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #3973:
URL: https://github.com/apache/iceberg/pull/3973#discussion_r793711557



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java
##########
@@ -517,7 +547,9 @@ public static void importSparkPartitions(SparkSession spark, List<SparkPartition
 
     Dataset<DataFile> filesToImport = partitionDS
         .flatMap((FlatMapFunction<SparkPartition, DataFile>) sparkPartition ->
-                listPartition(sparkPartition, spec, serializableConf, metricsConfig, nameMapping).iterator(),

Review comment:
       nit: whitespace/ formatting




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3973: :art: Add "parallelism" parameter to "add_files" syscall and MigrateTable, SnapshotTable.

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #3973:
URL: https://github.com/apache/iceberg/pull/3973#discussion_r810401119



##########
File path: api/src/main/java/org/apache/iceberg/actions/SnapshotTable.java
##########
@@ -60,6 +60,12 @@
    */
   SnapshotTable tableProperty(String key, String value);
 
+  /**
+   * @param numReaders the number of concurrent file read operations to use per partition
+   * @return this for method chaining
+   **/
+  SnapshotTable withParallelReads(int numReaders);

Review comment:
       Not a big deal, but I would consider defaulting this to unsupported so that we can review just one set of changes in the 3.2 module. I would recommend that and then doing the other modules as a follow up.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kingeasternsun commented on a change in pull request #3973: :art: Add "parallelism" parameter to "add_files" syscall and the related functions.

Posted by GitBox <gi...@apache.org>.

kingeasternsun commented on a change in pull request #3973:
URL: https://github.com/apache/iceberg/pull/3973#discussion_r794150006



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java
##########
@@ -60,7 +60,8 @@
       ProcedureParameter.required("table", DataTypes.StringType),
       ProcedureParameter.required("source_table", DataTypes.StringType),
       ProcedureParameter.optional("partition_filter", STRING_MAP),
-      ProcedureParameter.optional("check_duplicate_files", DataTypes.BooleanType)
+      ProcedureParameter.optional("check_duplicate_files", DataTypes.BooleanType),
+      ProcedureParameter.optional("list_partition_parallelism", DataTypes.IntegerType)

Review comment:
       Ok, Thanks.  I change it to   `max_concurrent_read_datafiles`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3973: :art: Add "parallelism" parameter to "add_files" syscall and the related functions.

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #3973:
URL: https://github.com/apache/iceberg/pull/3973#discussion_r792751801



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java
##########
@@ -60,7 +60,8 @@
       ProcedureParameter.required("table", DataTypes.StringType),
       ProcedureParameter.required("source_table", DataTypes.StringType),
       ProcedureParameter.optional("partition_filter", STRING_MAP),
-      ProcedureParameter.optional("check_duplicate_files", DataTypes.BooleanType)
+      ProcedureParameter.optional("check_duplicate_files", DataTypes.BooleanType),
+      ProcedureParameter.optional("list_partition_parallelism", DataTypes.IntegerType)

Review comment:
       on ExpireSnapshots and RemoveOrphans we have the parallelism parameter named
   
   max_concurrent_xxxxxx
   
   So maybe we just keep that convention? "max_concurrent_reads"?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3973: :art: Add "parallelism" parameter to "add_files" syscall and the related functions.

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #3973:
URL: https://github.com/apache/iceberg/pull/3973#discussion_r808189861



##########
File path: api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java
##########
@@ -40,6 +40,13 @@ default MigrateTable migrateTable(String tableIdent) {
     throw new UnsupportedOperationException(this.getClass().getName() + " does not implement migrateTable");
   }
 
+  /**
+   * Instantiates an action to migrate an existing table to Iceberg.
+   */
+  default MigrateTable migrateTable(String tableIdent, int parallelism) {

Review comment:
       Instead of extending the base call here I think it makes sense to extend MigrateTable with another functional method like
   ```java
   /**
   * @numReaders the number of concurrent file read operations to use per partition
   * @return this for method chaining
   **/
   MigrateTable withParallelReads(int numReaders) {
     // Set some internal property here that is passed through to the list utility method
     return this
   }
   
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3973: :art: Add "parallelism" parameter to "add_files" syscall and MigrateTable, SnapshotTable.

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #3973:
URL: https://github.com/apache/iceberg/pull/3973#discussion_r810400053



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java
##########
@@ -399,20 +400,41 @@ public static void importSparkTable(SparkSession spark, TableIdentifier sourceTa
       PartitionSpec spec = SparkSchemaUtil.specForTable(spark, sourceTableIdentWithDB.unquotedString());
 
       if (Objects.equal(spec, PartitionSpec.unpartitioned())) {
-        importUnpartitionedSparkTable(spark, sourceTableIdentWithDB, targetTable, checkDuplicateFiles);
+        importUnpartitionedSparkTable(spark, sourceTableIdentWithDB, targetTable, checkDuplicateFiles, parallelism);
       } else {
         List<SparkPartition> sourceTablePartitions = getPartitions(spark, sourceTableIdent,
             partitionFilter);
         Preconditions.checkArgument(!sourceTablePartitions.isEmpty(),
             "Cannot find any partitions in table %s", sourceTableIdent);
-        importSparkPartitions(spark, sourceTablePartitions, targetTable, spec, stagingDir, checkDuplicateFiles);
+        importSparkPartitions(spark, sourceTablePartitions, targetTable, spec, stagingDir, checkDuplicateFiles,
+                parallelism);

Review comment:
       nit: formatting (should be 4 not 8)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kingeasternsun edited a comment on pull request #3973: :art: Add "parallelism" parameter to "add_files" syscall and MigrateTable, SnapshotTable.

Posted by GitBox <gi...@apache.org>.

kingeasternsun edited a comment on pull request #3973:
URL: https://github.com/apache/iceberg/pull/3973#issuecomment-1049416478


   > I think most of my main concerns are wrapped up, just some formatting fixes left. I think as a follow up we should deprecate all the public methods not using "parallelism", not important to do it right int his PR though.
   
   Ok ，Thanks  @RussellSpitzer 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3973: :art: Add "parallelism" parameter to "add_files" syscall and MigrateTable, SnapshotTable.

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #3973:
URL: https://github.com/apache/iceberg/pull/3973#discussion_r810400053



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java
##########
@@ -399,20 +400,41 @@ public static void importSparkTable(SparkSession spark, TableIdentifier sourceTa
       PartitionSpec spec = SparkSchemaUtil.specForTable(spark, sourceTableIdentWithDB.unquotedString());
 
       if (Objects.equal(spec, PartitionSpec.unpartitioned())) {
-        importUnpartitionedSparkTable(spark, sourceTableIdentWithDB, targetTable, checkDuplicateFiles);
+        importUnpartitionedSparkTable(spark, sourceTableIdentWithDB, targetTable, checkDuplicateFiles, parallelism);
       } else {
         List<SparkPartition> sourceTablePartitions = getPartitions(spark, sourceTableIdent,
             partitionFilter);
         Preconditions.checkArgument(!sourceTablePartitions.isEmpty(),
             "Cannot find any partitions in table %s", sourceTableIdent);
-        importSparkPartitions(spark, sourceTablePartitions, targetTable, spec, stagingDir, checkDuplicateFiles);
+        importSparkPartitions(spark, sourceTablePartitions, targetTable, spec, stagingDir, checkDuplicateFiles,
+                parallelism);

Review comment:
       nit: formatting




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3973: :art: Add "parallelism" parameter to "add_files" syscall and the related functions.

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #3973:
URL: https://github.com/apache/iceberg/pull/3973#discussion_r793717202



##########
File path: spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java
##########
@@ -343,6 +343,50 @@ public void addDataPartitionedHive() {
         sql("SELECT id, name, dept, subdept FROM %s ORDER BY id", tableName));
   }
 
+  @Test
+  public void addDataPartitionedHiveInParallel() {
+    createPartitionedHiveTable();
+
+    String createIceberg =
+            "CREATE TABLE %s (id Integer, name String, dept String, subdept String) USING iceberg PARTITIONED BY (id)";
+
+    sql(createIceberg, tableName);
+
+    Object result = scalarSql("CALL %s.system.add_files(" +
+                    "table => '%s', " +
+                    "source_table => '%s', " +
+                    "list_partition_parallelism => 3)",

Review comment:
       would be nice if we had a way to actually test that this was working properly and not just not throwing an error ... Can't think of an easy way at the moment so this is fine




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3973: :art: Add "parallelism" parameter to "add_files" syscall and the related functions.

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #3973:
URL: https://github.com/apache/iceberg/pull/3973#discussion_r793710232



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java
##########
@@ -127,6 +135,11 @@ private boolean isFileIdentifier(Identifier ident) {
 
   private long importToIceberg(Identifier destIdent, Identifier sourceIdent, Map<String, String> partitionFilter,

Review comment:
       Is this still used now?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3973: :art: Add "parallelism" parameter to "add_files" syscall and the related functions.

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #3973:
URL: https://github.com/apache/iceberg/pull/3973#discussion_r792752276



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java
##########
@@ -60,7 +60,8 @@
       ProcedureParameter.required("table", DataTypes.StringType),
       ProcedureParameter.required("source_table", DataTypes.StringType),
       ProcedureParameter.optional("partition_filter", STRING_MAP),
-      ProcedureParameter.optional("check_duplicate_files", DataTypes.BooleanType)
+      ProcedureParameter.optional("check_duplicate_files", DataTypes.BooleanType),
+      ProcedureParameter.optional("list_partition_parallelism", DataTypes.IntegerType)

Review comment:
       Should we be adding this to the Migrate Procedure as well?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3973: :art: Add "parallelism" parameter to "add_files" syscall and MigrateTable, SnapshotTable.

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #3973:
URL: https://github.com/apache/iceberg/pull/3973#discussion_r819080430



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java
##########
@@ -564,6 +592,21 @@ public static void importSparkPartitions(SparkSession spark, List<SparkPartition
     }
   }
 
+  /**
+   * Import files from given partitions to an Iceberg table.
+   *
+   * @param spark a Spark session
+   * @param partitions partitions to import
+   * @param targetTable an Iceberg table where to import the data
+   * @param spec a partition spec
+   * @param stagingDir a staging directory to store temporary manifest files
+   * @param checkDuplicateFiles if true, throw exception if import results in a duplicate data file
+   */
+  public static void importSparkPartitions(SparkSession spark, List<SparkPartition> partitions, Table targetTable,

Review comment:
       This is an example of one the functions I think we can deprecate in the next release, but we don't have to add that tag in this PR.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kingeasternsun commented on pull request #3973: :art: Add "parallelism" parameter to "add_files" syscall and MigrateTable, SnapshotTable.

Posted by GitBox <gi...@apache.org>.

kingeasternsun commented on pull request #3973:
URL: https://github.com/apache/iceberg/pull/3973#issuecomment-1058746869


   > > > I think most of my main concerns are wrapped up, just some formatting fixes left. I think as a follow up we should deprecate all the public methods not using "parallelism", not important to do it right int his PR though.
   > > 
   > > 
   > > I thinks the sentence ` int hist PR` should be `in this PR` ? @RussellSpitzer . And did you mean `deprecate all the public methods not using "parallelism" ` in another PR ?
   > 
   > There are a bunch of public utility methods now that have a version which takes parallelism and a version which does not. I think with this PR we have finished changing all internal usages to always specify parallelism so we are probably good with removing the non-parallel arg versions in a future release.
   
   Thanks, I got It.  So I just need to  keep the code now as it is  , And waiting for it to be merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kingeasternsun commented on pull request #3973: :art: Add "parallelism" parameter to "add_files" syscall and MigrateTable, SnapshotTable.

Posted by GitBox <gi...@apache.org>.

kingeasternsun commented on pull request #3973:
URL: https://github.com/apache/iceberg/pull/3973#issuecomment-1057574865


   could you please help review this PR  too? @rdblue 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org