You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/03/10 07:04:17 UTC

[GitHub] [iceberg] sunchao commented on a change in pull request #2276: Core: Add option to combine tasks by partition

sunchao commented on a change in pull request #2276:
URL: https://github.com/apache/iceberg/pull/2276#discussion_r591131213



##########
File path: api/src/main/java/org/apache/iceberg/TableScan.java
##########
@@ -181,6 +181,34 @@ default TableScan select(String... columns) {
    */
   CloseableIterable<CombinedScanTask> planTasks();
 
+  /**
+   * Create a new {@link TableScan} which indicate that when plan tasks via the
+   * {@link #planTasks()}, the scan should preserve partition boundary specified by the provided
+   * partition column names. In other words, the scan will not attempt to combine tasks whose input
+   * files have different partition data w.r.t `columns`.
+   *
+   * @param columns the partition column names to preserve boundary when planning tasks
+   * @return a table scan preserving partition boundary when planning tasks
+   * @throws IllegalArgumentException if any of the input columns is not a partition column, or
+   *         if the table is unpartitioned, or `columns` is empty.
+   */
+  TableScan preservePartitions(Collection<String> columns);

Review comment:
       > One question is: what's the case that we want to group task by subset of partition columns ? In my mind, we usually group the tasks by the full set of partition columns..
   
   One such use-case is described [here](https://github.com/apache/iceberg/pull/2276#discussion_r588806632). Engines such as Spark may leverage this to push down the actual partition columns that are used in join or aggregate operators, or disable this feature if it's not needed. This can make the task combining logic more effective.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org