You are viewing a plain text version of this content. The canonical link for it is here.

Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2021/03/31 11:31:20 UTC

[GitHub] [hive] pvary opened a new pull request #2137: HIVE-24962: Implement partition pruning for Iceberg tables

pvary opened a new pull request #2137:
URL: https://github.com/apache/hive/pull/2137


   ### What changes were proposed in this pull request?
   There are 3 parts of the changes here:
   - Pruner changes to provide a filter instead of changing the file lists
   - HiveInputFormat changes to combine the Pruning filter with the original filter
   - Iceberg handler changes:
      - New method in SerDe to return the current partitioning columns
      - HiveStoragePredicateHandler implements the addDynamicSplitPruningEdge method which returns true if there is a filter for any of the partitioning columns
   - Added new test for partition pruning
   
   ### Why are the changes needed?
   To archive comparable performance for Iceberg tables
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Multiple queries
   
   Plan to add more tests later, but it would be good to have them as a `qtest` and it should be done in a different PR as it is a sizable task in itself


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] marton-bod commented on a change in pull request #2137: HIVE-24962: Implement partition pruning for Iceberg tables

Posted by GitBox <gi...@apache.org>.

marton-bod commented on a change in pull request #2137:
URL: https://github.com/apache/hive/pull/2137#discussion_r609494843



##########
File path: iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##########
@@ -153,6 +168,39 @@ public DecomposedPredicate decomposePredicate(JobConf jobConf, Deserializer dese
     return predicate;
   }
 
+  @Override
+  public boolean addDynamicSplitPruningEdge(org.apache.hadoop.hive.ql.metadata.Table table,
+      ExprNodeDesc syntheticFilterPredicate) {
+    try {
+      Collection<String> partitionColumns = ((HiveIcebergSerDe) table.getDeserializer()).partitionColumns();
+      if (partitionColumns.size() > 0) {
+        // Collect the column names from the predicate
+        Collection<String> filterColumns = Lists.newArrayList();
+        columns(syntheticFilterPredicate, filterColumns);
+
+        if (filterColumns.removeAll(partitionColumns)) {

Review comment:
       IIUC the intention here is to enter this block if the filter columns contain any partition columns. Instead of using `removeAll`, would it be more descriptive to do something like this?
   ```
   boolean filterContainsPartitionCols = filterColumns.stream().anyMatch(...);
   if (filterContainsPartitionCols) {
   
   }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] pvary commented on a change in pull request #2137: HIVE-24962: Implement partition pruning for Iceberg tables

Posted by GitBox <gi...@apache.org>.

pvary commented on a change in pull request #2137:
URL: https://github.com/apache/hive/pull/2137#discussion_r617642837



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##########
@@ -275,4 +339,74 @@ static void overlayTableProperties(Configuration configuration, TableDesc tableD
     // this is an exception to the interface documentation, but it's a safe operation to add this property
     props.put(InputFormatConfig.TABLE_SCHEMA, schemaJson);
   }
+
+  /**
+   * Recursively collects the column names from the predicate.
+   * @param node The node we are traversing
+   * @param columns The already collected column names
+   */
+  private void columns(ExprNodeDesc node, Set<String> columns) {
+    if (node instanceof ExprNodeColumnDesc) {
+      columns.add(((ExprNodeColumnDesc) node).getColumn());
+    } else {
+      List<ExprNodeDesc> children = node.getChildren();
+      if (children != null && !children.isEmpty()) {
+        children.forEach(child -> columns(child, columns));
+      }
+    }
+  }
+
+  /**
+   * Recursively replaces the ExprNodeDynamicListDesc nodes by a dummy ExprNodeConstantDesc so we can test if we can
+   * convert the predicate to an Iceberg predicate when pruning the partitions later. Please make sure that it is ok
+   * to change the input node (clone if needed)
+   * @param node The node we are traversing
+   */
+  private void replaceWithDummyValues(ExprNodeDesc node) {
+    List<ExprNodeDesc> children = node.getChildren();
+    if (children != null && !children.isEmpty()) {
+      ListIterator<ExprNodeDesc> iterator = node.getChildren().listIterator();

Review comment:
       We do an in-place replacement for the `Dynamic` nodes with `iterator.set(new ExprNodeConstantDesc(child.getTypeInfo(), dummy));`. I think I can not do that with a for loop.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] pvary commented on a change in pull request #2137: HIVE-24962: Implement partition pruning for Iceberg tables

Posted by GitBox <gi...@apache.org>.

pvary commented on a change in pull request #2137:
URL: https://github.com/apache/hive/pull/2137#discussion_r617643436



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##########
@@ -50,9 +61,14 @@
 import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
 import org.apache.iceberg.relocated.com.google.common.base.Splitter;
 import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.relocated.com.google.common.collect.Sets;
 import org.apache.iceberg.util.SerializationUtil;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
 
 public class HiveIcebergStorageHandler implements HiveStoragePredicateHandler, HiveStorageHandler {
+  private static final Logger LOG = LoggerFactory.getLogger(HiveIcebergStorageHandler.class);
+
   private static final Splitter TABLE_NAME_SPLITTER = Splitter.on("..");

Review comment:
       That's a good point. Will do it in a separate jira as it should be the same on Iceberg too.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] pvary commented on a change in pull request #2137: HIVE-24962: Implement partition pruning for Iceberg tables

Posted by GitBox <gi...@apache.org>.

pvary commented on a change in pull request #2137:
URL: https://github.com/apache/hive/pull/2137#discussion_r611535475



##########
File path: iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##########
@@ -250,4 +305,74 @@ static void overlayTableProperties(Configuration configuration, TableDesc tableD
     // this is an exception to the interface documentation, but it's a safe operation to add this property
     props.put(InputFormatConfig.TABLE_SCHEMA, schemaJson);
   }
+
+  /**
+   * Recursively collects the column names from the predicate.
+   * @param node The node we are traversing
+   * @param columns The already collected column names
+   */
+  private void columns(ExprNodeDesc node, Collection<String> columns) {

Review comment:
       Based on my tests the `addDynamicSplitPruningEdge` will be called for every column and the `ExprNodeDesc` only refers to a single column, so in theory we can handle multiple columns and everything, but we do not have to optimize for multi-column case. Using `Set` would work as well, but it is more strict and I wanted to keep things as general as possible.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] kgyrtkirk commented on a change in pull request #2137: HIVE-24962: Implement partition pruning for Iceberg tables

Posted by GitBox <gi...@apache.org>.

kgyrtkirk commented on a change in pull request #2137:
URL: https://github.com/apache/hive/pull/2137#discussion_r611516875



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java
##########
@@ -171,7 +171,7 @@ private void prepare(InputInitializerContext initializerContext) throws IOExcept
       // perform dynamic partition pruning
       if (pruner != null) {
         pruner.initialize(getContext(), work, jobConf);
-        pruner.prune();
+        pruner.prune(jobConf);

Review comment:
       note: `jobConf` was already shown to the `pruner` in the initialize

##########
File path: iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##########
@@ -250,4 +305,74 @@ static void overlayTableProperties(Configuration configuration, TableDesc tableD
     // this is an exception to the interface documentation, but it's a safe operation to add this property
     props.put(InputFormatConfig.TABLE_SCHEMA, schemaJson);
   }
+
+  /**
+   * Recursively collects the column names from the predicate.
+   * @param node The node we are traversing
+   * @param columns The already collected column names
+   */
+  private void columns(ExprNodeDesc node, Collection<String> columns) {

Review comment:
       why is this a `Collection` - do we need to collect the same column multiple times?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/tez/DynamicPartitionPruner.java
##########
@@ -514,4 +569,29 @@ private boolean checkForSourceCompletion(String name) {
     }
     return false;
   }
+
+  /**
+   * Recursively replaces the ExprNodeDynamicListDesc to the list of the actual values. As a result of this call the
+   * original expression is modified so it can be used for pushing down to the TableScan for filtering the data at the
+   * source.
+   * <p>
+   * Please make sure to clone the predicate if needed since the original node will be modified.
+   * @param node The node we are traversing
+   * @param dynArgs The constant values we are substituting
+   */
+  private void replaceDynamicLists(ExprNodeDesc node, Collection<ExprNodeConstantDesc> dynArgs) {
+    List<ExprNodeDesc> children = node.getChildren();
+    if (children != null && !children.isEmpty()) {
+      ListIterator<ExprNodeDesc> iterator = node.getChildren().listIterator();
+      while (iterator.hasNext()) {
+        ExprNodeDesc child = iterator.next();
+        if (child instanceof ExprNodeDynamicListDesc) {
+          iterator.remove();
+          dynArgs.forEach(iterator::add);

Review comment:
       I strongly suspect that this method is problematic; what will happen if you have filter for 2 different columns or 2 different values?
   ```
   a IN L1 and b IN L2
   ```
   

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java
##########
@@ -890,6 +894,24 @@ public static void pushFilters(JobConf jobConf, TableScanOperator tableScan,
     Utilities.setColumnTypeList(jobConf, tableScan);
     // push down filters
     ExprNodeGenericFuncDesc filterExpr = scanDesc.getFilterExpr();
+    String pruningFilter = jobConf.get(TableScanDesc.PARTITION_PRUNING_FILTER);
+    // If we have a pruning filter then combine it with the original
+    if (pruningFilter != null) {
+      ExprNodeGenericFuncDesc pruningExpr = SerializationUtilities.deserializeExpression(pruningFilter);
+      if (filterExpr != null) {
+        // Combine the 2 filters with AND
+        filterExpr = new ExprNodeGenericFuncDesc(TypeInfoFactory.booleanTypeInfo, new GenericUDFOPAnd(), "and",

Review comment:
       note: you could probably use `ExprNodeDescUtils#conjunction` (or move this method there...)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] lcspinter commented on a change in pull request #2137: HIVE-24962: Implement partition pruning for Iceberg tables

Posted by GitBox <gi...@apache.org>.

lcspinter commented on a change in pull request #2137:
URL: https://github.com/apache/hive/pull/2137#discussion_r617511852



##########
File path: iceberg/iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java
##########
@@ -126,7 +127,7 @@
     // skip HiveCatalog tests as they are added before
     for (TestTables.TestTableType testTableType : TestTables.ALL_TABLE_TYPES) {
       if (!TestTables.TestTableType.HIVE_CATALOG.equals(testTableType)) {
-        testParams.add(new Object[]{FileFormat.PARQUET, "mr", testTableType});
+        testParams.add(new Object[]{FileFormat.PARQUET, "tez", testTableType});

Review comment:
       The comment above the loop is not in sync with the logic. 

##########
File path: iceberg/iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java
##########
@@ -126,7 +127,7 @@
     // skip HiveCatalog tests as they are added before
     for (TestTables.TestTableType testTableType : TestTables.ALL_TABLE_TYPES) {
       if (!TestTables.TestTableType.HIVE_CATALOG.equals(testTableType)) {
-        testParams.add(new Object[]{FileFormat.PARQUET, "mr", testTableType});
+        testParams.add(new Object[]{FileFormat.PARQUET, "tez", testTableType});

Review comment:
       Anyway, this change is already present on master, so you might revert it. 

##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##########
@@ -194,6 +210,54 @@ public boolean canProvideBasicStatistics() {
     return stats;
   }
 
+  public boolean addDynamicSplitPruningEdge(org.apache.hadoop.hive.ql.metadata.Table table,
+      ExprNodeDesc syntheticFilterPredicate) {
+    try {
+      Collection<String> partitionColumns = ((HiveIcebergSerDe) table.getDeserializer()).partitionColumns();
+      if (partitionColumns.size() > 0) {
+        // Collect the column names from the predicate
+        Set<String> filterColumns = Sets.newHashSet();
+        columns(syntheticFilterPredicate, filterColumns);
+
+        // While Iceberg could handle multiple columns the current pruning only able to handle filters for a
+        // single column. We keep the logic below to handle multiple columns so if pruning is available on executor
+        // side the we can easily adapt to it as well.
+        if (filterColumns.size() > 1) {

Review comment:
       If we know that we don't support multiple column filtering, wouldn't be possible to get a rough estimate of the filter size ( is >=1) before collecting every column name? 

##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##########
@@ -50,9 +61,14 @@
 import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
 import org.apache.iceberg.relocated.com.google.common.base.Splitter;
 import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.relocated.com.google.common.collect.Sets;
 import org.apache.iceberg.util.SerializationUtil;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
 
 public class HiveIcebergStorageHandler implements HiveStoragePredicateHandler, HiveStorageHandler {
+  private static final Logger LOG = LoggerFactory.getLogger(HiveIcebergStorageHandler.class);
+
   private static final Splitter TABLE_NAME_SPLITTER = Splitter.on("..");

Review comment:
       nit: Splitter.on(TABLE_NAME_SEPARATOR)

##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##########
@@ -275,4 +339,74 @@ static void overlayTableProperties(Configuration configuration, TableDesc tableD
     // this is an exception to the interface documentation, but it's a safe operation to add this property
     props.put(InputFormatConfig.TABLE_SCHEMA, schemaJson);
   }
+
+  /**
+   * Recursively collects the column names from the predicate.
+   * @param node The node we are traversing
+   * @param columns The already collected column names
+   */
+  private void columns(ExprNodeDesc node, Set<String> columns) {
+    if (node instanceof ExprNodeColumnDesc) {
+      columns.add(((ExprNodeColumnDesc) node).getColumn());
+    } else {
+      List<ExprNodeDesc> children = node.getChildren();
+      if (children != null && !children.isEmpty()) {
+        children.forEach(child -> columns(child, columns));
+      }
+    }
+  }
+
+  /**
+   * Recursively replaces the ExprNodeDynamicListDesc nodes by a dummy ExprNodeConstantDesc so we can test if we can
+   * convert the predicate to an Iceberg predicate when pruning the partitions later. Please make sure that it is ok
+   * to change the input node (clone if needed)
+   * @param node The node we are traversing
+   */
+  private void replaceWithDummyValues(ExprNodeDesc node) {
+    List<ExprNodeDesc> children = node.getChildren();
+    if (children != null && !children.isEmpty()) {
+      ListIterator<ExprNodeDesc> iterator = node.getChildren().listIterator();

Review comment:
       nit: No need for the iterator, since we already have a reference on the children list. A simple for loop could do the trick

##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##########
@@ -275,4 +339,74 @@ static void overlayTableProperties(Configuration configuration, TableDesc tableD
     // this is an exception to the interface documentation, but it's a safe operation to add this property
     props.put(InputFormatConfig.TABLE_SCHEMA, schemaJson);
   }
+
+  /**
+   * Recursively collects the column names from the predicate.
+   * @param node The node we are traversing
+   * @param columns The already collected column names
+   */
+  private void columns(ExprNodeDesc node, Set<String> columns) {
+    if (node instanceof ExprNodeColumnDesc) {
+      columns.add(((ExprNodeColumnDesc) node).getColumn());
+    } else {
+      List<ExprNodeDesc> children = node.getChildren();
+      if (children != null && !children.isEmpty()) {

Review comment:
       nit: unnecessary isEmpty check




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] marton-bod commented on a change in pull request #2137: HIVE-24962: Implement partition pruning for Iceberg tables

Posted by GitBox <gi...@apache.org>.

marton-bod commented on a change in pull request #2137:
URL: https://github.com/apache/hive/pull/2137#discussion_r609497741



##########
File path: iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##########
@@ -153,6 +168,39 @@ public DecomposedPredicate decomposePredicate(JobConf jobConf, Deserializer dese
     return predicate;
   }
 
+  @Override
+  public boolean addDynamicSplitPruningEdge(org.apache.hadoop.hive.ql.metadata.Table table,
+      ExprNodeDesc syntheticFilterPredicate) {
+    try {
+      Collection<String> partitionColumns = ((HiveIcebergSerDe) table.getDeserializer()).partitionColumns();
+      if (partitionColumns.size() > 0) {
+        // Collect the column names from the predicate
+        Collection<String> filterColumns = Lists.newArrayList();
+        columns(syntheticFilterPredicate, filterColumns);
+
+        if (filterColumns.removeAll(partitionColumns)) {
+          // Replace dynamic values to dummy constants
+          ExprNodeDesc clone = syntheticFilterPredicate.clone();
+          replace(clone);
+
+          // Check if we can convert the expression to a valid Iceberg filter
+          SearchArgument sarg = ConvertAstToSearchArg.create(conf, (ExprNodeGenericFuncDesc) clone);
+          HiveIcebergFilterFactory.generateFilterExpression(sarg);
+          LOG.debug("Found Iceberg partition column to prune with predicate {}", filterColumns,
+              syntheticFilterPredicate);
+          return true;
+        }
+      }
+    } catch (UnsupportedOperationException uoe) {
+      // If we can not convert the filter, we do not prune
+      LOG.debug("Unsupported predicate {}", syntheticFilterPredicate);

Review comment:
       Would it be useful to log the `uoe` exception too in case the pruning failure is unexpected and need to debug?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] pvary commented on a change in pull request #2137: HIVE-24962: Implement partition pruning for Iceberg tables

Posted by GitBox <gi...@apache.org>.

pvary commented on a change in pull request #2137:
URL: https://github.com/apache/hive/pull/2137#discussion_r616490821



##########
File path: iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##########
@@ -250,4 +305,74 @@ static void overlayTableProperties(Configuration configuration, TableDesc tableD
     // this is an exception to the interface documentation, but it's a safe operation to add this property
     props.put(InputFormatConfig.TABLE_SCHEMA, schemaJson);
   }
+
+  /**
+   * Recursively collects the column names from the predicate.
+   * @param node The node we are traversing
+   * @param columns The already collected column names
+   */
+  private void columns(ExprNodeDesc node, Collection<String> columns) {

Review comment:
       In the end moved to `Set` since it was needed to make sure than only a single column is there in the pruning expression




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] marton-bod commented on a change in pull request #2137: HIVE-24962: Implement partition pruning for Iceberg tables

Posted by GitBox <gi...@apache.org>.

marton-bod commented on a change in pull request #2137:
URL: https://github.com/apache/hive/pull/2137#discussion_r609503889



##########
File path: iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergSerDe.java
##########
@@ -77,15 +85,26 @@ public void initialize(@Nullable Configuration configuration, Properties serDePr
 
     if (serDeProperties.get(InputFormatConfig.TABLE_SCHEMA) != null) {
       this.tableSchema = SchemaParser.fromJson((String) serDeProperties.get(InputFormatConfig.TABLE_SCHEMA));
+      if (serDeProperties.get(InputFormatConfig.PARTITION_SPEC) != null) {
+        PartitionSpec spec =
+            PartitionSpecParser.fromJson(tableSchema, (String) serDeProperties.get(InputFormatConfig.PARTITION_SPEC));
+        this.partitionColumns = spec.fields().stream().map(PartitionField::name).collect(Collectors.toList());
+      } else {
+        this.partitionColumns = ImmutableList.of();
+      }
     } else {
       try {
         // always prefer the original table schema if there is one
-        this.tableSchema = Catalogs.loadTable(configuration, serDeProperties).schema();
+        Table table = Catalogs.loadTable(configuration, serDeProperties);
+        this.tableSchema = table.schema();
+        this.partitionColumns = table.spec().fields().stream().map(PartitionField::name).collect(Collectors.toList());
         LOG.info("Using schema from existing table {}", SchemaParser.toJson(tableSchema));
       } catch (Exception e) {
         boolean autoConversion = configuration.getBoolean(InputFormatConfig.SCHEMA_AUTO_CONVERSION, false);
         // If we can not load the table try the provided hive schema
         this.tableSchema = hiveSchemaOrThrow(serDeProperties, e, autoConversion);
+        // This is only for table creation, it is ok to have an empty partition column list

Review comment:
       This comment reminds me: can we make it clear earlier at the beginning of the catch block, that we expect to get to this catch block only in table creation scenarios? I think that'd make it clearer why we're not rethrowing the exception and processing things in an alternative way




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] pvary merged pull request #2137: HIVE-24962: Implement partition pruning for Iceberg tables

Posted by GitBox <gi...@apache.org>.

pvary merged pull request #2137:
URL: https://github.com/apache/hive/pull/2137


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] pvary commented on a change in pull request #2137: HIVE-24962: Implement partition pruning for Iceberg tables

Posted by GitBox <gi...@apache.org>.

pvary commented on a change in pull request #2137:
URL: https://github.com/apache/hive/pull/2137#discussion_r609744304



##########
File path: iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergSerDe.java
##########
@@ -77,15 +85,26 @@ public void initialize(@Nullable Configuration configuration, Properties serDePr
 
     if (serDeProperties.get(InputFormatConfig.TABLE_SCHEMA) != null) {
       this.tableSchema = SchemaParser.fromJson((String) serDeProperties.get(InputFormatConfig.TABLE_SCHEMA));
+      if (serDeProperties.get(InputFormatConfig.PARTITION_SPEC) != null) {
+        PartitionSpec spec =
+            PartitionSpecParser.fromJson(tableSchema, (String) serDeProperties.get(InputFormatConfig.PARTITION_SPEC));
+        this.partitionColumns = spec.fields().stream().map(PartitionField::name).collect(Collectors.toList());
+      } else {
+        this.partitionColumns = ImmutableList.of();
+      }
     } else {
       try {
         // always prefer the original table schema if there is one
-        this.tableSchema = Catalogs.loadTable(configuration, serDeProperties).schema();
+        Table table = Catalogs.loadTable(configuration, serDeProperties);
+        this.tableSchema = table.schema();
+        this.partitionColumns = table.spec().fields().stream().map(PartitionField::name).collect(Collectors.toList());
         LOG.info("Using schema from existing table {}", SchemaParser.toJson(tableSchema));
       } catch (Exception e) {
         boolean autoConversion = configuration.getBoolean(InputFormatConfig.SCHEMA_AUTO_CONVERSION, false);
         // If we can not load the table try the provided hive schema
         this.tableSchema = hiveSchemaOrThrow(serDeProperties, e, autoConversion);
+        // This is only for table creation, it is ok to have an empty partition column list

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] pvary commented on a change in pull request #2137: HIVE-24962: Implement partition pruning for Iceberg tables

Posted by GitBox <gi...@apache.org>.

pvary commented on a change in pull request #2137:
URL: https://github.com/apache/hive/pull/2137#discussion_r617644403



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##########
@@ -275,4 +339,74 @@ static void overlayTableProperties(Configuration configuration, TableDesc tableD
     // this is an exception to the interface documentation, but it's a safe operation to add this property
     props.put(InputFormatConfig.TABLE_SCHEMA, schemaJson);
   }
+
+  /**
+   * Recursively collects the column names from the predicate.
+   * @param node The node we are traversing
+   * @param columns The already collected column names
+   */
+  private void columns(ExprNodeDesc node, Set<String> columns) {
+    if (node instanceof ExprNodeColumnDesc) {
+      columns.add(((ExprNodeColumnDesc) node).getColumn());
+    } else {
+      List<ExprNodeDesc> children = node.getChildren();
+      if (children != null && !children.isEmpty()) {

Review comment:
       Good point!
   Thanks, for pointing out




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] lcspinter commented on a change in pull request #2137: HIVE-24962: Implement partition pruning for Iceberg tables

Posted by GitBox <gi...@apache.org>.

lcspinter commented on a change in pull request #2137:
URL: https://github.com/apache/hive/pull/2137#discussion_r618093086



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##########
@@ -194,6 +210,54 @@ public boolean canProvideBasicStatistics() {
     return stats;
   }
 
+  public boolean addDynamicSplitPruningEdge(org.apache.hadoop.hive.ql.metadata.Table table,
+      ExprNodeDesc syntheticFilterPredicate) {
+    try {
+      Collection<String> partitionColumns = ((HiveIcebergSerDe) table.getDeserializer()).partitionColumns();
+      if (partitionColumns.size() > 0) {
+        // Collect the column names from the predicate
+        Set<String> filterColumns = Sets.newHashSet();
+        columns(syntheticFilterPredicate, filterColumns);
+
+        // While Iceberg could handle multiple columns the current pruning only able to handle filters for a
+        // single column. We keep the logic below to handle multiple columns so if pruning is available on executor
+        // side the we can easily adapt to it as well.
+        if (filterColumns.size() > 1) {

Review comment:
       We collect every column name in the filterColumns set through the columns() method. That method is traversing every node recursively, so it might be time-consuming.  After that, the size of the set is validated, and if it's greater than 1, return false. 
   Can we introduce some logic, to fail fast, without the need of traversing every node? I'm just thinking aloud, I don't know whether it is feasible or not.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] marton-bod commented on a change in pull request #2137: HIVE-24962: Implement partition pruning for Iceberg tables

Posted by GitBox <gi...@apache.org>.

marton-bod commented on a change in pull request #2137:
URL: https://github.com/apache/hive/pull/2137#discussion_r609496322



##########
File path: iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##########
@@ -153,6 +168,39 @@ public DecomposedPredicate decomposePredicate(JobConf jobConf, Deserializer dese
     return predicate;
   }
 
+  @Override
+  public boolean addDynamicSplitPruningEdge(org.apache.hadoop.hive.ql.metadata.Table table,
+      ExprNodeDesc syntheticFilterPredicate) {
+    try {
+      Collection<String> partitionColumns = ((HiveIcebergSerDe) table.getDeserializer()).partitionColumns();
+      if (partitionColumns.size() > 0) {
+        // Collect the column names from the predicate
+        Collection<String> filterColumns = Lists.newArrayList();
+        columns(syntheticFilterPredicate, filterColumns);
+
+        if (filterColumns.removeAll(partitionColumns)) {
+          // Replace dynamic values to dummy constants
+          ExprNodeDesc clone = syntheticFilterPredicate.clone();
+          replace(clone);

Review comment:
       Can we make this method a bit more descriptive? I like the method's javadoc, just wondering whether we can streamline the naming as well, e.g. `replaceWithDummyValues(clone)` instead of using the comment




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] pvary commented on a change in pull request #2137: HIVE-24962: Implement partition pruning for Iceberg tables

Posted by GitBox <gi...@apache.org>.

pvary commented on a change in pull request #2137:
URL: https://github.com/apache/hive/pull/2137#discussion_r609838245



##########
File path: iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##########
@@ -250,4 +298,74 @@ static void overlayTableProperties(Configuration configuration, TableDesc tableD
     // this is an exception to the interface documentation, but it's a safe operation to add this property
     props.put(InputFormatConfig.TABLE_SCHEMA, schemaJson);
   }
+
+  /**
+   * Recursively collects the column names from the predicate.
+   * @param node The node we are traversing
+   * @param columns The already collected column names
+   */
+  private void columns(ExprNodeDesc node, Collection<String> columns) {
+    if (node instanceof ExprNodeColumnDesc) {
+      columns.add(((ExprNodeColumnDesc) node).getColumn());
+    } else {
+      List<ExprNodeDesc> children = node.getChildren();
+      if (children != null && !children.isEmpty()) {
+        children.forEach(child -> columns(child, columns));
+      }
+    }
+  }
+
+  /**
+   * Recursively replaces the ExprNodeDynamicListDesc nodes by a dummy ExprNodeConstantDesc so we can test if we can
+   * convert the predicate to an Iceberg predicate when pruning the partitions later. Please make sure that it is ok

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] pvary commented on a change in pull request #2137: HIVE-24962: Implement partition pruning for Iceberg tables

Posted by GitBox <gi...@apache.org>.

pvary commented on a change in pull request #2137:
URL: https://github.com/apache/hive/pull/2137#discussion_r617643973



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##########
@@ -194,6 +210,54 @@ public boolean canProvideBasicStatistics() {
     return stats;
   }
 
+  public boolean addDynamicSplitPruningEdge(org.apache.hadoop.hive.ql.metadata.Table table,
+      ExprNodeDesc syntheticFilterPredicate) {
+    try {
+      Collection<String> partitionColumns = ((HiveIcebergSerDe) table.getDeserializer()).partitionColumns();
+      if (partitionColumns.size() > 0) {
+        // Collect the column names from the predicate
+        Set<String> filterColumns = Sets.newHashSet();
+        columns(syntheticFilterPredicate, filterColumns);
+
+        // While Iceberg could handle multiple columns the current pruning only able to handle filters for a
+        // single column. We keep the logic below to handle multiple columns so if pruning is available on executor
+        // side the we can easily adapt to it as well.
+        if (filterColumns.size() > 1) {

Review comment:
       I do not understand this. Could you please ellaborate?
   
   Thanks,
   Peter




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] pvary commented on a change in pull request #2137: HIVE-24962: Implement partition pruning for Iceberg tables

Posted by GitBox <gi...@apache.org>.

pvary commented on a change in pull request #2137:
URL: https://github.com/apache/hive/pull/2137#discussion_r617643973



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##########
@@ -194,6 +210,54 @@ public boolean canProvideBasicStatistics() {
     return stats;
   }
 
+  public boolean addDynamicSplitPruningEdge(org.apache.hadoop.hive.ql.metadata.Table table,
+      ExprNodeDesc syntheticFilterPredicate) {
+    try {
+      Collection<String> partitionColumns = ((HiveIcebergSerDe) table.getDeserializer()).partitionColumns();
+      if (partitionColumns.size() > 0) {
+        // Collect the column names from the predicate
+        Set<String> filterColumns = Sets.newHashSet();
+        columns(syntheticFilterPredicate, filterColumns);
+
+        // While Iceberg could handle multiple columns the current pruning only able to handle filters for a
+        // single column. We keep the logic below to handle multiple columns so if pruning is available on executor
+        // side the we can easily adapt to it as well.
+        if (filterColumns.size() > 1) {

Review comment:
       I do not understand this. Could you please elaborate?
   
   Thanks,
   Peter




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] pvary commented on a change in pull request #2137: HIVE-24962: Implement partition pruning for Iceberg tables

Posted by GitBox <gi...@apache.org>.

pvary commented on a change in pull request #2137:
URL: https://github.com/apache/hive/pull/2137#discussion_r609594376



##########
File path: iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergSerDe.java
##########
@@ -77,15 +85,26 @@ public void initialize(@Nullable Configuration configuration, Properties serDePr
 
     if (serDeProperties.get(InputFormatConfig.TABLE_SCHEMA) != null) {
       this.tableSchema = SchemaParser.fromJson((String) serDeProperties.get(InputFormatConfig.TABLE_SCHEMA));
+      if (serDeProperties.get(InputFormatConfig.PARTITION_SPEC) != null) {
+        PartitionSpec spec =
+            PartitionSpecParser.fromJson(tableSchema, (String) serDeProperties.get(InputFormatConfig.PARTITION_SPEC));

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] pvary commented on a change in pull request #2137: HIVE-24962: Implement partition pruning for Iceberg tables

Posted by GitBox <gi...@apache.org>.

pvary commented on a change in pull request #2137:
URL: https://github.com/apache/hive/pull/2137#discussion_r611564631



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java
##########
@@ -890,6 +894,24 @@ public static void pushFilters(JobConf jobConf, TableScanOperator tableScan,
     Utilities.setColumnTypeList(jobConf, tableScan);
     // push down filters
     ExprNodeGenericFuncDesc filterExpr = scanDesc.getFilterExpr();
+    String pruningFilter = jobConf.get(TableScanDesc.PARTITION_PRUNING_FILTER);
+    // If we have a pruning filter then combine it with the original
+    if (pruningFilter != null) {
+      ExprNodeGenericFuncDesc pruningExpr = SerializationUtilities.deserializeExpression(pruningFilter);
+      if (filterExpr != null) {
+        // Combine the 2 filters with AND
+        filterExpr = new ExprNodeGenericFuncDesc(TypeInfoFactory.booleanTypeInfo, new GenericUDFOPAnd(), "and",

Review comment:
       Thanks for pointing this out!
   Used `and`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] pvary commented on a change in pull request #2137: HIVE-24962: Implement partition pruning for Iceberg tables

Posted by GitBox <gi...@apache.org>.

pvary commented on a change in pull request #2137:
URL: https://github.com/apache/hive/pull/2137#discussion_r609590974



##########
File path: iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##########
@@ -153,6 +168,39 @@ public DecomposedPredicate decomposePredicate(JobConf jobConf, Deserializer dese
     return predicate;
   }
 
+  @Override
+  public boolean addDynamicSplitPruningEdge(org.apache.hadoop.hive.ql.metadata.Table table,
+      ExprNodeDesc syntheticFilterPredicate) {
+    try {
+      Collection<String> partitionColumns = ((HiveIcebergSerDe) table.getDeserializer()).partitionColumns();
+      if (partitionColumns.size() > 0) {
+        // Collect the column names from the predicate
+        Collection<String> filterColumns = Lists.newArrayList();
+        columns(syntheticFilterPredicate, filterColumns);
+
+        if (filterColumns.removeAll(partitionColumns)) {
+          // Replace dynamic values to dummy constants
+          ExprNodeDesc clone = syntheticFilterPredicate.clone();
+          replace(clone);

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] pvary commented on a change in pull request #2137: HIVE-24962: Implement partition pruning for Iceberg tables

Posted by GitBox <gi...@apache.org>.

pvary commented on a change in pull request #2137:
URL: https://github.com/apache/hive/pull/2137#discussion_r609590032



##########
File path: iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##########
@@ -153,6 +168,39 @@ public DecomposedPredicate decomposePredicate(JobConf jobConf, Deserializer dese
     return predicate;
   }
 
+  @Override
+  public boolean addDynamicSplitPruningEdge(org.apache.hadoop.hive.ql.metadata.Table table,
+      ExprNodeDesc syntheticFilterPredicate) {
+    try {
+      Collection<String> partitionColumns = ((HiveIcebergSerDe) table.getDeserializer()).partitionColumns();
+      if (partitionColumns.size() > 0) {
+        // Collect the column names from the predicate
+        Collection<String> filterColumns = Lists.newArrayList();
+        columns(syntheticFilterPredicate, filterColumns);
+
+        if (filterColumns.removeAll(partitionColumns)) {

Review comment:
       Yeah, much better.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] marton-bod commented on a change in pull request #2137: HIVE-24962: Implement partition pruning for Iceberg tables

Posted by GitBox <gi...@apache.org>.

marton-bod commented on a change in pull request #2137:
URL: https://github.com/apache/hive/pull/2137#discussion_r609501581



##########
File path: iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergSerDe.java
##########
@@ -77,15 +85,26 @@ public void initialize(@Nullable Configuration configuration, Properties serDePr
 
     if (serDeProperties.get(InputFormatConfig.TABLE_SCHEMA) != null) {
       this.tableSchema = SchemaParser.fromJson((String) serDeProperties.get(InputFormatConfig.TABLE_SCHEMA));
+      if (serDeProperties.get(InputFormatConfig.PARTITION_SPEC) != null) {
+        PartitionSpec spec =
+            PartitionSpecParser.fromJson(tableSchema, (String) serDeProperties.get(InputFormatConfig.PARTITION_SPEC));

Review comment:
       nit: can use `serDeProperties.getProperty(InputFormatConfig.PARTITION_SPEC)` instead of casting




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] pvary commented on a change in pull request #2137: HIVE-24962: Implement partition pruning for Iceberg tables

Posted by GitBox <gi...@apache.org>.

pvary commented on a change in pull request #2137:
URL: https://github.com/apache/hive/pull/2137#discussion_r609585235



##########
File path: iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##########
@@ -153,6 +168,39 @@ public DecomposedPredicate decomposePredicate(JobConf jobConf, Deserializer dese
     return predicate;
   }
 
+  @Override
+  public boolean addDynamicSplitPruningEdge(org.apache.hadoop.hive.ql.metadata.Table table,
+      ExprNodeDesc syntheticFilterPredicate) {
+    try {
+      Collection<String> partitionColumns = ((HiveIcebergSerDe) table.getDeserializer()).partitionColumns();
+      if (partitionColumns.size() > 0) {
+        // Collect the column names from the predicate
+        Collection<String> filterColumns = Lists.newArrayList();
+        columns(syntheticFilterPredicate, filterColumns);
+
+        if (filterColumns.removeAll(partitionColumns)) {
+          // Replace dynamic values to dummy constants
+          ExprNodeDesc clone = syntheticFilterPredicate.clone();
+          replace(clone);
+
+          // Check if we can convert the expression to a valid Iceberg filter
+          SearchArgument sarg = ConvertAstToSearchArg.create(conf, (ExprNodeGenericFuncDesc) clone);
+          HiveIcebergFilterFactory.generateFilterExpression(sarg);
+          LOG.debug("Found Iceberg partition column to prune with predicate {}", filterColumns,

Review comment:
       Good catch. The `syntheticFilterPredicate` contains the columns as well. Reworded the log, but forgot to remove the parameter




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] pvary commented on a change in pull request #2137: HIVE-24962: Implement partition pruning for Iceberg tables

Posted by GitBox <gi...@apache.org>.

pvary commented on a change in pull request #2137:
URL: https://github.com/apache/hive/pull/2137#discussion_r616491881



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/tez/DynamicPartitionPruner.java
##########
@@ -514,4 +569,29 @@ private boolean checkForSourceCompletion(String name) {
     }
     return false;
   }
+
+  /**
+   * Recursively replaces the ExprNodeDynamicListDesc to the list of the actual values. As a result of this call the
+   * original expression is modified so it can be used for pushing down to the TableScan for filtering the data at the
+   * source.
+   * <p>
+   * Please make sure to clone the predicate if needed since the original node will be modified.
+   * @param node The node we are traversing
+   * @param dynArgs The constant values we are substituting
+   */
+  private void replaceDynamicLists(ExprNodeDesc node, Collection<ExprNodeConstantDesc> dynArgs) {
+    List<ExprNodeDesc> children = node.getChildren();
+    if (children != null && !children.isEmpty()) {
+      ListIterator<ExprNodeDesc> iterator = node.getChildren().listIterator();
+      while (iterator.hasNext()) {
+        ExprNodeDesc child = iterator.next();
+        if (child instanceof ExprNodeDynamicListDesc) {
+          iterator.remove();
+          dynArgs.forEach(iterator::add);

Review comment:
       Added a check for `addDynamicSplitPruningEdge` which returns `false` if there are multiple columns in the expression to filter




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] pvary commented on a change in pull request #2137: HIVE-24962: Implement partition pruning for Iceberg tables

Posted by GitBox <gi...@apache.org>.

pvary commented on a change in pull request #2137:
URL: https://github.com/apache/hive/pull/2137#discussion_r609593782



##########
File path: iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##########
@@ -153,6 +168,39 @@ public DecomposedPredicate decomposePredicate(JobConf jobConf, Deserializer dese
     return predicate;
   }
 
+  @Override
+  public boolean addDynamicSplitPruningEdge(org.apache.hadoop.hive.ql.metadata.Table table,
+      ExprNodeDesc syntheticFilterPredicate) {
+    try {
+      Collection<String> partitionColumns = ((HiveIcebergSerDe) table.getDeserializer()).partitionColumns();
+      if (partitionColumns.size() > 0) {
+        // Collect the column names from the predicate
+        Collection<String> filterColumns = Lists.newArrayList();
+        columns(syntheticFilterPredicate, filterColumns);
+
+        if (filterColumns.removeAll(partitionColumns)) {
+          // Replace dynamic values to dummy constants
+          ExprNodeDesc clone = syntheticFilterPredicate.clone();
+          replace(clone);
+
+          // Check if we can convert the expression to a valid Iceberg filter
+          SearchArgument sarg = ConvertAstToSearchArg.create(conf, (ExprNodeGenericFuncDesc) clone);
+          HiveIcebergFilterFactory.generateFilterExpression(sarg);
+          LOG.debug("Found Iceberg partition column to prune with predicate {}", filterColumns,
+              syntheticFilterPredicate);
+          return true;
+        }
+      }
+    } catch (UnsupportedOperationException uoe) {
+      // If we can not convert the filter, we do not prune
+      LOG.debug("Unsupported predicate {}", syntheticFilterPredicate);

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] marton-bod commented on a change in pull request #2137: HIVE-24962: Implement partition pruning for Iceberg tables

Posted by GitBox <gi...@apache.org>.

marton-bod commented on a change in pull request #2137:
URL: https://github.com/apache/hive/pull/2137#discussion_r609496322



##########
File path: iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##########
@@ -153,6 +168,39 @@ public DecomposedPredicate decomposePredicate(JobConf jobConf, Deserializer dese
     return predicate;
   }
 
+  @Override
+  public boolean addDynamicSplitPruningEdge(org.apache.hadoop.hive.ql.metadata.Table table,
+      ExprNodeDesc syntheticFilterPredicate) {
+    try {
+      Collection<String> partitionColumns = ((HiveIcebergSerDe) table.getDeserializer()).partitionColumns();
+      if (partitionColumns.size() > 0) {
+        // Collect the column names from the predicate
+        Collection<String> filterColumns = Lists.newArrayList();
+        columns(syntheticFilterPredicate, filterColumns);
+
+        if (filterColumns.removeAll(partitionColumns)) {
+          // Replace dynamic values to dummy constants
+          ExprNodeDesc clone = syntheticFilterPredicate.clone();
+          replace(clone);

Review comment:
       Can we make this method a bit more descriptive? I like the method's javadoc, just wondering whether we can streamline the naming as well, e.g. `replaceWithDummyValues(clone)`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] pvary commented on a change in pull request #2137: HIVE-24962: Implement partition pruning for Iceberg tables

Posted by GitBox <gi...@apache.org>.

pvary commented on a change in pull request #2137:
URL: https://github.com/apache/hive/pull/2137#discussion_r611550383



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/tez/DynamicPartitionPruner.java
##########
@@ -514,4 +569,29 @@ private boolean checkForSourceCompletion(String name) {
     }
     return false;
   }
+
+  /**
+   * Recursively replaces the ExprNodeDynamicListDesc to the list of the actual values. As a result of this call the
+   * original expression is modified so it can be used for pushing down to the TableScan for filtering the data at the
+   * source.
+   * <p>
+   * Please make sure to clone the predicate if needed since the original node will be modified.
+   * @param node The node we are traversing
+   * @param dynArgs The constant values we are substituting
+   */
+  private void replaceDynamicLists(ExprNodeDesc node, Collection<ExprNodeConstantDesc> dynArgs) {
+    List<ExprNodeDesc> children = node.getChildren();
+    if (children != null && !children.isEmpty()) {
+      ListIterator<ExprNodeDesc> iterator = node.getChildren().listIterator();
+      while (iterator.hasNext()) {
+        ExprNodeDesc child = iterator.next();
+        if (child instanceof ExprNodeDynamicListDesc) {
+          iterator.remove();
+          dynArgs.forEach(iterator::add);

Review comment:
       This is where my knowledge might be insufficient, but we have these `SourceInfo` objects we try to evaluate.
   
   The `SourceInfo` has the following attributes:
   ```
       public final ExprNodeDesc partKey;
       public final ExprNodeDesc predicate;  // <- don't mind this as this is created by me
       public final Deserializer deserializer;
       public final StructObjectInspector soi;
       public final StructField field;
       public final ObjectInspector fieldInspector;
       /* List of partitions that are required - populated from processing each event */
       public Set<Object> values = new HashSet<Object>();
       /* Whether to skipPruning - depends on the payload from an event which may signal skip - if the event payload is too large */
       public AtomicBoolean skipPruning = new AtomicBoolean();
       public final String columnName;
       public final String columnType;
       private boolean mustKeepOnePartition;
   ```
   
   This suggests to me that every SourceInfo is only related to a single column. I added the `predicate` at the same place where the other fields are added, so I think the `predicate` should only contain a single column too.
   
   Also I have run tests with `a=L1 and b=L2` conditions and found that the condition is split to multiple expressions and `SourceInfo` objects - but that was only a single test, so....
   
   Maybe we should return false in `addDynamicSplitPruningEdge` if we have multiple columns so we can bail out early?
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] pvary commented on pull request #2137: HIVE-24962: Implement partition pruning for Iceberg tables

Posted by GitBox <gi...@apache.org>.

pvary commented on pull request #2137:
URL: https://github.com/apache/hive/pull/2137#issuecomment-815153195


   @jcamachor, @kgyrtkirk: The latest test failure seems flaky. Since I am not very familiar with this part of the code, I would appreciate if someone with more experience around compilation/execution could review.
   
   Thanks,
   Peter


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] marton-bod commented on a change in pull request #2137: HIVE-24962: Implement partition pruning for Iceberg tables

Posted by GitBox <gi...@apache.org>.

marton-bod commented on a change in pull request #2137:
URL: https://github.com/apache/hive/pull/2137#discussion_r609515251



##########
File path: iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##########
@@ -250,4 +298,74 @@ static void overlayTableProperties(Configuration configuration, TableDesc tableD
     // this is an exception to the interface documentation, but it's a safe operation to add this property
     props.put(InputFormatConfig.TABLE_SCHEMA, schemaJson);
   }
+
+  /**
+   * Recursively collects the column names from the predicate.
+   * @param node The node we are traversing
+   * @param columns The already collected column names
+   */
+  private void columns(ExprNodeDesc node, Collection<String> columns) {
+    if (node instanceof ExprNodeColumnDesc) {
+      columns.add(((ExprNodeColumnDesc) node).getColumn());
+    } else {
+      List<ExprNodeDesc> children = node.getChildren();
+      if (children != null && !children.isEmpty()) {
+        children.forEach(child -> columns(child, columns));
+      }
+    }
+  }
+
+  /**
+   * Recursively replaces the ExprNodeDynamicListDesc nodes by a dummy ExprNodeConstantDesc so we can test if we can
+   * convert the predicate to an Iceberg predicate when pruning the partitions later. Please make sure that it is ok

Review comment:
       I'm not entirely sure I understand why this dummy replacement is needed to convert the predicate to an Iceberg predicate. Can you elaborate a little bit (maybe in the javadoc too)? Apologies if it's obvious, I'm just not that knowledgeable on partition pruning & co. and I'm assuming some of the future maintainers of this code might not be either




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] marton-bod commented on a change in pull request #2137: HIVE-24962: Implement partition pruning for Iceberg tables

Posted by GitBox <gi...@apache.org>.

marton-bod commented on a change in pull request #2137:
URL: https://github.com/apache/hive/pull/2137#discussion_r609491113



##########
File path: iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##########
@@ -153,6 +168,39 @@ public DecomposedPredicate decomposePredicate(JobConf jobConf, Deserializer dese
     return predicate;
   }
 
+  @Override
+  public boolean addDynamicSplitPruningEdge(org.apache.hadoop.hive.ql.metadata.Table table,
+      ExprNodeDesc syntheticFilterPredicate) {
+    try {
+      Collection<String> partitionColumns = ((HiveIcebergSerDe) table.getDeserializer()).partitionColumns();
+      if (partitionColumns.size() > 0) {
+        // Collect the column names from the predicate
+        Collection<String> filterColumns = Lists.newArrayList();
+        columns(syntheticFilterPredicate, filterColumns);
+
+        if (filterColumns.removeAll(partitionColumns)) {
+          // Replace dynamic values to dummy constants
+          ExprNodeDesc clone = syntheticFilterPredicate.clone();
+          replace(clone);
+
+          // Check if we can convert the expression to a valid Iceberg filter
+          SearchArgument sarg = ConvertAstToSearchArg.create(conf, (ExprNodeGenericFuncDesc) clone);
+          HiveIcebergFilterFactory.generateFilterExpression(sarg);
+          LOG.debug("Found Iceberg partition column to prune with predicate {}", filterColumns,

Review comment:
       there is an extra `filterColumns` argument to the debug log




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] pvary commented on a change in pull request #2137: HIVE-24962: Implement partition pruning for Iceberg tables

Posted by GitBox <gi...@apache.org>.

pvary commented on a change in pull request #2137:
URL: https://github.com/apache/hive/pull/2137#discussion_r616491306



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java
##########
@@ -171,7 +171,7 @@ private void prepare(InputInitializerContext initializerContext) throws IOExcept
       // perform dynamic partition pruning
       if (pruner != null) {
         pruner.initialize(getContext(), work, jobConf);
-        pruner.prune();
+        pruner.prune(jobConf);

Review comment:
       Removed




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org