You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/05 21:46:42 UTC

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3726: Fix output schema generated by CommonSubExprEliminate

alamb commented on code in PR #3726:
URL: https://github.com/apache/arrow-datafusion/pull/3726#discussion_r988379685


##########
datafusion/optimizer/src/common_subexpr_eliminate.rs:
##########
@@ -448,7 +474,21 @@ impl ExpressionVisitor for ExprIdentifierVisitor<'_> {
 
         self.id_array[idx] = (self.series_number, desc.clone());
         self.visit_stack.push(VisitRecord::ExprItem(desc.clone()));
-        let data_type = self.data_type.clone();
+
+        let data_type = if let Ok(data_type) = expr.get_type(&self.input_schema) {
+            data_type
+        } else {
+            // expression type could not be resolved in schema, fall back to all schemas
+            let merged_schema =
+                self.all_schemas
+                    .iter()
+                    .fold(DFSchema::empty(), |mut lhs, rhs| {
+                        lhs.merge(rhs);
+                        lhs
+                    });

Review Comment:
   I wonder if we could use `reduce` here (I realize you just refactored this code)
   ```suggestion
               let merged_schema =
                   self.all_schemas
                       .iter()
                       .reduce(|mut lhs, rhs| {
                           lhs.merge(rhs);
                           lhs
                       });
   ```



##########
datafusion/optimizer/src/common_subexpr_eliminate.rs:
##########
@@ -370,7 +391,12 @@ struct ExprIdentifierVisitor<'a> {
     expr_set: &'a mut ExprSet,
     /// series number (usize) and identifier.
     id_array: &'a mut Vec<(usize, Identifier)>,
-    data_type: DataType,
+    /// input schema for the node that we're optimizing, so we can determine the correct datatype
+    /// for each subexpression
+    input_schema: DFSchemaRef,
+    /// all schemas in the logical plan, as a fall back if we cannot resolve an expression type
+    /// from the input schema alone
+    all_schemas: Vec<DFSchemaRef>,

Review Comment:
   I don't understand in what cases we wouldn't be able to resolve an expr type from the input schema alone.
   
   The only case I can think of is when the plan node has more than one input (e.g. a Join or a Union) -- but thus I would expect that we always resolve the type of the expressions using the input schema



##########
datafusion/optimizer/src/common_subexpr_eliminate.rs:
##########
@@ -448,7 +474,21 @@ impl ExpressionVisitor for ExprIdentifierVisitor<'_> {
 
         self.id_array[idx] = (self.series_number, desc.clone());
         self.visit_stack.push(VisitRecord::ExprItem(desc.clone()));
-        let data_type = self.data_type.clone();
+
+        let data_type = if let Ok(data_type) = expr.get_type(&self.input_schema) {
+            data_type
+        } else {
+            // expression type could not be resolved in schema, fall back to all schemas
+            let merged_schema =
+                self.all_schemas
+                    .iter()
+                    .fold(DFSchema::empty(), |mut lhs, rhs| {
+                        lhs.merge(rhs);
+                        lhs
+                    });
+            expr.get_type(&merged_schema)?
+        };

Review Comment:
   I think all exprs should be resolvable with the unified schemas, as explained above -- but maybe it is a performance optimization 🤔.
   
   Perhaps you could leave a comment explaining that we are not sure if it is necessary 



##########
datafusion/optimizer/src/common_subexpr_eliminate.rs:
##########
@@ -597,22 +640,36 @@ mod test {
     fn id_array_visitor() -> Result<()> {
         let expr = binary_expr(
             binary_expr(
-                sum(binary_expr(col("a"), Operator::Plus, lit("1"))),
+                sum(binary_expr(col("a"), Operator::Plus, lit(1))),

Review Comment:
   I agree this test doesn't make sense as coercion should have happened before this pass



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org