You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "Jefffrey (via GitHub)" <gi...@apache.org> on 2023/03/07 19:48:31 UTC
[GitHub] [arrow-datafusion] Jefffrey commented on a diff in pull request #5509: Enforce ambiguity check whilst normalizing columns

Jefffrey commented on code in PR #5509:
URL: https://github.com/apache/arrow-datafusion/pull/5509#discussion_r1128477085


##########
datafusion/common/src/column.rs:
##########
@@ -154,6 +158,105 @@ impl Column {
                 .collect(),
         }))
     }
+
+    /// Qualify column if not done yet.
+    ///
+    /// If this column already has a [relation](Self::relation), it will be returned as is and the given parameters are
+    /// ignored. Otherwise this will search through the given schemas to find the column.
+    ///
+    /// Will check for ambiguity at each level of `schemas`.
+    ///
+    /// A schema matches if there is a single column that -- when unqualified -- matches this column. There is an
+    /// exception for `USING` statements, see below.
+    ///
+    /// # Using columns
+    /// Take the following SQL statement:
+    ///
+    /// ```sql
+    /// SELECT id FROM t1 JOIN t2 USING(id)
+    /// ```
+    ///
+    /// In this case, both `t1.id` and `t2.id` will match unqualified column `id`. To express this possibility, use
+    /// `using_columns`. Each entry in this array is a set of columns that are bound together via a `USING` clause. So
+    /// in this example this would be `[{t1.id, t2.id}]`.
+    ///
+    /// Regarding ambiguity check, `schemas` is structured to allow levels of schemas to be passed in.
+    /// For example:
+    ///
+    /// ```text
+    /// schemas = &[
+    ///    &[schema1, schema2], // first level
+    ///    &[schema3, schema4], // second level
+    /// ]
+    /// ```
+    ///
+    /// Will search for a matching field in all schemas in the first level. If a matching field according to above
+    /// mentioned conditions is not found, then will check the next level. If found more than one matching column across
+    /// all schemas in a level, that isn't a USING column, will return an error due to ambiguous column.
+    ///
+    /// If checked all levels and couldn't find field, will return field not found error.
+    pub fn normalize_with_schemas_and_ambiguity_check(
+        self,
+        schemas: &[&[&DFSchema]],

Review Comment:
   kinda ugly syntax here. it makes the logic work, but maybe could have something better?



##########
datafusion/common/src/column.rs:
##########
@@ -102,6 +102,10 @@ impl Column {
     /// In this case, both `t1.id` and `t2.id` will match unqualified column `id`. To express this possibility, use
     /// `using_columns`. Each entry in this array is a set of columns that are bound together via a `USING` clause. So
     /// in this example this would be `[{t1.id, t2.id}]`.
+    #[deprecated(
+        since = "20.0.0",
+        note = "use normalize_with_schemas_and_ambiguity_check instead"
+    )]

Review Comment:
   ive marked old methods as deprecated as unsure if should remove (breaking api change technically?)



##########
datafusion/substrait/src/logical_plan/consumer.rs:
##########
@@ -505,10 +505,14 @@ pub async fn from_substrait_rex(
                         "Direct reference StructField with child is not supported"
                             .to_string(),
                     )),
-                    None => Ok(Arc::new(Expr::Column(Column {
-                        relation: None,
-                        name: input_schema.field(x.field as usize).name().to_string(),
-                    }))),
+                    None => {
+                        let column =
+                            input_schema.field(x.field as usize).qualified_column();
+                        Ok(Arc::new(Expr::Column(Column {
+                            relation: column.relation,
+                            name: column.name,
+                        })))
+                    }

Review Comment:
   this is a separate substrait fix, since it previously wasn't qualifying the column built even if the column itself was qualified



##########
datafusion/expr/src/logical_plan/plan.rs:
##########
@@ -174,7 +174,26 @@ impl LogicalPlan {
         }
     }
 
+    /// Used for normalizing columns, as the fallback schemas to the main schema
+    /// of the plan.
+    pub fn fallback_normalize_schemas(&self) -> Vec<&DFSchema> {
+        match self {
+            LogicalPlan::Window(_)
+            | LogicalPlan::Projection(_)
+            | LogicalPlan::Aggregate(_)
+            | LogicalPlan::Unnest(_)
+            | LogicalPlan::Join(_)
+            | LogicalPlan::CrossJoin(_) => self
+                .inputs()
+                .iter()
+                .map(|input| input.schema().as_ref())
+                .collect(),
+            _ => vec![],
+        }
+    }

Review Comment:
   since previous column normalization depended on all_schemas, which i still dont fully understand, this method extracts the part from all_schemas which returns the self & the child schemas, but only returns child schema
   
   meant to be used as a fallback to the new normalization method



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org