You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "Jefffrey (via GitHub)" <gi...@apache.org> on 2023/05/30 12:14:29 UTC

[GitHub] [arrow-datafusion] Jefffrey commented on issue #6012: Cross-schema/cross-catalog qualified column doesn't do ambiguity check

Jefffrey commented on issue #6012:
URL: https://github.com/apache/arrow-datafusion/issues/6012#issuecomment-1568329558

   So I took an initial stab at this: https://github.com/Jefffrey/arrow-datafusion/commit/b5548e047d45ec8f286d2f84773feb87a21a3939
   
   I found one issue which originated from how the SQL planner was generating `Expr`s from the SQL AST, specifically how it searches the schema for a matching column:
   
   https://github.com/apache/arrow-datafusion/blob/0d9c542c84f68dad42eaa0d26a55810cdd5cff2b/datafusion/sql/src/expr/identifier.rs#L282-L290
   
   It calls `field_with_name(...)` which eventually flows to `index_of_column_by_name(...)`:
   
   https://github.com/apache/arrow-datafusion/blob/0d9c542c84f68dad42eaa0d26a55810cdd5cff2b/datafusion/common/src/dfschema.rs#L186-L220
   
   Where can see it finds the first match only (if exists), ignoring the case where there are multiple matches. Specifically here:
   
   https://github.com/apache/arrow-datafusion/blob/0d9c542c84f68dad42eaa0d26a55810cdd5cff2b/datafusion/common/src/dfschema.rs#L200
   
   e.g. given the original issue, where in same schema can have fields `s1.t.b` and `s2.t.b`, where `b` is the column name and `s1.t` and `s2.t` are the qualifiers, when searching with qualifier `t` and column name `b` it'll match for both of them.
   
   I tried to preserve the original behavior of allowing multiple matches, and introduced a scoring system to try pick the best match, but it still didn't solve the issue, as I believe somewhere else down the line in the planning, it still resolves the column without doing a proper ambiguity check.
   
   I wonder if it's better to try centralize ambiguity checks somewhere, instead of trying to hunt down the different places that can resolve columns and implementing the checks there. Like a new analysis rule to resolve column references, though would require changes to planner (large impact?).
   
   Thoughts @alamb ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org