You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "Jefffrey (via GitHub)" <gi...@apache.org> on 2023/04/14 11:42:55 UTC

[GitHub] [arrow-datafusion] Jefffrey opened a new issue, #6012: Cross-schema qualified column doesn't do ambiguity check

Jefffrey opened a new issue, #6012:
URL: https://github.com/apache/arrow-datafusion/issues/6012

   ### Describe the bug
   
   If joining two identical tables from different schemas, and selecting a column using a table qualifier as part of the identifier, it should do ambiguity check and fail if referring to an ambiguous column.
   
   ### To Reproduce
   
   Using the [catalog.rs](https://github.com/apache/arrow-datafusion/blob/ebb839068b9d99d3a3fea0a50a1e4baf4f1a5fca/datafusion-examples/examples/catalog.rs) example as a base, but replace main function with:
   
   ```rs
   #[tokio::main]
   async fn main() -> Result<()> {
       let repo_dir =
           std::fs::canonicalize(PathBuf::from(env!("CARGO_MANIFEST_DIR")).join(".."))?;
       let mut ctx = SessionContext::new();
       let state = ctx.state();
       let catlist = Arc::new(CustomCatalogList::new());
       ctx.register_catalog_list(catlist.clone());
   
       let catalog = DirCatalog::new();
       let csv_schema = DirSchema::create(
           &state,
           DirSchemaOpts {
               format: Arc::new(CsvFormat::default()),
               dir: &repo_dir.join("testing").join("data").join("csv"),
               ext: "csv",
           },
       )
       .await?;
   
       catalog.register_schema("csv1", csv_schema.clone())?;
       catalog.register_schema("csv2", csv_schema.clone())?;
   
       ctx.register_catalog("dircat", Arc::new(catalog));
   
       let sql = r#"
       select "aggregate_test_100.csv".c2
       from dircat.csv1."aggregate_test_100.csv"
         join dircat.csv2."aggregate_test_100.csv" using (c1)
       "#;
       let df = ctx.sql(sql).await?.limit(0, Some(1))?;
       df.clone().explain(false, false)?.show().await?;
       df.show().await?;
   
       Ok(())
   }
   ```
   
   Can see there is identical table `aggregate_test_100.csv` in both schemas `csv1` and `csv2`, and selecting `"aggregate_test_100.csv".c2` column (see it's qualified with table) should do ambiguity check as could be in either table. But output is:
   
   ```bash
   effrey:~/Code/arrow-datafusion$ cargo run --example catalog
      Compiling datafusion-examples v22.0.0 (/home/jeffrey/Code/arrow-datafusion/datafusion-examples)
       Finished dev [unoptimized + debuginfo] target(s) in 10.57s
        Running `/media/jeffrey/1tb_860evo_ssd/.cargo_target_cache/debug/examples/catalog`
   +---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                                                                                                                                       |
   +---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Projection: dircat.csv1.aggregate_test_100.csv.c2                                                                                                                          |
   |               |   Limit: skip=0, fetch=1                                                                                                                                                   |
   |               |     Inner Join: Using dircat.csv1.aggregate_test_100.csv.c1 = dircat.csv2.aggregate_test_100.csv.c1                                                                        |
   |               |       TableScan: dircat.csv1.aggregate_test_100.csv projection=[c1, c2]                                                                                                    |
   |               |       TableScan: dircat.csv2.aggregate_test_100.csv projection=[c1]                                                                                                        |
   | physical_plan | ProjectionExec: expr=[c2@1 as c2]                                                                                                                                          |
   |               |   GlobalLimitExec: skip=0, fetch=1                                                                                                                                         |
   |               |     CoalescePartitionsExec                                                                                                                                                 |
   |               |       CoalesceBatchesExec: target_batch_size=8192                                                                                                                          |
   |               |         HashJoinExec: mode=Partitioned, join_type=Inner, on=[(Column { name: "c1", index: 0 }, Column { name: "c1", index: 0 })]                                           |
   |               |           CoalesceBatchesExec: target_batch_size=8192                                                                                                                      |
   |               |             RepartitionExec: partitioning=Hash([Column { name: "c1", index: 0 }], 12), input_partitions=12                                                                 |
   |               |               RepartitionExec: partitioning=RoundRobinBatch(12), input_partitions=1                                                                                        |
   |               |                 CsvExec: files={1 group: [[home/jeffrey/Code/arrow-datafusion/testing/data/csv/aggregate_test_100.csv]]}, has_header=true, limit=None, projection=[c1, c2] |
   |               |           CoalesceBatchesExec: target_batch_size=8192                                                                                                                      |
   |               |             RepartitionExec: partitioning=Hash([Column { name: "c1", index: 0 }], 12), input_partitions=12                                                                 |
   |               |               RepartitionExec: partitioning=RoundRobinBatch(12), input_partitions=1                                                                                        |
   |               |                 CsvExec: files={1 group: [[home/jeffrey/Code/arrow-datafusion/testing/data/csv/aggregate_test_100.csv]]}, has_header=true, limit=None, projection=[c1]     |
   |               |                                                                                                                                                                            |
   +---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   +----+
   | c2 |
   +----+
   | 3  |
   +----+
   ```
   
   To note, if column is not qualified at all and left as `c2` then ambiguity check will occur.
   
   ### Expected behavior
   
   Should return error about ambiguous column
   
   ### Additional context
   
   Ambiguity check was fixed in https://github.com/apache/arrow-datafusion/pull/5509 but seems this only accounted for unqualified columns, not qualified ones as well.
   
   Also this seems likely to occur if the tables are in different catalogs too, not just different schemas.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Jefffrey commented on issue #6012: Cross-schema/cross-catalog qualified column doesn't do ambiguity check

Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.

Jefffrey commented on issue #6012:
URL: https://github.com/apache/arrow-datafusion/issues/6012#issuecomment-1568329558

   So I took an initial stab at this: https://github.com/Jefffrey/arrow-datafusion/commit/b5548e047d45ec8f286d2f84773feb87a21a3939
   
   I found one issue which originated from how the SQL planner was generating `Expr`s from the SQL AST, specifically how it searches the schema for a matching column:
   
   https://github.com/apache/arrow-datafusion/blob/0d9c542c84f68dad42eaa0d26a55810cdd5cff2b/datafusion/sql/src/expr/identifier.rs#L282-L290
   
   It calls `field_with_name(...)` which eventually flows to `index_of_column_by_name(...)`:
   
   https://github.com/apache/arrow-datafusion/blob/0d9c542c84f68dad42eaa0d26a55810cdd5cff2b/datafusion/common/src/dfschema.rs#L186-L220
   
   Where can see it finds the first match only (if exists), ignoring the case where there are multiple matches. Specifically here:
   
   https://github.com/apache/arrow-datafusion/blob/0d9c542c84f68dad42eaa0d26a55810cdd5cff2b/datafusion/common/src/dfschema.rs#L200
   
   e.g. given the original issue, where in same schema can have fields `s1.t.b` and `s2.t.b`, where `b` is the column name and `s1.t` and `s2.t` are the qualifiers, when searching with qualifier `t` and column name `b` it'll match for both of them.
   
   I tried to preserve the original behavior of allowing multiple matches, and introduced a scoring system to try pick the best match, but it still didn't solve the issue, as I believe somewhere else down the line in the planning, it still resolves the column without doing a proper ambiguity check.
   
   I wonder if it's better to try centralize ambiguity checks somewhere, instead of trying to hunt down the different places that can resolve columns and implementing the checks there. Like a new analysis rule to resolve column references, though would require changes to planner (large impact?).
   
   Thoughts @alamb ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #6012: Cross-schema/cross-catalog qualified column doesn't do ambiguity check

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #6012:
URL: https://github.com/apache/arrow-datafusion/issues/6012#issuecomment-1572558146

   > I wonder if it's better to try centralize ambiguity checks somewhere, instead of trying to hunt down the different places that can resolve columns and implementing the checks there. Like a new analysis rule to resolve column references, though would require changes to planner (large impact?).
   
   Yes, I think consolidating the ambiguity checks and resolving column references (where they can be more easily documented and unit tested) would be very valuable and help DataFusion along


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org