You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "Jefffrey (via GitHub)" <gi...@apache.org> on 2023/02/07 11:23:07 UTC

[GitHub] [arrow-datafusion] Jefffrey opened a new issue, #5211: Dataframe filter() doesn't do ambiguity check

Jefffrey opened a new issue, #5211:
URL: https://github.com/apache/arrow-datafusion/issues/5211

   **Describe the bug**
   When using Dataframe filter() it doesn't check for ambiguous columns
   
   **To Reproduce**
   
   ```rust
       let schema = Arc::new(Schema::new(vec![
           Field::new("a", DataType::Utf8, false),
           Field::new("b", DataType::Int32, false),
       ]));
       let batch = RecordBatch::try_new(
           schema,
           vec![
               Arc::new(StringArray::from_slice(["a", "b", "c", "d"])),
               Arc::new(Int32Array::from_slice([1, 10, 10, 100])),
           ],
       )?;
       let ctx = SessionContext::new();
       ctx.register_batch("t1", batch.clone())?;
       ctx.register_batch("t2", batch)?;
       ctx.table("t1")
           .await?
           .join(
               ctx.table("t2").await?,
               JoinType::Inner,
               &["a"],
               &["a"],
               None,
           )?
           .filter(col("b").eq(lit(1)))? // ambiguous b
           .explain(false, false)?
           .show()
           .await?;
   ```
   
   Will execute fine:
   
   ```sql
   +---------------+--------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                                                                                     |
   +---------------+--------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Inner Join: t1.a = t2.a                                                                                                  |
   |               |   Filter: t1.b = Int32(1)                                                                                                |
   |               |     TableScan: t1 projection=[a, b]                                                                                      |
   |               |   TableScan: t2 projection=[a, b]                                                                                        |
   | physical_plan | CoalesceBatchesExec: target_batch_size=8192                                                                              |
   |               |   HashJoinExec: mode=Partitioned, join_type=Inner, on=[(Column { name: "a", index: 0 }, Column { name: "a", index: 0 })] |
   |               |     CoalesceBatchesExec: target_batch_size=8192                                                                          |
   |               |       RepartitionExec: partitioning=Hash([Column { name: "a", index: 0 }], 12), input_partitions=12                      |
   |               |         CoalesceBatchesExec: target_batch_size=8192                                                                      |
   |               |           FilterExec: b@1 = 1                                                                                            |
   |               |             RepartitionExec: partitioning=RoundRobinBatch(12), input_partitions=1                                        |
   |               |               MemoryExec: partitions=1, partition_sizes=[1]                                                              |
   |               |     CoalesceBatchesExec: target_batch_size=8192                                                                          |
   |               |       RepartitionExec: partitioning=Hash([Column { name: "a", index: 0 }], 12), input_partitions=12                      |
   |               |         RepartitionExec: partitioning=RoundRobinBatch(12), input_partitions=1                                            |
   |               |           MemoryExec: partitions=1, partition_sizes=[1]                                                                  |
   |               |                                                                                                                          |
   +---------------+--------------------------------------------------------------------------------------------------------------------------+
   ```
   
   **Expected behavior**
   
   Should raise error
   
   **Additional context**
   
   Similar has been addressed for SQL queries: https://github.com/apache/arrow-datafusion/issues/4196
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Jefffrey commented on issue #5211: Dataframe filter() doesn't do ambiguity check

Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.

Jefffrey commented on issue #5211:
URL: https://github.com/apache/arrow-datafusion/issues/5211#issuecomment-1423188620

   Seems not limited to only filter, but sort, select and aggregate don't do these ambiguity checks either.
   
   I plan to take this on


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Jefffrey commented on issue #5211: Dataframe methods don't do ambiguity checks

Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.

Jefffrey commented on issue #5211:
URL: https://github.com/apache/arrow-datafusion/issues/5211#issuecomment-1426922495

   Although could fix by adding the relevant ambiguity check in the Dataframe API methods (before it calls the LogicalPlanBuilder), I wonder if it would be better to try push that ambiguity check into the LogicalPlanBuilder methods themselves, otherwise always there would be duplication of the ambiguity check, once in the SQL planner and once in the Dataframe API, though they both use the same underlying API which is LogicalPlanBuilder.
   
   Part of this could be to try introduce the check into the `normalize_col` function itself (actually in the underlying function it calls):
   
   https://github.com/apache/arrow-datafusion/blob/f75d25fec2c1a5581eeb8ce73a890e5792df02c7/datafusion/expr/src/expr_rewriter.rs#L344-L348
   
   Which calls underlying `normalize_with_schemas`:
   
   https://github.com/apache/arrow-datafusion/blob/f75d25fec2c1a5581eeb8ce73a890e5792df02c7/datafusion/common/src/column.rs#L86-L109
   
   One thing thats tripping me is the usage of `all_schemas()` in `normalize_col`, since that is what allows `normalize_with_schemas` to find a schema with only one of the field, since usually:
   
   - Check output schema first, find 2 fields because of ambiguity
   - Don't return (since not a using column), check next schema in iteration (e.g. left input)
   - That schema has only one of the ambiguous fields, hence will return that for the normalized column
   
   Could instead enforce in the first iteration that if ambiguity is detected, will fail there.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb closed issue #5211: Dataframe methods don't do ambiguity checks

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb closed issue #5211: Dataframe methods don't do ambiguity checks
URL: https://github.com/apache/arrow-datafusion/issues/5211


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org