You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/06/08 18:20:19 UTC
[GitHub] [arrow-datafusion] alamb opened a new issue, #2712: Common Subexpression Eliminiation pass errors if run twice on some plans: Schema contains duplicate unqualified field name 'IsNull-Column-sys.host'
alamb opened a new issue, #2712:
URL: https://github.com/apache/arrow-datafusion/issues/2712
**Describe the bug**
We found this issue while working on IOx. IOx was (accidentally) optimizing a `LogicalPlan` more than once and when optimizations were applied the second time it saw an error like `Schema error: Schema contains duplicate unqualified field name 'IsNull-Column-sys.host'`
**To Reproduce**
Try and optimize this plan twice:
```
Projection: #cpu_load_short.host, #cpu_load_short.region, #cpu_load_short.value AS value, #cpu_load_short.time
Sort: #cpu_load_short.host ASC NULLS FIRST, #cpu_load_short.region ASC NULLS FIRST, #cpu_load_short.time ASC NULLS FIRST
Filter: #cpu_load_short.host IS NULL AND #cpu_load_short.host IS NULL OR #cpu_load_short.host = Utf8("") OR NOT #cpu_load_short.host IS NULL AND #cpu_load_short.host = Utf8("server01") OR #cpu_load_short.host IS NULL OR #cpu_load_short.host = Utf8("")
TableScan: cpu_load_short projection=None
```
After the first call to optimize it looks like
```
Projection: #cpu_load_short.host, #cpu_load_short.region, #cpu_load_short.value AS value, #cpu_load_short.time
Sort: #cpu_load_short.host ASC NULLS FIRST, #cpu_load_short.region ASC NULLS FIRST, #cpu_load_short.time ASC NULLS FIRST
Projection: #cpu_load_short.host IS NULL OR #cpu_load_short.host = Utf8("") AS BinaryExpr-ORBinaryExpr-=LiteralColumn-cpu_load_short.hostIsNull-Column-cpu_load_short.host, #cpu_load_short.host IS NULL AS IsNull-Column-cpu_load_short.host, #cpu_load_short.host, #cpu_load_short.region, #cpu_load_short.time, #cpu_load_short.value
Filter: #cpu_load_short.host IS NULL AS cpu_load_short.host IS NULL AND #cpu_load_short.host IS NULL OR #cpu_load_short.host = Utf8("") AS cpu_load_short.host IS NULL OR cpu_load_short.host = Utf8("") OR NOT #cpu_load_short.host IS NULL AS cpu_load_short.host IS NULL AND #cpu_load_short.host = Utf8("server01") OR #cpu_load_short.host IS NULL OR #cpu_load_short.host = Utf8("") AS cpu_load_short.host IS NULL OR cpu_load_short.host = Utf8("")
TableScan: cpu_load_short projection=Some([host, region, time, value]), partial_filters=[#cpu_load_short.host IS NULL AS cpu_load_short.host IS NULL AND #cpu_load_short.host IS NULL OR #cpu_load_short.host = Utf8("") AS cpu_load_short.host IS NULL OR cpu_load_short.host = Utf8("") OR NOT #cpu_load_short.host IS NULL AS cpu_load_short.host IS NULL AND #cpu_load_short.host = Utf8("server01") OR #cpu_load_short.host IS NULL OR #cpu_load_short.host = Utf8("") AS cpu_load_short.host IS NULL OR cpu_load_short.host = Utf8("")]
```
After the next call to optimize() errors with
```
`Schema error: Schema contains duplicate unqualified field name 'IsNull-Column-sys.host'`
```
I am working on a self contained reproducer
**Expected behavior**
The second call to optimize should not error and should work correctly.
**Additional context**
IOx ticket with original issue: https://github.com/influxdata/influxdb_iox/issues/4800
PR to stop optimizing twice: https://github.com/influxdata/influxdb_iox/pull/4809
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on issue #2712: Common Subexpression Eliminiation pass errors if run twice on some plans: Schema contains duplicate unqualified field name 'IsNull-Column-sys.host'
Posted by GitBox <gi...@apache.org>.
alamb commented on issue #2712:
URL: https://github.com/apache/arrow-datafusion/issues/2712#issuecomment-1151400179
Thanks @waynexia
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] andygrove closed issue #2712: Common Subexpression Eliminiation pass errors if run twice on some plans: Schema contains duplicate unqualified field name 'IsNull-Column-sys.host'
Posted by GitBox <gi...@apache.org>.
andygrove closed issue #2712: Common Subexpression Eliminiation pass errors if run twice on some plans: Schema contains duplicate unqualified field name 'IsNull-Column-sys.host'
URL: https://github.com/apache/arrow-datafusion/issues/2712
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on issue #2712: Common Subexpression Eliminiation pass errors if run twice on some plans: Schema contains duplicate unqualified field name 'IsNull-Column-sys.host'
Posted by GitBox <gi...@apache.org>.
alamb commented on issue #2712:
URL: https://github.com/apache/arrow-datafusion/issues/2712#issuecomment-1150297976
Here is a self contained reproducer:
```rust
use std::sync::Arc;
use datafusion::prelude::*;
use datafusion::arrow::array::Int32Array;
use datafusion::datasource::MemTable;
use datafusion::execution::context::TaskContext;
use datafusion::logical_plan::{LogicalPlanBuilder, provider_as_source};
use datafusion::physical_plan::collect;
use datafusion::error::Result;
use datafusion::arrow::{self, record_batch::RecordBatch};
#[tokio::main]
async fn main() -> Result<()> {
let ctx = SessionContext::new();
let a: Int32Array = vec![Some(1)].into_iter().collect();
let batch = RecordBatch::try_from_iter(vec![
("a", Arc::new(a) as _),
]).unwrap();
let t = MemTable::try_new(batch.schema(), vec![vec![batch]]).unwrap();
let projection = None;
let builder = LogicalPlanBuilder::scan(
"cpu_load_short",
provider_as_source(Arc::new(t)),
projection
).unwrap()
.filter(col("a").is_null()
.or(col("a").eq(lit(2)))
.or(col("a").is_null().and(col("a").eq(lit(5))))
.or(col("a").is_null().or(col("a").eq(lit(2))))
)
.unwrap();
let logical_plan = builder.build().unwrap();
// manually optimize the plan
let state = ctx.state.read().clone();
let logical_plan = state.optimize(&logical_plan).unwrap();
// THIS IS THE KEY: optimize it a second time
let logical_plan = state.optimize(&logical_plan).unwrap();
let physical_plan = state.query_planner.create_physical_plan(&logical_plan, &state).await.unwrap();
let task_ctx = Arc::new(TaskContext::from(&state));
let results: Vec<RecordBatch> = collect(physical_plan, task_ctx).await.unwrap();
// format the results
println!("Results:\n\n{}", arrow::util::pretty::pretty_format_batches(&results).unwrap());
Ok(())
}
```
Cargo.toml:
```toml
[package]
name = "rust_arrow_playground"
version = "0.1.0"
edition = "2018"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
ahash = "0.7"
tokio = "1.8.2"
tokio-stream = "0.1"
async-trait = "0.1"
futures-util = { version = "0.3.1" }
datafusion = { path = "/Users/alamb/Software/arrow-datafusion/datafusion/core", default-features = false }
once_cell = "1.8.0"
rand = "0.8"
```
When run errors like this:
```
cd /Users/alamb/Software/rust_datafusion_playground && RUST_BACKTRACE=1 CARGO_TARGET_DIR=/Users/alamb/Software/df-target cargo run
Compiling rust_arrow_playground v0.1.0 (/Users/alamb/Software/rust_datafusion_playground)
Finished dev [unoptimized + debuginfo] target(s) in 3.77s
Running `/Users/alamb/Software/df-target/debug/rust_arrow_playground`
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: SchemaError(DuplicateUnqualifiedField { name: "IsNull-Column-cpu_load_short.a" })', src/main.rs:46:54
stack backtrace:
0: rust_begin_unwind
at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/std/src/panicking.rs:584:5
1: core::panicking::panic_fmt
at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/panicking.rs:143:14
2: core::result::unwrap_failed
at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/result.rs:1785:5
3: core::result::Result<T,E>::unwrap
at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/result.rs:1078:23
4: rust_arrow_playground::main::{{closure}}
at ./src/main.rs:46:24
5: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/future/mod.rs:91:19
6: tokio::park::thread::CachedParkThread::block_on::{{closure}}
at /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/park/thread.rs:263:54
7: tokio::coop::with_budget::{{closure}}
at /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/coop.rs:102:9
8: std::thread::local::LocalKey<T>::try_with
at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/std/src/thread/local.rs:442:16
9: std::thread::local::LocalKey<T>::with
at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/std/src/thread/local.rs:418:9
10: tokio::coop::with_budget
at /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/coop.rs:95:5
11: tokio::coop::budget
at /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/coop.rs:72:5
12: tokio::park::thread::CachedParkThread::block_on
at /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/park/thread.rs:263:31
13: tokio::runtime::enter::Enter::block_on
at /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/runtime/enter.rs:151:13
14: tokio::runtime::thread_pool::ThreadPool::block_on
at /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/runtime/thread_pool/mod.rs:90:9
15: tokio::runtime::Runtime::block_on
at /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/runtime/mod.rs:482:43
16: rust_arrow_playground::main
at ./src/main.rs:55:5
17: core::ops::function::FnOnce::call_once
at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/ops/function.rs:227:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on issue #2712: Common Subexpression Eliminiation pass errors if run twice on some plans: Schema contains duplicate unqualified field name 'IsNull-Column-sys.host'
Posted by GitBox <gi...@apache.org>.
alamb commented on issue #2712:
URL: https://github.com/apache/arrow-datafusion/issues/2712#issuecomment-1150299033
FYI @waynexia as I think you implemented this code initially in https://github.com/apache/arrow-datafusion/pull/792 -- perhaps you have some ideas of what might be going wrong 🤔
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] waynexia commented on issue #2712: Common Subexpression Eliminiation pass errors if run twice on some plans: Schema contains duplicate unqualified field name 'IsNull-Column-sys.host'
Posted by GitBox <gi...@apache.org>.
waynexia commented on issue #2712:
URL: https://github.com/apache/arrow-datafusion/issues/2712#issuecomment-1151340221
Very clear reproducer @alamb, thanks! :+1:
Update: I find this only happens when both `FilterPushDown` and `CommonSubexprEliminate` are enabled. Still digging... you can assign this to me.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org