You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/29 09:27:41 UTC

[GitHub] [arrow-datafusion] dbr opened a new issue, #2374: Arrow2: Identifiers are made lower-case in SQL query

dbr opened a new issue, #2374:
URL: https://github.com/apache/arrow-datafusion/issues/2374

   **Describe the bug**
   With (I think) [the most up-to-date version](https://github.com/apache/arrow-datafusion/commit/744b2626081db95a254fc882820fc7812f95aa51) of the arrow2 branch, the idenfitiers in query seem to be made lower-case
   
   For example if I modify the `memtable` example by renaming the `bank_acount` field to `Bank_Account` then the query will fail,
   
   ```shell
   $ gd
   diff --git a/datafusion-examples/examples/memtable.rs b/datafusion-examples/examples/memtable.rs
   index 4c635205..16bd247d 100644
   --- a/datafusion-examples/examples/memtable.rs
   +++ b/datafusion-examples/examples/memtable.rs
   @@ -37,7 +37,7 @@ async fn main() -> Result<()> {
        // Register the in-memory table containing the data
        ctx.register_table("users", Arc::new(mem_table))?;
    
   -    let dataframe = ctx.sql("SELECT * FROM users;").await?;
   +    let dataframe = ctx.sql("SELECT Bank_Account FROM users;").await?;
    
        timeout(Duration::from_secs(10), async move {
            let result = dataframe.collect().await.unwrap();
   @@ -72,6 +72,6 @@ fn create_record_batch() -> Result<RecordBatch> {
    fn get_schema() -> SchemaRef {
        SchemaRef::new(Schema::new(vec![
            Field::new("id", DataType::UInt8, false),
   -        Field::new("bank_account", DataType::UInt64, true),
   +        Field::new("Bank_Account", DataType::UInt64, true),
        ]))
    }
   $ cargo run --example memtable
       Finished dev [unoptimized + debuginfo] target(s) in 0.35s
        Running `./target/debug/examples/memtable`
   Error: Plan("Invalid identifier '#bank_account' for schema fields:[users.id, users.Bank_Account], metadata:{}")
   ```
   
   Meaning, if the source table has any non-lowercase characters, the queries cannot work
   
   **Additional context**
   I'm using the arrow2 branch at this revision:
   
   https://github.com/apache/arrow-datafusion/commit/744b2626081db95a254fc882820fc7812f95aa51


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] dbr commented on issue #2374: Arrow2: Identifiers are made lower-case in SQL query

Posted by GitBox <gi...@apache.org>.

dbr commented on issue #2374:
URL: https://github.com/apache/arrow-datafusion/issues/2374#issuecomment-1113118289

   Oh this is acting same in latest `master` branch! 7a9b86526e2bd12deccb9ff30cead2ba6323409a
   
   ```sh
   $ git rev-parse HEAD
   7a9b86526e2bd12deccb9ff30cead2ba6323409a
   $ git diff
   diff --git a/datafusion-examples/examples/memtable.rs b/datafusion-examples/examples/memtable.rs
   index 11793a83..80cb65fc 100644
   --- a/datafusion-examples/examples/memtable.rs
   +++ b/datafusion-examples/examples/memtable.rs
   @@ -36,7 +36,7 @@ async fn main() -> Result<()> {
        // Register the in-memory table containing the data
        ctx.register_table("users", Arc::new(mem_table))?;
    
   -    let dataframe = ctx.sql("SELECT * FROM users;").await?;
   +    let dataframe = ctx.sql("SELECT Bank_Account FROM users;").await?;
    
        timeout(Duration::from_secs(10), async move {
            let result = dataframe.collect().await.unwrap();
   @@ -71,6 +71,6 @@ fn create_record_batch() -> Result<RecordBatch> {
    fn get_schema() -> SchemaRef {
        SchemaRef::new(Schema::new(vec![
            Field::new("id", DataType::UInt8, false),
   -        Field::new("bank_account", DataType::UInt64, true),
   +        Field::new("Bank_Account", DataType::UInt64, true),
        ]))
    }
   $ cargo run --example memtable
       Finished dev [unoptimized + debuginfo] target(s) in 0.36s
        Running `./target/debug/examples/memtable`
   Error: Plan("Invalid identifier '#bank_account' for schema fields:[users.id, users.Bank_Account], metadata:{}")
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] andygrove commented on issue #2374: Identifiers are made lower-case in SQL query

Posted by GitBox <gi...@apache.org>.

andygrove commented on issue #2374:
URL: https://github.com/apache/arrow-datafusion/issues/2374#issuecomment-1114493881

   @dbr My view is that this is not a bug. If a non-lowercase table or column name is created via the low-level Schema/DataFrame APIs then I would expect to have to use double-quotes in SQL to reference it. However, we are probably lacking in documentation around this so lets at least use this issue to drive that. Also would like to hear what others think, including @alamb and @yjshen 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #2374: Identifiers are made lower-case in SQL query

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #2374:
URL: https://github.com/apache/arrow-datafusion/issues/2374#issuecomment-1115159955

   > Given the field named Bank_Account, I would expect SELECT BaNk_AcCoUnT FROM users to work (but currently this only looks for a field named bank_account so the query errors, and the error contains a lower-cased version of what is present in the query)
   
   I would not expect this to work (as the field is named `Bank_Account`)
   
   Here is an example in case that is helpful: https://github.com/apache/arrow-datafusion/issues/1710#issuecomment-1029927443


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb closed issue #2374: Clarify in docs that Identifiers are made lower-case in SQL query

Posted by GitBox <gi...@apache.org>.

alamb closed issue #2374: Clarify in docs that Identifiers are made lower-case in SQL query
URL: https://github.com/apache/arrow-datafusion/issues/2374


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] dbr commented on issue #2374: Identifiers are made lower-case in SQL query

Posted by GitBox <gi...@apache.org>.

dbr commented on issue #2374:
URL: https://github.com/apache/arrow-datafusion/issues/2374#issuecomment-1115923739

   Ahh I see - I wasn't aware Postgres had these rules around case-insensitivity, interesting!
   
   I was thinking maybe the error message could be clearer, but even postgres appears to do the same thing (e.g error message containing the lower-cased field name, not the one the user supplied). So just noting this in the docs as mentioned seems like a good solution


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] dbr commented on issue #2374: Identifiers are made lower-case in SQL query

Posted by GitBox <gi...@apache.org>.

dbr commented on issue #2374:
URL: https://github.com/apache/arrow-datafusion/issues/2374#issuecomment-1113905938

   Ah interesting! Quoting the identifiers works, thanks
   
   Just to be clear, is it still a bug that the original example fails? I would expect:
   
   - Given the field named `bank_account` the `SELECT BaNk_AcCoUnT FROM users` does work as expected (case is ignored)
   - Given the field named `Bank_Account`, I would expect `SELECT BaNk_AcCoUnT FROM users` to work (but currently this only looks for a field named `bank_account` so the query errors, and the error contains a lower-cased version of what is present in the query)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] andygrove commented on issue #2374: Identifiers are made lower-case in SQL query

Posted by GitBox <gi...@apache.org>.

andygrove commented on issue #2374:
URL: https://github.com/apache/arrow-datafusion/issues/2374#issuecomment-1113738686

   @dbr Unquoted SQL identifiers are case-insensitive by design (to match ANSI and Postgres). SQL identifiers in double quotes are case-sensitive. If you put double quotes around the identifier in the SQL query then it works as expected.
   
   ```
   diff --git a/datafusion-examples/examples/memtable.rs b/datafusion-examples/examples/memtable.rs
   index 11793a837f..b04f5424a1 100644
   --- a/datafusion-examples/examples/memtable.rs
   +++ b/datafusion-examples/examples/memtable.rs
   @@ -36,7 +36,7 @@ async fn main() -> Result<()> {
        // Register the in-memory table containing the data
        ctx.register_table("users", Arc::new(mem_table))?;
    
   -    let dataframe = ctx.sql("SELECT * FROM users;").await?;
   +    let dataframe = ctx.sql("SELECT \"Bank_Account\" FROM users;").await?;
    
        timeout(Duration::from_secs(10), async move {
            let result = dataframe.collect().await.unwrap();
   @@ -71,6 +71,6 @@ fn create_record_batch() -> Result<RecordBatch> {
    fn get_schema() -> SchemaRef {
        SchemaRef::new(Schema::new(vec![
            Field::new("id", DataType::UInt8, false),
   -        Field::new("bank_account", DataType::UInt64, true),
   +        Field::new("Bank_Account", DataType::UInt64, true),
        ]))
    }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org