You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/30 19:01:10 UTC

[GitHub] [arrow-datafusion] OscarTHZhang opened a new issue #1710: Column names for SQL queries on CSV files should not be case sensitive

OscarTHZhang opened a new issue #1710:
URL: https://github.com/apache/arrow-datafusion/issues/1710


   **Describe the bug**
   If the column names in a CSV file are uppercase, then typing lowercase column names in SQL queries will result in an Error: `Invalid identifier for schema`.
   
   **To Reproduce**
   Here is a simple example program that runs a query on 2 tables. 
   ```rust
   #[tokio::main]
   async fn main() -> datafusion::error::Result<()> {
       let mut ctx = ExecutionContext::new();
       ctx.register_csv("lineorder", "data/lineorder.csv", CsvReadOptions::new()).await?;
       ctx.register_csv("date", "data/date.csv", CsvReadOptions::new()).await?;
   
       let df = ctx.sql("
           select sum(lo_extendedprice * lo_discount) as revenue
           from lineorder,
               date
           where lo_orderdate = d_datekey
           and d_year = 1993
           and (lo_discount between 1 and 3)
           and lo_quantity < 25;
         ").await?;
   
       df.show().await?;
       Ok(())
   }
   ```
   This will result in an error:
   ```shell
   Error: Plan("Invalid identifier '#lo_orderdate' for schema lineorder.LO_ORDERKEY, lineorder.LO_LINENUMBER, lineorder.LO_CUSTKEY, lineorder.LO_PARTKEY, lineorder.LO_SUPPKEY, lineorder.LO_ORDERDATE, lineorder.LO_ORDERPRIORITY, lineorder.LO_SHIPPRIORITY, lineorder.LO_QUANTITY, lineorder.LO_EXTENDEDPRICE, lineorder.LO_ORDTOTALPRICE, lineorder.LO_DISCOUNT, lineorder.LO_REVENUE, lineorder.LO_SUPPLYCOST, lineorder.LO_TAX, lineorder.LO_COMMITDATE, lineorder.LO_SHIPMODE, date.D_DATEKEY, date.D_DATE, date.D_DAYOFWEEK, date.D_MONTH, date.D_YEAR, date.D_YEARMONTHNUM, date.D_YEARMONTH, date.D_DAYNUMINWEEK, date.D_DAYNUMINMONTH, date.D_DAYNUMINYEAR, date.D_MONTHNUMINYEAR, date.D_WEEKNUMINYEAR, date.D_SELLINGSEASON, date.D_LASTDAYINWEEKFL, date.D_LASTDAYINMONTHFL, date.D_HOLIDAYFL, date.D_WEEKDAYFL")
   ```
   
   **Expected behavior**
   The column names should not be case-sensitive. The query should execute normally to produce the query result.
   
   **Additional context**
   Add any other context about the problem here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] OscarTHZhang commented on issue #1710: Column names for SQL queries on CSV files should not be case sensitive

Posted by GitBox <gi...@apache.org>.

OscarTHZhang commented on issue #1710:
URL: https://github.com/apache/arrow-datafusion/issues/1710#issuecomment-1037621421


   Looks like I can close this one


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] mkmik edited a comment on issue #1710: Column names for SQL queries on CSV files should not be case sensitive

Posted by GitBox <gi...@apache.org>.

mkmik edited a comment on issue #1710:
URL: https://github.com/apache/arrow-datafusion/issues/1710#issuecomment-1030078296


   I factored out the identifier quoting issue in https://github.com/apache/arrow-datafusion/issues/1746.
   
   I think this issue should be a feature request for datafusion to parse a CSV file while converting its column names to lowercase.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1710: Column names for SQL queries on CSV files should not be case sensitive

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1710:
URL: https://github.com/apache/arrow-datafusion/issues/1710#issuecomment-1029942224


   So my interpretation is "SQL should lowercase all identifiers unless they are double quoted"
   
   So if the arrow schema has `foo`, `Foo` fields, a query for `select Foo` should return the `foo` (lowercase) field. A query for `select "Foo"` should return `Foo` (the upper case one)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1710: Column names for SQL queries on CSV files should not be case sensitive

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1710:
URL: https://github.com/apache/arrow-datafusion/issues/1710#issuecomment-1029928353


   Apparently datafusion already does this conversion for  function  names
   
   https://github.com/apache/arrow-datafusion/blob/940d4eb60e76a3d4062489e872bf241dbfe0031a/datafusion/src/sql/planner.rs#L1636-L1649


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1710: Column names for SQL queries on CSV files should not be case sensitive

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1710:
URL: https://github.com/apache/arrow-datafusion/issues/1710#issuecomment-1025985382






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] mkmik commented on issue #1710: Column names for SQL queries on CSV files should not be case sensitive

Posted by GitBox <gi...@apache.org>.

mkmik commented on issue #1710:
URL: https://github.com/apache/arrow-datafusion/issues/1710#issuecomment-1030010752


   This issue however is about a CSV file that has a header that contains non-lowercased column names, and the expectation that datafusion will make them referenceable using lowercase identifiers.
   
   Perhaps we could add an option `CsvReadOptions` to lowercase all columns?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] mkmik commented on issue #1710: Column names for SQL queries on CSV files should not be case sensitive

Posted by GitBox <gi...@apache.org>.

mkmik commented on issue #1710:
URL: https://github.com/apache/arrow-datafusion/issues/1710#issuecomment-1030078296


   I factored out the identifier quoting issue in https://github.com/apache/arrow-datafusion/issues/1746.
   
   I'm turning this issue into a feature request for datafusion to parse a CSV file while converting its column names to lowercase.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] mkmik commented on issue #1710: Column names for SQL queries on CSV files should not be case sensitive

Posted by GitBox <gi...@apache.org>.

mkmik commented on issue #1710:
URL: https://github.com/apache/arrow-datafusion/issues/1710#issuecomment-1029897700


   IIUC arrow schemas are case sensitive. This is true for CSV and for other sources (e.g. parquet). 
   
   Not sure what's the right approach here. SQL engines are traditionally case insensitive, but here we have a SQL engine that builds on top of an existing schema model that doesn't follow that tradition.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1710: Column names for SQL queries on CSV files should not be case sensitive

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1710:
URL: https://github.com/apache/arrow-datafusion/issues/1710#issuecomment-1029927443


   Yes, Arrow schema's are case sensitive and SQL itself has stranger semantics for case insensitive
   
   Here is an example from postgres showing how SQL interprets mixed case identifiers.  Basically if the identifier is not double quoted, it is is simply lower cased prior to processing
   
   
   
   ```sql
   create table bar ("foo" int, "Foo" int, "FoO" int);
   CREATE TABLE
   
   alamb=# insert into bar values (1,2,3);
   INSERT 0 1
   
   alamb=# select * from bar;
    foo | Foo | FoO 
   -----+-----+-----
      1 |   2 |   3
   (1 row)
   
   alamb=# select foo from bar;
    foo 
   -----
      1
   (1 row)
   
   -- Note result has foo, not Foo
   alamb=# select Foo from bar;
    foo 
   -----
      1
   (1 row)
   
   alamb=# select "Foo" from bar;
    Foo 
   -----
      2
   (1 row)
   ```
   
   
    `foo` always matches the lower case column name `"foo"` even when it is not the first column
   ```sql
   alamb=# create table baz("Foo" int, foo int);
   CREATE TABLE
   alamb=# insert into baz values (100,200);
   INSERT 0 1
   alamb=# select foo from baz;
    foo 
   -----
    200
   (1 row)
   
   ```
   
   
   If you create a table, the identifier is also lower cased at create time
   
   ```
   alamb=# create table blarg(Foo int);
   CREATE TABLE
   
   alamb=# select * from blarg;
    foo 
   -----
   (0 rows)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] mkmik commented on issue #1710: Column names for SQL queries on CSV files should not be case sensitive

Posted by GitBox <gi...@apache.org>.

mkmik commented on issue #1710:
URL: https://github.com/apache/arrow-datafusion/issues/1710#issuecomment-1030007322


   > If you create a table, the identifier is also lower cased at create time
   
   the identifier is lowercased only if not quoted.
   
   ```
   tmp1=> create table blarg(Foo int);
   CREATE TABLE
   tmp1=> \d blarg
                  Table "public.blarg"
    Column |  Type   | Collation | Nullable | Default
   --------+---------+-----------+----------+---------
    foo    | integer |           |          |
   
   tmp1=> create table blarg2("Foo" int);
   CREATE TABLE
   tmp1=> \d blarg2
                  Table "public.blarg2"
    Column |  Type   | Collation | Nullable | Default
   --------+---------+-----------+----------+---------
    Foo    | integer |           |          |
   ```
   
   To summarize, postgres schemas are case sensitive, but in the SQL syntax the identifiers are lowercased when unquoted.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] OscarTHZhang closed issue #1710: Column names for SQL queries on CSV files should not be case sensitive

Posted by GitBox <gi...@apache.org>.

OscarTHZhang closed issue #1710:
URL: https://github.com/apache/arrow-datafusion/issues/1710


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1710: Column names for SQL queries on CSV files should not be case sensitive

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1710:
URL: https://github.com/apache/arrow-datafusion/issues/1710#issuecomment-1039201175


   Thanks @OscarTHZhang . Note that as @mkmik  points out, we haven't changed how identifiers are named from csv files. So if your CSV file has a column named `Foo` to query it you will have to use `"Foo"`. `foo` will not work. 
   
   > This issue however is about a CSV file that has a header that contains non-lowercased column names, and the expectation that datafusion will make them referenceable using lowercase identifiers.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb removed a comment on issue #1710: Column names for SQL queries on CSV files should not be case sensitive

Posted by GitBox <gi...@apache.org>.

alamb removed a comment on issue #1710:
URL: https://github.com/apache/arrow-datafusion/issues/1710#issuecomment-1025985470


   Thank you for the report @OscarTHZhang  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] OscarTHZhang closed issue #1710: Column names for SQL queries on CSV files should not be case sensitive

Posted by GitBox <gi...@apache.org>.

OscarTHZhang closed issue #1710:
URL: https://github.com/apache/arrow-datafusion/issues/1710


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] OscarTHZhang commented on issue #1710: Column names for SQL queries on CSV files should not be case sensitive

Posted by GitBox <gi...@apache.org>.

OscarTHZhang commented on issue #1710:
URL: https://github.com/apache/arrow-datafusion/issues/1710#issuecomment-1037621421


   Looks like I can close this one


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1710: Column names for SQL queries on CSV files should not be case sensitive

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1710:
URL: https://github.com/apache/arrow-datafusion/issues/1710#issuecomment-1030048324


   > Perhaps we could add an option CsvReadOptions to lowercase all columns?
   
   Indeed -- both changes (lowercase identifiers when unquoted) and an option to `CsvReadOption` if not already available sound good to me


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] mkmik commented on issue #1710: Column names for SQL queries on CSV files should not be case sensitive

Posted by GitBox <gi...@apache.org>.

mkmik commented on issue #1710:
URL: https://github.com/apache/arrow-datafusion/issues/1710#issuecomment-1030003765


   ```
   tmp1=> create table bar ("Foo" int, "FoO" int);
   CREATE TABLE
   tmp1=> insert into bar values (2,3);
   INSERT 0 1
   tmp1=> select * from bar;
    Foo | FoO
   -----+-----
      2 |   3
   (1 row)
   
   tmp1=> select foo from bar;
   ERROR:  column "foo" does not exist
   LINE 1: select foo from bar;
                  ^
   HINT:  Perhaps you meant to reference the column "bar.Foo".
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org