You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/30 19:01:10 UTC
[GitHub] [arrow-datafusion] OscarTHZhang opened a new issue #1710: Column names for SQL queries on CSV files should not be case sensitive
OscarTHZhang opened a new issue #1710:
URL: https://github.com/apache/arrow-datafusion/issues/1710
**Describe the bug**
If the column names in a CSV file are uppercase, then typing lowercase column names in SQL queries will result in an Error: `Invalid identifier for schema`.
**To Reproduce**
Here is a simple example program that runs a query on 2 tables.
```rust
#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
let mut ctx = ExecutionContext::new();
ctx.register_csv("lineorder", "data/lineorder.csv", CsvReadOptions::new()).await?;
ctx.register_csv("date", "data/date.csv", CsvReadOptions::new()).await?;
let df = ctx.sql("
select sum(lo_extendedprice * lo_discount) as revenue
from lineorder,
date
where lo_orderdate = d_datekey
and d_year = 1993
and (lo_discount between 1 and 3)
and lo_quantity < 25;
").await?;
df.show().await?;
Ok(())
}
```
This will result in an error:
```shell
Error: Plan("Invalid identifier '#lo_orderdate' for schema lineorder.LO_ORDERKEY, lineorder.LO_LINENUMBER, lineorder.LO_CUSTKEY, lineorder.LO_PARTKEY, lineorder.LO_SUPPKEY, lineorder.LO_ORDERDATE, lineorder.LO_ORDERPRIORITY, lineorder.LO_SHIPPRIORITY, lineorder.LO_QUANTITY, lineorder.LO_EXTENDEDPRICE, lineorder.LO_ORDTOTALPRICE, lineorder.LO_DISCOUNT, lineorder.LO_REVENUE, lineorder.LO_SUPPLYCOST, lineorder.LO_TAX, lineorder.LO_COMMITDATE, lineorder.LO_SHIPMODE, date.D_DATEKEY, date.D_DATE, date.D_DAYOFWEEK, date.D_MONTH, date.D_YEAR, date.D_YEARMONTHNUM, date.D_YEARMONTH, date.D_DAYNUMINWEEK, date.D_DAYNUMINMONTH, date.D_DAYNUMINYEAR, date.D_MONTHNUMINYEAR, date.D_WEEKNUMINYEAR, date.D_SELLINGSEASON, date.D_LASTDAYINWEEKFL, date.D_LASTDAYINMONTHFL, date.D_HOLIDAYFL, date.D_WEEKDAYFL")
```
**Expected behavior**
The column names should not be case-sensitive. The query should execute normally to produce the query result.
**Additional context**
Add any other context about the problem here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] OscarTHZhang commented on issue #1710: Column names for SQL queries on CSV files should not be case sensitive
Posted by GitBox <gi...@apache.org>.
OscarTHZhang commented on issue #1710:
URL: https://github.com/apache/arrow-datafusion/issues/1710#issuecomment-1037621421
Looks like I can close this one
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] mkmik edited a comment on issue #1710: Column names for SQL queries on CSV files should not be case sensitive
Posted by GitBox <gi...@apache.org>.
mkmik edited a comment on issue #1710:
URL: https://github.com/apache/arrow-datafusion/issues/1710#issuecomment-1030078296
I factored out the identifier quoting issue in https://github.com/apache/arrow-datafusion/issues/1746.
I think this issue should be a feature request for datafusion to parse a CSV file while converting its column names to lowercase.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on issue #1710: Column names for SQL queries on CSV files should not be case sensitive
Posted by GitBox <gi...@apache.org>.
alamb commented on issue #1710:
URL: https://github.com/apache/arrow-datafusion/issues/1710#issuecomment-1029942224
So my interpretation is "SQL should lowercase all identifiers unless they are double quoted"
So if the arrow schema has `foo`, `Foo` fields, a query for `select Foo` should return the `foo` (lowercase) field. A query for `select "Foo"` should return `Foo` (the upper case one)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on issue #1710: Column names for SQL queries on CSV files should not be case sensitive
Posted by GitBox <gi...@apache.org>.
alamb commented on issue #1710:
URL: https://github.com/apache/arrow-datafusion/issues/1710#issuecomment-1029928353
Apparently datafusion already does this conversion for function names
https://github.com/apache/arrow-datafusion/blob/940d4eb60e76a3d4062489e872bf241dbfe0031a/datafusion/src/sql/planner.rs#L1636-L1649
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on issue #1710: Column names for SQL queries on CSV files should not be case sensitive
Posted by GitBox <gi...@apache.org>.
alamb commented on issue #1710:
URL: https://github.com/apache/arrow-datafusion/issues/1710#issuecomment-1025985382
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] mkmik commented on issue #1710: Column names for SQL queries on CSV files should not be case sensitive
Posted by GitBox <gi...@apache.org>.
mkmik commented on issue #1710:
URL: https://github.com/apache/arrow-datafusion/issues/1710#issuecomment-1030010752
This issue however is about a CSV file that has a header that contains non-lowercased column names, and the expectation that datafusion will make them referenceable using lowercase identifiers.
Perhaps we could add an option `CsvReadOptions` to lowercase all columns?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] mkmik commented on issue #1710: Column names for SQL queries on CSV files should not be case sensitive
Posted by GitBox <gi...@apache.org>.
mkmik commented on issue #1710:
URL: https://github.com/apache/arrow-datafusion/issues/1710#issuecomment-1030078296
I factored out the identifier quoting issue in https://github.com/apache/arrow-datafusion/issues/1746.
I'm turning this issue into a feature request for datafusion to parse a CSV file while converting its column names to lowercase.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] mkmik commented on issue #1710: Column names for SQL queries on CSV files should not be case sensitive
Posted by GitBox <gi...@apache.org>.
mkmik commented on issue #1710:
URL: https://github.com/apache/arrow-datafusion/issues/1710#issuecomment-1029897700
IIUC arrow schemas are case sensitive. This is true for CSV and for other sources (e.g. parquet).
Not sure what's the right approach here. SQL engines are traditionally case insensitive, but here we have a SQL engine that builds on top of an existing schema model that doesn't follow that tradition.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on issue #1710: Column names for SQL queries on CSV files should not be case sensitive
Posted by GitBox <gi...@apache.org>.
alamb commented on issue #1710:
URL: https://github.com/apache/arrow-datafusion/issues/1710#issuecomment-1029927443
Yes, Arrow schema's are case sensitive and SQL itself has stranger semantics for case insensitive
Here is an example from postgres showing how SQL interprets mixed case identifiers. Basically if the identifier is not double quoted, it is is simply lower cased prior to processing
```sql
create table bar ("foo" int, "Foo" int, "FoO" int);
CREATE TABLE
alamb=# insert into bar values (1,2,3);
INSERT 0 1
alamb=# select * from bar;
foo | Foo | FoO
-----+-----+-----
1 | 2 | 3
(1 row)
alamb=# select foo from bar;
foo
-----
1
(1 row)
-- Note result has foo, not Foo
alamb=# select Foo from bar;
foo
-----
1
(1 row)
alamb=# select "Foo" from bar;
Foo
-----
2
(1 row)
```
`foo` always matches the lower case column name `"foo"` even when it is not the first column
```sql
alamb=# create table baz("Foo" int, foo int);
CREATE TABLE
alamb=# insert into baz values (100,200);
INSERT 0 1
alamb=# select foo from baz;
foo
-----
200
(1 row)
```
If you create a table, the identifier is also lower cased at create time
```
alamb=# create table blarg(Foo int);
CREATE TABLE
alamb=# select * from blarg;
foo
-----
(0 rows)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] mkmik commented on issue #1710: Column names for SQL queries on CSV files should not be case sensitive
Posted by GitBox <gi...@apache.org>.
mkmik commented on issue #1710:
URL: https://github.com/apache/arrow-datafusion/issues/1710#issuecomment-1030007322
> If you create a table, the identifier is also lower cased at create time
the identifier is lowercased only if not quoted.
```
tmp1=> create table blarg(Foo int);
CREATE TABLE
tmp1=> \d blarg
Table "public.blarg"
Column | Type | Collation | Nullable | Default
--------+---------+-----------+----------+---------
foo | integer | | |
tmp1=> create table blarg2("Foo" int);
CREATE TABLE
tmp1=> \d blarg2
Table "public.blarg2"
Column | Type | Collation | Nullable | Default
--------+---------+-----------+----------+---------
Foo | integer | | |
```
To summarize, postgres schemas are case sensitive, but in the SQL syntax the identifiers are lowercased when unquoted.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] OscarTHZhang closed issue #1710: Column names for SQL queries on CSV files should not be case sensitive
Posted by GitBox <gi...@apache.org>.
OscarTHZhang closed issue #1710:
URL: https://github.com/apache/arrow-datafusion/issues/1710
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on issue #1710: Column names for SQL queries on CSV files should not be case sensitive
Posted by GitBox <gi...@apache.org>.
alamb commented on issue #1710:
URL: https://github.com/apache/arrow-datafusion/issues/1710#issuecomment-1039201175
Thanks @OscarTHZhang . Note that as @mkmik points out, we haven't changed how identifiers are named from csv files. So if your CSV file has a column named `Foo` to query it you will have to use `"Foo"`. `foo` will not work.
> This issue however is about a CSV file that has a header that contains non-lowercased column names, and the expectation that datafusion will make them referenceable using lowercase identifiers.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb removed a comment on issue #1710: Column names for SQL queries on CSV files should not be case sensitive
Posted by GitBox <gi...@apache.org>.
alamb removed a comment on issue #1710:
URL: https://github.com/apache/arrow-datafusion/issues/1710#issuecomment-1025985470
Thank you for the report @OscarTHZhang
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] OscarTHZhang closed issue #1710: Column names for SQL queries on CSV files should not be case sensitive
Posted by GitBox <gi...@apache.org>.
OscarTHZhang closed issue #1710:
URL: https://github.com/apache/arrow-datafusion/issues/1710
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] OscarTHZhang commented on issue #1710: Column names for SQL queries on CSV files should not be case sensitive
Posted by GitBox <gi...@apache.org>.
OscarTHZhang commented on issue #1710:
URL: https://github.com/apache/arrow-datafusion/issues/1710#issuecomment-1037621421
Looks like I can close this one
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on issue #1710: Column names for SQL queries on CSV files should not be case sensitive
Posted by GitBox <gi...@apache.org>.
alamb commented on issue #1710:
URL: https://github.com/apache/arrow-datafusion/issues/1710#issuecomment-1030048324
> Perhaps we could add an option CsvReadOptions to lowercase all columns?
Indeed -- both changes (lowercase identifiers when unquoted) and an option to `CsvReadOption` if not already available sound good to me
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] mkmik commented on issue #1710: Column names for SQL queries on CSV files should not be case sensitive
Posted by GitBox <gi...@apache.org>.
mkmik commented on issue #1710:
URL: https://github.com/apache/arrow-datafusion/issues/1710#issuecomment-1030003765
```
tmp1=> create table bar ("Foo" int, "FoO" int);
CREATE TABLE
tmp1=> insert into bar values (2,3);
INSERT 0 1
tmp1=> select * from bar;
Foo | FoO
-----+-----
2 | 3
(1 row)
tmp1=> select foo from bar;
ERROR: column "foo" does not exist
LINE 1: select foo from bar;
^
HINT: Perhaps you meant to reference the column "bar.Foo".
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org