You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "sergiimk (via GitHub)" <gi...@apache.org> on 2023/09/01 02:43:25 UTC

[GitHub] [arrow-datafusion] sergiimk opened a new issue, #7460: Column name case sensitivity when reading from CSV / JSON

sergiimk opened a new issue, #7460:
URL: https://github.com/apache/arrow-datafusion/issues/7460

   ### Describe the bug
   
   SQL is case-insensitive language, but the way case-insensitivity is implemented for column identifiers in DataFusion often leads to non-intuitive behavior:
   
   ### To Reproduce
   
   Create `test.json`:
   ```json
   {"A": "a_upper", "B": "b_upper", "b": "b_lower"}
   ```
   Run `datafusion-cli`:
   ```sh
   ❯ create external table test stored as json location 'test.json';
   
   ❯ select * from test;
   +---------+---------+---------+
   | A       | B       | b       |
   +---------+---------+---------+
   | a_upper | b_upper | b_lower |
   +---------+---------+---------+
   
   > select A from test;
   Schema error: No field named a. Valid fields are test."A", test."B", test.b.
   
   ❯ select b from test;
   +---------+
   | b       |
   +---------+
   | b_lower |
   +---------+
   
   ❯ select "B" from test;
   +---------+
   | B       |
   +---------+
   | b_upper |
   +---------+
   ```
   
   ### Expected behavior
   
   In Spark:
   - When I load json like `{"A": "a_upper"}`  both `select a from test` and `select A from test` will return the "A" column.
   - When I save to Parquet - the schema will preserve the column's original case
   - Loading `{"A": "a_upper", "B": "b_upper", "b": "b_lower"}` however fails with duplicate column error - points to DF!
   
   What I would expect as a user:
   - `a` and `A` return "A" column
   - `b` and `B` fail as ambiguous
   - `"B"` and `"b"` work, returning upper and lower case columns respectively
   
   i.e. case sensitivity should matter only when there is an ambiguity, and can be resolved with quoted identifiers.
   
   ### Additional context
   
   Currently after switching our ingest from Spark to Datafusion our preprocessing code is full of `"X" as x, "Y" as y` just to restore case-insensitivity for downstream queries.
   
   I don't mind adding an automatic step that lower-cases all columns where there is no ambiguty, but was wondering if core behavior should be adjusted instead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] parkma99 commented on issue #7460: Column name case sensitivity when reading from CSV / JSON

Posted by "parkma99 (via GitHub)" <gi...@apache.org>.
parkma99 commented on issue #7460:
URL: https://github.com/apache/arrow-datafusion/issues/7460#issuecomment-1703150509

   Hello here is my  solution.
   
   function `field_with_unqualified_name` add one parameter named `ignore_case` with bool type. 
   in function body 
   change:
   
   `.filter(|field| field.name() == name)` => `.filter(|field| if ignore_case{ field.name().to_ascii_lowercase() == name}else{ field.name() == name}) `
   
   https://github.com/apache/arrow-datafusion/blob/58fc80eddaf93d4c6399e4ac4fc1234649b26e43/datafusion/sql/src/expr/identifier.rs#L49-L56
   
   we can find get `igore_case` value by using `id.quote_style.is_none();` 
   
   change :
   
   `Ok(Expr::Column(Column { relation: None, name: normalize_ident, }))` => `Ok(Expr::Column(field.qualified_column()))`
   
   fixed all compiler errors, I found it not working very well.
   
   > 1. a and A return "A" column
   > 2. b and B fail as ambiguous
   > 3. "B" and "b" work, returning upper and lower case columns respectively
   > 4. The schema always preserves the original case of identifier
   
   1 3 4 done,  but 2 failed.
   because function  `Column::normalize_with_schemas_and_ambiguity_check` does not know  `using_columns`'s names are `ignore_case` or not. I set default is `false`.
   
   One solution is `Column` adding a member named `ignore_case`.  I think it's a ugly way. Do you have some suggesstion? @sergiimk 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] parkma99 commented on issue #7460: Column name case sensitivity when reading from CSV / JSON

Posted by "parkma99 (via GitHub)" <gi...@apache.org>.
parkma99 commented on issue #7460:
URL: https://github.com/apache/arrow-datafusion/issues/7460#issuecomment-1702964319

   https://github.com/apache/arrow-datafusion/blob/58fc80eddaf93d4c6399e4ac4fc1234649b26e43/datafusion/sql/src/expr/identifier.rs#L49-L56
   in here the  `Column names` `.to_lower()` if ident without quote.
   
   https://github.com/apache/arrow-datafusion/blob/58fc80eddaf93d4c6399e4ac4fc1234649b26e43/datafusion/common/src/dfschema.rs#L274-L279
   in here finding table field if it's name is given name.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] sergiimk commented on issue #7460: Column name case sensitivity when reading from CSV / JSON

Posted by "sergiimk (via GitHub)" <gi...@apache.org>.
sergiimk commented on issue #7460:
URL: https://github.com/apache/arrow-datafusion/issues/7460#issuecomment-1728194433

   Thank you @alamb. Don't know why it did not occur to me to test Postgres behavior, but can confirm it's consistent with DataFusion's.
   
   Will close this issue and see how experience looks like with `enable_ident_normalization`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] sergiimk commented on issue #7460: Column name case sensitivity when reading from CSV / JSON

Posted by "sergiimk (via GitHub)" <gi...@apache.org>.
sergiimk commented on issue #7460:
URL: https://github.com/apache/arrow-datafusion/issues/7460#issuecomment-1704442387

   Thanks a lot for the initial investigation @parkma99!
   
   I think before making any further steps I'd like to wait for project maintainers to weigh in on whether the behavior and rules that I proposed even make sense.
   
   I based them on my experience with Spark and several databases, but I definitely don't have a full picture of how case-sensitivity is designed in DF.
   
   For example I just discovered `enable_ident_normalization` config option, which will likely interfere with what I proposed. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] sergiimk closed issue #7460: Column name case sensitivity when reading from CSV / JSON

Posted by "sergiimk (via GitHub)" <gi...@apache.org>.
sergiimk closed issue #7460: Column name case sensitivity when reading from CSV / JSON
URL: https://github.com/apache/arrow-datafusion/issues/7460


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #7460: Column name case sensitivity when reading from CSV / JSON

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #7460:
URL: https://github.com/apache/arrow-datafusion/issues/7460#issuecomment-1722214483

   I think case sensitivity was designed for better or worse to follow postgres. 
   
   I think the ideal solution is not to invent new normalization behavior but follow an existing model. Adding an option that makes DataFusion treat identifiers the same way as Spark would make sense to me. However, coming up with entirely new rules would not
   
   I agree that enabling `enable_ident_normalization` might make datafusion work better for your usecase.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org