You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/20 02:16:51 UTC

[GitHub] [arrow-datafusion] jychen7 edited a comment on issue #1507: Python bindings create duplicated qualified fields after joining

jychen7 edited a comment on issue #1507:
URL: https://github.com/apache/arrow-datafusion/issues/1507#issuecomment-1017052134


   Actually the problem is neither `datafusion` or `pydatafusion`, just different expection.
   
   `datafusion` allow duplicate columns in `ans`, but `pydatafusion` will raise error when `create_dataframe` when input columns are duplicated.
   e.g. `select x.c2, y.c2 from x join y using (c1) limit 1` shows
   ```
   +----+----+
   | c2 | c2 |
   +----+----+
   | 1  | 1  |
   +----+----+
   ```
   
   1. I test in MySQL 5.6 and PostgreSQL 9.6, they also show duplicate column names in output, e.g. http://sqlfiddle.com/#!9/a6c585/237251 and http://sqlfiddle.com/#!17/bf2fd/25993. This also align with "All bare column field names MUST not contain relation/table qualifier." in https://arrow.apache.org/datafusion/specification/output-field-name-semantic.html
   2. dataframe is correct too, since it is not expected to have duplicate input name. I believe the internal dataframe using in `ctx.sql` have different column name like `x.c2` and `y.c2`, but we trucate that when output


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org