You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/11/22 15:31:21 UTC

[GitHub] [arrow] jorgecarleitao edited a comment on pull request #8727: ARROW-10666: [Rust][DataFusion] Support nested SELECT statements.

jorgecarleitao edited a comment on pull request #8727:
URL: https://github.com/apache/arrow/pull/8727#issuecomment-731764988


   AFAIK pyspark does not desambiguate:
   
   ```python
   import pyspark
   
   with pyspark.SparkContext() as sc:
       spark = pyspark.sql.SQLContext(sc)
   
       df = spark.createDataFrame([
           [1, 2],
           [2, 3],
       ], schema=["id", "id1"])
   
       df1 = spark.createDataFrame([
           [1, 2],
           [1, 3],
       ], schema=["id", "id1"])
   
       df.join(df1, on="id").show()
   ```
   
   yields 
   
   ```
   +---+---+---+                                                                   
   | id|id1|id1|
   +---+---+---+
   |  1|  2|  2|
   |  1|  2|  3|
   +---+---+---+
   ```
   
   on `pyspark==2.4.6`
   
   In pyspark, writing `df.join(df1, on="id").select("id1")` errors because the select can't tell which column to select.
   
   I am generally against desambiguation because doing so changes the schema only when columns collide (or do we always add some `left_`?) In general, colliding columns requires the user to always desambiguate them, either before the statement (via alias) or after the statement (via `?.column_name`). Raising an error IMO is the best possible outcome as it requires the user to be explicit about what they want.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org