You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/11/20 20:01:41 UTC

[GitHub] [arrow] drusso opened a new pull request #8727: ARROW-10666: [Rust][DataFusion] Support nested SELECT statements.

drusso opened a new pull request #8727:
URL: https://github.com/apache/arrow/pull/8727


   This PR enables nested `SELECT` statements. Note that table aliases remain unsupported, and no optimizations are made during the planning stages. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorgecarleitao commented on pull request #8727: ARROW-10666: [Rust][DataFusion] Support nested SELECT statements.

Posted by GitBox <gi...@apache.org>.

jorgecarleitao commented on pull request #8727:
URL: https://github.com/apache/arrow/pull/8727#issuecomment-732338089


   @drusso , could you rebase this? We had some issues with the CI that were addressed, so you should be able to have this run on CI clean now.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] drusso edited a comment on pull request #8727: ARROW-10666: [Rust][DataFusion] Support nested SELECT statements.

Posted by GitBox <gi...@apache.org>.

drusso edited a comment on pull request #8727:
URL: https://github.com/apache/arrow/pull/8727#issuecomment-731762369


   On the topic of table aliasing:
   
   For example:
   
   ```rust
   let df_source = ctx.read_parquet(&parquet_source())?;
   let df_in1 = df_source.select_columns(vec!["string_col", "int_col"])?;
   let df_in2 = df_source.select_columns(vec!["string_col", "int_col"])?;
   let df_join = df_in1.join(df_in2, JoinType::Inner, &["string_col"], &["string_col"])?;
   let results = df_join.collect().await?;
   ```
   
   Will yield:
   
   ```
   Error: Plan("The left schema and the right schema have the following columns with the same name without being on the ON statement: {\"int_col\"}. Consider aliasing them.")
   ```
   
   Of course the workaround is to the alias the columns. Are there any plans to handle disambiguation? In PySpark, for example, the equivalent version of the example above would be valid, and columns can be disambiguated with `df_in1.int_col` and `df_in2.int_col`.
   
   The reason I ask about plans to handle this in the DataFrame API is because the solution there might influence the implementation in the SQL layer. 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] drusso commented on pull request #8727: ARROW-10666: [Rust][DataFusion] Support nested SELECT statements.

Posted by GitBox <gi...@apache.org>.

drusso commented on pull request #8727:
URL: https://github.com/apache/arrow/pull/8727#issuecomment-733275624


   @jorgecarleitao Sure thing, I've rebased the changes. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] drusso edited a comment on pull request #8727: ARROW-10666: [Rust][DataFusion] Support nested SELECT statements.

Posted by GitBox <gi...@apache.org>.

drusso edited a comment on pull request #8727:
URL: https://github.com/apache/arrow/pull/8727#issuecomment-731762369






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorgecarleitao commented on pull request #8727: ARROW-10666: [Rust][DataFusion] Support nested SELECT statements.

Posted by GitBox <gi...@apache.org>.

jorgecarleitao commented on pull request #8727:
URL: https://github.com/apache/arrow/pull/8727#issuecomment-731765349


   Note that this does not impact SQL, as SQL all tables are named and columns are referred via a qualified name (e.g. `t1.name`)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] drusso commented on pull request #8727: ARROW-10666: [Rust][DataFusion] Support nested SELECT statements.

Posted by GitBox <gi...@apache.org>.

drusso commented on pull request #8727:
URL: https://github.com/apache/arrow/pull/8727#issuecomment-731757266


   @jorgecarleitao I was pleasantly surprised by how few changes were required to get this working! I've updated the README.
   
   @andygrove I haven't looked into adding support for table aliasing, which I think is most useful in the context of joins. Since the feature is now in master, it's probably a good time to add support. 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorgecarleitao commented on pull request #8727: ARROW-10666: [Rust][DataFusion] Support nested SELECT statements.

Posted by GitBox <gi...@apache.org>.

jorgecarleitao commented on pull request #8727:
URL: https://github.com/apache/arrow/pull/8727#issuecomment-731764988


   AFAIK pyspark does not desambiguate:
   
   ```python
   import pyspark
   
   with pyspark.SparkContext() as sc:
       spark = pyspark.sql.SQLContext(sc)
   
       df = spark.createDataFrame([
           [1, 2],
           [2, 3],
       ], schema=["id", "id1"])
   
       df1 = spark.createDataFrame([
           [1, 2],
           [1, 3],
       ], schema=["id", "id1"])
   
       df.join(df1, on="id").show()
   ```
   
   yields 
   
   ```
   +---+---+---+                                                                   
   | id|id1|id1|
   +---+---+---+
   |  1|  2|  2|
   |  1|  2|  3|
   +---+---+---+
   ```
   
   on `pyspark==2.4.6`
   
   In pyspark, writing `df.join(df1, on="id").select("id1")` errors because the select can't tell which column to select.
   
   I am generally against desambiguation because doing so changes the schema only when columns collide (or do we always add some `left_`?) In general, colliding columns requires the user to always desambiguate them, either before the statement (via alias) or after the statement (via `?.column_name`).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorgecarleitao edited a comment on pull request #8727: ARROW-10666: [Rust][DataFusion] Support nested SELECT statements.

Posted by GitBox <gi...@apache.org>.

jorgecarleitao edited a comment on pull request #8727:
URL: https://github.com/apache/arrow/pull/8727#issuecomment-731764988


   AFAIK pyspark does not desambiguate:
   
   ```python
   import pyspark
   
   with pyspark.SparkContext() as sc:
       spark = pyspark.sql.SQLContext(sc)
   
       df = spark.createDataFrame([
           [1, 2],
           [2, 3],
       ], schema=["id", "id1"])
   
       df1 = spark.createDataFrame([
           [1, 2],
           [1, 3],
       ], schema=["id", "id1"])
   
       df.join(df1, on="id").show()
   ```
   
   yields 
   
   ```
   +---+---+---+                                                                   
   | id|id1|id1|
   +---+---+---+
   |  1|  2|  2|
   |  1|  2|  3|
   +---+---+---+
   ```
   
   on `pyspark==2.4.6`
   
   In pyspark, writing `df.join(df1, on="id").select("id1")` errors because the select can't tell which column to select. This IMO is poor judgment: the join itself does not crash, but operating on the resulting table crashes.
   
   I am generally against desambiguation because doing so changes the schema only when columns collide (or do we always add some `left_`?) In general, colliding columns requires the user to always desambiguate them, either before the statement (via alias) or after the statement (via `?.column_name`). Raising an error IMO is the best possible outcome as it requires the user to be explicit about what they want.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] drusso commented on pull request #8727: ARROW-10666: [Rust][DataFusion] Support nested SELECT statements.

Posted by GitBox <gi...@apache.org>.

drusso commented on pull request #8727:
URL: https://github.com/apache/arrow/pull/8727#issuecomment-731850353


   Sounds good. 
   
   In case it might be of interest, dplyr's [inner_join()](https://dplyr.tidyverse.org/reference/mutate-joins.html) will add a suffix to any non-joined column that collide. The suffixes can be explicitly passed as part of the function arguments. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorgecarleitao closed pull request #8727: ARROW-10666: [Rust][DataFusion] Support nested SELECT statements.

Posted by GitBox <gi...@apache.org>.

jorgecarleitao closed pull request #8727:
URL: https://github.com/apache/arrow/pull/8727


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] drusso commented on pull request #8727: ARROW-10666: [Rust][DataFusion] Support nested SELECT statements.

Posted by GitBox <gi...@apache.org>.

drusso commented on pull request #8727:
URL: https://github.com/apache/arrow/pull/8727#issuecomment-731762369


   On the topic of table aliasing:
   
   For example:
   
   ```
   let df_source = ctx.read_parquet(&parquet_source())?;
   let df_in1 = df_source.select_columns(vec!["string_col", "int_col"])?;
   let df_in2 = df_source.select_columns(vec!["string_col", "int_col"])?;
   let df_join = df_in1.join(df_in2, JoinType::Inner, &["string_col"], &["string_col"])?;
   let results = df_join.collect().await?;
   ```
   
   Will yield:
   
   ```
   Error: Plan("The left schema and the right schema have the following columns with the same name without being on the ON statement: {\"int_col\"}. Consider aliasing them.")
   ```
   
   Of course the workaround is to the alias the columns. Are there any plans to handle disambiguation? In PySpark, for example, the equivalent would be valid, and columns can be disambiguated with `df_in1.int_col` and `df_in2.int_col`.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #8727: ARROW-10666: [Rust][DataFusion] Support nested SELECT statements.

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #8727:
URL: https://github.com/apache/arrow/pull/8727#issuecomment-731385060


   https://issues.apache.org/jira/browse/ARROW-10666


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorgecarleitao edited a comment on pull request #8727: ARROW-10666: [Rust][DataFusion] Support nested SELECT statements.

Posted by GitBox <gi...@apache.org>.

jorgecarleitao edited a comment on pull request #8727:
URL: https://github.com/apache/arrow/pull/8727#issuecomment-731764988


   AFAIK pyspark does not desambiguate:
   
   ```python
   import pyspark
   
   with pyspark.SparkContext() as sc:
       spark = pyspark.sql.SQLContext(sc)
   
       df = spark.createDataFrame([
           [1, 2],
           [2, 3],
       ], schema=["id", "id1"])
   
       df1 = spark.createDataFrame([
           [1, 2],
           [1, 3],
       ], schema=["id", "id1"])
   
       df.join(df1, on="id").show()
   ```
   
   yields 
   
   ```
   +---+---+---+                                                                   
   | id|id1|id1|
   +---+---+---+
   |  1|  2|  2|
   |  1|  2|  3|
   +---+---+---+
   ```
   
   on `pyspark==2.4.6`
   
   In pyspark, writing `df.join(df1, on="id").select("id1")` errors because the select can't tell which column to select.
   
   I am generally against desambiguation because doing so changes the schema only when columns collide (or do we always add some `left_`?) In general, colliding columns requires the user to always desambiguate them, either before the statement (via alias) or after the statement (via `?.column_name`). Raising an error IMO is the best possible outcome as it requires the user to be explicit about what they want.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org