You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Arnaud Nauwynck (Jira)" <ji...@apache.org> on 2020/10/15 20:17:00 UTC
[jira] [Updated] (SPARK-33164) SPIP: add SQL support to "SELECT * (EXCEPT someColumn) FROM .." equivalent to DataSet.dropColumn(someColumn)

     [ https://issues.apache.org/jira/browse/SPARK-33164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arnaud Nauwynck updated SPARK-33164:
------------------------------------
    Description: 
*Q1.* What are you trying to do? Articulate your objectives using absolutely no jargon.

I would like to have the extended SQL syntax "SELECT * EXCEPT someColumn FROM .." 
to be able to select all columns except some in a SELECT clause.

It would be similar to SQL syntax from some databases, like Google BigQuery or PostgresQL.
https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax

Google question "select * EXCEPT one column", and you will see many developpers have the same problems.

example posts: 
https://blog.jooq.org/2018/05/14/selecting-all-columns-except-one-in-postgresql/
https://www.thetopsites.net/article/53001825.shtml

There are several typicall examples where is is very helpfull :

use-case1:
 you add "count(*) countCol" column, and then filter on it using for example "having countCol = 1" 
  ... and then you want to select all columns EXCEPT this dummy column which always is "1"


{noformat}
  select * (EXCEPT countCol)
  from (  
     select count(*) countCol, * 
	 from MyTable 
	 where ... 
	 group by ... having countCol = 1
  )	 

{noformat}
	 

use-case 2:
 same with analytical function "partition over(...) rankCol  ... where rankCol=1"

 For example to get the latest row before a given time, in a time series table.
 This is "Time-Travel" queries addressed by framework like "DeltaLake"

{noformat}

 CREATE table t_updates (update_time timestamp, id string, col1 type1, col2 type2, ... col42)

 pastTime=..

 SELECT * (except rankCol)
 FROM (
   SELECT *,
      RANK() OVER (PARTITION BY id ORDER BY update_time) rankCol   
   FROM t_updates
   where update_time < pastTime
 ) WHERE rankCol = 1
 
{noformat}
 

use-case 3:
 copy some data from table "t" to corresponding table "t_snapshot", and back to "t"

{noformat}

   CREATE TABLE t (col1 type1, col2 type2, col3 type3, ... col42 type42) ...
   
   /* create corresponding table: (snap_id string, col1 type1, col2 type2, col3 type3, ... col42 type42) */
   CREATE TABLE t_snapshot
   AS SELECT '' as snap_id, * FROM t WHERE 1=2

   /* insert data from t to some snapshot */
   INSERT INTO t_snapshot
   SELECT 'snap1' as snap_id, * from t 
   
   /* select some data from snapshot table (without snap_id column) .. */   
   SELECT * (EXCEPT snap_id) FROM t_snapshot where snap_id='snap1' 
   
{noformat}
   
   

*Q2.* What problem is this proposal NOT designed to solve?

It is only a SQL syntaxic sugar. 
It does not change SQL execution plan or anything complex.



*Q3.* How is it done today, and what are the limits of current practice?
 
Today, you can either use the DataSet API, with .dropColumn(someColumn)
or use need to HARD-CODE manually all columns in your SQL. Therefore your code is NOT generic (or you are using a SQL meta-code generator?)

*Q4.* What is new in your approach and why do you think it will be successful?

It is NOT new... it is already a proven solution from DataSet.dropColumn(), Postgresql, BigQuery
 
*Q5.* Who cares? If you are successful, what difference will it make?

It simplifies life of developpers, dba, data analysts, end users.
It simplify development of SQL code, in a more generic way for many tasks.


*Q6.* What are the risks?

There is VERY limited risk on spark SQL, because it already exists in DataSet API.

It is an extension of SQL syntax, so the risk is annoying some IDE SQL editors for a new SQL syntax. 

*Q7.* How long will it take?

No idea. I guess someone experienced in the Spark SQL internals might do it relatively "quickly".
It is a kind of syntaxic sugar to add in antlr grammar rule, then transform in DataSet api

*Q8.* What are the mid-term and final “exams” to check for success?

The 3 standard use-cases given in question Q1.


  was:
*Q1.* What are you trying to do? Articulate your objectives using absolutely no jargon.

I would like to have the extended SQL syntax "SELECT * EXCEPT someColumn FROM .." 
to be able to select all columns except some in a SELECT clause.

It would be similar to SQL syntax from some databases, like Google BigQuery or PostgresQL.
https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax

Google question "select * EXCEPT one column", and you will see many developpers have the same problems.

example posts: 
https://blog.jooq.org/2018/05/14/selecting-all-columns-except-one-in-postgresql/
https://www.thetopsites.net/article/53001825.shtml

There are several typicall examples where is is very helpfull :

use-case1:
 you add "count(*) countCol" column, and then filter on it using for example "having countCol = 1" 
  ... and then you want to select all columns EXCEPT this dummy column which always is "1"

  select * (EXCEPT countCol)
  from (  
     select count(*) countCol, * 
	 from MyTable 
	 where ... 
	 group by ... having countCol = 1
  )	 
	 

use-case 2:
 same with analytical function "partition over(...) rankCol  ... where rankCol=1"

 For example to get the latest row before a given time, in a time series table.
 This is "Time-Travel" queries addressed by framework like "DeltaLake"

 CREATE table t_updates (update_time timestamp, id string, col1 type1, col2 type2, ... col42)

 pastTime=..

 SELECT * (except rankCol)
 FROM (
   SELECT *,
      RANK() OVER (PARTITION BY id ORDER BY update_time) rankCol   
   FROM t_updates
   where update_time < pastTime
 ) WHERE rankCol = 1
 
 

use-case 3:
 copy some data from table "t" to corresponding table "t_snapshot", and back to "t"

   CREATE TABLE t (col1 type1, col2 type2, col3 type3, ... col42 type42) ...
   
   /* create corresponding table: (snap_id string, col1 type1, col2 type2, col3 type3, ... col42 type42) */
   CREATE TABLE t_snapshot
   AS SELECT '' as snap_id, * FROM t WHERE 1=2

   /* insert data from t to some snapshot */
   INSERT INTO t_snapshot
   SELECT 'snap1' as snap_id, * from t 
   
   /* select some data from snapshot table (without snap_id column) .. */   
   SELECT * (EXCEPT snap_id) FROM t_snapshot where snap_id='snap1' 
   
   
   

*Q2.* What problem is this proposal NOT designed to solve?

It is only a SQL syntaxic sugar. 
It does not change SQL execution plan or anything complex.



*Q3.* How is it done today, and what are the limits of current practice?
 
Today, you can either use the DataSet API, with .dropColumn(someColumn)
or use need to HARD-CODE manually all columns in your SQL. Therefore your code is NOT generic (or you are using a SQL meta-code generator?)

*Q4.* What is new in your approach and why do you think it will be successful?

It is NOT new... it is already a proven solution from DataSet.dropColumn(), Postgresql, BigQuery
 
*Q5.* Who cares? If you are successful, what difference will it make?

It simplifies life of developpers, dba, data analysts, end users.
It simplify development of SQL code, in a more generic way for many tasks.


*Q6.* What are the risks?

There is VERY limited risk on spark SQL, because it already exists in DataSet API.

It is an extension of SQL syntax, so the risk is annoying some IDE SQL editors for a new SQL syntax. 

*Q7.* How long will it take?

No idea. I guess someone experienced in the Spark SQL internals might do it relatively "quickly".
It is a kind of syntaxic sugar to add in antlr grammar rule, then transform in DataSet api

*Q8.* What are the mid-term and final “exams” to check for success?

The 3 standard use-cases given in question Q1.



> SPIP: add SQL support to "SELECT * (EXCEPT someColumn) FROM .." equivalent to DataSet.dropColumn(someColumn)
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-33164
>                 URL: https://issues.apache.org/jira/browse/SPARK-33164
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1
>            Reporter: Arnaud Nauwynck
>            Priority: Minor
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> *Q1.* What are you trying to do? Articulate your objectives using absolutely no jargon.
> I would like to have the extended SQL syntax "SELECT * EXCEPT someColumn FROM .." 
> to be able to select all columns except some in a SELECT clause.
> It would be similar to SQL syntax from some databases, like Google BigQuery or PostgresQL.
> https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax
> Google question "select * EXCEPT one column", and you will see many developpers have the same problems.
> example posts: 
> https://blog.jooq.org/2018/05/14/selecting-all-columns-except-one-in-postgresql/
> https://www.thetopsites.net/article/53001825.shtml
> There are several typicall examples where is is very helpfull :
> use-case1:
>  you add "count(*) countCol" column, and then filter on it using for example "having countCol = 1" 
>   ... and then you want to select all columns EXCEPT this dummy column which always is "1"
> {noformat}
>   select * (EXCEPT countCol)
>   from (  
>      select count(*) countCol, * 
> 	 from MyTable 
> 	 where ... 
> 	 group by ... having countCol = 1
>   )	 
> {noformat}
> 	 
> use-case 2:
>  same with analytical function "partition over(...) rankCol  ... where rankCol=1"
>  For example to get the latest row before a given time, in a time series table.
>  This is "Time-Travel" queries addressed by framework like "DeltaLake"
> {noformat}
>  CREATE table t_updates (update_time timestamp, id string, col1 type1, col2 type2, ... col42)
>  pastTime=..
>  SELECT * (except rankCol)
>  FROM (
>    SELECT *,
>       RANK() OVER (PARTITION BY id ORDER BY update_time) rankCol   
>    FROM t_updates
>    where update_time < pastTime
>  ) WHERE rankCol = 1
>  
> {noformat}
>  
> use-case 3:
>  copy some data from table "t" to corresponding table "t_snapshot", and back to "t"
> {noformat}
>    CREATE TABLE t (col1 type1, col2 type2, col3 type3, ... col42 type42) ...
>    
>    /* create corresponding table: (snap_id string, col1 type1, col2 type2, col3 type3, ... col42 type42) */
>    CREATE TABLE t_snapshot
>    AS SELECT '' as snap_id, * FROM t WHERE 1=2
>    /* insert data from t to some snapshot */
>    INSERT INTO t_snapshot
>    SELECT 'snap1' as snap_id, * from t 
>    
>    /* select some data from snapshot table (without snap_id column) .. */   
>    SELECT * (EXCEPT snap_id) FROM t_snapshot where snap_id='snap1' 
>    
> {noformat}
>    
>    
> *Q2.* What problem is this proposal NOT designed to solve?
> It is only a SQL syntaxic sugar. 
> It does not change SQL execution plan or anything complex.
> *Q3.* How is it done today, and what are the limits of current practice?
>  
> Today, you can either use the DataSet API, with .dropColumn(someColumn)
> or use need to HARD-CODE manually all columns in your SQL. Therefore your code is NOT generic (or you are using a SQL meta-code generator?)
> *Q4.* What is new in your approach and why do you think it will be successful?
> It is NOT new... it is already a proven solution from DataSet.dropColumn(), Postgresql, BigQuery
>  
> *Q5.* Who cares? If you are successful, what difference will it make?
> It simplifies life of developpers, dba, data analysts, end users.
> It simplify development of SQL code, in a more generic way for many tasks.
> *Q6.* What are the risks?
> There is VERY limited risk on spark SQL, because it already exists in DataSet API.
> It is an extension of SQL syntax, so the risk is annoying some IDE SQL editors for a new SQL syntax. 
> *Q7.* How long will it take?
> No idea. I guess someone experienced in the Spark SQL internals might do it relatively "quickly".
> It is a kind of syntaxic sugar to add in antlr grammar rule, then transform in DataSet api
> *Q8.* What are the mid-term and final “exams” to check for success?
> The 3 standard use-cases given in question Q1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org