You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Arnaud Nauwynck (Jira)" <ji...@apache.org> on 2020/10/15 21:25:00 UTC

[jira] [Comment Edited] (SPARK-33164) SPIP: add SQL support to "SELECT * (EXCEPT someColumn) FROM .." equivalent to DataSet.dropColumn(someColumn)

    [ https://issues.apache.org/jira/browse/SPARK-33164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17215016#comment-17215016 ] 

Arnaud Nauwynck edited comment on SPARK-33164 at 10/15/20, 9:24 PM:
--------------------------------------------------------------------

some links of interrest in source code

+link 1:+ Sql parser:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L534

{noformat}
selectClause
    : SELECT (hints+=hint)* setQuantifier? namedExpressionSeq
    ;
{noformat}
namedExpressionSeq is also used anywhere in PIVOT expression, so might not be modified ..
=> might be necessary to wrap in a new antlr expression, let's call it namedExpressionSeqExcept

{noformat}
selectClause
    : SELECT (hints+=hint)* setQuantifier? namedExpressionSeqExcept
    ;

namedExpressionSeqExcept 
    : namedExpressionSeq ( '(' exceptNamedSeq ')' )?

exceptNamedSeq : ....
{noformat}
 

+Link 2+: class Project in logical plan tree
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala#L62

{noformat}
case class Project(projectList: Seq[NamedExpression], child: LogicalPlan)
    extends OrderPreservingUnaryNode {

{noformat}

=> might need to add field   (or wrap class Project in ProjectExclude ??)

{noformat}
case class Project(projectList: Seq[NamedExpression], excludeList: Seq[String], child: LogicalPlan)
    extends OrderPreservingUnaryNode {

{noformat}

+Link 3+ Analyzer ... resolving  Project

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L1380


{noformat}
    def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp {
      case p: LogicalPlan if !p.childrenResolved => p

      // If the projection list contains Stars, expand it.
      case p: Project if containsStar(p.projectList) =>
        p.copy(projectList = buildExpandedProjectList(p.projectList, p.child))

{noformat}
=> should also add post rule to handle exclude if presents  (whether it contains star or not )

maybe something like 
{noformat}
      case p: Project => {
        // If the projection list contains Stars, expand it.
        if containsStar(p.projectList) 
           p.copy(projectList = buildExpandedProjectList(p.projectList, p.child))
 
        // If the projection list contains exclude column, then remove them
        if containsExclude(p.excludeList) 
           p.copy(projectList = filterProjectListAfterExcluded(p.projectList, p.excludeList, p.child))
      }
{noformat}




was (Author: arnaud.nauwynck):
some links of interrest in source code

+link 1:+ Sql parser:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L534

{noformat}
selectClause
    : SELECT (hints+=hint)* setQuantifier? namedExpressionSeq
    ;
{noformat}
namedExpressionSeq is also used anywhere in PIVOT expression, so might not be modified ..
=> might be necessary to wrap in a new antlr expression, let's call it namedExpressionSeqExcept

{noformat}
selectClause
    : SELECT (hints+=hint)* setQuantifier? namedExpressionSeqExcept
    ;

namedExpressionSeqExcept 
    : namedExpressionSeq ( '(' exceptNamedSeq ')' )?

exceptNamedSeq : ....
{noformat}
 

+Link 2+: class Project in logical plan tree
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala#L62

{noformat}
case class Project(projectList: Seq[NamedExpression], child: LogicalPlan)
    extends OrderPreservingUnaryNode {

{noformat}

=> might need to add field   (or wrap class Project in ProjectExclude ??)

{noformat}
case class Project(projectList: Seq[NamedExpression], excludeList: Seq[String], child: LogicalPlan)
    extends OrderPreservingUnaryNode {

{noformat}

+Link 3+ Analyzer ... resolving  Project

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L1380


{noformat}
    def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp {
      case p: LogicalPlan if !p.childrenResolved => p

      // If the projection list contains Stars, expand it.
      case p: Project if containsStar(p.projectList) =>
        p.copy(projectList = buildExpandedProjectList(p.projectList, p.child))

{noformat}
=> should also add post rule to handle exclude if presents  (whether it contains star or not )

meybe something like 
{noformat}
      case p: Project => {
        // If the projection list contains Stars, expand it.
        if containsStar(p.projectList) 
           p.copy(projectList = buildExpandedProjectList(p.projectList, p.child))
 
        // If the projection list contains exclude column, then remove them
        if containsExclude(p.excludeList) 
           p.copy(projectList = filterProjectListAfterExcluded(p.projectList, p.excludeList, p.child))
      }
{noformat}



> SPIP: add SQL support to "SELECT * (EXCEPT someColumn) FROM .." equivalent to DataSet.dropColumn(someColumn)
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-33164
>                 URL: https://issues.apache.org/jira/browse/SPARK-33164
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1
>            Reporter: Arnaud Nauwynck
>            Priority: Minor
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> *Q1.* What are you trying to do? Articulate your objectives using absolutely no jargon.
> I would like to have the extended SQL syntax "SELECT * EXCEPT someColumn FROM .." 
> to be able to select all columns except some in a SELECT clause.
> It would be similar to SQL syntax from some databases, like Google BigQuery or PostgresQL.
> https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax
> Google question "select * EXCEPT one column", and you will see many developpers have the same problems.
> example posts: 
> https://blog.jooq.org/2018/05/14/selecting-all-columns-except-one-in-postgresql/
> https://www.thetopsites.net/article/53001825.shtml
> There are several typicall examples where is is very helpfull :
> use-case1:
>  you add "count ( * )  countCol" column, and then filter on it using for example "having countCol = 1" 
>   ... and then you want to select all columns EXCEPT this dummy column which always is "1"
> {noformat}
>   select * (EXCEPT countCol)
>   from (  
>      select count(*) countCol, * 
> 	 from MyTable 
> 	 where ... 
> 	 group by ... having countCol = 1
>   )	 
> {noformat}
> 	 
> use-case 2:
>  same with analytical function "partition over(...) rankCol  ... where rankCol=1"
>  For example to get the latest row before a given time, in a time series table.
>  This is "Time-Travel" queries addressed by framework like "DeltaLake"
> {noformat}
>  CREATE table t_updates (update_time timestamp, id string, col1 type1, col2 type2, ... col42)
>  pastTime=..
>  SELECT * (except rankCol)
>  FROM (
>    SELECT *,
>       RANK() OVER (PARTITION BY id ORDER BY update_time) rankCol   
>    FROM t_updates
>    where update_time < pastTime
>  ) WHERE rankCol = 1
>  
> {noformat}
>  
> use-case 3:
>  copy some data from table "t" to corresponding table "t_snapshot", and back to "t"
> {noformat}
>    CREATE TABLE t (col1 type1, col2 type2, col3 type3, ... col42 type42) ...
>    
>    /* create corresponding table: (snap_id string, col1 type1, col2 type2, col3 type3, ... col42 type42) */
>    CREATE TABLE t_snapshot
>    AS SELECT '' as snap_id, * FROM t WHERE 1=2
>    /* insert data from t to some snapshot */
>    INSERT INTO t_snapshot
>    SELECT 'snap1' as snap_id, * from t 
>    
>    /* select some data from snapshot table (without snap_id column) .. */   
>    SELECT * (EXCEPT snap_id) FROM t_snapshot where snap_id='snap1' 
>    
> {noformat}
>    
>    
> *Q2.* What problem is this proposal NOT designed to solve?
> It is only a SQL syntaxic sugar. 
> It does not change SQL execution plan or anything complex.
> *Q3.* How is it done today, and what are the limits of current practice?
>  
> Today, you can either use the DataSet API, with .dropColumn(someColumn)
> or you need to HARD-CODE manually all columns in your SQL. Therefore your code is NOT generic (or you are using a SQL meta-code generator?)
> *Q4.* What is new in your approach and why do you think it will be successful?
> It is NOT new... it is already a proven solution from DataSet.dropColumn(), Postgresql, BigQuery
>  
> *Q5.* Who cares? If you are successful, what difference will it make?
> It simplifies life of developpers, dba, data analysts, end users.
> It simplify development of SQL code, in a more generic way for many tasks.
> *Q6.* What are the risks?
> There is VERY limited risk on spark SQL, because it already exists in DataSet API.
> It is an extension of SQL syntax, so the risk is annoying some IDE SQL editors for a new SQL syntax. 
> *Q7.* How long will it take?
> No idea. I guess someone experienced in the Spark SQL internals might do it relatively "quickly".
> It is a kind of syntaxic sugar to add in antlr grammar rule, then transform in DataSet api
> *Q8.* What are the mid-term and final “exams” to check for success?
> The 3 standard use-cases given in question Q1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org