You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Arnaud Nauwynck (Jira)" <ji...@apache.org> on 2020/10/15 21:25:00 UTC
[jira] [Comment Edited] (SPARK-33164) SPIP: add SQL support to
"SELECT * (EXCEPT someColumn) FROM .." equivalent to
DataSet.dropColumn(someColumn)
[ https://issues.apache.org/jira/browse/SPARK-33164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17215016#comment-17215016 ]
Arnaud Nauwynck edited comment on SPARK-33164 at 10/15/20, 9:24 PM:
--------------------------------------------------------------------
some links of interrest in source code
+link 1:+ Sql parser:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L534
{noformat}
selectClause
: SELECT (hints+=hint)* setQuantifier? namedExpressionSeq
;
{noformat}
namedExpressionSeq is also used anywhere in PIVOT expression, so might not be modified ..
=> might be necessary to wrap in a new antlr expression, let's call it namedExpressionSeqExcept
{noformat}
selectClause
: SELECT (hints+=hint)* setQuantifier? namedExpressionSeqExcept
;
namedExpressionSeqExcept
: namedExpressionSeq ( '(' exceptNamedSeq ')' )?
exceptNamedSeq : ....
{noformat}
+Link 2+: class Project in logical plan tree
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala#L62
{noformat}
case class Project(projectList: Seq[NamedExpression], child: LogicalPlan)
extends OrderPreservingUnaryNode {
{noformat}
=> might need to add field (or wrap class Project in ProjectExclude ??)
{noformat}
case class Project(projectList: Seq[NamedExpression], excludeList: Seq[String], child: LogicalPlan)
extends OrderPreservingUnaryNode {
{noformat}
+Link 3+ Analyzer ... resolving Project
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L1380
{noformat}
def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp {
case p: LogicalPlan if !p.childrenResolved => p
// If the projection list contains Stars, expand it.
case p: Project if containsStar(p.projectList) =>
p.copy(projectList = buildExpandedProjectList(p.projectList, p.child))
{noformat}
=> should also add post rule to handle exclude if presents (whether it contains star or not )
maybe something like
{noformat}
case p: Project => {
// If the projection list contains Stars, expand it.
if containsStar(p.projectList)
p.copy(projectList = buildExpandedProjectList(p.projectList, p.child))
// If the projection list contains exclude column, then remove them
if containsExclude(p.excludeList)
p.copy(projectList = filterProjectListAfterExcluded(p.projectList, p.excludeList, p.child))
}
{noformat}
was (Author: arnaud.nauwynck):
some links of interrest in source code
+link 1:+ Sql parser:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L534
{noformat}
selectClause
: SELECT (hints+=hint)* setQuantifier? namedExpressionSeq
;
{noformat}
namedExpressionSeq is also used anywhere in PIVOT expression, so might not be modified ..
=> might be necessary to wrap in a new antlr expression, let's call it namedExpressionSeqExcept
{noformat}
selectClause
: SELECT (hints+=hint)* setQuantifier? namedExpressionSeqExcept
;
namedExpressionSeqExcept
: namedExpressionSeq ( '(' exceptNamedSeq ')' )?
exceptNamedSeq : ....
{noformat}
+Link 2+: class Project in logical plan tree
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala#L62
{noformat}
case class Project(projectList: Seq[NamedExpression], child: LogicalPlan)
extends OrderPreservingUnaryNode {
{noformat}
=> might need to add field (or wrap class Project in ProjectExclude ??)
{noformat}
case class Project(projectList: Seq[NamedExpression], excludeList: Seq[String], child: LogicalPlan)
extends OrderPreservingUnaryNode {
{noformat}
+Link 3+ Analyzer ... resolving Project
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L1380
{noformat}
def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp {
case p: LogicalPlan if !p.childrenResolved => p
// If the projection list contains Stars, expand it.
case p: Project if containsStar(p.projectList) =>
p.copy(projectList = buildExpandedProjectList(p.projectList, p.child))
{noformat}
=> should also add post rule to handle exclude if presents (whether it contains star or not )
meybe something like
{noformat}
case p: Project => {
// If the projection list contains Stars, expand it.
if containsStar(p.projectList)
p.copy(projectList = buildExpandedProjectList(p.projectList, p.child))
// If the projection list contains exclude column, then remove them
if containsExclude(p.excludeList)
p.copy(projectList = filterProjectListAfterExcluded(p.projectList, p.excludeList, p.child))
}
{noformat}
> SPIP: add SQL support to "SELECT * (EXCEPT someColumn) FROM .." equivalent to DataSet.dropColumn(someColumn)
> ------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-33164
> URL: https://issues.apache.org/jira/browse/SPARK-33164
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1
> Reporter: Arnaud Nauwynck
> Priority: Minor
> Original Estimate: 120h
> Remaining Estimate: 120h
>
> *Q1.* What are you trying to do? Articulate your objectives using absolutely no jargon.
> I would like to have the extended SQL syntax "SELECT * EXCEPT someColumn FROM .."
> to be able to select all columns except some in a SELECT clause.
> It would be similar to SQL syntax from some databases, like Google BigQuery or PostgresQL.
> https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax
> Google question "select * EXCEPT one column", and you will see many developpers have the same problems.
> example posts:
> https://blog.jooq.org/2018/05/14/selecting-all-columns-except-one-in-postgresql/
> https://www.thetopsites.net/article/53001825.shtml
> There are several typicall examples where is is very helpfull :
> use-case1:
> you add "count ( * ) countCol" column, and then filter on it using for example "having countCol = 1"
> ... and then you want to select all columns EXCEPT this dummy column which always is "1"
> {noformat}
> select * (EXCEPT countCol)
> from (
> select count(*) countCol, *
> from MyTable
> where ...
> group by ... having countCol = 1
> )
> {noformat}
>
> use-case 2:
> same with analytical function "partition over(...) rankCol ... where rankCol=1"
> For example to get the latest row before a given time, in a time series table.
> This is "Time-Travel" queries addressed by framework like "DeltaLake"
> {noformat}
> CREATE table t_updates (update_time timestamp, id string, col1 type1, col2 type2, ... col42)
> pastTime=..
> SELECT * (except rankCol)
> FROM (
> SELECT *,
> RANK() OVER (PARTITION BY id ORDER BY update_time) rankCol
> FROM t_updates
> where update_time < pastTime
> ) WHERE rankCol = 1
>
> {noformat}
>
> use-case 3:
> copy some data from table "t" to corresponding table "t_snapshot", and back to "t"
> {noformat}
> CREATE TABLE t (col1 type1, col2 type2, col3 type3, ... col42 type42) ...
>
> /* create corresponding table: (snap_id string, col1 type1, col2 type2, col3 type3, ... col42 type42) */
> CREATE TABLE t_snapshot
> AS SELECT '' as snap_id, * FROM t WHERE 1=2
> /* insert data from t to some snapshot */
> INSERT INTO t_snapshot
> SELECT 'snap1' as snap_id, * from t
>
> /* select some data from snapshot table (without snap_id column) .. */
> SELECT * (EXCEPT snap_id) FROM t_snapshot where snap_id='snap1'
>
> {noformat}
>
>
> *Q2.* What problem is this proposal NOT designed to solve?
> It is only a SQL syntaxic sugar.
> It does not change SQL execution plan or anything complex.
> *Q3.* How is it done today, and what are the limits of current practice?
>
> Today, you can either use the DataSet API, with .dropColumn(someColumn)
> or you need to HARD-CODE manually all columns in your SQL. Therefore your code is NOT generic (or you are using a SQL meta-code generator?)
> *Q4.* What is new in your approach and why do you think it will be successful?
> It is NOT new... it is already a proven solution from DataSet.dropColumn(), Postgresql, BigQuery
>
> *Q5.* Who cares? If you are successful, what difference will it make?
> It simplifies life of developpers, dba, data analysts, end users.
> It simplify development of SQL code, in a more generic way for many tasks.
> *Q6.* What are the risks?
> There is VERY limited risk on spark SQL, because it already exists in DataSet API.
> It is an extension of SQL syntax, so the risk is annoying some IDE SQL editors for a new SQL syntax.
> *Q7.* How long will it take?
> No idea. I guess someone experienced in the Spark SQL internals might do it relatively "quickly".
> It is a kind of syntaxic sugar to add in antlr grammar rule, then transform in DataSet api
> *Q8.* What are the mid-term and final “exams” to check for success?
> The 3 standard use-cases given in question Q1.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org