You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/03/25 12:01:52 UTC

[GitHub] [spark] beliefer opened a new pull request #35975: [SPARK-28330][SQL] Support ANSI SQL: result offset clause in query expression

beliefer opened a new pull request #35975:
URL: https://github.com/apache/spark/pull/35975


   ### What changes were proposed in this pull request?
   This is a ANSI SQL and feature id is `F861`
   ```
   <query expression> ::=
   [ <with clause> ] <query expression body>
   [ <order by clause> ] [ <result offset clause> ] [ <fetch first clause> ]
   
   <result offset clause> ::=
   OFFSET <offset row count> { ROW | ROWS }
   ```
   For example:
   ```
   SELECT customer_name, customer_gender FROM customer_dimension 
      WHERE occupation='Dancer' AND customer_city = 'San Francisco' ORDER BY customer_name;
       customer_name     | customer_gender
   ----------------------+-----------------
    Amy X. Lang          | Female
    Anna H. Li           | Female
    Brian O. Weaver      | Male
    Craig O. Pavlov      | Male
    Doug Z. Goldberg     | Male
    Harold S. Jones      | Male
    Jack E. Perkins      | Male
    Joseph W. Overstreet | Male
    Kevin . Campbell     | Male
    Raja Y. Wilson       | Male
    Samantha O. Brown    | Female
    Steve H. Gauthier    | Male
    William . Nielson    | Male
    William Z. Roy       | Male
   (14 rows)
   
   SELECT customer_name, customer_gender FROM customer_dimension 
      WHERE occupation='Dancer' AND customer_city = 'San Francisco' ORDER BY customer_name OFFSET 8;
      customer_name   | customer_gender
   -------------------+-----------------
    Kevin . Campbell  | Male
    Raja Y. Wilson    | Male
    Samantha O. Brown | Female
    Steve H. Gauthier | Male
    William . Nielson | Male
    William Z. Roy    | Male
   (6 rows)
   ```
   There are some mainstream database support the syntax.
   
   **Druid**
   https://druid.apache.org/docs/latest/querying/sql.html#offset
   
   **Kylin**
   http://kylin.apache.org/docs/tutorial/sql_reference.html#QUERYSYNTAX
   
   **Exasol**
   https://docs.exasol.com/sql/select.htm
   
   **Greenplum**
   http://docs.greenplum.org/6-8/ref_guide/sql_commands/SELECT.html
   
   **MySQL**
   https://dev.mysql.com/doc/refman/5.6/en/select.html
   
   **Monetdb**
   https://www.monetdb.org/Documentation/SQLreference/SQLSyntaxOverview#SELECT
   
   **PostgreSQL**
   https://www.postgresql.org/docs/11/queries-limit.html
   
   **Sqlite**
   https://www.sqlite.org/lang_select.html
   
   **Vertica**
   https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Statements/SELECT/OFFSETClause.htm?zoom_highlight=offset
   
   The description for design:
   **1**. Consider `OFFSET` as the special case of `LIMIT`. For example:
   `SELECT * FROM a limit 10;` similar to `SELECT * FROM a limit 10 offset 0;`
   `SELECT * FROM a offset 10;` similar to `SELECT * FROM a limit -1 offset 10;`
   **2**. Because the current implement of `LIMIT` has good performance. For example:
   `SELECT * FROM a limit 10;` parsed to the logic plan as below:
   ```
   GlobalLimit (limit = 10)
   |--LocalLimit (limit = 10)
   ```
   and then the physical plan as below:
   ```
   GlobalLimitExec (limit = 10) // Take the first 10 rows globally
   |--LocalLimitExec (limit = 10) // Take the first 10 rows locally
   ```
   This operator reduce massive shuffle and has good performance.
   Sometimes, the logic plan transformed to the physical plan as:
   ```
   CollectLimitExec (limit = 10) // Take the first 10 rows globally
   ```
   If the SQL contains order by, such as `SELECT * FROM a order by c limit 10;`.
   This SQL will be transformed to the physical plan as below:
   ```
   TakeOrderedAndProjectExec (limit = 10) // Take the first 10 rows after sort globally
   ```
   
   Based on this situation, this PR produces the following operations. For example:
   `SELECT * FROM a limit 10 offset 10;` parsed to the logic plan as below:
   ```
   GlobalLimit (limit = 10)
   |--LocalLimit (limit = 10)
      |--Offset (offset = 10)
   ```
   After optimization, the above logic plan will be transformed to:
   ```
   GlobalLimitAndOffset (limit = 10, offset = 10) // Limit clause accompanied by offset clause
   |--LocalLimit (limit = 20)   // 10 + offset = 20
   ```
   
   and then the physical plan as below:
   ```
   GlobalLimitAndOffsetExec (limit = 10, offset = 10) // Skip the first 10 rows and take the next 10 rows globally
   |--LocalLimitExec (limit = 20) // Take the first 20(limit + offset) rows locally
   ```
   Sometimes, the logic plan transformed to the physical plan as:
   ```
   CollectLimitExec (limit = 10, offset = 10) // Skip the first 10 rows and take the next 10 rows globally
   ```
   If the SQL contains order by, such as `SELECT * FROM a order by c limit 10 offset 10;`.
   This SQL will be transformed to the physical plan as below:
   ```
   TakeOrderedAndProjectExec (limit = 10, offset 10) // Skip the first 10 rows and take the next 10 rows after sort globally
   ```
   **3**.In addition to the above, there is a special case that is only offset but no limit. For example:
   `SELECT * FROM a offset 10;` parsed to the logic plan as below:
   ```
   Offset (offset = 10) // Only offset clause
   ```
   If offset is very large, will generate a lot of overhead. So this PR will refuse use offset clause without limit clause, although we can parse, transform and execute it.
   
   A balanced idea is add a configuration item `spark.sql.forceUsingOffsetWithoutLimit` to force running query when user knows the offset is small enough. The default value of `spark.sql.forceUsingOffsetWithoutLimit` is false. This PR just came up with the idea so that it could be implemented at a better time in the future.
   
   Note: The origin PR to support this feature is https://github.com/apache/spark/pull/25416.
   Because the origin PR too old, there exists massive conflict which is hard to resolve. So I open this new PR to support this feature.
   
   ### Why are the changes needed?
   new feature
   
   ### Does this PR introduce any user-facing change?
   'No'
   
   
   ### How was this patch tested?
   Exists and new UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on pull request #35975: [SPARK-28330][SQL] Support ANSI SQL: result offset clause in query expression

Posted by GitBox <gi...@apache.org>.
beliefer commented on pull request #35975:
URL: https://github.com/apache/spark/pull/35975#issuecomment-1079639707


   ping @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org