You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Cheng Su (Jira)" <ji...@apache.org> on 2021/11/18 02:15:00 UTC

[jira] [Commented] (SPARK-37361) Introduce Z-order for efficient data skipping

    [ https://issues.apache.org/jira/browse/SPARK-37361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17445584#comment-17445584 ] 

Cheng Su commented on SPARK-37361:
----------------------------------

Just FYI, I am working on each sub-task now. Thanks.

> Introduce Z-order for efficient data skipping
> ---------------------------------------------
>
>                 Key: SPARK-37361
>                 URL: https://issues.apache.org/jira/browse/SPARK-37361
>             Project: Spark
>          Issue Type: Umbrella
>          Components: SQL
>    Affects Versions: 3.3.0
>            Reporter: Cheng Su
>            Priority: Minor
>
> This is the umbrella Jira to track the progress of introducing Z-order in Spark. Z-order enables to sort tuples in a way, to allow efficiently data skipping for columnar file format (Parquet and ORC).
> For query with filter on combination of multiple columns, example:
> {code:java}
> SELECT *
> FROM table
> WHERE x = 0 OR y = 0
> {code}
> Parquet/ORC cannot skip file/row-groups efficiently when reading, even though the table is sorted (locally or globally) on any columns. However when table is Z-order sorted on multiple columns, Parquet/ORC can skip file/row-groups efficiently when reading.
> We should add the feature in Spark to allow OSS Spark users benefitted in running these queries.
>  
> Reference:
>  * Databricks Delta Lake added similar support for Z-order ([https://databricks.com/blog/2018/07/31/processing-petabytes-of-data-in-seconds-with-databricks-delta.html], [https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-optimize.html] )
>  * Presto added similar support for Z-order ([https://github.com/prestodb/presto/blob/master/presto-hive-common/src/main/java/com/facebook/presto/hive/zorder/ZOrder.java] )
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org