You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Cheng Su (Jira)" <ji...@apache.org> on 2021/11/18 02:07:00 UTC

[jira] [Created] (SPARK-37361) Introduce Z-order for efficient data skipping

Cheng Su created SPARK-37361:
--------------------------------

             Summary: Introduce Z-order for efficient data skipping
                 Key: SPARK-37361
                 URL: https://issues.apache.org/jira/browse/SPARK-37361
             Project: Spark
          Issue Type: Umbrella
          Components: SQL
    Affects Versions: 3.3.0
            Reporter: Cheng Su


This is the umbrella Jira to track the progress of introducing Z-order in Spark. Z-order enables to sort tuples in a way, to allow efficiently data skipping for columnar file format (Parquet and ORC).

For query with filter on combination of multiple columns, example:

 
{code:java}
SELECT *
FROM table
WHERE x = 0 OR y = 0
{code}
Parquet/ORC cannot skip file/row-groups efficiently when reading, even though the table is sorted (locally or globally) on any columns. However when table is Z-order sorted on multiple columns, Parquet/ORC can skip file/row-groups efficiently when reading.

We should add the feature in Spark to allow OSS Spark users benefitted in running these queries.

 

Reference:

Databricks Delta Lake added similar support with Z-order ([https://databricks.com/blog/2018/07/31/processing-petabytes-of-data-in-seconds-with-databricks-delta.html], [https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-optimize.html] )

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org