You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Cheng Su (Jira)" <ji...@apache.org> on 2021/11/18 02:07:00 UTC
[jira] [Updated] (SPARK-37361) Introduce Z-order for efficient data skipping

     [ https://issues.apache.org/jira/browse/SPARK-37361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cheng Su updated SPARK-37361:
-----------------------------
    Description: 
This is the umbrella Jira to track the progress of introducing Z-order in Spark. Z-order enables to sort tuples in a way, to allow efficiently data skipping for columnar file format (Parquet and ORC).

For query with filter on combination of multiple columns, example:
{code:java}
SELECT *
FROM table
WHERE x = 0 OR y = 0
{code}
Parquet/ORC cannot skip file/row-groups efficiently when reading, even though the table is sorted (locally or globally) on any columns. However when table is Z-order sorted on multiple columns, Parquet/ORC can skip file/row-groups efficiently when reading.

We should add the feature in Spark to allow OSS Spark users benefitted in running these queries.

 

Reference:

Databricks Delta Lake added similar support with Z-order ([https://databricks.com/blog/2018/07/31/processing-petabytes-of-data-in-seconds-with-databricks-delta.html], [https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-optimize.html] )

 

  was:
This is the umbrella Jira to track the progress of introducing Z-order in Spark. Z-order enables to sort tuples in a way, to allow efficiently data skipping for columnar file format (Parquet and ORC).

For query with filter on combination of multiple columns, example:

 
{code:java}
SELECT *
FROM table
WHERE x = 0 OR y = 0
{code}
Parquet/ORC cannot skip file/row-groups efficiently when reading, even though the table is sorted (locally or globally) on any columns. However when table is Z-order sorted on multiple columns, Parquet/ORC can skip file/row-groups efficiently when reading.

We should add the feature in Spark to allow OSS Spark users benefitted in running these queries.

 

Reference:

Databricks Delta Lake added similar support with Z-order ([https://databricks.com/blog/2018/07/31/processing-petabytes-of-data-in-seconds-with-databricks-delta.html], [https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-optimize.html] )

 


> Introduce Z-order for efficient data skipping
> ---------------------------------------------
>
>                 Key: SPARK-37361
>                 URL: https://issues.apache.org/jira/browse/SPARK-37361
>             Project: Spark
>          Issue Type: Umbrella
>          Components: SQL
>    Affects Versions: 3.3.0
>            Reporter: Cheng Su
>            Priority: Minor
>
> This is the umbrella Jira to track the progress of introducing Z-order in Spark. Z-order enables to sort tuples in a way, to allow efficiently data skipping for columnar file format (Parquet and ORC).
> For query with filter on combination of multiple columns, example:
> {code:java}
> SELECT *
> FROM table
> WHERE x = 0 OR y = 0
> {code}
> Parquet/ORC cannot skip file/row-groups efficiently when reading, even though the table is sorted (locally or globally) on any columns. However when table is Z-order sorted on multiple columns, Parquet/ORC can skip file/row-groups efficiently when reading.
> We should add the feature in Spark to allow OSS Spark users benefitted in running these queries.
>  
> Reference:
> Databricks Delta Lake added similar support with Z-order ([https://databricks.com/blog/2018/07/31/processing-petabytes-of-data-in-seconds-with-databricks-delta.html], [https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-optimize.html] )
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org