You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Cheng Su (Jira)" <ji...@apache.org> on 2021/11/18 02:07:00 UTC
[jira] [Updated] (SPARK-37361) Introduce Z-order for efficient data skipping
[ https://issues.apache.org/jira/browse/SPARK-37361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Cheng Su updated SPARK-37361:
-----------------------------
Description:
This is the umbrella Jira to track the progress of introducing Z-order in Spark. Z-order enables to sort tuples in a way, to allow efficiently data skipping for columnar file format (Parquet and ORC).
For query with filter on combination of multiple columns, example:
{code:java}
SELECT *
FROM table
WHERE x = 0 OR y = 0
{code}
Parquet/ORC cannot skip file/row-groups efficiently when reading, even though the table is sorted (locally or globally) on any columns. However when table is Z-order sorted on multiple columns, Parquet/ORC can skip file/row-groups efficiently when reading.
We should add the feature in Spark to allow OSS Spark users benefitted in running these queries.
Reference:
Databricks Delta Lake added similar support with Z-order ([https://databricks.com/blog/2018/07/31/processing-petabytes-of-data-in-seconds-with-databricks-delta.html], [https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-optimize.html] )
was:
This is the umbrella Jira to track the progress of introducing Z-order in Spark. Z-order enables to sort tuples in a way, to allow efficiently data skipping for columnar file format (Parquet and ORC).
For query with filter on combination of multiple columns, example:
{code:java}
SELECT *
FROM table
WHERE x = 0 OR y = 0
{code}
Parquet/ORC cannot skip file/row-groups efficiently when reading, even though the table is sorted (locally or globally) on any columns. However when table is Z-order sorted on multiple columns, Parquet/ORC can skip file/row-groups efficiently when reading.
We should add the feature in Spark to allow OSS Spark users benefitted in running these queries.
Reference:
Databricks Delta Lake added similar support with Z-order ([https://databricks.com/blog/2018/07/31/processing-petabytes-of-data-in-seconds-with-databricks-delta.html], [https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-optimize.html] )
> Introduce Z-order for efficient data skipping
> ---------------------------------------------
>
> Key: SPARK-37361
> URL: https://issues.apache.org/jira/browse/SPARK-37361
> Project: Spark
> Issue Type: Umbrella
> Components: SQL
> Affects Versions: 3.3.0
> Reporter: Cheng Su
> Priority: Minor
>
> This is the umbrella Jira to track the progress of introducing Z-order in Spark. Z-order enables to sort tuples in a way, to allow efficiently data skipping for columnar file format (Parquet and ORC).
> For query with filter on combination of multiple columns, example:
> {code:java}
> SELECT *
> FROM table
> WHERE x = 0 OR y = 0
> {code}
> Parquet/ORC cannot skip file/row-groups efficiently when reading, even though the table is sorted (locally or globally) on any columns. However when table is Z-order sorted on multiple columns, Parquet/ORC can skip file/row-groups efficiently when reading.
> We should add the feature in Spark to allow OSS Spark users benefitted in running these queries.
>
> Reference:
> Databricks Delta Lake added similar support with Z-order ([https://databricks.com/blog/2018/07/31/processing-petabytes-of-data-in-seconds-with-databricks-delta.html], [https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-optimize.html] )
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org