You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "ShuMing Li (Jira)" <ji...@apache.org> on 2019/09/16 03:33:00 UTC

[jira] [Commented] (SPARK-29031) Materialized column to accelerate queries

    [ https://issues.apache.org/jira/browse/SPARK-29031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930211#comment-16930211 ] 

ShuMing Li commented on SPARK-29031:
------------------------------------

This can be useful in cases when user's queries are regular and specific or original(old) table schema is not very designed very well.

Just curious, does  `Materialized column` exist in other db engines ?

 

> Materialized column to accelerate queries
> -----------------------------------------
>
>                 Key: SPARK-29031
>                 URL: https://issues.apache.org/jira/browse/SPARK-29031
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Jason Guo
>            Priority: Major
>              Labels: SPIP
>
> Goals
>  * Add a new SQL grammar of Materialized column
>  * Implicitly rewrite SQL queries on the complex type of columns if there is a materialized columns for it
>  * If the data type of the materialized columns is atomic type, even though the origin column type is in complex type, enable vectorized read and filter pushdown to improve performance
> Example
> Create a normal table
> {quote}CREATE TABLE x (
>     name STRING,
>     age INT,
>     params STRING,
>     event MAP<STRING, STRING>
> ) USING parquet;
> {quote}
>  
> Add materialized columns to an existing table
> {quote}ALTER TABLE x ADD COLUMNS (
>     new_age INT MATERIALIZED age + 1,
>     city STRING MATERIALIZED get_json_object(params, '$.city'),
>     label STRING MATERIALIZED event['label']
> );
> {quote}
>  
> When issue a query as below
> {quote}SELECT name, age+1, get_json_object(params, '$.city'), event['label']
> FROM x
> WHER event['label'] = 'newuser';
> {quote}
> It's equivalent to
> {quote}SELECT name, new_age, city, label
> FROM x
> WHERE label = 'newuser';
> {quote}
>  
> The query performance improved dramatically because
>  # The new query (after rewritten) will read the new column city (in string type) instead of read the whole map of params(in map string). Much lesser data are need to read
>  # Vectorized read can be utilized in the new query and can not be used in the old one. Because vectorized read can only be enabled when all required columns are in atomic type
>  # Filter can be pushdown. Only filters on atomic column can be pushdown. The original filter  event['label'] = 'newuser' is on complex column, so it can not be pushdown.
>  # The new query do not need to parse JSON any more. JSON parse is a CPU intensive operation which will impact performance dramatically
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org