You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hawq.apache.org by "Ruilong Huo (JIRA)" <ji...@apache.org> on 2016/01/15 14:13:39 UTC
[jira] [Updated] (HAWQ-348) Optimizer (ORCA/Planner) should not preprocess (table) functions at planning phase

     [ https://issues.apache.org/jira/browse/HAWQ-348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ruilong Huo updated HAWQ-348:
-----------------------------
    Description: 
Optimizer (ORCA/Planner) currently preprocesses (table) functions (either in target list or from clause) at planning phase. This introduces:

1. Much lower performance since the result of the (table) function is motioned to QD from QEs after the preprocessing, and it is further processed there at QD, especially the result is large. In this case, QD does heavy workload and becomes the bottleneck. It shows about 20x performance difference in below example.

2. Much more memory overhead at QD since it needs to hold the result of the (table) function. This is risky since the result might be unpredictably large.

Here are the steps to reproduce this issue, as well as some initial analysis:
Step 1: Prepare schema and data
{noformat}
CREATE TABLE t (id INT);
CREATE TABLE


INSERT INTO t SELECT generate_series(1, 100000000);
INSERT 0 100000000


CREATE OR REPLACE FUNCTION get_t()
RETURNS SETOF t
LANGUAGE SQL AS
'SELECT * FROM t'
STABLE;
CREATE FUNCTION
{noformat}

2. With optimizer = OFF (Planner)
{noformat}
SET optimizer='OFF';
SET


select sum(id) from t;
       sum
------------------
 5000000050000000
(1 row)
Time: 8801.577 ms


select sum(id) from get_t();
       sum
------------------
 5000000050000000
(1 row)
Time: 189992.273 ms


EXPLAIN SELECT sum(id) FROM get_t();
                             QUERY PLAN
--------------------------------------------------------------------
 Aggregate  (cost=32.50..32.51 rows=1 width=8)
   ->  Function Scan on get_t  (cost=0.00..12.50 rows=8000 width=4)
 Settings:  default_segment_num=8; optimizer=off
 Optimizer status: legacy query optimizer
(4 rows)
{noformat}

3. With optimizer = ON (ORCA)
{noformat}
SET optimizer='ON';
SET


select sum(id) from t;
       sum
------------------
 5000000050000000
(1 row)

Time: 10103.436 ms


select sum(id) from get_t();
       sum
------------------
 5000000050000000
(1 row)

Time: 195551.740 ms


EXPLAIN SELECT sum(id) FROM get_t();
                             QUERY PLAN
--------------------------------------------------------------------
 Aggregate  (cost=32.50..32.51 rows=1 width=8)
   ->  Function Scan on get_t  (cost=0.00..12.50 rows=8000 width=4)
 Settings:  default_segment_num=8
 Optimizer status: legacy query optimizer
(4 rows)
{noformat}

  was:
Optimizer (ORCA/Planner) currently preprocesses (table) functions (either in target list or from clause) at planning phase. This introduces:

1. Much lower performance since the result of the (table) function is motioned to QD from QEs after the preprocessing, and it is further processed there at QD, especially the result is large. In this case, QD does heavy workload and becomes the bottleneck. It shows about 20x performance difference in below example.

2. Much memory overhead at QD since it needs to hold the result of the (table) function. This is risky since the result might be unpredictably large.

Here are the steps to reproduce this issue, as well as some initial analysis:
Step 1: Prepare schema and data
{noformat}
CREATE TABLE t (id INT);
CREATE TABLE


INSERT INTO t SELECT generate_series(1, 100000000);
INSERT 0 100000000


CREATE OR REPLACE FUNCTION get_t()
RETURNS SETOF t
LANGUAGE SQL AS
'SELECT * FROM t'
STABLE;
CREATE FUNCTION
{noformat}

2. With optimizer = OFF (Planner)
{noformat}
SET optimizer='OFF';
SET


select sum(id) from t;
       sum
------------------
 5000000050000000
(1 row)
Time: 8801.577 ms


select sum(id) from get_t();
       sum
------------------
 5000000050000000
(1 row)
Time: 189992.273 ms


EXPLAIN SELECT sum(id) FROM get_t();
                             QUERY PLAN
--------------------------------------------------------------------
 Aggregate  (cost=32.50..32.51 rows=1 width=8)
   ->  Function Scan on get_t  (cost=0.00..12.50 rows=8000 width=4)
 Settings:  default_segment_num=8; optimizer=off
 Optimizer status: legacy query optimizer
(4 rows)
{noformat}

3. With optimizer = ON (ORCA)
{noformat}
SET optimizer='ON';
SET


select sum(id) from t;
       sum
------------------
 5000000050000000
(1 row)

Time: 10103.436 ms


select sum(id) from get_t();
       sum
------------------
 5000000050000000
(1 row)

Time: 195551.740 ms


EXPLAIN SELECT sum(id) FROM get_t();
                             QUERY PLAN
--------------------------------------------------------------------
 Aggregate  (cost=32.50..32.51 rows=1 width=8)
   ->  Function Scan on get_t  (cost=0.00..12.50 rows=8000 width=4)
 Settings:  default_segment_num=8
 Optimizer status: legacy query optimizer
(4 rows)
{noformat}


> Optimizer (ORCA/Planner) should not preprocess (table) functions at planning phase
> ----------------------------------------------------------------------------------
>
>                 Key: HAWQ-348
>                 URL: https://issues.apache.org/jira/browse/HAWQ-348
>             Project: Apache HAWQ
>          Issue Type: Bug
>          Components: Optimizer
>            Reporter: Ruilong Huo
>            Assignee: Amr El-Helw
>
> Optimizer (ORCA/Planner) currently preprocesses (table) functions (either in target list or from clause) at planning phase. This introduces:
> 1. Much lower performance since the result of the (table) function is motioned to QD from QEs after the preprocessing, and it is further processed there at QD, especially the result is large. In this case, QD does heavy workload and becomes the bottleneck. It shows about 20x performance difference in below example.
> 2. Much more memory overhead at QD since it needs to hold the result of the (table) function. This is risky since the result might be unpredictably large.
> Here are the steps to reproduce this issue, as well as some initial analysis:
> Step 1: Prepare schema and data
> {noformat}
> CREATE TABLE t (id INT);
> CREATE TABLE
> INSERT INTO t SELECT generate_series(1, 100000000);
> INSERT 0 100000000
> CREATE OR REPLACE FUNCTION get_t()
> RETURNS SETOF t
> LANGUAGE SQL AS
> 'SELECT * FROM t'
> STABLE;
> CREATE FUNCTION
> {noformat}
> 2. With optimizer = OFF (Planner)
> {noformat}
> SET optimizer='OFF';
> SET
> select sum(id) from t;
>        sum
> ------------------
>  5000000050000000
> (1 row)
> Time: 8801.577 ms
> select sum(id) from get_t();
>        sum
> ------------------
>  5000000050000000
> (1 row)
> Time: 189992.273 ms
> EXPLAIN SELECT sum(id) FROM get_t();
>                              QUERY PLAN
> --------------------------------------------------------------------
>  Aggregate  (cost=32.50..32.51 rows=1 width=8)
>    ->  Function Scan on get_t  (cost=0.00..12.50 rows=8000 width=4)
>  Settings:  default_segment_num=8; optimizer=off
>  Optimizer status: legacy query optimizer
> (4 rows)
> {noformat}
> 3. With optimizer = ON (ORCA)
> {noformat}
> SET optimizer='ON';
> SET
> select sum(id) from t;
>        sum
> ------------------
>  5000000050000000
> (1 row)
> Time: 10103.436 ms
> select sum(id) from get_t();
>        sum
> ------------------
>  5000000050000000
> (1 row)
> Time: 195551.740 ms
> EXPLAIN SELECT sum(id) FROM get_t();
>                              QUERY PLAN
> --------------------------------------------------------------------
>  Aggregate  (cost=32.50..32.51 rows=1 width=8)
>    ->  Function Scan on get_t  (cost=0.00..12.50 rows=8000 width=4)
>  Settings:  default_segment_num=8
>  Optimizer status: legacy query optimizer
> (4 rows)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)