You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tajo.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2015/06/13 19:57:01 UTC

[jira] [Commented] (TAJO-1352) Improve the join order algorithm to consider missed cases of associative join operators

    [ https://issues.apache.org/jira/browse/TAJO-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14584744#comment-14584744 ] 

ASF GitHub Bot commented on TAJO-1352:
--------------------------------------

Github user jihoonson commented on the pull request:

    https://github.com/apache/tajo/pull/593#issuecomment-111736855
  
    Here is the simple evaluation result.
    
    ### Query
    ```
    select 
      count(*) 
    from 
      lineitem left outer join orders 
        on l_orderkey = o_orderkey 
      left outer join partsupp 
        on ps_suppkey = o_custkey 
      left outer join customer 
        on ps_suppkey = c_custkey 
      left outer join part 
        on p_partkey = c_nationkey
    ```
    
    ### Data
    * TPC-H of scale factor 100
    
    ### Cluster
    * One master and 3 workers
    * Each worker equips 4 cores, 8 GB memory, and 2 HDDs.
    
    ### Performance comparison
    * Elapsed time
     * Before patch: 25 mins, 23 sec
     * After patch: 19 mins, 59 sec
    * Performance improvement ratio: about 20%
    
    ### Query plan
    The query execution time is reduced due to the improved query plan as follows.
    #### Before patch
    ```
    -----------------------------
    Query Block Graph
    -----------------------------
    |-#ROOT
    -----------------------------
    Optimization Log:
    [LogicalPlan]
    	> ProjectionNode is eliminated.
    [#ROOT]
    	> Non-optimized join order: ((((tpch100.lineitem ⟕ tpch100.orders) ⟕ tpch100.partsupp) ⟕ tpch100.customer) ⟕ tpch100.part) (cost: 1.0447965953264456E46)
    	> Optimized join order    : ((((tpch100.lineitem ⟕ tpch100.orders) ⟕ tpch100.partsupp) ⟕ tpch100.customer) ⟕ tpch100.part) (cost: 1.0447965859707236E46)
    -----------------------------
    
    GROUP_BY(10)()
      => exprs: (count())
      => target list: ?count (INT8)
      => out schema:{(1) ?count (INT8)}
      => in schema:{(0) }
       JOIN(15)(LEFT_OUTER)
         => Join Cond: tpch100.part.p_partkey (INT8) = tpch100.customer.c_nationkey (INT8)
         => target list: 
         => out schema: {(0) }
         => in schema: {(2) tpch100.customer.c_nationkey (INT8), tpch100.part.p_partkey (INT8)}
          SCAN(7) on tpch100.part
            => target list: tpch100.part.p_partkey (INT8)
            => out schema: {(1) tpch100.part.p_partkey (INT8)}
            => in schema: {(9) tpch100.part.p_partkey (INT8), tpch100.part.p_name (TEXT), tpch100.part.p_mfgr (TEXT), tpch100.part.p_brand (TEXT), tpch100.part.p_type (TEXT), tpch100.part.p_size (INT4), tpch100.part.p_container (TEXT), tpch100.part.p_retailprice (FLOAT8), tpch100.part.p_comment (TEXT)}
          JOIN(14)(LEFT_OUTER)
            => Join Cond: tpch100.partsupp.ps_suppkey (INT8) = tpch100.customer.c_custkey (INT8)
            => target list: tpch100.customer.c_nationkey (INT8)
            => out schema: {(1) tpch100.customer.c_nationkey (INT8)}
            => in schema: {(3) tpch100.partsupp.ps_suppkey (INT8), tpch100.customer.c_nationkey (INT8), tpch100.customer.c_custkey (INT8)}
             SCAN(5) on tpch100.customer
               => target list: tpch100.customer.c_nationkey (INT8), tpch100.customer.c_custkey (INT8)
               => out schema: {(2) tpch100.customer.c_nationkey (INT8), tpch100.customer.c_custkey (INT8)}
               => in schema: {(8) tpch100.customer.c_custkey (INT8), tpch100.customer.c_name (TEXT), tpch100.customer.c_address (TEXT), tpch100.customer.c_nationkey (INT8), tpch100.customer.c_phone (TEXT), tpch100.customer.c_acctbal (FLOAT8), tpch100.customer.c_mktsegment (TEXT), tpch100.customer.c_comment (TEXT)}
             JOIN(13)(LEFT_OUTER)
               => Join Cond: tpch100.partsupp.ps_suppkey (INT8) = tpch100.orders.o_custkey (INT8)
               => target list: tpch100.partsupp.ps_suppkey (INT8)
               => out schema: {(1) tpch100.partsupp.ps_suppkey (INT8)}
               => in schema: {(2) tpch100.orders.o_custkey (INT8), tpch100.partsupp.ps_suppkey (INT8)}
                SCAN(3) on tpch100.partsupp
                  => target list: tpch100.partsupp.ps_suppkey (INT8)
                  => out schema: {(1) tpch100.partsupp.ps_suppkey (INT8)}
                  => in schema: {(5) tpch100.partsupp.ps_partkey (INT8), tpch100.partsupp.ps_suppkey (INT8), tpch100.partsupp.ps_availqty (INT4), tpch100.partsupp.ps_supplycost (FLOAT8), tpch100.partsupp.ps_comment (TEXT)}
                JOIN(12)(LEFT_OUTER)
                  => Join Cond: tpch100.lineitem.l_orderkey (INT8) = tpch100.orders.o_orderkey (INT8)
                  => target list: tpch100.orders.o_custkey (INT8)
                  => out schema: {(1) tpch100.orders.o_custkey (INT8)}
                  => in schema: {(3) tpch100.lineitem.l_orderkey (INT8), tpch100.orders.o_custkey (INT8), tpch100.orders.o_orderkey (INT8)}
                   SCAN(1) on tpch100.orders
                     => target list: tpch100.orders.o_custkey (INT8), tpch100.orders.o_orderkey (INT8)
                     => out schema: {(2) tpch100.orders.o_custkey (INT8), tpch100.orders.o_orderkey (INT8)}
                     => in schema: {(9) tpch100.orders.o_orderkey (INT8), tpch100.orders.o_custkey (INT8), tpch100.orders.o_orderstatus (TEXT), tpch100.orders.o_totalprice (FLOAT8), tpch100.orders.o_orderdate (DATE), tpch100.orders.o_orderpriority (TEXT), tpch100.orders.o_clerk (TEXT), tpch100.orders.o_shippriority (INT4), tpch100.orders.o_comment (TEXT)}
                   SCAN(0) on tpch100.lineitem
                     => target list: tpch100.lineitem.l_orderkey (INT8)
                     => out schema: {(1) tpch100.lineitem.l_orderkey (INT8)}
                     => in schema: {(16) tpch100.lineitem.l_orderkey (INT8), tpch100.lineitem.l_partkey (INT8), tpch100.lineitem.l_suppkey (INT8), tpch100.lineitem.l_linenumber (INT8), tpch100.lineitem.l_quantity (FLOAT8), tpch100.lineitem.l_extendedprice (FLOAT8), tpch100.lineitem.l_discount (FLOAT8), tpch100.lineitem.l_tax (FLOAT8), tpch100.lineitem.l_returnflag (TEXT), tpch100.lineitem.l_linestatus (TEXT), tpch100.lineitem.l_shipdate (DATE), tpch100.lineitem.l_commitdate (DATE), tpch100.lineitem.l_receiptdate (DATE), tpch100.lineitem.l_shipinstruct (TEXT), tpch100.lineitem.l_shipmode (TEXT), tpch100.lineitem.l_comment (TEXT)}
    ```
    
    #### After patch
    ```
    -----------------------------
    Query Block Graph
    -----------------------------
    |-#ROOT
    -----------------------------
    Optimization Log:
    [LogicalPlan]
    	> ProjectionNode is eliminated.
    [#ROOT]
    	> Non-optimized join order: ((((tpch100.lineitem ⟕ tpch100.orders) ⟕ tpch100.partsupp) ⟕ tpch100.customer) ⟕ tpch100.part) (cost: 7.933924122078524E46)
    	> Optimized join order    : ((tpch100.lineitem ⟕ (tpch100.orders ⟕ tpch100.partsupp)) ⟕ (tpch100.customer ⟕ tpch100.part)) (cost: 4.016549062824562E47)
    -----------------------------
    
    GROUP_BY(10)()
      => exprs: (count())
      => target list: ?count (INT8)
      => out schema:{(1) ?count (INT8)}
      => in schema:{(0) }
       JOIN(15)(LEFT_OUTER)
         => Join Cond: tpch100.partsupp.ps_suppkey (INT8) = tpch100.customer.c_custkey (INT8)
         => target list: 
         => out schema: {(0) }
         => in schema: {(2) tpch100.partsupp.ps_suppkey (INT8), tpch100.customer.c_custkey (INT8)}
          JOIN(14)(LEFT_OUTER)
            => Join Cond: tpch100.part.p_partkey (INT8) = tpch100.customer.c_nationkey (INT8)
            => target list: tpch100.customer.c_custkey (INT8)
            => out schema: {(1) tpch100.customer.c_custkey (INT8)}
            => in schema: {(3) tpch100.customer.c_custkey (INT8), tpch100.customer.c_nationkey (INT8), tpch100.part.p_partkey (INT8)}
             SCAN(7) on tpch100.part
               => target list: tpch100.part.p_partkey (INT8)
               => out schema: {(1) tpch100.part.p_partkey (INT8)}
               => in schema: {(9) tpch100.part.p_partkey (INT8), tpch100.part.p_name (TEXT), tpch100.part.p_mfgr (TEXT), tpch100.part.p_brand (TEXT), tpch100.part.p_type (TEXT), tpch100.part.p_size (INT4), tpch100.part.p_container (TEXT), tpch100.part.p_retailprice (FLOAT8), tpch100.part.p_comment (TEXT)}
             SCAN(5) on tpch100.customer
               => target list: tpch100.customer.c_custkey (INT8), tpch100.customer.c_nationkey (INT8)
               => out schema: {(2) tpch100.customer.c_custkey (INT8), tpch100.customer.c_nationkey (INT8)}
               => in schema: {(8) tpch100.customer.c_custkey (INT8), tpch100.customer.c_name (TEXT), tpch100.customer.c_address (TEXT), tpch100.customer.c_nationkey (INT8), tpch100.customer.c_phone (TEXT), tpch100.customer.c_acctbal (FLOAT8), tpch100.customer.c_mktsegment (TEXT), tpch100.customer.c_comment (TEXT)}
          JOIN(13)(LEFT_OUTER)
            => Join Cond: tpch100.lineitem.l_orderkey (INT8) = tpch100.orders.o_orderkey (INT8)
            => target list: tpch100.partsupp.ps_suppkey (INT8)
            => out schema: {(1) tpch100.partsupp.ps_suppkey (INT8)}
            => in schema: {(3) tpch100.lineitem.l_orderkey (INT8), tpch100.partsupp.ps_suppkey (INT8), tpch100.orders.o_orderkey (INT8)}
             JOIN(12)(LEFT_OUTER)
               => Join Cond: tpch100.partsupp.ps_suppkey (INT8) = tpch100.orders.o_custkey (INT8)
               => target list: tpch100.partsupp.ps_suppkey (INT8), tpch100.orders.o_orderkey (INT8)
               => out schema: {(2) tpch100.partsupp.ps_suppkey (INT8), tpch100.orders.o_orderkey (INT8)}
               => in schema: {(3) tpch100.orders.o_orderkey (INT8), tpch100.orders.o_custkey (INT8), tpch100.partsupp.ps_suppkey (INT8)}
                SCAN(3) on tpch100.partsupp
                  => target list: tpch100.partsupp.ps_suppkey (INT8)
                  => out schema: {(1) tpch100.partsupp.ps_suppkey (INT8)}
                  => in schema: {(5) tpch100.partsupp.ps_partkey (INT8), tpch100.partsupp.ps_suppkey (INT8), tpch100.partsupp.ps_availqty (INT4), tpch100.partsupp.ps_supplycost (FLOAT8), tpch100.partsupp.ps_comment (TEXT)}
                SCAN(1) on tpch100.orders
                  => target list: tpch100.orders.o_orderkey (INT8), tpch100.orders.o_custkey (INT8)
                  => out schema: {(2) tpch100.orders.o_orderkey (INT8), tpch100.orders.o_custkey (INT8)}
                  => in schema: {(9) tpch100.orders.o_orderkey (INT8), tpch100.orders.o_custkey (INT8), tpch100.orders.o_orderstatus (TEXT), tpch100.orders.o_totalprice (FLOAT8), tpch100.orders.o_orderdate (DATE), tpch100.orders.o_orderpriority (TEXT), tpch100.orders.o_clerk (TEXT), tpch100.orders.o_shippriority (INT4), tpch100.orders.o_comment (TEXT)}
             SCAN(0) on tpch100.lineitem
               => target list: tpch100.lineitem.l_orderkey (INT8)
               => out schema: {(1) tpch100.lineitem.l_orderkey (INT8)}
               => in schema: {(16) tpch100.lineitem.l_orderkey (INT8), tpch100.lineitem.l_partkey (INT8), tpch100.lineitem.l_suppkey (INT8), tpch100.lineitem.l_linenumber (INT8), tpch100.lineitem.l_quantity (FLOAT8), tpch100.lineitem.l_extendedprice (FLOAT8), tpch100.lineitem.l_discount (FLOAT8), tpch100.lineitem.l_tax (FLOAT8), tpch100.lineitem.l_returnflag (TEXT), tpch100.lineitem.l_linestatus (TEXT), tpch100.lineitem.l_shipdate (DATE), tpch100.lineitem.l_commitdate (DATE), tpch100.lineitem.l_receiptdate (DATE), tpch100.lineitem.l_shipinstruct (TEXT), tpch100.lineitem.l_shipmode (TEXT), tpch100.lineitem.l_comment (TEXT)}
    ```


> Improve the join order algorithm to consider missed cases of associative join operators
> ---------------------------------------------------------------------------------------
>
>                 Key: TAJO-1352
>                 URL: https://issues.apache.org/jira/browse/TAJO-1352
>             Project: Tajo
>          Issue Type: Improvement
>          Components: planner/optimizer
>            Reporter: Jihoon Son
>            Assignee: Jihoon Son
>         Attachments: JoinGraph.png
>
>
> TAJO-1277 fixes a bug related to the associativity of join operators, but there are still some missed cases that join operators are associative. This work should include the cases described in the following links:
> * http://stackoverflow.com/questions/20022196/are-left-outer-joins-associative
> * https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)