You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/04/25 20:13:27 UTC

[GitHub] [arrow-datafusion] Dandandan opened a new issue #77: Address performance/execution plan of TPCH query 9

Dandandan opened a new issue #77:
URL: https://github.com/apache/arrow-datafusion/issues/77


   **Describe the bug**
   TPC-H Query 9 consumes a lot of memory and takes ~8s in memory on a 8 core machine on SF=1 in memory (with 100%cpu usage across cores) to execute. That likely has to do with the plan misses a equi-join and turns it into a cross join.
   
   **To Reproduce**
   Run query 9, check plan and execution time.
   
   **Expected behavior**
   The query is expected to finish faster (<8s) and shouldn't need a cross join.
   
   **Additional context**
   N/A
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] pjmore commented on issue #77: Address performance/execution plan of TPCH query 9

Posted by GitBox <gi...@apache.org>.
pjmore commented on issue #77:
URL: https://github.com/apache/arrow-datafusion/issues/77#issuecomment-840742157


   So I took a look at this and I have two solutions one which I believe always finds all possible inner joins but runs in N^2 and one that works for this case and should work for most others that runs in N LogN where N is the number of plans in the select statement. Is the N^2 runtime and the added petgraph dependency alright? I don't want datafusion to show up on [Accidentally quadratic](https://accidentallyquadratic.tumblr.com/) because of me haha. I've attached a brief description of each method incase there are any obvious improvements that can be made. 
   
   
   #### Thorough version
   [Code](https://github.com/pjmore/arrow-datafusion/blob/tpch-q9-planner/datafusion/src/sql/planner.rs#L471)
   1. Create undirected graph of plans where each edge represents a join connection.
   2. Find the connected components of the graph.
   3. For each connected component find the plan with the schema that has the fewest number of columns.
   4. Cross join all of the previously selected plans into a new node, removing the selected nodes from the graph.
   5. Starting from an arbitrary node, traverse the graph performing inner joins on all remaining nodes.
   
   #### Less thorough version
   [Code](https://github.com/pjmore/arrow-datafusion/blob/tpch-query-9-simple/datafusion/src/sql/planner.rs#L468)
   1. Iterate over plans counting the number of join key hits in each schema.
   2. Sort plans so that plans that have no key hits are at the beginning and then the vector is sorted in descending order. E.g. the array (0,0,1,2,3,4) would be sorted to (0,0, 4,3,2,1). The rationale being that cross joins should prefer to run early to limit the amount of data that needs to be duplicated and that there are more likely to be joins to plans with many join keys.
   3. Build the plan as was done previously.
   
   
   The less thorough version was able to find the inner join that was missed for query 9 but there is still the possibility to miss inner joins. For example if the tables have the join relations as seen below, the order of the plans after sorting would be (B,F,D,A,C,E,G). Which means that the plan would generate a cross join between B and F as they share no join key pairs, missing the inner join through D.
   ```
   ┌─┐                ┌─┐
   │A│                │E│
   └┬┘                └┬┘
    │                  │
   ┌┴┐      ┌─┐       ┌┴┐
   │B├──────┤D├───────┤F│
   └┬┘      └─┘       └┬┘
    │                  │
   ┌┴┐                ┌┴┐
   │C│                │G│
   └─┘                └─┘
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org