You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@phoenix.apache.org by "James Taylor (JIRA)" <ji...@apache.org> on 2018/02/09 19:36:02 UTC

[jira] [Commented] (PHOENIX-1556) Base hash versus sort merge join decision on cost

    [ https://issues.apache.org/jira/browse/PHOENIX-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16358851#comment-16358851 ] 

James Taylor commented on PHOENIX-1556:
---------------------------------------

Wow, this is really awesome, [~maryannxue]. I love the tests. A couple of questions:
- Should UNION_DISTINCT_FACTOR be 1.0 since we only support UNION ALL currently?
{code}
+        if (!all) {
+            rows *= UNION_DISTINCT_FACTOR;
+        }
{code}
- What's the reasoning behind stripSkipScanFilter? Is that removed because it's effect is already incorporated into the bytes scanned estimate?
- Should RowCountVisitor have a method for distinct? In particular, there's an optimization we have when doing a distinct on the leading PK columns which impacts cost. This optimization is not identified until runtime, so we might need to tweak the code so we know about it at compile time. This could be done in a separate patch.
- Somewhat orthogonal to your pull (but maybe building on top of it), do you think it'd be possible to prevent a query from running that's "too expensive" (assuming "too expensive" would be identified by a config property)? Something to keep in mind - I can file a separate JIRA for this.

> Base hash versus sort merge join decision on cost
> -------------------------------------------------
>
>                 Key: PHOENIX-1556
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-1556
>             Project: Phoenix
>          Issue Type: Sub-task
>            Reporter: James Taylor
>            Assignee: Maryann Xue
>            Priority: Major
>              Labels: CostBasedOptimization
>         Attachments: PHOENIX-1556.patch
>
>
> At compile time, we know how many guideposts (i.e. how many bytes) will be scanned for the RHS table. We should, by default, base the decision of using the hash-join verus many-to-many join on this information.
> Another criteria (as we've seen in PHOENIX-4508) is whether or not the tables being joined are already ordered by the join key. In that case, it's better to always use the sort merge join.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)