You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@phoenix.apache.org by "James Taylor (JIRA)" <ji...@apache.org> on 2016/04/14 21:23:25 UTC

[jira] [Commented] (PHOENIX-258) Use skip scan when SELECT DISTINCT on leading row key column(s)

    [ https://issues.apache.org/jira/browse/PHOENIX-258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241760#comment-15241760 ] 

James Taylor commented on PHOENIX-258:
--------------------------------------

[~lhofhansl] - you ok if we assign this to you (as you've been dabbling with it recently)? 

FYI, to get the perf gain of the skip scan, I believe you'll need to pass a new boolean to the SkipScanFilter that indicates it's being used for DISTINCT. Otherwise, rather than skipping all duplicate rows, it's going to include them. So a tweak like this in SkipScanFilter.navigate():
{code}
        // First check to see if we're in-range until we reach our end key
        if (endKeyLength > 0) {
            if (!this.isDistinct && Bytes.compareTo(currentKey, offset, length, endKey, 0, endKeyLength) < 0) {
                return getIncludeReturnCode();
            }
{code}

Also, if there's a filter from the WHERE clause (i.e. any remaining filtering that was left over after computing the start/stop row of the scan), I suspect you won't be able to perform this optimization. In that case, you still need to traverse the rows first, before aggregating as you wouldn't know which of the duplicate rows match or don't match without looking at them all.

> Use skip scan when SELECT DISTINCT on leading row key column(s)
> ---------------------------------------------------------------
>
>                 Key: PHOENIX-258
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-258
>             Project: Phoenix
>          Issue Type: Task
>            Reporter: ryang-sfdc
>              Labels: gsoc2016
>             Fix For: 4.8.0
>
>
> create table(a varchar(32) not null, date date not null constraint pk primary key(a,date))
> [["PLAN"],["CLIENT PARALLEL 94-WAY FULL SCAN OVER foo"],["    SERVER AGGREGATE INTO ORDERED DISTINCT ROWS BY [a]"],["CLIENT MERGE SORT"]]             
> We should skip scan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)