You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@asterixdb.apache.org by Taewoo Kim <wa...@gmail.com> on 2016/01/08 00:01:14 UTC

Fwd: [jira] [Commented] (ASTERIXDB-1249) Self index join chooses wrong probe/index branch

Hello dev,

Regarding this issue (ASTERIXDB-1249), I would like to make an index-join
hint clarification. Let's start with an example query other than a
self-join query in this issue.

for $c in dataset('Customers')

for $o in dataset('Orders')

where $c.cid /*+ indexnl */ = $o.cid

order by $c.cid, $o.oid

return {"cid":$c.cid, "oid": $o.oid}

Right now, in the master branch, the first dataset (e.g., Customers)
becomes the outer branch and the second dataset (e.g., Orders) becomes the
inner branch. And, when the optimizer tries to honor the given indexnl hint
(transforming a join into an index-nested-loop join), if there are
applicable indexes from the inner branch (e.g., Orders), then it is going
to use one of those indexes. If there are no applicable indexes from the
inner branch, it tries to use indexes from the outer branch (e.g.,
Customers). We are going to change the last part; we will not use indexes
from the outer branch. So, the following are refined rules for transforming
a join into an index-nested-loop join.

1. The first dataset in a join (the first parameter of the given join)
becomes the outer branch.
2. The second dataset in a join (the second parameter of the given join)
becomes the inner branch.
3. We only try to use applicable indexes from the inner branch. So, if
there are no applicable indexes from the inner branch, we abort
transforming a join into an index-nested-loop join.
4. Variable order in the given join predicate is not important. It can be
either outer.fieldA = inner.fieldB or inner.fieldB = outer.fieldA.

So, for the left-outer join and inner join altogether, the left subtree is
the probing side and the right subtree is the index side. So, this can be
applied to the self-join case, too just like the following query in this
issue. In the following query, $t1 becomes the outer and $t2 becomes the
inner.

for $t1 in dataset('TweetMessages')

for $t2 in dataset('TweetMessages')

let $c := $t1.countA + 20

where $c /* +indexnl */= $t2.countB

order by $t2.tweetid

return {"tweetid2": $t2.tweetid, "count2":$t2.countB};

Thank you. Any opinion would be appreciated before I finalize this fix.

Best,
Taewoo

---------- Forwarded message ----------
From: Yingyi Bu (JIRA) <ji...@apache.org>
Date: Wed, Jan 6, 2016 at 10:23 AM
Subject: [jira] [Commented] (ASTERIXDB-1249) Self index join chooses wrong
probe/index branch
To: notifications@asterixdb.incubator.apache.org



    [
https://issues.apache.org/jira/browse/ASTERIXDB-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15085992#comment-15085992
]

Yingyi Bu commented on ASTERIXDB-1249:
--------------------------------------

That's right. Basically in AcessMethod implementations, e.g.:
https://github.com/apache/incubator-asterixdb/blob/master/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/am/RTreeAccessMethod.java#L124

Choosing probe/index branch is only based on dataset names, instead of
being based on the join condition.

> Self index join chooses wrong probe/index branch
> ------------------------------------------------
>
>                 Key: ASTERIXDB-1249
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-1249
>             Project: Apache AsterixDB
>          Issue Type: Bug
>          Components: Optimizer
>            Reporter: Yingyi Bu
>            Assignee: Taewoo Kim
>
> DDLs:
> {noformat}
> drop dataverse test if exists;
> create dataverse test;
> use dataverse test;
> create type TwitterUserType as closed {
>     screen-name: string,
>     lang: string,
>     friends-count: int64,
>     statuses-count: int64,
>     name: string,
>     followers-count: int64
> }
> create type TweetMessageType as closed {
>     tweetid: int64,
>         user: TwitterUserType,
>         sender-location: point,
>     send-time: datetime,
>         referred-topics: {{ string }},
>     message-text: string,
>     countA: int64,
>     countB: int64
> }
> create dataset TweetMessages(TweetMessageType)
> primary key tweetid;
> create index twmSndLocIx on TweetMessages(sender-location) type rtree;
> create index msgCountAIx on TweetMessages(countA) type btree;
> create index msgCountBIx on TweetMessages(countB) type btree;
> create index msgTextIx on TweetMessages(message-text) type keyword;
> {noformat}
> Query:
> {noformat}
> for $t1 in dataset('TweetMessages')
> for $t2 in dataset('TweetMessages')
> let $n :=  create-circle($t1.sender-location, 0.5)
> where spatial-intersect($t2.sender-location, $n)
> order by $t2.tweetid
> return {"tweetid2":$t2.tweetid, "loc2":$t2.sender-location};
> {noformat}
> Optimized plan:
> {noformat}
> distribute result [%0->$$10]
> -- DISTRIBUTE_RESULT  |PARTITIONED|
>   exchange
>   -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>     project ([$$10])
>     -- STREAM_PROJECT  |PARTITIONED|
>       assign [$$10] <- [function-call: asterix:closed-record-constructor,
Args:[AString: {tweetid2}, %0->$$15, AString: {loc2}, %0->$$13]]
>       -- ASSIGN  |PARTITIONED|
>         exchange
>         -- SORT_MERGE_EXCHANGE [$$15(ASC) ]  |PARTITIONED|
>           order (ASC, %0->$$15)
>           -- STABLE_SORT [$$15(ASC)]  |PARTITIONED|
>             exchange
>             -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>               project ([$$13, $$15])
>               -- STREAM_PROJECT  |PARTITIONED|
>                 select (function-call: asterix:spatial-intersect,
Args:[%0->$$13, function-call: asterix:create-circle, Args:[function-call:
asterix:field-access-by-index, Args:[%0->$$0, AInt32: {2}], ADouble:
{0.5}]])
>                 -- STREAM_SELECT  |PARTITIONED|
>                   project ([$$0, $$13, $$15])
>                   -- STREAM_PROJECT  |PARTITIONED|
>                     exchange
>                     -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>                       unnest-map [$$14, $$0] <- function-call:
asterix:index-search, Args:[AString: {TweetMessages}, AInt32: {0}, AString:
{test}, AString: {TweetMessages}, ABoolean: {true}, ABoolean: {false},
ABoolean: {false}, AInt32: {1}, %0->$$27, AInt32: {1}, %0->$$27, TRUE,
TRUE, TRUE]
>                       -- BTREE_SEARCH  |PARTITIONED|
>                         exchange
>                         -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>                           order (ASC, %0->$$27)
>                           -- STABLE_SORT [$$27(ASC)]  |PARTITIONED|
>                             exchange
>                             -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>                               project ([$$27, $$13, $$15])
>                               -- STREAM_PROJECT  |PARTITIONED|
>                                 exchange
>                                 -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>                                   unnest-map [$$23, $$24, $$25, $$26,
$$27] <- function-call: asterix:index-search, Args:[AString: {twmSndLocIx},
AInt32: {1}, AString: {test}, AString: {TweetMessages}, ABoolean: {true},
ABoolean: {false}, ABoolean: {true}, AInt32: {4}, %0->$$19, %0->$$20,
%0->$$21, %0->$$22]
>                                   -- RTREE_SEARCH  |PARTITIONED|
>                                     exchange
>                                     -- BROADCAST_EXCHANGE  |PARTITIONED|
>                                       assign [$$19, $$20, $$21, $$22] <-
[function-call: asterix:create-mbr, Args:[%0->$$13, AInt32: {2}, AInt32:
{0}], function-call: asterix:create-mbr, Args:[%0->$$13, AInt32: {2},
AInt32: {1}], function-call: asterix:create-mbr, Args:[%0->$$13, AInt32:
{2}, AInt32: {2}], function-call: asterix:create-mbr, Args:[%0->$$13,
AInt32: {2}, AInt32: {3}]]
>                                       -- ASSIGN  |PARTITIONED|
>                                         project ([$$13, $$15])
>                                         -- STREAM_PROJECT  |PARTITIONED|
>                                           assign [$$13] <-
[function-call: asterix:field-access-by-index, Args:[%0->$$1, AInt32: {2}]]
>                                           -- ASSIGN  |PARTITIONED|
>                                             exchange
>                                             -- ONE_TO_ONE_EXCHANGE
|PARTITIONED|
>                                               data-scan []<-[$$15, $$1]
<- test:TweetMessages
>                                               -- DATASOURCE_SCAN
|PARTITIONED|
>                                                 exchange
>                                                 -- ONE_TO_ONE_EXCHANGE
|PARTITIONED|
>                                                   empty-tuple-source
>                                                   -- EMPTY_TUPLE_SOURCE
|PARTITIONED|
> {noformat}
> The optimized plan is incorrect --- the index search doesn't use the
right join condition and hence the result is different from expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [jira] [Commented] (ASTERIXDB-1249) Self index join chooses wrong probe/index branch

Posted by Taewoo Kim <wa...@gmail.com>.
Yes. As Yingyi said in a previous e-mail, during the logical/physical plan,
probe (left branch of join) / build (right branch of join) is
systematically decided and it will be kept if "role reversal" will not
happen.

Best,
Taewoo

On Sun, Jan 10, 2016 at 9:16 PM, Yingyi Bu <bu...@gmail.com> wrote:

> >> the role reversal happens locally in later iterations of the hash join,
> right, based on the relative sizes of on-disk hash partitions?
> That's right.
>
> >>I think the key thing that we need to be consistent about is which is the
> "basic" outer and inner
> >> - which in the case of a parallel hash join would be which is the
> initial build dataset and which is the initial probe dataset.
> Got it!
> Thanks!
>
> Best,
> Yingyi
>
>
> On Sun, Jan 10, 2016 at 3:59 PM, Mike Carey <dt...@gmail.com> wrote:
>
> > Just to be (or get) clear - the role reversal happens locally in later
> > iterations of the hash join, right, based on the relative sizes of
> on-disk
> > hash partitions?  I think the key thing that we need to be consistent
> about
> > is which is the "basic" outer and inner - which in the case of a parallel
> > hash join would be which is the initial build dataset and which is the
> > initial probe dataset.
> >
> >
> > On 1/8/16 9:08 AM, Yingyi Bu wrote:
> >
> >> In left-outer hash join, if the the probe branch is locally clustered
> (or
> >> sorted) by a column superset of the join key,
> >> the output will still be locally clustered.
> >> Inner hash join couldn't maintain that because of the "role reversal"
> >> optimization in the runtime.
> >>
> >> Best,
> >> Yingyi
> >>
> >> On Thu, Jan 7, 2016 at 10:07 PM, Taewoo Kim <wa...@gmail.com> wrote:
> >>
> >> Interesting. Can you be more specific?
> >>>
> >>> Best,
> >>> Taewoo
> >>>
> >>> On Thu, Jan 7, 2016 at 6:36 PM, Yingyi Bu <bu...@gmail.com> wrote:
> >>>
> >>> Sorry, to be more precise:
> >>>> Left-outer hash join cannot preserve all local data properties for its
> >>>> probe branch (because spilling can happen) but can preserve (or
> >>>>
> >>> downgrade)
> >>>
> >>>> some when certain conditions meet.
> >>>>
> >>>> On Thu, Jan 7, 2016 at 6:28 PM, Yingyi Bu <bu...@gmail.com> wrote:
> >>>>
> >>>> In the logical/physical query plan, I think it is statically
> >>>>>
> >>>> determined.
> >>>
> >>>> However, that doesn't mean the execution is faithful to that
> >>>>>
> >>>> probe/build
> >>>
> >>>> decision because we have the "role reversal" optimization for inner
> >>>>>
> >>>> hash
> >>>
> >>>> joins:-)
> >>>>> (That's also why our inner hash join cannot maintain any local data
> >>>>> property from its probe branch, but left-outer hash can preserve
> that.)
> >>>>>
> >>>>> Best,
> >>>>> Yingyi
> >>>>>
> >>>>>
> >>>>> On Thu, Jan 7, 2016 at 5:34 PM, Mike Carey <dt...@gmail.com>
> wrote:
> >>>>>
> >>>>> Also, do we (separately want to make sure that our hash join behavior
> >>>>>>
> >>>>> is
> >>>
> >>>> comparable - i.e., that the initial build/probe decision is statically
> >>>>>> determined from the query?  (I think we do want that, and I think it
> >>>>>>
> >>>>> is
> >>>
> >>>> in
> >>>>
> >>>>> fact that way - but I'm not 100% sure, as its been awhile since that
> >>>>>>
> >>>>> was
> >>>
> >>>> discussed, and it's not in-cache for me. :-))
> >>>>>>
> >>>>>>
> >>>>>> On 1/7/16 3:22 PM, Taewoo Kim wrote:
> >>>>>>
> >>>>>> Thanks Yingyi. Yes. If there is an equality condition and if we
> can't
> >>>>>>> transform a join into an index-nested loop join, then a hybrid hash
> >>>>>>>
> >>>>>> join
> >>>>
> >>>>> will be used.
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Taewoo
> >>>>>>>
> >>>>>>> On Thu, Jan 7, 2016 at 3:14 PM, Yingyi Bu <bu...@gmail.com>
> >>>>>>>
> >>>>>> wrote:
> >>>
> >>>> +1!
> >>>>>>>
> >>>>>>>> 3. We only try to use applicable indexes from the inner branch.
> So,
> >>>>>>>>>
> >>>>>>>> if
> >>>>
> >>>>> there are no applicable indexes from the inner branch, we abort
> >>>>>>>>>> transforming a join into an index-nested-loop join.
> >>>>>>>>>>
> >>>>>>>>>> "index-nested-loop join" -> "hybrid hash join"?
> >>>>>>>>>
> >>>>>>>> Thanks!
> >>>>>>>>
> >>>>>>>> Yingyi
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Thu, Jan 7, 2016 at 3:01 PM, Taewoo Kim <wa...@gmail.com>
> >>>>>>>>
> >>>>>>> wrote:
> >>>>
> >>>>> Hello dev,
> >>>>>>>>
> >>>>>>>>> Regarding this issue (ASTERIXDB-1249), I would like to make an
> >>>>>>>>> index-join
> >>>>>>>>> hint clarification. Let's start with an example query other than
> a
> >>>>>>>>> self-join query in this issue.
> >>>>>>>>>
> >>>>>>>>> for $c in dataset('Customers')
> >>>>>>>>>
> >>>>>>>>> for $o in dataset('Orders')
> >>>>>>>>>
> >>>>>>>>> where $c.cid /*+ indexnl */ = $o.cid
> >>>>>>>>>
> >>>>>>>>> order by $c.cid, $o.oid
> >>>>>>>>>
> >>>>>>>>> return {"cid":$c.cid, "oid": $o.oid}
> >>>>>>>>>
> >>>>>>>>> Right now, in the master branch, the first dataset (e.g.,
> >>>>>>>>>
> >>>>>>>> Customers)
> >>>
> >>>> becomes the outer branch and the second dataset (e.g., Orders)
> >>>>>>>>>
> >>>>>>>> becomes
> >>>>
> >>>>> the
> >>>>>>>>
> >>>>>>>> inner branch. And, when the optimizer tries to honor the given
> >>>>>>>>>
> >>>>>>>> indexnl
> >>>>
> >>>>> hint
> >>>>>>>>
> >>>>>>>> (transforming a join into an index-nested-loop join), if there are
> >>>>>>>>> applicable indexes from the inner branch (e.g., Orders), then it
> is
> >>>>>>>>> going
> >>>>>>>>> to use one of those indexes. If there are no applicable indexes
> >>>>>>>>>
> >>>>>>>> from
> >>>
> >>>> the
> >>>>>>>>> inner branch, it tries to use indexes from the outer branch
> (e.g.,
> >>>>>>>>> Customers). We are going to change the last part; we will not use
> >>>>>>>>> indexes
> >>>>>>>>> from the outer branch. So, the following are refined rules for
> >>>>>>>>>
> >>>>>>>>> transforming
> >>>>>>>>
> >>>>>>>> a join into an index-nested-loop join.
> >>>>>>>>>
> >>>>>>>>> 1. The first dataset in a join (the first parameter of the given
> >>>>>>>>>
> >>>>>>>> join)
> >>>>
> >>>>> becomes the outer branch.
> >>>>>>>>> 2. The second dataset in a join (the second parameter of the
> given
> >>>>>>>>> join)
> >>>>>>>>> becomes the inner branch.
> >>>>>>>>> 3. We only try to use applicable indexes from the inner branch.
> So,
> >>>>>>>>>
> >>>>>>>> if
> >>>>
> >>>>> there are no applicable indexes from the inner branch, we abort
> >>>>>>>>> transforming a join into an index-nested-loop join.
> >>>>>>>>> 4. Variable order in the given join predicate is not important.
> It
> >>>>>>>>>
> >>>>>>>> can
> >>>>
> >>>>> be
> >>>>>>>>> either outer.fieldA = inner.fieldB or inner.fieldB =
> outer.fieldA.
> >>>>>>>>>
> >>>>>>>>> So, for the left-outer join and inner join altogether, the left
> >>>>>>>>>
> >>>>>>>> subtree
> >>>>
> >>>>> is
> >>>>>>>>
> >>>>>>>> the probing side and the right subtree is the index side. So, this
> >>>>>>>>>
> >>>>>>>> can
> >>>>
> >>>>> be
> >>>>>>>>> applied to the self-join case, too just like the following query
> in
> >>>>>>>>> this
> >>>>>>>>> issue. In the following query, $t1 becomes the outer and $t2
> >>>>>>>>>
> >>>>>>>> becomes
> >>>
> >>>> the
> >>>>>>>>> inner.
> >>>>>>>>>
> >>>>>>>>> for $t1 in dataset('TweetMessages')
> >>>>>>>>>
> >>>>>>>>> for $t2 in dataset('TweetMessages')
> >>>>>>>>>
> >>>>>>>>> let $c := $t1.countA + 20
> >>>>>>>>>
> >>>>>>>>> where $c /* +indexnl */= $t2.countB
> >>>>>>>>>
> >>>>>>>>> order by $t2.tweetid
> >>>>>>>>>
> >>>>>>>>> return {"tweetid2": $t2.tweetid, "count2":$t2.countB};
> >>>>>>>>>
> >>>>>>>>> Thank you. Any opinion would be appreciated before I finalize
> this
> >>>>>>>>>
> >>>>>>>> fix.
> >>>>
> >>>>> Best,
> >>>>>>>>> Taewoo
> >>>>>>>>>
> >>>>>>>>> ---------- Forwarded message ----------
> >>>>>>>>> From: Yingyi Bu (JIRA) <ji...@apache.org>
> >>>>>>>>> Date: Wed, Jan 6, 2016 at 10:23 AM
> >>>>>>>>> Subject: [jira] [Commented] (ASTERIXDB-1249) Self index join
> >>>>>>>>>
> >>>>>>>> chooses
> >>>
> >>>> wrong
> >>>>>>>>
> >>>>>>>> probe/index branch
> >>>>>>>>> To: notifications@asterixdb.incubator.apache.org
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>       [
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>
> https://issues.apache.org/jira/browse/ASTERIXDB-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15085992#comment-15085992
> >>>
> >>>> ]
> >>>>>>>>>
> >>>>>>>>> Yingyi Bu commented on ASTERIXDB-1249:
> >>>>>>>>> --------------------------------------
> >>>>>>>>>
> >>>>>>>>> That's right. Basically in AcessMethod implementations, e.g.:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>
> https://github.com/apache/incubator-asterixdb/blob/master/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/am/RTreeAccessMethod.java#L124
> >>>
> >>>> Choosing probe/index branch is only based on dataset names, instead
> >>>>>>>>>
> >>>>>>>> of
> >>>>
> >>>>> being based on the join condition.
> >>>>>>>>>
> >>>>>>>>> Self index join chooses wrong probe/index branch
> >>>>>>>>>
> >>>>>>>>>> ------------------------------------------------
> >>>>>>>>>>
> >>>>>>>>>>                   Key: ASTERIXDB-1249
> >>>>>>>>>>                   URL:
> >>>>>>>>>>
> >>>>>>>>>> https://issues.apache.org/jira/browse/ASTERIXDB-1249
> >>>>>>>>>
> >>>>>>>>>               Project: Apache AsterixDB
> >>>>>>>>>>            Issue Type: Bug
> >>>>>>>>>>            Components: Optimizer
> >>>>>>>>>>              Reporter: Yingyi Bu
> >>>>>>>>>>              Assignee: Taewoo Kim
> >>>>>>>>>>
> >>>>>>>>>> DDLs:
> >>>>>>>>>> {noformat}
> >>>>>>>>>> drop dataverse test if exists;
> >>>>>>>>>> create dataverse test;
> >>>>>>>>>> use dataverse test;
> >>>>>>>>>> create type TwitterUserType as closed {
> >>>>>>>>>>       screen-name: string,
> >>>>>>>>>>       lang: string,
> >>>>>>>>>>       friends-count: int64,
> >>>>>>>>>>       statuses-count: int64,
> >>>>>>>>>>       name: string,
> >>>>>>>>>>       followers-count: int64
> >>>>>>>>>> }
> >>>>>>>>>> create type TweetMessageType as closed {
> >>>>>>>>>>       tweetid: int64,
> >>>>>>>>>>           user: TwitterUserType,
> >>>>>>>>>>           sender-location: point,
> >>>>>>>>>>       send-time: datetime,
> >>>>>>>>>>           referred-topics: {{ string }},
> >>>>>>>>>>       message-text: string,
> >>>>>>>>>>       countA: int64,
> >>>>>>>>>>       countB: int64
> >>>>>>>>>> }
> >>>>>>>>>> create dataset TweetMessages(TweetMessageType)
> >>>>>>>>>> primary key tweetid;
> >>>>>>>>>> create index twmSndLocIx on TweetMessages(sender-location) type
> >>>>>>>>>>
> >>>>>>>>> rtree;
> >>>>
> >>>>> create index msgCountAIx on TweetMessages(countA) type btree;
> >>>>>>>>>> create index msgCountBIx on TweetMessages(countB) type btree;
> >>>>>>>>>> create index msgTextIx on TweetMessages(message-text) type
> >>>>>>>>>>
> >>>>>>>>> keyword;
> >>>
> >>>> {noformat}
> >>>>>>>>>> Query:
> >>>>>>>>>> {noformat}
> >>>>>>>>>> for $t1 in dataset('TweetMessages')
> >>>>>>>>>> for $t2 in dataset('TweetMessages')
> >>>>>>>>>> let $n :=  create-circle($t1.sender-location, 0.5)
> >>>>>>>>>> where spatial-intersect($t2.sender-location, $n)
> >>>>>>>>>> order by $t2.tweetid
> >>>>>>>>>> return {"tweetid2":$t2.tweetid, "loc2":$t2.sender-location};
> >>>>>>>>>> {noformat}
> >>>>>>>>>> Optimized plan:
> >>>>>>>>>> {noformat}
> >>>>>>>>>> distribute result [%0->$$10]
> >>>>>>>>>> -- DISTRIBUTE_RESULT  |PARTITIONED|
> >>>>>>>>>>     exchange
> >>>>>>>>>>     -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> >>>>>>>>>>       project ([$$10])
> >>>>>>>>>>       -- STREAM_PROJECT  |PARTITIONED|
> >>>>>>>>>>         assign [$$10] <- [function-call:
> >>>>>>>>>>
> >>>>>>>>>> asterix:closed-record-constructor,
> >>>>>>>>> Args:[AString: {tweetid2}, %0->$$15, AString: {loc2}, %0->$$13]]
> >>>>>>>>>
> >>>>>>>>>         -- ASSIGN  |PARTITIONED|
> >>>>>>>>>>           exchange
> >>>>>>>>>>           -- SORT_MERGE_EXCHANGE [$$15(ASC) ]  |PARTITIONED|
> >>>>>>>>>>             order (ASC, %0->$$15)
> >>>>>>>>>>             -- STABLE_SORT [$$15(ASC)]  |PARTITIONED|
> >>>>>>>>>>               exchange
> >>>>>>>>>>               -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> >>>>>>>>>>                 project ([$$13, $$15])
> >>>>>>>>>>                 -- STREAM_PROJECT  |PARTITIONED|
> >>>>>>>>>>                   select (function-call:
> >>>>>>>>>> asterix:spatial-intersect,
> >>>>>>>>>>
> >>>>>>>>>> Args:[%0->$$13, function-call: asterix:create-circle,
> >>>>>>>>>
> >>>>>>>>> Args:[function-call:
> >>>>>>>>
> >>>>>>>> asterix:field-access-by-index, Args:[%0->$$0, AInt32: {2}],
> >>>>>>>>>
> >>>>>>>> ADouble:
> >>>
> >>>> {0.5}]])
> >>>>>>>>>
> >>>>>>>>>                   -- STREAM_SELECT  |PARTITIONED|
> >>>>>>>>>>                     project ([$$0, $$13, $$15])
> >>>>>>>>>>                     -- STREAM_PROJECT  |PARTITIONED|
> >>>>>>>>>>                       exchange
> >>>>>>>>>>                       -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> >>>>>>>>>>                         unnest-map [$$14, $$0] <- function-call:
> >>>>>>>>>>
> >>>>>>>>>> asterix:index-search, Args:[AString: {TweetMessages}, AInt32:
> {0},
> >>>>>>>>>
> >>>>>>>>> AString:
> >>>>>>>>
> >>>>>>>> {test}, AString: {TweetMessages}, ABoolean: {true}, ABoolean:
> >>>>>>>>>
> >>>>>>>> {false},
> >>>>
> >>>>> ABoolean: {false}, AInt32: {1}, %0->$$27, AInt32: {1}, %0->$$27,
> >>>>>>>>>
> >>>>>>>> TRUE,
> >>>>
> >>>>> TRUE, TRUE]
> >>>>>>>>>
> >>>>>>>>>                         -- BTREE_SEARCH  |PARTITIONED|
> >>>>>>>>>>                           exchange
> >>>>>>>>>>                           -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> >>>>>>>>>>                             order (ASC, %0->$$27)
> >>>>>>>>>>                             -- STABLE_SORT [$$27(ASC)]
> >>>>>>>>>>
> >>>>>>>>> |PARTITIONED|
> >>>
> >>>>                               exchange
> >>>>>>>>>>                               -- ONE_TO_ONE_EXCHANGE
> >>>>>>>>>> |PARTITIONED|
> >>>>>>>>>>                                 project ([$$27, $$13, $$15])
> >>>>>>>>>>                                 -- STREAM_PROJECT  |PARTITIONED|
> >>>>>>>>>>                                   exchange
> >>>>>>>>>>                                   -- ONE_TO_ONE_EXCHANGE
> >>>>>>>>>>
> >>>>>>>>> |PARTITIONED|
> >>>>
> >>>>>                                     unnest-map [$$23, $$24, $$25,
> >>>>>>>>>>
> >>>>>>>>> $$26,
> >>>>
> >>>>> $$27] <- function-call: asterix:index-search, Args:[AString:
> >>>>>>>>>
> >>>>>>>>> {twmSndLocIx},
> >>>>>>>>
> >>>>>>>> AInt32: {1}, AString: {test}, AString: {TweetMessages}, ABoolean:
> >>>>>>>>> {true},
> >>>>>>>>> ABoolean: {false}, ABoolean: {true}, AInt32: {4}, %0->$$19,
> >>>>>>>>>
> >>>>>>>> %0->$$20,
> >>>
> >>>> %0->$$21, %0->$$22]
> >>>>>>>>>
> >>>>>>>>>                                     -- RTREE_SEARCH
> |PARTITIONED|
> >>>>>>>>>>                                       exchange
> >>>>>>>>>>                                       -- BROADCAST_EXCHANGE
> >>>>>>>>>>
> >>>>>>>>>> |PARTITIONED|
> >>>>>>>>>                                         assign [$$19, $$20, $$21,
> >>>>>>>>>
> >>>>>>>> $$22]
> >>>>
> >>>>> <-
> >>>>>>>>> [function-call: asterix:create-mbr, Args:[%0->$$13, AInt32: {2},
> >>>>>>>>> AInt32:
> >>>>>>>>> {0}], function-call: asterix:create-mbr, Args:[%0->$$13, AInt32:
> >>>>>>>>>
> >>>>>>>> {2},
> >>>
> >>>> AInt32: {1}], function-call: asterix:create-mbr, Args:[%0->$$13,
> >>>>>>>>> AInt32:
> >>>>>>>>> {2}, AInt32: {2}], function-call: asterix:create-mbr,
> >>>>>>>>>
> >>>>>>>> Args:[%0->$$13,
> >>>
> >>>> AInt32: {2}, AInt32: {3}]]
> >>>>>>>>>
> >>>>>>>>>                                         -- ASSIGN  |PARTITIONED|
> >>>>>>>>>>                                           project ([$$13, $$15])
> >>>>>>>>>>                                           -- STREAM_PROJECT
> >>>>>>>>>>
> >>>>>>>>>> |PARTITIONED|
> >>>>>>>>>                                             assign [$$13] <-
> >>>>>>>>> [function-call: asterix:field-access-by-index, Args:[%0->$$1,
> >>>>>>>>>
> >>>>>>>> AInt32:
> >>>
> >>>> {2}]]
> >>>>>>>>
> >>>>>>>>                                             -- ASSIGN
> |PARTITIONED|
> >>>>>>>>>
> >>>>>>>>>>                                               exchange
> >>>>>>>>>>                                               --
> >>>>>>>>>>
> >>>>>>>>> ONE_TO_ONE_EXCHANGE
> >>>
> >>>> |PARTITIONED|
> >>>>>>>>>
> >>>>>>>>>                                                 data-scan
> >>>>>>>>>>
> >>>>>>>>> []<-[$$15,
> >>>
> >>>> $$1]
> >>>>>>>>>>
> >>>>>>>>>> <- test:TweetMessages
> >>>>>>>>>
> >>>>>>>>>                                                 --
> DATASOURCE_SCAN
> >>>>>>>>>>
> >>>>>>>>>> |PARTITIONED|
> >>>>>>>>>
> >>>>>>>>>                                                   exchange
> >>>>>>>>>>                                                   --
> >>>>>>>>>> ONE_TO_ONE_EXCHANGE
> >>>>>>>>>>
> >>>>>>>>>> |PARTITIONED|
> >>>>>>>>>
> >>>>>>>>> empty-tuple-source
> >>>>
> >>>>>                                                     --
> >>>>>>>>>> EMPTY_TUPLE_SOURCE
> >>>>>>>>>>
> >>>>>>>>>> |PARTITIONED|
> >>>>>>>>>
> >>>>>>>>> {noformat}
> >>>>>>>>>> The optimized plan is incorrect --- the index search doesn't use
> >>>>>>>>>>
> >>>>>>>>> the
> >>>
> >>>> right join condition and hence the result is different from
> >>>>>>>>>
> >>>>>>>> expected.
> >>>
> >>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> This message was sent by Atlassian JIRA
> >>>>>>>>> (v6.3.4#6332)
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >
>

Re: [jira] [Commented] (ASTERIXDB-1249) Self index join chooses wrong probe/index branch

Posted by Yingyi Bu <bu...@gmail.com>.
>> the role reversal happens locally in later iterations of the hash join,
right, based on the relative sizes of on-disk hash partitions?
That's right.

>>I think the key thing that we need to be consistent about is which is the
"basic" outer and inner
>> - which in the case of a parallel hash join would be which is the
initial build dataset and which is the initial probe dataset.
Got it!
Thanks!

Best,
Yingyi


On Sun, Jan 10, 2016 at 3:59 PM, Mike Carey <dt...@gmail.com> wrote:

> Just to be (or get) clear - the role reversal happens locally in later
> iterations of the hash join, right, based on the relative sizes of on-disk
> hash partitions?  I think the key thing that we need to be consistent about
> is which is the "basic" outer and inner - which in the case of a parallel
> hash join would be which is the initial build dataset and which is the
> initial probe dataset.
>
>
> On 1/8/16 9:08 AM, Yingyi Bu wrote:
>
>> In left-outer hash join, if the the probe branch is locally clustered (or
>> sorted) by a column superset of the join key,
>> the output will still be locally clustered.
>> Inner hash join couldn't maintain that because of the "role reversal"
>> optimization in the runtime.
>>
>> Best,
>> Yingyi
>>
>> On Thu, Jan 7, 2016 at 10:07 PM, Taewoo Kim <wa...@gmail.com> wrote:
>>
>> Interesting. Can you be more specific?
>>>
>>> Best,
>>> Taewoo
>>>
>>> On Thu, Jan 7, 2016 at 6:36 PM, Yingyi Bu <bu...@gmail.com> wrote:
>>>
>>> Sorry, to be more precise:
>>>> Left-outer hash join cannot preserve all local data properties for its
>>>> probe branch (because spilling can happen) but can preserve (or
>>>>
>>> downgrade)
>>>
>>>> some when certain conditions meet.
>>>>
>>>> On Thu, Jan 7, 2016 at 6:28 PM, Yingyi Bu <bu...@gmail.com> wrote:
>>>>
>>>> In the logical/physical query plan, I think it is statically
>>>>>
>>>> determined.
>>>
>>>> However, that doesn't mean the execution is faithful to that
>>>>>
>>>> probe/build
>>>
>>>> decision because we have the "role reversal" optimization for inner
>>>>>
>>>> hash
>>>
>>>> joins:-)
>>>>> (That's also why our inner hash join cannot maintain any local data
>>>>> property from its probe branch, but left-outer hash can preserve that.)
>>>>>
>>>>> Best,
>>>>> Yingyi
>>>>>
>>>>>
>>>>> On Thu, Jan 7, 2016 at 5:34 PM, Mike Carey <dt...@gmail.com> wrote:
>>>>>
>>>>> Also, do we (separately want to make sure that our hash join behavior
>>>>>>
>>>>> is
>>>
>>>> comparable - i.e., that the initial build/probe decision is statically
>>>>>> determined from the query?  (I think we do want that, and I think it
>>>>>>
>>>>> is
>>>
>>>> in
>>>>
>>>>> fact that way - but I'm not 100% sure, as its been awhile since that
>>>>>>
>>>>> was
>>>
>>>> discussed, and it's not in-cache for me. :-))
>>>>>>
>>>>>>
>>>>>> On 1/7/16 3:22 PM, Taewoo Kim wrote:
>>>>>>
>>>>>> Thanks Yingyi. Yes. If there is an equality condition and if we can't
>>>>>>> transform a join into an index-nested loop join, then a hybrid hash
>>>>>>>
>>>>>> join
>>>>
>>>>> will be used.
>>>>>>>
>>>>>>> Best,
>>>>>>> Taewoo
>>>>>>>
>>>>>>> On Thu, Jan 7, 2016 at 3:14 PM, Yingyi Bu <bu...@gmail.com>
>>>>>>>
>>>>>> wrote:
>>>
>>>> +1!
>>>>>>>
>>>>>>>> 3. We only try to use applicable indexes from the inner branch. So,
>>>>>>>>>
>>>>>>>> if
>>>>
>>>>> there are no applicable indexes from the inner branch, we abort
>>>>>>>>>> transforming a join into an index-nested-loop join.
>>>>>>>>>>
>>>>>>>>>> "index-nested-loop join" -> "hybrid hash join"?
>>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>> Yingyi
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jan 7, 2016 at 3:01 PM, Taewoo Kim <wa...@gmail.com>
>>>>>>>>
>>>>>>> wrote:
>>>>
>>>>> Hello dev,
>>>>>>>>
>>>>>>>>> Regarding this issue (ASTERIXDB-1249), I would like to make an
>>>>>>>>> index-join
>>>>>>>>> hint clarification. Let's start with an example query other than a
>>>>>>>>> self-join query in this issue.
>>>>>>>>>
>>>>>>>>> for $c in dataset('Customers')
>>>>>>>>>
>>>>>>>>> for $o in dataset('Orders')
>>>>>>>>>
>>>>>>>>> where $c.cid /*+ indexnl */ = $o.cid
>>>>>>>>>
>>>>>>>>> order by $c.cid, $o.oid
>>>>>>>>>
>>>>>>>>> return {"cid":$c.cid, "oid": $o.oid}
>>>>>>>>>
>>>>>>>>> Right now, in the master branch, the first dataset (e.g.,
>>>>>>>>>
>>>>>>>> Customers)
>>>
>>>> becomes the outer branch and the second dataset (e.g., Orders)
>>>>>>>>>
>>>>>>>> becomes
>>>>
>>>>> the
>>>>>>>>
>>>>>>>> inner branch. And, when the optimizer tries to honor the given
>>>>>>>>>
>>>>>>>> indexnl
>>>>
>>>>> hint
>>>>>>>>
>>>>>>>> (transforming a join into an index-nested-loop join), if there are
>>>>>>>>> applicable indexes from the inner branch (e.g., Orders), then it is
>>>>>>>>> going
>>>>>>>>> to use one of those indexes. If there are no applicable indexes
>>>>>>>>>
>>>>>>>> from
>>>
>>>> the
>>>>>>>>> inner branch, it tries to use indexes from the outer branch (e.g.,
>>>>>>>>> Customers). We are going to change the last part; we will not use
>>>>>>>>> indexes
>>>>>>>>> from the outer branch. So, the following are refined rules for
>>>>>>>>>
>>>>>>>>> transforming
>>>>>>>>
>>>>>>>> a join into an index-nested-loop join.
>>>>>>>>>
>>>>>>>>> 1. The first dataset in a join (the first parameter of the given
>>>>>>>>>
>>>>>>>> join)
>>>>
>>>>> becomes the outer branch.
>>>>>>>>> 2. The second dataset in a join (the second parameter of the given
>>>>>>>>> join)
>>>>>>>>> becomes the inner branch.
>>>>>>>>> 3. We only try to use applicable indexes from the inner branch. So,
>>>>>>>>>
>>>>>>>> if
>>>>
>>>>> there are no applicable indexes from the inner branch, we abort
>>>>>>>>> transforming a join into an index-nested-loop join.
>>>>>>>>> 4. Variable order in the given join predicate is not important. It
>>>>>>>>>
>>>>>>>> can
>>>>
>>>>> be
>>>>>>>>> either outer.fieldA = inner.fieldB or inner.fieldB = outer.fieldA.
>>>>>>>>>
>>>>>>>>> So, for the left-outer join and inner join altogether, the left
>>>>>>>>>
>>>>>>>> subtree
>>>>
>>>>> is
>>>>>>>>
>>>>>>>> the probing side and the right subtree is the index side. So, this
>>>>>>>>>
>>>>>>>> can
>>>>
>>>>> be
>>>>>>>>> applied to the self-join case, too just like the following query in
>>>>>>>>> this
>>>>>>>>> issue. In the following query, $t1 becomes the outer and $t2
>>>>>>>>>
>>>>>>>> becomes
>>>
>>>> the
>>>>>>>>> inner.
>>>>>>>>>
>>>>>>>>> for $t1 in dataset('TweetMessages')
>>>>>>>>>
>>>>>>>>> for $t2 in dataset('TweetMessages')
>>>>>>>>>
>>>>>>>>> let $c := $t1.countA + 20
>>>>>>>>>
>>>>>>>>> where $c /* +indexnl */= $t2.countB
>>>>>>>>>
>>>>>>>>> order by $t2.tweetid
>>>>>>>>>
>>>>>>>>> return {"tweetid2": $t2.tweetid, "count2":$t2.countB};
>>>>>>>>>
>>>>>>>>> Thank you. Any opinion would be appreciated before I finalize this
>>>>>>>>>
>>>>>>>> fix.
>>>>
>>>>> Best,
>>>>>>>>> Taewoo
>>>>>>>>>
>>>>>>>>> ---------- Forwarded message ----------
>>>>>>>>> From: Yingyi Bu (JIRA) <ji...@apache.org>
>>>>>>>>> Date: Wed, Jan 6, 2016 at 10:23 AM
>>>>>>>>> Subject: [jira] [Commented] (ASTERIXDB-1249) Self index join
>>>>>>>>>
>>>>>>>> chooses
>>>
>>>> wrong
>>>>>>>>
>>>>>>>> probe/index branch
>>>>>>>>> To: notifications@asterixdb.incubator.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>       [
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>> https://issues.apache.org/jira/browse/ASTERIXDB-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15085992#comment-15085992
>>>
>>>> ]
>>>>>>>>>
>>>>>>>>> Yingyi Bu commented on ASTERIXDB-1249:
>>>>>>>>> --------------------------------------
>>>>>>>>>
>>>>>>>>> That's right. Basically in AcessMethod implementations, e.g.:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>> https://github.com/apache/incubator-asterixdb/blob/master/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/am/RTreeAccessMethod.java#L124
>>>
>>>> Choosing probe/index branch is only based on dataset names, instead
>>>>>>>>>
>>>>>>>> of
>>>>
>>>>> being based on the join condition.
>>>>>>>>>
>>>>>>>>> Self index join chooses wrong probe/index branch
>>>>>>>>>
>>>>>>>>>> ------------------------------------------------
>>>>>>>>>>
>>>>>>>>>>                   Key: ASTERIXDB-1249
>>>>>>>>>>                   URL:
>>>>>>>>>>
>>>>>>>>>> https://issues.apache.org/jira/browse/ASTERIXDB-1249
>>>>>>>>>
>>>>>>>>>               Project: Apache AsterixDB
>>>>>>>>>>            Issue Type: Bug
>>>>>>>>>>            Components: Optimizer
>>>>>>>>>>              Reporter: Yingyi Bu
>>>>>>>>>>              Assignee: Taewoo Kim
>>>>>>>>>>
>>>>>>>>>> DDLs:
>>>>>>>>>> {noformat}
>>>>>>>>>> drop dataverse test if exists;
>>>>>>>>>> create dataverse test;
>>>>>>>>>> use dataverse test;
>>>>>>>>>> create type TwitterUserType as closed {
>>>>>>>>>>       screen-name: string,
>>>>>>>>>>       lang: string,
>>>>>>>>>>       friends-count: int64,
>>>>>>>>>>       statuses-count: int64,
>>>>>>>>>>       name: string,
>>>>>>>>>>       followers-count: int64
>>>>>>>>>> }
>>>>>>>>>> create type TweetMessageType as closed {
>>>>>>>>>>       tweetid: int64,
>>>>>>>>>>           user: TwitterUserType,
>>>>>>>>>>           sender-location: point,
>>>>>>>>>>       send-time: datetime,
>>>>>>>>>>           referred-topics: {{ string }},
>>>>>>>>>>       message-text: string,
>>>>>>>>>>       countA: int64,
>>>>>>>>>>       countB: int64
>>>>>>>>>> }
>>>>>>>>>> create dataset TweetMessages(TweetMessageType)
>>>>>>>>>> primary key tweetid;
>>>>>>>>>> create index twmSndLocIx on TweetMessages(sender-location) type
>>>>>>>>>>
>>>>>>>>> rtree;
>>>>
>>>>> create index msgCountAIx on TweetMessages(countA) type btree;
>>>>>>>>>> create index msgCountBIx on TweetMessages(countB) type btree;
>>>>>>>>>> create index msgTextIx on TweetMessages(message-text) type
>>>>>>>>>>
>>>>>>>>> keyword;
>>>
>>>> {noformat}
>>>>>>>>>> Query:
>>>>>>>>>> {noformat}
>>>>>>>>>> for $t1 in dataset('TweetMessages')
>>>>>>>>>> for $t2 in dataset('TweetMessages')
>>>>>>>>>> let $n :=  create-circle($t1.sender-location, 0.5)
>>>>>>>>>> where spatial-intersect($t2.sender-location, $n)
>>>>>>>>>> order by $t2.tweetid
>>>>>>>>>> return {"tweetid2":$t2.tweetid, "loc2":$t2.sender-location};
>>>>>>>>>> {noformat}
>>>>>>>>>> Optimized plan:
>>>>>>>>>> {noformat}
>>>>>>>>>> distribute result [%0->$$10]
>>>>>>>>>> -- DISTRIBUTE_RESULT  |PARTITIONED|
>>>>>>>>>>     exchange
>>>>>>>>>>     -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>>>>>>>       project ([$$10])
>>>>>>>>>>       -- STREAM_PROJECT  |PARTITIONED|
>>>>>>>>>>         assign [$$10] <- [function-call:
>>>>>>>>>>
>>>>>>>>>> asterix:closed-record-constructor,
>>>>>>>>> Args:[AString: {tweetid2}, %0->$$15, AString: {loc2}, %0->$$13]]
>>>>>>>>>
>>>>>>>>>         -- ASSIGN  |PARTITIONED|
>>>>>>>>>>           exchange
>>>>>>>>>>           -- SORT_MERGE_EXCHANGE [$$15(ASC) ]  |PARTITIONED|
>>>>>>>>>>             order (ASC, %0->$$15)
>>>>>>>>>>             -- STABLE_SORT [$$15(ASC)]  |PARTITIONED|
>>>>>>>>>>               exchange
>>>>>>>>>>               -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>>>>>>>                 project ([$$13, $$15])
>>>>>>>>>>                 -- STREAM_PROJECT  |PARTITIONED|
>>>>>>>>>>                   select (function-call:
>>>>>>>>>> asterix:spatial-intersect,
>>>>>>>>>>
>>>>>>>>>> Args:[%0->$$13, function-call: asterix:create-circle,
>>>>>>>>>
>>>>>>>>> Args:[function-call:
>>>>>>>>
>>>>>>>> asterix:field-access-by-index, Args:[%0->$$0, AInt32: {2}],
>>>>>>>>>
>>>>>>>> ADouble:
>>>
>>>> {0.5}]])
>>>>>>>>>
>>>>>>>>>                   -- STREAM_SELECT  |PARTITIONED|
>>>>>>>>>>                     project ([$$0, $$13, $$15])
>>>>>>>>>>                     -- STREAM_PROJECT  |PARTITIONED|
>>>>>>>>>>                       exchange
>>>>>>>>>>                       -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>>>>>>>                         unnest-map [$$14, $$0] <- function-call:
>>>>>>>>>>
>>>>>>>>>> asterix:index-search, Args:[AString: {TweetMessages}, AInt32: {0},
>>>>>>>>>
>>>>>>>>> AString:
>>>>>>>>
>>>>>>>> {test}, AString: {TweetMessages}, ABoolean: {true}, ABoolean:
>>>>>>>>>
>>>>>>>> {false},
>>>>
>>>>> ABoolean: {false}, AInt32: {1}, %0->$$27, AInt32: {1}, %0->$$27,
>>>>>>>>>
>>>>>>>> TRUE,
>>>>
>>>>> TRUE, TRUE]
>>>>>>>>>
>>>>>>>>>                         -- BTREE_SEARCH  |PARTITIONED|
>>>>>>>>>>                           exchange
>>>>>>>>>>                           -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>>>>>>>                             order (ASC, %0->$$27)
>>>>>>>>>>                             -- STABLE_SORT [$$27(ASC)]
>>>>>>>>>>
>>>>>>>>> |PARTITIONED|
>>>
>>>>                               exchange
>>>>>>>>>>                               -- ONE_TO_ONE_EXCHANGE
>>>>>>>>>> |PARTITIONED|
>>>>>>>>>>                                 project ([$$27, $$13, $$15])
>>>>>>>>>>                                 -- STREAM_PROJECT  |PARTITIONED|
>>>>>>>>>>                                   exchange
>>>>>>>>>>                                   -- ONE_TO_ONE_EXCHANGE
>>>>>>>>>>
>>>>>>>>> |PARTITIONED|
>>>>
>>>>>                                     unnest-map [$$23, $$24, $$25,
>>>>>>>>>>
>>>>>>>>> $$26,
>>>>
>>>>> $$27] <- function-call: asterix:index-search, Args:[AString:
>>>>>>>>>
>>>>>>>>> {twmSndLocIx},
>>>>>>>>
>>>>>>>> AInt32: {1}, AString: {test}, AString: {TweetMessages}, ABoolean:
>>>>>>>>> {true},
>>>>>>>>> ABoolean: {false}, ABoolean: {true}, AInt32: {4}, %0->$$19,
>>>>>>>>>
>>>>>>>> %0->$$20,
>>>
>>>> %0->$$21, %0->$$22]
>>>>>>>>>
>>>>>>>>>                                     -- RTREE_SEARCH  |PARTITIONED|
>>>>>>>>>>                                       exchange
>>>>>>>>>>                                       -- BROADCAST_EXCHANGE
>>>>>>>>>>
>>>>>>>>>> |PARTITIONED|
>>>>>>>>>                                         assign [$$19, $$20, $$21,
>>>>>>>>>
>>>>>>>> $$22]
>>>>
>>>>> <-
>>>>>>>>> [function-call: asterix:create-mbr, Args:[%0->$$13, AInt32: {2},
>>>>>>>>> AInt32:
>>>>>>>>> {0}], function-call: asterix:create-mbr, Args:[%0->$$13, AInt32:
>>>>>>>>>
>>>>>>>> {2},
>>>
>>>> AInt32: {1}], function-call: asterix:create-mbr, Args:[%0->$$13,
>>>>>>>>> AInt32:
>>>>>>>>> {2}, AInt32: {2}], function-call: asterix:create-mbr,
>>>>>>>>>
>>>>>>>> Args:[%0->$$13,
>>>
>>>> AInt32: {2}, AInt32: {3}]]
>>>>>>>>>
>>>>>>>>>                                         -- ASSIGN  |PARTITIONED|
>>>>>>>>>>                                           project ([$$13, $$15])
>>>>>>>>>>                                           -- STREAM_PROJECT
>>>>>>>>>>
>>>>>>>>>> |PARTITIONED|
>>>>>>>>>                                             assign [$$13] <-
>>>>>>>>> [function-call: asterix:field-access-by-index, Args:[%0->$$1,
>>>>>>>>>
>>>>>>>> AInt32:
>>>
>>>> {2}]]
>>>>>>>>
>>>>>>>>                                             -- ASSIGN  |PARTITIONED|
>>>>>>>>>
>>>>>>>>>>                                               exchange
>>>>>>>>>>                                               --
>>>>>>>>>>
>>>>>>>>> ONE_TO_ONE_EXCHANGE
>>>
>>>> |PARTITIONED|
>>>>>>>>>
>>>>>>>>>                                                 data-scan
>>>>>>>>>>
>>>>>>>>> []<-[$$15,
>>>
>>>> $$1]
>>>>>>>>>>
>>>>>>>>>> <- test:TweetMessages
>>>>>>>>>
>>>>>>>>>                                                 -- DATASOURCE_SCAN
>>>>>>>>>>
>>>>>>>>>> |PARTITIONED|
>>>>>>>>>
>>>>>>>>>                                                   exchange
>>>>>>>>>>                                                   --
>>>>>>>>>> ONE_TO_ONE_EXCHANGE
>>>>>>>>>>
>>>>>>>>>> |PARTITIONED|
>>>>>>>>>
>>>>>>>>> empty-tuple-source
>>>>
>>>>>                                                     --
>>>>>>>>>> EMPTY_TUPLE_SOURCE
>>>>>>>>>>
>>>>>>>>>> |PARTITIONED|
>>>>>>>>>
>>>>>>>>> {noformat}
>>>>>>>>>> The optimized plan is incorrect --- the index search doesn't use
>>>>>>>>>>
>>>>>>>>> the
>>>
>>>> right join condition and hence the result is different from
>>>>>>>>>
>>>>>>>> expected.
>>>
>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> This message was sent by Atlassian JIRA
>>>>>>>>> (v6.3.4#6332)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>

Re: [jira] [Commented] (ASTERIXDB-1249) Self index join chooses wrong probe/index branch

Posted by Mike Carey <dt...@gmail.com>.
Just to be (or get) clear - the role reversal happens locally in later 
iterations of the hash join, right, based on the relative sizes of 
on-disk hash partitions?  I think the key thing that we need to be 
consistent about is which is the "basic" outer and inner - which in the 
case of a parallel hash join would be which is the initial build dataset 
and which is the initial probe dataset.

On 1/8/16 9:08 AM, Yingyi Bu wrote:
> In left-outer hash join, if the the probe branch is locally clustered (or
> sorted) by a column superset of the join key,
> the output will still be locally clustered.
> Inner hash join couldn't maintain that because of the "role reversal"
> optimization in the runtime.
>
> Best,
> Yingyi
>
> On Thu, Jan 7, 2016 at 10:07 PM, Taewoo Kim <wa...@gmail.com> wrote:
>
>> Interesting. Can you be more specific?
>>
>> Best,
>> Taewoo
>>
>> On Thu, Jan 7, 2016 at 6:36 PM, Yingyi Bu <bu...@gmail.com> wrote:
>>
>>> Sorry, to be more precise:
>>> Left-outer hash join cannot preserve all local data properties for its
>>> probe branch (because spilling can happen) but can preserve (or
>> downgrade)
>>> some when certain conditions meet.
>>>
>>> On Thu, Jan 7, 2016 at 6:28 PM, Yingyi Bu <bu...@gmail.com> wrote:
>>>
>>>> In the logical/physical query plan, I think it is statically
>> determined.
>>>> However, that doesn't mean the execution is faithful to that
>> probe/build
>>>> decision because we have the "role reversal" optimization for inner
>> hash
>>>> joins:-)
>>>> (That's also why our inner hash join cannot maintain any local data
>>>> property from its probe branch, but left-outer hash can preserve that.)
>>>>
>>>> Best,
>>>> Yingyi
>>>>
>>>>
>>>> On Thu, Jan 7, 2016 at 5:34 PM, Mike Carey <dt...@gmail.com> wrote:
>>>>
>>>>> Also, do we (separately want to make sure that our hash join behavior
>> is
>>>>> comparable - i.e., that the initial build/probe decision is statically
>>>>> determined from the query?  (I think we do want that, and I think it
>> is
>>> in
>>>>> fact that way - but I'm not 100% sure, as its been awhile since that
>> was
>>>>> discussed, and it's not in-cache for me. :-))
>>>>>
>>>>>
>>>>> On 1/7/16 3:22 PM, Taewoo Kim wrote:
>>>>>
>>>>>> Thanks Yingyi. Yes. If there is an equality condition and if we can't
>>>>>> transform a join into an index-nested loop join, then a hybrid hash
>>> join
>>>>>> will be used.
>>>>>>
>>>>>> Best,
>>>>>> Taewoo
>>>>>>
>>>>>> On Thu, Jan 7, 2016 at 3:14 PM, Yingyi Bu <bu...@gmail.com>
>> wrote:
>>>>>> +1!
>>>>>>>> 3. We only try to use applicable indexes from the inner branch. So,
>>> if
>>>>>>>>> there are no applicable indexes from the inner branch, we abort
>>>>>>>>> transforming a join into an index-nested-loop join.
>>>>>>>>>
>>>>>>>> "index-nested-loop join" -> "hybrid hash join"?
>>>>>>> Thanks!
>>>>>>>
>>>>>>> Yingyi
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jan 7, 2016 at 3:01 PM, Taewoo Kim <wa...@gmail.com>
>>> wrote:
>>>>>>> Hello dev,
>>>>>>>> Regarding this issue (ASTERIXDB-1249), I would like to make an
>>>>>>>> index-join
>>>>>>>> hint clarification. Let's start with an example query other than a
>>>>>>>> self-join query in this issue.
>>>>>>>>
>>>>>>>> for $c in dataset('Customers')
>>>>>>>>
>>>>>>>> for $o in dataset('Orders')
>>>>>>>>
>>>>>>>> where $c.cid /*+ indexnl */ = $o.cid
>>>>>>>>
>>>>>>>> order by $c.cid, $o.oid
>>>>>>>>
>>>>>>>> return {"cid":$c.cid, "oid": $o.oid}
>>>>>>>>
>>>>>>>> Right now, in the master branch, the first dataset (e.g.,
>> Customers)
>>>>>>>> becomes the outer branch and the second dataset (e.g., Orders)
>>> becomes
>>>>>>> the
>>>>>>>
>>>>>>>> inner branch. And, when the optimizer tries to honor the given
>>> indexnl
>>>>>>> hint
>>>>>>>
>>>>>>>> (transforming a join into an index-nested-loop join), if there are
>>>>>>>> applicable indexes from the inner branch (e.g., Orders), then it is
>>>>>>>> going
>>>>>>>> to use one of those indexes. If there are no applicable indexes
>> from
>>>>>>>> the
>>>>>>>> inner branch, it tries to use indexes from the outer branch (e.g.,
>>>>>>>> Customers). We are going to change the last part; we will not use
>>>>>>>> indexes
>>>>>>>> from the outer branch. So, the following are refined rules for
>>>>>>>>
>>>>>>> transforming
>>>>>>>
>>>>>>>> a join into an index-nested-loop join.
>>>>>>>>
>>>>>>>> 1. The first dataset in a join (the first parameter of the given
>>> join)
>>>>>>>> becomes the outer branch.
>>>>>>>> 2. The second dataset in a join (the second parameter of the given
>>>>>>>> join)
>>>>>>>> becomes the inner branch.
>>>>>>>> 3. We only try to use applicable indexes from the inner branch. So,
>>> if
>>>>>>>> there are no applicable indexes from the inner branch, we abort
>>>>>>>> transforming a join into an index-nested-loop join.
>>>>>>>> 4. Variable order in the given join predicate is not important. It
>>> can
>>>>>>>> be
>>>>>>>> either outer.fieldA = inner.fieldB or inner.fieldB = outer.fieldA.
>>>>>>>>
>>>>>>>> So, for the left-outer join and inner join altogether, the left
>>> subtree
>>>>>>> is
>>>>>>>
>>>>>>>> the probing side and the right subtree is the index side. So, this
>>> can
>>>>>>>> be
>>>>>>>> applied to the self-join case, too just like the following query in
>>>>>>>> this
>>>>>>>> issue. In the following query, $t1 becomes the outer and $t2
>> becomes
>>>>>>>> the
>>>>>>>> inner.
>>>>>>>>
>>>>>>>> for $t1 in dataset('TweetMessages')
>>>>>>>>
>>>>>>>> for $t2 in dataset('TweetMessages')
>>>>>>>>
>>>>>>>> let $c := $t1.countA + 20
>>>>>>>>
>>>>>>>> where $c /* +indexnl */= $t2.countB
>>>>>>>>
>>>>>>>> order by $t2.tweetid
>>>>>>>>
>>>>>>>> return {"tweetid2": $t2.tweetid, "count2":$t2.countB};
>>>>>>>>
>>>>>>>> Thank you. Any opinion would be appreciated before I finalize this
>>> fix.
>>>>>>>> Best,
>>>>>>>> Taewoo
>>>>>>>>
>>>>>>>> ---------- Forwarded message ----------
>>>>>>>> From: Yingyi Bu (JIRA) <ji...@apache.org>
>>>>>>>> Date: Wed, Jan 6, 2016 at 10:23 AM
>>>>>>>> Subject: [jira] [Commented] (ASTERIXDB-1249) Self index join
>> chooses
>>>>>>> wrong
>>>>>>>
>>>>>>>> probe/index branch
>>>>>>>> To: notifications@asterixdb.incubator.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>       [
>>>>>>>>
>>>>>>>>
>>>>>>>>
>> https://issues.apache.org/jira/browse/ASTERIXDB-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15085992#comment-15085992
>>>>>>>> ]
>>>>>>>>
>>>>>>>> Yingyi Bu commented on ASTERIXDB-1249:
>>>>>>>> --------------------------------------
>>>>>>>>
>>>>>>>> That's right. Basically in AcessMethod implementations, e.g.:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>> https://github.com/apache/incubator-asterixdb/blob/master/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/am/RTreeAccessMethod.java#L124
>>>>>>>> Choosing probe/index branch is only based on dataset names, instead
>>> of
>>>>>>>> being based on the join condition.
>>>>>>>>
>>>>>>>> Self index join chooses wrong probe/index branch
>>>>>>>>> ------------------------------------------------
>>>>>>>>>
>>>>>>>>>                   Key: ASTERIXDB-1249
>>>>>>>>>                   URL:
>>>>>>>>>
>>>>>>>> https://issues.apache.org/jira/browse/ASTERIXDB-1249
>>>>>>>>
>>>>>>>>>               Project: Apache AsterixDB
>>>>>>>>>            Issue Type: Bug
>>>>>>>>>            Components: Optimizer
>>>>>>>>>              Reporter: Yingyi Bu
>>>>>>>>>              Assignee: Taewoo Kim
>>>>>>>>>
>>>>>>>>> DDLs:
>>>>>>>>> {noformat}
>>>>>>>>> drop dataverse test if exists;
>>>>>>>>> create dataverse test;
>>>>>>>>> use dataverse test;
>>>>>>>>> create type TwitterUserType as closed {
>>>>>>>>>       screen-name: string,
>>>>>>>>>       lang: string,
>>>>>>>>>       friends-count: int64,
>>>>>>>>>       statuses-count: int64,
>>>>>>>>>       name: string,
>>>>>>>>>       followers-count: int64
>>>>>>>>> }
>>>>>>>>> create type TweetMessageType as closed {
>>>>>>>>>       tweetid: int64,
>>>>>>>>>           user: TwitterUserType,
>>>>>>>>>           sender-location: point,
>>>>>>>>>       send-time: datetime,
>>>>>>>>>           referred-topics: {{ string }},
>>>>>>>>>       message-text: string,
>>>>>>>>>       countA: int64,
>>>>>>>>>       countB: int64
>>>>>>>>> }
>>>>>>>>> create dataset TweetMessages(TweetMessageType)
>>>>>>>>> primary key tweetid;
>>>>>>>>> create index twmSndLocIx on TweetMessages(sender-location) type
>>> rtree;
>>>>>>>>> create index msgCountAIx on TweetMessages(countA) type btree;
>>>>>>>>> create index msgCountBIx on TweetMessages(countB) type btree;
>>>>>>>>> create index msgTextIx on TweetMessages(message-text) type
>> keyword;
>>>>>>>>> {noformat}
>>>>>>>>> Query:
>>>>>>>>> {noformat}
>>>>>>>>> for $t1 in dataset('TweetMessages')
>>>>>>>>> for $t2 in dataset('TweetMessages')
>>>>>>>>> let $n :=  create-circle($t1.sender-location, 0.5)
>>>>>>>>> where spatial-intersect($t2.sender-location, $n)
>>>>>>>>> order by $t2.tweetid
>>>>>>>>> return {"tweetid2":$t2.tweetid, "loc2":$t2.sender-location};
>>>>>>>>> {noformat}
>>>>>>>>> Optimized plan:
>>>>>>>>> {noformat}
>>>>>>>>> distribute result [%0->$$10]
>>>>>>>>> -- DISTRIBUTE_RESULT  |PARTITIONED|
>>>>>>>>>     exchange
>>>>>>>>>     -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>>>>>>       project ([$$10])
>>>>>>>>>       -- STREAM_PROJECT  |PARTITIONED|
>>>>>>>>>         assign [$$10] <- [function-call:
>>>>>>>>>
>>>>>>>> asterix:closed-record-constructor,
>>>>>>>> Args:[AString: {tweetid2}, %0->$$15, AString: {loc2}, %0->$$13]]
>>>>>>>>
>>>>>>>>>         -- ASSIGN  |PARTITIONED|
>>>>>>>>>           exchange
>>>>>>>>>           -- SORT_MERGE_EXCHANGE [$$15(ASC) ]  |PARTITIONED|
>>>>>>>>>             order (ASC, %0->$$15)
>>>>>>>>>             -- STABLE_SORT [$$15(ASC)]  |PARTITIONED|
>>>>>>>>>               exchange
>>>>>>>>>               -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>>>>>>                 project ([$$13, $$15])
>>>>>>>>>                 -- STREAM_PROJECT  |PARTITIONED|
>>>>>>>>>                   select (function-call: asterix:spatial-intersect,
>>>>>>>>>
>>>>>>>> Args:[%0->$$13, function-call: asterix:create-circle,
>>>>>>>>
>>>>>>> Args:[function-call:
>>>>>>>
>>>>>>>> asterix:field-access-by-index, Args:[%0->$$0, AInt32: {2}],
>> ADouble:
>>>>>>>> {0.5}]])
>>>>>>>>
>>>>>>>>>                   -- STREAM_SELECT  |PARTITIONED|
>>>>>>>>>                     project ([$$0, $$13, $$15])
>>>>>>>>>                     -- STREAM_PROJECT  |PARTITIONED|
>>>>>>>>>                       exchange
>>>>>>>>>                       -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>>>>>>                         unnest-map [$$14, $$0] <- function-call:
>>>>>>>>>
>>>>>>>> asterix:index-search, Args:[AString: {TweetMessages}, AInt32: {0},
>>>>>>>>
>>>>>>> AString:
>>>>>>>
>>>>>>>> {test}, AString: {TweetMessages}, ABoolean: {true}, ABoolean:
>>> {false},
>>>>>>>> ABoolean: {false}, AInt32: {1}, %0->$$27, AInt32: {1}, %0->$$27,
>>> TRUE,
>>>>>>>> TRUE, TRUE]
>>>>>>>>
>>>>>>>>>                         -- BTREE_SEARCH  |PARTITIONED|
>>>>>>>>>                           exchange
>>>>>>>>>                           -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>>>>>>                             order (ASC, %0->$$27)
>>>>>>>>>                             -- STABLE_SORT [$$27(ASC)]
>> |PARTITIONED|
>>>>>>>>>                               exchange
>>>>>>>>>                               -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>>>>>>                                 project ([$$27, $$13, $$15])
>>>>>>>>>                                 -- STREAM_PROJECT  |PARTITIONED|
>>>>>>>>>                                   exchange
>>>>>>>>>                                   -- ONE_TO_ONE_EXCHANGE
>>> |PARTITIONED|
>>>>>>>>>                                     unnest-map [$$23, $$24, $$25,
>>> $$26,
>>>>>>>> $$27] <- function-call: asterix:index-search, Args:[AString:
>>>>>>>>
>>>>>>> {twmSndLocIx},
>>>>>>>
>>>>>>>> AInt32: {1}, AString: {test}, AString: {TweetMessages}, ABoolean:
>>>>>>>> {true},
>>>>>>>> ABoolean: {false}, ABoolean: {true}, AInt32: {4}, %0->$$19,
>> %0->$$20,
>>>>>>>> %0->$$21, %0->$$22]
>>>>>>>>
>>>>>>>>>                                     -- RTREE_SEARCH  |PARTITIONED|
>>>>>>>>>                                       exchange
>>>>>>>>>                                       -- BROADCAST_EXCHANGE
>>>>>>>>>
>>>>>>>> |PARTITIONED|
>>>>>>>>                                         assign [$$19, $$20, $$21,
>>> $$22]
>>>>>>>> <-
>>>>>>>> [function-call: asterix:create-mbr, Args:[%0->$$13, AInt32: {2},
>>>>>>>> AInt32:
>>>>>>>> {0}], function-call: asterix:create-mbr, Args:[%0->$$13, AInt32:
>> {2},
>>>>>>>> AInt32: {1}], function-call: asterix:create-mbr, Args:[%0->$$13,
>>>>>>>> AInt32:
>>>>>>>> {2}, AInt32: {2}], function-call: asterix:create-mbr,
>> Args:[%0->$$13,
>>>>>>>> AInt32: {2}, AInt32: {3}]]
>>>>>>>>
>>>>>>>>>                                         -- ASSIGN  |PARTITIONED|
>>>>>>>>>                                           project ([$$13, $$15])
>>>>>>>>>                                           -- STREAM_PROJECT
>>>>>>>>>
>>>>>>>> |PARTITIONED|
>>>>>>>>                                             assign [$$13] <-
>>>>>>>> [function-call: asterix:field-access-by-index, Args:[%0->$$1,
>> AInt32:
>>>>>>> {2}]]
>>>>>>>
>>>>>>>>                                             -- ASSIGN  |PARTITIONED|
>>>>>>>>>                                               exchange
>>>>>>>>>                                               --
>> ONE_TO_ONE_EXCHANGE
>>>>>>>> |PARTITIONED|
>>>>>>>>
>>>>>>>>>                                                 data-scan
>> []<-[$$15,
>>>>>>>>> $$1]
>>>>>>>>>
>>>>>>>> <- test:TweetMessages
>>>>>>>>
>>>>>>>>>                                                 -- DATASOURCE_SCAN
>>>>>>>>>
>>>>>>>> |PARTITIONED|
>>>>>>>>
>>>>>>>>>                                                   exchange
>>>>>>>>>                                                   --
>>>>>>>>> ONE_TO_ONE_EXCHANGE
>>>>>>>>>
>>>>>>>> |PARTITIONED|
>>>>>>>>
>>> empty-tuple-source
>>>>>>>>>                                                     --
>>>>>>>>> EMPTY_TUPLE_SOURCE
>>>>>>>>>
>>>>>>>> |PARTITIONED|
>>>>>>>>
>>>>>>>>> {noformat}
>>>>>>>>> The optimized plan is incorrect --- the index search doesn't use
>> the
>>>>>>>> right join condition and hence the result is different from
>> expected.
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> This message was sent by Atlassian JIRA
>>>>>>>> (v6.3.4#6332)
>>>>>>>>
>>>>>>>>


Re: [jira] [Commented] (ASTERIXDB-1249) Self index join chooses wrong probe/index branch

Posted by Yingyi Bu <bu...@gmail.com>.
In left-outer hash join, if the the probe branch is locally clustered (or
sorted) by a column superset of the join key,
the output will still be locally clustered.
Inner hash join couldn't maintain that because of the "role reversal"
optimization in the runtime.

Best,
Yingyi

On Thu, Jan 7, 2016 at 10:07 PM, Taewoo Kim <wa...@gmail.com> wrote:

> Interesting. Can you be more specific?
>
> Best,
> Taewoo
>
> On Thu, Jan 7, 2016 at 6:36 PM, Yingyi Bu <bu...@gmail.com> wrote:
>
> > Sorry, to be more precise:
> > Left-outer hash join cannot preserve all local data properties for its
> > probe branch (because spilling can happen) but can preserve (or
> downgrade)
> > some when certain conditions meet.
> >
> > On Thu, Jan 7, 2016 at 6:28 PM, Yingyi Bu <bu...@gmail.com> wrote:
> >
> > > In the logical/physical query plan, I think it is statically
> determined.
> > > However, that doesn't mean the execution is faithful to that
> probe/build
> > > decision because we have the "role reversal" optimization for inner
> hash
> > > joins:-)
> > > (That's also why our inner hash join cannot maintain any local data
> > > property from its probe branch, but left-outer hash can preserve that.)
> > >
> > > Best,
> > > Yingyi
> > >
> > >
> > > On Thu, Jan 7, 2016 at 5:34 PM, Mike Carey <dt...@gmail.com> wrote:
> > >
> > >> Also, do we (separately want to make sure that our hash join behavior
> is
> > >> comparable - i.e., that the initial build/probe decision is statically
> > >> determined from the query?  (I think we do want that, and I think it
> is
> > in
> > >> fact that way - but I'm not 100% sure, as its been awhile since that
> was
> > >> discussed, and it's not in-cache for me. :-))
> > >>
> > >>
> > >> On 1/7/16 3:22 PM, Taewoo Kim wrote:
> > >>
> > >>> Thanks Yingyi. Yes. If there is an equality condition and if we can't
> > >>> transform a join into an index-nested loop join, then a hybrid hash
> > join
> > >>> will be used.
> > >>>
> > >>> Best,
> > >>> Taewoo
> > >>>
> > >>> On Thu, Jan 7, 2016 at 3:14 PM, Yingyi Bu <bu...@gmail.com>
> wrote:
> > >>>
> > >>> +1!
> > >>>>
> > >>>>> 3. We only try to use applicable indexes from the inner branch. So,
> > if
> > >>>>>> there are no applicable indexes from the inner branch, we abort
> > >>>>>> transforming a join into an index-nested-loop join.
> > >>>>>>
> > >>>>> "index-nested-loop join" -> "hybrid hash join"?
> > >>>>
> > >>>> Thanks!
> > >>>>
> > >>>> Yingyi
> > >>>>
> > >>>>
> > >>>>
> > >>>> On Thu, Jan 7, 2016 at 3:01 PM, Taewoo Kim <wa...@gmail.com>
> > wrote:
> > >>>>
> > >>>> Hello dev,
> > >>>>>
> > >>>>> Regarding this issue (ASTERIXDB-1249), I would like to make an
> > >>>>> index-join
> > >>>>> hint clarification. Let's start with an example query other than a
> > >>>>> self-join query in this issue.
> > >>>>>
> > >>>>> for $c in dataset('Customers')
> > >>>>>
> > >>>>> for $o in dataset('Orders')
> > >>>>>
> > >>>>> where $c.cid /*+ indexnl */ = $o.cid
> > >>>>>
> > >>>>> order by $c.cid, $o.oid
> > >>>>>
> > >>>>> return {"cid":$c.cid, "oid": $o.oid}
> > >>>>>
> > >>>>> Right now, in the master branch, the first dataset (e.g.,
> Customers)
> > >>>>> becomes the outer branch and the second dataset (e.g., Orders)
> > becomes
> > >>>>>
> > >>>> the
> > >>>>
> > >>>>> inner branch. And, when the optimizer tries to honor the given
> > indexnl
> > >>>>>
> > >>>> hint
> > >>>>
> > >>>>> (transforming a join into an index-nested-loop join), if there are
> > >>>>> applicable indexes from the inner branch (e.g., Orders), then it is
> > >>>>> going
> > >>>>> to use one of those indexes. If there are no applicable indexes
> from
> > >>>>> the
> > >>>>> inner branch, it tries to use indexes from the outer branch (e.g.,
> > >>>>> Customers). We are going to change the last part; we will not use
> > >>>>> indexes
> > >>>>> from the outer branch. So, the following are refined rules for
> > >>>>>
> > >>>> transforming
> > >>>>
> > >>>>> a join into an index-nested-loop join.
> > >>>>>
> > >>>>> 1. The first dataset in a join (the first parameter of the given
> > join)
> > >>>>> becomes the outer branch.
> > >>>>> 2. The second dataset in a join (the second parameter of the given
> > >>>>> join)
> > >>>>> becomes the inner branch.
> > >>>>> 3. We only try to use applicable indexes from the inner branch. So,
> > if
> > >>>>> there are no applicable indexes from the inner branch, we abort
> > >>>>> transforming a join into an index-nested-loop join.
> > >>>>> 4. Variable order in the given join predicate is not important. It
> > can
> > >>>>> be
> > >>>>> either outer.fieldA = inner.fieldB or inner.fieldB = outer.fieldA.
> > >>>>>
> > >>>>> So, for the left-outer join and inner join altogether, the left
> > subtree
> > >>>>>
> > >>>> is
> > >>>>
> > >>>>> the probing side and the right subtree is the index side. So, this
> > can
> > >>>>> be
> > >>>>> applied to the self-join case, too just like the following query in
> > >>>>> this
> > >>>>> issue. In the following query, $t1 becomes the outer and $t2
> becomes
> > >>>>> the
> > >>>>> inner.
> > >>>>>
> > >>>>> for $t1 in dataset('TweetMessages')
> > >>>>>
> > >>>>> for $t2 in dataset('TweetMessages')
> > >>>>>
> > >>>>> let $c := $t1.countA + 20
> > >>>>>
> > >>>>> where $c /* +indexnl */= $t2.countB
> > >>>>>
> > >>>>> order by $t2.tweetid
> > >>>>>
> > >>>>> return {"tweetid2": $t2.tweetid, "count2":$t2.countB};
> > >>>>>
> > >>>>> Thank you. Any opinion would be appreciated before I finalize this
> > fix.
> > >>>>>
> > >>>>> Best,
> > >>>>> Taewoo
> > >>>>>
> > >>>>> ---------- Forwarded message ----------
> > >>>>> From: Yingyi Bu (JIRA) <ji...@apache.org>
> > >>>>> Date: Wed, Jan 6, 2016 at 10:23 AM
> > >>>>> Subject: [jira] [Commented] (ASTERIXDB-1249) Self index join
> chooses
> > >>>>>
> > >>>> wrong
> > >>>>
> > >>>>> probe/index branch
> > >>>>> To: notifications@asterixdb.incubator.apache.org
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>      [
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>
> >
> https://issues.apache.org/jira/browse/ASTERIXDB-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15085992#comment-15085992
> > >>>>
> > >>>>> ]
> > >>>>>
> > >>>>> Yingyi Bu commented on ASTERIXDB-1249:
> > >>>>> --------------------------------------
> > >>>>>
> > >>>>> That's right. Basically in AcessMethod implementations, e.g.:
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>
> >
> https://github.com/apache/incubator-asterixdb/blob/master/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/am/RTreeAccessMethod.java#L124
> > >>>>
> > >>>>> Choosing probe/index branch is only based on dataset names, instead
> > of
> > >>>>> being based on the join condition.
> > >>>>>
> > >>>>> Self index join chooses wrong probe/index branch
> > >>>>>> ------------------------------------------------
> > >>>>>>
> > >>>>>>                  Key: ASTERIXDB-1249
> > >>>>>>                  URL:
> > >>>>>>
> > >>>>> https://issues.apache.org/jira/browse/ASTERIXDB-1249
> > >>>>>
> > >>>>>>              Project: Apache AsterixDB
> > >>>>>>           Issue Type: Bug
> > >>>>>>           Components: Optimizer
> > >>>>>>             Reporter: Yingyi Bu
> > >>>>>>             Assignee: Taewoo Kim
> > >>>>>>
> > >>>>>> DDLs:
> > >>>>>> {noformat}
> > >>>>>> drop dataverse test if exists;
> > >>>>>> create dataverse test;
> > >>>>>> use dataverse test;
> > >>>>>> create type TwitterUserType as closed {
> > >>>>>>      screen-name: string,
> > >>>>>>      lang: string,
> > >>>>>>      friends-count: int64,
> > >>>>>>      statuses-count: int64,
> > >>>>>>      name: string,
> > >>>>>>      followers-count: int64
> > >>>>>> }
> > >>>>>> create type TweetMessageType as closed {
> > >>>>>>      tweetid: int64,
> > >>>>>>          user: TwitterUserType,
> > >>>>>>          sender-location: point,
> > >>>>>>      send-time: datetime,
> > >>>>>>          referred-topics: {{ string }},
> > >>>>>>      message-text: string,
> > >>>>>>      countA: int64,
> > >>>>>>      countB: int64
> > >>>>>> }
> > >>>>>> create dataset TweetMessages(TweetMessageType)
> > >>>>>> primary key tweetid;
> > >>>>>> create index twmSndLocIx on TweetMessages(sender-location) type
> > rtree;
> > >>>>>> create index msgCountAIx on TweetMessages(countA) type btree;
> > >>>>>> create index msgCountBIx on TweetMessages(countB) type btree;
> > >>>>>> create index msgTextIx on TweetMessages(message-text) type
> keyword;
> > >>>>>> {noformat}
> > >>>>>> Query:
> > >>>>>> {noformat}
> > >>>>>> for $t1 in dataset('TweetMessages')
> > >>>>>> for $t2 in dataset('TweetMessages')
> > >>>>>> let $n :=  create-circle($t1.sender-location, 0.5)
> > >>>>>> where spatial-intersect($t2.sender-location, $n)
> > >>>>>> order by $t2.tweetid
> > >>>>>> return {"tweetid2":$t2.tweetid, "loc2":$t2.sender-location};
> > >>>>>> {noformat}
> > >>>>>> Optimized plan:
> > >>>>>> {noformat}
> > >>>>>> distribute result [%0->$$10]
> > >>>>>> -- DISTRIBUTE_RESULT  |PARTITIONED|
> > >>>>>>    exchange
> > >>>>>>    -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> > >>>>>>      project ([$$10])
> > >>>>>>      -- STREAM_PROJECT  |PARTITIONED|
> > >>>>>>        assign [$$10] <- [function-call:
> > >>>>>>
> > >>>>> asterix:closed-record-constructor,
> > >>>>
> > >>>>> Args:[AString: {tweetid2}, %0->$$15, AString: {loc2}, %0->$$13]]
> > >>>>>
> > >>>>>>        -- ASSIGN  |PARTITIONED|
> > >>>>>>          exchange
> > >>>>>>          -- SORT_MERGE_EXCHANGE [$$15(ASC) ]  |PARTITIONED|
> > >>>>>>            order (ASC, %0->$$15)
> > >>>>>>            -- STABLE_SORT [$$15(ASC)]  |PARTITIONED|
> > >>>>>>              exchange
> > >>>>>>              -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> > >>>>>>                project ([$$13, $$15])
> > >>>>>>                -- STREAM_PROJECT  |PARTITIONED|
> > >>>>>>                  select (function-call: asterix:spatial-intersect,
> > >>>>>>
> > >>>>> Args:[%0->$$13, function-call: asterix:create-circle,
> > >>>>>
> > >>>> Args:[function-call:
> > >>>>
> > >>>>> asterix:field-access-by-index, Args:[%0->$$0, AInt32: {2}],
> ADouble:
> > >>>>> {0.5}]])
> > >>>>>
> > >>>>>>                  -- STREAM_SELECT  |PARTITIONED|
> > >>>>>>                    project ([$$0, $$13, $$15])
> > >>>>>>                    -- STREAM_PROJECT  |PARTITIONED|
> > >>>>>>                      exchange
> > >>>>>>                      -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> > >>>>>>                        unnest-map [$$14, $$0] <- function-call:
> > >>>>>>
> > >>>>> asterix:index-search, Args:[AString: {TweetMessages}, AInt32: {0},
> > >>>>>
> > >>>> AString:
> > >>>>
> > >>>>> {test}, AString: {TweetMessages}, ABoolean: {true}, ABoolean:
> > {false},
> > >>>>> ABoolean: {false}, AInt32: {1}, %0->$$27, AInt32: {1}, %0->$$27,
> > TRUE,
> > >>>>> TRUE, TRUE]
> > >>>>>
> > >>>>>>                        -- BTREE_SEARCH  |PARTITIONED|
> > >>>>>>                          exchange
> > >>>>>>                          -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> > >>>>>>                            order (ASC, %0->$$27)
> > >>>>>>                            -- STABLE_SORT [$$27(ASC)]
> |PARTITIONED|
> > >>>>>>                              exchange
> > >>>>>>                              -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> > >>>>>>                                project ([$$27, $$13, $$15])
> > >>>>>>                                -- STREAM_PROJECT  |PARTITIONED|
> > >>>>>>                                  exchange
> > >>>>>>                                  -- ONE_TO_ONE_EXCHANGE
> > |PARTITIONED|
> > >>>>>>                                    unnest-map [$$23, $$24, $$25,
> > $$26,
> > >>>>>>
> > >>>>> $$27] <- function-call: asterix:index-search, Args:[AString:
> > >>>>>
> > >>>> {twmSndLocIx},
> > >>>>
> > >>>>> AInt32: {1}, AString: {test}, AString: {TweetMessages}, ABoolean:
> > >>>>> {true},
> > >>>>> ABoolean: {false}, ABoolean: {true}, AInt32: {4}, %0->$$19,
> %0->$$20,
> > >>>>> %0->$$21, %0->$$22]
> > >>>>>
> > >>>>>>                                    -- RTREE_SEARCH  |PARTITIONED|
> > >>>>>>                                      exchange
> > >>>>>>                                      -- BROADCAST_EXCHANGE
> > >>>>>>
> > >>>>> |PARTITIONED|
> > >>>>
> > >>>>>                                        assign [$$19, $$20, $$21,
> > $$22]
> > >>>>>>
> > >>>>> <-
> > >>>>
> > >>>>> [function-call: asterix:create-mbr, Args:[%0->$$13, AInt32: {2},
> > >>>>> AInt32:
> > >>>>> {0}], function-call: asterix:create-mbr, Args:[%0->$$13, AInt32:
> {2},
> > >>>>> AInt32: {1}], function-call: asterix:create-mbr, Args:[%0->$$13,
> > >>>>> AInt32:
> > >>>>> {2}, AInt32: {2}], function-call: asterix:create-mbr,
> Args:[%0->$$13,
> > >>>>> AInt32: {2}, AInt32: {3}]]
> > >>>>>
> > >>>>>>                                        -- ASSIGN  |PARTITIONED|
> > >>>>>>                                          project ([$$13, $$15])
> > >>>>>>                                          -- STREAM_PROJECT
> > >>>>>>
> > >>>>> |PARTITIONED|
> > >>>>
> > >>>>>                                            assign [$$13] <-
> > >>>>>>
> > >>>>> [function-call: asterix:field-access-by-index, Args:[%0->$$1,
> AInt32:
> > >>>>>
> > >>>> {2}]]
> > >>>>
> > >>>>>                                            -- ASSIGN  |PARTITIONED|
> > >>>>>>                                              exchange
> > >>>>>>                                              --
> ONE_TO_ONE_EXCHANGE
> > >>>>>>
> > >>>>> |PARTITIONED|
> > >>>>>
> > >>>>>>                                                data-scan
> []<-[$$15,
> > >>>>>> $$1]
> > >>>>>>
> > >>>>> <- test:TweetMessages
> > >>>>>
> > >>>>>>                                                -- DATASOURCE_SCAN
> > >>>>>>
> > >>>>> |PARTITIONED|
> > >>>>>
> > >>>>>>                                                  exchange
> > >>>>>>                                                  --
> > >>>>>> ONE_TO_ONE_EXCHANGE
> > >>>>>>
> > >>>>> |PARTITIONED|
> > >>>>>
> > >>>>>>
> > empty-tuple-source
> > >>>>>>                                                    --
> > >>>>>> EMPTY_TUPLE_SOURCE
> > >>>>>>
> > >>>>> |PARTITIONED|
> > >>>>>
> > >>>>>> {noformat}
> > >>>>>> The optimized plan is incorrect --- the index search doesn't use
> the
> > >>>>>>
> > >>>>> right join condition and hence the result is different from
> expected.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> --
> > >>>>> This message was sent by Atlassian JIRA
> > >>>>> (v6.3.4#6332)
> > >>>>>
> > >>>>>
> > >>
> > >
> >
>

Re: [jira] [Commented] (ASTERIXDB-1249) Self index join chooses wrong probe/index branch

Posted by Taewoo Kim <wa...@gmail.com>.
Interesting. Can you be more specific?

Best,
Taewoo

On Thu, Jan 7, 2016 at 6:36 PM, Yingyi Bu <bu...@gmail.com> wrote:

> Sorry, to be more precise:
> Left-outer hash join cannot preserve all local data properties for its
> probe branch (because spilling can happen) but can preserve (or downgrade)
> some when certain conditions meet.
>
> On Thu, Jan 7, 2016 at 6:28 PM, Yingyi Bu <bu...@gmail.com> wrote:
>
> > In the logical/physical query plan, I think it is statically determined.
> > However, that doesn't mean the execution is faithful to that probe/build
> > decision because we have the "role reversal" optimization for inner hash
> > joins:-)
> > (That's also why our inner hash join cannot maintain any local data
> > property from its probe branch, but left-outer hash can preserve that.)
> >
> > Best,
> > Yingyi
> >
> >
> > On Thu, Jan 7, 2016 at 5:34 PM, Mike Carey <dt...@gmail.com> wrote:
> >
> >> Also, do we (separately want to make sure that our hash join behavior is
> >> comparable - i.e., that the initial build/probe decision is statically
> >> determined from the query?  (I think we do want that, and I think it is
> in
> >> fact that way - but I'm not 100% sure, as its been awhile since that was
> >> discussed, and it's not in-cache for me. :-))
> >>
> >>
> >> On 1/7/16 3:22 PM, Taewoo Kim wrote:
> >>
> >>> Thanks Yingyi. Yes. If there is an equality condition and if we can't
> >>> transform a join into an index-nested loop join, then a hybrid hash
> join
> >>> will be used.
> >>>
> >>> Best,
> >>> Taewoo
> >>>
> >>> On Thu, Jan 7, 2016 at 3:14 PM, Yingyi Bu <bu...@gmail.com> wrote:
> >>>
> >>> +1!
> >>>>
> >>>>> 3. We only try to use applicable indexes from the inner branch. So,
> if
> >>>>>> there are no applicable indexes from the inner branch, we abort
> >>>>>> transforming a join into an index-nested-loop join.
> >>>>>>
> >>>>> "index-nested-loop join" -> "hybrid hash join"?
> >>>>
> >>>> Thanks!
> >>>>
> >>>> Yingyi
> >>>>
> >>>>
> >>>>
> >>>> On Thu, Jan 7, 2016 at 3:01 PM, Taewoo Kim <wa...@gmail.com>
> wrote:
> >>>>
> >>>> Hello dev,
> >>>>>
> >>>>> Regarding this issue (ASTERIXDB-1249), I would like to make an
> >>>>> index-join
> >>>>> hint clarification. Let's start with an example query other than a
> >>>>> self-join query in this issue.
> >>>>>
> >>>>> for $c in dataset('Customers')
> >>>>>
> >>>>> for $o in dataset('Orders')
> >>>>>
> >>>>> where $c.cid /*+ indexnl */ = $o.cid
> >>>>>
> >>>>> order by $c.cid, $o.oid
> >>>>>
> >>>>> return {"cid":$c.cid, "oid": $o.oid}
> >>>>>
> >>>>> Right now, in the master branch, the first dataset (e.g., Customers)
> >>>>> becomes the outer branch and the second dataset (e.g., Orders)
> becomes
> >>>>>
> >>>> the
> >>>>
> >>>>> inner branch. And, when the optimizer tries to honor the given
> indexnl
> >>>>>
> >>>> hint
> >>>>
> >>>>> (transforming a join into an index-nested-loop join), if there are
> >>>>> applicable indexes from the inner branch (e.g., Orders), then it is
> >>>>> going
> >>>>> to use one of those indexes. If there are no applicable indexes from
> >>>>> the
> >>>>> inner branch, it tries to use indexes from the outer branch (e.g.,
> >>>>> Customers). We are going to change the last part; we will not use
> >>>>> indexes
> >>>>> from the outer branch. So, the following are refined rules for
> >>>>>
> >>>> transforming
> >>>>
> >>>>> a join into an index-nested-loop join.
> >>>>>
> >>>>> 1. The first dataset in a join (the first parameter of the given
> join)
> >>>>> becomes the outer branch.
> >>>>> 2. The second dataset in a join (the second parameter of the given
> >>>>> join)
> >>>>> becomes the inner branch.
> >>>>> 3. We only try to use applicable indexes from the inner branch. So,
> if
> >>>>> there are no applicable indexes from the inner branch, we abort
> >>>>> transforming a join into an index-nested-loop join.
> >>>>> 4. Variable order in the given join predicate is not important. It
> can
> >>>>> be
> >>>>> either outer.fieldA = inner.fieldB or inner.fieldB = outer.fieldA.
> >>>>>
> >>>>> So, for the left-outer join and inner join altogether, the left
> subtree
> >>>>>
> >>>> is
> >>>>
> >>>>> the probing side and the right subtree is the index side. So, this
> can
> >>>>> be
> >>>>> applied to the self-join case, too just like the following query in
> >>>>> this
> >>>>> issue. In the following query, $t1 becomes the outer and $t2 becomes
> >>>>> the
> >>>>> inner.
> >>>>>
> >>>>> for $t1 in dataset('TweetMessages')
> >>>>>
> >>>>> for $t2 in dataset('TweetMessages')
> >>>>>
> >>>>> let $c := $t1.countA + 20
> >>>>>
> >>>>> where $c /* +indexnl */= $t2.countB
> >>>>>
> >>>>> order by $t2.tweetid
> >>>>>
> >>>>> return {"tweetid2": $t2.tweetid, "count2":$t2.countB};
> >>>>>
> >>>>> Thank you. Any opinion would be appreciated before I finalize this
> fix.
> >>>>>
> >>>>> Best,
> >>>>> Taewoo
> >>>>>
> >>>>> ---------- Forwarded message ----------
> >>>>> From: Yingyi Bu (JIRA) <ji...@apache.org>
> >>>>> Date: Wed, Jan 6, 2016 at 10:23 AM
> >>>>> Subject: [jira] [Commented] (ASTERIXDB-1249) Self index join chooses
> >>>>>
> >>>> wrong
> >>>>
> >>>>> probe/index branch
> >>>>> To: notifications@asterixdb.incubator.apache.org
> >>>>>
> >>>>>
> >>>>>
> >>>>>      [
> >>>>>
> >>>>>
> >>>>>
> >>>>
> https://issues.apache.org/jira/browse/ASTERIXDB-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15085992#comment-15085992
> >>>>
> >>>>> ]
> >>>>>
> >>>>> Yingyi Bu commented on ASTERIXDB-1249:
> >>>>> --------------------------------------
> >>>>>
> >>>>> That's right. Basically in AcessMethod implementations, e.g.:
> >>>>>
> >>>>>
> >>>>>
> >>>>
> https://github.com/apache/incubator-asterixdb/blob/master/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/am/RTreeAccessMethod.java#L124
> >>>>
> >>>>> Choosing probe/index branch is only based on dataset names, instead
> of
> >>>>> being based on the join condition.
> >>>>>
> >>>>> Self index join chooses wrong probe/index branch
> >>>>>> ------------------------------------------------
> >>>>>>
> >>>>>>                  Key: ASTERIXDB-1249
> >>>>>>                  URL:
> >>>>>>
> >>>>> https://issues.apache.org/jira/browse/ASTERIXDB-1249
> >>>>>
> >>>>>>              Project: Apache AsterixDB
> >>>>>>           Issue Type: Bug
> >>>>>>           Components: Optimizer
> >>>>>>             Reporter: Yingyi Bu
> >>>>>>             Assignee: Taewoo Kim
> >>>>>>
> >>>>>> DDLs:
> >>>>>> {noformat}
> >>>>>> drop dataverse test if exists;
> >>>>>> create dataverse test;
> >>>>>> use dataverse test;
> >>>>>> create type TwitterUserType as closed {
> >>>>>>      screen-name: string,
> >>>>>>      lang: string,
> >>>>>>      friends-count: int64,
> >>>>>>      statuses-count: int64,
> >>>>>>      name: string,
> >>>>>>      followers-count: int64
> >>>>>> }
> >>>>>> create type TweetMessageType as closed {
> >>>>>>      tweetid: int64,
> >>>>>>          user: TwitterUserType,
> >>>>>>          sender-location: point,
> >>>>>>      send-time: datetime,
> >>>>>>          referred-topics: {{ string }},
> >>>>>>      message-text: string,
> >>>>>>      countA: int64,
> >>>>>>      countB: int64
> >>>>>> }
> >>>>>> create dataset TweetMessages(TweetMessageType)
> >>>>>> primary key tweetid;
> >>>>>> create index twmSndLocIx on TweetMessages(sender-location) type
> rtree;
> >>>>>> create index msgCountAIx on TweetMessages(countA) type btree;
> >>>>>> create index msgCountBIx on TweetMessages(countB) type btree;
> >>>>>> create index msgTextIx on TweetMessages(message-text) type keyword;
> >>>>>> {noformat}
> >>>>>> Query:
> >>>>>> {noformat}
> >>>>>> for $t1 in dataset('TweetMessages')
> >>>>>> for $t2 in dataset('TweetMessages')
> >>>>>> let $n :=  create-circle($t1.sender-location, 0.5)
> >>>>>> where spatial-intersect($t2.sender-location, $n)
> >>>>>> order by $t2.tweetid
> >>>>>> return {"tweetid2":$t2.tweetid, "loc2":$t2.sender-location};
> >>>>>> {noformat}
> >>>>>> Optimized plan:
> >>>>>> {noformat}
> >>>>>> distribute result [%0->$$10]
> >>>>>> -- DISTRIBUTE_RESULT  |PARTITIONED|
> >>>>>>    exchange
> >>>>>>    -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> >>>>>>      project ([$$10])
> >>>>>>      -- STREAM_PROJECT  |PARTITIONED|
> >>>>>>        assign [$$10] <- [function-call:
> >>>>>>
> >>>>> asterix:closed-record-constructor,
> >>>>
> >>>>> Args:[AString: {tweetid2}, %0->$$15, AString: {loc2}, %0->$$13]]
> >>>>>
> >>>>>>        -- ASSIGN  |PARTITIONED|
> >>>>>>          exchange
> >>>>>>          -- SORT_MERGE_EXCHANGE [$$15(ASC) ]  |PARTITIONED|
> >>>>>>            order (ASC, %0->$$15)
> >>>>>>            -- STABLE_SORT [$$15(ASC)]  |PARTITIONED|
> >>>>>>              exchange
> >>>>>>              -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> >>>>>>                project ([$$13, $$15])
> >>>>>>                -- STREAM_PROJECT  |PARTITIONED|
> >>>>>>                  select (function-call: asterix:spatial-intersect,
> >>>>>>
> >>>>> Args:[%0->$$13, function-call: asterix:create-circle,
> >>>>>
> >>>> Args:[function-call:
> >>>>
> >>>>> asterix:field-access-by-index, Args:[%0->$$0, AInt32: {2}], ADouble:
> >>>>> {0.5}]])
> >>>>>
> >>>>>>                  -- STREAM_SELECT  |PARTITIONED|
> >>>>>>                    project ([$$0, $$13, $$15])
> >>>>>>                    -- STREAM_PROJECT  |PARTITIONED|
> >>>>>>                      exchange
> >>>>>>                      -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> >>>>>>                        unnest-map [$$14, $$0] <- function-call:
> >>>>>>
> >>>>> asterix:index-search, Args:[AString: {TweetMessages}, AInt32: {0},
> >>>>>
> >>>> AString:
> >>>>
> >>>>> {test}, AString: {TweetMessages}, ABoolean: {true}, ABoolean:
> {false},
> >>>>> ABoolean: {false}, AInt32: {1}, %0->$$27, AInt32: {1}, %0->$$27,
> TRUE,
> >>>>> TRUE, TRUE]
> >>>>>
> >>>>>>                        -- BTREE_SEARCH  |PARTITIONED|
> >>>>>>                          exchange
> >>>>>>                          -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> >>>>>>                            order (ASC, %0->$$27)
> >>>>>>                            -- STABLE_SORT [$$27(ASC)]  |PARTITIONED|
> >>>>>>                              exchange
> >>>>>>                              -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> >>>>>>                                project ([$$27, $$13, $$15])
> >>>>>>                                -- STREAM_PROJECT  |PARTITIONED|
> >>>>>>                                  exchange
> >>>>>>                                  -- ONE_TO_ONE_EXCHANGE
> |PARTITIONED|
> >>>>>>                                    unnest-map [$$23, $$24, $$25,
> $$26,
> >>>>>>
> >>>>> $$27] <- function-call: asterix:index-search, Args:[AString:
> >>>>>
> >>>> {twmSndLocIx},
> >>>>
> >>>>> AInt32: {1}, AString: {test}, AString: {TweetMessages}, ABoolean:
> >>>>> {true},
> >>>>> ABoolean: {false}, ABoolean: {true}, AInt32: {4}, %0->$$19, %0->$$20,
> >>>>> %0->$$21, %0->$$22]
> >>>>>
> >>>>>>                                    -- RTREE_SEARCH  |PARTITIONED|
> >>>>>>                                      exchange
> >>>>>>                                      -- BROADCAST_EXCHANGE
> >>>>>>
> >>>>> |PARTITIONED|
> >>>>
> >>>>>                                        assign [$$19, $$20, $$21,
> $$22]
> >>>>>>
> >>>>> <-
> >>>>
> >>>>> [function-call: asterix:create-mbr, Args:[%0->$$13, AInt32: {2},
> >>>>> AInt32:
> >>>>> {0}], function-call: asterix:create-mbr, Args:[%0->$$13, AInt32: {2},
> >>>>> AInt32: {1}], function-call: asterix:create-mbr, Args:[%0->$$13,
> >>>>> AInt32:
> >>>>> {2}, AInt32: {2}], function-call: asterix:create-mbr, Args:[%0->$$13,
> >>>>> AInt32: {2}, AInt32: {3}]]
> >>>>>
> >>>>>>                                        -- ASSIGN  |PARTITIONED|
> >>>>>>                                          project ([$$13, $$15])
> >>>>>>                                          -- STREAM_PROJECT
> >>>>>>
> >>>>> |PARTITIONED|
> >>>>
> >>>>>                                            assign [$$13] <-
> >>>>>>
> >>>>> [function-call: asterix:field-access-by-index, Args:[%0->$$1, AInt32:
> >>>>>
> >>>> {2}]]
> >>>>
> >>>>>                                            -- ASSIGN  |PARTITIONED|
> >>>>>>                                              exchange
> >>>>>>                                              -- ONE_TO_ONE_EXCHANGE
> >>>>>>
> >>>>> |PARTITIONED|
> >>>>>
> >>>>>>                                                data-scan []<-[$$15,
> >>>>>> $$1]
> >>>>>>
> >>>>> <- test:TweetMessages
> >>>>>
> >>>>>>                                                -- DATASOURCE_SCAN
> >>>>>>
> >>>>> |PARTITIONED|
> >>>>>
> >>>>>>                                                  exchange
> >>>>>>                                                  --
> >>>>>> ONE_TO_ONE_EXCHANGE
> >>>>>>
> >>>>> |PARTITIONED|
> >>>>>
> >>>>>>
> empty-tuple-source
> >>>>>>                                                    --
> >>>>>> EMPTY_TUPLE_SOURCE
> >>>>>>
> >>>>> |PARTITIONED|
> >>>>>
> >>>>>> {noformat}
> >>>>>> The optimized plan is incorrect --- the index search doesn't use the
> >>>>>>
> >>>>> right join condition and hence the result is different from expected.
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> This message was sent by Atlassian JIRA
> >>>>> (v6.3.4#6332)
> >>>>>
> >>>>>
> >>
> >
>

Re: [jira] [Commented] (ASTERIXDB-1249) Self index join chooses wrong probe/index branch

Posted by Yingyi Bu <bu...@gmail.com>.
Sorry, to be more precise:
Left-outer hash join cannot preserve all local data properties for its
probe branch (because spilling can happen) but can preserve (or downgrade)
some when certain conditions meet.

On Thu, Jan 7, 2016 at 6:28 PM, Yingyi Bu <bu...@gmail.com> wrote:

> In the logical/physical query plan, I think it is statically determined.
> However, that doesn't mean the execution is faithful to that probe/build
> decision because we have the "role reversal" optimization for inner hash
> joins:-)
> (That's also why our inner hash join cannot maintain any local data
> property from its probe branch, but left-outer hash can preserve that.)
>
> Best,
> Yingyi
>
>
> On Thu, Jan 7, 2016 at 5:34 PM, Mike Carey <dt...@gmail.com> wrote:
>
>> Also, do we (separately want to make sure that our hash join behavior is
>> comparable - i.e., that the initial build/probe decision is statically
>> determined from the query?  (I think we do want that, and I think it is in
>> fact that way - but I'm not 100% sure, as its been awhile since that was
>> discussed, and it's not in-cache for me. :-))
>>
>>
>> On 1/7/16 3:22 PM, Taewoo Kim wrote:
>>
>>> Thanks Yingyi. Yes. If there is an equality condition and if we can't
>>> transform a join into an index-nested loop join, then a hybrid hash join
>>> will be used.
>>>
>>> Best,
>>> Taewoo
>>>
>>> On Thu, Jan 7, 2016 at 3:14 PM, Yingyi Bu <bu...@gmail.com> wrote:
>>>
>>> +1!
>>>>
>>>>> 3. We only try to use applicable indexes from the inner branch. So, if
>>>>>> there are no applicable indexes from the inner branch, we abort
>>>>>> transforming a join into an index-nested-loop join.
>>>>>>
>>>>> "index-nested-loop join" -> "hybrid hash join"?
>>>>
>>>> Thanks!
>>>>
>>>> Yingyi
>>>>
>>>>
>>>>
>>>> On Thu, Jan 7, 2016 at 3:01 PM, Taewoo Kim <wa...@gmail.com> wrote:
>>>>
>>>> Hello dev,
>>>>>
>>>>> Regarding this issue (ASTERIXDB-1249), I would like to make an
>>>>> index-join
>>>>> hint clarification. Let's start with an example query other than a
>>>>> self-join query in this issue.
>>>>>
>>>>> for $c in dataset('Customers')
>>>>>
>>>>> for $o in dataset('Orders')
>>>>>
>>>>> where $c.cid /*+ indexnl */ = $o.cid
>>>>>
>>>>> order by $c.cid, $o.oid
>>>>>
>>>>> return {"cid":$c.cid, "oid": $o.oid}
>>>>>
>>>>> Right now, in the master branch, the first dataset (e.g., Customers)
>>>>> becomes the outer branch and the second dataset (e.g., Orders) becomes
>>>>>
>>>> the
>>>>
>>>>> inner branch. And, when the optimizer tries to honor the given indexnl
>>>>>
>>>> hint
>>>>
>>>>> (transforming a join into an index-nested-loop join), if there are
>>>>> applicable indexes from the inner branch (e.g., Orders), then it is
>>>>> going
>>>>> to use one of those indexes. If there are no applicable indexes from
>>>>> the
>>>>> inner branch, it tries to use indexes from the outer branch (e.g.,
>>>>> Customers). We are going to change the last part; we will not use
>>>>> indexes
>>>>> from the outer branch. So, the following are refined rules for
>>>>>
>>>> transforming
>>>>
>>>>> a join into an index-nested-loop join.
>>>>>
>>>>> 1. The first dataset in a join (the first parameter of the given join)
>>>>> becomes the outer branch.
>>>>> 2. The second dataset in a join (the second parameter of the given
>>>>> join)
>>>>> becomes the inner branch.
>>>>> 3. We only try to use applicable indexes from the inner branch. So, if
>>>>> there are no applicable indexes from the inner branch, we abort
>>>>> transforming a join into an index-nested-loop join.
>>>>> 4. Variable order in the given join predicate is not important. It can
>>>>> be
>>>>> either outer.fieldA = inner.fieldB or inner.fieldB = outer.fieldA.
>>>>>
>>>>> So, for the left-outer join and inner join altogether, the left subtree
>>>>>
>>>> is
>>>>
>>>>> the probing side and the right subtree is the index side. So, this can
>>>>> be
>>>>> applied to the self-join case, too just like the following query in
>>>>> this
>>>>> issue. In the following query, $t1 becomes the outer and $t2 becomes
>>>>> the
>>>>> inner.
>>>>>
>>>>> for $t1 in dataset('TweetMessages')
>>>>>
>>>>> for $t2 in dataset('TweetMessages')
>>>>>
>>>>> let $c := $t1.countA + 20
>>>>>
>>>>> where $c /* +indexnl */= $t2.countB
>>>>>
>>>>> order by $t2.tweetid
>>>>>
>>>>> return {"tweetid2": $t2.tweetid, "count2":$t2.countB};
>>>>>
>>>>> Thank you. Any opinion would be appreciated before I finalize this fix.
>>>>>
>>>>> Best,
>>>>> Taewoo
>>>>>
>>>>> ---------- Forwarded message ----------
>>>>> From: Yingyi Bu (JIRA) <ji...@apache.org>
>>>>> Date: Wed, Jan 6, 2016 at 10:23 AM
>>>>> Subject: [jira] [Commented] (ASTERIXDB-1249) Self index join chooses
>>>>>
>>>> wrong
>>>>
>>>>> probe/index branch
>>>>> To: notifications@asterixdb.incubator.apache.org
>>>>>
>>>>>
>>>>>
>>>>>      [
>>>>>
>>>>>
>>>>>
>>>> https://issues.apache.org/jira/browse/ASTERIXDB-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15085992#comment-15085992
>>>>
>>>>> ]
>>>>>
>>>>> Yingyi Bu commented on ASTERIXDB-1249:
>>>>> --------------------------------------
>>>>>
>>>>> That's right. Basically in AcessMethod implementations, e.g.:
>>>>>
>>>>>
>>>>>
>>>> https://github.com/apache/incubator-asterixdb/blob/master/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/am/RTreeAccessMethod.java#L124
>>>>
>>>>> Choosing probe/index branch is only based on dataset names, instead of
>>>>> being based on the join condition.
>>>>>
>>>>> Self index join chooses wrong probe/index branch
>>>>>> ------------------------------------------------
>>>>>>
>>>>>>                  Key: ASTERIXDB-1249
>>>>>>                  URL:
>>>>>>
>>>>> https://issues.apache.org/jira/browse/ASTERIXDB-1249
>>>>>
>>>>>>              Project: Apache AsterixDB
>>>>>>           Issue Type: Bug
>>>>>>           Components: Optimizer
>>>>>>             Reporter: Yingyi Bu
>>>>>>             Assignee: Taewoo Kim
>>>>>>
>>>>>> DDLs:
>>>>>> {noformat}
>>>>>> drop dataverse test if exists;
>>>>>> create dataverse test;
>>>>>> use dataverse test;
>>>>>> create type TwitterUserType as closed {
>>>>>>      screen-name: string,
>>>>>>      lang: string,
>>>>>>      friends-count: int64,
>>>>>>      statuses-count: int64,
>>>>>>      name: string,
>>>>>>      followers-count: int64
>>>>>> }
>>>>>> create type TweetMessageType as closed {
>>>>>>      tweetid: int64,
>>>>>>          user: TwitterUserType,
>>>>>>          sender-location: point,
>>>>>>      send-time: datetime,
>>>>>>          referred-topics: {{ string }},
>>>>>>      message-text: string,
>>>>>>      countA: int64,
>>>>>>      countB: int64
>>>>>> }
>>>>>> create dataset TweetMessages(TweetMessageType)
>>>>>> primary key tweetid;
>>>>>> create index twmSndLocIx on TweetMessages(sender-location) type rtree;
>>>>>> create index msgCountAIx on TweetMessages(countA) type btree;
>>>>>> create index msgCountBIx on TweetMessages(countB) type btree;
>>>>>> create index msgTextIx on TweetMessages(message-text) type keyword;
>>>>>> {noformat}
>>>>>> Query:
>>>>>> {noformat}
>>>>>> for $t1 in dataset('TweetMessages')
>>>>>> for $t2 in dataset('TweetMessages')
>>>>>> let $n :=  create-circle($t1.sender-location, 0.5)
>>>>>> where spatial-intersect($t2.sender-location, $n)
>>>>>> order by $t2.tweetid
>>>>>> return {"tweetid2":$t2.tweetid, "loc2":$t2.sender-location};
>>>>>> {noformat}
>>>>>> Optimized plan:
>>>>>> {noformat}
>>>>>> distribute result [%0->$$10]
>>>>>> -- DISTRIBUTE_RESULT  |PARTITIONED|
>>>>>>    exchange
>>>>>>    -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>>>      project ([$$10])
>>>>>>      -- STREAM_PROJECT  |PARTITIONED|
>>>>>>        assign [$$10] <- [function-call:
>>>>>>
>>>>> asterix:closed-record-constructor,
>>>>
>>>>> Args:[AString: {tweetid2}, %0->$$15, AString: {loc2}, %0->$$13]]
>>>>>
>>>>>>        -- ASSIGN  |PARTITIONED|
>>>>>>          exchange
>>>>>>          -- SORT_MERGE_EXCHANGE [$$15(ASC) ]  |PARTITIONED|
>>>>>>            order (ASC, %0->$$15)
>>>>>>            -- STABLE_SORT [$$15(ASC)]  |PARTITIONED|
>>>>>>              exchange
>>>>>>              -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>>>                project ([$$13, $$15])
>>>>>>                -- STREAM_PROJECT  |PARTITIONED|
>>>>>>                  select (function-call: asterix:spatial-intersect,
>>>>>>
>>>>> Args:[%0->$$13, function-call: asterix:create-circle,
>>>>>
>>>> Args:[function-call:
>>>>
>>>>> asterix:field-access-by-index, Args:[%0->$$0, AInt32: {2}], ADouble:
>>>>> {0.5}]])
>>>>>
>>>>>>                  -- STREAM_SELECT  |PARTITIONED|
>>>>>>                    project ([$$0, $$13, $$15])
>>>>>>                    -- STREAM_PROJECT  |PARTITIONED|
>>>>>>                      exchange
>>>>>>                      -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>>>                        unnest-map [$$14, $$0] <- function-call:
>>>>>>
>>>>> asterix:index-search, Args:[AString: {TweetMessages}, AInt32: {0},
>>>>>
>>>> AString:
>>>>
>>>>> {test}, AString: {TweetMessages}, ABoolean: {true}, ABoolean: {false},
>>>>> ABoolean: {false}, AInt32: {1}, %0->$$27, AInt32: {1}, %0->$$27, TRUE,
>>>>> TRUE, TRUE]
>>>>>
>>>>>>                        -- BTREE_SEARCH  |PARTITIONED|
>>>>>>                          exchange
>>>>>>                          -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>>>                            order (ASC, %0->$$27)
>>>>>>                            -- STABLE_SORT [$$27(ASC)]  |PARTITIONED|
>>>>>>                              exchange
>>>>>>                              -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>>>                                project ([$$27, $$13, $$15])
>>>>>>                                -- STREAM_PROJECT  |PARTITIONED|
>>>>>>                                  exchange
>>>>>>                                  -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>>>                                    unnest-map [$$23, $$24, $$25, $$26,
>>>>>>
>>>>> $$27] <- function-call: asterix:index-search, Args:[AString:
>>>>>
>>>> {twmSndLocIx},
>>>>
>>>>> AInt32: {1}, AString: {test}, AString: {TweetMessages}, ABoolean:
>>>>> {true},
>>>>> ABoolean: {false}, ABoolean: {true}, AInt32: {4}, %0->$$19, %0->$$20,
>>>>> %0->$$21, %0->$$22]
>>>>>
>>>>>>                                    -- RTREE_SEARCH  |PARTITIONED|
>>>>>>                                      exchange
>>>>>>                                      -- BROADCAST_EXCHANGE
>>>>>>
>>>>> |PARTITIONED|
>>>>
>>>>>                                        assign [$$19, $$20, $$21, $$22]
>>>>>>
>>>>> <-
>>>>
>>>>> [function-call: asterix:create-mbr, Args:[%0->$$13, AInt32: {2},
>>>>> AInt32:
>>>>> {0}], function-call: asterix:create-mbr, Args:[%0->$$13, AInt32: {2},
>>>>> AInt32: {1}], function-call: asterix:create-mbr, Args:[%0->$$13,
>>>>> AInt32:
>>>>> {2}, AInt32: {2}], function-call: asterix:create-mbr, Args:[%0->$$13,
>>>>> AInt32: {2}, AInt32: {3}]]
>>>>>
>>>>>>                                        -- ASSIGN  |PARTITIONED|
>>>>>>                                          project ([$$13, $$15])
>>>>>>                                          -- STREAM_PROJECT
>>>>>>
>>>>> |PARTITIONED|
>>>>
>>>>>                                            assign [$$13] <-
>>>>>>
>>>>> [function-call: asterix:field-access-by-index, Args:[%0->$$1, AInt32:
>>>>>
>>>> {2}]]
>>>>
>>>>>                                            -- ASSIGN  |PARTITIONED|
>>>>>>                                              exchange
>>>>>>                                              -- ONE_TO_ONE_EXCHANGE
>>>>>>
>>>>> |PARTITIONED|
>>>>>
>>>>>>                                                data-scan []<-[$$15,
>>>>>> $$1]
>>>>>>
>>>>> <- test:TweetMessages
>>>>>
>>>>>>                                                -- DATASOURCE_SCAN
>>>>>>
>>>>> |PARTITIONED|
>>>>>
>>>>>>                                                  exchange
>>>>>>                                                  --
>>>>>> ONE_TO_ONE_EXCHANGE
>>>>>>
>>>>> |PARTITIONED|
>>>>>
>>>>>>                                                    empty-tuple-source
>>>>>>                                                    --
>>>>>> EMPTY_TUPLE_SOURCE
>>>>>>
>>>>> |PARTITIONED|
>>>>>
>>>>>> {noformat}
>>>>>> The optimized plan is incorrect --- the index search doesn't use the
>>>>>>
>>>>> right join condition and hence the result is different from expected.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> This message was sent by Atlassian JIRA
>>>>> (v6.3.4#6332)
>>>>>
>>>>>
>>
>

Re: [jira] [Commented] (ASTERIXDB-1249) Self index join chooses wrong probe/index branch

Posted by Yingyi Bu <bu...@gmail.com>.
In the logical/physical query plan, I think it is statically determined.
However, that doesn't mean the execution is faithful to that probe/build
decision because we have the "role reversal" optimization for inner hash
joins:-)
(That's also why our inner hash join cannot maintain any local data
property from its probe branch, but left-outer hash can preserve that.)

Best,
Yingyi


On Thu, Jan 7, 2016 at 5:34 PM, Mike Carey <dt...@gmail.com> wrote:

> Also, do we (separately want to make sure that our hash join behavior is
> comparable - i.e., that the initial build/probe decision is statically
> determined from the query?  (I think we do want that, and I think it is in
> fact that way - but I'm not 100% sure, as its been awhile since that was
> discussed, and it's not in-cache for me. :-))
>
>
> On 1/7/16 3:22 PM, Taewoo Kim wrote:
>
>> Thanks Yingyi. Yes. If there is an equality condition and if we can't
>> transform a join into an index-nested loop join, then a hybrid hash join
>> will be used.
>>
>> Best,
>> Taewoo
>>
>> On Thu, Jan 7, 2016 at 3:14 PM, Yingyi Bu <bu...@gmail.com> wrote:
>>
>> +1!
>>>
>>>> 3. We only try to use applicable indexes from the inner branch. So, if
>>>>> there are no applicable indexes from the inner branch, we abort
>>>>> transforming a join into an index-nested-loop join.
>>>>>
>>>> "index-nested-loop join" -> "hybrid hash join"?
>>>
>>> Thanks!
>>>
>>> Yingyi
>>>
>>>
>>>
>>> On Thu, Jan 7, 2016 at 3:01 PM, Taewoo Kim <wa...@gmail.com> wrote:
>>>
>>> Hello dev,
>>>>
>>>> Regarding this issue (ASTERIXDB-1249), I would like to make an
>>>> index-join
>>>> hint clarification. Let's start with an example query other than a
>>>> self-join query in this issue.
>>>>
>>>> for $c in dataset('Customers')
>>>>
>>>> for $o in dataset('Orders')
>>>>
>>>> where $c.cid /*+ indexnl */ = $o.cid
>>>>
>>>> order by $c.cid, $o.oid
>>>>
>>>> return {"cid":$c.cid, "oid": $o.oid}
>>>>
>>>> Right now, in the master branch, the first dataset (e.g., Customers)
>>>> becomes the outer branch and the second dataset (e.g., Orders) becomes
>>>>
>>> the
>>>
>>>> inner branch. And, when the optimizer tries to honor the given indexnl
>>>>
>>> hint
>>>
>>>> (transforming a join into an index-nested-loop join), if there are
>>>> applicable indexes from the inner branch (e.g., Orders), then it is
>>>> going
>>>> to use one of those indexes. If there are no applicable indexes from the
>>>> inner branch, it tries to use indexes from the outer branch (e.g.,
>>>> Customers). We are going to change the last part; we will not use
>>>> indexes
>>>> from the outer branch. So, the following are refined rules for
>>>>
>>> transforming
>>>
>>>> a join into an index-nested-loop join.
>>>>
>>>> 1. The first dataset in a join (the first parameter of the given join)
>>>> becomes the outer branch.
>>>> 2. The second dataset in a join (the second parameter of the given join)
>>>> becomes the inner branch.
>>>> 3. We only try to use applicable indexes from the inner branch. So, if
>>>> there are no applicable indexes from the inner branch, we abort
>>>> transforming a join into an index-nested-loop join.
>>>> 4. Variable order in the given join predicate is not important. It can
>>>> be
>>>> either outer.fieldA = inner.fieldB or inner.fieldB = outer.fieldA.
>>>>
>>>> So, for the left-outer join and inner join altogether, the left subtree
>>>>
>>> is
>>>
>>>> the probing side and the right subtree is the index side. So, this can
>>>> be
>>>> applied to the self-join case, too just like the following query in this
>>>> issue. In the following query, $t1 becomes the outer and $t2 becomes the
>>>> inner.
>>>>
>>>> for $t1 in dataset('TweetMessages')
>>>>
>>>> for $t2 in dataset('TweetMessages')
>>>>
>>>> let $c := $t1.countA + 20
>>>>
>>>> where $c /* +indexnl */= $t2.countB
>>>>
>>>> order by $t2.tweetid
>>>>
>>>> return {"tweetid2": $t2.tweetid, "count2":$t2.countB};
>>>>
>>>> Thank you. Any opinion would be appreciated before I finalize this fix.
>>>>
>>>> Best,
>>>> Taewoo
>>>>
>>>> ---------- Forwarded message ----------
>>>> From: Yingyi Bu (JIRA) <ji...@apache.org>
>>>> Date: Wed, Jan 6, 2016 at 10:23 AM
>>>> Subject: [jira] [Commented] (ASTERIXDB-1249) Self index join chooses
>>>>
>>> wrong
>>>
>>>> probe/index branch
>>>> To: notifications@asterixdb.incubator.apache.org
>>>>
>>>>
>>>>
>>>>      [
>>>>
>>>>
>>>>
>>> https://issues.apache.org/jira/browse/ASTERIXDB-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15085992#comment-15085992
>>>
>>>> ]
>>>>
>>>> Yingyi Bu commented on ASTERIXDB-1249:
>>>> --------------------------------------
>>>>
>>>> That's right. Basically in AcessMethod implementations, e.g.:
>>>>
>>>>
>>>>
>>> https://github.com/apache/incubator-asterixdb/blob/master/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/am/RTreeAccessMethod.java#L124
>>>
>>>> Choosing probe/index branch is only based on dataset names, instead of
>>>> being based on the join condition.
>>>>
>>>> Self index join chooses wrong probe/index branch
>>>>> ------------------------------------------------
>>>>>
>>>>>                  Key: ASTERIXDB-1249
>>>>>                  URL:
>>>>>
>>>> https://issues.apache.org/jira/browse/ASTERIXDB-1249
>>>>
>>>>>              Project: Apache AsterixDB
>>>>>           Issue Type: Bug
>>>>>           Components: Optimizer
>>>>>             Reporter: Yingyi Bu
>>>>>             Assignee: Taewoo Kim
>>>>>
>>>>> DDLs:
>>>>> {noformat}
>>>>> drop dataverse test if exists;
>>>>> create dataverse test;
>>>>> use dataverse test;
>>>>> create type TwitterUserType as closed {
>>>>>      screen-name: string,
>>>>>      lang: string,
>>>>>      friends-count: int64,
>>>>>      statuses-count: int64,
>>>>>      name: string,
>>>>>      followers-count: int64
>>>>> }
>>>>> create type TweetMessageType as closed {
>>>>>      tweetid: int64,
>>>>>          user: TwitterUserType,
>>>>>          sender-location: point,
>>>>>      send-time: datetime,
>>>>>          referred-topics: {{ string }},
>>>>>      message-text: string,
>>>>>      countA: int64,
>>>>>      countB: int64
>>>>> }
>>>>> create dataset TweetMessages(TweetMessageType)
>>>>> primary key tweetid;
>>>>> create index twmSndLocIx on TweetMessages(sender-location) type rtree;
>>>>> create index msgCountAIx on TweetMessages(countA) type btree;
>>>>> create index msgCountBIx on TweetMessages(countB) type btree;
>>>>> create index msgTextIx on TweetMessages(message-text) type keyword;
>>>>> {noformat}
>>>>> Query:
>>>>> {noformat}
>>>>> for $t1 in dataset('TweetMessages')
>>>>> for $t2 in dataset('TweetMessages')
>>>>> let $n :=  create-circle($t1.sender-location, 0.5)
>>>>> where spatial-intersect($t2.sender-location, $n)
>>>>> order by $t2.tweetid
>>>>> return {"tweetid2":$t2.tweetid, "loc2":$t2.sender-location};
>>>>> {noformat}
>>>>> Optimized plan:
>>>>> {noformat}
>>>>> distribute result [%0->$$10]
>>>>> -- DISTRIBUTE_RESULT  |PARTITIONED|
>>>>>    exchange
>>>>>    -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>>      project ([$$10])
>>>>>      -- STREAM_PROJECT  |PARTITIONED|
>>>>>        assign [$$10] <- [function-call:
>>>>>
>>>> asterix:closed-record-constructor,
>>>
>>>> Args:[AString: {tweetid2}, %0->$$15, AString: {loc2}, %0->$$13]]
>>>>
>>>>>        -- ASSIGN  |PARTITIONED|
>>>>>          exchange
>>>>>          -- SORT_MERGE_EXCHANGE [$$15(ASC) ]  |PARTITIONED|
>>>>>            order (ASC, %0->$$15)
>>>>>            -- STABLE_SORT [$$15(ASC)]  |PARTITIONED|
>>>>>              exchange
>>>>>              -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>>                project ([$$13, $$15])
>>>>>                -- STREAM_PROJECT  |PARTITIONED|
>>>>>                  select (function-call: asterix:spatial-intersect,
>>>>>
>>>> Args:[%0->$$13, function-call: asterix:create-circle,
>>>>
>>> Args:[function-call:
>>>
>>>> asterix:field-access-by-index, Args:[%0->$$0, AInt32: {2}], ADouble:
>>>> {0.5}]])
>>>>
>>>>>                  -- STREAM_SELECT  |PARTITIONED|
>>>>>                    project ([$$0, $$13, $$15])
>>>>>                    -- STREAM_PROJECT  |PARTITIONED|
>>>>>                      exchange
>>>>>                      -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>>                        unnest-map [$$14, $$0] <- function-call:
>>>>>
>>>> asterix:index-search, Args:[AString: {TweetMessages}, AInt32: {0},
>>>>
>>> AString:
>>>
>>>> {test}, AString: {TweetMessages}, ABoolean: {true}, ABoolean: {false},
>>>> ABoolean: {false}, AInt32: {1}, %0->$$27, AInt32: {1}, %0->$$27, TRUE,
>>>> TRUE, TRUE]
>>>>
>>>>>                        -- BTREE_SEARCH  |PARTITIONED|
>>>>>                          exchange
>>>>>                          -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>>                            order (ASC, %0->$$27)
>>>>>                            -- STABLE_SORT [$$27(ASC)]  |PARTITIONED|
>>>>>                              exchange
>>>>>                              -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>>                                project ([$$27, $$13, $$15])
>>>>>                                -- STREAM_PROJECT  |PARTITIONED|
>>>>>                                  exchange
>>>>>                                  -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>>                                    unnest-map [$$23, $$24, $$25, $$26,
>>>>>
>>>> $$27] <- function-call: asterix:index-search, Args:[AString:
>>>>
>>> {twmSndLocIx},
>>>
>>>> AInt32: {1}, AString: {test}, AString: {TweetMessages}, ABoolean:
>>>> {true},
>>>> ABoolean: {false}, ABoolean: {true}, AInt32: {4}, %0->$$19, %0->$$20,
>>>> %0->$$21, %0->$$22]
>>>>
>>>>>                                    -- RTREE_SEARCH  |PARTITIONED|
>>>>>                                      exchange
>>>>>                                      -- BROADCAST_EXCHANGE
>>>>>
>>>> |PARTITIONED|
>>>
>>>>                                        assign [$$19, $$20, $$21, $$22]
>>>>>
>>>> <-
>>>
>>>> [function-call: asterix:create-mbr, Args:[%0->$$13, AInt32: {2}, AInt32:
>>>> {0}], function-call: asterix:create-mbr, Args:[%0->$$13, AInt32: {2},
>>>> AInt32: {1}], function-call: asterix:create-mbr, Args:[%0->$$13, AInt32:
>>>> {2}, AInt32: {2}], function-call: asterix:create-mbr, Args:[%0->$$13,
>>>> AInt32: {2}, AInt32: {3}]]
>>>>
>>>>>                                        -- ASSIGN  |PARTITIONED|
>>>>>                                          project ([$$13, $$15])
>>>>>                                          -- STREAM_PROJECT
>>>>>
>>>> |PARTITIONED|
>>>
>>>>                                            assign [$$13] <-
>>>>>
>>>> [function-call: asterix:field-access-by-index, Args:[%0->$$1, AInt32:
>>>>
>>> {2}]]
>>>
>>>>                                            -- ASSIGN  |PARTITIONED|
>>>>>                                              exchange
>>>>>                                              -- ONE_TO_ONE_EXCHANGE
>>>>>
>>>> |PARTITIONED|
>>>>
>>>>>                                                data-scan []<-[$$15,
>>>>> $$1]
>>>>>
>>>> <- test:TweetMessages
>>>>
>>>>>                                                -- DATASOURCE_SCAN
>>>>>
>>>> |PARTITIONED|
>>>>
>>>>>                                                  exchange
>>>>>                                                  -- ONE_TO_ONE_EXCHANGE
>>>>>
>>>> |PARTITIONED|
>>>>
>>>>>                                                    empty-tuple-source
>>>>>                                                    --
>>>>> EMPTY_TUPLE_SOURCE
>>>>>
>>>> |PARTITIONED|
>>>>
>>>>> {noformat}
>>>>> The optimized plan is incorrect --- the index search doesn't use the
>>>>>
>>>> right join condition and hence the result is different from expected.
>>>>
>>>>
>>>>
>>>> --
>>>> This message was sent by Atlassian JIRA
>>>> (v6.3.4#6332)
>>>>
>>>>
>

Re: [jira] [Commented] (ASTERIXDB-1249) Self index join chooses wrong probe/index branch

Posted by Taewoo Kim <wa...@gmail.com>.
Tried to check it. I think so. There
is JoinsUtil.setJoinAlgorithmAndExchangeAlgo and isHashJoinCondition()
method checks the condition. @Pouria: please correct me if I'm wrong here.

Best,
Taewoo

On Thu, Jan 7, 2016 at 5:34 PM, Mike Carey <dt...@gmail.com> wrote:

> Also, do we (separately want to make sure that our hash join behavior is
> comparable - i.e., that the initial build/probe decision is statically
> determined from the query?  (I think we do want that, and I think it is in
> fact that way - but I'm not 100% sure, as its been awhile since that was
> discussed, and it's not in-cache for me. :-))
>
>
> On 1/7/16 3:22 PM, Taewoo Kim wrote:
>
>> Thanks Yingyi. Yes. If there is an equality condition and if we can't
>> transform a join into an index-nested loop join, then a hybrid hash join
>> will be used.
>>
>> Best,
>> Taewoo
>>
>> On Thu, Jan 7, 2016 at 3:14 PM, Yingyi Bu <bu...@gmail.com> wrote:
>>
>> +1!
>>>
>>>> 3. We only try to use applicable indexes from the inner branch. So, if
>>>>> there are no applicable indexes from the inner branch, we abort
>>>>> transforming a join into an index-nested-loop join.
>>>>>
>>>> "index-nested-loop join" -> "hybrid hash join"?
>>>
>>> Thanks!
>>>
>>> Yingyi
>>>
>>>
>>>
>>> On Thu, Jan 7, 2016 at 3:01 PM, Taewoo Kim <wa...@gmail.com> wrote:
>>>
>>> Hello dev,
>>>>
>>>> Regarding this issue (ASTERIXDB-1249), I would like to make an
>>>> index-join
>>>> hint clarification. Let's start with an example query other than a
>>>> self-join query in this issue.
>>>>
>>>> for $c in dataset('Customers')
>>>>
>>>> for $o in dataset('Orders')
>>>>
>>>> where $c.cid /*+ indexnl */ = $o.cid
>>>>
>>>> order by $c.cid, $o.oid
>>>>
>>>> return {"cid":$c.cid, "oid": $o.oid}
>>>>
>>>> Right now, in the master branch, the first dataset (e.g., Customers)
>>>> becomes the outer branch and the second dataset (e.g., Orders) becomes
>>>>
>>> the
>>>
>>>> inner branch. And, when the optimizer tries to honor the given indexnl
>>>>
>>> hint
>>>
>>>> (transforming a join into an index-nested-loop join), if there are
>>>> applicable indexes from the inner branch (e.g., Orders), then it is
>>>> going
>>>> to use one of those indexes. If there are no applicable indexes from the
>>>> inner branch, it tries to use indexes from the outer branch (e.g.,
>>>> Customers). We are going to change the last part; we will not use
>>>> indexes
>>>> from the outer branch. So, the following are refined rules for
>>>>
>>> transforming
>>>
>>>> a join into an index-nested-loop join.
>>>>
>>>> 1. The first dataset in a join (the first parameter of the given join)
>>>> becomes the outer branch.
>>>> 2. The second dataset in a join (the second parameter of the given join)
>>>> becomes the inner branch.
>>>> 3. We only try to use applicable indexes from the inner branch. So, if
>>>> there are no applicable indexes from the inner branch, we abort
>>>> transforming a join into an index-nested-loop join.
>>>> 4. Variable order in the given join predicate is not important. It can
>>>> be
>>>> either outer.fieldA = inner.fieldB or inner.fieldB = outer.fieldA.
>>>>
>>>> So, for the left-outer join and inner join altogether, the left subtree
>>>>
>>> is
>>>
>>>> the probing side and the right subtree is the index side. So, this can
>>>> be
>>>> applied to the self-join case, too just like the following query in this
>>>> issue. In the following query, $t1 becomes the outer and $t2 becomes the
>>>> inner.
>>>>
>>>> for $t1 in dataset('TweetMessages')
>>>>
>>>> for $t2 in dataset('TweetMessages')
>>>>
>>>> let $c := $t1.countA + 20
>>>>
>>>> where $c /* +indexnl */= $t2.countB
>>>>
>>>> order by $t2.tweetid
>>>>
>>>> return {"tweetid2": $t2.tweetid, "count2":$t2.countB};
>>>>
>>>> Thank you. Any opinion would be appreciated before I finalize this fix.
>>>>
>>>> Best,
>>>> Taewoo
>>>>
>>>> ---------- Forwarded message ----------
>>>> From: Yingyi Bu (JIRA) <ji...@apache.org>
>>>> Date: Wed, Jan 6, 2016 at 10:23 AM
>>>> Subject: [jira] [Commented] (ASTERIXDB-1249) Self index join chooses
>>>>
>>> wrong
>>>
>>>> probe/index branch
>>>> To: notifications@asterixdb.incubator.apache.org
>>>>
>>>>
>>>>
>>>>      [
>>>>
>>>>
>>>>
>>> https://issues.apache.org/jira/browse/ASTERIXDB-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15085992#comment-15085992
>>>
>>>> ]
>>>>
>>>> Yingyi Bu commented on ASTERIXDB-1249:
>>>> --------------------------------------
>>>>
>>>> That's right. Basically in AcessMethod implementations, e.g.:
>>>>
>>>>
>>>>
>>> https://github.com/apache/incubator-asterixdb/blob/master/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/am/RTreeAccessMethod.java#L124
>>>
>>>> Choosing probe/index branch is only based on dataset names, instead of
>>>> being based on the join condition.
>>>>
>>>> Self index join chooses wrong probe/index branch
>>>>> ------------------------------------------------
>>>>>
>>>>>                  Key: ASTERIXDB-1249
>>>>>                  URL:
>>>>>
>>>> https://issues.apache.org/jira/browse/ASTERIXDB-1249
>>>>
>>>>>              Project: Apache AsterixDB
>>>>>           Issue Type: Bug
>>>>>           Components: Optimizer
>>>>>             Reporter: Yingyi Bu
>>>>>             Assignee: Taewoo Kim
>>>>>
>>>>> DDLs:
>>>>> {noformat}
>>>>> drop dataverse test if exists;
>>>>> create dataverse test;
>>>>> use dataverse test;
>>>>> create type TwitterUserType as closed {
>>>>>      screen-name: string,
>>>>>      lang: string,
>>>>>      friends-count: int64,
>>>>>      statuses-count: int64,
>>>>>      name: string,
>>>>>      followers-count: int64
>>>>> }
>>>>> create type TweetMessageType as closed {
>>>>>      tweetid: int64,
>>>>>          user: TwitterUserType,
>>>>>          sender-location: point,
>>>>>      send-time: datetime,
>>>>>          referred-topics: {{ string }},
>>>>>      message-text: string,
>>>>>      countA: int64,
>>>>>      countB: int64
>>>>> }
>>>>> create dataset TweetMessages(TweetMessageType)
>>>>> primary key tweetid;
>>>>> create index twmSndLocIx on TweetMessages(sender-location) type rtree;
>>>>> create index msgCountAIx on TweetMessages(countA) type btree;
>>>>> create index msgCountBIx on TweetMessages(countB) type btree;
>>>>> create index msgTextIx on TweetMessages(message-text) type keyword;
>>>>> {noformat}
>>>>> Query:
>>>>> {noformat}
>>>>> for $t1 in dataset('TweetMessages')
>>>>> for $t2 in dataset('TweetMessages')
>>>>> let $n :=  create-circle($t1.sender-location, 0.5)
>>>>> where spatial-intersect($t2.sender-location, $n)
>>>>> order by $t2.tweetid
>>>>> return {"tweetid2":$t2.tweetid, "loc2":$t2.sender-location};
>>>>> {noformat}
>>>>> Optimized plan:
>>>>> {noformat}
>>>>> distribute result [%0->$$10]
>>>>> -- DISTRIBUTE_RESULT  |PARTITIONED|
>>>>>    exchange
>>>>>    -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>>      project ([$$10])
>>>>>      -- STREAM_PROJECT  |PARTITIONED|
>>>>>        assign [$$10] <- [function-call:
>>>>>
>>>> asterix:closed-record-constructor,
>>>
>>>> Args:[AString: {tweetid2}, %0->$$15, AString: {loc2}, %0->$$13]]
>>>>
>>>>>        -- ASSIGN  |PARTITIONED|
>>>>>          exchange
>>>>>          -- SORT_MERGE_EXCHANGE [$$15(ASC) ]  |PARTITIONED|
>>>>>            order (ASC, %0->$$15)
>>>>>            -- STABLE_SORT [$$15(ASC)]  |PARTITIONED|
>>>>>              exchange
>>>>>              -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>>                project ([$$13, $$15])
>>>>>                -- STREAM_PROJECT  |PARTITIONED|
>>>>>                  select (function-call: asterix:spatial-intersect,
>>>>>
>>>> Args:[%0->$$13, function-call: asterix:create-circle,
>>>>
>>> Args:[function-call:
>>>
>>>> asterix:field-access-by-index, Args:[%0->$$0, AInt32: {2}], ADouble:
>>>> {0.5}]])
>>>>
>>>>>                  -- STREAM_SELECT  |PARTITIONED|
>>>>>                    project ([$$0, $$13, $$15])
>>>>>                    -- STREAM_PROJECT  |PARTITIONED|
>>>>>                      exchange
>>>>>                      -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>>                        unnest-map [$$14, $$0] <- function-call:
>>>>>
>>>> asterix:index-search, Args:[AString: {TweetMessages}, AInt32: {0},
>>>>
>>> AString:
>>>
>>>> {test}, AString: {TweetMessages}, ABoolean: {true}, ABoolean: {false},
>>>> ABoolean: {false}, AInt32: {1}, %0->$$27, AInt32: {1}, %0->$$27, TRUE,
>>>> TRUE, TRUE]
>>>>
>>>>>                        -- BTREE_SEARCH  |PARTITIONED|
>>>>>                          exchange
>>>>>                          -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>>                            order (ASC, %0->$$27)
>>>>>                            -- STABLE_SORT [$$27(ASC)]  |PARTITIONED|
>>>>>                              exchange
>>>>>                              -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>>                                project ([$$27, $$13, $$15])
>>>>>                                -- STREAM_PROJECT  |PARTITIONED|
>>>>>                                  exchange
>>>>>                                  -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>>                                    unnest-map [$$23, $$24, $$25, $$26,
>>>>>
>>>> $$27] <- function-call: asterix:index-search, Args:[AString:
>>>>
>>> {twmSndLocIx},
>>>
>>>> AInt32: {1}, AString: {test}, AString: {TweetMessages}, ABoolean:
>>>> {true},
>>>> ABoolean: {false}, ABoolean: {true}, AInt32: {4}, %0->$$19, %0->$$20,
>>>> %0->$$21, %0->$$22]
>>>>
>>>>>                                    -- RTREE_SEARCH  |PARTITIONED|
>>>>>                                      exchange
>>>>>                                      -- BROADCAST_EXCHANGE
>>>>>
>>>> |PARTITIONED|
>>>
>>>>                                        assign [$$19, $$20, $$21, $$22]
>>>>>
>>>> <-
>>>
>>>> [function-call: asterix:create-mbr, Args:[%0->$$13, AInt32: {2}, AInt32:
>>>> {0}], function-call: asterix:create-mbr, Args:[%0->$$13, AInt32: {2},
>>>> AInt32: {1}], function-call: asterix:create-mbr, Args:[%0->$$13, AInt32:
>>>> {2}, AInt32: {2}], function-call: asterix:create-mbr, Args:[%0->$$13,
>>>> AInt32: {2}, AInt32: {3}]]
>>>>
>>>>>                                        -- ASSIGN  |PARTITIONED|
>>>>>                                          project ([$$13, $$15])
>>>>>                                          -- STREAM_PROJECT
>>>>>
>>>> |PARTITIONED|
>>>
>>>>                                            assign [$$13] <-
>>>>>
>>>> [function-call: asterix:field-access-by-index, Args:[%0->$$1, AInt32:
>>>>
>>> {2}]]
>>>
>>>>                                            -- ASSIGN  |PARTITIONED|
>>>>>                                              exchange
>>>>>                                              -- ONE_TO_ONE_EXCHANGE
>>>>>
>>>> |PARTITIONED|
>>>>
>>>>>                                                data-scan []<-[$$15,
>>>>> $$1]
>>>>>
>>>> <- test:TweetMessages
>>>>
>>>>>                                                -- DATASOURCE_SCAN
>>>>>
>>>> |PARTITIONED|
>>>>
>>>>>                                                  exchange
>>>>>                                                  -- ONE_TO_ONE_EXCHANGE
>>>>>
>>>> |PARTITIONED|
>>>>
>>>>>                                                    empty-tuple-source
>>>>>                                                    --
>>>>> EMPTY_TUPLE_SOURCE
>>>>>
>>>> |PARTITIONED|
>>>>
>>>>> {noformat}
>>>>> The optimized plan is incorrect --- the index search doesn't use the
>>>>>
>>>> right join condition and hence the result is different from expected.
>>>>
>>>>
>>>>
>>>> --
>>>> This message was sent by Atlassian JIRA
>>>> (v6.3.4#6332)
>>>>
>>>>
>

Re: [jira] [Commented] (ASTERIXDB-1249) Self index join chooses wrong probe/index branch

Posted by Mike Carey <dt...@gmail.com>.
Also, do we (separately want to make sure that our hash join behavior is 
comparable - i.e., that the initial build/probe decision is statically 
determined from the query?  (I think we do want that, and I think it is 
in fact that way - but I'm not 100% sure, as its been awhile since that 
was discussed, and it's not in-cache for me. :-))

On 1/7/16 3:22 PM, Taewoo Kim wrote:
> Thanks Yingyi. Yes. If there is an equality condition and if we can't
> transform a join into an index-nested loop join, then a hybrid hash join
> will be used.
>
> Best,
> Taewoo
>
> On Thu, Jan 7, 2016 at 3:14 PM, Yingyi Bu <bu...@gmail.com> wrote:
>
>> +1!
>>>> 3. We only try to use applicable indexes from the inner branch. So, if
>>>> there are no applicable indexes from the inner branch, we abort
>>>> transforming a join into an index-nested-loop join.
>> "index-nested-loop join" -> "hybrid hash join"?
>>
>> Thanks!
>>
>> Yingyi
>>
>>
>>
>> On Thu, Jan 7, 2016 at 3:01 PM, Taewoo Kim <wa...@gmail.com> wrote:
>>
>>> Hello dev,
>>>
>>> Regarding this issue (ASTERIXDB-1249), I would like to make an index-join
>>> hint clarification. Let's start with an example query other than a
>>> self-join query in this issue.
>>>
>>> for $c in dataset('Customers')
>>>
>>> for $o in dataset('Orders')
>>>
>>> where $c.cid /*+ indexnl */ = $o.cid
>>>
>>> order by $c.cid, $o.oid
>>>
>>> return {"cid":$c.cid, "oid": $o.oid}
>>>
>>> Right now, in the master branch, the first dataset (e.g., Customers)
>>> becomes the outer branch and the second dataset (e.g., Orders) becomes
>> the
>>> inner branch. And, when the optimizer tries to honor the given indexnl
>> hint
>>> (transforming a join into an index-nested-loop join), if there are
>>> applicable indexes from the inner branch (e.g., Orders), then it is going
>>> to use one of those indexes. If there are no applicable indexes from the
>>> inner branch, it tries to use indexes from the outer branch (e.g.,
>>> Customers). We are going to change the last part; we will not use indexes
>>> from the outer branch. So, the following are refined rules for
>> transforming
>>> a join into an index-nested-loop join.
>>>
>>> 1. The first dataset in a join (the first parameter of the given join)
>>> becomes the outer branch.
>>> 2. The second dataset in a join (the second parameter of the given join)
>>> becomes the inner branch.
>>> 3. We only try to use applicable indexes from the inner branch. So, if
>>> there are no applicable indexes from the inner branch, we abort
>>> transforming a join into an index-nested-loop join.
>>> 4. Variable order in the given join predicate is not important. It can be
>>> either outer.fieldA = inner.fieldB or inner.fieldB = outer.fieldA.
>>>
>>> So, for the left-outer join and inner join altogether, the left subtree
>> is
>>> the probing side and the right subtree is the index side. So, this can be
>>> applied to the self-join case, too just like the following query in this
>>> issue. In the following query, $t1 becomes the outer and $t2 becomes the
>>> inner.
>>>
>>> for $t1 in dataset('TweetMessages')
>>>
>>> for $t2 in dataset('TweetMessages')
>>>
>>> let $c := $t1.countA + 20
>>>
>>> where $c /* +indexnl */= $t2.countB
>>>
>>> order by $t2.tweetid
>>>
>>> return {"tweetid2": $t2.tweetid, "count2":$t2.countB};
>>>
>>> Thank you. Any opinion would be appreciated before I finalize this fix.
>>>
>>> Best,
>>> Taewoo
>>>
>>> ---------- Forwarded message ----------
>>> From: Yingyi Bu (JIRA) <ji...@apache.org>
>>> Date: Wed, Jan 6, 2016 at 10:23 AM
>>> Subject: [jira] [Commented] (ASTERIXDB-1249) Self index join chooses
>> wrong
>>> probe/index branch
>>> To: notifications@asterixdb.incubator.apache.org
>>>
>>>
>>>
>>>      [
>>>
>>>
>> https://issues.apache.org/jira/browse/ASTERIXDB-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15085992#comment-15085992
>>> ]
>>>
>>> Yingyi Bu commented on ASTERIXDB-1249:
>>> --------------------------------------
>>>
>>> That's right. Basically in AcessMethod implementations, e.g.:
>>>
>>>
>> https://github.com/apache/incubator-asterixdb/blob/master/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/am/RTreeAccessMethod.java#L124
>>> Choosing probe/index branch is only based on dataset names, instead of
>>> being based on the join condition.
>>>
>>>> Self index join chooses wrong probe/index branch
>>>> ------------------------------------------------
>>>>
>>>>                  Key: ASTERIXDB-1249
>>>>                  URL:
>>> https://issues.apache.org/jira/browse/ASTERIXDB-1249
>>>>              Project: Apache AsterixDB
>>>>           Issue Type: Bug
>>>>           Components: Optimizer
>>>>             Reporter: Yingyi Bu
>>>>             Assignee: Taewoo Kim
>>>>
>>>> DDLs:
>>>> {noformat}
>>>> drop dataverse test if exists;
>>>> create dataverse test;
>>>> use dataverse test;
>>>> create type TwitterUserType as closed {
>>>>      screen-name: string,
>>>>      lang: string,
>>>>      friends-count: int64,
>>>>      statuses-count: int64,
>>>>      name: string,
>>>>      followers-count: int64
>>>> }
>>>> create type TweetMessageType as closed {
>>>>      tweetid: int64,
>>>>          user: TwitterUserType,
>>>>          sender-location: point,
>>>>      send-time: datetime,
>>>>          referred-topics: {{ string }},
>>>>      message-text: string,
>>>>      countA: int64,
>>>>      countB: int64
>>>> }
>>>> create dataset TweetMessages(TweetMessageType)
>>>> primary key tweetid;
>>>> create index twmSndLocIx on TweetMessages(sender-location) type rtree;
>>>> create index msgCountAIx on TweetMessages(countA) type btree;
>>>> create index msgCountBIx on TweetMessages(countB) type btree;
>>>> create index msgTextIx on TweetMessages(message-text) type keyword;
>>>> {noformat}
>>>> Query:
>>>> {noformat}
>>>> for $t1 in dataset('TweetMessages')
>>>> for $t2 in dataset('TweetMessages')
>>>> let $n :=  create-circle($t1.sender-location, 0.5)
>>>> where spatial-intersect($t2.sender-location, $n)
>>>> order by $t2.tweetid
>>>> return {"tweetid2":$t2.tweetid, "loc2":$t2.sender-location};
>>>> {noformat}
>>>> Optimized plan:
>>>> {noformat}
>>>> distribute result [%0->$$10]
>>>> -- DISTRIBUTE_RESULT  |PARTITIONED|
>>>>    exchange
>>>>    -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>      project ([$$10])
>>>>      -- STREAM_PROJECT  |PARTITIONED|
>>>>        assign [$$10] <- [function-call:
>> asterix:closed-record-constructor,
>>> Args:[AString: {tweetid2}, %0->$$15, AString: {loc2}, %0->$$13]]
>>>>        -- ASSIGN  |PARTITIONED|
>>>>          exchange
>>>>          -- SORT_MERGE_EXCHANGE [$$15(ASC) ]  |PARTITIONED|
>>>>            order (ASC, %0->$$15)
>>>>            -- STABLE_SORT [$$15(ASC)]  |PARTITIONED|
>>>>              exchange
>>>>              -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>                project ([$$13, $$15])
>>>>                -- STREAM_PROJECT  |PARTITIONED|
>>>>                  select (function-call: asterix:spatial-intersect,
>>> Args:[%0->$$13, function-call: asterix:create-circle,
>> Args:[function-call:
>>> asterix:field-access-by-index, Args:[%0->$$0, AInt32: {2}], ADouble:
>>> {0.5}]])
>>>>                  -- STREAM_SELECT  |PARTITIONED|
>>>>                    project ([$$0, $$13, $$15])
>>>>                    -- STREAM_PROJECT  |PARTITIONED|
>>>>                      exchange
>>>>                      -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>                        unnest-map [$$14, $$0] <- function-call:
>>> asterix:index-search, Args:[AString: {TweetMessages}, AInt32: {0},
>> AString:
>>> {test}, AString: {TweetMessages}, ABoolean: {true}, ABoolean: {false},
>>> ABoolean: {false}, AInt32: {1}, %0->$$27, AInt32: {1}, %0->$$27, TRUE,
>>> TRUE, TRUE]
>>>>                        -- BTREE_SEARCH  |PARTITIONED|
>>>>                          exchange
>>>>                          -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>                            order (ASC, %0->$$27)
>>>>                            -- STABLE_SORT [$$27(ASC)]  |PARTITIONED|
>>>>                              exchange
>>>>                              -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>                                project ([$$27, $$13, $$15])
>>>>                                -- STREAM_PROJECT  |PARTITIONED|
>>>>                                  exchange
>>>>                                  -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>                                    unnest-map [$$23, $$24, $$25, $$26,
>>> $$27] <- function-call: asterix:index-search, Args:[AString:
>> {twmSndLocIx},
>>> AInt32: {1}, AString: {test}, AString: {TweetMessages}, ABoolean: {true},
>>> ABoolean: {false}, ABoolean: {true}, AInt32: {4}, %0->$$19, %0->$$20,
>>> %0->$$21, %0->$$22]
>>>>                                    -- RTREE_SEARCH  |PARTITIONED|
>>>>                                      exchange
>>>>                                      -- BROADCAST_EXCHANGE
>> |PARTITIONED|
>>>>                                        assign [$$19, $$20, $$21, $$22]
>> <-
>>> [function-call: asterix:create-mbr, Args:[%0->$$13, AInt32: {2}, AInt32:
>>> {0}], function-call: asterix:create-mbr, Args:[%0->$$13, AInt32: {2},
>>> AInt32: {1}], function-call: asterix:create-mbr, Args:[%0->$$13, AInt32:
>>> {2}, AInt32: {2}], function-call: asterix:create-mbr, Args:[%0->$$13,
>>> AInt32: {2}, AInt32: {3}]]
>>>>                                        -- ASSIGN  |PARTITIONED|
>>>>                                          project ([$$13, $$15])
>>>>                                          -- STREAM_PROJECT
>> |PARTITIONED|
>>>>                                            assign [$$13] <-
>>> [function-call: asterix:field-access-by-index, Args:[%0->$$1, AInt32:
>> {2}]]
>>>>                                            -- ASSIGN  |PARTITIONED|
>>>>                                              exchange
>>>>                                              -- ONE_TO_ONE_EXCHANGE
>>> |PARTITIONED|
>>>>                                                data-scan []<-[$$15, $$1]
>>> <- test:TweetMessages
>>>>                                                -- DATASOURCE_SCAN
>>> |PARTITIONED|
>>>>                                                  exchange
>>>>                                                  -- ONE_TO_ONE_EXCHANGE
>>> |PARTITIONED|
>>>>                                                    empty-tuple-source
>>>>                                                    -- EMPTY_TUPLE_SOURCE
>>> |PARTITIONED|
>>>> {noformat}
>>>> The optimized plan is incorrect --- the index search doesn't use the
>>> right join condition and hence the result is different from expected.
>>>
>>>
>>>
>>> --
>>> This message was sent by Atlassian JIRA
>>> (v6.3.4#6332)
>>>


Re: [jira] [Commented] (ASTERIXDB-1249) Self index join chooses wrong probe/index branch

Posted by Taewoo Kim <wa...@gmail.com>.
Thanks Yingyi. Yes. If there is an equality condition and if we can't
transform a join into an index-nested loop join, then a hybrid hash join
will be used.

Best,
Taewoo

On Thu, Jan 7, 2016 at 3:14 PM, Yingyi Bu <bu...@gmail.com> wrote:

> +1!
> >>3. We only try to use applicable indexes from the inner branch. So, if
> >>there are no applicable indexes from the inner branch, we abort
> >>transforming a join into an index-nested-loop join.
> "index-nested-loop join" -> "hybrid hash join"?
>
> Thanks!
>
> Yingyi
>
>
>
> On Thu, Jan 7, 2016 at 3:01 PM, Taewoo Kim <wa...@gmail.com> wrote:
>
> > Hello dev,
> >
> > Regarding this issue (ASTERIXDB-1249), I would like to make an index-join
> > hint clarification. Let's start with an example query other than a
> > self-join query in this issue.
> >
> > for $c in dataset('Customers')
> >
> > for $o in dataset('Orders')
> >
> > where $c.cid /*+ indexnl */ = $o.cid
> >
> > order by $c.cid, $o.oid
> >
> > return {"cid":$c.cid, "oid": $o.oid}
> >
> > Right now, in the master branch, the first dataset (e.g., Customers)
> > becomes the outer branch and the second dataset (e.g., Orders) becomes
> the
> > inner branch. And, when the optimizer tries to honor the given indexnl
> hint
> > (transforming a join into an index-nested-loop join), if there are
> > applicable indexes from the inner branch (e.g., Orders), then it is going
> > to use one of those indexes. If there are no applicable indexes from the
> > inner branch, it tries to use indexes from the outer branch (e.g.,
> > Customers). We are going to change the last part; we will not use indexes
> > from the outer branch. So, the following are refined rules for
> transforming
> > a join into an index-nested-loop join.
> >
> > 1. The first dataset in a join (the first parameter of the given join)
> > becomes the outer branch.
> > 2. The second dataset in a join (the second parameter of the given join)
> > becomes the inner branch.
> > 3. We only try to use applicable indexes from the inner branch. So, if
> > there are no applicable indexes from the inner branch, we abort
> > transforming a join into an index-nested-loop join.
> > 4. Variable order in the given join predicate is not important. It can be
> > either outer.fieldA = inner.fieldB or inner.fieldB = outer.fieldA.
> >
> > So, for the left-outer join and inner join altogether, the left subtree
> is
> > the probing side and the right subtree is the index side. So, this can be
> > applied to the self-join case, too just like the following query in this
> > issue. In the following query, $t1 becomes the outer and $t2 becomes the
> > inner.
> >
> > for $t1 in dataset('TweetMessages')
> >
> > for $t2 in dataset('TweetMessages')
> >
> > let $c := $t1.countA + 20
> >
> > where $c /* +indexnl */= $t2.countB
> >
> > order by $t2.tweetid
> >
> > return {"tweetid2": $t2.tweetid, "count2":$t2.countB};
> >
> > Thank you. Any opinion would be appreciated before I finalize this fix.
> >
> > Best,
> > Taewoo
> >
> > ---------- Forwarded message ----------
> > From: Yingyi Bu (JIRA) <ji...@apache.org>
> > Date: Wed, Jan 6, 2016 at 10:23 AM
> > Subject: [jira] [Commented] (ASTERIXDB-1249) Self index join chooses
> wrong
> > probe/index branch
> > To: notifications@asterixdb.incubator.apache.org
> >
> >
> >
> >     [
> >
> >
> https://issues.apache.org/jira/browse/ASTERIXDB-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15085992#comment-15085992
> > ]
> >
> > Yingyi Bu commented on ASTERIXDB-1249:
> > --------------------------------------
> >
> > That's right. Basically in AcessMethod implementations, e.g.:
> >
> >
> https://github.com/apache/incubator-asterixdb/blob/master/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/am/RTreeAccessMethod.java#L124
> >
> > Choosing probe/index branch is only based on dataset names, instead of
> > being based on the join condition.
> >
> > > Self index join chooses wrong probe/index branch
> > > ------------------------------------------------
> > >
> > >                 Key: ASTERIXDB-1249
> > >                 URL:
> > https://issues.apache.org/jira/browse/ASTERIXDB-1249
> > >             Project: Apache AsterixDB
> > >          Issue Type: Bug
> > >          Components: Optimizer
> > >            Reporter: Yingyi Bu
> > >            Assignee: Taewoo Kim
> > >
> > > DDLs:
> > > {noformat}
> > > drop dataverse test if exists;
> > > create dataverse test;
> > > use dataverse test;
> > > create type TwitterUserType as closed {
> > >     screen-name: string,
> > >     lang: string,
> > >     friends-count: int64,
> > >     statuses-count: int64,
> > >     name: string,
> > >     followers-count: int64
> > > }
> > > create type TweetMessageType as closed {
> > >     tweetid: int64,
> > >         user: TwitterUserType,
> > >         sender-location: point,
> > >     send-time: datetime,
> > >         referred-topics: {{ string }},
> > >     message-text: string,
> > >     countA: int64,
> > >     countB: int64
> > > }
> > > create dataset TweetMessages(TweetMessageType)
> > > primary key tweetid;
> > > create index twmSndLocIx on TweetMessages(sender-location) type rtree;
> > > create index msgCountAIx on TweetMessages(countA) type btree;
> > > create index msgCountBIx on TweetMessages(countB) type btree;
> > > create index msgTextIx on TweetMessages(message-text) type keyword;
> > > {noformat}
> > > Query:
> > > {noformat}
> > > for $t1 in dataset('TweetMessages')
> > > for $t2 in dataset('TweetMessages')
> > > let $n :=  create-circle($t1.sender-location, 0.5)
> > > where spatial-intersect($t2.sender-location, $n)
> > > order by $t2.tweetid
> > > return {"tweetid2":$t2.tweetid, "loc2":$t2.sender-location};
> > > {noformat}
> > > Optimized plan:
> > > {noformat}
> > > distribute result [%0->$$10]
> > > -- DISTRIBUTE_RESULT  |PARTITIONED|
> > >   exchange
> > >   -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> > >     project ([$$10])
> > >     -- STREAM_PROJECT  |PARTITIONED|
> > >       assign [$$10] <- [function-call:
> asterix:closed-record-constructor,
> > Args:[AString: {tweetid2}, %0->$$15, AString: {loc2}, %0->$$13]]
> > >       -- ASSIGN  |PARTITIONED|
> > >         exchange
> > >         -- SORT_MERGE_EXCHANGE [$$15(ASC) ]  |PARTITIONED|
> > >           order (ASC, %0->$$15)
> > >           -- STABLE_SORT [$$15(ASC)]  |PARTITIONED|
> > >             exchange
> > >             -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> > >               project ([$$13, $$15])
> > >               -- STREAM_PROJECT  |PARTITIONED|
> > >                 select (function-call: asterix:spatial-intersect,
> > Args:[%0->$$13, function-call: asterix:create-circle,
> Args:[function-call:
> > asterix:field-access-by-index, Args:[%0->$$0, AInt32: {2}], ADouble:
> > {0.5}]])
> > >                 -- STREAM_SELECT  |PARTITIONED|
> > >                   project ([$$0, $$13, $$15])
> > >                   -- STREAM_PROJECT  |PARTITIONED|
> > >                     exchange
> > >                     -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> > >                       unnest-map [$$14, $$0] <- function-call:
> > asterix:index-search, Args:[AString: {TweetMessages}, AInt32: {0},
> AString:
> > {test}, AString: {TweetMessages}, ABoolean: {true}, ABoolean: {false},
> > ABoolean: {false}, AInt32: {1}, %0->$$27, AInt32: {1}, %0->$$27, TRUE,
> > TRUE, TRUE]
> > >                       -- BTREE_SEARCH  |PARTITIONED|
> > >                         exchange
> > >                         -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> > >                           order (ASC, %0->$$27)
> > >                           -- STABLE_SORT [$$27(ASC)]  |PARTITIONED|
> > >                             exchange
> > >                             -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> > >                               project ([$$27, $$13, $$15])
> > >                               -- STREAM_PROJECT  |PARTITIONED|
> > >                                 exchange
> > >                                 -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> > >                                   unnest-map [$$23, $$24, $$25, $$26,
> > $$27] <- function-call: asterix:index-search, Args:[AString:
> {twmSndLocIx},
> > AInt32: {1}, AString: {test}, AString: {TweetMessages}, ABoolean: {true},
> > ABoolean: {false}, ABoolean: {true}, AInt32: {4}, %0->$$19, %0->$$20,
> > %0->$$21, %0->$$22]
> > >                                   -- RTREE_SEARCH  |PARTITIONED|
> > >                                     exchange
> > >                                     -- BROADCAST_EXCHANGE
> |PARTITIONED|
> > >                                       assign [$$19, $$20, $$21, $$22]
> <-
> > [function-call: asterix:create-mbr, Args:[%0->$$13, AInt32: {2}, AInt32:
> > {0}], function-call: asterix:create-mbr, Args:[%0->$$13, AInt32: {2},
> > AInt32: {1}], function-call: asterix:create-mbr, Args:[%0->$$13, AInt32:
> > {2}, AInt32: {2}], function-call: asterix:create-mbr, Args:[%0->$$13,
> > AInt32: {2}, AInt32: {3}]]
> > >                                       -- ASSIGN  |PARTITIONED|
> > >                                         project ([$$13, $$15])
> > >                                         -- STREAM_PROJECT
> |PARTITIONED|
> > >                                           assign [$$13] <-
> > [function-call: asterix:field-access-by-index, Args:[%0->$$1, AInt32:
> {2}]]
> > >                                           -- ASSIGN  |PARTITIONED|
> > >                                             exchange
> > >                                             -- ONE_TO_ONE_EXCHANGE
> > |PARTITIONED|
> > >                                               data-scan []<-[$$15, $$1]
> > <- test:TweetMessages
> > >                                               -- DATASOURCE_SCAN
> > |PARTITIONED|
> > >                                                 exchange
> > >                                                 -- ONE_TO_ONE_EXCHANGE
> > |PARTITIONED|
> > >                                                   empty-tuple-source
> > >                                                   -- EMPTY_TUPLE_SOURCE
> > |PARTITIONED|
> > > {noformat}
> > > The optimized plan is incorrect --- the index search doesn't use the
> > right join condition and hence the result is different from expected.
> >
> >
> >
> > --
> > This message was sent by Atlassian JIRA
> > (v6.3.4#6332)
> >
>

Re: [jira] [Commented] (ASTERIXDB-1249) Self index join chooses wrong probe/index branch

Posted by Yingyi Bu <bu...@gmail.com>.
+1!
>>3. We only try to use applicable indexes from the inner branch. So, if
>>there are no applicable indexes from the inner branch, we abort
>>transforming a join into an index-nested-loop join.
"index-nested-loop join" -> "hybrid hash join"?

Thanks!

Yingyi



On Thu, Jan 7, 2016 at 3:01 PM, Taewoo Kim <wa...@gmail.com> wrote:

> Hello dev,
>
> Regarding this issue (ASTERIXDB-1249), I would like to make an index-join
> hint clarification. Let's start with an example query other than a
> self-join query in this issue.
>
> for $c in dataset('Customers')
>
> for $o in dataset('Orders')
>
> where $c.cid /*+ indexnl */ = $o.cid
>
> order by $c.cid, $o.oid
>
> return {"cid":$c.cid, "oid": $o.oid}
>
> Right now, in the master branch, the first dataset (e.g., Customers)
> becomes the outer branch and the second dataset (e.g., Orders) becomes the
> inner branch. And, when the optimizer tries to honor the given indexnl hint
> (transforming a join into an index-nested-loop join), if there are
> applicable indexes from the inner branch (e.g., Orders), then it is going
> to use one of those indexes. If there are no applicable indexes from the
> inner branch, it tries to use indexes from the outer branch (e.g.,
> Customers). We are going to change the last part; we will not use indexes
> from the outer branch. So, the following are refined rules for transforming
> a join into an index-nested-loop join.
>
> 1. The first dataset in a join (the first parameter of the given join)
> becomes the outer branch.
> 2. The second dataset in a join (the second parameter of the given join)
> becomes the inner branch.
> 3. We only try to use applicable indexes from the inner branch. So, if
> there are no applicable indexes from the inner branch, we abort
> transforming a join into an index-nested-loop join.
> 4. Variable order in the given join predicate is not important. It can be
> either outer.fieldA = inner.fieldB or inner.fieldB = outer.fieldA.
>
> So, for the left-outer join and inner join altogether, the left subtree is
> the probing side and the right subtree is the index side. So, this can be
> applied to the self-join case, too just like the following query in this
> issue. In the following query, $t1 becomes the outer and $t2 becomes the
> inner.
>
> for $t1 in dataset('TweetMessages')
>
> for $t2 in dataset('TweetMessages')
>
> let $c := $t1.countA + 20
>
> where $c /* +indexnl */= $t2.countB
>
> order by $t2.tweetid
>
> return {"tweetid2": $t2.tweetid, "count2":$t2.countB};
>
> Thank you. Any opinion would be appreciated before I finalize this fix.
>
> Best,
> Taewoo
>
> ---------- Forwarded message ----------
> From: Yingyi Bu (JIRA) <ji...@apache.org>
> Date: Wed, Jan 6, 2016 at 10:23 AM
> Subject: [jira] [Commented] (ASTERIXDB-1249) Self index join chooses wrong
> probe/index branch
> To: notifications@asterixdb.incubator.apache.org
>
>
>
>     [
>
> https://issues.apache.org/jira/browse/ASTERIXDB-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15085992#comment-15085992
> ]
>
> Yingyi Bu commented on ASTERIXDB-1249:
> --------------------------------------
>
> That's right. Basically in AcessMethod implementations, e.g.:
>
> https://github.com/apache/incubator-asterixdb/blob/master/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/am/RTreeAccessMethod.java#L124
>
> Choosing probe/index branch is only based on dataset names, instead of
> being based on the join condition.
>
> > Self index join chooses wrong probe/index branch
> > ------------------------------------------------
> >
> >                 Key: ASTERIXDB-1249
> >                 URL:
> https://issues.apache.org/jira/browse/ASTERIXDB-1249
> >             Project: Apache AsterixDB
> >          Issue Type: Bug
> >          Components: Optimizer
> >            Reporter: Yingyi Bu
> >            Assignee: Taewoo Kim
> >
> > DDLs:
> > {noformat}
> > drop dataverse test if exists;
> > create dataverse test;
> > use dataverse test;
> > create type TwitterUserType as closed {
> >     screen-name: string,
> >     lang: string,
> >     friends-count: int64,
> >     statuses-count: int64,
> >     name: string,
> >     followers-count: int64
> > }
> > create type TweetMessageType as closed {
> >     tweetid: int64,
> >         user: TwitterUserType,
> >         sender-location: point,
> >     send-time: datetime,
> >         referred-topics: {{ string }},
> >     message-text: string,
> >     countA: int64,
> >     countB: int64
> > }
> > create dataset TweetMessages(TweetMessageType)
> > primary key tweetid;
> > create index twmSndLocIx on TweetMessages(sender-location) type rtree;
> > create index msgCountAIx on TweetMessages(countA) type btree;
> > create index msgCountBIx on TweetMessages(countB) type btree;
> > create index msgTextIx on TweetMessages(message-text) type keyword;
> > {noformat}
> > Query:
> > {noformat}
> > for $t1 in dataset('TweetMessages')
> > for $t2 in dataset('TweetMessages')
> > let $n :=  create-circle($t1.sender-location, 0.5)
> > where spatial-intersect($t2.sender-location, $n)
> > order by $t2.tweetid
> > return {"tweetid2":$t2.tweetid, "loc2":$t2.sender-location};
> > {noformat}
> > Optimized plan:
> > {noformat}
> > distribute result [%0->$$10]
> > -- DISTRIBUTE_RESULT  |PARTITIONED|
> >   exchange
> >   -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> >     project ([$$10])
> >     -- STREAM_PROJECT  |PARTITIONED|
> >       assign [$$10] <- [function-call: asterix:closed-record-constructor,
> Args:[AString: {tweetid2}, %0->$$15, AString: {loc2}, %0->$$13]]
> >       -- ASSIGN  |PARTITIONED|
> >         exchange
> >         -- SORT_MERGE_EXCHANGE [$$15(ASC) ]  |PARTITIONED|
> >           order (ASC, %0->$$15)
> >           -- STABLE_SORT [$$15(ASC)]  |PARTITIONED|
> >             exchange
> >             -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> >               project ([$$13, $$15])
> >               -- STREAM_PROJECT  |PARTITIONED|
> >                 select (function-call: asterix:spatial-intersect,
> Args:[%0->$$13, function-call: asterix:create-circle, Args:[function-call:
> asterix:field-access-by-index, Args:[%0->$$0, AInt32: {2}], ADouble:
> {0.5}]])
> >                 -- STREAM_SELECT  |PARTITIONED|
> >                   project ([$$0, $$13, $$15])
> >                   -- STREAM_PROJECT  |PARTITIONED|
> >                     exchange
> >                     -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> >                       unnest-map [$$14, $$0] <- function-call:
> asterix:index-search, Args:[AString: {TweetMessages}, AInt32: {0}, AString:
> {test}, AString: {TweetMessages}, ABoolean: {true}, ABoolean: {false},
> ABoolean: {false}, AInt32: {1}, %0->$$27, AInt32: {1}, %0->$$27, TRUE,
> TRUE, TRUE]
> >                       -- BTREE_SEARCH  |PARTITIONED|
> >                         exchange
> >                         -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> >                           order (ASC, %0->$$27)
> >                           -- STABLE_SORT [$$27(ASC)]  |PARTITIONED|
> >                             exchange
> >                             -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> >                               project ([$$27, $$13, $$15])
> >                               -- STREAM_PROJECT  |PARTITIONED|
> >                                 exchange
> >                                 -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> >                                   unnest-map [$$23, $$24, $$25, $$26,
> $$27] <- function-call: asterix:index-search, Args:[AString: {twmSndLocIx},
> AInt32: {1}, AString: {test}, AString: {TweetMessages}, ABoolean: {true},
> ABoolean: {false}, ABoolean: {true}, AInt32: {4}, %0->$$19, %0->$$20,
> %0->$$21, %0->$$22]
> >                                   -- RTREE_SEARCH  |PARTITIONED|
> >                                     exchange
> >                                     -- BROADCAST_EXCHANGE  |PARTITIONED|
> >                                       assign [$$19, $$20, $$21, $$22] <-
> [function-call: asterix:create-mbr, Args:[%0->$$13, AInt32: {2}, AInt32:
> {0}], function-call: asterix:create-mbr, Args:[%0->$$13, AInt32: {2},
> AInt32: {1}], function-call: asterix:create-mbr, Args:[%0->$$13, AInt32:
> {2}, AInt32: {2}], function-call: asterix:create-mbr, Args:[%0->$$13,
> AInt32: {2}, AInt32: {3}]]
> >                                       -- ASSIGN  |PARTITIONED|
> >                                         project ([$$13, $$15])
> >                                         -- STREAM_PROJECT  |PARTITIONED|
> >                                           assign [$$13] <-
> [function-call: asterix:field-access-by-index, Args:[%0->$$1, AInt32: {2}]]
> >                                           -- ASSIGN  |PARTITIONED|
> >                                             exchange
> >                                             -- ONE_TO_ONE_EXCHANGE
> |PARTITIONED|
> >                                               data-scan []<-[$$15, $$1]
> <- test:TweetMessages
> >                                               -- DATASOURCE_SCAN
> |PARTITIONED|
> >                                                 exchange
> >                                                 -- ONE_TO_ONE_EXCHANGE
> |PARTITIONED|
> >                                                   empty-tuple-source
> >                                                   -- EMPTY_TUPLE_SOURCE
> |PARTITIONED|
> > {noformat}
> > The optimized plan is incorrect --- the index search doesn't use the
> right join condition and hence the result is different from expected.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>