You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Muhammad Gelbana <m....@gmail.com> on 2017/05/06 14:34:53 UTC

Running cartesian joins on Drill

Is there a reason why Drill would intentionally reject cartesian join
queries even if *planner.enable_nljoin_for_scalar_only* is disabled ?

Any ideas how could a query be rewritten to overcome this restriction ?

*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana

Re: Running cartesian joins on Drill

Posted by Muhammad Gelbana <m....@gmail.com>.
You are correct Aman. Here is the JIRA issue
<https://issues.apache.org/jira/browse/DRILL-5515>

This thread has been very helpful. Thank you all.

*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana

On Fri, May 12, 2017 at 6:50 AM, Aman Sinha <as...@mapr.com> wrote:

> Muhammad,
> The join condition  ‘a = b or (a is null && b is null)’ works.
> Internally, this is converted to  ‘a is not distinct from b’ which is
> processed by Drill.
> For some reason, if the second form is directly supplied in the user
> query, it is not working and ends up with the Cartesian join condition.
> Drill leverages Calcite for this (you can see CALCITE-1200 for some
> background).
> Can you file a JIRA for this ?
>
> -Aman
>
> From: "Aman Sinha (asinha@mapr.com)" <as...@mapr.com>
> Date: Thursday, May 11, 2017 at 4:29 PM
> To: dev <de...@drill.apache.org>, user <us...@drill.apache.org>
> Cc: Shadi Khalifa <kh...@cs.queensu.ca>
> Subject: Re: Running cartesian joins on Drill
>
>
> I think Muhammad may be trying to run his original query with IS NOT
> DISTINCT FROM.   That discussion got side-tracked into Cartesian joins
> because his query was not getting planned and the error was about Cartesian
> join.
>
> Muhammad,  can you try with the equivalent version below ?  You mentioned
> the rewrite but did you try the rewritten version ?
>
>
>
> SELECT * FROM (SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
>
> LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM
>
> `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (
>
> ​​
>
> `t0`.`UserID` = `t1`.`UserID` OR (`t0`.`UserID` IS NULL && `t1`.`UserID`
> IS NULL) )
>
>
>
> On 5/11/17, 3:23 PM, "Zelaine Fong" <zf...@mapr.com> wrote:
>
>
>
>     I’m not sure why it isn’t working for you.  Using Drill 1.10, here’s
> my output:
>
>
>
>     0: jdbc:drill:zk=local> alter session set `planner.enable_nljoin_for_scalar_only`
> = false;
>
>     +-------+-------------------------------------------------+
>
>     |  ok   |                     summary                     |
>
>     +-------+-------------------------------------------------+
>
>     | true  | planner.enable_nljoin_for_scalar_only updated.  |
>
>     +-------+-------------------------------------------------+
>
>     1 row selected (0.137 seconds)
>
>     0: jdbc:drill:zk=local> explain plan for select * from
> dfs.`/Users/zfong/foo.csv` t1, dfs.`/Users/zfong/foo.csv` t2;
>
>     +------+------+
>
>     | text | json |
>
>     +------+------+
>
>     | 00-00    Screen
>
>     00-01      ProjectAllowDup(*=[$0], *0=[$1])
>
>     00-02        NestedLoopJoin(condition=[true], joinType=[inner])
>
>     00-04          Project(T2¦¦*=[$0])
>
>     00-06            Scan(groupscan=[EasyGroupScan
> [selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`],
> files=[file:/Users/zfong/foo.csv]]])
>
>     00-03          Project(T3¦¦*=[$0])
>
>     00-05            Scan(groupscan=[EasyGroupScan
> [selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`],
> files=[file:/Users/zfong/foo.csv]]])
>
>
>
>     -- Zelaine
>
>
>
>     On 5/11/17, 3:17 PM, "Muhammad Gelbana" <m....@gmail.com> wrote:
>
>
>
>         ​But the query I provided failed to be planned because it's a
> cartesian
>
>         join, although I've set the option you mentioned to false. Is
> there a
>
>         reason why wouldn't Drill rules physically implement the logical
> join in my
>
>         query to a nested loop join ?
>
>
>
>         *---------------------*
>
>         *Muhammad Gelbana*
>
>         http://www.linkedin.com/in/mgelbana
>
>
>
>         On Thu, May 11, 2017 at 5:05 PM, Zelaine Fong <zf...@mapr.com>
> wrote:
>
>
>
>         > Provided `planner.enable_nljoin_for_scalar_only` is set to
> false, even
>
>         > without an explicit join condition, the query should use the
> Cartesian
>
>         > join/nested loop join.
>
>         >
>
>         > -- Zelaine
>
>         >
>
>         > On 5/11/17, 4:20 AM, "Anup Tiwari" <an...@games24x7.com>
> wrote:
>
>         >
>
>         >     Hi,
>
>         >
>
>         >     I have one question here.. so if we have to use Cartesian
> join in Drill
>
>         >     then do we have to follow some workaround like Shadi mention
> : adding a
>
>         >     dummy column on the fly that has the value 1 in both tables
> and then
>
>         > join
>
>         >     on that column leading to having a match of every row of the
> first
>
>         > table
>
>         >     with every row of the second table, hence do a Cartesian
> product?
>
>         >     OR
>
>         >     If we just don't specify join condition like :
>
>         >     select a.*, b.* from tt1 as a, tt2 b; then will it
> internally treat
>
>         > this
>
>         >     query as Cartesian join.
>
>         >
>
>         >     Regards,
>
>         >     *Anup Tiwari*
>
>         >
>
>         >     On Mon, May 8, 2017 at 10:00 PM, Zelaine Fong <
> zfong@mapr.com> wrote:
>
>         >
>
>         >     > Cartesian joins in Drill are implemented as nested loop
> joins, and I
>
>         > think
>
>         >     > you should see that reflected in the resultant query plan
> when you
>
>         > run
>
>         >     > explain plan on the query.
>
>         >     >
>
>         >     > Yes, Cartesian joins/nested loop joins are expensive
> because you’re
>
>         >     > effectively doing an MxN read of your tables.  There are
> more
>
>         > efficient
>
>         >     > ways of processing a nested loop join, e.g., by creating
> an index on
>
>         > the
>
>         >     > larger table in the join and then using that index to do
> lookups
>
>         > into that
>
>         >     > table.  That way, the nested loop join cost is the cost of
> creating
>
>         > the
>
>         >     > index + M, where M is the number of rows in the smaller
> table and
>
>         > assuming
>
>         >     > the lookup cost into the index does minimize the amount of
> data read
>
>         > of the
>
>         >     > second table.  Drill currently doesn’t do this.
>
>         >     >
>
>         >     > -- Zelaine
>
>         >     >
>
>         >     > On 5/8/17, 9:09 AM, "Muhammad Gelbana" <
> m.gelbana@gmail.com> wrote:
>
>         >     >
>
>         >     >     ​I believe ​clhubert is referring to this discussion
>
>         >     >     <http://drill-user.incubator.
> apache.narkive.com/TIXWiTY4/
>
>         >     > cartesian-product-in-apache-drill#post1>
>
>         >     >     .
>
>         >     >
>
>         >     >     So why Drill doesn't transform this query into a
> nested join
>
>         > query ?
>
>         >     > Simply
>
>         >     >     because there is no Calcite rule to transform it into
> a nested
>
>         > loop
>
>         >     > join ?
>
>         >     >     Is it not technically possible to write such Rule or
> is it
>
>         > feasible so
>
>         >     > I
>
>         >     >     may take on this challenge ?
>
>         >     >
>
>         >     >     Also pardon me for repeating my question but I fail to
> find an
>
>         > answer
>
>         >     > in
>
>         >     >     your replies, why doesn't Drill just run a cartesian
> join ?
>
>         > Because
>
>         >     > it's
>
>         >     >     expensive regarding resources (i.e. CPU\Network\RAM) ?
>
>         >     >
>
>         >     >     Thanks a lot Shadi for the query, it works for me.
>
>         >     >
>
>         >     >     *---------------------*
>
>         >     >     *Muhammad Gelbana*
>
>         >     >     http://www.linkedin.com/in/mgelbana
>
>         >     >
>
>         >     >     On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa <
>
>         > khalifa@cs.queensu.ca>
>
>         >     > wrote:
>
>         >     >
>
>         >     >     > Hi Muhammad,
>
>         >     >     >
>
>         >     >     > I did the following as a workaround to have
> Cartesian product.
>
>         > The
>
>         >     > basic
>
>         >     >     > idea is to create a dummy column on the fly that has
> the value
>
>         > 1 in
>
>         >     > both
>
>         >     >     > tables and then join on that column leading to
> having a match
>
>         > of
>
>         >     > every row
>
>         >     >     > of the first table with every row of the second
> table, hence
>
>         > do a
>
>         >     > Cartesian
>
>         >     >     > product. This might not be the most efficient way
> but it will
>
>         > do the
>
>         >     > job.
>
>         >     >     >
>
>         >     >     > *Original Query:*
>
>         >     >     > SELECT * FROM
>
>         >     >     > ( SELECT 'ABC' `UserID` FROM
> `dfs`.`path_to_parquet_file` tc
>
>         > LIMIT
>
>         >     >     > 2147483647) `t0`
>
>         >     >     > INNER JOIN
>
>         >     >     > ( SELECT 'ABC' `UserID` FROM
> `dfs`.`path_to_parquet_file` tc
>
>         > LIMIT
>
>         >     >     > 2147483647) `t1`
>
>         >     >     > ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)
>
>         >     >     > LIMIT 2147483647
>
>         >     >     >
>
>         >     >     > *Workaround (add columns **d1a381f3g73 and
> **d1a381f3g74 to
>
>         > tables
>
>         >     > one
>
>         >     >     > and two, respectively. Names don't really matter,
> just need to
>
>         > be
>
>         >     > unique):*
>
>         >     >     > SELECT * FROM
>
>         >     >     > ( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM
>
>         >     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647)
> `t0`
>
>         >     >     > INNER JOIN
>
>         >     >     > ( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM
>
>         >     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647)
> `t1`
>
>         >     >     > ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)
>
>         >     >     > WHERE `t0`.`UserID` IS NOT DISTINCT FROM
> `t1`.`UserID`
>
>         >     >     > LIMIT 2147483647
>
>         >     >     >
>
>         >     >     > Regards
>
>         >     >     >
>
>         >     >     >
>
>         >     >     > *Shadi Khalifa, PhD*
>
>         >     >     > Postdoctoral Fellow
>
>         >     >     > Cognitive Analytics Development Hub
>
>         >     >     > Centre for Advanced Computing
>
>         >     >     > Queen’s University
>
>         >     >     > (613) 533-6000 x78347
>
>         >     >     > http://cac.queensu.ca
>
>         >     >     >
>
>         >     >     > I'm just a neuron in the society collective brain
>
>         >     >     >
>
>         >     >     > *Join us for HPCS in June 2017! Register at:*  *
>
>         > http://2017.hpcs.ca/
>
>         >     >     > <http://2017.hpcs.ca/>*
>
>         >     >     >
>
>         >     >     > P Please consider your environmental responsibility
> before
>
>         > printing
>
>         >     > this
>
>         >     >     > e-mail
>
>         >     >     >
>
>         >     >     > *01001001 00100000 01101100 01101111 01110110
> 01100101 00100000
>
>         >     > 01000101
>
>         >     >     > 01100111 01111001 01110000 01110100 *
>
>         >     >     >
>
>         >     >     > *The information transmitted is intended only for
> the person or
>
>         >     > entity to
>
>         >     >     > which it is addressed and may contain confidential
> material.
>
>         > Any
>
>         >     > review or
>
>         >     >     > dissemination of this information by persons other
> than the
>
>         > intended
>
>         >     >     > recipient is prohibited. If you received this in
> error, please
>
>         >     > contact the
>
>         >     >     > sender and delete the material from any computer.
> Thank you.*
>
>         >     >     >
>
>         >     >     >
>
>         >     >     >
>
>         >     >     > On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana <
>
>         >     > m.gelbana@gmail.com>
>
>         >     >     > wrote:
>
>         >     >     >
>
>         >     >     >
>
>         >     >     > ​​
>
>         >     >     > Here it is:
>
>         >     >     >
>
>         >     >     > SELECT * FROM (SELECT 'ABC' `UserID` FROM
>
>         >     > `dfs`.`path_to_parquet_file` tc
>
>         >     >     > LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC'
> `UserID` FROM
>
>         >     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647)
> `t1` ON (
>
>         >     >     > ​​
>
>         >     >     > `t0`.`UserID` IS NOT DISTINCT FROM
>
>         >     >     > ​​
>
>         >     >     > `t1`.`UserID`) LIMIT 2147483647
>
>         >     >     >
>
>         >     >     > I debugged Drill code and found it decomposes *IS
> NOT DISTINCT
>
>         > FROM*
>
>         >     > into
>
>         >     >     > ​
>
>         >     >     > *`t0`.`UserID` = ​`t1`.`UserID` OR (`t0`.`UserID` IS
> NULL &&
>
>         >     > `t1`.`UserID`
>
>         >     >     > IS NULL**)* while checking if the query is a
> cartesian join,
>
>         > and
>
>         >     > when the
>
>         >     >     > check returns true, it throws an excetion saying:
> *This query
>
>         > cannot
>
>         >     > be
>
>         >     >     > planned possibly due to either a cartesian join or an
>
>         > inequality
>
>         >     > join*
>
>         >     >     >
>
>         >     >     >
>
>         >     >     > *---------------------*
>
>         >     >     > *Muhammad Gelbana*
>
>         >     >     > http://www.linkedin.com/in/mgelbana
>
>         >     >     >
>
>         >     >     > On Sat, May 6, 2017 at 6:53 PM, Gautam Parai <
> gparai@mapr.com>
>
>         >     > wrote:
>
>         >     >     >
>
>         >     >     > > Can you please specify the query you are trying to
> execute?
>
>         >     >     > >
>
>         >     >     > >
>
>         >     >     > > Gautam
>
>         >     >     > >
>
>         >     >     > > ________________________________
>
>         >     >     > > From: Muhammad Gelbana <m....@gmail.com>
>
>         >     >     > > Sent: Saturday, May 6, 2017 7:34:53 AM
>
>         >     >     > > To: user@drill.apache.org; dev@drill.apache.org
>
>         >     >     > > Subject: Running cartesian joins on Drill
>
>         >     >     > >
>
>         >     >     > > Is there a reason why Drill would intentionally
> reject
>
>         > cartesian
>
>         >     > join
>
>         >     >     > > queries even if *planner.enable_nljoin_for_scalar_only*
> is
>
>         >     > disabled ?
>
>         >     >     > >
>
>         >     >     > > Any ideas how could a query be rewritten to
> overcome this
>
>         >     > restriction ?
>
>         >     >     > >
>
>         >     >     > > *---------------------*
>
>         >     >     > > *Muhammad Gelbana*
>
>         >     >     > > http://www.linkedin.com/in/mgelbana
>
>         >     >     > >
>
>         >     >     >
>
>         >     >     >
>
>         >     >     >
>
>         >     >
>
>         >     >
>
>         >     >
>
>         >
>
>         >
>
>         >
>
>
>
>
>
>
>

Re: Running cartesian joins on Drill

Posted by Muhammad Gelbana <m....@gmail.com>.
You are correct Aman. Here is the JIRA issue
<https://issues.apache.org/jira/browse/DRILL-5515>

This thread has been very helpful. Thank you all.

*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana

On Fri, May 12, 2017 at 6:50 AM, Aman Sinha <as...@mapr.com> wrote:

> Muhammad,
> The join condition  ‘a = b or (a is null && b is null)’ works.
> Internally, this is converted to  ‘a is not distinct from b’ which is
> processed by Drill.
> For some reason, if the second form is directly supplied in the user
> query, it is not working and ends up with the Cartesian join condition.
> Drill leverages Calcite for this (you can see CALCITE-1200 for some
> background).
> Can you file a JIRA for this ?
>
> -Aman
>
> From: "Aman Sinha (asinha@mapr.com)" <as...@mapr.com>
> Date: Thursday, May 11, 2017 at 4:29 PM
> To: dev <de...@drill.apache.org>, user <us...@drill.apache.org>
> Cc: Shadi Khalifa <kh...@cs.queensu.ca>
> Subject: Re: Running cartesian joins on Drill
>
>
> I think Muhammad may be trying to run his original query with IS NOT
> DISTINCT FROM.   That discussion got side-tracked into Cartesian joins
> because his query was not getting planned and the error was about Cartesian
> join.
>
> Muhammad,  can you try with the equivalent version below ?  You mentioned
> the rewrite but did you try the rewritten version ?
>
>
>
> SELECT * FROM (SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
>
> LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM
>
> `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (
>
> ​​
>
> `t0`.`UserID` = `t1`.`UserID` OR (`t0`.`UserID` IS NULL && `t1`.`UserID`
> IS NULL) )
>
>
>
> On 5/11/17, 3:23 PM, "Zelaine Fong" <zf...@mapr.com> wrote:
>
>
>
>     I’m not sure why it isn’t working for you.  Using Drill 1.10, here’s
> my output:
>
>
>
>     0: jdbc:drill:zk=local> alter session set `planner.enable_nljoin_for_scalar_only`
> = false;
>
>     +-------+-------------------------------------------------+
>
>     |  ok   |                     summary                     |
>
>     +-------+-------------------------------------------------+
>
>     | true  | planner.enable_nljoin_for_scalar_only updated.  |
>
>     +-------+-------------------------------------------------+
>
>     1 row selected (0.137 seconds)
>
>     0: jdbc:drill:zk=local> explain plan for select * from
> dfs.`/Users/zfong/foo.csv` t1, dfs.`/Users/zfong/foo.csv` t2;
>
>     +------+------+
>
>     | text | json |
>
>     +------+------+
>
>     | 00-00    Screen
>
>     00-01      ProjectAllowDup(*=[$0], *0=[$1])
>
>     00-02        NestedLoopJoin(condition=[true], joinType=[inner])
>
>     00-04          Project(T2¦¦*=[$0])
>
>     00-06            Scan(groupscan=[EasyGroupScan
> [selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`],
> files=[file:/Users/zfong/foo.csv]]])
>
>     00-03          Project(T3¦¦*=[$0])
>
>     00-05            Scan(groupscan=[EasyGroupScan
> [selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`],
> files=[file:/Users/zfong/foo.csv]]])
>
>
>
>     -- Zelaine
>
>
>
>     On 5/11/17, 3:17 PM, "Muhammad Gelbana" <m....@gmail.com> wrote:
>
>
>
>         ​But the query I provided failed to be planned because it's a
> cartesian
>
>         join, although I've set the option you mentioned to false. Is
> there a
>
>         reason why wouldn't Drill rules physically implement the logical
> join in my
>
>         query to a nested loop join ?
>
>
>
>         *---------------------*
>
>         *Muhammad Gelbana*
>
>         http://www.linkedin.com/in/mgelbana
>
>
>
>         On Thu, May 11, 2017 at 5:05 PM, Zelaine Fong <zf...@mapr.com>
> wrote:
>
>
>
>         > Provided `planner.enable_nljoin_for_scalar_only` is set to
> false, even
>
>         > without an explicit join condition, the query should use the
> Cartesian
>
>         > join/nested loop join.
>
>         >
>
>         > -- Zelaine
>
>         >
>
>         > On 5/11/17, 4:20 AM, "Anup Tiwari" <an...@games24x7.com>
> wrote:
>
>         >
>
>         >     Hi,
>
>         >
>
>         >     I have one question here.. so if we have to use Cartesian
> join in Drill
>
>         >     then do we have to follow some workaround like Shadi mention
> : adding a
>
>         >     dummy column on the fly that has the value 1 in both tables
> and then
>
>         > join
>
>         >     on that column leading to having a match of every row of the
> first
>
>         > table
>
>         >     with every row of the second table, hence do a Cartesian
> product?
>
>         >     OR
>
>         >     If we just don't specify join condition like :
>
>         >     select a.*, b.* from tt1 as a, tt2 b; then will it
> internally treat
>
>         > this
>
>         >     query as Cartesian join.
>
>         >
>
>         >     Regards,
>
>         >     *Anup Tiwari*
>
>         >
>
>         >     On Mon, May 8, 2017 at 10:00 PM, Zelaine Fong <
> zfong@mapr.com> wrote:
>
>         >
>
>         >     > Cartesian joins in Drill are implemented as nested loop
> joins, and I
>
>         > think
>
>         >     > you should see that reflected in the resultant query plan
> when you
>
>         > run
>
>         >     > explain plan on the query.
>
>         >     >
>
>         >     > Yes, Cartesian joins/nested loop joins are expensive
> because you’re
>
>         >     > effectively doing an MxN read of your tables.  There are
> more
>
>         > efficient
>
>         >     > ways of processing a nested loop join, e.g., by creating
> an index on
>
>         > the
>
>         >     > larger table in the join and then using that index to do
> lookups
>
>         > into that
>
>         >     > table.  That way, the nested loop join cost is the cost of
> creating
>
>         > the
>
>         >     > index + M, where M is the number of rows in the smaller
> table and
>
>         > assuming
>
>         >     > the lookup cost into the index does minimize the amount of
> data read
>
>         > of the
>
>         >     > second table.  Drill currently doesn’t do this.
>
>         >     >
>
>         >     > -- Zelaine
>
>         >     >
>
>         >     > On 5/8/17, 9:09 AM, "Muhammad Gelbana" <
> m.gelbana@gmail.com> wrote:
>
>         >     >
>
>         >     >     ​I believe ​clhubert is referring to this discussion
>
>         >     >     <http://drill-user.incubator.
> apache.narkive.com/TIXWiTY4/
>
>         >     > cartesian-product-in-apache-drill#post1>
>
>         >     >     .
>
>         >     >
>
>         >     >     So why Drill doesn't transform this query into a
> nested join
>
>         > query ?
>
>         >     > Simply
>
>         >     >     because there is no Calcite rule to transform it into
> a nested
>
>         > loop
>
>         >     > join ?
>
>         >     >     Is it not technically possible to write such Rule or
> is it
>
>         > feasible so
>
>         >     > I
>
>         >     >     may take on this challenge ?
>
>         >     >
>
>         >     >     Also pardon me for repeating my question but I fail to
> find an
>
>         > answer
>
>         >     > in
>
>         >     >     your replies, why doesn't Drill just run a cartesian
> join ?
>
>         > Because
>
>         >     > it's
>
>         >     >     expensive regarding resources (i.e. CPU\Network\RAM) ?
>
>         >     >
>
>         >     >     Thanks a lot Shadi for the query, it works for me.
>
>         >     >
>
>         >     >     *---------------------*
>
>         >     >     *Muhammad Gelbana*
>
>         >     >     http://www.linkedin.com/in/mgelbana
>
>         >     >
>
>         >     >     On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa <
>
>         > khalifa@cs.queensu.ca>
>
>         >     > wrote:
>
>         >     >
>
>         >     >     > Hi Muhammad,
>
>         >     >     >
>
>         >     >     > I did the following as a workaround to have
> Cartesian product.
>
>         > The
>
>         >     > basic
>
>         >     >     > idea is to create a dummy column on the fly that has
> the value
>
>         > 1 in
>
>         >     > both
>
>         >     >     > tables and then join on that column leading to
> having a match
>
>         > of
>
>         >     > every row
>
>         >     >     > of the first table with every row of the second
> table, hence
>
>         > do a
>
>         >     > Cartesian
>
>         >     >     > product. This might not be the most efficient way
> but it will
>
>         > do the
>
>         >     > job.
>
>         >     >     >
>
>         >     >     > *Original Query:*
>
>         >     >     > SELECT * FROM
>
>         >     >     > ( SELECT 'ABC' `UserID` FROM
> `dfs`.`path_to_parquet_file` tc
>
>         > LIMIT
>
>         >     >     > 2147483647) `t0`
>
>         >     >     > INNER JOIN
>
>         >     >     > ( SELECT 'ABC' `UserID` FROM
> `dfs`.`path_to_parquet_file` tc
>
>         > LIMIT
>
>         >     >     > 2147483647) `t1`
>
>         >     >     > ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)
>
>         >     >     > LIMIT 2147483647
>
>         >     >     >
>
>         >     >     > *Workaround (add columns **d1a381f3g73 and
> **d1a381f3g74 to
>
>         > tables
>
>         >     > one
>
>         >     >     > and two, respectively. Names don't really matter,
> just need to
>
>         > be
>
>         >     > unique):*
>
>         >     >     > SELECT * FROM
>
>         >     >     > ( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM
>
>         >     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647)
> `t0`
>
>         >     >     > INNER JOIN
>
>         >     >     > ( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM
>
>         >     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647)
> `t1`
>
>         >     >     > ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)
>
>         >     >     > WHERE `t0`.`UserID` IS NOT DISTINCT FROM
> `t1`.`UserID`
>
>         >     >     > LIMIT 2147483647
>
>         >     >     >
>
>         >     >     > Regards
>
>         >     >     >
>
>         >     >     >
>
>         >     >     > *Shadi Khalifa, PhD*
>
>         >     >     > Postdoctoral Fellow
>
>         >     >     > Cognitive Analytics Development Hub
>
>         >     >     > Centre for Advanced Computing
>
>         >     >     > Queen’s University
>
>         >     >     > (613) 533-6000 x78347
>
>         >     >     > http://cac.queensu.ca
>
>         >     >     >
>
>         >     >     > I'm just a neuron in the society collective brain
>
>         >     >     >
>
>         >     >     > *Join us for HPCS in June 2017! Register at:*  *
>
>         > http://2017.hpcs.ca/
>
>         >     >     > <http://2017.hpcs.ca/>*
>
>         >     >     >
>
>         >     >     > P Please consider your environmental responsibility
> before
>
>         > printing
>
>         >     > this
>
>         >     >     > e-mail
>
>         >     >     >
>
>         >     >     > *01001001 00100000 01101100 01101111 01110110
> 01100101 00100000
>
>         >     > 01000101
>
>         >     >     > 01100111 01111001 01110000 01110100 *
>
>         >     >     >
>
>         >     >     > *The information transmitted is intended only for
> the person or
>
>         >     > entity to
>
>         >     >     > which it is addressed and may contain confidential
> material.
>
>         > Any
>
>         >     > review or
>
>         >     >     > dissemination of this information by persons other
> than the
>
>         > intended
>
>         >     >     > recipient is prohibited. If you received this in
> error, please
>
>         >     > contact the
>
>         >     >     > sender and delete the material from any computer.
> Thank you.*
>
>         >     >     >
>
>         >     >     >
>
>         >     >     >
>
>         >     >     > On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana <
>
>         >     > m.gelbana@gmail.com>
>
>         >     >     > wrote:
>
>         >     >     >
>
>         >     >     >
>
>         >     >     > ​​
>
>         >     >     > Here it is:
>
>         >     >     >
>
>         >     >     > SELECT * FROM (SELECT 'ABC' `UserID` FROM
>
>         >     > `dfs`.`path_to_parquet_file` tc
>
>         >     >     > LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC'
> `UserID` FROM
>
>         >     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647)
> `t1` ON (
>
>         >     >     > ​​
>
>         >     >     > `t0`.`UserID` IS NOT DISTINCT FROM
>
>         >     >     > ​​
>
>         >     >     > `t1`.`UserID`) LIMIT 2147483647
>
>         >     >     >
>
>         >     >     > I debugged Drill code and found it decomposes *IS
> NOT DISTINCT
>
>         > FROM*
>
>         >     > into
>
>         >     >     > ​
>
>         >     >     > *`t0`.`UserID` = ​`t1`.`UserID` OR (`t0`.`UserID` IS
> NULL &&
>
>         >     > `t1`.`UserID`
>
>         >     >     > IS NULL**)* while checking if the query is a
> cartesian join,
>
>         > and
>
>         >     > when the
>
>         >     >     > check returns true, it throws an excetion saying:
> *This query
>
>         > cannot
>
>         >     > be
>
>         >     >     > planned possibly due to either a cartesian join or an
>
>         > inequality
>
>         >     > join*
>
>         >     >     >
>
>         >     >     >
>
>         >     >     > *---------------------*
>
>         >     >     > *Muhammad Gelbana*
>
>         >     >     > http://www.linkedin.com/in/mgelbana
>
>         >     >     >
>
>         >     >     > On Sat, May 6, 2017 at 6:53 PM, Gautam Parai <
> gparai@mapr.com>
>
>         >     > wrote:
>
>         >     >     >
>
>         >     >     > > Can you please specify the query you are trying to
> execute?
>
>         >     >     > >
>
>         >     >     > >
>
>         >     >     > > Gautam
>
>         >     >     > >
>
>         >     >     > > ________________________________
>
>         >     >     > > From: Muhammad Gelbana <m....@gmail.com>
>
>         >     >     > > Sent: Saturday, May 6, 2017 7:34:53 AM
>
>         >     >     > > To: user@drill.apache.org; dev@drill.apache.org
>
>         >     >     > > Subject: Running cartesian joins on Drill
>
>         >     >     > >
>
>         >     >     > > Is there a reason why Drill would intentionally
> reject
>
>         > cartesian
>
>         >     > join
>
>         >     >     > > queries even if *planner.enable_nljoin_for_scalar_only*
> is
>
>         >     > disabled ?
>
>         >     >     > >
>
>         >     >     > > Any ideas how could a query be rewritten to
> overcome this
>
>         >     > restriction ?
>
>         >     >     > >
>
>         >     >     > > *---------------------*
>
>         >     >     > > *Muhammad Gelbana*
>
>         >     >     > > http://www.linkedin.com/in/mgelbana
>
>         >     >     > >
>
>         >     >     >
>
>         >     >     >
>
>         >     >     >
>
>         >     >
>
>         >     >
>
>         >     >
>
>         >
>
>         >
>
>         >
>
>
>
>
>
>
>

Re: Running cartesian joins on Drill

Posted by Aman Sinha <as...@mapr.com>.
Muhammad,
The join condition  ‘a = b or (a is null && b is null)’ works.  Internally, this is converted to  ‘a is not distinct from b’ which is processed by Drill.
For some reason, if the second form is directly supplied in the user query, it is not working and ends up with the Cartesian join condition.  Drill leverages Calcite for this (you can see CALCITE-1200 for some background).
Can you file a JIRA for this ?

-Aman

From: "Aman Sinha (asinha@mapr.com)" <as...@mapr.com>
Date: Thursday, May 11, 2017 at 4:29 PM
To: dev <de...@drill.apache.org>, user <us...@drill.apache.org>
Cc: Shadi Khalifa <kh...@cs.queensu.ca>
Subject: Re: Running cartesian joins on Drill


I think Muhammad may be trying to run his original query with IS NOT DISTINCT FROM.   That discussion got side-tracked into Cartesian joins because his query was not getting planned and the error was about Cartesian join.

Muhammad,  can you try with the equivalent version below ?  You mentioned the rewrite but did you try the rewritten version ?



SELECT * FROM (SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc

LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM

`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (

​​

`t0`.`UserID` = `t1`.`UserID` OR (`t0`.`UserID` IS NULL && `t1`.`UserID` IS NULL) )



On 5/11/17, 3:23 PM, "Zelaine Fong" <zf...@mapr.com> wrote:



    I’m not sure why it isn’t working for you.  Using Drill 1.10, here’s my output:



    0: jdbc:drill:zk=local> alter session set `planner.enable_nljoin_for_scalar_only` = false;

    +-------+-------------------------------------------------+

    |  ok   |                     summary                     |

    +-------+-------------------------------------------------+

    | true  | planner.enable_nljoin_for_scalar_only updated.  |

    +-------+-------------------------------------------------+

    1 row selected (0.137 seconds)

    0: jdbc:drill:zk=local> explain plan for select * from dfs.`/Users/zfong/foo.csv` t1, dfs.`/Users/zfong/foo.csv` t2;

    +------+------+

    | text | json |

    +------+------+

    | 00-00    Screen

    00-01      ProjectAllowDup(*=[$0], *0=[$1])

    00-02        NestedLoopJoin(condition=[true], joinType=[inner])

    00-04          Project(T2¦¦*=[$0])

    00-06            Scan(groupscan=[EasyGroupScan [selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`], files=[file:/Users/zfong/foo.csv]]])

    00-03          Project(T3¦¦*=[$0])

    00-05            Scan(groupscan=[EasyGroupScan [selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`], files=[file:/Users/zfong/foo.csv]]])



    -- Zelaine



    On 5/11/17, 3:17 PM, "Muhammad Gelbana" <m....@gmail.com> wrote:



        ​But the query I provided failed to be planned because it's a cartesian

        join, although I've set the option you mentioned to false. Is there a

        reason why wouldn't Drill rules physically implement the logical join in my

        query to a nested loop join ?



        *---------------------*

        *Muhammad Gelbana*

        http://www.linkedin.com/in/mgelbana



        On Thu, May 11, 2017 at 5:05 PM, Zelaine Fong <zf...@mapr.com> wrote:



        > Provided `planner.enable_nljoin_for_scalar_only` is set to false, even

        > without an explicit join condition, the query should use the Cartesian

        > join/nested loop join.

        >

        > -- Zelaine

        >

        > On 5/11/17, 4:20 AM, "Anup Tiwari" <an...@games24x7.com> wrote:

        >

        >     Hi,

        >

        >     I have one question here.. so if we have to use Cartesian join in Drill

        >     then do we have to follow some workaround like Shadi mention : adding a

        >     dummy column on the fly that has the value 1 in both tables and then

        > join

        >     on that column leading to having a match of every row of the first

        > table

        >     with every row of the second table, hence do a Cartesian product?

        >     OR

        >     If we just don't specify join condition like :

        >     select a.*, b.* from tt1 as a, tt2 b; then will it internally treat

        > this

        >     query as Cartesian join.

        >

        >     Regards,

        >     *Anup Tiwari*

        >

        >     On Mon, May 8, 2017 at 10:00 PM, Zelaine Fong <zf...@mapr.com> wrote:

        >

        >     > Cartesian joins in Drill are implemented as nested loop joins, and I

        > think

        >     > you should see that reflected in the resultant query plan when you

        > run

        >     > explain plan on the query.

        >     >

        >     > Yes, Cartesian joins/nested loop joins are expensive because you’re

        >     > effectively doing an MxN read of your tables.  There are more

        > efficient

        >     > ways of processing a nested loop join, e.g., by creating an index on

        > the

        >     > larger table in the join and then using that index to do lookups

        > into that

        >     > table.  That way, the nested loop join cost is the cost of creating

        > the

        >     > index + M, where M is the number of rows in the smaller table and

        > assuming

        >     > the lookup cost into the index does minimize the amount of data read

        > of the

        >     > second table.  Drill currently doesn’t do this.

        >     >

        >     > -- Zelaine

        >     >

        >     > On 5/8/17, 9:09 AM, "Muhammad Gelbana" <m....@gmail.com> wrote:

        >     >

        >     >     ​I believe ​clhubert is referring to this discussion

        >     >     <http://drill-user.incubator.apache.narkive.com/TIXWiTY4/

        >     > cartesian-product-in-apache-drill#post1>

        >     >     .

        >     >

        >     >     So why Drill doesn't transform this query into a nested join

        > query ?

        >     > Simply

        >     >     because there is no Calcite rule to transform it into a nested

        > loop

        >     > join ?

        >     >     Is it not technically possible to write such Rule or is it

        > feasible so

        >     > I

        >     >     may take on this challenge ?

        >     >

        >     >     Also pardon me for repeating my question but I fail to find an

        > answer

        >     > in

        >     >     your replies, why doesn't Drill just run a cartesian join ?

        > Because

        >     > it's

        >     >     expensive regarding resources (i.e. CPU\Network\RAM) ?

        >     >

        >     >     Thanks a lot Shadi for the query, it works for me.

        >     >

        >     >     *---------------------*

        >     >     *Muhammad Gelbana*

        >     >     http://www.linkedin.com/in/mgelbana

        >     >

        >     >     On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa <

        > khalifa@cs.queensu.ca>

        >     > wrote:

        >     >

        >     >     > Hi Muhammad,

        >     >     >

        >     >     > I did the following as a workaround to have Cartesian product.

        > The

        >     > basic

        >     >     > idea is to create a dummy column on the fly that has the value

        > 1 in

        >     > both

        >     >     > tables and then join on that column leading to having a match

        > of

        >     > every row

        >     >     > of the first table with every row of the second table, hence

        > do a

        >     > Cartesian

        >     >     > product. This might not be the most efficient way but it will

        > do the

        >     > job.

        >     >     >

        >     >     > *Original Query:*

        >     >     > SELECT * FROM

        >     >     > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc

        > LIMIT

        >     >     > 2147483647) `t0`

        >     >     > INNER JOIN

        >     >     > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc

        > LIMIT

        >     >     > 2147483647) `t1`

        >     >     > ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)

        >     >     > LIMIT 2147483647

        >     >     >

        >     >     > *Workaround (add columns **d1a381f3g73 and **d1a381f3g74 to

        > tables

        >     > one

        >     >     > and two, respectively. Names don't really matter, just need to

        > be

        >     > unique):*

        >     >     > SELECT * FROM

        >     >     > ( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM

        >     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0`

        >     >     > INNER JOIN

        >     >     > ( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM

        >     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1`

        >     >     > ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)

        >     >     > WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`

        >     >     > LIMIT 2147483647

        >     >     >

        >     >     > Regards

        >     >     >

        >     >     >

        >     >     > *Shadi Khalifa, PhD*

        >     >     > Postdoctoral Fellow

        >     >     > Cognitive Analytics Development Hub

        >     >     > Centre for Advanced Computing

        >     >     > Queen’s University

        >     >     > (613) 533-6000 x78347

        >     >     > http://cac.queensu.ca

        >     >     >

        >     >     > I'm just a neuron in the society collective brain

        >     >     >

        >     >     > *Join us for HPCS in June 2017! Register at:*  *

        > http://2017.hpcs.ca/

        >     >     > <http://2017.hpcs.ca/>*

        >     >     >

        >     >     > P Please consider your environmental responsibility before

        > printing

        >     > this

        >     >     > e-mail

        >     >     >

        >     >     > *01001001 00100000 01101100 01101111 01110110 01100101 00100000

        >     > 01000101

        >     >     > 01100111 01111001 01110000 01110100 *

        >     >     >

        >     >     > *The information transmitted is intended only for the person or

        >     > entity to

        >     >     > which it is addressed and may contain confidential material.

        > Any

        >     > review or

        >     >     > dissemination of this information by persons other than the

        > intended

        >     >     > recipient is prohibited. If you received this in error, please

        >     > contact the

        >     >     > sender and delete the material from any computer. Thank you.*

        >     >     >

        >     >     >

        >     >     >

        >     >     > On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana <

        >     > m.gelbana@gmail.com>

        >     >     > wrote:

        >     >     >

        >     >     >

        >     >     > ​​

        >     >     > Here it is:

        >     >     >

        >     >     > SELECT * FROM (SELECT 'ABC' `UserID` FROM

        >     > `dfs`.`path_to_parquet_file` tc

        >     >     > LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM

        >     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (

        >     >     > ​​

        >     >     > `t0`.`UserID` IS NOT DISTINCT FROM

        >     >     > ​​

        >     >     > `t1`.`UserID`) LIMIT 2147483647

        >     >     >

        >     >     > I debugged Drill code and found it decomposes *IS NOT DISTINCT

        > FROM*

        >     > into

        >     >     > ​

        >     >     > *`t0`.`UserID` = ​`t1`.`UserID` OR (`t0`.`UserID` IS NULL &&

        >     > `t1`.`UserID`

        >     >     > IS NULL**)* while checking if the query is a cartesian join,

        > and

        >     > when the

        >     >     > check returns true, it throws an excetion saying: *This query

        > cannot

        >     > be

        >     >     > planned possibly due to either a cartesian join or an

        > inequality

        >     > join*

        >     >     >

        >     >     >

        >     >     > *---------------------*

        >     >     > *Muhammad Gelbana*

        >     >     > http://www.linkedin.com/in/mgelbana

        >     >     >

        >     >     > On Sat, May 6, 2017 at 6:53 PM, Gautam Parai <gp...@mapr.com>

        >     > wrote:

        >     >     >

        >     >     > > Can you please specify the query you are trying to execute?

        >     >     > >

        >     >     > >

        >     >     > > Gautam

        >     >     > >

        >     >     > > ________________________________

        >     >     > > From: Muhammad Gelbana <m....@gmail.com>

        >     >     > > Sent: Saturday, May 6, 2017 7:34:53 AM

        >     >     > > To: user@drill.apache.org; dev@drill.apache.org

        >     >     > > Subject: Running cartesian joins on Drill

        >     >     > >

        >     >     > > Is there a reason why Drill would intentionally reject

        > cartesian

        >     > join

        >     >     > > queries even if *planner.enable_nljoin_for_scalar_only* is

        >     > disabled ?

        >     >     > >

        >     >     > > Any ideas how could a query be rewritten to overcome this

        >     > restriction ?

        >     >     > >

        >     >     > > *---------------------*

        >     >     > > *Muhammad Gelbana*

        >     >     > > http://www.linkedin.com/in/mgelbana

        >     >     > >

        >     >     >

        >     >     >

        >     >     >

        >     >

        >     >

        >     >

        >

        >

        >







Re: Running cartesian joins on Drill

Posted by Aman Sinha <as...@mapr.com>.
Muhammad,
The join condition  ‘a = b or (a is null && b is null)’ works.  Internally, this is converted to  ‘a is not distinct from b’ which is processed by Drill.
For some reason, if the second form is directly supplied in the user query, it is not working and ends up with the Cartesian join condition.  Drill leverages Calcite for this (you can see CALCITE-1200 for some background).
Can you file a JIRA for this ?

-Aman

From: "Aman Sinha (asinha@mapr.com)" <as...@mapr.com>
Date: Thursday, May 11, 2017 at 4:29 PM
To: dev <de...@drill.apache.org>, user <us...@drill.apache.org>
Cc: Shadi Khalifa <kh...@cs.queensu.ca>
Subject: Re: Running cartesian joins on Drill


I think Muhammad may be trying to run his original query with IS NOT DISTINCT FROM.   That discussion got side-tracked into Cartesian joins because his query was not getting planned and the error was about Cartesian join.

Muhammad,  can you try with the equivalent version below ?  You mentioned the rewrite but did you try the rewritten version ?



SELECT * FROM (SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc

LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM

`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (

​​

`t0`.`UserID` = `t1`.`UserID` OR (`t0`.`UserID` IS NULL && `t1`.`UserID` IS NULL) )



On 5/11/17, 3:23 PM, "Zelaine Fong" <zf...@mapr.com> wrote:



    I’m not sure why it isn’t working for you.  Using Drill 1.10, here’s my output:



    0: jdbc:drill:zk=local> alter session set `planner.enable_nljoin_for_scalar_only` = false;

    +-------+-------------------------------------------------+

    |  ok   |                     summary                     |

    +-------+-------------------------------------------------+

    | true  | planner.enable_nljoin_for_scalar_only updated.  |

    +-------+-------------------------------------------------+

    1 row selected (0.137 seconds)

    0: jdbc:drill:zk=local> explain plan for select * from dfs.`/Users/zfong/foo.csv` t1, dfs.`/Users/zfong/foo.csv` t2;

    +------+------+

    | text | json |

    +------+------+

    | 00-00    Screen

    00-01      ProjectAllowDup(*=[$0], *0=[$1])

    00-02        NestedLoopJoin(condition=[true], joinType=[inner])

    00-04          Project(T2¦¦*=[$0])

    00-06            Scan(groupscan=[EasyGroupScan [selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`], files=[file:/Users/zfong/foo.csv]]])

    00-03          Project(T3¦¦*=[$0])

    00-05            Scan(groupscan=[EasyGroupScan [selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`], files=[file:/Users/zfong/foo.csv]]])



    -- Zelaine



    On 5/11/17, 3:17 PM, "Muhammad Gelbana" <m....@gmail.com> wrote:



        ​But the query I provided failed to be planned because it's a cartesian

        join, although I've set the option you mentioned to false. Is there a

        reason why wouldn't Drill rules physically implement the logical join in my

        query to a nested loop join ?



        *---------------------*

        *Muhammad Gelbana*

        http://www.linkedin.com/in/mgelbana



        On Thu, May 11, 2017 at 5:05 PM, Zelaine Fong <zf...@mapr.com> wrote:



        > Provided `planner.enable_nljoin_for_scalar_only` is set to false, even

        > without an explicit join condition, the query should use the Cartesian

        > join/nested loop join.

        >

        > -- Zelaine

        >

        > On 5/11/17, 4:20 AM, "Anup Tiwari" <an...@games24x7.com> wrote:

        >

        >     Hi,

        >

        >     I have one question here.. so if we have to use Cartesian join in Drill

        >     then do we have to follow some workaround like Shadi mention : adding a

        >     dummy column on the fly that has the value 1 in both tables and then

        > join

        >     on that column leading to having a match of every row of the first

        > table

        >     with every row of the second table, hence do a Cartesian product?

        >     OR

        >     If we just don't specify join condition like :

        >     select a.*, b.* from tt1 as a, tt2 b; then will it internally treat

        > this

        >     query as Cartesian join.

        >

        >     Regards,

        >     *Anup Tiwari*

        >

        >     On Mon, May 8, 2017 at 10:00 PM, Zelaine Fong <zf...@mapr.com> wrote:

        >

        >     > Cartesian joins in Drill are implemented as nested loop joins, and I

        > think

        >     > you should see that reflected in the resultant query plan when you

        > run

        >     > explain plan on the query.

        >     >

        >     > Yes, Cartesian joins/nested loop joins are expensive because you’re

        >     > effectively doing an MxN read of your tables.  There are more

        > efficient

        >     > ways of processing a nested loop join, e.g., by creating an index on

        > the

        >     > larger table in the join and then using that index to do lookups

        > into that

        >     > table.  That way, the nested loop join cost is the cost of creating

        > the

        >     > index + M, where M is the number of rows in the smaller table and

        > assuming

        >     > the lookup cost into the index does minimize the amount of data read

        > of the

        >     > second table.  Drill currently doesn’t do this.

        >     >

        >     > -- Zelaine

        >     >

        >     > On 5/8/17, 9:09 AM, "Muhammad Gelbana" <m....@gmail.com> wrote:

        >     >

        >     >     ​I believe ​clhubert is referring to this discussion

        >     >     <http://drill-user.incubator.apache.narkive.com/TIXWiTY4/

        >     > cartesian-product-in-apache-drill#post1>

        >     >     .

        >     >

        >     >     So why Drill doesn't transform this query into a nested join

        > query ?

        >     > Simply

        >     >     because there is no Calcite rule to transform it into a nested

        > loop

        >     > join ?

        >     >     Is it not technically possible to write such Rule or is it

        > feasible so

        >     > I

        >     >     may take on this challenge ?

        >     >

        >     >     Also pardon me for repeating my question but I fail to find an

        > answer

        >     > in

        >     >     your replies, why doesn't Drill just run a cartesian join ?

        > Because

        >     > it's

        >     >     expensive regarding resources (i.e. CPU\Network\RAM) ?

        >     >

        >     >     Thanks a lot Shadi for the query, it works for me.

        >     >

        >     >     *---------------------*

        >     >     *Muhammad Gelbana*

        >     >     http://www.linkedin.com/in/mgelbana

        >     >

        >     >     On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa <

        > khalifa@cs.queensu.ca>

        >     > wrote:

        >     >

        >     >     > Hi Muhammad,

        >     >     >

        >     >     > I did the following as a workaround to have Cartesian product.

        > The

        >     > basic

        >     >     > idea is to create a dummy column on the fly that has the value

        > 1 in

        >     > both

        >     >     > tables and then join on that column leading to having a match

        > of

        >     > every row

        >     >     > of the first table with every row of the second table, hence

        > do a

        >     > Cartesian

        >     >     > product. This might not be the most efficient way but it will

        > do the

        >     > job.

        >     >     >

        >     >     > *Original Query:*

        >     >     > SELECT * FROM

        >     >     > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc

        > LIMIT

        >     >     > 2147483647) `t0`

        >     >     > INNER JOIN

        >     >     > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc

        > LIMIT

        >     >     > 2147483647) `t1`

        >     >     > ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)

        >     >     > LIMIT 2147483647

        >     >     >

        >     >     > *Workaround (add columns **d1a381f3g73 and **d1a381f3g74 to

        > tables

        >     > one

        >     >     > and two, respectively. Names don't really matter, just need to

        > be

        >     > unique):*

        >     >     > SELECT * FROM

        >     >     > ( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM

        >     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0`

        >     >     > INNER JOIN

        >     >     > ( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM

        >     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1`

        >     >     > ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)

        >     >     > WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`

        >     >     > LIMIT 2147483647

        >     >     >

        >     >     > Regards

        >     >     >

        >     >     >

        >     >     > *Shadi Khalifa, PhD*

        >     >     > Postdoctoral Fellow

        >     >     > Cognitive Analytics Development Hub

        >     >     > Centre for Advanced Computing

        >     >     > Queen’s University

        >     >     > (613) 533-6000 x78347

        >     >     > http://cac.queensu.ca

        >     >     >

        >     >     > I'm just a neuron in the society collective brain

        >     >     >

        >     >     > *Join us for HPCS in June 2017! Register at:*  *

        > http://2017.hpcs.ca/

        >     >     > <http://2017.hpcs.ca/>*

        >     >     >

        >     >     > P Please consider your environmental responsibility before

        > printing

        >     > this

        >     >     > e-mail

        >     >     >

        >     >     > *01001001 00100000 01101100 01101111 01110110 01100101 00100000

        >     > 01000101

        >     >     > 01100111 01111001 01110000 01110100 *

        >     >     >

        >     >     > *The information transmitted is intended only for the person or

        >     > entity to

        >     >     > which it is addressed and may contain confidential material.

        > Any

        >     > review or

        >     >     > dissemination of this information by persons other than the

        > intended

        >     >     > recipient is prohibited. If you received this in error, please

        >     > contact the

        >     >     > sender and delete the material from any computer. Thank you.*

        >     >     >

        >     >     >

        >     >     >

        >     >     > On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana <

        >     > m.gelbana@gmail.com>

        >     >     > wrote:

        >     >     >

        >     >     >

        >     >     > ​​

        >     >     > Here it is:

        >     >     >

        >     >     > SELECT * FROM (SELECT 'ABC' `UserID` FROM

        >     > `dfs`.`path_to_parquet_file` tc

        >     >     > LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM

        >     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (

        >     >     > ​​

        >     >     > `t0`.`UserID` IS NOT DISTINCT FROM

        >     >     > ​​

        >     >     > `t1`.`UserID`) LIMIT 2147483647

        >     >     >

        >     >     > I debugged Drill code and found it decomposes *IS NOT DISTINCT

        > FROM*

        >     > into

        >     >     > ​

        >     >     > *`t0`.`UserID` = ​`t1`.`UserID` OR (`t0`.`UserID` IS NULL &&

        >     > `t1`.`UserID`

        >     >     > IS NULL**)* while checking if the query is a cartesian join,

        > and

        >     > when the

        >     >     > check returns true, it throws an excetion saying: *This query

        > cannot

        >     > be

        >     >     > planned possibly due to either a cartesian join or an

        > inequality

        >     > join*

        >     >     >

        >     >     >

        >     >     > *---------------------*

        >     >     > *Muhammad Gelbana*

        >     >     > http://www.linkedin.com/in/mgelbana

        >     >     >

        >     >     > On Sat, May 6, 2017 at 6:53 PM, Gautam Parai <gp...@mapr.com>

        >     > wrote:

        >     >     >

        >     >     > > Can you please specify the query you are trying to execute?

        >     >     > >

        >     >     > >

        >     >     > > Gautam

        >     >     > >

        >     >     > > ________________________________

        >     >     > > From: Muhammad Gelbana <m....@gmail.com>

        >     >     > > Sent: Saturday, May 6, 2017 7:34:53 AM

        >     >     > > To: user@drill.apache.org; dev@drill.apache.org

        >     >     > > Subject: Running cartesian joins on Drill

        >     >     > >

        >     >     > > Is there a reason why Drill would intentionally reject

        > cartesian

        >     > join

        >     >     > > queries even if *planner.enable_nljoin_for_scalar_only* is

        >     > disabled ?

        >     >     > >

        >     >     > > Any ideas how could a query be rewritten to overcome this

        >     > restriction ?

        >     >     > >

        >     >     > > *---------------------*

        >     >     > > *Muhammad Gelbana*

        >     >     > > http://www.linkedin.com/in/mgelbana

        >     >     > >

        >     >     >

        >     >     >

        >     >     >

        >     >

        >     >

        >     >

        >

        >

        >







Re: Running cartesian joins on Drill

Posted by Aman Sinha <as...@mapr.com>.
I think Muhammad may be trying to run his original query with IS NOT DISTINCT FROM.   That discussion got side-tracked into Cartesian joins because his query was not getting planned and the error was about Cartesian join.

Muhammad,  can you try with the equivalent version below ?  You mentioned the rewrite but did you try the rewritten version ?



SELECT * FROM (SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc

LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM

`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (

​​

`t0`.`UserID` = `t1`.`UserID` OR (`t0`.`UserID` IS NULL && `t1`.`UserID` IS NULL) )



On 5/11/17, 3:23 PM, "Zelaine Fong" <zf...@mapr.com> wrote:



    I’m not sure why it isn’t working for you.  Using Drill 1.10, here’s my output:



    0: jdbc:drill:zk=local> alter session set `planner.enable_nljoin_for_scalar_only` = false;

    +-------+-------------------------------------------------+

    |  ok   |                     summary                     |

    +-------+-------------------------------------------------+

    | true  | planner.enable_nljoin_for_scalar_only updated.  |

    +-------+-------------------------------------------------+

    1 row selected (0.137 seconds)

    0: jdbc:drill:zk=local> explain plan for select * from dfs.`/Users/zfong/foo.csv` t1, dfs.`/Users/zfong/foo.csv` t2;

    +------+------+

    | text | json |

    +------+------+

    | 00-00    Screen

    00-01      ProjectAllowDup(*=[$0], *0=[$1])

    00-02        NestedLoopJoin(condition=[true], joinType=[inner])

    00-04          Project(T2¦¦*=[$0])

    00-06            Scan(groupscan=[EasyGroupScan [selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`], files=[file:/Users/zfong/foo.csv]]])

    00-03          Project(T3¦¦*=[$0])

    00-05            Scan(groupscan=[EasyGroupScan [selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`], files=[file:/Users/zfong/foo.csv]]])



    -- Zelaine



    On 5/11/17, 3:17 PM, "Muhammad Gelbana" <m....@gmail.com> wrote:



        ​But the query I provided failed to be planned because it's a cartesian

        join, although I've set the option you mentioned to false. Is there a

        reason why wouldn't Drill rules physically implement the logical join in my

        query to a nested loop join ?



        *---------------------*

        *Muhammad Gelbana*

        http://www.linkedin.com/in/mgelbana



        On Thu, May 11, 2017 at 5:05 PM, Zelaine Fong <zf...@mapr.com> wrote:



        > Provided `planner.enable_nljoin_for_scalar_only` is set to false, even

        > without an explicit join condition, the query should use the Cartesian

        > join/nested loop join.

        >

        > -- Zelaine

        >

        > On 5/11/17, 4:20 AM, "Anup Tiwari" <an...@games24x7.com> wrote:

        >

        >     Hi,

        >

        >     I have one question here.. so if we have to use Cartesian join in Drill

        >     then do we have to follow some workaround like Shadi mention : adding a

        >     dummy column on the fly that has the value 1 in both tables and then

        > join

        >     on that column leading to having a match of every row of the first

        > table

        >     with every row of the second table, hence do a Cartesian product?

        >     OR

        >     If we just don't specify join condition like :

        >     select a.*, b.* from tt1 as a, tt2 b; then will it internally treat

        > this

        >     query as Cartesian join.

        >

        >     Regards,

        >     *Anup Tiwari*

        >

        >     On Mon, May 8, 2017 at 10:00 PM, Zelaine Fong <zf...@mapr.com> wrote:

        >

        >     > Cartesian joins in Drill are implemented as nested loop joins, and I

        > think

        >     > you should see that reflected in the resultant query plan when you

        > run

        >     > explain plan on the query.

        >     >

        >     > Yes, Cartesian joins/nested loop joins are expensive because you’re

        >     > effectively doing an MxN read of your tables.  There are more

        > efficient

        >     > ways of processing a nested loop join, e.g., by creating an index on

        > the

        >     > larger table in the join and then using that index to do lookups

        > into that

        >     > table.  That way, the nested loop join cost is the cost of creating

        > the

        >     > index + M, where M is the number of rows in the smaller table and

        > assuming

        >     > the lookup cost into the index does minimize the amount of data read

        > of the

        >     > second table.  Drill currently doesn’t do this.

        >     >

        >     > -- Zelaine

        >     >

        >     > On 5/8/17, 9:09 AM, "Muhammad Gelbana" <m....@gmail.com> wrote:

        >     >

        >     >     ​I believe ​clhubert is referring to this discussion

        >     >     <http://drill-user.incubator.apache.narkive.com/TIXWiTY4/

        >     > cartesian-product-in-apache-drill#post1>

        >     >     .

        >     >

        >     >     So why Drill doesn't transform this query into a nested join

        > query ?

        >     > Simply

        >     >     because there is no Calcite rule to transform it into a nested

        > loop

        >     > join ?

        >     >     Is it not technically possible to write such Rule or is it

        > feasible so

        >     > I

        >     >     may take on this challenge ?

        >     >

        >     >     Also pardon me for repeating my question but I fail to find an

        > answer

        >     > in

        >     >     your replies, why doesn't Drill just run a cartesian join ?

        > Because

        >     > it's

        >     >     expensive regarding resources (i.e. CPU\Network\RAM) ?

        >     >

        >     >     Thanks a lot Shadi for the query, it works for me.

        >     >

        >     >     *---------------------*

        >     >     *Muhammad Gelbana*

        >     >     http://www.linkedin.com/in/mgelbana

        >     >

        >     >     On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa <

        > khalifa@cs.queensu.ca>

        >     > wrote:

        >     >

        >     >     > Hi Muhammad,

        >     >     >

        >     >     > I did the following as a workaround to have Cartesian product.

        > The

        >     > basic

        >     >     > idea is to create a dummy column on the fly that has the value

        > 1 in

        >     > both

        >     >     > tables and then join on that column leading to having a match

        > of

        >     > every row

        >     >     > of the first table with every row of the second table, hence

        > do a

        >     > Cartesian

        >     >     > product. This might not be the most efficient way but it will

        > do the

        >     > job.

        >     >     >

        >     >     > *Original Query:*

        >     >     > SELECT * FROM

        >     >     > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc

        > LIMIT

        >     >     > 2147483647) `t0`

        >     >     > INNER JOIN

        >     >     > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc

        > LIMIT

        >     >     > 2147483647) `t1`

        >     >     > ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)

        >     >     > LIMIT 2147483647

        >     >     >

        >     >     > *Workaround (add columns **d1a381f3g73 and **d1a381f3g74 to

        > tables

        >     > one

        >     >     > and two, respectively. Names don't really matter, just need to

        > be

        >     > unique):*

        >     >     > SELECT * FROM

        >     >     > ( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM

        >     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0`

        >     >     > INNER JOIN

        >     >     > ( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM

        >     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1`

        >     >     > ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)

        >     >     > WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`

        >     >     > LIMIT 2147483647

        >     >     >

        >     >     > Regards

        >     >     >

        >     >     >

        >     >     > *Shadi Khalifa, PhD*

        >     >     > Postdoctoral Fellow

        >     >     > Cognitive Analytics Development Hub

        >     >     > Centre for Advanced Computing

        >     >     > Queen’s University

        >     >     > (613) 533-6000 x78347

        >     >     > http://cac.queensu.ca

        >     >     >

        >     >     > I'm just a neuron in the society collective brain

        >     >     >

        >     >     > *Join us for HPCS in June 2017! Register at:*  *

        > http://2017.hpcs.ca/

        >     >     > <http://2017.hpcs.ca/>*

        >     >     >

        >     >     > P Please consider your environmental responsibility before

        > printing

        >     > this

        >     >     > e-mail

        >     >     >

        >     >     > *01001001 00100000 01101100 01101111 01110110 01100101 00100000

        >     > 01000101

        >     >     > 01100111 01111001 01110000 01110100 *

        >     >     >

        >     >     > *The information transmitted is intended only for the person or

        >     > entity to

        >     >     > which it is addressed and may contain confidential material.

        > Any

        >     > review or

        >     >     > dissemination of this information by persons other than the

        > intended

        >     >     > recipient is prohibited. If you received this in error, please

        >     > contact the

        >     >     > sender and delete the material from any computer. Thank you.*

        >     >     >

        >     >     >

        >     >     >

        >     >     > On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana <

        >     > m.gelbana@gmail.com>

        >     >     > wrote:

        >     >     >

        >     >     >

        >     >     > ​​

        >     >     > Here it is:

        >     >     >

        >     >     > SELECT * FROM (SELECT 'ABC' `UserID` FROM

        >     > `dfs`.`path_to_parquet_file` tc

        >     >     > LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM

        >     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (

        >     >     > ​​

        >     >     > `t0`.`UserID` IS NOT DISTINCT FROM

        >     >     > ​​

        >     >     > `t1`.`UserID`) LIMIT 2147483647

        >     >     >

        >     >     > I debugged Drill code and found it decomposes *IS NOT DISTINCT

        > FROM*

        >     > into

        >     >     > ​

        >     >     > *`t0`.`UserID` = ​`t1`.`UserID` OR (`t0`.`UserID` IS NULL &&

        >     > `t1`.`UserID`

        >     >     > IS NULL**)* while checking if the query is a cartesian join,

        > and

        >     > when the

        >     >     > check returns true, it throws an excetion saying: *This query

        > cannot

        >     > be

        >     >     > planned possibly due to either a cartesian join or an

        > inequality

        >     > join*

        >     >     >

        >     >     >

        >     >     > *---------------------*

        >     >     > *Muhammad Gelbana*

        >     >     > http://www.linkedin.com/in/mgelbana

        >     >     >

        >     >     > On Sat, May 6, 2017 at 6:53 PM, Gautam Parai <gp...@mapr.com>

        >     > wrote:

        >     >     >

        >     >     > > Can you please specify the query you are trying to execute?

        >     >     > >

        >     >     > >

        >     >     > > Gautam

        >     >     > >

        >     >     > > ________________________________

        >     >     > > From: Muhammad Gelbana <m....@gmail.com>

        >     >     > > Sent: Saturday, May 6, 2017 7:34:53 AM

        >     >     > > To: user@drill.apache.org; dev@drill.apache.org

        >     >     > > Subject: Running cartesian joins on Drill

        >     >     > >

        >     >     > > Is there a reason why Drill would intentionally reject

        > cartesian

        >     > join

        >     >     > > queries even if *planner.enable_nljoin_for_scalar_only* is

        >     > disabled ?

        >     >     > >

        >     >     > > Any ideas how could a query be rewritten to overcome this

        >     > restriction ?

        >     >     > >

        >     >     > > *---------------------*

        >     >     > > *Muhammad Gelbana*

        >     >     > > http://www.linkedin.com/in/mgelbana

        >     >     > >

        >     >     >

        >     >     >

        >     >     >

        >     >

        >     >

        >     >

        >

        >

        >







Re: Running cartesian joins on Drill

Posted by Aman Sinha <as...@mapr.com>.
I think Muhammad may be trying to run his original query with IS NOT DISTINCT FROM.   That discussion got side-tracked into Cartesian joins because his query was not getting planned and the error was about Cartesian join.

Muhammad,  can you try with the equivalent version below ?  You mentioned the rewrite but did you try the rewritten version ?



SELECT * FROM (SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc

LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM

`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (

​​

`t0`.`UserID` = `t1`.`UserID` OR (`t0`.`UserID` IS NULL && `t1`.`UserID` IS NULL) )



On 5/11/17, 3:23 PM, "Zelaine Fong" <zf...@mapr.com> wrote:



    I’m not sure why it isn’t working for you.  Using Drill 1.10, here’s my output:



    0: jdbc:drill:zk=local> alter session set `planner.enable_nljoin_for_scalar_only` = false;

    +-------+-------------------------------------------------+

    |  ok   |                     summary                     |

    +-------+-------------------------------------------------+

    | true  | planner.enable_nljoin_for_scalar_only updated.  |

    +-------+-------------------------------------------------+

    1 row selected (0.137 seconds)

    0: jdbc:drill:zk=local> explain plan for select * from dfs.`/Users/zfong/foo.csv` t1, dfs.`/Users/zfong/foo.csv` t2;

    +------+------+

    | text | json |

    +------+------+

    | 00-00    Screen

    00-01      ProjectAllowDup(*=[$0], *0=[$1])

    00-02        NestedLoopJoin(condition=[true], joinType=[inner])

    00-04          Project(T2¦¦*=[$0])

    00-06            Scan(groupscan=[EasyGroupScan [selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`], files=[file:/Users/zfong/foo.csv]]])

    00-03          Project(T3¦¦*=[$0])

    00-05            Scan(groupscan=[EasyGroupScan [selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`], files=[file:/Users/zfong/foo.csv]]])



    -- Zelaine



    On 5/11/17, 3:17 PM, "Muhammad Gelbana" <m....@gmail.com> wrote:



        ​But the query I provided failed to be planned because it's a cartesian

        join, although I've set the option you mentioned to false. Is there a

        reason why wouldn't Drill rules physically implement the logical join in my

        query to a nested loop join ?



        *---------------------*

        *Muhammad Gelbana*

        http://www.linkedin.com/in/mgelbana



        On Thu, May 11, 2017 at 5:05 PM, Zelaine Fong <zf...@mapr.com> wrote:



        > Provided `planner.enable_nljoin_for_scalar_only` is set to false, even

        > without an explicit join condition, the query should use the Cartesian

        > join/nested loop join.

        >

        > -- Zelaine

        >

        > On 5/11/17, 4:20 AM, "Anup Tiwari" <an...@games24x7.com> wrote:

        >

        >     Hi,

        >

        >     I have one question here.. so if we have to use Cartesian join in Drill

        >     then do we have to follow some workaround like Shadi mention : adding a

        >     dummy column on the fly that has the value 1 in both tables and then

        > join

        >     on that column leading to having a match of every row of the first

        > table

        >     with every row of the second table, hence do a Cartesian product?

        >     OR

        >     If we just don't specify join condition like :

        >     select a.*, b.* from tt1 as a, tt2 b; then will it internally treat

        > this

        >     query as Cartesian join.

        >

        >     Regards,

        >     *Anup Tiwari*

        >

        >     On Mon, May 8, 2017 at 10:00 PM, Zelaine Fong <zf...@mapr.com> wrote:

        >

        >     > Cartesian joins in Drill are implemented as nested loop joins, and I

        > think

        >     > you should see that reflected in the resultant query plan when you

        > run

        >     > explain plan on the query.

        >     >

        >     > Yes, Cartesian joins/nested loop joins are expensive because you’re

        >     > effectively doing an MxN read of your tables.  There are more

        > efficient

        >     > ways of processing a nested loop join, e.g., by creating an index on

        > the

        >     > larger table in the join and then using that index to do lookups

        > into that

        >     > table.  That way, the nested loop join cost is the cost of creating

        > the

        >     > index + M, where M is the number of rows in the smaller table and

        > assuming

        >     > the lookup cost into the index does minimize the amount of data read

        > of the

        >     > second table.  Drill currently doesn’t do this.

        >     >

        >     > -- Zelaine

        >     >

        >     > On 5/8/17, 9:09 AM, "Muhammad Gelbana" <m....@gmail.com> wrote:

        >     >

        >     >     ​I believe ​clhubert is referring to this discussion

        >     >     <http://drill-user.incubator.apache.narkive.com/TIXWiTY4/

        >     > cartesian-product-in-apache-drill#post1>

        >     >     .

        >     >

        >     >     So why Drill doesn't transform this query into a nested join

        > query ?

        >     > Simply

        >     >     because there is no Calcite rule to transform it into a nested

        > loop

        >     > join ?

        >     >     Is it not technically possible to write such Rule or is it

        > feasible so

        >     > I

        >     >     may take on this challenge ?

        >     >

        >     >     Also pardon me for repeating my question but I fail to find an

        > answer

        >     > in

        >     >     your replies, why doesn't Drill just run a cartesian join ?

        > Because

        >     > it's

        >     >     expensive regarding resources (i.e. CPU\Network\RAM) ?

        >     >

        >     >     Thanks a lot Shadi for the query, it works for me.

        >     >

        >     >     *---------------------*

        >     >     *Muhammad Gelbana*

        >     >     http://www.linkedin.com/in/mgelbana

        >     >

        >     >     On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa <

        > khalifa@cs.queensu.ca>

        >     > wrote:

        >     >

        >     >     > Hi Muhammad,

        >     >     >

        >     >     > I did the following as a workaround to have Cartesian product.

        > The

        >     > basic

        >     >     > idea is to create a dummy column on the fly that has the value

        > 1 in

        >     > both

        >     >     > tables and then join on that column leading to having a match

        > of

        >     > every row

        >     >     > of the first table with every row of the second table, hence

        > do a

        >     > Cartesian

        >     >     > product. This might not be the most efficient way but it will

        > do the

        >     > job.

        >     >     >

        >     >     > *Original Query:*

        >     >     > SELECT * FROM

        >     >     > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc

        > LIMIT

        >     >     > 2147483647) `t0`

        >     >     > INNER JOIN

        >     >     > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc

        > LIMIT

        >     >     > 2147483647) `t1`

        >     >     > ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)

        >     >     > LIMIT 2147483647

        >     >     >

        >     >     > *Workaround (add columns **d1a381f3g73 and **d1a381f3g74 to

        > tables

        >     > one

        >     >     > and two, respectively. Names don't really matter, just need to

        > be

        >     > unique):*

        >     >     > SELECT * FROM

        >     >     > ( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM

        >     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0`

        >     >     > INNER JOIN

        >     >     > ( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM

        >     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1`

        >     >     > ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)

        >     >     > WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`

        >     >     > LIMIT 2147483647

        >     >     >

        >     >     > Regards

        >     >     >

        >     >     >

        >     >     > *Shadi Khalifa, PhD*

        >     >     > Postdoctoral Fellow

        >     >     > Cognitive Analytics Development Hub

        >     >     > Centre for Advanced Computing

        >     >     > Queen’s University

        >     >     > (613) 533-6000 x78347

        >     >     > http://cac.queensu.ca

        >     >     >

        >     >     > I'm just a neuron in the society collective brain

        >     >     >

        >     >     > *Join us for HPCS in June 2017! Register at:*  *

        > http://2017.hpcs.ca/

        >     >     > <http://2017.hpcs.ca/>*

        >     >     >

        >     >     > P Please consider your environmental responsibility before

        > printing

        >     > this

        >     >     > e-mail

        >     >     >

        >     >     > *01001001 00100000 01101100 01101111 01110110 01100101 00100000

        >     > 01000101

        >     >     > 01100111 01111001 01110000 01110100 *

        >     >     >

        >     >     > *The information transmitted is intended only for the person or

        >     > entity to

        >     >     > which it is addressed and may contain confidential material.

        > Any

        >     > review or

        >     >     > dissemination of this information by persons other than the

        > intended

        >     >     > recipient is prohibited. If you received this in error, please

        >     > contact the

        >     >     > sender and delete the material from any computer. Thank you.*

        >     >     >

        >     >     >

        >     >     >

        >     >     > On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana <

        >     > m.gelbana@gmail.com>

        >     >     > wrote:

        >     >     >

        >     >     >

        >     >     > ​​

        >     >     > Here it is:

        >     >     >

        >     >     > SELECT * FROM (SELECT 'ABC' `UserID` FROM

        >     > `dfs`.`path_to_parquet_file` tc

        >     >     > LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM

        >     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (

        >     >     > ​​

        >     >     > `t0`.`UserID` IS NOT DISTINCT FROM

        >     >     > ​​

        >     >     > `t1`.`UserID`) LIMIT 2147483647

        >     >     >

        >     >     > I debugged Drill code and found it decomposes *IS NOT DISTINCT

        > FROM*

        >     > into

        >     >     > ​

        >     >     > *`t0`.`UserID` = ​`t1`.`UserID` OR (`t0`.`UserID` IS NULL &&

        >     > `t1`.`UserID`

        >     >     > IS NULL**)* while checking if the query is a cartesian join,

        > and

        >     > when the

        >     >     > check returns true, it throws an excetion saying: *This query

        > cannot

        >     > be

        >     >     > planned possibly due to either a cartesian join or an

        > inequality

        >     > join*

        >     >     >

        >     >     >

        >     >     > *---------------------*

        >     >     > *Muhammad Gelbana*

        >     >     > http://www.linkedin.com/in/mgelbana

        >     >     >

        >     >     > On Sat, May 6, 2017 at 6:53 PM, Gautam Parai <gp...@mapr.com>

        >     > wrote:

        >     >     >

        >     >     > > Can you please specify the query you are trying to execute?

        >     >     > >

        >     >     > >

        >     >     > > Gautam

        >     >     > >

        >     >     > > ________________________________

        >     >     > > From: Muhammad Gelbana <m....@gmail.com>

        >     >     > > Sent: Saturday, May 6, 2017 7:34:53 AM

        >     >     > > To: user@drill.apache.org; dev@drill.apache.org

        >     >     > > Subject: Running cartesian joins on Drill

        >     >     > >

        >     >     > > Is there a reason why Drill would intentionally reject

        > cartesian

        >     > join

        >     >     > > queries even if *planner.enable_nljoin_for_scalar_only* is

        >     > disabled ?

        >     >     > >

        >     >     > > Any ideas how could a query be rewritten to overcome this

        >     > restriction ?

        >     >     > >

        >     >     > > *---------------------*

        >     >     > > *Muhammad Gelbana*

        >     >     > > http://www.linkedin.com/in/mgelbana

        >     >     > >

        >     >     >

        >     >     >

        >     >     >

        >     >

        >     >

        >     >

        >

        >

        >







Re: Running cartesian joins on Drill

Posted by Zelaine Fong <zf...@mapr.com>.
I’m not sure why it isn’t working for you.  Using Drill 1.10, here’s my output:

0: jdbc:drill:zk=local> alter session set `planner.enable_nljoin_for_scalar_only` = false;
+-------+-------------------------------------------------+
|  ok   |                     summary                     |
+-------+-------------------------------------------------+
| true  | planner.enable_nljoin_for_scalar_only updated.  |
+-------+-------------------------------------------------+
1 row selected (0.137 seconds)
0: jdbc:drill:zk=local> explain plan for select * from dfs.`/Users/zfong/foo.csv` t1, dfs.`/Users/zfong/foo.csv` t2;
+------+------+
| text | json |
+------+------+
| 00-00    Screen
00-01      ProjectAllowDup(*=[$0], *0=[$1])
00-02        NestedLoopJoin(condition=[true], joinType=[inner])
00-04          Project(T2¦¦*=[$0])
00-06            Scan(groupscan=[EasyGroupScan [selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`], files=[file:/Users/zfong/foo.csv]]])
00-03          Project(T3¦¦*=[$0])
00-05            Scan(groupscan=[EasyGroupScan [selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`], files=[file:/Users/zfong/foo.csv]]])

-- Zelaine

On 5/11/17, 3:17 PM, "Muhammad Gelbana" <m....@gmail.com> wrote:

    ​But the query I provided failed to be planned because it's a cartesian
    join, although I've set the option you mentioned to false. Is there a
    reason why wouldn't Drill rules physically implement the logical join in my
    query to a nested loop join ?
    
    *---------------------*
    *Muhammad Gelbana*
    http://www.linkedin.com/in/mgelbana
    
    On Thu, May 11, 2017 at 5:05 PM, Zelaine Fong <zf...@mapr.com> wrote:
    
    > Provided `planner.enable_nljoin_for_scalar_only` is set to false, even
    > without an explicit join condition, the query should use the Cartesian
    > join/nested loop join.
    >
    > -- Zelaine
    >
    > On 5/11/17, 4:20 AM, "Anup Tiwari" <an...@games24x7.com> wrote:
    >
    >     Hi,
    >
    >     I have one question here.. so if we have to use Cartesian join in Drill
    >     then do we have to follow some workaround like Shadi mention : adding a
    >     dummy column on the fly that has the value 1 in both tables and then
    > join
    >     on that column leading to having a match of every row of the first
    > table
    >     with every row of the second table, hence do a Cartesian product?
    >     OR
    >     If we just don't specify join condition like :
    >     select a.*, b.* from tt1 as a, tt2 b; then will it internally treat
    > this
    >     query as Cartesian join.
    >
    >     Regards,
    >     *Anup Tiwari*
    >
    >     On Mon, May 8, 2017 at 10:00 PM, Zelaine Fong <zf...@mapr.com> wrote:
    >
    >     > Cartesian joins in Drill are implemented as nested loop joins, and I
    > think
    >     > you should see that reflected in the resultant query plan when you
    > run
    >     > explain plan on the query.
    >     >
    >     > Yes, Cartesian joins/nested loop joins are expensive because you’re
    >     > effectively doing an MxN read of your tables.  There are more
    > efficient
    >     > ways of processing a nested loop join, e.g., by creating an index on
    > the
    >     > larger table in the join and then using that index to do lookups
    > into that
    >     > table.  That way, the nested loop join cost is the cost of creating
    > the
    >     > index + M, where M is the number of rows in the smaller table and
    > assuming
    >     > the lookup cost into the index does minimize the amount of data read
    > of the
    >     > second table.  Drill currently doesn’t do this.
    >     >
    >     > -- Zelaine
    >     >
    >     > On 5/8/17, 9:09 AM, "Muhammad Gelbana" <m....@gmail.com> wrote:
    >     >
    >     >     ​I believe ​clhubert is referring to this discussion
    >     >     <http://drill-user.incubator.apache.narkive.com/TIXWiTY4/
    >     > cartesian-product-in-apache-drill#post1>
    >     >     .
    >     >
    >     >     So why Drill doesn't transform this query into a nested join
    > query ?
    >     > Simply
    >     >     because there is no Calcite rule to transform it into a nested
    > loop
    >     > join ?
    >     >     Is it not technically possible to write such Rule or is it
    > feasible so
    >     > I
    >     >     may take on this challenge ?
    >     >
    >     >     Also pardon me for repeating my question but I fail to find an
    > answer
    >     > in
    >     >     your replies, why doesn't Drill just run a cartesian join ?
    > Because
    >     > it's
    >     >     expensive regarding resources (i.e. CPU\Network\RAM) ?
    >     >
    >     >     Thanks a lot Shadi for the query, it works for me.
    >     >
    >     >     *---------------------*
    >     >     *Muhammad Gelbana*
    >     >     http://www.linkedin.com/in/mgelbana
    >     >
    >     >     On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa <
    > khalifa@cs.queensu.ca>
    >     > wrote:
    >     >
    >     >     > Hi Muhammad,
    >     >     >
    >     >     > I did the following as a workaround to have Cartesian product.
    > The
    >     > basic
    >     >     > idea is to create a dummy column on the fly that has the value
    > 1 in
    >     > both
    >     >     > tables and then join on that column leading to having a match
    > of
    >     > every row
    >     >     > of the first table with every row of the second table, hence
    > do a
    >     > Cartesian
    >     >     > product. This might not be the most efficient way but it will
    > do the
    >     > job.
    >     >     >
    >     >     > *Original Query:*
    >     >     > SELECT * FROM
    >     >     > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
    > LIMIT
    >     >     > 2147483647) `t0`
    >     >     > INNER JOIN
    >     >     > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
    > LIMIT
    >     >     > 2147483647) `t1`
    >     >     > ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)
    >     >     > LIMIT 2147483647
    >     >     >
    >     >     > *Workaround (add columns **d1a381f3g73 and **d1a381f3g74 to
    > tables
    >     > one
    >     >     > and two, respectively. Names don't really matter, just need to
    > be
    >     > unique):*
    >     >     > SELECT * FROM
    >     >     > ( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM
    >     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0`
    >     >     > INNER JOIN
    >     >     > ( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM
    >     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1`
    >     >     > ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)
    >     >     > WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`
    >     >     > LIMIT 2147483647
    >     >     >
    >     >     > Regards
    >     >     >
    >     >     >
    >     >     > *Shadi Khalifa, PhD*
    >     >     > Postdoctoral Fellow
    >     >     > Cognitive Analytics Development Hub
    >     >     > Centre for Advanced Computing
    >     >     > Queen’s University
    >     >     > (613) 533-6000 x78347
    >     >     > http://cac.queensu.ca
    >     >     >
    >     >     > I'm just a neuron in the society collective brain
    >     >     >
    >     >     > *Join us for HPCS in June 2017! Register at:*  *
    > http://2017.hpcs.ca/
    >     >     > <http://2017.hpcs.ca/>*
    >     >     >
    >     >     > P Please consider your environmental responsibility before
    > printing
    >     > this
    >     >     > e-mail
    >     >     >
    >     >     > *01001001 00100000 01101100 01101111 01110110 01100101 00100000
    >     > 01000101
    >     >     > 01100111 01111001 01110000 01110100 *
    >     >     >
    >     >     > *The information transmitted is intended only for the person or
    >     > entity to
    >     >     > which it is addressed and may contain confidential material.
    > Any
    >     > review or
    >     >     > dissemination of this information by persons other than the
    > intended
    >     >     > recipient is prohibited. If you received this in error, please
    >     > contact the
    >     >     > sender and delete the material from any computer. Thank you.*
    >     >     >
    >     >     >
    >     >     >
    >     >     > On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana <
    >     > m.gelbana@gmail.com>
    >     >     > wrote:
    >     >     >
    >     >     >
    >     >     > ​​
    >     >     > Here it is:
    >     >     >
    >     >     > SELECT * FROM (SELECT 'ABC' `UserID` FROM
    >     > `dfs`.`path_to_parquet_file` tc
    >     >     > LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM
    >     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (
    >     >     > ​​
    >     >     > `t0`.`UserID` IS NOT DISTINCT FROM
    >     >     > ​​
    >     >     > `t1`.`UserID`) LIMIT 2147483647
    >     >     >
    >     >     > I debugged Drill code and found it decomposes *IS NOT DISTINCT
    > FROM*
    >     > into
    >     >     > ​
    >     >     > *`t0`.`UserID` = ​`t1`.`UserID` OR (`t0`.`UserID` IS NULL &&
    >     > `t1`.`UserID`
    >     >     > IS NULL**)* while checking if the query is a cartesian join,
    > and
    >     > when the
    >     >     > check returns true, it throws an excetion saying: *This query
    > cannot
    >     > be
    >     >     > planned possibly due to either a cartesian join or an
    > inequality
    >     > join*
    >     >     >
    >     >     >
    >     >     > *---------------------*
    >     >     > *Muhammad Gelbana*
    >     >     > http://www.linkedin.com/in/mgelbana
    >     >     >
    >     >     > On Sat, May 6, 2017 at 6:53 PM, Gautam Parai <gp...@mapr.com>
    >     > wrote:
    >     >     >
    >     >     > > Can you please specify the query you are trying to execute?
    >     >     > >
    >     >     > >
    >     >     > > Gautam
    >     >     > >
    >     >     > > ________________________________
    >     >     > > From: Muhammad Gelbana <m....@gmail.com>
    >     >     > > Sent: Saturday, May 6, 2017 7:34:53 AM
    >     >     > > To: user@drill.apache.org; dev@drill.apache.org
    >     >     > > Subject: Running cartesian joins on Drill
    >     >     > >
    >     >     > > Is there a reason why Drill would intentionally reject
    > cartesian
    >     > join
    >     >     > > queries even if *planner.enable_nljoin_for_scalar_only* is
    >     > disabled ?
    >     >     > >
    >     >     > > Any ideas how could a query be rewritten to overcome this
    >     > restriction ?
    >     >     > >
    >     >     > > *---------------------*
    >     >     > > *Muhammad Gelbana*
    >     >     > > http://www.linkedin.com/in/mgelbana
    >     >     > >
    >     >     >
    >     >     >
    >     >     >
    >     >
    >     >
    >     >
    >
    >
    >
    


Re: Running cartesian joins on Drill

Posted by Zelaine Fong <zf...@mapr.com>.
I’m not sure why it isn’t working for you.  Using Drill 1.10, here’s my output:

0: jdbc:drill:zk=local> alter session set `planner.enable_nljoin_for_scalar_only` = false;
+-------+-------------------------------------------------+
|  ok   |                     summary                     |
+-------+-------------------------------------------------+
| true  | planner.enable_nljoin_for_scalar_only updated.  |
+-------+-------------------------------------------------+
1 row selected (0.137 seconds)
0: jdbc:drill:zk=local> explain plan for select * from dfs.`/Users/zfong/foo.csv` t1, dfs.`/Users/zfong/foo.csv` t2;
+------+------+
| text | json |
+------+------+
| 00-00    Screen
00-01      ProjectAllowDup(*=[$0], *0=[$1])
00-02        NestedLoopJoin(condition=[true], joinType=[inner])
00-04          Project(T2¦¦*=[$0])
00-06            Scan(groupscan=[EasyGroupScan [selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`], files=[file:/Users/zfong/foo.csv]]])
00-03          Project(T3¦¦*=[$0])
00-05            Scan(groupscan=[EasyGroupScan [selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`], files=[file:/Users/zfong/foo.csv]]])

-- Zelaine

On 5/11/17, 3:17 PM, "Muhammad Gelbana" <m....@gmail.com> wrote:

    ​But the query I provided failed to be planned because it's a cartesian
    join, although I've set the option you mentioned to false. Is there a
    reason why wouldn't Drill rules physically implement the logical join in my
    query to a nested loop join ?
    
    *---------------------*
    *Muhammad Gelbana*
    http://www.linkedin.com/in/mgelbana
    
    On Thu, May 11, 2017 at 5:05 PM, Zelaine Fong <zf...@mapr.com> wrote:
    
    > Provided `planner.enable_nljoin_for_scalar_only` is set to false, even
    > without an explicit join condition, the query should use the Cartesian
    > join/nested loop join.
    >
    > -- Zelaine
    >
    > On 5/11/17, 4:20 AM, "Anup Tiwari" <an...@games24x7.com> wrote:
    >
    >     Hi,
    >
    >     I have one question here.. so if we have to use Cartesian join in Drill
    >     then do we have to follow some workaround like Shadi mention : adding a
    >     dummy column on the fly that has the value 1 in both tables and then
    > join
    >     on that column leading to having a match of every row of the first
    > table
    >     with every row of the second table, hence do a Cartesian product?
    >     OR
    >     If we just don't specify join condition like :
    >     select a.*, b.* from tt1 as a, tt2 b; then will it internally treat
    > this
    >     query as Cartesian join.
    >
    >     Regards,
    >     *Anup Tiwari*
    >
    >     On Mon, May 8, 2017 at 10:00 PM, Zelaine Fong <zf...@mapr.com> wrote:
    >
    >     > Cartesian joins in Drill are implemented as nested loop joins, and I
    > think
    >     > you should see that reflected in the resultant query plan when you
    > run
    >     > explain plan on the query.
    >     >
    >     > Yes, Cartesian joins/nested loop joins are expensive because you’re
    >     > effectively doing an MxN read of your tables.  There are more
    > efficient
    >     > ways of processing a nested loop join, e.g., by creating an index on
    > the
    >     > larger table in the join and then using that index to do lookups
    > into that
    >     > table.  That way, the nested loop join cost is the cost of creating
    > the
    >     > index + M, where M is the number of rows in the smaller table and
    > assuming
    >     > the lookup cost into the index does minimize the amount of data read
    > of the
    >     > second table.  Drill currently doesn’t do this.
    >     >
    >     > -- Zelaine
    >     >
    >     > On 5/8/17, 9:09 AM, "Muhammad Gelbana" <m....@gmail.com> wrote:
    >     >
    >     >     ​I believe ​clhubert is referring to this discussion
    >     >     <http://drill-user.incubator.apache.narkive.com/TIXWiTY4/
    >     > cartesian-product-in-apache-drill#post1>
    >     >     .
    >     >
    >     >     So why Drill doesn't transform this query into a nested join
    > query ?
    >     > Simply
    >     >     because there is no Calcite rule to transform it into a nested
    > loop
    >     > join ?
    >     >     Is it not technically possible to write such Rule or is it
    > feasible so
    >     > I
    >     >     may take on this challenge ?
    >     >
    >     >     Also pardon me for repeating my question but I fail to find an
    > answer
    >     > in
    >     >     your replies, why doesn't Drill just run a cartesian join ?
    > Because
    >     > it's
    >     >     expensive regarding resources (i.e. CPU\Network\RAM) ?
    >     >
    >     >     Thanks a lot Shadi for the query, it works for me.
    >     >
    >     >     *---------------------*
    >     >     *Muhammad Gelbana*
    >     >     http://www.linkedin.com/in/mgelbana
    >     >
    >     >     On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa <
    > khalifa@cs.queensu.ca>
    >     > wrote:
    >     >
    >     >     > Hi Muhammad,
    >     >     >
    >     >     > I did the following as a workaround to have Cartesian product.
    > The
    >     > basic
    >     >     > idea is to create a dummy column on the fly that has the value
    > 1 in
    >     > both
    >     >     > tables and then join on that column leading to having a match
    > of
    >     > every row
    >     >     > of the first table with every row of the second table, hence
    > do a
    >     > Cartesian
    >     >     > product. This might not be the most efficient way but it will
    > do the
    >     > job.
    >     >     >
    >     >     > *Original Query:*
    >     >     > SELECT * FROM
    >     >     > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
    > LIMIT
    >     >     > 2147483647) `t0`
    >     >     > INNER JOIN
    >     >     > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
    > LIMIT
    >     >     > 2147483647) `t1`
    >     >     > ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)
    >     >     > LIMIT 2147483647
    >     >     >
    >     >     > *Workaround (add columns **d1a381f3g73 and **d1a381f3g74 to
    > tables
    >     > one
    >     >     > and two, respectively. Names don't really matter, just need to
    > be
    >     > unique):*
    >     >     > SELECT * FROM
    >     >     > ( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM
    >     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0`
    >     >     > INNER JOIN
    >     >     > ( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM
    >     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1`
    >     >     > ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)
    >     >     > WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`
    >     >     > LIMIT 2147483647
    >     >     >
    >     >     > Regards
    >     >     >
    >     >     >
    >     >     > *Shadi Khalifa, PhD*
    >     >     > Postdoctoral Fellow
    >     >     > Cognitive Analytics Development Hub
    >     >     > Centre for Advanced Computing
    >     >     > Queen’s University
    >     >     > (613) 533-6000 x78347
    >     >     > http://cac.queensu.ca
    >     >     >
    >     >     > I'm just a neuron in the society collective brain
    >     >     >
    >     >     > *Join us for HPCS in June 2017! Register at:*  *
    > http://2017.hpcs.ca/
    >     >     > <http://2017.hpcs.ca/>*
    >     >     >
    >     >     > P Please consider your environmental responsibility before
    > printing
    >     > this
    >     >     > e-mail
    >     >     >
    >     >     > *01001001 00100000 01101100 01101111 01110110 01100101 00100000
    >     > 01000101
    >     >     > 01100111 01111001 01110000 01110100 *
    >     >     >
    >     >     > *The information transmitted is intended only for the person or
    >     > entity to
    >     >     > which it is addressed and may contain confidential material.
    > Any
    >     > review or
    >     >     > dissemination of this information by persons other than the
    > intended
    >     >     > recipient is prohibited. If you received this in error, please
    >     > contact the
    >     >     > sender and delete the material from any computer. Thank you.*
    >     >     >
    >     >     >
    >     >     >
    >     >     > On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana <
    >     > m.gelbana@gmail.com>
    >     >     > wrote:
    >     >     >
    >     >     >
    >     >     > ​​
    >     >     > Here it is:
    >     >     >
    >     >     > SELECT * FROM (SELECT 'ABC' `UserID` FROM
    >     > `dfs`.`path_to_parquet_file` tc
    >     >     > LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM
    >     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (
    >     >     > ​​
    >     >     > `t0`.`UserID` IS NOT DISTINCT FROM
    >     >     > ​​
    >     >     > `t1`.`UserID`) LIMIT 2147483647
    >     >     >
    >     >     > I debugged Drill code and found it decomposes *IS NOT DISTINCT
    > FROM*
    >     > into
    >     >     > ​
    >     >     > *`t0`.`UserID` = ​`t1`.`UserID` OR (`t0`.`UserID` IS NULL &&
    >     > `t1`.`UserID`
    >     >     > IS NULL**)* while checking if the query is a cartesian join,
    > and
    >     > when the
    >     >     > check returns true, it throws an excetion saying: *This query
    > cannot
    >     > be
    >     >     > planned possibly due to either a cartesian join or an
    > inequality
    >     > join*
    >     >     >
    >     >     >
    >     >     > *---------------------*
    >     >     > *Muhammad Gelbana*
    >     >     > http://www.linkedin.com/in/mgelbana
    >     >     >
    >     >     > On Sat, May 6, 2017 at 6:53 PM, Gautam Parai <gp...@mapr.com>
    >     > wrote:
    >     >     >
    >     >     > > Can you please specify the query you are trying to execute?
    >     >     > >
    >     >     > >
    >     >     > > Gautam
    >     >     > >
    >     >     > > ________________________________
    >     >     > > From: Muhammad Gelbana <m....@gmail.com>
    >     >     > > Sent: Saturday, May 6, 2017 7:34:53 AM
    >     >     > > To: user@drill.apache.org; dev@drill.apache.org
    >     >     > > Subject: Running cartesian joins on Drill
    >     >     > >
    >     >     > > Is there a reason why Drill would intentionally reject
    > cartesian
    >     > join
    >     >     > > queries even if *planner.enable_nljoin_for_scalar_only* is
    >     > disabled ?
    >     >     > >
    >     >     > > Any ideas how could a query be rewritten to overcome this
    >     > restriction ?
    >     >     > >
    >     >     > > *---------------------*
    >     >     > > *Muhammad Gelbana*
    >     >     > > http://www.linkedin.com/in/mgelbana
    >     >     > >
    >     >     >
    >     >     >
    >     >     >
    >     >
    >     >
    >     >
    >
    >
    >
    


Re: Running cartesian joins on Drill

Posted by Muhammad Gelbana <m....@gmail.com>.
​But the query I provided failed to be planned because it's a cartesian
join, although I've set the option you mentioned to false. Is there a
reason why wouldn't Drill rules physically implement the logical join in my
query to a nested loop join ?

*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana

On Thu, May 11, 2017 at 5:05 PM, Zelaine Fong <zf...@mapr.com> wrote:

> Provided `planner.enable_nljoin_for_scalar_only` is set to false, even
> without an explicit join condition, the query should use the Cartesian
> join/nested loop join.
>
> -- Zelaine
>
> On 5/11/17, 4:20 AM, "Anup Tiwari" <an...@games24x7.com> wrote:
>
>     Hi,
>
>     I have one question here.. so if we have to use Cartesian join in Drill
>     then do we have to follow some workaround like Shadi mention : adding a
>     dummy column on the fly that has the value 1 in both tables and then
> join
>     on that column leading to having a match of every row of the first
> table
>     with every row of the second table, hence do a Cartesian product?
>     OR
>     If we just don't specify join condition like :
>     select a.*, b.* from tt1 as a, tt2 b; then will it internally treat
> this
>     query as Cartesian join.
>
>     Regards,
>     *Anup Tiwari*
>
>     On Mon, May 8, 2017 at 10:00 PM, Zelaine Fong <zf...@mapr.com> wrote:
>
>     > Cartesian joins in Drill are implemented as nested loop joins, and I
> think
>     > you should see that reflected in the resultant query plan when you
> run
>     > explain plan on the query.
>     >
>     > Yes, Cartesian joins/nested loop joins are expensive because you’re
>     > effectively doing an MxN read of your tables.  There are more
> efficient
>     > ways of processing a nested loop join, e.g., by creating an index on
> the
>     > larger table in the join and then using that index to do lookups
> into that
>     > table.  That way, the nested loop join cost is the cost of creating
> the
>     > index + M, where M is the number of rows in the smaller table and
> assuming
>     > the lookup cost into the index does minimize the amount of data read
> of the
>     > second table.  Drill currently doesn’t do this.
>     >
>     > -- Zelaine
>     >
>     > On 5/8/17, 9:09 AM, "Muhammad Gelbana" <m....@gmail.com> wrote:
>     >
>     >     ​I believe ​clhubert is referring to this discussion
>     >     <http://drill-user.incubator.apache.narkive.com/TIXWiTY4/
>     > cartesian-product-in-apache-drill#post1>
>     >     .
>     >
>     >     So why Drill doesn't transform this query into a nested join
> query ?
>     > Simply
>     >     because there is no Calcite rule to transform it into a nested
> loop
>     > join ?
>     >     Is it not technically possible to write such Rule or is it
> feasible so
>     > I
>     >     may take on this challenge ?
>     >
>     >     Also pardon me for repeating my question but I fail to find an
> answer
>     > in
>     >     your replies, why doesn't Drill just run a cartesian join ?
> Because
>     > it's
>     >     expensive regarding resources (i.e. CPU\Network\RAM) ?
>     >
>     >     Thanks a lot Shadi for the query, it works for me.
>     >
>     >     *---------------------*
>     >     *Muhammad Gelbana*
>     >     http://www.linkedin.com/in/mgelbana
>     >
>     >     On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa <
> khalifa@cs.queensu.ca>
>     > wrote:
>     >
>     >     > Hi Muhammad,
>     >     >
>     >     > I did the following as a workaround to have Cartesian product.
> The
>     > basic
>     >     > idea is to create a dummy column on the fly that has the value
> 1 in
>     > both
>     >     > tables and then join on that column leading to having a match
> of
>     > every row
>     >     > of the first table with every row of the second table, hence
> do a
>     > Cartesian
>     >     > product. This might not be the most efficient way but it will
> do the
>     > job.
>     >     >
>     >     > *Original Query:*
>     >     > SELECT * FROM
>     >     > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
> LIMIT
>     >     > 2147483647) `t0`
>     >     > INNER JOIN
>     >     > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
> LIMIT
>     >     > 2147483647) `t1`
>     >     > ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)
>     >     > LIMIT 2147483647
>     >     >
>     >     > *Workaround (add columns **d1a381f3g73 and **d1a381f3g74 to
> tables
>     > one
>     >     > and two, respectively. Names don't really matter, just need to
> be
>     > unique):*
>     >     > SELECT * FROM
>     >     > ( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM
>     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0`
>     >     > INNER JOIN
>     >     > ( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM
>     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1`
>     >     > ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)
>     >     > WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`
>     >     > LIMIT 2147483647
>     >     >
>     >     > Regards
>     >     >
>     >     >
>     >     > *Shadi Khalifa, PhD*
>     >     > Postdoctoral Fellow
>     >     > Cognitive Analytics Development Hub
>     >     > Centre for Advanced Computing
>     >     > Queen’s University
>     >     > (613) 533-6000 x78347
>     >     > http://cac.queensu.ca
>     >     >
>     >     > I'm just a neuron in the society collective brain
>     >     >
>     >     > *Join us for HPCS in June 2017! Register at:*  *
> http://2017.hpcs.ca/
>     >     > <http://2017.hpcs.ca/>*
>     >     >
>     >     > P Please consider your environmental responsibility before
> printing
>     > this
>     >     > e-mail
>     >     >
>     >     > *01001001 00100000 01101100 01101111 01110110 01100101 00100000
>     > 01000101
>     >     > 01100111 01111001 01110000 01110100 *
>     >     >
>     >     > *The information transmitted is intended only for the person or
>     > entity to
>     >     > which it is addressed and may contain confidential material.
> Any
>     > review or
>     >     > dissemination of this information by persons other than the
> intended
>     >     > recipient is prohibited. If you received this in error, please
>     > contact the
>     >     > sender and delete the material from any computer. Thank you.*
>     >     >
>     >     >
>     >     >
>     >     > On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana <
>     > m.gelbana@gmail.com>
>     >     > wrote:
>     >     >
>     >     >
>     >     > ​​
>     >     > Here it is:
>     >     >
>     >     > SELECT * FROM (SELECT 'ABC' `UserID` FROM
>     > `dfs`.`path_to_parquet_file` tc
>     >     > LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM
>     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (
>     >     > ​​
>     >     > `t0`.`UserID` IS NOT DISTINCT FROM
>     >     > ​​
>     >     > `t1`.`UserID`) LIMIT 2147483647
>     >     >
>     >     > I debugged Drill code and found it decomposes *IS NOT DISTINCT
> FROM*
>     > into
>     >     > ​
>     >     > *`t0`.`UserID` = ​`t1`.`UserID` OR (`t0`.`UserID` IS NULL &&
>     > `t1`.`UserID`
>     >     > IS NULL**)* while checking if the query is a cartesian join,
> and
>     > when the
>     >     > check returns true, it throws an excetion saying: *This query
> cannot
>     > be
>     >     > planned possibly due to either a cartesian join or an
> inequality
>     > join*
>     >     >
>     >     >
>     >     > *---------------------*
>     >     > *Muhammad Gelbana*
>     >     > http://www.linkedin.com/in/mgelbana
>     >     >
>     >     > On Sat, May 6, 2017 at 6:53 PM, Gautam Parai <gp...@mapr.com>
>     > wrote:
>     >     >
>     >     > > Can you please specify the query you are trying to execute?
>     >     > >
>     >     > >
>     >     > > Gautam
>     >     > >
>     >     > > ________________________________
>     >     > > From: Muhammad Gelbana <m....@gmail.com>
>     >     > > Sent: Saturday, May 6, 2017 7:34:53 AM
>     >     > > To: user@drill.apache.org; dev@drill.apache.org
>     >     > > Subject: Running cartesian joins on Drill
>     >     > >
>     >     > > Is there a reason why Drill would intentionally reject
> cartesian
>     > join
>     >     > > queries even if *planner.enable_nljoin_for_scalar_only* is
>     > disabled ?
>     >     > >
>     >     > > Any ideas how could a query be rewritten to overcome this
>     > restriction ?
>     >     > >
>     >     > > *---------------------*
>     >     > > *Muhammad Gelbana*
>     >     > > http://www.linkedin.com/in/mgelbana
>     >     > >
>     >     >
>     >     >
>     >     >
>     >
>     >
>     >
>
>
>

Re: Running cartesian joins on Drill

Posted by Muhammad Gelbana <m....@gmail.com>.
​But the query I provided failed to be planned because it's a cartesian
join, although I've set the option you mentioned to false. Is there a
reason why wouldn't Drill rules physically implement the logical join in my
query to a nested loop join ?

*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana

On Thu, May 11, 2017 at 5:05 PM, Zelaine Fong <zf...@mapr.com> wrote:

> Provided `planner.enable_nljoin_for_scalar_only` is set to false, even
> without an explicit join condition, the query should use the Cartesian
> join/nested loop join.
>
> -- Zelaine
>
> On 5/11/17, 4:20 AM, "Anup Tiwari" <an...@games24x7.com> wrote:
>
>     Hi,
>
>     I have one question here.. so if we have to use Cartesian join in Drill
>     then do we have to follow some workaround like Shadi mention : adding a
>     dummy column on the fly that has the value 1 in both tables and then
> join
>     on that column leading to having a match of every row of the first
> table
>     with every row of the second table, hence do a Cartesian product?
>     OR
>     If we just don't specify join condition like :
>     select a.*, b.* from tt1 as a, tt2 b; then will it internally treat
> this
>     query as Cartesian join.
>
>     Regards,
>     *Anup Tiwari*
>
>     On Mon, May 8, 2017 at 10:00 PM, Zelaine Fong <zf...@mapr.com> wrote:
>
>     > Cartesian joins in Drill are implemented as nested loop joins, and I
> think
>     > you should see that reflected in the resultant query plan when you
> run
>     > explain plan on the query.
>     >
>     > Yes, Cartesian joins/nested loop joins are expensive because you’re
>     > effectively doing an MxN read of your tables.  There are more
> efficient
>     > ways of processing a nested loop join, e.g., by creating an index on
> the
>     > larger table in the join and then using that index to do lookups
> into that
>     > table.  That way, the nested loop join cost is the cost of creating
> the
>     > index + M, where M is the number of rows in the smaller table and
> assuming
>     > the lookup cost into the index does minimize the amount of data read
> of the
>     > second table.  Drill currently doesn’t do this.
>     >
>     > -- Zelaine
>     >
>     > On 5/8/17, 9:09 AM, "Muhammad Gelbana" <m....@gmail.com> wrote:
>     >
>     >     ​I believe ​clhubert is referring to this discussion
>     >     <http://drill-user.incubator.apache.narkive.com/TIXWiTY4/
>     > cartesian-product-in-apache-drill#post1>
>     >     .
>     >
>     >     So why Drill doesn't transform this query into a nested join
> query ?
>     > Simply
>     >     because there is no Calcite rule to transform it into a nested
> loop
>     > join ?
>     >     Is it not technically possible to write such Rule or is it
> feasible so
>     > I
>     >     may take on this challenge ?
>     >
>     >     Also pardon me for repeating my question but I fail to find an
> answer
>     > in
>     >     your replies, why doesn't Drill just run a cartesian join ?
> Because
>     > it's
>     >     expensive regarding resources (i.e. CPU\Network\RAM) ?
>     >
>     >     Thanks a lot Shadi for the query, it works for me.
>     >
>     >     *---------------------*
>     >     *Muhammad Gelbana*
>     >     http://www.linkedin.com/in/mgelbana
>     >
>     >     On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa <
> khalifa@cs.queensu.ca>
>     > wrote:
>     >
>     >     > Hi Muhammad,
>     >     >
>     >     > I did the following as a workaround to have Cartesian product.
> The
>     > basic
>     >     > idea is to create a dummy column on the fly that has the value
> 1 in
>     > both
>     >     > tables and then join on that column leading to having a match
> of
>     > every row
>     >     > of the first table with every row of the second table, hence
> do a
>     > Cartesian
>     >     > product. This might not be the most efficient way but it will
> do the
>     > job.
>     >     >
>     >     > *Original Query:*
>     >     > SELECT * FROM
>     >     > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
> LIMIT
>     >     > 2147483647) `t0`
>     >     > INNER JOIN
>     >     > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
> LIMIT
>     >     > 2147483647) `t1`
>     >     > ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)
>     >     > LIMIT 2147483647
>     >     >
>     >     > *Workaround (add columns **d1a381f3g73 and **d1a381f3g74 to
> tables
>     > one
>     >     > and two, respectively. Names don't really matter, just need to
> be
>     > unique):*
>     >     > SELECT * FROM
>     >     > ( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM
>     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0`
>     >     > INNER JOIN
>     >     > ( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM
>     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1`
>     >     > ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)
>     >     > WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`
>     >     > LIMIT 2147483647
>     >     >
>     >     > Regards
>     >     >
>     >     >
>     >     > *Shadi Khalifa, PhD*
>     >     > Postdoctoral Fellow
>     >     > Cognitive Analytics Development Hub
>     >     > Centre for Advanced Computing
>     >     > Queen’s University
>     >     > (613) 533-6000 x78347
>     >     > http://cac.queensu.ca
>     >     >
>     >     > I'm just a neuron in the society collective brain
>     >     >
>     >     > *Join us for HPCS in June 2017! Register at:*  *
> http://2017.hpcs.ca/
>     >     > <http://2017.hpcs.ca/>*
>     >     >
>     >     > P Please consider your environmental responsibility before
> printing
>     > this
>     >     > e-mail
>     >     >
>     >     > *01001001 00100000 01101100 01101111 01110110 01100101 00100000
>     > 01000101
>     >     > 01100111 01111001 01110000 01110100 *
>     >     >
>     >     > *The information transmitted is intended only for the person or
>     > entity to
>     >     > which it is addressed and may contain confidential material.
> Any
>     > review or
>     >     > dissemination of this information by persons other than the
> intended
>     >     > recipient is prohibited. If you received this in error, please
>     > contact the
>     >     > sender and delete the material from any computer. Thank you.*
>     >     >
>     >     >
>     >     >
>     >     > On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana <
>     > m.gelbana@gmail.com>
>     >     > wrote:
>     >     >
>     >     >
>     >     > ​​
>     >     > Here it is:
>     >     >
>     >     > SELECT * FROM (SELECT 'ABC' `UserID` FROM
>     > `dfs`.`path_to_parquet_file` tc
>     >     > LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM
>     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (
>     >     > ​​
>     >     > `t0`.`UserID` IS NOT DISTINCT FROM
>     >     > ​​
>     >     > `t1`.`UserID`) LIMIT 2147483647
>     >     >
>     >     > I debugged Drill code and found it decomposes *IS NOT DISTINCT
> FROM*
>     > into
>     >     > ​
>     >     > *`t0`.`UserID` = ​`t1`.`UserID` OR (`t0`.`UserID` IS NULL &&
>     > `t1`.`UserID`
>     >     > IS NULL**)* while checking if the query is a cartesian join,
> and
>     > when the
>     >     > check returns true, it throws an excetion saying: *This query
> cannot
>     > be
>     >     > planned possibly due to either a cartesian join or an
> inequality
>     > join*
>     >     >
>     >     >
>     >     > *---------------------*
>     >     > *Muhammad Gelbana*
>     >     > http://www.linkedin.com/in/mgelbana
>     >     >
>     >     > On Sat, May 6, 2017 at 6:53 PM, Gautam Parai <gp...@mapr.com>
>     > wrote:
>     >     >
>     >     > > Can you please specify the query you are trying to execute?
>     >     > >
>     >     > >
>     >     > > Gautam
>     >     > >
>     >     > > ________________________________
>     >     > > From: Muhammad Gelbana <m....@gmail.com>
>     >     > > Sent: Saturday, May 6, 2017 7:34:53 AM
>     >     > > To: user@drill.apache.org; dev@drill.apache.org
>     >     > > Subject: Running cartesian joins on Drill
>     >     > >
>     >     > > Is there a reason why Drill would intentionally reject
> cartesian
>     > join
>     >     > > queries even if *planner.enable_nljoin_for_scalar_only* is
>     > disabled ?
>     >     > >
>     >     > > Any ideas how could a query be rewritten to overcome this
>     > restriction ?
>     >     > >
>     >     > > *---------------------*
>     >     > > *Muhammad Gelbana*
>     >     > > http://www.linkedin.com/in/mgelbana
>     >     > >
>     >     >
>     >     >
>     >     >
>     >
>     >
>     >
>
>
>

Re: Running cartesian joins on Drill

Posted by Zelaine Fong <zf...@mapr.com>.
Provided `planner.enable_nljoin_for_scalar_only` is set to false, even without an explicit join condition, the query should use the Cartesian join/nested loop join.

-- Zelaine

On 5/11/17, 4:20 AM, "Anup Tiwari" <an...@games24x7.com> wrote:

    Hi,
    
    I have one question here.. so if we have to use Cartesian join in Drill
    then do we have to follow some workaround like Shadi mention : adding a
    dummy column on the fly that has the value 1 in both tables and then join
    on that column leading to having a match of every row of the first table
    with every row of the second table, hence do a Cartesian product?
    OR
    If we just don't specify join condition like :
    select a.*, b.* from tt1 as a, tt2 b; then will it internally treat this
    query as Cartesian join.
    
    Regards,
    *Anup Tiwari*
    
    On Mon, May 8, 2017 at 10:00 PM, Zelaine Fong <zf...@mapr.com> wrote:
    
    > Cartesian joins in Drill are implemented as nested loop joins, and I think
    > you should see that reflected in the resultant query plan when you run
    > explain plan on the query.
    >
    > Yes, Cartesian joins/nested loop joins are expensive because you’re
    > effectively doing an MxN read of your tables.  There are more efficient
    > ways of processing a nested loop join, e.g., by creating an index on the
    > larger table in the join and then using that index to do lookups into that
    > table.  That way, the nested loop join cost is the cost of creating the
    > index + M, where M is the number of rows in the smaller table and assuming
    > the lookup cost into the index does minimize the amount of data read of the
    > second table.  Drill currently doesn’t do this.
    >
    > -- Zelaine
    >
    > On 5/8/17, 9:09 AM, "Muhammad Gelbana" <m....@gmail.com> wrote:
    >
    >     ​I believe ​clhubert is referring to this discussion
    >     <http://drill-user.incubator.apache.narkive.com/TIXWiTY4/
    > cartesian-product-in-apache-drill#post1>
    >     .
    >
    >     So why Drill doesn't transform this query into a nested join query ?
    > Simply
    >     because there is no Calcite rule to transform it into a nested loop
    > join ?
    >     Is it not technically possible to write such Rule or is it feasible so
    > I
    >     may take on this challenge ?
    >
    >     Also pardon me for repeating my question but I fail to find an answer
    > in
    >     your replies, why doesn't Drill just run a cartesian join ? Because
    > it's
    >     expensive regarding resources (i.e. CPU\Network\RAM) ?
    >
    >     Thanks a lot Shadi for the query, it works for me.
    >
    >     *---------------------*
    >     *Muhammad Gelbana*
    >     http://www.linkedin.com/in/mgelbana
    >
    >     On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa <kh...@cs.queensu.ca>
    > wrote:
    >
    >     > Hi Muhammad,
    >     >
    >     > I did the following as a workaround to have Cartesian product. The
    > basic
    >     > idea is to create a dummy column on the fly that has the value 1 in
    > both
    >     > tables and then join on that column leading to having a match of
    > every row
    >     > of the first table with every row of the second table, hence do a
    > Cartesian
    >     > product. This might not be the most efficient way but it will do the
    > job.
    >     >
    >     > *Original Query:*
    >     > SELECT * FROM
    >     > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT
    >     > 2147483647) `t0`
    >     > INNER JOIN
    >     > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT
    >     > 2147483647) `t1`
    >     > ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)
    >     > LIMIT 2147483647
    >     >
    >     > *Workaround (add columns **d1a381f3g73 and **d1a381f3g74 to tables
    > one
    >     > and two, respectively. Names don't really matter, just need to be
    > unique):*
    >     > SELECT * FROM
    >     > ( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM
    >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0`
    >     > INNER JOIN
    >     > ( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM
    >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1`
    >     > ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)
    >     > WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`
    >     > LIMIT 2147483647
    >     >
    >     > Regards
    >     >
    >     >
    >     > *Shadi Khalifa, PhD*
    >     > Postdoctoral Fellow
    >     > Cognitive Analytics Development Hub
    >     > Centre for Advanced Computing
    >     > Queen’s University
    >     > (613) 533-6000 x78347
    >     > http://cac.queensu.ca
    >     >
    >     > I'm just a neuron in the society collective brain
    >     >
    >     > *Join us for HPCS in June 2017! Register at:*  *http://2017.hpcs.ca/
    >     > <http://2017.hpcs.ca/>*
    >     >
    >     > P Please consider your environmental responsibility before printing
    > this
    >     > e-mail
    >     >
    >     > *01001001 00100000 01101100 01101111 01110110 01100101 00100000
    > 01000101
    >     > 01100111 01111001 01110000 01110100 *
    >     >
    >     > *The information transmitted is intended only for the person or
    > entity to
    >     > which it is addressed and may contain confidential material. Any
    > review or
    >     > dissemination of this information by persons other than the intended
    >     > recipient is prohibited. If you received this in error, please
    > contact the
    >     > sender and delete the material from any computer. Thank you.*
    >     >
    >     >
    >     >
    >     > On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana <
    > m.gelbana@gmail.com>
    >     > wrote:
    >     >
    >     >
    >     > ​​
    >     > Here it is:
    >     >
    >     > SELECT * FROM (SELECT 'ABC' `UserID` FROM
    > `dfs`.`path_to_parquet_file` tc
    >     > LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM
    >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (
    >     > ​​
    >     > `t0`.`UserID` IS NOT DISTINCT FROM
    >     > ​​
    >     > `t1`.`UserID`) LIMIT 2147483647
    >     >
    >     > I debugged Drill code and found it decomposes *IS NOT DISTINCT FROM*
    > into
    >     > ​
    >     > *`t0`.`UserID` = ​`t1`.`UserID` OR (`t0`.`UserID` IS NULL &&
    > `t1`.`UserID`
    >     > IS NULL**)* while checking if the query is a cartesian join, and
    > when the
    >     > check returns true, it throws an excetion saying: *This query cannot
    > be
    >     > planned possibly due to either a cartesian join or an inequality
    > join*
    >     >
    >     >
    >     > *---------------------*
    >     > *Muhammad Gelbana*
    >     > http://www.linkedin.com/in/mgelbana
    >     >
    >     > On Sat, May 6, 2017 at 6:53 PM, Gautam Parai <gp...@mapr.com>
    > wrote:
    >     >
    >     > > Can you please specify the query you are trying to execute?
    >     > >
    >     > >
    >     > > Gautam
    >     > >
    >     > > ________________________________
    >     > > From: Muhammad Gelbana <m....@gmail.com>
    >     > > Sent: Saturday, May 6, 2017 7:34:53 AM
    >     > > To: user@drill.apache.org; dev@drill.apache.org
    >     > > Subject: Running cartesian joins on Drill
    >     > >
    >     > > Is there a reason why Drill would intentionally reject cartesian
    > join
    >     > > queries even if *planner.enable_nljoin_for_scalar_only* is
    > disabled ?
    >     > >
    >     > > Any ideas how could a query be rewritten to overcome this
    > restriction ?
    >     > >
    >     > > *---------------------*
    >     > > *Muhammad Gelbana*
    >     > > http://www.linkedin.com/in/mgelbana
    >     > >
    >     >
    >     >
    >     >
    >
    >
    >
    


Re: Running cartesian joins on Drill

Posted by Zelaine Fong <zf...@mapr.com>.
Provided `planner.enable_nljoin_for_scalar_only` is set to false, even without an explicit join condition, the query should use the Cartesian join/nested loop join.

-- Zelaine

On 5/11/17, 4:20 AM, "Anup Tiwari" <an...@games24x7.com> wrote:

    Hi,
    
    I have one question here.. so if we have to use Cartesian join in Drill
    then do we have to follow some workaround like Shadi mention : adding a
    dummy column on the fly that has the value 1 in both tables and then join
    on that column leading to having a match of every row of the first table
    with every row of the second table, hence do a Cartesian product?
    OR
    If we just don't specify join condition like :
    select a.*, b.* from tt1 as a, tt2 b; then will it internally treat this
    query as Cartesian join.
    
    Regards,
    *Anup Tiwari*
    
    On Mon, May 8, 2017 at 10:00 PM, Zelaine Fong <zf...@mapr.com> wrote:
    
    > Cartesian joins in Drill are implemented as nested loop joins, and I think
    > you should see that reflected in the resultant query plan when you run
    > explain plan on the query.
    >
    > Yes, Cartesian joins/nested loop joins are expensive because you’re
    > effectively doing an MxN read of your tables.  There are more efficient
    > ways of processing a nested loop join, e.g., by creating an index on the
    > larger table in the join and then using that index to do lookups into that
    > table.  That way, the nested loop join cost is the cost of creating the
    > index + M, where M is the number of rows in the smaller table and assuming
    > the lookup cost into the index does minimize the amount of data read of the
    > second table.  Drill currently doesn’t do this.
    >
    > -- Zelaine
    >
    > On 5/8/17, 9:09 AM, "Muhammad Gelbana" <m....@gmail.com> wrote:
    >
    >     ​I believe ​clhubert is referring to this discussion
    >     <http://drill-user.incubator.apache.narkive.com/TIXWiTY4/
    > cartesian-product-in-apache-drill#post1>
    >     .
    >
    >     So why Drill doesn't transform this query into a nested join query ?
    > Simply
    >     because there is no Calcite rule to transform it into a nested loop
    > join ?
    >     Is it not technically possible to write such Rule or is it feasible so
    > I
    >     may take on this challenge ?
    >
    >     Also pardon me for repeating my question but I fail to find an answer
    > in
    >     your replies, why doesn't Drill just run a cartesian join ? Because
    > it's
    >     expensive regarding resources (i.e. CPU\Network\RAM) ?
    >
    >     Thanks a lot Shadi for the query, it works for me.
    >
    >     *---------------------*
    >     *Muhammad Gelbana*
    >     http://www.linkedin.com/in/mgelbana
    >
    >     On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa <kh...@cs.queensu.ca>
    > wrote:
    >
    >     > Hi Muhammad,
    >     >
    >     > I did the following as a workaround to have Cartesian product. The
    > basic
    >     > idea is to create a dummy column on the fly that has the value 1 in
    > both
    >     > tables and then join on that column leading to having a match of
    > every row
    >     > of the first table with every row of the second table, hence do a
    > Cartesian
    >     > product. This might not be the most efficient way but it will do the
    > job.
    >     >
    >     > *Original Query:*
    >     > SELECT * FROM
    >     > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT
    >     > 2147483647) `t0`
    >     > INNER JOIN
    >     > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT
    >     > 2147483647) `t1`
    >     > ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)
    >     > LIMIT 2147483647
    >     >
    >     > *Workaround (add columns **d1a381f3g73 and **d1a381f3g74 to tables
    > one
    >     > and two, respectively. Names don't really matter, just need to be
    > unique):*
    >     > SELECT * FROM
    >     > ( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM
    >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0`
    >     > INNER JOIN
    >     > ( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM
    >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1`
    >     > ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)
    >     > WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`
    >     > LIMIT 2147483647
    >     >
    >     > Regards
    >     >
    >     >
    >     > *Shadi Khalifa, PhD*
    >     > Postdoctoral Fellow
    >     > Cognitive Analytics Development Hub
    >     > Centre for Advanced Computing
    >     > Queen’s University
    >     > (613) 533-6000 x78347
    >     > http://cac.queensu.ca
    >     >
    >     > I'm just a neuron in the society collective brain
    >     >
    >     > *Join us for HPCS in June 2017! Register at:*  *http://2017.hpcs.ca/
    >     > <http://2017.hpcs.ca/>*
    >     >
    >     > P Please consider your environmental responsibility before printing
    > this
    >     > e-mail
    >     >
    >     > *01001001 00100000 01101100 01101111 01110110 01100101 00100000
    > 01000101
    >     > 01100111 01111001 01110000 01110100 *
    >     >
    >     > *The information transmitted is intended only for the person or
    > entity to
    >     > which it is addressed and may contain confidential material. Any
    > review or
    >     > dissemination of this information by persons other than the intended
    >     > recipient is prohibited. If you received this in error, please
    > contact the
    >     > sender and delete the material from any computer. Thank you.*
    >     >
    >     >
    >     >
    >     > On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana <
    > m.gelbana@gmail.com>
    >     > wrote:
    >     >
    >     >
    >     > ​​
    >     > Here it is:
    >     >
    >     > SELECT * FROM (SELECT 'ABC' `UserID` FROM
    > `dfs`.`path_to_parquet_file` tc
    >     > LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM
    >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (
    >     > ​​
    >     > `t0`.`UserID` IS NOT DISTINCT FROM
    >     > ​​
    >     > `t1`.`UserID`) LIMIT 2147483647
    >     >
    >     > I debugged Drill code and found it decomposes *IS NOT DISTINCT FROM*
    > into
    >     > ​
    >     > *`t0`.`UserID` = ​`t1`.`UserID` OR (`t0`.`UserID` IS NULL &&
    > `t1`.`UserID`
    >     > IS NULL**)* while checking if the query is a cartesian join, and
    > when the
    >     > check returns true, it throws an excetion saying: *This query cannot
    > be
    >     > planned possibly due to either a cartesian join or an inequality
    > join*
    >     >
    >     >
    >     > *---------------------*
    >     > *Muhammad Gelbana*
    >     > http://www.linkedin.com/in/mgelbana
    >     >
    >     > On Sat, May 6, 2017 at 6:53 PM, Gautam Parai <gp...@mapr.com>
    > wrote:
    >     >
    >     > > Can you please specify the query you are trying to execute?
    >     > >
    >     > >
    >     > > Gautam
    >     > >
    >     > > ________________________________
    >     > > From: Muhammad Gelbana <m....@gmail.com>
    >     > > Sent: Saturday, May 6, 2017 7:34:53 AM
    >     > > To: user@drill.apache.org; dev@drill.apache.org
    >     > > Subject: Running cartesian joins on Drill
    >     > >
    >     > > Is there a reason why Drill would intentionally reject cartesian
    > join
    >     > > queries even if *planner.enable_nljoin_for_scalar_only* is
    > disabled ?
    >     > >
    >     > > Any ideas how could a query be rewritten to overcome this
    > restriction ?
    >     > >
    >     > > *---------------------*
    >     > > *Muhammad Gelbana*
    >     > > http://www.linkedin.com/in/mgelbana
    >     > >
    >     >
    >     >
    >     >
    >
    >
    >
    


Re: Running cartesian joins on Drill

Posted by Anup Tiwari <an...@games24x7.com>.
Hi,

I have one question here.. so if we have to use Cartesian join in Drill
then do we have to follow some workaround like Shadi mention : adding a
dummy column on the fly that has the value 1 in both tables and then join
on that column leading to having a match of every row of the first table
with every row of the second table, hence do a Cartesian product?
OR
If we just don't specify join condition like :
select a.*, b.* from tt1 as a, tt2 b; then will it internally treat this
query as Cartesian join.

Regards,
*Anup Tiwari*

On Mon, May 8, 2017 at 10:00 PM, Zelaine Fong <zf...@mapr.com> wrote:

> Cartesian joins in Drill are implemented as nested loop joins, and I think
> you should see that reflected in the resultant query plan when you run
> explain plan on the query.
>
> Yes, Cartesian joins/nested loop joins are expensive because you’re
> effectively doing an MxN read of your tables.  There are more efficient
> ways of processing a nested loop join, e.g., by creating an index on the
> larger table in the join and then using that index to do lookups into that
> table.  That way, the nested loop join cost is the cost of creating the
> index + M, where M is the number of rows in the smaller table and assuming
> the lookup cost into the index does minimize the amount of data read of the
> second table.  Drill currently doesn’t do this.
>
> -- Zelaine
>
> On 5/8/17, 9:09 AM, "Muhammad Gelbana" <m....@gmail.com> wrote:
>
>     ​I believe ​clhubert is referring to this discussion
>     <http://drill-user.incubator.apache.narkive.com/TIXWiTY4/
> cartesian-product-in-apache-drill#post1>
>     .
>
>     So why Drill doesn't transform this query into a nested join query ?
> Simply
>     because there is no Calcite rule to transform it into a nested loop
> join ?
>     Is it not technically possible to write such Rule or is it feasible so
> I
>     may take on this challenge ?
>
>     Also pardon me for repeating my question but I fail to find an answer
> in
>     your replies, why doesn't Drill just run a cartesian join ? Because
> it's
>     expensive regarding resources (i.e. CPU\Network\RAM) ?
>
>     Thanks a lot Shadi for the query, it works for me.
>
>     *---------------------*
>     *Muhammad Gelbana*
>     http://www.linkedin.com/in/mgelbana
>
>     On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa <kh...@cs.queensu.ca>
> wrote:
>
>     > Hi Muhammad,
>     >
>     > I did the following as a workaround to have Cartesian product. The
> basic
>     > idea is to create a dummy column on the fly that has the value 1 in
> both
>     > tables and then join on that column leading to having a match of
> every row
>     > of the first table with every row of the second table, hence do a
> Cartesian
>     > product. This might not be the most efficient way but it will do the
> job.
>     >
>     > *Original Query:*
>     > SELECT * FROM
>     > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT
>     > 2147483647) `t0`
>     > INNER JOIN
>     > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT
>     > 2147483647) `t1`
>     > ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)
>     > LIMIT 2147483647
>     >
>     > *Workaround (add columns **d1a381f3g73 and **d1a381f3g74 to tables
> one
>     > and two, respectively. Names don't really matter, just need to be
> unique):*
>     > SELECT * FROM
>     > ( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM
>     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0`
>     > INNER JOIN
>     > ( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM
>     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1`
>     > ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)
>     > WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`
>     > LIMIT 2147483647
>     >
>     > Regards
>     >
>     >
>     > *Shadi Khalifa, PhD*
>     > Postdoctoral Fellow
>     > Cognitive Analytics Development Hub
>     > Centre for Advanced Computing
>     > Queen’s University
>     > (613) 533-6000 x78347
>     > http://cac.queensu.ca
>     >
>     > I'm just a neuron in the society collective brain
>     >
>     > *Join us for HPCS in June 2017! Register at:*  *http://2017.hpcs.ca/
>     > <http://2017.hpcs.ca/>*
>     >
>     > P Please consider your environmental responsibility before printing
> this
>     > e-mail
>     >
>     > *01001001 00100000 01101100 01101111 01110110 01100101 00100000
> 01000101
>     > 01100111 01111001 01110000 01110100 *
>     >
>     > *The information transmitted is intended only for the person or
> entity to
>     > which it is addressed and may contain confidential material. Any
> review or
>     > dissemination of this information by persons other than the intended
>     > recipient is prohibited. If you received this in error, please
> contact the
>     > sender and delete the material from any computer. Thank you.*
>     >
>     >
>     >
>     > On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana <
> m.gelbana@gmail.com>
>     > wrote:
>     >
>     >
>     > ​​
>     > Here it is:
>     >
>     > SELECT * FROM (SELECT 'ABC' `UserID` FROM
> `dfs`.`path_to_parquet_file` tc
>     > LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM
>     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (
>     > ​​
>     > `t0`.`UserID` IS NOT DISTINCT FROM
>     > ​​
>     > `t1`.`UserID`) LIMIT 2147483647
>     >
>     > I debugged Drill code and found it decomposes *IS NOT DISTINCT FROM*
> into
>     > ​
>     > *`t0`.`UserID` = ​`t1`.`UserID` OR (`t0`.`UserID` IS NULL &&
> `t1`.`UserID`
>     > IS NULL**)* while checking if the query is a cartesian join, and
> when the
>     > check returns true, it throws an excetion saying: *This query cannot
> be
>     > planned possibly due to either a cartesian join or an inequality
> join*
>     >
>     >
>     > *---------------------*
>     > *Muhammad Gelbana*
>     > http://www.linkedin.com/in/mgelbana
>     >
>     > On Sat, May 6, 2017 at 6:53 PM, Gautam Parai <gp...@mapr.com>
> wrote:
>     >
>     > > Can you please specify the query you are trying to execute?
>     > >
>     > >
>     > > Gautam
>     > >
>     > > ________________________________
>     > > From: Muhammad Gelbana <m....@gmail.com>
>     > > Sent: Saturday, May 6, 2017 7:34:53 AM
>     > > To: user@drill.apache.org; dev@drill.apache.org
>     > > Subject: Running cartesian joins on Drill
>     > >
>     > > Is there a reason why Drill would intentionally reject cartesian
> join
>     > > queries even if *planner.enable_nljoin_for_scalar_only* is
> disabled ?
>     > >
>     > > Any ideas how could a query be rewritten to overcome this
> restriction ?
>     > >
>     > > *---------------------*
>     > > *Muhammad Gelbana*
>     > > http://www.linkedin.com/in/mgelbana
>     > >
>     >
>     >
>     >
>
>
>

Re: Running cartesian joins on Drill

Posted by Anup Tiwari <an...@games24x7.com>.
Hi,

I have one question here.. so if we have to use Cartesian join in Drill
then do we have to follow some workaround like Shadi mention : adding a
dummy column on the fly that has the value 1 in both tables and then join
on that column leading to having a match of every row of the first table
with every row of the second table, hence do a Cartesian product?
OR
If we just don't specify join condition like :
select a.*, b.* from tt1 as a, tt2 b; then will it internally treat this
query as Cartesian join.

Regards,
*Anup Tiwari*

On Mon, May 8, 2017 at 10:00 PM, Zelaine Fong <zf...@mapr.com> wrote:

> Cartesian joins in Drill are implemented as nested loop joins, and I think
> you should see that reflected in the resultant query plan when you run
> explain plan on the query.
>
> Yes, Cartesian joins/nested loop joins are expensive because you’re
> effectively doing an MxN read of your tables.  There are more efficient
> ways of processing a nested loop join, e.g., by creating an index on the
> larger table in the join and then using that index to do lookups into that
> table.  That way, the nested loop join cost is the cost of creating the
> index + M, where M is the number of rows in the smaller table and assuming
> the lookup cost into the index does minimize the amount of data read of the
> second table.  Drill currently doesn’t do this.
>
> -- Zelaine
>
> On 5/8/17, 9:09 AM, "Muhammad Gelbana" <m....@gmail.com> wrote:
>
>     ​I believe ​clhubert is referring to this discussion
>     <http://drill-user.incubator.apache.narkive.com/TIXWiTY4/
> cartesian-product-in-apache-drill#post1>
>     .
>
>     So why Drill doesn't transform this query into a nested join query ?
> Simply
>     because there is no Calcite rule to transform it into a nested loop
> join ?
>     Is it not technically possible to write such Rule or is it feasible so
> I
>     may take on this challenge ?
>
>     Also pardon me for repeating my question but I fail to find an answer
> in
>     your replies, why doesn't Drill just run a cartesian join ? Because
> it's
>     expensive regarding resources (i.e. CPU\Network\RAM) ?
>
>     Thanks a lot Shadi for the query, it works for me.
>
>     *---------------------*
>     *Muhammad Gelbana*
>     http://www.linkedin.com/in/mgelbana
>
>     On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa <kh...@cs.queensu.ca>
> wrote:
>
>     > Hi Muhammad,
>     >
>     > I did the following as a workaround to have Cartesian product. The
> basic
>     > idea is to create a dummy column on the fly that has the value 1 in
> both
>     > tables and then join on that column leading to having a match of
> every row
>     > of the first table with every row of the second table, hence do a
> Cartesian
>     > product. This might not be the most efficient way but it will do the
> job.
>     >
>     > *Original Query:*
>     > SELECT * FROM
>     > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT
>     > 2147483647) `t0`
>     > INNER JOIN
>     > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT
>     > 2147483647) `t1`
>     > ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)
>     > LIMIT 2147483647
>     >
>     > *Workaround (add columns **d1a381f3g73 and **d1a381f3g74 to tables
> one
>     > and two, respectively. Names don't really matter, just need to be
> unique):*
>     > SELECT * FROM
>     > ( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM
>     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0`
>     > INNER JOIN
>     > ( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM
>     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1`
>     > ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)
>     > WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`
>     > LIMIT 2147483647
>     >
>     > Regards
>     >
>     >
>     > *Shadi Khalifa, PhD*
>     > Postdoctoral Fellow
>     > Cognitive Analytics Development Hub
>     > Centre for Advanced Computing
>     > Queen’s University
>     > (613) 533-6000 x78347
>     > http://cac.queensu.ca
>     >
>     > I'm just a neuron in the society collective brain
>     >
>     > *Join us for HPCS in June 2017! Register at:*  *http://2017.hpcs.ca/
>     > <http://2017.hpcs.ca/>*
>     >
>     > P Please consider your environmental responsibility before printing
> this
>     > e-mail
>     >
>     > *01001001 00100000 01101100 01101111 01110110 01100101 00100000
> 01000101
>     > 01100111 01111001 01110000 01110100 *
>     >
>     > *The information transmitted is intended only for the person or
> entity to
>     > which it is addressed and may contain confidential material. Any
> review or
>     > dissemination of this information by persons other than the intended
>     > recipient is prohibited. If you received this in error, please
> contact the
>     > sender and delete the material from any computer. Thank you.*
>     >
>     >
>     >
>     > On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana <
> m.gelbana@gmail.com>
>     > wrote:
>     >
>     >
>     > ​​
>     > Here it is:
>     >
>     > SELECT * FROM (SELECT 'ABC' `UserID` FROM
> `dfs`.`path_to_parquet_file` tc
>     > LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM
>     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (
>     > ​​
>     > `t0`.`UserID` IS NOT DISTINCT FROM
>     > ​​
>     > `t1`.`UserID`) LIMIT 2147483647
>     >
>     > I debugged Drill code and found it decomposes *IS NOT DISTINCT FROM*
> into
>     > ​
>     > *`t0`.`UserID` = ​`t1`.`UserID` OR (`t0`.`UserID` IS NULL &&
> `t1`.`UserID`
>     > IS NULL**)* while checking if the query is a cartesian join, and
> when the
>     > check returns true, it throws an excetion saying: *This query cannot
> be
>     > planned possibly due to either a cartesian join or an inequality
> join*
>     >
>     >
>     > *---------------------*
>     > *Muhammad Gelbana*
>     > http://www.linkedin.com/in/mgelbana
>     >
>     > On Sat, May 6, 2017 at 6:53 PM, Gautam Parai <gp...@mapr.com>
> wrote:
>     >
>     > > Can you please specify the query you are trying to execute?
>     > >
>     > >
>     > > Gautam
>     > >
>     > > ________________________________
>     > > From: Muhammad Gelbana <m....@gmail.com>
>     > > Sent: Saturday, May 6, 2017 7:34:53 AM
>     > > To: user@drill.apache.org; dev@drill.apache.org
>     > > Subject: Running cartesian joins on Drill
>     > >
>     > > Is there a reason why Drill would intentionally reject cartesian
> join
>     > > queries even if *planner.enable_nljoin_for_scalar_only* is
> disabled ?
>     > >
>     > > Any ideas how could a query be rewritten to overcome this
> restriction ?
>     > >
>     > > *---------------------*
>     > > *Muhammad Gelbana*
>     > > http://www.linkedin.com/in/mgelbana
>     > >
>     >
>     >
>     >
>
>
>

Re: Running cartesian joins on Drill

Posted by Zelaine Fong <zf...@mapr.com>.
Cartesian joins in Drill are implemented as nested loop joins, and I think you should see that reflected in the resultant query plan when you run explain plan on the query.

Yes, Cartesian joins/nested loop joins are expensive because you’re effectively doing an MxN read of your tables.  There are more efficient ways of processing a nested loop join, e.g., by creating an index on the larger table in the join and then using that index to do lookups into that table.  That way, the nested loop join cost is the cost of creating the index + M, where M is the number of rows in the smaller table and assuming the lookup cost into the index does minimize the amount of data read of the second table.  Drill currently doesn’t do this.

-- Zelaine

On 5/8/17, 9:09 AM, "Muhammad Gelbana" <m....@gmail.com> wrote:

    ​I believe ​clhubert is referring to this discussion
    <http://drill-user.incubator.apache.narkive.com/TIXWiTY4/cartesian-product-in-apache-drill#post1>
    .
    
    So why Drill doesn't transform this query into a nested join query ? Simply
    because there is no Calcite rule to transform it into a nested loop join ?
    Is it not technically possible to write such Rule or is it feasible so I
    may take on this challenge ?
    
    Also pardon me for repeating my question but I fail to find an answer in
    your replies, why doesn't Drill just run a cartesian join ? Because it's
    expensive regarding resources (i.e. CPU\Network\RAM) ?
    
    Thanks a lot Shadi for the query, it works for me.
    
    *---------------------*
    *Muhammad Gelbana*
    http://www.linkedin.com/in/mgelbana
    
    On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa <kh...@cs.queensu.ca> wrote:
    
    > Hi Muhammad,
    >
    > I did the following as a workaround to have Cartesian product. The basic
    > idea is to create a dummy column on the fly that has the value 1 in both
    > tables and then join on that column leading to having a match of every row
    > of the first table with every row of the second table, hence do a Cartesian
    > product. This might not be the most efficient way but it will do the job.
    >
    > *Original Query:*
    > SELECT * FROM
    > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT
    > 2147483647) `t0`
    > INNER JOIN
    > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT
    > 2147483647) `t1`
    > ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)
    > LIMIT 2147483647
    >
    > *Workaround (add columns **d1a381f3g73 and **d1a381f3g74 to tables one
    > and two, respectively. Names don't really matter, just need to be unique):*
    > SELECT * FROM
    > ( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM
    > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0`
    > INNER JOIN
    > ( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM
    > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1`
    > ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)
    > WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`
    > LIMIT 2147483647
    >
    > Regards
    >
    >
    > *Shadi Khalifa, PhD*
    > Postdoctoral Fellow
    > Cognitive Analytics Development Hub
    > Centre for Advanced Computing
    > Queen’s University
    > (613) 533-6000 x78347
    > http://cac.queensu.ca
    >
    > I'm just a neuron in the society collective brain
    >
    > *Join us for HPCS in June 2017! Register at:*  *http://2017.hpcs.ca/
    > <http://2017.hpcs.ca/>*
    >
    > P Please consider your environmental responsibility before printing this
    > e-mail
    >
    > *01001001 00100000 01101100 01101111 01110110 01100101 00100000 01000101
    > 01100111 01111001 01110000 01110100 *
    >
    > *The information transmitted is intended only for the person or entity to
    > which it is addressed and may contain confidential material. Any review or
    > dissemination of this information by persons other than the intended
    > recipient is prohibited. If you received this in error, please contact the
    > sender and delete the material from any computer. Thank you.*
    >
    >
    >
    > On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana <m....@gmail.com>
    > wrote:
    >
    >
    > ​​
    > Here it is:
    >
    > SELECT * FROM (SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
    > LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM
    > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (
    > ​​
    > `t0`.`UserID` IS NOT DISTINCT FROM
    > ​​
    > `t1`.`UserID`) LIMIT 2147483647
    >
    > I debugged Drill code and found it decomposes *IS NOT DISTINCT FROM* into
    > ​
    > *`t0`.`UserID` = ​`t1`.`UserID` OR (`t0`.`UserID` IS NULL && `t1`.`UserID`
    > IS NULL**)* while checking if the query is a cartesian join, and when the
    > check returns true, it throws an excetion saying: *This query cannot be
    > planned possibly due to either a cartesian join or an inequality join*
    >
    >
    > *---------------------*
    > *Muhammad Gelbana*
    > http://www.linkedin.com/in/mgelbana
    >
    > On Sat, May 6, 2017 at 6:53 PM, Gautam Parai <gp...@mapr.com> wrote:
    >
    > > Can you please specify the query you are trying to execute?
    > >
    > >
    > > Gautam
    > >
    > > ________________________________
    > > From: Muhammad Gelbana <m....@gmail.com>
    > > Sent: Saturday, May 6, 2017 7:34:53 AM
    > > To: user@drill.apache.org; dev@drill.apache.org
    > > Subject: Running cartesian joins on Drill
    > >
    > > Is there a reason why Drill would intentionally reject cartesian join
    > > queries even if *planner.enable_nljoin_for_scalar_only* is disabled ?
    > >
    > > Any ideas how could a query be rewritten to overcome this restriction ?
    > >
    > > *---------------------*
    > > *Muhammad Gelbana*
    > > http://www.linkedin.com/in/mgelbana
    > >
    >
    >
    >
    


Re: Running cartesian joins on Drill

Posted by Zelaine Fong <zf...@mapr.com>.
Cartesian joins in Drill are implemented as nested loop joins, and I think you should see that reflected in the resultant query plan when you run explain plan on the query.

Yes, Cartesian joins/nested loop joins are expensive because you’re effectively doing an MxN read of your tables.  There are more efficient ways of processing a nested loop join, e.g., by creating an index on the larger table in the join and then using that index to do lookups into that table.  That way, the nested loop join cost is the cost of creating the index + M, where M is the number of rows in the smaller table and assuming the lookup cost into the index does minimize the amount of data read of the second table.  Drill currently doesn’t do this.

-- Zelaine

On 5/8/17, 9:09 AM, "Muhammad Gelbana" <m....@gmail.com> wrote:

    ​I believe ​clhubert is referring to this discussion
    <http://drill-user.incubator.apache.narkive.com/TIXWiTY4/cartesian-product-in-apache-drill#post1>
    .
    
    So why Drill doesn't transform this query into a nested join query ? Simply
    because there is no Calcite rule to transform it into a nested loop join ?
    Is it not technically possible to write such Rule or is it feasible so I
    may take on this challenge ?
    
    Also pardon me for repeating my question but I fail to find an answer in
    your replies, why doesn't Drill just run a cartesian join ? Because it's
    expensive regarding resources (i.e. CPU\Network\RAM) ?
    
    Thanks a lot Shadi for the query, it works for me.
    
    *---------------------*
    *Muhammad Gelbana*
    http://www.linkedin.com/in/mgelbana
    
    On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa <kh...@cs.queensu.ca> wrote:
    
    > Hi Muhammad,
    >
    > I did the following as a workaround to have Cartesian product. The basic
    > idea is to create a dummy column on the fly that has the value 1 in both
    > tables and then join on that column leading to having a match of every row
    > of the first table with every row of the second table, hence do a Cartesian
    > product. This might not be the most efficient way but it will do the job.
    >
    > *Original Query:*
    > SELECT * FROM
    > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT
    > 2147483647) `t0`
    > INNER JOIN
    > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT
    > 2147483647) `t1`
    > ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)
    > LIMIT 2147483647
    >
    > *Workaround (add columns **d1a381f3g73 and **d1a381f3g74 to tables one
    > and two, respectively. Names don't really matter, just need to be unique):*
    > SELECT * FROM
    > ( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM
    > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0`
    > INNER JOIN
    > ( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM
    > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1`
    > ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)
    > WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`
    > LIMIT 2147483647
    >
    > Regards
    >
    >
    > *Shadi Khalifa, PhD*
    > Postdoctoral Fellow
    > Cognitive Analytics Development Hub
    > Centre for Advanced Computing
    > Queen’s University
    > (613) 533-6000 x78347
    > http://cac.queensu.ca
    >
    > I'm just a neuron in the society collective brain
    >
    > *Join us for HPCS in June 2017! Register at:*  *http://2017.hpcs.ca/
    > <http://2017.hpcs.ca/>*
    >
    > P Please consider your environmental responsibility before printing this
    > e-mail
    >
    > *01001001 00100000 01101100 01101111 01110110 01100101 00100000 01000101
    > 01100111 01111001 01110000 01110100 *
    >
    > *The information transmitted is intended only for the person or entity to
    > which it is addressed and may contain confidential material. Any review or
    > dissemination of this information by persons other than the intended
    > recipient is prohibited. If you received this in error, please contact the
    > sender and delete the material from any computer. Thank you.*
    >
    >
    >
    > On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana <m....@gmail.com>
    > wrote:
    >
    >
    > ​​
    > Here it is:
    >
    > SELECT * FROM (SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
    > LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM
    > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (
    > ​​
    > `t0`.`UserID` IS NOT DISTINCT FROM
    > ​​
    > `t1`.`UserID`) LIMIT 2147483647
    >
    > I debugged Drill code and found it decomposes *IS NOT DISTINCT FROM* into
    > ​
    > *`t0`.`UserID` = ​`t1`.`UserID` OR (`t0`.`UserID` IS NULL && `t1`.`UserID`
    > IS NULL**)* while checking if the query is a cartesian join, and when the
    > check returns true, it throws an excetion saying: *This query cannot be
    > planned possibly due to either a cartesian join or an inequality join*
    >
    >
    > *---------------------*
    > *Muhammad Gelbana*
    > http://www.linkedin.com/in/mgelbana
    >
    > On Sat, May 6, 2017 at 6:53 PM, Gautam Parai <gp...@mapr.com> wrote:
    >
    > > Can you please specify the query you are trying to execute?
    > >
    > >
    > > Gautam
    > >
    > > ________________________________
    > > From: Muhammad Gelbana <m....@gmail.com>
    > > Sent: Saturday, May 6, 2017 7:34:53 AM
    > > To: user@drill.apache.org; dev@drill.apache.org
    > > Subject: Running cartesian joins on Drill
    > >
    > > Is there a reason why Drill would intentionally reject cartesian join
    > > queries even if *planner.enable_nljoin_for_scalar_only* is disabled ?
    > >
    > > Any ideas how could a query be rewritten to overcome this restriction ?
    > >
    > > *---------------------*
    > > *Muhammad Gelbana*
    > > http://www.linkedin.com/in/mgelbana
    > >
    >
    >
    >
    


Re: Running cartesian joins on Drill

Posted by Muhammad Gelbana <m....@gmail.com>.
​I believe ​clhubert is referring to this discussion
<http://drill-user.incubator.apache.narkive.com/TIXWiTY4/cartesian-product-in-apache-drill#post1>
.

So why Drill doesn't transform this query into a nested join query ? Simply
because there is no Calcite rule to transform it into a nested loop join ?
Is it not technically possible to write such Rule or is it feasible so I
may take on this challenge ?

Also pardon me for repeating my question but I fail to find an answer in
your replies, why doesn't Drill just run a cartesian join ? Because it's
expensive regarding resources (i.e. CPU\Network\RAM) ?

Thanks a lot Shadi for the query, it works for me.

*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana

On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa <kh...@cs.queensu.ca> wrote:

> Hi Muhammad,
>
> I did the following as a workaround to have Cartesian product. The basic
> idea is to create a dummy column on the fly that has the value 1 in both
> tables and then join on that column leading to having a match of every row
> of the first table with every row of the second table, hence do a Cartesian
> product. This might not be the most efficient way but it will do the job.
>
> *Original Query:*
> SELECT * FROM
> ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT
> 2147483647) `t0`
> INNER JOIN
> ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT
> 2147483647) `t1`
> ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)
> LIMIT 2147483647
>
> *Workaround (add columns **d1a381f3g73 and **d1a381f3g74 to tables one
> and two, respectively. Names don't really matter, just need to be unique):*
> SELECT * FROM
> ( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM
> `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0`
> INNER JOIN
> ( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM
> `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1`
> ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)
> WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`
> LIMIT 2147483647
>
> Regards
>
>
> *Shadi Khalifa, PhD*
> Postdoctoral Fellow
> Cognitive Analytics Development Hub
> Centre for Advanced Computing
> Queen’s University
> (613) 533-6000 x78347
> http://cac.queensu.ca
>
> I'm just a neuron in the society collective brain
>
> *Join us for HPCS in June 2017! Register at:*  *http://2017.hpcs.ca/
> <http://2017.hpcs.ca/>*
>
> P Please consider your environmental responsibility before printing this
> e-mail
>
> *01001001 00100000 01101100 01101111 01110110 01100101 00100000 01000101
> 01100111 01111001 01110000 01110100 *
>
> *The information transmitted is intended only for the person or entity to
> which it is addressed and may contain confidential material. Any review or
> dissemination of this information by persons other than the intended
> recipient is prohibited. If you received this in error, please contact the
> sender and delete the material from any computer. Thank you.*
>
>
>
> On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana <m....@gmail.com>
> wrote:
>
>
> ​​
> Here it is:
>
> SELECT * FROM (SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
> LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM
> `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (
> ​​
> `t0`.`UserID` IS NOT DISTINCT FROM
> ​​
> `t1`.`UserID`) LIMIT 2147483647
>
> I debugged Drill code and found it decomposes *IS NOT DISTINCT FROM* into
> ​
> *`t0`.`UserID` = ​`t1`.`UserID` OR (`t0`.`UserID` IS NULL && `t1`.`UserID`
> IS NULL**)* while checking if the query is a cartesian join, and when the
> check returns true, it throws an excetion saying: *This query cannot be
> planned possibly due to either a cartesian join or an inequality join*
>
>
> *---------------------*
> *Muhammad Gelbana*
> http://www.linkedin.com/in/mgelbana
>
> On Sat, May 6, 2017 at 6:53 PM, Gautam Parai <gp...@mapr.com> wrote:
>
> > Can you please specify the query you are trying to execute?
> >
> >
> > Gautam
> >
> > ________________________________
> > From: Muhammad Gelbana <m....@gmail.com>
> > Sent: Saturday, May 6, 2017 7:34:53 AM
> > To: user@drill.apache.org; dev@drill.apache.org
> > Subject: Running cartesian joins on Drill
> >
> > Is there a reason why Drill would intentionally reject cartesian join
> > queries even if *planner.enable_nljoin_for_scalar_only* is disabled ?
> >
> > Any ideas how could a query be rewritten to overcome this restriction ?
> >
> > *---------------------*
> > *Muhammad Gelbana*
> > http://www.linkedin.com/in/mgelbana
> >
>
>
>

Re: Running cartesian joins on Drill

Posted by Muhammad Gelbana <m....@gmail.com>.
​I believe ​clhubert is referring to this discussion
<http://drill-user.incubator.apache.narkive.com/TIXWiTY4/cartesian-product-in-apache-drill#post1>
.

So why Drill doesn't transform this query into a nested join query ? Simply
because there is no Calcite rule to transform it into a nested loop join ?
Is it not technically possible to write such Rule or is it feasible so I
may take on this challenge ?

Also pardon me for repeating my question but I fail to find an answer in
your replies, why doesn't Drill just run a cartesian join ? Because it's
expensive regarding resources (i.e. CPU\Network\RAM) ?

Thanks a lot Shadi for the query, it works for me.

*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana

On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa <kh...@cs.queensu.ca> wrote:

> Hi Muhammad,
>
> I did the following as a workaround to have Cartesian product. The basic
> idea is to create a dummy column on the fly that has the value 1 in both
> tables and then join on that column leading to having a match of every row
> of the first table with every row of the second table, hence do a Cartesian
> product. This might not be the most efficient way but it will do the job.
>
> *Original Query:*
> SELECT * FROM
> ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT
> 2147483647) `t0`
> INNER JOIN
> ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT
> 2147483647) `t1`
> ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)
> LIMIT 2147483647
>
> *Workaround (add columns **d1a381f3g73 and **d1a381f3g74 to tables one
> and two, respectively. Names don't really matter, just need to be unique):*
> SELECT * FROM
> ( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM
> `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0`
> INNER JOIN
> ( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM
> `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1`
> ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)
> WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`
> LIMIT 2147483647
>
> Regards
>
>
> *Shadi Khalifa, PhD*
> Postdoctoral Fellow
> Cognitive Analytics Development Hub
> Centre for Advanced Computing
> Queen’s University
> (613) 533-6000 x78347
> http://cac.queensu.ca
>
> I'm just a neuron in the society collective brain
>
> *Join us for HPCS in June 2017! Register at:*  *http://2017.hpcs.ca/
> <http://2017.hpcs.ca/>*
>
> P Please consider your environmental responsibility before printing this
> e-mail
>
> *01001001 00100000 01101100 01101111 01110110 01100101 00100000 01000101
> 01100111 01111001 01110000 01110100 *
>
> *The information transmitted is intended only for the person or entity to
> which it is addressed and may contain confidential material. Any review or
> dissemination of this information by persons other than the intended
> recipient is prohibited. If you received this in error, please contact the
> sender and delete the material from any computer. Thank you.*
>
>
>
> On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana <m....@gmail.com>
> wrote:
>
>
> ​​
> Here it is:
>
> SELECT * FROM (SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
> LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM
> `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (
> ​​
> `t0`.`UserID` IS NOT DISTINCT FROM
> ​​
> `t1`.`UserID`) LIMIT 2147483647
>
> I debugged Drill code and found it decomposes *IS NOT DISTINCT FROM* into
> ​
> *`t0`.`UserID` = ​`t1`.`UserID` OR (`t0`.`UserID` IS NULL && `t1`.`UserID`
> IS NULL**)* while checking if the query is a cartesian join, and when the
> check returns true, it throws an excetion saying: *This query cannot be
> planned possibly due to either a cartesian join or an inequality join*
>
>
> *---------------------*
> *Muhammad Gelbana*
> http://www.linkedin.com/in/mgelbana
>
> On Sat, May 6, 2017 at 6:53 PM, Gautam Parai <gp...@mapr.com> wrote:
>
> > Can you please specify the query you are trying to execute?
> >
> >
> > Gautam
> >
> > ________________________________
> > From: Muhammad Gelbana <m....@gmail.com>
> > Sent: Saturday, May 6, 2017 7:34:53 AM
> > To: user@drill.apache.org; dev@drill.apache.org
> > Subject: Running cartesian joins on Drill
> >
> > Is there a reason why Drill would intentionally reject cartesian join
> > queries even if *planner.enable_nljoin_for_scalar_only* is disabled ?
> >
> > Any ideas how could a query be rewritten to overcome this restriction ?
> >
> > *---------------------*
> > *Muhammad Gelbana*
> > http://www.linkedin.com/in/mgelbana
> >
>
>
>

Re: Running cartesian joins on Drill

Posted by Shadi Khalifa <kh...@cs.queensu.ca>.
Hi Muhammad,
I did the following as a workaround to have Cartesian product. The basic idea is to create a dummy column on the fly that has the value 1 in both tables and then join on that column leading to having a match of every row of the first table with every row of the second table, hence do a Cartesian product. This might not be the most efficient way but it will do the job.
Original Query:SELECT * FROM ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0` INNER JOIN ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`) LIMIT 2147483647
Workaround (add columns d1a381f3g73 and d1a381f3g74 to tables one and two, respectively. Names don't really matter, just need to be unique):SELECT * FROM ( SELECT 1 as d1a381f3g73, 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0` INNER JOIN ( SELECT 1 as d1a381f3g74, 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (`t0`.d1a381f3g73 = `t1`.d1a381f3g74)WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`LIMIT 2147483647 Regards 

 Shadi Khalifa, PhD Postdoctoral Fellow Cognitive Analytics Development Hub Centre for Advanced Computing Queen’s University (613) 533-6000 x78347 http://cac.queensu.ca  I'm just a neuron in thesociety collective brain 
Join us for HPCS in June 2017! Register at:  http://2017.hpcs.ca/
 P Please consider yourenvironmental responsibility before printing this e-mail    01001001 00100000 01101100 01101111 0111011001100101 00100000 01000101 01100111 01111001 01110000 01110100 
  The information transmitted is intended only forthe person or entity to which it is addressed and may contain confidential material.Any review or dissemination of this information by persons other than theintended recipient is prohibited. If you received this in error, please contactthe sender and delete the material from any computer. Thank you.
 

    On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana <m....@gmail.com> wrote:
 

 ​​
Here it is:

SELECT * FROM (SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (
​​
`t0`.`UserID` IS NOT DISTINCT FROM
​​
`t1`.`UserID`) LIMIT 2147483647

I debugged Drill code and found it decomposes *IS NOT DISTINCT FROM* into
​
*`t0`.`UserID` = ​`t1`.`UserID` OR (`t0`.`UserID` IS NULL && `t1`.`UserID`
IS NULL**)* while checking if the query is a cartesian join, and when the
check returns true, it throws an excetion saying: *This query cannot be
planned possibly due to either a cartesian join or an inequality join*


*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana

On Sat, May 6, 2017 at 6:53 PM, Gautam Parai <gp...@mapr.com> wrote:

> Can you please specify the query you are trying to execute?
>
>
> Gautam
>
> ________________________________
> From: Muhammad Gelbana <m....@gmail.com>
> Sent: Saturday, May 6, 2017 7:34:53 AM
> To: user@drill.apache.org; dev@drill.apache.org
> Subject: Running cartesian joins on Drill
>
> Is there a reason why Drill would intentionally reject cartesian join
> queries even if *planner.enable_nljoin_for_scalar_only* is disabled ?
>
> Any ideas how could a query be rewritten to overcome this restriction ?
>
> *---------------------*
> *Muhammad Gelbana*
> http://www.linkedin.com/in/mgelbana
>

   

Re: Running cartesian joins on Drill

Posted by Shadi Khalifa <kh...@cs.queensu.ca>.
Hi Muhammad,
I did the following as a workaround to have Cartesian product. The basic idea is to create a dummy column on the fly that has the value 1 in both tables and then join on that column leading to having a match of every row of the first table with every row of the second table, hence do a Cartesian product. This might not be the most efficient way but it will do the job.
Original Query:SELECT * FROM ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0` INNER JOIN ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`) LIMIT 2147483647
Workaround (add columns d1a381f3g73 and d1a381f3g74 to tables one and two, respectively. Names don't really matter, just need to be unique):SELECT * FROM ( SELECT 1 as d1a381f3g73, 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0` INNER JOIN ( SELECT 1 as d1a381f3g74, 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (`t0`.d1a381f3g73 = `t1`.d1a381f3g74)WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`LIMIT 2147483647 Regards 

 Shadi Khalifa, PhD Postdoctoral Fellow Cognitive Analytics Development Hub Centre for Advanced Computing Queen’s University (613) 533-6000 x78347 http://cac.queensu.ca  I'm just a neuron in thesociety collective brain 
Join us for HPCS in June 2017! Register at:  http://2017.hpcs.ca/
 P Please consider yourenvironmental responsibility before printing this e-mail    01001001 00100000 01101100 01101111 0111011001100101 00100000 01000101 01100111 01111001 01110000 01110100 
  The information transmitted is intended only forthe person or entity to which it is addressed and may contain confidential material.Any review or dissemination of this information by persons other than theintended recipient is prohibited. If you received this in error, please contactthe sender and delete the material from any computer. Thank you.
 

    On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana <m....@gmail.com> wrote:
 

 ​​
Here it is:

SELECT * FROM (SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (
​​
`t0`.`UserID` IS NOT DISTINCT FROM
​​
`t1`.`UserID`) LIMIT 2147483647

I debugged Drill code and found it decomposes *IS NOT DISTINCT FROM* into
​
*`t0`.`UserID` = ​`t1`.`UserID` OR (`t0`.`UserID` IS NULL && `t1`.`UserID`
IS NULL**)* while checking if the query is a cartesian join, and when the
check returns true, it throws an excetion saying: *This query cannot be
planned possibly due to either a cartesian join or an inequality join*


*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana

On Sat, May 6, 2017 at 6:53 PM, Gautam Parai <gp...@mapr.com> wrote:

> Can you please specify the query you are trying to execute?
>
>
> Gautam
>
> ________________________________
> From: Muhammad Gelbana <m....@gmail.com>
> Sent: Saturday, May 6, 2017 7:34:53 AM
> To: user@drill.apache.org; dev@drill.apache.org
> Subject: Running cartesian joins on Drill
>
> Is there a reason why Drill would intentionally reject cartesian join
> queries even if *planner.enable_nljoin_for_scalar_only* is disabled ?
>
> Any ideas how could a query be rewritten to overcome this restriction ?
>
> *---------------------*
> *Muhammad Gelbana*
> http://www.linkedin.com/in/mgelbana
>

   

Re: Running cartesian joins on Drill

Posted by Zelaine Fong <zf...@mapr.com>.
@chulbert is correct.  You have to disable planner.enable_nljoin_for_scalar_only in order for Drill to use nested loop joins.  Otherwise, it only does so when one of the join inputs is a scalar subquery, i.e., a single return row.

Drill requires disabling the parameter because nested loop joins are currently implemented as inefficient Cartesian joins, which can be very expensive.  The end user needs to make an explicit choice to use them.

-- Zelaine

On 5/6/17, 9:17 PM, "clhubert@gmail.com" <cl...@gmail.com> wrote:

    As you've observed, we have to manually enable nested loops (which
    cartesian joins require).
    
    alter session set `planner.enable_nljoin_for_scalar_only` = false;
    
    I am guessing this approach was taken because
    
    1.  Although ANSI compliant SQL.  These type of joins are not typical in
    most applications.
    2.  To avoid the possibility of the cost based optimizer selecting a nested
    loop algorithm (expensive join algorithm) by mistake.  If this happens a
    lot, it would decrease overall performance.
    3.  Simple enough workaround (manually enabling/disabling) until the logic
    is developed to automatically enable nested loops when needed.
    
    If anyone knows for "sure", please chime in.
    
    OLAP Engines like Modrian have already been reported to generate queries
    that need Cartesian joins.
    https://issues.apache.org/jira/browse/DRILL-2915
    
    Hopefully this helps...
    
    For my use case, it's no big deal to enable it manually.    If performance
    ever becomes an issue, I can toggle it off and on as needed.
    
    I asked a similar question..   If you check back in the archives for this
    list
    (12/2016)  you will see the details.
    
    
    
    
    On Sat, May 6, 2017 at 3:05 PM, Muhammad Gelbana <m....@gmail.com>
    wrote:
    
    > ​​
    > Here it is:
    >
    > SELECT * FROM (SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
    > LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM
    > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (
    > ​​
    > `t0`.`UserID` IS NOT DISTINCT FROM
    > ​​
    > `t1`.`UserID`) LIMIT 2147483647
    >
    > I debugged Drill code and found it decomposes *IS NOT DISTINCT FROM* into
    > ​
    > *`t0`.`UserID` = ​`t1`.`UserID` OR (`t0`.`UserID` IS NULL && `t1`.`UserID`
    > IS NULL**)* while checking if the query is a cartesian join, and when the
    > check returns true, it throws an excetion saying: *This query cannot be
    > planned possibly due to either a cartesian join or an inequality join*
    >
    >
    > *---------------------*
    > *Muhammad Gelbana*
    > http://www.linkedin.com/in/mgelbana
    >
    > On Sat, May 6, 2017 at 6:53 PM, Gautam Parai <gp...@mapr.com> wrote:
    >
    > > Can you please specify the query you are trying to execute?
    > >
    > >
    > > Gautam
    > >
    > > ________________________________
    > > From: Muhammad Gelbana <m....@gmail.com>
    > > Sent: Saturday, May 6, 2017 7:34:53 AM
    > > To: user@drill.apache.org; dev@drill.apache.org
    > > Subject: Running cartesian joins on Drill
    > >
    > > Is there a reason why Drill would intentionally reject cartesian join
    > > queries even if *planner.enable_nljoin_for_scalar_only* is disabled ?
    > >
    > > Any ideas how could a query be rewritten to overcome this restriction ?
    > >
    > > *---------------------*
    > > *Muhammad Gelbana*
    > > http://www.linkedin.com/in/mgelbana
    > >
    >
    


Re: Running cartesian joins on Drill

Posted by "clhubert@gmail.com" <cl...@gmail.com>.
As you've observed, we have to manually enable nested loops (which
cartesian joins require).

alter session set `planner.enable_nljoin_for_scalar_only` = false;

I am guessing this approach was taken because

1.  Although ANSI compliant SQL.  These type of joins are not typical in
most applications.
2.  To avoid the possibility of the cost based optimizer selecting a nested
loop algorithm (expensive join algorithm) by mistake.  If this happens a
lot, it would decrease overall performance.
3.  Simple enough workaround (manually enabling/disabling) until the logic
is developed to automatically enable nested loops when needed.

If anyone knows for "sure", please chime in.

OLAP Engines like Modrian have already been reported to generate queries
that need Cartesian joins.
https://issues.apache.org/jira/browse/DRILL-2915

Hopefully this helps...

For my use case, it's no big deal to enable it manually.    If performance
ever becomes an issue, I can toggle it off and on as needed.

I asked a similar question..   If you check back in the archives for this
list
(12/2016)  you will see the details.




On Sat, May 6, 2017 at 3:05 PM, Muhammad Gelbana <m....@gmail.com>
wrote:

> ​​
> Here it is:
>
> SELECT * FROM (SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
> LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM
> `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (
> ​​
> `t0`.`UserID` IS NOT DISTINCT FROM
> ​​
> `t1`.`UserID`) LIMIT 2147483647
>
> I debugged Drill code and found it decomposes *IS NOT DISTINCT FROM* into
> ​
> *`t0`.`UserID` = ​`t1`.`UserID` OR (`t0`.`UserID` IS NULL && `t1`.`UserID`
> IS NULL**)* while checking if the query is a cartesian join, and when the
> check returns true, it throws an excetion saying: *This query cannot be
> planned possibly due to either a cartesian join or an inequality join*
>
>
> *---------------------*
> *Muhammad Gelbana*
> http://www.linkedin.com/in/mgelbana
>
> On Sat, May 6, 2017 at 6:53 PM, Gautam Parai <gp...@mapr.com> wrote:
>
> > Can you please specify the query you are trying to execute?
> >
> >
> > Gautam
> >
> > ________________________________
> > From: Muhammad Gelbana <m....@gmail.com>
> > Sent: Saturday, May 6, 2017 7:34:53 AM
> > To: user@drill.apache.org; dev@drill.apache.org
> > Subject: Running cartesian joins on Drill
> >
> > Is there a reason why Drill would intentionally reject cartesian join
> > queries even if *planner.enable_nljoin_for_scalar_only* is disabled ?
> >
> > Any ideas how could a query be rewritten to overcome this restriction ?
> >
> > *---------------------*
> > *Muhammad Gelbana*
> > http://www.linkedin.com/in/mgelbana
> >
>

Re: Running cartesian joins on Drill

Posted by Muhammad Gelbana <m....@gmail.com>.
​​
Here it is:

SELECT * FROM (SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (
​​
`t0`.`UserID` IS NOT DISTINCT FROM
​​
`t1`.`UserID`) LIMIT 2147483647

I debugged Drill code and found it decomposes *IS NOT DISTINCT FROM* into
​
*`t0`.`UserID` = ​`t1`.`UserID` OR (`t0`.`UserID` IS NULL && `t1`.`UserID`
IS NULL**)* while checking if the query is a cartesian join, and when the
check returns true, it throws an excetion saying: *This query cannot be
planned possibly due to either a cartesian join or an inequality join*


*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana

On Sat, May 6, 2017 at 6:53 PM, Gautam Parai <gp...@mapr.com> wrote:

> Can you please specify the query you are trying to execute?
>
>
> Gautam
>
> ________________________________
> From: Muhammad Gelbana <m....@gmail.com>
> Sent: Saturday, May 6, 2017 7:34:53 AM
> To: user@drill.apache.org; dev@drill.apache.org
> Subject: Running cartesian joins on Drill
>
> Is there a reason why Drill would intentionally reject cartesian join
> queries even if *planner.enable_nljoin_for_scalar_only* is disabled ?
>
> Any ideas how could a query be rewritten to overcome this restriction ?
>
> *---------------------*
> *Muhammad Gelbana*
> http://www.linkedin.com/in/mgelbana
>

Re: Running cartesian joins on Drill

Posted by Muhammad Gelbana <m....@gmail.com>.
​​
Here it is:

SELECT * FROM (SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (
​​
`t0`.`UserID` IS NOT DISTINCT FROM
​​
`t1`.`UserID`) LIMIT 2147483647

I debugged Drill code and found it decomposes *IS NOT DISTINCT FROM* into
​
*`t0`.`UserID` = ​`t1`.`UserID` OR (`t0`.`UserID` IS NULL && `t1`.`UserID`
IS NULL**)* while checking if the query is a cartesian join, and when the
check returns true, it throws an excetion saying: *This query cannot be
planned possibly due to either a cartesian join or an inequality join*


*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana

On Sat, May 6, 2017 at 6:53 PM, Gautam Parai <gp...@mapr.com> wrote:

> Can you please specify the query you are trying to execute?
>
>
> Gautam
>
> ________________________________
> From: Muhammad Gelbana <m....@gmail.com>
> Sent: Saturday, May 6, 2017 7:34:53 AM
> To: user@drill.apache.org; dev@drill.apache.org
> Subject: Running cartesian joins on Drill
>
> Is there a reason why Drill would intentionally reject cartesian join
> queries even if *planner.enable_nljoin_for_scalar_only* is disabled ?
>
> Any ideas how could a query be rewritten to overcome this restriction ?
>
> *---------------------*
> *Muhammad Gelbana*
> http://www.linkedin.com/in/mgelbana
>

Re: Running cartesian joins on Drill

Posted by Gautam Parai <gp...@mapr.com>.
Can you please specify the query you are trying to execute?


Gautam

________________________________
From: Muhammad Gelbana <m....@gmail.com>
Sent: Saturday, May 6, 2017 7:34:53 AM
To: user@drill.apache.org; dev@drill.apache.org
Subject: Running cartesian joins on Drill

Is there a reason why Drill would intentionally reject cartesian join
queries even if *planner.enable_nljoin_for_scalar_only* is disabled ?

Any ideas how could a query be rewritten to overcome this restriction ?

*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana

Re: Running cartesian joins on Drill

Posted by Gautam Parai <gp...@mapr.com>.
Can you please specify the query you are trying to execute?


Gautam

________________________________
From: Muhammad Gelbana <m....@gmail.com>
Sent: Saturday, May 6, 2017 7:34:53 AM
To: user@drill.apache.org; dev@drill.apache.org
Subject: Running cartesian joins on Drill

Is there a reason why Drill would intentionally reject cartesian join
queries even if *planner.enable_nljoin_for_scalar_only* is disabled ?

Any ideas how could a query be rewritten to overcome this restriction ?

*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana