You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by "clhubert@gmail.com" <cl...@gmail.com> on 2016/12/25 22:04:33 UTC

Cartesian Product in Apache Drill

I am trying to do a cross join to get a cartesian products.

Per the error message (attached) and the JIRA ticket I see it isn't
supported.
https://issues.apache.org/jira/browse/DRILL-3807

I wrote the query against using dfs on csv file types.

Can I execute a cross join in Apache Drill just by moving my data to a
different file type or Storage Plugin. Such as Parquet,JSON, or RDBMS
Plugin.

Regards,
CLN

Re: Cartesian Product in Apache Drill

Posted by Zelaine Fong <zf...@maprtech.com>.
I think you meant to ask how likely it is that a nested loop join plan will
be chosen if planner.enable_nljoin_for_*scalar_only *is set to FASLE, and
the query does not contain a cartesian join?

If so, I believe it's unlikely due to how the optimizer computes the cost
of nested loop join plans.  But as I noted in my original reply, I don't
know if this has been extensively tested by other Drill users.

Your suggestion of a new option *planner.enable_nljoin_for_crossjoin *should
effectively be the same as setting planner.enable_nljoin_for_*scalar_only*
to false because the way the optimizer estimates costs, it should never
consider cross join orderings unless that's the only available join
ordering option.  For those cartesian joins, the only option would be a
nested loop join.

-- Zelaine

On Tue, Dec 27, 2016 at 10:26 AM, clhubert@gmail.com <cl...@gmail.com>
wrote:

> Zelaine,
>
> Thanks for the clarity.  I can understand the performance concerns behind
> the nested loop.
>
> Can a nested loop be selected for "other" query plans when "
> planner.enable_nljoin_for_*scalar_only" *is set to true?
>
> I am trying to determine if I should turn it system wide, or ONLY turn it
> on when I need a Cartesian product result.
>
> Any ideas?  Suggestions?
>
> Outside of the Performance Concern..   There are two valid nested loop
> usecase.
> 1) Scalar subqueries
> 2) ANSI SQL Cross Joins
>
> Do you know if there are plans to incorporate this as a default setting?
>
>
> *planner.enable_nljoin_for_scalar_only *should probably be
> planner.enable_nljoin_for_*scalar *
>
> Similarly, there should be *planner.enable_nljoin_for_crossjoin* and the
> default setting should be "true"
>
> Do you agree?   Am I off base someplace?
>
> If there are no plans, and you agree with my assessment.   I will probably
> file this as a feature request
>
>
>
> Regards,
> CLN
>
>
>
>
>
>
>
>
>
>
>
> On Mon, Dec 26, 2016 at 7:58 PM, Zelaine Fong <zf...@maprtech.com> wrote:
>
>> I'm not sure how widely nested loop joins outside of scalar subqueries
>> have been exercised by Drill users, since that setting is not the default.
>> Note that nested loop joins can only be processed using broadcast joins
>> [1].  So you will incur a lot of network transfer overhead unless the
>> smaller of the tables you're joining is kept to a minimum.
>>
>> [1] https://drill.apache.org/docs/join-planning-guidelines/
>>
>> -- Zelaine
>>
>> On Mon, Dec 26, 2016 at 7:05 AM, clhubert@gmail.com <cl...@gmail.com>
>> wrote:
>>
>>> Zelaine,
>>>
>>> I appreciate it...   That worked.
>>>
>>> I am thinking of turning on this feature system wide.
>>>
>>> Is there any foreseeable issue with using nested joins outside of scalar
>>> subqueries?  Performance or otherwise?
>>>
>>> Regards,
>>> CLN
>>>
>>>
>>> On Sun, Dec 25, 2016 at 7:22 PM, Zelaine Fong <zf...@maprtech.com>
>>> wrote:
>>>
>>>> Alternatively, you can set the following configuration to false:
>>>>
>>>> alter session set `planner.enable_nljoin_for_scalar_only` = false;
>>>>
>>>> Cartesian joins need to be processed as a nested loop join, and by
>>>> default, Drill only considers nested joins in the case where at least one
>>>> side of the join is a scalar subquery.
>>>>
>>>> -- Zelaine
>>>>
>>>> On Sun, Dec 25, 2016 at 2:46 PM, Ted Dunning <te...@gmail.com>
>>>> wrote:
>>>>
>>>>> You can fake the limitation by adding a constant column to both
>>>>> tables, I
>>>>> think, and then joining on the constant.
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Dec 25, 2016 at 2:04 PM, clhubert@gmail.com <
>>>>> clhubert@gmail.com>
>>>>> wrote:
>>>>>
>>>>> >
>>>>> > I am trying to do a cross join to get a cartesian products.
>>>>> >
>>>>> > Per the error message (attached) and the JIRA ticket I see it isn't
>>>>> > supported.
>>>>> > https://issues.apache.org/jira/browse/DRILL-3807
>>>>> >
>>>>> > I wrote the query against using dfs on csv file types.
>>>>> >
>>>>> > Can I execute a cross join in Apache Drill just by moving my data to
>>>>> a
>>>>> > different file type or Storage Plugin. Such as Parquet,JSON, or RDBMS
>>>>> > Plugin.
>>>>> >
>>>>> > Regards,
>>>>> > CLN
>>>>> >
>>>>> >
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Cartesian Product in Apache Drill

Posted by "clhubert@gmail.com" <cl...@gmail.com>.
Zelaine,

Thanks for the clarity.  I can understand the performance concerns behind
the nested loop.

Can a nested loop be selected for "other" query plans when "
planner.enable_nljoin_for_*scalar_only" *is set to true?

I am trying to determine if I should turn it system wide, or ONLY turn it
on when I need a Cartesian product result.

Any ideas?  Suggestions?

Outside of the Performance Concern..   There are two valid nested loop
usecase.
1) Scalar subqueries
2) ANSI SQL Cross Joins

Do you know if there are plans to incorporate this as a default setting?

*planner.enable_nljoin_for_scalar_only *should probably be
planner.enable_nljoin_for_*scalar *

Similarly, there should be *planner.enable_nljoin_for_crossjoin* and the
default setting should be "true"

Do you agree?   Am I off base someplace?

If there are no plans, and you agree with my assessment.   I will probably
file this as a feature request



Regards,
CLN











On Mon, Dec 26, 2016 at 7:58 PM, Zelaine Fong <zf...@maprtech.com> wrote:

> I'm not sure how widely nested loop joins outside of scalar subqueries
> have been exercised by Drill users, since that setting is not the default.
> Note that nested loop joins can only be processed using broadcast joins
> [1].  So you will incur a lot of network transfer overhead unless the
> smaller of the tables you're joining is kept to a minimum.
>
> [1] https://drill.apache.org/docs/join-planning-guidelines/
>
> -- Zelaine
>
> On Mon, Dec 26, 2016 at 7:05 AM, clhubert@gmail.com <cl...@gmail.com>
> wrote:
>
>> Zelaine,
>>
>> I appreciate it...   That worked.
>>
>> I am thinking of turning on this feature system wide.
>>
>> Is there any foreseeable issue with using nested joins outside of scalar
>> subqueries?  Performance or otherwise?
>>
>> Regards,
>> CLN
>>
>>
>> On Sun, Dec 25, 2016 at 7:22 PM, Zelaine Fong <zf...@maprtech.com> wrote:
>>
>>> Alternatively, you can set the following configuration to false:
>>>
>>> alter session set `planner.enable_nljoin_for_scalar_only` = false;
>>>
>>> Cartesian joins need to be processed as a nested loop join, and by
>>> default, Drill only considers nested joins in the case where at least one
>>> side of the join is a scalar subquery.
>>>
>>> -- Zelaine
>>>
>>> On Sun, Dec 25, 2016 at 2:46 PM, Ted Dunning <te...@gmail.com>
>>> wrote:
>>>
>>>> You can fake the limitation by adding a constant column to both tables,
>>>> I
>>>> think, and then joining on the constant.
>>>>
>>>>
>>>>
>>>> On Sun, Dec 25, 2016 at 2:04 PM, clhubert@gmail.com <clhubert@gmail.com
>>>> >
>>>> wrote:
>>>>
>>>> >
>>>> > I am trying to do a cross join to get a cartesian products.
>>>> >
>>>> > Per the error message (attached) and the JIRA ticket I see it isn't
>>>> > supported.
>>>> > https://issues.apache.org/jira/browse/DRILL-3807
>>>> >
>>>> > I wrote the query against using dfs on csv file types.
>>>> >
>>>> > Can I execute a cross join in Apache Drill just by moving my data to a
>>>> > different file type or Storage Plugin. Such as Parquet,JSON, or RDBMS
>>>> > Plugin.
>>>> >
>>>> > Regards,
>>>> > CLN
>>>> >
>>>> >
>>>>
>>>
>>>
>>
>

Re: Cartesian Product in Apache Drill

Posted by Zelaine Fong <zf...@maprtech.com>.
I'm not sure how widely nested loop joins outside of scalar subqueries have
been exercised by Drill users, since that setting is not the default.  Note
that nested loop joins can only be processed using broadcast joins [1].  So
you will incur a lot of network transfer overhead unless the smaller of the
tables you're joining is kept to a minimum.

[1] https://drill.apache.org/docs/join-planning-guidelines/

-- Zelaine

On Mon, Dec 26, 2016 at 7:05 AM, clhubert@gmail.com <cl...@gmail.com>
wrote:

> Zelaine,
>
> I appreciate it...   That worked.
>
> I am thinking of turning on this feature system wide.
>
> Is there any foreseeable issue with using nested joins outside of scalar
> subqueries?  Performance or otherwise?
>
> Regards,
> CLN
>
>
> On Sun, Dec 25, 2016 at 7:22 PM, Zelaine Fong <zf...@maprtech.com> wrote:
>
>> Alternatively, you can set the following configuration to false:
>>
>> alter session set `planner.enable_nljoin_for_scalar_only` = false;
>>
>> Cartesian joins need to be processed as a nested loop join, and by
>> default, Drill only considers nested joins in the case where at least one
>> side of the join is a scalar subquery.
>>
>> -- Zelaine
>>
>> On Sun, Dec 25, 2016 at 2:46 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>>
>>> You can fake the limitation by adding a constant column to both tables, I
>>> think, and then joining on the constant.
>>>
>>>
>>>
>>> On Sun, Dec 25, 2016 at 2:04 PM, clhubert@gmail.com <cl...@gmail.com>
>>> wrote:
>>>
>>> >
>>> > I am trying to do a cross join to get a cartesian products.
>>> >
>>> > Per the error message (attached) and the JIRA ticket I see it isn't
>>> > supported.
>>> > https://issues.apache.org/jira/browse/DRILL-3807
>>> >
>>> > I wrote the query against using dfs on csv file types.
>>> >
>>> > Can I execute a cross join in Apache Drill just by moving my data to a
>>> > different file type or Storage Plugin. Such as Parquet,JSON, or RDBMS
>>> > Plugin.
>>> >
>>> > Regards,
>>> > CLN
>>> >
>>> >
>>>
>>
>>
>

Re: Cartesian Product in Apache Drill

Posted by "clhubert@gmail.com" <cl...@gmail.com>.
Zelaine,

I appreciate it...   That worked.

I am thinking of turning on this feature system wide.

Is there any foreseeable issue with using nested joins outside of scalar
subqueries?  Performance or otherwise?

Regards,
CLN


On Sun, Dec 25, 2016 at 7:22 PM, Zelaine Fong <zf...@maprtech.com> wrote:

> Alternatively, you can set the following configuration to false:
>
> alter session set `planner.enable_nljoin_for_scalar_only` = false;
>
> Cartesian joins need to be processed as a nested loop join, and by
> default, Drill only considers nested joins in the case where at least one
> side of the join is a scalar subquery.
>
> -- Zelaine
>
> On Sun, Dec 25, 2016 at 2:46 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
>> You can fake the limitation by adding a constant column to both tables, I
>> think, and then joining on the constant.
>>
>>
>>
>> On Sun, Dec 25, 2016 at 2:04 PM, clhubert@gmail.com <cl...@gmail.com>
>> wrote:
>>
>> >
>> > I am trying to do a cross join to get a cartesian products.
>> >
>> > Per the error message (attached) and the JIRA ticket I see it isn't
>> > supported.
>> > https://issues.apache.org/jira/browse/DRILL-3807
>> >
>> > I wrote the query against using dfs on csv file types.
>> >
>> > Can I execute a cross join in Apache Drill just by moving my data to a
>> > different file type or Storage Plugin. Such as Parquet,JSON, or RDBMS
>> > Plugin.
>> >
>> > Regards,
>> > CLN
>> >
>> >
>>
>
>

Re: Cartesian Product in Apache Drill

Posted by Zelaine Fong <zf...@maprtech.com>.
Alternatively, you can set the following configuration to false:

alter session set `planner.enable_nljoin_for_scalar_only` = false;

Cartesian joins need to be processed as a nested loop join, and by default,
Drill only considers nested joins in the case where at least one side of
the join is a scalar subquery.

-- Zelaine

On Sun, Dec 25, 2016 at 2:46 PM, Ted Dunning <te...@gmail.com> wrote:

> You can fake the limitation by adding a constant column to both tables, I
> think, and then joining on the constant.
>
>
>
> On Sun, Dec 25, 2016 at 2:04 PM, clhubert@gmail.com <cl...@gmail.com>
> wrote:
>
> >
> > I am trying to do a cross join to get a cartesian products.
> >
> > Per the error message (attached) and the JIRA ticket I see it isn't
> > supported.
> > https://issues.apache.org/jira/browse/DRILL-3807
> >
> > I wrote the query against using dfs on csv file types.
> >
> > Can I execute a cross join in Apache Drill just by moving my data to a
> > different file type or Storage Plugin. Such as Parquet,JSON, or RDBMS
> > Plugin.
> >
> > Regards,
> > CLN
> >
> >
>

Re: Cartesian Product in Apache Drill

Posted by Ted Dunning <te...@gmail.com>.
You can fake the limitation by adding a constant column to both tables, I
think, and then joining on the constant.



On Sun, Dec 25, 2016 at 2:04 PM, clhubert@gmail.com <cl...@gmail.com>
wrote:

>
> I am trying to do a cross join to get a cartesian products.
>
> Per the error message (attached) and the JIRA ticket I see it isn't
> supported.
> https://issues.apache.org/jira/browse/DRILL-3807
>
> I wrote the query against using dfs on csv file types.
>
> Can I execute a cross join in Apache Drill just by moving my data to a
> different file type or Storage Plugin. Such as Parquet,JSON, or RDBMS
> Plugin.
>
> Regards,
> CLN
>
>