You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Alexander Shoshin <Al...@epam.com> on 2016/11/10 10:26:39 UTC

[FLINK-4541] Support for SQL NOT IN operator

Hi,

I am working on FLINK-4541 issue and this is my current changes: https://github.com/apache/flink/compare/master...AlexanderShoshin:FLINK-4541.

I found that NOT IN does not work with nested queries because of missed DataSet planner rule for a cross join. After adding DataSetCrossJoinRule several tests from org.apache.flink.api.scala.batch.ExplainTest (testJoinWithExtended and testJoinWithoutExtended) that should check execution plans became fail because VolcanoPlanner started to build new execution plans for them using DataSetCrossJoin. That's why I increased DataSetCrossJoin cost (in computeSelfCost(...) function) to avoid its usage if it is possible. But it seems to be not a good idea.

Do you have any ideas how to solve a problem with testJoinWithExtended and testJoinWithoutExtended tests in another way? Is it correct that these tests check an execution plan instead of a query result?

Regards,
Alexander


RE: [FLINK-4541] Support for SQL NOT IN operator

Posted by Alexander Shoshin <Al...@epam.com>.
Yes, I would like to continue my work with this issue. It is already assigned to me. I will follow the approach that you suggested.

Thanks,
Alexander

> -----Original Message-----
> From: Fabian Hueske [mailto:fhueske@gmail.com] 
> Sent: Friday, November 11, 2016 11:49 AM
> To: dev@flink.apache.org
> Subject: Re: [FLINK-4541] Support for SQL NOT IN operator
>
> Hi Alexander,
>
> I had a look at the logical plan generated by Calcite for NOT IN again
> and noticed that the CROSS JOIN is only used to join a single record
> (the result of a global aggregation) to all records of the other table.
> IMO, this is a special case, that we can easily support and efficiently implement.
>
> For that I would implement a translation rule, that converts a Cross Join
> (join(true)) only into a DataSetSingleRowCross if one input of the join
> is a global aggregation, i.e., a LogicalAggregate without groupset / groupsets.
> The DataSetSingleRowCross would be implemented as MapFunction
> with BroadcastSet input (for the single record input) and internally
> call a code generated JoinFunction.
>
> If you want to follow that approach, please add a summary of our
> discussion to the JIRA issue and assign it to yourself (if that doesn't work,
> please let me know and I'll give you contributor permissions).
>
> Thanks,
> Fabian
>
> 2016-11-11 9:28 GMT+01:00 Alexander Shoshin <Al...@epam.com>:
>
> > Hi Fabian,
> >
> > Should we close this issue then? Or I could just leave the comment why 
> > we can't repair NOT IN at the moment. So no one else will do the same 
> > research again. Perhaps the Calcite team will change a logical plan 
> > for NOT IN and we will be back to this issue.
> >
> > Regards,
> > Alexander
> >
> > -----Original Message-----
> > From: Fabian Hueske [mailto:fhueske@gmail.com]
> > Sent: Friday, November 11, 2016 2:04 AM
> > To: dev@flink.apache.org
> > Subject: Re: [FLINK-4541] Support for SQL NOT IN operator
> >
> > > Hi Alexander,
> > >
> > > Thanks for looking into this issue!
> > >
> > > We did not support CROSS JOIN on purpose because the general case is
> > > very expensive to compute.
> > > Also as you noticed we would have to make sure that inner-joins are
> > > preferred over cross joins (if possible).
> > > Cost-based optimizers (such as Calcite's Volcano Planner) use cost
> > > estimates for that.
> > > So in principle, the approach of assigning high costs was correct. 
> > > In
> > > the case of Flink,
> > > we have to be careful though, because we do not have good estimates 
> > > for
> > > the size of
> > > our inputs (actually we do not have any at the moment...). So even
> > > tweaked cost functions might bit us.
> > >
> > > I would actually prefer to not add support for NOT IN (except for 
> > > the
> > > simple case of a list of a few literals) at this point.
> > > Adding a CrossJoin translation rule will also add support for 
> > > arbitrary
> > > cross joins, which I would like to avoid.
> > > With support for cross joins it is possible to write by accident 
> > > queries
> > > that run for days and produce
> > > vast amounts of data filling up all disks.
> > >
> > > I also think there are other issues, which are more important to 
> > > address
> > > and would have a bigger impact than support for NOT IN.
> > >
> > > Best, Fabian
> > >
> > > 2016-11-10 11:26 GMT+01:00 Alexander Shoshin 
> > ><Alexander_Shoshin@epam.com
> > >:
> > >
> > > > Hi,
> > > >
> > > > I am working on FLINK-4541 issue and this is my current changes:
> > > > https://github.com/apache/flink/compare/master...
> > > > AlexanderShoshin:FLINK-4541.
> > > >
> > > > I found that NOT IN does not work with nested queries because of 
> > > > missed DataSet planner rule for a cross join. After adding 
> > > > DataSetCrossJoinRule several tests from 
> > > > org.apache.flink.api.scala.batch.ExplainTest
> > > > (testJoinWithExtended and testJoinWithoutExtended) that should 
> > > > check execution plans became fail because VolcanoPlanner started 
> > > > to build new execution plans for them using DataSetCrossJoin. 
> > > > That's why I increased DataSetCrossJoin cost (in 
> > > > computeSelfCost(...) function) to avoid its usage if it is possible. But it seems to be not a good idea.
> > > >
> > > > Do you have any ideas how to solve a problem with 
> > > > testJoinWithExtended and testJoinWithoutExtended tests in another 
> > > > way? Is it correct that these tests check an execution plan instead of a query result?
> > > >
> > > > Regards,
> > > > Alexander
> > > >
> > > >

Re: [FLINK-4541] Support for SQL NOT IN operator

Posted by Fabian Hueske <fh...@gmail.com>.
Hi Alexander,

I had a look at the logical plan generated by Calcite for NOT IN again and
noticed that the CROSS JOIN is only used to join a single record (the
result of a global aggregation) to all records of the other table.
IMO, this is a special case, that we can easily support and efficiently
implement.

For that I would implement a translation rule, that converts a Cross Join
(join(true)) only into a DataSetSingleRowCross if one input of the join is
a global aggregation, i.e., a LogicalAggregate without groupset / groupsets.
The DataSetSingleRowCross would be implemented as MapFunction with
BroadcastSet input (for the single record input) and internally call a code
generated JoinFunction.

If you want to follow that approach, please add a summary of our discussion
to the JIRA issue and assign it to yourself (if that doesn't work, please
let me know and I'll give you contributor permissions).

Thanks,
Fabian

2016-11-11 9:28 GMT+01:00 Alexander Shoshin <Al...@epam.com>:

> Hi Fabian,
>
> Should we close this issue then? Or I could just leave the comment why we
> can't repair NOT IN at the moment. So no one else will do the same research
> again. Perhaps the Calcite team will change a logical plan for NOT IN and
> we will be back to this issue.
>
> Regards,
> Alexander
>
> -----Original Message-----
> From: Fabian Hueske [mailto:fhueske@gmail.com]
> Sent: Friday, November 11, 2016 2:04 AM
> To: dev@flink.apache.org
> Subject: Re: [FLINK-4541] Support for SQL NOT IN operator
>
> > Hi Alexander,
> >
> > Thanks for looking into this issue!
> >
> > We did not support CROSS JOIN on purpose because the general case is
> very expensive to compute.
> > Also as you noticed we would have to make sure that inner-joins are
> preferred over cross joins (if possible).
> > Cost-based optimizers (such as Calcite's Volcano Planner) use cost
> estimates for that.
> > So in principle, the approach of assigning high costs was correct. In
> the case of Flink,
> > we have to be careful though, because we do not have good estimates for
> the size of
> > our inputs (actually we do not have any at the moment...). So even
> tweaked cost functions might bit us.
>
> > I would actually prefer to not add support for NOT IN (except for the
> simple case of a list of a few literals) at this point.
> > Adding a CrossJoin translation rule will also add support for arbitrary
> cross joins, which I would like to avoid.
> > With support for cross joins it is possible to write by accident queries
> that run for days and produce
> > vast amounts of data filling up all disks.
> >
> > I also think there are other issues, which are more important to address
> and would have a bigger impact than support for NOT IN.
> >
> > Best, Fabian
> >
> > 2016-11-10 11:26 GMT+01:00 Alexander Shoshin <Alexander_Shoshin@epam.com
> >:
> >
> > > Hi,
> > >
> > > I am working on FLINK-4541 issue and this is my current changes:
> > > https://github.com/apache/flink/compare/master...
> > > AlexanderShoshin:FLINK-4541.
> > >
> > > I found that NOT IN does not work with nested queries because of
> > > missed DataSet planner rule for a cross join. After adding
> > > DataSetCrossJoinRule several tests from
> > > org.apache.flink.api.scala.batch.ExplainTest
> > > (testJoinWithExtended and testJoinWithoutExtended) that should check
> > > execution plans became fail because VolcanoPlanner started to build
> > > new execution plans for them using DataSetCrossJoin. That's why I
> > > increased DataSetCrossJoin cost (in computeSelfCost(...) function) to
> > > avoid its usage if it is possible. But it seems to be not a good idea.
> > >
> > > Do you have any ideas how to solve a problem with testJoinWithExtended
> > > and testJoinWithoutExtended tests in another way? Is it correct that
> > > these tests check an execution plan instead of a query result?
> > >
> > > Regards,
> > > Alexander
> > >
> > >
>

RE: [FLINK-4541] Support for SQL NOT IN operator

Posted by Alexander Shoshin <Al...@epam.com>.
Hi Fabian,

Should we close this issue then? Or I could just leave the comment why we can't repair NOT IN at the moment. So no one else will do the same research again. Perhaps the Calcite team will change a logical plan for NOT IN and we will be back to this issue.

Regards,
Alexander

-----Original Message-----
From: Fabian Hueske [mailto:fhueske@gmail.com] 
Sent: Friday, November 11, 2016 2:04 AM
To: dev@flink.apache.org
Subject: Re: [FLINK-4541] Support for SQL NOT IN operator

> Hi Alexander,
>
> Thanks for looking into this issue!
>
> We did not support CROSS JOIN on purpose because the general case is very expensive to compute.
> Also as you noticed we would have to make sure that inner-joins are preferred over cross joins (if possible).
> Cost-based optimizers (such as Calcite's Volcano Planner) use cost estimates for that.
> So in principle, the approach of assigning high costs was correct. In the case of Flink,
> we have to be careful though, because we do not have good estimates for the size of
> our inputs (actually we do not have any at the moment...). So even tweaked cost functions might bit us.

> I would actually prefer to not add support for NOT IN (except for the simple case of a list of a few literals) at this point.
> Adding a CrossJoin translation rule will also add support for arbitrary cross joins, which I would like to avoid.
> With support for cross joins it is possible to write by accident queries that run for days and produce
> vast amounts of data filling up all disks.
>
> I also think there are other issues, which are more important to address and would have a bigger impact than support for NOT IN.
>
> Best, Fabian
>
> 2016-11-10 11:26 GMT+01:00 Alexander Shoshin <Al...@epam.com>:
>
> > Hi,
> >
> > I am working on FLINK-4541 issue and this is my current changes:
> > https://github.com/apache/flink/compare/master...
> > AlexanderShoshin:FLINK-4541.
> >
> > I found that NOT IN does not work with nested queries because of 
> > missed DataSet planner rule for a cross join. After adding 
> > DataSetCrossJoinRule several tests from 
> > org.apache.flink.api.scala.batch.ExplainTest
> > (testJoinWithExtended and testJoinWithoutExtended) that should check 
> > execution plans became fail because VolcanoPlanner started to build 
> > new execution plans for them using DataSetCrossJoin. That's why I 
> > increased DataSetCrossJoin cost (in computeSelfCost(...) function) to 
> > avoid its usage if it is possible. But it seems to be not a good idea.
> >
> > Do you have any ideas how to solve a problem with testJoinWithExtended 
> > and testJoinWithoutExtended tests in another way? Is it correct that 
> > these tests check an execution plan instead of a query result?
> >
> > Regards,
> > Alexander
> >
> >

Re: [FLINK-4541] Support for SQL NOT IN operator

Posted by Fabian Hueske <fh...@gmail.com>.
Hi Alexander,

Thanks for looking into this issue!

We did not support CROSS JOIN on purpose because the general case is very
expensive to compute. Also as you noticed we would have to make sure that
inner-joins are preferred over cross joins (if possible). Cost-based
optimizers (such as Calcite's Volcano Planner) use cost estimates for that.
So in principle, the approach of assigning high costs was correct. In the
case of Flink, we have to be careful though, because we do not have good
estimates for the size of our inputs (actually we do not have any at the
moment...). So even tweaked cost functions might bit us.

I would actually prefer to not add support for NOT IN (except for the
simple case of a list of a few literals) at this point.
Adding a CrossJoin translation rule will also add support for arbitrary
cross joins, which I would like to avoid.
With support for cross joins it is possible to write by accident queries
that run for days and produce vast amounts of data filling up all disks.

I also think there are other issues, which are more important to address
and would have a bigger impact than support for NOT IN.

Best, Fabian

2016-11-10 11:26 GMT+01:00 Alexander Shoshin <Al...@epam.com>:

> Hi,
>
> I am working on FLINK-4541 issue and this is my current changes:
> https://github.com/apache/flink/compare/master...
> AlexanderShoshin:FLINK-4541.
>
> I found that NOT IN does not work with nested queries because of missed
> DataSet planner rule for a cross join. After adding DataSetCrossJoinRule
> several tests from org.apache.flink.api.scala.batch.ExplainTest
> (testJoinWithExtended and testJoinWithoutExtended) that should check
> execution plans became fail because VolcanoPlanner started to build new
> execution plans for them using DataSetCrossJoin. That's why I increased
> DataSetCrossJoin cost (in computeSelfCost(...) function) to avoid its usage
> if it is possible. But it seems to be not a good idea.
>
> Do you have any ideas how to solve a problem with testJoinWithExtended and
> testJoinWithoutExtended tests in another way? Is it correct that these
> tests check an execution plan instead of a query result?
>
> Regards,
> Alexander
>
>