You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Kai Jiang <ji...@gmail.com> on 2018/05/15 14:40:41 UTC

[SQL] Cross Join Operation

Hi everyone,

To prove the idea of GSoC project, I was working on some simple TPC-DS
queries running with given generated data on direct runner. query example
<https://gist.github.com/vectorijk/7c54f90aeebfd6fd9e9d2ee224bfed50>

The example is executed with TPC-DS query 9
<https://gist.github.com/vectorijk/7c54f90aeebfd6fd9e9d2ee224bfed50#file-tpcdssql-java-L176-L222>.
Briefly, Query 9 uses case when clauses to select 5 counting numbers
from store_sales
(table 1). In order to show those result numbers, case when clause inside
one select clause. In short, it looks like:
SELECT

CASE WHEN ( SELECT count(*)  FROM  table 1 WHERE..... )
THEN condition 1
ELSE condition 2,
.....
CASE WHEN .....

FROM table 2

IIUC, this query doesn't need join operation on table 1 and table 2 since
outside select clause doesn't need to interfere with table 1.
But, the program shows it does and throws errors message said
"java.lang.UnsupportedOperationException: CROSS JOIN is not supported". (error
message detail
<https://gist.github.com/vectorijk/5619a20485edc01113a56e348f87b0c3>)

To make the query work, I am wondering where I can start with:
1. see logic plan?
Will logic plan explain why the query need CROSS JOIN?

2. cross join support?
I checked all queries in TPC-DS benchmark. Almost every query uses cross
join. It is an important feature needs to implement. Unlike other join, it
consumes a lot of computing resource. But, I think we need cross join in
the future. and support both in join-library? I noticed James has open
BEAM-2194 <https://issues.apache.org/jira/browse/BEAM-2194> for supporting
cross join.

Looking forward to comments!

Best,
Kai

ᐧ

Re: [SQL] Cross Join Operation

Posted by Kenneth Knowles <kl...@google.com>.
That seems very useful. For anything that is an equi-join and we can encode
the things being compared, we should do an inner join that compiles to
CoGroupByKey. We will probably need some costs to make this work right.
There might be some needed changes to the join library to make this happen
too, or to implement it directly in the Beam SQL codebase.

Kenn

On Mon, May 21, 2018 at 9:02 AM Kai Jiang <ji...@gmail.com> wrote:

> Andrew, Kenn, thanks for your comments!
>
> This query is potential using COMMA (Cross) join. Since most of TPC-DS
> queries use comma join, I took a look at their logic plan. Calcite parses
> them to inner join and join condition is true. Outside inner join is around
> with filter condition which is parsing from WHERE clause.
>
> I think a way we could do is how to push filter (WHERE clause's condition)
> into join condition. I also took a look at Flink sql code. After parsing
> query, Flink optimizes its logic plan which is get rid of filter and move
> its condition to join condition. It seems Flink uses calcite optimize rule.
>
> Should we move forward to see how it apply to Beam SQL?
>
> Best,
> Kai
>
> ᐧ
>
> On Wed, May 16, 2018 at 2:46 PM Kenneth Knowles <kl...@google.com> wrote:
>
>> It sounds like a good way to proceed would be to replace the template
>> predicates with something that induces a real join.
>>
>> Kenn
>>
>> On Tue, May 15, 2018 at 9:33 AM Andrew Pilloud <ap...@google.com>
>> wrote:
>>
>>> Calcite does not have the concept of a "CROSS JOIN". It shows up in the
>>> plan as a LogicalJoin with condition=[true]. We could try rejecting the
>>> cross join at the planning stage by returning null for them
>>> in BeamJoinRule.convert(), which might result in a different plan. But
>>> looking at your query, you have a cross join unless the where clause on the
>>> inner select contains a row from the outer select.
>>>
>>> Andrew
>>>
>>> On Tue, May 15, 2018 at 9:15 AM Kenneth Knowles <kl...@google.com> wrote:
>>>
>>>> The logical plan should show you where the cross join is needed. Here
>>>> is where it is logged:
>>>> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamQueryPlanner.java#L150
>>>>
>>>> (It should probably be put to DEBUG level)
>>>>
>>>> If I look at the original template, like
>>>> https://github.com/gregrahn/tpcds-kit/blob/master/query_templates/query9.tpl
>>>> I see conditions "[RC.1]". Are those templates expected to be filled with
>>>> references to the `reason` table, perhaps? How does that change things?
>>>>
>>>> I still think it would be good to support CROSS JOIN if we can - the
>>>> problem of course is huge data size, but when one side is small it would be
>>>> good for it to work simply.
>>>>
>>>> Kenn
>>>>
>>>> On Tue, May 15, 2018 at 7:41 AM Kai Jiang <ji...@gmail.com> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> To prove the idea of GSoC project, I was working on some simple TPC-DS
>>>>> queries running with given generated data on direct runner. query
>>>>> example
>>>>> <https://gist.github.com/vectorijk/7c54f90aeebfd6fd9e9d2ee224bfed50>
>>>>>
>>>>> The example is executed with TPC-DS query 9
>>>>> <https://gist.github.com/vectorijk/7c54f90aeebfd6fd9e9d2ee224bfed50#file-tpcdssql-java-L176-L222>.
>>>>> Briefly, Query 9 uses case when clauses to select 5 counting numbers
>>>>> from store_sales (table 1). In order to show those result numbers, case
>>>>> when clause inside one select clause. In short, it looks like:
>>>>> SELECT
>>>>>
>>>>> CASE WHEN ( SELECT count(*)  FROM  table 1 WHERE..... )
>>>>> THEN condition 1
>>>>> ELSE condition 2,
>>>>> .....
>>>>> CASE WHEN .....
>>>>>
>>>>> FROM table 2
>>>>>
>>>>> IIUC, this query doesn't need join operation on table 1 and table 2 since
>>>>> outside select clause doesn't need to interfere with table 1.
>>>>> But, the program shows it does and throws errors message said
>>>>> "java.lang.UnsupportedOperationException: CROSS JOIN is not supported". (error
>>>>> message detail
>>>>> <https://gist.github.com/vectorijk/5619a20485edc01113a56e348f87b0c3>)
>>>>>
>>>>> To make the query work, I am wondering where I can start with:
>>>>> 1. see logic plan?
>>>>> Will logic plan explain why the query need CROSS JOIN?
>>>>>
>>>>> 2. cross join support?
>>>>> I checked all queries in TPC-DS benchmark. Almost every query uses
>>>>> cross join. It is an important feature needs to implement. Unlike other
>>>>> join, it consumes a lot of computing resource. But, I think we need cross
>>>>> join in the future. and support both in join-library? I noticed
>>>>> James has open BEAM-2194
>>>>> <https://issues.apache.org/jira/browse/BEAM-2194> for supporting
>>>>> cross join.
>>>>>
>>>>> Looking forward to comments!
>>>>>
>>>>> Best,
>>>>> Kai
>>>>>
>>>>> ᐧ
>>>>>
>>>>

Re: [SQL] Cross Join Operation

Posted by Kai Jiang <ji...@gmail.com>.
Andrew, Kenn, thanks for your comments!

This query is potential using COMMA (Cross) join. Since most of TPC-DS
queries use comma join, I took a look at their logic plan. Calcite parses
them to inner join and join condition is true. Outside inner join is around
with filter condition which is parsing from WHERE clause.

I think a way we could do is how to push filter (WHERE clause's condition)
into join condition. I also took a look at Flink sql code. After parsing
query, Flink optimizes its logic plan which is get rid of filter and move
its condition to join condition. It seems Flink uses calcite optimize rule.

Should we move forward to see how it apply to Beam SQL?

Best,
Kai

ᐧ

On Wed, May 16, 2018 at 2:46 PM Kenneth Knowles <kl...@google.com> wrote:

> It sounds like a good way to proceed would be to replace the template
> predicates with something that induces a real join.
>
> Kenn
>
> On Tue, May 15, 2018 at 9:33 AM Andrew Pilloud <ap...@google.com>
> wrote:
>
>> Calcite does not have the concept of a "CROSS JOIN". It shows up in the
>> plan as a LogicalJoin with condition=[true]. We could try rejecting the
>> cross join at the planning stage by returning null for them
>> in BeamJoinRule.convert(), which might result in a different plan. But
>> looking at your query, you have a cross join unless the where clause on the
>> inner select contains a row from the outer select.
>>
>> Andrew
>>
>> On Tue, May 15, 2018 at 9:15 AM Kenneth Knowles <kl...@google.com> wrote:
>>
>>> The logical plan should show you where the cross join is needed. Here is
>>> where it is logged:
>>> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamQueryPlanner.java#L150
>>>
>>> (It should probably be put to DEBUG level)
>>>
>>> If I look at the original template, like
>>> https://github.com/gregrahn/tpcds-kit/blob/master/query_templates/query9.tpl
>>> I see conditions "[RC.1]". Are those templates expected to be filled with
>>> references to the `reason` table, perhaps? How does that change things?
>>>
>>> I still think it would be good to support CROSS JOIN if we can - the
>>> problem of course is huge data size, but when one side is small it would be
>>> good for it to work simply.
>>>
>>> Kenn
>>>
>>> On Tue, May 15, 2018 at 7:41 AM Kai Jiang <ji...@gmail.com> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> To prove the idea of GSoC project, I was working on some simple TPC-DS
>>>> queries running with given generated data on direct runner. query
>>>> example
>>>> <https://gist.github.com/vectorijk/7c54f90aeebfd6fd9e9d2ee224bfed50>
>>>>
>>>> The example is executed with TPC-DS query 9
>>>> <https://gist.github.com/vectorijk/7c54f90aeebfd6fd9e9d2ee224bfed50#file-tpcdssql-java-L176-L222>.
>>>> Briefly, Query 9 uses case when clauses to select 5 counting numbers
>>>> from store_sales (table 1). In order to show those result numbers, case
>>>> when clause inside one select clause. In short, it looks like:
>>>> SELECT
>>>>
>>>> CASE WHEN ( SELECT count(*)  FROM  table 1 WHERE..... )
>>>> THEN condition 1
>>>> ELSE condition 2,
>>>> .....
>>>> CASE WHEN .....
>>>>
>>>> FROM table 2
>>>>
>>>> IIUC, this query doesn't need join operation on table 1 and table 2 since
>>>> outside select clause doesn't need to interfere with table 1.
>>>> But, the program shows it does and throws errors message said
>>>> "java.lang.UnsupportedOperationException: CROSS JOIN is not supported". (error
>>>> message detail
>>>> <https://gist.github.com/vectorijk/5619a20485edc01113a56e348f87b0c3>)
>>>>
>>>> To make the query work, I am wondering where I can start with:
>>>> 1. see logic plan?
>>>> Will logic plan explain why the query need CROSS JOIN?
>>>>
>>>> 2. cross join support?
>>>> I checked all queries in TPC-DS benchmark. Almost every query uses
>>>> cross join. It is an important feature needs to implement. Unlike other
>>>> join, it consumes a lot of computing resource. But, I think we need cross
>>>> join in the future. and support both in join-library? I noticed
>>>> James has open BEAM-2194
>>>> <https://issues.apache.org/jira/browse/BEAM-2194> for supporting cross
>>>> join.
>>>>
>>>> Looking forward to comments!
>>>>
>>>> Best,
>>>> Kai
>>>>
>>>> ᐧ
>>>>
>>>

Re: [SQL] Cross Join Operation

Posted by Kenneth Knowles <kl...@google.com>.
It sounds like a good way to proceed would be to replace the template
predicates with something that induces a real join.

Kenn

On Tue, May 15, 2018 at 9:33 AM Andrew Pilloud <ap...@google.com> wrote:

> Calcite does not have the concept of a "CROSS JOIN". It shows up in the
> plan as a LogicalJoin with condition=[true]. We could try rejecting the
> cross join at the planning stage by returning null for them
> in BeamJoinRule.convert(), which might result in a different plan. But
> looking at your query, you have a cross join unless the where clause on the
> inner select contains a row from the outer select.
>
> Andrew
>
> On Tue, May 15, 2018 at 9:15 AM Kenneth Knowles <kl...@google.com> wrote:
>
>> The logical plan should show you where the cross join is needed. Here is
>> where it is logged:
>> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamQueryPlanner.java#L150
>>
>> (It should probably be put to DEBUG level)
>>
>> If I look at the original template, like
>> https://github.com/gregrahn/tpcds-kit/blob/master/query_templates/query9.tpl
>> I see conditions "[RC.1]". Are those templates expected to be filled with
>> references to the `reason` table, perhaps? How does that change things?
>>
>> I still think it would be good to support CROSS JOIN if we can - the
>> problem of course is huge data size, but when one side is small it would be
>> good for it to work simply.
>>
>> Kenn
>>
>> On Tue, May 15, 2018 at 7:41 AM Kai Jiang <ji...@gmail.com> wrote:
>>
>>> Hi everyone,
>>>
>>> To prove the idea of GSoC project, I was working on some simple TPC-DS
>>> queries running with given generated data on direct runner. query
>>> example
>>> <https://gist.github.com/vectorijk/7c54f90aeebfd6fd9e9d2ee224bfed50>
>>>
>>> The example is executed with TPC-DS query 9
>>> <https://gist.github.com/vectorijk/7c54f90aeebfd6fd9e9d2ee224bfed50#file-tpcdssql-java-L176-L222>.
>>> Briefly, Query 9 uses case when clauses to select 5 counting numbers
>>> from store_sales (table 1). In order to show those result numbers, case
>>> when clause inside one select clause. In short, it looks like:
>>> SELECT
>>>
>>> CASE WHEN ( SELECT count(*)  FROM  table 1 WHERE..... )
>>> THEN condition 1
>>> ELSE condition 2,
>>> .....
>>> CASE WHEN .....
>>>
>>> FROM table 2
>>>
>>> IIUC, this query doesn't need join operation on table 1 and table 2 since
>>> outside select clause doesn't need to interfere with table 1.
>>> But, the program shows it does and throws errors message said
>>> "java.lang.UnsupportedOperationException: CROSS JOIN is not supported". (error
>>> message detail
>>> <https://gist.github.com/vectorijk/5619a20485edc01113a56e348f87b0c3>)
>>>
>>> To make the query work, I am wondering where I can start with:
>>> 1. see logic plan?
>>> Will logic plan explain why the query need CROSS JOIN?
>>>
>>> 2. cross join support?
>>> I checked all queries in TPC-DS benchmark. Almost every query uses cross
>>> join. It is an important feature needs to implement. Unlike other join, it
>>> consumes a lot of computing resource. But, I think we need cross join in
>>> the future. and support both in join-library? I noticed James has open
>>> BEAM-2194 <https://issues.apache.org/jira/browse/BEAM-2194> for
>>> supporting cross join.
>>>
>>> Looking forward to comments!
>>>
>>> Best,
>>> Kai
>>>
>>> ᐧ
>>>
>>

Re: [SQL] Cross Join Operation

Posted by Andrew Pilloud <ap...@google.com>.
Calcite does not have the concept of a "CROSS JOIN". It shows up in the
plan as a LogicalJoin with condition=[true]. We could try rejecting the
cross join at the planning stage by returning null for them
in BeamJoinRule.convert(), which might result in a different plan. But
looking at your query, you have a cross join unless the where clause on the
inner select contains a row from the outer select.

Andrew

On Tue, May 15, 2018 at 9:15 AM Kenneth Knowles <kl...@google.com> wrote:

> The logical plan should show you where the cross join is needed. Here is
> where it is logged:
> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamQueryPlanner.java#L150
>
> (It should probably be put to DEBUG level)
>
> If I look at the original template, like
> https://github.com/gregrahn/tpcds-kit/blob/master/query_templates/query9.tpl
> I see conditions "[RC.1]". Are those templates expected to be filled with
> references to the `reason` table, perhaps? How does that change things?
>
> I still think it would be good to support CROSS JOIN if we can - the
> problem of course is huge data size, but when one side is small it would be
> good for it to work simply.
>
> Kenn
>
> On Tue, May 15, 2018 at 7:41 AM Kai Jiang <ji...@gmail.com> wrote:
>
>> Hi everyone,
>>
>> To prove the idea of GSoC project, I was working on some simple TPC-DS
>> queries running with given generated data on direct runner. query example
>> <https://gist.github.com/vectorijk/7c54f90aeebfd6fd9e9d2ee224bfed50>
>>
>> The example is executed with TPC-DS query 9
>> <https://gist.github.com/vectorijk/7c54f90aeebfd6fd9e9d2ee224bfed50#file-tpcdssql-java-L176-L222>.
>> Briefly, Query 9 uses case when clauses to select 5 counting numbers
>> from store_sales (table 1). In order to show those result numbers, case
>> when clause inside one select clause. In short, it looks like:
>> SELECT
>>
>> CASE WHEN ( SELECT count(*)  FROM  table 1 WHERE..... )
>> THEN condition 1
>> ELSE condition 2,
>> .....
>> CASE WHEN .....
>>
>> FROM table 2
>>
>> IIUC, this query doesn't need join operation on table 1 and table 2 since
>> outside select clause doesn't need to interfere with table 1.
>> But, the program shows it does and throws errors message said
>> "java.lang.UnsupportedOperationException: CROSS JOIN is not supported". (error
>> message detail
>> <https://gist.github.com/vectorijk/5619a20485edc01113a56e348f87b0c3>)
>>
>> To make the query work, I am wondering where I can start with:
>> 1. see logic plan?
>> Will logic plan explain why the query need CROSS JOIN?
>>
>> 2. cross join support?
>> I checked all queries in TPC-DS benchmark. Almost every query uses cross
>> join. It is an important feature needs to implement. Unlike other join, it
>> consumes a lot of computing resource. But, I think we need cross join in
>> the future. and support both in join-library? I noticed James has open
>> BEAM-2194 <https://issues.apache.org/jira/browse/BEAM-2194> for
>> supporting cross join.
>>
>> Looking forward to comments!
>>
>> Best,
>> Kai
>>
>> ᐧ
>>
>

Re: [SQL] Cross Join Operation

Posted by Kenneth Knowles <kl...@google.com>.
The logical plan should show you where the cross join is needed. Here is
where it is logged:
https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamQueryPlanner.java#L150

(It should probably be put to DEBUG level)

If I look at the original template, like
https://github.com/gregrahn/tpcds-kit/blob/master/query_templates/query9.tpl
I see conditions "[RC.1]". Are those templates expected to be filled with
references to the `reason` table, perhaps? How does that change things?

I still think it would be good to support CROSS JOIN if we can - the
problem of course is huge data size, but when one side is small it would be
good for it to work simply.

Kenn

On Tue, May 15, 2018 at 7:41 AM Kai Jiang <ji...@gmail.com> wrote:

> Hi everyone,
>
> To prove the idea of GSoC project, I was working on some simple TPC-DS
> queries running with given generated data on direct runner. query example
> <https://gist.github.com/vectorijk/7c54f90aeebfd6fd9e9d2ee224bfed50>
>
> The example is executed with TPC-DS query 9
> <https://gist.github.com/vectorijk/7c54f90aeebfd6fd9e9d2ee224bfed50#file-tpcdssql-java-L176-L222>.
> Briefly, Query 9 uses case when clauses to select 5 counting numbers from store_sales
> (table 1). In order to show those result numbers, case when clause inside
> one select clause. In short, it looks like:
> SELECT
>
> CASE WHEN ( SELECT count(*)  FROM  table 1 WHERE..... )
> THEN condition 1
> ELSE condition 2,
> .....
> CASE WHEN .....
>
> FROM table 2
>
> IIUC, this query doesn't need join operation on table 1 and table 2 since
> outside select clause doesn't need to interfere with table 1.
> But, the program shows it does and throws errors message said
> "java.lang.UnsupportedOperationException: CROSS JOIN is not supported". (error
> message detail
> <https://gist.github.com/vectorijk/5619a20485edc01113a56e348f87b0c3>)
>
> To make the query work, I am wondering where I can start with:
> 1. see logic plan?
> Will logic plan explain why the query need CROSS JOIN?
>
> 2. cross join support?
> I checked all queries in TPC-DS benchmark. Almost every query uses cross
> join. It is an important feature needs to implement. Unlike other join, it
> consumes a lot of computing resource. But, I think we need cross join in
> the future. and support both in join-library? I noticed James has open
> BEAM-2194 <https://issues.apache.org/jira/browse/BEAM-2194> for
> supporting cross join.
>
> Looking forward to comments!
>
> Best,
> Kai
>
> ᐧ
>