You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Kirill Kozlov <ki...@google.com> on 2019/11/26 20:19:10 UTC

Update on push-down for SQL IOs.

Hello everyone!

I have been working on a push-down feature and would like to give a brief
update on what is done and is still under works.

*Things that are done*:
General API for SQL IOs to provide information about what filters/projects
they support [1]:
- *Filter* can be unsupported, supported with field reordering, and
supported without field reordering.
- *Predicate* is broken down into a conjunctive normal form (CNF) and
passed to a validator class to check what parts are supported or
unsupported by an IO.

A Calcite rule [2] that checks for push-down support, constructs a new IO
source Rel [3] with pushed-down projects and filters when applicable, and
preserves unsupported filters/projects.

BigQuery should perform push-down when running queries in DIRECT_READ
method [4].

MongoDB project push-down support is in a PR [5] and predicate support will
be added soon.

*Things that are in progress:*
Documenting how developers can enable push-down for IOs that support it.

Documenting certain limitation for BigQuery push-down (ex: comparing values
of 2 columns is not supported at the moment, so it is being preserved in a
Calc).

Updating google-cloud-bigquerystorage to 0.117.0-beta. Earlier versions
have a gRPC message limit set to ~11MB, which may cause some pipelies to
break when reading from a table with rows larger than the limit.

Adding some sort of performance tests to run continuously to
measure speed-up and detect regressions.

Deciding how cost should be computed for the IO source Rel with push-down
[6]. Right now the following formula is used: cost of an IO without
push-down minus the normalized (between 0.0 and 1.0) benefit of a performed
push-down.
The challenge here is to make the change to the cost small enough to not
break join reordering, but large enough to make the optimizer favor
pushed-down IO.

If you have any suggestions/questions/concerns I would love to hear them.

[1]
https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/BeamSqlTable.java#L36
[2]
https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rule/BeamIOPushDownRule.java
[3]
https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamPushDownIOSourceRel.java
[4]
https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/bigquery/BigQueryTable.java#L128
[5] https://github.com/apache/beam/pull/10095
[6] https://github.com/apache/beam/pull/10060

--
Kirill

Re: Update on push-down for SQL IOs.

Posted by Kirill Kozlov <ki...@google.com>.

>
> ParquetIO, CassandraIO/HBaseIO/BigTableIO (all should be about the same),
> JdbcIO, IcebergIO (doesn't exist yet, but is basically generalized
> schema-aware files as I understand it).

I think that adding Jiras with a tag "starter" for implementing push-down
for all of the IO interfaces listed above would be a good start. The design
doc does have an example for project push-down; predicate push-down example
is in the works.
Hopefully, that will make it straight forward for new contributors.

On Thu, Nov 28, 2019 at 4:32 AM David Morávek <da...@gmail.com>
wrote:

> Nice, this should bring a great performance improvement for SQL. Thanks
> for your work!
>
> On Thu, Nov 28, 2019 at 6:33 AM Kenneth Knowles <ke...@apache.org> wrote:
>
>> Nice! Thanks for the very thorough summary. I think this will be a really
>> good thing for Beam. Most of the IO sources are very highly optimized for
>> querying and will do it more efficiently than the Beam runner when the
>> structure of the query matches. I'm really excited to see the performance
>> measurements.
>>
>> A have a thought: your update did not mention a few extensions that we
>> might consider: ParquetIO, CassandraIO/HBaseIO/BigTableIO (all should be
>> about the same), JdbcIO, IcebergIO (doesn't exist yet, but is basically
>> generalized schema-aware files as I understand it). Are these things you
>> are thinking about doing, or would these be Jiras that could potentially be
>> tagged "starter"? They seem complex but maybe your framework will make it
>> feasible for someone with slightly less experience to implement new
>> versions of what you have already finished?
>>
>> Kenn
>>
>> On Tue, Nov 26, 2019 at 12:19 PM Kirill Kozlov <ki...@google.com>
>> wrote:
>>
>>> Hello everyone!
>>>
>>> I have been working on a push-down feature and would like to give a
>>> brief update on what is done and is still under works.
>>>
>>> *Things that are done*:
>>> General API for SQL IOs to provide information about what
>>> filters/projects they support [1]:
>>> - *Filter* can be unsupported, supported with field reordering, and
>>> supported without field reordering.
>>> - *Predicate* is broken down into a conjunctive normal form (CNF) and
>>> passed to a validator class to check what parts are supported or
>>> unsupported by an IO.
>>>
>>> A Calcite rule [2] that checks for push-down support, constructs a new
>>> IO source Rel [3] with pushed-down projects and filters when applicable,
>>> and preserves unsupported filters/projects.
>>>
>>> BigQuery should perform push-down when running queries in DIRECT_READ
>>> method [4].
>>>
>>> MongoDB project push-down support is in a PR [5] and predicate support
>>> will be added soon.
>>>
>>>
>>> *Things that are in progress:*
>>> Documenting how developers can enable push-down for IOs that support it.
>>>
>>> Documenting certain limitation for BigQuery push-down (ex: comparing
>>> values of 2 columns is not supported at the moment, so it is being
>>> preserved in a Calc).
>>>
>>> Updating google-cloud-bigquerystorage to 0.117.0-beta. Earlier versions
>>> have a gRPC message limit set to ~11MB, which may cause some pipelies to
>>> break when reading from a table with rows larger than the limit.
>>>
>>> Adding some sort of performance tests to run continuously to
>>> measure speed-up and detect regressions.
>>>
>>> Deciding how cost should be computed for the IO source Rel with
>>> push-down [6]. Right now the following formula is used: cost of an IO
>>> without push-down minus the normalized (between 0.0 and 1.0) benefit of a
>>> performed push-down.
>>> The challenge here is to make the change to the cost small enough to not
>>> break join reordering, but large enough to make the optimizer favor
>>> pushed-down IO.
>>>
>>>
>>> If you have any suggestions/questions/concerns I would love to hear them.
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/BeamSqlTable.java#L36
>>> [2]
>>> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rule/BeamIOPushDownRule.java
>>> [3]
>>> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamPushDownIOSourceRel.java
>>> [4]
>>> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/bigquery/BigQueryTable.java#L128
>>> [5] https://github.com/apache/beam/pull/10095
>>> [6] https://github.com/apache/beam/pull/10060
>>>
>>> --
>>> Kirill
>>>
>>

Re: Update on push-down for SQL IOs.

Posted by David Morávek <da...@gmail.com>.

Nice, this should bring a great performance improvement for SQL. Thanks for
your work!

On Thu, Nov 28, 2019 at 6:33 AM Kenneth Knowles <ke...@apache.org> wrote:

> Nice! Thanks for the very thorough summary. I think this will be a really
> good thing for Beam. Most of the IO sources are very highly optimized for
> querying and will do it more efficiently than the Beam runner when the
> structure of the query matches. I'm really excited to see the performance
> measurements.
>
> A have a thought: your update did not mention a few extensions that we
> might consider: ParquetIO, CassandraIO/HBaseIO/BigTableIO (all should be
> about the same), JdbcIO, IcebergIO (doesn't exist yet, but is basically
> generalized schema-aware files as I understand it). Are these things you
> are thinking about doing, or would these be Jiras that could potentially be
> tagged "starter"? They seem complex but maybe your framework will make it
> feasible for someone with slightly less experience to implement new
> versions of what you have already finished?
>
> Kenn
>
> On Tue, Nov 26, 2019 at 12:19 PM Kirill Kozlov <ki...@google.com>
> wrote:
>
>> Hello everyone!
>>
>> I have been working on a push-down feature and would like to give a brief
>> update on what is done and is still under works.
>>
>> *Things that are done*:
>> General API for SQL IOs to provide information about what
>> filters/projects they support [1]:
>> - *Filter* can be unsupported, supported with field reordering, and
>> supported without field reordering.
>> - *Predicate* is broken down into a conjunctive normal form (CNF) and
>> passed to a validator class to check what parts are supported or
>> unsupported by an IO.
>>
>> A Calcite rule [2] that checks for push-down support, constructs a new IO
>> source Rel [3] with pushed-down projects and filters when applicable, and
>> preserves unsupported filters/projects.
>>
>> BigQuery should perform push-down when running queries in DIRECT_READ
>> method [4].
>>
>> MongoDB project push-down support is in a PR [5] and predicate support
>> will be added soon.
>>
>>
>> *Things that are in progress:*
>> Documenting how developers can enable push-down for IOs that support it.
>>
>> Documenting certain limitation for BigQuery push-down (ex: comparing
>> values of 2 columns is not supported at the moment, so it is being
>> preserved in a Calc).
>>
>> Updating google-cloud-bigquerystorage to 0.117.0-beta. Earlier versions
>> have a gRPC message limit set to ~11MB, which may cause some pipelies to
>> break when reading from a table with rows larger than the limit.
>>
>> Adding some sort of performance tests to run continuously to
>> measure speed-up and detect regressions.
>>
>> Deciding how cost should be computed for the IO source Rel with push-down
>> [6]. Right now the following formula is used: cost of an IO without
>> push-down minus the normalized (between 0.0 and 1.0) benefit of a performed
>> push-down.
>> The challenge here is to make the change to the cost small enough to not
>> break join reordering, but large enough to make the optimizer favor
>> pushed-down IO.
>>
>>
>> If you have any suggestions/questions/concerns I would love to hear them.
>>
>> [1]
>> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/BeamSqlTable.java#L36
>> [2]
>> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rule/BeamIOPushDownRule.java
>> [3]
>> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamPushDownIOSourceRel.java
>> [4]
>> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/bigquery/BigQueryTable.java#L128
>> [5] https://github.com/apache/beam/pull/10095
>> [6] https://github.com/apache/beam/pull/10060
>>
>> --
>> Kirill
>>
>

Re: Update on push-down for SQL IOs.

Posted by Kenneth Knowles <ke...@apache.org>.

Nice! Thanks for the very thorough summary. I think this will be a really
good thing for Beam. Most of the IO sources are very highly optimized for
querying and will do it more efficiently than the Beam runner when the
structure of the query matches. I'm really excited to see the performance
measurements.

A have a thought: your update did not mention a few extensions that we
might consider: ParquetIO, CassandraIO/HBaseIO/BigTableIO (all should be
about the same), JdbcIO, IcebergIO (doesn't exist yet, but is basically
generalized schema-aware files as I understand it). Are these things you
are thinking about doing, or would these be Jiras that could potentially be
tagged "starter"? They seem complex but maybe your framework will make it
feasible for someone with slightly less experience to implement new
versions of what you have already finished?

Kenn

On Tue, Nov 26, 2019 at 12:19 PM Kirill Kozlov <ki...@google.com>
wrote:

> Hello everyone!
>
> I have been working on a push-down feature and would like to give a brief
> update on what is done and is still under works.
>
> *Things that are done*:
> General API for SQL IOs to provide information about what filters/projects
> they support [1]:
> - *Filter* can be unsupported, supported with field reordering, and
> supported without field reordering.
> - *Predicate* is broken down into a conjunctive normal form (CNF) and
> passed to a validator class to check what parts are supported or
> unsupported by an IO.
>
> A Calcite rule [2] that checks for push-down support, constructs a new IO
> source Rel [3] with pushed-down projects and filters when applicable, and
> preserves unsupported filters/projects.
>
> BigQuery should perform push-down when running queries in DIRECT_READ
> method [4].
>
> MongoDB project push-down support is in a PR [5] and predicate support
> will be added soon.
>
>
> *Things that are in progress:*
> Documenting how developers can enable push-down for IOs that support it.
>
> Documenting certain limitation for BigQuery push-down (ex: comparing
> values of 2 columns is not supported at the moment, so it is being
> preserved in a Calc).
>
> Updating google-cloud-bigquerystorage to 0.117.0-beta. Earlier versions
> have a gRPC message limit set to ~11MB, which may cause some pipelies to
> break when reading from a table with rows larger than the limit.
>
> Adding some sort of performance tests to run continuously to
> measure speed-up and detect regressions.
>
> Deciding how cost should be computed for the IO source Rel with push-down
> [6]. Right now the following formula is used: cost of an IO without
> push-down minus the normalized (between 0.0 and 1.0) benefit of a performed
> push-down.
> The challenge here is to make the change to the cost small enough to not
> break join reordering, but large enough to make the optimizer favor
> pushed-down IO.
>
>
> If you have any suggestions/questions/concerns I would love to hear them.
>
> [1]
> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/BeamSqlTable.java#L36
> [2]
> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rule/BeamIOPushDownRule.java
> [3]
> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamPushDownIOSourceRel.java
> [4]
> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/bigquery/BigQueryTable.java#L128
> [5] https://github.com/apache/beam/pull/10095
> [6] https://github.com/apache/beam/pull/10060
>
> --
> Kirill
>