You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Steve Niemitz <sn...@apache.org> on 2022/01/13 18:29:24 UTC

Some questions about external tables in BeamSQL

I've been playing around with CREATE EXTERNAL TABLE (using a custom
TableProvider as well) w/ BeamSQL and really love it.  I have a few
questions though that I've accumulated as I've been using it I wanted to
ask.

- I'm a little confused about the need to define columns in the CREATE
EXTERNAL TABLE statement.  If I have a BeamSqlTable implementation that can
provide the schema on its own, it seems like the columns supplied to the
CREATE statement are ignored.  This is ideal anyways, since it's infeasible
for users to provide the entire schema up-front, especially for more
complicated sources.  Should the column list be optional here instead?

- It seems like predicate pushdown only works if the schema is "flat" (has
no nested rows).  I understand the complication in pushing down more
complicated nested predicates, however, assuming the table implementation
doesn't actually attempt to push them down, it seems like it would be fine
to allow?

- As a follow up on the above, I'd like to expose a "virtual" field in my
schema that represents the partition the data has come from.  For example
BigQuery has a similar concept called _PARTITIONTIME.  This would be picked
up by the predicate pushdown and used to filter the partitions being read.
I can't really figure out how I'd construct something similar here, even if
pushdown worked in all cases.  For example, for this query:

SELECT * from table
where _PARTITIONTIME between X and Y

I'd want that filter to be pushed down to my IO, but also the
_PARTITIONTIME column wouldn't be returned in the select list.  I was
hoping to use BigQueryIO as an example of how to do this, but it doesn't
seem like it exposes the virtual _PARTITIONTIME column either.

Re: Some questions about external tables in BeamSQL

Posted by Steve Niemitz <sn...@apache.org>.
Thanks for the quick responses! Mine are inline as well.

On Thu, Jan 13, 2022 at 9:01 PM Brian Hulette <bh...@google.com> wrote:

> I added some responses inline. Also adding dev@ since this is getting
> into SQL internals.
>
> On Thu, Jan 13, 2022 at 10:29 AM Steve Niemitz <sn...@apache.org>
> wrote:
>
>> I've been playing around with CREATE EXTERNAL TABLE (using a custom
>> TableProvider as well) w/ BeamSQL and really love it.  I have a few
>> questions though that I've accumulated as I've been using it I wanted to
>> ask.
>>
>> - I'm a little confused about the need to define columns in the CREATE
>> EXTERNAL TABLE statement.  If I have a BeamSqlTable implementation that can
>> provide the schema on its own, it seems like the columns supplied to the
>> CREATE statement are ignored.  This is ideal anyways, since it's infeasible
>> for users to provide the entire schema up-front, especially for more
>> complicated sources.  Should the column list be optional here instead?
>>
>
> Our documentation would seem to indicate that defining columns is optional
> - looking at the example for BigQuery here [1] the schema is not provided.
> Those docs must be aspirational though, I just checked and the
> BigQueryTableProvider definitely expects the schema to be defined and uses
> it [2].
>
> I think it would make sense to make the column list optional- that way we
> can actually fulfill our BigQuery documentation.
>

Big +1 to that.


> Note if you're building your own custom TableProvider, you might not need
> to use CREATE EXTERNAL TABLE. You could add an implementation for
> TableProvider.getTable that retrieves the metadata for a given table name
> and returns a Table instance that can build the necessary IOs. This is only
> possible if you can retrieve all the metadata you need to construct the
> IOs though. If you want users to be able to configure it further (one
> example might be specifying the read mode for BigQuery), this won't work.
>
> [1]
> https://beam.apache.org/documentation/dsls/sql/extensions/create-external-table/#bigquery
> [2]
> https://github.com/apache/beam/blob/872455570ae7f3e2e35360bccf93b503ae9fdb5c/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/bigquery/BigQueryTable.java#L82
>

Maybe I'm looking at the wrong thing?  Both those examples show a column
list, and the BNF (or whatever) syntax implies at least one "table element"
must be present.

But yeah, this is basically what I'm doing right now.  I just return the
"real" schema in BeamSqlTable.getSchema and ignore whatever was passed in.
It seems to work correctly.  Ideally the column list would be optional
here, as you alluded to above.  It'll be clunky explaining to users
something like "just include any random column list, we'll ignore it".


> - It seems like predicate pushdown only works if the schema is "flat" (has
>> no nested rows).  I understand the complication in pushing down more
>> complicated nested predicates, however, assuming the table implementation
>> doesn't actually attempt to push them down, it seems like it would be fine
>> to allow?
>>
>
> Do we have this limitation? I think predicate pushdown will work
> with predicates on nested fields. The table is presented with a list of
> RexNodes representing separable predicates, an individual predicate could
> add a filter on a nested column IIUC.
>
> We may have the limitation that project pushdown won't work on nested rows
> though, since the API just takes a list of field names. It's possible we
> handle this by passing a joined name (e.g. foo.bar.baz), but I bet not. The
> design doc [3] does have a note saying "no nested tables for now".
>
> [3]
> https://docs.google.com/document/d/1-ysD7U7qF3MAmSfkbXZO_5PLJBevAL9bktlLCerd_jE/edit
>
>

BeamIOPushDownRule short circuits on nested fields [1], I can also verify
this just by the fact that my constructFilter method isn't called when my
schema contains a nested row.

[1]
https://github.com/apache/beam/blob/v2.35.0/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rule/BeamIOPushDownRule.java#L89


>
>> - As a follow up on the above, I'd like to expose a "virtual" field in my
>> schema that represents the partition the data has come from.  For example
>> BigQuery has a similar concept called _PARTITIONTIME.  This would be picked
>> up by the predicate pushdown and used to filter the partitions being read.
>> I can't really figure out how I'd construct something similar here, even if
>> pushdown worked in all cases.  For example, for this query:
>>
>> SELECT * from table
>> where _PARTITIONTIME between X and Y
>>
>> I'd want that filter to be pushed down to my IO, but also the
>> _PARTITIONTIME column wouldn't be returned in the select list.  I was
>> hoping to use BigQueryIO as an example of how to do this, but it doesn't
>> seem like it exposes the virtual _PARTITIONTIME column either.
>>
>
> Yeah I think this will be hard to do with our current abstractions. You
> may be able to do it if you're ok with actually populating a _PARTITIONTIME
> column though.
>

I can play around with something like that, it'd be a little weird to have
it in the result schema but not the end of the world.

>

Re: Some questions about external tables in BeamSQL

Posted by Steve Niemitz <sn...@apache.org>.
Thanks for the quick responses! Mine are inline as well.

On Thu, Jan 13, 2022 at 9:01 PM Brian Hulette <bh...@google.com> wrote:

> I added some responses inline. Also adding dev@ since this is getting
> into SQL internals.
>
> On Thu, Jan 13, 2022 at 10:29 AM Steve Niemitz <sn...@apache.org>
> wrote:
>
>> I've been playing around with CREATE EXTERNAL TABLE (using a custom
>> TableProvider as well) w/ BeamSQL and really love it.  I have a few
>> questions though that I've accumulated as I've been using it I wanted to
>> ask.
>>
>> - I'm a little confused about the need to define columns in the CREATE
>> EXTERNAL TABLE statement.  If I have a BeamSqlTable implementation that can
>> provide the schema on its own, it seems like the columns supplied to the
>> CREATE statement are ignored.  This is ideal anyways, since it's infeasible
>> for users to provide the entire schema up-front, especially for more
>> complicated sources.  Should the column list be optional here instead?
>>
>
> Our documentation would seem to indicate that defining columns is optional
> - looking at the example for BigQuery here [1] the schema is not provided.
> Those docs must be aspirational though, I just checked and the
> BigQueryTableProvider definitely expects the schema to be defined and uses
> it [2].
>
> I think it would make sense to make the column list optional- that way we
> can actually fulfill our BigQuery documentation.
>

Big +1 to that.


> Note if you're building your own custom TableProvider, you might not need
> to use CREATE EXTERNAL TABLE. You could add an implementation for
> TableProvider.getTable that retrieves the metadata for a given table name
> and returns a Table instance that can build the necessary IOs. This is only
> possible if you can retrieve all the metadata you need to construct the
> IOs though. If you want users to be able to configure it further (one
> example might be specifying the read mode for BigQuery), this won't work.
>
> [1]
> https://beam.apache.org/documentation/dsls/sql/extensions/create-external-table/#bigquery
> [2]
> https://github.com/apache/beam/blob/872455570ae7f3e2e35360bccf93b503ae9fdb5c/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/bigquery/BigQueryTable.java#L82
>

Maybe I'm looking at the wrong thing?  Both those examples show a column
list, and the BNF (or whatever) syntax implies at least one "table element"
must be present.

But yeah, this is basically what I'm doing right now.  I just return the
"real" schema in BeamSqlTable.getSchema and ignore whatever was passed in.
It seems to work correctly.  Ideally the column list would be optional
here, as you alluded to above.  It'll be clunky explaining to users
something like "just include any random column list, we'll ignore it".


> - It seems like predicate pushdown only works if the schema is "flat" (has
>> no nested rows).  I understand the complication in pushing down more
>> complicated nested predicates, however, assuming the table implementation
>> doesn't actually attempt to push them down, it seems like it would be fine
>> to allow?
>>
>
> Do we have this limitation? I think predicate pushdown will work
> with predicates on nested fields. The table is presented with a list of
> RexNodes representing separable predicates, an individual predicate could
> add a filter on a nested column IIUC.
>
> We may have the limitation that project pushdown won't work on nested rows
> though, since the API just takes a list of field names. It's possible we
> handle this by passing a joined name (e.g. foo.bar.baz), but I bet not. The
> design doc [3] does have a note saying "no nested tables for now".
>
> [3]
> https://docs.google.com/document/d/1-ysD7U7qF3MAmSfkbXZO_5PLJBevAL9bktlLCerd_jE/edit
>
>

BeamIOPushDownRule short circuits on nested fields [1], I can also verify
this just by the fact that my constructFilter method isn't called when my
schema contains a nested row.

[1]
https://github.com/apache/beam/blob/v2.35.0/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rule/BeamIOPushDownRule.java#L89


>
>> - As a follow up on the above, I'd like to expose a "virtual" field in my
>> schema that represents the partition the data has come from.  For example
>> BigQuery has a similar concept called _PARTITIONTIME.  This would be picked
>> up by the predicate pushdown and used to filter the partitions being read.
>> I can't really figure out how I'd construct something similar here, even if
>> pushdown worked in all cases.  For example, for this query:
>>
>> SELECT * from table
>> where _PARTITIONTIME between X and Y
>>
>> I'd want that filter to be pushed down to my IO, but also the
>> _PARTITIONTIME column wouldn't be returned in the select list.  I was
>> hoping to use BigQueryIO as an example of how to do this, but it doesn't
>> seem like it exposes the virtual _PARTITIONTIME column either.
>>
>
> Yeah I think this will be hard to do with our current abstractions. You
> may be able to do it if you're ok with actually populating a _PARTITIONTIME
> column though.
>

I can play around with something like that, it'd be a little weird to have
it in the result schema but not the end of the world.

>

Re: Some questions about external tables in BeamSQL

Posted by Brian Hulette <bh...@google.com>.
I added some responses inline. Also adding dev@ since this is getting into
SQL internals.

On Thu, Jan 13, 2022 at 10:29 AM Steve Niemitz <sn...@apache.org> wrote:

> I've been playing around with CREATE EXTERNAL TABLE (using a custom
> TableProvider as well) w/ BeamSQL and really love it.  I have a few
> questions though that I've accumulated as I've been using it I wanted to
> ask.
>
> - I'm a little confused about the need to define columns in the CREATE
> EXTERNAL TABLE statement.  If I have a BeamSqlTable implementation that can
> provide the schema on its own, it seems like the columns supplied to the
> CREATE statement are ignored.  This is ideal anyways, since it's infeasible
> for users to provide the entire schema up-front, especially for more
> complicated sources.  Should the column list be optional here instead?
>

Our documentation would seem to indicate that defining columns is optional
- looking at the example for BigQuery here [1] the schema is not provided.
Those docs must be aspirational though, I just checked and the
BigQueryTableProvider definitely expects the schema to be defined and uses
it [2].

I think it would make sense to make the column list optional- that way we
can actually fulfill our BigQuery documentation.

Note if you're building your own custom TableProvider, you might not need
to use CREATE EXTERNAL TABLE. You could add an implementation for
TableProvider.getTable that retrieves the metadata for a given table name
and returns a Table instance that can build the necessary IOs. This is only
possible if you can retrieve all the metadata you need to construct the
IOs though. If you want users to be able to configure it further (one
example might be specifying the read mode for BigQuery), this won't work.

[1]
https://beam.apache.org/documentation/dsls/sql/extensions/create-external-table/#bigquery
[2]
https://github.com/apache/beam/blob/872455570ae7f3e2e35360bccf93b503ae9fdb5c/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/bigquery/BigQueryTable.java#L82


>
> - It seems like predicate pushdown only works if the schema is "flat" (has
> no nested rows).  I understand the complication in pushing down more
> complicated nested predicates, however, assuming the table implementation
> doesn't actually attempt to push them down, it seems like it would be fine
> to allow?
>

Do we have this limitation? I think predicate pushdown will work
with predicates on nested fields. The table is presented with a list of
RexNodes representing separable predicates, an individual predicate could
add a filter on a nested column IIUC.

We may have the limitation that project pushdown won't work on nested rows
though, since the API just takes a list of field names. It's possible we
handle this by passing a joined name (e.g. foo.bar.baz), but I bet not. The
design doc [3] does have a note saying "no nested tables for now".

[3]
https://docs.google.com/document/d/1-ysD7U7qF3MAmSfkbXZO_5PLJBevAL9bktlLCerd_jE/edit


>
> - As a follow up on the above, I'd like to expose a "virtual" field in my
> schema that represents the partition the data has come from.  For example
> BigQuery has a similar concept called _PARTITIONTIME.  This would be picked
> up by the predicate pushdown and used to filter the partitions being read.
> I can't really figure out how I'd construct something similar here, even if
> pushdown worked in all cases.  For example, for this query:
>
> SELECT * from table
> where _PARTITIONTIME between X and Y
>
> I'd want that filter to be pushed down to my IO, but also the
> _PARTITIONTIME column wouldn't be returned in the select list.  I was
> hoping to use BigQueryIO as an example of how to do this, but it doesn't
> seem like it exposes the virtual _PARTITIONTIME column either.
>

Yeah I think this will be hard to do with our current abstractions. You may
be able to do it if you're ok with actually populating a _PARTITIONTIME
column though.

Brian

Re: Some questions about external tables in BeamSQL

Posted by Brian Hulette <bh...@google.com>.
I added some responses inline. Also adding dev@ since this is getting into
SQL internals.

On Thu, Jan 13, 2022 at 10:29 AM Steve Niemitz <sn...@apache.org> wrote:

> I've been playing around with CREATE EXTERNAL TABLE (using a custom
> TableProvider as well) w/ BeamSQL and really love it.  I have a few
> questions though that I've accumulated as I've been using it I wanted to
> ask.
>
> - I'm a little confused about the need to define columns in the CREATE
> EXTERNAL TABLE statement.  If I have a BeamSqlTable implementation that can
> provide the schema on its own, it seems like the columns supplied to the
> CREATE statement are ignored.  This is ideal anyways, since it's infeasible
> for users to provide the entire schema up-front, especially for more
> complicated sources.  Should the column list be optional here instead?
>

Our documentation would seem to indicate that defining columns is optional
- looking at the example for BigQuery here [1] the schema is not provided.
Those docs must be aspirational though, I just checked and the
BigQueryTableProvider definitely expects the schema to be defined and uses
it [2].

I think it would make sense to make the column list optional- that way we
can actually fulfill our BigQuery documentation.

Note if you're building your own custom TableProvider, you might not need
to use CREATE EXTERNAL TABLE. You could add an implementation for
TableProvider.getTable that retrieves the metadata for a given table name
and returns a Table instance that can build the necessary IOs. This is only
possible if you can retrieve all the metadata you need to construct the
IOs though. If you want users to be able to configure it further (one
example might be specifying the read mode for BigQuery), this won't work.

[1]
https://beam.apache.org/documentation/dsls/sql/extensions/create-external-table/#bigquery
[2]
https://github.com/apache/beam/blob/872455570ae7f3e2e35360bccf93b503ae9fdb5c/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/bigquery/BigQueryTable.java#L82


>
> - It seems like predicate pushdown only works if the schema is "flat" (has
> no nested rows).  I understand the complication in pushing down more
> complicated nested predicates, however, assuming the table implementation
> doesn't actually attempt to push them down, it seems like it would be fine
> to allow?
>

Do we have this limitation? I think predicate pushdown will work
with predicates on nested fields. The table is presented with a list of
RexNodes representing separable predicates, an individual predicate could
add a filter on a nested column IIUC.

We may have the limitation that project pushdown won't work on nested rows
though, since the API just takes a list of field names. It's possible we
handle this by passing a joined name (e.g. foo.bar.baz), but I bet not. The
design doc [3] does have a note saying "no nested tables for now".

[3]
https://docs.google.com/document/d/1-ysD7U7qF3MAmSfkbXZO_5PLJBevAL9bktlLCerd_jE/edit


>
> - As a follow up on the above, I'd like to expose a "virtual" field in my
> schema that represents the partition the data has come from.  For example
> BigQuery has a similar concept called _PARTITIONTIME.  This would be picked
> up by the predicate pushdown and used to filter the partitions being read.
> I can't really figure out how I'd construct something similar here, even if
> pushdown worked in all cases.  For example, for this query:
>
> SELECT * from table
> where _PARTITIONTIME between X and Y
>
> I'd want that filter to be pushed down to my IO, but also the
> _PARTITIONTIME column wouldn't be returned in the select list.  I was
> hoping to use BigQueryIO as an example of how to do this, but it doesn't
> seem like it exposes the virtual _PARTITIONTIME column either.
>

Yeah I think this will be hard to do with our current abstractions. You may
be able to do it if you're ok with actually populating a _PARTITIONTIME
column though.

Brian