You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Piotr Szuberski <pi...@polidea.com> on 2020/11/03 11:01:57 UTC

Bigtable for BeamSQL - question about the schema design

I'm going to write Bigtable table for BeamSQL and I have a question about the schema design, which one would be preferrable.

Bigtable stores its data in a table with rows that contain a key and 3-dimensional array where the 1st dimension is families with a names, 2nd dimension is columns with qualifiers and the 3rd cells containing timestamp and value.

Two design solutions come to mind:
1) Fix schema to be a generic Bigtable row:

Row(key, Array(Row(family, Array(Row(qualifier, Array(Row(value, timestamp)))))))

Then the table creation definition would always be in form:

CREATE TABLE bigtableexample1()
TYPE 'bigtable'
LOCATION 'https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId'

2) Let the user design his schema by providing the desired families and columns it sth like:
CREATE TABLE bigtableexample2(
  key VARCHAR,
  family1 ROW<
    column1 ROW<
      cells ARRAY<ROW<
        value VARCHAR,
        timestamp BIGINT
      >>
    >,
    column2 ROW<
      cells ARRAY<ROW<
        value VARCHAR,
        timestamp BIGINT
      >>
    >
  >
)
TYPE 'bigtable'
LOCATION 'https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId'

For me the 1st approach is more user friendly (typing schema from the 2nd would be troublesome) and more elastic especially when the row's schema (families and columns) changes and a user wants to perform 'SELECT * from bigtableexampleX'.

WDYT? I'd welcome any feedback. Maybe there is some 3rd option that will be a better one?

Re: Bigtable for BeamSQL - question about the schema design

Posted by Kenneth Knowles <ke...@apache.org>.

If I recall correctly, we need to upgrade Calcite for this.

On Tue, Nov 10, 2020 at 10:24 AM Brian Hulette <bh...@google.com> wrote:

>
>
> On Tue, Nov 10, 2020 at 5:46 AM Piotr Szuberski <
> piotr.szuberski@polidea.com> wrote:
>
>> Unfortunately according to the documentation, BeamSQL doesn't work well
>> with ARRAY<COMPLEX_TYPE>, like ARRAY<ROW<>> which I confirmed empirically.
>>
>>
> Is there a jira for this issue?
>
>
>> The only way to retrieve array's values was to get it by index, e.g.
>> SELECT t.complex_array[1].row_field from some_table t;
>>
>> Unnest and just taking an array doesn't work with ARRAY<ROW>.
>>
>> I think that if a user wants to have a list of cells then it has to be
>> truncated to "value" only and just the recent cell if it has to be ROW with
>> timestamp, labels etc.
>>
>> This limitation doesn't happen in the flattened rows, but another one
>> takes place - the value has to be of BINARY type and then parsed by user.
>>
>> I'll try to make it possibly elastic - ARRAY<ROW> limitation makes it
>> much less straightforward.
>>
>> Thanks for the references! As I understand Flink cares only for the
>> recent values of the column and ignores the timestamps and labels? I refer
>> to this:
>> family1 ROW<q1 INT>  -- family1.q1 is the most recent cell value of the
>> q1 column?
>>
>
> I can't find a reference to confirm, but that seems like the most likely
> explanation. Similar to how BigQuery behaves with the onlyReadLatest option.
>
>
>>
>> On 2020/11/06 20:43:03, Ismaël Mejía <ie...@gmail.com> wrote:
>> > Thanks for the references Rui. I think it is worth to consider how
>> > open source systems do it.
>> > The great thing about this is that we could 'easily' map Piotr's work
>> > for Bigtable to HBase too once it is done.
>
> >
>> > On Fri, Nov 6, 2020 at 8:22 PM Rui Wang <ru...@google.com> wrote:
>> > >
>> > > Another two references are from how Flink and Spark uses HBase by SQL:
>>
>
> Great point! I forgot that HBase is modeled after BigTable
>
>
>> > >
>> > >
>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/hbase.html
>> > > https://stackoverflow.com/questions/39530938/sparksql-on-hbase-tables
>> > >
>> > > -Rui
>> > >
>> > > On Thu, Nov 5, 2020 at 9:46 AM Piotr Szuberski <
>> piotr.szuberski@polidea.com> wrote:
>> > >>
>> > >> Thanks for the resources! I'll try to follow the BQ approach. I'd
>> also add something like flattened schema so the user can use simple types
>> only. It would be limited to BINARY values though. Something like:
>> > >> CREATE EXTERNAL TABLE(
>> > >>   key VARCHAR NOT NULL,
>> > >>   family VARCHAR NOT NULL,
>> > >>   column VARCHAR NOT NULL,
>> > >>   value BINARY NOT NULL,
>> > >>   timestampMicros BIGINT NOT NULL
>> > >> )
>> > >>  The cells array would be flattened (denormalized) and easy to use.
>> In case of single-valued cells it would also be quite efficient.
>> > >>
>> > >> On 2020/11/05 00:22:44, Brian Hulette <bh...@google.com> wrote:
>> > >> > I think we should take a look at how BigTable is integrated with
>> other SQL
>> > >> > systems. For example we could get some inspiration from BigQuery's
>> support
>> > >> > for querying BigTable data [1]. It looks like by default it uses
>> something
>> > >> > like (1), but they recognize this is difficult to process with
>> SQL, so they
>> > >> > have an option you can set to elevate certain columns as
>> sub-fields (more
>> > >> > like (2)), and you can also indicate you only want to get the
>> latest value
>> > >> > for each column.
>> > >> >
>> > >> > In any case this may be a good candidate for not requiring the
>> user to
>> > >> > actually specify a schema, and instead letting the table be fully
>> > >> > determined by options.
>> > >> >
>> > >> > [1] https://cloud.google.com/bigquery/external-data-bigtable
>> > >> >
>> > >> > On Tue, Nov 3, 2020 at 11:41 PM Piotr Szuberski <
>> piotr.szuberski@polidea.com>
>> > >> > wrote:
>> > >> >
>> > >> > > I've dug the topic a bit and I think the 2nd approach will fit
>> better. The
>> > >> > > schema in Bigtable is not supposed to change that often and
>> specifying our
>> > >> > > own schema is more SQL-like and will cause less potential
>> trouble.
>> > >> > >
>> > >> > > On 2020/11/03 11:01:57, Piotr Szuberski <
>> piotr.szuberski@polidea.com>
>> > >> > > wrote:
>> > >> > > > I'm going to write Bigtable table for BeamSQL and I have a
>> question
>> > >> > > about the schema design, which one would be preferrable.
>> > >> > > >
>> > >> > > > Bigtable stores its data in a table with rows that contain a
>> key and
>> > >> > > 3-dimensional array where the 1st dimension is families with a
>> names, 2nd
>> > >> > > dimension is columns with qualifiers and the 3rd cells
>> containing timestamp
>> > >> > > and value.
>> > >> > > >
>> > >> > > > Two design solutions come to mind:
>> > >> > > > 1) Fix schema to be a generic Bigtable row:
>> > >> > > >
>> > >> > > > Row(key, Array(Row(family, Array(Row(qualifier,
>> Array(Row(value,
>> > >> > > timestamp)))))))
>> > >> > > >
>> > >> > > > Then the table creation definition would always be in form:
>> > >> > > >
>> > >> > > > CREATE TABLE bigtableexample1()
>> > >> > > > TYPE 'bigtable'
>> > >> > > > LOCATION '
>> > >> > >
>> https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId
>> > >> > > '
>> > >> > > >
>> > >> > > > 2) Let the user design his schema by providing the desired
>> families and
>> > >> > > columns it sth like:
>> > >> > > > CREATE TABLE bigtableexample2(
>> > >> > > >   key VARCHAR,
>> > >> > > >   family1 ROW<
>> > >> > > >     column1 ROW<
>> > >> > > >       cells ARRAY<ROW<
>> > >> > > >         value VARCHAR,
>> > >> > > >         timestamp BIGINT
>> > >> > > >       >>
>> > >> > > >     >,
>> > >> > > >     column2 ROW<
>> > >> > > >       cells ARRAY<ROW<
>> > >> > > >         value VARCHAR,
>> > >> > > >         timestamp BIGINT
>> > >> > > >       >>
>> > >> > > >     >
>> > >> > > >   >
>> > >> > > > )
>> > >> > > > TYPE 'bigtable'
>> > >> > > > LOCATION '
>> > >> > >
>> https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId
>> > >> > > '
>> > >> > > >
>> > >> > > > For me the 1st approach is more user friendly (typing schema
>> from the
>> > >> > > 2nd would be troublesome) and more elastic especially when the
>> row's schema
>> > >> > > (families and columns) changes and a user wants to perform
>> 'SELECT * from
>> > >> > > bigtableexampleX'.
>> > >> > > >
>> > >> > > > WDYT? I'd welcome any feedback. Maybe there is some 3rd option
>> that will
>> > >> > > be a better one?
>> > >> > > >
>> > >> > >
>> > >> >
>> >
>>
>

Re: Bigtable for BeamSQL - question about the schema design

Posted by Piotr Szuberski <pi...@polidea.com>.

> >Is there a jira for this issue?

Sorry for the delay. Luckily Rui answered it better than I would.


> https://issues.apache.org/jira/browse/BEAM-10896 is the one that I am aware
> of. Though it says to aim to improve UNNEST, I think it could improve
> ARRAY<STRUCT> in general.  Also like Kenneth mentioned, it might depend on
> vendored Calcite upgrade to at least 1.23.0

It sounds like it's exactly what is needed!

Re: Bigtable for BeamSQL - question about the schema design

Posted by Rui Wang <ru...@google.com>.

On Tue, Nov 10, 2020 at 10:25 AM Brian Hulette <bh...@google.com> wrote:

>
>
> On Tue, Nov 10, 2020 at 5:46 AM Piotr Szuberski <
> piotr.szuberski@polidea.com> wrote:
>
>> Unfortunately according to the documentation, BeamSQL doesn't work well
>> with ARRAY<COMPLEX_TYPE>, like ARRAY<ROW<>> which I confirmed empirically.
>>
>>
> Is there a jira for this issue?
>

https://issues.apache.org/jira/browse/BEAM-10896 is the one that I am aware
of. Though it says to aim to improve UNNEST, I think it could improve
ARRAY<STRUCT> in general.  Also like Kenneth mentioned, it might depend on
vendored Calcite upgrade to at least 1.23.0


>
>
>> The only way to retrieve array's values was to get it by index, e.g.
>> SELECT t.complex_array[1].row_field from some_table t;
>>
>> Unnest and just taking an array doesn't work with ARRAY<ROW>.
>>
>> I think that if a user wants to have a list of cells then it has to be
>> truncated to "value" only and just the recent cell if it has to be ROW with
>> timestamp, labels etc.
>>
>> This limitation doesn't happen in the flattened rows, but another one
>> takes place - the value has to be of BINARY type and then parsed by user.
>>
>> I'll try to make it possibly elastic - ARRAY<ROW> limitation makes it
>> much less straightforward.
>>
>> Thanks for the references! As I understand Flink cares only for the
>> recent values of the column and ignores the timestamps and labels? I refer
>> to this:
>> family1 ROW<q1 INT>  -- family1.q1 is the most recent cell value of the
>> q1 column?
>>
>
> I can't find a reference to confirm, but that seems like the most likely
> explanation. Similar to how BigQuery behaves with the onlyReadLatest option.
>
>
>>
>> On 2020/11/06 20:43:03, Ismaël Mejía <ie...@gmail.com> wrote:
>> > Thanks for the references Rui. I think it is worth to consider how
>> > open source systems do it.
>> > The great thing about this is that we could 'easily' map Piotr's work
>> > for Bigtable to HBase too once it is done.
>
> >
>> > On Fri, Nov 6, 2020 at 8:22 PM Rui Wang <ru...@google.com> wrote:
>> > >
>> > > Another two references are from how Flink and Spark uses HBase by SQL:
>>
>
> Great point! I forgot that HBase is modeled after BigTable
>
>
>> > >
>> > >
>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/hbase.html
>> > > https://stackoverflow.com/questions/39530938/sparksql-on-hbase-tables
>> > >
>> > > -Rui
>> > >
>> > > On Thu, Nov 5, 2020 at 9:46 AM Piotr Szuberski <
>> piotr.szuberski@polidea.com> wrote:
>> > >>
>> > >> Thanks for the resources! I'll try to follow the BQ approach. I'd
>> also add something like flattened schema so the user can use simple types
>> only. It would be limited to BINARY values though. Something like:
>> > >> CREATE EXTERNAL TABLE(
>> > >>   key VARCHAR NOT NULL,
>> > >>   family VARCHAR NOT NULL,
>> > >>   column VARCHAR NOT NULL,
>> > >>   value BINARY NOT NULL,
>> > >>   timestampMicros BIGINT NOT NULL
>> > >> )
>> > >>  The cells array would be flattened (denormalized) and easy to use.
>> In case of single-valued cells it would also be quite efficient.
>> > >>
>> > >> On 2020/11/05 00:22:44, Brian Hulette <bh...@google.com> wrote:
>> > >> > I think we should take a look at how BigTable is integrated with
>> other SQL
>> > >> > systems. For example we could get some inspiration from BigQuery's
>> support
>> > >> > for querying BigTable data [1]. It looks like by default it uses
>> something
>> > >> > like (1), but they recognize this is difficult to process with
>> SQL, so they
>> > >> > have an option you can set to elevate certain columns as
>> sub-fields (more
>> > >> > like (2)), and you can also indicate you only want to get the
>> latest value
>> > >> > for each column.
>> > >> >
>> > >> > In any case this may be a good candidate for not requiring the
>> user to
>> > >> > actually specify a schema, and instead letting the table be fully
>> > >> > determined by options.
>> > >> >
>> > >> > [1] https://cloud.google.com/bigquery/external-data-bigtable
>> > >> >
>> > >> > On Tue, Nov 3, 2020 at 11:41 PM Piotr Szuberski <
>> piotr.szuberski@polidea.com>
>> > >> > wrote:
>> > >> >
>> > >> > > I've dug the topic a bit and I think the 2nd approach will fit
>> better. The
>> > >> > > schema in Bigtable is not supposed to change that often and
>> specifying our
>> > >> > > own schema is more SQL-like and will cause less potential
>> trouble.
>> > >> > >
>> > >> > > On 2020/11/03 11:01:57, Piotr Szuberski <
>> piotr.szuberski@polidea.com>
>> > >> > > wrote:
>> > >> > > > I'm going to write Bigtable table for BeamSQL and I have a
>> question
>> > >> > > about the schema design, which one would be preferrable.
>> > >> > > >
>> > >> > > > Bigtable stores its data in a table with rows that contain a
>> key and
>> > >> > > 3-dimensional array where the 1st dimension is families with a
>> names, 2nd
>> > >> > > dimension is columns with qualifiers and the 3rd cells
>> containing timestamp
>> > >> > > and value.
>> > >> > > >
>> > >> > > > Two design solutions come to mind:
>> > >> > > > 1) Fix schema to be a generic Bigtable row:
>> > >> > > >
>> > >> > > > Row(key, Array(Row(family, Array(Row(qualifier,
>> Array(Row(value,
>> > >> > > timestamp)))))))
>> > >> > > >
>> > >> > > > Then the table creation definition would always be in form:
>> > >> > > >
>> > >> > > > CREATE TABLE bigtableexample1()
>> > >> > > > TYPE 'bigtable'
>> > >> > > > LOCATION '
>> > >> > >
>> https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId
>> > >> > > '
>> > >> > > >
>> > >> > > > 2) Let the user design his schema by providing the desired
>> families and
>> > >> > > columns it sth like:
>> > >> > > > CREATE TABLE bigtableexample2(
>> > >> > > >   key VARCHAR,
>> > >> > > >   family1 ROW<
>> > >> > > >     column1 ROW<
>> > >> > > >       cells ARRAY<ROW<
>> > >> > > >         value VARCHAR,
>> > >> > > >         timestamp BIGINT
>> > >> > > >       >>
>> > >> > > >     >,
>> > >> > > >     column2 ROW<
>> > >> > > >       cells ARRAY<ROW<
>> > >> > > >         value VARCHAR,
>> > >> > > >         timestamp BIGINT
>> > >> > > >       >>
>> > >> > > >     >
>> > >> > > >   >
>> > >> > > > )
>> > >> > > > TYPE 'bigtable'
>> > >> > > > LOCATION '
>> > >> > >
>> https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId
>> > >> > > '
>> > >> > > >
>> > >> > > > For me the 1st approach is more user friendly (typing schema
>> from the
>> > >> > > 2nd would be troublesome) and more elastic especially when the
>> row's schema
>> > >> > > (families and columns) changes and a user wants to perform
>> 'SELECT * from
>> > >> > > bigtableexampleX'.
>> > >> > > >
>> > >> > > > WDYT? I'd welcome any feedback. Maybe there is some 3rd option
>> that will
>> > >> > > be a better one?
>> > >> > > >
>> > >> > >
>> > >> >
>> >
>>
>

Re: Bigtable for BeamSQL - question about the schema design

Posted by Brian Hulette <bh...@google.com>.

On Tue, Nov 10, 2020 at 5:46 AM Piotr Szuberski <pi...@polidea.com>
wrote:

> Unfortunately according to the documentation, BeamSQL doesn't work well
> with ARRAY<COMPLEX_TYPE>, like ARRAY<ROW<>> which I confirmed empirically.
>
>
Is there a jira for this issue?


> The only way to retrieve array's values was to get it by index, e.g.
> SELECT t.complex_array[1].row_field from some_table t;
>
> Unnest and just taking an array doesn't work with ARRAY<ROW>.
>
> I think that if a user wants to have a list of cells then it has to be
> truncated to "value" only and just the recent cell if it has to be ROW with
> timestamp, labels etc.
>
> This limitation doesn't happen in the flattened rows, but another one
> takes place - the value has to be of BINARY type and then parsed by user.
>
> I'll try to make it possibly elastic - ARRAY<ROW> limitation makes it much
> less straightforward.
>
> Thanks for the references! As I understand Flink cares only for the recent
> values of the column and ignores the timestamps and labels? I refer to
> this:
> family1 ROW<q1 INT>  -- family1.q1 is the most recent cell value of the q1
> column?
>

I can't find a reference to confirm, but that seems like the most likely
explanation. Similar to how BigQuery behaves with the onlyReadLatest option.


>
> On 2020/11/06 20:43:03, Ismaël Mejía <ie...@gmail.com> wrote:
> > Thanks for the references Rui. I think it is worth to consider how
> > open source systems do it.
> > The great thing about this is that we could 'easily' map Piotr's work
> > for Bigtable to HBase too once it is done.

>
> > On Fri, Nov 6, 2020 at 8:22 PM Rui Wang <ru...@google.com> wrote:
> > >
> > > Another two references are from how Flink and Spark uses HBase by SQL:
>

Great point! I forgot that HBase is modeled after BigTable


> > >
> > >
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/hbase.html
> > > https://stackoverflow.com/questions/39530938/sparksql-on-hbase-tables
> > >
> > > -Rui
> > >
> > > On Thu, Nov 5, 2020 at 9:46 AM Piotr Szuberski <
> piotr.szuberski@polidea.com> wrote:
> > >>
> > >> Thanks for the resources! I'll try to follow the BQ approach. I'd
> also add something like flattened schema so the user can use simple types
> only. It would be limited to BINARY values though. Something like:
> > >> CREATE EXTERNAL TABLE(
> > >>   key VARCHAR NOT NULL,
> > >>   family VARCHAR NOT NULL,
> > >>   column VARCHAR NOT NULL,
> > >>   value BINARY NOT NULL,
> > >>   timestampMicros BIGINT NOT NULL
> > >> )
> > >>  The cells array would be flattened (denormalized) and easy to use.
> In case of single-valued cells it would also be quite efficient.
> > >>
> > >> On 2020/11/05 00:22:44, Brian Hulette <bh...@google.com> wrote:
> > >> > I think we should take a look at how BigTable is integrated with
> other SQL
> > >> > systems. For example we could get some inspiration from BigQuery's
> support
> > >> > for querying BigTable data [1]. It looks like by default it uses
> something
> > >> > like (1), but they recognize this is difficult to process with SQL,
> so they
> > >> > have an option you can set to elevate certain columns as sub-fields
> (more
> > >> > like (2)), and you can also indicate you only want to get the
> latest value
> > >> > for each column.
> > >> >
> > >> > In any case this may be a good candidate for not requiring the user
> to
> > >> > actually specify a schema, and instead letting the table be fully
> > >> > determined by options.
> > >> >
> > >> > [1] https://cloud.google.com/bigquery/external-data-bigtable
> > >> >
> > >> > On Tue, Nov 3, 2020 at 11:41 PM Piotr Szuberski <
> piotr.szuberski@polidea.com>
> > >> > wrote:
> > >> >
> > >> > > I've dug the topic a bit and I think the 2nd approach will fit
> better. The
> > >> > > schema in Bigtable is not supposed to change that often and
> specifying our
> > >> > > own schema is more SQL-like and will cause less potential trouble.
> > >> > >
> > >> > > On 2020/11/03 11:01:57, Piotr Szuberski <
> piotr.szuberski@polidea.com>
> > >> > > wrote:
> > >> > > > I'm going to write Bigtable table for BeamSQL and I have a
> question
> > >> > > about the schema design, which one would be preferrable.
> > >> > > >
> > >> > > > Bigtable stores its data in a table with rows that contain a
> key and
> > >> > > 3-dimensional array where the 1st dimension is families with a
> names, 2nd
> > >> > > dimension is columns with qualifiers and the 3rd cells containing
> timestamp
> > >> > > and value.
> > >> > > >
> > >> > > > Two design solutions come to mind:
> > >> > > > 1) Fix schema to be a generic Bigtable row:
> > >> > > >
> > >> > > > Row(key, Array(Row(family, Array(Row(qualifier, Array(Row(value,
> > >> > > timestamp)))))))
> > >> > > >
> > >> > > > Then the table creation definition would always be in form:
> > >> > > >
> > >> > > > CREATE TABLE bigtableexample1()
> > >> > > > TYPE 'bigtable'
> > >> > > > LOCATION '
> > >> > >
> https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId
> > >> > > '
> > >> > > >
> > >> > > > 2) Let the user design his schema by providing the desired
> families and
> > >> > > columns it sth like:
> > >> > > > CREATE TABLE bigtableexample2(
> > >> > > >   key VARCHAR,
> > >> > > >   family1 ROW<
> > >> > > >     column1 ROW<
> > >> > > >       cells ARRAY<ROW<
> > >> > > >         value VARCHAR,
> > >> > > >         timestamp BIGINT
> > >> > > >       >>
> > >> > > >     >,
> > >> > > >     column2 ROW<
> > >> > > >       cells ARRAY<ROW<
> > >> > > >         value VARCHAR,
> > >> > > >         timestamp BIGINT
> > >> > > >       >>
> > >> > > >     >
> > >> > > >   >
> > >> > > > )
> > >> > > > TYPE 'bigtable'
> > >> > > > LOCATION '
> > >> > >
> https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId
> > >> > > '
> > >> > > >
> > >> > > > For me the 1st approach is more user friendly (typing schema
> from the
> > >> > > 2nd would be troublesome) and more elastic especially when the
> row's schema
> > >> > > (families and columns) changes and a user wants to perform
> 'SELECT * from
> > >> > > bigtableexampleX'.
> > >> > > >
> > >> > > > WDYT? I'd welcome any feedback. Maybe there is some 3rd option
> that will
> > >> > > be a better one?
> > >> > > >
> > >> > >
> > >> >
> >
>

Re: Bigtable for BeamSQL - question about the schema design

Posted by Piotr Szuberski <pi...@polidea.com>.

Unfortunately according to the documentation, BeamSQL doesn't work well with ARRAY<COMPLEX_TYPE>, like ARRAY<ROW<>> which I confirmed empirically.

The only way to retrieve array's values was to get it by index, e.g. 
SELECT t.complex_array[1].row_field from some_table t;

Unnest and just taking an array doesn't work with ARRAY<ROW>.

I think that if a user wants to have a list of cells then it has to be truncated to "value" only and just the recent cell if it has to be ROW with timestamp, labels etc.

This limitation doesn't happen in the flattened rows, but another one takes place - the value has to be of BINARY type and then parsed by user.

I'll try to make it possibly elastic - ARRAY<ROW> limitation makes it much less straightforward.

Thanks for the references! As I understand Flink cares only for the recent values of the column and ignores the timestamps and labels? I refer to this: 
family1 ROW<q1 INT>  -- family1.q1 is the most recent cell value of the q1 column?

On 2020/11/06 20:43:03, Ismaël Mejía <ie...@gmail.com> wrote: 
> Thanks for the references Rui. I think it is worth to consider how
> open source systems do it.
> The great thing about this is that we could 'easily' map Piotr's work
> for Bigtable to HBase too once it is done.
> 
> On Fri, Nov 6, 2020 at 8:22 PM Rui Wang <ru...@google.com> wrote:
> >
> > Another two references are from how Flink and Spark uses HBase by SQL:
> >
> > https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/hbase.html
> > https://stackoverflow.com/questions/39530938/sparksql-on-hbase-tables
> >
> > -Rui
> >
> > On Thu, Nov 5, 2020 at 9:46 AM Piotr Szuberski <pi...@polidea.com> wrote:
> >>
> >> Thanks for the resources! I'll try to follow the BQ approach. I'd also add something like flattened schema so the user can use simple types only. It would be limited to BINARY values though. Something like:
> >> CREATE EXTERNAL TABLE(
> >>   key VARCHAR NOT NULL,
> >>   family VARCHAR NOT NULL,
> >>   column VARCHAR NOT NULL,
> >>   value BINARY NOT NULL,
> >>   timestampMicros BIGINT NOT NULL
> >> )
> >>  The cells array would be flattened (denormalized) and easy to use. In case of single-valued cells it would also be quite efficient.
> >>
> >> On 2020/11/05 00:22:44, Brian Hulette <bh...@google.com> wrote:
> >> > I think we should take a look at how BigTable is integrated with other SQL
> >> > systems. For example we could get some inspiration from BigQuery's support
> >> > for querying BigTable data [1]. It looks like by default it uses something
> >> > like (1), but they recognize this is difficult to process with SQL, so they
> >> > have an option you can set to elevate certain columns as sub-fields (more
> >> > like (2)), and you can also indicate you only want to get the latest value
> >> > for each column.
> >> >
> >> > In any case this may be a good candidate for not requiring the user to
> >> > actually specify a schema, and instead letting the table be fully
> >> > determined by options.
> >> >
> >> > [1] https://cloud.google.com/bigquery/external-data-bigtable
> >> >
> >> > On Tue, Nov 3, 2020 at 11:41 PM Piotr Szuberski <pi...@polidea.com>
> >> > wrote:
> >> >
> >> > > I've dug the topic a bit and I think the 2nd approach will fit better. The
> >> > > schema in Bigtable is not supposed to change that often and specifying our
> >> > > own schema is more SQL-like and will cause less potential trouble.
> >> > >
> >> > > On 2020/11/03 11:01:57, Piotr Szuberski <pi...@polidea.com>
> >> > > wrote:
> >> > > > I'm going to write Bigtable table for BeamSQL and I have a question
> >> > > about the schema design, which one would be preferrable.
> >> > > >
> >> > > > Bigtable stores its data in a table with rows that contain a key and
> >> > > 3-dimensional array where the 1st dimension is families with a names, 2nd
> >> > > dimension is columns with qualifiers and the 3rd cells containing timestamp
> >> > > and value.
> >> > > >
> >> > > > Two design solutions come to mind:
> >> > > > 1) Fix schema to be a generic Bigtable row:
> >> > > >
> >> > > > Row(key, Array(Row(family, Array(Row(qualifier, Array(Row(value,
> >> > > timestamp)))))))
> >> > > >
> >> > > > Then the table creation definition would always be in form:
> >> > > >
> >> > > > CREATE TABLE bigtableexample1()
> >> > > > TYPE 'bigtable'
> >> > > > LOCATION '
> >> > > https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId
> >> > > '
> >> > > >
> >> > > > 2) Let the user design his schema by providing the desired families and
> >> > > columns it sth like:
> >> > > > CREATE TABLE bigtableexample2(
> >> > > >   key VARCHAR,
> >> > > >   family1 ROW<
> >> > > >     column1 ROW<
> >> > > >       cells ARRAY<ROW<
> >> > > >         value VARCHAR,
> >> > > >         timestamp BIGINT
> >> > > >       >>
> >> > > >     >,
> >> > > >     column2 ROW<
> >> > > >       cells ARRAY<ROW<
> >> > > >         value VARCHAR,
> >> > > >         timestamp BIGINT
> >> > > >       >>
> >> > > >     >
> >> > > >   >
> >> > > > )
> >> > > > TYPE 'bigtable'
> >> > > > LOCATION '
> >> > > https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId
> >> > > '
> >> > > >
> >> > > > For me the 1st approach is more user friendly (typing schema from the
> >> > > 2nd would be troublesome) and more elastic especially when the row's schema
> >> > > (families and columns) changes and a user wants to perform 'SELECT * from
> >> > > bigtableexampleX'.
> >> > > >
> >> > > > WDYT? I'd welcome any feedback. Maybe there is some 3rd option that will
> >> > > be a better one?
> >> > > >
> >> > >
> >> >
>

Re: Bigtable for BeamSQL - question about the schema design

Posted by Ismaël Mejía <ie...@gmail.com>.

Thanks for the references Rui. I think it is worth to consider how
open source systems do it.
The great thing about this is that we could 'easily' map Piotr's work
for Bigtable to HBase too once it is done.

On Fri, Nov 6, 2020 at 8:22 PM Rui Wang <ru...@google.com> wrote:
>
> Another two references are from how Flink and Spark uses HBase by SQL:
>
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/hbase.html
> https://stackoverflow.com/questions/39530938/sparksql-on-hbase-tables
>
> -Rui
>
> On Thu, Nov 5, 2020 at 9:46 AM Piotr Szuberski <pi...@polidea.com> wrote:
>>
>> Thanks for the resources! I'll try to follow the BQ approach. I'd also add something like flattened schema so the user can use simple types only. It would be limited to BINARY values though. Something like:
>> CREATE EXTERNAL TABLE(
>>   key VARCHAR NOT NULL,
>>   family VARCHAR NOT NULL,
>>   column VARCHAR NOT NULL,
>>   value BINARY NOT NULL,
>>   timestampMicros BIGINT NOT NULL
>> )
>>  The cells array would be flattened (denormalized) and easy to use. In case of single-valued cells it would also be quite efficient.
>>
>> On 2020/11/05 00:22:44, Brian Hulette <bh...@google.com> wrote:
>> > I think we should take a look at how BigTable is integrated with other SQL
>> > systems. For example we could get some inspiration from BigQuery's support
>> > for querying BigTable data [1]. It looks like by default it uses something
>> > like (1), but they recognize this is difficult to process with SQL, so they
>> > have an option you can set to elevate certain columns as sub-fields (more
>> > like (2)), and you can also indicate you only want to get the latest value
>> > for each column.
>> >
>> > In any case this may be a good candidate for not requiring the user to
>> > actually specify a schema, and instead letting the table be fully
>> > determined by options.
>> >
>> > [1] https://cloud.google.com/bigquery/external-data-bigtable
>> >
>> > On Tue, Nov 3, 2020 at 11:41 PM Piotr Szuberski <pi...@polidea.com>
>> > wrote:
>> >
>> > > I've dug the topic a bit and I think the 2nd approach will fit better. The
>> > > schema in Bigtable is not supposed to change that often and specifying our
>> > > own schema is more SQL-like and will cause less potential trouble.
>> > >
>> > > On 2020/11/03 11:01:57, Piotr Szuberski <pi...@polidea.com>
>> > > wrote:
>> > > > I'm going to write Bigtable table for BeamSQL and I have a question
>> > > about the schema design, which one would be preferrable.
>> > > >
>> > > > Bigtable stores its data in a table with rows that contain a key and
>> > > 3-dimensional array where the 1st dimension is families with a names, 2nd
>> > > dimension is columns with qualifiers and the 3rd cells containing timestamp
>> > > and value.
>> > > >
>> > > > Two design solutions come to mind:
>> > > > 1) Fix schema to be a generic Bigtable row:
>> > > >
>> > > > Row(key, Array(Row(family, Array(Row(qualifier, Array(Row(value,
>> > > timestamp)))))))
>> > > >
>> > > > Then the table creation definition would always be in form:
>> > > >
>> > > > CREATE TABLE bigtableexample1()
>> > > > TYPE 'bigtable'
>> > > > LOCATION '
>> > > https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId
>> > > '
>> > > >
>> > > > 2) Let the user design his schema by providing the desired families and
>> > > columns it sth like:
>> > > > CREATE TABLE bigtableexample2(
>> > > >   key VARCHAR,
>> > > >   family1 ROW<
>> > > >     column1 ROW<
>> > > >       cells ARRAY<ROW<
>> > > >         value VARCHAR,
>> > > >         timestamp BIGINT
>> > > >       >>
>> > > >     >,
>> > > >     column2 ROW<
>> > > >       cells ARRAY<ROW<
>> > > >         value VARCHAR,
>> > > >         timestamp BIGINT
>> > > >       >>
>> > > >     >
>> > > >   >
>> > > > )
>> > > > TYPE 'bigtable'
>> > > > LOCATION '
>> > > https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId
>> > > '
>> > > >
>> > > > For me the 1st approach is more user friendly (typing schema from the
>> > > 2nd would be troublesome) and more elastic especially when the row's schema
>> > > (families and columns) changes and a user wants to perform 'SELECT * from
>> > > bigtableexampleX'.
>> > > >
>> > > > WDYT? I'd welcome any feedback. Maybe there is some 3rd option that will
>> > > be a better one?
>> > > >
>> > >
>> >

Re: Bigtable for BeamSQL - question about the schema design

Posted by Rui Wang <ru...@google.com>.

Another two references are from how Flink and Spark uses HBase by SQL:

https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/hbase.html
https://stackoverflow.com/questions/39530938/sparksql-on-hbase-tables

-Rui

On Thu, Nov 5, 2020 at 9:46 AM Piotr Szuberski <pi...@polidea.com>
wrote:

> Thanks for the resources! I'll try to follow the BQ approach. I'd also add
> something like flattened schema so the user can use simple types only. It
> would be limited to BINARY values though. Something like:
> CREATE EXTERNAL TABLE(
>   key VARCHAR NOT NULL,
>   family VARCHAR NOT NULL,
>   column VARCHAR NOT NULL,
>   value BINARY NOT NULL,
>   timestampMicros BIGINT NOT NULL
> )
>  The cells array would be flattened (denormalized) and easy to use. In
> case of single-valued cells it would also be quite efficient.
>
> On 2020/11/05 00:22:44, Brian Hulette <bh...@google.com> wrote:
> > I think we should take a look at how BigTable is integrated with other
> SQL
> > systems. For example we could get some inspiration from BigQuery's
> support
> > for querying BigTable data [1]. It looks like by default it uses
> something
> > like (1), but they recognize this is difficult to process with SQL, so
> they
> > have an option you can set to elevate certain columns as sub-fields (more
> > like (2)), and you can also indicate you only want to get the latest
> value
> > for each column.
> >
> > In any case this may be a good candidate for not requiring the user to
> > actually specify a schema, and instead letting the table be fully
> > determined by options.
> >
> > [1] https://cloud.google.com/bigquery/external-data-bigtable
> >
> > On Tue, Nov 3, 2020 at 11:41 PM Piotr Szuberski <
> piotr.szuberski@polidea.com>
> > wrote:
> >
> > > I've dug the topic a bit and I think the 2nd approach will fit better.
> The
> > > schema in Bigtable is not supposed to change that often and specifying
> our
> > > own schema is more SQL-like and will cause less potential trouble.
> > >
> > > On 2020/11/03 11:01:57, Piotr Szuberski <pi...@polidea.com>
> > > wrote:
> > > > I'm going to write Bigtable table for BeamSQL and I have a question
> > > about the schema design, which one would be preferrable.
> > > >
> > > > Bigtable stores its data in a table with rows that contain a key and
> > > 3-dimensional array where the 1st dimension is families with a names,
> 2nd
> > > dimension is columns with qualifiers and the 3rd cells containing
> timestamp
> > > and value.
> > > >
> > > > Two design solutions come to mind:
> > > > 1) Fix schema to be a generic Bigtable row:
> > > >
> > > > Row(key, Array(Row(family, Array(Row(qualifier, Array(Row(value,
> > > timestamp)))))))
> > > >
> > > > Then the table creation definition would always be in form:
> > > >
> > > > CREATE TABLE bigtableexample1()
> > > > TYPE 'bigtable'
> > > > LOCATION '
> > >
> https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId
> > > '
> > > >
> > > > 2) Let the user design his schema by providing the desired families
> and
> > > columns it sth like:
> > > > CREATE TABLE bigtableexample2(
> > > >   key VARCHAR,
> > > >   family1 ROW<
> > > >     column1 ROW<
> > > >       cells ARRAY<ROW<
> > > >         value VARCHAR,
> > > >         timestamp BIGINT
> > > >       >>
> > > >     >,
> > > >     column2 ROW<
> > > >       cells ARRAY<ROW<
> > > >         value VARCHAR,
> > > >         timestamp BIGINT
> > > >       >>
> > > >     >
> > > >   >
> > > > )
> > > > TYPE 'bigtable'
> > > > LOCATION '
> > >
> https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId
> > > '
> > > >
> > > > For me the 1st approach is more user friendly (typing schema from the
> > > 2nd would be troublesome) and more elastic especially when the row's
> schema
> > > (families and columns) changes and a user wants to perform 'SELECT *
> from
> > > bigtableexampleX'.
> > > >
> > > > WDYT? I'd welcome any feedback. Maybe there is some 3rd option that
> will
> > > be a better one?
> > > >
> > >
> >
>

Re: Bigtable for BeamSQL - question about the schema design

Posted by Piotr Szuberski <pi...@polidea.com>.

Thanks for the resources! I'll try to follow the BQ approach. I'd also add something like flattened schema so the user can use simple types only. It would be limited to BINARY values though. Something like:
CREATE EXTERNAL TABLE(
  key VARCHAR NOT NULL,
  family VARCHAR NOT NULL,
  column VARCHAR NOT NULL,
  value BINARY NOT NULL,
  timestampMicros BIGINT NOT NULL
)
 The cells array would be flattened (denormalized) and easy to use. In case of single-valued cells it would also be quite efficient.

On 2020/11/05 00:22:44, Brian Hulette <bh...@google.com> wrote: 
> I think we should take a look at how BigTable is integrated with other SQL
> systems. For example we could get some inspiration from BigQuery's support
> for querying BigTable data [1]. It looks like by default it uses something
> like (1), but they recognize this is difficult to process with SQL, so they
> have an option you can set to elevate certain columns as sub-fields (more
> like (2)), and you can also indicate you only want to get the latest value
> for each column.
> 
> In any case this may be a good candidate for not requiring the user to
> actually specify a schema, and instead letting the table be fully
> determined by options.
> 
> [1] https://cloud.google.com/bigquery/external-data-bigtable
> 
> On Tue, Nov 3, 2020 at 11:41 PM Piotr Szuberski <pi...@polidea.com>
> wrote:
> 
> > I've dug the topic a bit and I think the 2nd approach will fit better. The
> > schema in Bigtable is not supposed to change that often and specifying our
> > own schema is more SQL-like and will cause less potential trouble.
> >
> > On 2020/11/03 11:01:57, Piotr Szuberski <pi...@polidea.com>
> > wrote:
> > > I'm going to write Bigtable table for BeamSQL and I have a question
> > about the schema design, which one would be preferrable.
> > >
> > > Bigtable stores its data in a table with rows that contain a key and
> > 3-dimensional array where the 1st dimension is families with a names, 2nd
> > dimension is columns with qualifiers and the 3rd cells containing timestamp
> > and value.
> > >
> > > Two design solutions come to mind:
> > > 1) Fix schema to be a generic Bigtable row:
> > >
> > > Row(key, Array(Row(family, Array(Row(qualifier, Array(Row(value,
> > timestamp)))))))
> > >
> > > Then the table creation definition would always be in form:
> > >
> > > CREATE TABLE bigtableexample1()
> > > TYPE 'bigtable'
> > > LOCATION '
> > https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId
> > '
> > >
> > > 2) Let the user design his schema by providing the desired families and
> > columns it sth like:
> > > CREATE TABLE bigtableexample2(
> > >   key VARCHAR,
> > >   family1 ROW<
> > >     column1 ROW<
> > >       cells ARRAY<ROW<
> > >         value VARCHAR,
> > >         timestamp BIGINT
> > >       >>
> > >     >,
> > >     column2 ROW<
> > >       cells ARRAY<ROW<
> > >         value VARCHAR,
> > >         timestamp BIGINT
> > >       >>
> > >     >
> > >   >
> > > )
> > > TYPE 'bigtable'
> > > LOCATION '
> > https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId
> > '
> > >
> > > For me the 1st approach is more user friendly (typing schema from the
> > 2nd would be troublesome) and more elastic especially when the row's schema
> > (families and columns) changes and a user wants to perform 'SELECT * from
> > bigtableexampleX'.
> > >
> > > WDYT? I'd welcome any feedback. Maybe there is some 3rd option that will
> > be a better one?
> > >
> >
>

Re: Bigtable for BeamSQL - question about the schema design

Posted by Brian Hulette <bh...@google.com>.

I think we should take a look at how BigTable is integrated with other SQL
systems. For example we could get some inspiration from BigQuery's support
for querying BigTable data [1]. It looks like by default it uses something
like (1), but they recognize this is difficult to process with SQL, so they
have an option you can set to elevate certain columns as sub-fields (more
like (2)), and you can also indicate you only want to get the latest value
for each column.

In any case this may be a good candidate for not requiring the user to
actually specify a schema, and instead letting the table be fully
determined by options.

[1] https://cloud.google.com/bigquery/external-data-bigtable

On Tue, Nov 3, 2020 at 11:41 PM Piotr Szuberski <pi...@polidea.com>
wrote:

> I've dug the topic a bit and I think the 2nd approach will fit better. The
> schema in Bigtable is not supposed to change that often and specifying our
> own schema is more SQL-like and will cause less potential trouble.
>
> On 2020/11/03 11:01:57, Piotr Szuberski <pi...@polidea.com>
> wrote:
> > I'm going to write Bigtable table for BeamSQL and I have a question
> about the schema design, which one would be preferrable.
> >
> > Bigtable stores its data in a table with rows that contain a key and
> 3-dimensional array where the 1st dimension is families with a names, 2nd
> dimension is columns with qualifiers and the 3rd cells containing timestamp
> and value.
> >
> > Two design solutions come to mind:
> > 1) Fix schema to be a generic Bigtable row:
> >
> > Row(key, Array(Row(family, Array(Row(qualifier, Array(Row(value,
> timestamp)))))))
> >
> > Then the table creation definition would always be in form:
> >
> > CREATE TABLE bigtableexample1()
> > TYPE 'bigtable'
> > LOCATION '
> https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId
> '
> >
> > 2) Let the user design his schema by providing the desired families and
> columns it sth like:
> > CREATE TABLE bigtableexample2(
> >   key VARCHAR,
> >   family1 ROW<
> >     column1 ROW<
> >       cells ARRAY<ROW<
> >         value VARCHAR,
> >         timestamp BIGINT
> >       >>
> >     >,
> >     column2 ROW<
> >       cells ARRAY<ROW<
> >         value VARCHAR,
> >         timestamp BIGINT
> >       >>
> >     >
> >   >
> > )
> > TYPE 'bigtable'
> > LOCATION '
> https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId
> '
> >
> > For me the 1st approach is more user friendly (typing schema from the
> 2nd would be troublesome) and more elastic especially when the row's schema
> (families and columns) changes and a user wants to perform 'SELECT * from
> bigtableexampleX'.
> >
> > WDYT? I'd welcome any feedback. Maybe there is some 3rd option that will
> be a better one?
> >
>

Re: Bigtable for BeamSQL - question about the schema design

Posted by Piotr Szuberski <pi...@polidea.com>.

I've dug the topic a bit and I think the 2nd approach will fit better. The schema in Bigtable is not supposed to change that often and specifying our own schema is more SQL-like and will cause less potential trouble.

On 2020/11/03 11:01:57, Piotr Szuberski <pi...@polidea.com> wrote: 
> I'm going to write Bigtable table for BeamSQL and I have a question about the schema design, which one would be preferrable.
> 
> Bigtable stores its data in a table with rows that contain a key and 3-dimensional array where the 1st dimension is families with a names, 2nd dimension is columns with qualifiers and the 3rd cells containing timestamp and value.
> 
> Two design solutions come to mind:
> 1) Fix schema to be a generic Bigtable row:
> 
> Row(key, Array(Row(family, Array(Row(qualifier, Array(Row(value, timestamp)))))))
> 
> Then the table creation definition would always be in form:
> 
> CREATE TABLE bigtableexample1()
> TYPE 'bigtable'
> LOCATION 'https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId'
> 
> 2) Let the user design his schema by providing the desired families and columns it sth like:
> CREATE TABLE bigtableexample2(
>   key VARCHAR,
>   family1 ROW<
>     column1 ROW<
>       cells ARRAY<ROW<
>         value VARCHAR,
>         timestamp BIGINT
>       >>
>     >,
>     column2 ROW<
>       cells ARRAY<ROW<
>         value VARCHAR,
>         timestamp BIGINT
>       >>
>     >
>   >
> )
> TYPE 'bigtable'
> LOCATION 'https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId'
> 
> For me the 1st approach is more user friendly (typing schema from the 2nd would be troublesome) and more elastic especially when the row's schema (families and columns) changes and a user wants to perform 'SELECT * from bigtableexampleX'.
> 
> WDYT? I'd welcome any feedback. Maybe there is some 3rd option that will be a better one?
>