You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Anton Kedin <ke...@google.com> on 2018/05/04 18:22:55 UTC

Complex Types Support for Beam SQL DDL

Hi,

I am working on adding support for non-primitive types in Beam SQL DDL.

*Goal*
Allow users to define tables with Rows, Arrays, Maps as field types in DDL.
This enables defining schemas for complex sources, e.g. describing JSON
sources or other sources which support complex field types (BQ, etc).

*Solution*
Extend the parser we have in Beam SQLto accept the following DDL statement:
"CREATE TABLE tableName (field_name <COMPLEX_FIELD_TYPE>)" where
"<COMPLEX_FIELD_TYPE>" can be any the following:

   - "primitiveType ARRAY", for example, "field_int_arr" INTEGER ARRAY".
   Thoughts:
   - this is how SQL standard defines ARRAY field declaration;
      - existing parser supports similar syntax for collections;
      - hard to read for nested collections;
      - similar syntax is supported in Postgres
      <https://www.postgresql.org/docs/9.1/static/arrays.html>;
   - "ARRAY<type>", for example "field_matrix ARRAY<ARRAY<INTEGER>>".
   Thoughts:
   - easy to read and support arbitrary nesting;
      - similar syntax is implemented in:
         - BigQuery
         <https://cloud.google.com/bigquery/docs/data-definition-language#column_name_and_column_schema>
         ;
         - Spanner
         <https://cloud.google.com/spanner/docs/data-definition-language#arrays>
         ;
         - KSQL
         <https://docs.confluent.io/current/ksql/docs/syntax-reference.html#create-table>
         ;
         - Spark/Hive
         <https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable>
         ;
      - "MAP<primitiveType, type>", for example "MAP<VARCHAR, MAP<INTEGER,
   VARCHAR>>". Thoughts:
   - there doesn't seem to be a SQL standard support for maps;
      - looks similar to the "ARRAY<type>" definition;
      - similar syntax is implemented in:
         - KSQL
         <https://docs.confluent.io/current/ksql/docs/syntax-reference.html#create-table>
         ;
         - Spark/Hive
         <https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable>
         ;
      - "ROW(fieldList)", for example "row_field ROW(f_int INTEGER, f_str
   VARCHAR)". Thoughts:
   - SQL standard defines the syntax this way;
      - don't know where similar syntax is implemented;
   - "ROW<fieldList>", for example "row_field ROW<f_int INTEGER, f_str
   VARCHAR>". Thoughts:
   - ROW is not supported in a lot of dialects, but STRUCT is similar and
      supported in few dialects;
      - similar syntax for STRUCT is implemented in:
         - BigQuery
         <https://cloud.google.com/bigquery/docs/data-definition-language>;
         - Spark/Hive
         <https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable>
         ;

Questions/comments?
Pull Request <https://github.com/apache/beam/pull/5276>

Thank you,
Anton

Re: Complex Types Support for Beam SQL DDL

Posted by Kenneth Knowles <kl...@google.com>.
That is a really nice comparison. Thanks! It makes sense for Beam to take
more responsibility for the specifics of DDL parsing so we can implement
the support we need. I wonder if - once we have something working well - we
could push it back into Calcite or some subproject. I know their position
on DDL is to avoid the issue, but it seems there's some new interest in
looking at ways to support a collection of dialects.

Kenn

On Fri, May 4, 2018 at 11:23 AM Anton Kedin <ke...@google.com> wrote:

> Hi,
>
> I am working on adding support for non-primitive types in Beam SQL DDL.
>
> *Goal*
> Allow users to define tables with Rows, Arrays, Maps as field types in
> DDL. This enables defining schemas for complex sources, e.g. describing
> JSON sources or other sources which support complex field types (BQ, etc).
>
> *Solution*
> Extend the parser we have in Beam SQLto accept the following DDL statement:
> "CREATE TABLE tableName (field_name <COMPLEX_FIELD_TYPE>)" where
> "<COMPLEX_FIELD_TYPE>" can be any the following:
>
>    - "primitiveType ARRAY", for example, "field_int_arr" INTEGER ARRAY".
>    Thoughts:
>    - this is how SQL standard defines ARRAY field declaration;
>       - existing parser supports similar syntax for collections;
>       - hard to read for nested collections;
>       - similar syntax is supported in Postgres
>       <https://www.postgresql.org/docs/9.1/static/arrays.html>;
>    - "ARRAY<type>", for example "field_matrix ARRAY<ARRAY<INTEGER>>".
>    Thoughts:
>    - easy to read and support arbitrary nesting;
>       - similar syntax is implemented in:
>          - BigQuery
>          <https://cloud.google.com/bigquery/docs/data-definition-language#column_name_and_column_schema>
>          ;
>          - Spanner
>          <https://cloud.google.com/spanner/docs/data-definition-language#arrays>
>          ;
>          - KSQL
>          <https://docs.confluent.io/current/ksql/docs/syntax-reference.html#create-table>
>          ;
>          - Spark/Hive
>          <https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable>
>          ;
>       - "MAP<primitiveType, type>", for example "MAP<VARCHAR,
>    MAP<INTEGER, VARCHAR>>". Thoughts:
>    - there doesn't seem to be a SQL standard support for maps;
>       - looks similar to the "ARRAY<type>" definition;
>       - similar syntax is implemented in:
>          - KSQL
>          <https://docs.confluent.io/current/ksql/docs/syntax-reference.html#create-table>
>          ;
>          - Spark/Hive
>          <https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable>
>          ;
>       - "ROW(fieldList)", for example "row_field ROW(f_int INTEGER, f_str
>    VARCHAR)". Thoughts:
>    - SQL standard defines the syntax this way;
>       - don't know where similar syntax is implemented;
>    - "ROW<fieldList>", for example "row_field ROW<f_int INTEGER, f_str
>    VARCHAR>". Thoughts:
>    - ROW is not supported in a lot of dialects, but STRUCT is similar and
>       supported in few dialects;
>       - similar syntax for STRUCT is implemented in:
>          - BigQuery
>          <https://cloud.google.com/bigquery/docs/data-definition-language>
>          ;
>          - Spark/Hive
>          <https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable>
>          ;
>
> Questions/comments?
> Pull Request <https://github.com/apache/beam/pull/5276>
>
> Thank you,
> Anton
>
>
>