You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by "Thomas Fredriksen(External)" <th...@cognite.com> on 2021/03/17 13:06:56 UTC

JdbcIO SQL best practice

Hello everyone,

I was wondering what is considered best-practice when writing SQL
statements for the JdbcIO connector?

Hand-writing the statements and subsequent preparedStatementSetter causes a
lot of bloat and is not very manageable.

Thank you/

Best Regards
Thomas Li Fredriksen

Re: JdbcIO SQL best practice

Posted by Alexey Romanenko <ar...@gmail.com>.

I don’t think so because this statement [1] is used in this case. 
 
[1] https://github.com/apache/beam/blob/97af0775cc19a4997a4b60c6a75d003f8e86cf1f/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcUtil.java#L56

> On 14 Apr 2021, at 14:44, Thomas Fredriksen(External) <th...@cognite.com> wrote:
> 
> This seems very promising,
> 
> Will the write from PCollectino<Row> handle upserts?
> 
> On Wed, Mar 24, 2021 at 6:56 PM Alexey Romanenko <aromanenko.dev@gmail.com <ma...@gmail.com>> wrote:
> Thanks for details.
> 
> If I’m not mistaken, JdbcIO already supports both your suggestions for read and write (at lest, in some way) [1][2]. 
> 
> Some examples from tests:
> - write from PCollection<Row> [3], 
> - read to PCollection<Row> [4], 
> - write from PCollection<POJO> with JavaBeanSchema [5] 
> 
> Is it something that you are looking for?
> 
> [1] https://issues.apache.org/jira/browse/BEAM-6674 <https://issues.apache.org/jira/browse/BEAM-6674>
> [2] https://github.com/apache/beam/pull/8725 <https://github.com/apache/beam/pull/8725>
> [3] https://github.com/apache/beam/blob/ab1dfa13a983d41669e70e83b11f58a83015004c/sdks/java/io/jdbc/src/test/java/org/apache/beam/sdk/io/jdbc/JdbcIOTest.java#L469 <https://github.com/apache/beam/blob/ab1dfa13a983d41669e70e83b11f58a83015004c/sdks/java/io/jdbc/src/test/java/org/apache/beam/sdk/io/jdbc/JdbcIOTest.java#L469>
> [4] https://github.com/apache/beam/blob/ab1dfa13a983d41669e70e83b11f58a83015004c/sdks/java/io/jdbc/src/test/java/org/apache/beam/sdk/io/jdbc/JdbcIOTest.java#L524 <https://github.com/apache/beam/blob/ab1dfa13a983d41669e70e83b11f58a83015004c/sdks/java/io/jdbc/src/test/java/org/apache/beam/sdk/io/jdbc/JdbcIOTest.java#L524>
> [5] https://github.com/apache/beam/blob/ab1dfa13a983d41669e70e83b11f58a83015004c/sdks/java/io/jdbc/src/test/java/org/apache/beam/sdk/io/jdbc/JdbcIOTest.java#L469 <https://github.com/apache/beam/blob/ab1dfa13a983d41669e70e83b11f58a83015004c/sdks/java/io/jdbc/src/test/java/org/apache/beam/sdk/io/jdbc/JdbcIOTest.java#L469>
> 
> 
>> On 23 Mar 2021, at 08:03, Thomas Fredriksen(External) <thomas.fredriksen@cognite.com <ma...@cognite.com>> wrote:
>> 
>> That is a very good question.
>> 
>> Personally, I would prefer that read and write were simplified. I guess there will always be a need for writing complex queries, but the vast majority of pipelines will only need to read or write data to or from a table. As such, having read/write functions that will take an input-class (BEAN or POJO for example) and simply generate the required write-statement would be sufficient. Upserts should also be a part of this.
>> 
>> For example:
>> 
>> ```
>> PCollection<MyBean> collection = ...;
>> collection.apply("Write to database", JdbcIO.writeTable(MyBean.class)
>>         .withDataSourceConfiguration(mySourceConfiguration)
>>         .withTableName(myTableName)
>>         .withUpsertOption(UpsertOption.create()
>>                 .withConflictTarget(keyColumn)
>>                 .withDoUpdate());
>> ```
>> This would of course assume that the columns of `myTableName` would match the members of `MyBean`.
>> 
>> There are of course technical challenges with this:
>> * How to handle situations where the column names do not match the input-type
>> * How to detect columns from the input-type.
>> 
>> As an alternative, schemas may be an option:
>> 
>> ```
>> PCollection<Row> collection = ...;
>> collection.apply("Write to database", JdbcIO.writeRows()
>>         .withSchema(mySchema)
>>         .withDataSourceConfiguration(mySourceConfiguration)
>>         .withTableName(myTableName)
>>         .withUpsertOption(UpsertOption.create()
>>                 .withConflictTarget(keyColumn)
>>                 .withDoUpdate());
>> ```
>> This would allow for greater flexibility, but we lose the type-strong nature of first suggestion.
>> 
>> I hope this helps.
>> 
>> Best Regards
>> Thomas Li Fredriksen
>> 
>> On Fri, Mar 19, 2021 at 7:17 PM Alexey Romanenko <aromanenko.dev@gmail.com <ma...@gmail.com>> wrote:
>> Hmm, interesting question. Since we don’t have any answers yet may I ask you a question - do you have an example of what like this could be these practises or how it can be simplified?
>> 
>> 
>> PS: Not sure that it can help but JdbcIO allows to set a query with “ValueProvider” option which can be helpful to parametrise your transform with values that are only available during pipeline execution and can be used for pipeline templates [1].
>> 
>> [1] https://cloud.google.com/dataflow/docs/guides/templates/creating-templates <https://cloud.google.com/dataflow/docs/guides/templates/creating-templates>
>> 
>> > On 17 Mar 2021, at 14:06, Thomas Fredriksen(External) <thomas.fredriksen@cognite.com <ma...@cognite.com>> wrote:
>> > 
>> > Hello everyone,
>> > 
>> > I was wondering what is considered best-practice when writing SQL statements for the JdbcIO connector?
>> > 
>> > Hand-writing the statements and subsequent preparedStatementSetter causes a lot of bloat and is not very manageable.
>> > 
>> > Thank you/
>> > 
>> > Best Regards
>> > Thomas Li Fredriksen
>> 
>

Re: JdbcIO SQL best practice

Posted by "Thomas Fredriksen(External)" <th...@cognite.com>.

This seems very promising,

Will the write from PCollectino<Row> handle upserts?

On Wed, Mar 24, 2021 at 6:56 PM Alexey Romanenko <ar...@gmail.com>
wrote:

> Thanks for details.
>
> If I’m not mistaken, JdbcIO already supports both your suggestions for
> read and write (at lest, in some way) [1][2].
>
> Some examples from tests:
> - write from PCollection<Row> [3],
> - read to PCollection<Row> [4],
> - write from PCollection<POJO> with JavaBeanSchema [5]
>
> Is it something that you are looking for?
>
> [1] https://issues.apache.org/jira/browse/BEAM-6674
> [2] https://github.com/apache/beam/pull/8725
> [3]
> https://github.com/apache/beam/blob/ab1dfa13a983d41669e70e83b11f58a83015004c/sdks/java/io/jdbc/src/test/java/org/apache/beam/sdk/io/jdbc/JdbcIOTest.java#L469
> [4]
> https://github.com/apache/beam/blob/ab1dfa13a983d41669e70e83b11f58a83015004c/sdks/java/io/jdbc/src/test/java/org/apache/beam/sdk/io/jdbc/JdbcIOTest.java#L524
> [5]
> https://github.com/apache/beam/blob/ab1dfa13a983d41669e70e83b11f58a83015004c/sdks/java/io/jdbc/src/test/java/org/apache/beam/sdk/io/jdbc/JdbcIOTest.java#L469
>
>
> On 23 Mar 2021, at 08:03, Thomas Fredriksen(External) <
> thomas.fredriksen@cognite.com> wrote:
>
> That is a very good question.
>
> Personally, I would prefer that read and write were simplified. I guess
> there will always be a need for writing complex queries, but the vast
> majority of pipelines will only need to read or write data to or from a
> table. As such, having read/write functions that will take an input-class
> (BEAN or POJO for example) and simply generate the required write-statement
> would be sufficient. Upserts should also be a part of this.
>
> For example:
>
> ```
> PCollection<MyBean> collection = ...;
> collection.apply("Write to database", JdbcIO.writeTable(MyBean.class)
>         .withDataSourceConfiguration(mySourceConfiguration)
>         .withTableName(myTableName)
>         .withUpsertOption(UpsertOption.create()
>                 .withConflictTarget(keyColumn)
>                 .withDoUpdate());
> ```
> This would of course assume that the columns of `myTableName` would match
> the members of `MyBean`.
>
> There are of course technical challenges with this:
> * How to handle situations where the column names do not match the
> input-type
> * How to detect columns from the input-type.
>
> As an alternative, schemas may be an option:
>
> ```
> PCollection<Row> collection = ...;
> collection.apply("Write to database", JdbcIO.writeRows()
>         .withSchema(mySchema)
>         .withDataSourceConfiguration(mySourceConfiguration)
>         .withTableName(myTableName)
>         .withUpsertOption(UpsertOption.create()
>                 .withConflictTarget(keyColumn)
>                 .withDoUpdate());
> ```
> This would allow for greater flexibility, but we lose the type-strong
> nature of first suggestion.
>
> I hope this helps.
>
> Best Regards
> Thomas Li Fredriksen
>
> On Fri, Mar 19, 2021 at 7:17 PM Alexey Romanenko <ar...@gmail.com>
> wrote:
>
>> Hmm, interesting question. Since we don’t have any answers yet may I ask
>> you a question - do you have an example of what like this could be these
>> practises or how it can be simplified?
>>
>>
>> PS: Not sure that it can help but JdbcIO allows to set a query with
>> “ValueProvider” option which can be helpful to parametrise your transform
>> with values that are only available during pipeline execution and can be
>> used for pipeline templates [1].
>>
>> [1]
>> https://cloud.google.com/dataflow/docs/guides/templates/creating-templates
>>
>> > On 17 Mar 2021, at 14:06, Thomas Fredriksen(External) <
>> thomas.fredriksen@cognite.com> wrote:
>> >
>> > Hello everyone,
>> >
>> > I was wondering what is considered best-practice when writing SQL
>> statements for the JdbcIO connector?
>> >
>> > Hand-writing the statements and subsequent preparedStatementSetter
>> causes a lot of bloat and is not very manageable.
>> >
>> > Thank you/
>> >
>> > Best Regards
>> > Thomas Li Fredriksen
>>
>>
>

Re: JdbcIO SQL best practice

Posted by Alexey Romanenko <ar...@gmail.com>.

Thanks for details.

If I’m not mistaken, JdbcIO already supports both your suggestions for read and write (at lest, in some way) [1][2]. 

Some examples from tests:
- write from PCollection<Row> [3], 
- read to PCollection<Row> [4], 
- write from PCollection<POJO> with JavaBeanSchema [5] 

Is it something that you are looking for?

[1] https://issues.apache.org/jira/browse/BEAM-6674
[2] https://github.com/apache/beam/pull/8725
[3] https://github.com/apache/beam/blob/ab1dfa13a983d41669e70e83b11f58a83015004c/sdks/java/io/jdbc/src/test/java/org/apache/beam/sdk/io/jdbc/JdbcIOTest.java#L469
[4] https://github.com/apache/beam/blob/ab1dfa13a983d41669e70e83b11f58a83015004c/sdks/java/io/jdbc/src/test/java/org/apache/beam/sdk/io/jdbc/JdbcIOTest.java#L524
[5] https://github.com/apache/beam/blob/ab1dfa13a983d41669e70e83b11f58a83015004c/sdks/java/io/jdbc/src/test/java/org/apache/beam/sdk/io/jdbc/JdbcIOTest.java#L469


> On 23 Mar 2021, at 08:03, Thomas Fredriksen(External) <th...@cognite.com> wrote:
> 
> That is a very good question.
> 
> Personally, I would prefer that read and write were simplified. I guess there will always be a need for writing complex queries, but the vast majority of pipelines will only need to read or write data to or from a table. As such, having read/write functions that will take an input-class (BEAN or POJO for example) and simply generate the required write-statement would be sufficient. Upserts should also be a part of this.
> 
> For example:
> 
> ```
> PCollection<MyBean> collection = ...;
> collection.apply("Write to database", JdbcIO.writeTable(MyBean.class)
>         .withDataSourceConfiguration(mySourceConfiguration)
>         .withTableName(myTableName)
>         .withUpsertOption(UpsertOption.create()
>                 .withConflictTarget(keyColumn)
>                 .withDoUpdate());
> ```
> This would of course assume that the columns of `myTableName` would match the members of `MyBean`.
> 
> There are of course technical challenges with this:
> * How to handle situations where the column names do not match the input-type
> * How to detect columns from the input-type.
> 
> As an alternative, schemas may be an option:
> 
> ```
> PCollection<Row> collection = ...;
> collection.apply("Write to database", JdbcIO.writeRows()
>         .withSchema(mySchema)
>         .withDataSourceConfiguration(mySourceConfiguration)
>         .withTableName(myTableName)
>         .withUpsertOption(UpsertOption.create()
>                 .withConflictTarget(keyColumn)
>                 .withDoUpdate());
> ```
> This would allow for greater flexibility, but we lose the type-strong nature of first suggestion.
> 
> I hope this helps.
> 
> Best Regards
> Thomas Li Fredriksen
> 
> On Fri, Mar 19, 2021 at 7:17 PM Alexey Romanenko <aromanenko.dev@gmail.com <ma...@gmail.com>> wrote:
> Hmm, interesting question. Since we don’t have any answers yet may I ask you a question - do you have an example of what like this could be these practises or how it can be simplified?
> 
> 
> PS: Not sure that it can help but JdbcIO allows to set a query with “ValueProvider” option which can be helpful to parametrise your transform with values that are only available during pipeline execution and can be used for pipeline templates [1].
> 
> [1] https://cloud.google.com/dataflow/docs/guides/templates/creating-templates <https://cloud.google.com/dataflow/docs/guides/templates/creating-templates>
> 
> > On 17 Mar 2021, at 14:06, Thomas Fredriksen(External) <thomas.fredriksen@cognite.com <ma...@cognite.com>> wrote:
> > 
> > Hello everyone,
> > 
> > I was wondering what is considered best-practice when writing SQL statements for the JdbcIO connector?
> > 
> > Hand-writing the statements and subsequent preparedStatementSetter causes a lot of bloat and is not very manageable.
> > 
> > Thank you/
> > 
> > Best Regards
> > Thomas Li Fredriksen
>

Re: JdbcIO SQL best practice

Posted by Brian Hulette <bh...@google.com>.

FYI the schemas option has been pursued a little bit in
JdbcSchemaIOProvider [1], which naively generates SELECT and INSERT
statements for reads and writes. Practically, this code is only usable from
SQL, and multi-language pipelines (e.g. it's accessible from the python SDK
[2]). We could consider either:
- Moving this logic into JdbcIO and re-using it in JdbcSchemaIOProvider, or
- Adding a user-friendly interface to SchemaIOProvider implementations in
the Java SDK

Brian

[1]
https://github.com/apache/beam/blob/master/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcSchemaIOProvider.java
[2]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/jdbc.py

On Tue, Mar 23, 2021 at 12:03 AM Thomas Fredriksen(External) <
thomas.fredriksen@cognite.com> wrote:

> That is a very good question.
>
> Personally, I would prefer that read and write were simplified. I guess
> there will always be a need for writing complex queries, but the vast
> majority of pipelines will only need to read or write data to or from a
> table. As such, having read/write functions that will take an input-class
> (BEAN or POJO for example) and simply generate the required write-statement
> would be sufficient. Upserts should also be a part of this.
>
> For example:
>
> ```
> PCollection<MyBean> collection = ...;
> collection.apply("Write to database", JdbcIO.writeTable(MyBean.class)
>         .withDataSourceConfiguration(mySourceConfiguration)
>         .withTableName(myTableName)
>         .withUpsertOption(UpsertOption.create()
>                 .withConflictTarget(keyColumn)
>                 .withDoUpdate());
> ```
> This would of course assume that the columns of `myTableName` would match
> the members of `MyBean`.
>
> There are of course technical challenges with this:
> * How to handle situations where the column names do not match the
> input-type
> * How to detect columns from the input-type.
>
> As an alternative, schemas may be an option:
>
> ```
> PCollection<Row> collection = ...;
> collection.apply("Write to database", JdbcIO.writeRows()
>         .withSchema(mySchema)
>         .withDataSourceConfiguration(mySourceConfiguration)
>         .withTableName(myTableName)
>         .withUpsertOption(UpsertOption.create()
>                 .withConflictTarget(keyColumn)
>                 .withDoUpdate());
> ```
> This would allow for greater flexibility, but we lose the type-strong
> nature of first suggestion.
>
> I hope this helps.
>
> Best Regards
> Thomas Li Fredriksen
>
> On Fri, Mar 19, 2021 at 7:17 PM Alexey Romanenko <ar...@gmail.com>
> wrote:
>
>> Hmm, interesting question. Since we don’t have any answers yet may I ask
>> you a question - do you have an example of what like this could be these
>> practises or how it can be simplified?
>>
>>
>> PS: Not sure that it can help but JdbcIO allows to set a query with
>> “ValueProvider” option which can be helpful to parametrise your transform
>> with values that are only available during pipeline execution and can be
>> used for pipeline templates [1].
>>
>> [1]
>> https://cloud.google.com/dataflow/docs/guides/templates/creating-templates
>>
>> > On 17 Mar 2021, at 14:06, Thomas Fredriksen(External) <
>> thomas.fredriksen@cognite.com> wrote:
>> >
>> > Hello everyone,
>> >
>> > I was wondering what is considered best-practice when writing SQL
>> statements for the JdbcIO connector?
>> >
>> > Hand-writing the statements and subsequent preparedStatementSetter
>> causes a lot of bloat and is not very manageable.
>> >
>> > Thank you/
>> >
>> > Best Regards
>> > Thomas Li Fredriksen
>>
>>

Re: JdbcIO SQL best practice

Posted by "Thomas Fredriksen(External)" <th...@cognite.com>.

That is a very good question.

Personally, I would prefer that read and write were simplified. I guess
there will always be a need for writing complex queries, but the vast
majority of pipelines will only need to read or write data to or from a
table. As such, having read/write functions that will take an input-class
(BEAN or POJO for example) and simply generate the required write-statement
would be sufficient. Upserts should also be a part of this.

For example:

```
PCollection<MyBean> collection = ...;
collection.apply("Write to database", JdbcIO.writeTable(MyBean.class)
        .withDataSourceConfiguration(mySourceConfiguration)
        .withTableName(myTableName)
        .withUpsertOption(UpsertOption.create()
                .withConflictTarget(keyColumn)
                .withDoUpdate());
```
This would of course assume that the columns of `myTableName` would match
the members of `MyBean`.

There are of course technical challenges with this:
* How to handle situations where the column names do not match the
input-type
* How to detect columns from the input-type.

As an alternative, schemas may be an option:

```
PCollection<Row> collection = ...;
collection.apply("Write to database", JdbcIO.writeRows()
        .withSchema(mySchema)
        .withDataSourceConfiguration(mySourceConfiguration)
        .withTableName(myTableName)
        .withUpsertOption(UpsertOption.create()
                .withConflictTarget(keyColumn)
                .withDoUpdate());
```
This would allow for greater flexibility, but we lose the type-strong
nature of first suggestion.

I hope this helps.

Best Regards
Thomas Li Fredriksen

On Fri, Mar 19, 2021 at 7:17 PM Alexey Romanenko <ar...@gmail.com>
wrote:

> Hmm, interesting question. Since we don’t have any answers yet may I ask
> you a question - do you have an example of what like this could be these
> practises or how it can be simplified?
>
>
> PS: Not sure that it can help but JdbcIO allows to set a query with
> “ValueProvider” option which can be helpful to parametrise your transform
> with values that are only available during pipeline execution and can be
> used for pipeline templates [1].
>
> [1]
> https://cloud.google.com/dataflow/docs/guides/templates/creating-templates
>
> > On 17 Mar 2021, at 14:06, Thomas Fredriksen(External) <
> thomas.fredriksen@cognite.com> wrote:
> >
> > Hello everyone,
> >
> > I was wondering what is considered best-practice when writing SQL
> statements for the JdbcIO connector?
> >
> > Hand-writing the statements and subsequent preparedStatementSetter
> causes a lot of bloat and is not very manageable.
> >
> > Thank you/
> >
> > Best Regards
> > Thomas Li Fredriksen
>
>

Re: JdbcIO SQL best practice

Posted by Alexey Romanenko <ar...@gmail.com>.

Hmm, interesting question. Since we don’t have any answers yet may I ask you a question - do you have an example of what like this could be these practises or how it can be simplified?

PS: Not sure that it can help but JdbcIO allows to set a query with “ValueProvider” option which can be helpful to parametrise your transform with values that are only available during pipeline execution and can be used for pipeline templates [1].

[1] https://cloud.google.com/dataflow/docs/guides/templates/creating-templates

> On 17 Mar 2021, at 14:06, Thomas Fredriksen(External) <th...@cognite.com> wrote:
> 
> Hello everyone,
> 
> I was wondering what is considered best-practice when writing SQL statements for the JdbcIO connector?
> 
> Hand-writing the statements and subsequent preparedStatementSetter causes a lot of bloat and is not very manageable.
> 
> Thank you/
> 
> Best Regards
> Thomas Li Fredriksen