You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Katarzyna Kucharczyk <ka...@gmail.com> on 2020/03/23 14:23:15 UTC

[PROPOSAL] Snowflake Java Connector for Apache Beam

Hi all,

Me and my colleagues have developed a new Java connector for Snowflake that
we would like to add to Beam.

Snowflake is an analytic data warehouse provided as Software-as-a-Service
(SaaS). It uses a new SQL database engine with a unique architecture
designed for the cloud. To read more details please check [1] and [2].

Proposed Snowflake IOs use JDBC Snowflake library [3]. The IOs are batch
write and batch read that use the Snowflake COPY [4] operation underneath.
In both cases ParDo IOs load files on a stage and then they are inserted
into the Snowflake table of choice using the COPY API. The currently
supported stage is Google Cloud Storage[5].

The schema how Snowflake Read IO works (write operation works similarly but
in opposite direction):

Here is an Apache Beam fork [6] with current work of the Snowflake IO.

In the near future we would like to also add IO for writing streams which
will use SnowPipe - Snowflake mechanism for continuous loading[7]. Also, we
would like to use cross language to provide Python connectors as well.

We are open for all opinions and suggestions. In case of any
questions/comments please do not hesitate to post them.

In case of no objection I will create jira tickets and share them in this
thread. Cheers, Kasia

[1] https://www.snowflake.com

[2] https://docs.snowflake.net/manuals/user-guide/intro-key-concepts.html

[3] https://docs.snowflake.net/manuals/user-guide/jdbc.html

[4] https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html

[5]
https://github.com/PolideaInternal/beam/tree/snowflake-io/sdks/java/io/snowflake


[6] https://cloud.google.com/storage

[7] https://docs.snowflake.net/manuals/user-guide/data-load-snowpipe.html

Re: [PROPOSAL] Snowflake Java Connector for Apache Beam

Posted by Jean-Baptiste Onofre <jb...@nanthrax.net>.

Hi,

It’s very interesting. +1 to create a Jira and prepare a PR for review.

Thanks !
Regards
JB

> Le 23 mars 2020 à 15:23, Katarzyna Kucharczyk <ka...@gmail.com> a écrit :
> 
> Hi all,
> 
> Me and my colleagues have developed a new Java connector for Snowflake that we would like to add to Beam.
> 
> Snowflake is an analytic data warehouse provided as Software-as-a-Service (SaaS). It uses a new SQL database engine with a unique architecture designed for the cloud. To read more details please check [1] and [2].
> 
> Proposed Snowflake IOs use JDBC Snowflake library [3]. The IOs are batch write and batch read that use the Snowflake COPY [4] operation underneath. In both cases ParDo IOs load files on a stage and then they are inserted into the Snowflake table of choice using the COPY API. The currently supported stage is Google Cloud Storage[5].
> 
> The schema how Snowflake Read IO works (write operation works similarly but in opposite direction):
> 
> 
> 
> Here is an Apache Beam fork [6] with current work of the Snowflake IO.
> 
> In the near future we would like to also add IO for writing streams which will use SnowPipe - Snowflake mechanism for continuous loading[7]. Also, we would like to use cross language to provide Python connectors as well.
> 
> We are open for all opinions and suggestions. In case of any questions/comments please do not hesitate to post them.
> 
> In case of no objection I will create jira tickets and share them in this thread.
> 
> Cheers,
> Kasia
> 
> [1] https://www.snowflake.com <https://www.snowflake.com/> 
> [2] https://docs.snowflake.net/manuals/user-guide/intro-key-concepts.html <https://docs.snowflake.net/manuals/user-guide/intro-key-concepts.html> 
> [3] https://docs.snowflake.net/manuals/user-guide/jdbc.html <https://docs.snowflake.net/manuals/user-guide/jdbc.html> 
> [4] https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html <https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html> 
> [5] https://github.com/PolideaInternal/beam/tree/snowflake-io/sdks/java/io/snowflake <https://github.com/PolideaInternal/beam/tree/snowflake-io/sdks/java/io/snowflake> 
> [6] https://cloud.google.com/storage <https://cloud.google.com/storage> 
> [7] https://docs.snowflake.net/manuals/user-guide/data-load-snowpipe.html <https://docs.snowflake.net/manuals/user-guide/data-load-snowpipe.html> 
>

Re: [PROPOSAL] Snowflake Java Connector for Apache Beam

Posted by Dariusz Aniszewski <da...@polidea.com>.

Hello

It's been a while since my last activity on beam dev-list ;) Happy to be
back!

Few days ago Kasia created a JIRA issue for adding SnowflakeIO:
https://issues.apache.org/jira/browse/BEAM-9722

Today, I'm happy to share the first PR with you with SnowflakeIO.Read:
https://github.com/apache/beam/pull/11360

Subsequent PRs (with Write and other parts) will come later, after this one
is approved and merged, as reviewing the whole thing at once would be very
hard.

We're looking forward to seeing all your reviews!

Cheers,
Dariusz





On Thu, Mar 26, 2020 at 4:58 PM Katarzyna Kucharczyk <
ka.kucharczyk@gmail.com> wrote:

> Hi,
> Thank you for your enthusiasm and for so many questions/comments :) I hope
> to address them all.
>
> Alexey, as far as I know, copy methods have better performance than
> inserts/selects. I think currently in Beam's JDBC loading and unloading is
> provided by selects and inserts as well. But I saw copy command in Postgres
> JDBC, maybe it's something worth investigating in the future?
> As it comes to other cloud storages, we thought GCP is a good starting
> point. It makes also sense in case of using Dataflow as a runner, so the
> user would have expenses only on one provider. But I think it would be
> great to add other storages in the future. As Ismaël mentioned it would be
> good to know if S3 works fine with FileIO as well.
> We didn't think about using Beam Schema in the IO, but it might be worth
> checking in case of creating table with specified schema.
>
> Cham, thanks for advice about SDF. I wonder how it might influence whole
> IO. I guess it can be helpful while staging files and splitting in
> pipeline. The COPY operation is called once for all staged files. It should
> be optimised on Snowflake side. I have to research it and check how it's
> done in other IOs.
>
> Ismaël, unfortunately there is no such thing as embedded Snowflake :( What
> we currently plan is to create fake Snowflake service for unit testing.
>
> Indeed, this is interesting that there are many tool with similar copy
> pattern. I am curious if it could be shared functionality in Beam.
>
> Thanks again for all comments and suggestions - those are extremely
> helpful,
> Kasia
>
> On Tue, Mar 24, 2020 at 10:28 AM Ismaël Mejía <ie...@gmail.com> wrote:
>
>> Forgot to mention that one particularly pesky issue we found in the work
>> on
>> Redshift is to be able to write unit tests on this.
>>
>> Is there an embedded version of SnowFlake to run those. I would like also
>> if
>> possible to get some ideas on how to test this use case.
>>
>> Also we should probably ensure that the FileIO part is generic enough so
>> we can
>> use S3 too because users can be using Snowflake in AWS too.
>>
>>
>> On Tue, Mar 24, 2020 at 10:10 AM Ismaël Mejía <ie...@gmail.com> wrote:
>>
>>> Great !
>>> It seems this pattern (COPY + parallel file read) is becoming a standard
>>> for
>>> 'data warehouses' we are using something similar too in the AWS Redshift
>>> PR (WIP)
>>> for details: https://github.com/apache/beam/pull/10206
>>>
>>> Maybe worth for all of us to check and se eif we can converge the
>>> implementations as
>>> much as possible to provide users a consistent experience.
>>>
>>>
>>> On Tue, Mar 24, 2020 at 10:02 AM Elias Djurfeldt <
>>> elias.djurfeldt@mirado.com> wrote:
>>>
>>>> Awesome job! I'm very interested in the cross-language support.
>>>>
>>>> Cheers,
>>>>
>>>> On Tue, 24 Mar 2020 at 01:20, Chamikara Jayalath <ch...@google.com>
>>>> wrote:
>>>>
>>>>> Sounds great. Looks like operation of the Snowflake source will be
>>>>> similar to BigQuery source (export files to GCS and read files). This will
>>>>> allow you to better parallelize reading (current JDBC source is limited to
>>>>> one worker when reading).
>>>>>
>>>>> Seems like you already support initial splitting using files -
>>>>> https://github.com/PolideaInternal/beam/blob/snowflake-io/sdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/SnowflakeIO.java#L374
>>>>> Prob. also consider supporting dynamic work rebalancing when runners
>>>>> support this through SDF.
>>>>>
>>>>> Thanks,
>>>>> Cham
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Mar 23, 2020 at 9:49 AM Alexey Romanenko <
>>>>> aromanenko.dev@gmail.com> wrote:
>>>>>
>>>>>> Great! This is always welcomed to have more IOs in Beam. I’d be happy
>>>>>> to take look on your PR once it will be created.
>>>>>>
>>>>>> Just a couple of questions for now.
>>>>>>
>>>>>> 1) Afaik, you can connect to Snowflake using standard JDBC driver. Do
>>>>>> you plan to compare a performance between this SnowflakeIO and Beam JdbcIO?
>>>>>> 2) Are you going to support staging in other locations, like S3 and
>>>>>> Azure?
>>>>>> 3) Does “ withSchema()” allows to infer Snowflake schema to Beam
>>>>>> schema?
>>>>>>
>>>>>> On 23 Mar 2020, at 15:23, Katarzyna Kucharczyk <
>>>>>> ka.kucharczyk@gmail.com> wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Me and my colleagues have developed a new Java connector for
>>>>>> Snowflake that we would like to add to Beam.
>>>>>>
>>>>>> Snowflake is an analytic data warehouse provided as
>>>>>> Software-as-a-Service (SaaS). It uses a new SQL database engine with a
>>>>>> unique architecture designed for the cloud. To read more details please
>>>>>> check [1] and [2].
>>>>>>
>>>>>> Proposed Snowflake IOs use JDBC Snowflake library [3]. The IOs are
>>>>>> batch write and batch read that use the Snowflake COPY [4] operation
>>>>>> underneath. In both cases ParDo IOs load files on a stage and then they are
>>>>>> inserted into the Snowflake table of choice using the COPY API. The
>>>>>> currently supported stage is Google Cloud Storage[5].
>>>>>>
>>>>>> The schema how Snowflake Read IO works (write operation works
>>>>>> similarly but in opposite direction):
>>>>>> Here is an Apache Beam fork [6] with current work of the Snowflake IO.
>>>>>>
>>>>>> In the near future we would like to also add IO for writing streams
>>>>>> which will use SnowPipe - Snowflake mechanism for continuous loading[7].
>>>>>> Also, we would like to use cross language to provide Python connectors as
>>>>>> well.
>>>>>>
>>>>>> We are open for all opinions and suggestions. In case of any
>>>>>> questions/comments please do not hesitate to post them.
>>>>>>
>>>>>> In case of no objection I will create jira tickets and share them in
>>>>>> this thread. Cheers, Kasia
>>>>>>
>>>>>> [1] https://www.snowflake.com
>>>>>> [2]
>>>>>> https://docs.snowflake.net/manuals/user-guide/intro-key-concepts.html
>>>>>>
>>>>>> [3] https://docs.snowflake.net/manuals/user-guide/jdbc.html
>>>>>> [4]
>>>>>> https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
>>>>>> [5]
>>>>>> https://github.com/PolideaInternal/beam/tree/snowflake-io/sdks/java/io/snowflake
>>>>>>
>>>>>> [6] https://cloud.google.com/storage
>>>>>> [7]
>>>>>> https://docs.snowflake.net/manuals/user-guide/data-load-snowpipe.html
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Elias Djurfeldt
>>>> Mirado Consulting
>>>>
>>>

-- 

Dariusz Aniszewski
Polidea <https://www.polidea.com/> | Lead Software Engineer

M: +48 535 432 708 <+48535432708>
E: dariusz.aniszewski@polidea.com
[image: Polidea] <https://www.polidea.com/>

Check out our projects! <https://www.polidea.com/our-work>
[image: Github] <https://github.com/Polidea> [image: Facebook]
<https://www.facebook.com/Polidea.Software> [image: Twitter]
<https://twitter.com/polidea> [image: Linkedin]
<https://www.linkedin.com/company/polidea> [image: Instagram]
<https://instagram.com/polidea> [image: Behance]
<https://www.behance.net/polidea> [image: dribbble]
<https://dribbble.com/polideadesign>

Re: [PROPOSAL] Snowflake Java Connector for Apache Beam

Posted by Katarzyna Kucharczyk <ka...@gmail.com>.

Hi,
Thank you for your enthusiasm and for so many questions/comments :) I hope
to address them all.

Alexey, as far as I know, copy methods have better performance than
inserts/selects. I think currently in Beam's JDBC loading and unloading is
provided by selects and inserts as well. But I saw copy command in Postgres
JDBC, maybe it's something worth investigating in the future?
As it comes to other cloud storages, we thought GCP is a good starting
point. It makes also sense in case of using Dataflow as a runner, so the
user would have expenses only on one provider. But I think it would be
great to add other storages in the future. As Ismaël mentioned it would be
good to know if S3 works fine with FileIO as well.
We didn't think about using Beam Schema in the IO, but it might be worth
checking in case of creating table with specified schema.

Cham, thanks for advice about SDF. I wonder how it might influence whole
IO. I guess it can be helpful while staging files and splitting in
pipeline. The COPY operation is called once for all staged files. It should
be optimised on Snowflake side. I have to research it and check how it's
done in other IOs.

Ismaël, unfortunately there is no such thing as embedded Snowflake :( What
we currently plan is to create fake Snowflake service for unit testing.

Indeed, this is interesting that there are many tool with similar copy
pattern. I am curious if it could be shared functionality in Beam.

Thanks again for all comments and suggestions - those are extremely helpful,
Kasia

On Tue, Mar 24, 2020 at 10:28 AM Ismaël Mejía <ie...@gmail.com> wrote:

> Forgot to mention that one particularly pesky issue we found in the work on
> Redshift is to be able to write unit tests on this.
>
> Is there an embedded version of SnowFlake to run those. I would like also
> if
> possible to get some ideas on how to test this use case.
>
> Also we should probably ensure that the FileIO part is generic enough so
> we can
> use S3 too because users can be using Snowflake in AWS too.
>
>
> On Tue, Mar 24, 2020 at 10:10 AM Ismaël Mejía <ie...@gmail.com> wrote:
>
>> Great !
>> It seems this pattern (COPY + parallel file read) is becoming a standard
>> for
>> 'data warehouses' we are using something similar too in the AWS Redshift
>> PR (WIP)
>> for details: https://github.com/apache/beam/pull/10206
>>
>> Maybe worth for all of us to check and se eif we can converge the
>> implementations as
>> much as possible to provide users a consistent experience.
>>
>>
>> On Tue, Mar 24, 2020 at 10:02 AM Elias Djurfeldt <
>> elias.djurfeldt@mirado.com> wrote:
>>
>>> Awesome job! I'm very interested in the cross-language support.
>>>
>>> Cheers,
>>>
>>> On Tue, 24 Mar 2020 at 01:20, Chamikara Jayalath <ch...@google.com>
>>> wrote:
>>>
>>>> Sounds great. Looks like operation of the Snowflake source will be
>>>> similar to BigQuery source (export files to GCS and read files). This will
>>>> allow you to better parallelize reading (current JDBC source is limited to
>>>> one worker when reading).
>>>>
>>>> Seems like you already support initial splitting using files -
>>>> https://github.com/PolideaInternal/beam/blob/snowflake-io/sdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/SnowflakeIO.java#L374
>>>> Prob. also consider supporting dynamic work rebalancing when runners
>>>> support this through SDF.
>>>>
>>>> Thanks,
>>>> Cham
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Mar 23, 2020 at 9:49 AM Alexey Romanenko <
>>>> aromanenko.dev@gmail.com> wrote:
>>>>
>>>>> Great! This is always welcomed to have more IOs in Beam. I’d be happy
>>>>> to take look on your PR once it will be created.
>>>>>
>>>>> Just a couple of questions for now.
>>>>>
>>>>> 1) Afaik, you can connect to Snowflake using standard JDBC driver. Do
>>>>> you plan to compare a performance between this SnowflakeIO and Beam JdbcIO?
>>>>> 2) Are you going to support staging in other locations, like S3 and
>>>>> Azure?
>>>>> 3) Does “ withSchema()” allows to infer Snowflake schema to Beam
>>>>> schema?
>>>>>
>>>>> On 23 Mar 2020, at 15:23, Katarzyna Kucharczyk <
>>>>> ka.kucharczyk@gmail.com> wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> Me and my colleagues have developed a new Java connector for Snowflake
>>>>> that we would like to add to Beam.
>>>>>
>>>>> Snowflake is an analytic data warehouse provided as
>>>>> Software-as-a-Service (SaaS). It uses a new SQL database engine with a
>>>>> unique architecture designed for the cloud. To read more details please
>>>>> check [1] and [2].
>>>>>
>>>>> Proposed Snowflake IOs use JDBC Snowflake library [3]. The IOs are
>>>>> batch write and batch read that use the Snowflake COPY [4] operation
>>>>> underneath. In both cases ParDo IOs load files on a stage and then they are
>>>>> inserted into the Snowflake table of choice using the COPY API. The
>>>>> currently supported stage is Google Cloud Storage[5].
>>>>>
>>>>> The schema how Snowflake Read IO works (write operation works
>>>>> similarly but in opposite direction):
>>>>> Here is an Apache Beam fork [6] with current work of the Snowflake IO.
>>>>>
>>>>> In the near future we would like to also add IO for writing streams
>>>>> which will use SnowPipe - Snowflake mechanism for continuous loading[7].
>>>>> Also, we would like to use cross language to provide Python connectors as
>>>>> well.
>>>>>
>>>>> We are open for all opinions and suggestions. In case of any
>>>>> questions/comments please do not hesitate to post them.
>>>>>
>>>>> In case of no objection I will create jira tickets and share them in
>>>>> this thread. Cheers, Kasia
>>>>>
>>>>> [1] https://www.snowflake.com
>>>>> [2]
>>>>> https://docs.snowflake.net/manuals/user-guide/intro-key-concepts.html
>>>>> [3] https://docs.snowflake.net/manuals/user-guide/jdbc.html
>>>>> [4]
>>>>> https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
>>>>> [5]
>>>>> https://github.com/PolideaInternal/beam/tree/snowflake-io/sdks/java/io/snowflake
>>>>>
>>>>> [6] https://cloud.google.com/storage
>>>>> [7]
>>>>> https://docs.snowflake.net/manuals/user-guide/data-load-snowpipe.html
>>>>>
>>>>>
>>>>>
>>>
>>> --
>>> Elias Djurfeldt
>>> Mirado Consulting
>>>
>>

Re: [PROPOSAL] Snowflake Java Connector for Apache Beam

Posted by Ismaël Mejía <ie...@gmail.com>.

Forgot to mention that one particularly pesky issue we found in the work on
Redshift is to be able to write unit tests on this.

Is there an embedded version of SnowFlake to run those. I would like also if
possible to get some ideas on how to test this use case.

Also we should probably ensure that the FileIO part is generic enough so we
can
use S3 too because users can be using Snowflake in AWS too.


On Tue, Mar 24, 2020 at 10:10 AM Ismaël Mejía <ie...@gmail.com> wrote:

> Great !
> It seems this pattern (COPY + parallel file read) is becoming a standard
> for
> 'data warehouses' we are using something similar too in the AWS Redshift
> PR (WIP)
> for details: https://github.com/apache/beam/pull/10206
>
> Maybe worth for all of us to check and se eif we can converge the
> implementations as
> much as possible to provide users a consistent experience.
>
>
> On Tue, Mar 24, 2020 at 10:02 AM Elias Djurfeldt <
> elias.djurfeldt@mirado.com> wrote:
>
>> Awesome job! I'm very interested in the cross-language support.
>>
>> Cheers,
>>
>> On Tue, 24 Mar 2020 at 01:20, Chamikara Jayalath <ch...@google.com>
>> wrote:
>>
>>> Sounds great. Looks like operation of the Snowflake source will be
>>> similar to BigQuery source (export files to GCS and read files). This will
>>> allow you to better parallelize reading (current JDBC source is limited to
>>> one worker when reading).
>>>
>>> Seems like you already support initial splitting using files -
>>> https://github.com/PolideaInternal/beam/blob/snowflake-io/sdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/SnowflakeIO.java#L374
>>> Prob. also consider supporting dynamic work rebalancing when runners
>>> support this through SDF.
>>>
>>> Thanks,
>>> Cham
>>>
>>>
>>>
>>>
>>> On Mon, Mar 23, 2020 at 9:49 AM Alexey Romanenko <
>>> aromanenko.dev@gmail.com> wrote:
>>>
>>>> Great! This is always welcomed to have more IOs in Beam. I’d be happy
>>>> to take look on your PR once it will be created.
>>>>
>>>> Just a couple of questions for now.
>>>>
>>>> 1) Afaik, you can connect to Snowflake using standard JDBC driver. Do
>>>> you plan to compare a performance between this SnowflakeIO and Beam JdbcIO?
>>>> 2) Are you going to support staging in other locations, like S3 and
>>>> Azure?
>>>> 3) Does “ withSchema()” allows to infer Snowflake schema to Beam schema?
>>>>
>>>> On 23 Mar 2020, at 15:23, Katarzyna Kucharczyk <ka...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> Me and my colleagues have developed a new Java connector for Snowflake
>>>> that we would like to add to Beam.
>>>>
>>>> Snowflake is an analytic data warehouse provided as
>>>> Software-as-a-Service (SaaS). It uses a new SQL database engine with a
>>>> unique architecture designed for the cloud. To read more details please
>>>> check [1] and [2].
>>>>
>>>> Proposed Snowflake IOs use JDBC Snowflake library [3]. The IOs are
>>>> batch write and batch read that use the Snowflake COPY [4] operation
>>>> underneath. In both cases ParDo IOs load files on a stage and then they are
>>>> inserted into the Snowflake table of choice using the COPY API. The
>>>> currently supported stage is Google Cloud Storage[5].
>>>>
>>>> The schema how Snowflake Read IO works (write operation works similarly
>>>> but in opposite direction):
>>>> Here is an Apache Beam fork [6] with current work of the Snowflake IO.
>>>>
>>>> In the near future we would like to also add IO for writing streams
>>>> which will use SnowPipe - Snowflake mechanism for continuous loading[7].
>>>> Also, we would like to use cross language to provide Python connectors as
>>>> well.
>>>>
>>>> We are open for all opinions and suggestions. In case of any
>>>> questions/comments please do not hesitate to post them.
>>>>
>>>> In case of no objection I will create jira tickets and share them in
>>>> this thread. Cheers, Kasia
>>>>
>>>> [1] https://www.snowflake.com
>>>> [2]
>>>> https://docs.snowflake.net/manuals/user-guide/intro-key-concepts.html
>>>> [3] https://docs.snowflake.net/manuals/user-guide/jdbc.html
>>>> [4]
>>>> https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
>>>> [5]
>>>> https://github.com/PolideaInternal/beam/tree/snowflake-io/sdks/java/io/snowflake
>>>>
>>>> [6] https://cloud.google.com/storage
>>>> [7]
>>>> https://docs.snowflake.net/manuals/user-guide/data-load-snowpipe.html
>>>>
>>>>
>>>>
>>
>> --
>> Elias Djurfeldt
>> Mirado Consulting
>>
>

Re: [PROPOSAL] Snowflake Java Connector for Apache Beam

Posted by Ismaël Mejía <ie...@gmail.com>.

Great !
It seems this pattern (COPY + parallel file read) is becoming a standard for
'data warehouses' we are using something similar too in the AWS Redshift PR
(WIP)
for details: https://github.com/apache/beam/pull/10206

Maybe worth for all of us to check and se eif we can converge the
implementations as
much as possible to provide users a consistent experience.


On Tue, Mar 24, 2020 at 10:02 AM Elias Djurfeldt <el...@mirado.com>
wrote:

> Awesome job! I'm very interested in the cross-language support.
>
> Cheers,
>
> On Tue, 24 Mar 2020 at 01:20, Chamikara Jayalath <ch...@google.com>
> wrote:
>
>> Sounds great. Looks like operation of the Snowflake source will be
>> similar to BigQuery source (export files to GCS and read files). This will
>> allow you to better parallelize reading (current JDBC source is limited to
>> one worker when reading).
>>
>> Seems like you already support initial splitting using files -
>> https://github.com/PolideaInternal/beam/blob/snowflake-io/sdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/SnowflakeIO.java#L374
>> Prob. also consider supporting dynamic work rebalancing when runners
>> support this through SDF.
>>
>> Thanks,
>> Cham
>>
>>
>>
>>
>> On Mon, Mar 23, 2020 at 9:49 AM Alexey Romanenko <
>> aromanenko.dev@gmail.com> wrote:
>>
>>> Great! This is always welcomed to have more IOs in Beam. I’d be happy to
>>> take look on your PR once it will be created.
>>>
>>> Just a couple of questions for now.
>>>
>>> 1) Afaik, you can connect to Snowflake using standard JDBC driver. Do
>>> you plan to compare a performance between this SnowflakeIO and Beam JdbcIO?
>>> 2) Are you going to support staging in other locations, like S3 and
>>> Azure?
>>> 3) Does “ withSchema()” allows to infer Snowflake schema to Beam schema?
>>>
>>> On 23 Mar 2020, at 15:23, Katarzyna Kucharczyk <ka...@gmail.com>
>>> wrote:
>>>
>>> Hi all,
>>>
>>> Me and my colleagues have developed a new Java connector for Snowflake
>>> that we would like to add to Beam.
>>>
>>> Snowflake is an analytic data warehouse provided as
>>> Software-as-a-Service (SaaS). It uses a new SQL database engine with a
>>> unique architecture designed for the cloud. To read more details please
>>> check [1] and [2].
>>>
>>> Proposed Snowflake IOs use JDBC Snowflake library [3]. The IOs are batch
>>> write and batch read that use the Snowflake COPY [4] operation underneath.
>>> In both cases ParDo IOs load files on a stage and then they are inserted
>>> into the Snowflake table of choice using the COPY API. The currently
>>> supported stage is Google Cloud Storage[5].
>>>
>>> The schema how Snowflake Read IO works (write operation works similarly
>>> but in opposite direction):
>>> Here is an Apache Beam fork [6] with current work of the Snowflake IO.
>>>
>>> In the near future we would like to also add IO for writing streams
>>> which will use SnowPipe - Snowflake mechanism for continuous loading[7].
>>> Also, we would like to use cross language to provide Python connectors as
>>> well.
>>>
>>> We are open for all opinions and suggestions. In case of any
>>> questions/comments please do not hesitate to post them.
>>>
>>> In case of no objection I will create jira tickets and share them in
>>> this thread. Cheers, Kasia
>>>
>>> [1] https://www.snowflake.com
>>> [2]
>>> https://docs.snowflake.net/manuals/user-guide/intro-key-concepts.html
>>> [3] https://docs.snowflake.net/manuals/user-guide/jdbc.html
>>> [4] https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
>>>
>>> [5]
>>> https://github.com/PolideaInternal/beam/tree/snowflake-io/sdks/java/io/snowflake
>>>
>>> [6] https://cloud.google.com/storage
>>> [7]
>>> https://docs.snowflake.net/manuals/user-guide/data-load-snowpipe.html
>>>
>>>
>>>
>
> --
> Elias Djurfeldt
> Mirado Consulting
>

Re: [PROPOSAL] Snowflake Java Connector for Apache Beam

Posted by Elias Djurfeldt <el...@mirado.com>.

Awesome job! I'm very interested in the cross-language support.

Cheers,

On Tue, 24 Mar 2020 at 01:20, Chamikara Jayalath <ch...@google.com>
wrote:

> Sounds great. Looks like operation of the Snowflake source will be similar
> to BigQuery source (export files to GCS and read files). This will allow
> you to better parallelize reading (current JDBC source is limited to one
> worker when reading).
>
> Seems like you already support initial splitting using files -
> https://github.com/PolideaInternal/beam/blob/snowflake-io/sdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/SnowflakeIO.java#L374
> Prob. also consider supporting dynamic work rebalancing when runners
> support this through SDF.
>
> Thanks,
> Cham
>
>
>
>
> On Mon, Mar 23, 2020 at 9:49 AM Alexey Romanenko <ar...@gmail.com>
> wrote:
>
>> Great! This is always welcomed to have more IOs in Beam. I’d be happy to
>> take look on your PR once it will be created.
>>
>> Just a couple of questions for now.
>>
>> 1) Afaik, you can connect to Snowflake using standard JDBC driver. Do you
>> plan to compare a performance between this SnowflakeIO and Beam JdbcIO?
>> 2) Are you going to support staging in other locations, like S3 and Azure?
>> 3) Does “ withSchema()” allows to infer Snowflake schema to Beam schema?
>>
>> On 23 Mar 2020, at 15:23, Katarzyna Kucharczyk <ka...@gmail.com>
>> wrote:
>>
>> Hi all,
>>
>> Me and my colleagues have developed a new Java connector for Snowflake
>> that we would like to add to Beam.
>>
>> Snowflake is an analytic data warehouse provided as Software-as-a-Service
>> (SaaS). It uses a new SQL database engine with a unique architecture
>> designed for the cloud. To read more details please check [1] and [2].
>>
>> Proposed Snowflake IOs use JDBC Snowflake library [3]. The IOs are batch
>> write and batch read that use the Snowflake COPY [4] operation underneath.
>> In both cases ParDo IOs load files on a stage and then they are inserted
>> into the Snowflake table of choice using the COPY API. The currently
>> supported stage is Google Cloud Storage[5].
>>
>> The schema how Snowflake Read IO works (write operation works similarly
>> but in opposite direction):
>> Here is an Apache Beam fork [6] with current work of the Snowflake IO.
>>
>> In the near future we would like to also add IO for writing streams which
>> will use SnowPipe - Snowflake mechanism for continuous loading[7]. Also, we
>> would like to use cross language to provide Python connectors as well.
>>
>> We are open for all opinions and suggestions. In case of any
>> questions/comments please do not hesitate to post them.
>>
>> In case of no objection I will create jira tickets and share them in this
>> thread. Cheers, Kasia
>>
>> [1] https://www.snowflake.com
>> [2] https://docs.snowflake.net/manuals/user-guide/intro-key-concepts.html
>>
>> [3] https://docs.snowflake.net/manuals/user-guide/jdbc.html
>> [4] https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
>> [5]
>> https://github.com/PolideaInternal/beam/tree/snowflake-io/sdks/java/io/snowflake
>>
>> [6] https://cloud.google.com/storage
>> [7] https://docs.snowflake.net/manuals/user-guide/data-load-snowpipe.html
>>
>>
>>
>>

-- 
Elias Djurfeldt
Mirado Consulting

Re: [PROPOSAL] Snowflake Java Connector for Apache Beam

Posted by Chamikara Jayalath <ch...@google.com>.

Sounds great. Looks like operation of the Snowflake source will be similar
to BigQuery source (export files to GCS and read files). This will allow
you to better parallelize reading (current JDBC source is limited to one
worker when reading).

Seems like you already support initial splitting using files -
https://github.com/PolideaInternal/beam/blob/snowflake-io/sdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/SnowflakeIO.java#L374
Prob. also consider supporting dynamic work rebalancing when runners
support this through SDF.

Thanks,
Cham




On Mon, Mar 23, 2020 at 9:49 AM Alexey Romanenko <ar...@gmail.com>
wrote:

> Great! This is always welcomed to have more IOs in Beam. I’d be happy to
> take look on your PR once it will be created.
>
> Just a couple of questions for now.
>
> 1) Afaik, you can connect to Snowflake using standard JDBC driver. Do you
> plan to compare a performance between this SnowflakeIO and Beam JdbcIO?
> 2) Are you going to support staging in other locations, like S3 and Azure?
> 3) Does “ withSchema()” allows to infer Snowflake schema to Beam schema?
>
> On 23 Mar 2020, at 15:23, Katarzyna Kucharczyk <ka...@gmail.com>
> wrote:
>
> Hi all,
>
> Me and my colleagues have developed a new Java connector for Snowflake
> that we would like to add to Beam.
>
> Snowflake is an analytic data warehouse provided as Software-as-a-Service
> (SaaS). It uses a new SQL database engine with a unique architecture
> designed for the cloud. To read more details please check [1] and [2].
>
> Proposed Snowflake IOs use JDBC Snowflake library [3]. The IOs are batch
> write and batch read that use the Snowflake COPY [4] operation underneath.
> In both cases ParDo IOs load files on a stage and then they are inserted
> into the Snowflake table of choice using the COPY API. The currently
> supported stage is Google Cloud Storage[5].
>
> The schema how Snowflake Read IO works (write operation works similarly
> but in opposite direction):
> Here is an Apache Beam fork [6] with current work of the Snowflake IO.
>
> In the near future we would like to also add IO for writing streams which
> will use SnowPipe - Snowflake mechanism for continuous loading[7]. Also, we
> would like to use cross language to provide Python connectors as well.
>
> We are open for all opinions and suggestions. In case of any
> questions/comments please do not hesitate to post them.
>
> In case of no objection I will create jira tickets and share them in this
> thread. Cheers, Kasia
>
> [1] https://www.snowflake.com
> [2] https://docs.snowflake.net/manuals/user-guide/intro-key-concepts.html
> [3] https://docs.snowflake.net/manuals/user-guide/jdbc.html
> [4] https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
> [5]
> https://github.com/PolideaInternal/beam/tree/snowflake-io/sdks/java/io/snowflake
>
> [6] https://cloud.google.com/storage
> [7] https://docs.snowflake.net/manuals/user-guide/data-load-snowpipe.html
>
>
>

Re: [PROPOSAL] Snowflake Java Connector for Apache Beam

Posted by Alexey Romanenko <ar...@gmail.com>.

Great! This is always welcomed to have more IOs in Beam. I’d be happy to take look on your PR once it will be created.

Just a couple of questions for now.

1) Afaik, you can connect to Snowflake using standard JDBC driver. Do you plan to compare a performance between this SnowflakeIO and Beam JdbcIO?
2) Are you going to support staging in other locations, like S3 and Azure?
3) Does “ withSchema()” allows to infer Snowflake schema to Beam schema?

> On 23 Mar 2020, at 15:23, Katarzyna Kucharczyk <ka...@gmail.com> wrote:
> 
> Hi all,
> 
> Me and my colleagues have developed a new Java connector for Snowflake that we would like to add to Beam.
> 
> Snowflake is an analytic data warehouse provided as Software-as-a-Service (SaaS). It uses a new SQL database engine with a unique architecture designed for the cloud. To read more details please check [1] and [2].
> 
> Proposed Snowflake IOs use JDBC Snowflake library [3]. The IOs are batch write and batch read that use the Snowflake COPY [4] operation underneath. In both cases ParDo IOs load files on a stage and then they are inserted into the Snowflake table of choice using the COPY API. The currently supported stage is Google Cloud Storage[5].
> 
> The schema how Snowflake Read IO works (write operation works similarly but in opposite direction):
> 
> 
> 
> Here is an Apache Beam fork [6] with current work of the Snowflake IO.
> 
> In the near future we would like to also add IO for writing streams which will use SnowPipe - Snowflake mechanism for continuous loading[7]. Also, we would like to use cross language to provide Python connectors as well.
> 
> We are open for all opinions and suggestions. In case of any questions/comments please do not hesitate to post them.
> 
> In case of no objection I will create jira tickets and share them in this thread.
> 
> Cheers,
> Kasia
> 
> [1] https://www.snowflake.com <https://www.snowflake.com/> 
> [2] https://docs.snowflake.net/manuals/user-guide/intro-key-concepts.html <https://docs.snowflake.net/manuals/user-guide/intro-key-concepts.html> 
> [3] https://docs.snowflake.net/manuals/user-guide/jdbc.html <https://docs.snowflake.net/manuals/user-guide/jdbc.html> 
> [4] https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html <https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html> 
> [5] https://github.com/PolideaInternal/beam/tree/snowflake-io/sdks/java/io/snowflake <https://github.com/PolideaInternal/beam/tree/snowflake-io/sdks/java/io/snowflake> 
> [6] https://cloud.google.com/storage <https://cloud.google.com/storage> 
> [7] https://docs.snowflake.net/manuals/user-guide/data-load-snowpipe.html <https://docs.snowflake.net/manuals/user-guide/data-load-snowpipe.html> 
>