You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Dennis <de...@gmail.com> on 2020/07/17 16:21:29 UTC

Connecting To MSSQL Server With Apache Beam, Python SDK

Hello,



I'm writing in order to inquire about developing a pipeline (using the Python
SDK) with multiple PTransforms that can read from, write to, and alter data
from an MSSQL server.



I've been using beam-nuggets (<https://pypi.org/project/beam-nuggets/>), a
community I/O Connector for dealing with these kinds of PTransforms for a
MySQL server, and was looking to see if there's an option to do this for
MSSQL.



So far, I've been able to run a pipeline with DirectRunner that reads data
from MSSQL using pyodbc. While this is a good starting point, it's not running
with DataflowRunner (even after configuring Private IP), and it's not
parallelized.



I tried to look into SQLAlchemy, but it seems that there isn't as much support
as there is for MySQL, especially for the insertion method. It is expected
that the default insertion method is upsert. For MySQL, this was implemented
using:



from sqlalchemy.dialects.mysql import insert as mysql_insert



There is not such a package available for MSSQL...



How would one go about doing this? I've looked at several stack overflow
articles, but there wasn't any solution there that had any similar
functionality to that of beam-nuggets. Perhaps I missed a solution?



I realize that this is a loaded question, so I greatly appreciate any help in
advance.



Thanks,

Dennis



P.S. I had trouble adding my work email address,
[dzvigelsky@questrade.com](mailto:dzvigelsky@questrade.com) to the mailing
list (even though I went through the same steps to subscribe as with this
one), could you please add it? Thanks.


Re: Connecting To MSSQL Server With Apache Beam, Python SDK

Posted by Dennis Zvigelsky <de...@gmail.com>.
Hi Luke,

Thank you for your help. If I have any further questions, I'll let you know.

All the best,
Dennis

On Fri, Jul 17, 2020 at 10:41 PM Luke Cwik <lc...@google.com> wrote:

> There is typically no trivial way to have SQL queries be partitioned to
> parallelize the reading automatically since you typically can't have
> multiple "readers" connect and use the same query results. There are a
> couple of options that you could use:
> 1) Use a single query use the databases abilitity to dump the contents to
> a file and then perform a parallel read over the files contents
> 2) Paritition your query over a primary key and submit N queries where N
> is the amount of parallel reads you want (alternatively dump to a temp
> table and partition your query)
> 3) Write your own splittable DoFn that can partition a query arbitrarily
>
> Do you see some meaningful errors when running your pipeline on Dataflow
> in the logs? If its an issue that pyodbc isn't installed on the remote
> worker you should check out how to manage pipeline dependencies[1].
>
> 3 has the best parallelism potential followed by 1 while 2 is the easiest
> to get working followed by 1 (assuming that the dump file format is already
> supported by Beam). On the writing side, you can use GroupIntoBatches[2] to
> have meaningfully large transactions or try to use the JdbcIO write
> transform when it becomes available in a cross language way.
>
> There is ongoing work[3, 4, 5] to use the existing JdbcIO in Beam Java
> connector as a cross language transform available to Python.
>
> I don't know of another way to get on the mailing list then sending an
> e-mail to user-subscribe@beam.apache.org
>
> 1:
> https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/
> 2:
> https://github.com/apache/beam/blob/62118fa66788ad45032a60abc30799cd6f0e4658/sdks/python/apache_beam/transforms/util.py#L744
> 3: https://github.com/apache/beam/pull/12145
> 4: https://issues.apache.org/jira/browse/BEAM-10135
> 5: https://issues.apache.org/jira/browse/BEAM-10136
>
> On Fri, Jul 17, 2020 at 9:21 AM Dennis <de...@gmail.com> wrote:
>
>> Hello,
>>
>>
>>
>> I'm writing in order to inquire about developing a pipeline (using the
>> Python SDK) with multiple PTransforms that can read from, write to, and
>> alter data from an MSSQL server.
>>
>>
>>
>> I've been using beam-nuggets (https://pypi.org/project/beam-nuggets/), a
>> community I/O Connector for dealing with these kinds of PTransforms for a
>> MySQL server, and was looking to see if there's an option to do this for
>> MSSQL.
>>
>>
>>
>> So far, I've been able to run a pipeline with DirectRunner that reads
>> data from MSSQL using pyodbc. While this is a good starting point, it's not
>> running with DataflowRunner (even after configuring Private IP), and it's
>> not parallelized.
>>
>>
>>
>> I tried to look into SQLAlchemy, but it seems that there isn't as much
>> support as there is for MySQL, especially for the insertion method. It is
>> expected that the default insertion method is upsert. For MySQL, this was
>> implemented using:
>>
>>
>>
>> from sqlalchemy.dialects.mysql import insert as mysql_insert
>>
>>
>>
>> There is not such a package available for MSSQL...
>>
>>
>>
>> How would one go about doing this? I've looked at several stack overflow
>> articles, but there wasn't any solution there that had any similar
>> functionality to that of beam-nuggets. Perhaps I missed a solution?
>>
>>
>>
>> I realize that this is a loaded question, so I greatly appreciate any
>> help in advance.
>>
>>
>>
>> Thanks,
>>
>> Dennis
>>
>>
>>
>> P.S. I had trouble adding my work email address, dzvigelsky@questrade.com
>> to the mailing list (even though I went through the same steps to subscribe
>> as with this one), could you please add it? Thanks.
>>
>

Re: Connecting To MSSQL Server With Apache Beam, Python SDK

Posted by Luke Cwik <lc...@google.com>.
There is typically no trivial way to have SQL queries be partitioned to
parallelize the reading automatically since you typically can't have
multiple "readers" connect and use the same query results. There are a
couple of options that you could use:
1) Use a single query use the databases abilitity to dump the contents to a
file and then perform a parallel read over the files contents
2) Paritition your query over a primary key and submit N queries where N is
the amount of parallel reads you want (alternatively dump to a temp table
and partition your query)
3) Write your own splittable DoFn that can partition a query arbitrarily

Do you see some meaningful errors when running your pipeline on Dataflow in
the logs? If its an issue that pyodbc isn't installed on the remote worker
you should check out how to manage pipeline dependencies[1].

3 has the best parallelism potential followed by 1 while 2 is the easiest
to get working followed by 1 (assuming that the dump file format is already
supported by Beam). On the writing side, you can use GroupIntoBatches[2] to
have meaningfully large transactions or try to use the JdbcIO write
transform when it becomes available in a cross language way.

There is ongoing work[3, 4, 5] to use the existing JdbcIO in Beam Java
connector as a cross language transform available to Python.

I don't know of another way to get on the mailing list then sending an
e-mail to user-subscribe@beam.apache.org

1: https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/
2:
https://github.com/apache/beam/blob/62118fa66788ad45032a60abc30799cd6f0e4658/sdks/python/apache_beam/transforms/util.py#L744
3: https://github.com/apache/beam/pull/12145
4: https://issues.apache.org/jira/browse/BEAM-10135
5: https://issues.apache.org/jira/browse/BEAM-10136

On Fri, Jul 17, 2020 at 9:21 AM Dennis <de...@gmail.com> wrote:

> Hello,
>
>
>
> I'm writing in order to inquire about developing a pipeline (using the
> Python SDK) with multiple PTransforms that can read from, write to, and
> alter data from an MSSQL server.
>
>
>
> I've been using beam-nuggets (https://pypi.org/project/beam-nuggets/), a
> community I/O Connector for dealing with these kinds of PTransforms for a
> MySQL server, and was looking to see if there's an option to do this for
> MSSQL.
>
>
>
> So far, I've been able to run a pipeline with DirectRunner that reads data
> from MSSQL using pyodbc. While this is a good starting point, it's not
> running with DataflowRunner (even after configuring Private IP), and it's
> not parallelized.
>
>
>
> I tried to look into SQLAlchemy, but it seems that there isn't as much
> support as there is for MySQL, especially for the insertion method. It is
> expected that the default insertion method is upsert. For MySQL, this was
> implemented using:
>
>
>
> from sqlalchemy.dialects.mysql import insert as mysql_insert
>
>
>
> There is not such a package available for MSSQL...
>
>
>
> How would one go about doing this? I've looked at several stack overflow
> articles, but there wasn't any solution there that had any similar
> functionality to that of beam-nuggets. Perhaps I missed a solution?
>
>
>
> I realize that this is a loaded question, so I greatly appreciate any help
> in advance.
>
>
>
> Thanks,
>
> Dennis
>
>
>
> P.S. I had trouble adding my work email address, dzvigelsky@questrade.com
> to the mailing list (even though I went through the same steps to subscribe
> as with this one), could you please add it? Thanks.
>