You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/11/15 16:04:37 UTC

[GitHub] [airflow] jplauri opened a new pull request #19596: MsSqlHook: implement _generate_insert_sql

jplauri opened a new pull request #19596:
URL: https://github.com/apache/airflow/pull/19596


   Related: https://github.com/apache/airflow/discussions/19583
   
   For prepared SQL statements, MSSQL uses `?` rather than `%s` for parameter markers. Currently, [MsSqlHook](https://airflow.apache.org/docs/apache-airflow-providers-microsoft-mssql/stable/_api/airflow/providers/microsoft/mssql/hooks/mssql/index.html) does not provide an implementation of `_generate_insert_sql` used by `insert_rows` making it impossible to use for instance [GenericTransfer](https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/operators/generic_transfer/index.html) with the said hooks.
   
   As written, the PR implements `_generate_insert_sql` for MsSqlHook but in a limited way. That is, it will only allow us to do INSERTs (so when `replace=False`) but not UPSERTs (when `replace=True`). I would love to get some feedback and ideas for implementing the UPSERT even though it is not useful for my current use-case.
   
   Note that MSSQL does not support anything like "REPLACE INTO" for UPSERT. Instead, there seem to be two alternatives:
   
   - Try to do an update first and check `@@ROWCOUNT`: in case nothing was updated, do an insert. This is explained [here by Aaron Bertrand](https://sqlperformance.com/2020/09/locking/upsert-anti-pattern).
   
   - Use `MERGE`, many hits on StackOverflow such [this one](https://stackoverflow.com/q/2479488/551375) explaining the syntax.
   
   It seems that Aaron B. argues for the pattern in the first link because of several issues with `MERGE` (see [list of issues](https://www.mssqltips.com/sqlservertip/3074/use-caution-with-sql-servers-merge-statement/)).
   
   I spent a bit of time trying to implement either approach here, but got stuck after realizing it's not that straightforward to re-use parameter markers `?` from `values` from pyodbc `cursor.execute(sql, values)`. In any case, I was attempting to draw some inspiration from `_generate_insert_sql` of [PostgresHook](https://airflow.apache.org/docs/apache-airflow-providers-postgres/stable/_modules/airflow/providers/postgres/hooks/postgres.html#PostgresHook). Thus, this might need some work beyond just implementing `_generate_insert_sql` unless I'm missing something.
   
   <!--
   Thank you for contributing! Please make sure that your code changes
   are covered with tests. And in case of new features or big changes
   remember to adjust the documentation.
   
   Feel free to ping committers for the review!
   
   In case of existing issue, reference it using one of the following:
   
   closes: #ISSUE
   related: #ISSUE
   
   How to write a good git commit message:
   http://chris.beams.io/posts/git-commit/
   -->
   
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#pull-request-guidelines)** for more information.
   In case of fundamental code change, Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)) is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in [UPDATING.md](https://github.com/apache/airflow/blob/main/UPDATING.md).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] jplauri edited a comment on pull request #19596: MsSqlHook: implement _generate_insert_sql

Posted by GitBox <gi...@apache.org>.
jplauri edited a comment on pull request #19596:
URL: https://github.com/apache/airflow/pull/19596#issuecomment-970480520


   > @jplauri regarding your question about merge vs insert, i think merge is the right call. That list of issues is pretty ancient now. Merge was implemented in 2008 and by now, I think most of the meaningful kinks have been ironed out. I would bet that the rowcount approach would be slow since it limits you to row-by-row (where as with merge you could, e.g. load into a temp table and merge into target with deduping logic too)
   > 
   > But I don't think you need to tackle that in this PR. One thing at a time. Can do this limited case first.
   
   Thanks! And yes, I would be fine with this limited solution too. I mean, GenericTransfer is meant to be used for small datasets anyway as it works in-mem, so I can imagine there are cases where doing say truncate & insert all is just fine instead of replace. Perhaps there are other use cases where replace is useful, but anyway.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] jplauri commented on pull request #19596: MsSqlHook: implement _generate_insert_sql

Posted by GitBox <gi...@apache.org>.
jplauri commented on pull request #19596:
URL: https://github.com/apache/airflow/pull/19596#issuecomment-970480520


   > @jplauri regarding your question about merge vs insert, i think merge is the right call. That list of issues is pretty ancient now. Merge was implemented in 2008 and by now, I think most of the meaningful kinks have been ironed out. I would bet that the rowcount approach would be slow since it limits you to row-by-row (where as with merge you could, e.g. load into a temp table and merge into target with deduping logic too)
   > 
   > But I don't think you need to tackle that in this PR. One thing at a time. Can do this limited case first.
   
   Thanks! And yes, I would be fine with this limited solution too. I mean, GenericTransfer is meant to be used for small datasets anyway as it works in-mem, so I can imagine there are cases where doing say truncate & insert all is just fine instead of replace.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eladkal commented on a change in pull request #19596: MsSqlHook: implement _generate_insert_sql

Posted by GitBox <gi...@apache.org>.
eladkal commented on a change in pull request #19596:
URL: https://github.com/apache/airflow/pull/19596#discussion_r795188792



##########
File path: tests/operators/test_generic_transfer.py
##########
@@ -26,6 +26,7 @@
 from airflow.models.dag import DAG
 from airflow.operators.generic_transfer import GenericTransfer
 from airflow.providers.mysql.hooks.mysql import MySqlHook
+from airflow.providers.microsoft.mssql.hooks.mssql import MsSqlHook

Review comment:
       ```suggestion
   from airflow.providers.microsoft.mssql.hooks.mssql import MsSqlHook
   from airflow.providers.mysql.hooks.mysql import MySqlHook
   ```
   
   To fix static tests




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dstandish commented on pull request #19596: MsSqlHook: implement _generate_insert_sql

Posted by GitBox <gi...@apache.org>.
dstandish commented on pull request #19596:
URL: https://github.com/apache/airflow/pull/19596#issuecomment-970495398


   >  I mean, GenericTransfer is meant to be used for small datasets anyway as it works in-mem
   
   Yeah I wasn't aware of that, this is my first look at generic transfer. 
   
   > so I can imagine there are cases where doing say truncate & insert all is just fine instead of replace
   
   Yeah this is one of the challenges with trying to build an interface like this.  Replace can mean "replace rows (i.e. when collision by PK)" or "replace table".  And in the base implementation, it means by row.  But you are talking by table.  So there's ambiguity.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dstandish commented on pull request #19596: MsSqlHook: implement _generate_insert_sql

Posted by GitBox <gi...@apache.org>.
dstandish commented on pull request #19596:
URL: https://github.com/apache/airflow/pull/19596#issuecomment-970474540


   @jplauri regarding your question about merge vs insert, i think merge is the right call.  That list of issues  is pretty ancient now.  Merge was implemented in 2008 and by now, I think most of the meaningful kinks have been ironed out.  I would bet that the rowcount approach would be slow since it limits you to row-by-row.
   
   But I don't think you need to tackle that in this PR.  One thing at a time.  Can do this limited case first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] jplauri commented on pull request #19596: MsSqlHook: implement _generate_insert_sql

Posted by GitBox <gi...@apache.org>.
jplauri commented on pull request #19596:
URL: https://github.com/apache/airflow/pull/19596#issuecomment-983930377


   @dstandish Not really sure if I got it right - I basically copied from TestPostgres with only superficial understanding. But intuitively it makes sense - running `test_mssql_to_mssql` will fail unless `_generate_insert_sql` is at least syntactically correct. What do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dstandish commented on pull request #19596: MsSqlHook: implement _generate_insert_sql

Posted by GitBox <gi...@apache.org>.
dstandish commented on pull request #19596:
URL: https://github.com/apache/airflow/pull/19596#issuecomment-982677938


   > @dstandish Would you be able to advise me further here
   
   i'll review today


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dstandish edited a comment on pull request #19596: MsSqlHook: implement _generate_insert_sql

Posted by GitBox <gi...@apache.org>.
dstandish edited a comment on pull request #19596:
URL: https://github.com/apache/airflow/pull/19596#issuecomment-970474540


   @jplauri regarding your question about merge vs insert, i think merge is the right call.  That list of issues  is pretty ancient now.  Merge was implemented in 2008 and by now, I think most of the meaningful kinks have been ironed out.  I would bet that the rowcount approach would be slow since it limits you to row-by-row (where as with merge you could, e.g. load into a temp table and merge into target with deduping logic too)
   
   But I don't think you need to tackle that in this PR.  One thing at a time.  Can do this limited case first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on pull request #19596: MsSqlHook: implement _generate_insert_sql

Posted by GitBox <gi...@apache.org>.
uranusjr commented on pull request #19596:
URL: https://github.com/apache/airflow/pull/19596#issuecomment-1012837301


   Let’s see if tests can run now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] jplauri commented on pull request #19596: MsSqlHook: implement _generate_insert_sql

Posted by GitBox <gi...@apache.org>.
jplauri commented on pull request #19596:
URL: https://github.com/apache/airflow/pull/19596#issuecomment-979369199


   @dstandish Would you be able to advise me further here, i.e., is there something more the PR should have? Should we just wait for a review/merge, or what's the process? I'm in no rush, but I believe this addition would be helpful for some people.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on pull request #19596: MsSqlHook: implement _generate_insert_sql

Posted by GitBox <gi...@apache.org>.
uranusjr commented on pull request #19596:
URL: https://github.com/apache/airflow/pull/19596#issuecomment-1025506461


   I merged main for #21221.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dstandish commented on pull request #19596: MsSqlHook: implement _generate_insert_sql

Posted by GitBox <gi...@apache.org>.
dstandish commented on pull request #19596:
URL: https://github.com/apache/airflow/pull/19596#issuecomment-970500140


   Oh... _generate_insert_sql already _is_ row-by-row.  I assumed it is row-based. Yeah in that case def small datasets / limited use cases.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on pull request #19596: MsSqlHook: implement _generate_insert_sql

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on pull request #19596:
URL: https://github.com/apache/airflow/pull/19596#issuecomment-969059859


   Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
   Here are some useful points:
   - Pay attention to the quality of your code (flake8, mypy and type annotations). Our [pre-commits]( https://github.com/apache/airflow/blob/main/STATIC_CODE_CHECKS.rst#prerequisites-for-pre-commit-hooks) will help you with that.
   - In case of a new feature add useful documentation (in docstrings or in `docs/` directory). Adding a new operator? Check this short [guide](https://github.com/apache/airflow/blob/main/docs/apache-airflow/howto/custom-operator.rst) Consider adding an example DAG that shows how users should use it.
   - Consider using [Breeze environment](https://github.com/apache/airflow/blob/main/BREEZE.rst) for testing locally, it’s a heavy docker but it ships with a working Airflow and a lot of integrations.
   - Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
   - Please follow [ASF Code of Conduct](https://www.apache.org/foundation/policies/conduct) for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
   - Be sure to read the [Airflow Coding style]( https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#coding-style-and-best-practices).
   Apache Airflow is a community-driven project and together we are making it better 🚀.
   In case of doubts contact the developers at:
   Mailing List: dev@airflow.apache.org
   Slack: https://s.apache.org/airflow-slack
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org