You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/04/20 08:56:50 UTC

[GitHub] [airflow] eladkal opened a new issue, #23112: Handle batch queries in `DbApiHook`?

eladkal opened a new issue, #23112:
URL: https://github.com/apache/airflow/issues/23112

   ### Body
   
   **TL;DR:**
   Multiple providers need to handle the logic of splitting SQL statements
   [Snowflake.run()](https://github.com/apache/airflow/blob/501a3c3fbefbcc0d6071a00eb101110fc4733e08/airflow/providers/snowflake/hooks/snowflake.py#L287-L347) , [RedShift.run()](https://github.com/apache/airflow/pull/22391/files) and it will be very similar should be eventually create `PrestoOperator` or `TrinoOperator`.
   
   I want to discuss the idea of extracting the split statement logic into `DbApiHook`.
   e.g create a `DbApiHook.run_many()` that will split the statements and use `DbApiHook.run()` under the hood.
   
   On one hand it will reduce duplicate code and allow to develop operators for these providers more easily
   On the other hand if we will make it in `dbapihook` it means to bump minimum supported Airflow version for these providers (once they will actually use the new function)?
   
   ----------------------------
   **The long story:**
   Some DBs like MySQL, PostgreSql, MsSQL are able to run a batch of SQL statements:
   `SELECT 1; SELECT2;`
   
   So you can do:
   `MySQLOperator(.., sql='SELECT 1; SELECT2;')`
   it will invoke :
   https://github.com/apache/airflow/blob/501a3c3fbefbcc0d6071a00eb101110fc4733e08/airflow/hooks/dbapi.py#L163
   
   which will execute the sql in a single query. (The run function will convert this sql to `[SELECT 1; SELECT2;]`  - Note this is a list with **1 cell**)
   
   
   But in other DBs like: Snowflake, Redshift, Trino, Presto
   
   You can not run `SELECT 1; SELECT2;` as is. You must split the statement.
   This makes the `run()` irrelevant for these DBs.
   
   
   For example in Snowflake when users pass `SELECT 1; SELECT2;` what we actually do is create:
   `['SELECT 1;', 'SELECT2;']` - Note this is a list with **2 cells**
   https://github.com/apache/airflow/blob/501a3c3fbefbcc0d6071a00eb101110fc4733e08/airflow/providers/snowflake/hooks/snowflake.py#L314-L316
   
   
   **The problem?**
   These providers (Snowflake, Redshift, Trino, Presto) **are forced to override dbapi.run()** but they all actually need the exact same functionality.
   
   If you will take a look at [Snowflake.run()](https://github.com/apache/airflow/blob/501a3c3fbefbcc0d6071a00eb101110fc4733e08/airflow/providers/snowflake/hooks/snowflake.py#L287-L347) you will see that it's almost identical to [RedShift.run()](https://github.com/apache/airflow/pull/22391/files) and it will be very similar should be eventually create `PrestoOperator` or `TrinoOperator`.
   
   
   This is something already brought up earlier in https://github.com/apache/airflow/pull/15533#discussion_r620481685
   
   
   **So why not just set the sql param to accept list only?**
   Because we want to support using .sql file so we must handle the split of the statements.
   ```
   SomeSqlOperator(
   ...,
   sql=my_queries.sql
   )
   ```
   We are reading the content of the file and loading it as a single statement.
   
   
   --------------------------
   
   **What do others think?**
   Should we create `DbApiHook.run_many()` or leave it as it is today where every provider needs to handle the logic on it's own?
   
   
   
   
   
   
   ### Committer
   
   - [X] I acknowledge that I am a maintainer/committer of the Apache Airflow project.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] eladkal commented on issue #23112: Handle batch queries in `DbApiHook`?

Posted by GitBox <gi...@apache.org>.
eladkal commented on issue #23112:
URL: https://github.com/apache/airflow/issues/23112#issuecomment-1193256779

   closed by https://github.com/apache/airflow/pull/23971 (added the `split_statements` parameter to `run()` )


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] williesb commented on issue #23112: Handle batch queries in `DbApiHook`?

Posted by GitBox <gi...@apache.org>.
williesb commented on issue #23112:
URL: https://github.com/apache/airflow/issues/23112#issuecomment-1116450069

   What is the tentative ETA on this?  I am stuck at the moment not being able to run multiple statements in a single connection/session with the TrinoHook because the run implementation does not take a list like DBApiHook does.
   
   Can we please make sure that this new method uses a single connection/session to run all the statements like the current run methods does with statement lists?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] josh-fell commented on issue #23112: Handle batch queries in `DbApiHook`?

Posted by GitBox <gi...@apache.org>.
josh-fell commented on issue #23112:
URL: https://github.com/apache/airflow/issues/23112#issuecomment-1104030981

   +1 and completely agree with @potiuk. An underlying standard would be very useful and less repetitive.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] eladkal closed issue #23112: Handle batch queries in `DbApiHook`?

Posted by GitBox <gi...@apache.org>.
eladkal closed issue #23112: Handle batch queries in `DbApiHook`?
URL: https://github.com/apache/airflow/issues/23112


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #23112: Handle batch queries in `DbApiHook`?

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #23112:
URL: https://github.com/apache/airflow/issues/23112#issuecomment-1103927111

   It would be good to standardize it indeed. And temporary implementation that would use the function (if existing in Aifrlow) and falling back to internal, deprecated implementation is the way to go. We can even add pre-commit guarding the proper use of (run_many) - making sure that there is a conditional code in provider that falls back if run_many is used (new operators will use it automatically when present I believe).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] eladkal commented on issue #23112: Handle batch queries in `DbApiHook`?

Posted by GitBox <gi...@apache.org>.
eladkal commented on issue #23112:
URL: https://github.com/apache/airflow/issues/23112#issuecomment-1179674481

   We are waiting for https://github.com/apache/airflow/pull/23971 to be finished first to see what actions needs to be taken here


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] williesb commented on issue #23112: Handle batch queries in `DbApiHook`?

Posted by GitBox <gi...@apache.org>.
williesb commented on issue #23112:
URL: https://github.com/apache/airflow/issues/23112#issuecomment-1118651113

   > @williesb currently no one is working on this. Would you like to start a PR?
   
   @eladkal It sounds like a tempting idea.  Your original post makes sense.  However, thinking more about it, if it was implemented as you suggest that the `DbApiHook.run_many()` would split the statements and use `DbApiHook.run()` under the hood.  That probably wouldn't fix my problem (I am not 100% not a Python expert) because the `TrinoHook.run()` methods does not work with lists.  I wish @uranusjr would have closed the issue I opened so quickly https://github.com/apache/airflow/issues/23431.  That I can fix quickly.  This one might have some nuances that are a bit complicated for me to resolve on my own.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] kevgeo commented on issue #23112: Handle batch queries in `DbApiHook`?

Posted by GitBox <gi...@apache.org>.
kevgeo commented on issue #23112:
URL: https://github.com/apache/airflow/issues/23112#issuecomment-1179585290

   Hi @eladkal, could I take up this issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] eladkal commented on issue #23112: Handle batch queries in `DbApiHook`?

Posted by GitBox <gi...@apache.org>.
eladkal commented on issue #23112:
URL: https://github.com/apache/airflow/issues/23112#issuecomment-1116456639

   @williesb currently no one is working on this. Would you like to start a PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] pateash commented on issue #23112: Handle batch queries in `DbApiHook`?

Posted by GitBox <gi...@apache.org>.
pateash commented on issue #23112:
URL: https://github.com/apache/airflow/issues/23112#issuecomment-1157697282

   Hi @eladkal, I would like to check this out.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] eladkal commented on issue #23112: Handle batch queries in `DbApiHook`?

Posted by GitBox <gi...@apache.org>.
eladkal commented on issue #23112:
URL: https://github.com/apache/airflow/issues/23112#issuecomment-1118663550

   > I wish @uranusjr wouldn't have closed the issue I opened so quickly #23431. That I can fix quickly. This one might have some nuances that are a bit complicated for me to resolve on my own.
   
   When we spot a problem we prefer to fix it from the root.
   Supporting temporary fixes means holding in with the actual fix and I see no reason to do so.
   
   In your case, you are not blocked. You can create custom hook and operator to reaolve your issue. I actually agree that we should not try to solve this problem per provider.
   
   You can start a draft PR to handle this issue. During review we can discuss the edge cases and see what we can do about them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org