You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/03/21 02:54:06 UTC

[GitHub] [airflow] pohek321 opened a new pull request #22391: Update the redshift sql operator to accept multiple sql statements

pohek321 opened a new pull request #22391:
URL: https://github.com/apache/airflow/pull/22391


   <!--
   Thank you for contributing! Please make sure that your code changes
   are covered with tests. And in case of new features or big changes
   remember to adjust the documentation.
   
   Feel free to ping committers for the review!
   
   In case of existing issue, reference it using one of the following:
   
   closes: #22390
   
   How to write a good git commit message:
   http://chris.beams.io/posts/git-commit/
   -->
   
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#pull-request-guidelines)** for more information.
   In case of fundamental code change, Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)) is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in [UPDATING.md](https://github.com/apache/airflow/blob/main/UPDATING.md).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating edited a comment on pull request #22391: Update the redshift sql operator to accept multiple sql statements

Posted by GitBox <gi...@apache.org>.
SamWheating edited a comment on pull request #22391:
URL: https://github.com/apache/airflow/pull/22391#issuecomment-1074400949


   I think that the base `DbAPIHook` already supports a list of statements, and the type of `sql` can be either a string or an iterable.
   
   https://github.com/apache/airflow/blob/900bad1c67654252196bb095a2a150a23ae5fc9a/airflow/hooks/dbapi.py#L163-L196
   
   Can't we just supply a list of SQL statements to the existing operator?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #22391: Update the redshift sql operator to accept multiple sql statements

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #22391:
URL: https://github.com/apache/airflow/pull/22391#issuecomment-1079948207


   Static checks need fixing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] pohek321 commented on a change in pull request #22391: Update the redshift sql operator to accept multiple sql statements

Posted by GitBox <gi...@apache.org>.
pohek321 commented on a change in pull request #22391:
URL: https://github.com/apache/airflow/pull/22391#discussion_r831430895



##########
File path: airflow/providers/amazon/aws/operators/redshift_sql.py
##########
@@ -74,5 +75,12 @@ def get_hook(self) -> RedshiftSQLHook:
     def execute(self, context: 'Context') -> None:
         """Execute a statement against Amazon Redshift"""
         self.log.info(f"Executing statement: {self.sql}")
+        sql_stmts = sqlparse.split(self.sql)
         hook = self.get_hook()
-        hook.run(self.sql, autocommit=self.autocommit, parameters=self.parameters)
+
+        with hook.get_conn() as con:
+            con.autocommit = self.autocommit
+            with con.cursor() as cursor:
+                for stmt in sql_stmts:
+                    cursor.execute(stmt)
+                    if self.autocommit is False: con.commit()

Review comment:
       @eladkal I went ahead and moved lines 81-86 to the RedshiftSQLHook as a sub method and called it in the RedshiftSQLOperator. Is this the right approach that you were envisioning?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr edited a comment on pull request #22391: Update the redshift sql operator to accept multiple sql statements

Posted by GitBox <gi...@apache.org>.
uranusjr edited a comment on pull request #22391:
URL: https://github.com/apache/airflow/pull/22391#issuecomment-1073433313


   Pleas fix the static checks. Also, you need to add `sqlparse` to the requirements; search for `sqlparse` in the code base to find where you need to add it (it’s currently used by the Apache Drill provider).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] pohek321 commented on pull request #22391: Update the redshift sql operator to accept multiple sql statements

Posted by GitBox <gi...@apache.org>.
pohek321 commented on pull request #22391:
URL: https://github.com/apache/airflow/pull/22391#issuecomment-1075661955


   @eladkal I should've noticed your reference to the Snowflake one, that makes sense. I went ahead and committed something similar to the RedshiftSQLHook. Can you take a look and provide feedback?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eladkal edited a comment on pull request #22391: Update the redshift sql operator to accept multiple sql statements

Posted by GitBox <gi...@apache.org>.
eladkal edited a comment on pull request #22391:
URL: https://github.com/apache/airflow/pull/22391#issuecomment-1074404405


   > Can't we just supply a list of SQL statements to the existing operator?
   
   Yes but the goal is to support users who pass string composed of multiple query like:
   
   `SELECT 1; SELECT 2;`
   Most DBs knows how to execute it without any problem (MySQL, PostgresSQL etc..) others like Trino, Presto, Snowflake, Redshift can handle only 1 statement per query (most of the times? I guess there are exceptions with Do blocks and maybe others?)
   
   anyway we can always ask users to pass it as `['SELECT 1', 'SELECT 2']` but that is not that nice and also note that you can't use it with templating. So if you want to use `my_queries.sql ` you can't do that with passing list.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eladkal commented on a change in pull request #22391: Update the redshift sql operator to accept multiple sql statements

Posted by GitBox <gi...@apache.org>.
eladkal commented on a change in pull request #22391:
URL: https://github.com/apache/airflow/pull/22391#discussion_r831441457



##########
File path: airflow/providers/amazon/aws/operators/redshift_sql.py
##########
@@ -74,5 +75,12 @@ def get_hook(self) -> RedshiftSQLHook:
     def execute(self, context: 'Context') -> None:
         """Execute a statement against Amazon Redshift"""
         self.log.info(f"Executing statement: {self.sql}")
+        sql_stmts = sqlparse.split(self.sql)
         hook = self.get_hook()
-        hook.run(self.sql, autocommit=self.autocommit, parameters=self.parameters)
+
+        with hook.get_conn() as con:
+            con.autocommit = self.autocommit
+            with con.cursor() as cursor:
+                for stmt in sql_stmts:
+                    cursor.execute(stmt)
+                    if self.autocommit is False: con.commit()

Review comment:
       Not quite...
   The function `run` comes from:
   https://github.com/apache/airflow/blob/900bad1c67654252196bb095a2a150a23ae5fc9a/airflow/hooks/dbapi.py#L163
   your suggestion to not use it and set `run_sql` means that now the hook has two ways to execute sql. This is not ideal.
   
   You should make it work with `run`. You can override the `run` by having a new implementation in `RedshiftSQLHook` (Just like we do it for Snowflake in the PR I shared)
   
   
   To be specific:
   The operator calling `hook.run()` should stay exactly as it is. Let me also point that by removing `parameters=self.parameters` in the operator you are forcing breaking change which is something that we avoid unless absolutely necessary (and I don't think this is a case where it's necessary correct me if i'm wrong)
    So the hook needs to handle single query or multiple in the `run` function without changing how the `run` function is called from the operator.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eladkal edited a comment on pull request #22391: Update the redshift sql operator to accept multiple sql statements

Posted by GitBox <gi...@apache.org>.
eladkal edited a comment on pull request #22391:
URL: https://github.com/apache/airflow/pull/22391#issuecomment-1074404405


   > Can't we just supply a list of SQL statements to the existing operator?
   
   Yes but the goal is to support users who pass string composed of multiple query like:
   
   `SELECT 1; SELECT 2;`
   Most DBs knows how to execute it without any problem (MySQL, PostgresSQL etc..) others like Trino, Presto, Snowflake, Redshift can handle only 1 statement per query (most of the times? I guess there are exceptions with Do blocks and maybe others?)
   
   anyway we can always ask users to pass it as `['SELECT 1', 'SELECT 2']` but that is not that nice (Think of very very very large SQLs) and also note that you can't use the list solution with templating. Consider you want to do
   ```
   RedshiftSQLOperator(
   ...,
   sql=my_queries.sql
   )
   ```
   
   when `my_queries.sql` contains multiple statements.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eladkal commented on pull request #22391: Update the redshift sql operator to accept multiple sql statements

Posted by GitBox <gi...@apache.org>.
eladkal commented on pull request #22391:
URL: https://github.com/apache/airflow/pull/22391#issuecomment-1074404405


   > Can't we just supply a list of SQL statements to the existing operator?
   
   Yes but the goal is to support users who pass string composed of multiple query like:
   
   `SELECT 1; SELECT 2;`
   Most DBs knows how to execute it without any problem (MySQL, PostgresSQL etc..) others like Trino, Presto, Snowflake, Redshift can handle only 1 statement per query.
   
   we can always ask users to pass it as `['SELECT 1', 'SELECT 2']` but that is not that nice and also note that you can't use it with templating. So if you want to use `my_queries.sql ` you can't do that with passing list.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eladkal commented on a change in pull request #22391: Update the redshift sql operator to accept multiple sql statements

Posted by GitBox <gi...@apache.org>.
eladkal commented on a change in pull request #22391:
URL: https://github.com/apache/airflow/pull/22391#discussion_r831441457



##########
File path: airflow/providers/amazon/aws/operators/redshift_sql.py
##########
@@ -74,5 +75,12 @@ def get_hook(self) -> RedshiftSQLHook:
     def execute(self, context: 'Context') -> None:
         """Execute a statement against Amazon Redshift"""
         self.log.info(f"Executing statement: {self.sql}")
+        sql_stmts = sqlparse.split(self.sql)
         hook = self.get_hook()
-        hook.run(self.sql, autocommit=self.autocommit, parameters=self.parameters)
+
+        with hook.get_conn() as con:
+            con.autocommit = self.autocommit
+            with con.cursor() as cursor:
+                for stmt in sql_stmts:
+                    cursor.execute(stmt)
+                    if self.autocommit is False: con.commit()

Review comment:
       Not quite...
   The function `run` comes from:
   https://github.com/apache/airflow/blob/900bad1c67654252196bb095a2a150a23ae5fc9a/airflow/hooks/dbapi.py#L163
   your suggestion to not use it and set `run_sql` means that now the hook has two ways to execute sql. This is not ideal.
   
   You should make it work with `run`. You can override the `run` by having a new implementation in `RedshiftSQLHook` (Just like we do it for Snowflake in the PR I shared)
   
   
   To be specific:
   The operator calling `hook.run()` should stay exactly as it is. Let me also point that by removing `parameters=self.parameters` in the operator you are forcing breaking change which is something that we avoid unless absolutely necessary (and I don't think this is a case where it's necessary correct me if i'm wrong)
    So the hook needs to handle single query or multiple.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on pull request #22391: Update the redshift sql operator to accept multiple sql statements

Posted by GitBox <gi...@apache.org>.
uranusjr commented on pull request #22391:
URL: https://github.com/apache/airflow/pull/22391#issuecomment-1073433313


   Pleas fix the static checks. Also, you need to add `sqlparse` to the requirements; search for `sqlparse` in the code base to find where you need to add it (it’s currently used by the Apache Drill provider.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eladkal commented on a change in pull request #22391: Update the redshift sql operator to accept multiple sql statements

Posted by GitBox <gi...@apache.org>.
eladkal commented on a change in pull request #22391:
URL: https://github.com/apache/airflow/pull/22391#discussion_r830917171



##########
File path: airflow/providers/amazon/aws/operators/redshift_sql.py
##########
@@ -74,5 +75,12 @@ def get_hook(self) -> RedshiftSQLHook:
     def execute(self, context: 'Context') -> None:
         """Execute a statement against Amazon Redshift"""
         self.log.info(f"Executing statement: {self.sql}")
+        sql_stmts = sqlparse.split(self.sql)
         hook = self.get_hook()
-        hook.run(self.sql, autocommit=self.autocommit, parameters=self.parameters)
+
+        with hook.get_conn() as con:
+            con.autocommit = self.autocommit
+            with con.cursor() as cursor:
+                for stmt in sql_stmts:
+                    cursor.execute(stmt)
+                    if self.autocommit is False: con.commit()

Review comment:
       I don't think this is the right approach.
   The functionality should exist in the hook and the operator should utilize it.
   The operator should not use cursor directly as you do it here.
   
   See https://github.com/apache/airflow/pull/15533 adding similar functionality for Snowflake




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eladkal edited a comment on pull request #22391: Update the redshift sql operator to accept multiple sql statements

Posted by GitBox <gi...@apache.org>.
eladkal edited a comment on pull request #22391:
URL: https://github.com/apache/airflow/pull/22391#issuecomment-1074404405


   > Can't we just supply a list of SQL statements to the existing operator?
   
   Yes but the goal is to support users who pass string composed of multiple query like:
   
   `SELECT 1; SELECT 2;`
   Most DBs knows how to execute it without any problem (MySQL, PostgresSQL etc..) others like Trino, Presto, Snowflake, Redshift can handle only 1 statement per query (most of the times? I guess there are exceptions with Do blocks and maybe others?)
   
   anyway we can always ask users to pass it as `['SELECT 1', 'SELECT 2']` but that is not that nice (Think of very very very large SQLs) and also note that you can't use it with templating. So if you want to use `my_queries.sql` you can't do that with passing list.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating commented on pull request #22391: Update the redshift sql operator to accept multiple sql statements

Posted by GitBox <gi...@apache.org>.
SamWheating commented on pull request #22391:
URL: https://github.com/apache/airflow/pull/22391#issuecomment-1074400949


   I think that the base `DbAPIHook` already supports a list of statements:
   
   https://github.com/apache/airflow/blob/900bad1c67654252196bb095a2a150a23ae5fc9a/airflow/hooks/dbapi.py#L163-L196
   
   Can't we just supply a list of SQL statements to the existing operator?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] pohek321 commented on pull request #22391: Update the redshift sql operator to accept multiple sql statements

Posted by GitBox <gi...@apache.org>.
pohek321 commented on pull request #22391:
URL: https://github.com/apache/airflow/pull/22391#issuecomment-1073441317


   @uranusjr can you elaborate on what you mean by static checks? I'm a newbie and not quite sure what that means. I've also added `sqlparse` to the `setup.py` file (which is where it was for Apache Drill)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating edited a comment on pull request #22391: Update the redshift sql operator to accept multiple sql statements

Posted by GitBox <gi...@apache.org>.
SamWheating edited a comment on pull request #22391:
URL: https://github.com/apache/airflow/pull/22391#issuecomment-1074420211


   Ah yeah, sorry about that I just read the linked issue and realized the difference between a list of statements and multiple statements in a single string 🤦 
   
   In this case, could we just `sqlparse.split` the queries into a list and then pass that list to the parent class's `run` method? 
   
   This would reduce duplication of logic and preserve some of the additional functionality such as the `handler` parameter which is implemented in the dbApiHook but not in the overriding `run_sql` method.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating commented on pull request #22391: Update the redshift sql operator to accept multiple sql statements

Posted by GitBox <gi...@apache.org>.
SamWheating commented on pull request #22391:
URL: https://github.com/apache/airflow/pull/22391#issuecomment-1074420211


   Ah yeah, sorry about that I just read the linked issue and realized the difference between a list of statements and multiple statements in a single string 🤦 
   
   In this case, could we just `sqlparse.split` the queries into a list and then pass that list to the parent class's `run` method? 
   
   This would reduce duplication of logic and preserve some of the additional functionality such as the `handler` parameter which is implemented in the dbApiHook but not in the overriding`run_sql` method.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org