You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/12/20 19:49:35 UTC

[GitHub] [airflow] potiuk opened a new pull request, #28505: Make pandas dependency optional for Amazon Provider

potiuk opened a new pull request, #28505:
URL: https://github.com/apache/airflow/pull/28505

   Pandas dependency is only used for a few features in Amazon providers amd it is a huge "drag" on a package in general - we use the existing Airflow 2.3+ feature of Optional Provider Feature to make pandas an optional dependency (with "[pandas]" extra).
   
   This is not a breaking change, since those who already installed previous provider version have pandas installed already. Only people who install provider manually will have to install "pandas" extra either for Airflow core package or for the provider.
   
   Fixes: #28468
   
   <!--
   Thank you for contributing! Please make sure that your code changes
   are covered with tests. And in case of new features or big changes
   remember to adjust the documentation.
   
   Feel free to ping committers for the review!
   
   In case of an existing issue, reference it using one of the following:
   
   closes: #ISSUE
   related: #ISSUE
   
   How to write a good git commit message:
   http://chris.beams.io/posts/git-commit/
   -->
   
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#pull-request-guidelines)** for more information.
   In case of fundamental code changes, an Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals)) is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in a newsfragment file, named `{pr_number}.significant.rst` or `{issue_number}.significant.rst`, in [newsfragments](https://github.com/apache/airflow/tree/main/newsfragments).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] eladkal commented on pull request #28505: Make pandas dependency optional for Amazon Provider

Posted by GitBox <gi...@apache.org>.
eladkal commented on PR #28505:
URL: https://github.com/apache/airflow/pull/28505#issuecomment-1360367756

   > Only people who install provider manually will have to install "pandas" extra either for Airflow core package or for the provider.
   
   This means it is a breaking change.
   Yes, on upgrade everything looks fine because as you said pandas lib is already there but if users are using install scripts that install airflow and providers seperatly then once workers restart and reinstall from scratch it will cause broken dags.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] Taragolis commented on a diff in pull request #28505: Make pandas dependency optional for Amazon Provider

Posted by GitBox <gi...@apache.org>.
Taragolis commented on code in PR #28505:
URL: https://github.com/apache/airflow/pull/28505#discussion_r1053740864


##########
airflow/providers/amazon/CHANGELOG.rst:
##########
@@ -24,6 +24,18 @@
 Changelog
 ---------
 
+6.3.0
+-----

Review Comment:
   I just realised that not yet after see this header



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #28505: Make pandas dependency optional for Amazon Provider

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #28505:
URL: https://github.com/apache/airflow/pull/28505#discussion_r1053742976


##########
airflow/providers/amazon/CHANGELOG.rst:
##########
@@ -24,6 +24,18 @@
 Changelog
 ---------
 
+6.3.0
+-----

Review Comment:
   We are .. just fixed it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #28505: Make pandas dependency optional for Amazon Provider

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #28505:
URL: https://github.com/apache/airflow/pull/28505#discussion_r1053734472


##########
airflow/providers/amazon/CHANGELOG.rst:
##########
@@ -24,6 +24,18 @@
 Changelog
 ---------
 
+6.3.0
+-----
+
+Features
+~~~~~~~~
+
+Pandas is now an optional dependency of the provider. The ``SqlToS3Operator`` and ``HiveToDynamoDBOperator``
+require Pandas to be installed (you can install it automatically by adding ``[pandas]`` extra when installing
+the provider. Those who installed previous version of the provider, should already have pandas installed, so
+this is not a breaking change.

Review Comment:
   That's a good point - those are local imports there and they   `raise Exception(` but now we can change them to the OptionalProviderFeature exception. I will do that as a follow-up.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] Taragolis commented on a diff in pull request #28505: Make pandas dependency optional for Amazon Provider

Posted by GitBox <gi...@apache.org>.
Taragolis commented on code in PR #28505:
URL: https://github.com/apache/airflow/pull/28505#discussion_r1053711582


##########
airflow/providers/amazon/CHANGELOG.rst:
##########
@@ -24,6 +24,18 @@
 Changelog
 ---------
 
+6.3.0
+-----
+
+Features
+~~~~~~~~
+
+Pandas is now an optional dependency of the provider. The ``SqlToS3Operator`` and ``HiveToDynamoDBOperator``
+require Pandas to be installed (you can install it automatically by adding ``[pandas]`` extra when installing
+the provider. Those who installed previous version of the provider, should already have pandas installed, so
+this is not a breaking change.

Review Comment:
   Just wondering, should we also mentioned `RedshiftSQLHook` methods `get_pandas_df` and `get_pandas_df_by_chunks`? However this methods provided by `DbApiHook` so maybe not.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] uranusjr merged pull request #28505: Make pandas dependency optional for Amazon Provider

Posted by GitBox <gi...@apache.org>.
uranusjr merged PR #28505:
URL: https://github.com/apache/airflow/pull/28505


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #28505: Make pandas dependency optional for Amazon Provider

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #28505:
URL: https://github.com/apache/airflow/pull/28505#discussion_r1053732628


##########
airflow/providers/amazon/CHANGELOG.rst:
##########
@@ -24,6 +24,18 @@
 Changelog
 ---------
 
+6.3.0
+-----

Review Comment:
   That's what reviews are for :)
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] Taragolis commented on a diff in pull request #28505: Make pandas dependency optional for Amazon Provider

Posted by GitBox <gi...@apache.org>.
Taragolis commented on code in PR #28505:
URL: https://github.com/apache/airflow/pull/28505#discussion_r1053739925


##########
airflow/providers/amazon/CHANGELOG.rst:
##########
@@ -24,6 +24,18 @@
 Changelog
 ---------
 
+6.3.0
+-----

Review Comment:
   Actually I thought we already release 7.0.0 😆 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] Taragolis commented on a diff in pull request #28505: Make pandas dependency optional for Amazon Provider

Posted by GitBox <gi...@apache.org>.
Taragolis commented on code in PR #28505:
URL: https://github.com/apache/airflow/pull/28505#discussion_r1053725692


##########
airflow/providers/amazon/CHANGELOG.rst:
##########
@@ -24,6 +24,18 @@
 Changelog
 ---------
 
+6.3.0
+-----

Review Comment:
   I bet next version should be 7.0.0 due to https://github.com/apache/airflow/pull/27920



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #28505: Make pandas dependency optional for Amazon Provider

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #28505:
URL: https://github.com/apache/airflow/pull/28505#issuecomment-1360462705

   > Yes, on upgrade everything looks fine because as you said pandas lib is already there but if users are using install scripts that install airflow and providers seperatly then once workers restart and reinstall from scratch it will cause broken dags because now pandas is not present. This require manual change to add the pandas dependency thus this is not backward compatible.
   
   Not that it matters in this case (we are releasing breaking version anyway), but let me venture into academic discussion (just for the sake of discussion). And more for "food for thoughts".  I do not want to argue vehemotly on this, I am happy to move it now to breaking, I would like more to present a different view on that.
   
   It really depends what WE consider as breaking change. It's never 0/1 (breaking/not breaking). As I mentioned multiple time and as [Hyrum's law](https://www.hyrumslaw.com/) nicely states: "With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody".
   
   Every single change in our case we do is breaking someone's workflow. We should accept this.  And whether we classify something as breaking or not depends purely on our assesment of the situation on how breaking it is. We earlier agreed that just dependency change is not breaking on its own (and we released a number of providers with dependency changes without increasing major.  
   
   What you described is a possibility of course, but more likely than not pandas will already be installed as dependency anyway there for mutliple reasons. And even if not it is just the matter of adding "[pandas]" in their scripts. No DAG modificaitons needed, no changes in the code just "environment" change. We are just dicussing what is what "Airflow Public Interface" - https://github.com/apache/airflow/pull/28300 and I have not seen anyone asking to add "list of dependencies to install automatically" as part of the public API. Once we merge this change, this kind of change is not going to be considered as "breaking" on the list. 
   
   So - super easy way to recover and error message will be very clear. 
   
   Also there are other things we should consider - do you think the users who install airflow automatically in the way do not have a test harness to verify if their new install is working ? There might be many more problems than missing pandas if every time they rebuild it from scratch, so in your scenario - they for sure have a test harness in place. They have a staging system to test it.
   
   And even if they don't have test harness - do we realistically think they will limit their installation and not install 7.0.0 version of the provider gets released in the same automated way? Do you think they have a separate test harness that they will test against such breaking relase of every single provider they use? Will that make a difference for them whether that description is put in "breaking" or "feature" changes" (because that will be the only difference in this case) ? I am not sure, but I think this is quite unrealistic.
   
   And of course we can be wrong in our assesment. Happens. We can always produce a bugfix for that if many of our users will start complaining. And we should of course try to avoid this, but ultimately, we will break some workflows of some people  - this is unavoidable. And there is no way to 100% protect against it.
   
   And finally - yes, I think this is a border-line for me this is really the case where things aren't clearly 0/1. And we can either err on a side of caution or be a bit more bold 
   
   This is all purely academic discussion of course - I have 0 problem with putting it into breaking section - it does not really matter in this case of course.  Just wanted to explain that putting everything that potentially changes some particular workflow is not necessarily breaking. It's a very simplistic view IMHO and we should not only asses if the change "actually" breaks something but also how disruptive it really is and whether it is simple to recover for the user at early stage, because it is realy harmful.
   
   ![image](https://user-images.githubusercontent.com/595491/208781135-c6a9538a-f5e7-43ae-a77d-f9a32f3e8a2e.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #28505: Make pandas dependency optional for Amazon Provider

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #28505:
URL: https://github.com/apache/airflow/pull/28505#issuecomment-1360464746

   Moved the description up to breaking changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] eladkal commented on pull request #28505: Make pandas dependency optional for Amazon Provider

Posted by GitBox <gi...@apache.org>.
eladkal commented on PR #28505:
URL: https://github.com/apache/airflow/pull/28505#issuecomment-1361010283

   @potiuk I actually agree with you on all fronts. I just think that whenever we are not 100% sure we should lean more to the safest option. If this also means no hassle for us then even more reason! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #28505: Make pandas dependency optional for Amazon Provider

Posted by GitBox <gi...@apache.org>.
potiuk commented on code in PR #28505:
URL: https://github.com/apache/airflow/pull/28505#discussion_r1053738869


##########
airflow/providers/amazon/CHANGELOG.rst:
##########
@@ -24,6 +24,18 @@
 Changelog
 ---------
 
+6.3.0
+-----

Review Comment:
   Fixed. 
   
   @dwreeves  - could you please help with adding a paragraph or two of description from https://github.com/apache/airflow/pull/27920. The context description was pretty long but we need to distill it to a single paragraph or so text (maybe with one example of code? )



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] Taragolis commented on a diff in pull request #28505: Make pandas dependency optional for Amazon Provider

Posted by GitBox <gi...@apache.org>.
Taragolis commented on code in PR #28505:
URL: https://github.com/apache/airflow/pull/28505#discussion_r1053726699


##########
airflow/providers/amazon/CHANGELOG.rst:
##########
@@ -24,6 +24,18 @@
 Changelog
 ---------
 
+6.3.0
+-----

Review Comment:
   But we forget add breaking changes into the CHANGELOG `¯\_(ツ)_/¯`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org