You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/12/20 23:27:49 UTC

[GitHub] [airflow] potiuk commented on pull request #28505: Make pandas dependency optional for Amazon Provider

potiuk commented on PR #28505:
URL: https://github.com/apache/airflow/pull/28505#issuecomment-1360462705

   > Yes, on upgrade everything looks fine because as you said pandas lib is already there but if users are using install scripts that install airflow and providers seperatly then once workers restart and reinstall from scratch it will cause broken dags because now pandas is not present. This require manual change to add the pandas dependency thus this is not backward compatible.
   
   Not that it matters in this case (we are releasing breaking version anyway), but let me venture into academic discussion (just for the sake of discussion). And more for "food for thoughts".  I do not want to argue vehemotly on this, I am happy to move it now to breaking, I would like more to present a different view on that.
   
   It really depends what WE consider as breaking change. It's never 0/1 (breaking/not breaking). As I mentioned multiple time and as [Hyrum's law](https://www.hyrumslaw.com/) nicely states: "With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody".
   
   Every single change in our case we do is breaking someone's workflow. We should accept this.  And whether we classify something as breaking or not depends purely on our assesment of the situation on how breaking it is. We earlier agreed that just dependency change is not breaking on its own (and we released a number of providers with dependency changes without increasing major.  
   
   What you described is a possibility of course, but more likely than not pandas will already be installed as dependency anyway there for mutliple reasons. And even if not it is just the matter of adding "[pandas]" in their scripts. No DAG modificaitons needed, no changes in the code just "environment" change. We are just dicussing what is what "Airflow Public Interface" - https://github.com/apache/airflow/pull/28300 and I have not seen anyone asking to add "list of dependencies to install automatically" as part of the public API. Once we merge this change, this kind of change is not going to be considered as "breaking" on the list. 
   
   So - super easy way to recover and error message will be very clear. 
   
   Also there are other things we should consider - do you think the users who install airflow automatically in the way do not have a test harness to verify if their new install is working ? There might be many more problems than missing pandas if every time they rebuild it from scratch, so in your scenario - they for sure have a test harness in place. They have a staging system to test it.
   
   And even if they don't have test harness - do we realistically think they will limit their installation and not install 7.0.0 version of the provider gets released in the same automated way? Do you think they have a separate test harness that they will test against such breaking relase of every single provider they use? Will that make a difference for them whether that description is put in "breaking" or "feature" changes" (because that will be the only difference in this case) ? I am not sure, but I think this is quite unrealistic.
   
   And of course we can be wrong in our assesment. Happens. We can always produce a bugfix for that if many of our users will start complaining. And we should of course try to avoid this, but ultimately, we will break some workflows of some people  - this is unavoidable. And there is no way to 100% protect against it.
   
   And finally - yes, I think this is a border-line for me this is really the case where things aren't clearly 0/1. And we can either err on a side of caution or be a bit more bold 
   
   This is all purely academic discussion of course - I have 0 problem with putting it into breaking section - it does not really matter in this case of course.  Just wanted to explain that putting everything that potentially changes some particular workflow is not necessarily breaking. It's a very simplistic view IMHO and we should not only asses if the change "actually" breaks something but also how disruptive it really is and whether it is simple to recover for the user at early stage, because it is realy harmful.
   
   ![image](https://user-images.githubusercontent.com/595491/208781135-c6a9538a-f5e7-43ae-a77d-f9a32f3e8a2e.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org