You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/01/17 17:36:53 UTC

[GitHub] [airflow] potiuk edited a comment on pull request #12505: Fix S3ToSnowflakeOperator to support uploading all files in the specified stage

potiuk edited a comment on pull request #12505:
URL: https://github.com/apache/airflow/pull/12505#issuecomment-761849270


   > @potiuk just thing to keep in mind I'm this topic of transfer operators. There is no s3 hook involved here. This is a pure snowflake operator. It's just executing a snowflake command. So in that sense, it's not properly a transfer operator. Yes snowflake is accessing storage but the operator knows nothing about this. (Snowflake manages this) So here's no reason to artificially restrict it's focus to s3 in the naming.
   > 
   > To be clear I'm not saying this particular or shouldn't go through. It's a good improvement. I'm just making a case for fixing the naming, if not now then later.
   
   Surely I understand that, but my line of thinking does not change here. It does not matter if this is a 'proper' transfer operator and whether a hook is used or not. I simply argue that for the user, having a separate "S3ToSnowflake", "GCSToSnowflake", "AzureBlobToSnowflake" is perfectly fine, and even intended. 
   
   Even if they will eventually re-use a lot of the same code, those can be implemented as 3 separate operators, with a common code reuse (either inheritance or composition). As a user, I'd very much prefer to have separate S3ToSnowlake with S3-only parameters, GCSToSnowflake with Google ones and  AzureBlobToSnowflake with only azure ones.
   
   Generally speaking - even if eventually 90% code under the hood will be the same - it can be extracted as next step when GCS/Azure operators will be added - without breaking backwards compatibility of the current S3ToSnowlake operator. The fact that snowflake allows to interact with all the three storages allows doing that and sharing the code (again - later). This way at this stage we save on system tests - the person who implements the S3 one and having credentials etc, would not have to create also Azure and GCS accounts to test if everything works fine for those. This can be done by next person who will have the S3 unit tests to test if the basic S3 assumptions are not broken, and can add GCS on top and test it thoroughly then. 
   
   I do not know exactly if there are some differences on how the GCS/S3 export works in this snowflake API, but I'd assume that you can still control if the BQ schema is created for example (this is typical use case for exporting to GCS). And I think having a "thin" layer of S3ToS , GCSToS .. over the native Snowflake capability is much closer to the philosophy of many current operators of Airflow. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org