You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/10/23 18:55:22 UTC

[GitHub] [airflow] maroshmka opened a new pull request #11793: [AIRFLOW-8664] Postgres to GCS operator - use named cursor

maroshmka opened a new pull request #11793:
URL: https://github.com/apache/airflow/pull/11793


   Hello!
   
   Implemented feature for postgres_to_gcs operator that allows to use server side cursor. More in the issue https://github.com/apache/airflow/issues/8664
   
   I was thinking about multiple designs to generalize it a bit more, but in the end went with straight forward solution on doing it only for postgres db, while keeping it consistent with presto_to_gcs operator. If there would be more demand for server-side cursors implemented on other dbs as well (mssql, mysql), generalization can be done then.
   
   btw, had great experience with Breeze, absolutely love it! I was sceptic and was counting on "this will help you develop, but it will yield 100 errors anyway" experience, but it was really a breeze - install, read docs, run. nice!
   
   Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj merged pull request #11793: [AIRFLOW-8664] Postgres to GCS operator - use named cursor

Posted by GitBox <gi...@apache.org>.
mik-laj merged pull request #11793:
URL: https://github.com/apache/airflow/pull/11793


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] maroshmka commented on pull request #11793: [AIRFLOW-8664] Postgres to GCS operator - use named cursor

Posted by GitBox <gi...@apache.org>.
maroshmka commented on pull request #11793:
URL: https://github.com/apache/airflow/pull/11793#issuecomment-716438718


   sure, will add the system tests. Would try to do it in first half of the week.
   
   ad generic transfers - @potiuk great to hear that something like that it's happening. thank you for the info, i'll look into it :)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #11793: [AIRFLOW-8664] Postgres to GCS operator - use named cursor

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #11793:
URL: https://github.com/apache/airflow/pull/11793#issuecomment-721931057


   The PR should be OK to be merged with just subset of tests as it does not modify Core of Airflow. The committers might merge it or can add a label 'full tests needed' and re-run it to run all tests if they see it is needed!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj edited a comment on pull request #11793: [AIRFLOW-8664] Postgres to GCS operator - use named cursor

Posted by GitBox <gi...@apache.org>.
mik-laj edited a comment on pull request #11793:
URL: https://github.com/apache/airflow/pull/11793#issuecomment-716057427


   I like this change very much, but I would like to test it before merging. Can you add system tests to make this easier?  https://github.com/apache/airflow/blob/7df41f4f14a870ac3f62da5b639768b72ba9f583/tests/providers/google/cloud/transfers/test_postgres_to_gcs_system.py
   For that, you need to update the example DAG.
   https://github.com/apache/airflow/tree/7df41f4f14a870ac3f62da5b639768b72ba9f583/airflow/providers/google/cloud/example_dags
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil commented on pull request #11793: [AIRFLOW-8664] Postgres to GCS operator - use named cursor

Posted by GitBox <gi...@apache.org>.
kaxil commented on pull request #11793:
URL: https://github.com/apache/airflow/pull/11793#issuecomment-721444914


   Can you please rebased your PR on latest Master since we have applied [Black](https://github.com/apache/airflow/commit/4e8f9cc8d02b29c325b8a5a76b4837671bdf5f68) and [PyUpgrade](https://github.com/apache/airflow/commit/8c42cf1b00c90f0d7f11b8a3a455381de8e003c5) on Master.
   
   It will help if your squash your commits into single commit first so that there are less conflicts.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on pull request #11793: [AIRFLOW-8664] Postgres to GCS operator - use named cursor

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on pull request #11793:
URL: https://github.com/apache/airflow/pull/11793#issuecomment-716116274


   As @mik-laj mentioned - adding system tests/update the howtos would be really great.
   
   And I agree that generalizations might be added later. We even have a discussion about it and design doc about it : Discussion here: https://lists.apache.org/thread.html/rc888a329f1c49622c0123c2ddbcfcc107eead020b774f8a8fab6d7f1%40%3Cdev.airflow.apache.org%3E  and the design doc : https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit so you might simply join the discussion and maybe participate later in developing it. Since we are busy with 2.0 release now, this feature ("generic transfer operators') does not have the highest priority, but sounds like a solid candidate for 2.1 some time early next year.
   
   And thanks for the kind words about Breeze :)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on pull request #11793: [AIRFLOW-8664] Postgres to GCS operator - use named cursor

Posted by GitBox <gi...@apache.org>.
mik-laj commented on pull request #11793:
URL: https://github.com/apache/airflow/pull/11793#issuecomment-721930886


   ```
   
   tests/providers/google/cloud/transfers/test_postgres_to_gcs_system.py::PostgresToGCSSystemTest::test_run_example_dag PASSED                                                                          [100%]
   ```
   Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] maroshmka edited a comment on pull request #11793: [AIRFLOW-8664] Postgres to GCS operator - use named cursor

Posted by GitBox <gi...@apache.org>.
maroshmka edited a comment on pull request #11793:
URL: https://github.com/apache/airflow/pull/11793#issuecomment-716864260


   hello, I amended the system tests and docs. Few notes:
   
   _1. It took quite a lot of time making it work locally._ 
   - I needed to export `GCP_PROJECT_ID`, which wasn't written in the doc
   - the GCS_BUCKET variable was used on 2 places, so I changed it to be unique, but only on 1 place, but didn't know whats the problem, so I have searched a lot. (also no env set, so I was getting 403 instead of 409)
   
   _2. I'm not really sure what are these tests for._ 
   I thought there will be some assert against the real GCS bucket. But as I saw (and found in doc), there is no logic in tests and all the other tests I checked contained just bunch of tasks e.g. copying a table to bq and thats it. So, did I miss something? Or are these tests just for ensuring that it can be run in actual dag ? and if yes then ok, no matter the logic (which is tested separately). These resemble e2e tests to me, but without validation. 
   
   Not sure if 1. should be solved as I'm not sure about 2. 😄  But if yes, I can fix both later - gcs_bucket name could be generated and made unique and docs could be updated to mention setting up env vars for system tests specifically.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on pull request #11793: [AIRFLOW-8664] Postgres to GCS operator - use named cursor

Posted by GitBox <gi...@apache.org>.
mik-laj commented on pull request #11793:
URL: https://github.com/apache/airflow/pull/11793#issuecomment-716057427


   I like this change very much, but I would like to test it before merging. Can you add system tests to make this easier?  https://github.com/apache/airflow/blob/7df41f4f14a870ac3f62da5b639768b72ba9f583/tests/providers/google/cloud/transfers/test_postgres_to_gcs_system.py


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #11793: [AIRFLOW-8664] Postgres to GCS operator - use named cursor

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #11793:
URL: https://github.com/apache/airflow/pull/11793#issuecomment-716116274


   As @mik-laj mentioned - adding system tests/update the howtos would be really great.
   
   And I agree that generalizations might be added later. We even have a discussion about it and design doc about it : Discussion here: https://lists.apache.org/thread.html/rc888a329f1c49622c0123c2ddbcfcc107eead020b774f8a8fab6d7f1%40%3Cdev.airflow.apache.org%3E  and the design doc : https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit so you might simply join the discussion and maybe participate later in developing it. Since we are busy with 2.0 release now, this feature ("generic transfer operators' does not have the higest priority, but sounds like a solid candidate for 2.1 some time early next year.
   
   And thanks for the kind words about Breeze :)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] maroshmka edited a comment on pull request #11793: [AIRFLOW-8664] Postgres to GCS operator - use named cursor

Posted by GitBox <gi...@apache.org>.
maroshmka edited a comment on pull request #11793:
URL: https://github.com/apache/airflow/pull/11793#issuecomment-716864260


   hello, I amended the system tests and docs. Few notes:
   
   _1. It took quite a lot of time making it work locally._ 
   - the GCS_BUCKET variable was used on 2 places, so I changed it to be unique only on 1 place, but didn't know whats the problem, so I have searched a lot. (also no env set, so I was getting 403 instead of 409)
   - Also, I needed to export `GCP_PROJECT_ID`, which wasn't written in the doc
   
   _2. I'm not really sure what are these tests for._ 
   I thought there will be some assert against the real GCS bucket. But as I saw (and found in doc), there is no logic in tests and all the other tests I checked contained just bunch of tasks e.g. copying a table to bq and thats it. So, did I miss something? Or are these tests just for ensuring that it can be run in actual dag ? and if yes then ok, no matter the logic (which is tested separately). These resemble e2e tests to me, but without validation. 
   
   Not sure if 1. should be solved as I'm not sure about 2. 😄  But if yes, I can fix both later - gcs_bucket name could be generated and made unique and docs could be updated to mention setting up env vars for system tests specifically.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on pull request #11793: [AIRFLOW-8664] Postgres to GCS operator - use named cursor

Posted by GitBox <gi...@apache.org>.
mik-laj commented on pull request #11793:
URL: https://github.com/apache/airflow/pull/11793#issuecomment-716940824


   > I needed to export GCP_PROJECT_ID, which wasn't written in the doc
   
   We are working to make all environmental variables public, but that will take some time.  Sorry for the inconvenience.
   
   > the GCS_BUCKET variable was used on 2 places, so I changed it to be unique, but only on 1 place, but didn't know whats the problem, so I tried to debug elsewhere, but didn't noticed this :/ (also no env set yet, so I was getting 403 instead of 409)
   
   It looks like you don't have access to the right buckets. Each DAG example should allow bucket to be changed using an environment variable. In this particular case, it did not happen, and I think it is worth adding this possibility. This will allow you to use this easy to change buckets in case of a conflicting name. 
   
   > 2. I'm not really sure what are these tests for.
   
   In most cases, we have a more complicated DAG example and the validation is embedded in the DAG logic, e.g. first task creates a resource, and the next task uses this resource. If the first task didn't create it, the next task wouldn't use it.  But even without complicated DAGs, such simple system tests without validation can detect problems with third libraries, because when a method with wrong parameters is called, a runtime exception will be thrown. This won't get detected in unit testing as we use unitest.mock very intensively.
   
   > Not sure if 1. should be solved as I'm not sure about 2. 😄 But if yes, I can fix both later - gcs_bucket name could be generated and made unique and docs could be updated to mention setting up env vars for system tests specifically.
   
   We don't want random names generated every time we run the test as this will make debugging bugs difficult. The bucket name should only be generated once on a given computer to make behavior more predictable. 
   
   If you want, you can create `files/airflow-breeze-config/init.sh` with some extra environment initialization code. In this directory we store the script that configures the environment variables and provides the service account keys. For now, we have 26 service account key and 128 env variables. 
   
   This quarter, we've already started working on making these system tests run regularly on the CI. Once we're done, everyone will have an easy way to create all the required environment keys and variables. Running system tests in an OSS project is quite complicated from a variety of financial, security and technical factors.
   
   Hope my explanations were helpful for you. Let me know if you want to make any more changes. If not, I will run the system tests (manually) and check if it works properly, then merge the change.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] maroshmka commented on pull request #11793: [AIRFLOW-8664] Postgres to GCS operator - use named cursor

Posted by GitBox <gi...@apache.org>.
maroshmka commented on pull request #11793:
URL: https://github.com/apache/airflow/pull/11793#issuecomment-717088709


   thank you very much for such an thorough explanation, it clears a lot stuff in my head! 
   
   > Each DAG example should allow bucket to be changed using an environment variable. In this particular case, it did not happen, and I think it is worth adding this possibility. 
   
   amended with possibility to take it from env vars, this was missing and thats why I got stuck to most :/
   
   > We don't want random names generated every time we run the test as this will make debugging bugs difficult. The bucket name should only be generated once on a given computer to make behavior more predictable.
   
   agree. my idea was to append some random suffix when starting breeze and set the env vars, e.g. `bucket_test_{str(uuid.uuid4())[:6]}`. as you are already working on it, its cool I guess :)
   
   that'd be all from my side, so you can test it :) 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] maroshmka edited a comment on pull request #11793: [AIRFLOW-8664] Postgres to GCS operator - use named cursor

Posted by GitBox <gi...@apache.org>.
maroshmka edited a comment on pull request #11793:
URL: https://github.com/apache/airflow/pull/11793#issuecomment-716864260


   hello, I amended the system tests and docs. Few notes:
   
   _1. It took quite a lot of time making it work locally._ 
   - the GCS_BUCKET variable was used on 2 places, so I changed it to be unique only on 1 place, but didn't know where's the problem
   - Also, I needed to export `GCP_PROJECT_ID`, which wasn't in the doc
   
   _2. I'm not really sure what are these tests for._ 
   I thought there will be some assert against the real GCS bucket. But as I saw (and found in doc), there is no logic in tests and all the other tests I checked contained just bunch of tasks e.g. copying a table to bq and thats it. So, did I miss something? Or are these tests just for ensuring that it can be run in actual dag ? and if yes then ok, no matter the logic (which is tested separately). These resemble e2e tests to me, but without validation. 
   
   Not sure if 1. should be solved as I'm not sure about 2. 😄  But if yes, I can fix both later - gcs_bucket name could be generated and made unique and docs could be updated to mention setting up env vars for system tests specifically.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] maroshmka commented on pull request #11793: [AIRFLOW-8664] Postgres to GCS operator - use named cursor

Posted by GitBox <gi...@apache.org>.
maroshmka commented on pull request #11793:
URL: https://github.com/apache/airflow/pull/11793#issuecomment-721592898


   sure, done.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] maroshmka edited a comment on pull request #11793: [AIRFLOW-8664] Postgres to GCS operator - use named cursor

Posted by GitBox <gi...@apache.org>.
maroshmka edited a comment on pull request #11793:
URL: https://github.com/apache/airflow/pull/11793#issuecomment-716864260


   hello, I amended the system tests and docs. Few notes:
   
   _1. It took quite a lot of time making it work locally._ 
   - I needed to export `GCP_PROJECT_ID`, which wasn't written in the doc
   - the GCS_BUCKET variable was used on 2 places, so I changed it to be unique, but only on 1 place, but didn't know whats the problem, so I tried to debug elsewhere, but didn't noticed this :/ (also no env set yet, so I was getting 403 instead of 409)
   
   _2. I'm not really sure what are these tests for._ 
   I thought there will be some assert against the real GCS bucket. But as I saw (and found in doc), there is no logic in tests and all the other tests I checked contained just bunch of tasks e.g. copying a table to bq and thats it. So, did I miss something? Or are these tests just for ensuring that it can be run in actual dag ? and if yes then ok, no matter the logic (which is tested separately). These resemble e2e tests to me, but without validation. 
   
   Not sure if 1. should be solved as I'm not sure about 2. 😄  But if yes, I can fix both later - gcs_bucket name could be generated and made unique and docs could be updated to mention setting up env vars for system tests specifically.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] maroshmka commented on pull request #11793: [AIRFLOW-8664] Postgres to GCS operator - use named cursor

Posted by GitBox <gi...@apache.org>.
maroshmka commented on pull request #11793:
URL: https://github.com/apache/airflow/pull/11793#issuecomment-716864260


   hello, I amended the system tests and docs. Few notes:
   
   _1. I took quite a lot of time making it work locally._ 
   - the GCS_BUCKET variable was used on 2 places, so I changed it to be unique only on 1 place, but didn't know where's the problem
   - Also, I needed to export `GCP_PROJECT_ID`, which wasn't in the doc
   
   _2. I'm not really sure what are these tests for._ 
   I thought there will be some assert against the real GCS bucket. But as I saw (and found in doc), there is no logic in tests and all the other tests I checked contained just bunch of tasks e.g. copying a table to bq and thats it. So, did I miss something? Or are these tests just for ensuring that it can be run in actual dag ? and if yes then ok, no matter the logic (which is tested separately). These resemble e2e tests to me, but without validation. 
   
   Not sure if 1. should be solved as I'm not sure about 2. 😄  But if yes, I can fix both later - gcs_bucket name could be generated and made unique and docs could be updated to mention setting up env vars for system tests specifically.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org