You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "chriscugliotta (via GitHub)" <gi...@apache.org> on 2023/03/07 16:07:39 UTC

[GitHub] [airflow] chriscugliotta opened a new issue, #29958: GCSToBigQueryOperator does not respect the destination project ID

chriscugliotta opened a new issue, #29958:
URL: https://github.com/apache/airflow/issues/29958

   ### Apache Airflow Provider(s)
   
   google
   
   ### Versions of Apache Airflow Providers
   
   apache-airflow-providers-google==8.10.0
   
   ### Apache Airflow version
   
   2.3.4
   
   ### Operating System
   
   Ubuntu 18.04.6 LTS
   
   ### Deployment
   
   Google Cloud Composer
   
   ### Deployment details
   
   Google Cloud Composer 2.1.2
   
   ### What happened
   
   [`GCSToBigQueryOperator`](https://github.com/apache/airflow/blob/3374fdfcbddb630b4fc70ceedd5aed673e6c0a0d/airflow/providers/google/cloud/transfers/gcs_to_bigquery.py#L58) does not respect the BigQuery project ID specified in [`destination_project_dataset_table`](https://github.com/apache/airflow/blob/3374fdfcbddb630b4fc70ceedd5aed673e6c0a0d/airflow/providers/google/cloud/transfers/gcs_to_bigquery.py#L74-L77) argument.  Instead, it prioritizes the project ID defined in the [Airflow connection](https://i.imgur.com/1tTIlQF.png).
   
   ### What you think should happen instead
   
   The project ID specified via [`destination_project_dataset_table`](https://github.com/apache/airflow/blob/3374fdfcbddb630b4fc70ceedd5aed673e6c0a0d/airflow/providers/google/cloud/transfers/gcs_to_bigquery.py#L74-L77) should be respected.
   
   **Use case:**  Suppose our Composer environment and service account (SA) live in `project-A`, and we want to transfer data into foreign projects `B`, `C`, and `D`.  We don't have credentials (and thus don't have Airflow connections defined) for projects `B`, `C`, and `D`.  Instead, all transfers are executed by our singular SA in `project-A`.  (Assume this SA has cross-project IAM policies).  Thus, we want to use a _single_ SA and _single_ [Airflow connection](https://i.imgur.com/1tTIlQF.png) (i.e. `gcp_conn_id=google_cloud_default`) to send data into 3+ destination projects.  I imagine this is a fairly common setup for sending data across GCP projects.
   
   **Root cause:**  I've been studying the source code, and I believe the bug is caused by [line 309](https://github.com/apache/airflow/blob/3374fdfcbddb630b4fc70ceedd5aed673e6c0a0d/airflow/providers/google/cloud/transfers/gcs_to_bigquery.py#L309).  Experimentally, I have verified  that `hook.project_id` traces back to the [Airflow connection's project ID](https://i.imgur.com/1tTIlQF.png).  If no destination project ID is explicitly specified, then it makes sense to _fall back_ on the connection's project.  However, if the destination project is explicitly provided, surely the operator should honor that.  I think this bug can be fixed by amending line 309 as follows:
   
   ```python
   project=passed_in_project or hook.project_id
   ```
   
   This pattern is used successfully in many other areas of the repo:  [example](https://github.com/apache/airflow/blob/3374fdfcbddb630b4fc70ceedd5aed673e6c0a0d/airflow/providers/google/cloud/operators/gcs.py#L154).
   
   ### How to reproduce
   
   Admittedly, this bug is difficult to reproduce, because it requires two GCP projects, i.e. a service account in `project-A`, and inbound GCS files and a destination BigQuery table in `project-B`.  Also, you need an Airflow server with a `google_cloud_default` connection that points to `project-A` like [this](https://i.imgur.com/1tTIlQF.png).  Assuming all that exists, the bug can be reproduced via the following Airflow DAG:
   
   ```python
   from airflow import DAG
   from airflow.providers.google.cloud.transfers.gcs_to_bigquery import GCSToBigQueryOperator
   from datetime import datetime
   
   GCS_BUCKET='my_bucket'
   GCS_PREFIX='path/to/*.json'
   BQ_PROJECT='project-B'
   BQ_DATASET='my_dataset'
   BQ_TABLE='my_table'
   SERVICE_ACCOUNT='my_account@project-A.iam.gserviceaccount.com'
   
   
   with DAG(
           dag_id='my_dag',
           start_date=datetime(2023, 1, 1),
           schedule_interval=None,
       ) as dag:
   
       task = GCSToBigQueryOperator(
           task_id='gcs_to_bigquery',
           bucket=GCS_BUCKET,
           source_objects=GCS_PREFIX,
           source_format='NEWLINE_DELIMITED_JSON',
           destination_project_dataset_table='{}.{}.{}'.format(BQ_PROJECT, BQ_DATASET, BQ_TABLE),
           impersonation_chain=SERVICE_ACCOUNT,
       )
   ```
   
   Stack trace:
   
   ```
   Traceback (most recent call last):
     File "/opt/python3.8/lib/python3.8/site-packages/airflow/executors/debug_executor.py", line 79, in _run_task
       ti.run(job_id=ti.job_id, **params)
     File "/opt/python3.8/lib/python3.8/site-packages/airflow/utils/session.py", line 71, in wrapper
       return func(*args, session=session, **kwargs)
     File "/opt/python3.8/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1797, in run
       self._run_raw_task(
     File "/opt/python3.8/lib/python3.8/site-packages/airflow/utils/session.py", line 68, in wrapper
       return func(*args, **kwargs)
     File "/opt/python3.8/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1464, in _run_raw_task
       self._execute_task_with_callbacks(context, test_mode)
     File "/opt/python3.8/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1612, in _execute_task_with_callbacks
       result = self._execute_task(context, task_orig)
     File "/opt/python3.8/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1673, in _execute_task
       result = execute_callable(context=context)
     File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/transfers/gcs_to_bigquery.py", line 387, in execute
       job = self._submit_job(self.hook, job_id)
     File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/transfers/gcs_to_bigquery.py", line 307, in _submit_job
       return hook.insert_job(
     File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/common/hooks/base_google.py", line 468, in inner_wrapper
       return func(self, *args, **kwargs)
     File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/bigquery.py", line 1549, in insert_job
       job._begin()
     File "/opt/python3.8/lib/python3.8/site-packages/google/cloud/bigquery/job/base.py", line 510, in _begin
       api_response = client._call_api(
     File "/opt/python3.8/lib/python3.8/site-packages/google/cloud/bigquery/client.py", line 782, in _call_api
       return call()
     File "/opt/python3.8/lib/python3.8/site-packages/google/api_core/retry.py", line 283, in retry_wrapped_func
       return retry_target(
     File "/opt/python3.8/lib/python3.8/site-packages/google/api_core/retry.py", line 190, in retry_target
       return target()
     File "/opt/python3.8/lib/python3.8/site-packages/google/cloud/_http/__init__.py", line 494, in api_request
       raise exceptions.from_http_response(response)
   google.api_core.exceptions.Forbidden: 403 POST https://bigquery.googleapis.com/bigquery/v2/projects/{project-A}/jobs?prettyPrint=false: Access Denied: Project {project-A}: User does not have bigquery.jobs.create permission in project {project-A}.
   ```
   
   From the stack trace, notice the operator is (incorrectly) attempting to insert into `project-A` rather than `project-B`.
   
   ### Anything else
   
   Perhaps out-of-scope, but the inverse direction also suffers from this same problem, i.e. [BigQueryToGcsOperator](https://github.com/apache/airflow/blob/3374fdfcbddb630b4fc70ceedd5aed673e6c0a0d/airflow/providers/google/cloud/transfers/bigquery_to_gcs.py#L38) and [line 192](https://github.com/apache/airflow/blob/3374fdfcbddb630b4fc70ceedd5aed673e6c0a0d/airflow/providers/google/cloud/transfers/bigquery_to_gcs.py#L192).
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] Yaro1 commented on issue #29958: GCSToBigQueryOperator does not respect the destination project ID

Posted by "Yaro1 (via GitHub)" <gi...@apache.org>.
Yaro1 commented on issue #29958:
URL: https://github.com/apache/airflow/issues/29958#issuecomment-1464998095

   @chriscugliotta for [BigQueryToGcsOperator](https://github.com/apache/airflow/blob/3374fdfcbddb630b4fc70ceedd5aed673e6c0a0d/airflow/providers/google/cloud/transfers/bigquery_to_gcs.py#L38) it should work correctly because of using split_tablename method from BigQueryHook object.
   
   In PR I will use same method for the sameness


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] Yaro1 commented on issue #29958: GCSToBigQueryOperator does not respect the destination project ID

Posted by "Yaro1 (via GitHub)" <gi...@apache.org>.
Yaro1 commented on issue #29958:
URL: https://github.com/apache/airflow/issues/29958#issuecomment-1458712406

   Please assign to me


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #29958: GCSToBigQueryOperator does not respect the destination project ID

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk commented on issue #29958:
URL: https://github.com/apache/airflow/issues/29958#issuecomment-1459077682

   assigned.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] Yaro1 commented on issue #29958: GCSToBigQueryOperator does not respect the destination project ID

Posted by "Yaro1 (via GitHub)" <gi...@apache.org>.
Yaro1 commented on issue #29958:
URL: https://github.com/apache/airflow/issues/29958#issuecomment-1463339983

   okay, got it, thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] boring-cyborg[bot] commented on issue #29958: GCSToBigQueryOperator does not respect the destination project ID

Posted by "boring-cyborg[bot] (via GitHub)" <gi...@apache.org>.
boring-cyborg[bot] commented on issue #29958:
URL: https://github.com/apache/airflow/issues/29958#issuecomment-1458430574

   Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] JessicaRudd commented on issue #29958: GCSToBigQueryOperator does not respect the destination project ID

Posted by "JessicaRudd (via GitHub)" <gi...@apache.org>.
JessicaRudd commented on issue #29958:
URL: https://github.com/apache/airflow/issues/29958#issuecomment-1458446878

   Thank you @chriscugliotta for documenting this very annoying bug. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] Yaro1 commented on issue #29958: GCSToBigQueryOperator does not respect the destination project ID

Posted by "Yaro1 (via GitHub)" <gi...@apache.org>.
Yaro1 commented on issue #29958:
URL: https://github.com/apache/airflow/issues/29958#issuecomment-1462797613

   Currently I am in progress. But should I to write something here, or just make a pr after completing?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] chriscugliotta commented on issue #29958: GCSToBigQueryOperator does not respect the destination project ID

Posted by "chriscugliotta (via GitHub)" <gi...@apache.org>.
chriscugliotta commented on issue #29958:
URL: https://github.com/apache/airflow/issues/29958#issuecomment-1470416332

   Thank you, @Yaro1!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] Yaro1 commented on issue #29958: GCSToBigQueryOperator does not respect the destination project ID

Posted by "Yaro1 (via GitHub)" <gi...@apache.org>.
Yaro1 commented on issue #29958:
URL: https://github.com/apache/airflow/issues/29958#issuecomment-1470484838

   My pleasure :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] Yaro1 commented on issue #29958: GCSToBigQueryOperator does not respect the destination project ID

Posted by "Yaro1 (via GitHub)" <gi...@apache.org>.
Yaro1 commented on issue #29958:
URL: https://github.com/apache/airflow/issues/29958#issuecomment-1458703928

   I want to take it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk closed issue #29958: GCSToBigQueryOperator does not respect the destination project ID

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk closed issue #29958: GCSToBigQueryOperator does not respect the destination project ID
URL: https://github.com/apache/airflow/issues/29958


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] Bowrna commented on issue #29958: GCSToBigQueryOperator does not respect the destination project ID

Posted by "Bowrna (via GitHub)" <gi...@apache.org>.
Bowrna commented on issue #29958:
URL: https://github.com/apache/airflow/issues/29958#issuecomment-1463294194

   @Yaro1 you could raise a PR and put the related: #29958 / closes: #29958
   that way it will link the PR raised to this issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org