You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/04/21 02:39:59 UTC

[GitHub] [airflow] bryan824 opened a new pull request, #23141: perf(BigQuery): pass table_id as str type

bryan824 opened a new pull request, #23141:
URL: https://github.com/apache/airflow/pull/23141

   Recently during migration from 1.10.14 to 2.2.3, I noticed an issue in the `BigQueryDeleteTableOperator`. For the context of this, there are two ways to specify a table in GCP BigQuery, one with the project_id, like `my-project.mydataset.mytable`, and the other one without project_id, like `mydataset.mytable`. 
   
   In 1.10.14, I was using the version without project_id, because the table can be recognized by `BigQueryHook`, using `bigquery_conn_id` to fetch `project_id` in configuration.
   
   The path to pass this info is: [gcp_api_base_hook#L131](https://github.com/apache/airflow/blob/c743b95a02ba1ec04013635a56ad042ce98823d2/airflow/contrib/hooks/gcp_api_base_hook.py#L131) ->  [gcp_api_base_hook#L200](https://github.com/apache/airflow/blob/c743b95a02ba1ec04013635a56ad042ce98823d2/airflow/contrib/hooks/gcp_api_base_hook.py#L200) -> [bigquery_hook#L71](https://github.com/apache/airflow/blob/c743b95a02ba1ec04013635a56ad042ce98823d2/airflow/contrib/hooks/bigquery_hook.py#L71) -> [bigquery_hook#L1498](https://github.com/apache/airflow/blob/c743b95a02ba1ec04013635a56ad042ce98823d2/airflow/contrib/hooks/bigquery_hook.py#L1498).
   
   But after upgrading to 2.2.3, a full `table_id` is required. This is unexpected because `bigquery_conn_id/gcp_conn_id` is still a valid parameter, `BigQueryDeleteTableOperator`  should still be able to get `project_id` automatically from the connection configuration. It seems like in this line of code [bigquery#L1195](https://github.com/apache/airflow/blob/eb26510d3a1ccfaa9e4f8e1e0c91b5c74ae7393e/airflow/providers/google/cloud/hooks/bigquery.py#L1195), it forces users to use full `table_id` to create a `Table` instance, which is the **_root cause_**.
   
   Method `delete_table` accepts 4 types of tables, such as `Table`, `TableReference`, `TableListItem` and `str` as shown in [client#L1754](https://github.com/googleapis/python-bigquery/blob/c1d3e3089de1c267f8fb013283289b7d42172c76/google/cloud/bigquery/client.py#L1754). Then in [client#L1784](https://github.com/googleapis/python-bigquery/blob/c1d3e3089de1c267f8fb013283289b7d42172c76/google/cloud/bigquery/client.py#L1784), it converts these 4 types to 1 type, which is `TableReference` as shown in [table#L2689](https://github.com/googleapis/python-bigquery/blob/c1d3e3089de1c267f8fb013283289b7d42172c76/google/cloud/bigquery/table.py#L2689).
   
   So back to the possible improvement of this issue, I wonder if it will help migration get smoother if instead of using `Table.from_string` to get a `Table` type, a `str` type parameter is passed directly. And this `str` parameter can be just `mydataset.mytable`, with `project_id` set by the `Client` as shown in [bigquery#L1194](https://github.com/apache/airflow/blob/8dedd2ac13a6cdc0c363446985f492e0f702f639/airflow/providers/google/cloud/hooks/bigquery.py#L1194). I believe due to the plan of [GCP](https://cloud.google.com/composer/docs/composer-2/composer-versioning-overview#version-support-for-composer-1), companies are slowly migrating to Airflow 2.0 for better support. This improvement will avoid having them add the `project_id`  to `table_id` for hundreds of DAGs since it is already included in the connection configuration.
   
   Below are two scenarios based on the two formats of specifying a BigQuery table:
   
   1. `table_id` like `mydataset.mytable` is passed in [bigquery#L1797](https://github.com/apache/airflow/blob/8dedd2ac13a6cdc0c363446985f492e0f702f639/airflow/providers/google/cloud/operators/bigquery.py#L1797) and the corresponding `project_id` is configured by the connection. This will work as expected, if no `project_id` is found, error will be captured in [_helpers#L825](https://github.com/googleapis/python-bigquery/blob/c1d3e3089de1c267f8fb013283289b7d42172c76/google/cloud/bigquery/_helpers.py#L825).
   2. `table_id` like `my-project.mydataset.mytable` is passed. In this case, whether or not the `project_id` is configured or configured correspondingly, it will use the `project_id` defined in the `table_id` regardless as shown in [_helpers#L836](https://github.com/googleapis/python-bigquery/blob/c1d3e3089de1c267f8fb013283289b7d42172c76/google/cloud/bigquery/_helpers.py#L836).
   
   This is my first attempt at submitting a PR to an open-sourced repo. Please let me know how I can improve. It is also fine if it is not worth merging such a change. I enjoyed the time when looking into this.
   
   @kaxil @eladkal @potiuk 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] bryan824 commented on pull request #23141: perf(BigQuery): pass table_id as str type

Posted by GitBox <gi...@apache.org>.
bryan824 commented on PR #23141:
URL: https://github.com/apache/airflow/pull/23141#issuecomment-1166067120

   Thanks for checking. Yes, what you said is exactly why I submitted this PR. Adding `project_id` should be optional since it is part of the connection that is already configured.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] bhirsz commented on pull request #23141: perf(BigQuery): pass table_id as str type

Posted by GitBox <gi...@apache.org>.
bhirsz commented on PR #23141:
URL: https://github.com/apache/airflow/pull/23141#issuecomment-1165250354

   I apologize, it somehow slipped my notice - I'm taking a look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] github-actions[bot] commented on pull request #23141: perf(BigQuery): pass table_id as str type

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #23141:
URL: https://github.com/apache/airflow/pull/23141#issuecomment-1163788922

   This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] boring-cyborg[bot] commented on pull request #23141: perf(BigQuery): pass table_id as str type

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on PR #23141:
URL: https://github.com/apache/airflow/pull/23141#issuecomment-1104647467

   Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
   Here are some useful points:
   - Pay attention to the quality of your code (flake8, mypy and type annotations). Our [pre-commits]( https://github.com/apache/airflow/blob/main/STATIC_CODE_CHECKS.rst#prerequisites-for-pre-commit-hooks) will help you with that.
   - In case of a new feature add useful documentation (in docstrings or in `docs/` directory). Adding a new operator? Check this short [guide](https://github.com/apache/airflow/blob/main/docs/apache-airflow/howto/custom-operator.rst) Consider adding an example DAG that shows how users should use it.
   - Consider using [Breeze environment](https://github.com/apache/airflow/blob/main/BREEZE.rst) for testing locally, itโ€™s a heavy docker but it ships with a working Airflow and a lot of integrations.
   - Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
   - Please follow [ASF Code of Conduct](https://www.apache.org/foundation/policies/conduct) for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
   - Be sure to read the [Airflow Coding style]( https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#coding-style-and-best-practices).
   Apache Airflow is a community-driven project and together we are making it better ๐Ÿš€.
   In case of doubts contact the developers at:
   Mailing List: dev@airflow.apache.org
   Slack: https://s.apache.org/airflow-slack
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] boring-cyborg[bot] commented on pull request #23141: perf(BigQuery): pass table_id as str type

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on PR #23141:
URL: https://github.com/apache/airflow/pull/23141#issuecomment-1174315170

   Awesome work, congrats on your first merged pull request!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk merged pull request #23141: perf(BigQuery): pass table_id as str type

Posted by GitBox <gi...@apache.org>.
potiuk merged PR #23141:
URL: https://github.com/apache/airflow/pull/23141


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] bhirsz commented on pull request #23141: perf(BigQuery): pass table_id as str type

Posted by GitBox <gi...@apache.org>.
bhirsz commented on PR #23141:
URL: https://github.com/apache/airflow/pull/23141#issuecomment-1165256372

   So if I got it correctly the hook delete_table accepts 4 types of parameters but we're trying to feed it only with ``Table.from_str()`` which is forcing us to use project_id in table_id. But since delete_table can accept table_id (with or without project_id - it will be resolved anyway) it's safe to pass table_id directly hence making it more flexible for the users. 
   
   I think the change is OK.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #23141: perf(BigQuery): pass table_id as str type

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #23141:
URL: https://github.com/apache/airflow/pull/23141#issuecomment-1120489501

   Thanks for that - and sorry for delay, it's been a bit busy period for us all (and it's going to last for a while). I am not sure if this one is good or not - I am not a bq expert but maybe @turbaszek @mik-laj., @TobKed or maybe @lwyszomi  or @bhirsz  can chime in here. In any way that does not seem like something that needs to be merged quickly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org