You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2023/01/09 01:41:57 UTC

[GitHub] [airflow] vchiapaikeo opened a new pull request, #28796: Fix BigQueryColumnCheckOperator runtime error

vchiapaikeo opened a new pull request, #28796:
URL: https://github.com/apache/airflow/pull/28796

   <!--
   Thank you for contributing! Please make sure that your code changes
   are covered with tests. And in case of new features or big changes
   remember to adjust the documentation.
   
   Feel free to ping committers for the review!
   
   In case of an existing issue, reference it using one of the following:
   
   closes: #ISSUE
   related: #ISSUE
   
   How to write a good git commit message:
   http://chris.beams.io/posts/git-commit/
   -->
   
   A TypeError which causes tasks to fail at runtime exists in the BigQueryColumnCheckOperator. This was initially uncovered while looking at another issue for this operator here: https://github.com/apache/airflow/issues/28343#issuecomment-1374350497
   
   This fixes the operator by calling the list's extend() method instead of calling the list itself. Also added a few tests.
   
   As an aside, I had to SKIP=run-mypy during my commit because I ran into this unusual pre-commit failure which doesn't seem relevant:
   
   ```
   Run mypy for providers.................................................................Failed
   - hook id: run-mypy
   - exit code: 1
   
   airflow/providers/google/cloud/operators/bigquery.py:250: error:
   "BigQueryCheckOperator" has no attribute "_raise_exception"  [attr-defined]
                   self._raise_exception(f"Test failed.\nQuery:\n{self.sql}\n...
                   ^
   Found 1 error in 1 file (checked 1 source file)
   If you see strange stacktraces above, run `breeze ci-image build --python 3.7` and try again.
   
   ```
   
   ## Test Dag
   
   ```py
   from airflow import DAG
   
   from airflow.providers.google.cloud.operators.bigquery import BigQueryColumnCheckOperator
   
   DEFAULT_TASK_ARGS = {
       "owner": "gcp-data-platform",
       "retries": 1,
       "retry_delay": 10,
       "start_date": "2022-08-01",
   }
   
   with DAG(
       max_active_runs=1,
       concurrency=2,
       catchup=False,
       schedule_interval="@daily",
       dag_id="test_bigquery_column_check",
       default_args=DEFAULT_TASK_ARGS,
   ) as dag:
   
       basic_column_quality_checks = BigQueryColumnCheckOperator(
               task_id="check_columns",
               table="my-project.vchiapaikeo.test1",
               use_legacy_sql=False,
               column_mapping={
                   "col1": {"min": {"greater_than": 0}},
               },
           )
   ```
   
   <img width="974" alt="image" src="https://user-images.githubusercontent.com/9200263/211229519-96a9f439-ffe4-4ddc-bf07-84e5b73bb45d.png">
   
   
   Task Logs:
   
   ```
   686f5b14989d
   *** Reading local file: /root/airflow/logs/dag_id=test_bigquery_column_check/run_id=scheduled__2023-01-08T00:00:00+00:00/task_id=check_columns/attempt=3.log
   [2023-01-09, 01:40:19 UTC] {taskinstance.py:1093} INFO - Dependencies all met for <TaskInstance: test_bigquery_column_check.check_columns scheduled__2023-01-08T00:00:00+00:00 [queued]>
   [2023-01-09, 01:40:19 UTC] {taskinstance.py:1093} INFO - Dependencies all met for <TaskInstance: test_bigquery_column_check.check_columns scheduled__2023-01-08T00:00:00+00:00 [queued]>
   [2023-01-09, 01:40:19 UTC] {taskinstance.py:1295} INFO - 
   --------------------------------------------------------------------------------
   [2023-01-09, 01:40:19 UTC] {taskinstance.py:1296} INFO - Starting attempt 3 of 4
   [2023-01-09, 01:40:19 UTC] {taskinstance.py:1297} INFO - 
   --------------------------------------------------------------------------------
   [2023-01-09, 01:40:19 UTC] {taskinstance.py:1316} INFO - Executing <Task(BigQueryColumnCheckOperator): check_columns> on 2023-01-08 00:00:00+00:00
   [2023-01-09, 01:40:19 UTC] {standard_task_runner.py:55} INFO - Started process 481 to run task
   [2023-01-09, 01:40:20 UTC] {standard_task_runner.py:82} INFO - Running: ['***', 'tasks', 'run', 'test_bigquery_column_check', 'check_columns', 'scheduled__2023-01-08T00:00:00+00:00', '--job-id', '5', '--raw', '--subdir', 'DAGS_FOLDER/test_bigquery_column_check.py', '--cfg-path', '/tmp/tmpgeqtp2hz']
   [2023-01-09, 01:40:20 UTC] {standard_task_runner.py:83} INFO - Job 5: Subtask check_columns
   [2023-01-09, 01:40:21 UTC] {task_command.py:391} INFO - Running <TaskInstance: test_bigquery_column_check.check_columns scheduled__2023-01-08T00:00:00+00:00 [running]> on host 686f5b14989d
   [2023-01-09, 01:40:21 UTC] {taskinstance.py:1525} INFO - Exporting the following env vars:
   AIRFLOW_CTX_DAG_OWNER=gcp-data-platform
   AIRFLOW_CTX_DAG_ID=test_bigquery_column_check
   AIRFLOW_CTX_TASK_ID=check_columns
   AIRFLOW_CTX_EXECUTION_DATE=2023-01-08T00:00:00+00:00
   AIRFLOW_CTX_TRY_NUMBER=3
   AIRFLOW_CTX_DAG_RUN_ID=scheduled__2023-01-08T00:00:00+00:00
   [2023-01-09, 01:40:21 UTC] {base.py:73} INFO - Using connection ID 'google_cloud_default' for task execution.
   [2023-01-09, 01:40:21 UTC] {credentials_provider.py:323} INFO - Getting connection using `google.auth.default()` since no key file is defined for hook.
   [2023-01-09, 01:40:21 UTC] {_default.py:649} WARNING - No project ID could be determined. Consider running `gcloud config set project` or setting the GOOGLE_CLOUD_PROJECT environment variable
   [2023-01-09, 01:40:21 UTC] {bigquery.py:1539} INFO - Inserting job ***_1673228421636668_2d2a9b688dcd63bef1c449cd8b764f86
   [2023-01-09, 01:40:23 UTC] {bigquery.py:601} INFO - Record:   col_name check_type  check_result
   0     col1        min             2
   [2023-01-09, 01:40:23 UTC] {bigquery.py:628} INFO - All tests have passed
   [2023-01-09, 01:40:23 UTC] {taskinstance.py:1339} INFO - Marking task as SUCCESS. dag_id=test_bigquery_column_check, task_id=check_columns, execution_date=20230108T000000, start_date=20230109T014019, end_date=20230109T014023
   [2023-01-09, 01:40:23 UTC] {local_task_job.py:211} INFO - Task exited with return code 0
   [2023-01-09, 01:40:23 UTC] {taskinstance.py:2613} INFO - 0 downstream tasks scheduled from follow-on schedule check
   ```
   
   cc: @eladkal , @VladaZakharova , @denimalpaca 
   
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#pull-request-guidelines)** for more information.
   In case of fundamental code changes, an Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals)) is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in a newsfragment file, named `{pr_number}.significant.rst` or `{issue_number}.significant.rst`, in [newsfragments](https://github.com/apache/airflow/tree/main/newsfragments).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] eladkal merged pull request #28796: Fix BigQueryColumnCheckOperator runtime error

Posted by GitBox <gi...@apache.org>.
eladkal merged PR #28796:
URL: https://github.com/apache/airflow/pull/28796


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org