You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/08/24 00:25:22 UTC

[GitHub] [airflow] lawrencestfs opened a new issue #17800: BigQueryCreateExternalTableOperator from providers package fails to get schema from GCS object

lawrencestfs opened a new issue #17800:
URL: https://github.com/apache/airflow/issues/17800


   
   **Apache Airflow version**: 1.10.15
   
   **OS**: Linux 5.4.109+
   
   **Apache Airflow Provider versions**:
   apache-airflow-backport-providers-apache-beam==2021.3.13
   apache-airflow-backport-providers-cncf-kubernetes==2021.3.3
   apache-airflow-backport-providers-google==2021.3.3
   
   **Deployment**: Cloud Composer 1.16.6 (Google Cloud Managed Airflow Service)
   
   **What happened**:
   
   BigQueryCreateExternalTableOperator from the providers package ([airflow.providers.google.cloud.operators.bigquery](https://github.com/apache/airflow/blob/main/airflow/providers/google/cloud/operators/bigquery.py)) fails with correct _schema_object_ parameter.
   
   **What you expected to happen**:
   
   I expected the DAG to succesfully run, as I've previously tested it with the deprecated operator from the contrib package ([airflow.contrib.operators.bigquery_operator](https://github.com/apache/airflow/blob/5786dcdc392f7a2649f398353a0beebef01c428e/airflow/contrib/operators/bigquery_operator.py#L476)), using the same parameters.
   
   Debbuging the DAG execution log, I saw the providers operator generated a wrong call to the Cloud Storage API: it mixed up the bucket and object parameters, according the stack trace bellow.
   
   ```
   [2021-08-23 23:17:22,316] {taskinstance.py:1152} ERROR - 404 GET https://storage.googleapis.com/download/storage/v1/b/foo/bar/schema.json/o/mybucket?alt=media: Not Found: ('Request failed with status code', 404, 'Expected one of', <HTTPStatus.OK: 200>, <HTTPStatus.PARTIAL_CONTENT: 206>)
   Traceback (most recent call last)
     File "/opt/python3.6/lib/python3.6/site-packages/google/cloud/storage/client.py", line 728, in download_blob_to_fil
       checksum=checksum
     File "/opt/python3.6/lib/python3.6/site-packages/google/cloud/storage/blob.py", line 956, in _do_downloa
       response = download.consume(transport, timeout=timeout
     File "/opt/python3.6/lib/python3.6/site-packages/google/resumable_media/requests/download.py", line 168, in consum
       self._process_response(result
     File "/opt/python3.6/lib/python3.6/site-packages/google/resumable_media/_download.py", line 186, in _process_respons
       response, _ACCEPTABLE_STATUS_CODES, self._get_status_cod
     File "/opt/python3.6/lib/python3.6/site-packages/google/resumable_media/_helpers.py", line 104, in require_status_cod
       *status_code
   google.resumable_media.common.InvalidResponse: ('Request failed with status code', 404, 'Expected one of', <HTTPStatus.OK: 200>, <HTTPStatus.PARTIAL_CONTENT: 206>
   
   During handling of the above exception, another exception occurred
   
   Traceback (most recent call last)
     File "/usr/local/lib/airflow/airflow/models/taskinstance.py", line 985, in _run_raw_tas
       result = task_copy.execute(context=context
     File "/usr/local/lib/airflow/airflow/providers/google/cloud/operators/bigquery.py", line 1178, in execut
       schema_fields = json.loads(gcs_hook.download(self.bucket, self.schema_object)
     File "/usr/local/lib/airflow/airflow/providers/google/cloud/hooks/gcs.py", line 301, in downloa
       return blob.download_as_string(
     File "/opt/python3.6/lib/python3.6/site-packages/google/cloud/storage/blob.py", line 1391, in download_as_strin
       timeout=timeout
     File "/opt/python3.6/lib/python3.6/site-packages/google/cloud/storage/blob.py", line 1302, in download_as_byte
       checksum=checksum
     File "/opt/python3.6/lib/python3.6/site-packages/google/cloud/storage/client.py", line 731, in download_blob_to_fil
       _raise_from_invalid_response(exc
     File "/opt/python3.6/lib/python3.6/site-packages/google/cloud/storage/blob.py", line 3936, in _raise_from_invalid_respons
       raise exceptions.from_http_status(response.status_code, message, response=response
   google.api_core.exceptions.NotFound: 404 GET https://storage.googleapis.com/download/storage/v1/b/foo/bar/schema.json/o/mybucket?alt=media: Not Found: ('Request failed with status code', 404, 'Expected one of', <HTTPStatus.OK: 200>, <HTTPStatus.PARTIAL_CONTENT: 206>
   ```
   PS: the bucket (_mybucket_) and object path (_foo/bar/schema.json_) were masked for security reasons.
   
   I believe the error appears on the [following](https://github.com/apache/airflow/blob/main/airflow/providers/google/cloud/operators/bigquery.py#L1183) line, although the bug itself is probably located on the [gcs_hook.download()](https://github.com/apache/airflow/blob/0264fea8c2024d7d3d64aa0ffa28a0cfa48839cd/airflow/providers/google/cloud/hooks/gcs.py#L266) method:
   
   `schema_fields = json.loads(gcs_hook.download(self.bucket, self.schema_object))`
   
   **How to reproduce it**:
   
   Create a DAG using both operators and the same parameters, as the example bellow. The task using the contrib version of the operator should work, while the task using the providers version should fail.
   
   ```
   from airflow.contrib.operators.bigquery_operator import BigQueryCreateExternalTableOperator as BQExtTabOptContrib
   from airflow.providers.google.cloud.operators.bigquery import BigQueryCreateExternalTableOperator as BQExtTabOptProviders
   
   #TODO: default args and DAG definition
   
   create_landing_external_table_contrib = BQExtTabOptContrib(
       task_id='create_landing_external_table_contrib',
       bucket='mybucket',
       source_objects=['foo/bar/*.csv'],
       destination_project_dataset_table='project.dataset.table',
       schema_object='foo/bar/schema_file.json',
   )
   
   create_landing_external_table_providers = BQExtTabOptProviders(
       task_id='create_landing_external_table_providers',
       bucket='mybucket',
       source_objects=['foo/bar/*.csv'],
       destination_project_dataset_table='project.dataset.table',
       schema_object='foo/bar/schema_file.json',
   )
   ```
   
   **Anything else we need to know**:
   
   The [*gcs_hook.download()*](https://github.com/apache/airflow/blob/0264fea8c2024d7d3d64aa0ffa28a0cfa48839cd/airflow/providers/google/cloud/hooks/gcs.py#L313) method is using the deprecated method _download_as_string()_ from the Cloud Storage API (https://googleapis.dev/python/storage/latest/blobs.html). It should be changed to _download_as_bytes()_.
   
   Also, comparing the [providers version](https://github.com/apache/airflow/blob/main/airflow/providers/google/cloud/operators/bigquery.py#L1183) of the operator to the [contrib version](https://github.com/apache/airflow/blob/5786dcdc392f7a2649f398353a0beebef01c428e/airflow/contrib/operators/bigquery_operator.py#L621), I observed there is also a missing decode operation: `.decode("utf-8")`
   
   **Are you willing to submit a PR?**
   
   Yes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on issue #17800: BigQueryCreateExternalTableOperator from providers package fails to get schema from GCS object

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #17800:
URL: https://github.com/apache/airflow/issues/17800#issuecomment-904225776


   Thanks for opening your first issue here! Be sure to follow the issue template!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr edited a comment on issue #17800: BigQueryCreateExternalTableOperator from providers package fails to get schema from GCS object

Posted by GitBox <gi...@apache.org>.
uranusjr edited a comment on issue #17800:
URL: https://github.com/apache/airflow/issues/17800#issuecomment-904336344


   It feels weird to me the use of `download_as_string` results in a 404, the issue seems to be separate. A PR fixing `download_as_string` would be very much welcomed, but I suspect it won’t fix your issue. But I can very well be wrong.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk closed issue #17800: BigQueryCreateExternalTableOperator from providers package fails to get schema from GCS object

Posted by GitBox <gi...@apache.org>.
potiuk closed issue #17800:
URL: https://github.com/apache/airflow/issues/17800


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on issue #17800: BigQueryCreateExternalTableOperator from providers package fails to get schema from GCS object

Posted by GitBox <gi...@apache.org>.
uranusjr commented on issue #17800:
URL: https://github.com/apache/airflow/issues/17800#issuecomment-904336344


   I feels weird to me the use of `download_as_string` results in a 404, the issue seems to be separate. A PR fixing `download_as_string` would be very much welcomed, but I suspect it won’t fix your issue. But I can very well be wrong.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org