You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/12/10 19:21:21 UTC

[GitHub] [airflow] VladaZakharova opened a new pull request, #28284: Fix for issue with reading schema fields for JSON files

VladaZakharova opened a new pull request, #28284:
URL: https://github.com/apache/airflow/pull/28284

   This fix includes changes for issue with incorrect reading schema_fields in case of using JSON files.
   <!--
   Thank you for contributing! Please make sure that your code changes
   are covered with tests. And in case of new features or big changes
   remember to adjust the documentation.
   
   Feel free to ping committers for the review!
   
   In case of an existing issue, reference it using one of the following:
   
   closes: #ISSUE
   related: #ISSUE
   
   How to write a good git commit message:
   http://chris.beams.io/posts/git-commit/
   -->
   
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#pull-request-guidelines)** for more information.
   In case of fundamental code changes, an Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals)) is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in a newsfragment file, named `{pr_number}.significant.rst` or `{issue_number}.significant.rst`, in [newsfragments](https://github.com/apache/airflow/tree/main/newsfragments).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] VladaZakharova closed pull request #28284: Fix for issue with reading schema fields for JSON files in GCSToBigQueryOperator

Posted by GitBox <gi...@apache.org>.
VladaZakharova closed pull request #28284: Fix for issue with reading schema fields for JSON files in GCSToBigQueryOperator
URL: https://github.com/apache/airflow/pull/28284


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] VladaZakharova commented on a diff in pull request #28284: Fix for issue with reading schema fields for JSON files in GCSToBigQueryOperator

Posted by GitBox <gi...@apache.org>.
VladaZakharova commented on code in PR #28284:
URL: https://github.com/apache/airflow/pull/28284#discussion_r1052979125


##########
tests/system/providers/google/cloud/gcs/example_gcs_to_bigquery_async.py:
##########
@@ -81,6 +81,20 @@
         max_id_key=MAX_ID_DATE,
         deferrable=True,
     )
+

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] IKholopov commented on a diff in pull request #28284: Fix for issue with reading schema fields for JSON files in GCSToBigQueryOperator

Posted by GitBox <gi...@apache.org>.
IKholopov commented on code in PR #28284:
URL: https://github.com/apache/airflow/pull/28284#discussion_r1048935913


##########
airflow/providers/google/cloud/transfers/gcs_to_bigquery.py:
##########
@@ -544,43 +494,272 @@ def _check_schema_fields(self, table_resource):
         :param table_resource: Configuration or table_resource dictionary
         :return: table_resource: Updated table_resource dict with schema_fields
         """
-        if not self.autodetect and not self.schema_fields:
-            raise RuntimeError(
-                "Table schema was not found. Set autodetect=True to "
-                "automatically set schema fields from source objects or pass "
-                "schema_fields explicitly"
-            )
-        elif not self.schema_fields:
+        if not self.schema_fields:

Review Comment:
   This is still broken in multiple places. 
   
   For example, it makes the assumption about the delimiter being "," instead of the value of `field_delimiter` parameter, same for `quote_character` parameter.
   
   Overall, I don't think that attempt to replicate the logic of determining schema instead of just letting "autodetect" and BigQuery do their thing is a wise idea - there are numerous edge cases that we would need to cover and we will continue getting issues because of implementations getting out of sync.
   
   I believe that the good approach would be to remove the explicit setting of `"schema"` in configuration if `autodetect==True` or remove `_check_schema_fields()` altogether.



##########
tests/system/providers/google/cloud/gcs/example_gcs_to_bigquery_async.py:
##########
@@ -81,6 +81,20 @@
         max_id_key=MAX_ID_DATE,
         deferrable=True,
     )
+

Review Comment:
   Can we add a test with non-default values of `field_delimiter`, `quote_character` parameters?



##########
airflow/providers/google/cloud/transfers/gcs_to_bigquery.py:
##########
@@ -322,101 +345,29 @@ def execute(self, context: Context):
             if self.schema_object and self.source_format != "DATASTORE_BACKUP":
                 schema_fields = json.loads(gcs_hook.download(self.bucket, self.schema_object).decode("utf-8"))
                 self.log.info("Autodetected fields from schema object: %s", schema_fields)

Review Comment:
   This logging statement is not correct. Nothing got *autodetected* here, we just loaded the explicitly specified schema.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] VladaZakharova closed pull request #28284: Fix for issue with reading schema fields for JSON files in GCSToBigQueryOperator

Posted by GitBox <gi...@apache.org>.
VladaZakharova closed pull request #28284: Fix for issue with reading schema fields for JSON files in GCSToBigQueryOperator
URL: https://github.com/apache/airflow/pull/28284


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] quentin-sommer commented on a diff in pull request #28284: Fix for issue with reading schema fields for JSON files in GCSToBigQueryOperator

Posted by GitBox <gi...@apache.org>.
quentin-sommer commented on code in PR #28284:
URL: https://github.com/apache/airflow/pull/28284#discussion_r1045141132


##########
airflow/providers/google/cloud/transfers/gcs_to_bigquery.py:
##########
@@ -300,6 +305,16 @@ def execute(self, context: Context):
             impersonation_chain=self.impersonation_chain,
         )
         self.hook = hook
+        self.source_format = self.source_format.upper()
+        allowed_formats = [

Review Comment:
   Why not declare this and validate `source_format` in the `__init__` to fail at dag parsing time instead of during the execution?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk merged pull request #28284: Fix for issue with reading schema fields for JSON files in GCSToBigQueryOperator

Posted by GitBox <gi...@apache.org>.
potiuk merged PR #28284:
URL: https://github.com/apache/airflow/pull/28284


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] IKholopov commented on pull request #28284: Fix for issue with reading schema fields for JSON files in GCSToBigQueryOperator

Posted by GitBox <gi...@apache.org>.
IKholopov commented on PR #28284:
URL: https://github.com/apache/airflow/pull/28284#issuecomment-1363535466

   Thanks for fixing it up, Vlada!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] VladaZakharova commented on pull request #28284: Fix for issue with reading schema fields for JSON files in GCSToBigQueryOperator

Posted by GitBox <gi...@apache.org>.
VladaZakharova commented on PR #28284:
URL: https://github.com/apache/airflow/pull/28284#issuecomment-1363692360

   Thank you for helping me to improve my code :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] VladaZakharova commented on a diff in pull request #28284: Fix for issue with reading schema fields for JSON files in GCSToBigQueryOperator

Posted by GitBox <gi...@apache.org>.
VladaZakharova commented on code in PR #28284:
URL: https://github.com/apache/airflow/pull/28284#discussion_r1051930431


##########
airflow/providers/google/cloud/transfers/gcs_to_bigquery.py:
##########
@@ -544,43 +494,272 @@ def _check_schema_fields(self, table_resource):
         :param table_resource: Configuration or table_resource dictionary
         :return: table_resource: Updated table_resource dict with schema_fields
         """
-        if not self.autodetect and not self.schema_fields:
-            raise RuntimeError(
-                "Table schema was not found. Set autodetect=True to "
-                "automatically set schema fields from source objects or pass "
-                "schema_fields explicitly"
-            )
-        elif not self.schema_fields:
+        if not self.schema_fields:

Review Comment:
   Yes, seems like that. I was trying to add this logic to simplify working with BigQuery schemas: as you know, BQ can only determine schema fields if the values for every column is of non-string format. (please, check the link: https://cloud.google.com/bigquery/docs/schema-detect#csv_header)
   But turned out that my implementation needs a lot of reworking :)
   I will remove this logic and let the schema fields be "string_field_1" as it was before.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] VladaZakharova commented on a diff in pull request #28284: Fix for issue with reading schema fields for JSON files in GCSToBigQueryOperator

Posted by GitBox <gi...@apache.org>.
VladaZakharova commented on code in PR #28284:
URL: https://github.com/apache/airflow/pull/28284#discussion_r1052979311


##########
airflow/providers/google/cloud/transfers/gcs_to_bigquery.py:
##########
@@ -322,101 +345,29 @@ def execute(self, context: Context):
             if self.schema_object and self.source_format != "DATASTORE_BACKUP":
                 schema_fields = json.loads(gcs_hook.download(self.bucket, self.schema_object).decode("utf-8"))
                 self.log.info("Autodetected fields from schema object: %s", schema_fields)

Review Comment:
   Sure, I have changed the output



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] VladaZakharova closed pull request #28284: Fix for issue with reading schema fields for JSON files in GCSToBigQueryOperator

Posted by GitBox <gi...@apache.org>.
VladaZakharova closed pull request #28284: Fix for issue with reading schema fields for JSON files in GCSToBigQueryOperator
URL: https://github.com/apache/airflow/pull/28284


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] VladaZakharova commented on a diff in pull request #28284: Fix for issue with reading schema fields for JSON files in GCSToBigQueryOperator

Posted by GitBox <gi...@apache.org>.
VladaZakharova commented on code in PR #28284:
URL: https://github.com/apache/airflow/pull/28284#discussion_r1045151208


##########
airflow/providers/google/cloud/transfers/gcs_to_bigquery.py:
##########
@@ -300,6 +305,16 @@ def execute(self, context: Context):
             impersonation_chain=self.impersonation_chain,
         )
         self.hook = hook
+        self.source_format = self.source_format.upper()
+        allowed_formats = [

Review Comment:
   Yes, looks like it will be better. I have added all appropriate changes.
   Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org