You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/05/30 14:11:03 UTC

[GitHub] [airflow] potiuk commented on a change in pull request #8868: Add BigQueryInsertJobOperator

potiuk commented on a change in pull request #8868:
URL: https://github.com/apache/airflow/pull/8868#discussion_r432844561



##########
File path: tests/providers/google/cloud/operators/test_bigquery.py
##########
@@ -788,3 +789,65 @@ def test_execute(self, mock_hook):
                 project_id=TEST_GCP_PROJECT_ID,
                 table_resource=TEST_TABLE_RESOURCES
             )
+
+
+class TestBigQueryInsertJobOperator:
+    @mock.patch('airflow.providers.google.cloud.operators.bigquery.BigQueryHook')
+    def test_execute(self, mock_hook):
+        job_id = "123456"
+        configuration = {
+            "query": {
+                "query": "SELECT * FROM any",
+                "useLegacySql": False,
+            }
+        }
+        mock_hook.return_value.insert_job.return_value = MagicMock(job_id=job_id)
+
+        op = BigQueryInsertJobOperator(
+            task_id="insert_query_job",
+            configuration=configuration,
+            location=TEST_DATASET_LOCATION,
+            job_id=job_id,
+            project_id=TEST_GCP_PROJECT_ID
+        )
+        result = op.execute({})
+
+        mock_hook.return_value.insert_job.assert_called_once_with(
+            configuration=configuration,
+            location=TEST_DATASET_LOCATION,
+            job_id=job_id,
+            project_id=TEST_GCP_PROJECT_ID,
+        )
+
+        assert result == job_id
+
+    @mock.patch('airflow.providers.google.cloud.operators.bigquery.BigQueryHook')

Review comment:
       Should we also add a test for retry behaviour?

##########
File path: airflow/providers/google/cloud/operators/bigquery.py
##########
@@ -1570,3 +1576,76 @@ def execute(self, context):
             table_resource=self.table_resource,
             project_id=self.project_id,
         )
+
+
+class BigQueryInsertJobOperator(BaseOperator):
+    """
+    Executes a BigQuery job. Waits for the job to complete and returns job id.
+    See here:
+
+        https://cloud.google.com/bigquery/docs/reference/v2/jobs
+
+    :param configuration: The configuration parameter maps directly to BigQuery's
+        configuration field in the job  object. For more details see
+        https://cloud.google.com/bigquery/docs/reference/v2/jobs
+    :type configuration: Dict[str, Any]
+    :param job_id: The ID of the job. The ID must contain only letters (a-z, A-Z),
+        numbers (0-9), underscores (_), or dashes (-). The maximum length is 1,024
+        characters. If not provided then uuid will be generated.
+    :type job_id: str
+    :param project_id: Google Cloud Project where the job is running
+    :type project_id: str
+    :param location: location the job is running
+    :type location: str
+    :param gcp_conn_id: (Optional) The connection ID used to connect to Google Cloud Platform.
+    :type gcp_conn_id: str
+    """
+
+    template_fields = ("configuration", "job_id")
+    ui_color = BigQueryUIColors.QUERY.value
+
+    def __init__(
+        self,
+        configuration: Dict[str, Any],
+        project_id: Optional[str] = None,
+        location: Optional[str] = None,
+        job_id: Optional[str] = None,
+        gcp_conn_id: str = 'google_cloud_default',
+        delegate_to: Optional[str] = None,
+        *args,
+        **kwargs,
+    ) -> None:
+        super().__init__(*args, **kwargs)
+        self.configuration = configuration
+        self.location = location
+        self.job_id = job_id
+        self.project_id = project_id
+        self.gcp_conn_id = gcp_conn_id
+        self.delegate_to = delegate_to
+
+    def execute(self, context: Any):
+        hook = BigQueryHook(
+            gcp_conn_id=self.gcp_conn_id,
+            delegate_to=self.delegate_to,
+        )
+
+        try:
+            job = hook.insert_job(
+                configuration=self.configuration,
+                project_id=self.project_id,
+                location=self.location,
+                job_id=self.job_id,
+            )
+            # Start the job and wait for it to complete and get the result.
+            job.result()
+        except Conflict:
+            job = hook.get_job(
+                project_id=self.project_id,
+                location=self.location,
+                job_id=self.job_id,
+            )
+            # Get existing job and wait for it to be ready
+            while not job.done():
+                sleep(10)

Review comment:
       Should we make it constant/configurable with default value? I think 10 s might be fairly frequent for some types of jobs that run for hours. I understand it's only in case the task gets restarted, but still I think it's better to get it configurable.

##########
File path: docs/howto/operator/gcp/bigquery.rst
##########
@@ -255,32 +255,18 @@ Let's say you would like to execute the following query.
     :end-before: [END howto_operator_bigquery_query]
 
 To execute the SQL query in a specific BigQuery database you can use
-:class:`~airflow.providers.google.cloud.operators.bigquery.BigQueryExecuteQueryOperator`.
+:class:`~airflow.providers.google.cloud.operators.bigquery.BigQueryInsertJobOperator` with
+proper query job configuration.
 
 .. exampleinclude:: ../../../../airflow/providers/google/cloud/example_dags/example_bigquery_queries.py
     :language: python
     :dedent: 4
-    :start-after: [START howto_operator_bigquery_execute_query]
-    :end-before: [END howto_operator_bigquery_execute_query]
-
-``sql`` argument can receive a str representing a sql statement, a list of str
-(sql statements), or reference to a template file. Template reference are recognized
-by str ending in '.sql'.
+    :start-after: [START howto_operator_bigquery_insert_job]
+    :end-before: [END howto_operator_bigquery_insert_job]

Review comment:
       I think it woudl be nice to explain a bit here the job_id behavior -and how to make a job that is an "idempotent one". Just a sentence or two.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org