You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/04 22:30:49 UTC

[GitHub] [beam] damccorm opened a new issue, #21267: WriteToBigQuery submits a duplicate BQ load job if a 503 error code is returned from googleapi

damccorm opened a new issue, #21267:
URL: https://github.com/apache/beam/issues/21267

   When running a WriteToBigQuery beam step, a 503 error code is returned from `https://www.googleapis.com/resumable/upload/storage/v1/b/<our_tmp_dataflow_location\>`. This is causing duplicated data as the BQ load job is still successfully submitted but the workitem returns "Finished processing workitem with errors". This causes dataflow to resubmit an identical job and thus insert duplicate data into our BigQuery tables.
   
   Problem you have encountered:
   
   1.) WriteToBigQuery step starts and triggers a BQ load job.
   ```
   
   "Triggering job beam_bq_job_LOAD_AUTOMATIC_JOB_NAME_LOAD_NAME_STEP_650_f2f7eb5ec442aa057357302eb9cb0263_9704d08e74d74e2b9cc743ef8a40c524"
   
   ```
   
   
   2.) An error occurs in the step, but apparently after the load job was already submitted.
   ```
   
   "Error in _start_upload while inserting file gs://<censored_bucket_location>.avro: Traceback (most
   recent call last):
    File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsio.py", line
   644, in _start_upload
    self._client.objects.Insert(self._insert_request, upload=self._upload)
    File
   "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/internal/clients/storage/storage_v1_client.py",
   line 1156, in Insert
    upload=upload, upload_config=upload_config)
    File "/usr/local/lib/python3.7/site-packages/apitools/base/py/base_api.py",
   line 731, in _RunMethod
    return self.ProcessHttpResponse(method_config, http_response, request)
    File
   "/usr/local/lib/python3.7/site-packages/apitools/base/py/base_api.py", line 737, in ProcessHttpResponse
   
   self.__ProcessHttpResponse(method_config, http_response, request))
    File "/usr/local/lib/python3.7/site-packages/apitools/base/py/base_api.py",
   line 604, in __ProcessHttpResponse
    http_response, method_config=method_config, request=request)
   apitools.base.py.exceptions.HttpError:
   HttpError accessing <https://www.googleapis.com/resumable/upload/storage/v1/b/bqflow_dataflow_tmp/o?alt=json&name=tmp%2F<censored_bucket_location>.avro&uploadType=resumable&upload_id=ADPycdtKO3HR5PjM_lE6lBin-QqIRuTBeiaCe3dPx9gUKAIPI5fzpfuTs4J5XEF9XiayNvMrhGsGe0XP1CJv90xsuBUrZy6mpw>:
   response: <\{'content-type': 'text/plain; charset=utf-8', 'x-guploader-uploadid': 'ADPycdtKO3HR5PjM_lE6lBin-QqIRuTBeiaCe3dPx9gUKAIPI5fzpfuTs4J5XEF9XiayNvMrhGsGe0XP1CJv90xsuBUrZy6mpw',
   'content-length': '0', 'date': 'Tue, 05 Oct 2021 18:01:51 GMT', 'server': 'UploadServer', 'status':
   '503'}>, content <>
   "
   
   ```
   
   
   3.) workitem finishes with errors
   
   ```
   
    Finished processing workitem X with errors. Reporting status to Dataflow service.
   
   ```
   
   
   4.) Beam re-runs the workitem which spawns another identical BQ load job.
   
   ```
   
   Triggering job beam_bq_job_LOAD_AUTOMATIC_JOB_NAME_LOAD_NAME_STEP_650_f2f7eb5ec442aa057357302eb9cb0263_1247e55bd00041d8b8bd4de491cd7063
   
   ```
   
   
   This causes a single WriteToBigQuery beam step to spawn two identical BQ load jobs. This creates duplicated data in our tables.
   
   What you expected to happen:
   
   I would expect the HTTP call to be retried before returning an error. Otherwise, if this did fail, I would expect the same BQ load job to not be successfully submitted twice without cancellation of the first job. A third option would be to implement something similar to "insert_retry_strategy" but for batch files that can allow us to not create another bq load job when a failure occurs. 
   
    
   
    
   
   Imported from Jira [BEAM-13132](https://issues.apache.org/jira/browse/BEAM-13132). Original Jira may contain additional context.
   Reported by: jamesprillaman.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] johnjcasey commented on issue #21267: WriteToBigQuery submits a duplicate BQ load job if a 503 error code is returned from googleapi

Posted by GitBox <gi...@apache.org>.
johnjcasey commented on issue #21267:
URL: https://github.com/apache/beam/issues/21267#issuecomment-1246893285

   I think our code works correctly here. We explicitly mark _start_upload as no retries, and the underlying transfer library understandably considers a 5xx error to not be retryable. I think this is a BQ bug, that we handle correctly.
   
   Data duplication is a possibility with beam


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] kennknowles commented on issue #21267: WriteToBigQuery submits a duplicate BQ load job if a 503 error code is returned from googleapi

Posted by GitBox <gi...@apache.org>.
kennknowles commented on issue #21267:
URL: https://github.com/apache/beam/issues/21267#issuecomment-1246031063

   @johnjcasey @pabloem is this a BQ bug, that they return a 503 even though it was a success? Is it our bug that a 503 on a non-critical status query causes a retry but we don't dedupe?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] kennknowles closed issue #21267: WriteToBigQuery submits a duplicate BQ load job if a 503 error code is returned from googleapi

Posted by GitBox <gi...@apache.org>.
kennknowles closed issue #21267: WriteToBigQuery submits a duplicate BQ load job if a 503 error code is returned from googleapi
URL: https://github.com/apache/beam/issues/21267


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] kennknowles commented on issue #21267: WriteToBigQuery submits a duplicate BQ load job if a 503 error code is returned from googleapi

Posted by GitBox <gi...@apache.org>.
kennknowles commented on issue #21267:
URL: https://github.com/apache/beam/issues/21267#issuecomment-1248338021

   I suppose the thing to do is to close this, then, and if it was a BQ bug follow up internally, or if it was an outage then that is the end of it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org