You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/11/22 10:58:46 UTC

[GitHub] [airflow] lwyszomi opened a new pull request #19740: Added wait mechanizm to the DataprocJobSensor to avoid 509 errors when Job is not available

lwyszomi opened a new pull request #19740:
URL: https://github.com/apache/airflow/pull/19740


   Sometimes Job in the Dataproc is not available after creation and then Sensor throws 509 error, To avoid this issue I implemented similar solution as we have in the Create Dataproc Job that we can specify how many seconds we should wait for that Job and after this time we throwing the exception.
   
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#pull-request-guidelines)** for more information.
   In case of fundamental code change, Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)) is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in [UPDATING.md](https://github.com/apache/airflow/blob/main/UPDATING.md).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #19740: Added wait mechanizm to the DataprocJobSensor to avoid errors when Job is not available

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #19740:
URL: https://github.com/apache/airflow/pull/19740#issuecomment-989872894


   I think both solutions are good in different scenarios- this really depends on how "transient" the message is and the context in which it is done. Both has pros and cons. And having both options is good.
   
   If this is really a message that appears randomly for 1000 requests out of a blue, then doing retries in the API call makes sense as it will be quickly retried and will succeed and the user will not even see it as an error (only logs will contain the information).  
   
   Also "API retry" scenario is good when you have custom operator which uses several operations sequentially using different Hooks. The retry mechanism of Airflow works in the way that whole custom operator will be retried from the beginning and it might mean delays, reprocessing data, bigger transfers etc. etc. 
   
   However if this just "standard" operator with single Hook and single opertion and especially if this is caused some "periods of unavailability" - like the API will start returning errors continuously for 5 minutes, using the standard "retry" mechanism of Airflow makes sense, because it will not block the worker for the 5 minutes.
   
   So adding an option to retry on API layer is a good idea additionally to "retry" functionality of Airflow.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] abhishekshenoy commented on pull request #19740: Added wait mechanizm to the DataprocJobSensor to avoid 509 errors when Job is not available

Posted by GitBox <gi...@apache.org>.
abhishekshenoy commented on pull request #19740:
URL: https://github.com/apache/airflow/pull/19740#issuecomment-989716007


   @lwyszomi @potiuk This will solve the server error received at the start of the job. We are experiencing issues wherein Google apis fail sometimes, on connecting with Google Support they mentioned 
   
   `Users are expected to see this set of errors (503) every now and then. These could be due to expected issues like busy servers, network unavailability, etc`
   
   In the above scenarios as the response is not successfully retrieved , Job Sensor fails though the cluster is running.
   
   I have an approach which is similar to the one used here but by using something like a resettable counter for any 'API Reponse Error' which on passing a threshold , only then should fail the Job Sensor. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] abhishekshenoy edited a comment on pull request #19740: Added wait mechanizm to the DataprocJobSensor to avoid errors when Job is not available

Posted by GitBox <gi...@apache.org>.
abhishekshenoy edited a comment on pull request #19740:
URL: https://github.com/apache/airflow/pull/19740#issuecomment-989716007


   @lwyszomi @potiuk This will solve the server error received at the start of the job. We are experiencing issues wherein Google apis fail sometimes with connectivity issues . 
   
   On connecting with Google Support they mentioned :
   
   `Users are expected to see this set of errors (503) every now and then. These could be due to expected issues like busy servers, network unavailability, etc`
   
   In the above scenarios as the response is not successfully retrieved , Job Sensor fails though the cluster is running.
   
   I have an approach which is similar to the one used here but by using something like a resettable counter for any 'API Reponse Error' which on passing a threshold , only then should fail the Job Sensor. 
   
   ``` 
    File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/google/cloud/sensors/dataproc.py", line 63, in poke
    job = hook.get_job(job_id=self.dataproc_job_id, location=self.location, project_id=self.project_id)
       .
       .
       .
    google.api_core.exceptions.ServiceUnavailable: 503 The service is currently unavailable.
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk merged pull request #19740: Added wait mechanizm to the DataprocJobSensor to avoid 509 errors when Job is not available

Posted by GitBox <gi...@apache.org>.
potiuk merged pull request #19740:
URL: https://github.com/apache/airflow/pull/19740


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on pull request #19740: Added wait mechanizm to the DataprocJobSensor to avoid errors when Job is not available

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on pull request #19740:
URL: https://github.com/apache/airflow/pull/19740#issuecomment-989872894


   I think both solutions are good in different scenarios- this really depends on how "transient" the message is and the context in which it is done. Both has pros and cons. And having both options is good.
   
   If this is really a message that appears randomly for 1/1000 requests out of a blue, then doing retries in the API call makes sense as it will be quickly retried and will succeed and the user will not even see it as an error (only logs will contain the information).  
   
   Also "API retry" scenario is good when you have custom operator which uses several operations sequentially using different Hooks. The retry mechanism of Airflow works in the way that whole custom operator will be retried from the beginning and it might mean delays, reprocessing data, bigger transfers etc. etc. 
   
   However if this just "standard" operator with single Hook and single opertion and especially if this is caused some "periods of unavailability" - like the API will start returning errors continuously for 5 minutes, using the standard "retry" mechanism of Airflow makes sense, because it will not block the worker for the 5 minutes.
   
   So adding an option to retry on API layer is a good idea additionally to "retry" functionality of Airflow.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] abhishekshenoy edited a comment on pull request #19740: Added wait mechanizm to the DataprocJobSensor to avoid 509 errors when Job is not available

Posted by GitBox <gi...@apache.org>.
abhishekshenoy edited a comment on pull request #19740:
URL: https://github.com/apache/airflow/pull/19740#issuecomment-989716007


   @lwyszomi @potiuk This will solve the server error received at the start of the job. We are experiencing issues wherein Google apis fail sometimes with connectivity issues . 
   
   On connecting with Google Support they mentioned :
   
   `Users are expected to see this set of errors (503) every now and then. These could be due to expected issues like busy servers, network unavailability, etc`
   
   In the above scenarios as the response is not successfully retrieved , Job Sensor fails though the cluster is running.
   
   I have an approach which is similar to the one used here but by using something like a resettable counter for any 'API Reponse Error' which on passing a threshold , only then should fail the Job Sensor. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] lwyszomi commented on pull request #19740: Added wait mechanizm to the DataprocJobSensor to avoid errors when Job is not available

Posted by GitBox <gi...@apache.org>.
lwyszomi commented on pull request #19740:
URL: https://github.com/apache/airflow/pull/19740#issuecomment-989732194


   @abhishekshenoy I think this use case can be resolved by `retries` parameter which can be set for each operator/sensor.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #19740: Added wait mechanizm to the DataprocJobSensor to avoid 509 errors when Job is not available

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #19740:
URL: https://github.com/apache/airflow/pull/19740#issuecomment-975411313


   The PR is likely OK to be merged with just subset of tests for default Python and Database versions without running the full matrix of tests, because it does not modify the core of Airflow. If the committers decide that the full tests matrix is needed, they will add the label 'full tests needed'. Then you should rebase to the latest main or amend the last commit of the PR, and push it with --force-with-lease.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org