You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2019/12/09 22:19:00 UTC
[jira] [Commented] (AIRFLOW-5889) AWS Batch Operator - API request limits should not fail a task

    [ https://issues.apache.org/jira/browse/AIRFLOW-5889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16991992#comment-16991992 ] 

ASF GitHub Bot commented on AIRFLOW-5889:
-----------------------------------------

darrenleeweber commented on pull request #6765: [AIRFLOW-5889] Fix polling for AWS Batch job status
URL: https://github.com/apache/airflow/pull/6765
 
 
   
   ### Jira
   
   - [ ] My PR addresses the following [AIRFLOW-5889]
     - https://issues.apache.org/jira/browse/AIRFLOW-5889
   
   ### Description
   
   - errors in polling for job status should not fail
     the airflow task when the polling hits an API throttle
     limit; polling should detect those cases and retry a
     few times to get the job status
   - added typing for the BatchProtocol method return
     types, based on the botocore.client.Batch types
   - applied trivial format consistency using black,
     but keeping predominant use of single-quotes, i.e.
     $ black -S -t py36 -l 96 {file}
   
   ### Tests
   
   - [x] My PR passes existing tests
     - this is just a bug fix, it introduces no new functionality to be tested
   
   ### Commits
   
   - [x] My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
   
   ### Documentation
   
   - [x] Not applicable to this bug fix
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> AWS Batch Operator - API request limits should not fail a task
> --------------------------------------------------------------
>
>                 Key: AIRFLOW-5889
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5889
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: aws, contrib
>    Affects Versions: 1.10.4
>            Reporter: Darren Weber
>            Assignee: Darren Weber
>            Priority: Major
>             Fix For: 2.0.0, 1.10.6
>
>
> The AWS Batch Operator attempts to use a boto3 feature that is not available and has not been merged in years, see
>  - [https://github.com/boto/botocore/pull/1307]
>  - see also [https://github.com/broadinstitute/cromwell/issues/4303]
> This is a curious case of premature optimization. So, in the meantime, this means that the fallback is the exponential backoff routine for the status checks on the batch job. Unfortunately, when the concurrency of Airflow jobs is very high (100's of tasks), this fallback polling hits the AWS Batch API too hard and the AWS API throttle throws an error, which fails the Airflow task, simply because the status is polled too frequently.  This results in Airflow issuing a retry of this task, when the task is actually running already, resulting in duplicate batch jobs.  Any exception thrown for an AWS API throttle limit should not fail the task, but just pause the polling for job status and retry the job status poll.
> This is an example of an API throttle exception:
> {code:java}
> An error occurred (TooManyRequestsException) when calling the DescribeJobs operation
> (reached max retries: 4): Too Many Requests
> {code}
> This exception should be handled while waiting for a job to complete, it must not result in a job-retry.
> Reduced polling rates help (https://issues.apache.org/jira/browse/AIRFLOW-5218), but additional exception handling in the polling function is required.  Within the exception handling code, a random pause on the polling routine could help to alleviate the API throttle limits.  Maybe the class could expose a parameter for the rate of polling (or a callable)?
> Another consideration is possible use of something like the sensor-poke approach, with rescheduling, so that the polling process does not occupy a worker for the full duration of a batch job, e.g.
> - [https://github.com/apache/airflow/blob/master/airflow/sensors/base_sensor_operator.py#L117]
> If a rescheduling approach is adopted, the similar API throttle considerations apply.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)