You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "t oo (Jira)" <ji...@apache.org> on 2019/12/11 10:43:00 UTC

[jira] [Updated] (AIRFLOW-6229) SparkSubmitOperator polls forever if status json can't find driver id

     [ https://issues.apache.org/jira/browse/AIRFLOW-6229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

t oo updated AIRFLOW-6229:
--------------------------
    Description: 
You click ‘release’ on a new spark cluster while the prior spark cluster is processing some spark submits from airflow. Then airflow is never able to finish the sparksubmit task as it polls from status on the new spark cluster build which it can’t find status for as the submit happened on earlier spark cluster build….the status loop goes on forever

 

[https://github.com/apache/airflow/blob/1.10.6/airflow/contrib/hooks/spark_submit_hook.py#L446]

              [https://github.com/apache/airflow/blob/1.10.6/airflow/contrib/hooks/spark_submit_hook.py#L489]

It loops forever if it can’t find driverState tag in the json response, since the new build (pointed to by the released DNS name) doesn’t know about the driver submitted (in previously released build) then the 2^nd^ response below does not contain the driverState tag.

 

 

#response before clicking release on new build

[ec2-user@reda ~]$

curl +[http://dns:6066/v1/submissions/status/driver-20191202142207-0000]+

{  "action" : "SubmissionStatusResponse",  "driverState" : "RUNNING",  "serverSparkVersion" : "2.3.4",  "submissionId" : "driver-20191202142207-0000",  "success" : true,  "workerHostPort" : "reda:31489",  "workerId" : "worker-20191202133526-reda-31489"}

 


#response after clicking release on new build

[ec2-user@reda ~]$

curl http://dns:6066/v1/submissions/status/driver-20191202142207-0000     

{  "action" : "SubmissionStatusResponse",  "serverSparkVersion" : "2.3.4",  "submissionId" : "driver-20191202142207-0000",  "success" : false               }               

 

Definitely a defect in current code. Can fix this by modifying _process_spark_status_log function to set driver status to UNKNOWN if driverState is not in response after iterating all lines.

 

  was:Right now each task instance can consume 1 slot inside a pool, but some tasks are bigger/smaller than others. For tasks that I know are 'big' i want to be able to say consume say 4 slots from a pool


> SparkSubmitOperator polls forever if status json can't find driver id
> ---------------------------------------------------------------------
>
>                 Key: AIRFLOW-6229
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-6229
>             Project: Apache Airflow
>          Issue Type: New Feature
>          Components: scheduler
>    Affects Versions: 1.10.6
>            Reporter: t oo
>            Priority: Major
>
> You click ‘release’ on a new spark cluster while the prior spark cluster is processing some spark submits from airflow. Then airflow is never able to finish the sparksubmit task as it polls from status on the new spark cluster build which it can’t find status for as the submit happened on earlier spark cluster build….the status loop goes on forever
>  
> [https://github.com/apache/airflow/blob/1.10.6/airflow/contrib/hooks/spark_submit_hook.py#L446]
>               [https://github.com/apache/airflow/blob/1.10.6/airflow/contrib/hooks/spark_submit_hook.py#L489]
> It loops forever if it can’t find driverState tag in the json response, since the new build (pointed to by the released DNS name) doesn’t know about the driver submitted (in previously released build) then the 2^nd^ response below does not contain the driverState tag.
>  
>  
> #response before clicking release on new build
> [ec2-user@reda ~]$
> curl +[http://dns:6066/v1/submissions/status/driver-20191202142207-0000]+
> {  "action" : "SubmissionStatusResponse",  "driverState" : "RUNNING",  "serverSparkVersion" : "2.3.4",  "submissionId" : "driver-20191202142207-0000",  "success" : true,  "workerHostPort" : "reda:31489",  "workerId" : "worker-20191202133526-reda-31489"}
>  
> #response after clicking release on new build
> [ec2-user@reda ~]$
> curl http://dns:6066/v1/submissions/status/driver-20191202142207-0000     
> {  "action" : "SubmissionStatusResponse",  "serverSparkVersion" : "2.3.4",  "submissionId" : "driver-20191202142207-0000",  "success" : false               }               
>  
> Definitely a defect in current code. Can fix this by modifying _process_spark_status_log function to set driver status to UNKNOWN if driverState is not in response after iterating all lines.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)