You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/01/12 17:18:31 UTC

[GitHub] [airflow] jfamestad opened a new issue #20832: Unable to specify Python version for AwsGlueJobOperator

jfamestad opened a new issue #20832:
URL: https://github.com/apache/airflow/issues/20832


   ### Apache Airflow Provider(s)
   
   amazon
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Apache Airflow version
   
   2.0.2
   
   ### Operating System
   
   Amazon Linux
   
   ### Deployment
   
   MWAA
   
   ### Deployment details
   
   _No response_
   
   ### What happened
   
   When a new Glue job is created using the AwsGlueJobOperator, the job is defaulting to Python2. Setting the version in create_job_kwargs fails with key error.
   
   ### What you expected to happen
   
   Expected the Glue job to be created with a Python3 runtime. create_job_kwargs are passed to the boto3 glue client create_job method which includes a "Command" parameter that is a dictionary containing the Python version.
   
   
   
   ### How to reproduce
   
   Create a dag with an AwsGlueJobOperator and pass a "Command" parameter in the create_job_kwargs argument.
   
   ```
       create_glue_job_args = {
           "Command": {
               "Name": "abalone-preprocess",
               "ScriptLocation": f"s3://{output_bucket}/code/preprocess.py",
               "PythonVersion": "3"
           }
       }
       glue_etl = AwsGlueJobOperator(  
           task_id="glue_etl",  
           s3_bucket=output_bucket,
           script_args={
                   '--S3_INPUT_BUCKET': data_bucket,
                   '--S3_INPUT_KEY_PREFIX': 'input/raw',
                   '--S3_UPLOADS_KEY_PREFIX': 'input/uploads',
                   '--S3_OUTPUT_BUCKET': output_bucket,
                   '--S3_OUTPUT_KEY_PREFIX': str(determine_dataset_id.output) +'/input/data' 
               },
           iam_role_name="MLOps",  
           retry_limit=2,
           concurrent_run_limit=3,
           create_job_kwargs=create_glue_job_args,
           dag=dag) 
   ```
   
   ```
   [2022-01-04 16:43:42,053] {{logging_mixin.py:104}} INFO - [2022-01-04 16:43:42,053] {{glue.py:190}} ERROR - Failed to create aws glue job, error: 'Command'
   [2022-01-04 16:43:42,081] {{logging_mixin.py:104}} INFO - [2022-01-04 16:43:42,081] {{glue.py:112}} ERROR - Failed to run aws glue job, error: 'Command'
   [2022-01-04 16:43:42,101] {{taskinstance.py:1482}} ERROR - Task failed with exception
   Traceback (most recent call last):
     File "/usr/local/lib/python3.7/site-packages/airflow/providers/amazon/aws/hooks/glue.py", line 166, in get_or_create_glue_job
       get_job_response = glue_client.get_job(JobName=self.job_name)
     File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 357, in _api_call
       return self._make_api_call(operation_name, kwargs)
     File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 676, in _make_api_call
       raise error_class(parsed_response, operation_name)
   botocore.errorfactory.EntityNotFoundException: An error occurred (EntityNotFoundException) when calling the GetJob operation: Job with name: abalone-preprocess not found.
   
   During handling of the above exception, another exception occurred:
   
   Traceback (most recent call last):
     File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1138, in _run_raw_task
       self._prepare_and_execute_task_with_callbacks(context, task)
     File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1311, in _prepare_and_execute_task_with_callbacks
       result = self._execute_task(context, task_copy)
     File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1341, in _execute_task
       result = task_copy.execute(context=context)
     File "/usr/local/lib/python3.7/site-packages/airflow/providers/amazon/aws/operators/glue.py", line 121, in execute
       glue_job_run = glue_job.initialize_job(self.script_args)
     File "/usr/local/lib/python3.7/site-packages/airflow/providers/amazon/aws/hooks/glue.py", line 108, in initialize_job
       job_name = self.get_or_create_glue_job()
     File "/usr/local/lib/python3.7/site-packages/airflow/providers/amazon/aws/hooks/glue.py", line 186, in get_or_create_glue_job
       **self.create_job_kwargs,
   KeyError: 'Command'
   ```
   
   ### Anything else
   
   When a new job is being created.
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating commented on issue #20832: Unable to specify Python version for AwsGlueJobOperator

Posted by GitBox <gi...@apache.org>.
SamWheating commented on issue #20832:
URL: https://github.com/apache/airflow/issues/20832#issuecomment-1011511506


   Also for what its worth, I think that you `Command` block is invalid, as the `Command.Name` you're using (`abalone-preprocess`) field must be one of `glueetl`, `pythonshell` or `gluestreaming`.
   
   https://docs.aws.amazon.com/glue/latest/webapi/API_JobCommand.html
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating edited a comment on issue #20832: Unable to specify Python version for AwsGlueJobOperator

Posted by GitBox <gi...@apache.org>.
SamWheating edited a comment on issue #20832:
URL: https://github.com/apache/airflow/issues/20832#issuecomment-1011502821


   I think that this is because the GlueHook is pretty opinionated and hardcodes the value of `Command` when running the `glue_client.create_job` command:
   
   https://github.com/apache/airflow/blob/2ab2ae8849bf6d80a700b1b74cef37eb187161ad/airflow/providers/amazon/aws/hooks/glue.py#L181-L225
   
   So when you provide `Command` as a `create_job_kwargs`, it ends up being supplied twice to that function (Although I suspect that this would be a typeError, not a keyError 🤔)
   
   Anyways, thoughts on just adding a `pythonVersion` argument to the `GlueJobOperator`?
   
   If y'all think that this is a satisfactory fix, feel free to assign this issue to me and I can put up a quick PR. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating edited a comment on issue #20832: Unable to specify Python version for AwsGlueJobOperator

Posted by GitBox <gi...@apache.org>.
SamWheating edited a comment on issue #20832:
URL: https://github.com/apache/airflow/issues/20832#issuecomment-1011511506


   Also for what its worth, I think that you `Command` block is invalid, as the `Command.Name` you're using (`abalone-preprocess`) must be one of `glueetl`, `pythonshell` or `gluestreaming`.
   
   https://docs.aws.amazon.com/glue/latest/webapi/API_JobCommand.html
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating edited a comment on issue #20832: Unable to specify Python version for AwsGlueJobOperator

Posted by GitBox <gi...@apache.org>.
SamWheating edited a comment on issue #20832:
URL: https://github.com/apache/airflow/issues/20832#issuecomment-1011511506


   Also for what its worth, I think that the `Command` block in your DAG is invalid, as the `Command.Name` you're using (`abalone-preprocess`) must be one of `glueetl`, `pythonshell` or `gluestreaming`.
   
   https://docs.aws.amazon.com/glue/latest/webapi/API_JobCommand.html
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating edited a comment on issue #20832: Unable to specify Python version for AwsGlueJobOperator

Posted by GitBox <gi...@apache.org>.
SamWheating edited a comment on issue #20832:
URL: https://github.com/apache/airflow/issues/20832#issuecomment-1011502821


   I think that this is because the GlueHook is pretty opinionated and hardcodes the value of `Command` when running the `glue_client.create_job` command:
   
   https://github.com/apache/airflow/blob/2ab2ae8849bf6d80a700b1b74cef37eb187161ad/airflow/providers/amazon/aws/hooks/glue.py#L181-L225
   
   So when you provide `Command` as a `create_job_kwargs`, it ends up being supplied twice to that function (Although I suspect that this would be a typeError, not a keyError 🤔)
   
   Anyways, thoughts on just making the `Command.Name` and `Command.PythonVersion` argument configurable in the `GlueJobOperator`?
   
   If y'all think that this is a satisfactory fix, feel free to assign this issue to me and I can put up a quick PR. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eladkal commented on issue #20832: Unable to specify Python version for AwsGlueJobOperator

Posted by GitBox <gi...@apache.org>.
eladkal commented on issue #20832:
URL: https://github.com/apache/airflow/issues/20832#issuecomment-1013758340


   @SamWheating We can make them configurable
   
   I'm just not I'm not 100% why must we create the command. Can't we just leave the `create_job_kwargs` as the user passes? 
   
   
   https://github.com/apache/airflow/blob/5ca5693b81b129cd34367fe8788d48ed70054f95/airflow/providers/amazon/aws/hooks/glue.py#L206
   https://github.com/apache/airflow/blob/5ca5693b81b129cd34367fe8788d48ed70054f95/airflow/providers/amazon/aws/hooks/glue.py#L217
   
   There are other parameters that don't get this special treatment:
   https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.create_job
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on issue #20832: Unable to specify Python version for AwsGlueJobOperator

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #20832:
URL: https://github.com/apache/airflow/issues/20832#issuecomment-1011275920


   Thanks for opening your first issue here! Be sure to follow the issue template!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating commented on issue #20832: Unable to specify Python version for AwsGlueJobOperator

Posted by GitBox <gi...@apache.org>.
SamWheating commented on issue #20832:
URL: https://github.com/apache/airflow/issues/20832#issuecomment-1011502821


   I think that this is because the GlueHook is pretty opinionated and hardcodes the value of `Command` when running the `create_glue_job` command:
   
   https://github.com/apache/airflow/blob/2ab2ae8849bf6d80a700b1b74cef37eb187161ad/airflow/providers/amazon/aws/hooks/glue.py#L181-L225
   
   So when you provide `Command` as a `create_job_kwargs`, it ends up being supplied twice to that function (Although I suspect that this would be a typeError, not a keyError 🤔)
   
   Anyways, thoughts on just adding a `pythonVersion` argument to the `GlueJobOperator`?
   
   If y'all think that this is a satisfactory fix, feel free to assign this issue to me and I can put up a quick PR. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org