You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "vgutkovsk (via GitHub)" <gi...@apache.org> on 2023/02/08 09:09:14 UTC

[GitHub] [airflow] vgutkovsk opened a new issue, #29423: GlueJobOperator throws error after migration to newest version of Airflow

vgutkovsk opened a new issue, #29423:
URL: https://github.com/apache/airflow/issues/29423

   ### Apache Airflow version
   
   2.5.1
   
   ### What happened
   
   We were using GlueJobOperator with Airflow 2.3.3 (official docker image) and it was working well, we didn't specify script file location, because it was inferred from the job name. After migration to 2.5.1 (official docker image) the operator fails if `s3_bucket` and `script_location` are not specified. That's the error I see:
   ```
   Traceback (most recent call last):
     File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/amazon/aws/operators/glue.py", line 146, in execute
       glue_job_run = glue_job.initialize_job(self.script_args, self.run_job_kwargs)
     File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/amazon/aws/hooks/glue.py", line 155, in initialize_job
       job_name = self.create_or_update_glue_job()
     File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/amazon/aws/hooks/glue.py", line 300, in create_or_update_glue_job
       config = self.create_glue_job_config()
     File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/amazon/aws/hooks/glue.py", line 97, in create_glue_job_config
       raise ValueError("Could not initialize glue job, error: Specify Parameter `s3_bucket`")
   ValueError: Could not initialize glue job, error: Specify Parameter `s3_bucket`
   ```
   
   ### What you think should happen instead
   
   I was expecting that after migration the operator would work the same way.
   
   ### How to reproduce
   
   Create a dag with `GlueJobOperator` operator and do not use s3_bucket or script_location arguments
   
   ### Operating System
   
   Linux
   
   ### Versions of Apache Airflow Providers
   
   apache-airflow-providers-amazon==7.1.0
   
   ### Deployment
   
   Docker-Compose
   
   ### Deployment details
   
   `apache/airflow:2.5.1-python3.10` Docker image and official docker compose
   
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] romibuzi commented on issue #29423: GlueJobOperator throws error after migration to newest version of Airflow

Posted by "romibuzi (via GitHub)" <gi...@apache.org>.
romibuzi commented on issue #29423:
URL: https://github.com/apache/airflow/issues/29423#issuecomment-1423190237

   @gabFirmaway Which parameters did you set in `create_job_kwargs` of your task definition?
   
   The GlueJobOperator should only update parameters which are set in the DAG task definition in case there is something different defined in Glue:
   https://github.com/apache/airflow/blob/fdac67b3a5350ab4af79fd98612592511ca5f3fc/airflow/providers/amazon/aws/hooks/glue.py#L307-L309


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] gabFirmaway commented on issue #29423: GlueJobOperator throws error after migration to newest version of Airflow

Posted by "gabFirmaway (via GitHub)" <gi...@apache.org>.
gabFirmaway commented on issue #29423:
URL: https://github.com/apache/airflow/issues/29423#issuecomment-1422932899

   Oh! I was stuck with this error today too! 
   I tested the triggering of existing Glue jobs on my development machine with an older version of Airflow and everything worked flawlessly. 
   Today, when I create an ec2 instance, create a custom airflow docker container and triggered the job, it keeps asking for s3_bucket and if you provide one, it override the configuration that you create in the Glue Editor (for example, for python only scripts, it changes to glue script with spark 3 but not defined language).
   
   A temporary solution would be to add in the task definition:
   
   create_job_kwargs={'GlueVersion':'3.0', "DefaultArguments": {"--job-language": "python"}}


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] Taragolis commented on issue #29423: GlueJobOperator throws error after migration to newest version of Airflow

Posted by "Taragolis (via GitHub)" <gi...@apache.org>.
Taragolis commented on issue #29423:
URL: https://github.com/apache/airflow/issues/29423#issuecomment-1423066278

   I've never like design of GlueJobOperator (previously AwsGlueJobOperator) because a lot of different things combine in one single operator: Upload Artifact to S3, Create/Update Job, Run Job.
   
   Seems like it should be decompose on different Operators, when I used Glue as service (hopefully not use it anymore) I've create and update it outside of Airflow as part of the CI process, so for me only Run Job was useful part but someone required also create/update it in Airflow in this case could use multiple chained operators.
   
   But for now would be nice if someone create a fix for current operator. Just let know who want to make a PR soon for avoid situation that multiple people work on the same issue 😉 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] romibuzi commented on issue #29423: GlueJobOperator throws error after migration to newest version of Airflow

Posted by "romibuzi (via GitHub)" <gi...@apache.org>.
romibuzi commented on issue #29423:
URL: https://github.com/apache/airflow/issues/29423#issuecomment-1422427027

   Hi @vgutkovsk!
   
   Oh damn indeed I realize introduced a breaking change. Before the check `if self.s3_bucket is None` was done only when the operator was creating the job. Now it is done at the start of `create_glue_job_config()` method here: https://github.com/apache/airflow/blob/44024564cb3dd6835b0375d61e682efc1acd7d2c/airflow/providers/amazon/aws/hooks/glue.py#L103-L104
   
   And this method is called in any cases here: https://github.com/apache/airflow/blob/44024564cb3dd6835b0375d61e682efc1acd7d2c/airflow/providers/amazon/aws/hooks/glue.py#L328
   
   I realize `s3_bucket` is only used to determine `s3_log_path`: https://github.com/apache/airflow/blob/44024564cb3dd6835b0375d61e682efc1acd7d2c/airflow/providers/amazon/aws/hooks/glue.py#L112
   
   `script_location` on the other hand can be None and is not concatenated with `s3_bucket` at all. 
   
   Maybe the best way to handle the problem would be to remove this check on s3_bucket, and if it is None then omit the the parameter `"LogUri"` which makes usage of `s3_log_path` as it is not a mandatory parameter for a glue job: https://github.com/apache/airflow/blob/44024564cb3dd6835b0375d61e682efc1acd7d2c/airflow/providers/amazon/aws/hooks/glue.py#L118
   
   cc @Taragolis 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] Taragolis commented on issue #29423: GlueJobOperator throws error after migration to newest version of Airflow

Posted by "Taragolis (via GitHub)" <gi...@apache.org>.
Taragolis commented on issue #29423:
URL: https://github.com/apache/airflow/issues/29423#issuecomment-1422632797

   > Oh damn indeed I realize introduced a breaking change.
   
   `¯\_(ツ)_/¯` this things sometimes happen, but I guess we could fix it by your proposal. Would you like to work on this issue? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] romibuzi commented on issue #29423: GlueJobOperator throws error after migration to newest version of Airflow

Posted by "romibuzi (via GitHub)" <gi...@apache.org>.
romibuzi commented on issue #29423:
URL: https://github.com/apache/airflow/issues/29423#issuecomment-1422775781

   @Taragolis Yeah I can work on the fix, I also saw that @vgutkovsk is willing to submit a PR so as you want :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] romibuzi commented on issue #29423: GlueJobOperator throws error after migration to newest version of Airflow

Posted by "romibuzi (via GitHub)" <gi...@apache.org>.
romibuzi commented on issue #29423:
URL: https://github.com/apache/airflow/issues/29423#issuecomment-1423143720

   @Taragolis I agree, i can definitely start working on the fix as it was my previous contribution which leads to the current issue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] eladkal closed issue #29423: GlueJobOperator throws error after migration to newest version of Airflow

Posted by "eladkal (via GitHub)" <gi...@apache.org>.
eladkal closed issue #29423: GlueJobOperator throws error after migration to newest version of Airflow
URL: https://github.com/apache/airflow/issues/29423


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org