You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "vksunilk (via GitHub)" <gi...@apache.org> on 2023/02/03 06:53:18 UTC

[GitHub] [airflow] vksunilk opened a new pull request, #29346: Check Absence of files or objects via GCSObjectExistenceSensor

vksunilk opened a new pull request, #29346:
URL: https://github.com/apache/airflow/pull/29346

   <!--
   Thank you for contributing! Please make sure that your code changes
   are covered with tests. And in case of new features or big changes
   remember to adjust the documentation.
   
   Feel free to ping committers for the review!
   
   In case of an existing issue, reference it using one of the following:
   
   closes: #ISSUE
   related: #ISSUE
   
   How to write a good git commit message:
   http://chris.beams.io/posts/git-commit/
   -->
   
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#pull-request-guidelines)** for more information.
   In case of fundamental code changes, an Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals)) is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in a newsfragment file, named `{pr_number}.significant.rst` or `{issue_number}.significant.rst`, in [newsfragments](https://github.com/apache/airflow/tree/main/newsfragments).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] vksunilk commented on pull request #29346: Check Absence of files or objects via GCSObjectExistenceSensor

Posted by "vksunilk (via GitHub)" <gi...@apache.org>.
vksunilk commented on PR #29346:
URL: https://github.com/apache/airflow/pull/29346#issuecomment-1416052237

   > > Yes. Incase, the user needs a case where he needs to wait for the file to be deleted by an external task. This can be useful. This is one such usecase.
   > 
   > do you want to wait for a file to get deleted? but this look just the opposite of what the name GCSObjectExistenceSensor suggests
   
   Maybe should we create a new operator for such usecase?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] eladkal closed pull request #29346: Check Absence of files or objects via GCSObjectExistenceSensor

Posted by "eladkal (via GitHub)" <gi...@apache.org>.
eladkal closed pull request #29346: Check Absence of files or objects via GCSObjectExistenceSensor
URL: https://github.com/apache/airflow/pull/29346


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] vksunilk commented on pull request #29346: Check Absence of files or objects via GCSObjectExistenceSensor

Posted by "vksunilk (via GitHub)" <gi...@apache.org>.
vksunilk commented on PR #29346:
URL: https://github.com/apache/airflow/pull/29346#issuecomment-1420239346

   > I don't know how common this use case is?
   > 
   > This is the first time ever that I see such request for a sensor. It feels more like something you should have as a custom operator in your own Airflow instance. Unlike other operator/sensor improvements which may not be common but do no harm in this case our options are:
   > 
   > 1. Add  `check_for_absence` to GCSObjectExistenceSensor which is confusing because it would do the opposite of what this sensor is designed to do.
   > 2. Deprecate and rename the class to something more generic like @pankajastro suggested that can support both behaviors. This means that all users of this operators (and I expect there are many!) will have to change their code because of this probably not very common use case. I don't this use case justify it.
   > 
   > I tend not to accept the PR until we see significant user voice in favor of use case.
   > 
   > i wonder what others think?
   
   I had this specific usecase. So, thought incase someone else also has this usecase, they'll be able to benefit. And ofcourse, I've created a custom operator for my project


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] pankajastro commented on pull request #29346: Check Absence of files or objects via GCSObjectExistenceSensor

Posted by "pankajastro (via GitHub)" <gi...@apache.org>.
pankajastro commented on PR #29346:
URL: https://github.com/apache/airflow/pull/29346#issuecomment-1416028927

   > Yes. Incase, the user needs a case where he needs to wait for the file to be deleted by an external task. This can be useful. This is one such usecase.
   
   do you want to wait for a file to get deleted? but this look just the opposite of what the name GCSObjectExistenceSensor 
    suggests  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] josh-fell commented on pull request #29346: Check Absence of files or objects via GCSObjectExistenceSensor

Posted by "josh-fell (via GitHub)" <gi...@apache.org>.
josh-fell commented on PR #29346:
URL: https://github.com/apache/airflow/pull/29346#issuecomment-1433383415

   I agree that the suggested parameter name is confusing in the context of this operator. I also feel that this is a very niche use case and pattern in a data pipeline. I could envision pipelines designed in such a way where the task that deletes the file triggers a DAG which can only begin when said file is delete (either through TriggerDagRunOperator or Datasets). Perhaps this is more of a design exercise rather than something a pre-packaged operator should offer. But, by all means, a custom operator or using the `GcsHook` to do what you need is highly-encouraged for custom use cases; it's one of the beauties of Airflow: it's extensibility.
   
   The generic operator would indeed solve both problems, but does create user impact between versions and a maintenance burden for a corner case IMO.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] pankajastro commented on pull request #29346: Check Absence of files or objects via GCSObjectExistenceSensor

Posted by "pankajastro (via GitHub)" <gi...@apache.org>.
pankajastro commented on PR #29346:
URL: https://github.com/apache/airflow/pull/29346#issuecomment-1418949027

   > Maybe should we create a new operator for such usecase?
   
   Not sure, renaming `GCSObjectExistenceSensor`  to `GCSObjectSensor` deprecating `GCSObjectExistenceSensor` WDYT?
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] eladkal commented on pull request #29346: Check Absence of files or objects via GCSObjectExistenceSensor

Posted by "eladkal (via GitHub)" <gi...@apache.org>.
eladkal commented on PR #29346:
URL: https://github.com/apache/airflow/pull/29346#issuecomment-1437448536

   Thanks josh!
   so at this point I think it's best to close this PR for the reasons listed in previous comments
   
   @vksunilk thank you for your contribution


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] eladkal commented on pull request #29346: Check Absence of files or objects via GCSObjectExistenceSensor

Posted by "eladkal (via GitHub)" <gi...@apache.org>.
eladkal commented on PR #29346:
URL: https://github.com/apache/airflow/pull/29346#issuecomment-1418992002

   I don't know how common this use case is? 
   
   This is the first time ever that I see such request for a sensor. It feels more like something you should have as a custom operator in your own Airflow instance. Unlike other operator/sensor improvements which may not be common but do no harm in this case our options are:
   1. Add  `check_for_absence` to GCSObjectExistenceSensor which is confusing because it would do the opposite of what this sensor is designed to do.
   2. Deprecate and rename the class to something more generic like @pankajastro suggested that can support both behaviors. This means that all users of this operators (and I expect there are many!) will have to change their code because of this probably not very common use case. I don't this use case justify it.
   
   I tend not to accept this until we see significant user voice in favor of this.
   
   i wonder what others think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] vksunilk commented on pull request #29346: Check Absence of files or objects via GCSObjectExistenceSensor

Posted by "vksunilk (via GitHub)" <gi...@apache.org>.
vksunilk commented on PR #29346:
URL: https://github.com/apache/airflow/pull/29346#issuecomment-1416017293

   > I think the purpose of `GCSObjectExistenceSensor` is to wait for a file to present. Why do we need these changes? It does not fail for me if the file is not present it waits for the file to get created and then timeout based on my `poke_interval` and `timeout` param.
   > 
   > ```
   > [2023-02-03, 15:06:41 UTC] {taskinstance.py:1524} INFO - Exporting env vars: AIRFLOW_CTX_DAG_OWNER='airflow' AIRFLOW_CTX_DAG_ID='bq_check_op' AIRFLOW_CTX_TASK_ID='gcs_object_exists_task' AIRFLOW_CTX_EXECUTION_DATE='2023-02-03T15:06:08.323807+00:00' AIRFLOW_CTX_TRY_NUMBER='1' AIRFLOW_CTX_DAG_RUN_ID='manual__2023-02-03T15:06:08.323807+00:00'
   > [2023-02-03, 15:06:41 UTC] {gcs.py:94} INFO - Sensor checks existence of : test-gcs-bucket-providers, example_gcs.py
   > [2023-02-03, 15:06:41 UTC] {base.py:73} INFO - Using connection ID 'google_cloud_default' for task execution.
   > [2023-02-03, 15:06:48 UTC] {gcs.py:94} INFO - Sensor checks existence of : test-gcs-bucket-providers, example_gcs.py
   > [2023-02-03, 15:06:48 UTC] {base.py:73} INFO - Using connection ID 'google_cloud_default' for task execution.
   > [2023-02-03, 15:06:56 UTC] {gcs.py:94} INFO - Sensor checks existence of : test-gcs-bucket-providers, example_gcs.py
   > [2023-02-03, 15:06:56 UTC] {base.py:73} INFO - Using connection ID 'google_cloud_default' for task execution.
   > [2023-02-03, 15:07:03 UTC] {gcs.py:94} INFO - Sensor checks existence of : test-gcs-bucket-providers, example_gcs.py
   > [2023-02-03, 15:07:03 UTC] {base.py:73} INFO - Using connection ID 'google_cloud_default' for task execution.
   > [2023-02-03, 15:07:05 UTC] {taskinstance.py:1798} ERROR - Task failed with exception
   > Traceback (most recent call last):
   >   File "/opt/airflow/airflow/sensors/base.py", line 216, in execute
   >     raise AirflowSensorTimeout(message)
   > airflow.exceptions.AirflowSensorTimeout: Sensor has timed out; run duration of 24.36089755300054 seconds exceeds the specified timeout of 20.
   > [2023-02-03, 15:07:05 UTC] {taskinstance.py:1338} INFO - Immediate failure requested. Marking task as FAILED. dag_id=bq_check_op, task_id=gcs_object_exists_task, execution_date=20230203T150608, start_date=20230203T150640, end_date=20230203T150705
   > [2023-02-03, 15:07:05 UTC] {standard_task_runner.py:105} ERROR - Failed to execute job 147 for task gcs_object_exists_task (Sensor has timed out; run duration of 24.36089755300054 seconds exceeds the specified timeout of 20.; 5779)
   > [2023-02-03, 15:07:05 UTC] {local_task_job.py:215} INFO - Task exited with return code 1
   > [2023-02-03, 15:07:05 UTC] {taskinstance.py:2616} INFO - 0 downstream tasks scheduled from follow-on schedule check
   > ```
   
   Yes. Incase, the user needs a case where he needs to wait for the file to be deleted by an external task. This can be useful. This is one such usecase.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org