You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/03/21 16:16:44 UTC

[GitHub] [airflow] potiuk opened a new issue #22404: tempfile.TemporaryDirectory does not get deleted after task failure

potiuk opened a new issue #22404:
URL: https://github.com/apache/airflow/issues/22404


   ### Discussed in https://github.com/apache/airflow/discussions/22403
   
   <div type='discussions-op-text'>
   
   <sup>Originally posted by **m1racoli** March 18, 2022</sup>
   ### Apache Airflow version
   
   2.2.4 (latest released)
   
   ### What happened
   
   When creating a temporary directory with `tempfile.TemporaryDirectory()` and then failing a task, the corresponding directory does not get deleted.
   
   This happens in Airflow on Astronomer as well as locally in for `astro dev` setups for LocalExecutor and CeleryExecutor.
   
   ### What you think should happen instead
   
   As in normal Python environments, the directory should get cleaned up, even in the case of a raised exception.
   
   ### How to reproduce
   
   Running this DAG will leave a temporary directory in the corresponding location:
   
   ```python
   import os
   import tempfile
   
   from airflow.decorators import dag, task
   from airflow.utils.dates import days_ago
   
   
   class MyException(Exception):
       pass
   
   
   @task
   def run():
       tmpdir = tempfile.TemporaryDirectory()
       print(f"directory {tmpdir.name} created")
       assert os.path.exists(tmpdir.name)
   
       raise MyException("error!")
   
   
   @dag(start_date=days_ago(1))
   def tempfile_test():
       run()
   
   
   _ = tempfile_test()
   ```
   
   
   ### Operating System
   
   Debian (Astronomer Airflow Docker image)
   
   ### Versions of Apache Airflow Providers
   
   ```
   apache-airflow-providers-amazon==1!3.0.0
   apache-airflow-providers-cncf-kubernetes==1!3.0.2
   apache-airflow-providers-elasticsearch==1!2.2.0
   apache-airflow-providers-ftp==1!2.0.1
   apache-airflow-providers-google==1!6.4.0
   apache-airflow-providers-http==1!2.0.3
   apache-airflow-providers-imap==1!2.2.0
   apache-airflow-providers-microsoft-azure==1!3.6.0
   apache-airflow-providers-mysql==1!2.2.0
   apache-airflow-providers-postgres==1!3.0.0
   apache-airflow-providers-redis==1!2.0.1
   apache-airflow-providers-slack==1!4.2.0
   apache-airflow-providers-sqlite==1!2.1.0
   apache-airflow-providers-ssh==1!2.4.0
   ```
   
   ### Deployment
   
   Astronomer
   
   ### Deployment details
   
   GKE, vanilla `astro dev`, LocalExecutor and CeleryExecutor.
   
   ### Anything else
   
   Always
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   </div>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #22404: tempfile.TemporaryDirectory does not get deleted after task failure

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #22404:
URL: https://github.com/apache/airflow/issues/22404#issuecomment-1075381813


   I did some testing and I think it comes from the way we run tasks via fork:
   
   ```python
   import os
   from time import sleep
   import tempfile
   
   
   def test():
       tmpdir = tempfile.TemporaryDirectory()
       print(f"directory {tmpdir.name} created")
       assert os.path.exists(tmpdir.name)
       raise Exception("exiting")
   
   
   pid = os.fork()
   if pid:
       sleep(2)
   else:
   
       try:
           test()
       finally:
           os._exit(0)
   ```
   
   This is exactly what we do to run the task and the tmp directory is not deleted as well. When I replace os._exit() with sys.exit(), the directory is deleted. But using sys.exit() for forked process is wrong and os._exit() is fine:
   
   Via: https://docs.python.org/3/library/os.html#os._exit
   
   > os._exit(n)
   > Exit the process with status n, without calling cleanup handlers, flushing stdio buffers, etc.
   > Note The standard way to exit is sys.exit(n). _exit() should normally only be used in the child process after a fork().
   
   We are (correctly) exiting forks via os._exit. We cannot do sys.exit() because we should not run finalizers that have been defined in the parent process before the fork happened (and running them prematurely might cause some memory corruption for shared memory from what I understand) . I think there is not much we can do other than warn users that they shoud not rely on finalizers in Airflow tasks. I think ther is no easy way to only run post-fork finalizations.
   
   @ashb - any comments on that?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #22404: tempfile.TemporaryDirectory does not get deleted after task failure

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #22404:
URL: https://github.com/apache/airflow/issues/22404#issuecomment-1074142442


   According to docs (and I looked at the code) the directory SHOULD be deleted by finalizer in this case - but is not.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on issue #22404: tempfile.TemporaryDirectory does not get deleted after task failure

Posted by GitBox <gi...@apache.org>.
uranusjr commented on issue #22404:
URL: https://github.com/apache/airflow/issues/22404#issuecomment-1074138997


   This happens when you raise the exception _outside_ the block? I can sort of understand if you do it inside, but the temporary directory should be deleted when the context manager exits. It is indeed strange if this is the case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #22404: tempfile.TemporaryDirectory does not get deleted after task failure

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #22404:
URL: https://github.com/apache/airflow/issues/22404#issuecomment-1075406261


   Really nice one  - fix in #22475 :D  - tested it with:
   
   ```python
   import os
   from time import sleep
   import tempfile
   
   
   def test():
       tmpdir = tempfile.TemporaryDirectory()
       print(f"directory {tmpdir.name} created")
       assert os.path.exists(tmpdir.name)
       raise Exception("exiting")
   
   
   pid = os.fork()
   if pid:
       sleep(2)
   else:
   
       try:
           test()
       except BaseException:
           pass
       finally:
           pass
       os._exit(0)
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #22404: tempfile.TemporaryDirectory does not get deleted after task failure

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #22404:
URL: https://github.com/apache/airflow/issues/22404#issuecomment-1075381813


   I did some testing and I think it comes from the way we run tasks via fork:
   
   ```python
   import os
   from time import sleep
   import tempfile
   
   
   def test():
       tmpdir = tempfile.TemporaryDirectory()
       print(f"directory {tmpdir.name} created")
       assert os.path.exists(tmpdir.name)
       raise Exception("exiting")
   
   
   pid = os.fork()
   if pid:
       sleep(2)
   else:
   
       try:
           test()
       finally:
           os._exit(0)
   ```
   
   This is exactly what we do to run the task and the tmp directory is not deleted as well. When I replace os._exit() with sys.exit(), the directory is deleted. But using sys.exit() for forked process is wrong and os._exit() is fine:
   
   Via: https://docs.python.org/3/library/os.html#os._exit
   
   > os._exit(n)
   > Exit the process with status n, without calling cleanup handlers, flushing stdio buffers, etc.
   > Note The standard way to exit is sys.exit(n). _exit() should normally only be used in the child process after a fork().
   
   We are (correctly) exiting forks via os._exit. We cannot do sys.exit() because we should not run finalizers that have been defined in the parent process before the fork happened (and running them prematurely might cause some memory corruption for shared memory from what I understand) . I think there is not much we can do other than warn users that they shoud not rely on finalizers in Airflow tasks. I think ther is no easy way to only run post-fork finalizations.
   
   @ashb - any comments on that?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #22404: tempfile.TemporaryDirectory does not get deleted after task failure

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #22404:
URL: https://github.com/apache/airflow/issues/22404#issuecomment-1075297539


   > This is a wild guess and can be very far off, but my main point is that relying on scoping to clean things up is unreliable by design (of the Python langauge), and it shouldn’t be too surprising when it does not work, unfortunately.
   
   I think it is very probable. And I think secret masker might play a role there. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] m1racoli commented on issue #22404: tempfile.TemporaryDirectory does not get deleted after task failure

Posted by GitBox <gi...@apache.org>.
m1racoli commented on issue #22404:
URL: https://github.com/apache/airflow/issues/22404#issuecomment-1075163309


   Yeah, I know that when using a context manager the problem doesn't occur, as the context manager explicitly cleans up the directory upon completion.
   
   And yes, running
   
   ```python
   import os
   import tempfile
   
   tmpdir = tempfile.TemporaryDirectory()
   print(f"directory {tmpdir.name} created")
   assert os.path.exists(tmpdir.name)
   
   raise Exception("error!")
   ```
   
   directly in Python cleans up the directory (done by the finalizer as @potiuk said).
   
   So now I wonder what's different in Airflow. My judgement is based on the assumption that each task is executed in it's own process and therefore also it's own Python process. Thus a failing task's Python process should clean up the directory too (unless Airflow messes with the finalizers). Maybe the child Python process knows it's a child of a parent process and doesn't invoke the finalizer (pure speculation). 🤔 
   
   I'm not an expert of the Python runtime and I did not verify (yet) if this also occurs when running in a sub-process. I initially was hoping that someone on Airflow's side knows what's going on. :)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on issue #22404: tempfile.TemporaryDirectory does not get deleted after task failure

Posted by GitBox <gi...@apache.org>.
uranusjr commented on issue #22404:
URL: https://github.com/apache/airflow/issues/22404#issuecomment-1075245303


   Finalisers in Python are somewhat known to not always work, unfortunately, which is why context manager is a thing. Since Python objects can contain non-transparent reference cycles, it is not guaranteed when things are finalised, or even at all.
   
   In this particular situation, my guess is that the exception raised in the function is causing a reference cycle. An exception instance holds the frame object of the current function, which references the local variables. But Airflow likely catches the exception somewhere (for logging, for example), which prevents the exception from being freed, which in turn holds on the `TemporaryDirectory` object.
   
   This is a wild guess and can be very far off, but my main point is that relying on scoping to clean things up is unreliable by design (of the Python langauge), and it shouldn’t be too surprising when it does not work, unfortunately.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] m1racoli edited a comment on issue #22404: tempfile.TemporaryDirectory does not get deleted after task failure

Posted by GitBox <gi...@apache.org>.
m1racoli edited a comment on issue #22404:
URL: https://github.com/apache/airflow/issues/22404#issuecomment-1075163309


   Yeah, I know that when using a context manager the problem doesn't occur, as the context manager explicitly cleans up the directory upon completion.
   
   And yes, running
   
   ```python
   import os
   import tempfile
   
   tmpdir = tempfile.TemporaryDirectory()
   print(f"directory {tmpdir.name} created")
   assert os.path.exists(tmpdir.name)
   
   raise Exception("error!")
   ```
   
   directly in Python cleans up the directory (done by the finalizer as @potiuk said).
   
   So now I wonder what's different in Airflow. My judgement is based on the assumption that each task is executed in it's own process and therefore also it's own Python process. Thus a failing task's Python process should clean up the directory too (unless Airflow messes with the finalizers). Maybe the child Python process knows it's a child of a parent process and doesn't invoke the finalizer (pure speculation). 🤔 
   
   > The original one does not (i.e. the folder remains after Exception is thrown and even after scheduler interpreter is stopped
   
   This ☝🏼 makes the issue even more interesting, excluding the possibility that it's a child process issue.
   
   I'm not an expert of the Python runtime and I did not verify (yet) if this also occurs when running in a sub-process. I initially was hoping that someone on Airflow's side knows what's going on. :)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #22404: tempfile.TemporaryDirectory does not get deleted after task failure

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #22404:
URL: https://github.com/apache/airflow/issues/22404#issuecomment-1075406261


   Really nice one  - fix in #22475 :D  - tested it with:
   
   ```python
   import os
   from time import sleep
   import tempfile
   
   
   def test():
       tmpdir = tempfile.TemporaryDirectory()
       print(f"directory {tmpdir.name} created")
       assert os.path.exists(tmpdir.name)
       raise Exception("exiting")
   
   
   pid = os.fork()
   if pid:
       sleep(2)
   else:
   
       try:
           test()
       except BaseException:
           pass
       finally:
           pass
       os._exit(0)
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #22404: tempfile.TemporaryDirectory does not get deleted after task failure

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #22404:
URL: https://github.com/apache/airflow/issues/22404#issuecomment-1074113877






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #22404: tempfile.TemporaryDirectory does not get deleted after task failure

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #22404:
URL: https://github.com/apache/airflow/issues/22404#issuecomment-1075385745


   AH NO! I got it.
   
   This is our bug actually. The problem is that we call it in "finally" clause. But we should not.  When finally is being prrocessed, the original exception with the stack trace and the tmpdir are still kept so os._exit() have no chance of running finalizer. Fix is coming!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #22404: tempfile.TemporaryDirectory does not get deleted after task failure

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #22404:
URL: https://github.com/apache/airflow/issues/22404#issuecomment-1074141785


   > This happens when you raise the exception _outside_ the block? I can sort of understand if you do it inside, but the temporary directory should be deleted when the context manager exits. It is indeed strange if this is the case.
   
   Just to clarify - the `context manager` case behaves as expected (directory deleted). The original one does not (i.e. the folder remains after Exception is thrown and even after scheduler interpreter is stopped:
   
   ```python
   
   import os
   import tempfile
   
   from airflow.decorators import dag, task
   from airflow.utils.dates import days_ago
   
   
   class MyException(Exception):
       pass
   
   
   @task
   def run():
       tmpdir = tempfile.TemporaryDirectory()
       print(f"directory {tmpdir.name} created")
       assert os.path.exists(tmpdir.name)
   
       raise MyException("error!")
   
   
   @dag(start_date=days_ago(1))
   def tempfile_test():
       run()
   
   
   _ = tempfile_test()
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk closed issue #22404: tempfile.TemporaryDirectory does not get deleted after task failure

Posted by GitBox <gi...@apache.org>.
potiuk closed issue #22404:
URL: https://github.com/apache/airflow/issues/22404


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #22404: tempfile.TemporaryDirectory does not get deleted after task failure

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #22404:
URL: https://github.com/apache/airflow/issues/22404#issuecomment-1074142442


   According to docs (and I looked at the code) the directory SHOULD be deleted by finalizer in this case - but is not (easily reproducible).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] m1racoli edited a comment on issue #22404: tempfile.TemporaryDirectory does not get deleted after task failure

Posted by GitBox <gi...@apache.org>.
m1racoli edited a comment on issue #22404:
URL: https://github.com/apache/airflow/issues/22404#issuecomment-1075163309


   Yeah, I know that when using a context manager the problem doesn't occur, as the context manager explicitly cleans up the directory upon completion.
   
   And yes, running
   
   ```python
   import os
   import tempfile
   
   tmpdir = tempfile.TemporaryDirectory()
   print(f"directory {tmpdir.name} created")
   assert os.path.exists(tmpdir.name)
   
   raise Exception("error!")
   ```
   
   directly in Python cleans up the directory (done by the finalizer as @potiuk said).
   
   So now I wonder what's different in Airflow. My judgement is based on the assumption that each task is executed in it's own process and therefore also it's own Python process. Thus a failing task's Python process should clean up the directory too (unless Airflow messes with the finalizers). Maybe the child Python process knows it's a child of a parent process and doesn't invoke the finalizer (pure speculation). 🤔 
   
   > The original one does not (i.e. the folder remains after Exception is thrown and even after scheduler interpreter is stopped
   
   This ☝🏼 makes the issue even more interesting.
   
   I'm not an expert of the Python runtime and I did not verify (yet) if this also occurs when running in a sub-process. I initially was hoping that someone on Airflow's side knows what's going on. :)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #22404: tempfile.TemporaryDirectory does not get deleted after task failure

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #22404:
URL: https://github.com/apache/airflow/issues/22404#issuecomment-1075385745


   AH NO! I got it.
   
   This is our bug actually. The problem is that we call `os._exit()` in "finally" clause. But we should not.  When finally is being prrocessed, the original exception with the stack trace and the tmpdir are still kept so os._exit() have no chance of running finalizer. Fix is coming!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] m1racoli commented on issue #22404: tempfile.TemporaryDirectory does not get deleted after task failure

Posted by GitBox <gi...@apache.org>.
m1racoli commented on issue #22404:
URL: https://github.com/apache/airflow/issues/22404#issuecomment-1075642574


   Awesome! Great work @potiuk!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #22404: tempfile.TemporaryDirectory does not get deleted after task failure

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #22404:
URL: https://github.com/apache/airflow/issues/22404#issuecomment-1075297539


   > This is a wild guess and can be very far off, but my main point is that relying on scoping to clean things up is unreliable by design (of the Python langauge), and it shouldn’t be too surprising when it does not work, unfortunately.
   
   I think it is very probable. And I think secret masker might play a role there. (intuition)  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org