You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "potiuk (via GitHub)" <gi...@apache.org> on 2023/07/31 16:35:19 UTC

[GitHub] [airflow] potiuk opened a new issue, #32973: Hanging test_send_tasks_to_celery_hang in quarantine

potiuk opened a new issue, #32973:
URL: https://github.com/apache/airflow/issues/32973

   ### Body
   
   It looks like the `test_send_tasks_to_celery_hang` is actually hanging occasionally in the ways that it has not been caught by our test timeouts. 
   
   Test timeouts work in the way, that they register for SIGALRM and handle them, however it seems that this test - especially in Amazon MWAA team environment does not timeout and hangs forever. This is likely because of the signal handling that is implemented by celery libraries utilised in this test.
   
   This test has been added in order to test some nasty race condition with multiprocessing https://github.com/apache/airflow/pull/15989 but likely there is another related race condition that triggers it occassionally.
   
   After the move of celery executor (and celery executor tests) to provider, this test was executed as part of "Providers" tests and this move caused the test to hang pretty deterministically in AWS MWWA team environment. For now we are quarantining the test to allow the tests to pass for MWAA team, but also this test is a strong suspect for generating mysterious "test failed"  conditions in CI of Airflow - if such test would hang, Github Actions is known to loose logs for jobs that have been stuck in the way that they have to be terminated forecefully. There is a big chance this test is causing it.
   
   But it is extremely difficult to reproduce in Airflow CI  in a repeatable way.
   
   We should investigate the test before 2.7.0 gets released. The good news is that if we have a hypothesis that we would like to test, we can simply unquarantine the test with the fix and ask the MWAA team (@vincbeck @ferruzzi @o-nikolas were involved) to run the PR in their environment - which should rather quickly give answer if the problem has been fixed.
   
   
   
   
   ### Committer
   
   - [X] I acknowledge that I am a maintainer/committer of the Apache Airflow project.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #32973: Hanging test_send_tasks_to_celery_hang in quarantine

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk commented on issue #32973:
URL: https://github.com/apache/airflow/issues/32973#issuecomment-1660541963

   > #33001 succeeds in AWS team test environment
   
   Wonderful. Thank you !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #32973: Hanging test_send_tasks_to_celery_hang in quarantine

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk commented on issue #32973:
URL: https://github.com/apache/airflow/issues/32973#issuecomment-1660266792

   Those are the two experimental PRs that will hopefully help to validate the hypothesis about "moving test being trigger":
   
   https://github.com/apache/airflow/pull/33001:  [EXPERIMENT] Tests for Celery executor moved back to core
   https://github.com/apache/airflow/pull/33002:  [EXPERIMENT] Bring back celery to core - leave tests in provider


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #32973: Hanging test_send_tasks_to_celery_hang in quarantine

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk commented on issue #32973:
URL: https://github.com/apache/airflow/issues/32973#issuecomment-1658757292

   Hey @ashb @yuqian90 ? @uranusjr @ephraimbuddy @jedcunningham  @RNHTTR  - tagging those who I know were somewhat involved with Celery and maybe you have more of an experience/insight.
   
   I think this problem needs somewhat deeper understanding of the signal /multiprocessing part of the celery and reasoning why the test might hang indefinitely without reacting to timeouts. 
   
   It looks awfully like some nasty race condition. I am quite sure it started to fail not because we moved celery executor to providers (that was merely moving imports) but because the tests are running in the same session as other provider tests and some left-overs/side effects kick in. It might be this is a false negative, only occuring in some specific test environment (we do reset signals as part of the test which does not normally happen) - but it also might mean that this is a  "real" issue that might occur in production.
   
   I want to run it for at least few days and see if we will have visibly less number of "this job failled" without any logs (which I suspected might be this test hanging) and in the meantime as 2.7.0 preparation maybe some guesswork can be done by those more familiar with the subject - thanks to MWAA team's environment we seem to have a relatively easy way to test any hypothesis on why this might happen. I will also take a look at the quarantined tests running it - maybe we will be able to see it also hanging there occasionally. 
   
   If anyone has any ideas - maybe we can brainstorm here a bit on what could be the problem?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] vincbeck commented on issue #32973: Hanging test_send_tasks_to_celery_hang in quarantine

Posted by "vincbeck (via GitHub)" <gi...@apache.org>.
vincbeck commented on issue #32973:
URL: https://github.com/apache/airflow/issues/32973#issuecomment-1660540544

   #33001 succeeds in AWS team test environment


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org