You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/07/11 21:41:22 UTC

[GitHub] [airflow] potiuk opened a new pull request, #24981: Patch getfqdn with more resilient version

potiuk opened a new pull request, #24981:
URL: https://github.com/apache/airflow/pull/24981

   We keep on having repeated issue reports about non-matching
   hostname of workers. This seems to be trceable to getfqdn method
   of socket in Kubernetes that in some circumstances (race condition
   with netwrking setup when starting) can return different hostname
   at different times.
   
   There seems to be a related issue in Python that has not been
   resolved in more than 13 years (!)
   
   https://github.com/python/cpython/issues/49254
   
   The error seems to be related to the way how canonicalname is
   derived by getfqdn (it uses gethostbyaddr which sometimes
   provides different name than canonical name (it returns the
   first DNS name resolved that contains ".").
   
   We are fixing it in two ways:
   
   * instead of using gethostbyaddr we are using getadddrinfo with
     AI_CANONNAME flag which (according to the docs):
   
     https://man7.org/linux/man-pages/man3/getaddrinfo.3.html
   
       If hints.ai_flags includes the AI_CANONNAME flag, then the
       ai_canonname field of the first of the addrinfo structures in the
       returned list is set to point to the official name of the host.
   
   * we are caching the name returned by first time retrieval per
     interpreter. This way at least inside the same interpreter, the
     name of the host should not change.
   
   <!--
   Thank you for contributing! Please make sure that your code changes
   are covered with tests. And in case of new features or big changes
   remember to adjust the documentation.
   
   Feel free to ping committers for the review!
   
   In case of an existing issue, reference it using one of the following:
   
   closes: #ISSUE
   related: #ISSUE
   
   How to write a good git commit message:
   http://chris.beams.io/posts/git-commit/
   -->
   
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#pull-request-guidelines)** for more information.
   In case of fundamental code changes, an Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)) is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in a newsfragment file, named `{pr_number}.significant.rst` or `{issue_number}.significant.rst`, in [newsfragments](https://github.com/apache/airflow/tree/main/newsfragments).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #24981: Patch getfqdn with more resilient version

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #24981:
URL: https://github.com/apache/airflow/pull/24981#issuecomment-1181168840

   > i am curious.... if the issue is just a changing representation, would it suffice to _only_ cache the result? or do we also need the other logic? or if the other logic is good, do we still need to cache?
   
   This is defensive/protective approach to protect agains both - wrong FQDN and one that changes.  I believe both changes should halp with cases like https://github.com/apache/airflow/discussions/20269#discussioncomment-3095616 -
   
   But I have no hard proof, unfortunately - this is quite a bit leap of faith - an attempt to get more stable solution without a way to easily test that it actually solves the problem. But seems no-one else was interested in discussing it when I raised the question several times in development slack of ours, so I decided to do some more search and make a  PR.
   
   > like if we had issues with the value changing... what was the problem? does the "old" value become non-functional or something? or do they both tell the truth?
   
   Context:
   
   I think (but I am not 100% sure) they MIGHT or MIGHT NOT work - depending on the sate of networking and DNS stabilty. From what I read in the description of https://github.com/python/cpython/issues/49254 - socket.getfqdn() in hosts that have DNS-only resolver (which is the case for K8S I believe) - might sometimes get wrong, non-canonical name. It depends on many factors:
   
   * whether the host support IPV4 or IPV6 
   * how many networking interface there are
   * whether the host is registered with some of the addresses with a name that contain "." (so it does not have to be "fully qualified name). It can be shorter name but it should contain "."
   * what is the sequence of those addresses returned 
   * whether internal DNS of K8S cluster is not too busy and can respond quickly enough for each of those addresses
   
   The `getfqdn` takes the shortcut that it will return the first name derived from IP addresses with gethostbyaddr() which contains "." in the name. This is the gist of the issue - that it is not **always** the canonical name. Mostly yes. But not always. And I think both names would normally be reachable by both names, but we are using the name in a few places as "verification" and "consistency check" (and raise exceptions if expected hostname does not match the observed one).
   
   Kubernetes's networfking is complex and the result might change depending on the DNS responses/registration of the PODs/Containers in the various networking interfaces (usually there are multiple networking interfaces that each Pod/Container has - and it also depennds on your K8S comfiguration, including some security rulles, istio, VPN, security zones, ingress, what networking virtualization is used etc. etc. Virtualisation of Networkin K8S is the centerpiece of how it works and it is pretty complex subject. Just look here https://kubernetes.io/docs/concepts/cluster-administration/networking/ to see how many network vitualization options are possible. The list goes on and on.
   
   My take:
   
   The getfqdn change (which is borrowed from https://github.com/borgbackup/borg/issues/3471) changes the retrieval mechanism to use getaddrinfo with hint that only canonical name should be returned. HOPEFULLY this will work better. But I do not know. This is based on evidence from some people from borgbackup and my understanding how it works. 
   
   From: https://man7.org/linux/man-pages/man3/getaddrinfo.3.html
   
   >  If hints.ai_flags includes the AI_CANONNAME flag, then the ai_canonname field of the first of the addrinfo structures in the returned list is set to point to the official name of the host.
   
   My hope is that (similarly as in case if borgbackup) it will get more stable results. But I am not sure, and I am not able to test that it will be always 100% accurate. The issue is open for 13 yars - and I believe the main reason is tha "volatiity" of the behaviour. So I am not sure if we will ever be able to have "hard" data on it. We need to act based on intuition here, I am afraid and more "educated guesses". And likely we will never know if it worked, because it happens intermittently and is not reproduceable. I've seen quite a few cases questions raised in slack/discussions (but usually there was no hard reproduction steps - they were mostly anecdotal evidence, but I saw it frequently enought to believe it is happening).
   
   So I also implemented caching. This should help in the way that one interpreter should only get the name once. I think some of the issues  might come from the fact the same interpreter retrieves different fqdn at different times. Caching should help there. And It should help to at least keep consistency even if we have multiple interpreters but the workers are spawned wiht "forks" - because if the cache is populated before forking, it will stay there.
   
   This is purely defensive to implement both. I have currently not enough data to determine which of those changes is really needed and which one will help to get less of the problem. So I chose to implement both.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk merged pull request #24981: Patch getfqdn with more resilient version

Posted by GitBox <gi...@apache.org>.
potiuk merged PR #24981:
URL: https://github.com/apache/airflow/pull/24981


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #24981: Patch getfqdn with more resilient version

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #24981:
URL: https://github.com/apache/airflow/pull/24981#issuecomment-1181436265

   I've also asked the user from https://github.com/apache/airflow/discussions/20269 to try it out, so maybe we will have a chance to see if it helps :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #24981: Patch getfqdn with more resilient version

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #24981:
URL: https://github.com/apache/airflow/pull/24981#issuecomment-1181375833

   Yeah. Also the nice thing is that `socket.getfqdn` can still be used by the users if they find the new approach troublesome :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #24981: Patch getfqdn with more resilient version

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #24981:
URL: https://github.com/apache/airflow/pull/24981#issuecomment-1180901686

   I am not 100% sure with this one, but I saw far too many issues which seem to be resulting from "changing" value of `getfqdn` - and it seems that this might be caused by an issue that is known in Python for 13 years now ....


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] dstandish commented on pull request #24981: Patch getfqdn with more resilient version

Posted by GitBox <gi...@apache.org>.
dstandish commented on PR #24981:
URL: https://github.com/apache/airflow/pull/24981#issuecomment-1180933031

   i am curious.... if the issue is just a changing representation, would it suffice to _only_ cache the result?  or do we also need the other logic?  or if the other logic is good, do we still need to cache?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org