You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/01/29 10:32:17 UTC

[GitHub] [airflow] potiuk edited a comment on pull request #21145: enter the shell breeze2 environment

potiuk edited a comment on pull request #21145:
URL: https://github.com/apache/airflow/pull/21145#issuecomment-1024884489


   > @potiuk Could you share your views about getting host user id and group id? Do we have to find its equivalent in windows to make this work? Also, can you explain to me more about why we need to find the host user id and group id and how it's used? There are a few comments about it in the code, but I couldn't fully understand it.
   
   This is only really needed on Linux. On Windows/MacOS they are not needed and can be empty (I believe - to be checked).
   
   The reason is that on Linux files that are mounted from the host to the container are mounted using native filesystem. This basically means that any file created inside the container will keep the userid /group id that are used in container also in the host.
   
   For example if we have user id 50001 and group id 501 in the container, any file we create in container will keep the same user is and group id in the host. But those user/group ids might not exist in the Host. - if we create a user 50001 in the container, the id will remain like that on the host, when we exit from the container. This is very problematic on linux because when we map "logs" directory and some logs (and directories) are created there, they might be owned by a non-existing user after we exit. And we want to be able to see the logs outside of the container because that's where we usually have IDE and that's where we keep reading those and analyse them. 
   
   Then, the problem is that if you want to delete such folders and files, you need to use `sudo` in the host, because your regular user has no access to it. This is  big problem especially if files are created inside your source directory (which is also mounted to the container) - for example it will prevent you from switching branches easily because git will not be able to remove some files and it will refuse to switch branches.
   
   There is also "reverse" problem - if you create files in a host with no "all" permissions, and you mount them inside the container, and container runs as "different" user, the user in container cannot  access to those file (unless you run as root inside the container - root inside the container is equivalent to root in host and can access and update all files). 
   
   This can be mitigated by "user remapping" - https://docs.docker.com/engine/security/userns-remap/  - but this can only be configure at the "docker daemon" level, and this is something we should not require an average user should do, also the problem with user remapping is that it is "global" setting. It will remap your user for all containers and in many cases this is not what you really want. 
   
   So In order to avoid that we have a few things:
   
   a) We use `root` user in container - all the files are created and run as root user. This is not recommended for production but it is great for CI - because you can freely create and read any mounted files (no matter what user), you can also run pip/apt etc. without sudo and it is generally much more convenient for many development tasks. The side effect of that is that all files created in the container have root user/group set.
   
   b) we pass HOST_USER_ID and HOST_GROUP_ID to the container, so that we know who is the user on the host. Depending on lthe linux distro and even depending on your configuration (how many users you have created and in which sequence) - the UID can be different.
   
   c) when the user enters the container, we set a "trap"L `add_trap "in_container_fix_ownership" EXIT HUP INT TERM` - this trap runs "fix_ownership" script that looks for all created files in the directories where we expect we will create files:
   
   ```
               "/files"
               "/root/.aws"
               "/root/.azure"
               "/root/.config/gcloud"
               "/root/.docker"
               "/opt/airflow/logs"
               "/opt/airflow/docs"
               "/opt/airflow/dags"
               "${AIRFLOW_SOURCES}"
   ```
   
   Whenever we exit, or terminate the container, this script is executed and it finds all files owned by "root" in those directories and changes their ownership to be HOST_USER/HOST_GROUP.  This way when you exit the containers on linux, the files are owned by the host user, and can be easily deleted - either manually or when you change branches.
   
   On MacOS and Windows this is not needed. Both MacOS and Windows use "user-space" filesystems to mount files. The filestystems are far slower than the native filesystem (many times actually) - which impacts the speed of runnign airlfow in docker container on MacOS and Windows. However they automatically remap the user - all the files created inside the containers are automatically remapped to have the "host" user ownership and there is no need to fix the ownership for those cases. 
   
   
   I hope it is clearer now. I will create an ADR out of that comment :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org