You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/08/03 13:32:48 UTC

[GitHub] [airflow] rodrigoechaide commented on issue #11379: Temporary failure in name resolution while running tasks using KubernetesExecutor

rodrigoechaide commented on issue #11379:
URL: https://github.com/apache/airflow/issues/11379#issuecomment-891850328


   Hi, @Siddharthk are you still facing the same issue? Because I am facing the same issue when running a DAG that has 500 parallel tasks because I am running some stress tests on airflow. In the DAG task, I have an iterator parameter, and by changing it I can modify the duration of each task. It does not matter how long the task lasts, I have had the issue with tasks that last from seconds to more than 20 minutes. I am using KubernetesExecutor and when fetching the pods using kubectl I am getting this:
   
   ```
   k get pods -n airflow | grep Error
   performancetest500tasksinparallel20taskperformancetest500tasksd.04c772dbda6c47b79b017c90b73055af   0/1     Error       0          8m27s
   performancetest500tasksinparallel20taskperformancetest500tasksd.05269b668c8043c7b7ac32c0e06ce2bc   0/1     Error       0          6m31s
   performancetest500tasksinparallel20taskperformancetest500tasksd.0819b03e3fda475abfd3893dc7598ffb   0/1     Error       0          8m56s
   performancetest500tasksinparallel20taskperformancetest500tasksd.09f9fa1367194deabead2b7d6de72c83   0/1     Error       0          8m2s
   performancetest500tasksinparallel20taskperformancetest500tasksd.0c61b81e7dfc4d17846c89d78eefac0c   0/1     Error       0          5m59s
   performancetest500tasksinparallel20taskperformancetest500tasksd.0d0b39ea912a48c898d13b5392c0ee7e   0/1     Error       0          8m41s
   performancetest500tasksinparallel20taskperformancetest500tasksd.0d1e17539b934616a0f72a05b530d88e   0/1     Error       0          8m33s
   performancetest500tasksinparallel20taskperformancetest500tasksd.12e3fd2a030340589e251c987652c61e   0/1     Error       0          9m16s
   performancetest500tasksinparallel20taskperformancetest500tasksd.1312a64638e34ee488d5f8839a29c0e6   0/1     Error       0          7m25s
   performancetest500tasksinparallel20taskperformancetest500tasksd.1508cf02371d4dff8c925a3855a60911   0/1     Error       0          7m31s
   performancetest500tasksinparallel20taskperformancetest500tasksd.1d3c9140a24e42c29fe5def938832759   0/1     Error       0          7m17s
   performancetest500tasksinparallel20taskperformancetest500tasksd.1e5cee28a93b4f62bc1c06d1bb6ed785   0/1     Error       0          8m30s
   performancetest500tasksinparallel20taskperformancetest500tasksd.214e5df400c24764b9104e5e324dc314   0/1     Error       0          8m55s
   performancetest500tasksinparallel20taskperformancetest500tasksd.272b9e6502ce49078c68741731aa8144   0/1     Error       0          7m39s
   performancetest500tasksinparallel20taskperformancetest500tasksd.2840867f20a34a4fae6ad71ff1ef2803   0/1     Error       0          6m3s
   performancetest500tasksinparallel20taskperformancetest500tasksd.2aca869d190d4a17a60653788d73e090   0/1     Error       0          7m22s
   performancetest500tasksinparallel20taskperformancetest500tasksd.2d6f588cba464f2c9aec0f75eff105a5   0/1     Error       0          6m32s
   performancetest500tasksinparallel20taskperformancetest500tasksd.31513adf9a4d4faa910b8eeedf53b960   0/1     Error       0          8m48s
   performancetest500tasksinparallel20taskperformancetest500tasksd.3600857bd1784617b4322ec304924870   0/1     Error       0          8m58s
   performancetest500tasksinparallel20taskperformancetest500tasksd.3659ef7cbcb345e99ba557e6ca6b881d   0/1     Error       0          9m1s
   ```
   
   And when checking the logs of one of the tasks I am getting this error:
   
   ```
   k logs performancetest500tasksinparallel20taskperformancetest500tasksd.6426f08f727c4f15b2c041ce98f163d5 -n airflow
   [2021-08-03 12:44:59,468] {cli_action_loggers.py:105} WARNING - Failed to log action with (psycopg2.OperationalError) could not translate host name "qa-airflow.carnijjbfa3r.eu-west-1.rds.amazonaws.com" to address: Temporary failure in name resolution
   
   (Background on this error at: http://sqlalche.me/e/13/e3q8)
   [2021-08-03 12:44:59,469] {dagbag.py:496} INFO - Filling up the DagBag from /opt/airflow/dags/git/performance_test_500_tasks_in_parallel_2_0.py
   Traceback (most recent call last):
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2336, in _wrap_pool_connect
       return fn()
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 364, in connect
       return _ConnectionFairy._checkout(self)
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 778, in _checkout
       fairy = _ConnectionRecord.checkout(pool)
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 495, in checkout
       rec = pool._do_get()
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/impl.py", line 241, in _do_get
       return self._create_connection()
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 309, in _create_connection
       return _ConnectionRecord(self)
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 440, in __init__
       self.__connect(first_connect_check=True)
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 661, in __connect
       pool.logger.debug("Error on connect(): %s", e)
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
       compat.raise_(
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
       raise exception
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 656, in __connect
       connection = pool._invoke_creator(self)
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/strategies.py", line 114, in connect
       return dialect.connect(*cargs, **cparams)
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 508, in connect
       return self.dbapi.connect(*cargs, **cparams)
     File "/usr/local/lib/python3.9/site-packages/psycopg2/__init__.py", line 122, in connect
       conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
   psycopg2.OperationalError: could not translate host name "qa-airflow.carnijjbfa3r.eu-west-1.rds.amazonaws.com" to address: Temporary failure in name resolution
   
   
   The above exception was the direct cause of the following exception:
   
   Traceback (most recent call last):
     File "/usr/local/bin/airflow", line 8, in <module>
       sys.exit(main())
     File "/usr/local/lib/python3.9/site-packages/airflow/__main__.py", line 40, in main
       args.func(args)
     File "/usr/local/lib/python3.9/site-packages/airflow/cli/cli_parser.py", line 48, in command
       return func(*args, **kwargs)
     File "/usr/local/lib/python3.9/site-packages/airflow/utils/cli.py", line 91, in wrapper
       return f(*args, **kwargs)
     File "/usr/local/lib/python3.9/site-packages/airflow/cli/commands/task_command.py", line 227, in task_run
       ti.refresh_from_db()
     File "/usr/local/lib/python3.9/site-packages/airflow/utils/session.py", line 70, in wrapper
       return func(*args, session=session, **kwargs)
     File "/usr/local/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 625, in refresh_from_db
       ti = qry.first()
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/query.py", line 3429, in first
       ret = list(self[0:1])
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/query.py", line 3203, in __getitem__
       return list(res)
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/query.py", line 3535, in __iter__
       return self._execute_and_instances(context)
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/query.py", line 3556, in _execute_and_instances
       conn = self._get_bind_args(
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/query.py", line 3571, in _get_bind_args
       return fn(
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/query.py", line 3550, in _connection_from_session
       conn = self.session.connection(**kw)
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 1142, in connection
       return self._connection_for_bind(
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 1150, in _connection_for_bind
       return self.transaction._connection_for_bind(
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 433, in _connection_for_bind
       conn = bind._contextual_connect()
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2302, in _contextual_connect
       self._wrap_pool_connect(self.pool.connect, None),
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2339, in _wrap_pool_connect
       Connection._handle_dbapi_exception_noconnection(
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1583, in _handle_dbapi_exception_noconnection
       util.raise_(
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
       raise exception
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2336, in _wrap_pool_connect
       return fn()
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 364, in connect
       return _ConnectionFairy._checkout(self)
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 778, in _checkout
       fairy = _ConnectionRecord.checkout(pool)
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 495, in checkout
       rec = pool._do_get()
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/impl.py", line 241, in _do_get
       return self._create_connection()
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 309, in _create_connection
       return _ConnectionRecord(self)
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 440, in __init__
       self.__connect(first_connect_check=True)
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 661, in __connect
       pool.logger.debug("Error on connect(): %s", e)
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
       compat.raise_(
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
       raise exception
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 656, in __connect
       connection = pool._invoke_creator(self)
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/strategies.py", line 114, in connect
       return dialect.connect(*cargs, **cparams)
     File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 508, in connect
       return self.dbapi.connect(*cargs, **cparams)
     File "/usr/local/lib/python3.9/site-packages/psycopg2/__init__.py", line 122, in connect
       conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
   sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not translate host name "qa-airflow.carnijjbfa3r.eu-west-1.rds.amazonaws.com" to address: Temporary failure in name resolution
   ```
   
   These are some of the configuration variables of my airflow cluster:
   
   ```
     AIRFLOW_HOME: "/opt/airflow"
     AIRFLOW__CORE__DAGS_FOLDER: "/opt/airflow/dags/git"
     AIRFLOW__LOGGING__BASE_LOG_FOLDER: "/opt/airflow/logs"
     AIRFLOW__LOGGING__LOGGING_LEVEL: "INFO" # DEBUG, INFO, WARNING, ERROR or CRITICAL.
     AIRFLOW__LOGGING__FAB_LOGGING_LEVEL: "WARNING"
     AIRFLOW__LOGGING__LOG_FILENAME_TEMPLATE: "{{ ti.dag_id }}/{{ ti.task_id }}/{{ ts }}/{{ try_number }}.log"
     AIRFLOW__LOGGING__LOG_FORMAT: "%(message)s"
     AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL: "60"
     AIRFLOW__CORE__DAG_CONCURRENCY: "500"
     AIRFLOW__CORE__PARALLELISM: "500"
     AIRFLOW__CORE__SQL_ALCHEMY_POOL_SIZE: "0"
     AIRFLOW__CORE__EXECUTOR: "KubernetesExecutor"
     AIRFLOW__API__AUTH_BACKEND: "airflow.api.auth.backend.default"
     AIRFLOW__CORE__LOAD_EXAMPLES: "False"
     AIRFLOW__KUBERNETES__WORKER_CONTAINER_TAG: "1.1"
     AIRFLOW__KUBERNETES__DELETE_WORKER_PODS: "True"
     AIRFLOW__KUBERNETES__NAMESPACE: "airflow"
     AIRFLOW__KUBERNETES__WORKER_PODS_CREATION_BATCH_SIZE: "1" 
     AIRFLOW__KUBERNETES__POD_TEMPLATE_FILE: "/opt/airflow/template.yaml"
     AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION: "False"
   ```
   
   And besides that config, I have set up the `default_pool` size of 500 slots in order to be able to run 500 parallel tasks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org