You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/10/28 16:33:54 UTC

[GitHub] [airflow] JorgenG commented on issue #27345: Duplicate log lines in CloudWatch after upgrade to 2.4.2

JorgenG commented on issue #27345:
URL: https://github.com/apache/airflow/issues/27345#issuecomment-1295204855

   Hey @potiuk, thanks a lot for responding. I realise my vocabulary for Airflow logs might be a bit inaccurate. 
   
   I'll try to be more specific about what I observe. First of all, I am not 100% what you mean about about parsing of top-level code in the DAG. 
   
   Essentially what we observe is that log lines _inside a single task execution_ appears to be duplicated.
   
   Below is an excerpt from a task that spins up an ECS container. 
   
   ```
   *** Reading remote log from Cloudwatch log_group: /stacks/intranet/airflow-tasks log_stream: dag_id=dbt-events-incremental/run_id=scheduled__2022-10-28T15_15_00+00_00/task_id=dbt_events_incremental/attempt=1.log.
   [2022-10-28, 15:21:36 UTC] Dependencies all met for <TaskInstance: dbt-events-incremental.dbt_events_incremental scheduled__2022-10-28T15:15:00+00:00 [queued]>
   [2022-10-28, 15:21:36 UTC] Dependencies all met for <TaskInstance: dbt-events-incremental.dbt_events_incremental scheduled__2022-10-28T15:15:00+00:00 [queued]>
   [2022-10-28, 15:21:36 UTC] Dependencies all met for <TaskInstance: dbt-events-incremental.dbt_events_incremental scheduled__2022-10-28T15:15:00+00:00 [queued]>
   [2022-10-28, 15:21:36 UTC] Dependencies all met for <TaskInstance: dbt-events-incremental.dbt_events_incremental scheduled__2022-10-28T15:15:00+00:00 [queued]>
   [2022-10-28, 15:21:36 UTC] 
   --------------------------------------------------------------------------------
   [2022-10-28, 15:21:36 UTC] 
   --------------------------------------------------------------------------------
   [2022-10-28, 15:21:36 UTC] Starting attempt 1 of 1
   [2022-10-28, 15:21:36 UTC] Starting attempt 1 of 1
   [2022-10-28, 15:21:36 UTC] 
   --------------------------------------------------------------------------------
   [2022-10-28, 15:21:36 UTC] 
   --------------------------------------------------------------------------------
   [2022-10-28, 15:21:36 UTC] Executing <Task(DbtOperatorV2): dbt_events_incremental> on 2022-10-28 15:15:00+00:00
   [2022-10-28, 15:21:36 UTC] Executing <Task(DbtOperatorV2): dbt_events_incremental> on 2022-10-28 15:15:00+00:00
   ...
   [2022-10-28, 15:21:37 UTC] Starting ECS Task Log Fetcher
   [2022-10-28, 15:21:37 UTC] Starting ECS Task Log Fetcher
   [2022-10-28, 15:21:37 UTC] /home/airflow/.local/lib/python3.9/site-packages/airflow/providers/amazon/aws/operators/ecs.py:437: DeprecationWarning: This class is deprecated. Please use `airflow.providers.amazon.aws.hooks.ecs.EcsTaskLogFetcher`.
     self.task_log_fetcher = self._get_task_log_fetcher()
   
   [2022-10-28, 15:22:07 UTC] Using connection ID 'aws_default' for task execution.
   [2022-10-28, 15:22:38 UTC] [2022-10-28, 15:22:15 UTC] + set -o pipefail
   [2022-10-28, 15:22:38 UTC] [2022-10-28, 15:22:15 UTC] + set -o pipefail
   ...
   [2022-10-28, 15:54:54 UTC] ECS Task has been successfully executed
   [2022-10-28, 15:54:54 UTC] ECS Task has been successfully executed
   [2022-10-28, 15:54:54 UTC] Marking task as SUCCESS. dag_id=dbt-events-incremental, task_id=dbt_events_incremental, execution_date=20221028T151500, start_date=20221028T152136, end_date=20221028T155454
   [2022-10-28, 15:54:54 UTC] Marking task as SUCCESS. dag_id=dbt-events-incremental, task_id=dbt_events_incremental, execution_date=20221028T151500, start_date=20221028T152136, end_date=20221028T155454
   [2022-10-28, 15:54:54 UTC] Task exited with return code 0
   [2022-10-28, 15:54:54 UTC] 0 downstream tasks scheduled from follow-on schedule check
   ```
   
   Upon looking through this in detail, I actually find, as you say, that there are lines that are not duplicated. 
   
   I.e.
   
   ```
   [2022-10-28, 15:21:37 UTC] /home/airflow/.local/lib/python3.9/site-packages/airflow/providers/amazon/aws/operators/ecs.py:437: DeprecationWarning: This class is deprecated. Please use `airflow.providers.amazon.aws.hooks.ecs.EcsTaskLogFetcher`.
     self.task_log_fetcher = self._get_task_log_fetcher()
   [2022-10-28, 15:22:07 UTC] Using connection ID 'aws_default' for task execution.
   ...
   [2022-10-28, 15:54:54 UTC] Task exited with return code 0
   [2022-10-28, 15:54:54 UTC] 0 downstream tasks scheduled from follow-on schedule check
   ```
   
   So based on your comment, I assume these are probably lines from DAG parsing then? 
   
   I realise now I was wrong stating that all log lines were duplicated, I didn't pay enough attention when looking at it. (Thanks for the hint 🙂 )
   
   To also further elaborate, I don't see duplication of logs in i.e. the webserver or scheduled cloudwatch outputs. 
   
   This is in no way critical for us, but I would love to help out further to get it fixed. Is there any way we can override some configuration or similar to verify that the change you referenced might be the cause? I.e. override propagation or similar? 
   
   Best, Jørgen


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org