You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Ash Berlin-Taylor (JIRA)" <ji...@apache.org> on 2017/12/13 11:43:00 UTC

[jira] [Commented] (AIRFLOW-1916) S3 Task logs end up duplicated in the file.

    [ https://issues.apache.org/jira/browse/AIRFLOW-1916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16289152#comment-16289152 ] 

Ash Berlin-Taylor commented on AIRFLOW-1916:
--------------------------------------------

I've added some debugging in, showing where each close was called from, and what the os args were. On python 3 this is what I see.

The raw shutdown happens automatically from the logging system:

{noformat}
Closing log from pid 445 / ['/usr/local/bin/airflow', 'run', 'tests', 'test-logging', '2017-12-13T11:37:55.558651', '--job_id', '18', '--raw', '-sd', '/usr/local/airflow/dags/example/csv_to_parquet.py']
  File "/usr/local/lib/python3.6/logging/__init__.py", line 1919, in shutdown
    h.close()
  File "/usr/local/lib/python3.6/site-packages/airflow/utils/log/s3_task_handler.py", line 72, in close
    traceback.print_stack(file=fh)
{noformat}

The non-raw shutdown is called by us.

{noformat}
Closing log from pid 440 / ['/usr/local/bin/airflow', 'run', 'tests', 'test-logging', '2017-12-13T11:37:55.558651', '--local', '-sd', '/usr/local/airflow/dags/example/csv_to_parquet.py']
  File "/usr/local/bin/airflow", line 27, in <module>
    args.func(args)
  File "/usr/local/lib/python3.6/site-packages/airflow/bin/cli.py", line 438, in run
    handler.close()
  File "/usr/local/lib/python3.6/site-packages/airflow/utils/log/s3_task_handler.py", line 72, in close
    traceback.print_stack(file=fh)
{noformat}

> S3 Task logs end up duplicated in the file.
> -------------------------------------------
>
>                 Key: AIRFLOW-1916
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-1916
>             Project: Apache Airflow
>          Issue Type: Bug
>    Affects Versions: 1.9.0
>            Reporter: Ash Berlin-Taylor
>             Fix For: 1.9.1
>
>
> If using the S3TaskHandler logger the contents of the log file in the S3 bucket end up duplicated - once from when `airflow run --raw` finalizes the task, and again from when `airflow run --local` finalizes it's logger.
> Log from the UI included below. The file on disk does not have the repetition.
> There is a comment in `run()` in airflow.bin.cli implying that `--raw` is not meant to upload, but something is.
> {noformat}
> *** Reading remote log from s3://xxx/ash-test/tests/test-logging/2017-12-13T10:45:42.552705/1.log.
> [2017-12-13 10:45:49,764] {cli.py:374} INFO - Running on host ac5d0787084d
> [2017-12-13 10:45:49,895] {models.py:1197} INFO - Dependencies all met for <TaskInstance: tests.test-logging 2017-12-13 10:45:42.552705 [queued]>
> [2017-12-13 10:45:49,905] {models.py:1197} INFO - Dependencies all met for <TaskInstance: tests.test-logging 2017-12-13 10:45:42.552705 [queued]>
> [2017-12-13 10:45:49,906] {models.py:1407} INFO - 
> --------------------------------------------------------------------------------
> Starting attempt 1 of 1
> --------------------------------------------------------------------------------
> [2017-12-13 10:45:49,923] {models.py:1428} INFO - Executing <Task(PythonOperator): test-logging> on 2017-12-13 10:45:42.552705
> [2017-12-13 10:45:49,924] {base_task_runner.py:115} INFO - Running: ['bash', '-c', 'airflow run tests test-logging 2017-12-13T10:45:42.552705 --job_id 5 --raw -sd /usr/local/airflow/dags/example/csv_to_parquet.py']
> [2017-12-13 10:45:53,622] {base_task_runner.py:98} INFO - Subtask: [2017-12-13 10:45:53,622] {configuration.py:206} WARNING - section/key [celery/celery_ssl_active] not found in config
> [2017-12-13 10:45:53,625] {base_task_runner.py:98} INFO - Subtask: [2017-12-13 10:45:53,622] {default_celery.py:41} WARNING - Celery Executor will run without SSL
> [2017-12-13 10:45:53,626] {base_task_runner.py:98} INFO - Subtask: [2017-12-13 10:45:53,624] {__init__.py:45} INFO - Using executor CeleryExecutor
> [2017-12-13 10:45:53,755] {base_task_runner.py:98} INFO - Subtask: [2017-12-13 10:45:53,754] {models.py:189} INFO - Filling up the DagBag from /usr/local/airflow/dags/example/csv_to_parquet.py
> [2017-12-13 10:45:53,859] {cli.py:374} INFO - Running on host ac5d0787084d
> [2017-12-13 10:45:53,901] {logging_mixin.py:84} INFO - Hi from /usr/local/airflow/dags/example/csv_to_parquet.py
> [2017-12-13 10:45:53,902] {logging_mixin.py:84} INFO - Hi 2 from /usr/local/airflow/dags/example/csv_to_parquet.py
> [2017-12-13 10:45:53,903] {csv_to_parquet.py:27} ERROR - Hello
> [2017-12-13 10:45:53,905] {base_task_runner.py:98} INFO - Subtask: [2017-12-13 10:45:53,904] {python_operator.py:90} INFO - Done. Returned value was: None
> [2017-12-13 10:45:49,764] {cli.py:374} INFO - Running on host ac5d0787084d
> [2017-12-13 10:45:49,895] {models.py:1197} INFO - Dependencies all met for <TaskInstance: tests.test-logging 2017-12-13 10:45:42.552705 [queued]>
> [2017-12-13 10:45:49,905] {models.py:1197} INFO - Dependencies all met for <TaskInstance: tests.test-logging 2017-12-13 10:45:42.552705 [queued]>
> [2017-12-13 10:45:49,906] {models.py:1407} INFO - 
> --------------------------------------------------------------------------------
> Starting attempt 1 of 1
> --------------------------------------------------------------------------------
> [2017-12-13 10:45:49,923] {models.py:1428} INFO - Executing <Task(PythonOperator): test-logging> on 2017-12-13 10:45:42.552705
> [2017-12-13 10:45:49,924] {base_task_runner.py:115} INFO - Running: ['bash', '-c', 'airflow run tests test-logging 2017-12-13T10:45:42.552705 --job_id 5 --raw -sd /usr/local/airflow/dags/example/csv_to_parquet.py']
> [2017-12-13 10:45:53,622] {base_task_runner.py:98} INFO - Subtask: [2017-12-13 10:45:53,622] {configuration.py:206} WARNING - section/key [celery/celery_ssl_active] not found in config
> [2017-12-13 10:45:53,625] {base_task_runner.py:98} INFO - Subtask: [2017-12-13 10:45:53,622] {default_celery.py:41} WARNING - Celery Executor will run without SSL
> [2017-12-13 10:45:53,626] {base_task_runner.py:98} INFO - Subtask: [2017-12-13 10:45:53,624] {__init__.py:45} INFO - Using executor CeleryExecutor
> [2017-12-13 10:45:53,755] {base_task_runner.py:98} INFO - Subtask: [2017-12-13 10:45:53,754] {models.py:189} INFO - Filling up the DagBag from /usr/local/airflow/dags/example/csv_to_parquet.py
> [2017-12-13 10:45:53,859] {cli.py:374} INFO - Running on host ac5d0787084d
> [2017-12-13 10:45:53,901] {logging_mixin.py:84} INFO - Hi from /usr/local/airflow/dags/example/csv_to_parquet.py
> [2017-12-13 10:45:53,902] {logging_mixin.py:84} INFO - Hi 2 from /usr/local/airflow/dags/example/csv_to_parquet.py
> [2017-12-13 10:45:53,903] {csv_to_parquet.py:27} ERROR - Hello
> [2017-12-13 10:45:53,905] {base_task_runner.py:98} INFO - Subtask: [2017-12-13 10:45:53,904] {python_operator.py:90} INFO - Done. Returned value was: None
> [2017-12-13 10:45:54,923] {base_task_runner.py:98} INFO - Subtask: [2017-12-13 10:45:54,922] {connectionpool.py:735} INFO - Starting new HTTPS connection (1): sts.amazonaws.com
> [2017-12-13 10:45:55,498] {base_task_runner.py:98} INFO - Subtask: [2017-12-13 10:45:55,498] {connectionpool.py:735} INFO - Starting new HTTPS connection (1): s3.eu-west-1.amazonaws.com
> [2017-12-13 10:45:55,784] {base_task_runner.py:98} INFO - Subtask: [2017-12-13 10:45:55,784] {connectionpool.py:735} INFO - Starting new HTTPS connection (1): s3.eu-west-1.amazonaws.com
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)