You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Allison Wang <al...@gmail.com> on 2017/07/06 21:23:17 UTC

Re: Airflow Log Handler Abstractions

cc dev@airflow.incubator.apache.org

Hi Bolke,

Thanks for the clarification. I was trying to separate the logging logic
for task run out of cli.py and using custom configurations but I discovered
it might need further abstractions and designs. We performed some sort of
post task run logging operations to upload the local logs to S3/GCS. It's
hard to abstract this logic now if we only return the logger. One way we
can approach this problem is to add customized hooks abstractions for
logging, or to trigger this upload to S3/GCS logic in other places.

Dan and I discussed some of the possibilities today and we think this
change is definitely desirable but requires much more time and efforts
given the schedule of my internship. Another solution is to provide ability
for customized implementation of task logging behaviors as in AIRFLOW-1385
(PR: https://github.com/apache/incubator-airflow/pull/2422). This is
certainly not ideal but it won't make it harder for future logging
refactoring work. What do you think about this approach?

Thanks,
Allison

On Thu, Jul 6, 2017 at 1:39 PM Bolke de Bruin <bd...@gmail.com> wrote:

> Hi Allison,
>
> Nice to meet you and great that you put so much work into Airflow. The
> logging is definitely in need of some love so I am very glad that you are
> taking up this task. Thanks for reaching out. I’m not sure yet whether we
> are entirely on the same page yet, but let’s see if we can get there.
>
> So the basic issue is, that we in certain circumstances would like to be
> able to differentiate by originating source where logging should go. In
> your case this is the “try_number” or “attempt” of a task instance. In
> other words the logger should be context aware, i.e. it should look at the
> “try_number” and some other data to determine what it needs to do.
>
> From the perspective of the TaskInstance I would expect something like (as
> explained in the big pr):
>
> from airflow import logging
>
> log = logging.getLogger(__name__)
> log.info(“My Message”, task_id=XX, dag_id=XX, execution_date=XXX)
>
> or
>
> Log = logging.getLogger(__name__, task_id=XX, dag_id=XX,
> execution_date=XXX)
> Log.info(“My Message”)
>
>
> Or you could do a LoggingMixin that picks up “task_id, dag_id,
> excution_date” from the object. And just allows log.info(“xxx”) as well.
>
>
> airflow.logging should then use handlers that you can make understand
> these differences if required. So that could be a FileHandler with a
> template as per your django link, or a DatabaseHandler or a SplunkHandler
> etc.
>
> Does this make sense?
>
> Bolke
>
> N.B. It might be smart to cc the dev. Apache doesn’t really like personal
> conversations.
>
>
> On 6 Jul 2017, at 09:58, Allison Wang <al...@gmail.com> wrote:
>
> Hi Bolke,
>
> My name is Allison and I am an intern on Airbnb's Data Platform team.
> Sorry for not introducing myself earlier. As part of my internship project,
> I will improve Airflow logging both frontend (webserver) and backend. Your
> suggestion of abstracting the log handler for task instance is very
> insightful and I am working on it now. Here is a general idea on how I am
> going to implement it. Just want to make sure we are at the same page
> before the actual coding part.
>
> The idea is very similar to logging.config, where you have a file
> logging.yml defines basic configuration for setting up the log. Example
> configuration file is like
>
> version: 1
> formatters:
>   log:
>     format: '[%%(asctime)s] {{%%(filename)s:%%(lineno)d}} %%(levelname)s -
> %%(message)s'
> handlers:
>   file:
>     class: logging.FileHandler
>     formatter: log
>     level: INFO
>     filename: worker.log
>   console:
>     class : logging.StreamHandler
>     formatter: log
>     level: INFO
>     stream: ext://sys.stdout
> loggers:
>   worker:
>     handlers: [file]
>
> However, the problem with using static logging.config is that we cannot
> dynamically create filehandler with specific filename at runtime, or
> anything requires runtime value when obtaining the logger. We want the
> ability to instantiate log handlers with additional args like task_instance
> object.
>
> There are two solutions discusses here
> http://codeinthehole.com/tips/a-deferred-logging-file-handler-for-django/ which
> requires either loading config dynamically or changing settings
> dynamically. I guess in this case the first solution is better and we can
> call logging.config.dictConfig everytime we call run method in cli.py with
> format(ti).
>
> What do you think about this approach? Or is there any other way we can
> customize those loggers?
>
> Thanks for taking the time doing code review and making Airflow awesome!
> Feel free to comment and point out things I can do better.
>
> Thanks,
> Allison
>
>
>