You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Edgardo Vega <ed...@gmail.com> on 2017/04/04 00:30:31 UTC

Cleanup

I have been playing with airflow for a few days and it's not obvious what
will happen down the road when we have lots of dags over a long period of
time. I set a fake dag to run once a minute for a few days and everything
seems okay except the graph view dropdown which works but take a few
seconds to show up.

Is there a way roll older data out of the system in order to clean things
visually and keep the database at a smallish size?

-- 
Cheers,

Edgardo

Re: Cleanup

Posted by Boris Tyukin <bo...@boristyukin.com>.
another related thing is cleanup of logs which was discussed a few days
ago. Airflow generates enormous of logs which I like because it is very
easy to troubleshot but one dag with 5 tasks i have been running for a few
weeks a few times a day generated 2Gb of logs! I can probably switch
logging mode to less detailed but what i really want is automatic archiving
capability. For now I can just use another airflow dag to do this cleanup
but it would be nice to have this feature

On Wed, Apr 5, 2017 at 11:23 PM, Vijay Krishna Ramesh <
vijay.krishna.ramesh@gmail.com> wrote:

> To add to Siddharth's pretty extensive list (in particular, the "delete a
> DAG from the code that makes up the dag bag folder, but now it shows up
> with a ! icon and you have to manually set it to is_active = f" issue that
> I didn't see in 1.8.0-RC4 but started seeing in 1.8.0-RC5 that became
> 1.8.0) -
>
> how does XCOM data get cleaned up? would be nice to either let tasks
> consume the data (and then it goes away from the backing db, after an ack
> or something) - or at the very least, TTL after a set interval.
>
>
>
> On Wed, Apr 5, 2017 at 7:46 PM, siddharth anand <sa...@apache.org> wrote:
>
> > Edgardo,
> > This is a great question and something that requires functionality to
> > address. As Airflow starts getting used for bigger workloads, we need a
> way
> > to clean up defunct resources.
> >
> >    - How do we delete a dag and its related resources?
> >       - Until the recent release, the way that I stopped having a defunct
> >       (retired) dag show up in the UI was to move the DAG file out of the
> >       dag_folder or just deleting it from Git. Our dag folders are
> > just symlinks
> >       to tagged Git repos.
> >       - This no longer works -- the UI will display the dag list based on
> >       entries in the dag table in the airflow metadata db -- but will no
> > longer
> >       have code to back that dag table entry. I currently manually delete
> > a row
> >       from the dag table, but that is surely not the right thing to do.
> >       - How do we retire entries from the *task_instance, job, log,
> xcom,
> >       sla_miss, dag_stats, *and *dag_run* tables for dags that are
> deleted?
> >       (I can surely clean these up manually as well, but we need a UI
> >       control).
> >          -  *task_instance, job, log, &* *dag_run *tables grow faster
> than
> >          the others
> >          - How does one track if variables, connections, or pools are no
> >       longer referenced because all of the DAGs that use them are gone?
> >          - It would be nice here to have reference counts & links to DAGs
> >          that reference a Pool, Connection, or Variable. The reference
> > counts can be
> >          broken down into paused & unpaused.
> >
> > It's time we added some functionality to the API/CLI/UI to address these
> > functionality gaps.
> >
> > -s
> >
> > On Tue, Apr 4, 2017 at 10:25 AM, Edgardo Vega <ed...@gmail.com>
> > wrote:
> >
> > > Max,
> > >
> > > Thanks for the reply, it is much appreciated.  I am currently running
> > ~10k
> > > task a day in our test environment.
> > >
> > > It is good to know where the archive point is and that I shouldn't have
> > any
> > > issues for a long time.
> > >
> > > I was just thinking ahead as we get airflow into production
> environment.
> > > Maybe in this case maybe way too far ahead.
> > >
> > >
> > > Cheers,
> > >
> > > Edgardo
> > >
> > > On Tue, Apr 4, 2017 at 11:58 AM, Maxime Beauchemin <
> > > maximebeauchemin@gmail.com> wrote:
> > >
> > > > We run ~50k tasks a day at Airbnb. How many tasks/day are you
> planning
> > on
> > > > running?
> > > >
> > > > Though you can archive the `task_instance` and `job` table down the
> > line,
> > > > but that shouldn't be a concern until you hit tens of millions of
> > > entries.
> > > > Then you can setup a daily Airflow job that archives some of these
> > > entries.
> > > > I believe we do it based on `start_date` and move rows to some other
> > > table
> > > > in the same db.
> > > >
> > > > Max
> > > >
> > > > On Mon, Apr 3, 2017 at 5:30 PM, Edgardo Vega <edgardo.vega@gmail.com
> >
> > > > wrote:
> > > >
> > > > > I have been playing with airflow for a few days and it's not
> obvious
> > > what
> > > > > will happen down the road when we have lots of dags over a long
> > period
> > > of
> > > > > time. I set a fake dag to run once a minute for a few days and
> > > everything
> > > > > seems okay except the graph view dropdown which works but take a
> few
> > > > > seconds to show up.
> > > > >
> > > > > Is there a way roll older data out of the system in order to clean
> > > things
> > > > > visually and keep the database at a smallish size?
> > > > >
> > > > > --
> > > > > Cheers,
> > > > >
> > > > > Edgardo
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Cheers,
> > >
> > > Edgardo
> > >
> >
>

Re: Cleanup

Posted by Vijay Krishna Ramesh <vi...@gmail.com>.
To add to Siddharth's pretty extensive list (in particular, the "delete a
DAG from the code that makes up the dag bag folder, but now it shows up
with a ! icon and you have to manually set it to is_active = f" issue that
I didn't see in 1.8.0-RC4 but started seeing in 1.8.0-RC5 that became
1.8.0) -

how does XCOM data get cleaned up? would be nice to either let tasks
consume the data (and then it goes away from the backing db, after an ack
or something) - or at the very least, TTL after a set interval.



On Wed, Apr 5, 2017 at 7:46 PM, siddharth anand <sa...@apache.org> wrote:

> Edgardo,
> This is a great question and something that requires functionality to
> address. As Airflow starts getting used for bigger workloads, we need a way
> to clean up defunct resources.
>
>    - How do we delete a dag and its related resources?
>       - Until the recent release, the way that I stopped having a defunct
>       (retired) dag show up in the UI was to move the DAG file out of the
>       dag_folder or just deleting it from Git. Our dag folders are
> just symlinks
>       to tagged Git repos.
>       - This no longer works -- the UI will display the dag list based on
>       entries in the dag table in the airflow metadata db -- but will no
> longer
>       have code to back that dag table entry. I currently manually delete
> a row
>       from the dag table, but that is surely not the right thing to do.
>       - How do we retire entries from the *task_instance, job, log,  xcom,
>       sla_miss, dag_stats, *and *dag_run* tables for dags that are deleted?
>       (I can surely clean these up manually as well, but we need a UI
>       control).
>          -  *task_instance, job, log, &* *dag_run *tables grow faster than
>          the others
>          - How does one track if variables, connections, or pools are no
>       longer referenced because all of the DAGs that use them are gone?
>          - It would be nice here to have reference counts & links to DAGs
>          that reference a Pool, Connection, or Variable. The reference
> counts can be
>          broken down into paused & unpaused.
>
> It's time we added some functionality to the API/CLI/UI to address these
> functionality gaps.
>
> -s
>
> On Tue, Apr 4, 2017 at 10:25 AM, Edgardo Vega <ed...@gmail.com>
> wrote:
>
> > Max,
> >
> > Thanks for the reply, it is much appreciated.  I am currently running
> ~10k
> > task a day in our test environment.
> >
> > It is good to know where the archive point is and that I shouldn't have
> any
> > issues for a long time.
> >
> > I was just thinking ahead as we get airflow into production environment.
> > Maybe in this case maybe way too far ahead.
> >
> >
> > Cheers,
> >
> > Edgardo
> >
> > On Tue, Apr 4, 2017 at 11:58 AM, Maxime Beauchemin <
> > maximebeauchemin@gmail.com> wrote:
> >
> > > We run ~50k tasks a day at Airbnb. How many tasks/day are you planning
> on
> > > running?
> > >
> > > Though you can archive the `task_instance` and `job` table down the
> line,
> > > but that shouldn't be a concern until you hit tens of millions of
> > entries.
> > > Then you can setup a daily Airflow job that archives some of these
> > entries.
> > > I believe we do it based on `start_date` and move rows to some other
> > table
> > > in the same db.
> > >
> > > Max
> > >
> > > On Mon, Apr 3, 2017 at 5:30 PM, Edgardo Vega <ed...@gmail.com>
> > > wrote:
> > >
> > > > I have been playing with airflow for a few days and it's not obvious
> > what
> > > > will happen down the road when we have lots of dags over a long
> period
> > of
> > > > time. I set a fake dag to run once a minute for a few days and
> > everything
> > > > seems okay except the graph view dropdown which works but take a few
> > > > seconds to show up.
> > > >
> > > > Is there a way roll older data out of the system in order to clean
> > things
> > > > visually and keep the database at a smallish size?
> > > >
> > > > --
> > > > Cheers,
> > > >
> > > > Edgardo
> > > >
> > >
> >
> >
> >
> > --
> > Cheers,
> >
> > Edgardo
> >
>

Re: Cleanup

Posted by siddharth anand <sa...@apache.org>.
Edgardo,
This is a great question and something that requires functionality to
address. As Airflow starts getting used for bigger workloads, we need a way
to clean up defunct resources.

   - How do we delete a dag and its related resources?
      - Until the recent release, the way that I stopped having a defunct
      (retired) dag show up in the UI was to move the DAG file out of the
      dag_folder or just deleting it from Git. Our dag folders are
just symlinks
      to tagged Git repos.
      - This no longer works -- the UI will display the dag list based on
      entries in the dag table in the airflow metadata db -- but will no longer
      have code to back that dag table entry. I currently manually delete a row
      from the dag table, but that is surely not the right thing to do.
      - How do we retire entries from the *task_instance, job, log,  xcom,
      sla_miss, dag_stats, *and *dag_run* tables for dags that are deleted?
      (I can surely clean these up manually as well, but we need a UI
      control).
         -  *task_instance, job, log, &* *dag_run *tables grow faster than
         the others
         - How does one track if variables, connections, or pools are no
      longer referenced because all of the DAGs that use them are gone?
         - It would be nice here to have reference counts & links to DAGs
         that reference a Pool, Connection, or Variable. The reference
counts can be
         broken down into paused & unpaused.

It's time we added some functionality to the API/CLI/UI to address these
functionality gaps.

-s

On Tue, Apr 4, 2017 at 10:25 AM, Edgardo Vega <ed...@gmail.com>
wrote:

> Max,
>
> Thanks for the reply, it is much appreciated.  I am currently running ~10k
> task a day in our test environment.
>
> It is good to know where the archive point is and that I shouldn't have any
> issues for a long time.
>
> I was just thinking ahead as we get airflow into production environment.
> Maybe in this case maybe way too far ahead.
>
>
> Cheers,
>
> Edgardo
>
> On Tue, Apr 4, 2017 at 11:58 AM, Maxime Beauchemin <
> maximebeauchemin@gmail.com> wrote:
>
> > We run ~50k tasks a day at Airbnb. How many tasks/day are you planning on
> > running?
> >
> > Though you can archive the `task_instance` and `job` table down the line,
> > but that shouldn't be a concern until you hit tens of millions of
> entries.
> > Then you can setup a daily Airflow job that archives some of these
> entries.
> > I believe we do it based on `start_date` and move rows to some other
> table
> > in the same db.
> >
> > Max
> >
> > On Mon, Apr 3, 2017 at 5:30 PM, Edgardo Vega <ed...@gmail.com>
> > wrote:
> >
> > > I have been playing with airflow for a few days and it's not obvious
> what
> > > will happen down the road when we have lots of dags over a long period
> of
> > > time. I set a fake dag to run once a minute for a few days and
> everything
> > > seems okay except the graph view dropdown which works but take a few
> > > seconds to show up.
> > >
> > > Is there a way roll older data out of the system in order to clean
> things
> > > visually and keep the database at a smallish size?
> > >
> > > --
> > > Cheers,
> > >
> > > Edgardo
> > >
> >
>
>
>
> --
> Cheers,
>
> Edgardo
>

Re: Cleanup

Posted by Edgardo Vega <ed...@gmail.com>.
Max,

Thanks for the reply, it is much appreciated.  I am currently running ~10k
task a day in our test environment.

It is good to know where the archive point is and that I shouldn't have any
issues for a long time.

I was just thinking ahead as we get airflow into production environment.
Maybe in this case maybe way too far ahead.


Cheers,

Edgardo

On Tue, Apr 4, 2017 at 11:58 AM, Maxime Beauchemin <
maximebeauchemin@gmail.com> wrote:

> We run ~50k tasks a day at Airbnb. How many tasks/day are you planning on
> running?
>
> Though you can archive the `task_instance` and `job` table down the line,
> but that shouldn't be a concern until you hit tens of millions of entries.
> Then you can setup a daily Airflow job that archives some of these entries.
> I believe we do it based on `start_date` and move rows to some other table
> in the same db.
>
> Max
>
> On Mon, Apr 3, 2017 at 5:30 PM, Edgardo Vega <ed...@gmail.com>
> wrote:
>
> > I have been playing with airflow for a few days and it's not obvious what
> > will happen down the road when we have lots of dags over a long period of
> > time. I set a fake dag to run once a minute for a few days and everything
> > seems okay except the graph view dropdown which works but take a few
> > seconds to show up.
> >
> > Is there a way roll older data out of the system in order to clean things
> > visually and keep the database at a smallish size?
> >
> > --
> > Cheers,
> >
> > Edgardo
> >
>



-- 
Cheers,

Edgardo

Re: Cleanup

Posted by Maxime Beauchemin <ma...@gmail.com>.
We run ~50k tasks a day at Airbnb. How many tasks/day are you planning on
running?

Though you can archive the `task_instance` and `job` table down the line,
but that shouldn't be a concern until you hit tens of millions of entries.
Then you can setup a daily Airflow job that archives some of these entries.
I believe we do it based on `start_date` and move rows to some other table
in the same db.

Max

On Mon, Apr 3, 2017 at 5:30 PM, Edgardo Vega <ed...@gmail.com> wrote:

> I have been playing with airflow for a few days and it's not obvious what
> will happen down the road when we have lots of dags over a long period of
> time. I set a fake dag to run once a minute for a few days and everything
> seems okay except the graph view dropdown which works but take a few
> seconds to show up.
>
> Is there a way roll older data out of the system in order to clean things
> visually and keep the database at a smallish size?
>
> --
> Cheers,
>
> Edgardo
>