You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/11/25 01:26:51 UTC

[GitHub] [airflow] dimberman opened a new issue #7911: Add data retention policy to Airflow

dimberman opened a new issue #7911:
URL: https://github.com/apache/airflow/issues/7911


   **Description**
   
   Airflow's DB currently holds the entire history of all executions for all time. This is problematic as the DB grows. The UI starts to get slower, and the DB's disk usage grows. There is no bound to how large the DB will grow.
   
   It would be useful to add a feature in Airflow to do two things:
   
       Delete old data from the DB
       Mark some lower watermark, past which DAG executions are ignored
   
   For example, (2) would allow you to tell the scheduler "ignore all data prior to a year ago". And (1) would allow Airflow to delete all data prior to January 1, 2015.
   
   
   **Use case / motivation**
   
   
   **Related Issues**
   
   Copied from https://issues.apache.org/jira/browse/AIRFLOW-108
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimberman commented on issue #7911: Add data retention policy to Airflow

Posted by GitBox <gi...@apache.org>.
dimberman commented on issue #7911:
URL: https://github.com/apache/airflow/issues/7911#issuecomment-776061850


   @potiuk this is something I auto-ported from JIRA. I can leave it to @ashb @vikramkoka if we want to address it but my guess is this might be lower priority vs. other features. That said if anyone in the community wants to take this on I'd be glad to help them do it!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #7911: Add data retention policy to Airflow

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #7911:
URL: https://github.com/apache/airflow/issues/7911#issuecomment-751825402


   I think maintenance DAGs is more of a "trick". It would be better if we have "proper data retention" built in Airflow rather than DAGs you can schedule. This is at least my opinion, maybe something we want to discuss as possible feature for 2.1


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on issue #7911: Add data retention policy to Airflow

Posted by GitBox <gi...@apache.org>.
mik-laj commented on issue #7911:
URL: https://github.com/apache/airflow/issues/7911#issuecomment-751876938


   Giving full administrator privileges is an extreme case, but giving read-only one or two DAGs is something that makes sense to support.  I looked at the code quickly and I don't find it difficult to add support for this case.
   
   I think if we improve this one SQL query and use get_user_roles instead of our own SQL query it should work.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj removed a comment on issue #7911: Add data retention policy to Airflow

Posted by GitBox <gi...@apache.org>.
mik-laj removed a comment on issue #7911:
URL: https://github.com/apache/airflow/issues/7911#issuecomment-751876938


   Giving full administrator privileges is an extreme case, but giving read-only one or two DAGs is something that makes sense to support.  I looked at the code quickly and I don't find it difficult to add support for this case.
   
   I think if we improve this one SQL query and use get_user_roles instead of our own SQL query it should work.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #7911: Add data retention policy to Airflow

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #7911:
URL: https://github.com/apache/airflow/issues/7911#issuecomment-776090202


   Sure. No strong feelings about it, and - as everything in the community - it will get done if there is someone leading it :)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #7911: Add data retention policy to Airflow

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #7911:
URL: https://github.com/apache/airflow/issues/7911#issuecomment-898901416


   > If I get some guidance on how/where to start I could try to do it
   
   I think good start is to take a look at the - quite popular - maintenance dags here: https://github.com/teamclairvoyant/airflow-maintenance-dags  - this is a set of 3rd-party maintenance DAGs that people are using for some kind of maintenance (`db-cleanup`). We do not know how "correct it is" and how well it copes with the new Airflow versions, but It can give an idea on how users deal with it.
   
   I think that might be a good idea to start from that and work out an approach (other than DAGs) implementing something like that in airlfow  as periodic Job  - especially that long term plans will be to not allow tasks to talk to the DB directly, the DAG-approach would not work in this case.
   
   I think personally this should start with at least discussion in the devlist or (maybe even better) a new AIP (Airflow Improvement Proposal - https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals ) as result of this discussion.
   
   I think there are many ways it can be done, but it needs some proposal and quite extensive discussion (on performance consequence, where should such cleanup be running, whether it should be a separate process or should it run within scheduler, how to deal with multiple-schedulers if we choose scheduler-embedded solution, etc. etc. It's actually quite an extensive one
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on issue #7911: Add data retention policy to Airflow

Posted by GitBox <gi...@apache.org>.
mik-laj commented on issue #7911:
URL: https://github.com/apache/airflow/issues/7911#issuecomment-644352357


   Some users use this to maintain high performance for a long time.
   https://github.com/teamclairvoyant/airflow-maintenance-dags


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] joshowen commented on issue #7911: Add data retention policy to Airflow

Posted by GitBox <gi...@apache.org>.
joshowen commented on issue #7911:
URL: https://github.com/apache/airflow/issues/7911#issuecomment-620744141


   I'm particularly interested in having a low watermark to speed up the webserver's queries, but we'd want to persist all of the dagrun metadata for occasional offline reporting.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] joshowen commented on issue #7911: Add data retention policy to Airflow

Posted by GitBox <gi...@apache.org>.
joshowen commented on issue #7911:
URL: https://github.com/apache/airflow/issues/7911#issuecomment-634214872


   One other thought on this: it would be very useful if the low watermark was configurable in the UI (eg, show me the last 30d/365d /all DAGRuns)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] JavierLopezT commented on issue #7911: Add data retention policy to Airflow

Posted by GitBox <gi...@apache.org>.
JavierLopezT commented on issue #7911:
URL: https://github.com/apache/airflow/issues/7911#issuecomment-896687760


   If I get some guidance on how/where to start I could try to do it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on issue #7911: Add data retention policy to Airflow

Posted by GitBox <gi...@apache.org>.
mik-laj commented on issue #7911:
URL: https://github.com/apache/airflow/issues/7911#issuecomment-751873903


   @potiuk  @hsnprsd Here is mailing lists discussion: https://lists.apache.org/thread.html/f622e719d5f7e804bf75c05697f405264caae1d3a72aa1bd991b78e9%40%3Cdev.airflow.apache.org%3E
   Can you add your comment there? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimberman closed issue #7911: Add data retention policy to Airflow

Posted by GitBox <gi...@apache.org>.
dimberman closed issue #7911:
URL: https://github.com/apache/airflow/issues/7911


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #7911: Add data retention policy to Airflow

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #7911:
URL: https://github.com/apache/airflow/issues/7911#issuecomment-751875668


   > @potiuk @hsnprsd Here is mailing lists discussion: https://lists.apache.org/thread.html/f622e719d5f7e804bf75c05697f405264caae1d3a72aa1bd991b78e9%40%3Cdev.airflow.apache.org%3E
   > Can you add your comment there?
   
   The discussion is from March/April 2019. I do not see any value in reviving it there.  I will leave it to whoever is interestedin movin it forward - possibly @dimberman who opened the issue (or anyone who would like to lead it to carry it on). 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #7911: Add data retention policy to Airflow

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #7911:
URL: https://github.com/apache/airflow/issues/7911#issuecomment-898901416


   > If I get some guidance on how/where to start I could try to do it
   
   I think good start is to take a look at the - quite popular - maintenance dags here: https://github.com/teamclairvoyant/airflow-maintenance-dags  - this is a set of 3rd-party maintenance DAGs that people are using for some kind of maintenance (`db-cleanup`). We do not know how "correct" it is and how well it copes with the new Airflow versions, but It can give an idea on how users deal with it.
   
   I think that might be a good idea to start from that and work out an approach (other than DAGs) implementing something like that in airlfow  as periodic Job  - especially that long term plans will be to not allow tasks to talk to the DB directly, the DAG-approach would not work in this case.
   
   I think personally this should start with at least discussion in the devlist or (maybe even better) a new AIP (Airflow Improvement Proposal - https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals ) as result of this discussion.
   
   I think there are many ways it can be done, but it needs some proposal and quite extensive discussion (on performance consequence, where should such cleanup be running, whether it should be a separate process or should it run within scheduler, how to deal with multiple-schedulers if we choose scheduler-embedded solution, etc. etc. It's actually quite an extensive one
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org