You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2023/01/09 22:58:44 UTC

[GitHub] [airflow] ferruzzi opened a new pull request, #28819: AWS system tests selective log purge

ferruzzi opened a new pull request, #28819:
URL: https://github.com/apache/airflow/pull/28819

   If the test fails, then add a retention policy to the relevant logs instead of deleting them so the logs can be used to troubleshoot.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] vandonr-amz commented on a diff in pull request #28819: AWS system tests selective log purge

Posted by GitBox <gi...@apache.org>.
vandonr-amz commented on code in PR #28819:
URL: https://github.com/apache/airflow/pull/28819#discussion_r1069927088


##########
tests/system/providers/amazon/aws/utils/__init__.py:
##########
@@ -249,7 +251,43 @@ def set_env_id() -> str:
     return env_id
 
 
-def purge_logs(
+def all_tasks_passed(ti) -> bool:
+    task_runs = ti.get_dagrun().get_task_instances()
+    return all([_task.state != State.FAILED for _task in task_runs])
+
+
+@task(trigger_rule=TriggerRule.ALL_DONE)
+def prune_logs(
+    logs: list[tuple[str, str | None]],
+    force_delete: bool = False,
+    retry: bool = False,
+    retry_times: int = 3,
+    ti=None,
+):
+    """
+    If all tasks in this dagrun have succeeded, then delete the associated logs.
+    Otherwise, append the logs with a retention policy.  This allows the logs
+    to be used for troubleshooting but assures they won't build up indefinitely.
+
+    :param logs: A list of log_group/stream_prefix tuples to delete.
+    :param force_delete: Whether to check log streams within the log group before
+        removal. If True, removes the log group and all its log streams inside it.
+    :param retry: Whether to retry if the log group/stream was not found. In some
+        cases, the log group/stream is created seconds after the main resource has
+        been created. By default, it retries for 3 times with a 5s waiting period.
+    :param retry_times: Number of retries.
+    :param ti: Used to check the status of the tasks. This gets pulled from the
+        DAG's context and does not need to be passed manually.
+    """
+    if all_tasks_passed(ti):
+        _purge_logs(logs, force_delete, retry, retry_times)
+    else:
+        client: BaseClient = boto3.client("logs")
+        for group, _ in logs:
+            client.put_retention_policy(logGroupName=group, retentionInDays=30)

Review Comment:
   just to get a sense of the magnitude of the costs we are talking about here, cloudwatch's price for logs storage is 3 cents per GB.
   And a system test would have to be SUPER BEEFY to produce even 1 single giga of logs.
   Also, the free tier covers the first 5GB of logs.
   So all in all, I'm not even sure changing retention from 30 to 14 days is going to change costs by a full cent.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] ashb commented on a diff in pull request #28819: AWS system tests selective log purge

Posted by GitBox <gi...@apache.org>.
ashb commented on code in PR #28819:
URL: https://github.com/apache/airflow/pull/28819#discussion_r1069787616


##########
tests/system/providers/amazon/aws/utils/__init__.py:
##########
@@ -249,7 +251,43 @@ def set_env_id() -> str:
     return env_id
 
 
-def purge_logs(
+def all_tasks_passed(ti) -> bool:
+    task_runs = ti.get_dagrun().get_task_instances()
+    return all([_task.state != State.FAILED for _task in task_runs])
+
+
+@task(trigger_rule=TriggerRule.ALL_DONE)
+def prune_logs(
+    logs: list[tuple[str, str | None]],
+    force_delete: bool = False,
+    retry: bool = False,
+    retry_times: int = 3,
+    ti=None,
+):
+    """
+    If all tasks in this dagrun have succeeded, then delete the associated logs.
+    Otherwise, append the logs with a retention policy.  This allows the logs
+    to be used for troubleshooting but assures they won't build up indefinitely.
+
+    :param logs: A list of log_group/stream_prefix tuples to delete.
+    :param force_delete: Whether to check log streams within the log group before
+        removal. If True, removes the log group and all its log streams inside it.
+    :param retry: Whether to retry if the log group/stream was not found. In some
+        cases, the log group/stream is created seconds after the main resource has
+        been created. By default, it retries for 3 times with a 5s waiting period.
+    :param retry_times: Number of retries.
+    :param ti: Used to check the status of the tasks. This gets pulled from the
+        DAG's context and does not need to be passed manually.
+    """
+    if all_tasks_passed(ti):
+        _purge_logs(logs, force_delete, retry, retry_times)
+    else:
+        client: BaseClient = boto3.client("logs")
+        for group, _ in logs:
+            client.put_retention_policy(logGroupName=group, retentionInDays=30)

Review Comment:
   We could probably drop this to 1 day if CloudWatch allows it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] o-nikolas merged pull request #28819: AWS system tests selective log purge

Posted by GitBox <gi...@apache.org>.
o-nikolas merged PR #28819:
URL: https://github.com/apache/airflow/pull/28819


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] ferruzzi commented on a diff in pull request #28819: AWS system tests selective log purge

Posted by GitBox <gi...@apache.org>.
ferruzzi commented on code in PR #28819:
URL: https://github.com/apache/airflow/pull/28819#discussion_r1069866665


##########
tests/system/providers/amazon/aws/utils/__init__.py:
##########
@@ -249,7 +251,43 @@ def set_env_id() -> str:
     return env_id
 
 
-def purge_logs(
+def all_tasks_passed(ti) -> bool:
+    task_runs = ti.get_dagrun().get_task_instances()
+    return all([_task.state != State.FAILED for _task in task_runs])
+
+
+@task(trigger_rule=TriggerRule.ALL_DONE)
+def prune_logs(
+    logs: list[tuple[str, str | None]],
+    force_delete: bool = False,
+    retry: bool = False,
+    retry_times: int = 3,
+    ti=None,
+):
+    """
+    If all tasks in this dagrun have succeeded, then delete the associated logs.
+    Otherwise, append the logs with a retention policy.  This allows the logs
+    to be used for troubleshooting but assures they won't build up indefinitely.
+
+    :param logs: A list of log_group/stream_prefix tuples to delete.
+    :param force_delete: Whether to check log streams within the log group before
+        removal. If True, removes the log group and all its log streams inside it.
+    :param retry: Whether to retry if the log group/stream was not found. In some
+        cases, the log group/stream is created seconds after the main resource has
+        been created. By default, it retries for 3 times with a 5s waiting period.
+    :param retry_times: Number of retries.
+    :param ti: Used to check the status of the tasks. This gets pulled from the
+        DAG's context and does not need to be passed manually.
+    """
+    if all_tasks_passed(ti):
+        _purge_logs(logs, force_delete, retry, retry_times)
+    else:
+        client: BaseClient = boto3.client("logs")
+        for group, _ in logs:
+            client.put_retention_policy(logGroupName=group, retentionInDays=30)

Review Comment:
   Alright, CI is green again.   If we're going to make a change, I propose either 7 or 14.  Does anyone have any strong feelings in any particular direction?
   
   cc @o-nikolas 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] vandonr-amz commented on a diff in pull request #28819: AWS system tests selective log purge

Posted by GitBox <gi...@apache.org>.
vandonr-amz commented on code in PR #28819:
URL: https://github.com/apache/airflow/pull/28819#discussion_r1069916460


##########
tests/system/providers/amazon/aws/utils/__init__.py:
##########
@@ -249,7 +251,43 @@ def set_env_id() -> str:
     return env_id
 
 
-def purge_logs(
+def all_tasks_passed(ti) -> bool:
+    task_runs = ti.get_dagrun().get_task_instances()
+    return all([_task.state != State.FAILED for _task in task_runs])
+
+
+@task(trigger_rule=TriggerRule.ALL_DONE)
+def prune_logs(
+    logs: list[tuple[str, str | None]],
+    force_delete: bool = False,
+    retry: bool = False,
+    retry_times: int = 3,
+    ti=None,
+):
+    """
+    If all tasks in this dagrun have succeeded, then delete the associated logs.
+    Otherwise, append the logs with a retention policy.  This allows the logs
+    to be used for troubleshooting but assures they won't build up indefinitely.
+
+    :param logs: A list of log_group/stream_prefix tuples to delete.
+    :param force_delete: Whether to check log streams within the log group before
+        removal. If True, removes the log group and all its log streams inside it.
+    :param retry: Whether to retry if the log group/stream was not found. In some
+        cases, the log group/stream is created seconds after the main resource has
+        been created. By default, it retries for 3 times with a 5s waiting period.
+    :param retry_times: Number of retries.
+    :param ti: Used to check the status of the tasks. This gets pulled from the
+        DAG's context and does not need to be passed manually.
+    """
+    if all_tasks_passed(ti):
+        _purge_logs(logs, force_delete, retry, retry_times)
+    else:
+        client: BaseClient = boto3.client("logs")
+        for group, _ in logs:
+            client.put_retention_policy(logGroupName=group, retentionInDays=30)

Review Comment:
   I'd go for 14 at minimum, 7 days is still in the range that could likely disappear while we investigate, which would be super annoying compared to the few cents saved on the bill.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] ferruzzi commented on pull request #28819: AWS system tests selective log purge

Posted by GitBox <gi...@apache.org>.
ferruzzi commented on PR #28819:
URL: https://github.com/apache/airflow/pull/28819#issuecomment-1380771725

   Rebased to deal with merge conflilcts


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] ferruzzi commented on a diff in pull request #28819: AWS system tests selective log purge

Posted by GitBox <gi...@apache.org>.
ferruzzi commented on code in PR #28819:
URL: https://github.com/apache/airflow/pull/28819#discussion_r1069816973


##########
tests/system/providers/amazon/aws/utils/__init__.py:
##########
@@ -249,7 +251,43 @@ def set_env_id() -> str:
     return env_id
 
 
-def purge_logs(
+def all_tasks_passed(ti) -> bool:
+    task_runs = ti.get_dagrun().get_task_instances()
+    return all([_task.state != State.FAILED for _task in task_runs])
+
+
+@task(trigger_rule=TriggerRule.ALL_DONE)
+def prune_logs(
+    logs: list[tuple[str, str | None]],
+    force_delete: bool = False,
+    retry: bool = False,
+    retry_times: int = 3,
+    ti=None,
+):
+    """
+    If all tasks in this dagrun have succeeded, then delete the associated logs.
+    Otherwise, append the logs with a retention policy.  This allows the logs
+    to be used for troubleshooting but assures they won't build up indefinitely.
+
+    :param logs: A list of log_group/stream_prefix tuples to delete.
+    :param force_delete: Whether to check log streams within the log group before
+        removal. If True, removes the log group and all its log streams inside it.
+    :param retry: Whether to retry if the log group/stream was not found. In some
+        cases, the log group/stream is created seconds after the main resource has
+        been created. By default, it retries for 3 times with a 5s waiting period.
+    :param retry_times: Number of retries.
+    :param ti: Used to check the status of the tasks. This gets pulled from the
+        DAG's context and does not need to be passed manually.
+    """
+    if all_tasks_passed(ti):
+        _purge_logs(logs, force_delete, retry, retry_times)
+    else:
+        client: BaseClient = boto3.client("logs")
+        for group, _ in logs:
+            client.put_retention_policy(logGroupName=group, retentionInDays=30)

Review Comment:
   30 days was fairly arbitrary, but I don't think less than a week would be a good idea.  If the test fails on a Friday before a long weekend, then it takes a day to get to it, we're already at 4 days, add a day or three to possibly dive in and figure out what went wrong... I wouldn't go less than that.  Having the logs disappear while troubleshooting would be frustrating.
   
   For what it's worth, according to the docs the possible values are: 1, 3, 5, 7, 14, 30, 60, 90, 120, 150, 180, 365, 400, 545, 731, 1827, 2192, 2557, 2922, 3288, and 3653.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] ferruzzi commented on a diff in pull request #28819: AWS system tests selective log purge

Posted by GitBox <gi...@apache.org>.
ferruzzi commented on code in PR #28819:
URL: https://github.com/apache/airflow/pull/28819#discussion_r1069866665


##########
tests/system/providers/amazon/aws/utils/__init__.py:
##########
@@ -249,7 +251,43 @@ def set_env_id() -> str:
     return env_id
 
 
-def purge_logs(
+def all_tasks_passed(ti) -> bool:
+    task_runs = ti.get_dagrun().get_task_instances()
+    return all([_task.state != State.FAILED for _task in task_runs])
+
+
+@task(trigger_rule=TriggerRule.ALL_DONE)
+def prune_logs(
+    logs: list[tuple[str, str | None]],
+    force_delete: bool = False,
+    retry: bool = False,
+    retry_times: int = 3,
+    ti=None,
+):
+    """
+    If all tasks in this dagrun have succeeded, then delete the associated logs.
+    Otherwise, append the logs with a retention policy.  This allows the logs
+    to be used for troubleshooting but assures they won't build up indefinitely.
+
+    :param logs: A list of log_group/stream_prefix tuples to delete.
+    :param force_delete: Whether to check log streams within the log group before
+        removal. If True, removes the log group and all its log streams inside it.
+    :param retry: Whether to retry if the log group/stream was not found. In some
+        cases, the log group/stream is created seconds after the main resource has
+        been created. By default, it retries for 3 times with a 5s waiting period.
+    :param retry_times: Number of retries.
+    :param ti: Used to check the status of the tasks. This gets pulled from the
+        DAG's context and does not need to be passed manually.
+    """
+    if all_tasks_passed(ti):
+        _purge_logs(logs, force_delete, retry, retry_times)
+    else:
+        client: BaseClient = boto3.client("logs")
+        for group, _ in logs:
+            client.put_retention_policy(logGroupName=group, retentionInDays=30)

Review Comment:
   Alright, CI is green again.   If we're going to make a change, I propose either 7 or 14.  Does anyone have any strong feelings in any particular direction?  I don't mind 30 personally, but happy to adjust if we feel it's worth changing.
   
   cc @o-nikolas 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] Taragolis commented on a diff in pull request #28819: AWS system tests selective log purge

Posted by GitBox <gi...@apache.org>.
Taragolis commented on code in PR #28819:
URL: https://github.com/apache/airflow/pull/28819#discussion_r1069862030


##########
tests/system/providers/amazon/aws/utils/__init__.py:
##########
@@ -249,7 +251,43 @@ def set_env_id() -> str:
     return env_id
 
 
-def purge_logs(
+def all_tasks_passed(ti) -> bool:
+    task_runs = ti.get_dagrun().get_task_instances()
+    return all([_task.state != State.FAILED for _task in task_runs])
+
+
+@task(trigger_rule=TriggerRule.ALL_DONE)
+def prune_logs(
+    logs: list[tuple[str, str | None]],
+    force_delete: bool = False,
+    retry: bool = False,
+    retry_times: int = 3,
+    ti=None,
+):
+    """
+    If all tasks in this dagrun have succeeded, then delete the associated logs.
+    Otherwise, append the logs with a retention policy.  This allows the logs
+    to be used for troubleshooting but assures they won't build up indefinitely.
+
+    :param logs: A list of log_group/stream_prefix tuples to delete.
+    :param force_delete: Whether to check log streams within the log group before
+        removal. If True, removes the log group and all its log streams inside it.
+    :param retry: Whether to retry if the log group/stream was not found. In some
+        cases, the log group/stream is created seconds after the main resource has
+        been created. By default, it retries for 3 times with a 5s waiting period.
+    :param retry_times: Number of retries.
+    :param ti: Used to check the status of the tasks. This gets pulled from the
+        DAG's context and does not need to be passed manually.
+    """
+    if all_tasks_passed(ti):
+        _purge_logs(logs, force_delete, retry, retry_times)
+    else:
+        client: BaseClient = boto3.client("logs")
+        for group, _ in logs:
+            client.put_retention_policy(logGroupName=group, retentionInDays=30)

Review Comment:
   Store data in Cloudwatch Logs much cheaper (for about x15) rather then Put this log to Cloudwatch.
   So I don't think this is would be significant difference in costs between store between 7 day and 30 days  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] ferruzzi commented on a diff in pull request #28819: AWS system tests selective log purge

Posted by GitBox <gi...@apache.org>.
ferruzzi commented on code in PR #28819:
URL: https://github.com/apache/airflow/pull/28819#discussion_r1071571584


##########
tests/system/providers/amazon/aws/utils/__init__.py:
##########
@@ -249,7 +251,43 @@ def set_env_id() -> str:
     return env_id
 
 
-def purge_logs(
+def all_tasks_passed(ti) -> bool:
+    task_runs = ti.get_dagrun().get_task_instances()
+    return all([_task.state != State.FAILED for _task in task_runs])
+
+
+@task(trigger_rule=TriggerRule.ALL_DONE)
+def prune_logs(
+    logs: list[tuple[str, str | None]],
+    force_delete: bool = False,
+    retry: bool = False,
+    retry_times: int = 3,
+    ti=None,
+):
+    """
+    If all tasks in this dagrun have succeeded, then delete the associated logs.
+    Otherwise, append the logs with a retention policy.  This allows the logs
+    to be used for troubleshooting but assures they won't build up indefinitely.
+
+    :param logs: A list of log_group/stream_prefix tuples to delete.
+    :param force_delete: Whether to check log streams within the log group before
+        removal. If True, removes the log group and all its log streams inside it.
+    :param retry: Whether to retry if the log group/stream was not found. In some
+        cases, the log group/stream is created seconds after the main resource has
+        been created. By default, it retries for 3 times with a 5s waiting period.
+    :param retry_times: Number of retries.
+    :param ti: Used to check the status of the tasks. This gets pulled from the
+        DAG's context and does not need to be passed manually.
+    """
+    if all_tasks_passed(ti):
+        _purge_logs(logs, force_delete, retry, retry_times)
+    else:
+        client: BaseClient = boto3.client("logs")
+        for group, _ in logs:
+            client.put_retention_policy(logGroupName=group, retentionInDays=30)

Review Comment:
   This is a pretty small blast-radius and a very easy thing to change later.  Let's merge it with 30 and if Ash (or anyone else, for that matter) has any strong feelings on it, it's a super easy change to make.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] ferruzzi commented on a diff in pull request #28819: AWS system tests selective log purge

Posted by GitBox <gi...@apache.org>.
ferruzzi commented on code in PR #28819:
URL: https://github.com/apache/airflow/pull/28819#discussion_r1069930251


##########
tests/system/providers/amazon/aws/utils/__init__.py:
##########
@@ -249,7 +251,43 @@ def set_env_id() -> str:
     return env_id
 
 
-def purge_logs(
+def all_tasks_passed(ti) -> bool:
+    task_runs = ti.get_dagrun().get_task_instances()
+    return all([_task.state != State.FAILED for _task in task_runs])
+
+
+@task(trigger_rule=TriggerRule.ALL_DONE)
+def prune_logs(
+    logs: list[tuple[str, str | None]],
+    force_delete: bool = False,
+    retry: bool = False,
+    retry_times: int = 3,
+    ti=None,
+):
+    """
+    If all tasks in this dagrun have succeeded, then delete the associated logs.
+    Otherwise, append the logs with a retention policy.  This allows the logs
+    to be used for troubleshooting but assures they won't build up indefinitely.
+
+    :param logs: A list of log_group/stream_prefix tuples to delete.
+    :param force_delete: Whether to check log streams within the log group before
+        removal. If True, removes the log group and all its log streams inside it.
+    :param retry: Whether to retry if the log group/stream was not found. In some
+        cases, the log group/stream is created seconds after the main resource has
+        been created. By default, it retries for 3 times with a 5s waiting period.
+    :param retry_times: Number of retries.
+    :param ti: Used to check the status of the tasks. This gets pulled from the
+        DAG's context and does not need to be passed manually.
+    """
+    if all_tasks_passed(ti):
+        _purge_logs(logs, force_delete, retry, retry_times)
+    else:
+        client: BaseClient = boto3.client("logs")
+        for group, _ in logs:
+            client.put_retention_policy(logGroupName=group, retentionInDays=30)

Review Comment:
   Alright, given those numbers, I'm inclined to leave it as it is.  Old logs might also be useful in troubleshooting if you can see the logs for the last time it failed in case it's a related issue that hasn't been (fully) resolved.
   
   @ashb  Are you cool with keeping it at 30?  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] vandonr-amz commented on a diff in pull request #28819: AWS system tests selective log purge

Posted by GitBox <gi...@apache.org>.
vandonr-amz commented on code in PR #28819:
URL: https://github.com/apache/airflow/pull/28819#discussion_r1069796378


##########
tests/system/providers/amazon/aws/utils/__init__.py:
##########
@@ -249,7 +251,43 @@ def set_env_id() -> str:
     return env_id
 
 
-def purge_logs(
+def all_tasks_passed(ti) -> bool:
+    task_runs = ti.get_dagrun().get_task_instances()
+    return all([_task.state != State.FAILED for _task in task_runs])
+
+
+@task(trigger_rule=TriggerRule.ALL_DONE)
+def prune_logs(
+    logs: list[tuple[str, str | None]],
+    force_delete: bool = False,
+    retry: bool = False,
+    retry_times: int = 3,
+    ti=None,
+):
+    """
+    If all tasks in this dagrun have succeeded, then delete the associated logs.
+    Otherwise, append the logs with a retention policy.  This allows the logs
+    to be used for troubleshooting but assures they won't build up indefinitely.
+
+    :param logs: A list of log_group/stream_prefix tuples to delete.
+    :param force_delete: Whether to check log streams within the log group before
+        removal. If True, removes the log group and all its log streams inside it.
+    :param retry: Whether to retry if the log group/stream was not found. In some
+        cases, the log group/stream is created seconds after the main resource has
+        been created. By default, it retries for 3 times with a 5s waiting period.
+    :param retry_times: Number of retries.
+    :param ti: Used to check the status of the tasks. This gets pulled from the
+        DAG's context and does not need to be passed manually.
+    """
+    if all_tasks_passed(ti):
+        _purge_logs(logs, force_delete, retry, retry_times)
+    else:
+        client: BaseClient = boto3.client("logs")
+        for group, _ in logs:
+            client.put_retention_policy(logGroupName=group, retentionInDays=30)

Review Comment:
   1 day is quite short if, say, we want to know why a system test failed on Friday evening when coming into the office on Tuesday after a bank holiday on Monday ;)
   Why do you think keeping the logs for a month for a system test failure is too long ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] ferruzzi commented on pull request #28819: AWS system tests selective log purge

Posted by GitBox <gi...@apache.org>.
ferruzzi commented on PR #28819:
URL: https://github.com/apache/airflow/pull/28819#issuecomment-1382198076

   @ashb - I missed a line in the merge, sorry about that.  This should pass now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] ferruzzi commented on pull request #28819: AWS system tests selective log purge

Posted by GitBox <gi...@apache.org>.
ferruzzi commented on PR #28819:
URL: https://github.com/apache/airflow/pull/28819#issuecomment-1382187060

   Resolved merge conflict


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] o-nikolas commented on a diff in pull request #28819: AWS system tests selective log purge

Posted by GitBox <gi...@apache.org>.
o-nikolas commented on code in PR #28819:
URL: https://github.com/apache/airflow/pull/28819#discussion_r1069926938


##########
tests/system/providers/amazon/aws/utils/__init__.py:
##########
@@ -249,7 +251,43 @@ def set_env_id() -> str:
     return env_id
 
 
-def purge_logs(
+def all_tasks_passed(ti) -> bool:
+    task_runs = ti.get_dagrun().get_task_instances()
+    return all([_task.state != State.FAILED for _task in task_runs])
+
+
+@task(trigger_rule=TriggerRule.ALL_DONE)
+def prune_logs(
+    logs: list[tuple[str, str | None]],
+    force_delete: bool = False,
+    retry: bool = False,
+    retry_times: int = 3,
+    ti=None,
+):
+    """
+    If all tasks in this dagrun have succeeded, then delete the associated logs.
+    Otherwise, append the logs with a retention policy.  This allows the logs
+    to be used for troubleshooting but assures they won't build up indefinitely.
+
+    :param logs: A list of log_group/stream_prefix tuples to delete.
+    :param force_delete: Whether to check log streams within the log group before
+        removal. If True, removes the log group and all its log streams inside it.
+    :param retry: Whether to retry if the log group/stream was not found. In some
+        cases, the log group/stream is created seconds after the main resource has
+        been created. By default, it retries for 3 times with a 5s waiting period.
+    :param retry_times: Number of retries.
+    :param ti: Used to check the status of the tasks. This gets pulled from the
+        DAG's context and does not need to be passed manually.
+    """
+    if all_tasks_passed(ti):
+        _purge_logs(logs, force_delete, retry, retry_times)
+    else:
+        client: BaseClient = boto3.client("logs")
+        for group, _ in logs:
+            client.put_retention_policy(logGroupName=group, retentionInDays=30)

Review Comment:
   Agreed with Andrey, at $0.03 per GB for storage I don't think cost is an issue for longer retention. However, the clutter it may cause by keeping logs for a long period would be more of a worry, if anything.
   
   I agree with the general sentiment that we shouldn't go much lower than ~14 days 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org