You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@maven.apache.org by "Michael Osipov (Jira)" <ji...@apache.org> on 2022/01/10 20:27:00 UTC

[jira] [Comment Edited] (MNG-7389) Incremental .m2 cache cleanup for CI

    [ https://issues.apache.org/jira/browse/MNG-7389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17472279#comment-17472279 ] 

Michael Osipov edited comment on MNG-7389 at 1/10/22, 8:26 PM:
---------------------------------------------------------------

Using atime is stupid because using atime has a very severe performance penalty, that is why most use {{atime=off}}.

the easiest way I see is to extend Maven Resolver by intercepting artifact access after its physical resolution on disk, write a properties file, e.g., {{\{gav}.cachestatus}} which will contain all information you need. Walk the repo cache tree, read those file, delete state artifacts. Done. Consider that since the repo cache might be subject to access during your cleanup you will require multiprocess (MP) based locking which makes it even harder. Thus, you will need to write Java code which uses Resolver's {{SyncContextFactory}}.

Don't expect anyone of us to provide this anytime soon especially because diskspace is cheap.

[~cstamas]


was (Author: michael-o):
Using atime is stupid because using atime has a very severe performance penalty, that is why most use {{atime=off}}.

the easiest way I see is to extend Maven Resolver by intercepting artifact access after its physical resolution on disk, write a properties file, e.g., {{{gav}.cachestatus}} which will contain all information you need. Walk the repo cache tree, read those file, delete state artifacts. Done. Consider that since the repo cache might be subject to access during your cleanup you will require multiprocess (MP) based locking which makes it even harder. Thus, you will need to write Java code which uses Resolver's {{SyncContextFactory}}.

Don't expect anyone of us to provide this anytime soon especially because diskspace is cheap.

[~cstamas]

> Incremental .m2 cache cleanup for CI
> ------------------------------------
>
>                 Key: MNG-7389
>                 URL: https://issues.apache.org/jira/browse/MNG-7389
>             Project: Maven
>          Issue Type: New Feature
>          Components: Dependencies
>            Reporter: Thomas Skjølberg
>            Priority: Minor
>
> One or more popular continous integration are unable to properly manage the .m2 repository cache, resulting in wasted resources in the form of increased CI runtime and bandwidth consumption.
> *CircleCI cache behaviour:*
>  - immutable cache entries
>  - default behaviour is to wipe the cache each time a pom file is modified (i.e. using pom hash as a cache key)
>  - cache entries TTL > weeks
> So CircleCI always has a cache containing only the necessary artifacts, but has to download all dependencies every time the pom file changes.
> *Github Actions cache behaviour*
>  - (effectively) mutable cache entries
>  - incremental cache (if it gets too big, it is wiped).
>  - cache entries TTL 1 week
> So Github actions work well if the cache entries expire from time to time, otherwise the cache keeps growing.
> *Summary*
> Perhaps this does not look so bad at first glance, but for a project under active development, with a lot of artifacts, the pom file changes often. For example we have apps with 100 dependencies and automatic dependency bumping via Renovate, in addition to an hierarchy of libraries.
> Key takeaways; time is wasted
>  - saving caches in CI
>  - loading cache in CI
>  - loading artifacts from external artifact store
> This happens quite a lot. From the artifact store perspective, this probably multiplies the load by a factor of 10.
> Possible solution: A way to define a "transaction" for artifact use, i.e.
> 1. run command to mark start of transaction 
> 2. run one or more maven commands
> 3. run command to mark end of transaction, deleting artifacts not in use.
> For reference, Gradle has the same problem.
> Proof of concept:
>  * CircleCI : [https://github.com/entur/maven-orb]
>  * Github actions: [https://github.com/skjolber/tidy-cache-github-action]
> The implementation uses instrumentation to record artifact access, then delete the artifacts not recorded. 
> *Alternatives:*
> I did try the last-accessed file timestamp first, turns out most CI filesystems are mounted without that option. However it should also be possible to update the modified timestamp and/or add read access to some existing metadata file. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)