You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/06/29 12:29:30 UTC

[GitHub] [arrow] assignUser commented on a diff in pull request #13453: ARROW-16901: [R][CI] Prune R nightly builds

assignUser commented on code in PR #13453:
URL: https://github.com/apache/arrow/pull/13453#discussion_r909484781


##########
.github/workflows/r_nightly.yml:
##########
@@ -100,18 +149,37 @@ jobs:
           # strip superfluous nested dirs
           new_paths <- sub(art_path, ".", new_paths)
           dirs <- dirname(new_paths)
-          dir_result <- sapply(dirs, dir.create, recursive = TRUE)
+          sapply(dirs, dir.create, recursive = TRUE, showWarnings = FALSE)
 
-          if (!all(dir_result)) {
-            stop("There was an issue while creating the folders!")
-          }
-
-          copy_result <- file.copy(current_path, new_paths)
+          # overwrite allows us to "force push" a new version with the same name
+          copy_result <- file.copy(current_path, new_paths, overwrite = TRUE)
 
           if (!all(copy_result)) {
             stop("There was an issue while copying the files!")
           }
+      - name: Prune Repository
+        shell: bash
+        env:
+          KEEP: ${{ github.event.inputs.keep || 14 }}
+        run: |   
+          prune() {
+            # list files  | retain $KEEP newest files | delete everything else
+            ls -t $1/arrow* | tail -n +$((KEEP + 1)) | xargs --no-run-if-empty rm 
+          }
+
+          # find leaf sub dirs
+          repo_dirs=$(find repo -type d -links 2)
 
+          # We want to retain $keep (14) versions of each pkg/lib so we call
+          # prune on each leaf dir and not on repo/.
+          for dir in ${repo_dirs[@]}; do
+            prune $dir
+          done

Review Comment:
   I'll give it a try :)



##########
.github/workflows/r_nightly.yml:
##########
@@ -59,6 +65,12 @@ jobs:
       - name: Install Archery
         shell: bash
         run: pip install -e arrow/dev/archery[all]
+      - uses: actions/cache@v3
+        with:
+          path: repo
+          key: r-nightly-${{ github.run_id }}
+          restore-keys: r-nightly-
+      

Review Comment:
   🤦‍♂️ you are right it is duplicated



##########
.github/workflows/r_nightly.yml:
##########
@@ -77,10 +89,47 @@ jobs:
             echo "No files found. Stopping upload."
             exit 1
           fi
+      - name: Cache Repo
+        uses: actions/cache@v3
+        with:
+          path: repo
+          key: r-nightly-${{ github.run_id }}
+          restore-keys: r-nightly-
+      - name: Sync from Remote

Review Comment:
   GHA caches are evicted regularly especially due to the workaround I described above which creates a large number of redundant cache objects (this is also in use for the R cpp builds). So we can not rely on always hitting the cache.



##########
.github/workflows/r_nightly.yml:
##########
@@ -59,6 +65,12 @@ jobs:
       - name: Install Archery
         shell: bash
         run: pip install -e arrow/dev/archery[all]
+      - uses: actions/cache@v3
+        with:
+          path: repo
+          key: r-nightly-${{ github.run_id }}
+          restore-keys: r-nightly-
+      

Review Comment:
   No, this is intentional. `actions/cache` is broken, it will not update the cache on a primary key (there are several issues about this e.g. https://github.com/actions/cache/discussions/641). This is a (wasteful) workaround, by generating a unique primary key and matching on the `restore-key` we force the cache to be "updated" (a new cache object will be created). Leading the whole concept of a cache ad absurdum but there is now way around it for now :(



##########
.github/workflows/r_nightly.yml:
##########
@@ -124,14 +192,32 @@ jobs:
             tools::write_PACKAGES(dir, type = ifelse(on_win, "win.binary", "mac.binary"), verbose = TRUE )
           }
       - name: Show repo contents
-        run: ls -R repo
-      - name: Upload Files
-        uses: burnett01/rsync-deployments@5.2
-        with:
-          switches: -avzr
-          path: repo/*
-          remote_path: ${{ secrets.NIGHTLIES_RSYNC_PATH }}/arrow/r
-          remote_host: ${{ secrets.NIGHTLIES_RSYNC_HOST }}
-          remote_port: ${{ secrets.NIGHTLIES_RSYNC_PORT }}
-          remote_user: ${{ secrets.NIGHTLIES_RSYNC_USER }}
-          remote_key: ${{ secrets.NIGHTLIES_RSYNC_KEY }}
+        run: tree repo
+      - name: Sync to Remote

Review Comment:
   For functional reasons:
   - the action does not support downloading from source
     - this could be achived without rsync but would likely be slower (no auth)
   - the functionality can be implemented in 10 LOCs without the need for docker, which in itself is a valid reason not to use the action imho
   - but also security reasons:
     - this workflow runs with full access to `secrets` and a ~~r/w `github.token` available to actions~~ (I removed this) -> :warning: see [here](https://cwiki.apache.org/confluence/display/BUILDS/GitHub+Actions+status#GitHubActionsstatus-Security)
     - the action does not validate host keys -> potential for MITM
   



##########
.github/workflows/r_nightly.yml:
##########
@@ -100,18 +149,37 @@ jobs:
           # strip superfluous nested dirs
           new_paths <- sub(art_path, ".", new_paths)
           dirs <- dirname(new_paths)
-          dir_result <- sapply(dirs, dir.create, recursive = TRUE)
+          sapply(dirs, dir.create, recursive = TRUE, showWarnings = FALSE)
 
-          if (!all(dir_result)) {
-            stop("There was an issue while creating the folders!")
-          }
-
-          copy_result <- file.copy(current_path, new_paths)
+          # overwrite allows us to "force push" a new version with the same name
+          copy_result <- file.copy(current_path, new_paths, overwrite = TRUE)
 
           if (!all(copy_result)) {
             stop("There was an issue while copying the files!")
           }
+      - name: Prune Repository
+        shell: bash
+        env:
+          KEEP: ${{ github.event.inputs.keep || 14 }}
+        run: |   
+          prune() {
+            # list files  | retain $KEEP newest files | delete everything else
+            ls -t $1/arrow* | tail -n +$((KEEP + 1)) | xargs --no-run-if-empty rm 
+          }
+
+          # find leaf sub dirs
+          repo_dirs=$(find repo -type d -links 2)
 
+          # We want to retain $keep (14) versions of each pkg/lib so we call
+          # prune on each leaf dir and not on repo/.
+          for dir in ${repo_dirs[@]}; do
+            prune $dir
+          done

Review Comment:
   No we can't, because this will delete by time with no regard to total number of files, so if for some reason the nightly builds are not uploaded regularly (e.g. because the build is failing for some reason) this will delete all files.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org