You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@sling.apache.org by "stefan-egli (via GitHub)" <gi...@apache.org> on 2023/03/07 15:48:22 UTC

[GitHub] [sling-org-apache-sling-discovery-oak] stefan-egli opened a new pull request, #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

stefan-egli opened a new pull request, #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13

   … under /var/discovery/oak automatically


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] sonarcloud[bot] commented on pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "sonarcloud[bot] (via GitHub)" <gi...@apache.org>.

sonarcloud[bot] commented on PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#issuecomment-1458502912

   SonarCloud Quality Gate failed.&nbsp; &nbsp; [![Quality Gate failed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/failed-16px.png 'Quality Gate failed')](https://sonarcloud.io/dashboard?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13)
   
   [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG)  
   [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY)  
   [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT)  
   [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [17 Code Smells](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL)
   
   [![76.2%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/60-16px.png '76.2%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list) [76.2% Coverage](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list)  
   [![0.0%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/3-16px.png '0.0%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list) [0.0% Duplication](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] stefan-egli commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "stefan-egli (via GitHub)" <gi...@apache.org>.

stefan-egli commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1128270693


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,484 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.apache.sling.discovery.commons.providers.util.ResourceHelper;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger runcount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This backup task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int org_apache_sling_discovery_oak_slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;
+
+        @AttributeDefinition(name = "Cleanup interval milliseconds", description = "Number of milliseconds after which to do another batch of cleaning up (if necessary)")
+        int org_apache_sling_discovery_oak_slingid_cleanup_interval() default DEFAULT_CLEANUP_INTERVAL;
+
+        @AttributeDefinition(name = "Cleanup batch size", description = "Maximum number of slingIds to cleanup in one batch.")
+        int org_apache_sling_discovery_oak_slingid_cleanup_batchsize() default DEFAULT_CLEANUP_BATCH_SIZE;
+
+        @AttributeDefinition(name = "Cleanup minimum creation age", description = "Minimum number of milliseconds since the slingId was created.")
+        long org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age() default DEFAULT_MIN_CREATION_AGE_MILLIS;
+    }
+
+    /**
+     * Test constructor
+     */
+    static SlingIdCleanupTask create(Scheduler scheduler, ResourceResolverFactory factory,
+            Config config, int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        final SlingIdCleanupTask s = new SlingIdCleanupTask();
+        s.scheduler = scheduler;
+        s.resourceResolverFactory = factory;
+        s.config = config;
+        s.config(initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+        return s;
+    }
+
+    @Activate
+    protected void activate(final BundleContext bc, final Conf config) {
+        this.modified(bc, config);
+    }
+
+    @Modified
+    protected void modified(final BundleContext bc, final Conf config) {
+        if (config == null) {
+            return;
+        }
+        config(config.org_apache_sling_discovery_oak_slingid_cleanup_initial_delay(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_interval(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_batchsize(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age());
+    }
+
+    @Deactivate
+    protected void deactivate() {
+        logger.info("deactivate : deactivated.");
+        hasTopology = false;
+    }
+
+    private void config(int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        this.initialDelayMillis = initialDelayMillis;
+        this.intervalMillis = intervalMillis;
+        this.batchSize = batchSize;
+        this.minCreationAgeMillis = minCreationAgeMillis;
+        logger.info(
+                "config: initial delay milliseconds = {}, interval milliseconds = {}, batch size = {}, min creation age milliseconds = {}",
+                initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+    }
+
+    @Override
+    public void handleTopologyEvent(TopologyEvent event) {
+        final TopologyView newView = event.getNewView();
+        if (newView == null || event.getType() == Type.PROPERTIES_CHANGED) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+        } else {
+            hasTopology = true;
+            currentView = newView;
+            if (newView.getLocalInstance().isLeader()) {
+                // only execute on leader
+                recreateSchedule();
+            } else {
+                // should not be necessary, but lets stop anyway on non-leaders:
+                stop();
+            }
+        }
+    }
+
+    /**
+     * Cancels a potentially previously registered cleanup schedule.
+     */
+    private void stop() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("stop: no scheduler set, giving up.");
+            return;
+        }
+        final boolean unscheduled = localScheduler.unschedule(SCHEDULE_NAME);
+        logger.debug("stop: unschedule result={}", unscheduled);
+    }
+
+    /**
+     * This method can be invoked at any time to reset the schedule to do a fresh
+     * round of cleanup.
+     * <p>
+     * This method is thread-safe : if called concurrently, the fact that
+     * scheduler.schedul is synchronized works out that ultimately there will be
+     * just 1 schedule active (which is what is the desired outcome).
+     */
+    private void recreateSchedule() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("resetSchedule: no scheduler set, giving up.");
+            return;
+        }
+        final Calendar cal = Calendar.getInstance();
+        int delayMillis;
+        if (firstRun) {
+            delayMillis = initialDelayMillis;
+        } else {
+            delayMillis = intervalMillis;
+        }
+        cal.add(Calendar.MILLISECOND, delayMillis);
+        final Date inFiveMinutes = cal.getTime();
+        logger.debug("resetSchedule: scheduling a cleanup in {} milliseconds from now.",
+                delayMillis);
+        ScheduleOptions options = localScheduler.AT(inFiveMinutes);
+        options.name(SCHEDULE_NAME);
+        options.canRunConcurrently(false); // should not concurrently execute
+        localScheduler.schedule(this, options);
+    }
+
+    /**
+     * Invoked via sling.commons.scheduler triggered from resetCleanupSchedule(). By
+     * default should get called at max every 5 minutes until cleanup is done or
+     * 10min after a topology change.
+     */
+    @Override
+    public void run() {
+        if (!hasTopology) {
+            return;
+        }
+        boolean mightHaveMore = true;
+        try {
+            mightHaveMore = cleanup();
+        } catch (Exception e) {
+            // upon exception just log and retry in 10min
+            logger.error("run: got Exception while cleaning up slnigIds : " + e, e);
+        }
+        if (mightHaveMore) {
+            // then continue in 10min
+            recreateSchedule();
+            return;
+        }
+        // log successful cleanup done, yes, on info
+        logger.info("run: slingId cleanup done.");
+    }
+
+    /**
+     * Do the actual cleanup of garbage slingIds and report back with true if there
+     * might be more or false if we're at the end.
+     * 
+     * @return true if there might be more garbage or false if we're at the end
+     */
+    boolean cleanup() {
+        logger.debug("cleanup: start");
+
+        final ResourceResolverFactory localFactory = resourceResolverFactory;
+        final Config localConfig = config;
+        if (localFactory == null || localConfig == null) {
+            logger.warn("cleanup: cannot cleanup due to rrf={} or c={}", localFactory,
+                    localConfig);
+            return true;
+        }
+        final TopologyView localCurrentView = currentView;
+        if (localCurrentView == null || !localCurrentView.isCurrent()) {
+            logger.info("cleanup : cannot cleanup as topology recently changed : {}",
+                    localCurrentView);
+            return true;
+        }
+        final Set<String> activeSlingIds = new HashSet<>();
+        for (InstanceDescription id : localCurrentView.getLocalInstance().getClusterView()
+                .getInstances()) {
+            activeSlingIds.add(id.getSlingId());
+        }
+
+        ResourceResolver resolver = null;
+        try {
+            resolver = localFactory.getServiceResourceResolver(null);
+
+            final Resource clusterInstances = ResourceHelper.getOrCreateResource(resolver,
+                    localConfig.getClusterInstancesPath());
+            final Resource idMap = ResourceHelper.getOrCreateResource(resolver,
+                    localConfig.getIdMapPath());
+            final Resource syncTokens = ResourceHelper.getOrCreateResource(resolver,
+                    localConfig.getSyncTokenPath());
+            resolver.revert();
+            resolver.refresh();
+
+            final ValueMap idMapMap = idMap.adaptTo(ValueMap.class);
+            final ModifiableValueMap syncTokenMap = syncTokens
+                    .adaptTo(ModifiableValueMap.class);
+            final Calendar now = Calendar.getInstance();
+            int removed = 0;
+            boolean mightHaveMore = false;
+            int localBatchSize = batchSize;
+            long localMinCreationAgeMillis = minCreationAgeMillis;
+            for (Resource resource : clusterInstances.getChildren()) {
+                if (!hasTopology || currentView != localCurrentView
+                        || !localCurrentView.isCurrent()) {
+                    // we got interrupted during cleanup
+                    // let's not commit at all then
+                    logger.info(
+                            "cleanup : topology changing during cleanup - not committing this time - stopping for now.");
+                    return true;
+                }
+                final String slingId = resource.getName();
+                logger.info("cleanup : handling slingId = {}", slingId);
+                Object clusterNodeId = idMapMap.get(slingId);
+                if (clusterNodeId == null) {
+                    logger.info("cleanup : slingId not recently in use : {}",
+                            clusterNodeId);
+                } else {
+                    logger.info("cleanup : slingId WAS recently in use : {}",
+                            clusterNodeId);
+                    continue;
+                }
+                if (activeSlingIds.contains(slingId)) {
+                    logger.info("cleanup : slingId is currently active : {}", slingId);

Review Comment:
   same here, did that already



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] sonarcloud[bot] commented on pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "sonarcloud[bot] (via GitHub)" <gi...@apache.org>.

sonarcloud[bot] commented on PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#issuecomment-1467684850

   Kudos, SonarCloud Quality Gate passed!&nbsp; &nbsp; [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13)
   
   [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG)  
   [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY)  
   [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT)  
   [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [30 Code Smells](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL)
   
   [![86.4%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/60-16px.png '86.4%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list) [86.4% Coverage](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list)  
   [![0.0%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/3-16px.png '0.0%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list) [0.0% Duplication](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] sonarcloud[bot] commented on pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "sonarcloud[bot] (via GitHub)" <gi...@apache.org>.

sonarcloud[bot] commented on PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#issuecomment-1458654183

   Kudos, SonarCloud Quality Gate passed!&nbsp; &nbsp; [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13)
   
   [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG)  
   [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY)  
   [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT)  
   [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [22 Code Smells](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL)
   
   [![85.0%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/60-16px.png '85.0%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list) [85.0% Coverage](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list)  
   [![0.0%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/3-16px.png '0.0%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list) [0.0% Duplication](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] stefan-egli commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "stefan-egli (via GitHub)" <gi...@apache.org>.

stefan-egli commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1128270317


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,484 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.apache.sling.discovery.commons.providers.util.ResourceHelper;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger runcount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This backup task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int org_apache_sling_discovery_oak_slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;
+
+        @AttributeDefinition(name = "Cleanup interval milliseconds", description = "Number of milliseconds after which to do another batch of cleaning up (if necessary)")
+        int org_apache_sling_discovery_oak_slingid_cleanup_interval() default DEFAULT_CLEANUP_INTERVAL;
+
+        @AttributeDefinition(name = "Cleanup batch size", description = "Maximum number of slingIds to cleanup in one batch.")
+        int org_apache_sling_discovery_oak_slingid_cleanup_batchsize() default DEFAULT_CLEANUP_BATCH_SIZE;
+
+        @AttributeDefinition(name = "Cleanup minimum creation age", description = "Minimum number of milliseconds since the slingId was created.")
+        long org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age() default DEFAULT_MIN_CREATION_AGE_MILLIS;
+    }
+
+    /**
+     * Test constructor
+     */
+    static SlingIdCleanupTask create(Scheduler scheduler, ResourceResolverFactory factory,
+            Config config, int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        final SlingIdCleanupTask s = new SlingIdCleanupTask();
+        s.scheduler = scheduler;
+        s.resourceResolverFactory = factory;
+        s.config = config;
+        s.config(initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+        return s;
+    }
+
+    @Activate
+    protected void activate(final BundleContext bc, final Conf config) {
+        this.modified(bc, config);
+    }
+
+    @Modified
+    protected void modified(final BundleContext bc, final Conf config) {
+        if (config == null) {
+            return;
+        }
+        config(config.org_apache_sling_discovery_oak_slingid_cleanup_initial_delay(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_interval(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_batchsize(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age());
+    }
+
+    @Deactivate
+    protected void deactivate() {
+        logger.info("deactivate : deactivated.");
+        hasTopology = false;
+    }
+
+    private void config(int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        this.initialDelayMillis = initialDelayMillis;
+        this.intervalMillis = intervalMillis;
+        this.batchSize = batchSize;
+        this.minCreationAgeMillis = minCreationAgeMillis;
+        logger.info(
+                "config: initial delay milliseconds = {}, interval milliseconds = {}, batch size = {}, min creation age milliseconds = {}",
+                initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+    }
+
+    @Override
+    public void handleTopologyEvent(TopologyEvent event) {
+        final TopologyView newView = event.getNewView();
+        if (newView == null || event.getType() == Type.PROPERTIES_CHANGED) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+        } else {
+            hasTopology = true;
+            currentView = newView;
+            if (newView.getLocalInstance().isLeader()) {
+                // only execute on leader
+                recreateSchedule();
+            } else {
+                // should not be necessary, but lets stop anyway on non-leaders:
+                stop();
+            }
+        }
+    }
+
+    /**
+     * Cancels a potentially previously registered cleanup schedule.
+     */
+    private void stop() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("stop: no scheduler set, giving up.");
+            return;
+        }
+        final boolean unscheduled = localScheduler.unschedule(SCHEDULE_NAME);
+        logger.debug("stop: unschedule result={}", unscheduled);
+    }
+
+    /**
+     * This method can be invoked at any time to reset the schedule to do a fresh
+     * round of cleanup.
+     * <p>
+     * This method is thread-safe : if called concurrently, the fact that
+     * scheduler.schedul is synchronized works out that ultimately there will be
+     * just 1 schedule active (which is what is the desired outcome).
+     */
+    private void recreateSchedule() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("resetSchedule: no scheduler set, giving up.");
+            return;
+        }
+        final Calendar cal = Calendar.getInstance();
+        int delayMillis;
+        if (firstRun) {
+            delayMillis = initialDelayMillis;
+        } else {
+            delayMillis = intervalMillis;
+        }
+        cal.add(Calendar.MILLISECOND, delayMillis);
+        final Date inFiveMinutes = cal.getTime();
+        logger.debug("resetSchedule: scheduling a cleanup in {} milliseconds from now.",
+                delayMillis);
+        ScheduleOptions options = localScheduler.AT(inFiveMinutes);
+        options.name(SCHEDULE_NAME);
+        options.canRunConcurrently(false); // should not concurrently execute
+        localScheduler.schedule(this, options);
+    }
+
+    /**
+     * Invoked via sling.commons.scheduler triggered from resetCleanupSchedule(). By
+     * default should get called at max every 5 minutes until cleanup is done or
+     * 10min after a topology change.
+     */
+    @Override
+    public void run() {
+        if (!hasTopology) {
+            return;
+        }
+        boolean mightHaveMore = true;
+        try {
+            mightHaveMore = cleanup();
+        } catch (Exception e) {
+            // upon exception just log and retry in 10min
+            logger.error("run: got Exception while cleaning up slnigIds : " + e, e);
+        }
+        if (mightHaveMore) {
+            // then continue in 10min
+            recreateSchedule();
+            return;
+        }
+        // log successful cleanup done, yes, on info
+        logger.info("run: slingId cleanup done.");
+    }
+
+    /**
+     * Do the actual cleanup of garbage slingIds and report back with true if there
+     * might be more or false if we're at the end.
+     * 
+     * @return true if there might be more garbage or false if we're at the end
+     */
+    boolean cleanup() {
+        logger.debug("cleanup: start");
+
+        final ResourceResolverFactory localFactory = resourceResolverFactory;
+        final Config localConfig = config;
+        if (localFactory == null || localConfig == null) {
+            logger.warn("cleanup: cannot cleanup due to rrf={} or c={}", localFactory,
+                    localConfig);
+            return true;
+        }
+        final TopologyView localCurrentView = currentView;
+        if (localCurrentView == null || !localCurrentView.isCurrent()) {
+            logger.info("cleanup : cannot cleanup as topology recently changed : {}",
+                    localCurrentView);
+            return true;
+        }
+        final Set<String> activeSlingIds = new HashSet<>();
+        for (InstanceDescription id : localCurrentView.getLocalInstance().getClusterView()
+                .getInstances()) {
+            activeSlingIds.add(id.getSlingId());
+        }
+
+        ResourceResolver resolver = null;
+        try {
+            resolver = localFactory.getServiceResourceResolver(null);
+
+            final Resource clusterInstances = ResourceHelper.getOrCreateResource(resolver,
+                    localConfig.getClusterInstancesPath());
+            final Resource idMap = ResourceHelper.getOrCreateResource(resolver,
+                    localConfig.getIdMapPath());
+            final Resource syncTokens = ResourceHelper.getOrCreateResource(resolver,
+                    localConfig.getSyncTokenPath());
+            resolver.revert();
+            resolver.refresh();
+
+            final ValueMap idMapMap = idMap.adaptTo(ValueMap.class);
+            final ModifiableValueMap syncTokenMap = syncTokens
+                    .adaptTo(ModifiableValueMap.class);
+            final Calendar now = Calendar.getInstance();
+            int removed = 0;
+            boolean mightHaveMore = false;
+            int localBatchSize = batchSize;
+            long localMinCreationAgeMillis = minCreationAgeMillis;
+            for (Resource resource : clusterInstances.getChildren()) {
+                if (!hasTopology || currentView != localCurrentView
+                        || !localCurrentView.isCurrent()) {
+                    // we got interrupted during cleanup
+                    // let's not commit at all then
+                    logger.info(
+                            "cleanup : topology changing during cleanup - not committing this time - stopping for now.");
+                    return true;
+                }
+                final String slingId = resource.getName();
+                logger.info("cleanup : handling slingId = {}", slingId);
+                Object clusterNodeId = idMapMap.get(slingId);
+                if (clusterNodeId == null) {
+                    logger.info("cleanup : slingId not recently in use : {}",
+                            clusterNodeId);
+                } else {
+                    logger.info("cleanup : slingId WAS recently in use : {}",

Review Comment:
   did that already



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] stefan-egli commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "stefan-egli (via GitHub)" <gi...@apache.org>.

stefan-egli commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1130749457


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,571 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import static org.osgi.util.converter.Converters.standardConverter;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    final static String SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME = "org.apache.sling.discovery.oak.slingidcleanup.enabled";
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active

Review Comment:
   Worst-case should be that the existing, active instances get into a TOPOLOGY_CHANGING situation and the restarted, racing instance fails to join the topology as well but the instance is running, so oak leases are exchanged. And killing that problematic instance would be required to get back to a proper topology.
   
   One thing that I'll try to look into is whether each instance shouldn't keep a list of all slingIds that were ever active (while it was alive). That list shouldn't be large, and holding just those slingIds shouldn't cause any memory issue. But having that list and then not cleaning up any slingId of that list would further reduce the likelihood of the described race-condition. Actually I think it would solve it completely.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] rishabhdaim commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "rishabhdaim (via GitHub)" <gi...@apache.org>.

rishabhdaim commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1133438597


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,601 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import static org.osgi.util.converter.Converters.standardConverter;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>was not ever seen in previous topologies by the now leader instance</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ * <p>
+ * Additionally, the cleanup is skipped for 13 hours after a successful cleanup.
+ * This is to avoid unnecessary load on the repository. The number of 13
+ * incorporates some heuristics such as : about 2 cleanup rounds per day maximum
+ * makes sense, if a leader is very long living, then the 1 additional hour
+ * makes it spread somewhat throughout the day. This is to further minimize any
+ * load side-effects.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    final static String SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME = "org.apache.sling.discovery.oak.slingidcleanup.enabled";
+
+    /** default minimal cleanup delay at 13h, to intraday load balance */
+    final static long MIN_CLEANUP_DELAY_MILLIS = 46800000;
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every scheduler invocation */
+    private AtomicInteger runCount = new AtomicInteger(0);
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger completionCount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    private long lastSuccessfulRun = -1;
+
+    /**
+     * Minimal delay after a successful cleanup round, in millis
+     */
+    private long minCleanupDelayMillis = MIN_CLEANUP_DELAY_MILLIS;
+
+    /**
+     * contains all slingIds ever seen by this instance - should not be a long list
+     * so not a memory issue
+     */
+    private Set<String> seenInstances = new HashSet<>();
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int org_apache_sling_discovery_oak_slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;
+
+        @AttributeDefinition(name = "Cleanup interval milliseconds", description = "Number of milliseconds after which to do another batch of cleaning up (if necessary)")
+        int org_apache_sling_discovery_oak_slingid_cleanup_interval() default DEFAULT_CLEANUP_INTERVAL;
+
+        @AttributeDefinition(name = "Cleanup batch size", description = "Maximum number of slingIds to cleanup in one batch.")
+        int org_apache_sling_discovery_oak_slingid_cleanup_batchsize() default DEFAULT_CLEANUP_BATCH_SIZE;
+
+        @AttributeDefinition(name = "Cleanup minimum creation age", description = "Minimum number of milliseconds since the slingId was created.")
+        long org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age() default DEFAULT_MIN_CREATION_AGE_MILLIS;
+    }
+
+    /**
+     * Test constructor
+     */
+    static SlingIdCleanupTask create(Scheduler scheduler, ResourceResolverFactory factory,
+            Config config, int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis, long minCleanupDelayMillis) {
+        final SlingIdCleanupTask s = new SlingIdCleanupTask();
+        s.scheduler = scheduler;
+        s.resourceResolverFactory = factory;
+        s.config = config;
+        s.minCleanupDelayMillis = minCleanupDelayMillis;
+        s.config(initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+        return s;
+    }
+
+    @Activate
+    protected void activate(final BundleContext bc, final Conf config) {
+        this.modified(bc, config);
+    }
+
+    @Modified
+    protected void modified(final BundleContext bc, final Conf config) {
+        if (config == null) {
+            return;
+        }
+        config(config.org_apache_sling_discovery_oak_slingid_cleanup_initial_delay(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_interval(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_batchsize(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age());
+    }
+
+    @Deactivate
+    protected void deactivate() {
+        logger.info("deactivate : deactivated.");
+        hasTopology = false;
+    }
+
+    private void config(int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        this.initialDelayMillis = initialDelayMillis;
+        this.intervalMillis = intervalMillis;
+        this.batchSize = batchSize;
+        this.minCreationAgeMillis = minCreationAgeMillis;
+        logger.info(
+                "config: enabled = {}, initial delay milliseconds = {}, interval milliseconds = {}, batch size = {}, min creation age milliseconds = {}",
+                isEnabled(), initialDelayMillis, intervalMillis, batchSize,
+                minCreationAgeMillis);
+    }
+
+    @Override
+    public void handleTopologyEvent(TopologyEvent event) {
+        if (!isEnabled()) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+            logger.debug("handleTopologyEvent: slingId cleanup is disabled");
+            return;
+        }
+        final TopologyView newView = event.getNewView();
+        if (newView == null || event.getType() == Type.PROPERTIES_CHANGED) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+        } else {
+            hasTopology = true;
+            currentView = newView;
+            seenInstances.addAll(getActiveSlingIdsFrom(newView));
+            if (newView.getLocalInstance().isLeader()) {
+                // only execute on leader
+                recreateSchedule();
+            } else {
+                // should not be necessary, but lets stop anyway on non-leaders:
+                stop();
+            }
+        }
+    }
+
+    /**
+     * Cancels a potentially previously registered cleanup schedule.
+     */
+    private void stop() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("stop: no scheduler set, giving up.");
+            return;
+        }
+        final boolean unscheduled = localScheduler.unschedule(SCHEDULE_NAME);
+        logger.debug("stop: unschedule result={}", unscheduled);
+    }
+
+    /**
+     * Reads the system property that enables or disabled this tasks
+     */
+    private static boolean isEnabled() {
+        final String systemPropertyValue = System
+                .getProperty(SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME);
+        return standardConverter().convert(systemPropertyValue).defaultValue(false)
+                .to(Boolean.class);
+    }
+
+    /**
+     * This method can be invoked at any time to reset the schedule to do a fresh
+     * round of cleanup.
+     * <p>
+     * This method is thread-safe : if called concurrently, the fact that
+     * scheduler.schedul is synchronized works out that ultimately there will be
+     * just 1 schedule active (which is what is the desired outcome).
+     */
+    private void recreateSchedule() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("recreateSchedule: no scheduler set, giving up.");
+            return;
+        }
+        final Calendar cal = Calendar.getInstance();
+        int delayMillis;
+        if (firstRun) {
+            delayMillis = initialDelayMillis;
+        } else {
+            delayMillis = intervalMillis;
+        }
+        cal.add(Calendar.MILLISECOND, delayMillis);
+        final Date scheduledDate = cal.getTime();
+        logger.debug(
+                "recreateSchedule: scheduling a cleanup in {} milliseconds from now, which is: {}",
+                delayMillis, scheduledDate);
+        ScheduleOptions options = localScheduler.AT(scheduledDate);
+        options.name(SCHEDULE_NAME);
+        options.canRunConcurrently(false); // should not concurrently execute
+        localScheduler.schedule(this, options);
+    }
+
+    /**
+     * Invoked via sling.commons.scheduler triggered from resetCleanupSchedule(). By
+     * default should get called at max every 5 minutes until cleanup is done or
+     * 10min after a topology change.
+     */
+    @Override
+    public void run() {
+        if (lastSuccessfulRun > 0 && System.currentTimeMillis()
+                - lastSuccessfulRun < minCleanupDelayMillis) {
+            logger.debug(
+                    "run: last cleanup was {} millis ago, which is less than {} millis, therefore not cleaning up yet.",
+                    System.currentTimeMillis() - lastSuccessfulRun,
+                    minCleanupDelayMillis);
+            recreateSchedule();
+            return;
+        }
+        runCount.incrementAndGet();
+        if (!hasTopology) {
+            return;
+        }
+        boolean mightHaveMore = true;
+        try {
+            mightHaveMore = cleanup();
+        } catch (Exception e) {
+            // upon exception just log and retry in 10min
+            logger.error("run: got Exception while cleaning up slnigIds : " + e, e);
+        }
+        if (mightHaveMore) {
+            // then continue in 10min
+            recreateSchedule();
+            return;
+        }
+        // log successful cleanup done, yes, on info
+        logger.info(
+                "run: slingId cleanup done, run counter = {}, delete counter = {}, completion counter = {}",
+                getRunCount(), getDeleteCount(), getCompletionCount());
+        lastSuccessfulRun = System.currentTimeMillis();
+    }
+
+    /**
+     * Do the actual cleanup of garbage slingIds and report back with true if there
+     * might be more or false if we're at the end.
+     * 
+     * @return true if there might be more garbage or false if we're at the end
+     */
+    private boolean cleanup() {
+        logger.debug("cleanup: start");
+        if (!isEnabled()) {
+            // bit of overkill probably, as this shouldn't happen.
+            // but adds to a good night's sleep.
+            logger.debug("cleanup: not enabled, stopping.");
+            return false;
+        }
+
+        final ResourceResolverFactory localFactory = resourceResolverFactory;
+        final Config localConfig = config;
+        if (localFactory == null || localConfig == null) {
+            logger.warn("cleanup: cannot cleanup due to rrf={} or c={}", localFactory,
+                    localConfig);
+            return true;
+        }
+        final TopologyView localCurrentView = currentView;
+        if (localCurrentView == null || !localCurrentView.isCurrent()) {
+            logger.debug("cleanup : cannot cleanup as topology recently changed : {}",
+                    localCurrentView);
+            return true;
+        }
+        final Set<String> activeSlingIds = getActiveSlingIdsFrom(localCurrentView);
+        try (ResourceResolver resolver = localFactory.getServiceResourceResolver(null)) {
+            final Resource clusterInstances = resolver
+                    .getResource(localConfig.getClusterInstancesPath());
+            final Resource idMap = resolver.getResource(localConfig.getIdMapPath());
+            final Resource syncTokens = resolver
+                    .getResource(localConfig.getSyncTokenPath());
+            if (clusterInstances == null || idMap == null || syncTokens == null) {
+                logger.warn("cleanup: no resource found at {}, {} or {}, stopping.",
+                        localConfig.getClusterInstancesPath(), localConfig.getIdMapPath(),
+                        localConfig.getSyncTokenPath());
+                return false;
+            }
+            resolver.refresh();
+            final ValueMap idMapMap = idMap.adaptTo(ValueMap.class);
+            final ModifiableValueMap syncTokenMap = syncTokens
+                    .adaptTo(ModifiableValueMap.class);
+            final Calendar now = Calendar.getInstance();
+            int removed = 0;
+            boolean mightHaveMore = false;
+            final int localBatchSize = batchSize;
+            final long localMinCreationAgeMillis = minCreationAgeMillis;
+            for (Resource resource : clusterInstances.getChildren()) {
+                if (!topologyUnchanged(localCurrentView)) {
+                    return true;
+                }
+                final String slingId = resource.getName();
+                if (deleteIfOldSlingId(resource, slingId, syncTokenMap, idMapMap,
+                        activeSlingIds, now, localMinCreationAgeMillis)) {
+                    if (++removed >= localBatchSize) {
+                        // we need to stop
+                        mightHaveMore = true;
+                        break;
+                    }
+                }
+            }
+            // if we're not already at the batch limit, check syncTokens too
+            if (!mightHaveMore) {
+                for (String slingId : syncTokenMap.keySet()) {
+                    try {
+                        UUID.fromString(slingId);
+                    } catch (Exception e) {
+                        // not a uuid
+                        continue;
+                    }
+                    if (!topologyUnchanged(localCurrentView)) {
+                        return true;
+                    }
+                    Resource resourceOrNull = clusterInstances.getChild(slingId);
+                    if (deleteIfOldSlingId(resourceOrNull, slingId, syncTokenMap,
+                            idMapMap, activeSlingIds, now, localMinCreationAgeMillis)) {
+                        if (++removed >= localBatchSize) {
+                            // we need to stop
+                            mightHaveMore = true;
+                            break;
+                        }
+                    }
+                }
+            }
+            if (!topologyUnchanged(localCurrentView)) {
+                return true;
+            }
+            if (removed > 0) {
+                // only if we removed something we commit
+                resolver.commit();
+                logger.info(
+                        "cleanup : removed {} old slingIds (batch size : {}), potentially has more: {}",
+                        removed, localBatchSize, mightHaveMore);
+                deleteCount.addAndGet(removed);
+            }
+            firstRun = false;
+            completionCount.incrementAndGet();
+            return mightHaveMore;
+        } catch (LoginException e) {
+            logger.error("cleanup: could not log in administratively: " + e, e);
+            throw new RuntimeException("Could not log in to repository (" + e + ")", e);
+        } catch (PersistenceException e) {
+            logger.error("cleanup: got a PersistenceException: " + e, e);
+            throw new RuntimeException(
+                    "Exception while talking to repository (" + e + ")", e);
+        } finally {
+            logger.debug("cleanup: done.");
+        }
+    }
+
+    private Set<String> getActiveSlingIdsFrom(final TopologyView localCurrentView) {

Review Comment:
   method name `getActiveSlingIdsFrom` seems incomplete.
   
   I would change this to either `getActiveSlingIds` or `getActiveSlingIdsFromCurrentView`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] rishabhdaim commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "rishabhdaim (via GitHub)" <gi...@apache.org>.

rishabhdaim commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1133435308


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,601 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import static org.osgi.util.converter.Converters.standardConverter;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>was not ever seen in previous topologies by the now leader instance</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ * <p>
+ * Additionally, the cleanup is skipped for 13 hours after a successful cleanup.
+ * This is to avoid unnecessary load on the repository. The number of 13
+ * incorporates some heuristics such as : about 2 cleanup rounds per day maximum
+ * makes sense, if a leader is very long living, then the 1 additional hour
+ * makes it spread somewhat throughout the day. This is to further minimize any
+ * load side-effects.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    final static String SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME = "org.apache.sling.discovery.oak.slingidcleanup.enabled";
+
+    /** default minimal cleanup delay at 13h, to intraday load balance */
+    final static long MIN_CLEANUP_DELAY_MILLIS = 46800000;
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every scheduler invocation */
+    private AtomicInteger runCount = new AtomicInteger(0);
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger completionCount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    private long lastSuccessfulRun = -1;
+
+    /**
+     * Minimal delay after a successful cleanup round, in millis
+     */
+    private long minCleanupDelayMillis = MIN_CLEANUP_DELAY_MILLIS;
+
+    /**
+     * contains all slingIds ever seen by this instance - should not be a long list
+     * so not a memory issue
+     */
+    private Set<String> seenInstances = new HashSet<>();
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int org_apache_sling_discovery_oak_slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;

Review Comment:
   aren't these method names too big?
   
   the part `org_apache_sling_discovery_oak_` can be removed.
   
   For reference, https://github.com/apache/sling-org-apache-sling-discovery-oak/blob/master/src/main/java/org/apache/sling/discovery/oak/DiscoveryServiceCentralConfig.java#L29 doesn't use them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] sonarcloud[bot] commented on pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "sonarcloud[bot] (via GitHub)" <gi...@apache.org>.

sonarcloud[bot] commented on PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#issuecomment-1458610540

   SonarCloud Quality Gate failed.&nbsp; &nbsp; [![Quality Gate failed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/failed-16px.png 'Quality Gate failed')](https://sonarcloud.io/dashboard?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13)
   
   [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG)  
   [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY)  
   [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT)  
   [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [19 Code Smells](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL)
   
   [![76.5%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/60-16px.png '76.5%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list) [76.5% Coverage](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list)  
   [![0.0%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/3-16px.png '0.0%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list) [0.0% Duplication](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] sonarcloud[bot] commented on pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "sonarcloud[bot] (via GitHub)" <gi...@apache.org>.

sonarcloud[bot] commented on PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#issuecomment-1458606054

   SonarCloud Quality Gate failed.&nbsp; &nbsp; [![Quality Gate failed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/failed-16px.png 'Quality Gate failed')](https://sonarcloud.io/dashboard?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13)
   
   [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG)  
   [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY)  
   [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT)  
   [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [19 Code Smells](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL)
   
   [![76.6%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/60-16px.png '76.6%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list) [76.6% Coverage](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list)  
   [![0.0%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/3-16px.png '0.0%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list) [0.0% Duplication](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] stefan-egli commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "stefan-egli (via GitHub)" <gi...@apache.org>.

stefan-egli commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1128288686


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,484 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.apache.sling.discovery.commons.providers.util.ResourceHelper;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger runcount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This backup task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int org_apache_sling_discovery_oak_slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;
+
+        @AttributeDefinition(name = "Cleanup interval milliseconds", description = "Number of milliseconds after which to do another batch of cleaning up (if necessary)")
+        int org_apache_sling_discovery_oak_slingid_cleanup_interval() default DEFAULT_CLEANUP_INTERVAL;
+
+        @AttributeDefinition(name = "Cleanup batch size", description = "Maximum number of slingIds to cleanup in one batch.")
+        int org_apache_sling_discovery_oak_slingid_cleanup_batchsize() default DEFAULT_CLEANUP_BATCH_SIZE;
+
+        @AttributeDefinition(name = "Cleanup minimum creation age", description = "Minimum number of milliseconds since the slingId was created.")
+        long org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age() default DEFAULT_MIN_CREATION_AGE_MILLIS;
+    }
+
+    /**
+     * Test constructor
+     */
+    static SlingIdCleanupTask create(Scheduler scheduler, ResourceResolverFactory factory,
+            Config config, int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        final SlingIdCleanupTask s = new SlingIdCleanupTask();
+        s.scheduler = scheduler;
+        s.resourceResolverFactory = factory;
+        s.config = config;
+        s.config(initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+        return s;
+    }
+
+    @Activate
+    protected void activate(final BundleContext bc, final Conf config) {
+        this.modified(bc, config);
+    }
+
+    @Modified
+    protected void modified(final BundleContext bc, final Conf config) {
+        if (config == null) {
+            return;
+        }
+        config(config.org_apache_sling_discovery_oak_slingid_cleanup_initial_delay(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_interval(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_batchsize(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age());
+    }
+
+    @Deactivate
+    protected void deactivate() {
+        logger.info("deactivate : deactivated.");
+        hasTopology = false;
+    }
+
+    private void config(int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        this.initialDelayMillis = initialDelayMillis;
+        this.intervalMillis = intervalMillis;
+        this.batchSize = batchSize;
+        this.minCreationAgeMillis = minCreationAgeMillis;
+        logger.info(
+                "config: initial delay milliseconds = {}, interval milliseconds = {}, batch size = {}, min creation age milliseconds = {}",
+                initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+    }
+
+    @Override
+    public void handleTopologyEvent(TopologyEvent event) {
+        final TopologyView newView = event.getNewView();
+        if (newView == null || event.getType() == Type.PROPERTIES_CHANGED) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+        } else {
+            hasTopology = true;
+            currentView = newView;
+            if (newView.getLocalInstance().isLeader()) {
+                // only execute on leader
+                recreateSchedule();
+            } else {
+                // should not be necessary, but lets stop anyway on non-leaders:
+                stop();
+            }
+        }
+    }
+
+    /**
+     * Cancels a potentially previously registered cleanup schedule.
+     */
+    private void stop() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("stop: no scheduler set, giving up.");
+            return;
+        }
+        final boolean unscheduled = localScheduler.unschedule(SCHEDULE_NAME);
+        logger.debug("stop: unschedule result={}", unscheduled);
+    }
+
+    /**
+     * This method can be invoked at any time to reset the schedule to do a fresh
+     * round of cleanup.
+     * <p>
+     * This method is thread-safe : if called concurrently, the fact that
+     * scheduler.schedul is synchronized works out that ultimately there will be
+     * just 1 schedule active (which is what is the desired outcome).
+     */
+    private void recreateSchedule() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("resetSchedule: no scheduler set, giving up.");
+            return;
+        }
+        final Calendar cal = Calendar.getInstance();
+        int delayMillis;
+        if (firstRun) {
+            delayMillis = initialDelayMillis;
+        } else {
+            delayMillis = intervalMillis;
+        }
+        cal.add(Calendar.MILLISECOND, delayMillis);
+        final Date inFiveMinutes = cal.getTime();
+        logger.debug("resetSchedule: scheduling a cleanup in {} milliseconds from now.",
+                delayMillis);
+        ScheduleOptions options = localScheduler.AT(inFiveMinutes);
+        options.name(SCHEDULE_NAME);
+        options.canRunConcurrently(false); // should not concurrently execute
+        localScheduler.schedule(this, options);
+    }
+
+    /**
+     * Invoked via sling.commons.scheduler triggered from resetCleanupSchedule(). By
+     * default should get called at max every 5 minutes until cleanup is done or
+     * 10min after a topology change.
+     */
+    @Override
+    public void run() {
+        if (!hasTopology) {
+            return;
+        }
+        boolean mightHaveMore = true;
+        try {
+            mightHaveMore = cleanup();
+        } catch (Exception e) {
+            // upon exception just log and retry in 10min
+            logger.error("run: got Exception while cleaning up slnigIds : " + e, e);
+        }
+        if (mightHaveMore) {
+            // then continue in 10min
+            recreateSchedule();
+            return;
+        }
+        // log successful cleanup done, yes, on info
+        logger.info("run: slingId cleanup done.");
+    }
+
+    /**
+     * Do the actual cleanup of garbage slingIds and report back with true if there
+     * might be more or false if we're at the end.
+     * 
+     * @return true if there might be more garbage or false if we're at the end
+     */
+    boolean cleanup() {
+        logger.debug("cleanup: start");
+
+        final ResourceResolverFactory localFactory = resourceResolverFactory;
+        final Config localConfig = config;
+        if (localFactory == null || localConfig == null) {
+            logger.warn("cleanup: cannot cleanup due to rrf={} or c={}", localFactory,
+                    localConfig);
+            return true;
+        }
+        final TopologyView localCurrentView = currentView;
+        if (localCurrentView == null || !localCurrentView.isCurrent()) {
+            logger.info("cleanup : cannot cleanup as topology recently changed : {}",
+                    localCurrentView);
+            return true;
+        }
+        final Set<String> activeSlingIds = new HashSet<>();
+        for (InstanceDescription id : localCurrentView.getLocalInstance().getClusterView()
+                .getInstances()) {
+            activeSlingIds.add(id.getSlingId());
+        }
+
+        ResourceResolver resolver = null;
+        try {
+            resolver = localFactory.getServiceResourceResolver(null);
+
+            final Resource clusterInstances = ResourceHelper.getOrCreateResource(resolver,

Review Comment:
   Agree, fixed at https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13/commits/3ac0c77aa5a25bdfa7183f69b68cdd554a1a5829



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] joerghoh commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "joerghoh (via GitHub)" <gi...@apache.org>.

joerghoh commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1129777878


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,571 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import static org.osgi.util.converter.Converters.standardConverter;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    final static String SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME = "org.apache.sling.discovery.oak.slingidcleanup.enabled";
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every scheduler invocation */
+    private AtomicInteger runCount = new AtomicInteger(0);
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger completionCount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This backup task is in charge of cleaning up old SlingIds from the repository.")

Review Comment:
   ```suggestion
       @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This task is in charge of cleaning up old SlingIds from the repository.")
   ```



##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,571 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import static org.osgi.util.converter.Converters.standardConverter;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    final static String SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME = "org.apache.sling.discovery.oak.slingidcleanup.enabled";
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active

Review Comment:
   only "very unlikely"???? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] stefan-egli commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "stefan-egli (via GitHub)" <gi...@apache.org>.

stefan-egli commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1128259111


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,484 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.apache.sling.discovery.commons.providers.util.ResourceHelper;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger runcount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This backup task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int org_apache_sling_discovery_oak_slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;
+
+        @AttributeDefinition(name = "Cleanup interval milliseconds", description = "Number of milliseconds after which to do another batch of cleaning up (if necessary)")
+        int org_apache_sling_discovery_oak_slingid_cleanup_interval() default DEFAULT_CLEANUP_INTERVAL;
+
+        @AttributeDefinition(name = "Cleanup batch size", description = "Maximum number of slingIds to cleanup in one batch.")
+        int org_apache_sling_discovery_oak_slingid_cleanup_batchsize() default DEFAULT_CLEANUP_BATCH_SIZE;
+
+        @AttributeDefinition(name = "Cleanup minimum creation age", description = "Minimum number of milliseconds since the slingId was created.")
+        long org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age() default DEFAULT_MIN_CREATION_AGE_MILLIS;
+    }
+
+    /**
+     * Test constructor
+     */
+    static SlingIdCleanupTask create(Scheduler scheduler, ResourceResolverFactory factory,
+            Config config, int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        final SlingIdCleanupTask s = new SlingIdCleanupTask();
+        s.scheduler = scheduler;
+        s.resourceResolverFactory = factory;
+        s.config = config;
+        s.config(initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+        return s;
+    }
+
+    @Activate
+    protected void activate(final BundleContext bc, final Conf config) {
+        this.modified(bc, config);
+    }
+
+    @Modified
+    protected void modified(final BundleContext bc, final Conf config) {
+        if (config == null) {
+            return;
+        }
+        config(config.org_apache_sling_discovery_oak_slingid_cleanup_initial_delay(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_interval(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_batchsize(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age());
+    }
+
+    @Deactivate
+    protected void deactivate() {
+        logger.info("deactivate : deactivated.");
+        hasTopology = false;
+    }
+
+    private void config(int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        this.initialDelayMillis = initialDelayMillis;
+        this.intervalMillis = intervalMillis;
+        this.batchSize = batchSize;
+        this.minCreationAgeMillis = minCreationAgeMillis;
+        logger.info(
+                "config: initial delay milliseconds = {}, interval milliseconds = {}, batch size = {}, min creation age milliseconds = {}",
+                initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+    }
+
+    @Override
+    public void handleTopologyEvent(TopologyEvent event) {
+        final TopologyView newView = event.getNewView();
+        if (newView == null || event.getType() == Type.PROPERTIES_CHANGED) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+        } else {
+            hasTopology = true;
+            currentView = newView;
+            if (newView.getLocalInstance().isLeader()) {
+                // only execute on leader
+                recreateSchedule();
+            } else {
+                // should not be necessary, but lets stop anyway on non-leaders:
+                stop();
+            }
+        }
+    }
+
+    /**
+     * Cancels a potentially previously registered cleanup schedule.
+     */
+    private void stop() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("stop: no scheduler set, giving up.");
+            return;
+        }
+        final boolean unscheduled = localScheduler.unschedule(SCHEDULE_NAME);
+        logger.debug("stop: unschedule result={}", unscheduled);
+    }
+
+    /**
+     * This method can be invoked at any time to reset the schedule to do a fresh
+     * round of cleanup.
+     * <p>
+     * This method is thread-safe : if called concurrently, the fact that
+     * scheduler.schedul is synchronized works out that ultimately there will be
+     * just 1 schedule active (which is what is the desired outcome).
+     */
+    private void recreateSchedule() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("resetSchedule: no scheduler set, giving up.");
+            return;
+        }
+        final Calendar cal = Calendar.getInstance();
+        int delayMillis;
+        if (firstRun) {
+            delayMillis = initialDelayMillis;
+        } else {
+            delayMillis = intervalMillis;
+        }
+        cal.add(Calendar.MILLISECOND, delayMillis);
+        final Date inFiveMinutes = cal.getTime();
+        logger.debug("resetSchedule: scheduling a cleanup in {} milliseconds from now.",
+                delayMillis);
+        ScheduleOptions options = localScheduler.AT(inFiveMinutes);
+        options.name(SCHEDULE_NAME);
+        options.canRunConcurrently(false); // should not concurrently execute
+        localScheduler.schedule(this, options);
+    }
+
+    /**
+     * Invoked via sling.commons.scheduler triggered from resetCleanupSchedule(). By
+     * default should get called at max every 5 minutes until cleanup is done or
+     * 10min after a topology change.
+     */
+    @Override
+    public void run() {
+        if (!hasTopology) {
+            return;
+        }
+        boolean mightHaveMore = true;
+        try {
+            mightHaveMore = cleanup();
+        } catch (Exception e) {
+            // upon exception just log and retry in 10min
+            logger.error("run: got Exception while cleaning up slnigIds : " + e, e);

Review Comment:
   I guess the reason for this habit is that it will end up on the same line, while as IIUC the stacktrace starts on the next line.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] rishabhdaim commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "rishabhdaim (via GitHub)" <gi...@apache.org>.

rishabhdaim commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1133442945


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,601 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import static org.osgi.util.converter.Converters.standardConverter;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>was not ever seen in previous topologies by the now leader instance</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ * <p>
+ * Additionally, the cleanup is skipped for 13 hours after a successful cleanup.
+ * This is to avoid unnecessary load on the repository. The number of 13
+ * incorporates some heuristics such as : about 2 cleanup rounds per day maximum
+ * makes sense, if a leader is very long living, then the 1 additional hour
+ * makes it spread somewhat throughout the day. This is to further minimize any
+ * load side-effects.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    final static String SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME = "org.apache.sling.discovery.oak.slingidcleanup.enabled";
+
+    /** default minimal cleanup delay at 13h, to intraday load balance */
+    final static long MIN_CLEANUP_DELAY_MILLIS = 46800000;
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every scheduler invocation */
+    private AtomicInteger runCount = new AtomicInteger(0);
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger completionCount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    private long lastSuccessfulRun = -1;
+
+    /**
+     * Minimal delay after a successful cleanup round, in millis
+     */
+    private long minCleanupDelayMillis = MIN_CLEANUP_DELAY_MILLIS;
+
+    /**
+     * contains all slingIds ever seen by this instance - should not be a long list
+     * so not a memory issue
+     */
+    private Set<String> seenInstances = new HashSet<>();
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int org_apache_sling_discovery_oak_slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;
+
+        @AttributeDefinition(name = "Cleanup interval milliseconds", description = "Number of milliseconds after which to do another batch of cleaning up (if necessary)")
+        int org_apache_sling_discovery_oak_slingid_cleanup_interval() default DEFAULT_CLEANUP_INTERVAL;
+
+        @AttributeDefinition(name = "Cleanup batch size", description = "Maximum number of slingIds to cleanup in one batch.")
+        int org_apache_sling_discovery_oak_slingid_cleanup_batchsize() default DEFAULT_CLEANUP_BATCH_SIZE;
+
+        @AttributeDefinition(name = "Cleanup minimum creation age", description = "Minimum number of milliseconds since the slingId was created.")
+        long org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age() default DEFAULT_MIN_CREATION_AGE_MILLIS;
+    }
+
+    /**
+     * Test constructor
+     */
+    static SlingIdCleanupTask create(Scheduler scheduler, ResourceResolverFactory factory,
+            Config config, int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis, long minCleanupDelayMillis) {
+        final SlingIdCleanupTask s = new SlingIdCleanupTask();
+        s.scheduler = scheduler;
+        s.resourceResolverFactory = factory;
+        s.config = config;
+        s.minCleanupDelayMillis = minCleanupDelayMillis;
+        s.config(initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+        return s;
+    }
+
+    @Activate
+    protected void activate(final BundleContext bc, final Conf config) {
+        this.modified(bc, config);
+    }
+
+    @Modified
+    protected void modified(final BundleContext bc, final Conf config) {
+        if (config == null) {
+            return;
+        }
+        config(config.org_apache_sling_discovery_oak_slingid_cleanup_initial_delay(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_interval(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_batchsize(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age());
+    }
+
+    @Deactivate
+    protected void deactivate() {
+        logger.info("deactivate : deactivated.");
+        hasTopology = false;
+    }
+
+    private void config(int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        this.initialDelayMillis = initialDelayMillis;
+        this.intervalMillis = intervalMillis;
+        this.batchSize = batchSize;
+        this.minCreationAgeMillis = minCreationAgeMillis;
+        logger.info(
+                "config: enabled = {}, initial delay milliseconds = {}, interval milliseconds = {}, batch size = {}, min creation age milliseconds = {}",
+                isEnabled(), initialDelayMillis, intervalMillis, batchSize,
+                minCreationAgeMillis);
+    }
+
+    @Override
+    public void handleTopologyEvent(TopologyEvent event) {
+        if (!isEnabled()) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+            logger.debug("handleTopologyEvent: slingId cleanup is disabled");
+            return;
+        }
+        final TopologyView newView = event.getNewView();
+        if (newView == null || event.getType() == Type.PROPERTIES_CHANGED) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+        } else {
+            hasTopology = true;
+            currentView = newView;
+            seenInstances.addAll(getActiveSlingIdsFrom(newView));
+            if (newView.getLocalInstance().isLeader()) {
+                // only execute on leader
+                recreateSchedule();
+            } else {
+                // should not be necessary, but lets stop anyway on non-leaders:
+                stop();
+            }
+        }
+    }
+
+    /**
+     * Cancels a potentially previously registered cleanup schedule.
+     */
+    private void stop() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("stop: no scheduler set, giving up.");
+            return;
+        }
+        final boolean unscheduled = localScheduler.unschedule(SCHEDULE_NAME);
+        logger.debug("stop: unschedule result={}", unscheduled);
+    }
+
+    /**
+     * Reads the system property that enables or disabled this tasks
+     */
+    private static boolean isEnabled() {
+        final String systemPropertyValue = System
+                .getProperty(SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME);
+        return standardConverter().convert(systemPropertyValue).defaultValue(false)
+                .to(Boolean.class);
+    }
+
+    /**
+     * This method can be invoked at any time to reset the schedule to do a fresh
+     * round of cleanup.
+     * <p>
+     * This method is thread-safe : if called concurrently, the fact that
+     * scheduler.schedul is synchronized works out that ultimately there will be
+     * just 1 schedule active (which is what is the desired outcome).
+     */
+    private void recreateSchedule() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("recreateSchedule: no scheduler set, giving up.");
+            return;
+        }
+        final Calendar cal = Calendar.getInstance();
+        int delayMillis;
+        if (firstRun) {
+            delayMillis = initialDelayMillis;
+        } else {
+            delayMillis = intervalMillis;
+        }
+        cal.add(Calendar.MILLISECOND, delayMillis);
+        final Date scheduledDate = cal.getTime();
+        logger.debug(
+                "recreateSchedule: scheduling a cleanup in {} milliseconds from now, which is: {}",
+                delayMillis, scheduledDate);
+        ScheduleOptions options = localScheduler.AT(scheduledDate);
+        options.name(SCHEDULE_NAME);
+        options.canRunConcurrently(false); // should not concurrently execute
+        localScheduler.schedule(this, options);
+    }
+
+    /**
+     * Invoked via sling.commons.scheduler triggered from resetCleanupSchedule(). By
+     * default should get called at max every 5 minutes until cleanup is done or
+     * 10min after a topology change.
+     */
+    @Override
+    public void run() {
+        if (lastSuccessfulRun > 0 && System.currentTimeMillis()
+                - lastSuccessfulRun < minCleanupDelayMillis) {
+            logger.debug(
+                    "run: last cleanup was {} millis ago, which is less than {} millis, therefore not cleaning up yet.",
+                    System.currentTimeMillis() - lastSuccessfulRun,
+                    minCleanupDelayMillis);
+            recreateSchedule();
+            return;
+        }
+        runCount.incrementAndGet();
+        if (!hasTopology) {
+            return;
+        }
+        boolean mightHaveMore = true;
+        try {
+            mightHaveMore = cleanup();
+        } catch (Exception e) {
+            // upon exception just log and retry in 10min
+            logger.error("run: got Exception while cleaning up slnigIds : " + e, e);
+        }
+        if (mightHaveMore) {
+            // then continue in 10min
+            recreateSchedule();
+            return;
+        }
+        // log successful cleanup done, yes, on info
+        logger.info(
+                "run: slingId cleanup done, run counter = {}, delete counter = {}, completion counter = {}",
+                getRunCount(), getDeleteCount(), getCompletionCount());
+        lastSuccessfulRun = System.currentTimeMillis();
+    }
+
+    /**
+     * Do the actual cleanup of garbage slingIds and report back with true if there
+     * might be more or false if we're at the end.
+     * 
+     * @return true if there might be more garbage or false if we're at the end
+     */
+    private boolean cleanup() {
+        logger.debug("cleanup: start");
+        if (!isEnabled()) {
+            // bit of overkill probably, as this shouldn't happen.
+            // but adds to a good night's sleep.
+            logger.debug("cleanup: not enabled, stopping.");
+            return false;
+        }
+
+        final ResourceResolverFactory localFactory = resourceResolverFactory;
+        final Config localConfig = config;
+        if (localFactory == null || localConfig == null) {
+            logger.warn("cleanup: cannot cleanup due to rrf={} or c={}", localFactory,
+                    localConfig);
+            return true;
+        }
+        final TopologyView localCurrentView = currentView;
+        if (localCurrentView == null || !localCurrentView.isCurrent()) {
+            logger.debug("cleanup : cannot cleanup as topology recently changed : {}",
+                    localCurrentView);
+            return true;
+        }
+        final Set<String> activeSlingIds = getActiveSlingIdsFrom(localCurrentView);
+        try (ResourceResolver resolver = localFactory.getServiceResourceResolver(null)) {
+            final Resource clusterInstances = resolver
+                    .getResource(localConfig.getClusterInstancesPath());
+            final Resource idMap = resolver.getResource(localConfig.getIdMapPath());
+            final Resource syncTokens = resolver
+                    .getResource(localConfig.getSyncTokenPath());
+            if (clusterInstances == null || idMap == null || syncTokens == null) {
+                logger.warn("cleanup: no resource found at {}, {} or {}, stopping.",
+                        localConfig.getClusterInstancesPath(), localConfig.getIdMapPath(),
+                        localConfig.getSyncTokenPath());
+                return false;
+            }
+            resolver.refresh();
+            final ValueMap idMapMap = idMap.adaptTo(ValueMap.class);
+            final ModifiableValueMap syncTokenMap = syncTokens
+                    .adaptTo(ModifiableValueMap.class);
+            final Calendar now = Calendar.getInstance();
+            int removed = 0;
+            boolean mightHaveMore = false;
+            final int localBatchSize = batchSize;
+            final long localMinCreationAgeMillis = minCreationAgeMillis;
+            for (Resource resource : clusterInstances.getChildren()) {
+                if (!topologyUnchanged(localCurrentView)) {
+                    return true;
+                }
+                final String slingId = resource.getName();
+                if (deleteIfOldSlingId(resource, slingId, syncTokenMap, idMapMap,
+                        activeSlingIds, now, localMinCreationAgeMillis)) {
+                    if (++removed >= localBatchSize) {
+                        // we need to stop
+                        mightHaveMore = true;
+                        break;
+                    }
+                }
+            }
+            // if we're not already at the batch limit, check syncTokens too
+            if (!mightHaveMore) {
+                for (String slingId : syncTokenMap.keySet()) {
+                    try {
+                        UUID.fromString(slingId);
+                    } catch (Exception e) {
+                        // not a uuid
+                        continue;
+                    }
+                    if (!topologyUnchanged(localCurrentView)) {
+                        return true;
+                    }
+                    Resource resourceOrNull = clusterInstances.getChild(slingId);
+                    if (deleteIfOldSlingId(resourceOrNull, slingId, syncTokenMap,
+                            idMapMap, activeSlingIds, now, localMinCreationAgeMillis)) {
+                        if (++removed >= localBatchSize) {
+                            // we need to stop
+                            mightHaveMore = true;
+                            break;
+                        }
+                    }
+                }
+            }
+            if (!topologyUnchanged(localCurrentView)) {
+                return true;
+            }
+            if (removed > 0) {
+                // only if we removed something we commit
+                resolver.commit();
+                logger.info(
+                        "cleanup : removed {} old slingIds (batch size : {}), potentially has more: {}",
+                        removed, localBatchSize, mightHaveMore);
+                deleteCount.addAndGet(removed);
+            }
+            firstRun = false;
+            completionCount.incrementAndGet();
+            return mightHaveMore;
+        } catch (LoginException e) {
+            logger.error("cleanup: could not log in administratively: " + e, e);
+            throw new RuntimeException("Could not log in to repository (" + e + ")", e);
+        } catch (PersistenceException e) {
+            logger.error("cleanup: got a PersistenceException: " + e, e);
+            throw new RuntimeException(
+                    "Exception while talking to repository (" + e + ")", e);
+        } finally {
+            logger.debug("cleanup: done.");
+        }
+    }
+
+    private Set<String> getActiveSlingIdsFrom(final TopologyView localCurrentView) {
+        final Set<String> activeSlingIds = new HashSet<>();
+        for (InstanceDescription id : localCurrentView.getLocalInstance().getClusterView()
+                .getInstances()) {
+            activeSlingIds.add(id.getSlingId());
+        }
+        return activeSlingIds;

Review Comment:
   ```suggestion
           localCurrentView.getLocalInstance().getClusterView()
                   .getInstances().stream().map(InstanceDescription::getSlingId).collect(Collectors.toSet());
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] rishabhdaim commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "rishabhdaim (via GitHub)" <gi...@apache.org>.

rishabhdaim commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1133454700


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,601 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import static org.osgi.util.converter.Converters.standardConverter;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>was not ever seen in previous topologies by the now leader instance</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ * <p>
+ * Additionally, the cleanup is skipped for 13 hours after a successful cleanup.
+ * This is to avoid unnecessary load on the repository. The number of 13
+ * incorporates some heuristics such as : about 2 cleanup rounds per day maximum
+ * makes sense, if a leader is very long living, then the 1 additional hour
+ * makes it spread somewhat throughout the day. This is to further minimize any
+ * load side-effects.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    final static String SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME = "org.apache.sling.discovery.oak.slingidcleanup.enabled";
+
+    /** default minimal cleanup delay at 13h, to intraday load balance */
+    final static long MIN_CLEANUP_DELAY_MILLIS = 46800000;
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every scheduler invocation */
+    private AtomicInteger runCount = new AtomicInteger(0);
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger completionCount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    private long lastSuccessfulRun = -1;
+
+    /**
+     * Minimal delay after a successful cleanup round, in millis
+     */
+    private long minCleanupDelayMillis = MIN_CLEANUP_DELAY_MILLIS;
+
+    /**
+     * contains all slingIds ever seen by this instance - should not be a long list
+     * so not a memory issue
+     */
+    private Set<String> seenInstances = new HashSet<>();
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int org_apache_sling_discovery_oak_slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;
+
+        @AttributeDefinition(name = "Cleanup interval milliseconds", description = "Number of milliseconds after which to do another batch of cleaning up (if necessary)")
+        int org_apache_sling_discovery_oak_slingid_cleanup_interval() default DEFAULT_CLEANUP_INTERVAL;
+
+        @AttributeDefinition(name = "Cleanup batch size", description = "Maximum number of slingIds to cleanup in one batch.")
+        int org_apache_sling_discovery_oak_slingid_cleanup_batchsize() default DEFAULT_CLEANUP_BATCH_SIZE;
+
+        @AttributeDefinition(name = "Cleanup minimum creation age", description = "Minimum number of milliseconds since the slingId was created.")
+        long org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age() default DEFAULT_MIN_CREATION_AGE_MILLIS;
+    }
+
+    /**
+     * Test constructor
+     */
+    static SlingIdCleanupTask create(Scheduler scheduler, ResourceResolverFactory factory,
+            Config config, int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis, long minCleanupDelayMillis) {
+        final SlingIdCleanupTask s = new SlingIdCleanupTask();
+        s.scheduler = scheduler;
+        s.resourceResolverFactory = factory;
+        s.config = config;
+        s.minCleanupDelayMillis = minCleanupDelayMillis;
+        s.config(initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+        return s;
+    }
+
+    @Activate
+    protected void activate(final BundleContext bc, final Conf config) {
+        this.modified(bc, config);
+    }
+
+    @Modified
+    protected void modified(final BundleContext bc, final Conf config) {
+        if (config == null) {
+            return;
+        }
+        config(config.org_apache_sling_discovery_oak_slingid_cleanup_initial_delay(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_interval(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_batchsize(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age());
+    }
+
+    @Deactivate
+    protected void deactivate() {
+        logger.info("deactivate : deactivated.");
+        hasTopology = false;
+    }
+
+    private void config(int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        this.initialDelayMillis = initialDelayMillis;
+        this.intervalMillis = intervalMillis;
+        this.batchSize = batchSize;
+        this.minCreationAgeMillis = minCreationAgeMillis;
+        logger.info(
+                "config: enabled = {}, initial delay milliseconds = {}, interval milliseconds = {}, batch size = {}, min creation age milliseconds = {}",
+                isEnabled(), initialDelayMillis, intervalMillis, batchSize,
+                minCreationAgeMillis);
+    }
+
+    @Override
+    public void handleTopologyEvent(TopologyEvent event) {
+        if (!isEnabled()) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+            logger.debug("handleTopologyEvent: slingId cleanup is disabled");
+            return;
+        }
+        final TopologyView newView = event.getNewView();
+        if (newView == null || event.getType() == Type.PROPERTIES_CHANGED) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+        } else {
+            hasTopology = true;
+            currentView = newView;
+            seenInstances.addAll(getActiveSlingIdsFrom(newView));
+            if (newView.getLocalInstance().isLeader()) {
+                // only execute on leader
+                recreateSchedule();
+            } else {
+                // should not be necessary, but lets stop anyway on non-leaders:
+                stop();
+            }
+        }
+    }
+
+    /**
+     * Cancels a potentially previously registered cleanup schedule.
+     */
+    private void stop() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("stop: no scheduler set, giving up.");
+            return;
+        }
+        final boolean unscheduled = localScheduler.unschedule(SCHEDULE_NAME);
+        logger.debug("stop: unschedule result={}", unscheduled);
+    }
+
+    /**
+     * Reads the system property that enables or disabled this tasks
+     */
+    private static boolean isEnabled() {
+        final String systemPropertyValue = System
+                .getProperty(SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME);
+        return standardConverter().convert(systemPropertyValue).defaultValue(false)
+                .to(Boolean.class);
+    }
+
+    /**
+     * This method can be invoked at any time to reset the schedule to do a fresh
+     * round of cleanup.
+     * <p>
+     * This method is thread-safe : if called concurrently, the fact that
+     * scheduler.schedul is synchronized works out that ultimately there will be
+     * just 1 schedule active (which is what is the desired outcome).
+     */
+    private void recreateSchedule() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("recreateSchedule: no scheduler set, giving up.");
+            return;
+        }
+        final Calendar cal = Calendar.getInstance();
+        int delayMillis;
+        if (firstRun) {
+            delayMillis = initialDelayMillis;
+        } else {
+            delayMillis = intervalMillis;
+        }
+        cal.add(Calendar.MILLISECOND, delayMillis);
+        final Date scheduledDate = cal.getTime();
+        logger.debug(
+                "recreateSchedule: scheduling a cleanup in {} milliseconds from now, which is: {}",
+                delayMillis, scheduledDate);
+        ScheduleOptions options = localScheduler.AT(scheduledDate);
+        options.name(SCHEDULE_NAME);
+        options.canRunConcurrently(false); // should not concurrently execute
+        localScheduler.schedule(this, options);
+    }
+
+    /**
+     * Invoked via sling.commons.scheduler triggered from resetCleanupSchedule(). By
+     * default should get called at max every 5 minutes until cleanup is done or
+     * 10min after a topology change.
+     */
+    @Override
+    public void run() {
+        if (lastSuccessfulRun > 0 && System.currentTimeMillis()
+                - lastSuccessfulRun < minCleanupDelayMillis) {
+            logger.debug(
+                    "run: last cleanup was {} millis ago, which is less than {} millis, therefore not cleaning up yet.",
+                    System.currentTimeMillis() - lastSuccessfulRun,
+                    minCleanupDelayMillis);
+            recreateSchedule();
+            return;
+        }
+        runCount.incrementAndGet();
+        if (!hasTopology) {
+            return;
+        }
+        boolean mightHaveMore = true;
+        try {
+            mightHaveMore = cleanup();
+        } catch (Exception e) {
+            // upon exception just log and retry in 10min
+            logger.error("run: got Exception while cleaning up slnigIds : " + e, e);
+        }
+        if (mightHaveMore) {
+            // then continue in 10min
+            recreateSchedule();
+            return;
+        }
+        // log successful cleanup done, yes, on info
+        logger.info(
+                "run: slingId cleanup done, run counter = {}, delete counter = {}, completion counter = {}",
+                getRunCount(), getDeleteCount(), getCompletionCount());
+        lastSuccessfulRun = System.currentTimeMillis();
+    }
+
+    /**
+     * Do the actual cleanup of garbage slingIds and report back with true if there
+     * might be more or false if we're at the end.
+     * 
+     * @return true if there might be more garbage or false if we're at the end
+     */
+    private boolean cleanup() {
+        logger.debug("cleanup: start");
+        if (!isEnabled()) {
+            // bit of overkill probably, as this shouldn't happen.
+            // but adds to a good night's sleep.
+            logger.debug("cleanup: not enabled, stopping.");
+            return false;
+        }
+
+        final ResourceResolverFactory localFactory = resourceResolverFactory;
+        final Config localConfig = config;
+        if (localFactory == null || localConfig == null) {
+            logger.warn("cleanup: cannot cleanup due to rrf={} or c={}", localFactory,
+                    localConfig);
+            return true;
+        }
+        final TopologyView localCurrentView = currentView;
+        if (localCurrentView == null || !localCurrentView.isCurrent()) {
+            logger.debug("cleanup : cannot cleanup as topology recently changed : {}",
+                    localCurrentView);
+            return true;
+        }
+        final Set<String> activeSlingIds = getActiveSlingIdsFrom(localCurrentView);
+        try (ResourceResolver resolver = localFactory.getServiceResourceResolver(null)) {
+            final Resource clusterInstances = resolver
+                    .getResource(localConfig.getClusterInstancesPath());
+            final Resource idMap = resolver.getResource(localConfig.getIdMapPath());
+            final Resource syncTokens = resolver
+                    .getResource(localConfig.getSyncTokenPath());
+            if (clusterInstances == null || idMap == null || syncTokens == null) {
+                logger.warn("cleanup: no resource found at {}, {} or {}, stopping.",
+                        localConfig.getClusterInstancesPath(), localConfig.getIdMapPath(),
+                        localConfig.getSyncTokenPath());
+                return false;
+            }
+            resolver.refresh();
+            final ValueMap idMapMap = idMap.adaptTo(ValueMap.class);
+            final ModifiableValueMap syncTokenMap = syncTokens
+                    .adaptTo(ModifiableValueMap.class);
+            final Calendar now = Calendar.getInstance();
+            int removed = 0;
+            boolean mightHaveMore = false;
+            final int localBatchSize = batchSize;
+            final long localMinCreationAgeMillis = minCreationAgeMillis;
+            for (Resource resource : clusterInstances.getChildren()) {
+                if (!topologyUnchanged(localCurrentView)) {
+                    return true;
+                }
+                final String slingId = resource.getName();
+                if (deleteIfOldSlingId(resource, slingId, syncTokenMap, idMapMap,
+                        activeSlingIds, now, localMinCreationAgeMillis)) {
+                    if (++removed >= localBatchSize) {
+                        // we need to stop
+                        mightHaveMore = true;
+                        break;
+                    }
+                }
+            }
+            // if we're not already at the batch limit, check syncTokens too
+            if (!mightHaveMore) {
+                for (String slingId : syncTokenMap.keySet()) {
+                    try {
+                        UUID.fromString(slingId);
+                    } catch (Exception e) {
+                        // not a uuid
+                        continue;
+                    }
+                    if (!topologyUnchanged(localCurrentView)) {
+                        return true;
+                    }
+                    Resource resourceOrNull = clusterInstances.getChild(slingId);
+                    if (deleteIfOldSlingId(resourceOrNull, slingId, syncTokenMap,
+                            idMapMap, activeSlingIds, now, localMinCreationAgeMillis)) {
+                        if (++removed >= localBatchSize) {
+                            // we need to stop
+                            mightHaveMore = true;
+                            break;
+                        }
+                    }
+                }
+            }
+            if (!topologyUnchanged(localCurrentView)) {
+                return true;
+            }
+            if (removed > 0) {
+                // only if we removed something we commit
+                resolver.commit();
+                logger.info(
+                        "cleanup : removed {} old slingIds (batch size : {}), potentially has more: {}",
+                        removed, localBatchSize, mightHaveMore);
+                deleteCount.addAndGet(removed);
+            }
+            firstRun = false;
+            completionCount.incrementAndGet();
+            return mightHaveMore;
+        } catch (LoginException e) {
+            logger.error("cleanup: could not log in administratively: " + e, e);
+            throw new RuntimeException("Could not log in to repository (" + e + ")", e);
+        } catch (PersistenceException e) {
+            logger.error("cleanup: got a PersistenceException: " + e, e);
+            throw new RuntimeException(
+                    "Exception while talking to repository (" + e + ")", e);
+        } finally {
+            logger.debug("cleanup: done.");
+        }
+    }
+
+    private Set<String> getActiveSlingIdsFrom(final TopologyView localCurrentView) {
+        final Set<String> activeSlingIds = new HashSet<>();
+        for (InstanceDescription id : localCurrentView.getLocalInstance().getClusterView()
+                .getInstances()) {
+            activeSlingIds.add(id.getSlingId());
+        }
+        return activeSlingIds;
+    }
+
+    private boolean topologyUnchanged(TopologyView localCurrentView) {

Review Comment:
   Wherever we are using this method, we are negating its result.
   
   I would suggest renaming and inverting this method from `topologyUnchanged` to `hasTopologyChanged`.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] stefan-egli commented on pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "stefan-egli (via GitHub)" <gi...@apache.org>.

stefan-egli commented on PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#issuecomment-1460542347

   Marked this PR as ready-for-review, as the main features that were intended are now done.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] stefan-egli commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "stefan-egli (via GitHub)" <gi...@apache.org>.

stefan-egli commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1130916377


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,571 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import static org.osgi.util.converter.Converters.standardConverter;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    final static String SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME = "org.apache.sling.discovery.oak.slingidcleanup.enabled";
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active

Review Comment:
   Marking htis conversation as closed, please reopen if you disagree, thx



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] sonarcloud[bot] commented on pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "sonarcloud[bot] (via GitHub)" <gi...@apache.org>.

sonarcloud[bot] commented on PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#issuecomment-1462081335

   Kudos, SonarCloud Quality Gate passed!&nbsp; &nbsp; [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13)
   
   [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG)  
   [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY)  
   [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT)  
   [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [30 Code Smells](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL)
   
   [![86.6%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/60-16px.png '86.6%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list) [86.6% Coverage](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list)  
   [![0.0%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/3-16px.png '0.0%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list) [0.0% Duplication](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] stefan-egli commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "stefan-egli (via GitHub)" <gi...@apache.org>.

stefan-egli commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1129797751


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,571 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import static org.osgi.util.converter.Converters.standardConverter;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    final static String SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME = "org.apache.sling.discovery.oak.slingidcleanup.enabled";
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every scheduler invocation */
+    private AtomicInteger runCount = new AtomicInteger(0);
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger completionCount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This backup task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int org_apache_sling_discovery_oak_slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;
+
+        @AttributeDefinition(name = "Cleanup interval milliseconds", description = "Number of milliseconds after which to do another batch of cleaning up (if necessary)")
+        int org_apache_sling_discovery_oak_slingid_cleanup_interval() default DEFAULT_CLEANUP_INTERVAL;
+
+        @AttributeDefinition(name = "Cleanup batch size", description = "Maximum number of slingIds to cleanup in one batch.")
+        int org_apache_sling_discovery_oak_slingid_cleanup_batchsize() default DEFAULT_CLEANUP_BATCH_SIZE;
+
+        @AttributeDefinition(name = "Cleanup minimum creation age", description = "Minimum number of milliseconds since the slingId was created.")
+        long org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age() default DEFAULT_MIN_CREATION_AGE_MILLIS;
+    }
+
+    /**
+     * Test constructor
+     */
+    static SlingIdCleanupTask create(Scheduler scheduler, ResourceResolverFactory factory,
+            Config config, int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        final SlingIdCleanupTask s = new SlingIdCleanupTask();
+        s.scheduler = scheduler;
+        s.resourceResolverFactory = factory;
+        s.config = config;
+        s.config(initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+        return s;
+    }
+
+    @Activate
+    protected void activate(final BundleContext bc, final Conf config) {
+        this.modified(bc, config);
+    }
+
+    @Modified
+    protected void modified(final BundleContext bc, final Conf config) {
+        if (config == null) {
+            return;
+        }
+        config(config.org_apache_sling_discovery_oak_slingid_cleanup_initial_delay(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_interval(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_batchsize(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age());
+    }
+
+    @Deactivate
+    protected void deactivate() {
+        logger.info("deactivate : deactivated.");
+        hasTopology = false;
+    }
+
+    private void config(int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        this.initialDelayMillis = initialDelayMillis;
+        this.intervalMillis = intervalMillis;
+        this.batchSize = batchSize;
+        this.minCreationAgeMillis = minCreationAgeMillis;
+        logger.info(
+                "config: enabled = {}, initial delay milliseconds = {}, interval milliseconds = {}, batch size = {}, min creation age milliseconds = {}",
+                isEnabled(), initialDelayMillis, intervalMillis, batchSize,
+                minCreationAgeMillis);
+    }
+
+    @Override
+    public void handleTopologyEvent(TopologyEvent event) {
+        if (!isEnabled()) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+            logger.info("handleTopologyEvent: slingId cleanup is disabled");

Review Comment:
   +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] sonarcloud[bot] commented on pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "sonarcloud[bot] (via GitHub)" <gi...@apache.org>.

sonarcloud[bot] commented on PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#issuecomment-1461579412

   Kudos, SonarCloud Quality Gate passed!&nbsp; &nbsp; [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13)
   
   [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG)  
   [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY)  
   [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT)  
   [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [33 Code Smells](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL)
   
   [![85.3%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/60-16px.png '85.3%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list) [85.3% Coverage](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list)  
   [![0.0%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/3-16px.png '0.0%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list) [0.0% Duplication](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] stefan-egli commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "stefan-egli (via GitHub)" <gi...@apache.org>.

stefan-egli commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1130743528


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,585 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import static org.osgi.util.converter.Converters.standardConverter;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    final static String SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME = "org.apache.sling.discovery.oak.slingidcleanup.enabled";
+
+    private final static long MIN_CLEANUP_DELAY_MILLIS = 46800000; // 13 hours, to intraday load balance
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every scheduler invocation */
+    private AtomicInteger runCount = new AtomicInteger(0);
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger completionCount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    private long lastSuccessfulRun = -1;
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int org_apache_sling_discovery_oak_slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;
+
+        @AttributeDefinition(name = "Cleanup interval milliseconds", description = "Number of milliseconds after which to do another batch of cleaning up (if necessary)")
+        int org_apache_sling_discovery_oak_slingid_cleanup_interval() default DEFAULT_CLEANUP_INTERVAL;
+
+        @AttributeDefinition(name = "Cleanup batch size", description = "Maximum number of slingIds to cleanup in one batch.")
+        int org_apache_sling_discovery_oak_slingid_cleanup_batchsize() default DEFAULT_CLEANUP_BATCH_SIZE;
+
+        @AttributeDefinition(name = "Cleanup minimum creation age", description = "Minimum number of milliseconds since the slingId was created.")
+        long org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age() default DEFAULT_MIN_CREATION_AGE_MILLIS;
+    }
+
+    /**
+     * Test constructor
+     */
+    static SlingIdCleanupTask create(Scheduler scheduler, ResourceResolverFactory factory,
+            Config config, int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        final SlingIdCleanupTask s = new SlingIdCleanupTask();
+        s.scheduler = scheduler;
+        s.resourceResolverFactory = factory;
+        s.config = config;
+        s.config(initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+        return s;
+    }
+
+    @Activate
+    protected void activate(final BundleContext bc, final Conf config) {
+        this.modified(bc, config);
+    }
+
+    @Modified
+    protected void modified(final BundleContext bc, final Conf config) {
+        if (config == null) {
+            return;
+        }
+        config(config.org_apache_sling_discovery_oak_slingid_cleanup_initial_delay(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_interval(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_batchsize(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age());
+    }
+
+    @Deactivate
+    protected void deactivate() {
+        logger.info("deactivate : deactivated.");
+        hasTopology = false;
+    }
+
+    private void config(int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        this.initialDelayMillis = initialDelayMillis;
+        this.intervalMillis = intervalMillis;
+        this.batchSize = batchSize;
+        this.minCreationAgeMillis = minCreationAgeMillis;
+        logger.info(
+                "config: enabled = {}, initial delay milliseconds = {}, interval milliseconds = {}, batch size = {}, min creation age milliseconds = {}",
+                isEnabled(), initialDelayMillis, intervalMillis, batchSize,
+                minCreationAgeMillis);
+    }
+
+    @Override
+    public void handleTopologyEvent(TopologyEvent event) {
+        if (!isEnabled()) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+            logger.info("handleTopologyEvent: slingId cleanup is disabled");
+            return;
+        }
+        final TopologyView newView = event.getNewView();
+        if (newView == null || event.getType() == Type.PROPERTIES_CHANGED) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+        } else {
+            hasTopology = true;
+            currentView = newView;
+            if (newView.getLocalInstance().isLeader()) {
+                // only execute on leader
+                recreateSchedule();
+            } else {
+                // should not be necessary, but lets stop anyway on non-leaders:
+                stop();
+            }
+        }
+    }
+
+    /**
+     * Cancels a potentially previously registered cleanup schedule.
+     */
+    private void stop() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("stop: no scheduler set, giving up.");
+            return;
+        }
+        final boolean unscheduled = localScheduler.unschedule(SCHEDULE_NAME);
+        logger.debug("stop: unschedule result={}", unscheduled);
+    }
+
+    /**
+     * Reads the system property that enables or disabled this tasks
+     */
+    private static boolean isEnabled() {
+        final String systemPropertyValue = System
+                .getProperty(SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME);
+        return standardConverter().convert(systemPropertyValue).defaultValue(false)
+                .to(Boolean.class);
+    }
+
+    /**
+     * This method can be invoked at any time to reset the schedule to do a fresh
+     * round of cleanup.
+     * <p>
+     * This method is thread-safe : if called concurrently, the fact that
+     * scheduler.schedul is synchronized works out that ultimately there will be
+     * just 1 schedule active (which is what is the desired outcome).
+     */
+    private void recreateSchedule() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("recreateSchedule: no scheduler set, giving up.");
+            return;
+        }
+        final Calendar cal = Calendar.getInstance();
+        int delayMillis;
+        if (firstRun) {
+            delayMillis = initialDelayMillis;
+        } else {
+            delayMillis = intervalMillis;
+        }
+        cal.add(Calendar.MILLISECOND, delayMillis);
+        final Date scheduledDate = cal.getTime();
+        logger.debug(
+                "recreateSchedule: scheduling a cleanup in {} milliseconds from now, which is: {}",
+                delayMillis, scheduledDate);
+        ScheduleOptions options = localScheduler.AT(scheduledDate);
+        options.name(SCHEDULE_NAME);
+        options.canRunConcurrently(false); // should not concurrently execute
+        localScheduler.schedule(this, options);
+    }
+
+    /**
+     * Invoked via sling.commons.scheduler triggered from resetCleanupSchedule(). By
+     * default should get called at max every 5 minutes until cleanup is done or
+     * 10min after a topology change.
+     */
+    @Override
+    public void run() {
+        if (lastSuccessfulRun > 0 && System.currentTimeMillis()
+                - lastSuccessfulRun < MIN_CLEANUP_DELAY_MILLIS) {
+            logger.debug(
+                    "run: last cleanup was {} millis ago, which is less than {} millis, therefore not cleaning up yet.",
+                    System.currentTimeMillis() - lastSuccessfulRun,
+                    MIN_CLEANUP_DELAY_MILLIS);
+            recreateSchedule();
+            return;
+        }
+        runCount.incrementAndGet();
+        if (!hasTopology) {
+            return;
+        }
+        boolean mightHaveMore = true;
+        try {
+            mightHaveMore = cleanup();
+        } catch (Exception e) {
+            // upon exception just log and retry in 10min
+            logger.error("run: got Exception while cleaning up slnigIds : " + e, e);
+        }
+        if (mightHaveMore) {
+            // then continue in 10min
+            recreateSchedule();
+            return;
+        }
+        // log successful cleanup done, yes, on info
+        logger.info(
+                "run: slingId cleanup done, run counter = {}, delete counter = {}, completion counter = {}",
+                getRunCount(), getDeleteCount(), getCompletionCount());
+        lastSuccessfulRun = System.currentTimeMillis();
+    }
+
+    /**
+     * Do the actual cleanup of garbage slingIds and report back with true if there
+     * might be more or false if we're at the end.
+     * 
+     * @return true if there might be more garbage or false if we're at the end
+     */
+    private boolean cleanup() {

Review Comment:
   yea I agree, I was actually trying to shorten it but didn't not find any further low hanging fruits. But let me revisit it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] sonarcloud[bot] commented on pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "sonarcloud[bot] (via GitHub)" <gi...@apache.org>.

sonarcloud[bot] commented on PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#issuecomment-1461932940

   Kudos, SonarCloud Quality Gate passed!&nbsp; &nbsp; [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13)
   
   [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG)  
   [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY)  
   [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT)  
   [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [30 Code Smells](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL)
   
   [![86.6%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/60-16px.png '86.6%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list) [86.6% Coverage](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list)  
   [![0.0%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/3-16px.png '0.0%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list) [0.0% Duplication](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] stefan-egli commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "stefan-egli (via GitHub)" <gi...@apache.org>.

stefan-egli commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1130864559


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,571 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import static org.osgi.util.converter.Converters.standardConverter;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    final static String SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME = "org.apache.sling.discovery.oak.slingidcleanup.enabled";
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active

Review Comment:
   Added code to address this in https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13/commits/b57b2e1d919f345a59359db179fe7fa8e265b2c2



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] stefan-egli commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "stefan-egli (via GitHub)" <gi...@apache.org>.

stefan-egli commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1133902386


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,598 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import static org.osgi.util.converter.Converters.standardConverter;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.Set;
+import java.util.UUID;
+import java.util.concurrent.CopyOnWriteArraySet;
+import java.util.concurrent.atomic.AtomicInteger;
+import java.util.stream.Collectors;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>was not ever seen in previous topologies by the now leader instance</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ * <p>
+ * Additionally, the cleanup is skipped for 13 hours after a successful cleanup.
+ * This is to avoid unnecessary load on the repository. The number of 13
+ * incorporates some heuristics such as : about 2 cleanup rounds per day maximum
+ * makes sense, if a leader is very long living, then the 1 additional hour
+ * makes it spread somewhat throughout the day. This is to further minimize any
+ * load side-effects.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    final static String SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME = "org.apache.sling.discovery.oak.slingidcleanup.enabled";
+
+    /** default minimal cleanup delay at 13h, to intraday load balance */
+    final static long MIN_CLEANUP_DELAY_MILLIS = 46800000;
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every scheduler invocation */
+    private AtomicInteger runCount = new AtomicInteger(0);
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger completionCount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    private long lastSuccessfulRun = -1;
+
+    /**
+     * Minimal delay after a successful cleanup round, in millis
+     */
+    private long minCleanupDelayMillis = MIN_CLEANUP_DELAY_MILLIS;
+
+    /**
+     * contains all slingIds ever seen by this instance - should not be a long list
+     * so not a memory issue
+     */
+    private Set<String> seenInstances = new CopyOnWriteArraySet<>();
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;
+
+        @AttributeDefinition(name = "Cleanup interval milliseconds", description = "Number of milliseconds after which to do another batch of cleaning up (if necessary)")
+        int slingid_cleanup_interval() default DEFAULT_CLEANUP_INTERVAL;
+
+        @AttributeDefinition(name = "Cleanup batch size", description = "Maximum number of slingIds to cleanup in one batch.")
+        int slingid_cleanup_batchsize() default DEFAULT_CLEANUP_BATCH_SIZE;
+
+        @AttributeDefinition(name = "Cleanup minimum creation age", description = "Minimum number of milliseconds since the slingId was created.")
+        long slingid_cleanup_min_creation_age() default DEFAULT_MIN_CREATION_AGE_MILLIS;
+    }
+
+    /**
+     * Test constructor
+     */
+    static SlingIdCleanupTask create(Scheduler scheduler, ResourceResolverFactory factory,
+            Config config, int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis, long minCleanupDelayMillis) {
+        final SlingIdCleanupTask s = new SlingIdCleanupTask();
+        s.scheduler = scheduler;
+        s.resourceResolverFactory = factory;
+        s.config = config;
+        s.minCleanupDelayMillis = minCleanupDelayMillis;
+        s.config(initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+        return s;
+    }
+
+    @Activate
+    protected void activate(final BundleContext bc, final Conf config) {
+        this.modified(bc, config);
+    }
+
+    @Modified
+    protected void modified(final BundleContext bc, final Conf config) {
+        if (config == null) {
+            return;
+        }
+        config(config.slingid_cleanup_initial_delay(),
+                config.slingid_cleanup_interval(),
+                config.slingid_cleanup_batchsize(),
+                config.slingid_cleanup_min_creation_age());
+    }
+
+    @Deactivate
+    protected void deactivate() {
+        logger.info("deactivate : deactivated.");
+        hasTopology = false;
+    }
+
+    private void config(int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        this.initialDelayMillis = initialDelayMillis;
+        this.intervalMillis = intervalMillis;
+        this.batchSize = batchSize;
+        this.minCreationAgeMillis = minCreationAgeMillis;
+        logger.info(
+                "config: enabled = {}, initial delay milliseconds = {}, interval milliseconds = {}, batch size = {}, min creation age milliseconds = {}",
+                isEnabled(), initialDelayMillis, intervalMillis, batchSize,
+                minCreationAgeMillis);
+    }
+
+    @Override
+    public void handleTopologyEvent(TopologyEvent event) {
+        if (!isEnabled()) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+            logger.debug("handleTopologyEvent: slingId cleanup is disabled");
+            return;
+        }
+        final TopologyView newView = event.getNewView();
+        if (newView == null || event.getType() == Type.PROPERTIES_CHANGED) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+        } else {
+            hasTopology = true;
+            currentView = newView;
+            seenInstances.addAll(getActiveSlingIds(newView));
+            if (newView.getLocalInstance().isLeader()) {
+                // only execute on leader
+                recreateSchedule();
+            } else {
+                // should not be necessary, but lets stop anyway on non-leaders:
+                stop();
+            }
+        }
+    }
+
+    /**
+     * Cancels a potentially previously registered cleanup schedule.
+     */
+    private void stop() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("stop: no scheduler set, giving up.");
+            return;
+        }
+        final boolean unscheduled = localScheduler.unschedule(SCHEDULE_NAME);
+        logger.debug("stop: unschedule result={}", unscheduled);
+    }
+
+    /**
+     * Reads the system property that enables or disabled this tasks
+     */
+    private static boolean isEnabled() {
+        final String systemPropertyValue = System
+                .getProperty(SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME);
+        return standardConverter().convert(systemPropertyValue).defaultValue(false)
+                .to(Boolean.class);
+    }
+
+    /**
+     * This method can be invoked at any time to reset the schedule to do a fresh
+     * round of cleanup.
+     * <p>
+     * This method is thread-safe : if called concurrently, the fact that
+     * scheduler.schedul is synchronized works out that ultimately there will be
+     * just 1 schedule active (which is what is the desired outcome).
+     */
+    private void recreateSchedule() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("recreateSchedule: no scheduler set, giving up.");
+            return;
+        }
+        final Calendar cal = Calendar.getInstance();
+        int delayMillis;
+        if (firstRun) {
+            delayMillis = initialDelayMillis;
+        } else {
+            delayMillis = intervalMillis;
+        }
+        cal.add(Calendar.MILLISECOND, delayMillis);
+        final Date scheduledDate = cal.getTime();
+        logger.debug(
+                "recreateSchedule: scheduling a cleanup in {} milliseconds from now, which is: {}",
+                delayMillis, scheduledDate);
+        ScheduleOptions options = localScheduler.AT(scheduledDate);
+        options.name(SCHEDULE_NAME);
+        options.canRunConcurrently(false); // should not concurrently execute
+        localScheduler.schedule(this, options);
+    }
+
+    /**
+     * Invoked via sling.commons.scheduler triggered from resetCleanupSchedule(). By
+     * default should get called at max every 5 minutes until cleanup is done or
+     * 10min after a topology change.
+     */
+    @Override
+    public void run() {
+        if (lastSuccessfulRun > 0 && System.currentTimeMillis()
+                - lastSuccessfulRun < minCleanupDelayMillis) {
+            logger.debug(
+                    "run: last cleanup was {} millis ago, which is less than {} millis, therefore not cleaning up yet.",
+                    System.currentTimeMillis() - lastSuccessfulRun,
+                    minCleanupDelayMillis);
+            recreateSchedule();
+            return;
+        }
+        runCount.incrementAndGet();
+        if (!hasTopology) {
+            return;
+        }
+        boolean mightHaveMore = true;
+        try {
+            mightHaveMore = cleanup();
+        } catch (Exception e) {
+            // upon exception just log and retry in 10min
+            logger.error("run: got Exception while cleaning up slnigIds : " + e, e);
+        }
+        if (mightHaveMore) {
+            // then continue in 10min
+            recreateSchedule();
+            return;
+        }
+        // log successful cleanup done, yes, on info
+        logger.info(
+                "run: slingId cleanup done, run counter = {}, delete counter = {}, completion counter = {}",
+                getRunCount(), getDeleteCount(), getCompletionCount());
+        lastSuccessfulRun = System.currentTimeMillis();
+    }
+
+    /**
+     * Do the actual cleanup of garbage slingIds and report back with true if there
+     * might be more or false if we're at the end.
+     * 
+     * @return true if there might be more garbage or false if we're at the end
+     */
+    private boolean cleanup() {
+        logger.debug("cleanup: start");
+        if (!isEnabled()) {
+            // bit of overkill probably, as this shouldn't happen.
+            // but adds to a good night's sleep.
+            logger.debug("cleanup: not enabled, stopping.");
+            return false;
+        }
+
+        final ResourceResolverFactory localFactory = resourceResolverFactory;
+        final Config localConfig = config;
+        if (localFactory == null || localConfig == null) {
+            logger.warn("cleanup: cannot cleanup due to rrf={} or c={}", localFactory,
+                    localConfig);
+            return true;
+        }
+        final TopologyView localCurrentView = currentView;
+        if (localCurrentView == null || !localCurrentView.isCurrent()) {
+            logger.debug("cleanup : cannot cleanup as topology recently changed : {}",
+                    localCurrentView);
+            return true;
+        }
+        final Set<String> activeSlingIds = getActiveSlingIds(localCurrentView);
+        try (ResourceResolver resolver = localFactory.getServiceResourceResolver(null)) {
+            final Resource clusterInstances = resolver
+                    .getResource(localConfig.getClusterInstancesPath());
+            final Resource idMap = resolver.getResource(localConfig.getIdMapPath());
+            final Resource syncTokens = resolver
+                    .getResource(localConfig.getSyncTokenPath());
+            if (clusterInstances == null || idMap == null || syncTokens == null) {
+                logger.warn("cleanup: no resource found at {}, {} or {}, stopping.",
+                        localConfig.getClusterInstancesPath(), localConfig.getIdMapPath(),
+                        localConfig.getSyncTokenPath());
+                return false;
+            }
+            resolver.refresh();
+            final ValueMap idMapMap = idMap.adaptTo(ValueMap.class);
+            final ModifiableValueMap syncTokenMap = syncTokens
+                    .adaptTo(ModifiableValueMap.class);
+            final Calendar now = Calendar.getInstance();
+            int removed = 0;
+            boolean mightHaveMore = false;
+            final int localBatchSize = batchSize;
+            final long localMinCreationAgeMillis = minCreationAgeMillis;
+            for (Resource resource : clusterInstances.getChildren()) {
+                if (topologyChanged(localCurrentView)) {
+                    return true;
+                }
+                final String slingId = resource.getName();
+                if (deleteIfOldSlingId(resource, slingId, syncTokenMap, idMapMap,
+                        activeSlingIds, now, localMinCreationAgeMillis)) {
+                    if (++removed >= localBatchSize) {
+                        // we need to stop
+                        mightHaveMore = true;
+                        break;
+                    }
+                }
+            }
+            // if we're not already at the batch limit, check syncTokens too
+            if (!mightHaveMore) {
+                for (String slingId : syncTokenMap.keySet()) {
+                    try {
+                        UUID.fromString(slingId);
+                    } catch (Exception e) {
+                        // not a uuid
+                        continue;
+                    }
+                    if (topologyChanged(localCurrentView)) {
+                        return true;
+                    }
+                    Resource resourceOrNull = clusterInstances.getChild(slingId);
+                    if (deleteIfOldSlingId(resourceOrNull, slingId, syncTokenMap,
+                            idMapMap, activeSlingIds, now, localMinCreationAgeMillis)) {
+                        if (++removed >= localBatchSize) {
+                            // we need to stop
+                            mightHaveMore = true;
+                            break;
+                        }
+                    }
+                }
+            }
+            if (topologyChanged(localCurrentView)) {
+                return true;
+            }
+            if (removed > 0) {
+                // only if we removed something we commit
+                resolver.commit();
+                logger.info(
+                        "cleanup : removed {} old slingIds (batch size : {}), potentially has more: {}",
+                        removed, localBatchSize, mightHaveMore);
+                deleteCount.addAndGet(removed);
+            }
+            firstRun = false;
+            completionCount.incrementAndGet();
+            return mightHaveMore;
+        } catch (LoginException e) {
+            logger.error("cleanup: could not log in administratively: " + e, e);
+            throw new RuntimeException("Could not log in to repository (" + e + ")", e);
+        } catch (PersistenceException e) {
+            logger.error("cleanup: got a PersistenceException: " + e, e);
+            throw new RuntimeException(
+                    "Exception while talking to repository (" + e + ")", e);
+        } finally {
+            logger.debug("cleanup: done.");
+        }
+    }
+
+    private Set<String> getActiveSlingIds(final TopologyView localCurrentView) {
+        return localCurrentView.getLocalInstance().getClusterView()
+                .getInstances().stream().map(InstanceDescription::getSlingId).collect(Collectors.toSet());
+    }
+
+    private boolean topologyChanged(TopologyView localCurrentView) {
+        if (!hasTopology || currentView != localCurrentView
+                || !localCurrentView.isCurrent()) {
+            // we got interrupted during cleanup
+            // let's not commit at all then
+            logger.debug(
+                    "topologyChanged : topology changing during cleanup - not committing this time - stopping for now.");
+            return true;
+        } else {
+            return false;
+        }
+    }
+
+    static long millisOf(Object leaderElectionIdCreatedAt) {
+        if (leaderElectionIdCreatedAt == null) {
+            return -1;
+        }
+        if (leaderElectionIdCreatedAt instanceof Date) {
+            final Date d = (Date) leaderElectionIdCreatedAt;
+            return d.getTime();
+        }
+        if (leaderElectionIdCreatedAt instanceof Calendar) {
+            final Calendar c = (Calendar) leaderElectionIdCreatedAt;
+            return c.getTimeInMillis();
+        }
+        return -1;
+    }
+
+    private boolean deleteIfOldSlingId(Resource resourceOrNull, String slingId,
+            ModifiableValueMap syncTokenMap, ValueMap idMapMap,
+            Set<String> activeSlingIds, Calendar now, long localMinCreationAgeMillis)
+            throws PersistenceException {
+        logger.trace("deleteIfOldSlingId : handling slingId = {}", slingId);
+        if (activeSlingIds.contains(slingId)) {
+            logger.trace("deleteIfOldSlingId : slingId is currently active : {}",
+                    slingId);
+            return false;
+        } else if (seenInstances.contains(slingId)) {
+            logger.trace("deleteIfOldSlingId : slingId seen active previously : {}",
+                    slingId);
+            return false;
+        }
+        // only check in idmap and for leaderElectionId details if the clusterInstance
+        // resource is there
+        if (resourceOrNull != null) {
+            Object clusterNodeId = idMapMap.get(slingId);
+            if (clusterNodeId == null) {
+                logger.trace("deleteIfOldSlingId : slingId {} not recently in use : {}",
+                        slingId, clusterNodeId);

Review Comment:
   +1, fixed in https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13/commits/388f3fc15d4f1cf2ff6d69a2e1a86dd5e8c3f8bd



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] stefan-egli commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "stefan-egli (via GitHub)" <gi...@apache.org>.

stefan-egli commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1133634099


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,601 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import static org.osgi.util.converter.Converters.standardConverter;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>was not ever seen in previous topologies by the now leader instance</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ * <p>
+ * Additionally, the cleanup is skipped for 13 hours after a successful cleanup.
+ * This is to avoid unnecessary load on the repository. The number of 13
+ * incorporates some heuristics such as : about 2 cleanup rounds per day maximum
+ * makes sense, if a leader is very long living, then the 1 additional hour
+ * makes it spread somewhat throughout the day. This is to further minimize any
+ * load side-effects.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    final static String SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME = "org.apache.sling.discovery.oak.slingidcleanup.enabled";
+
+    /** default minimal cleanup delay at 13h, to intraday load balance */
+    final static long MIN_CLEANUP_DELAY_MILLIS = 46800000;
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every scheduler invocation */
+    private AtomicInteger runCount = new AtomicInteger(0);
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger completionCount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    private long lastSuccessfulRun = -1;
+
+    /**
+     * Minimal delay after a successful cleanup round, in millis
+     */
+    private long minCleanupDelayMillis = MIN_CLEANUP_DELAY_MILLIS;
+
+    /**
+     * contains all slingIds ever seen by this instance - should not be a long list
+     * so not a memory issue
+     */
+    private Set<String> seenInstances = new HashSet<>();
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int org_apache_sling_discovery_oak_slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;
+
+        @AttributeDefinition(name = "Cleanup interval milliseconds", description = "Number of milliseconds after which to do another batch of cleaning up (if necessary)")
+        int org_apache_sling_discovery_oak_slingid_cleanup_interval() default DEFAULT_CLEANUP_INTERVAL;
+
+        @AttributeDefinition(name = "Cleanup batch size", description = "Maximum number of slingIds to cleanup in one batch.")
+        int org_apache_sling_discovery_oak_slingid_cleanup_batchsize() default DEFAULT_CLEANUP_BATCH_SIZE;
+
+        @AttributeDefinition(name = "Cleanup minimum creation age", description = "Minimum number of milliseconds since the slingId was created.")
+        long org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age() default DEFAULT_MIN_CREATION_AGE_MILLIS;
+    }
+
+    /**
+     * Test constructor
+     */
+    static SlingIdCleanupTask create(Scheduler scheduler, ResourceResolverFactory factory,
+            Config config, int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis, long minCleanupDelayMillis) {
+        final SlingIdCleanupTask s = new SlingIdCleanupTask();
+        s.scheduler = scheduler;
+        s.resourceResolverFactory = factory;
+        s.config = config;
+        s.minCleanupDelayMillis = minCleanupDelayMillis;
+        s.config(initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+        return s;
+    }
+
+    @Activate
+    protected void activate(final BundleContext bc, final Conf config) {
+        this.modified(bc, config);
+    }
+
+    @Modified
+    protected void modified(final BundleContext bc, final Conf config) {
+        if (config == null) {
+            return;
+        }
+        config(config.org_apache_sling_discovery_oak_slingid_cleanup_initial_delay(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_interval(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_batchsize(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age());
+    }
+
+    @Deactivate
+    protected void deactivate() {
+        logger.info("deactivate : deactivated.");
+        hasTopology = false;
+    }
+
+    private void config(int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        this.initialDelayMillis = initialDelayMillis;
+        this.intervalMillis = intervalMillis;
+        this.batchSize = batchSize;
+        this.minCreationAgeMillis = minCreationAgeMillis;
+        logger.info(
+                "config: enabled = {}, initial delay milliseconds = {}, interval milliseconds = {}, batch size = {}, min creation age milliseconds = {}",
+                isEnabled(), initialDelayMillis, intervalMillis, batchSize,
+                minCreationAgeMillis);
+    }
+
+    @Override
+    public void handleTopologyEvent(TopologyEvent event) {
+        if (!isEnabled()) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+            logger.debug("handleTopologyEvent: slingId cleanup is disabled");
+            return;
+        }
+        final TopologyView newView = event.getNewView();
+        if (newView == null || event.getType() == Type.PROPERTIES_CHANGED) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+        } else {
+            hasTopology = true;
+            currentView = newView;
+            seenInstances.addAll(getActiveSlingIdsFrom(newView));
+            if (newView.getLocalInstance().isLeader()) {
+                // only execute on leader
+                recreateSchedule();
+            } else {
+                // should not be necessary, but lets stop anyway on non-leaders:
+                stop();
+            }
+        }
+    }
+
+    /**
+     * Cancels a potentially previously registered cleanup schedule.
+     */
+    private void stop() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("stop: no scheduler set, giving up.");
+            return;
+        }
+        final boolean unscheduled = localScheduler.unschedule(SCHEDULE_NAME);
+        logger.debug("stop: unschedule result={}", unscheduled);
+    }
+
+    /**
+     * Reads the system property that enables or disabled this tasks
+     */
+    private static boolean isEnabled() {
+        final String systemPropertyValue = System
+                .getProperty(SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME);
+        return standardConverter().convert(systemPropertyValue).defaultValue(false)
+                .to(Boolean.class);
+    }
+
+    /**
+     * This method can be invoked at any time to reset the schedule to do a fresh
+     * round of cleanup.
+     * <p>
+     * This method is thread-safe : if called concurrently, the fact that
+     * scheduler.schedul is synchronized works out that ultimately there will be
+     * just 1 schedule active (which is what is the desired outcome).
+     */
+    private void recreateSchedule() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("recreateSchedule: no scheduler set, giving up.");
+            return;
+        }
+        final Calendar cal = Calendar.getInstance();
+        int delayMillis;
+        if (firstRun) {
+            delayMillis = initialDelayMillis;
+        } else {
+            delayMillis = intervalMillis;
+        }
+        cal.add(Calendar.MILLISECOND, delayMillis);
+        final Date scheduledDate = cal.getTime();
+        logger.debug(
+                "recreateSchedule: scheduling a cleanup in {} milliseconds from now, which is: {}",
+                delayMillis, scheduledDate);
+        ScheduleOptions options = localScheduler.AT(scheduledDate);
+        options.name(SCHEDULE_NAME);
+        options.canRunConcurrently(false); // should not concurrently execute
+        localScheduler.schedule(this, options);
+    }
+
+    /**
+     * Invoked via sling.commons.scheduler triggered from resetCleanupSchedule(). By
+     * default should get called at max every 5 minutes until cleanup is done or
+     * 10min after a topology change.
+     */
+    @Override
+    public void run() {
+        if (lastSuccessfulRun > 0 && System.currentTimeMillis()
+                - lastSuccessfulRun < minCleanupDelayMillis) {
+            logger.debug(
+                    "run: last cleanup was {} millis ago, which is less than {} millis, therefore not cleaning up yet.",
+                    System.currentTimeMillis() - lastSuccessfulRun,
+                    minCleanupDelayMillis);
+            recreateSchedule();
+            return;
+        }
+        runCount.incrementAndGet();
+        if (!hasTopology) {
+            return;
+        }
+        boolean mightHaveMore = true;
+        try {
+            mightHaveMore = cleanup();
+        } catch (Exception e) {
+            // upon exception just log and retry in 10min
+            logger.error("run: got Exception while cleaning up slnigIds : " + e, e);
+        }
+        if (mightHaveMore) {
+            // then continue in 10min
+            recreateSchedule();
+            return;
+        }
+        // log successful cleanup done, yes, on info
+        logger.info(
+                "run: slingId cleanup done, run counter = {}, delete counter = {}, completion counter = {}",
+                getRunCount(), getDeleteCount(), getCompletionCount());
+        lastSuccessfulRun = System.currentTimeMillis();
+    }
+
+    /**
+     * Do the actual cleanup of garbage slingIds and report back with true if there
+     * might be more or false if we're at the end.
+     * 
+     * @return true if there might be more garbage or false if we're at the end
+     */
+    private boolean cleanup() {
+        logger.debug("cleanup: start");
+        if (!isEnabled()) {
+            // bit of overkill probably, as this shouldn't happen.
+            // but adds to a good night's sleep.
+            logger.debug("cleanup: not enabled, stopping.");
+            return false;
+        }
+
+        final ResourceResolverFactory localFactory = resourceResolverFactory;
+        final Config localConfig = config;
+        if (localFactory == null || localConfig == null) {
+            logger.warn("cleanup: cannot cleanup due to rrf={} or c={}", localFactory,
+                    localConfig);
+            return true;
+        }
+        final TopologyView localCurrentView = currentView;
+        if (localCurrentView == null || !localCurrentView.isCurrent()) {
+            logger.debug("cleanup : cannot cleanup as topology recently changed : {}",
+                    localCurrentView);
+            return true;
+        }
+        final Set<String> activeSlingIds = getActiveSlingIdsFrom(localCurrentView);
+        try (ResourceResolver resolver = localFactory.getServiceResourceResolver(null)) {
+            final Resource clusterInstances = resolver
+                    .getResource(localConfig.getClusterInstancesPath());
+            final Resource idMap = resolver.getResource(localConfig.getIdMapPath());
+            final Resource syncTokens = resolver
+                    .getResource(localConfig.getSyncTokenPath());
+            if (clusterInstances == null || idMap == null || syncTokens == null) {
+                logger.warn("cleanup: no resource found at {}, {} or {}, stopping.",
+                        localConfig.getClusterInstancesPath(), localConfig.getIdMapPath(),
+                        localConfig.getSyncTokenPath());
+                return false;
+            }
+            resolver.refresh();
+            final ValueMap idMapMap = idMap.adaptTo(ValueMap.class);
+            final ModifiableValueMap syncTokenMap = syncTokens
+                    .adaptTo(ModifiableValueMap.class);
+            final Calendar now = Calendar.getInstance();
+            int removed = 0;
+            boolean mightHaveMore = false;
+            final int localBatchSize = batchSize;
+            final long localMinCreationAgeMillis = minCreationAgeMillis;
+            for (Resource resource : clusterInstances.getChildren()) {
+                if (!topologyUnchanged(localCurrentView)) {
+                    return true;
+                }
+                final String slingId = resource.getName();
+                if (deleteIfOldSlingId(resource, slingId, syncTokenMap, idMapMap,
+                        activeSlingIds, now, localMinCreationAgeMillis)) {
+                    if (++removed >= localBatchSize) {
+                        // we need to stop
+                        mightHaveMore = true;
+                        break;
+                    }
+                }
+            }
+            // if we're not already at the batch limit, check syncTokens too
+            if (!mightHaveMore) {
+                for (String slingId : syncTokenMap.keySet()) {
+                    try {
+                        UUID.fromString(slingId);
+                    } catch (Exception e) {
+                        // not a uuid
+                        continue;
+                    }
+                    if (!topologyUnchanged(localCurrentView)) {
+                        return true;
+                    }
+                    Resource resourceOrNull = clusterInstances.getChild(slingId);
+                    if (deleteIfOldSlingId(resourceOrNull, slingId, syncTokenMap,
+                            idMapMap, activeSlingIds, now, localMinCreationAgeMillis)) {
+                        if (++removed >= localBatchSize) {
+                            // we need to stop
+                            mightHaveMore = true;
+                            break;
+                        }
+                    }
+                }
+            }
+            if (!topologyUnchanged(localCurrentView)) {
+                return true;
+            }
+            if (removed > 0) {
+                // only if we removed something we commit
+                resolver.commit();
+                logger.info(
+                        "cleanup : removed {} old slingIds (batch size : {}), potentially has more: {}",
+                        removed, localBatchSize, mightHaveMore);
+                deleteCount.addAndGet(removed);
+            }
+            firstRun = false;
+            completionCount.incrementAndGet();
+            return mightHaveMore;
+        } catch (LoginException e) {
+            logger.error("cleanup: could not log in administratively: " + e, e);
+            throw new RuntimeException("Could not log in to repository (" + e + ")", e);
+        } catch (PersistenceException e) {
+            logger.error("cleanup: got a PersistenceException: " + e, e);
+            throw new RuntimeException(
+                    "Exception while talking to repository (" + e + ")", e);
+        } finally {
+            logger.debug("cleanup: done.");
+        }
+    }
+
+    private Set<String> getActiveSlingIdsFrom(final TopologyView localCurrentView) {

Review Comment:
   that was the original name and then I changed it to make it more object oriented - but find to rename it back - done in https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13/commits/5959e4dbed5de3ea9e36fab6efdacf086f6e353e



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] stefan-egli commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "stefan-egli (via GitHub)" <gi...@apache.org>.

stefan-egli commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1128274084


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,484 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.apache.sling.discovery.commons.providers.util.ResourceHelper;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger runcount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This backup task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int org_apache_sling_discovery_oak_slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;
+
+        @AttributeDefinition(name = "Cleanup interval milliseconds", description = "Number of milliseconds after which to do another batch of cleaning up (if necessary)")
+        int org_apache_sling_discovery_oak_slingid_cleanup_interval() default DEFAULT_CLEANUP_INTERVAL;
+
+        @AttributeDefinition(name = "Cleanup batch size", description = "Maximum number of slingIds to cleanup in one batch.")
+        int org_apache_sling_discovery_oak_slingid_cleanup_batchsize() default DEFAULT_CLEANUP_BATCH_SIZE;
+
+        @AttributeDefinition(name = "Cleanup minimum creation age", description = "Minimum number of milliseconds since the slingId was created.")
+        long org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age() default DEFAULT_MIN_CREATION_AGE_MILLIS;
+    }
+
+    /**
+     * Test constructor
+     */
+    static SlingIdCleanupTask create(Scheduler scheduler, ResourceResolverFactory factory,
+            Config config, int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        final SlingIdCleanupTask s = new SlingIdCleanupTask();
+        s.scheduler = scheduler;
+        s.resourceResolverFactory = factory;
+        s.config = config;
+        s.config(initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+        return s;
+    }
+
+    @Activate
+    protected void activate(final BundleContext bc, final Conf config) {
+        this.modified(bc, config);
+    }
+
+    @Modified
+    protected void modified(final BundleContext bc, final Conf config) {
+        if (config == null) {
+            return;
+        }
+        config(config.org_apache_sling_discovery_oak_slingid_cleanup_initial_delay(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_interval(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_batchsize(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age());
+    }
+
+    @Deactivate
+    protected void deactivate() {
+        logger.info("deactivate : deactivated.");
+        hasTopology = false;
+    }
+
+    private void config(int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        this.initialDelayMillis = initialDelayMillis;
+        this.intervalMillis = intervalMillis;
+        this.batchSize = batchSize;
+        this.minCreationAgeMillis = minCreationAgeMillis;
+        logger.info(
+                "config: initial delay milliseconds = {}, interval milliseconds = {}, batch size = {}, min creation age milliseconds = {}",
+                initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+    }
+
+    @Override
+    public void handleTopologyEvent(TopologyEvent event) {
+        final TopologyView newView = event.getNewView();
+        if (newView == null || event.getType() == Type.PROPERTIES_CHANGED) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+        } else {
+            hasTopology = true;
+            currentView = newView;
+            if (newView.getLocalInstance().isLeader()) {
+                // only execute on leader
+                recreateSchedule();
+            } else {
+                // should not be necessary, but lets stop anyway on non-leaders:
+                stop();
+            }
+        }
+    }
+
+    /**
+     * Cancels a potentially previously registered cleanup schedule.
+     */
+    private void stop() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("stop: no scheduler set, giving up.");
+            return;
+        }
+        final boolean unscheduled = localScheduler.unschedule(SCHEDULE_NAME);
+        logger.debug("stop: unschedule result={}", unscheduled);
+    }
+
+    /**
+     * This method can be invoked at any time to reset the schedule to do a fresh
+     * round of cleanup.
+     * <p>
+     * This method is thread-safe : if called concurrently, the fact that
+     * scheduler.schedul is synchronized works out that ultimately there will be
+     * just 1 schedule active (which is what is the desired outcome).
+     */
+    private void recreateSchedule() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("resetSchedule: no scheduler set, giving up.");
+            return;
+        }
+        final Calendar cal = Calendar.getInstance();
+        int delayMillis;
+        if (firstRun) {
+            delayMillis = initialDelayMillis;
+        } else {
+            delayMillis = intervalMillis;
+        }
+        cal.add(Calendar.MILLISECOND, delayMillis);
+        final Date inFiveMinutes = cal.getTime();
+        logger.debug("resetSchedule: scheduling a cleanup in {} milliseconds from now.",
+                delayMillis);
+        ScheduleOptions options = localScheduler.AT(inFiveMinutes);
+        options.name(SCHEDULE_NAME);
+        options.canRunConcurrently(false); // should not concurrently execute
+        localScheduler.schedule(this, options);
+    }
+
+    /**
+     * Invoked via sling.commons.scheduler triggered from resetCleanupSchedule(). By
+     * default should get called at max every 5 minutes until cleanup is done or
+     * 10min after a topology change.
+     */
+    @Override
+    public void run() {
+        if (!hasTopology) {
+            return;
+        }
+        boolean mightHaveMore = true;
+        try {
+            mightHaveMore = cleanup();
+        } catch (Exception e) {
+            // upon exception just log and retry in 10min
+            logger.error("run: got Exception while cleaning up slnigIds : " + e, e);
+        }
+        if (mightHaveMore) {
+            // then continue in 10min
+            recreateSchedule();
+            return;
+        }
+        // log successful cleanup done, yes, on info
+        logger.info("run: slingId cleanup done.");
+    }
+
+    /**
+     * Do the actual cleanup of garbage slingIds and report back with true if there
+     * might be more or false if we're at the end.
+     * 
+     * @return true if there might be more garbage or false if we're at the end
+     */
+    boolean cleanup() {
+        logger.debug("cleanup: start");
+
+        final ResourceResolverFactory localFactory = resourceResolverFactory;
+        final Config localConfig = config;
+        if (localFactory == null || localConfig == null) {
+            logger.warn("cleanup: cannot cleanup due to rrf={} or c={}", localFactory,
+                    localConfig);
+            return true;
+        }
+        final TopologyView localCurrentView = currentView;
+        if (localCurrentView == null || !localCurrentView.isCurrent()) {
+            logger.info("cleanup : cannot cleanup as topology recently changed : {}",
+                    localCurrentView);
+            return true;
+        }
+        final Set<String> activeSlingIds = new HashSet<>();
+        for (InstanceDescription id : localCurrentView.getLocalInstance().getClusterView()
+                .getInstances()) {
+            activeSlingIds.add(id.getSlingId());
+        }
+
+        ResourceResolver resolver = null;
+        try {
+            resolver = localFactory.getServiceResourceResolver(null);
+
+            final Resource clusterInstances = ResourceHelper.getOrCreateResource(resolver,
+                    localConfig.getClusterInstancesPath());
+            final Resource idMap = ResourceHelper.getOrCreateResource(resolver,
+                    localConfig.getIdMapPath());
+            final Resource syncTokens = ResourceHelper.getOrCreateResource(resolver,
+                    localConfig.getSyncTokenPath());
+            resolver.revert();
+            resolver.refresh();
+
+            final ValueMap idMapMap = idMap.adaptTo(ValueMap.class);
+            final ModifiableValueMap syncTokenMap = syncTokens
+                    .adaptTo(ModifiableValueMap.class);
+            final Calendar now = Calendar.getInstance();
+            int removed = 0;
+            boolean mightHaveMore = false;
+            int localBatchSize = batchSize;
+            long localMinCreationAgeMillis = minCreationAgeMillis;
+            for (Resource resource : clusterInstances.getChildren()) {
+                if (!hasTopology || currentView != localCurrentView
+                        || !localCurrentView.isCurrent()) {
+                    // we got interrupted during cleanup
+                    // let's not commit at all then
+                    logger.info(
+                            "cleanup : topology changing during cleanup - not committing this time - stopping for now.");
+                    return true;
+                }
+                final String slingId = resource.getName();
+                logger.info("cleanup : handling slingId = {}", slingId);
+                Object clusterNodeId = idMapMap.get(slingId);
+                if (clusterNodeId == null) {
+                    logger.info("cleanup : slingId not recently in use : {}",
+                            clusterNodeId);
+                } else {
+                    logger.info("cleanup : slingId WAS recently in use : {}",
+                            clusterNodeId);
+                    continue;
+                }
+                if (activeSlingIds.contains(slingId)) {
+                    logger.info("cleanup : slingId is currently active : {}", slingId);
+                    continue;
+                }
+                if (deleteIfOldSlingId(resource, syncTokenMap, now,
+                        localMinCreationAgeMillis)) {
+                    if (++removed >= localBatchSize) {
+                        // we need to stop
+                        mightHaveMore = true;
+                        break;
+                    }
+                }
+            }
+            if (!hasTopology) {
+                // we got interrupted during cleanup
+                // let's not commit at all then
+                logger.info(
+                        "cleanup : topology changing during cleanup - not committing this time - stopping for now.");
+                return true;
+            }
+            if (removed > 0) {

Review Comment:
   The goal with `removed` was to have a precise counter how many garbage slingIds were deleted, that's why it's tracked further up.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] stefan-egli commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "stefan-egli (via GitHub)" <gi...@apache.org>.

stefan-egli commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1129810324


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,571 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import static org.osgi.util.converter.Converters.standardConverter;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    final static String SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME = "org.apache.sling.discovery.oak.slingidcleanup.enabled";
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active

Review Comment:
   yes that's a bit the thing : an instance can also be long living and not change slingIds - depends on the how this is operated. So it could for example just have crashed a few seconds ago - hence not be active - but in the process of being restarted, with the same slingId. But this case is also covered by the cleanup algorithm as in that case the clusterNodeId would (well, only "very likely") still be in the /idmap and mapped to this same slingId. In which case it woudl not be deleted.
   Worst case, there is indeed a race condition between this new cleanup and a Sling instance startup, and it would be able to disturb a startup. In that case, that instance would have to be restarted. 
   But I'd consider that likelyhood as "very unlikely" as well ;)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] sonarcloud[bot] commented on pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "sonarcloud[bot] (via GitHub)" <gi...@apache.org>.

sonarcloud[bot] commented on PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#issuecomment-1461899482

   Kudos, SonarCloud Quality Gate passed!&nbsp; &nbsp; [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13)
   
   [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG)  
   [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY)  
   [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT)  
   [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [31 Code Smells](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL)
   
   [![87.0%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/60-16px.png '87.0%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list) [87.0% Coverage](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list)  
   [![0.0%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/3-16px.png '0.0%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list) [0.0% Duplication](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] stefan-egli commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "stefan-egli (via GitHub)" <gi...@apache.org>.

stefan-egli commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1130879106


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,585 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import static org.osgi.util.converter.Converters.standardConverter;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    final static String SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME = "org.apache.sling.discovery.oak.slingidcleanup.enabled";
+
+    private final static long MIN_CLEANUP_DELAY_MILLIS = 46800000; // 13 hours, to intraday load balance
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every scheduler invocation */
+    private AtomicInteger runCount = new AtomicInteger(0);
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger completionCount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    private long lastSuccessfulRun = -1;
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int org_apache_sling_discovery_oak_slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;
+
+        @AttributeDefinition(name = "Cleanup interval milliseconds", description = "Number of milliseconds after which to do another batch of cleaning up (if necessary)")
+        int org_apache_sling_discovery_oak_slingid_cleanup_interval() default DEFAULT_CLEANUP_INTERVAL;
+
+        @AttributeDefinition(name = "Cleanup batch size", description = "Maximum number of slingIds to cleanup in one batch.")
+        int org_apache_sling_discovery_oak_slingid_cleanup_batchsize() default DEFAULT_CLEANUP_BATCH_SIZE;
+
+        @AttributeDefinition(name = "Cleanup minimum creation age", description = "Minimum number of milliseconds since the slingId was created.")
+        long org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age() default DEFAULT_MIN_CREATION_AGE_MILLIS;
+    }
+
+    /**
+     * Test constructor
+     */
+    static SlingIdCleanupTask create(Scheduler scheduler, ResourceResolverFactory factory,
+            Config config, int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        final SlingIdCleanupTask s = new SlingIdCleanupTask();
+        s.scheduler = scheduler;
+        s.resourceResolverFactory = factory;
+        s.config = config;
+        s.config(initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+        return s;
+    }
+
+    @Activate
+    protected void activate(final BundleContext bc, final Conf config) {
+        this.modified(bc, config);
+    }
+
+    @Modified
+    protected void modified(final BundleContext bc, final Conf config) {
+        if (config == null) {
+            return;
+        }
+        config(config.org_apache_sling_discovery_oak_slingid_cleanup_initial_delay(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_interval(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_batchsize(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age());
+    }
+
+    @Deactivate
+    protected void deactivate() {
+        logger.info("deactivate : deactivated.");
+        hasTopology = false;
+    }
+
+    private void config(int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        this.initialDelayMillis = initialDelayMillis;
+        this.intervalMillis = intervalMillis;
+        this.batchSize = batchSize;
+        this.minCreationAgeMillis = minCreationAgeMillis;
+        logger.info(
+                "config: enabled = {}, initial delay milliseconds = {}, interval milliseconds = {}, batch size = {}, min creation age milliseconds = {}",
+                isEnabled(), initialDelayMillis, intervalMillis, batchSize,
+                minCreationAgeMillis);
+    }
+
+    @Override
+    public void handleTopologyEvent(TopologyEvent event) {
+        if (!isEnabled()) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+            logger.info("handleTopologyEvent: slingId cleanup is disabled");
+            return;
+        }
+        final TopologyView newView = event.getNewView();
+        if (newView == null || event.getType() == Type.PROPERTIES_CHANGED) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+        } else {
+            hasTopology = true;
+            currentView = newView;
+            if (newView.getLocalInstance().isLeader()) {
+                // only execute on leader
+                recreateSchedule();
+            } else {
+                // should not be necessary, but lets stop anyway on non-leaders:
+                stop();
+            }
+        }
+    }
+
+    /**
+     * Cancels a potentially previously registered cleanup schedule.
+     */
+    private void stop() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("stop: no scheduler set, giving up.");
+            return;
+        }
+        final boolean unscheduled = localScheduler.unschedule(SCHEDULE_NAME);
+        logger.debug("stop: unschedule result={}", unscheduled);
+    }
+
+    /**
+     * Reads the system property that enables or disabled this tasks
+     */
+    private static boolean isEnabled() {
+        final String systemPropertyValue = System
+                .getProperty(SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME);
+        return standardConverter().convert(systemPropertyValue).defaultValue(false)
+                .to(Boolean.class);
+    }
+
+    /**
+     * This method can be invoked at any time to reset the schedule to do a fresh
+     * round of cleanup.
+     * <p>
+     * This method is thread-safe : if called concurrently, the fact that
+     * scheduler.schedul is synchronized works out that ultimately there will be
+     * just 1 schedule active (which is what is the desired outcome).
+     */
+    private void recreateSchedule() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("recreateSchedule: no scheduler set, giving up.");
+            return;
+        }
+        final Calendar cal = Calendar.getInstance();
+        int delayMillis;
+        if (firstRun) {
+            delayMillis = initialDelayMillis;
+        } else {
+            delayMillis = intervalMillis;
+        }
+        cal.add(Calendar.MILLISECOND, delayMillis);
+        final Date scheduledDate = cal.getTime();
+        logger.debug(
+                "recreateSchedule: scheduling a cleanup in {} milliseconds from now, which is: {}",
+                delayMillis, scheduledDate);
+        ScheduleOptions options = localScheduler.AT(scheduledDate);
+        options.name(SCHEDULE_NAME);
+        options.canRunConcurrently(false); // should not concurrently execute
+        localScheduler.schedule(this, options);
+    }
+
+    /**
+     * Invoked via sling.commons.scheduler triggered from resetCleanupSchedule(). By
+     * default should get called at max every 5 minutes until cleanup is done or
+     * 10min after a topology change.
+     */
+    @Override
+    public void run() {
+        if (lastSuccessfulRun > 0 && System.currentTimeMillis()
+                - lastSuccessfulRun < MIN_CLEANUP_DELAY_MILLIS) {
+            logger.debug(
+                    "run: last cleanup was {} millis ago, which is less than {} millis, therefore not cleaning up yet.",
+                    System.currentTimeMillis() - lastSuccessfulRun,
+                    MIN_CLEANUP_DELAY_MILLIS);
+            recreateSchedule();
+            return;
+        }
+        runCount.incrementAndGet();
+        if (!hasTopology) {
+            return;
+        }
+        boolean mightHaveMore = true;
+        try {
+            mightHaveMore = cleanup();
+        } catch (Exception e) {
+            // upon exception just log and retry in 10min
+            logger.error("run: got Exception while cleaning up slnigIds : " + e, e);
+        }
+        if (mightHaveMore) {
+            // then continue in 10min
+            recreateSchedule();
+            return;
+        }
+        // log successful cleanup done, yes, on info
+        logger.info(
+                "run: slingId cleanup done, run counter = {}, delete counter = {}, completion counter = {}",
+                getRunCount(), getDeleteCount(), getCompletionCount());
+        lastSuccessfulRun = System.currentTimeMillis();
+    }
+
+    /**
+     * Do the actual cleanup of garbage slingIds and report back with true if there
+     * might be more or false if we're at the end.
+     * 
+     * @return true if there might be more garbage or false if we're at the end
+     */
+    private boolean cleanup() {

Review Comment:
   two small reductions in https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13/commits/119c8861fdf9ea97b2b4ff5fd68f560328cb358b , https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13/commits/658b0d1328bac4994b340d798f4b4e6e12c5b670 and https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13/commits/7fa73bd7a68020d4b3887fb7c9f4cf6104ff37d3
   
   it's still long though.
   
   I guess the main reason being those 2 for loops, one for clusterInstances children and the other for syncTokens. I did consider combining the two - but I'm not sure if it makes the code more readable indeed..
   
   Do you have any concrete suggestion for splitting it up?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] sonarcloud[bot] commented on pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "sonarcloud[bot] (via GitHub)" <gi...@apache.org>.

sonarcloud[bot] commented on PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#issuecomment-1461890441

   Kudos, SonarCloud Quality Gate passed!&nbsp; &nbsp; [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13)
   
   [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG)  
   [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY)  
   [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT)  
   [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [34 Code Smells](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL)
   
   [![86.2%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/60-16px.png '86.2%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list) [86.2% Coverage](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list)  
   [![0.0%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/3-16px.png '0.0%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list) [0.0% Duplication](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] sonarcloud[bot] commented on pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "sonarcloud[bot] (via GitHub)" <gi...@apache.org>.

sonarcloud[bot] commented on PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#issuecomment-1465806108

   Kudos, SonarCloud Quality Gate passed!&nbsp; &nbsp; [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13)
   
   [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG)  
   [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY)  
   [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT)  
   [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [30 Code Smells](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL)
   
   [![86.4%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/60-16px.png '86.4%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list) [86.4% Coverage](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list)  
   [![0.0%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/3-16px.png '0.0%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list) [0.0% Duplication](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] stefan-egli merged pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "stefan-egli (via GitHub)" <gi...@apache.org>.

stefan-egli merged PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] sonarcloud[bot] commented on pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "sonarcloud[bot] (via GitHub)" <gi...@apache.org>.

sonarcloud[bot] commented on PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#issuecomment-1458431940

   SonarCloud Quality Gate failed.&nbsp; &nbsp; [![Quality Gate failed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/failed-16px.png 'Quality Gate failed')](https://sonarcloud.io/dashboard?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13)
   
   [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [![B](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/B-16px.png 'B')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [1 Bug](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG)  
   [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY)  
   [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT)  
   [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [17 Code Smells](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL)
   
   [![72.4%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/60-16px.png '72.4%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list) [72.4% Coverage](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list)  
   [![0.0%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/3-16px.png '0.0%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list) [0.0% Duplication](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] stefan-egli commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "stefan-egli (via GitHub)" <gi...@apache.org>.

stefan-egli commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1128259831


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,484 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.apache.sling.discovery.commons.providers.util.ResourceHelper;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger runcount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This backup task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int org_apache_sling_discovery_oak_slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;
+
+        @AttributeDefinition(name = "Cleanup interval milliseconds", description = "Number of milliseconds after which to do another batch of cleaning up (if necessary)")
+        int org_apache_sling_discovery_oak_slingid_cleanup_interval() default DEFAULT_CLEANUP_INTERVAL;
+
+        @AttributeDefinition(name = "Cleanup batch size", description = "Maximum number of slingIds to cleanup in one batch.")
+        int org_apache_sling_discovery_oak_slingid_cleanup_batchsize() default DEFAULT_CLEANUP_BATCH_SIZE;
+
+        @AttributeDefinition(name = "Cleanup minimum creation age", description = "Minimum number of milliseconds since the slingId was created.")
+        long org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age() default DEFAULT_MIN_CREATION_AGE_MILLIS;
+    }
+
+    /**
+     * Test constructor
+     */
+    static SlingIdCleanupTask create(Scheduler scheduler, ResourceResolverFactory factory,
+            Config config, int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        final SlingIdCleanupTask s = new SlingIdCleanupTask();
+        s.scheduler = scheduler;
+        s.resourceResolverFactory = factory;
+        s.config = config;
+        s.config(initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+        return s;
+    }
+
+    @Activate
+    protected void activate(final BundleContext bc, final Conf config) {
+        this.modified(bc, config);
+    }
+
+    @Modified
+    protected void modified(final BundleContext bc, final Conf config) {
+        if (config == null) {
+            return;
+        }
+        config(config.org_apache_sling_discovery_oak_slingid_cleanup_initial_delay(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_interval(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_batchsize(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age());
+    }
+
+    @Deactivate
+    protected void deactivate() {
+        logger.info("deactivate : deactivated.");
+        hasTopology = false;
+    }
+
+    private void config(int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        this.initialDelayMillis = initialDelayMillis;
+        this.intervalMillis = intervalMillis;
+        this.batchSize = batchSize;
+        this.minCreationAgeMillis = minCreationAgeMillis;
+        logger.info(
+                "config: initial delay milliseconds = {}, interval milliseconds = {}, batch size = {}, min creation age milliseconds = {}",
+                initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+    }
+
+    @Override
+    public void handleTopologyEvent(TopologyEvent event) {
+        final TopologyView newView = event.getNewView();
+        if (newView == null || event.getType() == Type.PROPERTIES_CHANGED) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+        } else {
+            hasTopology = true;
+            currentView = newView;
+            if (newView.getLocalInstance().isLeader()) {
+                // only execute on leader
+                recreateSchedule();
+            } else {
+                // should not be necessary, but lets stop anyway on non-leaders:
+                stop();
+            }
+        }
+    }
+
+    /**
+     * Cancels a potentially previously registered cleanup schedule.
+     */
+    private void stop() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("stop: no scheduler set, giving up.");
+            return;
+        }
+        final boolean unscheduled = localScheduler.unschedule(SCHEDULE_NAME);
+        logger.debug("stop: unschedule result={}", unscheduled);
+    }
+
+    /**
+     * This method can be invoked at any time to reset the schedule to do a fresh
+     * round of cleanup.
+     * <p>
+     * This method is thread-safe : if called concurrently, the fact that
+     * scheduler.schedul is synchronized works out that ultimately there will be
+     * just 1 schedule active (which is what is the desired outcome).
+     */
+    private void recreateSchedule() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("resetSchedule: no scheduler set, giving up.");
+            return;
+        }
+        final Calendar cal = Calendar.getInstance();
+        int delayMillis;
+        if (firstRun) {
+            delayMillis = initialDelayMillis;
+        } else {
+            delayMillis = intervalMillis;
+        }
+        cal.add(Calendar.MILLISECOND, delayMillis);
+        final Date inFiveMinutes = cal.getTime();
+        logger.debug("resetSchedule: scheduling a cleanup in {} milliseconds from now.",
+                delayMillis);
+        ScheduleOptions options = localScheduler.AT(inFiveMinutes);
+        options.name(SCHEDULE_NAME);
+        options.canRunConcurrently(false); // should not concurrently execute
+        localScheduler.schedule(this, options);
+    }
+
+    /**
+     * Invoked via sling.commons.scheduler triggered from resetCleanupSchedule(). By
+     * default should get called at max every 5 minutes until cleanup is done or
+     * 10min after a topology change.
+     */
+    @Override
+    public void run() {
+        if (!hasTopology) {
+            return;
+        }
+        boolean mightHaveMore = true;
+        try {
+            mightHaveMore = cleanup();
+        } catch (Exception e) {
+            // upon exception just log and retry in 10min
+            logger.error("run: got Exception while cleaning up slnigIds : " + e, e);
+        }
+        if (mightHaveMore) {
+            // then continue in 10min
+            recreateSchedule();
+            return;
+        }
+        // log successful cleanup done, yes, on info
+        logger.info("run: slingId cleanup done.");
+    }
+
+    /**
+     * Do the actual cleanup of garbage slingIds and report back with true if there
+     * might be more or false if we're at the end.
+     * 
+     * @return true if there might be more garbage or false if we're at the end
+     */
+    boolean cleanup() {
+        logger.debug("cleanup: start");
+
+        final ResourceResolverFactory localFactory = resourceResolverFactory;
+        final Config localConfig = config;
+        if (localFactory == null || localConfig == null) {
+            logger.warn("cleanup: cannot cleanup due to rrf={} or c={}", localFactory,
+                    localConfig);
+            return true;
+        }
+        final TopologyView localCurrentView = currentView;
+        if (localCurrentView == null || !localCurrentView.isCurrent()) {
+            logger.info("cleanup : cannot cleanup as topology recently changed : {}",
+                    localCurrentView);
+            return true;
+        }
+        final Set<String> activeSlingIds = new HashSet<>();
+        for (InstanceDescription id : localCurrentView.getLocalInstance().getClusterView()
+                .getInstances()) {
+            activeSlingIds.add(id.getSlingId());
+        }
+
+        ResourceResolver resolver = null;
+        try {
+            resolver = localFactory.getServiceResourceResolver(null);

Review Comment:
   Good point, there is a service user already. Should switch to that. Will also try the try



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] stefan-egli commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "stefan-egli (via GitHub)" <gi...@apache.org>.

stefan-egli commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1130864559


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,571 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import static org.osgi.util.converter.Converters.standardConverter;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    final static String SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME = "org.apache.sling.discovery.oak.slingidcleanup.enabled";
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active

Review Comment:
   Added code to address this in https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13/commits/b57b2e1d919f345a59359db179fe7fa8e265b2c2 (and a docu update-let in https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13/commits/775eca76c38b3afe059917e82fd7cbcb408aada8 )



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] stefan-egli commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "stefan-egli (via GitHub)" <gi...@apache.org>.

stefan-egli commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1130879106


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,585 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import static org.osgi.util.converter.Converters.standardConverter;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    final static String SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME = "org.apache.sling.discovery.oak.slingidcleanup.enabled";
+
+    private final static long MIN_CLEANUP_DELAY_MILLIS = 46800000; // 13 hours, to intraday load balance
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every scheduler invocation */
+    private AtomicInteger runCount = new AtomicInteger(0);
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger completionCount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    private long lastSuccessfulRun = -1;
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int org_apache_sling_discovery_oak_slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;
+
+        @AttributeDefinition(name = "Cleanup interval milliseconds", description = "Number of milliseconds after which to do another batch of cleaning up (if necessary)")
+        int org_apache_sling_discovery_oak_slingid_cleanup_interval() default DEFAULT_CLEANUP_INTERVAL;
+
+        @AttributeDefinition(name = "Cleanup batch size", description = "Maximum number of slingIds to cleanup in one batch.")
+        int org_apache_sling_discovery_oak_slingid_cleanup_batchsize() default DEFAULT_CLEANUP_BATCH_SIZE;
+
+        @AttributeDefinition(name = "Cleanup minimum creation age", description = "Minimum number of milliseconds since the slingId was created.")
+        long org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age() default DEFAULT_MIN_CREATION_AGE_MILLIS;
+    }
+
+    /**
+     * Test constructor
+     */
+    static SlingIdCleanupTask create(Scheduler scheduler, ResourceResolverFactory factory,
+            Config config, int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        final SlingIdCleanupTask s = new SlingIdCleanupTask();
+        s.scheduler = scheduler;
+        s.resourceResolverFactory = factory;
+        s.config = config;
+        s.config(initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+        return s;
+    }
+
+    @Activate
+    protected void activate(final BundleContext bc, final Conf config) {
+        this.modified(bc, config);
+    }
+
+    @Modified
+    protected void modified(final BundleContext bc, final Conf config) {
+        if (config == null) {
+            return;
+        }
+        config(config.org_apache_sling_discovery_oak_slingid_cleanup_initial_delay(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_interval(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_batchsize(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age());
+    }
+
+    @Deactivate
+    protected void deactivate() {
+        logger.info("deactivate : deactivated.");
+        hasTopology = false;
+    }
+
+    private void config(int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        this.initialDelayMillis = initialDelayMillis;
+        this.intervalMillis = intervalMillis;
+        this.batchSize = batchSize;
+        this.minCreationAgeMillis = minCreationAgeMillis;
+        logger.info(
+                "config: enabled = {}, initial delay milliseconds = {}, interval milliseconds = {}, batch size = {}, min creation age milliseconds = {}",
+                isEnabled(), initialDelayMillis, intervalMillis, batchSize,
+                minCreationAgeMillis);
+    }
+
+    @Override
+    public void handleTopologyEvent(TopologyEvent event) {
+        if (!isEnabled()) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+            logger.info("handleTopologyEvent: slingId cleanup is disabled");
+            return;
+        }
+        final TopologyView newView = event.getNewView();
+        if (newView == null || event.getType() == Type.PROPERTIES_CHANGED) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+        } else {
+            hasTopology = true;
+            currentView = newView;
+            if (newView.getLocalInstance().isLeader()) {
+                // only execute on leader
+                recreateSchedule();
+            } else {
+                // should not be necessary, but lets stop anyway on non-leaders:
+                stop();
+            }
+        }
+    }
+
+    /**
+     * Cancels a potentially previously registered cleanup schedule.
+     */
+    private void stop() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("stop: no scheduler set, giving up.");
+            return;
+        }
+        final boolean unscheduled = localScheduler.unschedule(SCHEDULE_NAME);
+        logger.debug("stop: unschedule result={}", unscheduled);
+    }
+
+    /**
+     * Reads the system property that enables or disabled this tasks
+     */
+    private static boolean isEnabled() {
+        final String systemPropertyValue = System
+                .getProperty(SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME);
+        return standardConverter().convert(systemPropertyValue).defaultValue(false)
+                .to(Boolean.class);
+    }
+
+    /**
+     * This method can be invoked at any time to reset the schedule to do a fresh
+     * round of cleanup.
+     * <p>
+     * This method is thread-safe : if called concurrently, the fact that
+     * scheduler.schedul is synchronized works out that ultimately there will be
+     * just 1 schedule active (which is what is the desired outcome).
+     */
+    private void recreateSchedule() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("recreateSchedule: no scheduler set, giving up.");
+            return;
+        }
+        final Calendar cal = Calendar.getInstance();
+        int delayMillis;
+        if (firstRun) {
+            delayMillis = initialDelayMillis;
+        } else {
+            delayMillis = intervalMillis;
+        }
+        cal.add(Calendar.MILLISECOND, delayMillis);
+        final Date scheduledDate = cal.getTime();
+        logger.debug(
+                "recreateSchedule: scheduling a cleanup in {} milliseconds from now, which is: {}",
+                delayMillis, scheduledDate);
+        ScheduleOptions options = localScheduler.AT(scheduledDate);
+        options.name(SCHEDULE_NAME);
+        options.canRunConcurrently(false); // should not concurrently execute
+        localScheduler.schedule(this, options);
+    }
+
+    /**
+     * Invoked via sling.commons.scheduler triggered from resetCleanupSchedule(). By
+     * default should get called at max every 5 minutes until cleanup is done or
+     * 10min after a topology change.
+     */
+    @Override
+    public void run() {
+        if (lastSuccessfulRun > 0 && System.currentTimeMillis()
+                - lastSuccessfulRun < MIN_CLEANUP_DELAY_MILLIS) {
+            logger.debug(
+                    "run: last cleanup was {} millis ago, which is less than {} millis, therefore not cleaning up yet.",
+                    System.currentTimeMillis() - lastSuccessfulRun,
+                    MIN_CLEANUP_DELAY_MILLIS);
+            recreateSchedule();
+            return;
+        }
+        runCount.incrementAndGet();
+        if (!hasTopology) {
+            return;
+        }
+        boolean mightHaveMore = true;
+        try {
+            mightHaveMore = cleanup();
+        } catch (Exception e) {
+            // upon exception just log and retry in 10min
+            logger.error("run: got Exception while cleaning up slnigIds : " + e, e);
+        }
+        if (mightHaveMore) {
+            // then continue in 10min
+            recreateSchedule();
+            return;
+        }
+        // log successful cleanup done, yes, on info
+        logger.info(
+                "run: slingId cleanup done, run counter = {}, delete counter = {}, completion counter = {}",
+                getRunCount(), getDeleteCount(), getCompletionCount());
+        lastSuccessfulRun = System.currentTimeMillis();
+    }
+
+    /**
+     * Do the actual cleanup of garbage slingIds and report back with true if there
+     * might be more or false if we're at the end.
+     * 
+     * @return true if there might be more garbage or false if we're at the end
+     */
+    private boolean cleanup() {

Review Comment:
   two small reductions in https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13/commits/119c8861fdf9ea97b2b4ff5fd68f560328cb358b and https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13/commits/658b0d1328bac4994b340d798f4b4e6e12c5b670
   
   it's still long though.
   
   I guess the main reason being those 2 for loops, one for clusterInstances children and the other for syncTokens. I did consider combining the two - but I'm not sure if it makes the code more readable indeed..
   
   Do you have any concrete suggestion for splitting it up?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] joerghoh commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "joerghoh (via GitHub)" <gi...@apache.org>.

joerghoh commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1128218485


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,484 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.apache.sling.discovery.commons.providers.util.ResourceHelper;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger runcount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This backup task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int org_apache_sling_discovery_oak_slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;
+
+        @AttributeDefinition(name = "Cleanup interval milliseconds", description = "Number of milliseconds after which to do another batch of cleaning up (if necessary)")
+        int org_apache_sling_discovery_oak_slingid_cleanup_interval() default DEFAULT_CLEANUP_INTERVAL;
+
+        @AttributeDefinition(name = "Cleanup batch size", description = "Maximum number of slingIds to cleanup in one batch.")
+        int org_apache_sling_discovery_oak_slingid_cleanup_batchsize() default DEFAULT_CLEANUP_BATCH_SIZE;
+
+        @AttributeDefinition(name = "Cleanup minimum creation age", description = "Minimum number of milliseconds since the slingId was created.")
+        long org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age() default DEFAULT_MIN_CREATION_AGE_MILLIS;
+    }
+
+    /**
+     * Test constructor
+     */
+    static SlingIdCleanupTask create(Scheduler scheduler, ResourceResolverFactory factory,
+            Config config, int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        final SlingIdCleanupTask s = new SlingIdCleanupTask();
+        s.scheduler = scheduler;
+        s.resourceResolverFactory = factory;
+        s.config = config;
+        s.config(initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+        return s;
+    }
+
+    @Activate
+    protected void activate(final BundleContext bc, final Conf config) {
+        this.modified(bc, config);
+    }
+
+    @Modified
+    protected void modified(final BundleContext bc, final Conf config) {
+        if (config == null) {
+            return;
+        }
+        config(config.org_apache_sling_discovery_oak_slingid_cleanup_initial_delay(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_interval(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_batchsize(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age());
+    }
+
+    @Deactivate
+    protected void deactivate() {
+        logger.info("deactivate : deactivated.");
+        hasTopology = false;
+    }
+
+    private void config(int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        this.initialDelayMillis = initialDelayMillis;
+        this.intervalMillis = intervalMillis;
+        this.batchSize = batchSize;
+        this.minCreationAgeMillis = minCreationAgeMillis;
+        logger.info(
+                "config: initial delay milliseconds = {}, interval milliseconds = {}, batch size = {}, min creation age milliseconds = {}",
+                initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+    }
+
+    @Override
+    public void handleTopologyEvent(TopologyEvent event) {
+        final TopologyView newView = event.getNewView();
+        if (newView == null || event.getType() == Type.PROPERTIES_CHANGED) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+        } else {
+            hasTopology = true;
+            currentView = newView;
+            if (newView.getLocalInstance().isLeader()) {
+                // only execute on leader
+                recreateSchedule();
+            } else {
+                // should not be necessary, but lets stop anyway on non-leaders:
+                stop();
+            }
+        }
+    }
+
+    /**
+     * Cancels a potentially previously registered cleanup schedule.
+     */
+    private void stop() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("stop: no scheduler set, giving up.");
+            return;
+        }
+        final boolean unscheduled = localScheduler.unschedule(SCHEDULE_NAME);
+        logger.debug("stop: unschedule result={}", unscheduled);
+    }
+
+    /**
+     * This method can be invoked at any time to reset the schedule to do a fresh
+     * round of cleanup.
+     * <p>
+     * This method is thread-safe : if called concurrently, the fact that
+     * scheduler.schedul is synchronized works out that ultimately there will be
+     * just 1 schedule active (which is what is the desired outcome).
+     */
+    private void recreateSchedule() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("resetSchedule: no scheduler set, giving up.");
+            return;
+        }
+        final Calendar cal = Calendar.getInstance();
+        int delayMillis;
+        if (firstRun) {
+            delayMillis = initialDelayMillis;
+        } else {
+            delayMillis = intervalMillis;
+        }
+        cal.add(Calendar.MILLISECOND, delayMillis);
+        final Date inFiveMinutes = cal.getTime();
+        logger.debug("resetSchedule: scheduling a cleanup in {} milliseconds from now.",

Review Comment:
   While normally a timestamp is provided, I would rather log the ```scheduledDate``` as well (assuming that the ```delayMillis``` number is large and requires some translation to get the actual firing time.



##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,484 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.apache.sling.discovery.commons.providers.util.ResourceHelper;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger runcount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This backup task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int org_apache_sling_discovery_oak_slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;
+
+        @AttributeDefinition(name = "Cleanup interval milliseconds", description = "Number of milliseconds after which to do another batch of cleaning up (if necessary)")
+        int org_apache_sling_discovery_oak_slingid_cleanup_interval() default DEFAULT_CLEANUP_INTERVAL;
+
+        @AttributeDefinition(name = "Cleanup batch size", description = "Maximum number of slingIds to cleanup in one batch.")
+        int org_apache_sling_discovery_oak_slingid_cleanup_batchsize() default DEFAULT_CLEANUP_BATCH_SIZE;
+
+        @AttributeDefinition(name = "Cleanup minimum creation age", description = "Minimum number of milliseconds since the slingId was created.")
+        long org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age() default DEFAULT_MIN_CREATION_AGE_MILLIS;
+    }
+
+    /**
+     * Test constructor
+     */
+    static SlingIdCleanupTask create(Scheduler scheduler, ResourceResolverFactory factory,
+            Config config, int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        final SlingIdCleanupTask s = new SlingIdCleanupTask();
+        s.scheduler = scheduler;
+        s.resourceResolverFactory = factory;
+        s.config = config;
+        s.config(initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+        return s;
+    }
+
+    @Activate
+    protected void activate(final BundleContext bc, final Conf config) {
+        this.modified(bc, config);
+    }
+
+    @Modified
+    protected void modified(final BundleContext bc, final Conf config) {
+        if (config == null) {
+            return;
+        }
+        config(config.org_apache_sling_discovery_oak_slingid_cleanup_initial_delay(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_interval(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_batchsize(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age());
+    }
+
+    @Deactivate
+    protected void deactivate() {
+        logger.info("deactivate : deactivated.");
+        hasTopology = false;
+    }
+
+    private void config(int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        this.initialDelayMillis = initialDelayMillis;
+        this.intervalMillis = intervalMillis;
+        this.batchSize = batchSize;
+        this.minCreationAgeMillis = minCreationAgeMillis;
+        logger.info(
+                "config: initial delay milliseconds = {}, interval milliseconds = {}, batch size = {}, min creation age milliseconds = {}",
+                initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+    }
+
+    @Override
+    public void handleTopologyEvent(TopologyEvent event) {
+        final TopologyView newView = event.getNewView();
+        if (newView == null || event.getType() == Type.PROPERTIES_CHANGED) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+        } else {
+            hasTopology = true;
+            currentView = newView;
+            if (newView.getLocalInstance().isLeader()) {
+                // only execute on leader
+                recreateSchedule();
+            } else {
+                // should not be necessary, but lets stop anyway on non-leaders:
+                stop();
+            }
+        }
+    }
+
+    /**
+     * Cancels a potentially previously registered cleanup schedule.
+     */
+    private void stop() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("stop: no scheduler set, giving up.");
+            return;
+        }
+        final boolean unscheduled = localScheduler.unschedule(SCHEDULE_NAME);
+        logger.debug("stop: unschedule result={}", unscheduled);
+    }
+
+    /**
+     * This method can be invoked at any time to reset the schedule to do a fresh
+     * round of cleanup.
+     * <p>
+     * This method is thread-safe : if called concurrently, the fact that
+     * scheduler.schedul is synchronized works out that ultimately there will be
+     * just 1 schedule active (which is what is the desired outcome).
+     */
+    private void recreateSchedule() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("resetSchedule: no scheduler set, giving up.");
+            return;
+        }
+        final Calendar cal = Calendar.getInstance();
+        int delayMillis;
+        if (firstRun) {
+            delayMillis = initialDelayMillis;
+        } else {
+            delayMillis = intervalMillis;
+        }
+        cal.add(Calendar.MILLISECOND, delayMillis);
+        final Date inFiveMinutes = cal.getTime();

Review Comment:
   ```suggestion
           final Date scheduledDate = cal.getTime();
   ```



##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,484 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.apache.sling.discovery.commons.providers.util.ResourceHelper;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger runcount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This backup task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int org_apache_sling_discovery_oak_slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;
+
+        @AttributeDefinition(name = "Cleanup interval milliseconds", description = "Number of milliseconds after which to do another batch of cleaning up (if necessary)")
+        int org_apache_sling_discovery_oak_slingid_cleanup_interval() default DEFAULT_CLEANUP_INTERVAL;
+
+        @AttributeDefinition(name = "Cleanup batch size", description = "Maximum number of slingIds to cleanup in one batch.")
+        int org_apache_sling_discovery_oak_slingid_cleanup_batchsize() default DEFAULT_CLEANUP_BATCH_SIZE;
+
+        @AttributeDefinition(name = "Cleanup minimum creation age", description = "Minimum number of milliseconds since the slingId was created.")
+        long org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age() default DEFAULT_MIN_CREATION_AGE_MILLIS;
+    }
+
+    /**
+     * Test constructor
+     */
+    static SlingIdCleanupTask create(Scheduler scheduler, ResourceResolverFactory factory,
+            Config config, int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        final SlingIdCleanupTask s = new SlingIdCleanupTask();
+        s.scheduler = scheduler;
+        s.resourceResolverFactory = factory;
+        s.config = config;
+        s.config(initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+        return s;
+    }
+
+    @Activate
+    protected void activate(final BundleContext bc, final Conf config) {
+        this.modified(bc, config);
+    }
+
+    @Modified
+    protected void modified(final BundleContext bc, final Conf config) {
+        if (config == null) {
+            return;
+        }
+        config(config.org_apache_sling_discovery_oak_slingid_cleanup_initial_delay(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_interval(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_batchsize(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age());
+    }
+
+    @Deactivate
+    protected void deactivate() {
+        logger.info("deactivate : deactivated.");
+        hasTopology = false;
+    }
+
+    private void config(int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        this.initialDelayMillis = initialDelayMillis;
+        this.intervalMillis = intervalMillis;
+        this.batchSize = batchSize;
+        this.minCreationAgeMillis = minCreationAgeMillis;
+        logger.info(
+                "config: initial delay milliseconds = {}, interval milliseconds = {}, batch size = {}, min creation age milliseconds = {}",
+                initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+    }
+
+    @Override
+    public void handleTopologyEvent(TopologyEvent event) {
+        final TopologyView newView = event.getNewView();
+        if (newView == null || event.getType() == Type.PROPERTIES_CHANGED) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+        } else {
+            hasTopology = true;
+            currentView = newView;
+            if (newView.getLocalInstance().isLeader()) {
+                // only execute on leader
+                recreateSchedule();
+            } else {
+                // should not be necessary, but lets stop anyway on non-leaders:
+                stop();
+            }
+        }
+    }
+
+    /**
+     * Cancels a potentially previously registered cleanup schedule.
+     */
+    private void stop() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("stop: no scheduler set, giving up.");
+            return;
+        }
+        final boolean unscheduled = localScheduler.unschedule(SCHEDULE_NAME);
+        logger.debug("stop: unschedule result={}", unscheduled);
+    }
+
+    /**
+     * This method can be invoked at any time to reset the schedule to do a fresh
+     * round of cleanup.
+     * <p>
+     * This method is thread-safe : if called concurrently, the fact that
+     * scheduler.schedul is synchronized works out that ultimately there will be
+     * just 1 schedule active (which is what is the desired outcome).
+     */
+    private void recreateSchedule() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("resetSchedule: no scheduler set, giving up.");
+            return;
+        }
+        final Calendar cal = Calendar.getInstance();
+        int delayMillis;
+        if (firstRun) {
+            delayMillis = initialDelayMillis;
+        } else {
+            delayMillis = intervalMillis;
+        }
+        cal.add(Calendar.MILLISECOND, delayMillis);
+        final Date inFiveMinutes = cal.getTime();
+        logger.debug("resetSchedule: scheduling a cleanup in {} milliseconds from now.",
+                delayMillis);
+        ScheduleOptions options = localScheduler.AT(inFiveMinutes);
+        options.name(SCHEDULE_NAME);
+        options.canRunConcurrently(false); // should not concurrently execute
+        localScheduler.schedule(this, options);
+    }
+
+    /**
+     * Invoked via sling.commons.scheduler triggered from resetCleanupSchedule(). By
+     * default should get called at max every 5 minutes until cleanup is done or
+     * 10min after a topology change.
+     */
+    @Override
+    public void run() {
+        if (!hasTopology) {
+            return;
+        }
+        boolean mightHaveMore = true;
+        try {
+            mightHaveMore = cleanup();
+        } catch (Exception e) {
+            // upon exception just log and retry in 10min
+            logger.error("run: got Exception while cleaning up slnigIds : " + e, e);
+        }
+        if (mightHaveMore) {
+            // then continue in 10min
+            recreateSchedule();
+            return;
+        }
+        // log successful cleanup done, yes, on info
+        logger.info("run: slingId cleanup done.");
+    }
+
+    /**
+     * Do the actual cleanup of garbage slingIds and report back with true if there
+     * might be more or false if we're at the end.
+     * 
+     * @return true if there might be more garbage or false if we're at the end
+     */
+    boolean cleanup() {
+        logger.debug("cleanup: start");
+
+        final ResourceResolverFactory localFactory = resourceResolverFactory;
+        final Config localConfig = config;
+        if (localFactory == null || localConfig == null) {
+            logger.warn("cleanup: cannot cleanup due to rrf={} or c={}", localFactory,
+                    localConfig);
+            return true;
+        }
+        final TopologyView localCurrentView = currentView;
+        if (localCurrentView == null || !localCurrentView.isCurrent()) {
+            logger.info("cleanup : cannot cleanup as topology recently changed : {}",
+                    localCurrentView);
+            return true;
+        }
+        final Set<String> activeSlingIds = new HashSet<>();
+        for (InstanceDescription id : localCurrentView.getLocalInstance().getClusterView()
+                .getInstances()) {
+            activeSlingIds.add(id.getSlingId());
+        }
+
+        ResourceResolver resolver = null;
+        try {
+            resolver = localFactory.getServiceResourceResolver(null);

Review Comment:
   please use try-with-resource: 
   
   ```
   try (ResourceResolver resolver = localFactory....) {
     ...
   }
   ```
   
   Also the use of a service user might be more appropriate than the use of an admin resolver.



##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,484 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.apache.sling.discovery.commons.providers.util.ResourceHelper;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger runcount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This backup task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int org_apache_sling_discovery_oak_slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;
+
+        @AttributeDefinition(name = "Cleanup interval milliseconds", description = "Number of milliseconds after which to do another batch of cleaning up (if necessary)")
+        int org_apache_sling_discovery_oak_slingid_cleanup_interval() default DEFAULT_CLEANUP_INTERVAL;
+
+        @AttributeDefinition(name = "Cleanup batch size", description = "Maximum number of slingIds to cleanup in one batch.")
+        int org_apache_sling_discovery_oak_slingid_cleanup_batchsize() default DEFAULT_CLEANUP_BATCH_SIZE;
+
+        @AttributeDefinition(name = "Cleanup minimum creation age", description = "Minimum number of milliseconds since the slingId was created.")
+        long org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age() default DEFAULT_MIN_CREATION_AGE_MILLIS;
+    }
+
+    /**
+     * Test constructor
+     */
+    static SlingIdCleanupTask create(Scheduler scheduler, ResourceResolverFactory factory,
+            Config config, int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        final SlingIdCleanupTask s = new SlingIdCleanupTask();
+        s.scheduler = scheduler;
+        s.resourceResolverFactory = factory;
+        s.config = config;
+        s.config(initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+        return s;
+    }
+
+    @Activate
+    protected void activate(final BundleContext bc, final Conf config) {
+        this.modified(bc, config);
+    }
+
+    @Modified
+    protected void modified(final BundleContext bc, final Conf config) {
+        if (config == null) {
+            return;
+        }
+        config(config.org_apache_sling_discovery_oak_slingid_cleanup_initial_delay(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_interval(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_batchsize(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age());
+    }
+
+    @Deactivate
+    protected void deactivate() {
+        logger.info("deactivate : deactivated.");
+        hasTopology = false;
+    }
+
+    private void config(int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        this.initialDelayMillis = initialDelayMillis;
+        this.intervalMillis = intervalMillis;
+        this.batchSize = batchSize;
+        this.minCreationAgeMillis = minCreationAgeMillis;
+        logger.info(
+                "config: initial delay milliseconds = {}, interval milliseconds = {}, batch size = {}, min creation age milliseconds = {}",
+                initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+    }
+
+    @Override
+    public void handleTopologyEvent(TopologyEvent event) {
+        final TopologyView newView = event.getNewView();
+        if (newView == null || event.getType() == Type.PROPERTIES_CHANGED) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+        } else {
+            hasTopology = true;
+            currentView = newView;
+            if (newView.getLocalInstance().isLeader()) {
+                // only execute on leader
+                recreateSchedule();
+            } else {
+                // should not be necessary, but lets stop anyway on non-leaders:
+                stop();
+            }
+        }
+    }
+
+    /**
+     * Cancels a potentially previously registered cleanup schedule.
+     */
+    private void stop() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("stop: no scheduler set, giving up.");
+            return;
+        }
+        final boolean unscheduled = localScheduler.unschedule(SCHEDULE_NAME);
+        logger.debug("stop: unschedule result={}", unscheduled);
+    }
+
+    /**
+     * This method can be invoked at any time to reset the schedule to do a fresh
+     * round of cleanup.
+     * <p>
+     * This method is thread-safe : if called concurrently, the fact that
+     * scheduler.schedul is synchronized works out that ultimately there will be
+     * just 1 schedule active (which is what is the desired outcome).
+     */
+    private void recreateSchedule() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("resetSchedule: no scheduler set, giving up.");
+            return;
+        }
+        final Calendar cal = Calendar.getInstance();
+        int delayMillis;
+        if (firstRun) {
+            delayMillis = initialDelayMillis;
+        } else {
+            delayMillis = intervalMillis;
+        }
+        cal.add(Calendar.MILLISECOND, delayMillis);
+        final Date inFiveMinutes = cal.getTime();
+        logger.debug("resetSchedule: scheduling a cleanup in {} milliseconds from now.",
+                delayMillis);
+        ScheduleOptions options = localScheduler.AT(inFiveMinutes);
+        options.name(SCHEDULE_NAME);
+        options.canRunConcurrently(false); // should not concurrently execute
+        localScheduler.schedule(this, options);
+    }
+
+    /**
+     * Invoked via sling.commons.scheduler triggered from resetCleanupSchedule(). By
+     * default should get called at max every 5 minutes until cleanup is done or
+     * 10min after a topology change.
+     */
+    @Override
+    public void run() {
+        if (!hasTopology) {
+            return;
+        }
+        boolean mightHaveMore = true;
+        try {
+            mightHaveMore = cleanup();
+        } catch (Exception e) {
+            // upon exception just log and retry in 10min
+            logger.error("run: got Exception while cleaning up slnigIds : " + e, e);
+        }
+        if (mightHaveMore) {
+            // then continue in 10min
+            recreateSchedule();
+            return;
+        }
+        // log successful cleanup done, yes, on info
+        logger.info("run: slingId cleanup done.");
+    }
+
+    /**
+     * Do the actual cleanup of garbage slingIds and report back with true if there
+     * might be more or false if we're at the end.
+     * 
+     * @return true if there might be more garbage or false if we're at the end
+     */
+    boolean cleanup() {
+        logger.debug("cleanup: start");
+
+        final ResourceResolverFactory localFactory = resourceResolverFactory;
+        final Config localConfig = config;
+        if (localFactory == null || localConfig == null) {
+            logger.warn("cleanup: cannot cleanup due to rrf={} or c={}", localFactory,
+                    localConfig);
+            return true;
+        }
+        final TopologyView localCurrentView = currentView;
+        if (localCurrentView == null || !localCurrentView.isCurrent()) {
+            logger.info("cleanup : cannot cleanup as topology recently changed : {}",
+                    localCurrentView);
+            return true;
+        }
+        final Set<String> activeSlingIds = new HashSet<>();
+        for (InstanceDescription id : localCurrentView.getLocalInstance().getClusterView()
+                .getInstances()) {
+            activeSlingIds.add(id.getSlingId());
+        }
+
+        ResourceResolver resolver = null;
+        try {
+            resolver = localFactory.getServiceResourceResolver(null);
+
+            final Resource clusterInstances = ResourceHelper.getOrCreateResource(resolver,

Review Comment:
   I am not sure if the creation of that resource is appropriate here. It should rather be
   ```suggestion
               final Resource clusterInstances = resolver.getResource(localConfig.getClusterInstancesPath());
   ```
   
   



##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,484 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.apache.sling.discovery.commons.providers.util.ResourceHelper;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger runcount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This backup task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int org_apache_sling_discovery_oak_slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;
+
+        @AttributeDefinition(name = "Cleanup interval milliseconds", description = "Number of milliseconds after which to do another batch of cleaning up (if necessary)")
+        int org_apache_sling_discovery_oak_slingid_cleanup_interval() default DEFAULT_CLEANUP_INTERVAL;
+
+        @AttributeDefinition(name = "Cleanup batch size", description = "Maximum number of slingIds to cleanup in one batch.")
+        int org_apache_sling_discovery_oak_slingid_cleanup_batchsize() default DEFAULT_CLEANUP_BATCH_SIZE;
+
+        @AttributeDefinition(name = "Cleanup minimum creation age", description = "Minimum number of milliseconds since the slingId was created.")
+        long org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age() default DEFAULT_MIN_CREATION_AGE_MILLIS;
+    }
+
+    /**
+     * Test constructor
+     */
+    static SlingIdCleanupTask create(Scheduler scheduler, ResourceResolverFactory factory,
+            Config config, int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        final SlingIdCleanupTask s = new SlingIdCleanupTask();
+        s.scheduler = scheduler;
+        s.resourceResolverFactory = factory;
+        s.config = config;
+        s.config(initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+        return s;
+    }
+
+    @Activate
+    protected void activate(final BundleContext bc, final Conf config) {
+        this.modified(bc, config);
+    }
+
+    @Modified
+    protected void modified(final BundleContext bc, final Conf config) {
+        if (config == null) {
+            return;
+        }
+        config(config.org_apache_sling_discovery_oak_slingid_cleanup_initial_delay(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_interval(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_batchsize(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age());
+    }
+
+    @Deactivate
+    protected void deactivate() {
+        logger.info("deactivate : deactivated.");
+        hasTopology = false;
+    }
+
+    private void config(int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        this.initialDelayMillis = initialDelayMillis;
+        this.intervalMillis = intervalMillis;
+        this.batchSize = batchSize;
+        this.minCreationAgeMillis = minCreationAgeMillis;
+        logger.info(
+                "config: initial delay milliseconds = {}, interval milliseconds = {}, batch size = {}, min creation age milliseconds = {}",
+                initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+    }
+
+    @Override
+    public void handleTopologyEvent(TopologyEvent event) {
+        final TopologyView newView = event.getNewView();
+        if (newView == null || event.getType() == Type.PROPERTIES_CHANGED) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+        } else {
+            hasTopology = true;
+            currentView = newView;
+            if (newView.getLocalInstance().isLeader()) {
+                // only execute on leader
+                recreateSchedule();
+            } else {
+                // should not be necessary, but lets stop anyway on non-leaders:
+                stop();
+            }
+        }
+    }
+
+    /**
+     * Cancels a potentially previously registered cleanup schedule.
+     */
+    private void stop() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("stop: no scheduler set, giving up.");
+            return;
+        }
+        final boolean unscheduled = localScheduler.unschedule(SCHEDULE_NAME);
+        logger.debug("stop: unschedule result={}", unscheduled);
+    }
+
+    /**
+     * This method can be invoked at any time to reset the schedule to do a fresh
+     * round of cleanup.
+     * <p>
+     * This method is thread-safe : if called concurrently, the fact that
+     * scheduler.schedul is synchronized works out that ultimately there will be
+     * just 1 schedule active (which is what is the desired outcome).
+     */
+    private void recreateSchedule() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("resetSchedule: no scheduler set, giving up.");
+            return;
+        }
+        final Calendar cal = Calendar.getInstance();
+        int delayMillis;
+        if (firstRun) {
+            delayMillis = initialDelayMillis;
+        } else {
+            delayMillis = intervalMillis;
+        }
+        cal.add(Calendar.MILLISECOND, delayMillis);
+        final Date inFiveMinutes = cal.getTime();
+        logger.debug("resetSchedule: scheduling a cleanup in {} milliseconds from now.",
+                delayMillis);
+        ScheduleOptions options = localScheduler.AT(inFiveMinutes);
+        options.name(SCHEDULE_NAME);
+        options.canRunConcurrently(false); // should not concurrently execute
+        localScheduler.schedule(this, options);
+    }
+
+    /**
+     * Invoked via sling.commons.scheduler triggered from resetCleanupSchedule(). By
+     * default should get called at max every 5 minutes until cleanup is done or
+     * 10min after a topology change.
+     */
+    @Override
+    public void run() {
+        if (!hasTopology) {
+            return;
+        }
+        boolean mightHaveMore = true;
+        try {
+            mightHaveMore = cleanup();
+        } catch (Exception e) {
+            // upon exception just log and retry in 10min
+            logger.error("run: got Exception while cleaning up slnigIds : " + e, e);

Review Comment:
   ```suggestion
               logger.error("run: got Exception while cleaning up slnigIds", e);
   ```
   It does not make sense to print both the stacktrace and e.toString() in the same log message.



##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,484 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.apache.sling.discovery.commons.providers.util.ResourceHelper;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger runcount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This backup task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int org_apache_sling_discovery_oak_slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;
+
+        @AttributeDefinition(name = "Cleanup interval milliseconds", description = "Number of milliseconds after which to do another batch of cleaning up (if necessary)")
+        int org_apache_sling_discovery_oak_slingid_cleanup_interval() default DEFAULT_CLEANUP_INTERVAL;
+
+        @AttributeDefinition(name = "Cleanup batch size", description = "Maximum number of slingIds to cleanup in one batch.")
+        int org_apache_sling_discovery_oak_slingid_cleanup_batchsize() default DEFAULT_CLEANUP_BATCH_SIZE;
+
+        @AttributeDefinition(name = "Cleanup minimum creation age", description = "Minimum number of milliseconds since the slingId was created.")
+        long org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age() default DEFAULT_MIN_CREATION_AGE_MILLIS;
+    }
+
+    /**
+     * Test constructor
+     */
+    static SlingIdCleanupTask create(Scheduler scheduler, ResourceResolverFactory factory,
+            Config config, int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        final SlingIdCleanupTask s = new SlingIdCleanupTask();
+        s.scheduler = scheduler;
+        s.resourceResolverFactory = factory;
+        s.config = config;
+        s.config(initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+        return s;
+    }
+
+    @Activate
+    protected void activate(final BundleContext bc, final Conf config) {
+        this.modified(bc, config);
+    }
+
+    @Modified
+    protected void modified(final BundleContext bc, final Conf config) {
+        if (config == null) {
+            return;
+        }
+        config(config.org_apache_sling_discovery_oak_slingid_cleanup_initial_delay(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_interval(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_batchsize(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age());
+    }
+
+    @Deactivate
+    protected void deactivate() {
+        logger.info("deactivate : deactivated.");
+        hasTopology = false;
+    }
+
+    private void config(int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        this.initialDelayMillis = initialDelayMillis;
+        this.intervalMillis = intervalMillis;
+        this.batchSize = batchSize;
+        this.minCreationAgeMillis = minCreationAgeMillis;
+        logger.info(
+                "config: initial delay milliseconds = {}, interval milliseconds = {}, batch size = {}, min creation age milliseconds = {}",
+                initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+    }
+
+    @Override
+    public void handleTopologyEvent(TopologyEvent event) {
+        final TopologyView newView = event.getNewView();
+        if (newView == null || event.getType() == Type.PROPERTIES_CHANGED) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+        } else {
+            hasTopology = true;
+            currentView = newView;
+            if (newView.getLocalInstance().isLeader()) {
+                // only execute on leader
+                recreateSchedule();
+            } else {
+                // should not be necessary, but lets stop anyway on non-leaders:
+                stop();
+            }
+        }
+    }
+
+    /**
+     * Cancels a potentially previously registered cleanup schedule.
+     */
+    private void stop() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("stop: no scheduler set, giving up.");
+            return;
+        }
+        final boolean unscheduled = localScheduler.unschedule(SCHEDULE_NAME);
+        logger.debug("stop: unschedule result={}", unscheduled);
+    }
+
+    /**
+     * This method can be invoked at any time to reset the schedule to do a fresh
+     * round of cleanup.
+     * <p>
+     * This method is thread-safe : if called concurrently, the fact that
+     * scheduler.schedul is synchronized works out that ultimately there will be
+     * just 1 schedule active (which is what is the desired outcome).
+     */
+    private void recreateSchedule() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("resetSchedule: no scheduler set, giving up.");
+            return;
+        }
+        final Calendar cal = Calendar.getInstance();
+        int delayMillis;
+        if (firstRun) {
+            delayMillis = initialDelayMillis;
+        } else {
+            delayMillis = intervalMillis;
+        }
+        cal.add(Calendar.MILLISECOND, delayMillis);
+        final Date inFiveMinutes = cal.getTime();
+        logger.debug("resetSchedule: scheduling a cleanup in {} milliseconds from now.",
+                delayMillis);
+        ScheduleOptions options = localScheduler.AT(inFiveMinutes);
+        options.name(SCHEDULE_NAME);
+        options.canRunConcurrently(false); // should not concurrently execute
+        localScheduler.schedule(this, options);
+    }
+
+    /**
+     * Invoked via sling.commons.scheduler triggered from resetCleanupSchedule(). By
+     * default should get called at max every 5 minutes until cleanup is done or
+     * 10min after a topology change.
+     */
+    @Override
+    public void run() {
+        if (!hasTopology) {
+            return;
+        }
+        boolean mightHaveMore = true;
+        try {
+            mightHaveMore = cleanup();
+        } catch (Exception e) {
+            // upon exception just log and retry in 10min
+            logger.error("run: got Exception while cleaning up slnigIds : " + e, e);
+        }
+        if (mightHaveMore) {
+            // then continue in 10min
+            recreateSchedule();
+            return;
+        }
+        // log successful cleanup done, yes, on info
+        logger.info("run: slingId cleanup done.");
+    }
+
+    /**
+     * Do the actual cleanup of garbage slingIds and report back with true if there
+     * might be more or false if we're at the end.
+     * 
+     * @return true if there might be more garbage or false if we're at the end
+     */
+    boolean cleanup() {
+        logger.debug("cleanup: start");
+
+        final ResourceResolverFactory localFactory = resourceResolverFactory;
+        final Config localConfig = config;
+        if (localFactory == null || localConfig == null) {
+            logger.warn("cleanup: cannot cleanup due to rrf={} or c={}", localFactory,
+                    localConfig);
+            return true;

Review Comment:
   does it make sense to rerun this later?



##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,484 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.apache.sling.discovery.commons.providers.util.ResourceHelper;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger runcount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This backup task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int org_apache_sling_discovery_oak_slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;
+
+        @AttributeDefinition(name = "Cleanup interval milliseconds", description = "Number of milliseconds after which to do another batch of cleaning up (if necessary)")
+        int org_apache_sling_discovery_oak_slingid_cleanup_interval() default DEFAULT_CLEANUP_INTERVAL;
+
+        @AttributeDefinition(name = "Cleanup batch size", description = "Maximum number of slingIds to cleanup in one batch.")
+        int org_apache_sling_discovery_oak_slingid_cleanup_batchsize() default DEFAULT_CLEANUP_BATCH_SIZE;
+
+        @AttributeDefinition(name = "Cleanup minimum creation age", description = "Minimum number of milliseconds since the slingId was created.")
+        long org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age() default DEFAULT_MIN_CREATION_AGE_MILLIS;
+    }
+
+    /**
+     * Test constructor
+     */
+    static SlingIdCleanupTask create(Scheduler scheduler, ResourceResolverFactory factory,
+            Config config, int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        final SlingIdCleanupTask s = new SlingIdCleanupTask();
+        s.scheduler = scheduler;
+        s.resourceResolverFactory = factory;
+        s.config = config;
+        s.config(initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+        return s;
+    }
+
+    @Activate
+    protected void activate(final BundleContext bc, final Conf config) {
+        this.modified(bc, config);
+    }
+
+    @Modified
+    protected void modified(final BundleContext bc, final Conf config) {
+        if (config == null) {
+            return;
+        }
+        config(config.org_apache_sling_discovery_oak_slingid_cleanup_initial_delay(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_interval(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_batchsize(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age());
+    }
+
+    @Deactivate
+    protected void deactivate() {
+        logger.info("deactivate : deactivated.");
+        hasTopology = false;
+    }
+
+    private void config(int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        this.initialDelayMillis = initialDelayMillis;
+        this.intervalMillis = intervalMillis;
+        this.batchSize = batchSize;
+        this.minCreationAgeMillis = minCreationAgeMillis;
+        logger.info(
+                "config: initial delay milliseconds = {}, interval milliseconds = {}, batch size = {}, min creation age milliseconds = {}",
+                initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+    }
+
+    @Override
+    public void handleTopologyEvent(TopologyEvent event) {
+        final TopologyView newView = event.getNewView();
+        if (newView == null || event.getType() == Type.PROPERTIES_CHANGED) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+        } else {
+            hasTopology = true;
+            currentView = newView;
+            if (newView.getLocalInstance().isLeader()) {
+                // only execute on leader
+                recreateSchedule();
+            } else {
+                // should not be necessary, but lets stop anyway on non-leaders:
+                stop();
+            }
+        }
+    }
+
+    /**
+     * Cancels a potentially previously registered cleanup schedule.
+     */
+    private void stop() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("stop: no scheduler set, giving up.");
+            return;
+        }
+        final boolean unscheduled = localScheduler.unschedule(SCHEDULE_NAME);
+        logger.debug("stop: unschedule result={}", unscheduled);
+    }
+
+    /**
+     * This method can be invoked at any time to reset the schedule to do a fresh
+     * round of cleanup.
+     * <p>
+     * This method is thread-safe : if called concurrently, the fact that
+     * scheduler.schedul is synchronized works out that ultimately there will be
+     * just 1 schedule active (which is what is the desired outcome).
+     */
+    private void recreateSchedule() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("resetSchedule: no scheduler set, giving up.");
+            return;
+        }
+        final Calendar cal = Calendar.getInstance();
+        int delayMillis;
+        if (firstRun) {
+            delayMillis = initialDelayMillis;
+        } else {
+            delayMillis = intervalMillis;
+        }
+        cal.add(Calendar.MILLISECOND, delayMillis);
+        final Date inFiveMinutes = cal.getTime();
+        logger.debug("resetSchedule: scheduling a cleanup in {} milliseconds from now.",
+                delayMillis);
+        ScheduleOptions options = localScheduler.AT(inFiveMinutes);
+        options.name(SCHEDULE_NAME);
+        options.canRunConcurrently(false); // should not concurrently execute
+        localScheduler.schedule(this, options);
+    }
+
+    /**
+     * Invoked via sling.commons.scheduler triggered from resetCleanupSchedule(). By
+     * default should get called at max every 5 minutes until cleanup is done or
+     * 10min after a topology change.
+     */
+    @Override
+    public void run() {
+        if (!hasTopology) {
+            return;
+        }
+        boolean mightHaveMore = true;
+        try {
+            mightHaveMore = cleanup();
+        } catch (Exception e) {
+            // upon exception just log and retry in 10min
+            logger.error("run: got Exception while cleaning up slnigIds : " + e, e);
+        }
+        if (mightHaveMore) {
+            // then continue in 10min
+            recreateSchedule();
+            return;
+        }
+        // log successful cleanup done, yes, on info
+        logger.info("run: slingId cleanup done.");
+    }
+
+    /**
+     * Do the actual cleanup of garbage slingIds and report back with true if there
+     * might be more or false if we're at the end.
+     * 
+     * @return true if there might be more garbage or false if we're at the end
+     */
+    boolean cleanup() {
+        logger.debug("cleanup: start");
+
+        final ResourceResolverFactory localFactory = resourceResolverFactory;
+        final Config localConfig = config;
+        if (localFactory == null || localConfig == null) {
+            logger.warn("cleanup: cannot cleanup due to rrf={} or c={}", localFactory,
+                    localConfig);
+            return true;
+        }
+        final TopologyView localCurrentView = currentView;
+        if (localCurrentView == null || !localCurrentView.isCurrent()) {
+            logger.info("cleanup : cannot cleanup as topology recently changed : {}",
+                    localCurrentView);
+            return true;
+        }
+        final Set<String> activeSlingIds = new HashSet<>();
+        for (InstanceDescription id : localCurrentView.getLocalInstance().getClusterView()
+                .getInstances()) {
+            activeSlingIds.add(id.getSlingId());
+        }
+
+        ResourceResolver resolver = null;
+        try {
+            resolver = localFactory.getServiceResourceResolver(null);
+
+            final Resource clusterInstances = ResourceHelper.getOrCreateResource(resolver,
+                    localConfig.getClusterInstancesPath());
+            final Resource idMap = ResourceHelper.getOrCreateResource(resolver,
+                    localConfig.getIdMapPath());
+            final Resource syncTokens = ResourceHelper.getOrCreateResource(resolver,
+                    localConfig.getSyncTokenPath());
+            resolver.revert();
+            resolver.refresh();
+
+            final ValueMap idMapMap = idMap.adaptTo(ValueMap.class);
+            final ModifiableValueMap syncTokenMap = syncTokens
+                    .adaptTo(ModifiableValueMap.class);
+            final Calendar now = Calendar.getInstance();
+            int removed = 0;
+            boolean mightHaveMore = false;
+            int localBatchSize = batchSize;
+            long localMinCreationAgeMillis = minCreationAgeMillis;
+            for (Resource resource : clusterInstances.getChildren()) {
+                if (!hasTopology || currentView != localCurrentView
+                        || !localCurrentView.isCurrent()) {
+                    // we got interrupted during cleanup
+                    // let's not commit at all then
+                    logger.info(
+                            "cleanup : topology changing during cleanup - not committing this time - stopping for now.");
+                    return true;
+                }
+                final String slingId = resource.getName();
+                logger.info("cleanup : handling slingId = {}", slingId);
+                Object clusterNodeId = idMapMap.get(slingId);
+                if (clusterNodeId == null) {
+                    logger.info("cleanup : slingId not recently in use : {}",
+                            clusterNodeId);
+                } else {
+                    logger.info("cleanup : slingId WAS recently in use : {}",
+                            clusterNodeId);
+                    continue;
+                }
+                if (activeSlingIds.contains(slingId)) {
+                    logger.info("cleanup : slingId is currently active : {}", slingId);
+                    continue;
+                }
+                if (deleteIfOldSlingId(resource, syncTokenMap, now,
+                        localMinCreationAgeMillis)) {
+                    if (++removed >= localBatchSize) {
+                        // we need to stop
+                        mightHaveMore = true;
+                        break;
+                    }
+                }
+            }
+            if (!hasTopology) {
+                // we got interrupted during cleanup
+                // let's not commit at all then
+                logger.info(
+                        "cleanup : topology changing during cleanup - not committing this time - stopping for now.");
+                return true;
+            }
+            if (removed > 0) {

Review Comment:
   ```suggestion
               if (resolver.hasChanges > 0) {
   ```



##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,484 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.apache.sling.discovery.commons.providers.util.ResourceHelper;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger runcount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This backup task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int org_apache_sling_discovery_oak_slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;
+
+        @AttributeDefinition(name = "Cleanup interval milliseconds", description = "Number of milliseconds after which to do another batch of cleaning up (if necessary)")
+        int org_apache_sling_discovery_oak_slingid_cleanup_interval() default DEFAULT_CLEANUP_INTERVAL;
+
+        @AttributeDefinition(name = "Cleanup batch size", description = "Maximum number of slingIds to cleanup in one batch.")
+        int org_apache_sling_discovery_oak_slingid_cleanup_batchsize() default DEFAULT_CLEANUP_BATCH_SIZE;
+
+        @AttributeDefinition(name = "Cleanup minimum creation age", description = "Minimum number of milliseconds since the slingId was created.")
+        long org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age() default DEFAULT_MIN_CREATION_AGE_MILLIS;
+    }
+
+    /**
+     * Test constructor
+     */
+    static SlingIdCleanupTask create(Scheduler scheduler, ResourceResolverFactory factory,
+            Config config, int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        final SlingIdCleanupTask s = new SlingIdCleanupTask();
+        s.scheduler = scheduler;
+        s.resourceResolverFactory = factory;
+        s.config = config;
+        s.config(initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+        return s;
+    }
+
+    @Activate
+    protected void activate(final BundleContext bc, final Conf config) {
+        this.modified(bc, config);
+    }
+
+    @Modified
+    protected void modified(final BundleContext bc, final Conf config) {
+        if (config == null) {
+            return;
+        }
+        config(config.org_apache_sling_discovery_oak_slingid_cleanup_initial_delay(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_interval(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_batchsize(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age());
+    }
+
+    @Deactivate
+    protected void deactivate() {
+        logger.info("deactivate : deactivated.");
+        hasTopology = false;
+    }
+
+    private void config(int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        this.initialDelayMillis = initialDelayMillis;
+        this.intervalMillis = intervalMillis;
+        this.batchSize = batchSize;
+        this.minCreationAgeMillis = minCreationAgeMillis;
+        logger.info(
+                "config: initial delay milliseconds = {}, interval milliseconds = {}, batch size = {}, min creation age milliseconds = {}",
+                initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+    }
+
+    @Override
+    public void handleTopologyEvent(TopologyEvent event) {
+        final TopologyView newView = event.getNewView();
+        if (newView == null || event.getType() == Type.PROPERTIES_CHANGED) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+        } else {
+            hasTopology = true;
+            currentView = newView;
+            if (newView.getLocalInstance().isLeader()) {
+                // only execute on leader
+                recreateSchedule();
+            } else {
+                // should not be necessary, but lets stop anyway on non-leaders:
+                stop();
+            }
+        }
+    }
+
+    /**
+     * Cancels a potentially previously registered cleanup schedule.
+     */
+    private void stop() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("stop: no scheduler set, giving up.");
+            return;
+        }
+        final boolean unscheduled = localScheduler.unschedule(SCHEDULE_NAME);
+        logger.debug("stop: unschedule result={}", unscheduled);
+    }
+
+    /**
+     * This method can be invoked at any time to reset the schedule to do a fresh
+     * round of cleanup.
+     * <p>
+     * This method is thread-safe : if called concurrently, the fact that
+     * scheduler.schedul is synchronized works out that ultimately there will be
+     * just 1 schedule active (which is what is the desired outcome).
+     */
+    private void recreateSchedule() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("resetSchedule: no scheduler set, giving up.");
+            return;
+        }
+        final Calendar cal = Calendar.getInstance();
+        int delayMillis;
+        if (firstRun) {
+            delayMillis = initialDelayMillis;
+        } else {
+            delayMillis = intervalMillis;
+        }
+        cal.add(Calendar.MILLISECOND, delayMillis);
+        final Date inFiveMinutes = cal.getTime();
+        logger.debug("resetSchedule: scheduling a cleanup in {} milliseconds from now.",
+                delayMillis);
+        ScheduleOptions options = localScheduler.AT(inFiveMinutes);
+        options.name(SCHEDULE_NAME);
+        options.canRunConcurrently(false); // should not concurrently execute
+        localScheduler.schedule(this, options);
+    }
+
+    /**
+     * Invoked via sling.commons.scheduler triggered from resetCleanupSchedule(). By
+     * default should get called at max every 5 minutes until cleanup is done or
+     * 10min after a topology change.
+     */
+    @Override
+    public void run() {
+        if (!hasTopology) {
+            return;
+        }
+        boolean mightHaveMore = true;
+        try {
+            mightHaveMore = cleanup();
+        } catch (Exception e) {
+            // upon exception just log and retry in 10min
+            logger.error("run: got Exception while cleaning up slnigIds : " + e, e);
+        }
+        if (mightHaveMore) {
+            // then continue in 10min
+            recreateSchedule();
+            return;
+        }
+        // log successful cleanup done, yes, on info
+        logger.info("run: slingId cleanup done.");
+    }
+
+    /**
+     * Do the actual cleanup of garbage slingIds and report back with true if there
+     * might be more or false if we're at the end.
+     * 
+     * @return true if there might be more garbage or false if we're at the end
+     */
+    boolean cleanup() {
+        logger.debug("cleanup: start");
+
+        final ResourceResolverFactory localFactory = resourceResolverFactory;
+        final Config localConfig = config;
+        if (localFactory == null || localConfig == null) {
+            logger.warn("cleanup: cannot cleanup due to rrf={} or c={}", localFactory,
+                    localConfig);
+            return true;
+        }
+        final TopologyView localCurrentView = currentView;
+        if (localCurrentView == null || !localCurrentView.isCurrent()) {
+            logger.info("cleanup : cannot cleanup as topology recently changed : {}",
+                    localCurrentView);
+            return true;
+        }
+        final Set<String> activeSlingIds = new HashSet<>();
+        for (InstanceDescription id : localCurrentView.getLocalInstance().getClusterView()
+                .getInstances()) {
+            activeSlingIds.add(id.getSlingId());
+        }
+
+        ResourceResolver resolver = null;
+        try {
+            resolver = localFactory.getServiceResourceResolver(null);
+
+            final Resource clusterInstances = ResourceHelper.getOrCreateResource(resolver,
+                    localConfig.getClusterInstancesPath());
+            final Resource idMap = ResourceHelper.getOrCreateResource(resolver,
+                    localConfig.getIdMapPath());
+            final Resource syncTokens = ResourceHelper.getOrCreateResource(resolver,
+                    localConfig.getSyncTokenPath());
+            resolver.revert();
+            resolver.refresh();
+
+            final ValueMap idMapMap = idMap.adaptTo(ValueMap.class);
+            final ModifiableValueMap syncTokenMap = syncTokens
+                    .adaptTo(ModifiableValueMap.class);
+            final Calendar now = Calendar.getInstance();
+            int removed = 0;
+            boolean mightHaveMore = false;
+            int localBatchSize = batchSize;
+            long localMinCreationAgeMillis = minCreationAgeMillis;
+            for (Resource resource : clusterInstances.getChildren()) {
+                if (!hasTopology || currentView != localCurrentView
+                        || !localCurrentView.isCurrent()) {
+                    // we got interrupted during cleanup
+                    // let's not commit at all then
+                    logger.info(
+                            "cleanup : topology changing during cleanup - not committing this time - stopping for now.");
+                    return true;
+                }
+                final String slingId = resource.getName();
+                logger.info("cleanup : handling slingId = {}", slingId);
+                Object clusterNodeId = idMapMap.get(slingId);
+                if (clusterNodeId == null) {
+                    logger.info("cleanup : slingId not recently in use : {}",
+                            clusterNodeId);
+                } else {
+                    logger.info("cleanup : slingId WAS recently in use : {}",

Review Comment:
   please downgrade the loglevel (this might be logged with every call of this method, e.g. every 10 minutes:
   ```suggestion
                       logger.debug("cleanup : slingId WAS recently in use : {}",
   ```



##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,484 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.apache.sling.discovery.commons.providers.util.ResourceHelper;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger runcount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This backup task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int org_apache_sling_discovery_oak_slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;
+
+        @AttributeDefinition(name = "Cleanup interval milliseconds", description = "Number of milliseconds after which to do another batch of cleaning up (if necessary)")
+        int org_apache_sling_discovery_oak_slingid_cleanup_interval() default DEFAULT_CLEANUP_INTERVAL;
+
+        @AttributeDefinition(name = "Cleanup batch size", description = "Maximum number of slingIds to cleanup in one batch.")
+        int org_apache_sling_discovery_oak_slingid_cleanup_batchsize() default DEFAULT_CLEANUP_BATCH_SIZE;
+
+        @AttributeDefinition(name = "Cleanup minimum creation age", description = "Minimum number of milliseconds since the slingId was created.")
+        long org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age() default DEFAULT_MIN_CREATION_AGE_MILLIS;
+    }
+
+    /**
+     * Test constructor
+     */
+    static SlingIdCleanupTask create(Scheduler scheduler, ResourceResolverFactory factory,
+            Config config, int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        final SlingIdCleanupTask s = new SlingIdCleanupTask();
+        s.scheduler = scheduler;
+        s.resourceResolverFactory = factory;
+        s.config = config;
+        s.config(initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+        return s;
+    }
+
+    @Activate
+    protected void activate(final BundleContext bc, final Conf config) {
+        this.modified(bc, config);
+    }
+
+    @Modified
+    protected void modified(final BundleContext bc, final Conf config) {
+        if (config == null) {
+            return;
+        }
+        config(config.org_apache_sling_discovery_oak_slingid_cleanup_initial_delay(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_interval(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_batchsize(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age());
+    }
+
+    @Deactivate
+    protected void deactivate() {
+        logger.info("deactivate : deactivated.");
+        hasTopology = false;
+    }
+
+    private void config(int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        this.initialDelayMillis = initialDelayMillis;
+        this.intervalMillis = intervalMillis;
+        this.batchSize = batchSize;
+        this.minCreationAgeMillis = minCreationAgeMillis;
+        logger.info(
+                "config: initial delay milliseconds = {}, interval milliseconds = {}, batch size = {}, min creation age milliseconds = {}",
+                initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+    }
+
+    @Override
+    public void handleTopologyEvent(TopologyEvent event) {
+        final TopologyView newView = event.getNewView();
+        if (newView == null || event.getType() == Type.PROPERTIES_CHANGED) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+        } else {
+            hasTopology = true;
+            currentView = newView;
+            if (newView.getLocalInstance().isLeader()) {
+                // only execute on leader
+                recreateSchedule();
+            } else {
+                // should not be necessary, but lets stop anyway on non-leaders:
+                stop();
+            }
+        }
+    }
+
+    /**
+     * Cancels a potentially previously registered cleanup schedule.
+     */
+    private void stop() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("stop: no scheduler set, giving up.");
+            return;
+        }
+        final boolean unscheduled = localScheduler.unschedule(SCHEDULE_NAME);
+        logger.debug("stop: unschedule result={}", unscheduled);
+    }
+
+    /**
+     * This method can be invoked at any time to reset the schedule to do a fresh
+     * round of cleanup.
+     * <p>
+     * This method is thread-safe : if called concurrently, the fact that
+     * scheduler.schedul is synchronized works out that ultimately there will be
+     * just 1 schedule active (which is what is the desired outcome).
+     */
+    private void recreateSchedule() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("resetSchedule: no scheduler set, giving up.");
+            return;
+        }
+        final Calendar cal = Calendar.getInstance();
+        int delayMillis;
+        if (firstRun) {
+            delayMillis = initialDelayMillis;
+        } else {
+            delayMillis = intervalMillis;
+        }
+        cal.add(Calendar.MILLISECOND, delayMillis);
+        final Date inFiveMinutes = cal.getTime();
+        logger.debug("resetSchedule: scheduling a cleanup in {} milliseconds from now.",
+                delayMillis);
+        ScheduleOptions options = localScheduler.AT(inFiveMinutes);
+        options.name(SCHEDULE_NAME);
+        options.canRunConcurrently(false); // should not concurrently execute
+        localScheduler.schedule(this, options);
+    }
+
+    /**
+     * Invoked via sling.commons.scheduler triggered from resetCleanupSchedule(). By
+     * default should get called at max every 5 minutes until cleanup is done or
+     * 10min after a topology change.
+     */
+    @Override
+    public void run() {
+        if (!hasTopology) {
+            return;
+        }
+        boolean mightHaveMore = true;
+        try {
+            mightHaveMore = cleanup();
+        } catch (Exception e) {
+            // upon exception just log and retry in 10min
+            logger.error("run: got Exception while cleaning up slnigIds : " + e, e);
+        }
+        if (mightHaveMore) {
+            // then continue in 10min
+            recreateSchedule();
+            return;
+        }
+        // log successful cleanup done, yes, on info
+        logger.info("run: slingId cleanup done.");
+    }
+
+    /**
+     * Do the actual cleanup of garbage slingIds and report back with true if there
+     * might be more or false if we're at the end.
+     * 
+     * @return true if there might be more garbage or false if we're at the end
+     */
+    boolean cleanup() {
+        logger.debug("cleanup: start");
+
+        final ResourceResolverFactory localFactory = resourceResolverFactory;
+        final Config localConfig = config;
+        if (localFactory == null || localConfig == null) {
+            logger.warn("cleanup: cannot cleanup due to rrf={} or c={}", localFactory,
+                    localConfig);
+            return true;
+        }
+        final TopologyView localCurrentView = currentView;
+        if (localCurrentView == null || !localCurrentView.isCurrent()) {
+            logger.info("cleanup : cannot cleanup as topology recently changed : {}",
+                    localCurrentView);
+            return true;
+        }
+        final Set<String> activeSlingIds = new HashSet<>();
+        for (InstanceDescription id : localCurrentView.getLocalInstance().getClusterView()
+                .getInstances()) {
+            activeSlingIds.add(id.getSlingId());
+        }
+
+        ResourceResolver resolver = null;
+        try {
+            resolver = localFactory.getServiceResourceResolver(null);
+
+            final Resource clusterInstances = ResourceHelper.getOrCreateResource(resolver,
+                    localConfig.getClusterInstancesPath());
+            final Resource idMap = ResourceHelper.getOrCreateResource(resolver,
+                    localConfig.getIdMapPath());
+            final Resource syncTokens = ResourceHelper.getOrCreateResource(resolver,
+                    localConfig.getSyncTokenPath());
+            resolver.revert();

Review Comment:
   Why is a revert required here? Are changes supposed to be done with the previous statements?



##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,484 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.apache.sling.discovery.commons.providers.util.ResourceHelper;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger runcount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This backup task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int org_apache_sling_discovery_oak_slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;
+
+        @AttributeDefinition(name = "Cleanup interval milliseconds", description = "Number of milliseconds after which to do another batch of cleaning up (if necessary)")
+        int org_apache_sling_discovery_oak_slingid_cleanup_interval() default DEFAULT_CLEANUP_INTERVAL;
+
+        @AttributeDefinition(name = "Cleanup batch size", description = "Maximum number of slingIds to cleanup in one batch.")
+        int org_apache_sling_discovery_oak_slingid_cleanup_batchsize() default DEFAULT_CLEANUP_BATCH_SIZE;
+
+        @AttributeDefinition(name = "Cleanup minimum creation age", description = "Minimum number of milliseconds since the slingId was created.")
+        long org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age() default DEFAULT_MIN_CREATION_AGE_MILLIS;
+    }
+
+    /**
+     * Test constructor
+     */
+    static SlingIdCleanupTask create(Scheduler scheduler, ResourceResolverFactory factory,
+            Config config, int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        final SlingIdCleanupTask s = new SlingIdCleanupTask();
+        s.scheduler = scheduler;
+        s.resourceResolverFactory = factory;
+        s.config = config;
+        s.config(initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+        return s;
+    }
+
+    @Activate
+    protected void activate(final BundleContext bc, final Conf config) {
+        this.modified(bc, config);
+    }
+
+    @Modified
+    protected void modified(final BundleContext bc, final Conf config) {
+        if (config == null) {
+            return;
+        }
+        config(config.org_apache_sling_discovery_oak_slingid_cleanup_initial_delay(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_interval(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_batchsize(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age());
+    }
+
+    @Deactivate
+    protected void deactivate() {
+        logger.info("deactivate : deactivated.");
+        hasTopology = false;
+    }
+
+    private void config(int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        this.initialDelayMillis = initialDelayMillis;
+        this.intervalMillis = intervalMillis;
+        this.batchSize = batchSize;
+        this.minCreationAgeMillis = minCreationAgeMillis;
+        logger.info(
+                "config: initial delay milliseconds = {}, interval milliseconds = {}, batch size = {}, min creation age milliseconds = {}",
+                initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+    }
+
+    @Override
+    public void handleTopologyEvent(TopologyEvent event) {
+        final TopologyView newView = event.getNewView();
+        if (newView == null || event.getType() == Type.PROPERTIES_CHANGED) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+        } else {
+            hasTopology = true;
+            currentView = newView;
+            if (newView.getLocalInstance().isLeader()) {
+                // only execute on leader
+                recreateSchedule();
+            } else {
+                // should not be necessary, but lets stop anyway on non-leaders:
+                stop();
+            }
+        }
+    }
+
+    /**
+     * Cancels a potentially previously registered cleanup schedule.
+     */
+    private void stop() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("stop: no scheduler set, giving up.");
+            return;
+        }
+        final boolean unscheduled = localScheduler.unschedule(SCHEDULE_NAME);
+        logger.debug("stop: unschedule result={}", unscheduled);
+    }
+
+    /**
+     * This method can be invoked at any time to reset the schedule to do a fresh
+     * round of cleanup.
+     * <p>
+     * This method is thread-safe : if called concurrently, the fact that
+     * scheduler.schedul is synchronized works out that ultimately there will be
+     * just 1 schedule active (which is what is the desired outcome).
+     */
+    private void recreateSchedule() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("resetSchedule: no scheduler set, giving up.");
+            return;
+        }
+        final Calendar cal = Calendar.getInstance();
+        int delayMillis;
+        if (firstRun) {
+            delayMillis = initialDelayMillis;
+        } else {
+            delayMillis = intervalMillis;
+        }
+        cal.add(Calendar.MILLISECOND, delayMillis);
+        final Date inFiveMinutes = cal.getTime();
+        logger.debug("resetSchedule: scheduling a cleanup in {} milliseconds from now.",
+                delayMillis);
+        ScheduleOptions options = localScheduler.AT(inFiveMinutes);
+        options.name(SCHEDULE_NAME);
+        options.canRunConcurrently(false); // should not concurrently execute
+        localScheduler.schedule(this, options);
+    }
+
+    /**
+     * Invoked via sling.commons.scheduler triggered from resetCleanupSchedule(). By
+     * default should get called at max every 5 minutes until cleanup is done or
+     * 10min after a topology change.
+     */
+    @Override
+    public void run() {
+        if (!hasTopology) {
+            return;
+        }
+        boolean mightHaveMore = true;
+        try {
+            mightHaveMore = cleanup();
+        } catch (Exception e) {
+            // upon exception just log and retry in 10min
+            logger.error("run: got Exception while cleaning up slnigIds : " + e, e);
+        }
+        if (mightHaveMore) {
+            // then continue in 10min
+            recreateSchedule();
+            return;
+        }
+        // log successful cleanup done, yes, on info
+        logger.info("run: slingId cleanup done.");
+    }
+
+    /**
+     * Do the actual cleanup of garbage slingIds and report back with true if there
+     * might be more or false if we're at the end.
+     * 
+     * @return true if there might be more garbage or false if we're at the end
+     */
+    boolean cleanup() {
+        logger.debug("cleanup: start");
+
+        final ResourceResolverFactory localFactory = resourceResolverFactory;
+        final Config localConfig = config;
+        if (localFactory == null || localConfig == null) {
+            logger.warn("cleanup: cannot cleanup due to rrf={} or c={}", localFactory,
+                    localConfig);
+            return true;
+        }
+        final TopologyView localCurrentView = currentView;
+        if (localCurrentView == null || !localCurrentView.isCurrent()) {
+            logger.info("cleanup : cannot cleanup as topology recently changed : {}",
+                    localCurrentView);
+            return true;
+        }
+        final Set<String> activeSlingIds = new HashSet<>();
+        for (InstanceDescription id : localCurrentView.getLocalInstance().getClusterView()
+                .getInstances()) {
+            activeSlingIds.add(id.getSlingId());
+        }
+
+        ResourceResolver resolver = null;
+        try {
+            resolver = localFactory.getServiceResourceResolver(null);
+
+            final Resource clusterInstances = ResourceHelper.getOrCreateResource(resolver,
+                    localConfig.getClusterInstancesPath());
+            final Resource idMap = ResourceHelper.getOrCreateResource(resolver,
+                    localConfig.getIdMapPath());
+            final Resource syncTokens = ResourceHelper.getOrCreateResource(resolver,
+                    localConfig.getSyncTokenPath());
+            resolver.revert();
+            resolver.refresh();
+
+            final ValueMap idMapMap = idMap.adaptTo(ValueMap.class);
+            final ModifiableValueMap syncTokenMap = syncTokens
+                    .adaptTo(ModifiableValueMap.class);
+            final Calendar now = Calendar.getInstance();
+            int removed = 0;
+            boolean mightHaveMore = false;
+            int localBatchSize = batchSize;
+            long localMinCreationAgeMillis = minCreationAgeMillis;
+            for (Resource resource : clusterInstances.getChildren()) {
+                if (!hasTopology || currentView != localCurrentView
+                        || !localCurrentView.isCurrent()) {
+                    // we got interrupted during cleanup
+                    // let's not commit at all then
+                    logger.info(
+                            "cleanup : topology changing during cleanup - not committing this time - stopping for now.");
+                    return true;
+                }
+                final String slingId = resource.getName();
+                logger.info("cleanup : handling slingId = {}", slingId);
+                Object clusterNodeId = idMapMap.get(slingId);
+                if (clusterNodeId == null) {
+                    logger.info("cleanup : slingId not recently in use : {}",
+                            clusterNodeId);
+                } else {
+                    logger.info("cleanup : slingId WAS recently in use : {}",
+                            clusterNodeId);
+                    continue;
+                }
+                if (activeSlingIds.contains(slingId)) {
+                    logger.info("cleanup : slingId is currently active : {}", slingId);

Review Comment:
   ditto
   ```suggestion
                       logger.debug("cleanup : slingId is currently active : {}", slingId);
   ```



##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,484 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.apache.sling.discovery.commons.providers.util.ResourceHelper;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger runcount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This backup task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int org_apache_sling_discovery_oak_slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;
+
+        @AttributeDefinition(name = "Cleanup interval milliseconds", description = "Number of milliseconds after which to do another batch of cleaning up (if necessary)")
+        int org_apache_sling_discovery_oak_slingid_cleanup_interval() default DEFAULT_CLEANUP_INTERVAL;
+
+        @AttributeDefinition(name = "Cleanup batch size", description = "Maximum number of slingIds to cleanup in one batch.")
+        int org_apache_sling_discovery_oak_slingid_cleanup_batchsize() default DEFAULT_CLEANUP_BATCH_SIZE;
+
+        @AttributeDefinition(name = "Cleanup minimum creation age", description = "Minimum number of milliseconds since the slingId was created.")
+        long org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age() default DEFAULT_MIN_CREATION_AGE_MILLIS;
+    }
+
+    /**
+     * Test constructor
+     */
+    static SlingIdCleanupTask create(Scheduler scheduler, ResourceResolverFactory factory,
+            Config config, int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        final SlingIdCleanupTask s = new SlingIdCleanupTask();
+        s.scheduler = scheduler;
+        s.resourceResolverFactory = factory;
+        s.config = config;
+        s.config(initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+        return s;
+    }
+
+    @Activate
+    protected void activate(final BundleContext bc, final Conf config) {
+        this.modified(bc, config);
+    }
+
+    @Modified
+    protected void modified(final BundleContext bc, final Conf config) {
+        if (config == null) {
+            return;
+        }
+        config(config.org_apache_sling_discovery_oak_slingid_cleanup_initial_delay(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_interval(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_batchsize(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age());
+    }
+
+    @Deactivate
+    protected void deactivate() {
+        logger.info("deactivate : deactivated.");
+        hasTopology = false;
+    }
+
+    private void config(int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        this.initialDelayMillis = initialDelayMillis;
+        this.intervalMillis = intervalMillis;
+        this.batchSize = batchSize;
+        this.minCreationAgeMillis = minCreationAgeMillis;
+        logger.info(
+                "config: initial delay milliseconds = {}, interval milliseconds = {}, batch size = {}, min creation age milliseconds = {}",
+                initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+    }
+
+    @Override
+    public void handleTopologyEvent(TopologyEvent event) {
+        final TopologyView newView = event.getNewView();
+        if (newView == null || event.getType() == Type.PROPERTIES_CHANGED) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+        } else {
+            hasTopology = true;
+            currentView = newView;
+            if (newView.getLocalInstance().isLeader()) {
+                // only execute on leader
+                recreateSchedule();
+            } else {
+                // should not be necessary, but lets stop anyway on non-leaders:
+                stop();
+            }
+        }
+    }
+
+    /**
+     * Cancels a potentially previously registered cleanup schedule.
+     */
+    private void stop() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("stop: no scheduler set, giving up.");
+            return;
+        }
+        final boolean unscheduled = localScheduler.unschedule(SCHEDULE_NAME);
+        logger.debug("stop: unschedule result={}", unscheduled);
+    }
+
+    /**
+     * This method can be invoked at any time to reset the schedule to do a fresh
+     * round of cleanup.
+     * <p>
+     * This method is thread-safe : if called concurrently, the fact that
+     * scheduler.schedul is synchronized works out that ultimately there will be
+     * just 1 schedule active (which is what is the desired outcome).
+     */
+    private void recreateSchedule() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("resetSchedule: no scheduler set, giving up.");
+            return;
+        }
+        final Calendar cal = Calendar.getInstance();
+        int delayMillis;
+        if (firstRun) {
+            delayMillis = initialDelayMillis;
+        } else {
+            delayMillis = intervalMillis;
+        }
+        cal.add(Calendar.MILLISECOND, delayMillis);
+        final Date inFiveMinutes = cal.getTime();
+        logger.debug("resetSchedule: scheduling a cleanup in {} milliseconds from now.",
+                delayMillis);
+        ScheduleOptions options = localScheduler.AT(inFiveMinutes);
+        options.name(SCHEDULE_NAME);
+        options.canRunConcurrently(false); // should not concurrently execute
+        localScheduler.schedule(this, options);
+    }
+
+    /**
+     * Invoked via sling.commons.scheduler triggered from resetCleanupSchedule(). By
+     * default should get called at max every 5 minutes until cleanup is done or
+     * 10min after a topology change.
+     */
+    @Override
+    public void run() {
+        if (!hasTopology) {
+            return;
+        }
+        boolean mightHaveMore = true;
+        try {
+            mightHaveMore = cleanup();
+        } catch (Exception e) {
+            // upon exception just log and retry in 10min
+            logger.error("run: got Exception while cleaning up slnigIds : " + e, e);
+        }
+        if (mightHaveMore) {
+            // then continue in 10min
+            recreateSchedule();
+            return;
+        }
+        // log successful cleanup done, yes, on info
+        logger.info("run: slingId cleanup done.");
+    }
+
+    /**
+     * Do the actual cleanup of garbage slingIds and report back with true if there
+     * might be more or false if we're at the end.
+     * 
+     * @return true if there might be more garbage or false if we're at the end
+     */
+    boolean cleanup() {
+        logger.debug("cleanup: start");
+
+        final ResourceResolverFactory localFactory = resourceResolverFactory;
+        final Config localConfig = config;
+        if (localFactory == null || localConfig == null) {
+            logger.warn("cleanup: cannot cleanup due to rrf={} or c={}", localFactory,
+                    localConfig);
+            return true;
+        }
+        final TopologyView localCurrentView = currentView;
+        if (localCurrentView == null || !localCurrentView.isCurrent()) {
+            logger.info("cleanup : cannot cleanup as topology recently changed : {}",
+                    localCurrentView);
+            return true;
+        }
+        final Set<String> activeSlingIds = new HashSet<>();
+        for (InstanceDescription id : localCurrentView.getLocalInstance().getClusterView()
+                .getInstances()) {
+            activeSlingIds.add(id.getSlingId());
+        }
+
+        ResourceResolver resolver = null;
+        try {
+            resolver = localFactory.getServiceResourceResolver(null);
+
+            final Resource clusterInstances = ResourceHelper.getOrCreateResource(resolver,
+                    localConfig.getClusterInstancesPath());
+            final Resource idMap = ResourceHelper.getOrCreateResource(resolver,
+                    localConfig.getIdMapPath());
+            final Resource syncTokens = ResourceHelper.getOrCreateResource(resolver,
+                    localConfig.getSyncTokenPath());
+            resolver.revert();
+            resolver.refresh();
+
+            final ValueMap idMapMap = idMap.adaptTo(ValueMap.class);
+            final ModifiableValueMap syncTokenMap = syncTokens
+                    .adaptTo(ModifiableValueMap.class);
+            final Calendar now = Calendar.getInstance();
+            int removed = 0;
+            boolean mightHaveMore = false;
+            int localBatchSize = batchSize;
+            long localMinCreationAgeMillis = minCreationAgeMillis;
+            for (Resource resource : clusterInstances.getChildren()) {
+                if (!hasTopology || currentView != localCurrentView
+                        || !localCurrentView.isCurrent()) {
+                    // we got interrupted during cleanup
+                    // let's not commit at all then
+                    logger.info(
+                            "cleanup : topology changing during cleanup - not committing this time - stopping for now.");
+                    return true;
+                }
+                final String slingId = resource.getName();
+                logger.info("cleanup : handling slingId = {}", slingId);

Review Comment:
   please downgrade the log level:
   ```suggestion
                   logger.trace("cleanup : handling slingId = {}", slingId);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] joerghoh commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "joerghoh (via GitHub)" <gi...@apache.org>.

joerghoh commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1129764543


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,571 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import static org.osgi.util.converter.Converters.standardConverter;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    final static String SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME = "org.apache.sling.discovery.oak.slingidcleanup.enabled";
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every scheduler invocation */
+    private AtomicInteger runCount = new AtomicInteger(0);
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger completionCount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This backup task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int org_apache_sling_discovery_oak_slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;
+
+        @AttributeDefinition(name = "Cleanup interval milliseconds", description = "Number of milliseconds after which to do another batch of cleaning up (if necessary)")
+        int org_apache_sling_discovery_oak_slingid_cleanup_interval() default DEFAULT_CLEANUP_INTERVAL;
+
+        @AttributeDefinition(name = "Cleanup batch size", description = "Maximum number of slingIds to cleanup in one batch.")
+        int org_apache_sling_discovery_oak_slingid_cleanup_batchsize() default DEFAULT_CLEANUP_BATCH_SIZE;
+
+        @AttributeDefinition(name = "Cleanup minimum creation age", description = "Minimum number of milliseconds since the slingId was created.")
+        long org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age() default DEFAULT_MIN_CREATION_AGE_MILLIS;
+    }
+
+    /**
+     * Test constructor
+     */
+    static SlingIdCleanupTask create(Scheduler scheduler, ResourceResolverFactory factory,
+            Config config, int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        final SlingIdCleanupTask s = new SlingIdCleanupTask();
+        s.scheduler = scheduler;
+        s.resourceResolverFactory = factory;
+        s.config = config;
+        s.config(initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+        return s;
+    }
+
+    @Activate
+    protected void activate(final BundleContext bc, final Conf config) {
+        this.modified(bc, config);
+    }
+
+    @Modified
+    protected void modified(final BundleContext bc, final Conf config) {
+        if (config == null) {
+            return;
+        }
+        config(config.org_apache_sling_discovery_oak_slingid_cleanup_initial_delay(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_interval(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_batchsize(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age());
+    }
+
+    @Deactivate
+    protected void deactivate() {
+        logger.info("deactivate : deactivated.");
+        hasTopology = false;
+    }
+
+    private void config(int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        this.initialDelayMillis = initialDelayMillis;
+        this.intervalMillis = intervalMillis;
+        this.batchSize = batchSize;
+        this.minCreationAgeMillis = minCreationAgeMillis;
+        logger.info(
+                "config: enabled = {}, initial delay milliseconds = {}, interval milliseconds = {}, batch size = {}, min creation age milliseconds = {}",
+                isEnabled(), initialDelayMillis, intervalMillis, batchSize,
+                minCreationAgeMillis);
+    }
+
+    @Override
+    public void handleTopologyEvent(TopologyEvent event) {
+        if (!isEnabled()) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+            logger.info("handleTopologyEvent: slingId cleanup is disabled");

Review Comment:
   ```suggestion
               logger.trace("handleTopologyEvent: slingId cleanup is disabled");
   ```
   Please downgrade this message, as it might get invoked quite often (whenever an topology event is sent).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] stefan-egli commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "stefan-egli (via GitHub)" <gi...@apache.org>.

stefan-egli commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1133640023


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,601 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import static org.osgi.util.converter.Converters.standardConverter;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>was not ever seen in previous topologies by the now leader instance</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ * <p>
+ * Additionally, the cleanup is skipped for 13 hours after a successful cleanup.
+ * This is to avoid unnecessary load on the repository. The number of 13
+ * incorporates some heuristics such as : about 2 cleanup rounds per day maximum
+ * makes sense, if a leader is very long living, then the 1 additional hour
+ * makes it spread somewhat throughout the day. This is to further minimize any
+ * load side-effects.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    final static String SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME = "org.apache.sling.discovery.oak.slingidcleanup.enabled";
+
+    /** default minimal cleanup delay at 13h, to intraday load balance */
+    final static long MIN_CLEANUP_DELAY_MILLIS = 46800000;
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every scheduler invocation */
+    private AtomicInteger runCount = new AtomicInteger(0);
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger completionCount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    private long lastSuccessfulRun = -1;
+
+    /**
+     * Minimal delay after a successful cleanup round, in millis
+     */
+    private long minCleanupDelayMillis = MIN_CLEANUP_DELAY_MILLIS;
+
+    /**
+     * contains all slingIds ever seen by this instance - should not be a long list
+     * so not a memory issue
+     */
+    private Set<String> seenInstances = new HashSet<>();
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int org_apache_sling_discovery_oak_slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;
+
+        @AttributeDefinition(name = "Cleanup interval milliseconds", description = "Number of milliseconds after which to do another batch of cleaning up (if necessary)")
+        int org_apache_sling_discovery_oak_slingid_cleanup_interval() default DEFAULT_CLEANUP_INTERVAL;
+
+        @AttributeDefinition(name = "Cleanup batch size", description = "Maximum number of slingIds to cleanup in one batch.")
+        int org_apache_sling_discovery_oak_slingid_cleanup_batchsize() default DEFAULT_CLEANUP_BATCH_SIZE;
+
+        @AttributeDefinition(name = "Cleanup minimum creation age", description = "Minimum number of milliseconds since the slingId was created.")
+        long org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age() default DEFAULT_MIN_CREATION_AGE_MILLIS;
+    }
+
+    /**
+     * Test constructor
+     */
+    static SlingIdCleanupTask create(Scheduler scheduler, ResourceResolverFactory factory,
+            Config config, int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis, long minCleanupDelayMillis) {
+        final SlingIdCleanupTask s = new SlingIdCleanupTask();
+        s.scheduler = scheduler;
+        s.resourceResolverFactory = factory;
+        s.config = config;
+        s.minCleanupDelayMillis = minCleanupDelayMillis;
+        s.config(initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+        return s;
+    }
+
+    @Activate
+    protected void activate(final BundleContext bc, final Conf config) {
+        this.modified(bc, config);
+    }
+
+    @Modified
+    protected void modified(final BundleContext bc, final Conf config) {
+        if (config == null) {
+            return;
+        }
+        config(config.org_apache_sling_discovery_oak_slingid_cleanup_initial_delay(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_interval(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_batchsize(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age());
+    }
+
+    @Deactivate
+    protected void deactivate() {
+        logger.info("deactivate : deactivated.");
+        hasTopology = false;
+    }
+
+    private void config(int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        this.initialDelayMillis = initialDelayMillis;
+        this.intervalMillis = intervalMillis;
+        this.batchSize = batchSize;
+        this.minCreationAgeMillis = minCreationAgeMillis;
+        logger.info(
+                "config: enabled = {}, initial delay milliseconds = {}, interval milliseconds = {}, batch size = {}, min creation age milliseconds = {}",
+                isEnabled(), initialDelayMillis, intervalMillis, batchSize,
+                minCreationAgeMillis);
+    }
+
+    @Override
+    public void handleTopologyEvent(TopologyEvent event) {
+        if (!isEnabled()) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+            logger.debug("handleTopologyEvent: slingId cleanup is disabled");
+            return;
+        }
+        final TopologyView newView = event.getNewView();
+        if (newView == null || event.getType() == Type.PROPERTIES_CHANGED) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+        } else {
+            hasTopology = true;
+            currentView = newView;
+            seenInstances.addAll(getActiveSlingIdsFrom(newView));
+            if (newView.getLocalInstance().isLeader()) {
+                // only execute on leader
+                recreateSchedule();
+            } else {
+                // should not be necessary, but lets stop anyway on non-leaders:
+                stop();
+            }
+        }
+    }
+
+    /**
+     * Cancels a potentially previously registered cleanup schedule.
+     */
+    private void stop() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("stop: no scheduler set, giving up.");
+            return;
+        }
+        final boolean unscheduled = localScheduler.unschedule(SCHEDULE_NAME);
+        logger.debug("stop: unschedule result={}", unscheduled);
+    }
+
+    /**
+     * Reads the system property that enables or disabled this tasks
+     */
+    private static boolean isEnabled() {
+        final String systemPropertyValue = System
+                .getProperty(SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME);
+        return standardConverter().convert(systemPropertyValue).defaultValue(false)
+                .to(Boolean.class);
+    }
+
+    /**
+     * This method can be invoked at any time to reset the schedule to do a fresh
+     * round of cleanup.
+     * <p>
+     * This method is thread-safe : if called concurrently, the fact that
+     * scheduler.schedul is synchronized works out that ultimately there will be
+     * just 1 schedule active (which is what is the desired outcome).
+     */
+    private void recreateSchedule() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("recreateSchedule: no scheduler set, giving up.");
+            return;
+        }
+        final Calendar cal = Calendar.getInstance();
+        int delayMillis;
+        if (firstRun) {
+            delayMillis = initialDelayMillis;
+        } else {
+            delayMillis = intervalMillis;
+        }
+        cal.add(Calendar.MILLISECOND, delayMillis);
+        final Date scheduledDate = cal.getTime();
+        logger.debug(
+                "recreateSchedule: scheduling a cleanup in {} milliseconds from now, which is: {}",
+                delayMillis, scheduledDate);
+        ScheduleOptions options = localScheduler.AT(scheduledDate);
+        options.name(SCHEDULE_NAME);
+        options.canRunConcurrently(false); // should not concurrently execute
+        localScheduler.schedule(this, options);
+    }
+
+    /**
+     * Invoked via sling.commons.scheduler triggered from resetCleanupSchedule(). By
+     * default should get called at max every 5 minutes until cleanup is done or
+     * 10min after a topology change.
+     */
+    @Override
+    public void run() {
+        if (lastSuccessfulRun > 0 && System.currentTimeMillis()
+                - lastSuccessfulRun < minCleanupDelayMillis) {
+            logger.debug(
+                    "run: last cleanup was {} millis ago, which is less than {} millis, therefore not cleaning up yet.",
+                    System.currentTimeMillis() - lastSuccessfulRun,
+                    minCleanupDelayMillis);
+            recreateSchedule();
+            return;
+        }
+        runCount.incrementAndGet();
+        if (!hasTopology) {
+            return;
+        }
+        boolean mightHaveMore = true;
+        try {
+            mightHaveMore = cleanup();
+        } catch (Exception e) {
+            // upon exception just log and retry in 10min
+            logger.error("run: got Exception while cleaning up slnigIds : " + e, e);
+        }
+        if (mightHaveMore) {
+            // then continue in 10min
+            recreateSchedule();
+            return;
+        }
+        // log successful cleanup done, yes, on info
+        logger.info(
+                "run: slingId cleanup done, run counter = {}, delete counter = {}, completion counter = {}",
+                getRunCount(), getDeleteCount(), getCompletionCount());
+        lastSuccessfulRun = System.currentTimeMillis();
+    }
+
+    /**
+     * Do the actual cleanup of garbage slingIds and report back with true if there
+     * might be more or false if we're at the end.
+     * 
+     * @return true if there might be more garbage or false if we're at the end
+     */
+    private boolean cleanup() {
+        logger.debug("cleanup: start");
+        if (!isEnabled()) {
+            // bit of overkill probably, as this shouldn't happen.
+            // but adds to a good night's sleep.
+            logger.debug("cleanup: not enabled, stopping.");
+            return false;
+        }
+
+        final ResourceResolverFactory localFactory = resourceResolverFactory;
+        final Config localConfig = config;
+        if (localFactory == null || localConfig == null) {
+            logger.warn("cleanup: cannot cleanup due to rrf={} or c={}", localFactory,
+                    localConfig);
+            return true;
+        }
+        final TopologyView localCurrentView = currentView;
+        if (localCurrentView == null || !localCurrentView.isCurrent()) {
+            logger.debug("cleanup : cannot cleanup as topology recently changed : {}",
+                    localCurrentView);
+            return true;
+        }
+        final Set<String> activeSlingIds = getActiveSlingIdsFrom(localCurrentView);
+        try (ResourceResolver resolver = localFactory.getServiceResourceResolver(null)) {
+            final Resource clusterInstances = resolver
+                    .getResource(localConfig.getClusterInstancesPath());
+            final Resource idMap = resolver.getResource(localConfig.getIdMapPath());
+            final Resource syncTokens = resolver
+                    .getResource(localConfig.getSyncTokenPath());
+            if (clusterInstances == null || idMap == null || syncTokens == null) {
+                logger.warn("cleanup: no resource found at {}, {} or {}, stopping.",
+                        localConfig.getClusterInstancesPath(), localConfig.getIdMapPath(),
+                        localConfig.getSyncTokenPath());
+                return false;
+            }
+            resolver.refresh();
+            final ValueMap idMapMap = idMap.adaptTo(ValueMap.class);
+            final ModifiableValueMap syncTokenMap = syncTokens
+                    .adaptTo(ModifiableValueMap.class);
+            final Calendar now = Calendar.getInstance();
+            int removed = 0;
+            boolean mightHaveMore = false;
+            final int localBatchSize = batchSize;
+            final long localMinCreationAgeMillis = minCreationAgeMillis;
+            for (Resource resource : clusterInstances.getChildren()) {
+                if (!topologyUnchanged(localCurrentView)) {
+                    return true;
+                }
+                final String slingId = resource.getName();
+                if (deleteIfOldSlingId(resource, slingId, syncTokenMap, idMapMap,
+                        activeSlingIds, now, localMinCreationAgeMillis)) {
+                    if (++removed >= localBatchSize) {
+                        // we need to stop
+                        mightHaveMore = true;
+                        break;
+                    }
+                }
+            }
+            // if we're not already at the batch limit, check syncTokens too
+            if (!mightHaveMore) {
+                for (String slingId : syncTokenMap.keySet()) {
+                    try {
+                        UUID.fromString(slingId);
+                    } catch (Exception e) {
+                        // not a uuid
+                        continue;
+                    }
+                    if (!topologyUnchanged(localCurrentView)) {
+                        return true;
+                    }
+                    Resource resourceOrNull = clusterInstances.getChild(slingId);
+                    if (deleteIfOldSlingId(resourceOrNull, slingId, syncTokenMap,
+                            idMapMap, activeSlingIds, now, localMinCreationAgeMillis)) {
+                        if (++removed >= localBatchSize) {
+                            // we need to stop
+                            mightHaveMore = true;
+                            break;
+                        }
+                    }
+                }
+            }
+            if (!topologyUnchanged(localCurrentView)) {
+                return true;
+            }
+            if (removed > 0) {
+                // only if we removed something we commit
+                resolver.commit();
+                logger.info(
+                        "cleanup : removed {} old slingIds (batch size : {}), potentially has more: {}",
+                        removed, localBatchSize, mightHaveMore);
+                deleteCount.addAndGet(removed);
+            }
+            firstRun = false;
+            completionCount.incrementAndGet();
+            return mightHaveMore;
+        } catch (LoginException e) {
+            logger.error("cleanup: could not log in administratively: " + e, e);
+            throw new RuntimeException("Could not log in to repository (" + e + ")", e);
+        } catch (PersistenceException e) {
+            logger.error("cleanup: got a PersistenceException: " + e, e);
+            throw new RuntimeException(
+                    "Exception while talking to repository (" + e + ")", e);
+        } finally {
+            logger.debug("cleanup: done.");
+        }
+    }
+
+    private Set<String> getActiveSlingIdsFrom(final TopologyView localCurrentView) {
+        final Set<String> activeSlingIds = new HashSet<>();
+        for (InstanceDescription id : localCurrentView.getLocalInstance().getClusterView()
+                .getInstances()) {
+            activeSlingIds.add(id.getSlingId());
+        }
+        return activeSlingIds;
+    }
+
+    private boolean topologyUnchanged(TopologyView localCurrentView) {

Review Comment:
   +1, done in https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13/commits/073b4acc01ccf362a4ca4636d1ddc70626093361



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] sonarcloud[bot] commented on pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "sonarcloud[bot] (via GitHub)" <gi...@apache.org>.

sonarcloud[bot] commented on PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#issuecomment-1462081192

   Kudos, SonarCloud Quality Gate passed!&nbsp; &nbsp; [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13)
   
   [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG)  
   [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY)  
   [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT)  
   [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [30 Code Smells](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL)
   
   [![86.6%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/60-16px.png '86.6%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list) [86.6% Coverage](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list)  
   [![0.0%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/3-16px.png '0.0%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list) [0.0% Duplication](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] joerghoh commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "joerghoh (via GitHub)" <gi...@apache.org>.

joerghoh commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1130652890


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,585 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import static org.osgi.util.converter.Converters.standardConverter;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    final static String SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME = "org.apache.sling.discovery.oak.slingidcleanup.enabled";
+
+    private final static long MIN_CLEANUP_DELAY_MILLIS = 46800000; // 13 hours, to intraday load balance
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every scheduler invocation */
+    private AtomicInteger runCount = new AtomicInteger(0);
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger completionCount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    private long lastSuccessfulRun = -1;
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int org_apache_sling_discovery_oak_slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;
+
+        @AttributeDefinition(name = "Cleanup interval milliseconds", description = "Number of milliseconds after which to do another batch of cleaning up (if necessary)")
+        int org_apache_sling_discovery_oak_slingid_cleanup_interval() default DEFAULT_CLEANUP_INTERVAL;
+
+        @AttributeDefinition(name = "Cleanup batch size", description = "Maximum number of slingIds to cleanup in one batch.")
+        int org_apache_sling_discovery_oak_slingid_cleanup_batchsize() default DEFAULT_CLEANUP_BATCH_SIZE;
+
+        @AttributeDefinition(name = "Cleanup minimum creation age", description = "Minimum number of milliseconds since the slingId was created.")
+        long org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age() default DEFAULT_MIN_CREATION_AGE_MILLIS;
+    }
+
+    /**
+     * Test constructor
+     */
+    static SlingIdCleanupTask create(Scheduler scheduler, ResourceResolverFactory factory,
+            Config config, int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        final SlingIdCleanupTask s = new SlingIdCleanupTask();
+        s.scheduler = scheduler;
+        s.resourceResolverFactory = factory;
+        s.config = config;
+        s.config(initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+        return s;
+    }
+
+    @Activate
+    protected void activate(final BundleContext bc, final Conf config) {
+        this.modified(bc, config);
+    }
+
+    @Modified
+    protected void modified(final BundleContext bc, final Conf config) {
+        if (config == null) {
+            return;
+        }
+        config(config.org_apache_sling_discovery_oak_slingid_cleanup_initial_delay(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_interval(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_batchsize(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age());
+    }
+
+    @Deactivate
+    protected void deactivate() {
+        logger.info("deactivate : deactivated.");
+        hasTopology = false;
+    }
+
+    private void config(int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        this.initialDelayMillis = initialDelayMillis;
+        this.intervalMillis = intervalMillis;
+        this.batchSize = batchSize;
+        this.minCreationAgeMillis = minCreationAgeMillis;
+        logger.info(
+                "config: enabled = {}, initial delay milliseconds = {}, interval milliseconds = {}, batch size = {}, min creation age milliseconds = {}",
+                isEnabled(), initialDelayMillis, intervalMillis, batchSize,
+                minCreationAgeMillis);
+    }
+
+    @Override
+    public void handleTopologyEvent(TopologyEvent event) {
+        if (!isEnabled()) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+            logger.info("handleTopologyEvent: slingId cleanup is disabled");
+            return;
+        }
+        final TopologyView newView = event.getNewView();
+        if (newView == null || event.getType() == Type.PROPERTIES_CHANGED) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+        } else {
+            hasTopology = true;
+            currentView = newView;
+            if (newView.getLocalInstance().isLeader()) {
+                // only execute on leader
+                recreateSchedule();
+            } else {
+                // should not be necessary, but lets stop anyway on non-leaders:
+                stop();
+            }
+        }
+    }
+
+    /**
+     * Cancels a potentially previously registered cleanup schedule.
+     */
+    private void stop() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("stop: no scheduler set, giving up.");
+            return;
+        }
+        final boolean unscheduled = localScheduler.unschedule(SCHEDULE_NAME);
+        logger.debug("stop: unschedule result={}", unscheduled);
+    }
+
+    /**
+     * Reads the system property that enables or disabled this tasks
+     */
+    private static boolean isEnabled() {
+        final String systemPropertyValue = System
+                .getProperty(SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME);
+        return standardConverter().convert(systemPropertyValue).defaultValue(false)
+                .to(Boolean.class);
+    }
+
+    /**
+     * This method can be invoked at any time to reset the schedule to do a fresh
+     * round of cleanup.
+     * <p>
+     * This method is thread-safe : if called concurrently, the fact that
+     * scheduler.schedul is synchronized works out that ultimately there will be
+     * just 1 schedule active (which is what is the desired outcome).
+     */
+    private void recreateSchedule() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("recreateSchedule: no scheduler set, giving up.");
+            return;
+        }
+        final Calendar cal = Calendar.getInstance();
+        int delayMillis;
+        if (firstRun) {
+            delayMillis = initialDelayMillis;
+        } else {
+            delayMillis = intervalMillis;
+        }
+        cal.add(Calendar.MILLISECOND, delayMillis);
+        final Date scheduledDate = cal.getTime();
+        logger.debug(
+                "recreateSchedule: scheduling a cleanup in {} milliseconds from now, which is: {}",
+                delayMillis, scheduledDate);
+        ScheduleOptions options = localScheduler.AT(scheduledDate);
+        options.name(SCHEDULE_NAME);
+        options.canRunConcurrently(false); // should not concurrently execute
+        localScheduler.schedule(this, options);
+    }
+
+    /**
+     * Invoked via sling.commons.scheduler triggered from resetCleanupSchedule(). By
+     * default should get called at max every 5 minutes until cleanup is done or
+     * 10min after a topology change.
+     */
+    @Override
+    public void run() {
+        if (lastSuccessfulRun > 0 && System.currentTimeMillis()
+                - lastSuccessfulRun < MIN_CLEANUP_DELAY_MILLIS) {
+            logger.debug(
+                    "run: last cleanup was {} millis ago, which is less than {} millis, therefore not cleaning up yet.",
+                    System.currentTimeMillis() - lastSuccessfulRun,
+                    MIN_CLEANUP_DELAY_MILLIS);
+            recreateSchedule();
+            return;
+        }
+        runCount.incrementAndGet();
+        if (!hasTopology) {
+            return;
+        }
+        boolean mightHaveMore = true;
+        try {
+            mightHaveMore = cleanup();
+        } catch (Exception e) {
+            // upon exception just log and retry in 10min
+            logger.error("run: got Exception while cleaning up slnigIds : " + e, e);
+        }
+        if (mightHaveMore) {
+            // then continue in 10min
+            recreateSchedule();
+            return;
+        }
+        // log successful cleanup done, yes, on info
+        logger.info(
+                "run: slingId cleanup done, run counter = {}, delete counter = {}, completion counter = {}",
+                getRunCount(), getDeleteCount(), getCompletionCount());
+        lastSuccessfulRun = System.currentTimeMillis();
+    }
+
+    /**
+     * Do the actual cleanup of garbage slingIds and report back with true if there
+     * might be more or false if we're at the end.
+     * 
+     * @return true if there might be more garbage or false if we're at the end
+     */
+    private boolean cleanup() {

Review Comment:
   this method is quite long, can you split it into smaller pieces?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] stefan-egli commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "stefan-egli (via GitHub)" <gi...@apache.org>.

stefan-egli commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1128261618


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,484 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.apache.sling.discovery.commons.providers.util.ResourceHelper;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger runcount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This backup task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int org_apache_sling_discovery_oak_slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;
+
+        @AttributeDefinition(name = "Cleanup interval milliseconds", description = "Number of milliseconds after which to do another batch of cleaning up (if necessary)")
+        int org_apache_sling_discovery_oak_slingid_cleanup_interval() default DEFAULT_CLEANUP_INTERVAL;
+
+        @AttributeDefinition(name = "Cleanup batch size", description = "Maximum number of slingIds to cleanup in one batch.")
+        int org_apache_sling_discovery_oak_slingid_cleanup_batchsize() default DEFAULT_CLEANUP_BATCH_SIZE;
+
+        @AttributeDefinition(name = "Cleanup minimum creation age", description = "Minimum number of milliseconds since the slingId was created.")
+        long org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age() default DEFAULT_MIN_CREATION_AGE_MILLIS;
+    }
+
+    /**
+     * Test constructor
+     */
+    static SlingIdCleanupTask create(Scheduler scheduler, ResourceResolverFactory factory,
+            Config config, int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        final SlingIdCleanupTask s = new SlingIdCleanupTask();
+        s.scheduler = scheduler;
+        s.resourceResolverFactory = factory;
+        s.config = config;
+        s.config(initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+        return s;
+    }
+
+    @Activate
+    protected void activate(final BundleContext bc, final Conf config) {
+        this.modified(bc, config);
+    }
+
+    @Modified
+    protected void modified(final BundleContext bc, final Conf config) {
+        if (config == null) {
+            return;
+        }
+        config(config.org_apache_sling_discovery_oak_slingid_cleanup_initial_delay(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_interval(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_batchsize(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age());
+    }
+
+    @Deactivate
+    protected void deactivate() {
+        logger.info("deactivate : deactivated.");
+        hasTopology = false;
+    }
+
+    private void config(int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        this.initialDelayMillis = initialDelayMillis;
+        this.intervalMillis = intervalMillis;
+        this.batchSize = batchSize;
+        this.minCreationAgeMillis = minCreationAgeMillis;
+        logger.info(
+                "config: initial delay milliseconds = {}, interval milliseconds = {}, batch size = {}, min creation age milliseconds = {}",
+                initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+    }
+
+    @Override
+    public void handleTopologyEvent(TopologyEvent event) {
+        final TopologyView newView = event.getNewView();
+        if (newView == null || event.getType() == Type.PROPERTIES_CHANGED) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+        } else {
+            hasTopology = true;
+            currentView = newView;
+            if (newView.getLocalInstance().isLeader()) {
+                // only execute on leader
+                recreateSchedule();
+            } else {
+                // should not be necessary, but lets stop anyway on non-leaders:
+                stop();
+            }
+        }
+    }
+
+    /**
+     * Cancels a potentially previously registered cleanup schedule.
+     */
+    private void stop() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("stop: no scheduler set, giving up.");
+            return;
+        }
+        final boolean unscheduled = localScheduler.unschedule(SCHEDULE_NAME);
+        logger.debug("stop: unschedule result={}", unscheduled);
+    }
+
+    /**
+     * This method can be invoked at any time to reset the schedule to do a fresh
+     * round of cleanup.
+     * <p>
+     * This method is thread-safe : if called concurrently, the fact that
+     * scheduler.schedul is synchronized works out that ultimately there will be
+     * just 1 schedule active (which is what is the desired outcome).
+     */
+    private void recreateSchedule() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("resetSchedule: no scheduler set, giving up.");
+            return;
+        }
+        final Calendar cal = Calendar.getInstance();
+        int delayMillis;
+        if (firstRun) {
+            delayMillis = initialDelayMillis;
+        } else {
+            delayMillis = intervalMillis;
+        }
+        cal.add(Calendar.MILLISECOND, delayMillis);
+        final Date inFiveMinutes = cal.getTime();
+        logger.debug("resetSchedule: scheduling a cleanup in {} milliseconds from now.",
+                delayMillis);
+        ScheduleOptions options = localScheduler.AT(inFiveMinutes);
+        options.name(SCHEDULE_NAME);
+        options.canRunConcurrently(false); // should not concurrently execute
+        localScheduler.schedule(this, options);
+    }
+
+    /**
+     * Invoked via sling.commons.scheduler triggered from resetCleanupSchedule(). By
+     * default should get called at max every 5 minutes until cleanup is done or
+     * 10min after a topology change.
+     */
+    @Override
+    public void run() {
+        if (!hasTopology) {
+            return;
+        }
+        boolean mightHaveMore = true;
+        try {
+            mightHaveMore = cleanup();
+        } catch (Exception e) {
+            // upon exception just log and retry in 10min
+            logger.error("run: got Exception while cleaning up slnigIds : " + e, e);
+        }
+        if (mightHaveMore) {
+            // then continue in 10min
+            recreateSchedule();
+            return;
+        }
+        // log successful cleanup done, yes, on info
+        logger.info("run: slingId cleanup done.");
+    }
+
+    /**
+     * Do the actual cleanup of garbage slingIds and report back with true if there
+     * might be more or false if we're at the end.
+     * 
+     * @return true if there might be more garbage or false if we're at the end
+     */
+    boolean cleanup() {
+        logger.debug("cleanup: start");
+
+        final ResourceResolverFactory localFactory = resourceResolverFactory;
+        final Config localConfig = config;
+        if (localFactory == null || localConfig == null) {
+            logger.warn("cleanup: cannot cleanup due to rrf={} or c={}", localFactory,
+                    localConfig);
+            return true;

Review Comment:
   I'm considering this paranoia code, as it should not really occur. in which case I guess it's more of a philosophical question. I'm in favour of retrying here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] stefan-egli commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "stefan-egli (via GitHub)" <gi...@apache.org>.

stefan-egli commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1128278311


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,484 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.apache.sling.discovery.commons.providers.util.ResourceHelper;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger runcount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This backup task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int org_apache_sling_discovery_oak_slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;
+
+        @AttributeDefinition(name = "Cleanup interval milliseconds", description = "Number of milliseconds after which to do another batch of cleaning up (if necessary)")
+        int org_apache_sling_discovery_oak_slingid_cleanup_interval() default DEFAULT_CLEANUP_INTERVAL;
+
+        @AttributeDefinition(name = "Cleanup batch size", description = "Maximum number of slingIds to cleanup in one batch.")
+        int org_apache_sling_discovery_oak_slingid_cleanup_batchsize() default DEFAULT_CLEANUP_BATCH_SIZE;
+
+        @AttributeDefinition(name = "Cleanup minimum creation age", description = "Minimum number of milliseconds since the slingId was created.")
+        long org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age() default DEFAULT_MIN_CREATION_AGE_MILLIS;
+    }
+
+    /**
+     * Test constructor
+     */
+    static SlingIdCleanupTask create(Scheduler scheduler, ResourceResolverFactory factory,
+            Config config, int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        final SlingIdCleanupTask s = new SlingIdCleanupTask();
+        s.scheduler = scheduler;
+        s.resourceResolverFactory = factory;
+        s.config = config;
+        s.config(initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+        return s;
+    }
+
+    @Activate
+    protected void activate(final BundleContext bc, final Conf config) {
+        this.modified(bc, config);
+    }
+
+    @Modified
+    protected void modified(final BundleContext bc, final Conf config) {
+        if (config == null) {
+            return;
+        }
+        config(config.org_apache_sling_discovery_oak_slingid_cleanup_initial_delay(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_interval(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_batchsize(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age());
+    }
+
+    @Deactivate
+    protected void deactivate() {
+        logger.info("deactivate : deactivated.");
+        hasTopology = false;
+    }
+
+    private void config(int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        this.initialDelayMillis = initialDelayMillis;
+        this.intervalMillis = intervalMillis;
+        this.batchSize = batchSize;
+        this.minCreationAgeMillis = minCreationAgeMillis;
+        logger.info(
+                "config: initial delay milliseconds = {}, interval milliseconds = {}, batch size = {}, min creation age milliseconds = {}",
+                initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+    }
+
+    @Override
+    public void handleTopologyEvent(TopologyEvent event) {
+        final TopologyView newView = event.getNewView();
+        if (newView == null || event.getType() == Type.PROPERTIES_CHANGED) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+        } else {
+            hasTopology = true;
+            currentView = newView;
+            if (newView.getLocalInstance().isLeader()) {
+                // only execute on leader
+                recreateSchedule();
+            } else {
+                // should not be necessary, but lets stop anyway on non-leaders:
+                stop();
+            }
+        }
+    }
+
+    /**
+     * Cancels a potentially previously registered cleanup schedule.
+     */
+    private void stop() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("stop: no scheduler set, giving up.");
+            return;
+        }
+        final boolean unscheduled = localScheduler.unschedule(SCHEDULE_NAME);
+        logger.debug("stop: unschedule result={}", unscheduled);
+    }
+
+    /**
+     * This method can be invoked at any time to reset the schedule to do a fresh
+     * round of cleanup.
+     * <p>
+     * This method is thread-safe : if called concurrently, the fact that
+     * scheduler.schedul is synchronized works out that ultimately there will be
+     * just 1 schedule active (which is what is the desired outcome).
+     */
+    private void recreateSchedule() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("resetSchedule: no scheduler set, giving up.");
+            return;
+        }
+        final Calendar cal = Calendar.getInstance();
+        int delayMillis;
+        if (firstRun) {
+            delayMillis = initialDelayMillis;
+        } else {
+            delayMillis = intervalMillis;
+        }
+        cal.add(Calendar.MILLISECOND, delayMillis);
+        final Date inFiveMinutes = cal.getTime();
+        logger.debug("resetSchedule: scheduling a cleanup in {} milliseconds from now.",

Review Comment:
   done at https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13/commits/504012b9a3d1b3d30445209808ef98e328a62903



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] sonarcloud[bot] commented on pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "sonarcloud[bot] (via GitHub)" <gi...@apache.org>.

sonarcloud[bot] commented on PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#issuecomment-1465796482

   Kudos, SonarCloud Quality Gate passed!&nbsp; &nbsp; [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13)
   
   [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG)  
   [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY)  
   [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT)  
   [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [30 Code Smells](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL)
   
   [![86.6%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/60-16px.png '86.6%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list) [86.6% Coverage](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list)  
   [![0.0%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/3-16px.png '0.0%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list) [0.0% Duplication](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] sonarcloud[bot] commented on pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "sonarcloud[bot] (via GitHub)" <gi...@apache.org>.

sonarcloud[bot] commented on PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#issuecomment-1465797337

   Kudos, SonarCloud Quality Gate passed!&nbsp; &nbsp; [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13)
   
   [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG)  
   [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY)  
   [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT)  
   [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [30 Code Smells](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL)
   
   [![86.6%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/60-16px.png '86.6%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list) [86.6% Coverage](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list)  
   [![0.0%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/3-16px.png '0.0%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list) [0.0% Duplication](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] rishabhdaim commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "rishabhdaim (via GitHub)" <gi...@apache.org>.

rishabhdaim commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1133466597


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,601 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import static org.osgi.util.converter.Converters.standardConverter;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>was not ever seen in previous topologies by the now leader instance</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ * <p>
+ * Additionally, the cleanup is skipped for 13 hours after a successful cleanup.
+ * This is to avoid unnecessary load on the repository. The number of 13
+ * incorporates some heuristics such as : about 2 cleanup rounds per day maximum
+ * makes sense, if a leader is very long living, then the 1 additional hour
+ * makes it spread somewhat throughout the day. This is to further minimize any
+ * load side-effects.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    final static String SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME = "org.apache.sling.discovery.oak.slingidcleanup.enabled";
+
+    /** default minimal cleanup delay at 13h, to intraday load balance */
+    final static long MIN_CLEANUP_DELAY_MILLIS = 46800000;
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every scheduler invocation */
+    private AtomicInteger runCount = new AtomicInteger(0);
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger completionCount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    private long lastSuccessfulRun = -1;
+
+    /**
+     * Minimal delay after a successful cleanup round, in millis
+     */
+    private long minCleanupDelayMillis = MIN_CLEANUP_DELAY_MILLIS;
+
+    /**
+     * contains all slingIds ever seen by this instance - should not be a long list
+     * so not a memory issue
+     */
+    private Set<String> seenInstances = new HashSet<>();

Review Comment:
   Since this set is used by multiple threads, shouldn't we replace this by `CopyOnWriteArraySet`.



##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,601 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import static org.osgi.util.converter.Converters.standardConverter;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>was not ever seen in previous topologies by the now leader instance</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ * <p>
+ * Additionally, the cleanup is skipped for 13 hours after a successful cleanup.
+ * This is to avoid unnecessary load on the repository. The number of 13
+ * incorporates some heuristics such as : about 2 cleanup rounds per day maximum
+ * makes sense, if a leader is very long living, then the 1 additional hour
+ * makes it spread somewhat throughout the day. This is to further minimize any
+ * load side-effects.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    final static String SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME = "org.apache.sling.discovery.oak.slingidcleanup.enabled";
+
+    /** default minimal cleanup delay at 13h, to intraday load balance */
+    final static long MIN_CLEANUP_DELAY_MILLIS = 46800000;
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every scheduler invocation */
+    private AtomicInteger runCount = new AtomicInteger(0);
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger completionCount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    private long lastSuccessfulRun = -1;
+
+    /**
+     * Minimal delay after a successful cleanup round, in millis
+     */
+    private long minCleanupDelayMillis = MIN_CLEANUP_DELAY_MILLIS;
+
+    /**
+     * contains all slingIds ever seen by this instance - should not be a long list
+     * so not a memory issue
+     */
+    private Set<String> seenInstances = new HashSet<>();

Review Comment:
   Since this set is used by multiple threads, shouldn't we replace this with `CopyOnWriteArraySet`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] sonarcloud[bot] commented on pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "sonarcloud[bot] (via GitHub)" <gi...@apache.org>.

sonarcloud[bot] commented on PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#issuecomment-1465806119

   Kudos, SonarCloud Quality Gate passed!&nbsp; &nbsp; [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13)
   
   [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG)  
   [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY)  
   [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT)  
   [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [30 Code Smells](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL)
   
   [![86.4%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/60-16px.png '86.4%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list) [86.4% Coverage](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list)  
   [![0.0%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/3-16px.png '0.0%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list) [0.0% Duplication](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] sonarcloud[bot] commented on pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "sonarcloud[bot] (via GitHub)" <gi...@apache.org>.

sonarcloud[bot] commented on PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#issuecomment-1465806238

   Please retry analysis of this Pull-Request directly on [SonarCloud](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] stefan-egli commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "stefan-egli (via GitHub)" <gi...@apache.org>.

stefan-egli commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1133641583


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,601 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import static org.osgi.util.converter.Converters.standardConverter;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>was not ever seen in previous topologies by the now leader instance</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ * <p>
+ * Additionally, the cleanup is skipped for 13 hours after a successful cleanup.
+ * This is to avoid unnecessary load on the repository. The number of 13
+ * incorporates some heuristics such as : about 2 cleanup rounds per day maximum
+ * makes sense, if a leader is very long living, then the 1 additional hour
+ * makes it spread somewhat throughout the day. This is to further minimize any
+ * load side-effects.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    final static String SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME = "org.apache.sling.discovery.oak.slingidcleanup.enabled";
+
+    /** default minimal cleanup delay at 13h, to intraday load balance */
+    final static long MIN_CLEANUP_DELAY_MILLIS = 46800000;
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every scheduler invocation */
+    private AtomicInteger runCount = new AtomicInteger(0);
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger completionCount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    private long lastSuccessfulRun = -1;
+
+    /**
+     * Minimal delay after a successful cleanup round, in millis
+     */
+    private long minCleanupDelayMillis = MIN_CLEANUP_DELAY_MILLIS;
+
+    /**
+     * contains all slingIds ever seen by this instance - should not be a long list
+     * so not a memory issue
+     */
+    private Set<String> seenInstances = new HashSet<>();

Review Comment:
   Indeed, nice catch! fixed in https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13/commits/7293c201c083cf0ad43b96d7c6143b84108f3718



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] rishabhdaim commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "rishabhdaim (via GitHub)" <gi...@apache.org>.

rishabhdaim commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1133696153


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,598 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import static org.osgi.util.converter.Converters.standardConverter;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.Set;
+import java.util.UUID;
+import java.util.concurrent.CopyOnWriteArraySet;
+import java.util.concurrent.atomic.AtomicInteger;
+import java.util.stream.Collectors;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>was not ever seen in previous topologies by the now leader instance</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ * <p>
+ * Additionally, the cleanup is skipped for 13 hours after a successful cleanup.
+ * This is to avoid unnecessary load on the repository. The number of 13
+ * incorporates some heuristics such as : about 2 cleanup rounds per day maximum
+ * makes sense, if a leader is very long living, then the 1 additional hour
+ * makes it spread somewhat throughout the day. This is to further minimize any
+ * load side-effects.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    final static String SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME = "org.apache.sling.discovery.oak.slingidcleanup.enabled";
+
+    /** default minimal cleanup delay at 13h, to intraday load balance */
+    final static long MIN_CLEANUP_DELAY_MILLIS = 46800000;
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every scheduler invocation */
+    private AtomicInteger runCount = new AtomicInteger(0);
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger completionCount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    private long lastSuccessfulRun = -1;
+
+    /**
+     * Minimal delay after a successful cleanup round, in millis
+     */
+    private long minCleanupDelayMillis = MIN_CLEANUP_DELAY_MILLIS;
+
+    /**
+     * contains all slingIds ever seen by this instance - should not be a long list
+     * so not a memory issue
+     */
+    private Set<String> seenInstances = new CopyOnWriteArraySet<>();
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;
+
+        @AttributeDefinition(name = "Cleanup interval milliseconds", description = "Number of milliseconds after which to do another batch of cleaning up (if necessary)")
+        int slingid_cleanup_interval() default DEFAULT_CLEANUP_INTERVAL;
+
+        @AttributeDefinition(name = "Cleanup batch size", description = "Maximum number of slingIds to cleanup in one batch.")
+        int slingid_cleanup_batchsize() default DEFAULT_CLEANUP_BATCH_SIZE;
+
+        @AttributeDefinition(name = "Cleanup minimum creation age", description = "Minimum number of milliseconds since the slingId was created.")
+        long slingid_cleanup_min_creation_age() default DEFAULT_MIN_CREATION_AGE_MILLIS;
+    }
+
+    /**
+     * Test constructor
+     */
+    static SlingIdCleanupTask create(Scheduler scheduler, ResourceResolverFactory factory,
+            Config config, int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis, long minCleanupDelayMillis) {
+        final SlingIdCleanupTask s = new SlingIdCleanupTask();
+        s.scheduler = scheduler;
+        s.resourceResolverFactory = factory;
+        s.config = config;
+        s.minCleanupDelayMillis = minCleanupDelayMillis;
+        s.config(initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+        return s;
+    }
+
+    @Activate
+    protected void activate(final BundleContext bc, final Conf config) {
+        this.modified(bc, config);
+    }
+
+    @Modified
+    protected void modified(final BundleContext bc, final Conf config) {
+        if (config == null) {
+            return;
+        }
+        config(config.slingid_cleanup_initial_delay(),
+                config.slingid_cleanup_interval(),
+                config.slingid_cleanup_batchsize(),
+                config.slingid_cleanup_min_creation_age());
+    }
+
+    @Deactivate
+    protected void deactivate() {
+        logger.info("deactivate : deactivated.");
+        hasTopology = false;
+    }
+
+    private void config(int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        this.initialDelayMillis = initialDelayMillis;
+        this.intervalMillis = intervalMillis;
+        this.batchSize = batchSize;
+        this.minCreationAgeMillis = minCreationAgeMillis;
+        logger.info(
+                "config: enabled = {}, initial delay milliseconds = {}, interval milliseconds = {}, batch size = {}, min creation age milliseconds = {}",
+                isEnabled(), initialDelayMillis, intervalMillis, batchSize,
+                minCreationAgeMillis);
+    }
+
+    @Override
+    public void handleTopologyEvent(TopologyEvent event) {
+        if (!isEnabled()) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+            logger.debug("handleTopologyEvent: slingId cleanup is disabled");
+            return;
+        }
+        final TopologyView newView = event.getNewView();
+        if (newView == null || event.getType() == Type.PROPERTIES_CHANGED) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+        } else {
+            hasTopology = true;
+            currentView = newView;
+            seenInstances.addAll(getActiveSlingIds(newView));
+            if (newView.getLocalInstance().isLeader()) {
+                // only execute on leader
+                recreateSchedule();
+            } else {
+                // should not be necessary, but lets stop anyway on non-leaders:
+                stop();
+            }
+        }
+    }
+
+    /**
+     * Cancels a potentially previously registered cleanup schedule.
+     */
+    private void stop() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("stop: no scheduler set, giving up.");
+            return;
+        }
+        final boolean unscheduled = localScheduler.unschedule(SCHEDULE_NAME);
+        logger.debug("stop: unschedule result={}", unscheduled);
+    }
+
+    /**
+     * Reads the system property that enables or disabled this tasks
+     */
+    private static boolean isEnabled() {
+        final String systemPropertyValue = System
+                .getProperty(SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME);
+        return standardConverter().convert(systemPropertyValue).defaultValue(false)
+                .to(Boolean.class);
+    }
+
+    /**
+     * This method can be invoked at any time to reset the schedule to do a fresh
+     * round of cleanup.
+     * <p>
+     * This method is thread-safe : if called concurrently, the fact that
+     * scheduler.schedul is synchronized works out that ultimately there will be
+     * just 1 schedule active (which is what is the desired outcome).
+     */
+    private void recreateSchedule() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("recreateSchedule: no scheduler set, giving up.");
+            return;
+        }
+        final Calendar cal = Calendar.getInstance();
+        int delayMillis;
+        if (firstRun) {
+            delayMillis = initialDelayMillis;
+        } else {
+            delayMillis = intervalMillis;
+        }
+        cal.add(Calendar.MILLISECOND, delayMillis);
+        final Date scheduledDate = cal.getTime();
+        logger.debug(
+                "recreateSchedule: scheduling a cleanup in {} milliseconds from now, which is: {}",
+                delayMillis, scheduledDate);
+        ScheduleOptions options = localScheduler.AT(scheduledDate);
+        options.name(SCHEDULE_NAME);
+        options.canRunConcurrently(false); // should not concurrently execute
+        localScheduler.schedule(this, options);
+    }
+
+    /**
+     * Invoked via sling.commons.scheduler triggered from resetCleanupSchedule(). By
+     * default should get called at max every 5 minutes until cleanup is done or
+     * 10min after a topology change.
+     */
+    @Override
+    public void run() {
+        if (lastSuccessfulRun > 0 && System.currentTimeMillis()
+                - lastSuccessfulRun < minCleanupDelayMillis) {
+            logger.debug(
+                    "run: last cleanup was {} millis ago, which is less than {} millis, therefore not cleaning up yet.",
+                    System.currentTimeMillis() - lastSuccessfulRun,
+                    minCleanupDelayMillis);
+            recreateSchedule();
+            return;
+        }
+        runCount.incrementAndGet();
+        if (!hasTopology) {
+            return;
+        }
+        boolean mightHaveMore = true;
+        try {
+            mightHaveMore = cleanup();
+        } catch (Exception e) {
+            // upon exception just log and retry in 10min
+            logger.error("run: got Exception while cleaning up slnigIds : " + e, e);
+        }
+        if (mightHaveMore) {
+            // then continue in 10min
+            recreateSchedule();
+            return;
+        }
+        // log successful cleanup done, yes, on info
+        logger.info(
+                "run: slingId cleanup done, run counter = {}, delete counter = {}, completion counter = {}",
+                getRunCount(), getDeleteCount(), getCompletionCount());
+        lastSuccessfulRun = System.currentTimeMillis();
+    }
+
+    /**
+     * Do the actual cleanup of garbage slingIds and report back with true if there
+     * might be more or false if we're at the end.
+     * 
+     * @return true if there might be more garbage or false if we're at the end
+     */
+    private boolean cleanup() {
+        logger.debug("cleanup: start");
+        if (!isEnabled()) {
+            // bit of overkill probably, as this shouldn't happen.
+            // but adds to a good night's sleep.
+            logger.debug("cleanup: not enabled, stopping.");
+            return false;
+        }
+
+        final ResourceResolverFactory localFactory = resourceResolverFactory;
+        final Config localConfig = config;
+        if (localFactory == null || localConfig == null) {
+            logger.warn("cleanup: cannot cleanup due to rrf={} or c={}", localFactory,
+                    localConfig);
+            return true;
+        }
+        final TopologyView localCurrentView = currentView;
+        if (localCurrentView == null || !localCurrentView.isCurrent()) {
+            logger.debug("cleanup : cannot cleanup as topology recently changed : {}",
+                    localCurrentView);
+            return true;
+        }
+        final Set<String> activeSlingIds = getActiveSlingIds(localCurrentView);
+        try (ResourceResolver resolver = localFactory.getServiceResourceResolver(null)) {
+            final Resource clusterInstances = resolver
+                    .getResource(localConfig.getClusterInstancesPath());
+            final Resource idMap = resolver.getResource(localConfig.getIdMapPath());
+            final Resource syncTokens = resolver
+                    .getResource(localConfig.getSyncTokenPath());
+            if (clusterInstances == null || idMap == null || syncTokens == null) {
+                logger.warn("cleanup: no resource found at {}, {} or {}, stopping.",
+                        localConfig.getClusterInstancesPath(), localConfig.getIdMapPath(),
+                        localConfig.getSyncTokenPath());
+                return false;
+            }
+            resolver.refresh();
+            final ValueMap idMapMap = idMap.adaptTo(ValueMap.class);
+            final ModifiableValueMap syncTokenMap = syncTokens
+                    .adaptTo(ModifiableValueMap.class);
+            final Calendar now = Calendar.getInstance();
+            int removed = 0;
+            boolean mightHaveMore = false;
+            final int localBatchSize = batchSize;
+            final long localMinCreationAgeMillis = minCreationAgeMillis;
+            for (Resource resource : clusterInstances.getChildren()) {
+                if (topologyChanged(localCurrentView)) {
+                    return true;
+                }
+                final String slingId = resource.getName();
+                if (deleteIfOldSlingId(resource, slingId, syncTokenMap, idMapMap,
+                        activeSlingIds, now, localMinCreationAgeMillis)) {
+                    if (++removed >= localBatchSize) {
+                        // we need to stop
+                        mightHaveMore = true;
+                        break;
+                    }
+                }
+            }
+            // if we're not already at the batch limit, check syncTokens too
+            if (!mightHaveMore) {
+                for (String slingId : syncTokenMap.keySet()) {
+                    try {
+                        UUID.fromString(slingId);
+                    } catch (Exception e) {
+                        // not a uuid
+                        continue;
+                    }
+                    if (topologyChanged(localCurrentView)) {
+                        return true;
+                    }
+                    Resource resourceOrNull = clusterInstances.getChild(slingId);
+                    if (deleteIfOldSlingId(resourceOrNull, slingId, syncTokenMap,
+                            idMapMap, activeSlingIds, now, localMinCreationAgeMillis)) {
+                        if (++removed >= localBatchSize) {
+                            // we need to stop
+                            mightHaveMore = true;
+                            break;
+                        }
+                    }
+                }
+            }
+            if (topologyChanged(localCurrentView)) {
+                return true;
+            }
+            if (removed > 0) {
+                // only if we removed something we commit
+                resolver.commit();
+                logger.info(
+                        "cleanup : removed {} old slingIds (batch size : {}), potentially has more: {}",
+                        removed, localBatchSize, mightHaveMore);
+                deleteCount.addAndGet(removed);
+            }
+            firstRun = false;
+            completionCount.incrementAndGet();
+            return mightHaveMore;
+        } catch (LoginException e) {
+            logger.error("cleanup: could not log in administratively: " + e, e);
+            throw new RuntimeException("Could not log in to repository (" + e + ")", e);
+        } catch (PersistenceException e) {
+            logger.error("cleanup: got a PersistenceException: " + e, e);
+            throw new RuntimeException(
+                    "Exception while talking to repository (" + e + ")", e);
+        } finally {
+            logger.debug("cleanup: done.");
+        }
+    }
+
+    private Set<String> getActiveSlingIds(final TopologyView localCurrentView) {
+        return localCurrentView.getLocalInstance().getClusterView()
+                .getInstances().stream().map(InstanceDescription::getSlingId).collect(Collectors.toSet());
+    }
+
+    private boolean topologyChanged(TopologyView localCurrentView) {
+        if (!hasTopology || currentView != localCurrentView
+                || !localCurrentView.isCurrent()) {
+            // we got interrupted during cleanup
+            // let's not commit at all then
+            logger.debug(
+                    "topologyChanged : topology changing during cleanup - not committing this time - stopping for now.");
+            return true;
+        } else {
+            return false;
+        }
+    }
+
+    static long millisOf(Object leaderElectionIdCreatedAt) {
+        if (leaderElectionIdCreatedAt == null) {
+            return -1;
+        }
+        if (leaderElectionIdCreatedAt instanceof Date) {
+            final Date d = (Date) leaderElectionIdCreatedAt;
+            return d.getTime();
+        }
+        if (leaderElectionIdCreatedAt instanceof Calendar) {
+            final Calendar c = (Calendar) leaderElectionIdCreatedAt;
+            return c.getTimeInMillis();
+        }
+        return -1;
+    }
+
+    private boolean deleteIfOldSlingId(Resource resourceOrNull, String slingId,
+            ModifiableValueMap syncTokenMap, ValueMap idMapMap,
+            Set<String> activeSlingIds, Calendar now, long localMinCreationAgeMillis)
+            throws PersistenceException {
+        logger.trace("deleteIfOldSlingId : handling slingId = {}", slingId);
+        if (activeSlingIds.contains(slingId)) {
+            logger.trace("deleteIfOldSlingId : slingId is currently active : {}",
+                    slingId);
+            return false;
+        } else if (seenInstances.contains(slingId)) {
+            logger.trace("deleteIfOldSlingId : slingId seen active previously : {}",
+                    slingId);
+            return false;
+        }
+        // only check in idmap and for leaderElectionId details if the clusterInstance
+        // resource is there
+        if (resourceOrNull != null) {
+            Object clusterNodeId = idMapMap.get(slingId);
+            if (clusterNodeId == null) {
+                logger.trace("deleteIfOldSlingId : slingId {} not recently in use : {}",
+                        slingId, clusterNodeId);

Review Comment:
   Since clusterNodeId is null here, it doesn't add any value to the log.
   
   ```suggestion
                   logger.trace("deleteIfOldSlingId : slingId {} not recently in use", slingId);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] stefan-egli commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "stefan-egli (via GitHub)" <gi...@apache.org>.

stefan-egli commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1128291733


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,484 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.apache.sling.discovery.commons.providers.util.ResourceHelper;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger runcount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This backup task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int org_apache_sling_discovery_oak_slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;
+
+        @AttributeDefinition(name = "Cleanup interval milliseconds", description = "Number of milliseconds after which to do another batch of cleaning up (if necessary)")
+        int org_apache_sling_discovery_oak_slingid_cleanup_interval() default DEFAULT_CLEANUP_INTERVAL;
+
+        @AttributeDefinition(name = "Cleanup batch size", description = "Maximum number of slingIds to cleanup in one batch.")
+        int org_apache_sling_discovery_oak_slingid_cleanup_batchsize() default DEFAULT_CLEANUP_BATCH_SIZE;
+
+        @AttributeDefinition(name = "Cleanup minimum creation age", description = "Minimum number of milliseconds since the slingId was created.")
+        long org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age() default DEFAULT_MIN_CREATION_AGE_MILLIS;
+    }
+
+    /**
+     * Test constructor
+     */
+    static SlingIdCleanupTask create(Scheduler scheduler, ResourceResolverFactory factory,
+            Config config, int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        final SlingIdCleanupTask s = new SlingIdCleanupTask();
+        s.scheduler = scheduler;
+        s.resourceResolverFactory = factory;
+        s.config = config;
+        s.config(initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+        return s;
+    }
+
+    @Activate
+    protected void activate(final BundleContext bc, final Conf config) {
+        this.modified(bc, config);
+    }
+
+    @Modified
+    protected void modified(final BundleContext bc, final Conf config) {
+        if (config == null) {
+            return;
+        }
+        config(config.org_apache_sling_discovery_oak_slingid_cleanup_initial_delay(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_interval(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_batchsize(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age());
+    }
+
+    @Deactivate
+    protected void deactivate() {
+        logger.info("deactivate : deactivated.");
+        hasTopology = false;
+    }
+
+    private void config(int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        this.initialDelayMillis = initialDelayMillis;
+        this.intervalMillis = intervalMillis;
+        this.batchSize = batchSize;
+        this.minCreationAgeMillis = minCreationAgeMillis;
+        logger.info(
+                "config: initial delay milliseconds = {}, interval milliseconds = {}, batch size = {}, min creation age milliseconds = {}",
+                initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+    }
+
+    @Override
+    public void handleTopologyEvent(TopologyEvent event) {
+        final TopologyView newView = event.getNewView();
+        if (newView == null || event.getType() == Type.PROPERTIES_CHANGED) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+        } else {
+            hasTopology = true;
+            currentView = newView;
+            if (newView.getLocalInstance().isLeader()) {
+                // only execute on leader
+                recreateSchedule();
+            } else {
+                // should not be necessary, but lets stop anyway on non-leaders:
+                stop();
+            }
+        }
+    }
+
+    /**
+     * Cancels a potentially previously registered cleanup schedule.
+     */
+    private void stop() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("stop: no scheduler set, giving up.");
+            return;
+        }
+        final boolean unscheduled = localScheduler.unschedule(SCHEDULE_NAME);
+        logger.debug("stop: unschedule result={}", unscheduled);
+    }
+
+    /**
+     * This method can be invoked at any time to reset the schedule to do a fresh
+     * round of cleanup.
+     * <p>
+     * This method is thread-safe : if called concurrently, the fact that
+     * scheduler.schedul is synchronized works out that ultimately there will be
+     * just 1 schedule active (which is what is the desired outcome).
+     */
+    private void recreateSchedule() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("resetSchedule: no scheduler set, giving up.");
+            return;
+        }
+        final Calendar cal = Calendar.getInstance();
+        int delayMillis;
+        if (firstRun) {
+            delayMillis = initialDelayMillis;
+        } else {
+            delayMillis = intervalMillis;
+        }
+        cal.add(Calendar.MILLISECOND, delayMillis);
+        final Date inFiveMinutes = cal.getTime();
+        logger.debug("resetSchedule: scheduling a cleanup in {} milliseconds from now.",
+                delayMillis);
+        ScheduleOptions options = localScheduler.AT(inFiveMinutes);
+        options.name(SCHEDULE_NAME);
+        options.canRunConcurrently(false); // should not concurrently execute
+        localScheduler.schedule(this, options);
+    }
+
+    /**
+     * Invoked via sling.commons.scheduler triggered from resetCleanupSchedule(). By
+     * default should get called at max every 5 minutes until cleanup is done or
+     * 10min after a topology change.
+     */
+    @Override
+    public void run() {
+        if (!hasTopology) {
+            return;
+        }
+        boolean mightHaveMore = true;
+        try {
+            mightHaveMore = cleanup();
+        } catch (Exception e) {
+            // upon exception just log and retry in 10min
+            logger.error("run: got Exception while cleaning up slnigIds : " + e, e);
+        }
+        if (mightHaveMore) {
+            // then continue in 10min
+            recreateSchedule();
+            return;
+        }
+        // log successful cleanup done, yes, on info
+        logger.info("run: slingId cleanup done.");
+    }
+
+    /**
+     * Do the actual cleanup of garbage slingIds and report back with true if there
+     * might be more or false if we're at the end.
+     * 
+     * @return true if there might be more garbage or false if we're at the end
+     */
+    boolean cleanup() {
+        logger.debug("cleanup: start");
+
+        final ResourceResolverFactory localFactory = resourceResolverFactory;
+        final Config localConfig = config;
+        if (localFactory == null || localConfig == null) {
+            logger.warn("cleanup: cannot cleanup due to rrf={} or c={}", localFactory,
+                    localConfig);
+            return true;
+        }
+        final TopologyView localCurrentView = currentView;
+        if (localCurrentView == null || !localCurrentView.isCurrent()) {
+            logger.info("cleanup : cannot cleanup as topology recently changed : {}",
+                    localCurrentView);
+            return true;
+        }
+        final Set<String> activeSlingIds = new HashSet<>();
+        for (InstanceDescription id : localCurrentView.getLocalInstance().getClusterView()
+                .getInstances()) {
+            activeSlingIds.add(id.getSlingId());
+        }
+
+        ResourceResolver resolver = null;
+        try {
+            resolver = localFactory.getServiceResourceResolver(null);
+
+            final Resource clusterInstances = ResourceHelper.getOrCreateResource(resolver,
+                    localConfig.getClusterInstancesPath());
+            final Resource idMap = ResourceHelper.getOrCreateResource(resolver,
+                    localConfig.getIdMapPath());
+            final Resource syncTokens = ResourceHelper.getOrCreateResource(resolver,
+                    localConfig.getSyncTokenPath());
+            resolver.revert();

Review Comment:
   removed at https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13/commits/9f625b6b19882255141a9e4c18beaef8fbc5e0e7



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] stefan-egli commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "stefan-egli (via GitHub)" <gi...@apache.org>.

stefan-egli commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1129801802


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,571 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import static org.osgi.util.converter.Converters.standardConverter;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    final static String SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME = "org.apache.sling.discovery.oak.slingidcleanup.enabled";
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every scheduler invocation */
+    private AtomicInteger runCount = new AtomicInteger(0);
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger completionCount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This backup task is in charge of cleaning up old SlingIds from the repository.")

Review Comment:
   +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] stefan-egli commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "stefan-egli (via GitHub)" <gi...@apache.org>.

stefan-egli commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1128269430


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,484 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.apache.sling.discovery.commons.providers.util.ResourceHelper;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger runcount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This backup task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int org_apache_sling_discovery_oak_slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;
+
+        @AttributeDefinition(name = "Cleanup interval milliseconds", description = "Number of milliseconds after which to do another batch of cleaning up (if necessary)")
+        int org_apache_sling_discovery_oak_slingid_cleanup_interval() default DEFAULT_CLEANUP_INTERVAL;
+
+        @AttributeDefinition(name = "Cleanup batch size", description = "Maximum number of slingIds to cleanup in one batch.")
+        int org_apache_sling_discovery_oak_slingid_cleanup_batchsize() default DEFAULT_CLEANUP_BATCH_SIZE;
+
+        @AttributeDefinition(name = "Cleanup minimum creation age", description = "Minimum number of milliseconds since the slingId was created.")
+        long org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age() default DEFAULT_MIN_CREATION_AGE_MILLIS;
+    }
+
+    /**
+     * Test constructor
+     */
+    static SlingIdCleanupTask create(Scheduler scheduler, ResourceResolverFactory factory,
+            Config config, int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        final SlingIdCleanupTask s = new SlingIdCleanupTask();
+        s.scheduler = scheduler;
+        s.resourceResolverFactory = factory;
+        s.config = config;
+        s.config(initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+        return s;
+    }
+
+    @Activate
+    protected void activate(final BundleContext bc, final Conf config) {
+        this.modified(bc, config);
+    }
+
+    @Modified
+    protected void modified(final BundleContext bc, final Conf config) {
+        if (config == null) {
+            return;
+        }
+        config(config.org_apache_sling_discovery_oak_slingid_cleanup_initial_delay(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_interval(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_batchsize(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age());
+    }
+
+    @Deactivate
+    protected void deactivate() {
+        logger.info("deactivate : deactivated.");
+        hasTopology = false;
+    }
+
+    private void config(int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        this.initialDelayMillis = initialDelayMillis;
+        this.intervalMillis = intervalMillis;
+        this.batchSize = batchSize;
+        this.minCreationAgeMillis = minCreationAgeMillis;
+        logger.info(
+                "config: initial delay milliseconds = {}, interval milliseconds = {}, batch size = {}, min creation age milliseconds = {}",
+                initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+    }
+
+    @Override
+    public void handleTopologyEvent(TopologyEvent event) {
+        final TopologyView newView = event.getNewView();
+        if (newView == null || event.getType() == Type.PROPERTIES_CHANGED) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+        } else {
+            hasTopology = true;
+            currentView = newView;
+            if (newView.getLocalInstance().isLeader()) {
+                // only execute on leader
+                recreateSchedule();
+            } else {
+                // should not be necessary, but lets stop anyway on non-leaders:
+                stop();
+            }
+        }
+    }
+
+    /**
+     * Cancels a potentially previously registered cleanup schedule.
+     */
+    private void stop() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("stop: no scheduler set, giving up.");
+            return;
+        }
+        final boolean unscheduled = localScheduler.unschedule(SCHEDULE_NAME);
+        logger.debug("stop: unschedule result={}", unscheduled);
+    }
+
+    /**
+     * This method can be invoked at any time to reset the schedule to do a fresh
+     * round of cleanup.
+     * <p>
+     * This method is thread-safe : if called concurrently, the fact that
+     * scheduler.schedul is synchronized works out that ultimately there will be
+     * just 1 schedule active (which is what is the desired outcome).
+     */
+    private void recreateSchedule() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("resetSchedule: no scheduler set, giving up.");
+            return;
+        }
+        final Calendar cal = Calendar.getInstance();
+        int delayMillis;
+        if (firstRun) {
+            delayMillis = initialDelayMillis;
+        } else {
+            delayMillis = intervalMillis;
+        }
+        cal.add(Calendar.MILLISECOND, delayMillis);
+        final Date inFiveMinutes = cal.getTime();
+        logger.debug("resetSchedule: scheduling a cleanup in {} milliseconds from now.",
+                delayMillis);
+        ScheduleOptions options = localScheduler.AT(inFiveMinutes);
+        options.name(SCHEDULE_NAME);
+        options.canRunConcurrently(false); // should not concurrently execute
+        localScheduler.schedule(this, options);
+    }
+
+    /**
+     * Invoked via sling.commons.scheduler triggered from resetCleanupSchedule(). By
+     * default should get called at max every 5 minutes until cleanup is done or
+     * 10min after a topology change.
+     */
+    @Override
+    public void run() {
+        if (!hasTopology) {
+            return;
+        }
+        boolean mightHaveMore = true;
+        try {
+            mightHaveMore = cleanup();
+        } catch (Exception e) {
+            // upon exception just log and retry in 10min
+            logger.error("run: got Exception while cleaning up slnigIds : " + e, e);
+        }
+        if (mightHaveMore) {
+            // then continue in 10min
+            recreateSchedule();
+            return;
+        }
+        // log successful cleanup done, yes, on info
+        logger.info("run: slingId cleanup done.");
+    }
+
+    /**
+     * Do the actual cleanup of garbage slingIds and report back with true if there
+     * might be more or false if we're at the end.
+     * 
+     * @return true if there might be more garbage or false if we're at the end
+     */
+    boolean cleanup() {
+        logger.debug("cleanup: start");
+
+        final ResourceResolverFactory localFactory = resourceResolverFactory;
+        final Config localConfig = config;
+        if (localFactory == null || localConfig == null) {
+            logger.warn("cleanup: cannot cleanup due to rrf={} or c={}", localFactory,
+                    localConfig);
+            return true;
+        }
+        final TopologyView localCurrentView = currentView;
+        if (localCurrentView == null || !localCurrentView.isCurrent()) {
+            logger.info("cleanup : cannot cleanup as topology recently changed : {}",
+                    localCurrentView);
+            return true;
+        }
+        final Set<String> activeSlingIds = new HashSet<>();
+        for (InstanceDescription id : localCurrentView.getLocalInstance().getClusterView()
+                .getInstances()) {
+            activeSlingIds.add(id.getSlingId());
+        }
+
+        ResourceResolver resolver = null;
+        try {
+            resolver = localFactory.getServiceResourceResolver(null);
+
+            final Resource clusterInstances = ResourceHelper.getOrCreateResource(resolver,
+                    localConfig.getClusterInstancesPath());
+            final Resource idMap = ResourceHelper.getOrCreateResource(resolver,
+                    localConfig.getIdMapPath());
+            final Resource syncTokens = ResourceHelper.getOrCreateResource(resolver,
+                    localConfig.getSyncTokenPath());
+            resolver.revert();
+            resolver.refresh();
+
+            final ValueMap idMapMap = idMap.adaptTo(ValueMap.class);
+            final ModifiableValueMap syncTokenMap = syncTokens
+                    .adaptTo(ModifiableValueMap.class);
+            final Calendar now = Calendar.getInstance();
+            int removed = 0;
+            boolean mightHaveMore = false;
+            int localBatchSize = batchSize;
+            long localMinCreationAgeMillis = minCreationAgeMillis;
+            for (Resource resource : clusterInstances.getChildren()) {
+                if (!hasTopology || currentView != localCurrentView
+                        || !localCurrentView.isCurrent()) {
+                    // we got interrupted during cleanup
+                    // let's not commit at all then
+                    logger.info(
+                            "cleanup : topology changing during cleanup - not committing this time - stopping for now.");
+                    return true;
+                }
+                final String slingId = resource.getName();
+                logger.info("cleanup : handling slingId = {}", slingId);

Review Comment:
   lowered to debug already



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] sonarcloud[bot] commented on pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "sonarcloud[bot] (via GitHub)" <gi...@apache.org>.

sonarcloud[bot] commented on PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#issuecomment-1461888787

   Kudos, SonarCloud Quality Gate passed!&nbsp; &nbsp; [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13)
   
   [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG)  
   [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY)  
   [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT)  
   [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [34 Code Smells](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL)
   
   [![86.2%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/60-16px.png '86.2%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list) [86.2% Coverage](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list)  
   [![0.0%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/3-16px.png '0.0%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list) [0.0% Duplication](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] sonarcloud[bot] commented on pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "sonarcloud[bot] (via GitHub)" <gi...@apache.org>.

sonarcloud[bot] commented on PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#issuecomment-1461895297

   Kudos, SonarCloud Quality Gate passed!&nbsp; &nbsp; [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13)
   
   [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG)  
   [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY)  
   [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT)  
   [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [33 Code Smells](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL)
   
   [![87.1%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/60-16px.png '87.1%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list) [87.1% Coverage](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list)  
   [![0.0%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/3-16px.png '0.0%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list) [0.0% Duplication](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] stefan-egli commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "stefan-egli (via GitHub)" <gi...@apache.org>.

stefan-egli commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1128308189


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,484 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.apache.sling.discovery.commons.providers.util.ResourceHelper;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger runcount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This backup task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int org_apache_sling_discovery_oak_slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;
+
+        @AttributeDefinition(name = "Cleanup interval milliseconds", description = "Number of milliseconds after which to do another batch of cleaning up (if necessary)")
+        int org_apache_sling_discovery_oak_slingid_cleanup_interval() default DEFAULT_CLEANUP_INTERVAL;
+
+        @AttributeDefinition(name = "Cleanup batch size", description = "Maximum number of slingIds to cleanup in one batch.")
+        int org_apache_sling_discovery_oak_slingid_cleanup_batchsize() default DEFAULT_CLEANUP_BATCH_SIZE;
+
+        @AttributeDefinition(name = "Cleanup minimum creation age", description = "Minimum number of milliseconds since the slingId was created.")
+        long org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age() default DEFAULT_MIN_CREATION_AGE_MILLIS;
+    }
+
+    /**
+     * Test constructor
+     */
+    static SlingIdCleanupTask create(Scheduler scheduler, ResourceResolverFactory factory,
+            Config config, int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        final SlingIdCleanupTask s = new SlingIdCleanupTask();
+        s.scheduler = scheduler;
+        s.resourceResolverFactory = factory;
+        s.config = config;
+        s.config(initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+        return s;
+    }
+
+    @Activate
+    protected void activate(final BundleContext bc, final Conf config) {
+        this.modified(bc, config);
+    }
+
+    @Modified
+    protected void modified(final BundleContext bc, final Conf config) {
+        if (config == null) {
+            return;
+        }
+        config(config.org_apache_sling_discovery_oak_slingid_cleanup_initial_delay(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_interval(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_batchsize(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age());
+    }
+
+    @Deactivate
+    protected void deactivate() {
+        logger.info("deactivate : deactivated.");
+        hasTopology = false;
+    }
+
+    private void config(int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        this.initialDelayMillis = initialDelayMillis;
+        this.intervalMillis = intervalMillis;
+        this.batchSize = batchSize;
+        this.minCreationAgeMillis = minCreationAgeMillis;
+        logger.info(
+                "config: initial delay milliseconds = {}, interval milliseconds = {}, batch size = {}, min creation age milliseconds = {}",
+                initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+    }
+
+    @Override
+    public void handleTopologyEvent(TopologyEvent event) {
+        final TopologyView newView = event.getNewView();
+        if (newView == null || event.getType() == Type.PROPERTIES_CHANGED) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+        } else {
+            hasTopology = true;
+            currentView = newView;
+            if (newView.getLocalInstance().isLeader()) {
+                // only execute on leader
+                recreateSchedule();
+            } else {
+                // should not be necessary, but lets stop anyway on non-leaders:
+                stop();
+            }
+        }
+    }
+
+    /**
+     * Cancels a potentially previously registered cleanup schedule.
+     */
+    private void stop() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("stop: no scheduler set, giving up.");
+            return;
+        }
+        final boolean unscheduled = localScheduler.unschedule(SCHEDULE_NAME);
+        logger.debug("stop: unschedule result={}", unscheduled);
+    }
+
+    /**
+     * This method can be invoked at any time to reset the schedule to do a fresh
+     * round of cleanup.
+     * <p>
+     * This method is thread-safe : if called concurrently, the fact that
+     * scheduler.schedul is synchronized works out that ultimately there will be
+     * just 1 schedule active (which is what is the desired outcome).
+     */
+    private void recreateSchedule() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("resetSchedule: no scheduler set, giving up.");
+            return;
+        }
+        final Calendar cal = Calendar.getInstance();
+        int delayMillis;
+        if (firstRun) {
+            delayMillis = initialDelayMillis;
+        } else {
+            delayMillis = intervalMillis;
+        }
+        cal.add(Calendar.MILLISECOND, delayMillis);
+        final Date inFiveMinutes = cal.getTime();
+        logger.debug("resetSchedule: scheduling a cleanup in {} milliseconds from now.",
+                delayMillis);
+        ScheduleOptions options = localScheduler.AT(inFiveMinutes);
+        options.name(SCHEDULE_NAME);
+        options.canRunConcurrently(false); // should not concurrently execute
+        localScheduler.schedule(this, options);
+    }
+
+    /**
+     * Invoked via sling.commons.scheduler triggered from resetCleanupSchedule(). By
+     * default should get called at max every 5 minutes until cleanup is done or
+     * 10min after a topology change.
+     */
+    @Override
+    public void run() {
+        if (!hasTopology) {
+            return;
+        }
+        boolean mightHaveMore = true;
+        try {
+            mightHaveMore = cleanup();
+        } catch (Exception e) {
+            // upon exception just log and retry in 10min
+            logger.error("run: got Exception while cleaning up slnigIds : " + e, e);
+        }
+        if (mightHaveMore) {
+            // then continue in 10min
+            recreateSchedule();
+            return;
+        }
+        // log successful cleanup done, yes, on info
+        logger.info("run: slingId cleanup done.");
+    }
+
+    /**
+     * Do the actual cleanup of garbage slingIds and report back with true if there
+     * might be more or false if we're at the end.
+     * 
+     * @return true if there might be more garbage or false if we're at the end
+     */
+    boolean cleanup() {
+        logger.debug("cleanup: start");
+
+        final ResourceResolverFactory localFactory = resourceResolverFactory;
+        final Config localConfig = config;
+        if (localFactory == null || localConfig == null) {
+            logger.warn("cleanup: cannot cleanup due to rrf={} or c={}", localFactory,
+                    localConfig);
+            return true;
+        }
+        final TopologyView localCurrentView = currentView;
+        if (localCurrentView == null || !localCurrentView.isCurrent()) {
+            logger.info("cleanup : cannot cleanup as topology recently changed : {}",
+                    localCurrentView);
+            return true;
+        }
+        final Set<String> activeSlingIds = new HashSet<>();
+        for (InstanceDescription id : localCurrentView.getLocalInstance().getClusterView()
+                .getInstances()) {
+            activeSlingIds.add(id.getSlingId());
+        }
+
+        ResourceResolver resolver = null;
+        try {
+            resolver = localFactory.getServiceResourceResolver(null);

Review Comment:
   fixed the try thingy in https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13/commits/bd00eeb29de4ab668fc64f0c0bd8f711f31b40b6
   
   now regarding the service user : IIUC that's already happening now via `getServiceResourceResolver`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] sonarcloud[bot] commented on pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "sonarcloud[bot] (via GitHub)" <gi...@apache.org>.

sonarcloud[bot] commented on PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#issuecomment-1460481734

   Kudos, SonarCloud Quality Gate passed!&nbsp; &nbsp; [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13)
   
   [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG)  
   [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY)  
   [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT)  
   [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [27 Code Smells](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL)
   
   [![85.1%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/60-16px.png '85.1%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list) [85.1% Coverage](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list)  
   [![0.0%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/3-16px.png '0.0%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list) [0.0% Duplication](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] stefan-egli commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "stefan-egli (via GitHub)" <gi...@apache.org>.

stefan-egli commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1129804053


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,571 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import static org.osgi.util.converter.Converters.standardConverter;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    final static String SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME = "org.apache.sling.discovery.oak.slingidcleanup.enabled";
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every scheduler invocation */
+    private AtomicInteger runCount = new AtomicInteger(0);
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger completionCount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This backup task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int org_apache_sling_discovery_oak_slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;
+
+        @AttributeDefinition(name = "Cleanup interval milliseconds", description = "Number of milliseconds after which to do another batch of cleaning up (if necessary)")
+        int org_apache_sling_discovery_oak_slingid_cleanup_interval() default DEFAULT_CLEANUP_INTERVAL;
+
+        @AttributeDefinition(name = "Cleanup batch size", description = "Maximum number of slingIds to cleanup in one batch.")
+        int org_apache_sling_discovery_oak_slingid_cleanup_batchsize() default DEFAULT_CLEANUP_BATCH_SIZE;
+
+        @AttributeDefinition(name = "Cleanup minimum creation age", description = "Minimum number of milliseconds since the slingId was created.")
+        long org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age() default DEFAULT_MIN_CREATION_AGE_MILLIS;
+    }
+
+    /**
+     * Test constructor
+     */
+    static SlingIdCleanupTask create(Scheduler scheduler, ResourceResolverFactory factory,
+            Config config, int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        final SlingIdCleanupTask s = new SlingIdCleanupTask();
+        s.scheduler = scheduler;
+        s.resourceResolverFactory = factory;
+        s.config = config;
+        s.config(initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+        return s;
+    }
+
+    @Activate
+    protected void activate(final BundleContext bc, final Conf config) {
+        this.modified(bc, config);
+    }
+
+    @Modified
+    protected void modified(final BundleContext bc, final Conf config) {
+        if (config == null) {
+            return;
+        }
+        config(config.org_apache_sling_discovery_oak_slingid_cleanup_initial_delay(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_interval(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_batchsize(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age());
+    }
+
+    @Deactivate
+    protected void deactivate() {
+        logger.info("deactivate : deactivated.");
+        hasTopology = false;
+    }
+
+    private void config(int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        this.initialDelayMillis = initialDelayMillis;
+        this.intervalMillis = intervalMillis;
+        this.batchSize = batchSize;
+        this.minCreationAgeMillis = minCreationAgeMillis;
+        logger.info(
+                "config: enabled = {}, initial delay milliseconds = {}, interval milliseconds = {}, batch size = {}, min creation age milliseconds = {}",
+                isEnabled(), initialDelayMillis, intervalMillis, batchSize,
+                minCreationAgeMillis);
+    }
+
+    @Override
+    public void handleTopologyEvent(TopologyEvent event) {
+        if (!isEnabled()) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+            logger.info("handleTopologyEvent: slingId cleanup is disabled");

Review Comment:
   done (I lowered it to debug only, since debug was intended to show start / done as well)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] stefan-egli commented on pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "stefan-egli (via GitHub)" <gi...@apache.org>.

stefan-egli commented on PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#issuecomment-1458589196

   @joerghoh thx a lot for the review! I've addressed or commented on all now


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] joerghoh commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "joerghoh (via GitHub)" <gi...@apache.org>.

joerghoh commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1130650899


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,571 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import static org.osgi.util.converter.Converters.standardConverter;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    final static String SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME = "org.apache.sling.discovery.oak.slingidcleanup.enabled";
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active

Review Comment:
   what would be the worst-case situation, in case the slingID of a still-active instance (under the circumstances you described above) is deleted?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] stefan-egli commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "stefan-egli (via GitHub)" <gi...@apache.org>.

stefan-egli commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1128268791


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,484 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.apache.sling.discovery.commons.providers.util.ResourceHelper;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger runcount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This backup task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int org_apache_sling_discovery_oak_slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;
+
+        @AttributeDefinition(name = "Cleanup interval milliseconds", description = "Number of milliseconds after which to do another batch of cleaning up (if necessary)")
+        int org_apache_sling_discovery_oak_slingid_cleanup_interval() default DEFAULT_CLEANUP_INTERVAL;
+
+        @AttributeDefinition(name = "Cleanup batch size", description = "Maximum number of slingIds to cleanup in one batch.")
+        int org_apache_sling_discovery_oak_slingid_cleanup_batchsize() default DEFAULT_CLEANUP_BATCH_SIZE;
+
+        @AttributeDefinition(name = "Cleanup minimum creation age", description = "Minimum number of milliseconds since the slingId was created.")
+        long org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age() default DEFAULT_MIN_CREATION_AGE_MILLIS;
+    }
+
+    /**
+     * Test constructor
+     */
+    static SlingIdCleanupTask create(Scheduler scheduler, ResourceResolverFactory factory,
+            Config config, int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        final SlingIdCleanupTask s = new SlingIdCleanupTask();
+        s.scheduler = scheduler;
+        s.resourceResolverFactory = factory;
+        s.config = config;
+        s.config(initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+        return s;
+    }
+
+    @Activate
+    protected void activate(final BundleContext bc, final Conf config) {
+        this.modified(bc, config);
+    }
+
+    @Modified
+    protected void modified(final BundleContext bc, final Conf config) {
+        if (config == null) {
+            return;
+        }
+        config(config.org_apache_sling_discovery_oak_slingid_cleanup_initial_delay(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_interval(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_batchsize(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age());
+    }
+
+    @Deactivate
+    protected void deactivate() {
+        logger.info("deactivate : deactivated.");
+        hasTopology = false;
+    }
+
+    private void config(int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        this.initialDelayMillis = initialDelayMillis;
+        this.intervalMillis = intervalMillis;
+        this.batchSize = batchSize;
+        this.minCreationAgeMillis = minCreationAgeMillis;
+        logger.info(
+                "config: initial delay milliseconds = {}, interval milliseconds = {}, batch size = {}, min creation age milliseconds = {}",
+                initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+    }
+
+    @Override
+    public void handleTopologyEvent(TopologyEvent event) {
+        final TopologyView newView = event.getNewView();
+        if (newView == null || event.getType() == Type.PROPERTIES_CHANGED) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+        } else {
+            hasTopology = true;
+            currentView = newView;
+            if (newView.getLocalInstance().isLeader()) {
+                // only execute on leader
+                recreateSchedule();
+            } else {
+                // should not be necessary, but lets stop anyway on non-leaders:
+                stop();
+            }
+        }
+    }
+
+    /**
+     * Cancels a potentially previously registered cleanup schedule.
+     */
+    private void stop() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("stop: no scheduler set, giving up.");
+            return;
+        }
+        final boolean unscheduled = localScheduler.unschedule(SCHEDULE_NAME);
+        logger.debug("stop: unschedule result={}", unscheduled);
+    }
+
+    /**
+     * This method can be invoked at any time to reset the schedule to do a fresh
+     * round of cleanup.
+     * <p>
+     * This method is thread-safe : if called concurrently, the fact that
+     * scheduler.schedul is synchronized works out that ultimately there will be
+     * just 1 schedule active (which is what is the desired outcome).
+     */
+    private void recreateSchedule() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("resetSchedule: no scheduler set, giving up.");
+            return;
+        }
+        final Calendar cal = Calendar.getInstance();
+        int delayMillis;
+        if (firstRun) {
+            delayMillis = initialDelayMillis;
+        } else {
+            delayMillis = intervalMillis;
+        }
+        cal.add(Calendar.MILLISECOND, delayMillis);
+        final Date inFiveMinutes = cal.getTime();
+        logger.debug("resetSchedule: scheduling a cleanup in {} milliseconds from now.",
+                delayMillis);
+        ScheduleOptions options = localScheduler.AT(inFiveMinutes);
+        options.name(SCHEDULE_NAME);
+        options.canRunConcurrently(false); // should not concurrently execute
+        localScheduler.schedule(this, options);
+    }
+
+    /**
+     * Invoked via sling.commons.scheduler triggered from resetCleanupSchedule(). By
+     * default should get called at max every 5 minutes until cleanup is done or
+     * 10min after a topology change.
+     */
+    @Override
+    public void run() {
+        if (!hasTopology) {
+            return;
+        }
+        boolean mightHaveMore = true;
+        try {
+            mightHaveMore = cleanup();
+        } catch (Exception e) {
+            // upon exception just log and retry in 10min
+            logger.error("run: got Exception while cleaning up slnigIds : " + e, e);
+        }
+        if (mightHaveMore) {
+            // then continue in 10min
+            recreateSchedule();
+            return;
+        }
+        // log successful cleanup done, yes, on info
+        logger.info("run: slingId cleanup done.");
+    }
+
+    /**
+     * Do the actual cleanup of garbage slingIds and report back with true if there
+     * might be more or false if we're at the end.
+     * 
+     * @return true if there might be more garbage or false if we're at the end
+     */
+    boolean cleanup() {
+        logger.debug("cleanup: start");
+
+        final ResourceResolverFactory localFactory = resourceResolverFactory;
+        final Config localConfig = config;
+        if (localFactory == null || localConfig == null) {
+            logger.warn("cleanup: cannot cleanup due to rrf={} or c={}", localFactory,
+                    localConfig);
+            return true;
+        }
+        final TopologyView localCurrentView = currentView;
+        if (localCurrentView == null || !localCurrentView.isCurrent()) {
+            logger.info("cleanup : cannot cleanup as topology recently changed : {}",
+                    localCurrentView);
+            return true;
+        }
+        final Set<String> activeSlingIds = new HashSet<>();
+        for (InstanceDescription id : localCurrentView.getLocalInstance().getClusterView()
+                .getInstances()) {
+            activeSlingIds.add(id.getSlingId());
+        }
+
+        ResourceResolver resolver = null;
+        try {
+            resolver = localFactory.getServiceResourceResolver(null);
+
+            final Resource clusterInstances = ResourceHelper.getOrCreateResource(resolver,
+                    localConfig.getClusterInstancesPath());
+            final Resource idMap = ResourceHelper.getOrCreateResource(resolver,
+                    localConfig.getIdMapPath());
+            final Resource syncTokens = ResourceHelper.getOrCreateResource(resolver,
+                    localConfig.getSyncTokenPath());
+            resolver.revert();

Review Comment:
   There must have been some issue in similar code here, as this is copied from [here](https://github.com/apache/sling-org-apache-sling-discovery-impl/blob/master/src/main/java/org/apache/sling/discovery/impl/DiscoveryServiceImpl.java#L524) - but probably it's overkill by now..



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] stefan-egli commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "stefan-egli (via GitHub)" <gi...@apache.org>.

stefan-egli commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1130915606


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,484 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.apache.sling.discovery.commons.providers.util.ResourceHelper;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger runcount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This backup task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int org_apache_sling_discovery_oak_slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;
+
+        @AttributeDefinition(name = "Cleanup interval milliseconds", description = "Number of milliseconds after which to do another batch of cleaning up (if necessary)")
+        int org_apache_sling_discovery_oak_slingid_cleanup_interval() default DEFAULT_CLEANUP_INTERVAL;
+
+        @AttributeDefinition(name = "Cleanup batch size", description = "Maximum number of slingIds to cleanup in one batch.")
+        int org_apache_sling_discovery_oak_slingid_cleanup_batchsize() default DEFAULT_CLEANUP_BATCH_SIZE;
+
+        @AttributeDefinition(name = "Cleanup minimum creation age", description = "Minimum number of milliseconds since the slingId was created.")
+        long org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age() default DEFAULT_MIN_CREATION_AGE_MILLIS;
+    }
+
+    /**
+     * Test constructor
+     */
+    static SlingIdCleanupTask create(Scheduler scheduler, ResourceResolverFactory factory,
+            Config config, int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        final SlingIdCleanupTask s = new SlingIdCleanupTask();
+        s.scheduler = scheduler;
+        s.resourceResolverFactory = factory;
+        s.config = config;
+        s.config(initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+        return s;
+    }
+
+    @Activate
+    protected void activate(final BundleContext bc, final Conf config) {
+        this.modified(bc, config);
+    }
+
+    @Modified
+    protected void modified(final BundleContext bc, final Conf config) {
+        if (config == null) {
+            return;
+        }
+        config(config.org_apache_sling_discovery_oak_slingid_cleanup_initial_delay(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_interval(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_batchsize(),
+                config.org_apache_sling_discovery_oak_slingid_cleanup_min_creation_age());
+    }
+
+    @Deactivate
+    protected void deactivate() {
+        logger.info("deactivate : deactivated.");
+        hasTopology = false;
+    }
+
+    private void config(int initialDelayMillis, int intervalMillis, int batchSize,
+            long minCreationAgeMillis) {
+        this.initialDelayMillis = initialDelayMillis;
+        this.intervalMillis = intervalMillis;
+        this.batchSize = batchSize;
+        this.minCreationAgeMillis = minCreationAgeMillis;
+        logger.info(
+                "config: initial delay milliseconds = {}, interval milliseconds = {}, batch size = {}, min creation age milliseconds = {}",
+                initialDelayMillis, intervalMillis, batchSize, minCreationAgeMillis);
+    }
+
+    @Override
+    public void handleTopologyEvent(TopologyEvent event) {
+        final TopologyView newView = event.getNewView();
+        if (newView == null || event.getType() == Type.PROPERTIES_CHANGED) {
+            hasTopology = false; // stops potentially ongoing deletion
+            currentView = null;
+            // cancel cleanup schedule
+            stop();
+        } else {
+            hasTopology = true;
+            currentView = newView;
+            if (newView.getLocalInstance().isLeader()) {
+                // only execute on leader
+                recreateSchedule();
+            } else {
+                // should not be necessary, but lets stop anyway on non-leaders:
+                stop();
+            }
+        }
+    }
+
+    /**
+     * Cancels a potentially previously registered cleanup schedule.
+     */
+    private void stop() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("stop: no scheduler set, giving up.");
+            return;
+        }
+        final boolean unscheduled = localScheduler.unschedule(SCHEDULE_NAME);
+        logger.debug("stop: unschedule result={}", unscheduled);
+    }
+
+    /**
+     * This method can be invoked at any time to reset the schedule to do a fresh
+     * round of cleanup.
+     * <p>
+     * This method is thread-safe : if called concurrently, the fact that
+     * scheduler.schedul is synchronized works out that ultimately there will be
+     * just 1 schedule active (which is what is the desired outcome).
+     */
+    private void recreateSchedule() {
+        final Scheduler localScheduler = scheduler;
+        if (localScheduler == null) {
+            // should not happen
+            logger.warn("resetSchedule: no scheduler set, giving up.");
+            return;
+        }
+        final Calendar cal = Calendar.getInstance();
+        int delayMillis;
+        if (firstRun) {
+            delayMillis = initialDelayMillis;
+        } else {
+            delayMillis = intervalMillis;
+        }
+        cal.add(Calendar.MILLISECOND, delayMillis);
+        final Date inFiveMinutes = cal.getTime();
+        logger.debug("resetSchedule: scheduling a cleanup in {} milliseconds from now.",
+                delayMillis);
+        ScheduleOptions options = localScheduler.AT(inFiveMinutes);
+        options.name(SCHEDULE_NAME);
+        options.canRunConcurrently(false); // should not concurrently execute
+        localScheduler.schedule(this, options);
+    }
+
+    /**
+     * Invoked via sling.commons.scheduler triggered from resetCleanupSchedule(). By
+     * default should get called at max every 5 minutes until cleanup is done or
+     * 10min after a topology change.
+     */
+    @Override
+    public void run() {
+        if (!hasTopology) {
+            return;
+        }
+        boolean mightHaveMore = true;
+        try {
+            mightHaveMore = cleanup();
+        } catch (Exception e) {
+            // upon exception just log and retry in 10min
+            logger.error("run: got Exception while cleaning up slnigIds : " + e, e);
+        }
+        if (mightHaveMore) {
+            // then continue in 10min
+            recreateSchedule();
+            return;
+        }
+        // log successful cleanup done, yes, on info
+        logger.info("run: slingId cleanup done.");
+    }
+
+    /**
+     * Do the actual cleanup of garbage slingIds and report back with true if there
+     * might be more or false if we're at the end.
+     * 
+     * @return true if there might be more garbage or false if we're at the end
+     */
+    boolean cleanup() {
+        logger.debug("cleanup: start");
+
+        final ResourceResolverFactory localFactory = resourceResolverFactory;
+        final Config localConfig = config;
+        if (localFactory == null || localConfig == null) {
+            logger.warn("cleanup: cannot cleanup due to rrf={} or c={}", localFactory,
+                    localConfig);
+            return true;
+        }
+        final TopologyView localCurrentView = currentView;
+        if (localCurrentView == null || !localCurrentView.isCurrent()) {
+            logger.info("cleanup : cannot cleanup as topology recently changed : {}",
+                    localCurrentView);
+            return true;
+        }
+        final Set<String> activeSlingIds = new HashSet<>();
+        for (InstanceDescription id : localCurrentView.getLocalInstance().getClusterView()
+                .getInstances()) {
+            activeSlingIds.add(id.getSlingId());
+        }
+
+        ResourceResolver resolver = null;
+        try {
+            resolver = localFactory.getServiceResourceResolver(null);

Review Comment:
   marking this conversation as closed, please reopen if you disagree, thx



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] sonarcloud[bot] commented on pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "sonarcloud[bot] (via GitHub)" <gi...@apache.org>.

sonarcloud[bot] commented on PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#issuecomment-1461903561

   Kudos, SonarCloud Quality Gate passed!&nbsp; &nbsp; [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13)
   
   [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG)  
   [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY)  
   [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT)  
   [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [31 Code Smells](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL)
   
   [![86.9%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/60-16px.png '86.9%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list) [86.9% Coverage](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list)  
   [![0.0%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/3-16px.png '0.0%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list) [0.0% Duplication](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] sonarcloud[bot] commented on pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "sonarcloud[bot] (via GitHub)" <gi...@apache.org>.

sonarcloud[bot] commented on PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#issuecomment-1461920840

   Kudos, SonarCloud Quality Gate passed!&nbsp; &nbsp; [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13)
   
   [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=BUG)  
   [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=VULNERABILITY)  
   [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=SECURITY_HOTSPOT)  
   [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL) [30 Code Smells](https://sonarcloud.io/project/issues?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&resolved=false&types=CODE_SMELL)
   
   [![86.6%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/60-16px.png '86.6%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list) [86.6% Coverage](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_coverage&view=list)  
   [![0.0%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/3-16px.png '0.0%')](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list) [0.0% Duplication](https://sonarcloud.io/component_measures?id=apache_sling-org-apache-sling-discovery-oak&pullRequest=13&metric=new_duplicated_lines_density&view=list)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] rishabhdaim commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "rishabhdaim (via GitHub)" <gi...@apache.org>.

rishabhdaim commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1133466597


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,601 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import static org.osgi.util.converter.Converters.standardConverter;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>was not ever seen in previous topologies by the now leader instance</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ * <p>
+ * Additionally, the cleanup is skipped for 13 hours after a successful cleanup.
+ * This is to avoid unnecessary load on the repository. The number of 13
+ * incorporates some heuristics such as : about 2 cleanup rounds per day maximum
+ * makes sense, if a leader is very long living, then the 1 additional hour
+ * makes it spread somewhat throughout the day. This is to further minimize any
+ * load side-effects.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    final static String SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME = "org.apache.sling.discovery.oak.slingidcleanup.enabled";
+
+    /** default minimal cleanup delay at 13h, to intraday load balance */
+    final static long MIN_CLEANUP_DELAY_MILLIS = 46800000;
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every scheduler invocation */
+    private AtomicInteger runCount = new AtomicInteger(0);
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger completionCount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    private long lastSuccessfulRun = -1;
+
+    /**
+     * Minimal delay after a successful cleanup round, in millis
+     */
+    private long minCleanupDelayMillis = MIN_CLEANUP_DELAY_MILLIS;
+
+    /**
+     * contains all slingIds ever seen by this instance - should not be a long list
+     * so not a memory issue
+     */
+    private Set<String> seenInstances = new HashSet<>();

Review Comment:
   Since multiple threads use this set, shouldn't we replace this with `CopyOnWriteArraySet`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [sling-org-apache-sling-discovery-oak] stefan-egli commented on a diff in pull request #13: SLING-10854 : introducing SlingIdCleanupTask to clean up old slingIds…

Posted by "stefan-egli (via GitHub)" <gi...@apache.org>.

stefan-egli commented on code in PR #13:
URL: https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13#discussion_r1133631856


##########
src/main/java/org/apache/sling/discovery/oak/SlingIdCleanupTask.java:
##########
@@ -0,0 +1,601 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sling.discovery.oak;
+
+import static org.osgi.util.converter.Converters.standardConverter;
+
+import java.util.Calendar;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.sling.api.resource.LoginException;
+import org.apache.sling.api.resource.ModifiableValueMap;
+import org.apache.sling.api.resource.PersistenceException;
+import org.apache.sling.api.resource.Resource;
+import org.apache.sling.api.resource.ResourceResolver;
+import org.apache.sling.api.resource.ResourceResolverFactory;
+import org.apache.sling.api.resource.ValueMap;
+import org.apache.sling.commons.scheduler.ScheduleOptions;
+import org.apache.sling.commons.scheduler.Scheduler;
+import org.apache.sling.discovery.InstanceDescription;
+import org.apache.sling.discovery.TopologyEvent;
+import org.apache.sling.discovery.TopologyEvent.Type;
+import org.apache.sling.discovery.TopologyEventListener;
+import org.apache.sling.discovery.TopologyView;
+import org.osgi.framework.BundleContext;
+import org.osgi.service.component.annotations.Activate;
+import org.osgi.service.component.annotations.Component;
+import org.osgi.service.component.annotations.Deactivate;
+import org.osgi.service.component.annotations.Modified;
+import org.osgi.service.component.annotations.Reference;
+import org.osgi.service.metatype.annotations.AttributeDefinition;
+import org.osgi.service.metatype.annotations.Designate;
+import org.osgi.service.metatype.annotations.ObjectClassDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A background task that cleans up garbage slingIds after topology changes.
+ * <p>
+ * A slingId is considered garbage when:
+ * <ul>
+ * <li>it is not in the current topology</li>
+ * <li>was not ever seen in previous topologies by the now leader instance</li>
+ * <li>it is not in the current idmap (where clusterNodeIds are reused hence
+ * that list stays small and does not need cleanup)</li>
+ * <li>its leaderElectionId was created more than 7 days ago (the
+ * leaderElectionId is created at activate time of the discovery.oak bundle -
+ * hence this more or less corresponds to the startup time of that
+ * instance)</li>
+ * </ul>
+ * The garbage accumulates at the following places, where it will thus be
+ * cleaned up from:
+ * <ul>
+ * <li>as child node under /var/discovery/oak/clusterInstances : this is the
+ * most performance critical garbage</li>
+ * <li>as a property key in /var/discovery/oak/syncTokens</li>
+ * </ul>
+ * The task by default is executed:
+ * <ul>
+ * <li>only on the leader</li>
+ * <li>10min after a TOPOLOGY_INIT or TOPOLOGY_CHANGED event</li>
+ * <li>with a maximum number of delete operations to avoid repository overload -
+ * that maximum is called batchSize and is 50 by default</li>
+ * <li>in subsequent intervals of 10min after the initial run, if that had to
+ * stop at the batchSize of 50 deletions</li>
+ * </ul>
+ * All parameters mentioned above can be configured.
+ * <p>
+ * Additionally, the cleanup is skipped for 13 hours after a successful cleanup.
+ * This is to avoid unnecessary load on the repository. The number of 13
+ * incorporates some heuristics such as : about 2 cleanup rounds per day maximum
+ * makes sense, if a leader is very long living, then the 1 additional hour
+ * makes it spread somewhat throughout the day. This is to further minimize any
+ * load side-effects.
+ */
+@Component
+@Designate(ocd = SlingIdCleanupTask.Conf.class)
+public class SlingIdCleanupTask implements TopologyEventListener, Runnable {
+
+    final static String SLINGID_CLEANUP_ENABLED_SYSTEM_PROPERTY_NAME = "org.apache.sling.discovery.oak.slingidcleanup.enabled";
+
+    /** default minimal cleanup delay at 13h, to intraday load balance */
+    final static long MIN_CLEANUP_DELAY_MILLIS = 46800000;
+
+    /**
+     * default age is 1 week : an instance that is not in the current topology,
+     * started 1 week ago is very unlikely to still be active
+     */
+    private static final long DEFAULT_MIN_CREATION_AGE_MILLIS = 604800000; // 1 week
+
+    /**
+     * initial delay is 10min : after a TOPOLOGY_INIT or TOPOLOGY_CHANGED on the
+     * leader, there should be a 10min delay before starting a round of cleanup.
+     * This is to not add unnecessary load after a startup/change.
+     */
+    private static final int DEFAULT_CLEANUP_INITIAL_DELAY = 600000; // 10min
+
+    /**
+     * default cleanup interval is 10min - this is together with the batchSize to
+     * lower repository load
+     */
+    private static final int DEFAULT_CLEANUP_INTERVAL = 600000; // 10min
+
+    /**
+     * default batch size is 50 deletions : normally there should not be much
+     * garbage around anyway, so normally it's just a few, 1-5 perhaps. If there's
+     * more than 50, that is probably a one-time cleanup after this feature is first
+     * rolled out. That one-time cleanup can actually take a considerable amount of
+     * time. So, to not overload the write load on the repository, the deletion is
+     * batched into 50 at any time - with 10min delays in between. That results in
+     * an average of 1 cleanup every 12 seconds, or 5 per minute, or 8640 per day,
+     * for a legacy cleanup.
+     */
+    private static final int DEFAULT_CLEANUP_BATCH_SIZE = 50;
+
+    /**
+     * The sling.commons.scheduler name, so that it can be cancelled upon topology
+     * changes.
+     */
+    private static final String SCHEDULE_NAME = "org.apache.sling.discovery.oak.SlingIdCleanupTask";
+
+    protected final Logger logger = LoggerFactory.getLogger(this.getClass());
+
+    @Reference
+    protected Scheduler scheduler;
+
+    @Reference
+    protected ResourceResolverFactory resourceResolverFactory;
+
+    @Reference
+    private Config config;
+
+    /**
+     * volatile flag to fast stop any ongoing deletion upon a change in the topology
+     */
+    private volatile boolean hasTopology = false;
+
+    /**
+     * volatile field to keep track of the current topology, shared between topology
+     * listener and deletion
+     */
+    @SuppressWarnings("all")
+    private volatile TopologyView currentView;
+
+    private int initialDelayMillis = DEFAULT_CLEANUP_INITIAL_DELAY;
+
+    private int intervalMillis = DEFAULT_CLEANUP_INTERVAL;
+
+    private int batchSize = DEFAULT_CLEANUP_BATCH_SIZE;
+
+    private long minCreationAgeMillis = DEFAULT_MIN_CREATION_AGE_MILLIS;
+
+    /** test counter that increments upon every scheduler invocation */
+    private AtomicInteger runCount = new AtomicInteger(0);
+
+    /** test counter that increments upon every batch deletion */
+    private AtomicInteger completionCount = new AtomicInteger(0);
+
+    /** test counter that keeps track of actually deleted slingIds */
+    private AtomicInteger deleteCount = new AtomicInteger(0);
+
+    /**
+     * flag to distinguish first from subsequent runs, as they might have different
+     * scheduler delays
+     */
+    private volatile boolean firstRun = true;
+
+    private long lastSuccessfulRun = -1;
+
+    /**
+     * Minimal delay after a successful cleanup round, in millis
+     */
+    private long minCleanupDelayMillis = MIN_CLEANUP_DELAY_MILLIS;
+
+    /**
+     * contains all slingIds ever seen by this instance - should not be a long list
+     * so not a memory issue
+     */
+    private Set<String> seenInstances = new HashSet<>();
+
+    @ObjectClassDefinition(name = "Apache Sling Discovery Oak SlingId Cleanup Task", description = "This task is in charge of cleaning up old SlingIds from the repository.")
+    public @interface Conf {
+
+        @AttributeDefinition(name = "Cleanup initial delay milliseconds", description = "Number of milliseconds to initially wait for the first cleanup")
+        int org_apache_sling_discovery_oak_slingid_cleanup_initial_delay() default DEFAULT_CLEANUP_INITIAL_DELAY;

Review Comment:
   agree, they are unnecessarily large indeed - fixed in https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/13/commits/30d2e8e76bdfe40337d62ea908f8fbf30c13f39d



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sling.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org