You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by GitBox <gi...@apache.org> on 2022/01/15 23:54:01 UTC

[GitHub] [nutch] lewismc commented on a change in pull request #724: NUTCH-2573 Suspend crawling if robots.txt fails to fetch with 5xx status

lewismc commented on a change in pull request #724:
URL: https://github.com/apache/nutch/pull/724#discussion_r785373191



##########
File path: src/java/org/apache/nutch/fetcher/FetchItemQueues.java
##########
@@ -195,11 +195,15 @@ public synchronized FetchItem getFetchItem() {
     return null;
   }
 
+  public boolean timelimitReached() {

Review comment:
       Maybe provide basic Javadoc?

##########
File path: src/java/org/apache/nutch/fetcher/FetchItemQueues.java
##########
@@ -263,6 +283,10 @@ public synchronized int checkExceptionThreshold(String queueid) {
     return 0;
   }
 
+  public int checkExceptionThreshold(String queueid) {

Review comment:
       Same here. Basic Javadoc?

##########
File path: src/java/org/apache/nutch/fetcher/FetcherThread.java
##########
@@ -600,6 +628,12 @@ private FetchItem queueRedirect(Text redirUrl, FetchItem fit)
       LOG.debug(" - ignoring redirect from {} to {} as duplicate", fit.url,
           redirUrl);
       return null;
+    } else if (fetchQueues.timelimitReached()) {
+      redirecting = false;
+      context.getCounter("FetcherStatus", "hitByTimeLimit").increment(1);

Review comment:
       Same with this one https://cwiki.apache.org/confluence/display/NUTCH/Metrics

##########
File path: src/java/org/apache/nutch/fetcher/FetcherThread.java
##########
@@ -312,6 +322,24 @@ public void run() {
               outputRobotsTxt(robotsTxtContent);
               robotsTxtContent.clear();
             }
+            if (rules.isDeferVisits()) {
+              LOG.info("Defer visits for queue {} : {}", fit.queueID, fit.url);
+              // retry the fetch item
+              if (fetchQueues.timelimitReached()) {
+                fetchQueues.finishFetchItem(fit, true);
+              } else {
+                fetchQueues.addFetchItem(fit);
+              }
+              // but check whether it's time to cancel the queue
+              int killedURLs = fetchQueues.checkExceptionThreshold(
+                  fit.getQueueID(), this.robotsDeferVisitsRetries + 1,
+                  this.robotsDeferVisitsDelay);
+              if (killedURLs != 0) {
+                context.getCounter("FetcherStatus",

Review comment:
       Can you please augment https://cwiki.apache.org/confluence/display/NUTCH/Metrics




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@nutch.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org