You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Hudson (JIRA)" <ji...@apache.org> on 2018/11/15 10:45:01 UTC
[jira] [Commented] (NUTCH-2652) Fetcher launches more fetch tasks
than fetch lists
[ https://issues.apache.org/jira/browse/NUTCH-2652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16687796#comment-16687796 ]
Hudson commented on NUTCH-2652:
-------------------------------
FAILURE: Integrated in Jenkins build Nutch-trunk #3589 (See [https://builds.apache.org/job/Nutch-trunk/3589/])
NUTCH-2652 Fetcher launches more fetch tasks than fetch lists - properly (snagel: [https://github.com/apache/nutch/commit/89b16ce29f3bf6618ec2bf9df0807b24c1e40339])
* (edit) src/java/org/apache/nutch/fetcher/Fetcher.java
> Fetcher launches more fetch tasks than fetch lists
> --------------------------------------------------
>
> Key: NUTCH-2652
> URL: https://issues.apache.org/jira/browse/NUTCH-2652
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 1.15
> Environment: Hadoop, distributed mode (cluster of 22 nodes), CDH 5.15.1, Nutch built on recent master.
> Seen the first time right now, although running since two months with Nutch 1.15. But the constraints causing inputs to be split may change from run to run.
> Reporter: Sebastian Nagel
> Assignee: Sebastian Nagel
> Priority: Critical
> Fix For: 1.16
>
>
> Fetcher may launch more fetcher tasks than there are fetch lists:
> {noformat}
> 18/10/15 07:27:26 INFO input.FileInputFormat: Total input paths to process : 128
> 18/10/15 07:27:26 INFO mapreduce.JobSubmitter: number of splits:187
> {noformat}
> That's one design principle of Nutch as a MapRecude-based crawler: to ensure politeness and a guaranteed delay between requests to the same host/domain/ip all items of one host/domain/ip are put by Generator into the same fetch list. A fetch list may not be split because that would violate the politeness constraints - multiple fetcher tasks processing the splits of one fetch list then may send requests to the same host/domain/ip in parallel. See [~ab]'s chapter about Nutch in [Hadoop the definitive guide (3rd edition)|https://www.safaribooksonline.com/library/view/hadoop-the-definitive/9781449328917/ch16.html#NutchFetcher].
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)