You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/10/15 11:49:00 UTC

[jira] [Commented] (NUTCH-2652) Fetcher launches more fetch tasks than fetch lists

    [ https://issues.apache.org/jira/browse/NUTCH-2652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650111#comment-16650111 ] 

ASF GitHub Bot commented on NUTCH-2652:
---------------------------------------

sebastian-nagel opened a new pull request #394: NUTCH-2652 Fetcher launches more fetch tasks than fetch lists
URL: https://github.com/apache/nutch/pull/394
 
 
   - properly override method [getSplits(JobContext context) of FileInputFormat](https://hadoop.apache.org/docs/r2.8.5/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html#getSplits(org.apache.hadoop.mapreduce.JobContext))
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Fetcher launches more fetch tasks than fetch lists
> --------------------------------------------------
>
>                 Key: NUTCH-2652
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2652
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.15
>         Environment: Hadoop, distributed mode (cluster of 22 nodes), CDH 5.15.1, Nutch built on recent master.
> Seen the first time right now, although running since two months with Nutch 1.15. But the constraints causing inputs to be split may change from run to run.
>            Reporter: Sebastian Nagel
>            Priority: Critical
>             Fix For: 1.16
>
>
> Fetcher may launch more fetcher tasks than there are fetch lists:
> {noformat}
> 18/10/15 07:27:26 INFO input.FileInputFormat: Total input paths to process : 128
> 18/10/15 07:27:26 INFO mapreduce.JobSubmitter: number of splits:187
> {noformat}
> That's one design principle of Nutch as a MapRecude-based crawler: to ensure politeness and a guaranteed delay between requests to the same host/domain/ip all items of one host/domain/ip are put by Generator into the same fetch list. A fetch list may not be split because that would violate the politeness constraints - multiple fetcher tasks processing the splits of one fetch list then may send requests to the same host/domain/ip in parallel. See [~ab]'s chapter about Nutch in [Hadoop the definitive guide (3rd edition)|https://www.safaribooksonline.com/library/view/hadoop-the-definitive/9781449328917/ch16.html#NutchFetcher].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)