You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2018/05/29 08:39:00 UTC

[jira] [Commented] (NUTCH-2588) Getting status code x01 (unfetched) on more than 80% crawled urls

    [ https://issues.apache.org/jira/browse/NUTCH-2588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16493253#comment-16493253 ] 

Sebastian Nagel commented on NUTCH-2588:
----------------------------------------

Nutch needs to find URLs as outlinks. It may easily happen that not all URLs are found in the first 9 rounds/cycles. All URLs found in the last cycle will be fetched in later cycle. By default bin/crawl does only spent 3 hours to fetch URLs of one cycle. If the URLs are from only a few hosts it may happen that far less than 50,000 are fetched per cycle. There are also many configuration properties you could change the behavior of Nutch. Those are listed in conf/nutch-default.xml. You may also ask for help on the [Nutch user mailing list|http://nutch.apache.org/mailing_lists.html].

> Getting status code x01 (unfetched) on more than 80% crawled urls
> -----------------------------------------------------------------
>
>                 Key: NUTCH-2588
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2588
>             Project: Nutch
>          Issue Type: Bug
>          Components: crawldb, fetcher
>    Affects Versions: 2.3.1
>         Environment: I am using apache nutch 2.3.1 with hadoop 2.7.6 and hbase 0.98.8 hadop2.
> Operating System: Ubuntu 16.04
>            Reporter: Usama Tahir
>            Priority: Major
>
> when i run nucth with external links enabled, seed of 10 urls and number of rounds 5 using command 
> bin/crawl <seed_path> <db>  [<solr url>] <number of rounds>
> i have default topN value which is 50000
> the process completes execution in 11 to 12 hours and generated urls rows of about 280000.
> when we analyze hbase table and check status codes of all urls we got round about 242000 urls having status code of x01 [un fetched] 
> it means 242000 urls out of 280000 which nutch extracted remains unfetched.
> after some debugging of nutch and analyzing its logs i found that those urls which have status code of x01 are not even tried to fetch.
> is this the bug of nutch or something configuration issue?
>  kindly resolve my issue as soon as possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)