You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Pramod Setlur <se...@usc.edu> on 2015/09/30 03:13:10 UTC

dbunfetched URLs - team #32

Hello,

We had left Nutch to crawl with 25 urls and 7 rounds. After around 15 hrs
it was able to fetch only 10% of the URLs.

I have  attached a screenshot for a better reference. Can you guide us on
what other configurations need to be added to improve crawling?

Also where can i learn more about configurations of Nutch. Eg: increasing
the threads for crawling, etc.

Thank you,
Best Regards,
Pramod P. Setlur

alt email id: pramodsetlur@gmail.com
[M] - +1-(323)-637-5256
USC ID: 7369871317
LinkedIn <http://www.linkedin.com/pub/pramod-setlur/3b/726/270/>

Re: dbunfetched URLs - team #32

Posted by Michael Joyce <jo...@apache.org>.
That doesn't seem too unreasonable of a result count to me if you're
running local. Assuming you're partitioning via host, all of those URLs are
to the same host, and you have a 3 second politeness delay you should end
up w/ a crawl lasting

21497 * 3 / 60 / 60 = 17.9 hours

There's a wiki page on crawl optimization that might help you out:
https://wiki.apache.org/nutch/OptimizingCrawls

As for conf documentation check the descriptions in nutch-default and try
poking around the wiki for some more info. I think your best bet is going
to be what's in the conf files.

-- Jimmy

On Tue, Sep 29, 2015 at 6:13 PM, Pramod Setlur <se...@usc.edu> wrote:

> Hello,
>
> We had left Nutch to crawl with 25 urls and 7 rounds. After around 15 hrs
> it was able to fetch only 10% of the URLs.
>
> I have  attached a screenshot for a better reference. Can you guide us on
> what other configurations need to be added to improve crawling?
>
> Also where can i learn more about configurations of Nutch. Eg: increasing
> the threads for crawling, etc.
>
> Thank you,
> Best Regards,
> Pramod P. Setlur
>
> alt email id: pramodsetlur@gmail.com
> [M] - +1-(323)-637-5256
> USC ID: 7369871317
> LinkedIn <http://www.linkedin.com/pub/pramod-setlur/3b/726/270/>
>
>