You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by James Ford <si...@gmail.com> on 2012/05/09 17:09:09 UTC

Make Nutch to crawl internal urls only

Hello,

I am wondering how to only crawl the domains of a injected seed without
adding external URLs to the database?

Lets say I have 5k urls in my seed, and I want nutch to crawl everything(Or
some million urls) for each domain in the fastest way possible.

What settings should I use?

I will have topN at about 20k, and I want the db_unfetched to be around 20k
for each iteration?

What should I set "db.max.outlinks.per.page" to? I was wondering about
setting it to 4, to get 4*5k=20k for the first iteration?

Can anyone help me? 

Thanks,
James Ford

--
View this message in context: http://lucene.472066.n3.nabble.com/Make-Nutch-to-crawl-internal-urls-only-tp3974397.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Make Nutch to crawl internal urls only

Posted by Julien Nioche <li...@gmail.com>.
Just adding to what Markus said, even in distributed mode the generation
and update steps will take more and more time as your crawldb gets bigger.
There are quite a few things you can do to alleviate that e.g. set a
minimal score for the generation, generate multiple segments in one go then
fetch them one by one and update them all at the same time. having said
that if your crawldb contains only 10M urls deactivating the normalisation
as Markus said will be the best thing to do in the short term. Last
comment, even if you are on a single machine you should run Nutch in pseudo
distributed mode and not in local mode. This way you'll be able to monitor
your crawl using the hadoop web interfaces + have more than one mapper and
reducer

HTH

Julien

On 10 May 2012 09:56, Markus Jelsma <ma...@openindex.io> wrote:

> Hi
>
>
> On Thu, 10 May 2012 01:47:34 -0700 (PDT), James Ford <
> simon.forsb@gmail.com> wrote:
>
>> Thanks for your reply.
>>
>> The problem I have with using the suggested settings you have described
>> above is that the Generator step of normalizing is taking too long after
>> some iterations(Thats why I want the crawldb to be at a reasonable level).
>>
>
> Then disable normalizing and filtering in that step. There's usually no
> good reason to do it unless you have some very specific set-up and exotic
> requirements.
>
>
>
>> It seems that I can crawl and index about one million URLs in a 24h period
>> from the first init. But this number is decreasing with a large amount if
>> I
>> continue to crawl. This is due to the fact that the normalize step can
>> take
>> up to one hour after some iterations, when the crawldb is getting bigger.
>>
>
> You run Nutch local?
>
>
>
>> I don't see why the generator step is taking so long? It can't take that
>> much time selecting X urls from a database of about 10 million URLs?
>>
>
> Certainly! The GeneratorMapper is quite CPU intensive, it's calculate a
> lot of things for most records and then the reducer limits records by host
> or domain taking a lot of additional CPU time and RAM.
>
> You must disable filtering and normalizing but this will only help for a
> short while. If the CrawlDB grows again you must you a cluster to do the
> work.
>
>
>
>> Thanks,
>> James Ford
>>
>> --
>> View this message in context:
>>
>> http://lucene.472066.n3.**nabble.com/Make-Nutch-to-**
>> crawl-internal-urls-only-**tp3974397p3976511.html<http://lucene.472066.n3.nabble.com/Make-Nutch-to-crawl-internal-urls-only-tp3974397p3976511.html>
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>
> --
> Markus Jelsma - CTO - Openindex
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Make Nutch to crawl internal urls only

Posted by Greg Fields <gr...@gmail.com>.
I have a similar problem. Is there a way i can force the fetcher to only take
urls from the unfetched url list? 

--
View this message in context: http://lucene.472066.n3.nabble.com/Make-Nutch-to-crawl-internal-urls-only-tp3974397p3976568.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Make Nutch to crawl internal urls only

Posted by Markus Jelsma <ma...@openindex.io>.
 Hi

 On Thu, 10 May 2012 01:47:34 -0700 (PDT), James Ford 
 <si...@gmail.com> wrote:
> Thanks for your reply.
>
> The problem I have with using the suggested settings you have 
> described
> above is that the Generator step of normalizing is taking too long 
> after
> some iterations(Thats why I want the crawldb to be at a reasonable 
> level).

 Then disable normalizing and filtering in that step. There's usually no 
 good reason to do it unless you have some very specific set-up and 
 exotic requirements.

>
> It seems that I can crawl and index about one million URLs in a 24h 
> period
> from the first init. But this number is decreasing with a large 
> amount if I
> continue to crawl. This is due to the fact that the normalize step 
> can take
> up to one hour after some iterations, when the crawldb is getting 
> bigger.

 You run Nutch local?

>
> I don't see why the generator step is taking so long? It can't take 
> that
> much time selecting X urls from a database of about 10 million URLs?

 Certainly! The GeneratorMapper is quite CPU intensive, it's calculate a 
 lot of things for most records and then the reducer limits records by 
 host or domain taking a lot of additional CPU time and RAM.

 You must disable filtering and normalizing but this will only help for 
 a short while. If the CrawlDB grows again you must you a cluster to do 
 the work.

>
> Thanks,
> James Ford
>
> --
> View this message in context:
> 
> http://lucene.472066.n3.nabble.com/Make-Nutch-to-crawl-internal-urls-only-tp3974397p3976511.html
> Sent from the Nutch - User mailing list archive at Nabble.com.

-- 
 Markus Jelsma - CTO - Openindex

Re: Make Nutch to crawl internal urls only

Posted by James Ford <si...@gmail.com>.
Thanks for your reply.

The problem I have with using the suggested settings you have described
above is that the Generator step of normalizing is taking too long after
some iterations(Thats why I want the crawldb to be at a reasonable level).

It seems that I can crawl and index about one million URLs in a 24h period
from the first init. But this number is decreasing with a large amount if I
continue to crawl. This is due to the fact that the normalize step can take
up to one hour after some iterations, when the crawldb is getting bigger.

I don't see why the generator step is taking so long? It can't take that
much time selecting X urls from a database of about 10 million URLs?

Thanks,
James Ford

--
View this message in context: http://lucene.472066.n3.nabble.com/Make-Nutch-to-crawl-internal-urls-only-tp3974397p3976511.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Make Nutch to crawl internal urls only

Posted by Ken Krugler <kk...@transpac.com>.
Hi James,

As Markus said, you can set db.ignore.external.links to true so that you only process outlinks within the same domain as the page they're found on.

This has one (usually minor) side-effect; you toss links that go between domains that are in your seed list.

If that's an issue, then you could take your 5K URLs, extract the domains, dedup, and then use that with domain based URL filtering.

-- Ken

On May 9, 2012, at 8:09am, James Ford wrote:

> Hello,
> 
> I am wondering how to only crawl the domains of a injected seed without
> adding external URLs to the database?
> 
> Lets say I have 5k urls in my seed, and I want nutch to crawl everything(Or
> some million urls) for each domain in the fastest way possible.
> 
> What settings should I use?
> 
> I will have topN at about 20k, and I want the db_unfetched to be around 20k
> for each iteration?
> 
> What should I set "db.max.outlinks.per.page" to? I was wondering about
> setting it to 4, to get 4*5k=20k for the first iteration?
> 
> Can anyone help me? 
> 
> Thanks,
> James Ford
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Make-Nutch-to-crawl-internal-urls-only-tp3974397.html
> Sent from the Nutch - User mailing list archive at Nabble.com.

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr





Re: Make Nutch to crawl internal urls only

Posted by Markus Jelsma <ma...@openindex.io>.
 Hi

 On Wed, 9 May 2012 08:09:09 -0700 (PDT), James Ford 
 <si...@gmail.com> wrote:
> Hello,
>
> I am wondering how to only crawl the domains of a injected seed 
> without
> adding external URLs to the database?

 Check db.ignore.external.links.

>
> Lets say I have 5k urls in my seed, and I want nutch to crawl 
> everything(Or
> some million urls) for each domain in the fastest way possible.
>
> What settings should I use?

 Well, the fastest is of course no delay and with maximum number of 
 threads but that's usually not a good idea. You will overload your 
 connection or the servers.

>
> I will have topN at about 20k, and I want the db_unfetched to be 
> around 20k
> for each iteration?

 There is no guarantee of db_unfetched unless each page has exactly the 
 same number of outlinks. If your crawl is limited to a few domains then 
 just crawl until there's nothing left to crawl.

>
> What should I set "db.max.outlinks.per.page" to? I was wondering 
> about
> setting it to 4, to get 4*5k=20k for the first iteration?

 It's set to 100 by default. There's no reason to change it unless some 
 pages have more than 100 and the target pages have no other inlinks.

>
> Can anyone help me?
>
> Thanks,
> James Ford
>
> --
> View this message in context:
> 
> http://lucene.472066.n3.nabble.com/Make-Nutch-to-crawl-internal-urls-only-tp3974397.html
> Sent from the Nutch - User mailing list archive at Nabble.com.

-- 
 Markus Jelsma - CTO - Openindex