You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Marseld Dedgjonaj <ma...@ikubinfo.com> on 2011/03/04 17:21:21 UTC

How to crawl fast a large site

Hello everybody,

I am trying to use nutch for searching within my site. 

I have configured an instance of nutch and start it to crawl the whole
website. Now that all urls of my site are crawled (about 150'000 urls) and I
need to crawl only the newest urls(about 10-20 per hour), a crawl process
with depth = 1 and topN = 50 takes more than 15 hours.

The most consuming time steps are merging segments and indexing.

I need to have the newest urls searchable in my website as soon as possible.

 

I was trying to configure an other instance  of nutch just to take the
latest articles.

In this instance I injected 40 urls that changes very often and the new
articles added to the site will appear in one of this links.(links are:
Homepage, latest news, etc)

I set "db.fetch.interval.default" to 3600 (1 hr) to crawl my injected pages
and fetch all newest urls .

And clear crawldb,  segments and indexes of this site every 24 hr because
now this urls should been crawled by the main instance.

When a user search will search in both of instances and merge the results.

 

My problem is:

I need that my second instance to fetch only the injected urls and urls
founded to the injected pages, but if I run the crawl continually to crawl
fast newest urls, the crawl process crawls every url founded.

 

Please any suggestion to make possible that when update crawlDB, to put
inside only the urls that agree with my requests.

 

Any other suggestion will be very valuable to me.

 

Thanks in advance and 

Best regards,

Marseldi 

 



<p class="MsoNormal"><span style="color: rgb(31, 73, 125);">Gjeni <b>Pun&euml; t&euml; Mir&euml;</b> dhe <b>t&euml; Mir&euml; p&euml;r Pun&euml;</b>... Vizitoni: <a target="_blank" href="http://www.punaime.al/">www.punaime.al</a></span></p>
<p><a target="_blank" href="http://www.punaime.al/"><span style="text-decoration: none;"><img width="165" height="31" border="0" alt="punaime" src="http://www.ikub.al/images/punaime.al_small.png" /></span></a></p>

RE: How to crawl fast a large site

Posted by Marseld Dedgjonaj <ma...@ikubinfo.com>.
Thank you Arkadi.
I will check if Arch will satisfy my requirements.

Best Regards,
Marseldi

-----Original Message-----
From: Arkadi.Kosmynin@csiro.au [mailto:Arkadi.Kosmynin@csiro.au] 
Sent: Sunday, March 06, 2011 11:19 PM
To: user@nutch.apache.org
Subject: RE: How to crawl fast a large site

Hello Marseld,

I think you should have a look at Arch:

http://www.atnf.csiro.au/computing/software/arch/

Arch is a free, open source extension of Nutch. Among other added features,
it supports partial recrawls.

Regards,

Arkadi


>-----Original Message-----
>From: Marseld Dedgjonaj [mailto:marseld.dedgjonaj@ikubinfo.com]
>Sent: Saturday, March 05, 2011 3:21 AM
>To: user@nutch.apache.org
>Subject: How to crawl fast a large site
>
>Hello everybody,
>
>I am trying to use nutch for searching within my site.
>
>I have configured an instance of nutch and start it to crawl the whole
>website. Now that all urls of my site are crawled (about 150'000 urls)
>and I
>need to crawl only the newest urls(about 10-20 per hour), a crawl
>process
>with depth = 1 and topN = 50 takes more than 15 hours.
>
>The most consuming time steps are merging segments and indexing.
>
>I need to have the newest urls searchable in my website as soon as
>possible.
>
>
>
>I was trying to configure an other instance  of nutch just to take the
>latest articles.
>
>In this instance I injected 40 urls that changes very often and the new
>articles added to the site will appear in one of this links.(links are:
>Homepage, latest news, etc)
>
>I set "db.fetch.interval.default" to 3600 (1 hr) to crawl my injected
>pages
>and fetch all newest urls .
>
>And clear crawldb,  segments and indexes of this site every 24 hr
>because
>now this urls should been crawled by the main instance.
>
>When a user search will search in both of instances and merge the
>results.
>
>
>
>My problem is:
>
>I need that my second instance to fetch only the injected urls and urls
>founded to the injected pages, but if I run the crawl continually to
>crawl
>fast newest urls, the crawl process crawls every url founded.
>
>
>
>Please any suggestion to make possible that when update crawlDB, to put
>inside only the urls that agree with my requests.
>
>
>
>Any other suggestion will be very valuable to me.
>
>
>
>Thanks in advance and
>
>Best regards,
>
>Marseldi
>
>
>
>
>
><p class="MsoNormal"><span style="color: rgb(31, 73, 125);">Gjeni
><b>Pun&euml; t&euml; Mir&euml;</b> dhe <b>t&euml; Mir&euml; p&euml;r
>Pun&euml;</b>... Vizitoni: <a target="_blank"
>href="http://www.punaime.al/">www.punaime.al</a></span></p>
><p><a target="_blank" href="http://www.punaime.al/"><span style="text-
>decoration: none;"><img width="165" height="31" border="0" alt="punaime"
>src="http://www.ikub.al/images/punaime.al_small.png" /></span></a></p>




<p class="MsoNormal"><span style="color: rgb(31, 73, 125);">Gjeni <b>Pun&euml; t&euml; Mir&euml;</b> dhe <b>t&euml; Mir&euml; p&euml;r Pun&euml;</b>... Vizitoni: <a target="_blank" href="http://www.punaime.al/">www.punaime.al</a></span></p>
<p><a target="_blank" href="http://www.punaime.al/"><span style="text-decoration: none;"><img width="165" height="31" border="0" alt="punaime" src="http://www.ikub.al/images/punaime.al_small.png" /></span></a></p>



RE: How to crawl fast a large site

Posted by Ar...@csiro.au.
Hello Marseld,

I think you should have a look at Arch:

http://www.atnf.csiro.au/computing/software/arch/

Arch is a free, open source extension of Nutch. Among other added features, it supports partial recrawls.

Regards,

Arkadi


>-----Original Message-----
>From: Marseld Dedgjonaj [mailto:marseld.dedgjonaj@ikubinfo.com]
>Sent: Saturday, March 05, 2011 3:21 AM
>To: user@nutch.apache.org
>Subject: How to crawl fast a large site
>
>Hello everybody,
>
>I am trying to use nutch for searching within my site.
>
>I have configured an instance of nutch and start it to crawl the whole
>website. Now that all urls of my site are crawled (about 150'000 urls)
>and I
>need to crawl only the newest urls(about 10-20 per hour), a crawl
>process
>with depth = 1 and topN = 50 takes more than 15 hours.
>
>The most consuming time steps are merging segments and indexing.
>
>I need to have the newest urls searchable in my website as soon as
>possible.
>
>
>
>I was trying to configure an other instance  of nutch just to take the
>latest articles.
>
>In this instance I injected 40 urls that changes very often and the new
>articles added to the site will appear in one of this links.(links are:
>Homepage, latest news, etc)
>
>I set "db.fetch.interval.default" to 3600 (1 hr) to crawl my injected
>pages
>and fetch all newest urls .
>
>And clear crawldb,  segments and indexes of this site every 24 hr
>because
>now this urls should been crawled by the main instance.
>
>When a user search will search in both of instances and merge the
>results.
>
>
>
>My problem is:
>
>I need that my second instance to fetch only the injected urls and urls
>founded to the injected pages, but if I run the crawl continually to
>crawl
>fast newest urls, the crawl process crawls every url founded.
>
>
>
>Please any suggestion to make possible that when update crawlDB, to put
>inside only the urls that agree with my requests.
>
>
>
>Any other suggestion will be very valuable to me.
>
>
>
>Thanks in advance and
>
>Best regards,
>
>Marseldi
>
>
>
>
>
><p class="MsoNormal"><span style="color: rgb(31, 73, 125);">Gjeni
><b>Pun&euml; t&euml; Mir&euml;</b> dhe <b>t&euml; Mir&euml; p&euml;r
>Pun&euml;</b>... Vizitoni: <a target="_blank"
>href="http://www.punaime.al/">www.punaime.al</a></span></p>
><p><a target="_blank" href="http://www.punaime.al/"><span style="text-
>decoration: none;"><img width="165" height="31" border="0" alt="punaime"
>src="http://www.ikub.al/images/punaime.al_small.png" /></span></a></p>