You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Gianni Parini <gi...@gmail.com> on 2006/09/21 21:35:47 UTC

Automatic crawling

Hi,
	-Is it possible to have an automatic recrawling? have i got to write  
my own application by myself? I need an application running in  
background that re-crawl my intranet site 2-3 times a week..

	-Is it possible to crawl site on port 8080? how? my site is deployed  
in a servlet container(tomcat) and nutch seems not to see jsp files  
on that port.

Gianni

Re: Automatic crawling

Posted by Tomi NA <he...@gmail.com>.

On 9/21/06, Jacob Brunson <ja...@gmail.com> wrote:
> On 9/21/06, Gianni Parini <gi...@gmail.com> wrote:
> >         -Is it possible to have an automatic recrawling? have i got to write
> > my own application by myself? I need an application running in
> > background that re-crawl my intranet site 2-3 times a week..
>
> On the nutch wiki you will find an intranet recrawl script.  That
> probably will work for you.  However, I think the script has a problem
> with duplicating segment data during the mergesegs step, but I've
> asked about it here and haven't had any confirmations.
>
Well, I can confirm my index grew to ~5 GB from ~1.5 GB after (if I
remember correctly) 2 recrawls.
It doesn't solve the problem I was after anyway, as it only indexes
pages according to the time of the last crawl, rather than crawling
everything, checking if it the new content has a newer
modification/creation date and indexing only that (typical intranet
scenario). But I'm running like a madman in the opposite direction of
the topic: please ignore me. :)

t.n.a.

RE: Automatic crawling

Posted by ja...@thomson.com.

Gianni-
Here's the recrawl script that Jacob mentioned:
http://wiki.apache.org/nutch/IntranetRecrawl
[Note: There are 0.7.x and 0.8 versions]

Jacob-
I noticed that the 0.8 script had an issue with after merging too.
After it merges the segments, it fails to remove all the segments that
it used to create the merged segment.  (I think that's why there are all
these comments about it filling up your disk, and recommending that you
rm your segments and perform a periodic recrawl from scratch...)

I changed this line after the mergesegs:
for segment in `ls -d $segments_dir/* | tail -$depth`

to:
for segment in `ls -d $segments_dir/*`

[Note: No need for that for loop, if you don't care to print out the
segments you are removing, instead you can just make it 'rm -rf
$segments_dir/*']

Am I missing something? It looks like the mergesegs call is on all the
segments so it seems right to nuke the segments folder contents before
moving in the resulting merged segment.

Jared-

-----Original Message-----
From: Jacob Brunson [mailto:jacob.brunson@gmail.com] 
Sent: Thursday, September 21, 2006 12:57 PM
To: nutch-user@lucene.apache.org
Subject: Re: Automatic crawling

On 9/21/06, Gianni Parini <gi...@gmail.com> wrote:
>         -Is it possible to have an automatic recrawling? have i got to
write
> my own application by myself? I need an application running in
> background that re-crawl my intranet site 2-3 times a week..

On the nutch wiki you will find an intranet recrawl script.  That
probably will work for you.  However, I think the script has a problem
with duplicating segment data during the mergesegs step, but I've
asked about it here and haven't had any confirmations.

Re: Automatic crawling

Posted by Jacob Brunson <ja...@gmail.com>.

On 9/21/06, Gianni Parini <gi...@gmail.com> wrote:
>         -Is it possible to have an automatic recrawling? have i got to write
> my own application by myself? I need an application running in
> background that re-crawl my intranet site 2-3 times a week..

On the nutch wiki you will find an intranet recrawl script.  That
probably will work for you.  However, I think the script has a problem
with duplicating segment data during the mergesegs step, but I've
asked about it here and haven't had any confirmations.