You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ian Reardon <ir...@gmail.com> on 2005/05/13 19:54:31 UTC

How does this sound

I am going to crawl a small set of sites and I never want to go off
site and I also want to strictly control my link dept.

I setup crawls for each site using the crawl command.  Then manually
move the segments folder to my "master" directory and re-index.  (This
can all be scripted).  This gives me the flex ability to QA each
individual crawl.

Am I jumping through unnecessary hoops here or does this sound like a
reasonable plan?

Re: How does this sound

Posted by Ian Reardon <ir...@gmail.com>.
I have crated individual url-filters to specify exactly what pages I
want in each site, and then I wrote a script to switch in and out the
different filters when I crawl.  That way i'm sure to never go off
site.

On 5/13/05, EM <em...@cpuedge.com> wrote:
> Sounds fine with me although more experience people here may have
> different opinion.
> 
> One small thing, if you are setting up each site individually, then,
> fully disable the spidering. That way, you can inject individual sites
> by yourself.
> 
> Good luck,
> Emilijan
> Ian Reardon wrote:
> 
> >I am going to crawl a small set of sites and I never want to go off
> >site and I also want to strictly control my link dept.
> >
> >I setup crawls for each site using the crawl command.  Then manually
> >move the segments folder to my "master" directory and re-index.  (This
> >can all be scripted).  This gives me the flex ability to QA each
> >individual crawl.
> >
> >Am I jumping through unnecessary hoops here or does this sound like a
> >reasonable plan?
> >
> >
>

Re: How does this sound

Posted by EM <em...@cpuedge.com>.
Sounds fine with me although more experience people here may have 
different opinion.

One small thing, if you are setting up each site individually, then, 
fully disable the spidering. That way, you can inject individual sites 
by yourself.

Good luck,
Emilijan
Ian Reardon wrote:

>I am going to crawl a small set of sites and I never want to go off
>site and I also want to strictly control my link dept.
>
>I setup crawls for each site using the crawl command.  Then manually
>move the segments folder to my "master" directory and re-index.  (This
>can all be scripted).  This gives me the flex ability to QA each
>individual crawl.
>
>Am I jumping through unnecessary hoops here or does this sound like a
>reasonable plan?
>  
>