You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Steve Kallestad <ka...@gmail.com> on 2007/02/08 10:17:56 UTC

Recrawl not following crawl-urlfilter.txt

Please oh please, don't shoot me for being a newbie.

I have set up a site-search using nutch, and I have the
crawl-urlfilter.txtfile configured so that everything works properly
when I call something
similar to:

bin/nutch crawl urls -dir crawl -depth 3 -topN 100


I grabbed the Intranet Recrawl script from
http://wiki.apache.org/nutch/IntranetRecrawl

I noticed while it was running that nutch was actually grabbing files I
didn't want it to grab, and it was also going off site to get others.
Obviously I don't want it to do that.

On my site, without making a change to the crawl-urlfilter.txt file, nutch
is trying to fetch some non-existant files, probably because of some
javascript that I have, so I really need my re-crawl to follow my original
guidelines.

My question is - how can I modify the IntranetRecrawl script so that it
follows crawl-urlfilter.txt, or barring that where can I find a documented
list of steps to recrawl my site?


Thanks,
Steve

My nutch is at:
http://www.stevekallestad.com/search/
in case anybody wanted to check it out.  I have the directory proxied
through apache which I thought was pretty cool.

Re: Recrawl not following crawl-urlfilter.txt

Posted by Steve Kallestad <ka...@gmail.com>.
Thanks!  You're the man!!!

Now I can automate this thing :).

Steve
http://www.stevekallestad.com/

On 2/8/07, chee wu <ch...@gmail.com> wrote:
> The crawl command use "crawl-tool.xml" as default nutch config,but the recrawl script use "nutch-site". So just copy the all configuration in "crawl-tool.xml" to "nutch-site.xml". Concerning the selecting of "crawl-urlfiltertxt",refer the property belowin your "crawl-tool" :
> <property>
>   <name>urlfilter.regex.file</name>
>   <value>crawl-urlfilter.txt</value>
> </property>
>
> ----- Original Message -----
> From: "Steve Kallestad" <ka...@gmail.com>
> To: <nu...@lucene.apache.org>
> Sent: Thursday, February 08, 2007 5:17 PM
> Subject: Recrawl not following crawl-urlfilter.txt
>
>
> > Please oh please, don't shoot me for being a newbie.
> >
> > I have set up a site-search using nutch, and I have the
> > crawl-urlfilter.txtfile configured so that everything works properly
> > when I call something
> > similar to:
> >
> > bin/nutch crawl urls -dir crawl -depth 3 -topN 100
> >
> >
> > I grabbed the Intranet Recrawl script from
> > http://wiki.apache.org/nutch/IntranetRecrawl
> >
> > I noticed while it was running that nutch was actually grabbing files I
> > didn't want it to grab, and it was also going off site to get others.
> > Obviously I don't want it to do that.
> >
> > On my site, without making a change to the crawl-urlfilter.txt file, nutch
> > is trying to fetch some non-existant files, probably because of some
> > javascript that I have, so I really need my re-crawl to follow my original
> > guidelines.
> >
> > My question is - how can I modify the IntranetRecrawl script so that it
> > follows crawl-urlfilter.txt, or barring that where can I find a documented
> > list of steps to recrawl my site?
> >
> >
> > Thanks,
> > Steve
> >
> > My nutch is at:
> > http://www.stevekallestad.com/search/
> > in case anybody wanted to check it out.  I have the directory proxied
> > through apache which I thought was pretty cool.
> >

Re: Recrawl not following crawl-urlfilter.txt

Posted by chee wu <ch...@gmail.com>.
The crawl command use "crawl-tool.xml" as default nutch config,but the recrawl script use "nutch-site". So just copy the all configuration in "crawl-tool.xml" to "nutch-site.xml". Concerning the selecting of "crawl-urlfiltertxt",refer the property belowin your "crawl-tool" :
<property>
  <name>urlfilter.regex.file</name>
  <value>crawl-urlfilter.txt</value>
</property>

----- Original Message ----- 
From: "Steve Kallestad" <ka...@gmail.com>
To: <nu...@lucene.apache.org>
Sent: Thursday, February 08, 2007 5:17 PM
Subject: Recrawl not following crawl-urlfilter.txt


> Please oh please, don't shoot me for being a newbie.
> 
> I have set up a site-search using nutch, and I have the
> crawl-urlfilter.txtfile configured so that everything works properly
> when I call something
> similar to:
> 
> bin/nutch crawl urls -dir crawl -depth 3 -topN 100
> 
> 
> I grabbed the Intranet Recrawl script from
> http://wiki.apache.org/nutch/IntranetRecrawl
> 
> I noticed while it was running that nutch was actually grabbing files I
> didn't want it to grab, and it was also going off site to get others.
> Obviously I don't want it to do that.
> 
> On my site, without making a change to the crawl-urlfilter.txt file, nutch
> is trying to fetch some non-existant files, probably because of some
> javascript that I have, so I really need my re-crawl to follow my original
> guidelines.
> 
> My question is - how can I modify the IntranetRecrawl script so that it
> follows crawl-urlfilter.txt, or barring that where can I find a documented
> list of steps to recrawl my site?
> 
> 
> Thanks,
> Steve
> 
> My nutch is at:
> http://www.stevekallestad.com/search/
> in case anybody wanted to check it out.  I have the directory proxied
> through apache which I thought was pretty cool.
>