You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Manoharam Reddy <ma...@gmail.com> on 2007/05/28 12:22:39 UTC

Nutch crawls blocked sites - Why?

In my crawl-urlfilter.txt I have put a statement like

-^http://cdserver

Still while running crawl, it fetches this site. I am running the
crawl using these commands:-

bin/nutch inject crawl/crawldb urls

Inside a loop:-

bin/nutch generate crawl/crawldb crawl/segments -topN 10
segment=`ls -d crawl/segments/* | tail -1`
bin/nutch fetch $segment -threads 10
bin/nutch updatedb crawl/crawldb $segment

Why does it fetch http://cdserver even though I have blocked it? Is it
becoming "allowed" from some other filter file? If so, what do I need
to check. Please help.

Re: Nutch crawls blocked sites - Why?

Posted by Manoharam Reddy <ma...@gmail.com>.
Thanks! It worked.

On 5/28/07, Doğacan Güney <do...@gmail.com> wrote:
> Hi,
>
> On 5/28/07, Manoharam Reddy <ma...@gmail.com> wrote:
> > In my crawl-urlfilter.txt I have put a statement like
> >
> > -^http://cdserver
> >
> > Still while running crawl, it fetches this site. I am running the
> > crawl using these commands:-
> >
> > bin/nutch inject crawl/crawldb urls
> >
> > Inside a loop:-
> >
> > bin/nutch generate crawl/crawldb crawl/segments -topN 10
> > segment=`ls -d crawl/segments/* | tail -1`
> > bin/nutch fetch $segment -threads 10
> > bin/nutch updatedb crawl/crawldb $segment
> >
> > Why does it fetch http://cdserver even though I have blocked it? Is it
> > becoming "allowed" from some other filter file? If so, what do I need
> > to check. Please help.
> >
>
> In your case, crawl-urlfilter.txt is not read because you are not
> running 'crawl' command (as in bin/nutch crawl). You have to update
> regex-urlfilter.txt or prefix-urlfilter.txt and make sure that you
> enable them in your conf.
>
> --
> Doğacan Güney
>

Re: Nutch crawls blocked sites - Why?

Posted by Doğacan Güney <do...@gmail.com>.
Hi,

On 5/28/07, Manoharam Reddy <ma...@gmail.com> wrote:
> In my crawl-urlfilter.txt I have put a statement like
>
> -^http://cdserver
>
> Still while running crawl, it fetches this site. I am running the
> crawl using these commands:-
>
> bin/nutch inject crawl/crawldb urls
>
> Inside a loop:-
>
> bin/nutch generate crawl/crawldb crawl/segments -topN 10
> segment=`ls -d crawl/segments/* | tail -1`
> bin/nutch fetch $segment -threads 10
> bin/nutch updatedb crawl/crawldb $segment
>
> Why does it fetch http://cdserver even though I have blocked it? Is it
> becoming "allowed" from some other filter file? If so, what do I need
> to check. Please help.
>

In your case, crawl-urlfilter.txt is not read because you are not
running 'crawl' command (as in bin/nutch crawl). You have to update
regex-urlfilter.txt or prefix-urlfilter.txt and make sure that you
enable them in your conf.

-- 
Doğacan Güney