You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Lyndon Maydwell <ma...@gmail.com> on 2007/07/13 03:53:41 UTC

fetch errors?

Hi list,

I'm running a crawl over a site, but it seems to be fetching pages
outside of the regex domain.

+^http://([a-z0-9]*\.)*curtin.edu.au/

ie.

fetching http://www.environment.sa.gov.au/epa/used_packaging.html
fetching http://abc.net.au/triplej/hottest100/ringtones/default.htm
fetching http://dmoz.org/News/Newspapers/

This seems wrong to me, is there some way make sure I haven't made any
stupid mistakes?

Re: fetch errors?

Posted by Lyndon Maydwell <ma...@gmail.com>.
That did fix the problem thank you.

On 7/13/07, Karol Rybak <ka...@gmail.com> wrote:
> Make sure that you configured proper file, if you are using crawl tool
> crawl-urlfilter is used. If you use fetch or fetch2 regex-urlfiter is used.
>
> On 7/13/07, Lyndon Maydwell <ma...@gmail.com> wrote:
> >
> > Hi list,
> >
> > I'm running a crawl over a site, but it seems to be fetching pages
> > outside of the regex domain.
> >
> > +^http://([a-z0-9]*\.)*curtin.edu.au/
> >
> > ie.
> >
> > fetching http://www.environment.sa.gov.au/epa/used_packaging.html
> > fetching http://abc.net.au/triplej/hottest100/ringtones/default.htm
> > fetching http://dmoz.org/News/Newspapers/
> >
> > This seems wrong to me, is there some way make sure I haven't made any
> > stupid mistakes?
> >
>

Re: fetch errors?

Posted by Karol Rybak <ka...@gmail.com>.
Make sure that you configured proper file, if you are using crawl tool
crawl-urlfilter is used. If you use fetch or fetch2 regex-urlfiter is used.

On 7/13/07, Lyndon Maydwell <ma...@gmail.com> wrote:
>
> Hi list,
>
> I'm running a crawl over a site, but it seems to be fetching pages
> outside of the regex domain.
>
> +^http://([a-z0-9]*\.)*curtin.edu.au/
>
> ie.
>
> fetching http://www.environment.sa.gov.au/epa/used_packaging.html
> fetching http://abc.net.au/triplej/hottest100/ringtones/default.htm
> fetching http://dmoz.org/News/Newspapers/
>
> This seems wrong to me, is there some way make sure I haven't made any
> stupid mistakes?
>