You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by devang pandey <de...@gmail.com> on 2013/07/10 10:28:51 UTC

nutch crawling issues

I have a website eg . www.example.com. Now when I am crawling this using
nutch 1.4 problem is that of duplicated crawling . There are a number of
pages like www.example.com/s38r84rejkfndn/xyz.aspx . Now this number
s38r84rejkfndn keeps on changing every time you visit this page and hence
crawler is crawling this again and again as for nutch I this this must be a
new url everytime . Please suggest me how to overcome this issue

Re: nutch crawling issues

Posted by devang pandey <de...@gmail.com>.
hey markus but if I would specify a regex then those urls wont be crawled
at all . I dont want this all I ant is to crawl them only once



On Wed, Jul 10, 2013 at 3:23 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> Hi - conf/regex-url-filter.txt and make sure the urlfilter-regex is
> enabled in your nutch-site plugin.includes config.
>
>
> -----Original message-----
> > From:devang pandey <de...@gmail.com>
> > Sent: Wednesday 10th July 2013 11:51
> > To: user@nutch.apache.org
> > Subject: Re: nutch crawling issues
> >
> > hello markus I have one confusion should i implement changes in crawl-url
> > filter or regex filter
> >
> >
> > On Wed, Jul 10, 2013 at 3:12 PM, Markus Jelsma
> > <ma...@openindex.io>wrote:
> >
> > > Hi,
> > >
> > > Use a regex url filter to filter those URL's and prevent them from
> being
> > > crawled again.
> > >
> > > Cheers
> > >
> > > -----Original message-----
> > > > From:devang pandey <de...@gmail.com>
> > > > Sent: Wednesday 10th July 2013 10:29
> > > > To: user@nutch.apache.org
> > > > Subject: nutch crawling issues
> > > >
> > > > I have a website eg . www.example.com. Now when I am crawling this
> using
> > > > nutch 1.4 problem is that of duplicated crawling . There are a
> number of
> > > > pages like www.example.com/s38r84rejkfndn/xyz.aspx . Now this number
> > > > s38r84rejkfndn keeps on changing every time you visit this page and
> hence
> > > > crawler is crawling this again and again as for nutch I this this
> must
> > > be a
> > > > new url everytime . Please suggest me how to overcome this issue
> > > >
> > >
> >
>

Re: nutch crawling issues

Posted by devang pandey <de...@gmail.com>.
hello markus I have one confusion should i implement changes in crawl-url
filter or regex filter


On Wed, Jul 10, 2013 at 3:12 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> Hi,
>
> Use a regex url filter to filter those URL's and prevent them from being
> crawled again.
>
> Cheers
>
> -----Original message-----
> > From:devang pandey <de...@gmail.com>
> > Sent: Wednesday 10th July 2013 10:29
> > To: user@nutch.apache.org
> > Subject: nutch crawling issues
> >
> > I have a website eg . www.example.com. Now when I am crawling this using
> > nutch 1.4 problem is that of duplicated crawling . There are a number of
> > pages like www.example.com/s38r84rejkfndn/xyz.aspx . Now this number
> > s38r84rejkfndn keeps on changing every time you visit this page and hence
> > crawler is crawling this again and again as for nutch I this this must
> be a
> > new url everytime . Please suggest me how to overcome this issue
> >
>

RE: nutch crawling issues

Posted by Markus Jelsma <ma...@openindex.io>.
Hi,

Use a regex url filter to filter those URL's and prevent them from being crawled again.

Cheers 
 
-----Original message-----
> From:devang pandey <de...@gmail.com>
> Sent: Wednesday 10th July 2013 10:29
> To: user@nutch.apache.org
> Subject: nutch crawling issues
> 
> I have a website eg . www.example.com. Now when I am crawling this using
> nutch 1.4 problem is that of duplicated crawling . There are a number of
> pages like www.example.com/s38r84rejkfndn/xyz.aspx . Now this number
> s38r84rejkfndn keeps on changing every time you visit this page and hence
> crawler is crawling this again and again as for nutch I this this must be a
> new url everytime . Please suggest me how to overcome this issue
> 

RE: nutch crawling issues

Posted by Markus Jelsma <ma...@openindex.io>.
Hi - conf/regex-url-filter.txt and make sure the urlfilter-regex is enabled in your nutch-site plugin.includes config.
 
 
-----Original message-----
> From:devang pandey <de...@gmail.com>
> Sent: Wednesday 10th July 2013 11:51
> To: user@nutch.apache.org
> Subject: Re: nutch crawling issues
> 
> hello markus I have one confusion should i implement changes in crawl-url
> filter or regex filter
> 
> 
> On Wed, Jul 10, 2013 at 3:12 PM, Markus Jelsma
> <ma...@openindex.io>wrote:
> 
> > Hi,
> >
> > Use a regex url filter to filter those URL's and prevent them from being
> > crawled again.
> >
> > Cheers
> >
> > -----Original message-----
> > > From:devang pandey <de...@gmail.com>
> > > Sent: Wednesday 10th July 2013 10:29
> > > To: user@nutch.apache.org
> > > Subject: nutch crawling issues
> > >
> > > I have a website eg . www.example.com. Now when I am crawling this using
> > > nutch 1.4 problem is that of duplicated crawling . There are a number of
> > > pages like www.example.com/s38r84rejkfndn/xyz.aspx . Now this number
> > > s38r84rejkfndn keeps on changing every time you visit this page and hence
> > > crawler is crawling this again and again as for nutch I this this must
> > be a
> > > new url everytime . Please suggest me how to overcome this issue
> > >
> >
>