You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Vangelis karv <ka...@hotmail.com> on 2013/12/17 11:15:00 UTC

Crawling a specific site only

Hi again! My goal is to crawl a specific site. I want to crawl all the links that exist under that site. For example, if i decide to crawl http://www.uefa.com/, I want to parse all its inlinks(photos, videos, htmls etc) and not only the best scoring urls for this site= topN. So, my question here is: how can we tell Nutch to crawl everything in a site and not only the sites that have the best score?

Re: Crawling a specific site only

Posted by Tejas Patil <te...@gmail.com>.

You need to provide topN parameter to run Generate.. can't skip that. What
I meant was that set its value more than 2000.
Note: The max allowable value for topN is (2^63)-1. Don't exceed that.

Thanks,
Tejas


On Wed, Dec 18, 2013 at 2:14 AM, Vangelis karv <ka...@hotmail.com>wrote:

> Thanks for the support guys! I'll crawl again with
> generate.count.mode=host and generate.max.count=-1. Although, if i dont set
> -topN in the nutch script it won't let me run GeneratorJob.
>
> > Subject: RE: Crawling a specific site only
> > From: markus.jelsma@openindex.io
> > To: user@nutch.apache.org
> > Date: Wed, 18 Dec 2013 09:38:04 +0000
> >
> > Increase it to a reasonable high value or don't set it at all, it will
> then attempt to crawl as much as it can. Also check generate.count.mode and
> generate.max.count.
> >
> >
> > -----Original message-----
> > > From:Vangelis karv <ka...@hotmail.com>
> > > Sent: Wednesday 18th December 2013 9:56
> > > To: user@nutch.apache.org
> > > Subject: RE: Crawling a specific site only
> > >
> > > Can you be a little more specific about that, Tejas?
> > >
> > > > Date: Tue, 17 Dec 2013 23:32:46 -0800
> > > > Subject: Re: Crawling a specific site only
> > > > From: tejas.patil.cs@gmail.com
> > > > To: user@nutch.apache.org
> > > >
> > > > You should bump the value of topN instead of setting to 2000. That
> would
> > > > make lot of the urls eligible for fetching.
> > > >
> > > > Thanks,
> > > > Tejas
> > > >
> > > >
> > > > On Tue, Dec 17, 2013 at 3:02 AM, Vangelis karv <
> karvounis_b@hotmail.com>wrote:
> > > >
> > > > > Markus and Wang thank you very much for your fast responses. I
> forgot to
> > > > > mention that i use nutch 2.2.1 and mysql. Both DomainFilter and
> > > > > ignore.external.links ideas are awesome! What really bothers me is
> that
> > > > > dreaded "-topN". I really want to live without it! :) I hate it
> when I open
> > > > > my database and I see that i have for example 2000 links
> unfetched, which
> > > > > means they are not parsed->useless, and only 2000 fetched.
> > > > >
> > > > > > Subject: Re: Crawling a specific site only
> > > > > > From: wangyi1997@gmail.com
> > > > > > To: user@nutch.apache.org
> > > > > > Date: Tue, 17 Dec 2013 18:53:55 +0800
> > > > > >
> > > > > > HI
> > > > > > Just set
> > > > > >         <name>db.ignore.external.links</name>
> > > > > >         <value>true</value>
> > > > > > and run crawl script for several times, the default number of
> pages to
> > > > > > be added is 50,000.
> > > > > >
> > > > > > Is it right?
> > > > > > Wang
> > > > > >
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: Vangelis karv <ka...@hotmail.com>
> > > > > > Reply-to: user@nutch.apache.org
> > > > > > To: user@nutch.apache.org <us...@nutch.apache.org>
> > > > > > Subject: Crawling a specific site only
> > > > > > Date: Tue, 17 Dec 2013 12:15:00 +0200
> > > > > >
> > > > > > Hi again! My goal is to crawl a specific site. I want to crawl
> all the
> > > > > links that exist under that site. For example, if i decide to crawl
> > > > > http://www.uefa.com/, I want to parse all its inlinks(photos,
> videos,
> > > > > htmls etc) and not only the best scoring urls for this site= topN.
> So, my
> > > > > question here is: how can we tell Nutch to crawl everything in a
> site and
> > > > > not only the sites that have the best score?
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > >
>
>

RE: Crawling a specific site only

Posted by Vangelis karv <ka...@hotmail.com>.

Thanks for the support guys! I'll crawl again with generate.count.mode=host and generate.max.count=-1. Although, if i dont set -topN in the nutch script it won't let me run GeneratorJob.

> Subject: RE: Crawling a specific site only
> From: markus.jelsma@openindex.io
> To: user@nutch.apache.org
> Date: Wed, 18 Dec 2013 09:38:04 +0000
> 
> Increase it to a reasonable high value or don't set it at all, it will then attempt to crawl as much as it can. Also check generate.count.mode and generate.max.count.
>  
>  
> -----Original message-----
> > From:Vangelis karv <ka...@hotmail.com>
> > Sent: Wednesday 18th December 2013 9:56
> > To: user@nutch.apache.org
> > Subject: RE: Crawling a specific site only
> > 
> > Can you be a little more specific about that, Tejas?
> > 
> > > Date: Tue, 17 Dec 2013 23:32:46 -0800
> > > Subject: Re: Crawling a specific site only
> > > From: tejas.patil.cs@gmail.com
> > > To: user@nutch.apache.org
> > > 
> > > You should bump the value of topN instead of setting to 2000. That would
> > > make lot of the urls eligible for fetching.
> > > 
> > > Thanks,
> > > Tejas
> > > 
> > > 
> > > On Tue, Dec 17, 2013 at 3:02 AM, Vangelis karv <ka...@hotmail.com>wrote:
> > > 
> > > > Markus and Wang thank you very much for your fast responses. I forgot to
> > > > mention that i use nutch 2.2.1 and mysql. Both DomainFilter and
> > > > ignore.external.links ideas are awesome! What really bothers me is that
> > > > dreaded "-topN". I really want to live without it! :) I hate it when I open
> > > > my database and I see that i have for example 2000 links unfetched, which
> > > > means they are not parsed->useless, and only 2000 fetched.
> > > >
> > > > > Subject: Re: Crawling a specific site only
> > > > > From: wangyi1997@gmail.com
> > > > > To: user@nutch.apache.org
> > > > > Date: Tue, 17 Dec 2013 18:53:55 +0800
> > > > >
> > > > > HI
> > > > > Just set
> > > > >         <name>db.ignore.external.links</name>
> > > > >         <value>true</value>
> > > > > and run crawl script for several times, the default number of pages to
> > > > > be added is 50,000.
> > > > >
> > > > > Is it right?
> > > > > Wang
> > > > >
> > > > >
> > > > > -----Original Message-----
> > > > > From: Vangelis karv <ka...@hotmail.com>
> > > > > Reply-to: user@nutch.apache.org
> > > > > To: user@nutch.apache.org <us...@nutch.apache.org>
> > > > > Subject: Crawling a specific site only
> > > > > Date: Tue, 17 Dec 2013 12:15:00 +0200
> > > > >
> > > > > Hi again! My goal is to crawl a specific site. I want to crawl all the
> > > > links that exist under that site. For example, if i decide to crawl
> > > > http://www.uefa.com/, I want to parse all its inlinks(photos, videos,
> > > > htmls etc) and not only the best scoring urls for this site= topN. So, my
> > > > question here is: how can we tell Nutch to crawl everything in a site and
> > > > not only the sites that have the best score?
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> >

RE: Crawling a specific site only

Posted by Markus Jelsma <ma...@openindex.io>.

Increase it to a reasonable high value or don't set it at all, it will then attempt to crawl as much as it can. Also check generate.count.mode and generate.max.count.
 
 
-----Original message-----
> From:Vangelis karv <ka...@hotmail.com>
> Sent: Wednesday 18th December 2013 9:56
> To: user@nutch.apache.org
> Subject: RE: Crawling a specific site only
> 
> Can you be a little more specific about that, Tejas?
> 
> > Date: Tue, 17 Dec 2013 23:32:46 -0800
> > Subject: Re: Crawling a specific site only
> > From: tejas.patil.cs@gmail.com
> > To: user@nutch.apache.org
> > 
> > You should bump the value of topN instead of setting to 2000. That would
> > make lot of the urls eligible for fetching.
> > 
> > Thanks,
> > Tejas
> > 
> > 
> > On Tue, Dec 17, 2013 at 3:02 AM, Vangelis karv <ka...@hotmail.com>wrote:
> > 
> > > Markus and Wang thank you very much for your fast responses. I forgot to
> > > mention that i use nutch 2.2.1 and mysql. Both DomainFilter and
> > > ignore.external.links ideas are awesome! What really bothers me is that
> > > dreaded "-topN". I really want to live without it! :) I hate it when I open
> > > my database and I see that i have for example 2000 links unfetched, which
> > > means they are not parsed->useless, and only 2000 fetched.
> > >
> > > > Subject: Re: Crawling a specific site only
> > > > From: wangyi1997@gmail.com
> > > > To: user@nutch.apache.org
> > > > Date: Tue, 17 Dec 2013 18:53:55 +0800
> > > >
> > > > HI
> > > > Just set
> > > >         <name>db.ignore.external.links</name>
> > > >         <value>true</value>
> > > > and run crawl script for several times, the default number of pages to
> > > > be added is 50,000.
> > > >
> > > > Is it right?
> > > > Wang
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: Vangelis karv <ka...@hotmail.com>
> > > > Reply-to: user@nutch.apache.org
> > > > To: user@nutch.apache.org <us...@nutch.apache.org>
> > > > Subject: Crawling a specific site only
> > > > Date: Tue, 17 Dec 2013 12:15:00 +0200
> > > >
> > > > Hi again! My goal is to crawl a specific site. I want to crawl all the
> > > links that exist under that site. For example, if i decide to crawl
> > > http://www.uefa.com/, I want to parse all its inlinks(photos, videos,
> > > htmls etc) and not only the best scoring urls for this site= topN. So, my
> > > question here is: how can we tell Nutch to crawl everything in a site and
> > > not only the sites that have the best score?
> > > >
> > > >
> > > >
> > >
> > >
>

RE: Crawling a specific site only

Posted by Vangelis karv <ka...@hotmail.com>.

Can you be a little more specific about that, Tejas?

> Date: Tue, 17 Dec 2013 23:32:46 -0800
> Subject: Re: Crawling a specific site only
> From: tejas.patil.cs@gmail.com
> To: user@nutch.apache.org
> 
> You should bump the value of topN instead of setting to 2000. That would
> make lot of the urls eligible for fetching.
> 
> Thanks,
> Tejas
> 
> 
> On Tue, Dec 17, 2013 at 3:02 AM, Vangelis karv <ka...@hotmail.com>wrote:
> 
> > Markus and Wang thank you very much for your fast responses. I forgot to
> > mention that i use nutch 2.2.1 and mysql. Both DomainFilter and
> > ignore.external.links ideas are awesome! What really bothers me is that
> > dreaded "-topN". I really want to live without it! :) I hate it when I open
> > my database and I see that i have for example 2000 links unfetched, which
> > means they are not parsed->useless, and only 2000 fetched.
> >
> > > Subject: Re: Crawling a specific site only
> > > From: wangyi1997@gmail.com
> > > To: user@nutch.apache.org
> > > Date: Tue, 17 Dec 2013 18:53:55 +0800
> > >
> > > HI
> > > Just set
> > >         <name>db.ignore.external.links</name>
> > >         <value>true</value>
> > > and run crawl script for several times, the default number of pages to
> > > be added is 50,000.
> > >
> > > Is it right?
> > > Wang
> > >
> > >
> > > -----Original Message-----
> > > From: Vangelis karv <ka...@hotmail.com>
> > > Reply-to: user@nutch.apache.org
> > > To: user@nutch.apache.org <us...@nutch.apache.org>
> > > Subject: Crawling a specific site only
> > > Date: Tue, 17 Dec 2013 12:15:00 +0200
> > >
> > > Hi again! My goal is to crawl a specific site. I want to crawl all the
> > links that exist under that site. For example, if i decide to crawl
> > http://www.uefa.com/, I want to parse all its inlinks(photos, videos,
> > htmls etc) and not only the best scoring urls for this site= topN. So, my
> > question here is: how can we tell Nutch to crawl everything in a site and
> > not only the sites that have the best score?
> > >
> > >
> > >
> >
> >

Re: Crawling a specific site only

Posted by Tejas Patil <te...@gmail.com>.

You should bump the value of topN instead of setting to 2000. That would
make lot of the urls eligible for fetching.

Thanks,
Tejas


On Tue, Dec 17, 2013 at 3:02 AM, Vangelis karv <ka...@hotmail.com>wrote:

> Markus and Wang thank you very much for your fast responses. I forgot to
> mention that i use nutch 2.2.1 and mysql. Both DomainFilter and
> ignore.external.links ideas are awesome! What really bothers me is that
> dreaded "-topN". I really want to live without it! :) I hate it when I open
> my database and I see that i have for example 2000 links unfetched, which
> means they are not parsed->useless, and only 2000 fetched.
>
> > Subject: Re: Crawling a specific site only
> > From: wangyi1997@gmail.com
> > To: user@nutch.apache.org
> > Date: Tue, 17 Dec 2013 18:53:55 +0800
> >
> > HI
> > Just set
> >         <name>db.ignore.external.links</name>
> >         <value>true</value>
> > and run crawl script for several times, the default number of pages to
> > be added is 50,000.
> >
> > Is it right?
> > Wang
> >
> >
> > -----Original Message-----
> > From: Vangelis karv <ka...@hotmail.com>
> > Reply-to: user@nutch.apache.org
> > To: user@nutch.apache.org <us...@nutch.apache.org>
> > Subject: Crawling a specific site only
> > Date: Tue, 17 Dec 2013 12:15:00 +0200
> >
> > Hi again! My goal is to crawl a specific site. I want to crawl all the
> links that exist under that site. For example, if i decide to crawl
> http://www.uefa.com/, I want to parse all its inlinks(photos, videos,
> htmls etc) and not only the best scoring urls for this site= topN. So, my
> question here is: how can we tell Nutch to crawl everything in a site and
> not only the sites that have the best score?
> >
> >
> >
>
>

RE: Crawling a specific site only

Posted by Vangelis karv <ka...@hotmail.com>.

Markus and Wang thank you very much for your fast responses. I forgot to mention that i use nutch 2.2.1 and mysql. Both DomainFilter and ignore.external.links ideas are awesome! What really bothers me is that dreaded "-topN". I really want to live without it! :) I hate it when I open my database and I see that i have for example 2000 links unfetched, which means they are not parsed->useless, and only 2000 fetched. 

> Subject: Re: Crawling a specific site only
> From: wangyi1997@gmail.com
> To: user@nutch.apache.org
> Date: Tue, 17 Dec 2013 18:53:55 +0800
> 
> HI
> Just set 
> 	  <name>db.ignore.external.links</name>
> 	  <value>true</value>
> and run crawl script for several times, the default number of pages to
> be added is 50,000.
> 
> Is it right?
> Wang
> 
> 
> -----Original Message-----
> From: Vangelis karv <ka...@hotmail.com>
> Reply-to: user@nutch.apache.org
> To: user@nutch.apache.org <us...@nutch.apache.org>
> Subject: Crawling a specific site only
> Date: Tue, 17 Dec 2013 12:15:00 +0200
> 
> Hi again! My goal is to crawl a specific site. I want to crawl all the links that exist under that site. For example, if i decide to crawl http://www.uefa.com/, I want to parse all its inlinks(photos, videos, htmls etc) and not only the best scoring urls for this site= topN. So, my question here is: how can we tell Nutch to crawl everything in a site and not only the sites that have the best score?
>  		 	   		  
> 
>

Re: Crawling a specific site only

Posted by Wang Yi <wa...@gmail.com>.

HI
Just set 
	  <name>db.ignore.external.links</name>
	  <value>true</value>
and run crawl script for several times, the default number of pages to
be added is 50,000.

Is it right?
Wang


-----Original Message-----
From: Vangelis karv <ka...@hotmail.com>
Reply-to: user@nutch.apache.org
To: user@nutch.apache.org <us...@nutch.apache.org>
Subject: Crawling a specific site only
Date: Tue, 17 Dec 2013 12:15:00 +0200

Hi again! My goal is to crawl a specific site. I want to crawl all the links that exist under that site. For example, if i decide to crawl http://www.uefa.com/, I want to parse all its inlinks(photos, videos, htmls etc) and not only the best scoring urls for this site= topN. So, my question here is: how can we tell Nutch to crawl everything in a site and not only the sites that have the best score?

RE: Crawling a specific site only

Posted by Markus Jelsma <ma...@openindex.io>.

Hi - You can use the DomainUrlFilter to restrict URL's to a specific site.

 
 
-----Original message-----
> From:Vangelis karv <ka...@hotmail.com>
> Sent: Tuesday 17th December 2013 11:15
> To: user@nutch.apache.org
> Subject: Crawling a specific site only
> 
> Hi again! My goal is to crawl a specific site. I want to crawl all the links that exist under that site. For example, if i decide to crawl http://www.uefa.com/, I want to parse all its inlinks(photos, videos, htmls etc) and not only the best scoring urls for this site= topN. So, my question here is: how can we tell Nutch to crawl everything in a site and not only the sites that have the best score?
>