You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Dennis <ar...@yahoo.com.cn> on 2010/09/28 10:08:00 UTC

crawl www

Hi, all,
I want to crawl the whole www, how do I config "crawl-urlfilter.txt"?It used to be:# accept hosts in MY.DOMAIN.NAME+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
ThanksDennis


      

Re: crawl www

Posted by Markus Jelsma <ma...@buyways.nl>.
Oh, you don't need to crawl-urlfilter.txt. It's being used by the crawl 
command only and if you're about to crawl the internet (!), you will need the 
steps i explained in the other e-mail. You can forget about the crawl command 
in this case.


On Tuesday 28 September 2010 14:58:32 Dennis wrote:
> Sorry for interrupting, Markus,
> 
> But I'm not quite understand. How do I "update your DB's"?, What should I
>  do about "crawl-urlfilter.txt"? Thanks
> 
> 
> Dennis
> 
> --- On Tue, 9/28/10, Markus Jelsma <ma...@buyways.nl> wrote:
> 
> From: Markus Jelsma <ma...@buyways.nl>
> Subject: Re: crawl www
> To: user@nutch.apache.org
> Date: Tuesday, September 28, 2010, 8:19 PM
> 
> Dennis, you shouldn't hyjack my thread ;)
> 
> Anyway. it's all about crawl, update your DB's and recrawl and keep
>  repeating the same loop over and over.
> 
> Cheers,
> 
> On Tuesday 28 September 2010 10:08:00 Dennis wrote:
> > Hi, all,
> > I want to crawl the whole www, how do I config "crawl-urlfilter.txt"?It
> >  used to be:# accept hosts in
> >  MY.DOMAIN.NAME+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ ThanksDennis
> 
> Markus Jelsma - Technisch Architect - Buyways BV
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: crawl www

Posted by Dennis <ar...@yahoo.com.cn>.
Thanks, Markus
Dennis

--- On Tue, 9/28/10, Markus Jelsma <ma...@buyways.nl> wrote:

From: Markus Jelsma <ma...@buyways.nl>
Subject: Re: crawl www
To: "Dennis" <ar...@yahoo.com.cn>
Cc: user@nutch.apache.org
Date: Tuesday, September 28, 2010, 9:08 PM

You should read a bit, maybe this 'll help.

http://wiki.apache.org/nutch/NutchTutorial
http://wiki.apache.org/nutch/Crawl

In short, in Nutch you need to have a CrawlDB, a DB listing your URL's. To 
start fetching URL's you need to generate a fetch list from your CrawlDB. 
These are the URL's you're going to fetch in the first and subsequent cycles. 
When done fetching, you can parse the fetched pages and get proper content. 
Now you've got a fully parsed segment.

Later you need to update your CrawlDB and add the newly found URL's in your 
parsed segment. This way your CrawlDB grows and new URL's can be used to 
generate your subsequent fetch list.

Finally you need to update your LinkDB (holding anchors to URL's) and index 
the parsed content in Nutch 1.x or a Solr instance.



On Tuesday 28 September 2010 14:58:32 Dennis wrote:
> Sorry for interrupting, Markus,
> 
> But I'm not quite understand. How do I "update your DB's"?, What should I
>  do about "crawl-urlfilter.txt"? Thanks
> 
> 
> Dennis
> 
> --- On Tue, 9/28/10, Markus Jelsma <ma...@buyways.nl> wrote:
> 
> From: Markus Jelsma <ma...@buyways.nl>
> Subject: Re: crawl www
> To: user@nutch.apache.org
> Date: Tuesday, September 28, 2010, 8:19 PM
> 
> Dennis, you shouldn't hyjack my thread ;)
> 
> Anyway. it's all about crawl, update your DB's and recrawl and keep
>  repeating the same loop over and over.
> 
> Cheers,
> 
> On Tuesday 28 September 2010 10:08:00 Dennis wrote:
> > Hi, all,
> > I want to crawl the whole www, how do I config "crawl-urlfilter.txt"?It
> >  used to be:# accept hosts in
> >  MY.DOMAIN.NAME+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ ThanksDennis
> 
> Markus Jelsma - Technisch Architect - Buyways BV
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350




      

Re: crawl www

Posted by Markus Jelsma <ma...@buyways.nl>.
You should read a bit, maybe this 'll help.

http://wiki.apache.org/nutch/NutchTutorial
http://wiki.apache.org/nutch/Crawl

In short, in Nutch you need to have a CrawlDB, a DB listing your URL's. To 
start fetching URL's you need to generate a fetch list from your CrawlDB. 
These are the URL's you're going to fetch in the first and subsequent cycles. 
When done fetching, you can parse the fetched pages and get proper content. 
Now you've got a fully parsed segment.

Later you need to update your CrawlDB and add the newly found URL's in your 
parsed segment. This way your CrawlDB grows and new URL's can be used to 
generate your subsequent fetch list.

Finally you need to update your LinkDB (holding anchors to URL's) and index 
the parsed content in Nutch 1.x or a Solr instance.



On Tuesday 28 September 2010 14:58:32 Dennis wrote:
> Sorry for interrupting, Markus,
> 
> But I'm not quite understand. How do I "update your DB's"?, What should I
>  do about "crawl-urlfilter.txt"? Thanks
> 
> 
> Dennis
> 
> --- On Tue, 9/28/10, Markus Jelsma <ma...@buyways.nl> wrote:
> 
> From: Markus Jelsma <ma...@buyways.nl>
> Subject: Re: crawl www
> To: user@nutch.apache.org
> Date: Tuesday, September 28, 2010, 8:19 PM
> 
> Dennis, you shouldn't hyjack my thread ;)
> 
> Anyway. it's all about crawl, update your DB's and recrawl and keep
>  repeating the same loop over and over.
> 
> Cheers,
> 
> On Tuesday 28 September 2010 10:08:00 Dennis wrote:
> > Hi, all,
> > I want to crawl the whole www, how do I config "crawl-urlfilter.txt"?It
> >  used to be:# accept hosts in
> >  MY.DOMAIN.NAME+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ ThanksDennis
> 
> Markus Jelsma - Technisch Architect - Buyways BV
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: crawl www

Posted by Dennis <ar...@yahoo.com.cn>.
Sorry for interrupting, Markus,
But I'm not quite understand. How do I "update your DB's"?, What should I do about "crawl-urlfilter.txt"?Thanks
Dennis

--- On Tue, 9/28/10, Markus Jelsma <ma...@buyways.nl> wrote:

From: Markus Jelsma <ma...@buyways.nl>
Subject: Re: crawl www
To: user@nutch.apache.org
Date: Tuesday, September 28, 2010, 8:19 PM

Dennis, you shouldn't hyjack my thread ;)

Anyway. it's all about crawl, update your DB's and recrawl and keep repeating 
the same loop over and over.

Cheers,

On Tuesday 28 September 2010 10:08:00 Dennis wrote:
> Hi, all,
> I want to crawl the whole www, how do I config "crawl-urlfilter.txt"?It
>  used to be:# accept hosts in
>  MY.DOMAIN.NAME+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ ThanksDennis
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350




      

Re: crawl www

Posted by Markus Jelsma <ma...@buyways.nl>.
Dennis, you shouldn't hyjack my thread ;)

Anyway. it's all about crawl, update your DB's and recrawl and keep repeating 
the same loop over and over.

Cheers,

On Tuesday 28 September 2010 10:08:00 Dennis wrote:
> Hi, all,
> I want to crawl the whole www, how do I config "crawl-urlfilter.txt"?It
>  used to be:# accept hosts in
>  MY.DOMAIN.NAME+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ ThanksDennis
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350