You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Daniele Menozzi <me...@ngi.it> on 2005/09/16 18:50:11 UTC

Problems on Crawling

Hi all, I have questions regarding org.apache.nutch.tools.CrawlTool: I do
not have really understood what is the ralationship between
depth,segments,fetching..
Take for example the tutorial, I understand theese 2 steps:

	bin/nutch admin db -create
	bin/nutch inject db -dmozfile content.rdf.u8 -subset 3000

but, when I do this:
	
	bin/nutch generate db segments

what happens? I think that a dir called 'segments' id created, and inside
of it I can find the links I have previously injected.Ok.Next steps:
	
	bin/nutch fetch $s1 	
	bin/nutch updatedb db $s1 

Ok, no problems here. 
But now I cannot understood what happens with this command:

	bin/nutch generate db segments

it is the same command of above, but now I've not injected anything in the
DB, it only contais the pages I've previously fetched.
So, does it mean that when I generate a segment, it will automagically be
filled with links found in fetched pages? And where theese links are saved?
And who saves theese link?

Thank you so much, this work is really interesting!
	Menoz

-- 
		      Free Software Enthusiast
		 Debian Powered Linux User #332564 
		     http://menoz.homelinux.org

Re: Problems on Crawling

Posted by Daniele Menozzi <me...@ngi.it>.

On  11:44:00 17/Sep , Piotr Kosiorowski wrote:
> Yes - depth means in fact - number of interations of 
> generate/fetch/update cycle.

ok, now it's clear :)

> nutch generate - will include already fetched pages in new segment for 
> fetching after some time (I think default is 30 days and you can change 
> it in config file). And if you deduplicate segments the old page would 
> be removed from index.

ok, thank you for the explaination!!

> regards
> Piotr

regards
	Menoz

-- 
		      Free Software Enthusiast
		 Debian Powered Linux User #332564 
		     http://menoz.homelinux.org

Re: Problems on Crawling

Posted by Piotr Kosiorowski <pk...@gmail.com>.

Daniele Menozzi wrote:

> ok, so the depth value is only used to stop the crawling at a certain
> point, and proceed with the indexing, right?
> 
Yes - depth means in fact - number of interations of 
generate/fetch/update cycle.
> 
> But, another thing: how can I refresh old pages? What class do I have to
> use?
>
nutch generate - will include already fetched pages in new segment for 
fetching after some time (I think default is 30 days and you can change 
it in config file). And if you deduplicate segments the old page would 
be removed from index.
regards
Piotr

Re: Problems on Crawling

Posted by Daniele Menozzi <me...@ngi.it>.

On  19:33:57 16/Sep , Piotr Kosiorowski wrote:
> bin/nutch updatedb db $s1
> command updates WebDB with links you fetched in segment $s1.

ok, so the depth value is only used to stop the crawling at a certain
point, and proceed with the indexing, right?

But, another thing: how can I refresh old pages? What class do I have to
use?

Thank you :)
	Menoz

-- 
		      Free Software Enthusiast
		 Debian Powered Linux User #332564 
		     http://menoz.homelinux.org

Re: Problems on Crawling

Posted by Piotr Kosiorowski <pk...@gmail.com>.

bin/nutch updatedb db $s1
command updates WebDB with links you fetched in segment $s1.
Regards
Piotr


Daniele Menozzi wrote:
> Hi all, I have questions regarding org.apache.nutch.tools.CrawlTool: I do
> not have really understood what is the ralationship between
> depth,segments,fetching..
> Take for example the tutorial, I understand theese 2 steps:
> 
> 	bin/nutch admin db -create
> 	bin/nutch inject db -dmozfile content.rdf.u8 -subset 3000
> 
> but, when I do this:
> 	
> 	bin/nutch generate db segments
> 
> what happens? I think that a dir called 'segments' id created, and inside
> of it I can find the links I have previously injected.Ok.Next steps:
> 	
> 	bin/nutch fetch $s1 	
> 	bin/nutch updatedb db $s1 
> 
> Ok, no problems here. 
> But now I cannot understood what happens with this command:
> 
> 	bin/nutch generate db segments
> 
> it is the same command of above, but now I've not injected anything in the
> DB, it only contais the pages I've previously fetched.
> So, does it mean that when I generate a segment, it will automagically be
> filled with links found in fetched pages? And where theese links are saved?
> And who saves theese link?
> 
> Thank you so much, this work is really interesting!
> 	Menoz
>

Re: Problems on Crawling

Posted by Michael Ji <fj...@yahoo.com>.

at look at this good nutch doc

http://wiki.apache.org/nutch/DissectingTheNutchCrawler

Michael Ji

--- Daniele Menozzi <me...@ngi.it> wrote:

> Hi all, I have questions regarding
> org.apache.nutch.tools.CrawlTool: I do
> not have really understood what is the ralationship
> between
> depth,segments,fetching..
> Take for example the tutorial, I understand theese 2
> steps:
> 
> 	bin/nutch admin db -create
> 	bin/nutch inject db -dmozfile content.rdf.u8
> -subset 3000
> 
> but, when I do this:
> 	
> 	bin/nutch generate db segments
> 
> what happens? I think that a dir called 'segments'
> id created, and inside
> of it I can find the links I have previously
> injected.Ok.Next steps:
> 	
> 	bin/nutch fetch $s1 	
> 	bin/nutch updatedb db $s1 
> 
> Ok, no problems here. 
> But now I cannot understood what happens with this
> command:
> 
> 	bin/nutch generate db segments
> 
> it is the same command of above, but now I've not
> injected anything in the
> DB, it only contais the pages I've previously
> fetched.
> So, does it mean that when I generate a segment, it
> will automagically be
> filled with links found in fetched pages? And where
> theese links are saved?
> And who saves theese link?
> 
> Thank you so much, this work is really interesting!
> 	Menoz
> 
> -- 
> 		      Free Software Enthusiast
> 		 Debian Powered Linux User #332564 
> 		     http://menoz.homelinux.org
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com