You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Camilo Abel Monreal <km...@matcom.uh.cu> on 2005/09/05 13:15:37 UTC

separate Crawler from nutch

Hi :

 I try to separate the nutch crawler from entire project.  I need to 
download the page to a file.Please if someone have  that please help me.

 thanks kmilo

link analysis in OC

Posted by Michael Ji <fj...@yahoo.com>.

hi Kelvin:

Did OC compute page score same as Nutch crawling?

I found Nutch/index compute document boost value based
on the score/anchor data in segment/fetchlist data
structure.

I guess OC won't generate this boost score by itself
or use its' own data structure. So if we want to have
this score saved in lucene index, we need to use
nutch/generate.. to get the fetchlist and generate
webdb.

That means OC will live with Nutch's webdb and other
data structures.

Is my though right?

thanks,

Michael Ji

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: separate Crawler from nutch

Posted by Stefan Groschupf <sg...@media-style.com>.

Hi,
There is a set of standalone crawler available,
the coolst one from my point of view is crawler.archive.org
Stefan

Am 05.09.2005 um 13:15 schrieb Camilo Abel Monreal:

> Hi :
>
> I try to separate the nutch crawler from entire project.  I need to  
> download the page to a file.Please if someone have  that please  
> help me.
>
> thanks kmilo
>
>